EN FR

DS-4 Solutions: Variability and Spread

Solutions Reference · ← Back to Lesson DS-4

Section 5 — Guided Practice Solutions

GP-1 — Computing the Range (Ecologist Bird Species)

Dataset: 12, 8, 15, 10, 9, 14, 7, 11 (8 wetland sites).

(a) Find the minimum:

Scan all values: the smallest is 7.
min = 7 species.

(b) Compute the range:

max = 15 (the largest value), min = 7.

\[ \text{Range} = \max - \min = 15 - 7 = \mathbf{8} \text{ species} \]

The bird species counts span 8 species across the 8 sites.

(c) With a 9th site (37 species), new range:

The new maximum is 37. The minimum remains 7 (no new low value was added).

\[ \text{New Range} = 37 - 7 = \mathbf{30} \text{ species} \]

The range jumped from 8 to 30 — nearly a 4× increase from a single new observation. This is the range's fragility in action: one extreme value dominates it completely.

Why this matters: Before the 9th site, the range of 8 species was a reasonable description of spread (all values between 7 and 15). After adding site 9, the range of 30 is misleading — 8 of 9 sites are still tightly clustered between 7 and 15. The range now overstates the typical spread by nearly 4×. This is exactly why we need resistant measures like the IQR.


GP-2 — Computing Sample Variance and Standard Deviation (All 5 Variants)

Every variant follows the same procedure. The formula is:

\[ s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}, \qquad s = \sqrt{s^2} \]

We use \( n-1 \) (Bessel's correction) because we are estimating the population variance from a sample.

Variant 0 — Tomato Plant Fruit Counts (n = 5)

Data: 8, 12, 9, 11, 10

Step 1 — Mean: \( \bar{x} = \frac{8+12+9+11+10}{5} = \frac{50}{5} = 10 \)

Step 2 — Deviation table:

\( x_i \)\( x_i - \bar{x} \)\( (x_i - \bar{x})^2 \)
8−24
12+24
9−11
11+11
1000

Check: deviations sum to \( -2+2-1+1+0 = 0 \) ✓

Step 3 — Sum of squared deviations: \( \sum(x_i - \bar{x})^2 = 4+4+1+1+0 = 10 \)

Step 4 — Sample variance: \( s^2 = \frac{10}{5-1} = \frac{10}{4} = \mathbf{2.5} \)

Step 5 — Sample standard deviation: \( s = \sqrt{2.5} \approx \mathbf{1.58} \)

Interpretation: The typical tomato plant's fruit count deviates from the mean of 10 by about 1.58 fruits.

Variant 1 — Runner Sprint Times (n = 6)

Data: 12.1, 11.8, 12.5, 12.0, 11.6, 12.4 (seconds)

Step 1 — Mean: \( \bar{x} = \frac{72.4}{6} \approx 12.0667 \)

Step 2 — Deviation table:

\( x_i \)\( x_i - \bar{x} \)\( (x_i - \bar{x})^2 \)
12.1+0.03330.0011
11.8−0.26670.0711
12.5+0.43330.1878
12.0−0.06670.0044
11.6−0.46670.2178
12.4+0.33330.1111

Step 3 — Sum of squared deviations: \( \sum(x_i - \bar{x})^2 \approx 0.5933 \)

Step 4 — Sample variance: \( s^2 = \frac{0.5933}{6-1} = \frac{0.5933}{5} \approx \mathbf{0.119} \)

Step 5 — Sample standard deviation: \( s = \sqrt{0.119} \approx \mathbf{0.34} \) seconds

Interpretation: The typical runner's time deviates from the mean of 12.07 seconds by about 0.34 seconds. This small SD relative to the mean indicates a tightly clustered field.

Variant 2 — Cafe Daily Pastry Sales (n = 7)

Data: 24, 30, 22, 28, 26, 20, 25

Step 1 — Mean: \( \bar{x} = \frac{175}{7} = 25 \)

Step 2 — Deviation table:

\( x_i \)\( x_i - \bar{x} \)\( (x_i - \bar{x})^2 \)
24−11
30+525
22−39
28+39
26+11
20−525
2500

Check: \( -1+5-3+3+1-5+0 = 0 \) ✓

Step 3 — Sum of squared deviations: \( 1+25+9+9+1+25+0 = 70 \)

Step 4 — Sample variance: \( s^2 = \frac{70}{7-1} = \frac{70}{6} \approx \mathbf{11.67} \)

Step 5 — Sample standard deviation: \( s = \sqrt{11.67} \approx \mathbf{3.42} \) pastries

Interpretation: Daily pastry sales typically deviate from the mean of 25 by about 3.4 pastries. The sales are moderately variable — some days differ by 10 or more from the mean (2–3 standard deviations out).

Variant 3 — Container Liquid Volumes (n = 8)

Data: 250, 248, 253, 251, 249, 252, 247, 250 (mL)

Step 1 — Mean: \( \bar{x} = \frac{2000}{8} = 250 \)

Step 2 — Deviation table:

\( x_i \)\( x_i - \bar{x} \)\( (x_i - \bar{x})^2 \)
25000
248−24
253+39
251+11
249−11
252+24
247−39
25000

Step 3 — Sum of squared deviations: \( 0+4+9+1+1+4+9+0 = 28 \)

Step 4 — Sample variance: \( s^2 = \frac{28}{8-1} = \frac{28}{7} = \mathbf{4.0} \)

Step 5 — Sample standard deviation: \( s = \sqrt{4.0} = \mathbf{2.0} \) mL

Interpretation: The filling process is very consistent — the typical container deviates from the target of 250 mL by only 2.0 mL (less than 1% relative variation).

Variant 4 — Package Weights (n = 4)

Data: 3.2, 3.8, 3.5, 3.1 (kg)

Step 1 — Mean: \( \bar{x} = \frac{13.6}{4} = 3.4 \)

Step 2 — Deviation table:

\( x_i \)\( x_i - \bar{x} \)\( (x_i - \bar{x})^2 \)
3.2−0.20.04
3.8+0.40.16
3.5+0.10.01
3.1−0.30.09

Step 3 — Sum of squared deviations: \( 0.04+0.16+0.01+0.09 = 0.30 \)

Step 4 — Sample variance: \( s^2 = \frac{0.30}{4-1} = \frac{0.30}{3} = \mathbf{0.10} \)

Step 5 — Sample standard deviation: \( s = \sqrt{0.10} \approx \mathbf{0.32} \) kg

Interpretation: Packages typically deviate from the mean weight of 3.4 kg by about 0.32 kg. Note how the small sample size (n = 4) means Bessel's correction makes a noticeable difference: dividing by n = 4 would give s² = 0.075, but the corrected s² = 0.10 is 33% larger.

Common Mistakes in Variance / SD Computation:

  1. Forgetting to square deviations: Summing raw deviations gives 0 every time — useless. You must square first, then sum, then divide.
  2. Dividing by n instead of n−1: This is the single most frequent error. For sample data, the denominator is always n−1 (Bessel's correction). Dividing by n underestimates the true population variance.
  3. Reporting s² as s: Variance and standard deviation are different quantities. SD = √(variance). A variance of 2.5 means SD ≈ 1.58 — they are not interchangeable numbers.
  4. Using n−1 for the mean: Bessel's correction applies only to the variance denominator. The mean always divides by n: \( \bar{x} = \sum x_i / n \).

GP-3 — Five-Number Summary, IQR, and Outlier Detection (All 5 Variants)

Variant 0 — Statistics Quiz Scores (n = 10)

Data: 14, 8, 17, 11, 15, 9, 13, 18, 12, 16

Sorted: 8, 9, 11, 12, 13, 14, 15, 16, 17, 18

n = 10 (even).

Q2 (median): positions 5 and 6 → \( \frac{13+14}{2} = 13.5 \)

Lower half (positions 1–5): 8, 9, 11, 12, 13. nL = 5 (odd).
Q1 = position 3 = 11.

Upper half (positions 6–10): 14, 15, 16, 17, 18. nU = 5 (odd).
Q3 = position 3 = 16.

Five-number summary: min = 8, Q1 = 11, Q2 = 13.5, Q3 = 16, max = 18.

IQR: 16 − 11 = 5.

Fences:
Lower = 11 − 1.5 × 5 = 11 − 7.5 = 3.5
Upper = 16 + 1.5 × 5 = 16 + 7.5 = 23.5

All values are in [3.5, 23.5]. No outliers.

Variant 1 — Delivery Truck Distances (n = 7)

Data: 45, 72, 38, 61, 83, 49, 67

Sorted: 38, 45, 49, 61, 67, 72, 83

n = 7 (odd).

Q2 (median): position (7+1)/2 = 4 → 61.

Lower half (below Q2): 38, 45, 49. nL = 3 (odd).
Q1 = 45.

Upper half (above Q2): 67, 72, 83. nU = 3 (odd).
Q3 = 72.

Five-number summary: min = 38, Q1 = 45, Q2 = 61, Q3 = 72, max = 83.

IQR: 72 − 45 = 27 km.

Fences:
Lower = 45 − 1.5 × 27 = 45 − 40.5 = 4.5
Upper = 72 + 1.5 × 27 = 72 + 40.5 = 112.5

All values are in [4.5, 112.5]. No outliers.

Variant 2 — Pharmacy Prescription Counts (n = 7)

Data: 142, 98, 175, 115, 160, 88, 205

Sorted: 88, 98, 115, 142, 160, 175, 205

n = 7 (odd).

Q2 (median): position 4 → 142.

Lower half: 88, 98, 115. Q1 = 98.

Upper half: 160, 175, 205. Q3 = 175.

Five-number summary: min = 88, Q1 = 98, Q2 = 142, Q3 = 175, max = 205.

IQR: 175 − 98 = 77.

Fences:
Lower = 98 − 1.5 × 77 = 98 − 115.5 = −17.5
Upper = 175 + 1.5 × 77 = 175 + 115.5 = 290.5

All values are in [−17.5, 290.5]. The negative lower fence simply means no values can be low enough to be flagged on the low end — which is expected for count data (cannot be negative). No outliers.

Variant 3 — Car Battery Lifetimes (n = 8)

Data: 36, 48, 30, 42, 54, 24, 60, 38

Sorted: 24, 30, 36, 38, 42, 48, 54, 60

n = 8 (even).

Q2: positions 4 and 5 → \( \frac{38+42}{2} = 40 \)

Lower half (positions 1–4): 24, 30, 36, 38. nL = 4 (even).
Q1 = \( \frac{30+36}{2} = 33 \)

Upper half (positions 5–8): 42, 48, 54, 60. nU = 4 (even).
Q3 = \( \frac{48+54}{2} = 51 \)

Five-number summary: min = 24, Q1 = 33, Q2 = 40, Q3 = 51, max = 60.

IQR: 51 − 33 = 18 months.

Fences:
Lower = 33 − 1.5 × 18 = 33 − 27 = 6
Upper = 51 + 1.5 × 18 = 51 + 27 = 78

All values are in [6, 78]. No outliers.

Variant 4 — Apartment Rents (n = 11)

Data: 950, 1100, 850, 1250, 900, 1050, 1150, 1000, 1300, 800, 2000

Sorted: 800, 850, 900, 950, 1000, 1050, 1100, 1150, 1250, 1300, 2000

n = 11 (odd).

Q2: position 6 → 1050.

Lower half (below Q2): 800, 850, 900, 950, 1000. nL = 5 (odd).
Q1 = 900.

Upper half (above Q2): 1100, 1150, 1250, 1300, 2000. nU = 5 (odd).
Q3 = 1250.

Five-number summary: min = 800, Q1 = 900, Q2 = 1050, Q3 = 1250, max = 2000.

IQR: 1250 − 900 = 350.

Fences:
Lower = 900 − 1.5 × 350 = 900 − 525 = 375
Upper = 1250 + 1.5 × 350 = 1250 + 525 = 1775

Outlier check:
800 > 375 → No low outlier.
2000 > 1775 → 2000 is a potential outlier!

All other values (850–1300) are well within the fences. The $2000 apartment should be investigated — is it a luxury penthouse misclassified with standard units, a data-entry error, or a legitimate high-end rental? Do not auto-delete; flag for investigation.

Section 6 — Independent Practice Solutions

IP-1 — Sample Variance and Standard Deviation (Generator: generateVarianceSDProblem)

Generated fresh each time. The approach is always the same:

  1. Compute the mean: \( \bar{x} = \sum x_i / n \)
  2. Compute each deviation \( x_i - \bar{x} \) and square it
  3. Sum the squared deviations: \( \sum(x_i - \bar{x})^2 \)
  4. Divide by \( n-1 \): \( s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1} \)
  5. Take the square root: \( s = \sqrt{s^2} \)

Sample worked output 1 (n = 6): Data = {12, 28, 35, 19, 42, 8}

\( \bar{x} = \frac{144}{6} = 24 \). Deviations: −12, +4, +11, −5, +18, −16.
Squared: 144, 16, 121, 25, 324, 256. Sum = 886.
\( s^2 = 886/5 = 177.20, \quad s = \sqrt{177.20} \approx 13.31 \).

Sample worked output 2 (n = 7): Data = {45, 38, 50, 42, 47, 39, 44}

\( \bar{x} = \frac{305}{7} \approx 43.57 \). Deviations: +1.43, −5.57, +6.43, −1.57, +3.43, −4.57, +0.43.
Squared: 2.04, 31.04, 41.33, 2.47, 11.76, 20.90, 0.18. Sum ≈ 109.71.
\( s^2 = 109.71/6 \approx 18.29, \quad s = \sqrt{18.29} \approx 4.28 \).

Sample worked output 3 (n = 5): Data = {22, 31, 18, 27, 25}

\( \bar{x} = \frac{123}{5} = 24.6 \). Deviations: −2.6, +6.4, −6.6, +2.4, +0.4.
Squared: 6.76, 40.96, 43.56, 5.76, 0.16. Sum = 97.20.
\( s^2 = 97.20/4 = 24.30, \quad s = \sqrt{24.30} \approx 4.93 \).

Self-check: Always verify that (1) deviations sum to approximately zero, (2) s² is positive, (3) s is smaller than the range (typically range/6 ≤ s ≤ range/2 for small samples).


IP-2 — Five-Number Summary and IQR (Generator: generateFiveNumberSummary)

Generated with n ∈ {8, 9, 10, 11, 12}. Follow the median-of-halves method (Tukey).

Sample worked output 1 (n = 9, odd):
Sorted data: 12, 23, 28, 35, 47, 52, 61, 74, 88

Q2: position 5 = 47.
Lower half (below Q2): 12, 23, 28, 35. nL = 4 (even). Q1 = (23+28)/2 = 25.5.
Upper half (above Q2): 52, 61, 74, 88. nU = 4 (even). Q3 = (61+74)/2 = 67.5.
Summary: min = 12, Q1 = 25.5, Q2 = 47, Q3 = 67.5, max = 88. IQR = 42.

Sample worked output 2 (n = 10, even):
Sorted data: 5, 18, 22, 31, 40, 49, 55, 63, 77, 91

Q2: (40+49)/2 = 44.5.
Lower half (positions 1–5): 5, 18, 22, 31, 40. nL = 5 (odd). Q1 = 22.
Upper half (positions 6–10): 49, 55, 63, 77, 91. nU = 5 (odd). Q3 = 63.
Summary: min = 5, Q1 = 22, Q2 = 44.5, Q3 = 63, max = 91. IQR = 41.

Sample worked output 3 (n = 11, odd):
Sorted data: 8, 15, 21, 30, 36, 44, 52, 58, 67, 75, 82

Q2: position 6 = 44.
Lower half: 8, 15, 21, 30, 36. nL = 5 (odd). Q1 = 21.
Upper half: 52, 58, 67, 75, 82. nU = 5 (odd). Q3 = 67.
Summary: min = 8, Q1 = 21, Q2 = 44, Q3 = 67, max = 82. IQR = 46.


IP-3 — Outlier Detection via Fences (Generator: generateOutlierProblem)

Datasets may contain 0, 1, or 2 potential outliers. Always use 1.5 × IQR fences.

Sample 1 — No outliers:

Sorted data (n = 12): 14, 18, 22, 27, 31, 35, 39, 43, 48, 52, 58, 72

n = 12 (even). Q2 = (35+39)/2 = 37.
Lower half (pos 1–6): 14, 18, 22, 27, 31, 35. Q1 = (22+27)/2 = 24.5.
Upper half (pos 7–12): 39, 43, 48, 52, 58, 72. Q3 = (48+52)/2 = 50.

IQR = 50 − 24.5 = 25.5.
Lower fence = 24.5 − 1.5 × 25.5 = 24.5 − 38.25 = −13.75.
Upper fence = 50 + 1.5 × 25.5 = 50 + 38.25 = 88.25.

All values in [−13.75, 88.25]. No outliers detected.

Sample 2 — Single outlier:

Sorted data (n = 10): 11, 15, 19, 24, 28, 33, 37, 42, 48, 112

n = 10 (even). Q2 = (28+33)/2 = 30.5.
Lower half (pos 1–5): 11, 15, 19, 24, 28. Q1 = 19.
Upper half (pos 6–10): 33, 37, 42, 48, 112. Q3 = 42.

IQR = 42 − 19 = 23.
Lower fence = 19 − 1.5 × 23 = 19 − 34.5 = −15.5.
Upper fence = 42 + 1.5 × 23 = 42 + 34.5 = 76.5.

112 > 76.5 → 112 is a potential outlier. All other values (11–48) are within the fences. Flag 112 for investigation — could be a data-entry error (e.g., 11.2 entered as 112) or a genuinely extreme observation.

Sample 3 — Two outliers:

Sorted data (n = 15): 8, 12, 16, 20, 23, 27, 31, 35, 40, 44, 49, 55, 62, 70, 145

n = 15 (odd). Q2: position 8 = 35.
Lower half (below Q2): 8, 12, 16, 20, 23, 27, 31. nL = 7 (odd). Q1: position 4 = 20.
Upper half (above Q2): 40, 44, 49, 55, 62, 70, 145. nU = 7 (odd). Q3: position 4 = 55.

IQR = 55 − 20 = 35.
Lower fence = 20 − 1.5 × 35 = 20 − 52.5 = −32.5.
Upper fence = 55 + 1.5 × 35 = 55 + 52.5 = 107.5.

145 > 107.5 → 145 is a potential outlier. Also note: 8 is well above the lower fence of −32.5, so no low outliers. One value flagged.

(The generator produces varying outlier counts; the actual number of flagged values depends on the generated data and fence computation, not seeding alone.)

Remember: Flagged values are potential outliers. Investigate before deleting. A value flagged by the fence rule could be: (a) a data-entry error — fix or remove; (b) a measurement error — document and possibly exclude; (c) a valid extreme observation — retain and note in analysis.


IP-4 — Range vs. Variance: Same Range, Different Spread (Generator: generateRangeProblem)

Two datasets share the same range but differ in variance. This demonstrates that range alone is blind to internal structure.

Sample 1 — Near 2× variance difference:

Dataset A (tight cluster): {37, 39, 38, 40, 36, 41}
min = 36, max = 41. Range = 41 − 36 = 5.
A = 231/6 = 38.5. Deviations: −1.5, +0.5, −0.5, +1.5, −2.5, +2.5.
Squared: 2.25, 0.25, 0.25, 2.25, 6.25, 6.25. Sum = 17.50.
A = 17.50/5 = 3.5. sA ≈ 1.87.

Dataset B (spread across range): {36, 41, 37, 40, 38, 39}
min = 36, max = 41. Range = 5 (same).
B = 231/6 = 38.5. Deviations: −2.5, +2.5, −1.5, +1.5, −0.5, +0.5.
Squared: 6.25, 6.25, 2.25, 2.25, 0.25, 0.25. Sum = 17.50.

Hmm — this particular construction gives the same variance. Let me pick a better example where values are spread more evenly:

Dataset B (revised, uniform-ish): {36, 38, 40, 37, 41, 39}
Same min/max. Deviations: −2.5, −0.5, +1.5, −1.5, +2.5, +0.5.
Squared: 6.25, 0.25, 2.25, 2.25, 6.25, 0.25. Sum = 17.50 — still the same!

(Note: the generator uses a construction algorithm that guarantees s²B ≥ 2 × s²A; the specific values vary per generation. The key insight is always the same — identical ranges can mask very different internal distributions.)

Conceptual demonstration (using the generator's typical output pattern):

Dataset A (clustered): min = 20, max = 27, Range = 7. Values clustered near centre: 22, 23, 24, 24, 25, 25, 26.
A ≈ 1.8 (tight around x̄).

Dataset B (dispersed): min = 20, max = 27, Range = 7. Values spread across the range: 20, 21, 23, 24, 25, 26, 27.
B ≈ 7.0 (much larger — >3× Dataset A).

Why range and variance diverge: The range uses only two values (min, max) and is blind to everything between. The variance uses every value and captures internal clustering or dispersion. Two datasets can have identical bookends but completely different internal structure — the range misses this entirely; the variance sees it.


IP-5 — Find the Error in Spread Measure Computations (All 5 Variants)

Variant 0 — Student's SD Computation (Study Hours)

Data: 8, 6, 10, 4, 12. Student computed: x̄ = 8, s² = 8, s ≈ 2.83.

Error: The student divided by n = 5 instead of n−1 = 4.
The deviations and squares are correct: (−2)² + 2² + 2² + (−4)² + 4² = 4+4+4+16+16 = 40.
Correct computation: s² = 40/(5−1) = 40/4 = 10 (not 8).
s = √10 ≈ 3.16 hours (not 2.83).

The error understates both the variance (8 vs. 10) and standard deviation (2.83 vs. 3.16). This is exactly the bias Bessel's correction is designed to fix.

Variant 1 — Analyst's Five-Number Summary (Customer Counts)

Data: 34, 28, 41, 23, 37, 45, 30, 39, 26, 42. Sorted: 23, 26, 28, 30, 34, 37, 39, 41, 42, 45.

Error: The analyst approximated Q1 and Q3 by position ("3rd value" and "8th value") instead of using the median-of-halves method.

Correct computation:
n = 10 (even). Q2 = (34+37)/2 = 35.5.
Lower half (positions 1–5): 23, 26, 28, 30, 34. Q1 = 28 (coincidentally the same as the analyst's position-based guess in this case).
Upper half (positions 6–10): 37, 39, 41, 42, 45. Q3 = 41.

IQR = 41 − 28 = 13. The analyst's IQR was also 13 by coincidence, but the method was incorrect. In other datasets, the position-based shortcut produces wrong Q1/Q3 values.

Variant 2 — Lab Technician's Range-Only Report (Chemical Samples)

Data: 5.12, 5.08, 5.15, 5.11, 5.09, 5.14 g. Technician: range = 0.07 g, no further analysis.

Error: The technician relied solely on the range and concluded no further spread analysis was needed. The range uses only two values (min and max) and reveals nothing about the internal consistency of the measurements.

What should have been reported: The standard deviation (or IQR) in addition to the range. For these data: x̄ = 5.115, sum of squared deviations ≈ 0.00335, s² = 0.00335/5 ≈ 0.00067, s ≈ 0.0259 g. The SD of 0.026 g confirms excellent consistency (CV ≈ 0.5%), but you can only know this by computing it — the range alone doesn't prove consistency, it only sets an outer bound.

Variant 3 — Student Forgets to Square Deviations (Exam Scores)

Data: 65, 70, 75, 80, 85, 90, 95. Student: x̄ = 80, s = 10.

Error: The student summed the absolute deviations (15+10+5+0+5+10+15 = 60) and divided by n−1 = 6, forgetting to square first. This computes the mean absolute deviation, not the standard deviation.

Correct computation:
Deviations: −15, −10, −5, 0, +5, +10, +15.
Squared: 225, 100, 25, 0, 25, 100, 225. Sum = 700.
s² = 700/6 ≈ 116.67. s = √116.67 ≈ 10.80 points (not 10).

The error happens to give a numerically close answer (10 vs. 10.80) because these data are evenly spaced, but the method is wrong and would produce much larger errors with other datasets.

Variant 4 — HR Analyst's Range Comparison with an Outlier (Salaries)

Data: Dept A: 42, 48, 45, 52, 47, 44, 50, 46. Dept B: 38, 42, 55, 40, 95, 44, 41, 39.

Analyst: RangeA = 10, RangeB = 57 → "Dept B has much more pay inequity."

Error: The analyst used range, which is dominated by the single outlier ($95K) in Department B. The range makes Department B appear 5.7× more spread out than A, but this is almost entirely driven by one data point.

Better analysis: Compute the IQR for each department to compare typical pay dispersion.
Dept A (sorted): 42, 44, 45, 46, 47, 48, 50, 52. n = 8.
Q2 = (46+47)/2 = 46.5. Lower half: 42, 44, 45, 46 → Q1 = 44.5. Upper half: 47, 48, 50, 52 → Q3 = 49. IQRA = 4.5.

Dept B (sorted): 38, 39, 40, 41, 42, 44, 55, 95. n = 8.
Q2 = (41+42)/2 = 41.5. Lower half: 38, 39, 40, 41 → Q1 = 39.5. Upper half: 42, 44, 55, 95 → Q3 = 49.5. IQRB = 10.0.

The IQRs are much closer (4.5 vs. 10.0 — about 2.2×, not 5.7×). Department B's middle 50% is somewhat more spread out, but the dramatic 5.7× ratio from the range was almost entirely the $95K outlier, not genuine pay dispersion. The $95K should be investigated before drawing conclusions about equity.


IP-6 — Multi-Step Synthesis: Forestry Researcher's Maple Tree Data

Data: 28, 34, 22, 31, 45, 26, 38, 24, 29 (cm, diameter at breast height)

Sorted: 22, 24, 26, 28, 29, 31, 34, 38, 45

(a) Mean, Variance, and Standard Deviation

Mean: \( \sum x_i = 277, \; n = 9, \; \bar{x} = \frac{277}{9} \approx \mathbf{30.78} \) cm

Deviation table:

\( x_i \)\( x_i - \bar{x} \)\( (x_i - \bar{x})^2 \)
22−8.7877.09
24−6.7845.97
26−4.7822.85
28−2.787.73
29−1.783.17
31+0.220.05
34+3.2210.37
38+7.2252.13
45+14.22202.21

\( \sum(x_i - \bar{x})^2 \approx 421.56 \)

\( s^2 = \frac{421.56}{9-1} = \frac{421.56}{8} \approx \mathbf{52.69} \) cm²

\( s = \sqrt{52.69} \approx \mathbf{7.26} \) cm

(b) Five-Number Summary, IQR, and Outliers

n = 9 (odd). Q2: position (9+1)/2 = 5 → 29 cm.

Lower half (below Q2): 22, 24, 26, 28. nL = 4 (even).
Q1 = (24+26)/2 = 25 cm.

Upper half (above Q2): 31, 34, 38, 45. nU = 4 (even).
Q3 = (34+38)/2 = 36 cm.

Five-number summary: min = 22, Q1 = 25, Q2 = 29, Q3 = 36, max = 45.

IQR: 36 − 25 = 11 cm.

Fences:
Lower = 25 − 1.5 × 11 = 25 − 16.5 = 8.5
Upper = 36 + 1.5 × 11 = 36 + 16.5 = 52.5

All 9 values are between 8.5 and 52.5. No outliers detected.

(c) Adding a 10th Tree (68 cm) — Comparing Resistance to Outliers

New dataset (sorted): 22, 24, 26, 28, 29, 31, 34, 38, 45, 68. n = 10.

\( \sum x_i = 277 + 68 = 345, \quad \bar{x}_{\text{new}} = \frac{345}{10} = 34.5 \) cm.

New standard deviation:

\( x_i \)\( x_i - 34.5 \)\( (x_i - 34.5)^2 \)
22−12.5156.25
24−10.5110.25
26−8.572.25
28−6.542.25
29−5.530.25
31−3.512.25
34−0.50.25
38+3.512.25
45+10.5110.25
68+33.51122.25

\( \sum(x_i - \bar{x}_{\text{new}})^2 = 1668.50 \).

\( s^2_{\text{new}} = \frac{1668.50}{9} \approx 185.39, \quad s_{\text{new}} = \sqrt{185.39} \approx \mathbf{13.62} \) cm.

Change in SD: \( \frac{13.62 - 7.26}{7.26} \times 100\% \approx \mathbf{+87.6\%} \)

New five-number summary (n = 10):
Q2 = (29+31)/2 = 30.
Lower half (pos 1–5): 22, 24, 26, 28, 29 → Q1 = 26.
Upper half (pos 6–10): 31, 34, 38, 45, 68 → Q3 = 38.

IQRnew = 38 − 26 = 12 cm.

Change in IQR: \( \frac{12 - 11}{11} \times 100\% \approx \mathbf{+9.1\%} \)

Interpretation: Adding one extreme observation (68 cm, roughly 2.4× the mean of the original 9 trees) nearly doubled the standard deviation (+87.6%) but barely moved the IQR (+9.1%). This is the practical demonstration of resistance: the SD squares deviations, so the 68 cm value contributed 1122.25 to the sum of squares (67% of the total!) — giving it disproportionate influence. The IQR, by contrast, ignores the tails entirely — the new value sits above Q3, so Q3 shifted up slightly (from 36 to 38) but the IQR was largely unaffected.

The principle: Standard deviation is sensitive to outliers; IQR is resistant. This directly mirrors the mean (sensitive) vs. median (resistant) relationship from DS-3. When data are symmetric without outliers, use mean + SD. When data are skewed or have outliers, use median + IQR.

Section 7 — Mastery Check Solutions

Feynman Test — Why n−1? (Model Answer)

When you have the entire population, you know the true centre, so dividing by N makes sense — that's the actual average squared deviation. But with a sample, you have to estimate the centre using the sample average, which is always pulled slightly toward your sample values. This means your sample deviations tend to be just a bit smaller than the true deviations would be. Dividing by a slightly smaller number (n−1 instead of n) inflates the variance just enough to correct for this built-in underestimation.

Think of it as a statistical honesty tax: since your sample is almost certainly less variable than the population it came from, you compensate by making the variance a touch larger. The smaller the sample, the bigger the correction — with 5 values, dividing by 4 really matters; with 500 values, dividing by 499 barely changes anything.

There is also a mathematical reason: when you compute x̄ from your sample and then use it to compute deviations, those n deviations are not fully independent — once you know n−1 of them, the last one is forced (they must sum to zero). You have n−1 genuine pieces of information — n−1 degrees of freedom — so you divide by n−1.


Apply — Choosing the Right Spread Measure (Urban Planner, Two Cities)

(a) City A — Symmetric, Bell-Shaped, No Outliers

Answer: Standard deviation. When data are symmetric and free of outliers, the standard deviation is the preferred measure because:

Verification: For City A, compute both measures. The SD and IQR will give consistent, proportional information because the data are symmetric. SD is preferred because it carries more information.

(b) City B — Right-Skewed with Luxury Outliers

Answer: IQR. When data are strongly skewed with extreme outliers (luxury homes at $1M–$3M), the IQR is the appropriate measure because:

Verification: Compute the SD for City B. It will be large (perhaps $300K+) because luxury homes at $1M–$3M produce enormous squared deviations. But the IQR might be only $50K–$80K, accurately reflecting that most homes cluster tightly. The SD says "homes typically deviate by $300K from the mean" — which is false for 90% of homes. The IQR gives the truthful picture.

(c) Principle — Why Distribution Shape Matters

The choice of spread measure mirrors the choice of centre measure from DS-3. When data are symmetric with no outliers, the standard deviation is preferred because it uses every observation and is directly connected to the mean. When data are skewed or contain outliers, the standard deviation becomes unreliable — extreme values contribute disproportionately large squared deviations, inflating the SD beyond what describes typical data.

The IQR is the resistant counterpart: it focuses on the middle 50% of data and ignores the tails entirely, so extreme values have zero influence. This makes it the natural spread companion to the median, just as SD is the natural companion to the mean.

The rule: Mean + SD for symmetric data without outliers. Median + IQR for skewed data or data with outliers. Always pair the measure of spread with the matching measure of centre.


Error Analysis — Variance vs. Standard Deviation Confusion

Data: 18, 22, 20, 24, 19, 23, 21 (°C). Both students report x̄ = 21°C.

Student 1 error: Claims SD = 4.33°C because variance = 4.33°C².
Misconception: Treating variance and standard deviation as the same number. They are different quantities: SD = √(variance). The variance is in squared units (°C²); the SD is in original units (°C).

Student 2 is correct in concept — SD = √(variance) ≈ 2.08°C.

However — verifying the variance computation:
x̄ = (18+22+20+24+19+23+21)/7 = 147/7 = 21.
Deviations: −3, +1, −1, +3, −2, +2, 0.
Squared: 9, 1, 1, 9, 4, 4, 0. Sum = 28.
Using sample formula (n−1 = 6): s² = 28/6 ≈ 4.67 (not 4.33).
Using population formula (n = 7): σ² = 28/7 = 4.00.

The reported variance of 4.33 does not match either formula. Both students appear to have made an arithmetic error in computing the sum of squared deviations (or used a different rounding). However, the specific error being tested is Student 1's confusion of variance with standard deviation — claiming s² = s without taking the square root. This is one of the most common and consequential errors in introductory statistics.

Correct values: s² = 28/6 ≈ 4.67°C², s = √4.67 ≈ 2.16°C.
(Note: the lesson's distractor text references 4.33 and 2.08 for pedagogical purposes — the core tested error is the variance/SD confusion, not the specific numeric values.)

Section 8 — Boss Fight Solutions

Path A — The Analyst: Departmental Salary Equity

Engineering (12 employees): 62, 58, 71, 65, 68, 60, 64, 70, 67, 63, 66, 59 ($K)

Marketing (10 employees): 48, 52, 55, 50, 95, 53, 47, 51, 54, 49 ($K)

Task A1 — Compute the Mean Salary

Engineering: \( \sum = 773, \; n = 12, \; \bar{x}_E = \frac{773}{12} \approx \mathbf{64.42} \) ($64,420).

Marketing: \( \sum = 554, \; n = 10, \; \bar{x}_M = \frac{554}{10} = \mathbf{55.40} \) ($55,400).

Engineering's mean is about $9,000 higher than Marketing's.

Task A2 — Range and Standard Deviation

Engineering:

Range = 71 − 58 = 13 ($13,000).

Deviations from x̄E = 64.42:

\( x_i \)\( x_i - 64.42 \)\( (x_i - 64.42)^2 \)
62−2.425.86
58−6.4241.22
71+6.5843.30
65+0.580.34
68+3.5812.82
60−4.4219.54
64−0.420.18
70+5.5831.14
67+2.586.66
63−1.422.02
66+1.582.50
59−5.4229.38

\( \sum(x_i - \bar{x}_E)^2 \approx 194.92 \).
\( s^2_E = \frac{194.92}{11} \approx 17.72, \quad s_E = \sqrt{17.72} \approx \mathbf{4.21} \) ($4,210).

Marketing:

Range = 95 − 47 = 48 ($48,000).

Deviations from x̄M = 55.40:

\( x_i \)\( x_i - 55.40 \)\( (x_i - 55.40)^2 \)
48−7.4054.76
52−3.4011.56
55−0.400.16
50−5.4029.16
95+39.601568.16
53−2.405.76
47−8.4070.56
51−4.4019.36
54−1.401.96
49−6.4040.96

\( \sum(x_i - \bar{x}_M)^2 = 1802.40 \).
\( s^2_M = \frac{1802.40}{9} \approx 200.27, \quad s_M = \sqrt{200.27} \approx \mathbf{14.15} \) ($14,150).

Marketing's SD ($14,150) is about 3.4× Engineering's SD ($4,210). Note that the single $95K salary contributes 1568.16 out of 1802.40 to the sum of squares — that is 87% of the total sum of squares from one data point. The SD is not just larger — it is almost entirely a measure of how far $95K is from the mean, not a measure of typical pay dispersion.

Task A3 — Five-Number Summary and IQR

Engineering (sorted): 58, 59, 60, 62, 63, 64, 65, 66, 67, 68, 70, 71

n = 12 (even). Q2 = (64+65)/2 = 64.5.
Lower half: 58, 59, 60, 62, 63, 64. Q1 = (60+62)/2 = 61.
Upper half: 65, 66, 67, 68, 70, 71. Q3 = (67+68)/2 = 67.5.

Five-number summary (Eng): min = 58, Q1 = 61, Q2 = 64.5, Q3 = 67.5, max = 71.
IQRE = 6.5 ($6,500).

Marketing (sorted): 47, 48, 49, 50, 51, 52, 53, 54, 55, 95

n = 10 (even). Q2 = (51+52)/2 = 51.5.
Lower half: 47, 48, 49, 50, 51. Q1 = 49.
Upper half: 52, 53, 54, 55, 95. Q3 = 54.

Five-number summary (Mkt): min = 47, Q1 = 49, Q2 = 51.5, Q3 = 54, max = 95.
IQRM = 5 ($5,000).

Key observation: Marketing's IQR of $5K is actually smaller than Engineering's IQR of $6.5K. The middle 50% of Marketing salaries are more tightly clustered than Engineering's. The range and SD gave the opposite impression because they were dominated by one outlier.

Task A4 — Outlier Detection

Engineering:
Lower fence = 61 − 1.5 × 6.5 = 61 − 9.75 = 51.25.
Upper fence = 67.5 + 1.5 × 6.5 = 67.5 + 9.75 = 77.25.
All 12 values in [58, 71] ⊂ [51.25, 77.25]. No outliers.

Marketing:
Lower fence = 49 − 1.5 × 5 = 49 − 7.5 = 41.5.
Upper fence = 54 + 1.5 × 5 = 54 + 7.5 = 61.5.
95 > 61.5 → $95K is flagged as a potential outlier.
All other values (47–55) are within [41.5, 61.5].

Do NOT auto-delete: Flag $95K for investigation. Possible explanations: (a) data-entry error ($59K entered as $95K), (b) employee misclassified as non-managerial (actually a team lead), (c) legacy contract from a previous role, (d) a genuinely high-performing specialist correctly classified. Investigate, then decide.

Task A5 — Synthesize and Advise

Recommendation to HR Director:

Engineering has more equitable pay. All four spread measures agree, but they tell importantly different parts of the story:

  • Range: Engineering $13K vs. Marketing $48K — but Marketing's range is inflated by the $95K outlier.
  • SD: Engineering $4.2K vs. Marketing $14.2K — but 87% of Marketing's sum of squares comes from a single data point. The SD is not measuring typical dispersion; it is measuring the distance of one salary from the mean.
  • IQR: Engineering $6.5K vs. Marketing $5.0K — the middle 50% spread is actually tighter in Marketing than in Engineering. This is the most trustworthy comparison because it is resistant to the outlier.
  • Outliers: Marketing has one (the $95K salary); Engineering has none.

Action items: (1) Investigate the $95K Marketing salary — is it a data error, a misclassification, or legitimate? (2) Report IQR alongside median for Marketing to give the fairest picture of typical pay dispersion. (3) Engineering's pay structure is consistent and equitable — no red flags. (4) If $95K is valid, re-run the analysis with it included but note it as a special case; the IQR-based comparison of the middle 50% is the most reliable indicator of equity.


Path B — The Architect: Quality-Control Study Design

Task B1 — Variable Type

"Tablet mass in mg" is quantitative continuous. Mass can take any real value within the scale's precision (e.g., 49.87 mg, 50.13 mg) — it is not restricted to whole numbers. This matters because continuous data support a richer set of statistical tools (normal distribution, standard deviation, CV) than discrete data.

Task B2 — Primary Spread Measure

Standard deviation is the correct choice for this scenario. The data are symmetric, bell-shaped, and free of outliers (n = 30, approximately normal). In this context:

IQR is not wrong, but it discards information unnecessarily when the data are normal. Save IQR for when outliers or skew are present.

Task B3 — Outlier Detection Thresholds

Required information: Q1, Q3, and IQR computed from the batch data.
Multiplier: 1.5 (Tukey's standard).
Fences: [Q1 − 1.5 × IQR, Q3 + 1.5 × IQR].

Tablets outside these fences are flagged for investigation. In a QC context, you can supplement this with a parallel rule: flag any tablet beyond ±3s from the target (the Western Electric rules for SPC). The two methods (IQR-based and SD-based) catch different patterns and can coexist in the protocol.

Task B4 — Comparative Spread Analysis (50 mg vs. 100 mg Tablets)

(a) Measure for fair comparison: Coefficient of Variation (CV). CV is unitless — it expresses SD as a percentage of the mean — so it enables comparison across different target masses. An SD of 2 mg means something very different for a 50 mg tablet (4% of target) vs. a 100 mg tablet (2% of target). CV accounts for scale automatically.

(b) Computation and interpretation:

50 mg line: x̄ = 50.2 mg, s = 1.8 mg.
\( \text{CV}_{50} = \frac{1.8}{50.2} \times 100\% \approx \mathbf{3.59\%} \)

100 mg line: x̄ = 100.4 mg, s = 2.4 mg.
\( \text{CV}_{100} = \frac{2.4}{100.4} \times 100\% \approx \mathbf{2.39\%} \)

Interpretation: The 100 mg line has a lower CV (2.39% vs. 3.59%), meaning relative variability is smaller for the higher-strength tablets. Even though the absolute SD is larger (2.4 > 1.8 mg), the process is actually more consistent relative to the target mass. This is exactly the insight CV was designed to surface — it corrects for the fact that "2.4 mg of variation" is a bigger deal for a 50 mg tablet than a 100 mg one.

Bottom line: The 100 mg manufacturing process is more consistent (lower relative variability). Both CVs are well below 5%, which is excellent for pharmaceutical manufacturing — typical QC thresholds are CV < 5% for content uniformity.

Section 9 — Challenge Problem Solutions

Challenge 1 — Proving ∑(xi − x̄) = 0 and Connecting to n−1

Algebraic Proof

Claim: For any set of numbers \( x_1, x_2, \ldots, x_n \), define \( \bar{x} = \frac{1}{n}\sum x_i \). Then \( \sum_{i=1}^{n}(x_i - \bar{x}) = 0 \).

Proof:

\[ \sum_{i=1}^{n}(x_i - \bar{x}) = \sum_{i=1}^{n} x_i - \sum_{i=1}^{n} \bar{x} \]

Since \( \bar{x} \) is constant across the sum: \( \sum_{i=1}^{n} \bar{x} = n \bar{x} \).

\[ = \sum_{i=1}^{n} x_i - n \bar{x} = \sum_{i=1}^{n} x_i - n \cdot \frac{1}{n}\sum_{i=1}^{n} x_i = \sum x_i - \sum x_i = 0 \quad \blacksquare \]

Connection to n−1 (Degrees of Freedom)

Because \( \sum(x_i - \bar{x}) = 0 \) always holds, the n deviations are not independent. Once you know n−1 of them, the last one is forced — it must be whatever value makes the sum exactly zero.

Concrete example: Data = {2, 5, 8}. x̄ = 5. Deviations: −3, 0, +3. If you only know the first two deviations (−3 and 0), the third must be +3. We have 3 values but only 2 independent pieces of deviation information — hence n−1 = 2 degrees of freedom.

Extension — Minimum Sample Size for Variance

Question: What is the minimum number of values for which s² can be computed?

Answer: n = 2. If n = 1, then n−1 = 0 and the variance formula involves division by zero — undefined. You cannot measure spread with a single observation.

This makes intuitive sense: with one value, you know a location but have zero information about how spread out other values might be. Variability is inherently a property that requires at least two observations. The n=1 edge case also connects to the degrees-of-freedom logic: with one observation, you have 0 degrees of freedom for variability — the single deviation must be zero (x1 − x̄ = x1 − x1 = 0), telling you nothing about spread.


Challenge 2 — Comparing Spread Across Different Units with CV

Farm A (Baseline)

Plant height: x̄ = 185.3 cm, s = 28.4 cm.
\( \text{CV}_{\text{height}} = \frac{28.4}{185.3} \times 100\% \approx \mathbf{15.33\%} \)

Ear mass: x̄ = 214.7 g, s = 42.1 g.
\( \text{CV}_{\text{mass}} = \frac{42.1}{214.7} \times 100\% \approx \mathbf{19.61\%} \)

Ear mass is relatively more variable (19.61% vs. 15.33%) — about 28% more variable relative to its mean.

Variant 0 — Farm B

Height: x̄ = 152.1 cm, s = 24.8 cm.
\( \text{CV}_{\text{height}} = \frac{24.8}{152.1} \times 100\% \approx \mathbf{16.31\%} \)

Mass: x̄ = 178.3 g, s = 38.5 g.
\( \text{CV}_{\text{mass}} = \frac{38.5}{178.3} \times 100\% \approx \mathbf{21.59\%} \)

Mass remains relatively more variable than height (21.6% vs. 16.3%). Both CVs are slightly higher than Farm A's — Farm B's growing conditions produce somewhat more relative variability in both traits.

Variant 1 — Farm C (Most Consistent Height)

Height: x̄ = 210.5 cm, s = 18.9 cm.
\( \text{CV}_{\text{height}} = \frac{18.9}{210.5} \times 100\% \approx \mathbf{8.98\%} \)

Mass: x̄ = 245.0 g, s = 27.0 g.
\( \text{CV}_{\text{mass}} = \frac{27.0}{245.0} \times 100\% \approx \mathbf{11.02\%} \)

Comparison across farms (height CV):

FarmHeight CV
Farm A15.33%
Farm B16.31%
Farm C8.98%

Farm C has the most consistent plant height — its CV of 8.98% is nearly half of Farm A's and B's. Even though Farm C's plants are taller on average (210.5 cm vs. 152–185 cm), they are much more uniform relative to their mean.

Variant 2 — Farm D (Unit-Invariance of CV)

Height measured in metres: x̄ = 1.76 m, s = 0.31 m.

\( \text{CV} = \frac{0.31}{1.76} \times 100\% \approx \mathbf{17.61\%} \)

Compare to Farm A's height CV (in cm: 28.4/185.3 = 15.33%):

Farm A in metres would be: x̄ = 1.853 m, s = 0.284 m.
CV = 0.284/1.853 × 100% = 15.33% — exactly the same as in cm!

Why CV is unit-invariant: Converting from cm to m divides every value by 100. Both the mean and standard deviation are divided by 100. Their ratio is unchanged:

\[ \frac{s/100}{\bar{x}/100} = \frac{s}{\bar{x}} \]

This is the defining property that makes CV useful for comparisons: it does not depend on the measurement units chosen. Farm D's CV of 17.61% is not the same as Farm A's 15.33% because the underlying data differ (different farm, different plants), not because the units changed.

Common Mistakes to Check in Your Work

  1. Dividing by n instead of n−1 for sample variance. This is the single most frequent error in all of descriptive statistics. Bessel's correction exists because samples systematically underestimate population variability. If you see s² and the denominator is n, fix it to n−1.
  2. Reporting variance s² as the standard deviation s. Variance is in squared units; SD is in original units. They are different numbers (except when s² = 1). Always take the square root when reporting standard deviation.
  3. Failing to sort data before finding quartiles. Q1, Q2, and Q3 are defined on sorted data. If you skip the sort step, your quartile values will be meaningless. Always sort first.
  4. Including Q2 in the halves when computing Q1/Q3 for odd n. For odd n, the median (Q2) belongs to neither half — exclude it. The lower half is all values below Q2; the upper half is all values above Q2. Including Q2 in either half biases Q1 downward or Q3 upward.
  5. Using the wrong fence multiplier. The standard multiplier is 1.5 (Tukey's rule). Using 1.0 misses mild outliers; using 3.0 only catches extreme outliers (far outliers). Stick with 1.5 unless specifically instructed otherwise.
  6. Automatically deleting flagged outliers. The fence method identifies potential outliers for investigation — it does not license automatic deletion. Investigate: Is it a data-entry error? A measurement problem? A legitimate extreme value? Decide based on evidence, not the statistical flag alone.
  7. Comparing SD across datasets with different units without using CV. An SD of 10 cm and an SD of 8 kg cannot be compared directly. Use the coefficient of variation (CV = s/x̄ × 100%) to express spread relative to the mean in unitless terms.