EN FR

DS-5 Solutions: Position and Distribution Shape

Solutions Reference · ← Back to Lesson DS-5

Section 5 — Guided Practice Solutions

Guided Practice 1: Computing Z-Scores (Variant Bank)

Context for all variants: \( \mu = 120 \) calls, \( \sigma = 15 \) calls. Formula: \( z = \dfrac{x - \mu}{\sigma} \).

Variant A (147 calls):

\[ z = \frac{147 - 120}{15} = \frac{27}{15} = +1.80 \]

This agent handled 1.80 standard deviations more calls than the daily mean — a notably high-volume day.

Variant B (105 calls):

\[ z = \frac{105 - 120}{15} = \frac{-15}{15} = -1.00 \]

This agent handled exactly 1 standard deviation fewer calls than the daily mean — below average but not unusual.

Variant C (120 calls):

\[ z = \frac{120 - 120}{15} = \frac{0}{15} = 0.00 \]

This agent handled exactly the mean number of calls — right at average.

Variant D (93 calls):

\[ z = \frac{93 - 120}{15} = \frac{-27}{15} = -1.80 \]

This agent handled 1.80 standard deviations fewer calls than the mean — a notably low-volume day, symmetrically opposite to Variant A.

Variant E (158 calls):

\[ z = \frac{158 - 120}{15} = \frac{38}{15} \approx +2.53 \]

This agent handled 2.53 standard deviations more calls than the mean — an unusually high-volume day that would be flagged as exceptional.

Common mistakes: (1) Forgetting to subtract the mean before dividing — dividing the raw score by \( \sigma \) gives a meaningless ratio. (2) Dropping the sign — the sign carries essential directional information. A negative z-score means below the mean, not an error. (3) Using \( s \) for a population problem or \( \sigma \) for a sample problem.


Guided Practice 2: Interpreting Percentiles (MCQ)

Correct answer: "A student who scored 68 performed better than approximately 50% of all test-takers."

Why the other options are wrong:


Guided Practice 3: Empirical Rule Application (Variant Bank)

Context for all variants: body temperatures approximately normally distributed with \( \mu = 37.0 \)°C, \( \sigma = 0.4 \)°C.

Variant A — Between 36.2°C and 37.8°C:

\( 36.2 = 37.0 - 0.8 = \mu - 2\sigma \) and \( 37.8 = 37.0 + 0.8 = \mu + 2\sigma \).

By the Empirical Rule: approximately 95% of healthy adults have a temperature in this range.

Variant B — Between 36.6°C and 37.4°C:

\( 36.6 = 37.0 - 0.4 = \mu - \sigma \) and \( 37.4 = 37.0 + 0.4 = \mu + \sigma \).

By the Empirical Rule: approximately 68% of healthy adults have a temperature in this range.

Variant C — 38.2°C, how many SDs above the mean?

\[ z = \frac{38.2 - 37.0}{0.4} = \frac{1.2}{0.4} = +3.00 \]

This temperature is exactly 3 SDs above the mean. By the Empirical Rule, only approximately 0.15% of adults have a temperature this high — highly unusual, consistent with a fever.

Variant D — Percentage above 37.4°C:

\( 37.4 = \mu + \sigma \). Approximately 68% fall within 1 SD, so approximately 32% fall outside. By symmetry, approximately \( 32\%/2 = 16\% \) are above \( \mu + \sigma = 37.4 \)°C.

Variant E — Interval containing approximately 99.7% of adults:

\( \mu - 3\sigma = 37.0 - 1.2 = 35.8 \)°C and \( \mu + 3\sigma = 37.0 + 1.2 = 38.2 \)°C.

Approximately 99.7% of adults have temperatures between 35.8°C and 38.2°C.

Common mistakes: (1) Applying the Empirical Rule to a non-normal distribution. (2) Forgetting that "within 1 SD" means 68% total — some students take 68% as the one-sided area. (3) For one-sided tails: always halve the outside percentage by symmetry (32%/2 = 16%, not 32%).


Guided Practice 4: Classifying Distribution Shape (MCQ)

Correct answer: "Right-skewed; the median is a more appropriate measure of centre."

Reasoning: Mean (\$112,000) > Median (\$84,000). When the mean is pulled above the median, high-value outliers are dragging it to the right — the distribution is right-skewed. For right-skewed data, the median is resistant to extreme values and is more representative of the typical salary than the mean.

Why the other options are wrong:

Section 6 — Independent Practice Solutions

Independent Practice 1: Z-Score Generator

Generated by generateZScoreProblem(). The numbers vary each time, but the method is always the same:

  1. Identify whether the context describes a population (use \( \mu \) and \( \sigma \)) or a sample (use \( \bar{x} \) and \( s \)).
  2. Apply the formula: \[ z = \frac{x - \mu}{\sigma} \quad \text{or} \quad z = \frac{x - \bar{x}}{s} \]
  3. Round to 2 decimal places.
  4. Interpret: state the direction (positive = above mean, negative = below mean) and magnitude (how many standard deviations away).

Example walk-through (representative values — yours will differ):

Suppose \( \mu = 80 \), \( \sigma = 12 \), \( x = 98 \).

\[ z = \frac{98 - 80}{12} = \frac{18}{12} = +1.50 \]

Interpretation: The value 98 is 1.50 standard deviations above the mean of 80. This places it in the upper portion of the distribution — above average but not unusually so (within 2 SDs of the mean).

Common mistakes: (1) Subtracting in the wrong order — always \( x - \mu \), not \( \mu - x \). (2) Using the wrong denominator — make sure to divide by \( \sigma \) (or \( s \)), not \( \sigma^2 \) (the variance). (3) Omitting the interpretation of sign — a z-score answer is incomplete without stating direction.


Independent Practice 2: Percentile Generator

Generated by generatePercentileProblem(). The dataset and target percentile change each time, but the nearest-rank method is always the same:

  1. Confirm the data is sorted in ascending order.
  2. Compute the rank: \[ L = \left\lceil \frac{k}{100} \times n \right\rceil \] where \( k \) is the percentile and \( n \) is the number of values. The \( \lceil \cdot \rceil \) symbol means "ceiling" — round up to the next whole number.
  3. The \( L \)-th value in the sorted list is \( P_k \).
  4. State the interpretation: "Approximately \( k \)% of the data falls below [value]."

Example walk-through (representative values — yours will differ):

Sorted dataset (\( n = 10 \)): 14, 19, 23, 27, 31, 35, 40, 46, 52, 60. Find \( P_{75} \).

\[ L = \left\lceil 0.75 \times 10 \right\rceil = \lceil 7.5 \rceil = 8 \]

The 8th value is 46. So \( P_{75} = 46 \) — approximately 75% of the data falls below 46.

Common mistakes: (1) Forgetting to sort the data first — percentiles are meaningless on unsorted data. (2) Using \( \lfloor \cdot \rfloor \) (floor) instead of \( \lceil \cdot \rceil \) (ceiling) in the nearest-rank method. (3) Confusing the rank (the position, here 8) with the value (the actual data point, here 46).


Independent Practice 3: Empirical Rule (Variant Bank)

Variant A — IQ scores (\( \mu = 100 \), \( \sigma = 15 \)), between 70 and 130:

\( 70 = 100 - 30 = \mu - 2\sigma \) and \( 130 = 100 + 30 = \mu + 2\sigma \). Distribution is approximately normal — rule applies. Approximately 95% of the population has an IQ between 70 and 130.

Variant B — IQ scores, percentage below 85:

\( 85 = 100 - 15 = \mu - \sigma \). Approximately 68% fall within 1 SD, so approximately 32% fall outside. By symmetry, approximately 16% fall below \( \mu - \sigma = 85 \).

Variant C — House prices (mean \$480,000, median \$310,000, right tail): Should the Empirical Rule be used?

No. The large gap between mean (\$480,000) and median (\$310,000), combined with the described right tail, indicates heavy right skew. The Empirical Rule requires approximately normal distributions. Applying it here would produce badly wrong estimates — the rule assumes symmetry and bell shape, neither of which holds.

Variant D — IQ scores, interval containing 99.7%:

\( \mu - 3\sigma = 100 - 45 = 55 \) and \( \mu + 3\sigma = 100 + 45 = 145 \). Approximately 99.7% of IQ scores fall between 55 and 145.

Variant E — Reaction times (\( \mu = 250 \) ms, \( \sigma = 40 \) ms), value of 410 ms unusual?

\[ z = \frac{410 - 250}{40} = \frac{160}{40} = +4.00 \]

This is 4 standard deviations above the mean. By the Empirical Rule, >99.7% of reactions fall within 3 SDs. A value at +4 SDs is extremely unusual — likely a recording error or a genuine outlier requiring investigation.


Independent Practice 4: Find the Error

Generated by generateSkewnessClassification(). The generator presents a flawed student classification of distribution shape. The general error pattern to look for:

The specific error in the lesson: The student correctly stated that approximately 68% of students scored between 62 and 82 (the \( \bar{x} \pm s \) interval). However, the student then used "68%" as if it were the z-score for a score of 80. This is a category confusion — a proportion and a z-score are completely different quantities.

The correct z-score for a score of 80 when \( \bar{x} = 72 \) and \( s = 10 \):

\[ z = \frac{80 - 72}{10} = \frac{8}{10} = +0.80 \]

The "68" in "68% of data within 1 SD" is a percentage of observations in a band — it has no connection to the z-score of any individual value. Z-scores measure distance from the mean in standard deviation units.

General method for any generated variant:

  1. Read the student's classification (right-skewed / left-skewed / symmetric).
  2. Check the mean vs. median relationship: if mean > median → right-skewed; if mean < median → left-skewed; if mean ≈ median → symmetric.
  3. Check the tail direction in the histogram description: the tail names the skew direction.
  4. If the student's answer contradicts either test, name the specific error (wrong direction, wrong tail named, or mean/median relationship inverted).

Independent Practice 5: Multi-Step Synthesis

Bolt diameters: approximately normal, \( \mu = 10.0 \) mm, \( \sigma = 0.05 \) mm. Specification limits: 9.90 mm to 10.10 mm.

(a) Z-scores for the specification limits:

\[ z_{\text{lower}} = \frac{9.90 - 10.0}{0.05} = \frac{-0.10}{0.05} = -2.00 \]

\[ z_{\text{upper}} = \frac{10.10 - 10.0}{0.05} = \frac{0.10}{0.05} = +2.00 \]

(b) The specification limits are exactly \( \mu \pm 2\sigma \). By the Empirical Rule, approximately 95% of bolts are accepted.

(c) A bolt measuring 10.12 mm:

\[ z = \frac{10.12 - 10.0}{0.05} = \frac{0.12}{0.05} = +2.40 \]

Since \( +2.40 > +2.00 \), this bolt exceeds the upper specification limit — it is rejected.

(d) With the process mean shifted to \( \mu = 10.02 \) mm (same \( \sigma = 0.05 \)):

\[ z_{\text{lower}} = \frac{9.90 - 10.02}{0.05} = \frac{-0.12}{0.05} = -2.40 \]

\[ z_{\text{upper}} = \frac{10.10 - 10.02}{0.05} = \frac{0.08}{0.05} = +1.60 \]

The acceptance band is no longer symmetric about the mean (\( z \) ranges from \( -2.40 \) to \( +1.60 \)). This asymmetric band cannot be handled cleanly with the Empirical Rule alone. More bolts now fall outside the upper limit than the lower limit. The mean shift reduces the proportion of accepted bolts — process adjustment is warranted.

Key insight: The Empirical Rule gives clean answers only when the specification limits are exactly \( \mu \pm k\sigma \) for integer \( k \). When the mean shifts, the limits are no longer symmetric z-scores, and exact proportions require a standard normal table (covered in later lessons).

Section 7 — Mastery Check Solutions

Mastery Check 1: Feynman Explanation

Model answer: "Below average" only tells you the direction — the value is on the low side of the mean. It gives no information about how far below average the value is. A value that is 0.01 standard deviations below the mean and a value that is 3 standard deviations below the mean are both "below average," but they have completely different implications.

A z-score of \( -1.5 \) communicates both direction (negative → below the mean) and magnitude (1.5 standard deviations away from the mean). It places the value in context relative to the full spread of the distribution. In an approximately normal distribution, roughly 93% of values fall within 1.5 SDs of the mean — a z-score of \( -1.5 \) is below average but not extreme. By contrast, "below average" is completely silent on whether the value is unusual or perfectly typical.

The z-score is also unitless, enabling comparisons across datasets measured in different units — something "below average" cannot do.


Mastery Check 2: Apply Question

Correct answer: The analyst should not apply the Empirical Rule — the large gap between mean and median signals a right-skewed distribution, which violates the rule's normality condition.

Reasoning: The mean (\$94,000) is substantially higher than the median (\$68,000) — a \$26,000 gap. This difference signals significant right skew, almost certainly driven by a small number of highly-paid executives pulling the mean upward. The Empirical Rule requires an approximately normal (bell-shaped) distribution. Applying it to this right-skewed salary data would produce misleading estimates.

What the analyst should do instead: Report the five-number summary and IQR, use the median as the measure of centre, and avoid any claim based on the Empirical Rule. If a more precise description is needed, a histogram or box plot would accurately show the skewed shape.

Why the other options are wrong:


Mastery Check 3: Error Analysis

Correct answer: The student divided the raw score by the mean instead of subtracting the mean from the raw score before dividing by the standard deviation.

The student's computation: \( z = 85/70 = 1.21 \) — this divides the score by the mean, which is not the z-score formula.

Correct computation:

\[ z = \frac{x - \bar{x}}{s} = \frac{85 - 70}{10} = \frac{15}{10} = +1.50 \]

The student's conclusion ("I am above average") happens to be directionally correct — a positive number from any upward deviation would suggest above average. But the numerical value is wrong, and the formula used is invalid.

Why the other options are wrong:

Section 8 — Boss Fight Solutions

Path A — Exam Analyst

Given: \( \mu = 68 \), \( \sigma = 11 \). Students: 89, 45, 71.

Task 1 — Z-scores for all three students:

\[ z_1 = \frac{89 - 68}{11} = \frac{21}{11} \approx +1.91 \]

\[ z_2 = \frac{45 - 68}{11} = \frac{-23}{11} \approx -2.09 \]

\[ z_3 = \frac{71 - 68}{11} = \frac{3}{11} \approx +0.27 \]

Sanity check: Student 3 (71) is just above the mean (68), so a small positive z is expected. Student 1 (89) is well above the mean; Student 2 (45) is well below. Signs and magnitudes are consistent.

Task 2 — Empirical Rule and the 60 threshold:

\[ z_{60} = \frac{60 - 68}{11} = \frac{-8}{11} \approx -0.73 \]

This is not a whole-number standard deviation boundary — 60 is approximately 0.73 SDs below the mean. The Empirical Rule gives clean percentages only at exactly 1, 2, and 3 SDs. At \( -0.73 \) SDs we can only say: more than 16% (the one-SD lower tail) but less than 50% of students scored below 60. A precise estimate requires a standard normal table. The Empirical Rule alone is insufficient here — you can state a bound but not an exact estimate.

Task 3 — Ranking by unusualness:

Unusualness is measured by \( |z| \) regardless of direction:

In an approximately normal distribution, scores beyond \( |z| = 2 \) occur in roughly 5% of cases. Both Students 1 and 2 are approaching that threshold — their performances are in the tails of the distribution and may warrant attention.

Task 4 — Memo to the professor (model content):

The three students have z-scores of approximately \( +1.91 \), \( -2.09 \), and \( +0.27 \). Student 3's score (71) is essentially average (\( z \approx +0.27 \)) and requires no special attention. Student 1's score (89, \( z \approx +1.91 \)) represents strong performance — in the top few percent of the class, within the boundary where approximately 5% of students score. Student 2's score (45, \( z \approx -2.09 \)) is in the bottom 2–3% of the class — nearly two-and-a-half standard deviations below the mean — and may warrant academic support. A grade review based on raw scores alone would miss that Student 2 is performing as unusually poorly as Student 1 is unusually well, just in opposite directions.


Path B — Process Architect

Task 1 — Empirical Rule applicability:

Task 2 — Z-scores for Line A specification limits:

Line A: \( \bar{x} = 8.0 \) mm, \( s = 0.3 \) mm. Limits: 7.1 mm to 8.9 mm.

\[ z_{\text{lower}} = \frac{7.1 - 8.0}{0.3} = \frac{-0.9}{0.3} = -3.00 \]

\[ z_{\text{upper}} = \frac{8.9 - 8.0}{0.3} = \frac{0.9}{0.3} = +3.00 \]

The limits are exactly \( \mu \pm 3\sigma \). By the Empirical Rule, approximately 99.7% of Line A tiles pass the specification.

Task 3 — Flag outliers on Line A:

Tile measuring 8.75 mm:

\[ z = \frac{8.75 - 8.0}{0.3} = \frac{0.75}{0.3} = +2.50 \]

Is it inside specification? Upper limit is 8.9 mm; \( 8.75 < 8.9 \) — yes, it passes. Is it statistically unusual? \( |z| = 2.50 > 2 \) — yes, it is in the outer 5% of the distribution. The tile passes quality control but is flagging as a statistical outlier. Continued monitoring of tiles in this range is warranted.

Task 4 — Engineering recommendation (model content):

Line A's output is approximately normally distributed and the Empirical Rule applies reliably — approximately 99.7% of tiles fall within specification, and quality reporting can use this estimate with confidence. Line B's output is right-skewed (mean substantially above median, long right tail) and the Empirical Rule cannot be applied. Quality reports for Line B that assume normality will underestimate the proportion of panels exceeding the upper specification limit. To better assess Line B: (1) collect a larger sample and plot a histogram to confirm shape; (2) compute the exact proportion of panels outside specification from the sample directly; (3) investigate the root cause of the right tail — are thick panels coming from a specific batch, shift, or raw material lot? Until Line B's distribution is better understood, use the sample proportion within limits rather than any rule-based approximation.

Section 9 — Challenge Problem Solutions

Challenge 1: Why Does Right-Skew Imply Mean > Median? (Variant Bank)

Variant A — Concrete example: Dataset: {1, 2, 3, 4, 100}.

Mean: \( \bar{x} = (1 + 2 + 3 + 4 + 100)/5 = 110/5 = 22 \). Median: the 3rd value in the sorted list = 3. Mean (22) > Median (3) — confirmed.

The value 100 is responsible. The mean includes 100 in its arithmetic average — one extreme value pulls the entire sum upward by 100 units, which divides across all 5 observations, shifting the mean substantially. The median, however, is determined solely by the middle rank — the value 100 has the same influence on the median as the value 4 would; only its rank matters, not its magnitude. One sentence: the extreme value (100) inflates the mean by contributing its full size to the sum, but only contributes one rank to the median, so it can displace the median by at most one position.

Variant B — Algebraic argument:

For odd \( n \), the median is \( x_{(n+1)/2} \). The mean is:

\[ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i = \frac{1}{n}\left( \sum_{i=1}^{n-1} x_i + x_n \right) \]

Rearranging: \( n\bar{x} = \sum_{i=1}^{n-1} x_i + x_n \). For \( \bar{x} > x_{(n+1)/2} \), we need \( \sum_{i=1}^{n-1} x_i + x_n > n \cdot x_{(n+1)/2} \). As \( x_n \to \infty \), the left side grows without bound while the right side (which involves only the fixed middle-rank value) stays finite. Therefore, for sufficiently large \( x_n \), the mean exceeds the median. \( \square \)

Formally: since \( \sum_{i=1}^{n-1} x_i \) is fixed and finite, and \( x_n \) is unbounded, \( n\bar{x} = \text{(fixed)} + x_n \to \infty \), while \( n \cdot x_{(n+1)/2} \) is bounded. The inequality \( \bar{x} > x_{(n+1)/2} \) must hold for large enough \( x_n \).

Variant C — Moment / balance-point intuition:

The mean is the balance point of the distribution — place each data value as a weight at its position on the number line, and the mean is where the see-saw balances. The median is the point that splits the number line into two halves with equal numbers of observations on each side.

In a right-skewed distribution, there are a few very large values far to the right of the bulk of the data. These extreme weights create a large clockwise torque (rightward pull) on the see-saw. To balance, the fulcrum (mean) must shift rightward past the median. The median does not move because it only counts observations by rank — the extreme values on the right are already "counted" once each, no matter how large they are. Result: mean > median in right-skewed distributions.


Challenge 2: Z-Scores Preserve Relative Order

(a) Proof that \( z_1 > z_2 \) when \( x_1 > x_2 \):

Given \( x_1 > x_2 \) and \( s > 0 \):

\[ z_1 - z_2 = \frac{x_1 - \bar{x}}{s} - \frac{x_2 - \bar{x}}{s} = \frac{(x_1 - \bar{x}) - (x_2 - \bar{x})}{s} = \frac{x_1 - x_2}{s} \]

Since \( x_1 > x_2 \), we have \( x_1 - x_2 > 0 \). Since \( s > 0 \), the fraction \( \dfrac{x_1 - x_2}{s} > 0 \). Therefore \( z_1 - z_2 > 0 \), which means \( z_1 > z_2 \). \( \square \)

(b) Why order preservation is necessary:

If standardizing reversed the order of any pair of values, a higher z-score would not reliably indicate a higher relative position in the distribution. Z-scores would be useless for ranking or comparing observations across datasets — the very purpose of standardization would be defeated.

(c) What happens when \( s = 0 \):

If \( s = 0 \), every value in the dataset is identical: \( x_i = \bar{x} \) for all \( i \). The formula \( z = (x - \bar{x})/s \) would require dividing by zero, which is undefined. This case is degenerate — a dataset with zero spread has no variation, every observation is at the mean, and the concept of "relative position" is meaningless. The preservation property requires \( s > 0 \); when \( s = 0 \), z-scores cannot be defined.