EN FR

INF-5 Solutions: Hypothesis Testing for a Population Mean

Solutions Reference · ← Back to Lesson INF-5

Section 5 — Guided Practice Solutions

Problem 1 — Setting Up Hypotheses and Computing z (Variants 0–2)

Variant 0 (bottling plant, \( \mu_0 = 750 \) mL, \( \sigma = 8 \) mL, \( n = 64 \), \( \bar{x} = 747.5 \) mL):

Variant 1 (delivery company, \( \mu_0 = 3.0 \) days, \( \sigma = 0.6 \) days, \( n = 36 \), \( \bar{x} = 3.2 \) days):

Variant 2 (bag fill, \( \mu_0 = 1000 \) g, \( \sigma = 15 \) g, \( n = 100 \), \( \bar{x} = 1003 \) g):

Common mistakes: (1) Writing hypotheses in terms of \( \bar{x} \) instead of the population parameter \( \mu \). (2) Forgetting to divide \( \sigma \) by \( \sqrt{n} \) when computing SE — using \( \sigma \) directly as the denominator of z. (3) Reversing \( H_0 \) and \( H_a \) (H₀ always uses "=").


Problem 2 — p-value and Decision (Variants 0–2)

Variant 0 (\( z = -2.50 \), two-tailed, \( \alpha = 0.05 \)):

Variant 1 (\( z = 2.00 \), two-tailed, \( \alpha = 0.05 \)):

Variant 2 (\( z = 2.00 \), two-tailed, \( \alpha = 0.01 \)):

Common mistakes: (1) Reading the one-tail area and forgetting to multiply by 2 for a two-tailed test. (2) Writing "accept \( H_0 \)" — the correct language is "fail to reject \( H_0 \)." (3) Comparing \( z \) to \( \alpha \) instead of comparing \( p \) to \( \alpha \). (4) Concluding that a negative \( z \) automatically means rejection.


Problem 3 — Choosing the Test Form (Traffic Speed)

Correct answer: Right-tailed test. \( H_0: \mu = 110 \); \( H_a: \mu > 110 \) km/h.

Justification: The traffic planner's concern is specifically whether speeds have increased beyond 110 km/h — a directional question. The test form must be chosen from the research question before examining the data. A two-tailed test would waste power by including evidence of speeds below 110 km/h, which is irrelevant to the planner's concern. A left-tailed test would look in exactly the wrong direction.

Common mistake: Choosing "two-tailed because the sample could go either way." The test type is determined by the research question, not by what the data might show. Choosing the direction after seeing the data is data snooping — it inflates the true Type I error rate.


Problem 4 — Classifying Errors (Drug Trial)

Correct answer: Type II error — the company failed to reject a false null hypothesis.

Justification: The true mean recovery time is 11 days, so \( H_0: \mu = 14 \) is false. The company's test failed to detect this. By definition, failing to reject a false null is a Type II error (\( \beta \)). This is not a Type I error (which requires rejecting a true null). Note also: the company "failed to reject" — never "accepted" — \( H_0 \).

Common mistakes: (1) Confusing which error is which — Type I = false alarm (reject true \( H_0 \)); Type II = missed signal (fail to reject false \( H_0 \)). (2) Saying "no error occurred because they followed the rule" — the rule can still produce an error. (3) Writing "accepted \( H_0 \)" — always "failed to reject."

Section 6 — Independent Practice Solutions

Problem 1 — Two-Tailed Generator (generateTwoTailTest)

Generated values differ for each student. The five-step approach is always the same. Example with typical parameters (\( \mu_0 = 120 \), \( \sigma = 18 \), \( n = 64 \), \( \bar{x} = 124.5 \), \( \alpha = 0.05 \)):

  1. Hypotheses: \( H_0: \mu = 120 \); \( H_a: \mu \neq 120 \) (two-tailed).
  2. Conditions: \( n = 64 \geq 30 \) ✓; \( \sigma \) known ✓. CLT applies.
  3. Test statistic: \( \text{SE} = 18/\sqrt{64} = 18/8 = 2.25 \). \[ z = \frac{124.5 - 120}{2.25} = \frac{4.5}{2.25} = 2.00 \]
  4. p-value: \( p = 2(1 - \Phi(2.00)) = 2(0.0228) = 0.0456 \).
  5. Conclusion: \( p = 0.0456 < 0.05 \). We reject \( H_0 \). There is sufficient evidence at the 5% level to conclude the mean differs from 120. (Student-facing generated values will differ; the five steps and conclusion logic are identical.)

Problem 2 — One-Tailed Test Direction (Variants 0–2)

Variant 0 (battery life, \( \mu_0 = 500 \) h, \( \sigma = 40 \) h, \( n = 49 \), \( \bar{x} = 488 \) h, \( \alpha = 0.05 \)):

Variant 1 (factory emissions, \( \mu_0 = 80 \) ppm, \( \sigma = 12 \) ppm, \( n = 64 \), \( \bar{x} = 84 \) ppm, \( \alpha = 0.01 \)):

Variant 2 (meal plan, \( \mu_0 = 2000 \) kcal, \( \sigma = 150 \) kcal, \( n = 100 \), \( \bar{x} = 1975 \) kcal, \( \alpha = 0.05 \)):

Common mistakes: (1) Using the right-tail formula \( 1 - \Phi(z) \) for a left-tailed test when \( z \) is negative — for a left-tailed test, \( p = \Phi(z) \), which equals \( 1 - \Phi(|z|) \) when \( z < 0 \). (2) Choosing the test direction after seeing the data (data snooping). (3) Using \( H_0: \mu \geq 500 \) instead of the point equality \( H_0: \mu = 500 \).


Problem 3 — One-Tailed Generator (generateOneTailTest)

Generated values differ for each student. Example with typical parameters (\( \mu_0 = 75 \), \( \sigma = 12 \), \( n = 81 \), \( \bar{x} = 72 \), left-tailed, \( \alpha = 0.05 \)):

  1. Hypotheses: \( H_0: \mu = 75 \); \( H_a: \mu < 75 \) (left-tailed — the scenario specifies the mean has decreased).
  2. Conditions: \( n = 81 \geq 30 \) ✓; \( \sigma \) known ✓.
  3. Test statistic: \( \text{SE} = 12/\sqrt{81} = 12/9 = 1.\overline{3} \). \[ z = \frac{72 - 75}{12/9} = \frac{-3}{1.\overline{3}} = -2.25 \]
  4. p-value: Left-tail: \( p = P(Z < -2.25) = 1 - \Phi(2.25) = 1 - 0.9878 = 0.0122 \).
  5. Conclusion: \( p = 0.0122 < 0.05 \). We reject \( H_0 \). (Student-facing generated values will differ.)

Problem 4 — Error Classification and the \( \alpha \)–\( \beta \) Trade-off (Variants 0–2)

Variant 0 (sodium content, fail to reject \( H_0 \), true mean = 430 mg):

Variant 1 (prosecutor, reject \( H_0 \), defendant was actually compliant):

Variant 2 (quality engineer, reject \( H_0 \), true mean had not shifted):

Common mistakes: (1) Confusing which error occurred based on the test outcome vs. the true state of \( H_0 \) — always check both. (2) Thinking that decreasing \( \alpha \) fixes all errors — it only moves the trade-off. (3) Saying "increasing \( \alpha \) reduces Type II errors" without noting the cost: higher Type I error rate.


Problem 5 — Multi-Step Synthesis: News Headline Interpretation

(a) The journalist's error: "No evidence of an effect" is not the same as "evidence of no effect." Failing to reject \( H_0 \) means only that the data are consistent with \( H_0 \) — it does not prove \( H_0 \) is true. A Type II error may have occurred: the method could be effective but the study lacked sufficient power to detect it. Correct interpretation: "There is insufficient evidence at the chosen significance level to conclude that the new teaching method improves test scores."

(b) Concern with \( n = 15 \): With only 15 students, the standard error \( \sigma/\sqrt{15} \) is large. A true effect would need to be very large to yield a significant test statistic. This means the test has low power — a high probability \( \beta \) of missing a real improvement. If the method works but the effect is modest, the study almost certainly could not detect it. Additionally, with \( n = 15 < 30 \) and unknown \( \sigma \), the z-test is inappropriate; a t-test (covered in INF-6) should be used.

(c) Multiple comparisons / p-hacking: Running 20 independent tests each at \( \alpha = 0.05 \) creates a familywise Type I error rate of approximately \( 1 - (0.95)^{20} \approx 0.64 \). If only significant results are reported, roughly 1 in 20 findings could be a pure false positive — but a reader sees only the "successes" and cannot know this. This selective reporting is a form of data snooping and severely undermines the validity of reported findings.

Common mistakes: (1) Treating "fail to reject \( H_0 \)" as proof that the teaching method does not work. (2) Overlooking power concerns when sample size is small. (3) Not recognizing that running many tests inflates the overall false-positive rate.

Section 7 — Mastery Check Solutions

Question 1 — Feynman Test: What is a p-value?

A p-value is the probability of observing a test statistic at least as extreme as the one computed, assuming the null hypothesis is true. A small p-value means the data would be very surprising in a null-hypothesis world — strong evidence against \( H_0 \). A large p-value means the data are consistent with \( H_0 \).

What a p-value does NOT tell you:


Question 2 — Apply: Lake pH Test (\( \sigma = 0.4 \), \( n = 25 \), \( \bar{x} = 6.88 \))

Part A — Correct alternative: \( H_a: \mu \neq 7.0 \) (two-tailed). The water authority wants to know whether pH differs from 7.0 in either direction — not just whether it decreased. Although the sample mean is below 7.0, choosing the direction after seeing the data would be data snooping. The question was set as non-directional before sampling.

Part B — Full five-step solution:

  1. \( H_0: \mu = 7.0 \); \( H_a: \mu \neq 7.0 \) (two-tailed).
  2. Conditions: \( n = 25 \). Note \( n < 30 \), but we proceed with z since \( \sigma \) is known and the population is approximately normal. (With unknown \( \sigma \) and small \( n \), INF-6 requires a t-test.)
  3. \( \text{SE} = 0.4/\sqrt{25} = 0.4/5 = 0.08 \) units. \[ z = \frac{6.88 - 7.0}{0.08} = \frac{-0.12}{0.08} = -1.50 \]
  4. \( p = 2P(Z > 1.50) = 2(1 - \Phi(1.50)) = 2(1 - 0.9332) = 2(0.0668) = 0.1336 \).
  5. \( p = 0.1336 > \alpha = 0.05 \). We fail to reject \( H_0 \). There is insufficient evidence at the 5% significance level to conclude that the lake's mean pH differs from 7.0.

Common mistake: Seeing \( \bar{x} = 6.88 < 7.0 \) and switching to a left-tailed test. The direction of \( H_a \) must be set before examining the data. Using "less than" after seeing the low sample mean is data snooping and invalidates the test.


Question 3 — Error Analysis: "Proves the null is false"

Error 1 — "Proves": A hypothesis test never proves anything with certainty. Rejecting \( H_0 \) means only that the data are statistically unlikely under \( H_0 \). There is still a probability \( \alpha = 0.05 \) (1 in 20) that this is a Type I error. Correct language: "There is sufficient evidence at the 5% level to reject \( H_0 \)" — not "proves."

Error 2 — "Definitely works": Statistical significance (\( p < \alpha \)) means only that the effect is unlikely under \( H_0 \). It says nothing about whether the effect is large or practically meaningful. With a very large sample, even a trivially small improvement could yield \( p = 0.049 \). The researcher needs to report effect size alongside the p-value.

Corrected statement: "There is sufficient evidence at the 5% significance level to reject \( H_0 \) and conclude that the study technique has a statistically detectable effect on exam scores. However, statistical significance does not imply the effect is large or practically important — the magnitude of improvement should also be reported."

Section 8 — Boss Fight Solutions

Path A — The Auditor: Grant Processing Times

(\( \mu_0 = 30 \) days, \( \sigma = 8 \) days, \( n = 64 \), \( \bar{x} = 32.5 \) days, \( \alpha = 0.01 \))

Task 1 — Hypotheses and justification for one-tailed test:

\( H_0: \mu = 30 \) days; \( H_a: \mu > 30 \) days (right-tailed).

The audit mandate is specifically to detect whether processing times exceed the target — not whether they differ in any direction. A one-tailed right test concentrates all the statistical power in the relevant direction. A two-tailed test would split the rejection region across both tails, reducing power to detect the specific problem of concern (delays).

Task 2 — Conditions and test statistic:

Conditions: \( n = 64 \geq 30 \) ✓; \( \sigma = 8 \) days known ✓. CLT applies.

\( \text{SE} = 8/\sqrt{64} = 8/8 = 1.0 \) day.

\[ z = \frac{32.5 - 30}{1.0} = \frac{2.5}{1.0} = 2.50 \]

Task 3 — p-value, decision, and audit conclusion:

Right-tailed: \( p = P(Z > 2.50) = 1 - \Phi(2.50) = 1 - 0.9938 = 0.0062 \).

\( p = 0.0062 < \alpha = 0.01 \). We reject \( H_0 \).

Audit report sentence: "Based on a random sample of 64 grant applications, there is sufficient evidence at the 1% significance level to conclude that the mean processing time exceeds the 30-day target (\( z = 2.50 \), \( p = 0.0062 \))."

Task 4 — \( \alpha \) reduction and Type II error consequence:

Reducing \( \alpha \) from 0.01 to 0.001 raises the critical value from \( z^* \approx 2.33 \) to \( z^* \approx 3.09 \). With the current data (\( z = 2.50 \)), the test would fail to reject at \( \alpha = 0.001 \) because \( 2.50 < 3.09 \). The real delay would go undetected — a Type II error.

Cost reasoning: A Type I error (falsely flagging a compliant department) carries reputational and legal costs. A Type II error (missing a real delay problem) means public money is wasted and grant applicants are harmed. If the cost of undetected delays is judged high, lowering \( \alpha \) to 0.001 is not advisable — it increases \( \beta \) and makes the audit less likely to catch genuine problems. \( \alpha = 0.01 \) is already a demanding standard for public audit use.

Common mistakes: (1) Choosing a two-tailed test for an audit with a clear directional mandate. (2) Thinking that lower \( \alpha \) always makes a test "better" — it makes it stricter for Type I errors while increasing \( \beta \). (3) Writing "we accept \( H_0 \)" in any scenario — always "we reject" or "we fail to reject."


Path B — The Designer: Manufacturing Diameter Check

(\( \mu_0 = 25.00 \) mm, \( \sigma = 0.06 \) mm, \( n = 36 \), \( \bar{x} = 25.012 \) mm; Type I cost = €500, Type II cost = €5,000)

Task 1 — Full five-step test at \( \alpha = 0.05 \):

  1. \( H_0: \mu = 25.00 \) mm; \( H_a: \mu \neq 25.00 \) mm (two-tailed — both too small and too large are defects).
  2. Conditions: \( n = 36 \geq 30 \) ✓; \( \sigma = 0.06 \) known ✓.
  3. \( \text{SE} = 0.06/\sqrt{36} = 0.06/6 = 0.01 \) mm. \[ z = \frac{25.012 - 25.00}{0.01} = \frac{0.012}{0.01} = 1.20 \]
  4. \( p = 2P(Z > 1.20) = 2(1 - 0.8849) = 2(0.1151) = 0.2302 \).
  5. \( p = 0.2302 > 0.05 \). We fail to reject \( H_0 \). There is insufficient evidence at the 5% level to conclude the mean diameter differs from 25.00 mm.

Task 2 — Error type (true mean = 25.015 mm):

\( H_0: \mu = 25.00 \) is false (true mean = 25.015 ≠ 25.00). The test failed to reject a false null hypothesis. This is a Type II error (\( \beta \)).

In the 2×2 truth table: "Fail to reject \( H_0 \) | \( H_0 \) is false" = Type II error cell. The procedure was correct — the test simply lacked sufficient power to detect a shift of only 0.015 mm (1.2 SE above the null).

Task 3 — \( \alpha \) direction and cost trade-off:

To reduce \( \beta \) (Type II error), increase \( \alpha \). A higher \( \alpha \) lowers the rejection threshold, making it easier to detect true process shifts.

Cost asymmetry: Type I error (rejecting a good batch) costs €500. Type II error (missing a defective batch) costs €5,000 — 10× more. This strongly favors a higher \( \alpha \) (e.g., 0.10) to reduce the costly missed defects, accepting more frequent but inexpensive false alarms (€500). The team should tolerate higher Type I error rate to protect against the €5,000 Type II consequence.

Task 4 — Sample size for 80% power to detect a 0.01 mm shift:

Parameters: \( \delta = 0.01 \) mm, \( \sigma = 0.06 \) mm, two-tailed at \( \alpha = 0.05 \) so \( z_{\alpha/2} = z_{0.025} = 1.96 \); 80% power so \( z_{\beta} = z_{0.20} = 0.842 \).

\[ n = \frac{(z_{\alpha} + z_{\beta})^2 \sigma^2}{\delta^2} = \frac{(1.96 + 0.842)^2 \times (0.06)^2}{(0.01)^2} \] \[ = \frac{(2.802)^2 \times 0.0036}{0.0001} = \frac{7.851 \times 0.0036}{0.0001} = \frac{0.028264}{0.0001} = 282.6 \]

Round up: \( n = \mathbf{283} \) parts.

The current \( n = 36 \) provides very low power to detect a 0.01 mm shift. Achieving 80% power requires roughly 8× as many measurements.

Common mistakes: (1) Using \( z_{\alpha} = 1.96 \) for \( \alpha = 0.05 \) in a one-tailed context — for a two-tailed test, use \( z_{\alpha/2} = 1.96 \). (2) Rounding the sample size down (283 would give slightly less than 80% power — always round up). (3) Concluding that increasing \( \alpha \) is always bad — in an asymmetric cost context, a higher \( \alpha \) is the rational choice.

Section 9 — Challenge Problem Solutions

Challenge 1 — Critical-Value Approach (Variants 0–2)

Variant 0 (postal service, \( \mu_0 = 2.0 \) days, \( \sigma = 0.5 \), \( n = 100 \), \( \bar{x} = 2.09 \), \( \alpha = 0.05 \) two-tailed):

Variant 1 (widget weight, \( \mu_0 = 50 \) g, \( \sigma = 4 \), \( n = 64 \), \( \bar{x} = 51.2 \), \( \alpha = 0.01 \) two-tailed):

Variant 2 (tablet weight, \( \mu_0 = 200 \) mg, \( \sigma = 10 \), \( n = 81 \), \( \bar{x} = 203 \), \( \alpha = 0.05 \) two-tailed):

Key insight: The critical-value approach and the p-value approach always yield the same decision. They are mathematically equivalent — use whichever form the problem requests. The p-value approach provides more information (how extreme the result is relative to \( \alpha \)), while the critical-value approach gives a direct comparison in z-units.


Challenge 2 — Equivalence of CI and Two-Tailed Test

(From Example 1: \( \mu_0 = 500 \) g, \( \sigma = 20 \) g, \( n = 64 \), \( \bar{x} = 495 \) g, \( \alpha = 0.05 \).)

(a) 95% confidence interval:

\( \text{SE} = 20/\sqrt{64} = 2.5 \) g. \( E = 1.96 \times 2.5 = 4.9 \) g.

\[ \text{CI} = 495 \pm 4.9 = (490.1,\; 499.9) \text{ g} \]

(b) Equivalence:

\( \mu_0 = 500 \) falls outside the interval \( (490.1, 499.9) \).

Key insight: A 95% CI and a two-tailed test at \( \alpha = 0.05 \) always yield the same decision:

This equivalence holds exactly for two-tailed tests. One-tailed tests have no direct CI equivalent — a one-sided confidence bound must be used instead.


Challenge 3 — Two-Sample Preview Generator (generateTwoTailTest)

Generated values differ for each student. The five-step template for the single-sample case is shown in Problem 1 above. For the extension prompt: a two-sample z-test for comparing two independent means uses the test statistic

\[ z = \frac{(\bar{x}_1 - \bar{x}_2) - 0}{\sqrt{\sigma_1^2/n_1 + \sigma_2^2/n_2}} \]

The same five-step framework applies; only the formula for the standard error changes. This is covered formally in REG-3.

Key takeaway: The five-step hypothesis testing framework learned in this lesson is universal. Every test — one-sample, two-sample, proportions, regression coefficients — follows the same structure: state hypotheses, check conditions, compute a test statistic, find the p-value, state a conclusion. Only the formula for the test statistic and the reference distribution change.