EN FR

INF-6 Solutions: Hypothesis Testing for Small Sample Mean and Proportion

Solutions Reference · ← Back to Lesson INF-6

Section 5 — Guided Practice Solutions

Problem 1 — t-Test Decisions (Variants A, B, C)

Each variant below shows a small-sample t-test worked in full. The key steps are always: (1) identify df = n − 1, (2) compute SE = s/√n, (3) compute t, (4) locate t in the correct df row of the t-table to bound p, (5) compare to α.

Variant A (Nutritionist; n = 10, x̄ = 2180 kcal, s = 280 kcal, μ₀ = 2000, two-tailed, α = 0.05):

Variant B (Battery lab; n = 12, x̄ = 23.4 h, s = 3.2 h, μ₀ = 25 h, two-tailed, α = 0.05):

Variant C (Doctor; n = 16, x̄ = 37.3°C, s = 0.5°C, μ₀ = 37.0°C, two-tailed, α = 0.05):

Common mistakes: (1) Using df = n instead of df = n − 1. (2) Reading the one-tail column for a two-tailed test — for a two-tailed test at α = 0.05, use the one-tail 0.025 column. (3) Reporting an exact p-value from the t-table: the t-table gives bounds, not exact p-values. State "0.05 < p < 0.10" rather than "p = 0.07."


Problem 2 — Proportion Test Conditions and z Statistic (Variants A, B, C)

Variant A (n = 150, x = 36, p₀ = 0.20, two-tailed, α = 0.05):

Variant B (n = 80, x = 42, p₀ = 0.60, two-tailed, α = 0.05):

Variant C (n = 200, x = 82, p₀ = 0.35, two-tailed, α = 0.01):

Common mistake: Using \(\hat{p}\) instead of \(p_0\) in the SE denominator. The SE formula for a proportion test is \(\sqrt{p_0(1-p_0)/n}\) — the null value goes in the denominator because we compute the SE assuming H₀ is true.


Problem 3 — Choosing t vs. z vs. Proportion Test

Scenario A (Factory QC; n = 10, x̄ = 97 g, s = 4 g, μ₀ = 100 g, left-tailed):

Scenario B (Pet ownership; n = 120, x = 54, p₀ = 0.40, two-tailed):

Common mistake: Choosing the t-test for Scenario B because "σ is unknown." The t-vs-z distinction applies only to tests about a population mean, not a proportion. For proportions (when conditions are met), always use the z-distribution.


Problem 4 — CI Equivalence: Same Conclusion, Two Methods

Scenario: n = 16, x̄ = 2150 kcal, s = 300 kcal, μ₀ = 2000, α = 0.05 (two-tailed), df = 15. (This is the same setup as the lesson's Section 4 Example 1.)

Method 1 — Hypothesis Test (p-value approach):

Method 2 — 95% Confidence Interval:

Why they agree: For a two-tailed test at level α, the null value μ₀ falls outside the (1 − α) CI if and only if p < α. When p is in (0.05, 0.10), μ₀ is inside the 95% CI — both conclude "fail to reject." This equivalence always holds for two-tailed tests.

Variant C from Problem 1 (n = 16, x̄ = 37.3°C, s = 0.5°C, μ₀ = 37.0°C) provides a "reject" example for comparison:

Common mistake: Using \(\hat{p}\) (or x̄) instead of \(\mu_0\) when checking whether the CI excludes the null. The check is: does μ₀ fall inside or outside the interval? The CI is centered at x̄, not at μ₀.

Section 6 — Independent Practice Solutions

Problem 1 — t-Test Generator

Generated by generateTTest(). The approach is always the same five steps:

  1. H₀ and Ha: State H₀: μ = μ₀ and the appropriate Ha (two-tailed ≠ μ₀, since the generator uses two-tailed tests).
  2. Conditions: σ unknown (s given) and n < 30 → use t-test. Assume approximately normal population.
  3. Test statistic: \(\text{SE} = s/\sqrt{n}\), then \(t = (\bar{x} - \mu_0)/\text{SE}\). Record df = n − 1.
  4. p-value: Locate |t| in the df row of the t-table. Find the two critical values that bracket |t| (e.g., t* = 2.093 and t* = 2.539 for df = 19). The p-value is bounded between the corresponding two-tail α values (e.g., 0.02 < p < 0.05).
  5. Conclusion: If p < α = 0.05, reject H₀ and state in context. If p ≥ 0.05, fail to reject H₀ and state in context.

Key reminder — bounding the p-value: The t-table gives critical values for specific α levels, not exact p-values. You will always report a range such as "0.02 < p < 0.05." This is sufficient to make the reject/fail-to-reject decision: if the entire range is below α, reject; if the range straddles α or is entirely above, fail to reject.

Common mistake: Using df = n instead of df = n − 1. Always compute df first and write it down before looking up the t-table.


Problem 2 — One-Tailed Proportion Test (Variants 0–2)

Variant 0 (Medication; n = 100, x = 58, p₀ = 0.50, right-tailed, α = 0.05):

Variant 1 (Customer satisfaction; n = 120, x = 30, p₀ = 0.30, left-tailed, α = 0.05):

Variant 2 (Social media; n = 150, x = 117, p₀ = 0.70, right-tailed, α = 0.01):

Common mistake — one-tailed p-values: For a right-tailed test, p = P(Z > z), not 2 × P(Z > z). For a left-tailed test, p = P(Z < z). The two-tail multiplication applies only to two-tailed tests. Match the tail direction to the direction of Ha.


Problem 3 — Proportion Test Generator

Generated by generateProportionTest(). The approach is always the same five steps:

  1. H₀ and Ha: State H₀: p = p₀ and the Ha matching the scenario direction (left, right, or two-tailed).
  2. Conditions: Compute \(np_0\) and \(n(1-p_0)\). Both must be ≥ 5.
  3. Test statistic: \(\hat{p} = x/n\), then \(\text{SE} = \sqrt{p_0(1-p_0)/n}\), then \(z = (\hat{p} - p_0)/\text{SE}\). Note: SE uses p₀, not \(\hat{p}\).
  4. p-value: Match tail to Ha: right-tailed → p = 1 − Φ(z); left-tailed → p = Φ(z); two-tailed → p = 2(1 − Φ(|z|)).
  5. Conclusion: If p < α, reject H₀ and state in context. If p ≥ α, fail to reject H₀ and state in context.

The single most-tested concept in this lesson: The SE in the proportion test denominator uses \(p_0\) (the null hypothesis value), not \(\hat{p}\) (the sample estimate). The reason: the p-value is computed assuming H₀ is true, so we use the spread we'd expect if p were exactly p₀.


Problem 4 — Find the Error

Generated by generateProportionFindTheError(). The generator rotates through three possible error types. All three share a common theme: the proportion test mechanics are partially correct, but one critical step contains a flaw.

Error Type 1 — p̂ in the denominator:

The student uses \(\hat{p}\) in the SE formula instead of \(p_0\). The correct SE is \(\sqrt{p_0(1-p_0)/n}\); the incorrect SE is \(\sqrt{\hat{p}(1-\hat{p})/n}\). This produces a different z-value and can change the conclusion, especially when \(\hat{p}\) and \(p_0\) are far apart.

Error Type 2 — Wrong p-value direction:

For a two-tailed test, the student reports only the one-tail area instead of multiplying by 2. For example, if z = 2.10, the one-tail area is 0.018, but the correct two-tailed p-value is 0.036. Failing to double the area underestimates p and may lead to a spurious rejection.

Error Type 3 — "Accept H₀" language:

The numerical work is correct, but the conclusion states "accept H₀" or "the data prove H₀." This is never valid. When p ≥ α, the correct statement is "we fail to reject H₀" — meaning only that the data are not surprising enough under H₀ to reject it. Failing to reject does not prove H₀ is true; it means there is insufficient evidence against it.

How to check your own work: Before writing a conclusion, verify three things: (1) the SE uses p₀; (2) the p-value direction matches Ha; (3) the conclusion says "reject" or "fail to reject," never "accept."


Problem 5 — Multi-Step Synthesis: Hospital Quality Audit

Data: n = 20 patients; x̄ = 4.1 days, s = 1.2 days; 7 readmissions (\(\hat{p} = 0.35\)); claimed μ₀ = 3.5 days and p₀ = 0.25. Test both at α = 0.05.

Part (a) — t-test for mean post-operative stay:

Part (b) — z-test for readmission proportion:

Part (c) — Interpreting two simultaneous test results:

Rejecting the stay claim (mean differs from 3.5 days) provides statistical evidence that the hospital's stated average is inaccurate. However, it does not prove the hospital is negligent — it only shows a detectable deviation from 3.5 days. Whether a difference of 0.6 days is practically significant depends on clinical context and resource implications, not just on the p-value.

Failing to reject the readmission claim does not confirm the 25% rate is correct. With n = 20, the test has very low power — a true rate of 35% or even 40% could easily go undetected. The auditor should flag this as a limitation and recommend a larger follow-up sample for the readmission claim.

One test rejecting while another does not is not a contradiction: each test addresses a separate claim about a separate parameter, with its own sampling variability. Both conclusions are about evidence, not proof.

Part (d) — "Accept H₀" reasoning error:

The administrator's argument is a classic "fail to reject = accept" fallacy. "Fail to reject H₀" means only that the sample data are not surprising enough under H₀ to reject it at the chosen α. It does not mean H₀ is proven true, confirmed, or verified.

With n = 20, the test has extremely low power. If the true readmission rate were 35%, there would be only a modest probability of detecting it. The administrator is treating the absence of statistical evidence as positive evidence — the same logical error as concluding "no news is good news" when the test was underpowered to begin with.

The correct statement: "We do not have sufficient evidence at the 5% level to conclude the readmission proportion differs from 25%. A larger sample is needed before any affirmative conclusion can be drawn."

Common mistake in Part (a): If np₀ = 5 exactly, note this in your write-up — "conditions are barely met." In practice, a slightly larger sample (n = 25 or more) would be preferred for the proportion test to ensure the normal approximation is reliable.

Section 7 — Mastery Check Solutions

Question 1 — Feynman Test: Why p₀ (not p̂) in the Denominator

Model answer:

The denominator of the z test statistic is the standard error of \(\hat{p}\) under H₀ — that is, the spread we'd expect in sample proportions if the null hypothesis were exactly right. H₀ says the true proportion is p₀. If H₀ is true, then \(\hat{p}\) varies around p₀ with standard deviation \(\sqrt{p_0(1-p_0)/n}\). That is the SE we use.

If instead we used \(\hat{p}\) in the denominator, we'd be computing the SE as if the true proportion equaled our sample estimate — circular reasoning. We'd be assuming our sample is exactly right in order to determine whether our sample is surprising. That defeats the entire purpose of the test.

The rule: SE uses p₀ because we assume H₀ is true when computing the p-value. This is the defining feature of the proportion test and the single most common error in this lesson.


Question 2 — Apply: Placement Test

Scenario: H₀: p = 0.40; Ha: p ≠ 0.40 (two-tailed); n = 120, x = 54, α = 0.05.

Part A: H₀: p = 0.40; Ha: p ≠ 0.40 (two-tailed). The null uses the claimed proportion (0.40), not the sample estimate.

Part B (conditions):

Part C (test statistic):

\[ \hat{p} = 54/120 = 0.45 \]

\[ \text{SE} = \sqrt{\frac{0.40 \times 0.60}{120}} = \sqrt{0.002} \approx 0.04472 \]

\[ z = \frac{0.45 - 0.40}{0.04472} = \frac{0.05}{0.04472} \approx 1.118 \]

Part D (full conclusion):

\[ p = 2P(Z > 1.118) = 2(1 - \Phi(1.118)) \approx 2(1 - 0.8682) = 2(0.1318) = 0.264 \]

p ≈ 0.264 > 0.05. We fail to reject H₀.

There is insufficient evidence at the 5% significance level to conclude that the true passage rate for the advanced mathematics placement test differs from 40%. The sample proportion of 45% is consistent with random variation from a true rate of 40%.

Common mistake: Choosing a right-tailed test because p̂ = 0.45 > p₀ = 0.40. The tail direction must be determined from the research question (does the rate differ in any direction?), not from the direction of the observed sample data. Using sample data to choose the tail is a form of data snooping.


Question 3 — Error Analysis: df Off by One

Setup: n = 10, t = 2.50, two-tailed test. Researcher uses df = 10 and finds t* = 2.228 → rejects H₀.

The error: The researcher used df = n = 10 instead of df = n − 1 = 9. For a one-sample t-test, degrees of freedom are always n − 1. Using df = 10 yields a critical value of t* = 2.228 (two-tail α = 0.05); the correct df = 9 gives t* = 2.262.

Does the conclusion change?

When would it matter? If t were between 2.228 and 2.262 — for example, t = 2.24 — the wrong df (10) would give a critical value of 2.228, leading to rejection; the correct df (9) would give 2.262, leading to failure to reject. The off-by-one df error is consequential near the boundary of the critical region. It is worth correcting even when the conclusion happens not to change.

The habit to build: Always write df = n − 1 before opening the t-table. Circle it. Then look up the correct row.

Section 8 — Boss Fight Solutions

Path A — The Health Researcher

Data: n = 18, x̄ = 5.1 days, s = 1.2 days, μ₀ = 4.5 days, α = 0.05 (two-tailed).

Task 1 — H₀, Ha, and test choice:

Task 2 — Compute t and bound p:

\[ \text{SE} = s/\sqrt{n} = 1.2/\sqrt{18} = 1.2/4.243 \approx 0.2828 \text{ days} \]

\[ t = (5.1 - 4.5)/0.2828 = 0.6/0.2828 \approx 2.121 \]

t-table, df = 17: t* = 2.110 (two-tail 0.05); t* = 2.567 (two-tail 0.02). Since 2.110 < 2.121 < 2.567: 0.02 < p < 0.05. (Since t = 2.121 > 2.110, p < 0.05 strictly.)

Task 3 — Conclusion and Type II error:

p < 0.05 = α. Reject H₀. There is sufficient evidence at the 5% level to conclude that the mean recovery time differs from 4.5 days. The sample data suggest recovery times may be longer than the hospital claims.

Type II error in context: A Type II error (failing to reject H₀ when it is false) would mean concluding the hospital's claim of 4.5 days is consistent with the data when the true mean is actually higher. The practical consequence: patients receive inaccurate discharge expectations, and policymakers may under-allocate post-operative care resources. This is a real-world harm arising from insufficient statistical evidence.

Task 4 — z vs. t argument:

The flaw in the colleague's suggestion: the criterion for using z vs. t is not sample size but whether σ is known. Here σ is unknown; only s is given. The t-distribution is required regardless of whether n is "close to 30."

Using z* = 1.96 (the two-tailed critical value at α = 0.05) vs. t* = 2.110 (df = 17, two-tail α = 0.05): since t = 2.121 > 2.110, both reject H₀ in this case. But using z when σ is unknown produces a smaller critical value (1.96 < 2.110), making it easier to reject — this is anti-conservative. In borderline cases (e.g., if t = 2.00), z would reject (2.00 > 1.96) while t would fail to reject (2.00 < 2.110), leading to a Type I error rate above the nominal α = 0.05.


Path B — The Policy Analyst

Data: n = 150, x = 63, p₀ = 0.35, α = 0.05 (two-tailed).

Task 1 — H₀, Ha, and conditions:

Task 2 — p̂, z, and p-value:

\[ \hat{p} = 63/150 = 0.42 \]

\[ \text{SE} = \sqrt{\frac{0.35 \times 0.65}{150}} = \sqrt{\frac{0.2275}{150}} = \sqrt{0.001517} \approx 0.03894 \]

\[ z = (0.42 - 0.35)/0.03894 = 0.07/0.03894 \approx 1.797 \]

\[ p = 2P(Z > 1.797) \approx 2(1 - \Phi(1.80)) = 2(1 - 0.9641) = 2(0.0359) = 0.0718 \]

Task 3 — Decision and CI verification:

p ≈ 0.0718 > 0.05. Fail to reject H₀. Insufficient evidence that the proportion of students working more than 20 hours per week at this university differs from 35%.

95% CI for p (using p̂ in SE, not p₀ — CIs use the sample estimate):

\[ \text{SE}_\text{CI} = \sqrt{\hat{p}(1-\hat{p})/n} = \sqrt{0.42 \times 0.58/150} \approx 0.04030 \]

\[ \text{CI} = 0.42 \pm 1.96 \times 0.04030 = 0.42 \pm 0.0790 = (0.341,\; 0.499) \]

p₀ = 0.35 falls inside (0.341, 0.499) → fail to reject H₀ at α = 0.05. ✓ Both methods agree.

Key distinction: The hypothesis test uses p₀ in the SE; the confidence interval uses p̂. This is the only case where the two-formula distinction matters — they serve different purposes.

Task 4 — Practical vs. statistical significance:

Statistical significance: The test failed to reject H₀ at α = 0.05. The data are not statistically surprising under H₀: p = 0.35. The advocacy group cannot claim "dramatically higher" rates — the evidence does not support even a statistically detectable difference from 35%.

Practical significance: Even if the test had rejected H₀, a difference of p̂ − p₀ = 0.42 − 0.35 = 0.07 (7 percentage points) may or may not be "dramatic" by policy standards. Whether 7 more students per 100 working over 20 hours per week warrants major policy intervention is a domain judgment, not a statistical one. Statistical significance does not imply large or important effects.

Responsible communication: "Our sample shows 42% — 7 percentage points above the government's 35% — but we do not have sufficient statistical evidence at the 5% level to conclude the true rate at our university differs from the national figure. A larger sample is needed to draw firmer conclusions."

Section 9 — Challenge Problem Solutions

Challenge 1 — Paired Data: t-Test on Differences

The key insight for paired data: treat the differences di = after − before as a single sample, then apply the standard one-sample t-test to the di values. This eliminates between-subject variability and increases power.

Variant 0 (Typing speed; 7 participants; H₀: μd = 0 vs. Ha: μd ≠ 0):

ParticipantBeforeAfterdi = After − Before
152586
245505
360655
438424
555627
648513
750555

\(\bar{d} = (6 + 5 + 5 + 4 + 7 + 3 + 5)/7 = 35/7 = 5.00\) wpm

Deviations from \(\bar{d} = 5\): 1, 0, 0, −1, 2, −2, 0. Sum of squared deviations: \(1 + 0 + 0 + 1 + 4 + 4 + 0 = 10\).

\(s_d^2 = 10/(7-1) = 10/6 \approx 1.667\); \(s_d \approx 1.291\) wpm

\(\text{SE} = s_d/\sqrt{n} = 1.291/\sqrt{7} = 1.291/2.646 \approx 0.4879\)

\(t = \bar{d}/\text{SE} = 5.00/0.4879 \approx 10.25\), df = 6

t-table, df = 6: t* = 3.707 at two-tail α = 0.01. Since 10.25 ≫ 3.707: p < 0.01. Reject H₀.

Conclusion: There is very strong evidence (p < 0.01) that the training program improves typing speed.

Variant 1 (Reaction time; 6 athletes; H₀: μd = 0 vs. Ha: μd < 0, left-tailed, α = 0.05):

AthleteBeforeAfterdi = After − Before
1280265−15
2310295−15
3295290−5
4320305−15
5275260−15
6300285−15

\(\bar{d} = (-15 - 15 - 5 - 15 - 15 - 15)/6 = -80/6 \approx -13.33\) ms

Deviations from −13.33: −1.67, −1.67, 8.33, −1.67, −1.67, −1.67.

Sum of squared deviations: \(2.79 + 2.79 + 69.39 + 2.79 + 2.79 + 2.79 = 83.34\)

\(s_d^2 = 83.34/5 \approx 16.67\); \(s_d \approx 4.08\) ms

\(t = -13.33/(4.08/\sqrt{6}) = -13.33/(4.08/2.449) = -13.33/1.665 \approx -8.00\), df = 5

Left-tailed: |t| = 8.00 ≫ t* = 2.015 (one-tail α = 0.05, df = 5). p ≪ 0.05. Reject H₀.

Conclusion: Strong evidence that the conditioning drill reduces reaction time.

Variant 2 (Caloric intake; 8 patients; H₀: μd = 0 vs. Ha: μd ≠ 0, α = 0.05):

PatientBeforeAfterdi = After − Before
124002200−200
221002050−50
328002500−300
423002250−50
526002350−250
619001950+50
722002100−100
825002300−200

\(\bar{d} = (-200 - 50 - 300 - 50 - 250 + 50 - 100 - 200)/8 = -1100/8 = -137.5\) kcal

Deviations from −137.5: −62.5, 87.5, −162.5, 87.5, −112.5, 187.5, 37.5, −62.5.

Sum of squared deviations: \(3906.25 + 7656.25 + 26406.25 + 7656.25 + 12656.25 + 35156.25 + 1406.25 + 3906.25 = 98750\)

\(s_d^2 = 98750/7 \approx 14107\); \(s_d \approx 118.8\) kcal

\(t = -137.5/(118.8/\sqrt{8}) = -137.5/(118.8/2.828) = -137.5/42.01 \approx -3.27\), df = 7

Two-tailed, df = 7: t* = 2.365 (two-tail 0.05); t* = 2.998 (two-tail 0.02). Since |t| = 3.27 > 2.998: p < 0.02. Reject H₀.

Conclusion: Sufficient evidence at α = 0.05 that the dietary intervention changes mean caloric intake (specifically, reduces it).

Common mistake in paired t-tests: Running a two-sample test instead of treating the differences as a single sample. Paired designs eliminate between-subject noise — using a two-sample test ignores this and produces a less powerful test. Always compute di first, then apply the one-sample t-test to the di column.


Challenge 2 — Power of the t-Test

Setup: μ₀ = 500 mL, true μ = 505 mL, n = 16, s = 20 mL, α = 0.05 (two-tailed), df = 15.

Part (a) — Approximate power:

Non-centrality parameter (standardized shift): \(\delta = (\mu - \mu_0)/(s/\sqrt{n}) = (505 - 500)/(20/\sqrt{16}) = 5/5 = 1.00\)

Critical value: t* = 2.131 (df = 15, two-tailed α = 0.05).

Under the true distribution shifted by δ = 1.00, the test rejects when t > 2.131. Using the standard normal approximation, the probability of rejection is approximately:

\[ P(Z > t^* - \delta) = P(Z > 2.131 - 1.00) = P(Z > 1.131) \approx 1 - \Phi(1.131) \approx 1 - 0.871 = 0.129 \]

Approximate power ≈ 13%. (The left-tail contribution is negligible here.)

Part (b) — What's needed for 80% power:

Power of 13% means that if the true mean has shifted by 5 mL from 500 mL, this test will detect it only about 1 time in 8. This is very low power — the test is designed to detect gross failures, not small shifts.

To achieve ≈80% power for a shift of 5 mL with σ ≈ 20: using the standard power formula \(n \approx (z_\alpha + z_\beta)^2 \times (\sigma/\Delta)^2\) where z_α = 1.96 (two-tailed α = 0.05) and z_β = 0.84 (80% power): \(n \approx (1.96 + 0.84)^2 \times (20/5)^2 = (2.80)^2 \times 16 = 7.84 \times 16 \approx 125\) observations. The current n = 16 is far too small to detect a 5-mL shift reliably.

Key lesson: Failing to reject H₀ is not the same as "no effect." With power of 13%, a failure to reject is almost uninformative — the test was very unlikely to detect the true shift even if it exists. Always consider power alongside the decision.