Hypothesis Testing for a Population Mean (Large Sample)

A pharmaceutical company claims its new pain reliever reduces mean recovery time to below the industry standard of 14 days. The company’s own trial of 50 patients produced a sample mean of 12.8 days. Should regulators approve the drug on that basis alone?

“The sample average looks lower” is not enough. With only 50 patients, random variation could easily produce a sample mean of 12.8 days even if the drug does nothing at all. What regulators need is a formal method — a hypothesis test — that asks a precise question: How unlikely would our data be if the drug truly had no effect?

This lesson introduces that method. You already know the key tool: the standardization formula from INF-2. The difference is what we do with the result. In INF-2 we built an interval. Here we compute a probability — the p-value — and compare it to a pre-set threshold to make a decision.

By the end of this lesson, you will be able to:

State a null hypothesis and an alternative hypothesis correctly for a given scenario.
Compute the z test statistic .
Find and interpret the p-value for a two-tailed or one-tailed test.
Distinguish one-tailed from two-tailed tests and choose the correct form from the research question.
Explain Type I and Type II errors and the trade-off between them.

What you need coming in — and why it matters today:

Sampling distribution of (INF-1, INF-2): varies from sample to sample with mean and standard error . Today’s test statistic is built directly on this idea.
Standard error formula (INF-2): . The denominator of the z test statistic is exactly this.
z-standardization (INF-2, DS-5): converts a sample mean to a z-score. We reuse this formula verbatim — just for a different purpose.
z-table and tail probabilities (PR-6): The standard normal table gives . Today you’ll use repeatedly to find p-values.
CLT conditions (INF-1): (or normal population) justifies using the z-distribution. We need this before we can trust any test result.

Quick check — can you recall these?

What is the formula for the standard error of the sample mean?

I can write the standard error formula:

I know the CLT condition for using the z-distribution:

(or normal population) I can compute

given numbers I can find

using the z-table as

I understand that

is a random variable that varies from sample to sample

What changes in this lesson: In INF-2 you took the standardized value and built a symmetric interval around it. Here you take the exact same z-score and ask: “What is the probability of seeing a value this extreme — or more — if is true?” That probability is the p-value. Same formula, completely different question. inf-6 will extend these same five steps to the t-distribution and to proportions — the framework you learn here carries forward exactly.

Retrieval Warm-up — from earlier lessons

A researcher builds a 95% CI and reports: (42.1, 49.9). Which of the following is the correct interpretation?

For a CI for , a student says “I used t* = 2.093 instead of z* = 1.96 because the sample was small.” For which scenario was the t* the correct choice?

How this section is organized: Nine concepts build on each other in order. The formula in C3 looks identical to INF-2’s standardization. The difference is what we do with the result — that shift in purpose is the key idea of this entire lesson.

C1–C2: The logic and structure of hypothesis testing (what we’re doing and why)
C3–C6: How to do it (test statistic, p-value, two-tailed, one-tailed)
C7–C9: What can go wrong (error types, trade-offs, language traps)

C1 — The Logic of Hypothesis Testing

In a criminal trial, the defendant is assumed innocent until the evidence proves otherwise. The jury does not have to prove innocence — the prosecution must prove guilt beyond a reasonable doubt.

Hypothesis testing works the same way. We start with a default assumption — the null hypothesis — that nothing unusual is happening. Then we ask: if that assumption were true, how likely would our data be? If the data would be very unlikely under the null, we have evidence against it.

Null and Alternative Hypotheses

The null hypothesis is the default claim about the population — typically that a parameter equals a specific value: .

The alternative hypothesis is the claim we are trying to find evidence for — a departure from in some direction.

We write using ”=” only. We never “accept” — we either reject it or fail to reject it.

The court analogy: = “innocent.” Evidence that would be very rare if the defendant were innocent is strong evidence of guilt. A weak case means we cannot convict — but it does not mean the defendant is innocent. “Fail to reject ” is the statistical equivalent of “not guilty.”

Criminal trial

Hypothesis test

Starting assumption

Presume innocence The defendant is innocent until proven otherwise.

Assume H₀: μ = μ₀ We start by assuming the null hypothesis is true.

Weigh the evidence

Examine the evidence Is the evidence strong enough to overturn that assumption?

Examine the data (p-value) How surprising is the data if H₀ were true?

The verdict

"Guilty" Evidence was strong enough. "Not guilty" Evidence fell short — not "innocent."

"Reject H₀" Data was surprising enough. "Fail to reject H₀" Data fell short — never "accept H₀."

Absence of proof is not proof of innocence. A jury that says "not guilty" has not declared the defendant innocent — just as a test that "fails to reject H₀" has not shown H₀ is true. We never prove H₀; we only reject it or fail to reject it.

Figure: A criminal trial and a hypothesis test share the same logic. We assume a starting position (innocence ↔ H₀) and challenge it with evidence (testimony ↔ the p-value). Crucially, "fail to reject H₀" is the statistical "not guilty" — it never means we accept H₀ as true.

C2 — The Five-Step Framework

Every hypothesis test in this course — and in inf-6 — follows exactly five steps. Memorize the structure now; the details change but the steps do not.

The Five-Step Framework

State and . Include the parameter (), the hypothesized value (), and the direction of (<, >, or ≠).
Check conditions. Verify (or population normal) and that is known. Only proceed if conditions are met.
Compute the test statistic.
Find the p-value. Use the z-table and the direction of to determine the tail area.
State the conclusion in context. Compare p to : if , reject ; otherwise, fail to reject . Write the conclusion in a sentence about the real-world situation.

Step 5 conclusion language matters — a great deal. Never write “we accept .” Never write “the data prove .” The only correct forms are “we reject ” (when ) and “we fail to reject ” (when ). Every problem in this lesson enforces this.

See the five steps as a flowchart. The diagram below shows the complete decision path — from stating hypotheses through to the final conclusion. Each step is color-coded. Notice the early exit at Step 2: if the CLT conditions are not met, the z-test cannot be used and you must stop.

State H₀ and H_a Include μ, μ₀, and the direction of H_a (≠, >, or <)

Check Conditions n ≥ 30 (or normal population) and σ known

Not met

Stop — do not proceed with z-test

Met

Compute the Test Statistic z = (x̄ − μ₀) / (σ / √n)

Find the p-value Use the z-table and direction of H_a to get the tail area

State the Conclusion in Context Compare p to α — write the conclusion as a sentence about the real-world situation

p < α Reject H₀ There is sufficient evidence that H_a is true.

p ≥ α Fail to Reject H₀ There is not sufficient evidence that H_a is true.

Figure: The five-step hypothesis testing framework. Every test in this course — and in INF-6 — follows this sequence in order. Conditions must be verified at Step 2 before computing z. The conclusion at Step 5 must always be written as a sentence about the real-world situation.

C3 — The Test Statistic

The test statistic measures how far the observed sample mean is from the hypothesized value, in standard error units. It is exactly the z-score you computed in INF-2, but now — the null hypothesis value — is the reference point.

z Test Statistic for a Population Mean

where is the value stated in , is the known population standard deviation, and is the sample size.

A large means the data are far from what predicts — strong evidence against . A small means the data are consistent with .

Mini-example: Suppose , , , .

The sample mean is 3 standard errors above the null value. That’s unusual — but whether it’s unusual enough depends on the p-value.

A common error: using the sample standard deviation instead of the population . This lesson’s z-test requires known . When is unknown — even for — the technically correct procedure is the t-test (inf-6). For large , the two give nearly identical numerical results, but the procedure is still called a t-test, not a z-test.

See the test statistic as a ruler reading. The visualization below puts the sampling distribution of (in original units — recovery days) above an aligned “standard-error ruler.” Drag and watch its distance from read out both in days and in standard errors. That number of standard errors is the test statistic — the same standardization you learned in INF-2, now measuring distance from the null value.

Observed x̄ = 12.8 days

Figure: the test statistic as a ruler reading. The top axis is the sampling distribution of x̄ in original units (days), centered at μ₀ = 14. The bottom scale is the same line, re-marked in standard-error units (SE = σ/√n = 0.6 days) — that is the z-scale. Drag x̄: how many SEs it sits from μ₀ is z = (x̄ − μ₀)/SE.

C4 — The p-value

The p-value is the probability of observing a test statistic at least as extreme as ours, assuming is true. It is not the probability that is true.

p-value

Decision rule: Reject if . Fail to reject if .

The significance level is set before the data are collected — it is never adjusted after seeing the result.

Interpreting p = 0.03: If were true, only 3% of all random samples of this size would produce a test statistic as extreme as the one we observed. This is fairly rare — we have moderate evidence against .

See the p-value as a proportion. The simulator below generates 500 sample means from a world where H₀ is true (μ₀ = 14 days, σ = 4.2, n = 49 — the pharmaceutical scenario from Section 1). Drag the slider to set your observed x̄. Blue bars are samples at least as extreme as yours — their proportion is the empirical p-value. Compare it to the theoretical value below the chart. Click New simulation to see how the empirical proportion always hovers near the theoretical value across different random draws.

Observed x̄ = 12.8 days

Fail to Reject H₀

500 samples | run #1

Figure: P-value simulator — 500 sample means generated from a world where H₀ is true (μ₀ = 14 days, σ = 4.2, n = 49 patients). Move the slider to change your observed x̄. Blue bars are samples at least as extreme as your observed value — their proportion approximates the theoretical p-value shown below the chart.

The p-value is NOT the probability that is true. It is NOT the probability that the result occurred by chance. It is a conditional probability: given that is true, how likely is data this extreme? These are very different statements. Confusing them is the single most common error in applied statistics.

C5 — Two-Tailed Tests

When the alternative hypothesis is that the mean simply differs from — in either direction — we use a two-tailed test.

Two-Tailed Test

The p-value includes both tails:

We use because extreme evidence in either direction counts against .

Significance level α:

Test statistic z:

z = −2.00 | p-value = 0.0456 | α = 0.05

Reject H₀

Figure: Interactive hypothesis test — drag the slider to move the test statistic, then observe how the shaded p-value area (blue) compares with the rejection region (red) beyond the critical value z*.

Mini-example: , two-tailed test.

At : → reject .

A common error: computing only one tail area and forgetting to multiply by 2. For a two-tailed test, and are equally extreme — both tails count. The p-value is always (one-tail area) for a two-tailed test.

C6 — One-Tailed Tests

When the research question is directional — specifically asking whether the mean is greater than or less than — we use a one-tailed test.

One-Tailed Tests

Right-tailed:

Left-tailed:

Mini-example (left-tailed): , .

At : → fail to reject .

The same z, three p-values. Drag the slider below to set a test statistic and observe how the identical z-score produces very different p-values depending on whether you specified a two-tailed, left-tailed, or right-tailed alternative. Notice the warning indicators when z points in the wrong direction for a one-tailed test — a reminder that test type must be chosen from the research question, not from the data.

Test statistic z = +2.00 α = 0.05

Try:

Two-tailed

p = 2 · P(Z > |z|)

p = —

Fail to Reject

Left-tailed

p = P(Z < z) = Φ(z)

p = —

Fail to Reject

Right-tailed

p = P(Z > z) = 1 − Φ(z)

p = —

Fail to Reject

Figure: The same z-statistic, three different p-values. Drag the slider (or use the preset buttons) to change z and observe how the choice of test type produces a different p-value from the identical test statistic. The α = 0.05 threshold governs all three decisions. Warning indicators appear when the direction of z is inconsistent with the directional test chosen.

You must choose one-tailed vs. two-tailed from the research question — before collecting or examining the data. Choosing a one-tailed test after seeing that the data point in a convenient direction is called data snooping, and it inflates the true Type I error rate to roughly twice . This is a serious methodological error that can invalidate a study.

C7 — Type I and Type II Errors

Any decision rule can make two types of mistakes.

Type I and Type II Errors

Type I error (false positive): Rejecting when is actually true. Probability = .

Type II error (false negative): Failing to reject when is actually false. Probability = .

Statistical power: The probability of correctly rejecting when it is false. Power .

The 2×2 truth table:

	is true	is false
Reject	Type I error ()	Correct! (Power)
Fail to reject	Correct!	Type II error ()

Students often confuse which error is which. A helpful mnemonic: Type I = false alarm (you cried wolf when there was no wolf). Type II = missed signal (the wolf was there and you said nothing). In a drug trial, a Type I error means approving a drug that doesn’t work; a Type II error means rejecting a drug that does.

Both errors, side by side. The panel below shows the pharmaceutical scenario in full. Select a decision and watch both panels update simultaneously — the same decision leads to completely different outcomes depending on whether H₀ is true. This is why neither error type can be eliminated by choice of decision alone.

Scenario: Regulators run a hypothesis test to decide whether to approve a new drug. What happens to their decision in each possible truth?

Regulator decides:

H₀ is TRUE Drug has no real effect

Type I Error False Positive Probability = α

The drug is approved even though it has no real effect. A false alarm — the error whose rate we directly control by choosing α. At α = 0.05, about 5 out of 100 trials where H₀ is true would wrongly approve the drug.

H₀ is FALSE Drug actually works

Correct Decision True Positive · Statistical Power Probability = 1 − β = Power

The drug is approved because the evidence was strong enough to detect the real effect. Out of 100 trials where the drug works, power tells you how many correctly reach this outcome. Higher power means fewer missed cures.

Figure: The same decision (reject H₀ or fail to reject H₀) produces completely different outcomes depending on the true state of the world. Toggle between decisions to see all four possible outcomes. α and β are the probabilities of the two error types.

C8 — The α–β Trade-Off

For a fixed sample size, decreasing (making it harder to reject ) necessarily increases (making it easier to miss a real effect). The two error rates pull in opposite directions.

The dial analogy: Imagine a single dial labeled “strictness.” Turn it toward strict (lower ): you rarely convict the innocent, but you also let guilty defendants walk free more often. Turn it toward lenient (higher ): you catch more real effects, but you also have more false alarms. There is no free lunch — you must decide which error is more costly in your context.

The only way to reduce both and simultaneously is to increase the sample size .

Adjust the trade-off yourself. The visualization below places two distributions on the same axis: H₀ centered at 0, and a specific alternative Hₐ centered at δ — the effect size, meaning how many standard deviations the true mean differs from μ₀. A larger δ separates the distributions and makes the effect easier to detect. The left slider changes δ; the right slider moves the critical value z*, which controls α. Watch how Type I error (α), Type II error (β), and power (1−β) respond. Note: this diagram models a one-tailed (right-tail) scenario; for a two-tailed test, α is split equally across both tails.

Figure 2: The α–β trade-off. The H₀ distribution (left curve) and the Hₐ distribution (right curve) share the same x-axis. The vertical line marks the critical value z*. Move z* left to increase α (and decrease β); move it right to decrease α (and increase β). The effect size δ sets how far apart the two distributions are — a larger δ acts exactly like increasing n: it separates the curves and shrinks β without touching α. Increasing n is the only way to reduce both α and β simultaneously.

Statistical Significance

A result is statistically significant at level if . This means only that the data would be unlikely under — it does not guarantee the effect is large, important, or practically meaningful.

C9 — “Fail to Reject” ≠ “Accept”

The phrase “fail to reject ” is deliberate and non-negotiable. It is not the same as “accept ” or “prove is true.”

When , all we know is that the data are not surprising under . We do NOT know that is true. We simply lack sufficient evidence to rule it out. This is the statistical equivalent of a jury returning “not guilty” — absence of proof is not proof of absence. Write “we fail to reject ” and nothing stronger.

A confidence interval tells you the same thing. A 95% CI and a two-tailed test at α = 0.05 always give the same decision: if μ₀ falls outside the CI, the test rejects H₀; if μ₀ is inside, it fails to reject. Drag the slider below to see both the CI and the p-value update together — they move in lockstep.

Observed x̄ = 12.82 days

Fail to Reject H₀

Figure: CI–hypothesis test duality. The blue bar is the 95% confidence interval centered at x̄. The dashed line is μ₀ = 14. The shaded regions on the axis mark where x̄ would fall to reject H₀ (|z| > 1.96). When μ₀ falls outside the CI the test rejects H₀ — the two methods always give the same decision.

Why this wording matters downstream: inf-6, reg-3, and reg-4 all use the same five-step framework with the same conclusion language. If you build the habit now, you will never lose marks for writing “accept ” in any future lesson.

Example 1 — Fully Worked: Cereal Fill Weight (Two-Tailed)

A cereal manufacturer claims its boxes contain g on average, with g. A quality inspector randomly samples boxes and finds g. Test at .

Step 1: State and

g g (two-tailed — the inspector has no prior reason to expect low or high fill)

Step 2: Check Conditions

✓ g known ✓. The CLT applies. We may use the z-distribution.

Step 3: Compute the Test Statistic

I notice g.

I choose the negative sign because the sample mean fell below the null value — this will matter for interpretation.

Step 4: Find the p-value

Because is two-tailed, I use .

From the z-table: , so .

Step 5: Conclusion

. We reject .

There is sufficient evidence at the 5% significance level to conclude that the mean fill weight is not 500 g.

Example 2 — Prediction Checkpoint: Student Commute Times (Two-Tailed)

A CEGEP registrar believes the mean student commute time is min, with min. A survey of students gives min. Test at .

Steps 1–3:

min; min (two-tailed). Conditions: ✓, known ✓.

min. .

Step 4: .

Step 5: . Reject .

There is sufficient evidence at the 5% level to conclude that the mean commute time differs from 45 minutes.

Notice the pattern: the same and p-value as Example 1 lead to the same decision rule. The context and units still determine the conclusion you write.

Example 3 — Minimally Scaffolded: Sleep Hours (One-Tailed Left)

A health agency claims Quebecers sleep less than the national mean of 7.0 hours ( h). A random sample of Quebecers gives h. Test at .

Hint: The agency’s claim is directional (less than). Use a left-tailed test.

Show Solution

Step 1: h; h (left-tailed — agency specifically claims less than).

Step 2: ✓, known ✓.

Step 3: h.

Step 4: Left-tailed: .

(using ), so .

Step 5: . We fail to reject .

There is insufficient evidence at the 1% significance level to conclude that Quebecers sleep less than the national mean of 7.0 hours. (Note: at , we would reject — the threshold matters.)

Example 4 — Error Analysis: Error Types in a Research Decision

Read the following analysis carefully. It contains errors.

Analysis to examine:

A researcher tests kg at . The true population mean is actually 12.4 kg. The researcher obtains and concludes: “We accept . The mean is 12 kg.”

The researcher also notes: “Since we accepted , there is no Type II error here.”

Error 1 — Language: “We accept ” is never correct. Write “we fail to reject ”: failing to reject is not proof that is true.

Error 2 — Classification: Here is false, but the researcher failed to reject it. That is a Type II error (), not “no error.”

What could change the risk: A larger sample would reduce and make a real shift easier to detect.

Problem 1 — Full z Test, Step by Step

Start with forced guidance: choose the hypotheses, then commit your standard error, test statistic, p-value, and conclusion. Each attempt uses a fresh scenario.

Problem 2 — α, β, and Power

After the worked cost scenario, commit to the trade-off at a fixed sample size.

Problem 3 — Choosing the Test Form

A traffic planner wants to know whether the mean speed on a highway section has increased beyond the posted limit of 110 km/h. She samples 50 vehicles.

Problem 4 — Classifying Errors

A drug company tests “mean recovery time = 14 days” at . The company fails to reject . Later, an independent study confirms the drug does reduce recovery time (true mean = 11 days).

Problem 5 — Full Test, Your Way

You’ve practiced each piece of the test separately — now run one complete test from start to finish, and choose how much support you want. Work through it one committed step at a time, or solve the whole problem on paper and answer once; either way, your first answer counts, and you can always break the problem into steps afterwards to see exactly where a mistake crept in.

Problem 1 — Two-Tailed Full Test

Run a fresh two-tailed z test independently. Choose your level of support, then commit each calculation or the whole conclusion.

Problem 2 — One-Tailed Full Test

Now use the claim, not the observed sample, to choose the test direction.

Problem 3 — Another Full-Test Rematch

One more fresh scenario mixes left- and right-tailed claims. Work it solo first when you can.

Problem 4 — Error Classification and the α–β Trade-off

Classify the outcome before examining the consequence for power.

Problem 5 — Failure to Reject Is Not Proof

Find the first error in the journalist’s reasoning, then use the feedback to state the evidence correctly.

Problem 6 — Low Power and Test Choice

A study uses , has an unknown population standard deviation, and fails to reject after testing a new teaching method. Which diagnosis is best?

Problem 7 — Multiple Comparisons

A researcher runs 20 independent tests at and reports only the significant results. What is the central statistical problem?

Mixed Review — Confidence Intervals

Review Problem 1 — Confidence-Interval Interpretation (INF-2)

Find the first error in the student’s confidence-interval interpretation.

Review Problem 2 — Proportion CI Conditions and Construction (INF-4)

Check the conditions, compute the interval, then interpret the claim. Choose solo when you are ready.

Question 1 — Feynman Test

In your own words, explain what a p-value tells you — and what it does NOT tell you. Write as if explaining to a classmate who has never taken statistics. Aim for 200–500 characters.

0 / 500

Question 2 — Apply: Lake pH Test

A water authority tests whether a lake’s mean pH differs from the neutral standard of 7.0. The population standard deviation is units. A random sample of water readings gives .

Part A: Which alternative hypothesis is correct?

Part B: Now run a new full test with no pre-attempt solution. Choose solo to retrieve the entire procedure, or use the step sequence to locate a missed step.

Question 3 — Diagnose an Overclaim

Find the first error in the report, then identify the named misconception in the regulator’s interpretation.

Question 4 — Full Test: The Pharmaceutical Decision

We return to the opening question from Section 1. A pharmaceutical company claims its new pain reliever reduces mean recovery time below the industry standard of days. From historical data, recovery times for this condition have standard deviation days. The company enrolls patients and observes days.

Part A: Which alternative hypothesis is correct for testing the company’s claim?

Part B: Keep the directional claim in mind as you complete the independent mastery test above; its feedback will show the full five-step path after your commitment.

Part C: A regulator reads the report and claims that a p-value is the probability that the drug does not work. Find the error in that conditional-probability claim.

Self-Assessment

How confident do you feel about hypothesis testing for a population mean?

Still confusedReady for the Boss Fight

Choose your path. Both require full five-step reasoning.

🔬 Path A: The Auditor

A provincial auditor suspects that mean grant processing times exceed the target. You must build the case — or fail to build it — with rigorous statistical evidence.

🏗️ Path B: The Designer

An engineering team must verify their manufacturing process meets tolerances. When the test fails to reject, you must trace the consequences through error types and cost trade-offs.

🔬 Path A: The Auditor

A provincial auditor suspects that the mean processing time for government grant applications exceeds the stated target of 30 business days. The population standard deviation is known to be days. A random sample of 64 recent applications shows a mean of days.

Task 1. State and with full justification for the directional choice. Explain why a one-tailed test is appropriate here rather than a two-tailed test.

Show Guidance for Task 1

days; days.

The auditor’s concern is specifically whether processing times exceed the target — not whether they differ in any direction. A one-tailed right test is justified because the audit mandate is to detect delays. Using a two-tailed test would waste power by splitting the rejection region across both directions, making it harder to detect the specific problem of concern.

Task 2. Check the conditions and compute the test statistic. Show all work.

Show Guidance for Task 2

Conditions: ✓; days known ✓. CLT applies.

day.

Task 3. Find the p-value and make a decision at . State the conclusion in a sentence suitable for a formal audit report.

Show Guidance for Task 3

Right-tailed: .

. Reject .

Audit conclusion: “Based on a random sample of 64 applications, there is sufficient evidence at the 1% significance level to conclude that the mean grant processing time exceeds the 30-day target (, ).”

Task 4. A colleague argues that should be reduced to 0.001 “to be absolutely safe before accusing the department.” Explain the Type II error consequence of this choice. Would you recommend it?

Show Guidance for Task 4

Reducing from 0.01 to 0.001 raises the bar for rejection. The critical z-value increases from roughly to . For the same sample size and true mean, this makes it harder to detect a real excess — increases and power decreases.

In this case: with , the test rejected at but would fail to reject at (since ). The departmental delay would go undetected.

Recommendation: The appropriate depends on the cost of each error type. A Type I error means falsely flagging a compliant department (unfair, reputational cost). A Type II error means missing a real delay problem (public money wasted, applicants harmed). If the cost of missed delays is high, lowering to 0.001 is not wise — it increases the chance of letting real problems through. is already strict enough for an audit context.

Reflection: Write a two-sentence conclusion suitable for a public audit report — include the decision, the significance level, and what it means for the department.

0 / 500

🏗️ Path B: The Designer

An engineering team wants to verify that a manufacturing process produces parts with mean diameter mm. The population standard deviation is mm. A sample of parts gives mm. Rejecting a good batch costs the company €500; missing a defective batch costs €5,000.

Task 1. Set up the appropriate hypothesis test (justify two-tailed) and conduct it at . Show all five steps.

Show Guidance for Task 1

Two-tailed is appropriate: the engineering team wants to know if the diameter differs from 25.00 mm in either direction — both too small and too large are defects.

mm; mm.

Conditions: ✓, known ✓.

mm.

. Fail to reject .

There is insufficient evidence at the 5% level to conclude that the mean diameter differs from 25.00 mm.

Task 2. The test failed to reject , but an independent precise measurement confirms the true mean is actually 25.015 mm. What type of error occurred? Justify your answer using the 2×2 truth table.

Show Guidance for Task 2

was false (the true mean is 25.015 ≠ 25.00). The test failed to reject a false null. This is a Type II error ().

In the truth table: we are in the cell “Fail to reject | is false” — that is the Type II error cell.

The error was not a mistake in the procedure — the test was performed correctly. With and a shift of only 0.015 mm (1.5 SE), the test had low power to detect such a small deviation.

Task 3. To reduce this error, should the team increase or decrease ? What is the cost trade-off given the asymmetric costs (€500 vs. €5,000)?

Show Guidance for Task 3

To reduce (Type II error), increase . A higher makes it easier to reject , catching more real defects.

Cost analysis: A Type I error (rejecting a good batch) costs €500. A Type II error (missing a defective batch) costs €5,000 — 10 times more. This asymmetry argues strongly for a higher (say 0.10) to reduce the expensive Type II error. The team is willing to accept more false alarms (€500 each) to avoid the far costlier missed defects (€5,000 each).

Task 4. How large a sample would be needed to detect a 0.01 mm shift with 80% power? Use the power formula with (two-tailed at ) and (for 80% power).

Show Guidance for Task 4

mm (the shift to detect), mm.

Round up: parts.

The current has very low power to detect a 0.01 mm shift. To achieve 80% power, the team needs roughly 8 times as many measurements.

Reflection: Given the cost asymmetry (€500 vs. €5,000), would you advise changing ? Justify your recommendation with specific reference to the cost trade-off and what each type of error means in this manufacturing context.

0 / 500

Ready for more? These go beyond the lesson objectives.

Challenge 1 — Critical-Value Approach

Use a fresh scenario to compare the test statistic directly to the rejection region, then verify the p-value equivalence after your attempt.

Challenge 2 — Equivalence of a CI and a Two-Tailed Test

Construct the interval first. The post-attempt solution connects inclusion of to the matching two-tailed test decision.

Challenge 3 — Two-Sample Preview

This optional first look uses two independent samples and keeps its learner-controlled explanation after the generated prompt.

Complete, step-by-step solutions for all problems in Sections 5–9 are available on the solutions page. Solutions include full five-step write-ups, z-critical values shown explicitly, and decision rule justifications.

View Full Solutions →

If you’re stuck: Re-read the relevant Core Concept in Section 3. For z-test problems, make sure you used (population standard deviation) in the denominator and divided by , not by alone. For critical value lookups, confirm whether you are running a one-tailed or two-tailed test, as this changes your threshold. The solutions page shows the reasoning behind every step, not just the final answer.

Quick-Reference Formulas

One-Sample z Test Statistic (mean, large sample / known):

Decision Rule (P-Value Approach):

Decision Rule (Critical-Value Approach):

Two-tailed test (): Reject if
Right-tailed test (): Reject if
Left-tailed test (): Reject if

Standard Critical Values ():

: (two-tailed) or (one-tailed)
: (two-tailed) or (one-tailed)
: (two-tailed) or (one-tailed)

INF-5: Hypothesis Testing for a Population Mean (Large Sample)

Section 1: Introduction

Section 2: Prerequisites

Section 3: Core Concepts

C1 — The Logic of Hypothesis Testing

Null and Alternative Hypotheses

Criminal trial

Hypothesis test

C2 — The Five-Step Framework

The Five-Step Framework

C3 — The Test Statistic

z Test Statistic for a Population Mean

C4 — The p-value

p-value

C5 — Two-Tailed Tests

Two-Tailed Test

C6 — One-Tailed Tests

One-Tailed Tests

C7 — Type I and Type II Errors

Type I and Type II Errors

C8 — The α–β Trade-Off

Statistical Significance

C9 — “Fail to Reject” ≠ “Accept”

Section 4: Worked Examples

Example 1 — Fully Worked: Cereal Fill Weight (Two-Tailed)

Example 2 — Prediction Checkpoint: Student Commute Times (Two-Tailed)

Example 3 — Minimally Scaffolded: Sleep Hours (One-Tailed Left)

Example 4 — Error Analysis: Error Types in a Research Decision

Section 5: Guided Practice

Problem 1 — Full z Test, Step by Step

Problem 2 — α, β, and Power

Problem 3 — Choosing the Test Form

Problem 4 — Classifying Errors

Problem 5 — Full Test, Your Way

Section 6: Independent Practice

Problem 1 — Two-Tailed Full Test

Problem 2 — One-Tailed Full Test

Problem 3 — Another Full-Test Rematch

Problem 4 — Error Classification and the α–β Trade-off

Problem 5 — Failure to Reject Is Not Proof

Problem 6 — Low Power and Test Choice

Problem 7 — Multiple Comparisons

Mixed Review — Confidence Intervals

Review Problem 1 — Confidence-Interval Interpretation (INF-2)

Review Problem 2 — Proportion CI Conditions and Construction (INF-4)

Section 7: Mastery Check

Question 1 — Feynman Test

Question 2 — Apply: Lake pH Test

Question 3 — Diagnose an Overclaim

Question 4 — Full Test: The Pharmaceutical Decision

Self-Assessment

Section 8: Boss Fight

🔬 Path A: The Auditor

🏗️ Path B: The Designer

🔬 Path A: The Auditor

🏗️ Path B: The Designer

Section 9: Challenge Problems

Challenge 1 — Critical-Value Approach

Challenge 2 — Equivalence of a CI and a Two-Tailed Test

Challenge 3 — Two-Sample Preview

Section 10: Solutions Reference

Quick-Reference Formulas