INF-5: Hypothesis Testing for a Population Mean (Large Sample)
Module 3 · Statistical Inference
Section 1: Introduction
▾
A pharmaceutical company claims its new pain reliever reduces mean recovery time to below the industry standard of 14 days. The company’s own trial of 50 patients produced a sample mean of 12.8 days. Should regulators approve the drug on that basis alone?
“The sample average looks lower” is not enough. With only 50 patients, random variation could easily produce a sample mean of 12.8 days even if the drug does nothing at all. What regulators need is a formal method — a hypothesis test — that asks a precise question: How unlikely would our data be if the drug truly had no effect?
This lesson introduces that method. You already know the key tool: the standardization formula from INF-2. The difference is what we do with the result. In INF-2 we built an interval. Here we compute a probability — the p-value — and compare it to a pre-set threshold to make a decision.
After this lesson, you will be able to:
By the end of this lesson, you will be able to:
State a null hypothesis and an alternative hypothesis correctly for a given scenario.
Compute the z test statistic .
Find and interpret the p-value for a two-tailed or one-tailed test.
Distinguish one-tailed from two-tailed tests and choose the correct form from the research question.
Explain Type I and Type II errors and the trade-off between them.
Section 2: Prerequisites
▾
What you need coming in — and why it matters today:
Sampling distribution of (INF-1, INF-2): varies from sample to sample with mean and standard error . Today’s test statistic is built directly on this idea.
Standard error formula (INF-2):. The denominator of the z test statistic is exactly this.
z-standardization (INF-2, DS-5): converts a sample mean to a z-score. We reuse this formula verbatim — just for a different purpose.
z-table and tail probabilities (PR-6): The standard normal table gives . Today you’ll use repeatedly to find p-values.
CLT conditions (INF-1): (or normal population) justifies using the z-distribution. We need this before we can trust any test result.
Quick check — can you recall these?
What is the formula for the standard error of the sample mean?
Success Factor:
What changes in this lesson: In INF-2 you took the standardized value and built a symmetric interval around it. Here you take the exact same z-score and ask: “What is the probability of seeing a value this extreme — or more — if is true?” That probability is the p-value. Same formula, completely different question. INF-6 will extend these same five steps to the t-distribution and to proportions — the framework you learn here carries forward exactly.
Retrieval Warm-up — from earlier lessons
A researcher builds a 95% CI and reports: (42.1, 49.9). Which of the following is the correct interpretation?
For a CI for , a student says “I used t* = 2.093 instead of z* = 1.96 because the sample was small.” For which scenario was the t* the correct choice?
Section 3: Core Concepts
▾
How this section is organized: Nine concepts build on each other in order. The formula in C3 looks identical to INF-2’s standardization. The difference is what we do with the result — that shift in purpose is the key idea of this entire lesson.
C1–C2: The logic and structure of hypothesis testing (what we’re doing and why)
C3–C6: How to do it (test statistic, p-value, two-tailed, one-tailed)
C7–C9: What can go wrong (error types, trade-offs, language traps)
C1 — The Logic of Hypothesis Testing
In a criminal trial, the defendant is assumed innocent until the evidence proves otherwise. The jury does not have to prove innocence — the prosecution must prove guilt beyond a reasonable doubt.
Hypothesis testing works the same way. We start with a default assumption — the null hypothesis — that nothing unusual is happening. Then we ask: if that assumption were true, how likely would our data be? If the data would be very unlikely under the null, we have evidence against it.
Null and Alternative Hypotheses
The null hypothesis is the default claim about the population — typically that a parameter equals a specific value: .
The alternative hypothesis is the claim we are trying to find evidence for — a departure from in some direction.
We write using ”=” only. We never “accept” — we either reject it or fail to reject it.
The court analogy: = “innocent.” Evidence that would be very rare if the defendant were innocent is strong evidence of guilt. A weak case means we cannot convict — but it does not mean the defendant is innocent. “Fail to reject ” is the statistical equivalent of “not guilty.”
C2 — The Five-Step Framework
Every hypothesis test in this course — and in INF-6 — follows exactly five steps. Memorize the structure now; the details change but the steps do not.
The Five-Step Framework
State and . Include the parameter (), the hypothesized value (), and the direction of (<, >, or ≠).
Check conditions. Verify (or population normal) and that is known. Only proceed if conditions are met.
Compute the test statistic.
Find the p-value. Use the z-table and the direction of to determine the tail area.
State the conclusion in context. Compare p to : if , reject ; otherwise, fail to reject . Write the conclusion in a sentence about the real-world situation.
Step 5 conclusion language matters — a great deal. Never write “we accept .” Never write “the data prove .” The only correct forms are “we reject ” (when ) and “we fail to reject ” (when ). Every problem in this lesson enforces this.
C3 — The Test Statistic
The test statistic measures how far the observed sample mean is from the hypothesized value, in standard error units. It is exactly the z-score you computed in INF-2, but now — the null hypothesis value — is the reference point.
z Test Statistic for a Population Mean
where is the value stated in , is the known population standard deviation, and is the sample size.
A large means the data are far from what predicts — strong evidence against . A small means the data are consistent with .
Mini-example: Suppose , , , .
The sample mean is 3 standard errors above the null value. That’s unusual — but whether it’s unusual enough depends on the p-value.
A common error: using the sample standard deviation instead of the population . This lesson’s z-test requires known . When is unknown, we use the t-distribution (INF-6). If you see only given in a problem where , you may substitute for as an approximation — but the test is still called a large-sample z-test in that context, not a t-test.
C4 — The p-value
The p-value is the probability of observing a test statistic at least as extreme as ours, assuming is true. It is not the probability that is true.
p-value
Decision rule: Reject if . Fail to reject if .
The significance level is set before the data are collected — it is never adjusted after seeing the result.
Interpreting p = 0.03: If were true, only 3% of all random samples of this size would produce a test statistic as extreme as the one we observed. This is fairly rare — we have moderate evidence against .
The p-value is NOT the probability that is true. It is NOT the probability that the result occurred by chance. It is a conditional probability: given that is true, how likely is data this extreme? These are very different statements. Confusing them is the single most common error in applied statistics.
C5 — Two-Tailed Tests
When the alternative hypothesis is that the mean simply differs from — in either direction — we use a two-tailed test.
Two-Tailed Test
The p-value includes both tails:
We use because extreme evidence in either direction counts against .
Standard normal curve for hypothesis test visualization
Fail to Reject H₀
Figure: Interactive hypothesis test — drag the slider to move the test statistic, then observe how the shaded p-value area (blue) compares with the rejection region (red) beyond the critical value z*.
Mini-example:, two-tailed test.
At : → reject .
A common error: computing only one tail area and forgetting to multiply by 2. For a two-tailed test, and are equally extreme — both tails count. The p-value is always (one-tail area) for a two-tailed test.
C6 — One-Tailed Tests
When the research question is directional — specifically asking whether the mean is greater than or less than — we use a one-tailed test.
One-Tailed Tests
Right-tailed:
Left-tailed:
Mini-example (left-tailed):, .
At : → fail to reject .
You must choose one-tailed vs. two-tailed from the research question — before collecting or examining the data. Choosing a one-tailed test after seeing that the data point in a convenient direction is called data snooping, and it inflates the true Type I error rate to roughly twice . This is a serious methodological error that can invalidate a study.
C7 — Type I and Type II Errors
Any decision rule can make two types of mistakes.
Type I and Type II Errors
Type I error (false positive): Rejecting when is actually true. Probability = .
Type II error (false negative): Failing to reject when is actually false. Probability = .
The 2×2 truth table:
is true
is false
Reject
Type I error ()
Correct! (Power)
Fail to reject
Correct!
Type II error ()
Students often confuse which error is which. A helpful mnemonic: Type I = false alarm (you cried wolf when there was no wolf). Type II = missed signal (the wolf was there and you said nothing). In a drug trial, a Type I error means approving a drug that doesn’t work; a Type II error means rejecting a drug that does.
C8 — The α–β Trade-Off
For a fixed sample size, decreasing (making it harder to reject ) necessarily increases (making it easier to miss a real effect). The two error rates pull in opposite directions.
The dial analogy: Imagine a single dial labeled “strictness.” Turn it toward strict (lower ): you rarely convict the innocent, but you also let guilty defendants walk free more often. Turn it toward lenient (higher ): you catch more real effects, but you also have more false alarms. There is no free lunch — you must decide which error is more costly in your context.
The only way to reduce both and simultaneously is to increase the sample size .
Adjust the trade-off yourself. The sliders below control the critical value (your α threshold) and the effect size (how different the true mean is from H₀). Watch what happens to Type I error (α) and power (1−β) as you move z* left or right.
Figure 2: The α–β trade-off. The H₀ distribution (left curve) and the Hₐ distribution (right curve) share the same x-axis. The vertical line marks the critical value z*. Move z* left to increase α (and decrease β); move it right to decrease α (and increase β). The effect size δ sets how far apart the two distributions are.
Statistical Significance
A result is statistically significant at level if . This means only that the data would be unlikely under — it does not guarantee the effect is large, important, or practically meaningful.
C9 — “Fail to Reject” ≠ “Accept”
The phrase “fail to reject ” is deliberate and non-negotiable. It is not the same as “accept ” or “prove is true.”
When , all we know is that the data are not surprising under . We do NOT know that is true. We simply lack sufficient evidence to rule it out. This is the statistical equivalent of a jury returning “not guilty” — absence of proof is not proof of absence. Write “we fail to reject ” and nothing stronger.
Why this wording matters downstream: INF-6, REG-3, and REG-4 all use the same five-step framework with the same conclusion language. If you build the habit now, you will never lose marks for writing “accept ” in any future lesson.
Section 4: Worked Examples
▾
Example 1 — Fully Worked: Cereal Fill Weight (Two-Tailed)
A cereal manufacturer claims its boxes contain g on average, with g. A quality inspector randomly samples boxes and finds g. Test at .
Step 1: State and
g g (two-tailed — the inspector has no prior reason to expect low or high fill)
Step 2: Check Conditions
✓ g known ✓. The CLT applies. We may use the z-distribution.
Step 3: Compute the Test Statistic
I notice g.
I choose the negative sign because the sample mean fell below the null value — this will matter for interpretation.
Step 4: Find the p-value
Because is two-tailed, I use .
From the z-table: , so .
.
Step 5: Conclusion
. We reject .
There is sufficient evidence at the 5% significance level to conclude that the mean fill weight is not 500 g.
Example 2 — Prediction Checkpoint: Student Commute Times (Two-Tailed)
A CEGEP registrar believes the mean student commute time is min, with min. A survey of students gives min. Test at .
Steps 1–3:
min; min (two-tailed). Conditions: ✓, known ✓.
min. .
Pause here. Before reading the p-value and decision:
What is your best estimate of the p-value?
Will you reject or fail to reject ? Why?
Write down your prediction, then continue.
Show Solution (Steps 4–5)
Step 4:.
Step 5:. Reject .
There is sufficient evidence at the 5% level to conclude that the mean commute time differs from 45 minutes.
Note: The same z = 2.00 and same p-value as Example 1 — a good reminder that the numbers alone don’t tell the story; context and units are always part of the conclusion.
Example 3 — Minimally Scaffolded: Sleep Hours (One-Tailed Left)
A health agency claims Quebecers sleep less than the national mean of 7.0 hours ( h). A random sample of Quebecers gives h. Test at .
Hint: The agency’s claim is directional (less than). Use a left-tailed test.
Show Solution
Step 1: h; h (left-tailed — agency specifically claims less than).
Step 2: ✓, known ✓.
Step 3: h.
Step 4: Left-tailed: .
(using ), so .
Step 5:. We fail to reject .
There is insufficient evidence at the 1% significance level to conclude that Quebecers sleep less than the national mean of 7.0 hours. (Note: at , we would reject — the threshold matters.)
Example 4 — Error Analysis: Error Types in a Research Decision
Read the following analysis carefully. It contains errors.
Analysis to examine:
A researcher tests kg at . The true population mean is actually 12.4 kg. The researcher obtains and concludes: “We accept . The mean is 12 kg.”
The researcher also notes: “Since we accepted , there is no Type II error here.”
Show Full Analysis
Error 1 — Language: “We accept ” is never correct. The proper conclusion is “we fail to reject .” Failing to reject is not the same as proving true.
Error 2 — Error classification: The true mean is 12.4 kg, so is false. The researcher failed to reject a false null hypothesis. By definition, this is a Type II error (). The researcher’s claim that “no error occurred” is incorrect.
What could have been done: A larger sample size would reduce and make it more likely the test would detect the true mean of 12.4.
Section 5: Guided Practice
▾
Problem 1 — Setting Up Hypotheses and Computing z
A bottling plant claims its machines fill bottles to a mean of mL with mL. An auditor samples bottles and finds mL.
Part A: What are the correct hypotheses for a two-tailed test?
Part B: What is the test statistic?
A delivery company claims its packages arrive in a mean of days with days. A watchdog group samples deliveries and finds days.
Part A: What are the correct hypotheses for a two-tailed test?
Part B: What is the test statistic?
A machine is calibrated to fill bags to g with g. A sample of bags gives g.
Part A: What are the correct hypotheses for a two-tailed test?
Part B: What is the test statistic?
Problem 2 — p-value and Decision
From Problem 1 Variant A (bottling plant), the test statistic is . The test is two-tailed at .
Part A: What is the p-value?
Part B: What is the conclusion?
From Problem 1 Variant B (delivery times), the test statistic is . The test is two-tailed at .
Part A: What is the p-value?
Part B: What is the conclusion at ?
From Problem 1 Variant C (bag fill), the test statistic is . The test is two-tailed at .
Part A: What is the p-value?
Part B: What is the conclusion at ?
Problem 3 — Choosing the Test Form
A traffic planner wants to know whether the mean speed on a highway section has increased beyond the posted limit of 110 km/h. She samples 50 vehicles.
Problem 4 — Classifying Errors
A drug company tests “mean recovery time = 14 days” at . The company fails to reject . Later, an independent study confirms the drug does reduce recovery time (true mean = 11 days).
Section 6: Independent Practice
▾
Problem 1 — Two-Tailed Generator
Problem 2 — One-Tailed Test Direction
A consumer group suspects a battery brand’s mean life is less than the advertised 500 hours ( h). They test batteries and find h.
Part A: Set up and .
Part B: Conduct the full test at .
Show Solution
h.
.
Left-tailed: .
. Reject .
There is sufficient evidence at the 5% level to conclude that the mean battery life is less than 500 hours.
An environmental agency claims a factory’s mean daily emission exceeds the legal limit of 80 ppm ( ppm). They sample days and find ppm.
Part A: Set up and .
Part B: Conduct the full test at .
Show Solution
ppm.
.
Right-tailed: .
. Reject .
There is sufficient evidence at the 1% level to conclude that the factory’s mean daily emission exceeds 80 ppm.
A dietitian claims that a new meal plan reduces mean daily caloric intake below 2000 kcal ( kcal). A sample of participants shows kcal.
Part A: Set up and .
Part B: Conduct the full test at .
Show Solution
kcal.
.
Left-tailed: .
. Reject .
There is sufficient evidence at the 5% level to conclude that the mean daily caloric intake is below 2000 kcal under this meal plan.
Problem 3 — One-Tailed Generator
Problem 4 — Error Classification and the α–β Trade-off
A food safety agency tests whether a canned product’s mean sodium content exceeds 400 mg. They use and fail to reject . A later comprehensive audit reveals the true mean is 430 mg.
Part A: What type of error occurred?
Part B: To reduce the chance of this type of error occurring in future tests, the agency should:
A prosecutor tests whether a defendant exceeds a legal threshold. Using , the test yields and the prosecutor rejects . The defendant was actually compliant (true mean was within the legal threshold).
Part A: What type of error occurred?
Part B: Which of the following is true about this situation?
A quality engineer tests whether a process mean has shifted from its target. Using , the test rejects . The true mean had not actually shifted.
Part A: What type of error occurred and what is its probability?
Part B: The engineer proposes reducing α to 0.01. What consequence does this have?
Problem 5 — Multi-Step Synthesis: News Headline Interpretation
A news headline reads: “Study Finds No Evidence That New Teaching Method Improves Test Scores — Null Hypothesis Not Rejected.”
(a) The journalist concludes from this headline: “The new teaching method has been proven ineffective.” Identify the error in this interpretation.
Show Solution (a)
Failing to reject is not the same as proving true. The study found insufficient evidence to conclude the method works — it did not prove it doesn’t work. “No evidence for an effect” is not “evidence for no effect.” A Type II error may have occurred: the method could be effective but the study lacked power to detect it (perhaps was too small).
Correct interpretation: “There is insufficient evidence at the chosen significance level to conclude that the new teaching method improves test scores.”
(b) A statistician notes the study used students. What concern does this raise?
Show Solution (b)
With , the sample is small. The standard error is relatively large, so the test statistic (or ) would need to be very large to reach significance. This means the test has low power — a high probability of failing to detect a true effect. A Type II error is likely if the teaching method actually does improve scores but the improvement is modest.
Additionally, with and unknown , the z-test is not appropriate — a t-test with should be used (covered in INF-6).
(c) The study used . If the researcher ran the same test on 20 different teaching methods and reported only the ones with , what problem arises?
Show Solution (c)
This is the multiple comparisons problem (sometimes called “p-hacking”). When 20 independent tests are each conducted at , the probability of at least one false positive (Type I error) across all tests is approximately — not 5%. Reporting only the significant results inflates the apparent significance of the findings and is a form of data snooping.
Mixed Review — Retrieval from Earlier Lessons
These problems draw on concepts from earlier in the course. Attempting them without re-reading prior lessons is the point — retrieval practice strengthens long-term memory more than re-reading.
Review Problem 1 — CI Interpretation (INF-2)
A social scientist constructs a 95% CI for mean weekly screen time: (28.4, 35.6) hours. Her report says: “Since 30 hours is inside the interval, we can say with 95% certainty that the true mean screen time is exactly 30 hours per week.”
Identify all errors in this statement and write the correct interpretation.
Show Solution
Error 1 — Probability statement about : “95% certainty” implies is random. is a fixed constant. Correct language: “We are 95% confident that lies between 28.4 and 35.6 hours” — referring to the method’s long-run capture rate.
Error 2 — Inferring that equals a specific value: The CI does not say . It says 30 is a plausible value because it falls inside the interval. Any value between 28.4 and 35.6 is equally consistent with the data in the sense that none can be ruled out at the 5% level. Values near the center () are most likely, but 30 is not uniquely supported.
Correct interpretation: “We are 95% confident the true mean weekly screen time is between 28.4 and 35.6 hours. Because 30 hours lies within the interval, the data are consistent with a mean of 30 hours — but the CI does not pinpoint at any single value.”
Review Problem 2 — Proportion CI Conditions and Construction (INF-4)
An environmental group surveys 120 households and finds 42 using a programmable thermostat. Construct a 95% CI for the true proportion.
(a) Check conditions. (b) Compute the CI. (c) Can the group claim “more than 30% of households use a programmable thermostat”?
Show Solution
(a).
✓
✓
Conditions met.
(b)
(c) The lower bound is 0.265 = 26.5%, which is below 30%. Since 30% lies inside the CI, the data cannot rule out a true proportion below 30%. The group cannot claim more than 30% at 95% confidence — the evidence is consistent with values both above and below 30%.
Section 7: Mastery Check
▾
Question 1 — Feynman Test
In your own words, explain what a p-value tells you — and what it does NOT tell you. Write as if explaining to a classmate who has never taken statistics. Aim for 200–500 characters.
0 / 500
Model Answer
A p-value tells you how surprising your data are, assuming the null hypothesis is true. Specifically: if the null were true, what fraction of random samples would produce a test statistic at least as extreme as the one you got?
A small p-value (say, 0.02) means your data would be very rare in a null-hypothesis world — evidence against .
What a p-value does NOT tell you:
It is NOT the probability that is true.
It is NOT the probability that your result occurred “by chance.”
It is NOT a measure of the size or importance of an effect.
A small p-value does not mean the effect is large or practically significant.
A large p-value does not prove is true — only that you lack evidence against it.
Question 2 — Apply: Lake pH Test
A water authority tests whether a lake’s mean pH differs from the neutral standard of 7.0. The population standard deviation is units. A random sample of water readings gives .
Part A: Which alternative hypothesis is correct?
Part B: Conduct the full test at .
Show Full Solution
Step 1:; (two-tailed).
Step 2:. Note: , but we proceed with z because the problem states is known and the population is assumed approximately normal. (In INF-6, we would use the t-distribution with unknown for small samples.)
Step 3:.
Step 4:.
Step 5:. Fail to reject .
There is insufficient evidence at the 5% significance level to conclude that the lake’s mean pH differs from 7.0.
Common mistake: With , it is tempting to use a left-tailed test. But the decision was made before seeing the data — “differs from” is the correct framing. Using a left-tailed test after seeing the direction would be data snooping.
Question 3 — Error Analysis
Flawed statistical report:
A researcher tests whether a new study technique improves exam scores. After collecting data, they find and report: “Since , this proves the null hypothesis is false. The study technique definitely works.”
Identify and correct the errors in this statement.
Show Full Analysis
Error 1 — “Proves”: A hypothesis test never proves anything with certainty. Rejecting means the data are unlikely under — it does not prove is definitively false. There is still a probability (1 in 20 chance) that this is a Type I error: rejecting a true null.
Correct language: “There is sufficient evidence at the 5% level to reject ” or “the result is statistically significant at .”
Error 2 — “Definitely works”: Statistical significance () means only that the effect is unlikely under . It says nothing about whether the effect is large enough to be practically meaningful. With a very large sample, even a tiny, practically irrelevant improvement could yield . The researcher needs to report the effect size (e.g., mean improvement in exam scores) alongside the p-value.
Self-Assessment
How confident do you feel about hypothesis testing for a population mean?
Still confusedReady for the Boss Fight
Section 8: Boss Fight
▾
Choose your path. Both require full five-step reasoning.
🔬 Path A: The Auditor
A provincial auditor suspects that mean grant processing times exceed the target. You must build the case — or fail to build it — with rigorous statistical evidence.
🏗️ Path B: The Designer
An engineering team must verify their manufacturing process meets tolerances. When the test fails to reject, you must trace the consequences through error types and cost trade-offs.
🔬 Path A: The Auditor
A provincial auditor suspects that the mean processing time for government grant applications exceeds the stated target of 30 business days. The population standard deviation is known to be days. A random sample of 64 recent applications shows a mean of days.
Task 1. State and with full justification for the directional choice. Explain why a one-tailed test is appropriate here rather than a two-tailed test.
Show Guidance for Task 1
days; days.
The auditor’s concern is specifically whether processing times exceed the target — not whether they differ in any direction. A one-tailed right test is justified because the audit mandate is to detect delays. Using a two-tailed test would waste power by splitting the rejection region across both directions, making it harder to detect the specific problem of concern.
Task 2. Check the conditions and compute the test statistic. Show all work.
Show Guidance for Task 2
Conditions: ✓; days known ✓. CLT applies.
day.
Task 3. Find the p-value and make a decision at . State the conclusion in a sentence suitable for a formal audit report.
Show Guidance for Task 3
Right-tailed: .
. Reject .
Audit conclusion: “Based on a random sample of 64 applications, there is sufficient evidence at the 1% significance level to conclude that the mean grant processing time exceeds the 30-day target (, ).”
Task 4. A colleague argues that should be reduced to 0.001 “to be absolutely safe before accusing the department.” Explain the Type II error consequence of this choice. Would you recommend it?
Show Guidance for Task 4
Reducing from 0.01 to 0.001 raises the bar for rejection. The critical z-value increases from roughly to . For the same sample size and true mean, this makes it harder to detect a real excess — increases and power decreases.
In this case: with , the test rejected at but would fail to reject at (since ). The departmental delay would go undetected.
Recommendation: The appropriate depends on the cost of each error type. A Type I error means falsely flagging a compliant department (unfair, reputational cost). A Type II error means missing a real delay problem (public money wasted, applicants harmed). If the cost of missed delays is high, lowering to 0.001 is not wise — it increases the chance of letting real problems through. is already strict enough for an audit context.
Reflection: Write a two-sentence conclusion suitable for a public audit report — include the decision, the significance level, and what it means for the department.
0 / 500
🏗️ Path B: The Designer
An engineering team wants to verify that a manufacturing process produces parts with mean diameter mm. The population standard deviation is mm. A sample of parts gives mm. Rejecting a good batch costs the company €500; missing a defective batch costs €5,000.
Task 1. Set up the appropriate hypothesis test (justify two-tailed) and conduct it at . Show all five steps.
Show Guidance for Task 1
Two-tailed is appropriate: the engineering team wants to know if the diameter differs from 25.00 mm in either direction — both too small and too large are defects.
mm; mm.
Conditions: ✓, known ✓.
mm.
.
.
. Fail to reject .
There is insufficient evidence at the 5% level to conclude that the mean diameter differs from 25.00 mm.
Task 2. The test failed to reject , but an independent precise measurement confirms the true mean is actually 25.015 mm. What type of error occurred? Justify your answer using the 2×2 truth table.
Show Guidance for Task 2
was false (the true mean is 25.015 ≠ 25.00). The test failed to reject a false null. This is a Type II error ().
In the truth table: we are in the cell “Fail to reject | is false” — that is the Type II error cell.
The error was not a mistake in the procedure — the test was performed correctly. With and a shift of only 0.015 mm (1.5 SE), the test had low power to detect such a small deviation.
Task 3. To reduce this error, should the team increase or decrease ? What is the cost trade-off given the asymmetric costs (€500 vs. €5,000)?
Show Guidance for Task 3
To reduce (Type II error), increase . A higher makes it easier to reject , catching more real defects.
Cost analysis: A Type I error (rejecting a good batch) costs €500. A Type II error (missing a defective batch) costs €5,000 — 10 times more. This asymmetry argues strongly for a higher (say 0.10) to reduce the expensive Type II error. The team is willing to accept more false alarms (€500 each) to avoid the far costlier missed defects (€5,000 each).
Task 4. How large a sample would be needed to detect a 0.01 mm shift with 80% power? Use the power formula with (two-tailed at ) and (for 80% power).
Show Guidance for Task 4
mm (the shift to detect), mm.
Round up: parts.
The current has very low power to detect a 0.01 mm shift. To achieve 80% power, the team needs roughly 8 times as many measurements.
Reflection: Given the cost asymmetry (€500 vs. €5,000), would you advise changing ? Justify your recommendation with specific reference to the cost trade-off and what each type of error means in this manufacturing context.
0 / 500
Section 9: Challenge Problems
▾
Ready for more? These go beyond the lesson objectives.
Challenge 1 — Critical-Value Approach
A postal service claims its mean delivery time is days ( days). A consumer group samples packages and finds days. Instead of computing a p-value, use the critical-value approach to test at (two-tailed).
The critical-value approach: find the rejection region and compare the test statistic directly.
Show Solution
For a two-tailed test at : the critical value is (splitting 0.025 in each tail).
Rejection region:, i.e., reject if or .
days.
.
. The test statistic falls outside the rejection region. Fail to reject .
Equivalence check:. ✓ Both approaches give the same decision.
A factory claims mean widget weight is g ( g). An inspector samples widgets; g. Use the critical-value approach at (two-tailed).
Show Solution
For two-tailed: .
Rejection region:.
g.
.
. Fail to reject .
. ✓
A pharmacy claims a drug’s mean tablet weight is mg ( mg). Quality control samples tablets; mg. Use the critical-value approach at (two-tailed).
Show Solution
For two-tailed: .
Rejection region:.
mg.
.
. Reject .
. ✓
There is sufficient evidence that mean tablet weight differs from 200 mg.
Challenge 2 — Equivalence of CI and Two-Tailed Test
From Example 1 (cereal boxes): g, g, , g, .
(a) We rejected at (two-tailed). Now construct a 95% confidence interval for .
(b) Does fall inside or outside the confidence interval? What does this tell you about the relationship between a two-tailed hypothesis test and a confidence interval at the same significance level?
Show Solution
(a) g. 95% CI: .
CI: g.
(b) falls outside the interval .
Key insight: A 95% CI and a two-tailed test at always give the same decision:
If falls outside the 95% CI → reject at ✓
If falls inside the 95% CI → fail to reject at ✓
This equivalence holds exactly for two-tailed tests. One-tailed tests do not have a direct CI equivalent.
Challenge 3 — Two-Sample Preview (Generator)
Extension prompt: Imagine you had two independent samples instead of one — say, boxes from two different factories. How would you modify the test statistic to compare the two means? (This concept is called a two-sample z-test; you will cover it in REG-3.)
Section 10: Solutions Reference
▾
Complete, step-by-step solutions for all problems in Sections 5–9 are available on the solutions page. Solutions include full five-step write-ups, z-critical values shown explicitly, and decision rule justifications.
If you’re stuck: Re-read the relevant Core Concept in Section 3. For z-test problems, make sure you used (population standard deviation) in the denominator and did not divide by ar{x}. For critical value lookups, confirm whether you are running a one-tailed or two-tailed test, as this changes your threshold. The solutions page shows the reasoning behind every step, not just the final answer.
Quick-Reference Formulas
One-Sample z Test Statistic (mean, large sample / known):