How to use this page: Try each problem in the lesson before checking solutions here. If your answer doesn't match, read the solution carefully — especially the part that explains why common wrong answers are wrong. Understanding the error matters more than getting the right answer the first time.
← Back to Lesson DS-1
Section 5: Guided Practice
Problem 1 — The Four Elements (C1 + C2)
NBA strength training scenario: basketball analytics company surveys 60 NBA players, average 280 min/week.
Step 1 — Population: All professional basketball players.
Not just the 60 contacted (that's the sample), and not just NBA players — the analytics company wants to draw conclusions about professional basketball broadly. Population = the whole target group.
Step 2 — Statistic: \( \bar{x} = 280 \) minutes/week.
The 280 was computed from the 60-player sample — it's a statistic. The unknown average for all professional players (which we never measured) is the parameter.
Step 3 — Parameter notation: \( \mu \) — population mean strength-training time.
\( \bar{x} = 280 \) is the statistic (sample mean). \( \mu \) is what we're trying to estimate. \( s \) is sample standard deviation — a different statistic from a different question.
Common mistake: Calling the 280 minutes "μ." The 280 was computed from 60 players — it's a sample computation, so it's \( \bar{x} \).
---
Problem 2 — Notation Match (C2)
2a — \( \mu = \$67{,}200 \). All 4,500 employees were measured — the whole population. When you measure everyone, the resulting number is a parameter. No sampling occurred.
2b — \( \bar{x} = \$1{,}450 \). The 80 apartments are a random sample of all Montreal apartments. The $1,450 was computed from this sample — it's a statistic estimating the population mean rent μ.
Key rule: If you measured the entire population → parameter (μ). If you measured a sample → statistic (\( \bar{x} \).)
---
Problem 3 — Classify the Variable (C3) — Variant Bank
Correct answers for all 5 variants:
- Variant 0 (Membership tier: Bronze/Silver/Gold/Platinum) → Qualitative — Ordinal. Tiers have a natural order but unequal gaps.
- Variant 1 (Number of defects per batch) → Quantitative — Discrete. Whole-number counts with equal gaps.
- Variant 2 (Blood type: A/B/AB/O) → Qualitative — Nominal. Categories with no ranking.
- Variant 3 (Marathon finishing time) → Quantitative — Continuous. Measured time; any decimal value possible.
- Variant 4 (Cafeteria rating: Terrible/Poor/Okay/Good/Excellent) → Qualitative — Ordinal. Ordered categories with unequal gaps.
The two biggest traps: (1) Numbers coded as labels (postal codes, phone numbers) are nominal — not quantitative. (2) An ordered rating scale (1 to 5) looks discrete quantitative, but if the numbers are labels for categories, it's ordinal qualitative. Ask: "Are the gaps between values equal and meaningful?"
---
Problem 4 — Identify the Sampling Method (C4) — Variant Bank
Correct answers for all 5 variants:
- Variant 0 (Select 4 of 15 neighbourhoods; survey all residents in chosen neighbourhoods) → Cluster sampling. Whole groups selected; all members of selected groups surveyed.
- Variant 1 (Inspect item #3, then every 20th item) → Systematic sampling. Random start + fixed interval.
- Variant 2 (Divide by year of study; randomly select 75 from each year) → Stratified sampling. Homogeneous groups (years); random sample from every group.
- Variant 3 (Interview first 50 people exiting mall) → Convenience sampling. Whoever is easiest to reach.
- Variant 4 (Randomly select cities → tracts → households at three levels) → Multistage sampling. Multiple sequential stages of random selection.
Cluster vs. Stratified — the critical distinction: Stratified = homogeneous groups, sample from ALL groups. Cluster = heterogeneous groups, sample SOME groups entirely. You divide the population in both — the difference is whether you sample from all groups or select whole groups.
---
Problem 5 — Identify the Bias (C5)
Type of bias: Voluntary response bias.
Viewers who feel strongly about the tax issue are far more likely to text in than indifferent viewers. Self-selected responses systematically overrepresent extreme opinions.
Direction: Toward "No".
Tax opponents tend to feel more urgently motivated to act. A tax increase hurts people economically in a direct, immediate way — that kind of tangible cost motivates stronger responses than the more diffuse benefits of public spending. The "Yes" side (those who support the tax) is likely less motivated to call a radio station to express it.
Sample size ≠ reliability: 4,200 responses sounds like a lot. But 4,200 strongly-motivated non-representative respondents is less reliable than 100 randomly selected voters. Bias doesn't wash out with larger samples.
---
Section 6: Independent Practice
Problem 1 — The Full Picture (C1 + C2)
Pharmaceutical drug trial: 180 patients from 12 clinics, mean blood pressure drop of 14.2 mmHg.
- (a) Population: All patients with high blood pressure who would be candidates for this drug (broadly: all adults with hypertension).
- (b) Sample: The 180 patients enrolled from 12 Canadian clinics.
- (c) Parameter: \( \mu \) = the true mean blood pressure reduction for all high-blood-pressure patients who take this drug. Unknown — we only measured 180.
- (d) Statistic: \( \bar{x} = 14.2 \) mmHg = sample mean reduction, computed from 180 patients.
- (e) Concern: Volunteer enrollment introduces self-selection bias — patients who enroll may be more health-motivated and adherent to treatment than the general hypertensive population, potentially inflating the apparent benefit.
---
Problem 2 — Is It μ or x̄? (C1 + C2)
- (a) Parameter — \( \mu = 11.4 \) years. All 847 teachers were surveyed — the entire district = the population.
- (b) Statistic — \( \hat{p} = 0.62 \) (sample proportion). The 500 polled Canadians are a sample; the 62% estimates the population proportion of all Canadians who purchased online.
- (c) Parameter — \( \mu = 2.83. The registrar has all students' records — the entire enrolled population was measured.
- (d) Statistic — \( \bar{x} = 632 \) bpm. 30 mice is a sample from the relevant population of lab mice.
The rule that resolves all ambiguity: Was the entire defined population measured? → Parameter. Was a subset measured? → Statistic.
---
Problem 3 — Variable Type Generator (C3)
This problem generates new scenarios randomly. The correct classification and explanation appear after you answer. General keys:
- Nominal: Category labels, no order (blood type, department, colour)
- Ordinal: Ordered categories, gaps may be unequal (ratings, education levels, military ranks)
- Discrete: Countable whole numbers (number of children, defects, goals)
- Continuous: Measured, any decimal possible (height, weight, time, temperature)
---
Problem 4 — Sampling Method Critique (C4) — Variant Bank
Core answers for all 5 variants:
- Variant 0 (City website survey): Voluntary response / convenience. Bias: strongly-opinionated residents (pro or con park) over-respond. Fix: mail survey to SRS of registered residents.
- Variant 1 (First 10 morning customers): Convenience sampling (time-of-day bias). Fix: systematic sampling of transaction receipts across all hours.
- Variant 2 (Number each of 6,200 students, randomly select 300): Simple random sampling. Strength: unbiased. Weakness: by chance, some schools may be over/underrepresented. Improvement: stratify by school.
- Variant 3 (Randomly select cities → tracts → households): Multistage sampling. Advantage: practical for dispersed populations without a complete national household list. Tradeoff: clustering effect reduces effective sample size.
- Variant 4 (Select record #12, then every 20th): Systematic sampling. Bias risk: periodicity in filing order. Good choice when: records are in random order and a complete list exists.
---
Problem 5 — Bias in a News Story (C5)
Remote work productivity study: 1,200 remote employees at 5 tech companies; 84% self-report higher productivity.
- (a) The conclusion is not justified. No control group, no objective productivity measure, no random sample, and the claim "proves" is too strong for any survey study.
- (b) Biases: social desirability (remote workers have incentive to claim productivity), undercoverage (only tech workers at remote-friendly companies), and convenience sampling (no random selection from a defined population).
- (c) Excluded: workers in non-remote industries, workers who tried remote work and returned to the office, workers in non-tech sectors. These exclusions bias the population estimate upward (remote-friendly tech workers are probably the most suitable candidates for remote work).
---
Problem 6 — Survey Question Critique (C6)
- (a) "Don't you agree that our school needs better sports facilities?" → Leading question. Fix: "Do you think our school's sports facilities need improvement? [Yes / No / No opinion]"
- (b) "Are you satisfied with the price and speed?" → Double-barrelled. Fix: two separate questions — one for price, one for speed.
- (c) "How often do you exercise regularly?" → Ambiguous. Fix: "In an average week, on how many days do you exercise for at least 30 minutes? [0 / 1–2 / 3–4 / 5–6 / 7]"
- (d) "Given that most experts agree climate change is serious, do you support carbon taxes?" → Leading premise. Fix: "Do you support the introduction of carbon taxes in Canada? [Yes / No / Unsure]"
---
Section 7: Mastery Check
Question 1 — Feynman Test
Model answer (there is no single correct response — evaluate your own answer against this):
"A population is the whole group you care about — like every student in Canada. A sample is just the part you actually study — like 200 randomly chosen students. We use samples because studying millions of people is too expensive and slow. A parameter describes the full population (like the true average GPA of all Canadian students — we probably can't measure it exactly). A statistic describes the sample (like the average GPA of your 200 students — you computed it directly). The statistic is our best estimate of the parameter."
Checklist for your own answer:
- Did you explain why samples are used (not just that they are)?
- Did you make clear that parameters are usually unknown and statistics are computed?
- Did you use (or explain) the symbols μ and x̄?
- Did you explain that the statistic estimates the parameter?
---
Question 2 — Apply (C3 + C4)
Coffee chain with 240 locations in 4 regions: Plan 1 (cluster, 30 stores) vs. Plan 2 (stratified by region).
- (a) Plan 1 = Cluster sampling. Weakness: with only 30 of 240 stores, Atlantic (20 stores, only ~2–3 selected) may be barely represented. Opinions in small regions could be washed out.
- (b) Plan 2 = Stratified sampling by region. Fix: guarantees each region is represented proportionally to its store count — Atlantic always gets 5 surveyed stores regardless of random fluctuation.
- (c) Satisfaction rating = Qualitative — Ordinal. Ordered categories (Very Dissatisfied through Very Satisfied) with unequal gaps. Implication: report mode and percentage per category, not a simple mean. Computing "average satisfaction = 3.7" treats ordinal data as interval, which overclaims precision.
---
Question 3 — Find the Error (C4 + C5)
Food blogger Instagram poll: 847 responses, 91% prefer traditional bagels. Report calls it stratified sampling, calls 91% "μ," calls it quantitative-continuous.
Four errors in the student's analysis:
- "Huge random sample": Voluntary response / convenience sampling — not random. Followers who engage with food content are not representative of all Montrealers.
- "Reliable estimate": A biased large sample is less reliable than an unbiased small one. The 91% is skewed by the food-enthusiast audience.
- "μ" for a proportion: Population proportion is written \( p \), not \( \mu \). The sample proportion is \( \hat{p} \). \( \mu \) is reserved for means.
- "Stratified sampling": Stratification requires deliberate a priori division of the population into strata before sampling. Post-hoc grouping of respondents is not stratification. This was voluntary response sampling.
---
Section 8: Boss Fight
Path A — The Analyst: Screen Time Survey
Full solutions appear in the lesson's Boss Fight section. Summary of errors found:
- Misidentified as stratified sampling (actually voluntary response / convenience)
- Misused \( \mu \) for a sample-computed value (should be \( \bar{x} \))
- Misclassified screen time as qualitative-ordinal (it's quantitative-continuous)
- Used a leading question ("excessive hours") that biases responses downward
- Non-response bias and undercoverage from the newsletter subscriber convenience sample
Path B — The Architect: CEGEP Well-Being Study
Key design decisions:
- Method: Stratified sampling by institution size (small/medium/large), proportional allocation to enrollment, random selection within strata
- Parameter/Statistic: \( \mu \) = true mean well-being score; \( \bar{x} \) = sample mean from 3,000 students
- Survey questions: Neutral wording, one concept per question, bounded time frame
- Residual bias: Non-response from students in greatest distress (hardest to mitigate)
---
Section 9: Challenge Problems
Challenge 1 — Can a Statistic Equal a Parameter? (Variant Bank)
Core insight for all variants: \( \bar{x} = \mu \) can happen by coincidence, but it's rare and doesn't mean the sample "perfectly represents" the population. The key concept is sampling variability — different samples give different statistics. A statistic that happens to equal the parameter is still just one observation from the distribution of all possible sample means.
In Variant 1, none of the 6 possible size-2 samples produced \( \bar{x} = \mu = 60 \). But the average of all 6 sample means = 60 = μ. This illustrates unbiasedness: \( \bar{x} \) is an unbiased estimator of \( \mu \) (correct on average, not necessarily in any given sample).
---
Challenge 2 — Multistage vs. Cluster Design
- (a) 30 × 900 = 27,000 students. Not practical.
- (b) Two-stage: Stage 1 — randomly select 60 schools; Stage 2 — randomly select 50 students per school. Total = 3,000.
- (c) Multistage gives more geographic/demographic diversity per student surveyed. Tradeoff: requires agreements with 60 schools vs. 30, but each school provides only 50 participants.
- (d) Population proportion of students spending >3h/day on social media is written p; the sample estimate is \( \hat{p} \).
---
Challenge 3 — Why Convenience Sampling is Always Biased
- (a) Frequent exercisers have higher \( \pi_i \) (selection probability); non-gym-goers have near-zero \( \pi_i \).
- (b) \( \bar{x} \) overestimates \( \mu \) (exercise time inflated by overrepresentation of heavy exercisers).
- (c) A larger sample still draws from the biased pool. Bigger sample = more precise estimate of the wrong thing.
- (d) Proof sketch: unequal \( \pi_i \) means individuals with systematically different values are over/underweighted in \( \bar{x} \). This creates a systematic gap between E[\bar{x}] and \( \mu \). Increasing sample size reduces variance but cannot eliminate the systematic deviation caused by the biased selection mechanism.
---