Solutions — Statistical Vocabulary and Sampling

How to use this page: Try each problem in the lesson before checking solutions here. If your answer doesn't match, read the solution carefully — especially the part that explains why common wrong answers are wrong. Understanding the error matters more than getting the right answer the first time.

← Back to Lesson DS-1

Section 5: Guided Practice Solutions

▾

Problem 1 — The Four Elements (C1 + C2)

NBA strength training scenario: basketball analytics company surveys 60 NBA players, average 280 min/week.

Step 1 — Population: All professional basketball players.
Not just the 60 contacted (that’s the sample), and not just NBA players — the analytics company wants to draw conclusions about professional basketball broadly. Population = the whole target group.

Step 2 — Statistic: minutes/week.
The 280 was computed from the 60-player sample — it’s a statistic. The unknown average for all professional players (which we never measured) is the parameter.

Step 3 — Parameter notation: — population mean strength-training time.
is the statistic (sample mean). is what we’re trying to estimate. is sample standard deviation — a different statistic from a different question.

Common mistake: Calling the 280 minutes “μ.” The 280 was computed from 60 players — it’s a sample computation, so it’s .

Problem 2 — Notation Match (C2)

2a — 67,200$. All 4,500 employees were measured — the whole population. When you measure everyone, the resulting number is a parameter. No sampling occurred.

**2b — 1,4501,450 was computed from this sample — it’s a statistic estimating the population mean rent .

Key rule: If you measured the entire population → parameter (). If you measured a sample → statistic ().

Problem 3 — Classify the Variable (C3) — Variant Bank

Correct answers for all 5 variants:

Variant 0 (Membership tier: Bronze/Silver/Gold/Platinum) → Qualitative — Ordinal. Tiers have a natural order but unequal gaps.
Variant 1 (Number of defects per batch) → Quantitative — Discrete. Whole-number counts with equal gaps.
Variant 2 (Blood type: A/B/AB/O) → Qualitative — Nominal. Categories with no ranking.
Variant 3 (Marathon finishing time) → Quantitative — Continuous. Measured time; any decimal value possible.
Variant 4 (Cafeteria rating: Terrible/Poor/Okay/Good/Excellent) → Qualitative — Ordinal. Ordered categories with unequal gaps.

The two biggest traps: (1) Numbers coded as labels (postal codes, phone numbers) are nominal — not quantitative. (2) An ordered rating scale (1 to 5) looks discrete quantitative, but if the numbers are labels for categories, it’s ordinal qualitative. Ask: “Are the gaps between values equal and meaningful?”

Problem 4 — Identify the Sampling Method (C4) — Variant Bank

Correct answers for all 5 variants:

Variant 0 (Select 4 of 15 neighbourhoods; survey all residents in chosen neighbourhoods) → Cluster sampling. Whole groups selected; all members of selected groups surveyed.
Variant 1 (Inspect item #3, then every 20th item) → Systematic sampling. Random start + fixed interval.
Variant 2 (Divide by year of study; randomly select 75 from each year) → Stratified sampling. Homogeneous groups (years); random sample from every group.
Variant 3 (Interview first 50 people exiting mall) → Convenience sampling. Whoever is easiest to reach.
Variant 4 (Randomly select cities → tracts → households at three levels) → Multistage sampling. Multiple sequential stages of random selection.

Cluster vs. Stratified — the critical distinction: Stratified = homogeneous groups, sample from ALL groups. Cluster = heterogeneous groups, sample SOME groups entirely. You divide the population in both — the difference is whether you sample from all groups or select whole groups.

Problem 5 — Identify the Bias (C5)

Type of bias: Voluntary response bias.
Viewers who feel strongly about the tax issue are far more likely to text in than indifferent viewers. Self-selected responses systematically overrepresent extreme opinions.

Direction: Toward “No”.
Tax opponents tend to feel more urgently motivated to act. A tax increase hurts people economically in a direct, immediate way — that kind of tangible cost motivates stronger responses than the more diffuse benefits of public spending. The “Yes” side (those who support the tax) is likely less motivated to call a radio station to express it.

Sample size ≠ reliability: 4,200 responses sounds like a lot. But 4,200 strongly-motivated non-representative respondents is less reliable than 100 randomly selected voters. Bias doesn’t wash out with larger samples.

Section 6: Independent Practice Solutions

▾

Problem 1 — The Full Picture (C1 + C2)

Pharmaceutical drug trial: 180 patients from 12 clinics, mean blood pressure drop of 14.2 mmHg.

(a) Population: All patients with high blood pressure who would be candidates for this drug (broadly: all adults with hypertension).
(b) Sample: The 180 patients enrolled from 12 Canadian clinics.
(c) Parameter: = the true mean blood pressure reduction for all high-blood-pressure patients who take this drug. Unknown — we only measured 180.
(d) Statistic: mmHg = sample mean reduction, computed from 180 patients.
(e) Concern: Volunteer enrollment introduces self-selection bias — patients who enroll may be more health-motivated and adherent to treatment than the general hypertensive population, potentially inflating the apparent benefit.

Problem 2 — Is It or ? (C1 + C2)

(a) Parameter — years. All 847 teachers were surveyed — the entire district = the population.
(b) Statistic — (sample proportion). The 500 polled Canadians are a sample; the 62% estimates the population proportion of all Canadians who purchased online.
(c) Parameter — . The registrar has all students’ records — the entire enrolled population was measured.
(d) Statistic — bpm. 30 mice is a sample from the relevant population of lab mice.

The rule that resolves all ambiguity: Was the entire defined population measured? → Parameter. Was a subset measured? → Statistic.

Problem 3 — Variable Type Generator (C3)

This problem generates new scenarios randomly. The correct classification and explanation appear after you answer. General keys:

Nominal: Category labels, no order (blood type, department, colour)
Ordinal: Ordered categories, gaps may be unequal (ratings, education levels, military ranks)
Discrete: Countable whole numbers (number of children, defects, goals)
Continuous: Measured, any decimal possible (height, weight, time, temperature)

Problem 4 — Sampling Method Critique (C4) — Variant Bank

Core answers for all 5 variants:

Variant 0 (City website survey): Voluntary response / convenience. Bias: strongly-opinionated residents (pro or con park) over-respond. Fix: mail survey to SRS of registered residents.
Variant 1 (First 10 morning customers): Convenience sampling (time-of-day bias). Fix: systematic sampling of transaction receipts across all hours.
Variant 2 (Number each of 6,200 students, randomly select 300): Simple random sampling. Strength: unbiased. Weakness: by chance, some schools may be over/underrepresented. Improvement: stratify by school.
Variant 3 (Randomly select cities → tracts → households): Multistage sampling. Advantage: practical for dispersed populations without a complete national household list. Tradeoff: clustering effect reduces effective sample size.
Variant 4 (Select record #12, then every 20th): Systematic sampling. Bias risk: periodicity in filing order. Good choice when: records are in random order and a complete list exists.

Problem 5 — Bias in a News Story (C5)

Remote work productivity study: 1,200 remote employees at 5 tech companies; 84% self-report higher productivity.

(a) The conclusion is not justified. No control group, no objective productivity measure, no random sample, and the claim “proves” is too strong for any survey study.
(b) Biases: social desirability (remote workers have incentive to claim productivity), undercoverage (only tech workers at remote-friendly companies), and convenience sampling (no random selection from a defined population).
(c) Excluded: workers in non-remote industries, workers who tried remote work and returned to the office, workers in non-tech sectors. These exclusions bias the population estimate upward (remote-friendly tech workers are probably the most suitable candidates for remote work).

Problem 6 — Survey Question Critique (C6)

(a) “Don’t you agree that our school needs better sports facilities?” → Leading question. Fix: “Do you think our school’s sports facilities need improvement? [Yes / No / No opinion]”
(b) “Are you satisfied with the price and speed?” → Double-barrelled. Fix: two separate questions — one for price, one for speed.
(c) “How often do you exercise regularly?” → Ambiguous. Fix: “In an average week, on how many days do you exercise for at least 30 minutes? [0 / 1–2 / 3–4 / 5–6 / 7]”
(d) “Given that most experts agree climate change is serious, do you support carbon taxes?” → Leading premise. Fix: “Do you support the introduction of carbon taxes in Canada? [Yes / No / Unsure]“

Section 7: Mastery Check Solutions

▾

Question 1 — Feynman Test

Model answer (there is no single correct response — evaluate your own answer against this):

“A population is the whole group you care about — like every student in Canada. A sample is just the part you actually study — like 200 randomly chosen students. We use samples because studying millions of people is too expensive and slow. A parameter describes the full population (like the true average GPA of all Canadian students — we probably can’t measure it exactly). A statistic describes the sample (like the average GPA of your 200 students — you computed it directly). The statistic is our best estimate of the parameter.”

Checklist for your own answer:

Did you explain why samples are used (not just that they are)?
Did you make clear that parameters are usually unknown and statistics are computed?
Did you use (or explain) the symbols and ?
Did you explain that the statistic estimates the parameter?

Question 2 — Apply (C3 + C4)

Coffee chain with 240 locations in 4 regions: Plan 1 (cluster, 30 stores) vs. Plan 2 (stratified by region).

(a) Plan 1 = Cluster sampling. Weakness: with only 30 of 240 stores, Atlantic (20 stores, only ~2–3 selected) may be barely represented. Opinions in small regions could be washed out.
(b) Plan 2 = Stratified sampling by region. Fix: guarantees each region is represented proportionally to its store count — Atlantic always gets 5 surveyed stores regardless of random fluctuation.
(c) Satisfaction rating = Qualitative — Ordinal. Ordered categories (Very Dissatisfied through Very Satisfied) with unequal gaps. Implication: report mode and percentage per category, not a simple mean. Computing “average satisfaction = 3.7” treats ordinal data as interval, which overclaims precision.

Question 3 — Find the Error (C4 + C5)

Food blogger Instagram poll: 847 responses, 91% prefer traditional bagels. Report calls it stratified sampling, calls 91% ”,” calls it quantitative-continuous.

Four errors in the student’s analysis:

“Huge random sample”: Voluntary response / convenience sampling — not random. Followers who engage with food content are not representative of all Montrealers.
“Reliable estimate”: A biased large sample is less reliable than an unbiased small one. The 91% is skewed by the food-enthusiast audience.
"" for a proportion: Population proportion is written , not . The sample proportion is . is reserved for means.
“Stratified sampling”: Stratification requires deliberate a priori division of the population into strata before sampling. Post-hoc grouping of respondents is not stratification. This was voluntary response sampling.

Section 8: Boss Fight Solutions

▾

Path A — The Analyst: Screen Time Survey

Full solutions appear in the lesson’s Boss Fight section. Summary of errors found:

Misidentified as stratified sampling (actually voluntary response / convenience)
Misused for a sample-computed value (should be )
Misclassified screen time as qualitative-ordinal (it’s quantitative-continuous)
Used a leading question (“excessive hours”) that biases responses downward
Non-response bias and undercoverage from the newsletter subscriber convenience sample

Path B — The Architect: CEGEP Well-Being Study

Key design decisions:

Method: Stratified sampling by institution size (small/medium/large), proportional allocation to enrollment, random selection within strata
Parameter/Statistic: = true mean well-being score; = sample mean from 3,000 students
Survey questions: Neutral wording, one concept per question, bounded time frame
Residual bias: Non-response from students in greatest distress (hardest to mitigate)

Section 9: Challenge Problem Solutions

▾

Challenge 1 — Can a Statistic Equal a Parameter? (Variant Bank)

Core insight for all variants: can happen by coincidence, but it’s rare and doesn’t mean the sample “perfectly represents” the population. The key concept is sampling variability — different samples give different statistics. A statistic that happens to equal the parameter is still just one observation from the distribution of all possible sample means.

In Variant 1, none of the 6 possible size-2 samples produced . But the average of all 6 sample means = 60 = . This illustrates unbiasedness: is an unbiased estimator of (correct on average, not necessarily in any given sample).

Challenge 2 — Multistage vs. Cluster Design

(a) 30 × 900 = 27,000 students. Not practical.
(b) Two-stage: Stage 1 — randomly select 60 schools; Stage 2 — randomly select 50 students per school. Total = 3,000.
**(c) ** Multistage gives more geographic/demographic diversity per student surveyed. Tradeoff: requires agreements with 60 schools vs. 30, but each school provides only 50 participants.
(d) Population proportion of students spending >3h/day on social media is written p; the sample estimate is .

Challenge 3 — Why Convenience Sampling is Always Biased

(a) Frequent exercisers have higher (selection probability); non-gym-goers have near-zero .
(b) overestimates (exercise time inflated by overrepresentation of heavy exercisers).
(c) A larger sample still draws from the biased pool. Bigger sample = more precise estimate of the wrong thing.
(d) Proof sketch: unequal means individuals with systematically different values are over/underweighted in . This creates a systematic gap between and . Increasing sample size reduces variance but cannot eliminate the systematic deviation caused by the biased selection mechanism.

← Return to Lesson DS-1

DS-1: Solutions — Statistical Vocabulary and Sampling

Section 5: Guided Practice Solutions

Problem 1 — The Four Elements (C1 + C2)

Problem 2 — Notation Match (C2)

Problem 3 — Classify the Variable (C3) — Variant Bank

Problem 4 — Identify the Sampling Method (C4) — Variant Bank

Problem 5 — Identify the Bias (C5)

Section 6: Independent Practice Solutions

Problem 1 — The Full Picture (C1 + C2)

Problem 2 — Is It or ? (C1 + C2)

Problem 3 — Variable Type Generator (C3)

Problem 4 — Sampling Method Critique (C4) — Variant Bank

Problem 5 — Bias in a News Story (C5)

Problem 6 — Survey Question Critique (C6)

Section 7: Mastery Check Solutions

Question 1 — Feynman Test

Question 2 — Apply (C3 + C4)

Question 3 — Find the Error (C4 + C5)

Section 8: Boss Fight Solutions

Path A — The Analyst: Screen Time Survey

Path B — The Architect: CEGEP Well-Being Study

Section 9: Challenge Problem Solutions

Challenge 1 — Can a Statistic Equal a Parameter? (Variant Bank)

Challenge 2 — Multistage vs. Cluster Design

Challenge 3 — Why Convenience Sampling is Always Biased