Statistical Vocabulary and Sampling

In 1936, a major American magazine called the Literary Digest mailed 10 million questionnaires asking people who they planned to vote for in the upcoming presidential election. Nearly 2.4 million people responded — an enormous number by any standard. Based on those responses, the magazine confidently predicted that Alfred Landon would defeat incumbent President Franklin Roosevelt in a landslide.

Roosevelt won by one of the largest margins in American electoral history.

How? The Literary Digest had a sampling problem. They drew their mailing list from telephone directories and automobile registration records — meaning their respondents were systematically wealthier than the broader voting population, which in 1936 was still recovering from the Great Depression. And people who felt strongly enough to mail back a questionnaire were not typical of all voters. Nearly 2.4 million responses, and every single one came from a biased pool. The enormous sample size made no difference — the damage was done at the point of selection.

That story illustrates a truth that runs through everything in statistics: the quality of your conclusions can never exceed the quality of your data collection. Before we can talk about calculating averages, drawing graphs, or testing hypotheses, we need to get the foundations right — the vocabulary, the concepts, and the methods that determine whether our data is trustworthy in the first place.

This lesson is where all of statistics begins.

After this lesson, you will be able to:

Distinguish a population from a sample, and explain why samples are used
Tell apart a parameter and a statistic, and match the correct notation to each
Classify any variable as qualitative or quantitative, and identify its sub-type (nominal, ordinal, discrete, or continuous)
Identify and compare six common sampling methods, and explain the strengths and weaknesses of each
Recognize sources of bias in sampling and explain how bias affects the validity of conclusions
Evaluate basic survey design for problems with question wording and non-response

This is Lesson DS-1 — the first lesson of the entire course. Everything that follows — probability, confidence intervals, hypothesis tests, regression — builds on the concepts you’ll learn here. Spend time with the vocabulary. It’s the language the rest of the course is written in.

Before diving into the mechanics of data collection, ensure you are comfortable with the basic mathematical language used to describe groups and values.

Sets and Subsets: A “population” is the complete set of all items of interest. A “sample” is a subset of that population. (This logic is the foundation for all sampling methods in C2.)
Percentages and Proportions: Converting counts to percentages (e.g., ). You will use this to describe sample compositions and error rates.
Variables and Values: A variable is a characteristic (like “height” or “color”); a value is the specific measurement (like “175 cm” or “Red”).
Rounding Rules: In statistics, we typically round to 2 or 3 decimal places. (e.g., ).

Orientation Question

Here is the kind of question this lesson will teach you to answer easily. Take a guess before reading on — it doesn’t matter if you get it wrong right now.

A researcher wants to study the sleeping habits of all 5,000 students at a college. They successfully interview 200 of these students. Which of the following correctly identifies the sample?

Success Factor:

If you weren’t sure: That’s completely expected at this stage — this is the opening lesson. The difference between “population” and “sample” is explained in detail in C1, just ahead. Read it carefully; the rest of the lesson builds on it.

C1 — Population and Sample

Every statistical study begins with a question about a group. That group is called the population. But studying an entire population is usually impossible — it’s too large, too expensive, or simply impractical. So instead, we study a subset of it: a sample.

Population

The population is the complete set of all individuals, objects, or measurements of interest in a study. It is the group we want to draw conclusions about.

Notation: population size is written N (capital letter).

Examples: all adults in Quebec; every manufacturing component produced last month; all registered voters in Canada.

Sample

A sample is a subset of the population actually selected for study. We observe the sample and use it to make inferences about the population.

Notation: sample size is written n (lowercase letter).

Examples: 400 adults randomly selected from Quebec; 50 components pulled from last month’s production; 1,200 registered voters contacted by phone.

The key idea: we study the sample, but our goal is always to say something about the population. The sample is a means to an end — the population is what we care about.

A common error is to define the population as “the people I surveyed” — that’s the sample. The population is the group you want to generalize to, whether you’ve reached them yet or not. Ask yourself: “Who do I want my conclusions to apply to?” That’s your population.

C2 — Parameter and Statistic

Now that we have a population and a sample, we need to describe them numerically. This is where a critical distinction enters — one that the entire course depends on.

Parameter

A parameter is a numerical value that describes a characteristic of the population. Parameters are usually unknown — we can’t measure the whole population.

Common parameters:

Population mean: (the Greek letter “mu”)
Population standard deviation: (the Greek letter “sigma”)
Population size: N

Statistic

A statistic is a numerical value computed from sample data. Statistics are known (we calculated them), but they vary from sample to sample.

Common statistics:

Sample mean: (read “x-bar”)
Sample standard deviation:
Sample size: n

The relationship to remember: statistics estimate parameters. We compute from our sample and use it as our best guess for the unknown .

Memory trick: Parameter → Population (both start with P). Statistic → Sample (both start with S). Greek letters (, ) go with the population; Roman letters (, ) go with the sample.

The most common error in this lesson: Treating a sample statistic as if it were the population parameter. For example, reporting “the average score is ” is fine — but then concluding “so the population mean is ” is an overstatement. The statistic estimates the parameter; it is not equal to it. The difference matters for everything that comes later.

Coming in DS-2: The symbols (population standard deviation) and (sample standard deviation) are introduced above so the notation is familiar when you meet them. For now, focus on and — those are the only two you will use in this lesson’s problems.

Figure 2: The four-element framework of statistical inference. The left side lives in the population — values that exist but are usually unknown. The right side lives in the sample — values we compute from data. The arrow captures the entire purpose of sampling: use what you know (x̄) to estimate what you can't observe (μ).

Figure 1: A population of 50 people (left) and a random sample (right). Click Draw Sample to select a new group — x̄ shifts every time, but μ stays fixed. Switch sample sizes (n = 5 / 10 / 20) and watch the history strip: larger n means x̄ clusters more tightly around μ.

C3 — Types of Variables

Before collecting data, you need to know what kind of data you’re dealing with. The type of variable determines which graphs, which summaries, and which statistical tests are appropriate. Using the wrong tool for the wrong type of variable is one of the most common mistakes in applied statistics.

Qualitative (Categorical) Variables

A qualitative variable classifies individuals into categories. The values are labels or names, not numbers you can do arithmetic with.

Nominal: Categories have no natural ranking. You can only check equality, not order.

Examples: eye colour (brown, blue, green); country of birth; favourite music genre; blood type (A, B, AB, O).

Ordinal: Categories have a meaningful order, but the gaps between ranks are not necessarily equal.

Examples: education level (high school < college < university < graduate); customer satisfaction (poor / fair / good / excellent); pain scale (1 through 10 as labels).

Quantitative (Numerical) Variables

A quantitative variable takes numerical values where arithmetic makes sense. Differences and ratios are meaningful.

Discrete: Takes only countable values — usually whole numbers. You can list them (at least in principle).

Examples: number of children in a household; number of defects per batch; number of goals scored in a game.

Continuous: Can take any value in an interval — including all decimals. Measured, not counted.

Examples: height (171.3 cm, 171.31 cm, …); temperature; time to complete a task; blood pressure.

Trap 1: A variable coded with numbers is not automatically quantitative. Postal codes are numbers, but you can’t compute “average postal code.” Phone numbers, student ID numbers, and jersey numbers are nominal.

Trap 2: Pain scores rated 1–10 are often treated as quantitative, but technically they’re ordinal — the gap between “3 out of 10” and “4 out of 10” may not equal the gap between “7 out of 10” and “8 out of 10.” Context matters for this distinction.

Trap 3: Age in years (18, 19, 20…) looks discrete, but age is really continuous — you’re always some fractional number of years old. The measurement convention creates the appearance of discreteness.

Ordinal data and arithmetic don’t mix. When a rating scale is coded as numbers (1 = Terrible, 5 = Excellent), it is tempting to compute an average. But ordinal gaps are not guaranteed to be equal — the jump from “Poor” to “Okay” may not represent the same real-world distance as the jump from “Good” to “Excellent.” Throughout this course, Likert-scale and ranked-category responses are treated as ordinal. A more honest summary uses the median category or a frequency table. You will see exactly why this matters in DS-2.

Figure 2: The variable-type hierarchy. Click any example below the tree to trace its classification path. Each leaf node shows a distinct colour used consistently in this course. The ⚠ examples are traps — variables that look like one type but are actually another.

C4 — Sampling Methods

How you select your sample determines whether it will accurately represent the population. There are many methods — each with different strengths, weaknesses, and appropriate contexts.

Simple Random Sampling (SRS)

Every possible sample of size n has an equal probability of being selected. This is the gold standard for unbiased sampling.

How: Number every individual in the population 1 through N; use a random number generator or lottery to select n of them.

Strength: Unbiased; produces the mathematical properties statistics relies on.

Weakness: Requires a complete sampling frame — a list of every individual in the population — which is often unavailable or expensive to build.

Stratified Sampling

Divide the population into homogeneous subgroups called strata (e.g., age groups, departments, regions). Then draw a simple random sample from each stratum.

Strength: Guarantees representation of important subgroups; can produce more precise estimates than SRS for the same total sample size.

Weakness: Requires knowing the strata boundaries in advance; more complex to execute than SRS.

Systematic Sampling

Number every individual, choose a random starting point, then select every k-th individual (where ).

Example: To sample 50 from 500, select a random start between 1 and 10, then take every 10th person after that.

Strength: Easy to implement; spreads the sample evenly across the population.

Weakness: Biased if there is a periodic pattern in the list that aligns with the interval k.

Cluster Sampling

Divide the population into heterogeneous groups called clusters (e.g., schools, city blocks, hospitals). Randomly select a few clusters, then survey every individual in the selected clusters.

Strength: Very cost-effective when the population is geographically dispersed and there is no complete list of individuals.

Weakness: Higher sampling error than SRS because individuals in the same cluster tend to be similar.

Multistage Sampling

A combination of methods applied in stages. Common in large national surveys.

Example: Stage 1 — randomly select provinces. Stage 2 — within each province, randomly select cities. Stage 3 — within each city, randomly select households. Stage 4 — interview one adult per household.

Strength: Practical for very large populations spread across wide areas.

Weakness: Complex to design; errors can compound across stages.

Convenience Sampling

Select whoever is easiest to reach — the first n people who walk by, the first respondents to an online survey, volunteers who self-select.

Strength: Fast and cheap.

Weakness: Almost always biased. The people who are easiest to reach are systematically different from the rest of the population.

Cluster ≠ Stratified. Both divide the population into groups — but they work opposite ways. In stratified sampling, groups are homogeneous (similar inside), and you sample from all groups. In cluster sampling, groups are heterogeneous (diverse inside), and you sample only some groups entirely. A school’s students divided by grade = strata (homogeneous within grade). A school’s students divided by homeroom class = clusters (diverse within class).

Figure 3: A population of 56 people arranged in a grid. Each tab shows which people get selected under that sampling method. For Stratified sampling, dot colour shows stratum membership; for Cluster, it shows cluster membership.

C5 — Bias in Sampling

A sample is biased when it systematically favours certain outcomes over others — when it consistently misrepresents the population in one direction. Bias is not random error; it doesn’t cancel out with larger sample sizes. A biased sample of 10,000 people can be less trustworthy than an unbiased sample of 100.

Common Sources of Sampling Bias

Undercoverage: Some segments of the population have a lower probability (or zero probability) of being selected. Example: An online survey excludes people without internet access.
Voluntary response bias: People who feel strongly about an issue are more likely to respond. Example: Calling in to a radio poll about crime — only the angriest listeners call.
Non-response bias: People selected for the sample don’t respond, and those who don’t respond differ systematically from those who do. Example: A survey on work satisfaction — disengaged employees may ignore it.
Convenience bias: Any convenience sample is systematically biased toward whoever is easiest to reach.

Bias affects the validity of conclusions: a biased sample may give us an accurate picture of who responded, but a distorted picture of the population we actually care about.

Figure 5: Each dot is one sample's estimate of μ (mean weekly exercise hours). The four quadrants cross bias (centred on μ vs. off-target) with precision (tight vs. spread) — because the two are independent. Read the columns: the left (unbiased) column centres on μ whether tight or spread; the right (biased) column misses μ either way. The top-right cell is the trap — a precise, confident-looking estimate that is reliably wrong. Bias is not random error: a bigger sample shrinks the spread but never pulls a biased cloud back onto μ.

C6 — Survey Design Principles

Enrichment — beyond the assessed standards. Survey-question design is valuable real-world knowledge, but it isn’t one of this lesson’s graded standards (DS-1 covers vocabulary, classification, sampling methods, and sampling bias — i.e. who you ask, not how you word the question). Read this to become a sharper consumer of surveys; it won’t appear on a standards assessment.

Even with a perfectly selected sample, a poorly designed survey can still produce bad data. The questions themselves can introduce error.

Sources of Response Error in Surveys

Leading questions: Wording that pushes respondents toward a particular answer. Bad: “Do you agree that the current tax rate is too high?” Better: “Do you think the current tax rate is too high, about right, or too low?”
Double-barrelled questions: Two questions combined into one. Bad: “Are you satisfied with the price and quality of the product?” (What if you love the quality but not the price?)
Social desirability bias: Respondents give answers they think are socially acceptable, not honest ones. Example: Self-reported exercise frequency, hours spent studying, or income.
Ambiguous wording: Questions that different respondents interpret differently. Bad: “Do you eat regularly?” (What does “regularly” mean?)
Order effects: The order of questions influences answers. Asking about satisfaction with a specific product before asking about overall satisfaction changes the overall satisfaction rating.

Figure 6: Four common survey design flaws. For each, the problematic word or phrase is highlighted in the original question. Click a flaw type to examine it — then compare the original against the corrected version.

Good survey design rules of thumb:

Ask one thing per question
Use neutral wording — no loaded language
Pilot test the survey with a small group before the real study
Keep the questionnaire as short as possible (longer surveys = more non-response)
Guarantee anonymity where possible (reduces social desirability bias)

Let’s work through some examples together — classification and identification problems that look exactly like what you’ll see in practice. I’ll narrate my thinking at each step, not just the final answer.

Example 1 — Fully Worked: The Four Elements

Scenario: A nutritionist wants to know the average daily sugar intake of Canadian adults. She recruits 250 adult volunteers from three cities, records their sugar intake for a week, and computes an average of 87 g/day.

Identify: (a) the population, (b) the sample, (c) the parameter of interest, (d) the statistic.

Step 1 — Identify each element.

Let’s work through each piece systematically.

(a) Population: “Canadian adults” — the entire group the nutritionist wants to draw conclusions about. Not just adults in three cities; she wants to generalize to all Canadian adults. This is a large, defined group that was never fully studied.

(b) Sample: The 250 adult volunteers recruited from three cities. This is the subset actually measured. Note: these are volunteers, which raises concerns about bias (voluntary response) — we’ll come back to that.

(c) Parameter of interest: The average daily sugar intake of all Canadian adults — written . This is unknown. We didn’t measure every Canadian adult, so is never directly observed.

(d) Statistic: The sample average of 87 g/day — written . This was computed from the 250 participants. It’s our best estimate of .

Key takeaway: We know with certainty (we computed it). But we only estimate that — and given the voluntary-response design, that estimate might be off. The statistic is certain; the inference to the parameter is always uncertain.

Example 2 — Partially Scaffolded: Which Sampling Method?

Scenario: A university wants to survey its 8,000 students about library resources. Here are three different approaches they could use:

Approach A: Assign each student a number 1–8000, generate 300 random numbers, survey those students.
Approach B: Divide students by faculty (Science, Arts, Engineering, Business, etc.). Randomly select 50 students from each faculty.
Approach C: Post a survey link on the university’s social media page and survey whoever responds.

Before looking at the solution: Write down your classification for each approach. What method is each one? What’s the key clue in each description?

Show Solution

Approach A — Simple Random Sampling (SRS). Every student has an equal chance of being selected; we used a random number generator to choose. The defining clue: “generate random numbers” applied to a complete numbered list.

Approach B — Stratified Sampling. The strata are the faculties (homogeneous subgroups). We sampled randomly from every stratum. Key clue: “divide students by faculty” (the strata) + “randomly select from each” (the sampling step).

Approach C — Convenience Sampling (voluntary response). Posting a link and surveying whoever responds is the definition of voluntary response bias. Only students who feel motivated to click will respond — probably not a representative cross-section. Key clue: “whoever responds.”

Which approach would you recommend? Approach B, actually — it guarantees that every faculty is represented in proportion, making it more precise than SRS for the same sample size. Approach A is also good, but wouldn’t guarantee any particular faculty is represented. Approach C is problematic: students with strong opinions (frustrated with the library) are overrepresented.

Example 3 — Minimally Scaffolded: Variable Types in the Wild

Scenario: A hospital collects the following data for each patient admitted to the emergency room:

Patient’s primary language
Triage level (1 = critical, 2 = urgent, 3 = less urgent, 4 = non-urgent, 5 = minor)
Time (in minutes) from arrival to first physician contact
Number of previous ER visits in the past year
Discharge status (discharged / admitted / transferred)

Hint: For each variable, ask yourself (1) Is it a category label or a number you can arithmetic with? (2) If it’s a category, does it have a natural order? (3) If it’s a number, can you get fractional values?

Show Solution

(a) Primary language → Qualitative, Nominal. Languages are category labels with no natural order. “Spanish” is not greater than or less than “Mandarin.”

(b) Triage level → Qualitative, Ordinal. The levels 1–5 have a meaningful order (1 is more severe than 5), but the gaps between levels aren’t equal. The difference in severity between levels 1 and 2 is not the same as between levels 4 and 5. Ordinal: ordered categories with unequal gaps.

(c) Time to physician contact → Quantitative, Continuous. Time can take any value in a range — theoretically 14.3 minutes, 14.37 minutes, 14.371 minutes. It’s measured (not counted) and any decimal value is possible.

(d) Number of previous ER visits → Quantitative, Discrete. You count visits — you can’t have 2.7 ER visits. It takes whole number values and the gaps are equal.

(e) Discharge status → Qualitative, Nominal. Three categories with no natural order — “admitted” is not greater than “transferred.” Category label, no ranking.

Example 4 — Application Twist: Spotting Hidden Bias

Scenario: A municipal government wants to know whether residents support a new cycling infrastructure project that would repurpose some car lanes. They commission a survey and mail questionnaires to 2,000 randomly selected registered voters. They receive 380 completed responses.

The results: 71% of respondents support the project.

A city councillor says: “Great news — 71% of voters support this.” Is this claim justified? What concerns should be raised?

Show Solution

This is a real example of how a technically “random” sample can still produce unreliable conclusions.

Problem 1 — Non-response bias: Only 380 of 2,000 responded — a 19% response rate. The 81% who didn’t respond may have very different opinions. Who doesn’t bother to mail back a survey about cycling infrastructure? Possibly people who rarely cycle and have no strong motivation to engage with the topic. We simply don’t know why they didn’t respond — and that uncertainty is the problem.

Problem 2 — Wording matters: We don’t know how the question was phrased. “Do you support new cycling infrastructure?” might get different responses than “Do you support repurposing car lanes for cyclists?” Both describe the same project, but one activates different associations.

Problem 3 — The 71% is a statistic, not the parameter: The councillor said “71% of voters support this.” But the 71% is (sample proportion), not (population proportion). The inference to all voters is valid only if the sample is unbiased — and because of the non-response issue, it isn’t.

The honest summary: “71% of the 380 respondents support the project. However, non-response bias may make this estimate unreliable as a measure of the full voter population’s opinion.”

Work through each problem below. The dropdown will tell you immediately whether you’ve chosen the right answer — and if not, it explains exactly why the other choices are wrong.

Problem 1 — The Four Elements (C1 + C2)

A sports analytics company wants to know how many minutes per week professional basketball players spend on strength training. They contact 60 players from the NBA and find an average of 280 minutes per week.

Step 1: What is the population?

Step 2: What is the statistic?

Step 3: The parameter of interest is best written as:

Problem 2 — Notation Match (C1, C2)

Match each description to the correct notation.

2a. The average salary of all 4,500 employees at a company is $67,200.

2b. A researcher randomly selects 80 apartments in Montreal and finds the average rent is $1,450/month.

Problem 3 — Bias Identification and Direction (C5)

For each scenario, identify (a) the type of bias present and (b) the direction of the bias — whether it likely makes the estimate too high or too low.

Scenario A: A television station invites viewers to text their opinion on whether a proposed tax increase should pass. 78% of the 4,200 responses say “No.”

What type of bias is most clearly present?

Which direction does this bias most likely push the estimate?

Try these problems on your own. Solutions are hidden — attempt each one before revealing. The problems are intentionally interleaved across all six concepts.

Problem 1 — Population or Sample? (C1)

Identify the population or the sample in each study. Each answer is checked immediately, and the dropdown explains why the other choices are wrong.

Problem 2 — Parameter or Statistic? (C2)

For each study, decide whether the reported number is a parameter (from a census of the whole population) or a statistic (from a sample). The reveal names the matching symbol.

Problem 3 — Variable Type (Generator) (C3)

Problem 4 — Sampling Method Critique (C4)

Problem 5 — Bias in a News Story (C5)

Read the following excerpt, then answer the questions.

“A new study proves that people who work from home are more productive than those who work in offices. Researchers from TechTrend International surveyed 1,200 fully remote employees at five tech companies. 84% reported that they were more productive when working from home.”

First, identify the study design:

Now analyze the conclusions:

(a) Is the conclusion (“people who work from home are more productive”) justified by the study? Explain. (b) Identify at least two specific sources of bias. (c) Who is excluded from the population, and how might this affect the conclusion?

Show Solution

(a) No — the conclusion dramatically overstates what the study shows. The study surveyed remote employees who self-reported their productivity. There is no control group (in-office workers measured the same way), no objective productivity measure, and no random sample from a broadly defined population of workers. “Proves” is inappropriate — this is one survey with significant limitations.

(b) Sources of bias:

Social desirability bias: Remote workers asked whether they’re productive while working remotely may feel pressure to say “yes” — admitting low productivity could feel like justifying an end to remote work privileges.
Undercoverage: Only remote employees at five tech companies are included. Workers in industries where remote work is impossible (manufacturing, healthcare, retail) are entirely excluded.
Self-selection / convenience: The sample is limited to companies that already have remote work policies — these companies may differ from average employers in culture, resources, and employee type.

(c) The population is implicitly “all workers” or “all people who work from home,” but only tech workers at companies that already embraced remote work are included. This excludes:

Workers in industries where remote work is impractical
Employees who tried remote work and returned to the office (survivorship bias)
Workers who want remote work but haven’t had the option

The conclusion would need a much broader, randomized study design to support such a sweeping claim.

Now drill the bias types themselves. Name the bias in each scenario — the reveal defines the whole taxonomy so you can tell the confusable types apart.

Problem 6 — Survey Question Critique (C6 · Enrichment)

Enrichment — not assessed. This problem practises survey-question design (C6), which is beyond the graded standards for this lesson. Do it to sharpen your judgment — it won’t be on a standards check.

Each question below has a flaw. Identify the problem with each question and rewrite it to fix the flaw.

”Don’t you agree that our school needs better sports facilities?"
"Are you satisfied with the price and the speed of our delivery service?"
"How often do you exercise regularly?” (with response options: Never / Sometimes / Often / Always)
“Given that most experts agree climate change is a serious problem, do you support carbon taxes?”

Show Solution

(a) Problem: Leading question. “Don’t you agree” frames the question to pressure respondents toward a “yes.” Fix: “Do you think our school’s sports facilities need improvement?” (or offer “Yes / No / No opinion”)

(b) Problem: Double-barrelled. A respondent might love the price but be unhappy with speed — the question forces one answer for two separate things. Fix: Split into two questions: “How satisfied are you with the price of our delivery service?” and “How satisfied are you with the speed of our delivery service?”

(c) Problem: Ambiguous wording. “Regularly” means different things to different people — some might define it as once a week, others as daily. The response options (Sometimes, Often) are equally vague. Fix: “In an average week, on how many days do you exercise for at least 30 minutes?” (Response: 0 / 1–2 / 3–4 / 5–6 / 7)

(d) Problem: Leading question with a loaded premise. “Most experts agree climate change is a serious problem” primes respondents toward agreement before the actual question is asked. Fix: Remove the preamble: “Do you support the introduction of carbon taxes in Canada?” (Yes / No / Unsure)

No hints. No guidance. This section checks whether you’ve genuinely understood the material — not just recognized answers when you saw them.

Question 1 — Feynman Test (C1 + C2)

A classmate missed today’s lesson. Explain in your own words:

What is the difference between a population and a sample?
Why do statisticians use samples instead of studying the whole population?
What is the difference between a parameter and a statistic, and why does the distinction matter?

Write your explanation as if you’re texting your classmate — clear, direct, no jargon you haven’t defined. Aim for 200–400 characters.

0 / 400

Show a model answer

A population is the full group you care about — like all students at your school. A sample is just a portion you actually study — like 50 randomly chosen students. We use samples because studying everyone is usually too expensive or time-consuming.

A parameter is a number that describes the population (like the true average GPA of all students — we might never know it exactly). A statistic is a number computed from your sample (like the average GPA of your 50 students — you computed it directly). The statistic estimates the parameter. That’s the whole engine of statistics.

Key check: Does your answer clearly state that parameters are usually unknown while statistics are computed from data? That’s the heart of why this distinction matters.

Question 2 — Apply (C3 + C4)

A coffee chain with 240 store locations across Canada wants to measure customer satisfaction. Their data team proposes two approaches and asks you to evaluate them.

The chain has four regions: Atlantic (20 stores), Quebec (60 stores), Ontario (80 stores), Western Canada (80 stores).

Plan 1: Randomly select 30 stores. Visit each selected store on a random day and survey every customer who visits during a 4-hour window.

Plan 2: From each region, randomly select stores proportional to the region’s size (Atlantic: 5 stores, Quebec: 15 stores, Ontario: 20 stores, Western: 20 stores). Then survey every customer at selected stores during a random 4-hour window.

Evaluate the plans.

(a) Which sampling method does Plan 1 use?

In one sentence, what is Plan 1’s main weakness in this context? (Hint: the four regions differ in size.)

(b) Which sampling method does Plan 2 use?

How does Plan 2 address the weakness you named in (a)?

(c) Customer satisfaction rating is measured as: Very Dissatisfied / Dissatisfied / Neutral / Satisfied / Very Satisfied. What type of variable is it?

Why does that variable type matter for how you’d summarize the data?

Show model answers (the written parts)

(a) weakness: Because the 30 stores are drawn without regard to region, they can over-represent a large region by chance (e.g. 20 of 30 in Ontario), while tiny Atlantic (20 stores) may contribute almost nothing — so the sample can miss whole regions’ opinions.

(b) how Plan 2 fixes it: Treating the four regions as strata and sampling proportional to size guarantees every region is represented in the right proportion, removing the chance that one region dominates.

(c) why the type matters: Because the rating is ordinal, you can report the mode or median category and the percentage in each level, but a numerical mean (“average satisfaction = 3.7”) is questionable — the gaps between levels aren’t guaranteed equal, so the scale isn’t interval-level.

Question 3 — Find the Error (C4 + C5)

A student analyses an informal social-media poll in four claim-steps. Exactly one claim contains an error — find it, then read the feedback. Each new problem plants a different mistake, so cycle through several.

Self-Assessment

How confident are you with the material in this lesson?

Still unsureFully confident

If you’re under 70% confident, revisit Section 3 (Core Concepts). Focus on the concept that felt shakiest in this section — that’s the one most worth reviewing before you continue.

You’ve built the vocabulary. You’ve practised the classifications. Now it’s time to put it all together.

Choose your path. Both cover the same concepts — they just ask you to use them differently.

🔬 The Analyst

A national survey was conducted. Your job is to dissect it — identify what was done well, what was done wrong, and what conclusions are and aren’t justified.

Emphasis: critical reading, classification, identifying bias

🏗️ The Architect

A research question has been posed. Your job is to design the study — choose sampling methods, write good survey questions, and justify every decision.

Emphasis: methodology, survey design, justification

🔬 Path A — The Analyst: Dissecting a Real Survey

Below is a summary of a survey conducted by a Canadian media outlet. Read it carefully — there are problems hiding in every paragraph.

The Survey Report:

“We asked Canadians about their screen time habits. Our team emailed a survey to 5,000 subscribers of our newsletter, and 1,847 completed it — a 37% response rate. Results showed the average daily screen time was 6.4 hours, which we report as the true average for all Canadian adults ( h). Among respondents, we identified three groups by profession — students, office workers, and other — and found students averaged 8.1 h/day, office workers 6.0 h/day, and others 5.2 h/day. Since we collected professional background, this constitutes stratified sampling. The screen time question asked: ‘How many excessive hours do you spend on screens daily?’ Respondents answered on a scale from 1 to 5 hours (if they entered ‘3,’ this means 3 hours). The variable ‘daily screen time’ is qualitative-ordinal since people rate it on a 1–5 scale.”

Your Analysis Tasks

Task A1. The report claims this is stratified sampling. Is it? (Identify the actual method used and explain the distinction.)

Task A2. Identify at least two sources of bias. For each, explain the likely direction of the bias (does it push the screen time estimate up or down?).

Task A3. The report writes ” h.” Is this notation correct? What should the correct notation be, and why?

Task A4. The report calls daily screen time “qualitative-ordinal.” Is this correct? What is the actual type and why?

Task A5 (enrichment — not assessed). Identify the problem with the question “How many excessive hours do you spend on screens daily?” Rewrite it.

Show Full Solution

A1 — Not stratified sampling; it’s convenience/voluntary response: Stratified sampling requires the researcher to divide the population into strata and then randomly sample from each stratum before data collection. Here, the researchers simply emailed a convenience sample of newsletter subscribers and accepted whoever responded. Sorting respondents into student/office worker/other after the fact is post-hoc grouping — not stratification. This is voluntary response sampling (a form of convenience sampling). The professional groups discovered in the data are subgroups, not strata.

A2 — Bias sources:

Undercoverage bias (pushes estimate down or in unknown direction): Newsletter subscribers tend to be higher-educated, media-literate adults. Low-income individuals, the elderly, and rural Canadians with low digital engagement are underrepresented — ironically, these groups may have less screen time, which would push the estimate down. Alternatively, subscribers to a media newsletter might be heavy screen users, pushing the estimate up. Direction is ambiguous, but the bias is certain.
Non-response bias (likely pushes estimate up): Only 37% responded. People who feel comfortable reporting their screen time — possibly those who are reflective about their digital habits — may respond at higher rates. Those embarrassed by high screen time might avoid the survey or might finish faster — unclear direction. But heavy users who are proud of their tech engagement may respond more readily.
Question wording bias (pushes estimate down — see A5): The word “excessive” cues respondents that high screen time is bad, potentially causing underreporting.

A3 — Wrong notation: is the population parameter — the true average daily screen time of all Canadian adults, which we never measured. The 6.4 hours came from the 1,847 respondents — it is the sample mean, correctly written h. The report should say: “Our sample mean was h, which we use to estimate the population mean for Canadian adults.”

A4 — Wrong variable type, but for an interesting reason: Daily screen time measured in hours is quantitative — continuous (you can watch 2.7 hours, 4.15 hours, etc.; the variable is measured, not categorized). The report confuses the response scale (1–5 as possible answers) with the variable type. If respondents entered “3” meaning 3 hours, that’s a numerical measurement. The variable is continuous; the researcher’s measurement instrument limits precision but doesn’t change the underlying nature of the variable.

A5 — Leading question (loaded word “excessive”): “Excessive” implies that screen time is inherently bad and high amounts are excessive. Respondents who use screens a lot for legitimate work may feel pressured to report lower numbers. Better: “On an average day, approximately how many hours do you spend looking at screens (phone, tablet, computer, TV combined)?” with open numerical entry or hour ranges.

Reflection: What was the most challenging part of this analysis? What would you check first if you were a journalist reviewing this survey before publication?

🏗️ Path B — The Architect: Design the Study

You’re working as a research consultant for the Quebec Ministry of Education. They need to understand student mental well-being across all CEGEP institutions in Quebec.

Context: Quebec has 48 CEGEP institutions. Collectively, they enroll approximately 220,000 students. Institutions range from small rural CEGEPs (500 students) to large urban ones (12,000 students). The ministry wants to survey 3,000 students. Their research question: “What is the average self-reported mental well-being score of Quebec CEGEP students, and do well-being scores differ by institution size?”

Your Design Tasks

Task B1. State precisely what the population is and what your sample of 3,000 represents.

Task B2. Recommend a sampling method and justify it. (Hint: the research question mentions comparing by institution size — how does that affect your design?) Explain why you’re not choosing convenience sampling or simple random sampling.

Task B3. What parameter is the ministry trying to estimate? What statistic will you compute?

Task B4 (enrichment — not assessed). Write two well-designed questions for the survey. One should measure mental well-being (on a scale), the other should capture a demographic. For each, identify the variable type and explain why your wording avoids bias.

Note: The survey-wording portion of this task is enrichment (C6). Identifying the variable type of each question (C3) is still on-standard.

Task B5. Despite your best design, name one source of bias that could still affect your results, and suggest how to minimize it.

Show Full Solution

B1 — Population and sample: Population = all currently enrolled Quebec CEGEP students (~220,000). Sample = 3,000 students selected according to the design below. The sample represents the population only if it mirrors the population’s distribution by institution type, size, region, and program.

B2 — Recommended method: Stratified sampling by institution size. Since the research question specifically asks about differences by institution size, we must guarantee that small, medium, and large institutions are all represented. Suggested strata:

Small (under 2,000 students): 14 institutions
Medium (2,000–6,000): 20 institutions
Large (over 6,000): 14 institutions

Allocate the 3,000 surveys proportionally to enrollment within each stratum, then randomly select students within each stratum.

Why not SRS? With SRS, the 12 large institutions dominate because they have more students — small institutions might get very few respondents, making comparison across sizes impossible. Why not convenience? Biased toward easily reachable students (likely healthier, more engaged). Stratification ensures all size groups are represented.

B3 — Parameter and statistic: Parameter: = the true mean mental well-being score of all ~220,000 Quebec CEGEP students. Statistic: = the mean well-being score computed from the 3,000 surveyed students, used to estimate .

B4 — Survey questions:

“Over the past two weeks, how often have you felt able to handle the challenges of daily student life?” Options: Never / Rarely / Sometimes / Often / Always Variable type: Qualitative — Ordinal Why it avoids bias: Neutral phrasing (no “struggling” or “excelling”), time-bounded (last two weeks), and focused on a specific concrete feeling rather than the vague concept of “mental health.”
“Which of the following best describes your current program?” Options: Pre-university (Social Sciences / Pure and Applied Sciences / Arts and Literature / Other pre-university) / Technical program Variable type: Qualitative — Nominal Why it avoids bias: Factual question with clear categories, no value judgment, includes an “Other” option to avoid forcing respondents into incorrect categories.

B5 — Anticipated bias: Non-response bias. Even with a well-designed stratified sample, students experiencing severe mental distress — the very people most relevant to the study — may be least likely to complete a voluntary survey. This would underestimate the severity of problems in the population. Mitigation: send two follow-up reminders to non-respondents, offer the survey in both French and English, and work with institution wellness offices to encourage participation while maintaining anonymity.

Reflection: What aspect of this design was hardest to decide? Is there a sampling method you considered and rejected? What would you do if the ministry’s budget only allowed for 500 surveys instead of 3,000?

Ready for more? These go beyond the lesson objectives. They’re for students who want to think more deeply — or get a preview of where this course is headed.

Challenge 1 — Can a Statistic Equal a Parameter? (C2)

A class of exactly 5 students has quiz scores: 70, 74, 80, 85, 91. Their professor calculates the class average as 80. She then randomly selects 3 students to give a brief survey and finds their average is also 80.

(a) Is the 80 computed from all 5 students a parameter or a statistic? Use correct notation.
(b) Is the 80 computed from the 3 selected students a parameter or a statistic? Use correct notation.
(c) The two values are identical. Does this mean the sample perfectly represents the population? Explain why or why not.

Show Solution

(a) — a parameter. All 5 students (the entire population) were measured. No sampling occurred.

(b) — a statistic. Only 3 of the 5 students were sampled. The 80 was computed from sub-group data.

(c) No — equal values don’t mean perfect representation. The 3 selected students happened to have scores that average to 80, but they might be (70, 80, 90), (74, 80, 86), or (80, 80, 80) — very different distributions, all averaging 80. The statistic matches the parameter by coincidence. Different samples of 3 from these 5 students would give averages of: {70,74,80}→74.7; {70,74,91}→78.3; {74,80,91}→81.7; {80,85,91}→85.3; {70,85,91}→82. Most don’t equal μ = 80. This illustrates sampling variability — the topic of DS-INF-1.

A town has exactly 4 households with annual incomes (in thousands): $42, $58, $65, $75. A researcher samples 2 households.

(a) Compute the population mean income. What notation applies?
(b) List all possible samples of size 2 and their sample means.
(c) How many samples produce a sample mean exactly equal to the population mean? What does this tell us?

Show Solution

(a) thousand dollars.

(b) All samples of size 2 from {42, 58, 65, 75}:

{42, 58}:
{42, 65}:
{42, 75}:
{58, 65}:
{58, 75}:
{65, 75}:

(c) Zero samples produce exactly. The sample mean almost never equals the population mean — this is sampling variability. Notice that the average of all six values is (50+53.5+58.5+61.5+66.5+70)/6 = 360/6 = 60 = μ. The sample mean is an unbiased estimator — on average it hits the target, but any individual sample rarely lands exactly on it.

A company has 6 employees. Their annual salaries (in thousands) are: $45, $50, $55, $60, $70, $80. The CEO reports the average salary as $60k. A journalist randomly asks 2 employees about their salary and gets $45k and $80k.

(a) Is the CEO’s $60k a parameter or statistic?
(b) Is the journalist’s sample mean a parameter or statistic? Compute it.
(c) The two averages differ. Does this prove the CEO is lying? What does it actually illustrate?

Show Solution

(a) — parameter (all 6 employees = the entire company = population; CEO computed this from full data).

(b) k — a statistic (computed from a sample of 2).

(c) No — the discrepancy doesn’t prove lying. It illustrates sampling variability: different random samples produce different statistics. The journalist happened to sample the lowest and highest paid employees — an extreme but possible outcome. The CEO’s figure is the true population parameter. The journalist’s statistic is simply one possible sample mean, which happened not to equal μ. This is exactly why statistics need uncertainty quantification (confidence intervals — covered later in the course).

Challenge 2 — Multistage vs. Cluster: Design Choices (C4)

A national research institute wants to estimate the proportion of high school students in Canada who spend more than 3 hours/day on social media. Canada has approximately 3,200 high schools and 1.5 million high school students.

(a) A colleague proposes cluster sampling: randomly select 30 schools, then survey every student in those schools. Estimate the total number of students surveyed if each school has an average of 900 students. Is this practical?

(b) Alternatively, design a two-stage multistage sampling plan that surveys approximately 3,000 students. Be specific about what happens at each stage.

(c) Compare the two designs. Which produces a more precise estimate? Which is more practical? Is there a tradeoff?

(d) Preview question: The true proportion of students who spend >3 h/day on social media is an unknown parameter. What symbol would this parameter be given in the notation introduced in this lesson? (Hint: it’s not μ — μ is for means. Look at the context document for the course.)

Show Solution

(a) 30 schools × 900 students/school = 27,000 students. This is logistically impractical for most research budgets — surveying 27,000 students requires massive coordination across 30 schools. Cluster sampling with full enumeration works well for small clusters, but 900-student schools are too large.

(b) Two-stage plan:

Stage 1: Randomly select 60 schools from the 3,200 high schools across Canada (use SRS or stratified by province).
Stage 2: Within each selected school, randomly select 50 students (using the school’s enrollment list and a random number generator).
Total: 60 × 50 = 3,000 students surveyed.

(c) The cluster design (full enumeration of 30 schools) actually tends to be less statistically precise per student surveyed than multistage sampling, because all students in the same school share similar social environments, making their responses correlated. The multistage design spreads 3,000 students across 60 schools — capturing more geographic and demographic diversity. Tradeoff: the cluster design requires fewer agreements with schools (30 vs. 60) but surveys far more students per school (27,000 total vs. 3,000).

(d) The parameter is a population proportion — the true fraction of all Canadian high school students who spend >3 h/day on social media. This would be written p (population proportion). You’ll see this notation formally introduced in DS PR-6 and INF-4. The sample proportion — computed from your 3,000 students — would be written (“p-hat”).

Challenge 3 — Why Convenience Sampling is Always Biased: A Proof Sketch (C5)

This problem asks you to reason carefully about why convenience sampling produces biased estimates, not just that it does.

Consider the following model: A population has N individuals. Each individual has some characteristic value x (say, hours of exercise per week). The true population mean is .

In a convenience sample, individuals are not selected with equal probability. Let be the probability that individual i is selected. A random sample (SRS) requires for all i (equal probability). A convenience sample has that varies — some individuals (those “convenient”) have much higher than others.

(a) Suppose a gym posts a sign-up sheet for a study on exercise habits. Who is more likely to sign up — people who exercise frequently or people who rarely exercise? What does this say about the values for high-exercisers vs. low-exercisers?

(b) If high-exercisers have higher , the sample will overrepresent them. In what direction will the sample mean be biased relative to ?

(c) This bias doesn’t go away with a larger sample. In fact, a convenience sample of 10,000 gym members would be even more biased than a random sample of 100 general-population adults. Why?

(d) Write a one-paragraph “proof sketch” (in words, no formal mathematics required) explaining why unequal selection probabilities are the fundamental cause of bias in non-random sampling.

Show Solution

(a) Frequent exercisers are far more likely to sign up — they’re proud of their habits, they’re at the gym already, and the topic seems relevant to them. Low-exercisers who feel embarrassed about their habits or simply never visit the gym have essentially zero probability of being in this sample. So is high for high-exercisers, near zero for low-exercisers.

(b) Since high-exercisers are overrepresented, will be systematically higher than — the sample will overestimate the true population mean exercise time.

(c) A larger convenience sample still draws from the same biased pool. Adding more gym members to the sample doesn’t add people who don’t go to gyms. You’re just measuring the same overrepresented group more precisely — getting a very accurate estimate of the wrong quantity. Bias comes from who is in the sample, not how many.

(d) Proof sketch: The sample mean is a weighted average of the values in the sample. In SRS, every individual has an equal probability of selection, so the contribution of each individual’s value to the expected value of is equal — the expected value of equals (the statistic is unbiased). In a convenience sample, individuals with higher selection probability contribute more heavily to . If these high-probability individuals have systematically higher (or lower) values than the rest of the population, the expected value of will be systematically above (or below) . This systematic deviation is bias — and it persists no matter how large the sample grows, because the mechanism producing the bias (unequal selection probability) is unchanged by sample size.

Complete, step-by-step solutions for all problems in Sections 5–9 are available on the solutions page. Solutions include worked examples, common mistakes to watch for, and interpretation guidance.

View Full Solutions →

If you’re stuck: Re-read the relevant Core Concept in Section 3, then find the Worked Example that maps to that concept (e.g., Example 1 maps to Concept 1). The solutions page shows the reasoning behind every step, not just the final answer.

Quick-Reference Notation

Measure	Population (Parameter)	Sample (Statistic)
Size
Mean	(mu)	(x-bar)
Proportion		(p-hat) — introduced in DS-PR-6

Variable Type	Sub-types	Examples
Categorical (Qualitative)	Nominal	Eye color, Postal code
	Ordinal	Letter grade, Survey rating
Numerical (Quantitative)	Discrete	Number of children, Shoe size
	Continuous	Height, Weight, Time

Sampling Method	How it works
Simple Random (SRS)	Every possible sample of size has an equal chance.
Stratified	Divide into groups (strata), take SRS from every group.
Cluster	Divide into groups (clusters), randomly select some groups, sample everyone in those.
Systematic	Pick every -th individual from a list.
Convenience (Biased)	Pick whoever is easiest to reach.

DS-1: Statistical Vocabulary and Sampling

Section 1: Introduction

Section 2: Prerequisites

Section 3: Core Concepts

C1 — Population and Sample

Population

Sample

C2 — Parameter and Statistic

Parameter

Statistic

C3 — Types of Variables

Qualitative (Categorical) Variables

Quantitative (Numerical) Variables

C4 — Sampling Methods

Simple Random Sampling (SRS)

Stratified Sampling

Systematic Sampling

Cluster Sampling

Multistage Sampling

Convenience Sampling

C5 — Bias in Sampling

Common Sources of Sampling Bias

C6 — Survey Design Principles

Sources of Response Error in Surveys

Section 4: Worked Examples

Example 1 — Fully Worked: The Four Elements

Example 2 — Partially Scaffolded: Which Sampling Method?

Example 3 — Minimally Scaffolded: Variable Types in the Wild

Example 4 — Application Twist: Spotting Hidden Bias

Section 5: Guided Practice

Problem 1 — The Four Elements (C1 + C2)

Problem 2 — Notation Match (C1, C2)

Problem 3 — Bias Identification and Direction (C5)

Section 6: Independent Practice

Problem 1 — Population or Sample? (C1)

Problem 2 — Parameter or Statistic? (C2)

Problem 3 — Variable Type (Generator) (C3)

Problem 4 — Sampling Method Critique (C4)

Problem 5 — Bias in a News Story (C5)

Problem 6 — Survey Question Critique (C6 · Enrichment)

Section 7: Mastery Check

Question 1 — Feynman Test (C1 + C2)

Question 2 — Apply (C3 + C4)

Question 3 — Find the Error (C4 + C5)

Self-Assessment

Section 8: Boss Fight

🔬 The Analyst

🏗️ The Architect

🔬 Path A — The Analyst: Dissecting a Real Survey

Your Analysis Tasks

🏗️ Path B — The Architect: Design the Study

Your Design Tasks

Section 9: Challenge Problems

Challenge 1 — Can a Statistic Equal a Parameter? (C2)

Challenge 2 — Multistage vs. Cluster: Design Choices (C4)

Challenge 3 — Why Convenience Sampling is Always Biased: A Proof Sketch (C5)

Section 10: Solutions Reference

Quick-Reference Notation