EN FR

DS-1: Statistical Vocabulary and Sampling

Module 1 · Descriptive Statistics

Section 1: Introduction

In 1936, a major American magazine called the Literary Digest mailed 10 million questionnaires asking people who they planned to vote for in the upcoming presidential election. Nearly 2.4 million people responded — an enormous number by any standard. Based on those responses, the magazine confidently predicted that Alfred Landon would defeat incumbent President Franklin Roosevelt in a landslide.

Roosevelt won by one of the largest margins in American electoral history.

How? The Literary Digest had a sampling problem. They drew their mailing list from telephone directories and automobile registration records — meaning their respondents were systematically wealthier than the broader voting population, which in 1936 was still recovering from the Great Depression. And people who felt strongly enough to mail back a questionnaire were not typical of all voters. Nearly 2.4 million responses, and every single one came from a biased pool. The enormous sample size made no difference — the damage was done at the point of selection.

That story illustrates a truth that runs through everything in statistics: the quality of your conclusions can never exceed the quality of your data collection. Before we can talk about calculating averages, drawing graphs, or testing hypotheses, we need to get the foundations right — the vocabulary, the concepts, and the methods that determine whether our data is trustworthy in the first place.

This lesson is where all of statistics begins.

After this lesson, you will be able to:

  • Distinguish a population from a sample, and explain why samples are used
  • Tell apart a parameter and a statistic, and match the correct notation to each
  • Classify any variable as qualitative or quantitative, and identify its sub-type (nominal, ordinal, discrete, or continuous)
  • Identify and compare six common sampling methods, and explain the strengths and weaknesses of each
  • Recognize sources of bias in sampling and explain how bias affects the validity of conclusions
  • Evaluate basic survey design for problems with question wording and non-response

This is Lesson DS-1 — the first lesson of the entire course. Everything that follows — probability, confidence intervals, hypothesis tests, regression — builds on the concepts you’ll learn here. Spend time with the vocabulary. It’s the language the rest of the course is written in.

Section 2: Prerequisites

Before diving into the mechanics of data collection, ensure you are comfortable with the basic mathematical language used to describe groups and values.

  • Sets and Subsets: A “population” is the complete set of all items of interest. A “sample” is a subset of that population. (This logic is the foundation for all sampling methods in C2.)
  • Percentages and Proportions: Converting counts to percentages (e.g., ). You will use this to describe sample compositions and error rates.
  • Variables and Values: A variable is a characteristic (like “height” or “color”); a value is the specific measurement (like “175 cm” or “Red”).
  • Rounding Rules: In statistics, we typically round to 2 or 3 decimal places. (e.g., ).

Orientation Question

Here is the kind of question this lesson will teach you to answer easily. Take a guess before reading on — it doesn’t matter if you get it wrong right now.

A researcher wants to study the sleeping habits of all 5,000 students at a college. They successfully interview 200 of these students. Which of the following correctly identifies the sample?

Success Factor:

  • If you weren’t sure: That’s completely expected at this stage — this is the opening lesson. The difference between “population” and “sample” is explained in detail in C1, just ahead. Read it carefully; the rest of the lesson builds on it.

Section 3: Core Concepts

C1 — Population and Sample

Every statistical study begins with a question about a group. That group is called the population. But studying an entire population is usually impossible — it’s too large, too expensive, or simply impractical. So instead, we study a subset of it: a sample.

Population

The population is the complete set of all individuals, objects, or measurements of interest in a study. It is the group we want to draw conclusions about.

Notation: population size is written N (capital letter).

Examples: all adults in Quebec; every manufacturing component produced last month; all registered voters in Canada.

Sample

A sample is a subset of the population actually selected for study. We observe the sample and use it to make inferences about the population.

Notation: sample size is written n (lowercase letter).

Examples: 400 adults randomly selected from Quebec; 50 components pulled from last month’s production; 1,200 registered voters contacted by phone.

The key idea: we study the sample, but our goal is always to say something about the population. The sample is a means to an end — the population is what we care about.

A common error is to define the population as “the people I surveyed” — that’s the sample. The population is the group you want to generalize to, whether you’ve reached them yet or not. Ask yourself: “Who do I want my conclusions to apply to?” That’s your population.


C2 — Parameter and Statistic

Now that we have a population and a sample, we need to describe them numerically. This is where a critical distinction enters — one that the entire course depends on.

Parameter

A parameter is a numerical value that describes a characteristic of the population. Parameters are usually unknown — we can’t measure the whole population.

Common parameters:
  • Population mean: (the Greek letter “mu”)
  • Population standard deviation: (the Greek letter “sigma”)
  • Population size: N

Statistic

A statistic is a numerical value computed from sample data. Statistics are known (we calculated them), but they vary from sample to sample.

Common statistics:
  • Sample mean: (read “x-bar”)
  • Sample standard deviation:
  • Sample size: n

The relationship to remember: statistics estimate parameters. We compute from our sample and use it as our best guess for the unknown .

Memory trick: Parameter → Population (both start with P). Statistic → Sample (both start with S). Greek letters (, ) go with the population; Roman letters (, ) go with the sample.

The most common error in this lesson: Treating a sample statistic as if it were the population parameter. For example, reporting “the average score is ” is fine — but then concluding “so the population mean is ” is an overstatement. The statistic estimates the parameter; it is not equal to it. The difference matters for everything that comes later.

Coming in DS-2: The symbols (population standard deviation) and (sample standard deviation) are introduced above so the notation is familiar when you meet them. For now, focus on and — those are the only two you will use in this lesson’s problems.

Figure 2: The four-element framework of statistical inference. The left side lives in the population — values that exist but are usually unknown. The right side lives in the sample — values we compute from data. The arrow captures the entire purpose of sampling: use what you know (x̄) to estimate what you can't observe (μ).
Figure 1: A population of 50 people (left) and a random sample (right). Click Draw Sample to select a new group — x̄ shifts every time, but μ stays fixed. Switch sample sizes (n = 5 / 10 / 20) and watch the history strip: larger n means x̄ clusters more tightly around μ.

C3 — Types of Variables

Before collecting data, you need to know what kind of data you’re dealing with. The type of variable determines which graphs, which summaries, and which statistical tests are appropriate. Using the wrong tool for the wrong type of variable is one of the most common mistakes in applied statistics.

Qualitative (Categorical) Variables

A qualitative variable classifies individuals into categories. The values are labels or names, not numbers you can do arithmetic with.

  • Nominal: Categories have no natural ranking. You can only check equality, not order.

Examples: eye colour (brown, blue, green); country of birth; favourite music genre; blood type (A, B, AB, O).

  • Ordinal: Categories have a meaningful order, but the gaps between ranks are not necessarily equal.

Examples: education level (high school < college < university < graduate); customer satisfaction (poor / fair / good / excellent); pain scale (1 through 10 as labels).

Quantitative (Numerical) Variables

A quantitative variable takes numerical values where arithmetic makes sense. Differences and ratios are meaningful.

  • Discrete: Takes only countable values — usually whole numbers. You can list them (at least in principle).

Examples: number of children in a household; number of defects per batch; number of goals scored in a game.

  • Continuous: Can take any value in an interval — including all decimals. Measured, not counted.

Examples: height (171.3 cm, 171.31 cm, …); temperature; time to complete a task; blood pressure.

Trap 1: A variable coded with numbers is not automatically quantitative. Postal codes are numbers, but you can’t compute “average postal code.” Phone numbers, student ID numbers, and jersey numbers are nominal.

Trap 2: Pain scores rated 1–10 are often treated as quantitative, but technically they’re ordinal — the gap between “3 out of 10” and “4 out of 10” may not equal the gap between “7 out of 10” and “8 out of 10.” Context matters for this distinction.

Trap 3: Age in years (18, 19, 20…) looks discrete, but age is really continuous — you’re always some fractional number of years old. The measurement convention creates the appearance of discreteness.

Ordinal data and arithmetic don’t mix. When a rating scale is coded as numbers (1 = Terrible, 5 = Excellent), it is tempting to compute an average. But ordinal gaps are not guaranteed to be equal — the jump from “Poor” to “Okay” may not represent the same real-world distance as the jump from “Good” to “Excellent.” Throughout this course, Likert-scale and ranked-category responses are treated as ordinal. A more honest summary uses the median category or a frequency table. You will see exactly why this matters in DS-2.

Figure 2: The variable-type hierarchy. Click any example below the tree to trace its classification path. Each leaf node shows a distinct colour used consistently in this course. The ⚠ examples are traps — variables that look like one type but are actually another.

C4 — Sampling Methods

How you select your sample determines whether it will accurately represent the population. There are many methods — each with different strengths, weaknesses, and appropriate contexts.

Simple Random Sampling (SRS)

Every possible sample of size n has an equal probability of being selected. This is the gold standard for unbiased sampling.

How: Number every individual in the population 1 through N; use a random number generator or lottery to select n of them.

Strength: Unbiased; produces the mathematical properties statistics relies on.

Weakness: Requires a complete sampling frame — a list of every individual in the population — which is often unavailable or expensive to build.

Stratified Sampling

Divide the population into homogeneous subgroups called strata (e.g., age groups, departments, regions). Then draw a simple random sample from each stratum.

Strength: Guarantees representation of important subgroups; can produce more precise estimates than SRS for the same total sample size.

Weakness: Requires knowing the strata boundaries in advance; more complex to execute than SRS.

Systematic Sampling

Number every individual, choose a random starting point, then select every k-th individual (where ).

Example: To sample 50 from 500, select a random start between 1 and 10, then take every 10th person after that.

Strength: Easy to implement; spreads the sample evenly across the population.

Weakness: Biased if there is a periodic pattern in the list that aligns with the interval k.

Cluster Sampling

Divide the population into heterogeneous groups called clusters (e.g., schools, city blocks, hospitals). Randomly select a few clusters, then survey every individual in the selected clusters.

Strength: Very cost-effective when the population is geographically dispersed and there is no complete list of individuals.

Weakness: Higher sampling error than SRS because individuals in the same cluster tend to be similar.

Multistage Sampling

A combination of methods applied in stages. Common in large national surveys.

Example: Stage 1 — randomly select provinces. Stage 2 — within each province, randomly select cities. Stage 3 — within each city, randomly select households. Stage 4 — interview one adult per household.

Strength: Practical for very large populations spread across wide areas.

Weakness: Complex to design; errors can compound across stages.

Convenience Sampling

Select whoever is easiest to reach — the first n people who walk by, the first respondents to an online survey, volunteers who self-select.

Strength: Fast and cheap.

Weakness: Almost always biased. The people who are easiest to reach are systematically different from the rest of the population.

Cluster ≠ Stratified. Both divide the population into groups — but they work opposite ways. In stratified sampling, groups are homogeneous (similar inside), and you sample from all groups. In cluster sampling, groups are heterogeneous (diverse inside), and you sample only some groups entirely. A school’s students divided by grade = strata (homogeneous within grade). A school’s students divided by homeroom class = clusters (diverse within class).

Figure 3: A population of 56 people arranged in a grid. Each tab shows which people get selected under that sampling method. For Stratified sampling, dot colour shows stratum membership; for Cluster, it shows cluster membership.

C5 — Bias in Sampling

A sample is biased when it systematically favours certain outcomes over others — when it consistently misrepresents the population in one direction. Bias is not random error; it doesn’t cancel out with larger sample sizes. A biased sample of 10,000 people can be less trustworthy than an unbiased sample of 100.

Common Sources of Sampling Bias

  • Undercoverage: Some segments of the population have a lower probability (or zero probability) of being selected. Example: An online survey excludes people without internet access.
  • Voluntary response bias: People who feel strongly about an issue are more likely to respond. Example: Calling in to a radio poll about crime — only the angriest listeners call.
  • Non-response bias: People selected for the sample don’t respond, and those who don’t respond differ systematically from those who do. Example: A survey on work satisfaction — disengaged employees may ignore it.
  • Convenience bias: Any convenience sample is systematically biased toward whoever is easiest to reach.

Bias affects the validity of conclusions: a biased sample may give us an accurate picture of who responded, but a distorted picture of the population we actually care about.

Figure 5: Each dot is one sample's estimate of μ. The biased sampler (left) surveys gym members to estimate weekly exercise for all adults — always landing too high, no matter how many draws. The unbiased sampler (right) uses SRS — estimates scatter around the true μ and converge on it with more samples. Bias is not random error: it does not cancel out with a larger sample.

C6 — Survey Design Principles

Even with a perfectly selected sample, a poorly designed survey can still produce bad data. The questions themselves can introduce error.

Sources of Response Error in Surveys

  • Leading questions: Wording that pushes respondents toward a particular answer. Bad: “Do you agree that the current tax rate is too high?” Better: “Do you think the current tax rate is too high, about right, or too low?”
  • Double-barrelled questions: Two questions combined into one. Bad: “Are you satisfied with the price and quality of the product?” (What if you love the quality but not the price?)
  • Social desirability bias: Respondents give answers they think are socially acceptable, not honest ones. Example: Self-reported exercise frequency, hours spent studying, or income.
  • Ambiguous wording: Questions that different respondents interpret differently. Bad: “Do you eat regularly?” (What does “regularly” mean?)
  • Order effects: The order of questions influences answers. Asking about satisfaction with a specific product before asking about overall satisfaction changes the overall satisfaction rating.
Figure 6: Four common survey design flaws. For each, the problematic word or phrase is highlighted in the original question. Click a flaw type to examine it — then compare the original against the corrected version.
Good survey design rules of thumb:
  • Ask one thing per question
  • Use neutral wording — no loaded language
  • Pilot test the survey with a small group before the real study
  • Keep the questionnaire as short as possible (longer surveys = more non-response)
  • Guarantee anonymity where possible (reduces social desirability bias)

Section 4: Worked Examples

Let’s work through some examples together — classification and identification problems that look exactly like what you’ll see in practice. I’ll narrate my thinking at each step, not just the final answer.

Example 1 — Fully Worked: The Four Elements

Scenario: A nutritionist wants to know the average daily sugar intake of Canadian adults. She recruits 250 adult volunteers from three cities, records their sugar intake for a week, and computes an average of 87 g/day.

Identify: (a) the population, (b) the sample, (c) the parameter of interest, (d) the statistic.

Step 1 — Identify each element.

Let’s work through each piece systematically.

(a) Population: “Canadian adults” — the entire group the nutritionist wants to draw conclusions about. Not just adults in three cities; she wants to generalize to all Canadian adults. This is a large, defined group that was never fully studied.

(b) Sample: The 250 adult volunteers recruited from three cities. This is the subset actually measured. Note: these are volunteers, which raises concerns about bias (voluntary response) — we’ll come back to that.

(c) Parameter of interest: The average daily sugar intake of all Canadian adults — written . This is unknown. We didn’t measure every Canadian adult, so is never directly observed.

(d) Statistic: The sample average of 87 g/day — written . This was computed from the 250 participants. It’s our best estimate of .

Key takeaway: We know with certainty (we computed it). But we only estimate that — and given the voluntary-response design, that estimate might be off. The statistic is certain; the inference to the parameter is always uncertain.


Example 2 — Partially Scaffolded: Which Sampling Method?

Scenario: A university wants to survey its 8,000 students about library resources. Here are three different approaches they could use:

Before looking at the solution: Write down your classification for each approach. What method is each one? What’s the key clue in each description?

Show Solution

Approach A — Simple Random Sampling (SRS). Every student has an equal chance of being selected; we used a random number generator to choose. The defining clue: “generate random numbers” applied to a complete numbered list.

Approach B — Stratified Sampling. The strata are the faculties (homogeneous subgroups). We sampled randomly from every stratum. Key clue: “divide students by faculty” (the strata) + “randomly select from each” (the sampling step).

Approach C — Convenience Sampling (voluntary response). Posting a link and surveying whoever responds is the definition of voluntary response bias. Only students who feel motivated to click will respond — probably not a representative cross-section. Key clue: “whoever responds.”

Which approach would you recommend? Approach B, actually — it guarantees that every faculty is represented in proportion, making it more precise than SRS for the same sample size. Approach A is also good, but wouldn’t guarantee any particular faculty is represented. Approach C is problematic: students with strong opinions (frustrated with the library) are overrepresented.


Example 3 — Minimally Scaffolded: Variable Types in the Wild

Scenario: A hospital collects the following data for each patient admitted to the emergency room:

  1. Patient’s primary language
  2. Triage level (1 = critical, 2 = urgent, 3 = less urgent, 4 = non-urgent, 5 = minor)
  3. Time (in minutes) from arrival to first physician contact
  4. Number of previous ER visits in the past year
  5. Discharge status (discharged / admitted / transferred)

Hint: For each variable, ask yourself (1) Is it a category label or a number you can arithmetic with? (2) If it’s a category, does it have a natural order? (3) If it’s a number, can you get fractional values?

Show Solution

(a) Primary language → Qualitative, Nominal. Languages are category labels with no natural order. “Spanish” is not greater than or less than “Mandarin.”

(b) Triage level → Qualitative, Ordinal. The levels 1–5 have a meaningful order (1 is more severe than 5), but the gaps between levels aren’t equal. The difference in severity between levels 1 and 2 is not the same as between levels 4 and 5. Ordinal: ordered categories with unequal gaps.

(c) Time to physician contact → Quantitative, Continuous. Time can take any value in a range — theoretically 14.3 minutes, 14.37 minutes, 14.371 minutes. It’s measured (not counted) and any decimal value is possible.

(d) Number of previous ER visits → Quantitative, Discrete. You count visits — you can’t have 2.7 ER visits. It takes whole number values and the gaps are equal.

(e) Discharge status → Qualitative, Nominal. Three categories with no natural order — “admitted” is not greater than “transferred.” Category label, no ranking.


Example 4 — Application Twist: Spotting Hidden Bias

Scenario: A municipal government wants to know whether residents support a new cycling infrastructure project that would repurpose some car lanes. They commission a survey and mail questionnaires to 2,000 randomly selected registered voters. They receive 380 completed responses.

The results: 71% of respondents support the project.

A city councillor says: “Great news — 71% of voters support this.” Is this claim justified? What concerns should be raised?

Show Solution

This is a real example of how a technically “random” sample can still produce unreliable conclusions.

Problem 1 — Non-response bias: Only 380 of 2,000 responded — a 19% response rate. The 81% who didn’t respond may have very different opinions. Who doesn’t bother to mail back a survey about cycling infrastructure? Possibly people who rarely cycle and have no strong motivation to engage with the topic. We simply don’t know why they didn’t respond — and that uncertainty is the problem.

Problem 2 — Wording matters: We don’t know how the question was phrased. “Do you support new cycling infrastructure?” might get different responses than “Do you support repurposing car lanes for cyclists?” Both describe the same project, but one activates different associations.

Problem 3 — The 71% is a statistic, not the parameter: The councillor said “71% of voters support this.” But the 71% is (sample proportion), not (population proportion). The inference to all voters is valid only if the sample is unbiased — and because of the non-response issue, it isn’t.

The honest summary: “71% of the 380 respondents support the project. However, non-response bias may make this estimate unreliable as a measure of the full voter population’s opinion.”

Section 5: Guided Practice

Work through each problem below. The dropdown will tell you immediately whether you’ve chosen the right answer — and if not, it explains exactly why the other choices are wrong.

Problem 1 — The Four Elements (C1 + C2)

A sports analytics company wants to know how many minutes per week professional basketball players spend on strength training. They contact 60 players from the NBA and find an average of 280 minutes per week.

Step 1: What is the population?

Step 2: What is the statistic?

Step 3: The parameter of interest is best written as:


Problem 2 — Notation Match (C1, C2)

Match each description to the correct notation.

2a. The average salary of all 4,500 employees at a company is $67,200.

2b. A researcher randomly selects 80 apartments in Montreal and finds the average rent is $1,450/month.


Problem 3 — Variable Type Classification (C3)

A gym records each member’s membership tier (Bronze, Silver, Gold, Platinum).

What type of variable is membership tier?

A factory records the number of defects found in each batch of 500 units produced.

What type of variable is number of defects per batch?

A hospital records each patient’s blood type (A, B, AB, or O).

What type of variable is blood type?

A marathon records each runner’s finishing time in hours and minutes.

What type of variable is finishing time?

A school survey asks students: “How would you rate the quality of the cafeteria food?” with response options: Terrible / Poor / Okay / Good / Excellent.

What type of variable is the cafeteria rating?


Problem 4 — Sampling Method Identification (C4)

A public health agency wants to survey 300 residents of a large city about their physical activity habits. The city is divided into 15 neighbourhoods. The agency randomly selects 4 neighbourhoods, then surveys all residents in those 4 neighbourhoods.

Which sampling method is this?

A quality control manager needs to inspect 40 items from a production line of 800. She inspects the 3rd item, then every 20th item after that (3rd, 23rd, 43rd, 63rd, …).

Which sampling method is this?

A university is studying student mental health. Students are categorized by year of study (Year 1, Year 2, Year 3, Year 4). The researchers randomly select 75 students from each year.

Which sampling method is this?

A researcher stands outside a shopping mall exit and interviews the first 50 people who walk out.

Which sampling method is this?

A national statistics agency wants to estimate household income. They randomly select 50 cities across Canada. Within each selected city, they randomly select 10 census tracts. Within each census tract, they randomly select 20 households to survey.

Which sampling method is this?


Problem 5 — Bias Identification and Direction (C5)

For each scenario, identify (a) the type of bias present and (b) the direction of the bias — whether it likely makes the estimate too high or too low.

Scenario A: A television station invites viewers to text their opinion on whether a proposed tax increase should pass. 78% of the 4,200 responses say “No.”

What type of bias is most clearly present?

Which direction does this bias most likely push the estimate?

Section 6: Independent Practice

Try these problems on your own. Solutions are hidden — attempt each one before revealing. The problems are intentionally interleaved across all six concepts.

Problem 1 — The Full Picture (C1 + C2)

A pharmaceutical company is testing a new drug to lower blood pressure. They enroll 180 patients with high blood pressure from 12 clinics across Canada. After 3 months, the average systolic blood pressure in the group dropped by 14.2 mmHg.

For this study, identify:

  1. The population
  2. The sample
  3. The parameter of interest (and its notation)
  4. The statistic reported (and its notation)
  5. One potential concern about whether the sample represents the population
Show Solution

(a) Population: All patients with high blood pressure (or more precisely: all adults with high blood pressure who would be candidates for this drug).

(b) Sample: The 180 patients enrolled from 12 clinics across Canada. These are the people actually studied.

(c) Parameter: The true mean reduction in systolic blood pressure for all patients with high blood pressure who take this drug: . This is unknown — we didn’t treat all patients, just 180.

(d) Statistic: The sample mean reduction: mmHg. This was computed from the 180 enrolled patients.

(e) Concern: Patients from clinics may not be representative of all high-blood-pressure patients (clinic patients may be more proactive about their health, have better access to care, etc.). Also, voluntary enrollment introduces self-selection bias — enrolled patients may differ from those who declined.


Problem 2 — Is It Quantifiable? (C1 + C2)

For each statement, decide whether the numerical value described is a parameter () or a statistic (), and explain your reasoning.

  1. A school board surveys all 847 teachers in the district and finds the average years of experience is 11.4 years.
  2. A market research firm polls 500 randomly selected Canadians and finds 62% have made an online purchase in the past month.
  3. A university’s registrar reports that the grade point average of all currently enrolled students is 2.83.
  4. A scientist measures the resting heart rate of 30 lab mice and finds an average of 632 beats per minute.
Show Solution

(a) Parameter — = 11.4 years. “All 847 teachers in the district” is the entire population. Every teacher was surveyed — no sampling occurred. When you measure the whole population, the result is a parameter.

(b) Statistic — (or sample proportion ). The 500 polled Canadians are a sample from the population of all Canadians. The 62% was computed from this sample — it’s a statistic estimating the population proportion.

(c) Parameter — = 2.83. “All currently enrolled students” is the population. The registrar has access to all their records. No sample was taken.

(d) Statistic — bpm. The 30 mice are a sample from the population of (all relevant lab mice / all mice of this species / all mice under study — depending on the research context). The 632 bpm is computed from the sample.


Problem 3 — Variable Type (Generator) (C3)


Problem 4 — Sampling Method Critique (C4)


Problem 5 — Bias in a News Story (C5)

Read the following excerpt, then answer the questions.

“A new study proves that people who work from home are more productive than those who work in offices. Researchers from TechTrend International surveyed 1,200 fully remote employees at five tech companies. 84% reported that they were more productive when working from home.”

First, identify the study design:

Now analyze the conclusions:

(a) Is the conclusion (“people who work from home are more productive”) justified by the study? Explain. (b) Identify at least two specific sources of bias. (c) Who is excluded from the population, and how might this affect the conclusion?

Show Solution

(a) No — the conclusion dramatically overstates what the study shows. The study surveyed remote employees who self-reported their productivity. There is no control group (in-office workers measured the same way), no objective productivity measure, and no random sample from a broadly defined population of workers. “Proves” is inappropriate — this is one survey with significant limitations.

(b) Sources of bias:

  • Social desirability bias: Remote workers asked whether they’re productive while working remotely may feel pressure to say “yes” — admitting low productivity could feel like justifying an end to remote work privileges.
  • Undercoverage: Only remote employees at five tech companies are included. Workers in industries where remote work is impossible (manufacturing, healthcare, retail) are entirely excluded.
  • Self-selection / convenience: The sample is limited to companies that already have remote work policies — these companies may differ from average employers in culture, resources, and employee type.

(c) The population is implicitly “all workers” or “all people who work from home,” but only tech workers at companies that already embraced remote work are included. This excludes:

  • Workers in industries where remote work is impractical
  • Employees who tried remote work and returned to the office (survivorship bias)
  • Workers who want remote work but haven’t had the option

The conclusion would need a much broader, randomized study design to support such a sweeping claim.


Problem 6 — Survey Question Critique (C6)

Each question below has a flaw. Identify the problem with each question and rewrite it to fix the flaw.

  1. ”Don’t you agree that our school needs better sports facilities?"
  2. "Are you satisfied with the price and the speed of our delivery service?"
  3. "How often do you exercise regularly?” (with response options: Never / Sometimes / Often / Always)
  4. “Given that most experts agree climate change is a serious problem, do you support carbon taxes?”
Show Solution

(a) Problem: Leading question. “Don’t you agree” frames the question to pressure respondents toward a “yes.” Fix: “Do you think our school’s sports facilities need improvement?” (or offer “Yes / No / No opinion”)

(b) Problem: Double-barrelled. A respondent might love the price but be unhappy with speed — the question forces one answer for two separate things. Fix: Split into two questions: “How satisfied are you with the price of our delivery service?” and “How satisfied are you with the speed of our delivery service?”

(c) Problem: Ambiguous wording. “Regularly” means different things to different people — some might define it as once a week, others as daily. The response options (Sometimes, Often) are equally vague. Fix: “In an average week, on how many days do you exercise for at least 30 minutes?” (Response: 0 / 1–2 / 3–4 / 5–6 / 7)

(d) Problem: Leading question with a loaded premise. “Most experts agree climate change is a serious problem” primes respondents toward agreement before the actual question is asked. Fix: Remove the preamble: “Do you support the introduction of carbon taxes in Canada?” (Yes / No / Unsure)

Section 7: Mastery Check

No hints. No guidance. This section checks whether you’ve genuinely understood the material — not just recognized answers when you saw them.

Question 1 — Feynman Test (C1 + C2)

A classmate missed today’s lesson. Explain in your own words:

Write your explanation as if you’re texting your classmate — clear, direct, no jargon you haven’t defined. Aim for 200–400 characters.

0 / 400
Show a model answer

A population is the full group you care about — like all students at your school. A sample is just a portion you actually study — like 50 randomly chosen students. We use samples because studying everyone is usually too expensive or time-consuming.

A parameter is a number that describes the population (like the true average GPA of all students — we might never know it exactly). A statistic is a number computed from your sample (like the average GPA of your 50 students — you computed it directly). The statistic estimates the parameter. That’s the whole engine of statistics.

Key check: Does your answer clearly state that parameters are usually unknown while statistics are computed from data? That’s the heart of why this distinction matters.


Question 2 — Apply (C3 + C4)

A coffee chain with 240 store locations across Canada wants to measure customer satisfaction. Their data team proposes two approaches and asks you to evaluate them.

The chain has four regions: Atlantic (20 stores), Quebec (60 stores), Ontario (80 stores), Western Canada (80 stores).

Plan 1: Randomly select 30 stores. Visit each selected store on a random day and survey every customer who visits during a 4-hour window.

Plan 2: From each region, randomly select stores proportional to the region’s size (Atlantic: 5 stores, Quebec: 15 stores, Ontario: 20 stores, Western: 20 stores). Then survey every customer at selected stores during a random 4-hour window.

Evaluate the plans:

  1. What sampling method is Plan 1? What is its main weakness in this context?
  2. What sampling method is Plan 2? How does it address the weakness you identified?
  3. Customer “satisfaction rating” is measured on a scale: Very Dissatisfied / Dissatisfied / Neutral / Satisfied / Very Satisfied. What type of variable is this? Why does it matter for how you’ll analyze the data?
Show Solution

(a) Plan 1 — Cluster sampling. Randomly selected stores are the clusters; every customer during the observation window is surveyed within each cluster. Main weakness: if the randomly selected 30 stores happen to over-represent one region (e.g., 20 of 30 are in Ontario), the satisfaction results may not reflect other regions. Atlantic stores have a small chance of being selected, so Atlantic customers’ opinions are likely underrepresented.

(b) Plan 2 — Stratified sampling. The regions are the strata. By selecting stores proportional to each region’s size, Plan 2 guarantees all regions are represented in proportion. This eliminates the risk that one region dominates the sample by chance.

(c) Satisfaction rating — Qualitative, Ordinal. The five response options have a clear order (Very Dissatisfied < Dissatisfied < Neutral < Satisfied < Very Satisfied), but the gaps are not necessarily equal. This matters because you can compute the mode or median category, but computing a true numerical mean of ordinal responses (e.g., “the average satisfaction is 3.7 out of 5”) is statistically questionable — the scale isn’t guaranteed to be interval-level. Better summaries include frequency distributions and the percentage in each category.


Question 3 — Find the Error (C4 + C5)

A student writes the following analysis. Find and explain every error.

“A food blogger wants to know whether Montrealers prefer traditional bagels over grocery store bagels. She posts a poll on her Instagram story and gets 847 responses: 91% prefer traditional bagels. Since she’s using Instagram, she has access to a huge random sample of Montrealers, so 91% is a reliable estimate of the true population proportion μ. Her poll used stratified sampling because Instagram has users from many different demographic groups.”

Show Solution

There are four distinct errors:

  1. Error 1 — “Huge random sample”: An Instagram story poll is voluntary response sampling, not random sampling. Only people who follow the blogger’s account and actively click to vote are included. This is a convenience/voluntary response sample, not a random one. The size (847 responses) doesn’t fix the selection bias.

  2. Error 2 — “Reliable estimate”: Because the sample is biased (food bloggers’ followers are disproportionately food enthusiasts who probably care about artisan food quality), the 91% figure likely overestimates the preference among all Montrealers. A biased large sample is not more reliable than an unbiased small one.

  3. Error 3 — “μ” for a proportion: The symbol denotes a population mean (average of a quantitative variable). A proportion (percentage preferring traditional bagels) is typically written for the population proportion and for the sample proportion. Using here is a notation error.

  4. Error 4 — “Stratified sampling”: Stratified sampling requires the researcher to deliberately divide the population into strata and then randomly sample from each. The blogger did no such thing — she posted a poll and accepted whoever responded. The fact that Instagram users are demographically diverse does not make the sample stratified. It’s still voluntary response sampling.


Self-Assessment

How confident are you with the material in this lesson?

Still unsureFully confident

If you’re under 70% confident, revisit Section 3 (Core Concepts). Focus on the concept that felt shakiest in this section — that’s the one most worth reviewing before you continue.

Section 8: Boss Fight

You’ve built the vocabulary. You’ve practised the classifications. Now it’s time to put it all together.

Choose your path. Both cover the same concepts — they just ask you to use them differently.

🔬 The Analyst

A national survey was conducted. Your job is to dissect it — identify what was done well, what was done wrong, and what conclusions are and aren’t justified.

Emphasis: critical reading, classification, identifying bias

🏗️ The Architect

A research question has been posed. Your job is to design the study — choose sampling methods, write good survey questions, and justify every decision.

Emphasis: methodology, survey design, justification

🔬 Path A — The Analyst: Dissecting a Real Survey

Below is a summary of a survey conducted by a Canadian media outlet. Read it carefully — there are problems hiding in every paragraph.

The Survey Report:

“We asked Canadians about their screen time habits. Our team emailed a survey to 5,000 subscribers of our newsletter, and 1,847 completed it — a 37% response rate. Results showed the average daily screen time was 6.4 hours, which we report as the true average for all Canadian adults ( h). Among respondents, we identified three groups by profession — students, office workers, and other — and found students averaged 8.1 h/day, office workers 6.0 h/day, and others 5.2 h/day. Since we collected professional background, this constitutes stratified sampling. The screen time question asked: ‘How many excessive hours do you spend on screens daily?’ Respondents answered on a scale from 1 to 5 hours (if they entered ‘3,’ this means 3 hours). The variable ‘daily screen time’ is qualitative-ordinal since people rate it on a 1–5 scale.”

Your Analysis Tasks

Task A1 — Sampling method: The report claims this is stratified sampling. Is it? (Identify the actual method used and explain the distinction.)


Task A2 — Sampling bias: Identify at least two sources of bias. For each, explain the likely direction of the bias (does it push the screen time estimate up or down?).


Task A3 — Parameter notation: The report writes ” h.” Is this notation correct? What should the correct notation be, and why?


Task A4 — Variable type: The report calls daily screen time “qualitative-ordinal.” Is this correct? What is the actual type and why?


Task A5 — Survey question: Identify the problem with the question “How many excessive hours do you spend on screens daily?” Rewrite it.

Show Full Solution

A1 — Not stratified sampling; it’s convenience/voluntary response: Stratified sampling requires the researcher to divide the population into strata and then randomly sample from each stratum before data collection. Here, the researchers simply emailed a convenience sample of newsletter subscribers and accepted whoever responded. Sorting respondents into student/office worker/other after the fact is post-hoc grouping — not stratification. This is voluntary response sampling (a form of convenience sampling). The professional groups discovered in the data are subgroups, not strata.

A2 — Bias sources:

  • Undercoverage bias (pushes estimate down or in unknown direction): Newsletter subscribers tend to be higher-educated, media-literate adults. Low-income individuals, the elderly, and rural Canadians with low digital engagement are underrepresented — ironically, these groups may have less screen time, which would push the estimate down. Alternatively, subscribers to a media newsletter might be heavy screen users, pushing the estimate up. Direction is ambiguous, but the bias is certain.
  • Non-response bias (likely pushes estimate up): Only 37% responded. People who feel comfortable reporting their screen time — possibly those who are reflective about their digital habits — may respond at higher rates. Those embarrassed by high screen time might avoid the survey or might finish faster — unclear direction. But heavy users who are proud of their tech engagement may respond more readily.
  • Question wording bias (pushes estimate down — see A5): The word “excessive” cues respondents that high screen time is bad, potentially causing underreporting.

A3 — Wrong notation: is the population parameter — the true average daily screen time of all Canadian adults, which we never measured. The 6.4 hours came from the 1,847 respondents — it is the sample mean, correctly written h. The report should say: “Our sample mean was h, which we use to estimate the population mean for Canadian adults.”

A4 — Wrong variable type, but for an interesting reason: Daily screen time measured in hours is quantitative — continuous (you can watch 2.7 hours, 4.15 hours, etc.; the variable is measured, not categorized). The report confuses the response scale (1–5 as possible answers) with the variable type. If respondents entered “3” meaning 3 hours, that’s a numerical measurement. The variable is continuous; the researcher’s measurement instrument limits precision but doesn’t change the underlying nature of the variable.

A5 — Leading question (loaded word “excessive”): “Excessive” implies that screen time is inherently bad and high amounts are excessive. Respondents who use screens a lot for legitimate work may feel pressured to report lower numbers. Better: “On an average day, approximately how many hours do you spend looking at screens (phone, tablet, computer, TV combined)?” with open numerical entry or hour ranges.

Reflection: What was the most challenging part of this analysis? What would you check first if you were a journalist reviewing this survey before publication?

🏗️ Path B — The Architect: Design the Study

You’re working as a research consultant for the Quebec Ministry of Education. They need to understand student mental well-being across all CEGEP institutions in Quebec.

Context: Quebec has 48 CEGEP institutions. Collectively, they enroll approximately 220,000 students. Institutions range from small rural CEGEPs (500 students) to large urban ones (12,000 students). The ministry wants to survey 3,000 students. Their research question: “What is the average self-reported mental well-being score of Quebec CEGEP students, and do well-being scores differ by institution size?”

Your Design Tasks

Task B1 — Define the population and sample: State precisely what the population is and what your sample of 3,000 represents.


Task B2 — Choose a sampling method: Recommend a sampling method and justify it. (Hint: the research question mentions comparing by institution size — how does that affect your design?) Explain why you’re not choosing convenience sampling or simple random sampling.


Task B3 — Identify the parameter and statistic: What parameter is the ministry trying to estimate? What statistic will you compute?


Task B4 — Design two survey questions: Write two well-designed questions for the survey. One should measure mental well-being (on a scale), the other should capture a demographic. For each, identify the variable type and explain why your wording avoids bias.


Task B5 — Anticipate one bias: Despite your best design, name one source of bias that could still affect your results, and suggest how to minimize it.

Show Full Solution

B1 — Population and sample: Population = all currently enrolled Quebec CEGEP students (~220,000). Sample = 3,000 students selected according to the design below. The sample represents the population only if it mirrors the population’s distribution by institution type, size, region, and program.

B2 — Recommended method: Stratified sampling by institution size. Since the research question specifically asks about differences by institution size, we must guarantee that small, medium, and large institutions are all represented. Suggested strata:

  • Small (under 2,000 students): 14 institutions
  • Medium (2,000–6,000): 20 institutions
  • Large (over 6,000): 14 institutions

Allocate the 3,000 surveys proportionally to enrollment within each stratum, then randomly select students within each stratum.

Why not SRS? With SRS, the 12 large institutions dominate because they have more students — small institutions might get very few respondents, making comparison across sizes impossible. Why not convenience? Biased toward easily reachable students (likely healthier, more engaged). Stratification ensures all size groups are represented.

B3 — Parameter and statistic: Parameter: = the true mean mental well-being score of all ~220,000 Quebec CEGEP students. Statistic: = the mean well-being score computed from the 3,000 surveyed students, used to estimate .

B4 — Survey questions:

  1. “Over the past two weeks, how often have you felt able to handle the challenges of daily student life?” Options: Never / Rarely / Sometimes / Often / Always Variable type: Qualitative — Ordinal Why it avoids bias: Neutral phrasing (no “struggling” or “excelling”), time-bounded (last two weeks), and focused on a specific concrete feeling rather than the vague concept of “mental health.”

  2. “Which of the following best describes your current program?” Options: Pre-university (Social Sciences / Pure and Applied Sciences / Arts and Literature / Other pre-university) / Technical program Variable type: Qualitative — Nominal Why it avoids bias: Factual question with clear categories, no value judgment, includes an “Other” option to avoid forcing respondents into incorrect categories.

B5 — Anticipated bias: Non-response bias. Even with a well-designed stratified sample, students experiencing severe mental distress — the very people most relevant to the study — may be least likely to complete a voluntary survey. This would underestimate the severity of problems in the population. Mitigation: send two follow-up reminders to non-respondents, offer the survey in both French and English, and work with institution wellness offices to encourage participation while maintaining anonymity.

Reflection: What aspect of this design was hardest to decide? Is there a sampling method you considered and rejected? What would you do if the ministry’s budget only allowed for 500 surveys instead of 3,000?

Section 9: Challenge Problems

Ready for more? These go beyond the lesson objectives. They’re for students who want to think more deeply — or get a preview of where this course is headed.

Challenge 1 — Can a Statistic Equal a Parameter? (C2)

A class of exactly 5 students has quiz scores: 70, 74, 80, 85, 91. Their professor calculates the class average as 80. She then randomly selects 3 students to give a brief survey and finds their average is also 80.

(a) Is the 80 computed from all 5 students a parameter or a statistic? Use correct notation.
(b) Is the 80 computed from the 3 selected students a parameter or a statistic? Use correct notation.
(c) The two values are identical. Does this mean the sample perfectly represents the population? Explain why or why not.

Show Solution

(a) — a parameter. All 5 students (the entire population) were measured. No sampling occurred.

(b) — a statistic. Only 3 of the 5 students were sampled. The 80 was computed from sub-group data.

(c) No — equal values don’t mean perfect representation. The 3 selected students happened to have scores that average to 80, but they might be (70, 80, 90), (74, 80, 86), or (80, 80, 80) — very different distributions, all averaging 80. The statistic matches the parameter by coincidence. Different samples of 3 from these 5 students would give averages of: {70,74,80}→74.7; {70,74,91}→78.3; {74,80,91}→81.7; {80,85,91}→85.3; {70,85,91}→82. Most don’t equal μ = 80. This illustrates sampling variability — the topic of DS-INF-1.

A town has exactly 4 households with annual incomes (in thousands): $42, $58, $65, $75. A researcher samples 2 households.

(a) Compute the population mean income. What notation applies?
(b) List all possible samples of size 2 and their sample means.
(c) How many samples produce a sample mean exactly equal to the population mean? What does this tell us?

Show Solution

(a) thousand dollars.

(b) All samples of size 2 from {42, 58, 65, 75}:

  • {42, 58}:
  • {42, 65}:
  • {42, 75}:
  • {58, 65}:
  • {58, 75}:
  • {65, 75}:

(c) Zero samples produce exactly. The sample mean almost never equals the population mean — this is sampling variability. Notice that the average of all six values is (50+53.5+58.5+61.5+66.5+70)/6 = 360/6 = 60 = μ. The sample mean is an unbiased estimator — on average it hits the target, but any individual sample rarely lands exactly on it.

A company has 6 employees. Their annual salaries (in thousands) are: $45, $50, $55, $60, $70, $80. The CEO reports the average salary as $60k. A journalist randomly asks 2 employees about their salary and gets $45k and $80k.

(a) Is the CEO’s $60k a parameter or statistic?
(b) Is the journalist’s sample mean a parameter or statistic? Compute it.
(c) The two averages differ. Does this prove the CEO is lying? What does it actually illustrate?

Show Solution

(a) — parameter (all 6 employees = the entire company = population; CEO computed this from full data).

(b) k — a statistic (computed from a sample of 2).

(c) No — the discrepancy doesn’t prove lying. It illustrates sampling variability: different random samples produce different statistics. The journalist happened to sample the lowest and highest paid employees — an extreme but possible outcome. The CEO’s figure is the true population parameter. The journalist’s statistic is simply one possible sample mean, which happened not to equal μ. This is exactly why statistics need uncertainty quantification (confidence intervals — covered later in the course).


Challenge 2 — Multistage vs. Cluster: Design Choices (C4)

A national research institute wants to estimate the proportion of high school students in Canada who spend more than 3 hours/day on social media. Canada has approximately 3,200 high schools and 1.5 million high school students.

(a) A colleague proposes cluster sampling: randomly select 30 schools, then survey every student in those schools. Estimate the total number of students surveyed if each school has an average of 900 students. Is this practical?

(b) Alternatively, design a two-stage multistage sampling plan that surveys approximately 3,000 students. Be specific about what happens at each stage.

(c) Compare the two designs. Which produces a more precise estimate? Which is more practical? Is there a tradeoff?

(d) Preview question: The true proportion of students who spend >3 h/day on social media is an unknown parameter. What symbol would this parameter be given in the notation introduced in this lesson? (Hint: it’s not μ — μ is for means. Look at the context document for the course.)

Show Solution

(a) 30 schools × 900 students/school = 27,000 students. This is logistically impractical for most research budgets — surveying 27,000 students requires massive coordination across 30 schools. Cluster sampling with full enumeration works well for small clusters, but 900-student schools are too large.

(b) Two-stage plan:

  • Stage 1: Randomly select 60 schools from the 3,200 high schools across Canada (use SRS or stratified by province).
  • Stage 2: Within each selected school, randomly select 50 students (using the school’s enrollment list and a random number generator).
  • Total: 60 × 50 = 3,000 students surveyed.

(c) The cluster design (full enumeration of 30 schools) actually tends to be less statistically precise per student surveyed than multistage sampling, because all students in the same school share similar social environments, making their responses correlated. The multistage design spreads 3,000 students across 60 schools — capturing more geographic and demographic diversity. Tradeoff: the cluster design requires fewer agreements with schools (30 vs. 60) but surveys far more students per school (27,000 total vs. 3,000).

(d) The parameter is a population proportion — the true fraction of all Canadian high school students who spend >3 h/day on social media. This would be written p (population proportion). You’ll see this notation formally introduced in DS PR-6 and INF-4. The sample proportion — computed from your 3,000 students — would be written (“p-hat”).


Challenge 3 — Why Convenience Sampling is Always Biased: A Proof Sketch (C5)

This problem asks you to reason carefully about why convenience sampling produces biased estimates, not just that it does.

Consider the following model: A population has N individuals. Each individual has some characteristic value x (say, hours of exercise per week). The true population mean is .

In a convenience sample, individuals are not selected with equal probability. Let be the probability that individual i is selected. A random sample (SRS) requires for all i (equal probability). A convenience sample has that varies — some individuals (those “convenient”) have much higher than others.

(a) Suppose a gym posts a sign-up sheet for a study on exercise habits. Who is more likely to sign up — people who exercise frequently or people who rarely exercise? What does this say about the values for high-exercisers vs. low-exercisers?

(b) If high-exercisers have higher , the sample will overrepresent them. In what direction will the sample mean be biased relative to ?

(c) This bias doesn’t go away with a larger sample. In fact, a convenience sample of 10,000 gym members would be even more biased than a random sample of 100 general-population adults. Why?

(d) Write a one-paragraph “proof sketch” (in words, no formal mathematics required) explaining why unequal selection probabilities are the fundamental cause of bias in non-random sampling.

Show Solution

(a) Frequent exercisers are far more likely to sign up — they’re proud of their habits, they’re at the gym already, and the topic seems relevant to them. Low-exercisers who feel embarrassed about their habits or simply never visit the gym have essentially zero probability of being in this sample. So is high for high-exercisers, near zero for low-exercisers.

(b) Since high-exercisers are overrepresented, will be systematically higher than — the sample will overestimate the true population mean exercise time.

(c) A larger convenience sample still draws from the same biased pool. Adding more gym members to the sample doesn’t add people who don’t go to gyms. You’re just measuring the same overrepresented group more precisely — getting a very accurate estimate of the wrong quantity. Bias comes from who is in the sample, not how many.

(d) Proof sketch: The sample mean is a weighted average of the values in the sample. In SRS, every individual has an equal probability of selection, so the contribution of each individual’s value to the expected value of is equal — the expected value of equals (the statistic is unbiased). In a convenience sample, individuals with higher selection probability contribute more heavily to . If these high-probability individuals have systematically higher (or lower) values than the rest of the population, the expected value of will be systematically above (or below) . This systematic deviation is bias — and it persists no matter how large the sample grows, because the mechanism producing the bias (unequal selection probability) is unchanged by sample size.

Section 10: Solutions Reference

Complete, step-by-step solutions for all problems in Sections 5–9 are available on the solutions page. Solutions include worked examples, common mistakes to watch for, and interpretation guidance.

View Full Solutions →

If you’re stuck: Re-read the relevant Core Concept in Section 3, then find the Worked Example that maps to that concept (e.g., Example 1 maps to Concept 1). The solutions page shows the reasoning behind every step, not just the final answer.

Quick-Reference Notation

MeasurePopulation (Parameter)Sample (Statistic)
Size
Mean (mu) (x-bar)
Proportion (p-hat) — introduced in DS-PR-6
Variable TypeSub-typesExamples
Categorical (Qualitative)NominalEye color, Postal code
OrdinalLetter grade, Survey rating
Numerical (Quantitative)DiscreteNumber of children, Shoe size
ContinuousHeight, Weight, Time
Sampling MethodHow it works
Simple Random (SRS)Every possible sample of size has an equal chance.
StratifiedDivide into groups (strata), take SRS from every group.
ClusterDivide into groups (clusters), randomly select some groups, sample everyone in those.
SystematicPick every -th individual from a list.
Convenience (Biased)Pick whoever is easiest to reach.