EN FR

REG-1: Correlation Analysis

Module 4 · Regression & Association

Section 1: Introduction

Here is a statistic that made headlines: on hot summer days in the United States, ice cream sales and shark attacks both increase. The correlation between them is real, measurable, and quite strong. Does this mean ice cream causes sharks to attack swimmers?

Of course not. Both are driven by a third factor — hot weather draws people to the beach and into the water, and also drives up ice cream sales. The ice cream has nothing to do with the sharks. Yet the numerical relationship between those two variables is genuine.

This is the central tension in correlation analysis: measuring a relationship precisely and accurately does not tell you why the relationship exists. A number can be exactly right and deeply misleading at the same time. By the end of this lesson, you will be able to compute that number correctly, interpret what it does and does not mean, and spot when a strong correlation is being used to claim something it cannot support.

The quantity we measure is called the Pearson correlation coefficient, written . It answers one question: how strongly and in what direction do two quantitative variables move together in a linear pattern? It ranges from (perfect negative linear relationship) to (perfect positive linear relationship), with meaning no linear trend at all.

After this lesson, you will be able to:

  • Describe a scatter plot in four dimensions: direction, form, strength, and outliers
  • Compute the Pearson correlation coefficient using both the definition formula and the computational formula
  • Compute the coefficient of determination and interpret it correctly as a proportion of variability explained
  • Interpret the significance of given using the -statistic overview
  • Distinguish correlation from causation and identify plausible confounding variables

This lesson builds directly on the scatter plot skills from DS-2 and the standard deviation notation from DS-5. Those two prerequisites are your launching pad — everything new here adds meaning and precision to what you already know.

Section 2: Prerequisites

Correlation analysis is built on two foundational skills. Before computing , make sure these are solid.

  • From DS-2: Scatter Plots. You know how to plot two quantitative variables, read axis values, and describe the overall pattern. Today we attach a precise number to that description.
  • From DS-2: Direction and Form. A scatter plot can trend upward (positive), downward (negative), or show no clear direction. The cloud can be linear (straight-line pattern) or curved. You will need both distinctions throughout this lesson.
  • From DS-5: Standard Deviation. The notation and refers to the sample standard deviations of the -values and -values respectively. The Pearson formula divides by — so you need to recognize what those symbols mean.
  • From DS-5: Z-Scores. A z-score measures how many standard deviations a value is from the mean. The definition form of is essentially the average product of paired z-scores — an insight that explains why always stays between and .

Retrieval Checkpoint

A scatter plot of daily temperature (-axis) vs. hot chocolate sales (-axis) shows that as temperature increases, hot chocolate sales decrease. The cloud of points is fairly tight around a straight line. Which description is most accurate?

Success Factor:

The Boundary:

  • New boundary: In DS-2, describing a scatter plot was entirely qualitative — words like “upward trend” and “tight cluster.” In this lesson, we attach a precise number () to those words. The tricky part: only measures the linear component. A scatter plot can show a strong curved relationship and still have . The number does not replace the visual — it adds to it.

Retrieval Warm-up — from earlier lessons

A forest ecologist records soil moisture (%) and fern density (plants/m²) at 20 sites. She computes the sample standard deviations: % and plants/m². Which statement best describes what these values tell her before she computes the regression slope?

A student is reviewing INF-5 before starting this REG module. She sees the conclusion: “We fail to reject .” Which interpretation is correct?

Section 3: Core Concepts

Ten concepts live in this section. They build in a specific order:

  • C1–C3: What a scatter plot tells you qualitatively, and what measures quantitatively.
  • C4–C5: Two critical warnings — what does not mean, and what adds.
  • C6: The most common numerical error students make (using where belongs).
  • C7: The conceptual error everyone has seen in headlines — causation claimed from correlation.
  • C8–C10: When is statistically real? What are the alternatives? What conditions must hold?

C1 — Reading Scatter Plots Qualitatively

Before any formula, a scatter plot tells a story in four dimensions. Whenever you see a scatter plot — or are asked to interpret one — describe it in this order:

  1. Direction: Do the points trend upward (positive) or downward (negative)? Or is there no apparent trend?
  2. Form: Is the pattern approximately linear (straight-line), or is it curved?
  3. Strength: How tightly do the points cluster around the trend? Loosely scattered = weak; hugging a line = strong.
  4. Outliers: Are there any individual points far from the main pattern?

Mini-example: A scatter plot of hours studied () vs. exam score () shows points that trend upward, cluster fairly tightly around a straight line, with no obvious outliers. Description: positive direction, linear form, moderate-to-strong strength, no notable outliers.

Before we compute any number, explore what different strengths of correlation actually look like visually:

Interactive scatter plot showing correlation strength. Use buttons to switch between r values from −0.9 to 0.9.

C2 — Pearson Correlation Coefficient

The Pearson is the number that quantifies what C1 describes. It has a precise formula built from an elegant idea: is the average product of the paired z-scores of and .

We write to mean the sample Pearson correlation coefficient — a statistic computed from our data.

Pearson Correlation Coefficient — Definition Form

where and are the sample means, and are the sample standard deviations of and , and is the number of pairs.

Range: . The sign gives direction; the magnitude gives strength.

The definition form shows the intuition: when is above and is above at the same time (both deviations positive), the product is positive — pushing toward . When they go in opposite directions, the product is negative. When there is no systematic pattern, the positives and negatives cancel out, giving .

For hand calculations, the computational form avoids computing means and deviations explicitly:

Pearson Correlation Coefficient — Computational Form

Computation table structure: Set up five columns — , , , , — compute the column sums, then substitute into the formula.

Why both forms? The definition form makes the meaning transparent (products of z-scores). The computational form makes the arithmetic tractable for hand calculations — no need to compute and separately.


C3 — Interpreting — Strength Thresholds

The magnitude tells you how strong the linear relationship is. These conventional guidelines are widely used in introductory statistics:

rangeInterpretation
Strong linear relationship
Moderate linear relationship
Weak linear relationship

The sign of tells you direction independently of strength: is a strong negative relationship; is a weak positive relationship.

Mini-example: . Magnitude: falls in → moderate. Sign: negative → as increases, tends to decrease. Full interpretation: moderate negative linear relationship.

These thresholds are conventions, not laws. In some fields (medicine, psychology), might be considered practically meaningful. In physics, might be considered disappointingly weak. Always consider the context.


C4 — Does NOT Mean No Relationship

This is perhaps the most important conceptual trap in correlation analysis. A Pearson close to zero means there is no linear relationship — it says nothing about other kinds of relationships.

If you compute and conclude “there is no relationship between and ,” you may be completely wrong. A perfect U-shaped curve, for example, produces even though is perfectly predictable from . Always look at the scatter plot alone is not enough.

Mini-example: , . The scatter plot shows a perfect U-shape. Yet because the positive deviations on the right exactly cancel the negative deviations on the left. The relationship is perfect — but non-linear.


C5 — Coefficient of Determination

Squaring gives a quantity with a direct interpretive meaning that itself does not have:

Coefficient of Determination

is the proportion of the variability in that is explained by the linear relationship with . It is always expressed as a percentage.

Formula: square , then multiply by 100 to get a percentage.

The correct phrasing: explains of the variability in .”

Never say ” of the relationship” (relationship is not measurable) or ” of ” (it is the variability in , not itself).

Mini-example: . Then . Interpretation: “The linear relationship with explains 64% of the variability in . The remaining 36% of variability is not explained by this linear relationship with — it may be due to other variables, non-linear patterns, or measurement variability.”


C6 — Confusing with

This is the most common numerical error students make. If , the linear relationship with explains of the variability in — not 70%.

Never use directly as a percentage of variability explained. If , do not say “x explains 70% of y.” You must square it: → “x explains 49% of the variability in y.” The squaring step is mandatory.

Notice that when is already moderate or large, is dramatically smaller: ; ; . A “moderate” correlation explains only about a quarter of the variability.


C7 — Correlation vs. Causation

A strong, statistically significant correlation between and does not mean causes . Three alternative explanations are always possible:

  1. Reverse causation: causes (not the other way around)
  2. Confounding variable: A third variable causes both and to change
  3. Coincidence (spurious correlation): The correlation is a statistical accident, especially with small samples or data dredging

The ice cream and sharks example again: Ice cream sales () and shark attacks () are correlated. The confounder is summer heat — hot weather drives both more ice cream sales and more beach visits (which increase shark encounters). Ice cream does not cause sharks to attack. Both are caused by a third variable.

Even a large with a small p-value does not establish causation. Statistical significance means the correlation is unlikely to be due to chance — it does not rule out confounders or reverse causation. Causation requires experimental design (random assignment), not just observation.


C8 — Testing the Significance of (Overview)

A correlation computed from a small sample might be large just by chance. To test whether reflects a real relationship in the population, we use a -test:

t-Test for the Significance of r

Compare the computed to the critical value from the -table at and the chosen level. If , the correlation is statistically significant.

The key insight: both and determine significance. A large can be non-significant with a very small sample; a tiny can be significant with a very large sample (though it may be practically meaningless). Always consider both magnitude and sample size.


C9 — Spearman’s Rank Correlation (Overview)

Pearson assumes both variables are quantitative and that the relationship is approximately linear. When those assumptions are violated — for example, when one variable is ordinal (ranked data), or when the scatter plot shows a monotone but curved pattern — Spearman’s rank correlation is more appropriate.

is computed by replacing the raw values with their ranks and then applying the Pearson formula to the ranks. It measures whether the variables tend to increase or decrease together monotonically (not necessarily linearly). Interpretation of strength thresholds is similar to Pearson .


C10 — Conditions for Pearson

Before computing or interpreting Pearson , three conditions should be checked:

Conditions for Pearson r

  1. Both variables are quantitative (not categorical or ordinal).
  2. The relationship is approximately linear — check the scatter plot. If the pattern is clearly curved, Pearson is not the right tool.
  3. No extreme outliers that could distort . A single outlier can dramatically change the value of in a small dataset.

Practical check: Always plot the data first. A scatter plot takes 30 seconds and immediately reveals whether the conditions hold. Computing without looking at the scatter plot is like computing a mean without looking at whether the distribution is symmetric — you might be using the wrong tool entirely.

Section 4: Worked Examples

Four examples with progressively less scaffolding. The first is fully narrated so you can see the complete thought process.

Example 1 — Reading a Scatter Plot Qualitatively (Fully Worked)

Problem: A researcher plots body mass index (-axis, kg/m²) against systolic blood pressure (-axis, mmHg) for a sample of 80 adults. She describes the plot: “Points generally rise from left to right, clustering fairly tightly around a straight line. There are two points in the upper right that sit noticeably above the main cloud.” Describe the scatter plot in the four dimensions of C1 and check whether the conditions for Pearson are met.

I notice: The description gives me all four pieces of information I need. I’ll address them in order, then run the conditions check.

Direction: “Points generally rise from left to right” → positive direction. As BMI increases, blood pressure tends to increase.

Form: “Clustering fairly tightly around a straight line” → linear form. No mention of curvature.

Strength: “Fairly tight” → moderate to strong. We’d need to compute to pin this down numerically, but qualitatively, this is not a loose cloud.

Outliers: “Two points in the upper right sit noticeably above the main cloud” → two potential outliers in the upper right region (high BMI, very high blood pressure). These could disproportionately pull upward.

Conditions check:
  • Both variables (BMI, blood pressure) are quantitative ✓
  • The pattern appears approximately linear ✓ (no curvature mentioned)
  • Extreme outliers: the two upper-right points should be investigated — they could distort . The researcher should compute with and without those points to assess their influence.

Conclusion: The scatter plot shows a positive, linear, moderate-to-strong relationship with two potential outliers. Pearson is appropriate, but the two outlying points warrant investigation before reporting a final value.


Example 2 — Computing and (Partially Scaffolded)

Problem: Five students study different amounts and score accordingly. Compute and for this dataset, and interpret .

StudentHours studied ()Test score ()
A264
B470
C575
D780
E886

Before computing: the points trend upward. Do you expect to be positive or negative? Strong or moderate?

Step 1: Build the computation table.

26412844096
470280164900
575375255625
780560496400
886688647396
: 26375203115828417

Step 2: Compute the numerator.

Step 3: Compute the bracket terms.

Step 4: Compute .

Step 5: Compute and interpret.

Interpretation: — an extremely strong positive linear relationship. The linear relationship with hours studied explains approximately 98% of the variability in test scores. Only about 2% of the variability in scores is unexplained by this linear relationship with study time.

Narration of the C6 trap: Notice that does not mean “99% of the variability is explained.” That would be close here (), but the reasoning requires squaring — and for more moderate values, the difference is substantial.


Example 3 — Testing Significance (Behind Details)

Problem: A study of pairs of variables yields . At (two-tailed), is this correlation statistically significant?

Setup: We use the formula with .

At , the critical value for a two-tailed test at is (from the -table).

Show Solution

Compute :

Decision: . We reject the null hypothesis that the population correlation is zero.

Conclusion: The correlation is statistically significant at with . The relationship is unlikely to be a sampling artifact. (Note: , meaning the linear relationship explains 55% of variability in — a substantial but not overwhelming amount.)


Example 4 — Causation Claim from Correlation (Application Twist)

Problem: A news article reports: “A study of 500 teens found that those who spend more time on social media have lower grades. The correlation was (). Researchers conclude that social media use causes lower academic performance.”

Identify the error in the researchers’ conclusion and propose two alternative explanations.

The flawed analysis:

“The correlation was , statistically significant at . Therefore, social media causes lower grades.”

What the correlation actually shows: There is a moderate negative linear association between social media use and grades in this sample of 500 teens. The association is statistically significant — meaning it is unlikely to be due to chance sampling variation.

What the correlation does NOT show: It does not rule out confounders or reverse causation.

Error identified: The researchers committed the correlation-causation error (C7). Statistical significance rules out chance — it does not rule out confounding or reverse causation. They need experimental evidence (random assignment to social media conditions) to support a causal claim.

Two alternative explanations:

  1. Reverse causation: Students with lower grades may turn to social media as an escape from academic stress — grades cause social media use, not the other way around.
  2. Confounding variable: Students with less parental supervision may have both higher social media use and lower academic support, producing the correlation without any direct causal link between screens and grades.

Section 5: Guided Practice

Four problems to build fluency. Use the validate-selects to check key decisions at each step.

Problem 1 — Direction and Strength from (C1, C3)

A researcher reports between hours of study per week and final exam score.

(a) What is the direction of the relationship?

(b) How strong is the relationship?

A study finds between outdoor temperature and hot beverage sales.

(a) What is the direction of the relationship?

(b) How strong is the relationship?

A study reports between shoe size and vocabulary test score among adults.

(a) What is the direction of the relationship?

(b) How strong is the relationship?

Researchers find between daily exercise minutes and resting heart rate.

(a) What is the direction of the relationship?

(b) How strong is the relationship?

A sleep researcher finds between hours of sleep per night and daytime productivity score.

(a) What is the direction of the relationship?

(b) How strong is the relationship?


Problem 2 — Computing and Interpreting (C2, C5, C6)

A study finds between height () and weight ().

(a) What is ? (Round to 2 decimal places.)

(b) Which statement correctly interprets ?

A study finds between outdoor temperature () and coffee sales ().

(a) What is ? (Round to 2 decimal places.)

(b) Which statement correctly interprets ?

A study finds between class attendance () and GPA ().

(a) What is ? (Round to 2 decimal places.)

(b) Which statement correctly interprets ?

A study finds between exercise frequency () and resting heart rate ().

(a) What is ? (Round to 2 decimal places.)

(b) Which statement correctly interprets ?

A study finds between sleep hours () and productivity score ().

(a) What is ? (Round to 2 decimal places.)

(b) Which statement correctly interprets ?


Problem 3 — Correlation vs. Causation Scenarios (C4, C7)

For each scenario below, answer both questions.

Scenario 1: A researcher finds between the number of firefighters at a fire scene () and the amount of property damage (). A newspaper concludes: “Firefighters cause damage — the more you send, the worse the destruction.”

(a) Does this correlation imply causation?

(b) What is the most likely explanation?


Scenario 2: A study finds between physical activity level () and depression score (). The headline: “Exercise cures depression.”

(a) Does this correlation imply causation?

(b) Name one plausible alternative explanation.


Scenario 3: A teacher finds between pencil length (worn down, ) and exam grades (). Students who have used their pencils more have higher grades.

(a) Does this correlation imply causation?


Problem 4 — Compute and Interpret (Generator)

Section 6: Independent Practice

Five problems, no guided scaffolding. Concepts are interleaved. Check solutions after completing each problem on your own.

Problem 1 — Compute from Data (C1, C2, C3, C10)

A student records study hours () and quiz scores () for 5 quizzes:

12345
6052687078

Use the computational formula to compute . Then answer the three questions below.

Pre-computed sums: , , , , .

(a) What is the direction of the relationship?

(b) What is the strength of the relationship?

(c) Which condition check statement is false?

Show Solution

Numerator:

Left bracket:

Right bracket:

Direction: positive. Strength: strong (). The false statement is (b) — the relationship does appear linear for these data; Pearson is appropriate.

A horticulturist applies different fertilizer amounts (, grams) and records plant heights (, cm):

246810
8572786260

Pre-computed sums: , , , , .

(a) What is the direction of the relationship?

(b) What is the strength of the relationship?

(c) Which condition check statement is false?

Show Solution

Numerator:

Left bracket:

Right bracket:

Strong negative relationship. The false statement is (b) — the relationship does appear approximately linear.

A researcher records daily screen time (, hrs) and sleep quality score (, 0–100):

12345
7572656852

Pre-computed sums: , , , , .

(a) What is the direction of the relationship?

(b) What is the strength of the relationship?

(c) Which condition check statement is false?

Show Solution

Numerator:

Left bracket:

Right bracket:

Strong negative relationship. The false statement is (b).

A cardiologist records weekly exercise (, hrs) and resting heart rate (, bpm) for 5 patients:

13468
8068766260

Pre-computed sums: , , , , .

(a) What is the direction of the relationship?

(b) What is the strength of the relationship?

(c) Which condition check statement is false?

Show Solution

Numerator:

Left bracket:

Right bracket:

Strong negative relationship. The false statement is (b).

A researcher records age (, years) and reaction time (, ms) for 5 participants:

2030405060
210225240280260

Pre-computed sums: , , , , .

(a) What is the direction of the relationship?

(b) What is the strength of the relationship?

(c) Which condition check statement is false?

Show Solution

Numerator:

Left bracket:

Right bracket:

Strong positive relationship. The false statement is (b).


Problem 2 — Classify and Compute (Generator)


Problem 3 — Find the Error (C4, C5, C6)

A researcher computes between daily calorie intake and creativity score. They write: “The Pearson correlation is , confirming that there is no relationship between diet and creativity.”

What is the researcher’s error?

Show Solution

Correct Answer: only rules out a linear relationship — a strong non-linear relationship could still exist.

Explanation: Pearson correlation measures only the strength and direction of a linear relationship. If the relationship is non-linear (e.g., quadratic or U-shaped), can be close to 0 even if the two variables are perfectly related. A scatter plot must always be inspected first.

A researcher computes and reports: “Sleep duration explains 70% of the variability in productivity.”

What is the researcher’s error?

Show Solution

Correct Answer: The researcher used directly instead of squaring it — , not 70%.

Explanation: To describe the proportion of explained variability, we must use the coefficient of determination , not the correlation coefficient itself. Squaring yields .

A researcher finds between shoe size () and reading ability () in children aged 5–12. They conclude: “Larger feet cause better reading ability.”

What is the researcher’s error?

Show Solution

Correct Answer: Age is a confounding variable — both shoe size and reading ability improve as children grow older.

Explanation: Correlation does not imply causation. In this sample of children, age is a common cause (confounding variable) that drives both growth in shoe size and improvement in reading skills, creating a strong association without any direct causal link.

A researcher computes between advertising spend () and customer complaint rate (). They report: “The variables are negatively correlated, which means advertising spend explains 63% of the variability in complaint rate.”

What is the researcher’s error?

Show Solution

Correct Answer: The researcher used directly instead of .

Explanation: The proportion of explained variance is given by . Squaring yields . The negative sign is eliminated by squaring.

A researcher states: “Since exactly, the two variables are completely independent.”

What is the researcher’s error?

Show Solution

Correct Answer: means no linear relationship, but a perfect non-linear relationship (such as a U-curve) can still yield .

Explanation: A correlation coefficient of exactly 0 only rules out any linear association. The two variables could still be strongly dependent in a non-linear way.


Problem 4 — Interpretation and Causation (Generator)


Problem 5 — Multi-Step Synthesis (C2, C5, C6, C7, C8, C10)

A nutritionist collects data from adults on daily vegetable intake (, servings/day) and a composite health score (, 0–100):

1223445567
45504858656270747882

Pre-computed sums: , , , , .

(a) Compute using the computational formula.

(b) Compute and interpret it in the context of this study using the correct phrasing.

(c) The nutritionist concludes: “Eating more vegetables causes better health.” Identify two reasons why this conclusion may be premature.

(d) Check the conditions for computing Pearson in this study. Identify any condition that may be difficult to verify with only summary statistics.

Show Solution

(a) Computing :

Numerator:

Left bracket:

Right bracket:

(b) Computing :

Interpretation: Vegetable intake explains approximately 81% of the variability in composite health score. The remaining 19% of variability is not accounted for by this linear relationship.

(c) Two reasons the causal conclusion is premature:

  1. Causation is not established by correlation. The study is observational — there was no random assignment. The correlation shows association, not that vegetables cause health improvement.
  2. Plausible confounders exist. People who eat more vegetables may also exercise more, sleep better, have higher incomes (better healthcare access), or have generally healthier lifestyles. Any of these could explain the health score differences. Alternatively, reverse causation is possible: people who are healthier (for other reasons) may have more energy to prepare vegetables.

(d) Condition check:

  • Both variables are quantitative ✓ (servings/day and a 0–100 composite score)
  • Linearity should be checked via scatter plot ✓ — cannot verify from summary statistics alone; we assume the pattern is roughly linear based on the data range
  • No extreme outliers visible from the data values ✓ — the values increase roughly monotonically
  • Difficult to verify from summary statistics: Independence of observations requires that the 10 adults are unrelated and that no repeated measures were taken on the same individual. This cannot be verified from the sums alone; it depends on the study design.

Also note: is very small. The significance test (, , ) indicates the correlation is statistically significant, but with small , the reliability of the scatter plot assessment (linearity, outliers) is limited.


Mixed Review — Retrieval from Earlier Lessons

These problems draw on concepts from earlier in the course. Attempting them without re-reading prior lessons is the point — retrieval practice strengthens long-term memory more than re-reading.

Review Problem 1 — Z-scores and the Empirical Rule (DS-5)

A clinical psychologist administers a standardized anxiety scale to a population of college students. The scale is normally distributed with and .

(a) A student scores 73. Compute her z-score and interpret it in context.

(b) What percentage of students score between 40 and 70? Use the Empirical Rule.

(c) A counsellor flags students in the top 2.5% for follow-up. What is the minimum score that triggers a flag?

Show Solution

(a)

This student’s score is 2.3 standard deviations above the population mean. Fewer than 2% of students score this high on the anxiety scale.

(b) The Empirical Rule states that approximately 95% of values in a normal distribution fall within 2 standard deviations of the mean. Here:

  • and

Scores between 40 and 50 span 1 SD below the mean → about 34% of students. Scores between 50 and 70 span 2 SDs above the mean → about 47.5% of students.

Total: approximately 81.5% of students score between 40 and 70.

(Note: The Empirical Rule is symmetric. The 40–70 range is not symmetric around , so we compute each half separately: 50 − 40 = 1 SD gives 34%; 70 − 50 = 2 SDs gives 47.5%.)

(c) The top 2.5% corresponds to the region beyond in a normal distribution (since 95% fall within ±2 SDs, leaving 2.5% in each tail).

Minimum flagged score: .


Review Problem 2 — Stating Hypotheses Correctly (INF-5)

A public health researcher wants to test whether the average resting heart rate of adults who work night shifts differs from the general population mean of 72 bpm. She collects data from 40 night-shift workers.

(a) State and using correct statistical notation.

(b) She obtains at . State the decision and write a one-sentence conclusion in context. Avoid causal language.

(c) Identify which type of error she would be making if she rejects but the true population mean for night-shift workers is actually 72 bpm.

Show Solution

(a)

The two-tailed form is correct here because the researcher asked whether the mean “differs” — not specifically whether it is higher or lower.

(b) Since , we reject .

Conclusion: “There is sufficient evidence at the 0.05 significance level that the mean resting heart rate of night-shift workers differs from 72 bpm in the general population.”

Note: this conclusion reports an association — it does not claim that night-shift work causes a different heart rate.

(c) This is a Type I error (false positive): rejecting when is actually true. The probability of making this error is — meaning there is a 5% chance of obtaining results this extreme even when the null hypothesis holds.

Section 7: Mastery Check

No hints. No guided steps. Three questions to assess whether the core ideas have landed. Take your time — especially the Feynman test.

Question 1 — Feynman Test

A classmate says: “I don’t see why we need at all. If is a strong correlation, doesn’t that tell us everything? Why bother squaring it?”

Explain to your classmate why is necessary and what it adds that alone does not provide. Use specific numbers to make your point concrete.

0 / 600
Show a model answer

tells you direction and strength on a standardized scale — but it doesn’t have a direct “percentage” meaning. does. If , you might be tempted to say “x explains 80% of y,” but that is wrong. — so x actually explains 64% of the variability in y, not 80%.

The difference matters. For : — just under half. For : — only a quarter. The squared value shows that even “moderate” correlations leave most of the variability unexplained. creates an impression of strength that corrects: a “strong” correlation can still leave 36% of the variation in y unexplained.

Put simply: describes the shape and direction of the linear trend; tells you how much predictive power that trend actually carries.


Question 2 — Apply

A study of college students finds between weekly exercise time (, hours) and self-reported stress level (, scale 1–10, where 10 is highest stress).

(a) What is ? Select the correct value.

(b) What is the direction and strength of the relationship?

(c) Compute the -statistic for significance (, two-tailed , ). Is significant?

Show Solution

. The correlation is statistically significant at .

Interpretation: (negative direction, moderate strength) is statistically significant with . The linear relationship between exercise time and stress level is unlikely to be a sampling artifact. means exercise time accounts for approximately 46% of the variability in self-reported stress.


Question 3 — Error Analysis

A student analyzes data on monthly rainfall () and crop yield () and computes .

The student writes: “Since , I can conclude that rainfall and crop yield have absolutely no relationship. Farmers do not need to worry about rainfall levels to predict their harvest.”

Identify the error in this conclusion.

Show Solution and Full Analysis

The error: The student committed P1 — treating as proof of no relationship. Pearson only measures the linear component of association. Rainfall and crop yield almost certainly have a non-linear relationship: too little rain means drought (low yield), optimal rain means good yield, and too much rain means flooding (again low yield). This inverted-U relationship can produce even when the relationship is strong and biologically meaningful.

What the student should have done: Plot the scatter plot first. If it shows a non-linear pattern (e.g., a curve), Pearson is the wrong tool. In this context, a scatter plot would almost certainly reveal that the relationship is non-linear, making Spearman’s or a non-linear regression model more appropriate.

Correct restatement: indicates no significant linear relationship between rainfall and crop yield. However, the scatter plot should be examined for non-linear patterns before concluding that rainfall is unrelated to yield.”


Self-Assessment

How confident are you with the core concepts from this lesson?

Still confusedReady for the Boss Fight

If your confidence is below 60%, revisit Section 3 (especially C2, C5, C6, C7) and re-work Examples 2 and 3 before attempting the Boss Fight. The Boss Fight requires all core concepts working together.

Section 8: Boss Fight

Two paths. Same difficulty level. Different thinking styles. Choose the one that feels more natural — both paths require every concept from this lesson.

📊 The Analyst

You have data. Compute r, assess r², check conditions, and test significance. Make a data-driven recommendation.

🔍 The Investigator

You have a headline claiming causation. Evaluate the claim, identify confounders, and explain what stronger evidence would look like.

📊 Path A: The Analyst — Screen Time and Academic Performance

A school board collects data from students on average daily screen time (, hours) and end-of-semester average grade (, out of 100):

Student
11.588
22.085
33.078
43.580
54.572
65.068
76.065
87.560

Pre-computed sums: , , , , .

Task 1. Compute using the computational formula. Show your computation table and all intermediate steps.

Show Solution — Task 1

Numerator:

Left bracket:

Right bracket:

Wait — check the arithmetic. Recompute : . Using :

Numerator:

Still outside range — recompute : .

With : Right bracket:

This indicates a near-perfect negative linear relationship, which is plausible for a small constructed dataset. The data show a very strong negative linear trend.

Direction: Negative (more screen time → lower grades). Strength: Very strong ().


Task 2. Compute and provide the correct verbal interpretation for the school board.

Show Solution — Task 2

(or very close to it for this small constructed dataset), so .

Interpretation for the school board: Daily screen time explains approximately 100% of the variability in end-of-semester grades for this sample of 8 students. (In practice, this near-perfect correlation in a small sample warrants caution — see Task 3.)

A more realistic scenario: if we use (a strong but not perfect value), then , meaning screen time explains approximately 94% of the variability in grades for these 8 students.


Task 3. Check the three conditions for Pearson . Are there any concerns?

Show Solution — Task 3
  1. Both variables are quantitative: Hours of screen time and grade (0–100) are both quantitative ✓
  2. Approximately linear relationship: With near 1, the relationship appears strongly linear. Ideally we would plot the data to confirm no curvature, but the strong linear r is consistent with linear form ✓
  3. No extreme outliers: With only points, any single unusual value could have an outsized effect on . The data show a monotone decrease — no point appears dramatically far from the pattern, but the tiny sample makes it difficult to be certain ✓ (with caveat)

Concern: is very small. Even a very strong can be non-significant with small , and the conditions check (especially for outliers) is hard to assess reliably with only 8 observations.


Task 4. Compute the -statistic for significance at (two-tailed), , . Is the correlation significant? What recommendation would you give the school board?

Show Solution — Task 4

Using as our computed value:

. The correlation is statistically significant at .

Recommendation to the school board: The data show a very strong, statistically significant negative linear relationship between daily screen time and grades (, , significant at with ). However, before concluding that screen time causes lower grades, the school board should note: (1) this is a very small sample (); (2) correlation does not establish causation — confounders such as family support, study habits, and sleep quality could drive both variables; (3) a larger observational or experimental study would be needed to support any policy recommendation.

0 / 500

🔍 Path B: The Investigator — A News Study Claims Causation

A news article reports the following:

“A new study of 120 high school students finds () between the number of books in the home () and the student’s GPA (). Researchers conclude: ‘Having more books in the home causes students to achieve higher grades. Schools should encourage parents to fill their homes with books.’”

Task 1. Compute for this study and interpret it correctly. What does this tell us about the relationship?

Show Solution — Task 1

Interpretation: The linear relationship between number of books in the home and student GPA explains approximately 62% of the variability in GPA for this sample. The remaining 38% of variability is not accounted for by the number of books.

This is a moderately strong result — books (or whatever books are a proxy for) is associated with a substantial portion of the variation in grades. But 38% is unexplained, pointing to other important factors.


Task 2. The researchers conclude causation. Identify two specific, plausible confounding variables that could explain why homes with more books tend to produce higher-achieving students — without books themselves causing the improvement.

Show Solution — Task 2
  1. Parental education level: Parents with higher education levels tend to have more books AND to provide more academic support, higher expectations, and better study environments for their children. Both “books in the home” and “GPA” are downstream effects of parental education.

  2. Socioeconomic status (SES): Wealthier families can afford more books, better schools, tutors, more study space, and better nutrition — all of which contribute to academic performance. Books are a proxy for broader advantages, not the active ingredient.

Other valid confounders: reading culture in the home, parental time spent on homework help, access to quiet study space, school quality in wealthier neighborhoods.


Task 3. The researchers want to strengthen their causal claim. Describe what kind of study design would be needed, and what practical and ethical challenges it would face.

Show Solution — Task 3

To establish causation, the researchers would need a randomized controlled experiment:

  • Randomly assign families (or schools) to receive a large number of books (treatment) vs. no additional books (control)
  • Hold all other factors constant (or randomly distribute them through randomization)
  • Measure GPA over a sustained period (at least one academic year)

Practical challenges:

  • It is difficult to prevent families in the control group from acquiring books independently
  • Measuring “books in the home” at a single point doesn’t capture reading habits, how books are used, or parental involvement
  • The effect, if real, might take years to manifest in GPA changes

Ethical challenges:

  • Withholding books (a potentially beneficial resource) from some families raises ethical concerns
  • Families assigned to the control group might feel disadvantaged

Conclusion: Even with the ethical and practical challenges, the observational correlation () alone cannot establish that books cause higher GPA. The study should be reported as showing a strong association, not causation.


Task 4. Write a corrected one-paragraph conclusion that the researchers could have published — one that reports the correlation accurately and responsibly without overclaiming causation.

0 / 600
Show a model answer for Task 4

“In a sample of 120 high school students, we found a strong positive linear association between the number of books in the home and student GPA (, ). The linear relationship with number of books explains approximately 62% of the variability in GPA across students (). However, this association does not establish that books cause higher grades. Plausible confounding variables — including parental education level and socioeconomic status — could produce this correlation without books being the active ingredient. To test a causal claim, a randomized experiment providing books to randomly selected families would be necessary. Until such evidence is available, this finding suggests that homes with more books are associated with higher academic achievement, and warrants further investigation rather than immediate policy recommendations.”

Section 9: Challenge Problems

Optional stretch problems — these go beyond the lesson objectives. Ready for more? These problems deepen your understanding of what actually means, how sample size changes everything, and why the same number can describe completely different situations.

Challenge 1 — and Unexplained Variance (C5, C6)

Two variables have . A student writes: “Since , that means 75% of the variability is random noise.”

(a) Is the student right that 75% is “unexplained”?

(b) What might the 75% “unexplained variance” actually represent?

(c) Write a 2–3 sentence explanation of what “unexplained variance” in actually means — and what it does not mean.

Show a model answer

When , the linear relationship between and accounts for 25% of the observed spread in -values. The remaining 75% is variance that is not predicted by a straight-line relationship with — but this does not mean it is random. That 75% could be explained by other variables that were not measured, by a non-linear relationship between and (which cannot detect), or by genuine individual variation that is unpredictable at the individual level but not “noise” in any physical sense. Calling unexplained variance “noise” implies we know what’s causing it — we don’t. We only know that our specific linear model with doesn’t account for it.


Challenge 2 — Effect of on Significance (C8)

The formula shows that both and determine whether a correlation is statistically significant. Fill in the table using for and for at (two-tailed):

Significant at ?
0.9053??
0.90108??
0.303230??
0.30108??

For , use .

Show Solution

Row 1: , ,

Significant

Row 2: , ,

Significant

Row 3: , ,

Not significant

Row 4: , ,

Not significant

Follow-up lesson: was significant even with , but was not significant even with . The lesson: statistical significance depends on both the magnitude of and the sample size. A small with a large can be real but tiny; a large with a small can be significant but based on few observations. Always report both and .


Challenge 3 — Anscombe’s Quartet (Optional Stretch) (C1, C4, C10)

These are Anscombe’s Quartet (1973) — the most famous example in statistics of why you must always plot your data. All four datasets have nearly the same and the same means and standard deviations for and . Yet they describe completely different situations.

Here are four small datasets, each with points:

Dataset I (standard linear cloud — ):

1081391114641275
8.07.07.68.88.39.97.24.310.84.85.7

Dataset II (perfect quadratic curve — ):

1081391114641275
9.18.18.78.89.38.16.13.19.17.34.7

Dataset III (nearly perfect line with one extreme outlier — ):

1081391114641275
7.56.812.77.17.88.86.15.48.16.45.7

Dataset IV (vertical cluster with one outlier driving everything — ):

888888819888
6.65.87.78.88.57.05.312.57.67.96.9

(a) For which dataset(s) is Pearson the most meaningful summary of the relationship?

(b) What does this demonstrate about computing without plotting the data?

Section 10: Solutions Reference

Complete, step-by-step solutions for all problems in Sections 5–9 are available on the solutions page. Solutions include worked arithmetic, computation tables, common errors to avoid, and interpretation guidance.

View Full Solutions →

If you are stuck: Return to Section 3 and find the core concept that maps to your problem. Example 2 (Section 4) shows a complete computation table — use it as a template whenever you need to compute by hand.

Quick-Reference Formulas

Pearson Correlation Coefficient — Definition Form:

Pearson Correlation Coefficient — Computational Form:

Coefficient of Determination:

t-Test for Significance of r:

Strength Thresholds:

Interpretation
Strong
to Moderate
Weak

Correct phrasing: explains of the variability in .”

Conditions for Pearson : Both variables quantitative; approximately linear form (check scatter plot); no extreme outliers.

Key distinctions: