Solutions — Correlation Analysis

How to use this page: Try each problem in the lesson before checking solutions here. If your answer doesn't match, read the solution carefully — especially the part that explains why common wrong answers are wrong. Understanding the error matters more than getting the right answer the first time.

← Back to Lesson REG-1

Section 5: Guided Practice Solutions

▾

Problem 1 — Direction and Strength from r (Variants 0–4)

Variant 0 (): positive (); strong ().
Variant 1 (): negative; moderate ().
Variant 2 (): positive; weak ().
Variant 3 (): negative; strong ().
Variant 4 (): positive; moderate ().

Common mistakes: (1) a negative means negative direction, not a weak relationship; (2) compare (magnitude), not the signed value, to the strength thresholds.

Problem 2 — Computing and Interpreting r² (Variants 0–4)

Variant 0 (): → “74% of the variability in weight is explained by the linear relationship with height.”
Variant 1 (): (squaring removes the sign) → “52% of coffee-sales variability explained by temperature.”
Variant 2 (): → “18% of GPA variability explained by attendance; 82% unexplained.”
Variant 3 (): → “83% of resting-heart-rate variability explained by exercise frequency.”
Variant 4 (): → “42% of productivity variability explained by sleep hours.”

The C6 trap: never report directly as a percentage. With , the variability explained is 42%, not 65%. The squaring step is mandatory.

Problem 3 — Correlation vs. Causation

Scenario 1 (firefighters vs. property damage, ): no causation — fire size is the confounder (bigger fires draw more firefighters and cause more damage). The strength of is irrelevant to causation.
Scenario 2 (activity vs. depression, ): cannot determine — reverse causation (feeling better → more energy to exercise) or a health confounder is plausible.
Scenario 3 (pencil length vs. grades, ): no — both are driven by studious behavior (a common cause).

Problem 4 — Compute r² and Interpret (Generator)

(1) Square ; (2) multiply by 100; (3) phrase as ” explains of the variability in .”

Key distractor: the generator offers itself as an option for — the C6 error. Always square first.

Section 6: Independent Practice Solutions

▾

Problem 1 — Compute r from Data (Variants 0–4)

General approach: build the table, then

Variant 0 (study vs. quiz, ): , , , , . Numerator ; Left , Right ; (strong positive).
Variant 1 (fertilizer vs. height, ): , , , , . Numerator ; Left , Right ; .
Variant 2 (screen vs. sleep, ): .
Variant 3 (exercise vs. heart rate, ): .
Variant 4 (age vs. reaction time, ): .

Problem 2 — Classify and Compute r² (Generator)

(1) Direction from the sign of ; (2) strength from (weak , moderate 0.5–0.8, strong ); (3) from squaring — the generator always offers itself as the C6 distractor.

Problem 3 — Find the Error (Variants 0–4)

Variant 0 ( → “no relationship”): rules out only a linear relationship; a non-linear one may exist. Plot first.
Variant 1 ( → “70% of variability”): used instead of . Correct: → 49%.
Variant 2 (, shoe size vs. reading → “feet cause reading”): age is the confounder. Correlation ≠ causation.
Variant 3 ( → “63% of variability”): used . Correct: → 40%.
Variant 4 ( → “completely independent”): means no linear relationship; a symmetric U-curve gives while being strongly dependent. Independence is a stronger claim.

Problem 4 — r² Interpretation and Causation (Generator)

(1) Square (the dropdown includes as a distractor); (2) the causation answer is always “No — correlation shows association, not causation,” regardless of how large or significant is.

Problem 5 — Multi-Step Synthesis: Nutritionist Dataset (n = 10)

Vegetable intake vs. health score . Sums: , , , , .

(a) Numerator ; Left ; Right ; (strong positive).

(b) — vegetable intake explains ~98% of health-score variability (2% unexplained).

(c) Why causal conclusion is premature: (1) observational, no random assignment — association, not causation; (2) plausible confounders (exercise, sleep, income) and reverse causation (healthier people have more energy to eat vegetables).

(d) Conditions: both quantitative ✓; linearity needs a scatter plot; no extreme outliers visible (roughly monotone) ✓; independence of the 10 adults can’t be verified from sums alone.

Bonus — significance: . With , ; since , highly significant ().

Section 7: Mastery Check Solutions

▾

Problem 1 — Feynman Test: Why r² Is Necessary

gives the direction and standardized strength of the linear trend but carries no direct “percentage” meaning; does. For , it’s tempting to say “x explains 80% of y,” but — so x explains 64%, leaving 36% unexplained. The gap is starker for moderate correlations: ; (only a quarter). In short: describes the trend’s shape and direction; tells you how much predictive power it carries.

Problem 2 — Apply: Exercise and Stress (n = 25, r = −0.68)

(a) . (Traps: 0.68 = ; 0.82 = ; 0.54 = .)

(b) Negative direction, moderate strength ().

(c) Significance: . With , ; since , significant at . Exercise accounts for ~46% of stress variability — meaningful but not complete.

Problem 3 — Error Analysis: Rainfall and Crop Yield (r = 0.03)

Error: treating as proof of no relationship. Pearson measures only the linear component. Rainfall and yield likely have an inverted-U relationship (drought → low, optimal → high, flood → low), which can give despite a strong real relationship.

Should have: plotted the scatter first; if non-linear, use a non-linear model. Restatement: ” indicates no significant linear relationship; examine the scatter plot for non-linear patterns before concluding rainfall is unrelated to yield.”

Section 8: Boss Fight Solutions

▾

Path A — The Analyst: Screen Time and Academic Performance (n = 8)

Sums: , , , , .

Task 1 — r: Numerator ; Left ; Right ; — negative, very strong.

Task 2 — r²: — screen time explains ~97% of grade variability for these 8 students (exceptional, but a tiny sample — don’t generalize).

Task 3 — Conditions: both quantitative ✓; strongly linear () ✓; no single dramatic outlier in a monotone decrease ✓; main concern: is very small, so outlier/linearity checks are unreliable.

Task 4 — Significance: ; with , , so significant. Recommendation: strong, significant negative relationship (, ), but (1) tiny sample, (2) correlation ≠ causation (confounders: family support, study habits, sleep), (3) a larger study is needed before any policy.

Path B — The Investigator: Books and GPA (r = 0.79, n = 120)

Task 1 — r²: — the linear relationship explains ~62% of GPA variability; 38% remains, pointing to other factors.

Task 2 — Confounders: (1) parental education (more books and more academic support); (2) socioeconomic status (books proxy for tutors, better schools, study space). Books are a proxy, not necessarily the active ingredient.

Task 3 — Study design: a randomized controlled experiment — randomly assign families to receive many books vs. none, then measure GPA over ≥1 year. Practical issues (control families acquire books; usage matters; slow effects) and ethical issues (withholding a beneficial resource) make it hard, but observational correlation alone can’t establish causation.

Task 4 — Corrected conclusion:

“In a sample of 120 high school students, we found a strong positive linear association between number of books in the home and GPA (, ); the linear relationship explains ~62% of GPA variability (). This does not establish that books cause higher grades — confounders such as parental education and socioeconomic status could produce the correlation. A randomized experiment providing books to randomly selected families would be needed to test causation. Until then, this finding warrants further investigation rather than immediate policy recommendations.”

Section 9: Challenge Problem Solutions

▾

Challenge 1 — r² and Unexplained Variance (r = 0.50)

(a) Partially correct: 75% is unexplained by the linear relationship, but calling it “random noise” is wrong.

(b) The other 75% reflects other variables, non-linear patterns, or measurement variability — not “random” in any meaningful sense. All variance in has causes; some just aren’t captured by the linear relationship with .

(c) When , the line accounts for 25% of ‘s spread; the remaining 75% could come from unmeasured variables, a non-linear relationship can’t detect, or genuine individual variation. Calling it “noise” wrongly implies we know its cause — we only know our linear model with doesn’t capture it.

Challenge 2 — Anscombe’s Quartet

(a) Dataset I only is well-described by Pearson :

I: standard linear cloud — is meaningful ✓.
II: perfect quadratic — non-linear; misses the curvature. Use a non-linear model.
III: near-perfect line with one extreme outlier driving — remove it and collapses.
IV: vertical cluster (all except one at ) — the single point produces the entire .

(b) The same can describe fundamentally different structures (linear, non-linear, outlier-driven, single-point-driven) with identical summary statistics. Always plot the data before trusting .

← Return to Lesson REG-1

REG-1: Solutions — Correlation Analysis

Section 5: Guided Practice Solutions

Problem 1 — Direction and Strength from r (Variants 0–4)

Problem 2 — Computing and Interpreting r² (Variants 0–4)

Problem 3 — Correlation vs. Causation

Problem 4 — Compute r² and Interpret (Generator)

Section 6: Independent Practice Solutions

Problem 1 — Compute r from Data (Variants 0–4)

Problem 2 — Classify and Compute r² (Generator)

Problem 3 — Find the Error (Variants 0–4)

Problem 4 — r² Interpretation and Causation (Generator)

Problem 5 — Multi-Step Synthesis: Nutritionist Dataset (n = 10)

Section 7: Mastery Check Solutions

Problem 1 — Feynman Test: Why r² Is Necessary

Problem 2 — Apply: Exercise and Stress (n = 25, r = −0.68)

Problem 3 — Error Analysis: Rainfall and Crop Yield (r = 0.03)

Section 8: Boss Fight Solutions

Path A — The Analyst: Screen Time and Academic Performance (n = 8)

Path B — The Investigator: Books and GPA (r = 0.79, n = 120)

Section 9: Challenge Problem Solutions

Challenge 1 — r² and Unexplained Variance (r = 0.50)

Challenge 2 — Anscombe’s Quartet