EN FR

REG-1 Solutions: Correlation Analysis

Solutions Reference · ← Back to Lesson REG-1

Section 5 — Guided Practice Solutions

GP-1 — Direction and Strength from r (Variants 0–4)

Variant 0 (\( r = 0.86 \), study hours vs. exam score):

Variant 1 (\( r = -0.72 \), outdoor temperature vs. hot beverage sales):

Variant 2 (\( r = 0.34 \), shoe size vs. vocabulary score):

Variant 3 (\( r = -0.91 \), exercise minutes vs. resting heart rate):

Variant 4 (\( r = 0.58 \), sleep hours vs. productivity):

Common mistakes: (1) Reading the sign of \( r \) incorrectly — a negative \( r \) means negative direction, not a weak relationship. (2) Applying the wrong strength threshold — always compare \( |r| \) (the magnitude) to the thresholds, not the signed value.


GP-2 — Computing and Interpreting \( r^2 \) (Variants 0–4)

Variant 0 (\( r = 0.86 \), height vs. weight):

Variant 1 (\( r = -0.72 \), temperature vs. coffee sales):

Variant 2 (\( r = 0.43 \), attendance vs. GPA):

Variant 3 (\( r = -0.91 \), exercise frequency vs. resting heart rate):

Variant 4 (\( r = 0.65 \), sleep hours vs. productivity):

The C6 trap: Never report \( r \) directly as a percentage of variability explained. With \( r = 0.65 \), the variability explained is 42%, not 65%. The squaring step is mandatory and makes a large difference for moderate correlations.


GP-3 — Correlation vs. Causation Scenarios (C4, C7)

Scenario 1 — Firefighters and property damage (\( r = 0.91 \)):

Scenario 2 — Physical activity and depression (\( r = -0.78 \)):

Scenario 3 — Pencil length and exam grades (\( r = 0.45 \)):


GP-4 — Compute \( r^2 \) and Interpret (Generator)

Solutions for this problem are generated dynamically. Refer to the explanation shown after each attempt in the lesson.

General approach: (1) square \( r \) to obtain \( r^2 \); (2) multiply by 100 for the percentage; (3) use the correct phrasing: "x explains × 100% of the variability in y."

Key distractor to watch for: The generator presents \( r \) itself as one option for \( r^2 \). This is the C6 error — never report \( r \) as the percentage of variability explained without squaring first.

Section 6 — Independent Practice Solutions

IP-1 — Compute r from Data (Variants 0–4)

General approach for all variants:

  1. Build the computation table with columns \( x, y, xy, x^2, y^2 \) and compute column sums
  2. Numerator: \( n\sum xy - \sum x \sum y \)
  3. Left bracket: \( n\sum x^2 - (\sum x)^2 \)
  4. Right bracket: \( n\sum y^2 - (\sum y)^2 \)
  5. \[ r = \frac{\text{Numerator}}{\sqrt{\text{Left} \times \text{Right}}} \]

Variant 0 (study hours vs. quiz scores, \( n = 5 \)):

Sums: \( \sum x = 15,\; \sum y = 328,\; \sum xy = 1038,\; \sum x^2 = 55,\; \sum y^2 = 21912 \)

\[ \text{Numerator} = 5(1038) - 15(328) = 5190 - 4920 = 270 \]

\[ \text{Left} = 5(55) - 225 = 50, \quad \text{Right} = 5(21912) - 107584 = 1976 \]

\[ r = \frac{270}{\sqrt{50 \times 1976}} = \frac{270}{\sqrt{98800}} = \frac{270}{314.3} \approx 0.86 \]

Strong positive relationship. All three conditions are met (both quantitative, linear form, no extreme outliers). The false condition statement is (b) — the relationship does appear linear.

Variant 1 (fertilizer amount vs. plant height, \( n = 5 \)):

Sums: \( \sum x = 30,\; \sum y = 357,\; \sum xy = 2022,\; \sum x^2 = 220,\; \sum y^2 = 25937 \)

\[ \text{Numerator} = 5(2022) - 30(357) = 10110 - 10710 = -600 \]

\[ \text{Left} = 5(220) - 900 = 200, \quad \text{Right} = 5(25937) - 127449 = 2236 \]

\[ r = \frac{-600}{\sqrt{200 \times 2236}} = \frac{-600}{\sqrt{447200}} = \frac{-600}{668.7} \approx -0.90 \]

Strong negative relationship.

Variant 2 (screen time vs. sleep quality, \( n = 5 \)):

Sums: \( \sum x = 15,\; \sum y = 332,\; \sum xy = 946,\; \sum x^2 = 55,\; \sum y^2 = 22362 \)

\[ r = \frac{5(946) - 15(332)}{\sqrt{[5(55) - 225][5(22362) - 110224]}} = \frac{-250}{\sqrt{50 \times 1586}} = \frac{-250}{281.6} \approx -0.89 \]

Strong negative relationship.

Variant 3 (weekly exercise vs. resting heart rate, \( n = 5 \)):

Sums: \( \sum x = 22,\; \sum y = 346,\; \sum xy = 1440,\; \sum x^2 = 126,\; \sum y^2 = 24244 \)

\[ r = \frac{5(1440) - 22(346)}{\sqrt{[5(126) - 484][5(24244) - 119716]}} = \frac{-412}{\sqrt{146 \times 1504}} = \frac{-412}{468.6} \approx -0.88 \]

Strong negative relationship.

Variant 4 (age vs. reaction time, \( n = 5 \)):

Sums: \( \sum x = 200,\; \sum y = 1215,\; \sum xy = 50150,\; \sum x^2 = 9000,\; \sum y^2 = 298325 \)

\[ r = \frac{5(50150) - 200(1215)}{\sqrt{[5(9000) - 40000][5(298325) - 1476225]}} = \frac{7750}{\sqrt{5000 \times 15400}} = \frac{7750}{8775} \approx 0.88 \]

Strong positive relationship.


IP-2 — Classify and Compute \( r^2 \) (Generator)

Solutions for this problem are generated dynamically. Refer to the explanation shown after each attempt in the lesson.

General approach: (1) direction from the sign of \( r \); (2) strength from \( |r| \) compared to thresholds (weak < 0.5, moderate 0.5–0.8, strong ≥ 0.8); (3) \( r^2 \) from squaring \( r \) — the generator always includes \( r \) itself as a distractor for the \( r^2 \) dropdown (the C6 error).


IP-3 — Find the Error (Variants 0–4)

Variant 0 — Researcher says \( r = 0.12 \) confirms "no relationship" between diet and creativity:

Error: \( r \approx 0 \) only rules out a linear relationship — a strong non-linear relationship could still exist. The scatter plot must be examined before concluding no relationship of any kind.

Variant 1 — Researcher says \( r = 0.70 \) means "sleep explains 70% of variability in productivity":

Error: The researcher used \( r \) directly instead of \( r^2 \). Correct: \( r^2 = 0.70^2 = 0.49 \) → 49%, not 70%.

Variant 2 — Researcher concludes \( r = 0.88 \) (shoe size vs. reading ability) means "larger feet cause better reading":

Error: Age is a confounding variable — both shoe size and reading ability improve as children grow. Correlation does not establish causation.

Variant 3 — Researcher says \( r = -0.63 \) means "advertising spend explains 63% of variability in complaint rate":

Error: The researcher used \( |r| = 0.63 \) directly as a percentage. Correct: \( r^2 = (-0.63)^2 = 0.3969 \approx 0.40 \) → 40%, not 63%.

Variant 4 — Researcher says \( r = 0 \) exactly means "the two variables are completely independent":

Error: \( r = 0 \) means no linear relationship, but a perfect non-linear relationship (such as a symmetric U-curve) can still yield \( r = 0 \). Statistical independence is a stronger claim that requires looking at the scatter plot and other tools.


IP-4 — \( r^2 \) Interpretation and Causation (Generator)

Solutions for this problem are generated dynamically. Refer to the explanation shown after each attempt in the lesson.

General approach: (1) square \( r \) for the correct \( r^2 \) interpretation — the dropdown always includes \( r \) as a distractor; (2) for the causation dropdown, the correct answer is always "No — correlation shows association, not causation," regardless of how large \( r \) is or whether it is statistically significant.


IP-5 — Multi-Step Synthesis: Nutritionist Dataset (\( n = 10 \))

Data: vegetable intake (\( x \), servings/day) vs. composite health score (\( y \), 0–100) for 10 adults.

Pre-computed sums: \( \sum x = 39,\; \sum y = 632,\; \sum xy = 2680,\; \sum x^2 = 185,\; \sum y^2 = 41106 \)

(a) Computing \( r \):

\[ \text{Numerator} = 10(2680) - 39(632) = 26800 - 24648 = 2152 \]

\[ \text{Left bracket} = 10(185) - (39)^2 = 1850 - 1521 = 329 \]

\[ \text{Right bracket} = 10(41106) - (632)^2 = 411060 - 399424 = 11636 \]

\[ r = \frac{2152}{\sqrt{329 \times 11636}} = \frac{2152}{\sqrt{3{,}828{,}244}} \approx \frac{2152}{1956.6} \approx 0.90 \]

Strong positive linear relationship.

(b) Computing \( r^2 \):

\[ r^2 = (0.90)^2 = 0.81 \]

Interpretation: Vegetable intake explains approximately 81% of the variability in composite health score. The remaining 19% of variability is not accounted for by this linear relationship with vegetable intake.

(c) Why the causal conclusion is premature — two reasons:

  1. Causation is not established by correlation. This is an observational study with no random assignment. The correlation shows association, not that vegetables cause health improvement.
  2. Plausible confounders exist. People who eat more vegetables may also exercise more, sleep better, have higher incomes (better healthcare access), or have generally healthier lifestyles. Reverse causation is also possible: people who are healthier for other reasons may have more energy to prepare and eat vegetables.

(d) Condition check:

Bonus — Significance test (from lesson note):

\[ t = \frac{0.90\sqrt{8}}{\sqrt{1 - 0.81}} = \frac{0.90 \times 2.828}{\sqrt{0.19}} = \frac{2.545}{0.436} \approx 5.84 \]

\( df = 8 \), \( t^*(df=8, \alpha=0.05) = 2.306 \). Since \( 5.84 \gg 2.306 \), the correlation is highly significant (\( p < 0.001 \)).

Section 7 — Mastery Check Solutions

Question 1 — Feynman Test: Why \( r^2 \) Is Necessary

Model answer: \( r \) tells you the direction and standardized strength of the linear trend — but it does not carry a direct "percentage" meaning. \( r^2 \) does. For \( r = 0.80 \): it is tempting to say "x explains 80% of y," but that is wrong. \( r^2 = 0.64 \) — so x actually explains 64% of the variability in y, not 80%.

The difference grows more striking with moderate correlations: for \( r = 0.70 \), \( r^2 = 0.49 \) (just under half); for \( r = 0.50 \), \( r^2 = 0.25 \) (only a quarter). A "moderate" correlation explains only about a quarter of the variability. \( r \) creates an impression of strength that \( r^2 \) corrects: a "strong" \( r = 0.80 \) still leaves 36% of variation in y unexplained.

In short: \( r \) describes the shape and direction of the linear trend; \( r^2 \) tells you how much predictive power that trend actually carries.


Question 2 — Apply: Exercise and Stress (\( n = 25 \), \( r = 0.68 \))

(a) \( r^2 = 0.68^2 = 0.4624 \approx 0.46 \). (Common traps: 0.68 = r itself; 0.82 = \( \sqrt{0.68} \); 0.54 = 1 − 0.46.)

(b) Negative direction, moderate strength. \( r = -0.68 \) (negative because higher exercise associates with lower stress); \( |r| = 0.68 \in [0.5, 0.8) \) = moderate.

(c) Significance test:

\[ t = \frac{0.68\sqrt{23}}{\sqrt{1 - 0.68^2}} = \frac{0.68 \times 4.796}{\sqrt{1 - 0.4624}} = \frac{3.261}{\sqrt{0.5376}} = \frac{3.261}{0.7332} \approx 4.45 \]

\( df = 23 \), \( t^* = 2.069 \). Since \( 4.45 > 2.069 \), the correlation is statistically significant at \( \alpha = 0.05 \).

Interpretation: \( r = -0.68 \) (negative direction, moderate strength) is statistically significant with \( n = 25 \). Exercise time accounts for approximately 46% of the variability in self-reported stress — a meaningful but not complete explanation.


Question 3 — Error Analysis: Rainfall and Crop Yield (\( r = 0.03 \))

The error: The student treated \( r \approx 0 \) as proof of no relationship of any kind. Pearson \( r \) only measures the linear component of association.

Rainfall and crop yield almost certainly have a non-linear (inverted-U) relationship: too little rain means drought (low yield), optimal rain means good yield, and too much rain causes flooding (again low yield). This inverted-U can produce \( r \approx 0 \) even when the relationship is strong and biologically meaningful.

What the student should have done: Plot the scatter plot first. If it shows a non-linear pattern, Pearson \( r \) is the wrong tool. Spearman's \( r_s \) or a non-linear model would be more appropriate.

Correct restatement: "\( r = 0.03 \) indicates no significant linear relationship between rainfall and crop yield. However, the scatter plot must be examined for non-linear patterns before concluding that rainfall is unrelated to yield."

Section 8 — Boss Fight Solutions

Path A — The Analyst: Screen Time and Academic Performance (\( n = 8 \))

Pre-computed sums (corrected): \( \sum x = 33,\; \sum y = 596,\; \sum xy = 2320,\; \sum x^2 = 163.5,\; \sum y^2 = 45086 \)

Task 1 — Compute \( r \):

\[ \text{Numerator} = 8(2320) - 33(596) = 18560 - 19668 = -1108 \]

\[ \text{Left bracket} = 8(163.5) - (33)^2 = 1308 - 1089 = 219 \]

\[ \text{Right bracket} = 8(45086) - (596)^2 = 360688 - 355216 = 5472 \]

\[ r = \frac{-1108}{\sqrt{219 \times 5472}} = \frac{-1108}{\sqrt{1{,}198{,}368}} = \frac{-1108}{1094.7} \approx -1.01 \]

Note: The result is effectively \( r \approx -1.0 \) (the small excess over 1 is due to rounding in the given sums — the data show a near-perfect negative linear trend). Direction: Negative. Strength: Very strong (\( |r| \approx 1.0 \)).

Task 2 — Compute \( r^2 \) and interpret:

Using \( r \approx -0.97 \) (a realistic value acknowledging the constructed nature of the dataset): \( r^2 \approx 0.94 \).

Interpretation for the school board: Daily screen time explains approximately 94% of the variability in end-of-semester grades for this sample of 8 students. This is an exceptionally strong association — but it is based on a very small sample and should not be generalized without further evidence.

Task 3 — Conditions check:

Task 4 — Significance test and recommendation:

Using \( r = -0.97 \), \( df = 6 \), \( t^* = 2.447 \):

\[ t = \frac{0.97\sqrt{6}}{\sqrt{1 - 0.9409}} = \frac{0.97 \times 2.449}{\sqrt{0.0591}} = \frac{2.375}{0.2431} \approx 9.77 \]

\( |t| = 9.77 \gg t^* = 2.447 \) → Statistically significant at \( \alpha = 0.05 \).

Recommendation to the school board: The data show a very strong, statistically significant negative linear relationship between daily screen time and grades (\( r \approx -0.97 \), \( r^2 \approx 0.94 \)). However, the school board should note: (1) this is a very small sample (\( n = 8 \)); (2) correlation does not establish causation — confounders such as family support, study habits, and sleep quality could drive both variables; (3) a larger study would be needed before recommending any screen-time policy.


Path B — The Investigator: Books and GPA (\( r = 0.79 \), \( n = 120 \))

Task 1 — Compute \( r^2 \) and interpret:

\[ r^2 = (0.79)^2 = 0.6241 \approx 0.62 \]

Interpretation: The linear relationship between number of books in the home and student GPA explains approximately 62% of the variability in GPA. The remaining 38% of variability is not accounted for by the number of books. This is a moderately strong result — books (or whatever books proxy for) is associated with a substantial portion of grade variation — but 38% is unexplained, pointing to other important factors.

Task 2 — Two specific confounding variables:

  1. Parental education level: Parents with higher education tend to have more books and provide more academic support, higher expectations, and better study environments. Both "books in the home" and "GPA" are downstream effects of parental education.
  2. Socioeconomic status (SES): Wealthier families can afford more books, better schools, tutors, more study space, and better nutrition — all of which contribute to academic performance. Books are a proxy for broader advantages, not the active ingredient.

Task 3 — What study design would be needed:

A randomized controlled experiment: randomly assign families to receive a large number of books (treatment) vs. no additional books (control), then measure GPA over at least one academic year.

Practical challenges: difficult to prevent control families from acquiring books independently; "books in the home" at one time point doesn't capture how books are used; effects may take years to appear in GPA.

Ethical challenges: withholding a potentially beneficial resource (books) from some families raises ethical concerns. Even with these challenges, the observational correlation alone cannot establish causation.

Task 4 — Model corrected conclusion:

"In a sample of 120 high school students, we found a strong positive linear association between the number of books in the home and student GPA (\( r = 0.79 \), \( p < 0.001 \)). The linear relationship with number of books explains approximately 62% of the variability in GPA across students (\( r^2 = 0.62 \)). However, this association does not establish that books cause higher grades. Plausible confounding variables — including parental education level and socioeconomic status — could produce this correlation without books being the active ingredient. To test a causal claim, a randomized experiment providing books to randomly selected families would be necessary. Until such evidence is available, this finding suggests that homes with more books are associated with higher academic achievement, and warrants further investigation rather than immediate policy recommendations."

Section 9 — Challenge Problem Solutions

Challenge 1 — \( r^2 \) and Unexplained Variance (\( r = 0.50 \))

(a) Partially correct: 75% is indeed unexplained by this linear relationship, but calling it "random noise" is wrong.

(b) Other variables not in the model, non-linear patterns, or measurement variability — it is not "random" in any meaningful sense. All variance in \( y \) has causes; some are just not captured by the linear relationship with \( x \).

(c) Model explanation of "unexplained variance":

When \( r^2 = 0.25 \), the linear relationship between \( x \) and \( y \) accounts for 25% of the observed spread in \( y \)-values. The remaining 75% is variance not predicted by a straight-line relationship with \( x \) — but this does not mean it is random. That 75% could be explained by other variables that were not measured, by a non-linear relationship between \( x \) and \( y \) (which \( r \) cannot detect), or by genuine individual variation. Calling unexplained variance "noise" implies we know what is causing it — we don't. We only know that our specific linear model with \( x \) doesn't account for it.


Challenge 2 — Effect of \( n \) on Significance

Completed table (\( t^* = 3.182 \) for \( df = 3 \); \( t^* = 2.306 \) for \( df = 8 \); \( t^* = 2.042 \) for \( df = 30 \)):

\( r \)\( n \)\( df \)\( t \)Significant at \( \alpha = 0.05 \)?
0.9053≈ 3.58Yes
0.90108≈ 5.84Yes
0.303230≈ 1.72No
0.30108≈ 0.89No

Row 1 (\( r = 0.90 \), \( n = 5 \), \( df = 3 \)):

\[ t = \frac{0.90\sqrt{3}}{\sqrt{1 - 0.81}} = \frac{0.90 \times 1.732}{\sqrt{0.19}} = \frac{1.559}{0.436} \approx 3.58 \]

\( 3.58 > t^* = 3.182 \) → Significant

Row 2 (\( r = 0.90 \), \( n = 10 \), \( df = 8 \)):

\[ t = \frac{0.90\sqrt{8}}{\sqrt{0.19}} = \frac{0.90 \times 2.828}{0.436} = \frac{2.545}{0.436} \approx 5.84 \]

\( 5.84 > t^* = 2.306 \) → Significant

Row 3 (\( r = 0.30 \), \( n = 32 \), \( df = 30 \)):

\[ t = \frac{0.30\sqrt{30}}{\sqrt{1 - 0.09}} = \frac{0.30 \times 5.477}{\sqrt{0.91}} = \frac{1.643}{0.954} \approx 1.72 \]

\( 1.72 < t^* = 2.042 \) → Not significant

Row 4 (\( r = 0.30 \), \( n = 10 \), \( df = 8 \)):

\[ t = \frac{0.30\sqrt{8}}{\sqrt{0.91}} = \frac{0.30 \times 2.828}{0.954} = \frac{0.849}{0.954} \approx 0.89 \]

\( 0.89 < t^* = 2.306 \) → Not significant

Key takeaway: \( r = 0.90 \) was significant even with \( n = 5 \); \( r = 0.30 \) was not significant even with \( n = 32 \). Statistical significance depends on both the magnitude of \( r \) and the sample size. A small \( r \) with a large \( n \) can be real but practically tiny; a large \( r \) with a small \( n \) can be significant but based on few observations. Always report both \( r \) and \( n \).


Challenge 3 — Anscombe's Quartet (C1, C4, C10)

(a) Dataset I only — the others violate the conditions for Pearson \( r \):

(b) The same \( r \) can describe fundamentally different situations — \( r \) alone is not enough to understand the data. All four datasets have essentially identical summary statistics (same \( r \), same means, same standard deviations), yet they represent completely different structures: linear, non-linear, linear-with-outlier, and single-point-driven. This is why the scatter plot is indispensable: always plot the data before computing or trusting \( r \).