EN FR

REG-3 Solutions: Regression Interpretation and Prediction

Solutions Reference · ← Back to Lesson REG-3

Section 4 — Worked Examples Solutions

Example 1 — Interpreting Slope and Intercept

See lesson text — the full solution with metacognitive narration is presented directly in the lesson.


Example 2 — Testing \( H_0: \rho = 0 \) (n = 15, r = 0.75)

See lesson text — the full five-step solution is behind the "Show Solution" toggle in the lesson.


Example 3 — Interpolation vs. Extrapolation (x = 8 g and x = 35 g)

Model: \( \hat{y} = 1.60 + 0.45x \), fertilizer (g) → tomato yield (kg), observed range \( x \in [0, 20] \) g.

x = 8 g:

x = 35 g:

Key habit: Always check whether \( x_{\min} \leq x \leq x_{\max} \) before using a prediction. Arithmetic always works — the question is whether the result can be trusted.


Example 4 — Find the Error (Age → Reaction Time)

Model: \( \hat{y} = 177.4 + 1.69x \), age (years) → reaction time (ms), observed range \( x \in [20, 65] \) years. \( n = 20 \), \( r = 0.81 \), \( p = 0.001 \).

The researcher made three errors:

Error 1 — Causation language: The researcher states the regression "proves that aging causes slower reactions." Regression shows association, not causation. The observed pattern could reflect health status, lifestyle, or other age-correlated confounders. Correct language: "is associated with" or "predicts."

Error 2 — Conflating statistical significance with prediction reliability: The researcher claims that \( p = 0.001 \) means "we can trust all predictions from it." Statistical significance tells us the linear relationship is non-zero in the population — it does not validate predictions at any arbitrary \( x \) value, especially outside the observed range. Note: \( r^2 = 0.81^2 \approx 0.66 \), so the model does explain 66% of variance within the observed range. But this says nothing about extrapolated predictions.

Error 3 — Extrapolation reported as reliable: \( x = 80 \) years is outside the observed range \( [20, 65] \). The arithmetic \( \hat{y}(80) = 177.4 + 1.69 \times 80 = 312.6 \) ms is correct, but calling this "a reliable clinical prediction" is not. The linear trend observed in adults 20–65 may not hold at age 80 — neurological and physical changes at extreme ages may create non-linearities. This is extrapolation and must be flagged as such.

Section 5 — Guided Practice Solutions

GP-1 — Slope and Intercept Interpretation (Variants 0–4)

Variant 0 (\( \hat{y} = 56.90 + 3.70x \), study hours → exam score, range [1, 8] h):

Variant 1 (\( \hat{y} = 123.2 - 2.16x \), temperature (°C) → hot beverage sales, range [5, 35]°C):

Variant 2 (\( \hat{y} = 87.9 - 0.53x \), exercise (min) → resting heart rate (bpm), range [10, 60] min):

Variant 3 (\( \hat{y} = 1.60 + 0.45x \), fertilizer (g) → tomato yield (kg), range [0, 20] g):

Variant 4 (\( \hat{y} = 177.4 + 1.69x \), age (years) → reaction time (ms), range [20, 65] years):

Intercept ruling summary: The test is not "does x = 0 make real-world sense?" but "is x = 0 within or very near [x_min, x_max]?" Variants 0 (borderline) and 3 (yes) are the only ones where x = 0 is near the data range.


GP-2 — Interpolation and Extrapolation (Variants 0–4)

Variant 0 (\( \hat{y} = 56.90 + 3.70x \), range [1, 8] h):

Variant 1 (\( \hat{y} = 123.2 - 2.16x \), range [5, 35]°C):

Variant 2 (\( \hat{y} = 87.9 - 0.53x \), range [10, 60] min):

Variant 3 (\( \hat{y} = 1.60 + 0.45x \), range [0, 20] g):

Variant 4 (\( \hat{y} = 177.4 + 1.69x \), range [20, 65] years):


GP-3 — Residual Plot Diagnosis

(a) "Residuals bounce randomly above and below zero; spread roughly constant across all fitted values."

Diagnosis: Random scatter — both linearity and homoscedasticity assumptions appear satisfied. No remediation needed; proceed with the model.

(b) "For small fitted values, residuals are positive; for middle values, near zero; for large values, positive again — a U-shape."

Diagnosis: Curved band (non-linearity) — the mean residual follows a systematic curve, meaning the true relationship is non-linear. The linear model is mis-specified. Consider a quadratic or transformed model.

(c) "For low fitted values, residuals are within ±2; for high fitted values, residuals range from −15 to +15."

Diagnosis: Fan shape (heteroscedasticity) — the spread of residuals increases as fitted values increase. Equal-variance assumption is violated. Weighted regression or a variance-stabilizing transformation (e.g., log y) may be needed.


GP-4 — Significance Test for \( \rho \) (Generator)

Solutions are shown in the generator's solution panel.

Section 6 — Independent Practice Solutions

IP-1 — Full Interpretation Chain (Variants 0–4)

Variant 0 (\( \hat{y} = 56.90 + 3.70x \), study hours → exam score, range [1, 8] h):

Variant 1 (\( \hat{y} = 123.2 - 2.16x \), temperature (°C) → hot beverage sales, range [5, 35]°C):

Variant 2 (\( \hat{y} = 87.9 - 0.53x \), exercise (min) → resting heart rate (bpm), range [10, 60] min):

Variant 3 (\( \hat{y} = 1.60 + 0.45x \), fertilizer (g) → tomato yield (kg), range [0, 20] g):

Variant 4 (\( \hat{y} = 177.4 + 1.69x \), age (years) → reaction time (ms), range [20, 65] years):


IP-2 — Significance Test and \( r^2 \) (Generator)

Solutions are shown in the generator's solution panel.


IP-3 — Find the Error (Variants 0–4)

Variant 0 — Extrapolation without warning (\( x = 50 \)°C, range [5, 35]°C):

Variant 1 — Meaningless intercept interpretation (\( x = 0 \) for age → reaction time model, range [20, 65] years):

Variant 2 — Conflating statistical significance with practical significance (\( r = 0.18 \), \( n = 500 \)):

Variant 3 — Ignoring an influential point (\( x = 95 \) added to range [20, 60] dataset):

Variant 4 — Extrapolation producing a physically impossible result (\( x = 70 \) h/week for run time model, range [5, 30] h):


IP-4 — Prediction Risk (Generator)

Solutions are shown in the generator's solution panel.


IP-5 — Multi-Step Synthesis (Sports Science: Training Hours → 5K Time)

Context: \( n = 12 \) runners, \( r = -0.92 \), \( \bar{x} = 18 \) h, \( \bar{y} = 22.5 \) min, \( s_x = 6.2 \), \( s_y = 3.4 \). Observed range: \( x \in [5, 30] \) h.

(a) Computing b and a:

\[ b = r \cdot \frac{s_y}{s_x} = -0.92 \times \frac{3.4}{6.2} = -0.92 \times 0.548 \approx -0.50 \]

\[ a = \bar{y} - b\bar{x} = 22.5 - (-0.50)(18) = 22.5 + 9.0 = 31.5 \]

Regression equation: \( \hat{y} = 31.5 - 0.50x \)

(b) Slope interpretation: "For each additional hour of weekly training, the predicted 5K run time decreases by 0.50 minutes (30 seconds), on average."

(c) Intercept meaningfulness: \( x = 0 \) means no training — below the observed range of [5, 30] hours. While intuitive (a non-runner would be slower), the model was not fit to data in this region. The intercept (31.5 min) is a mathematical anchor rather than a reliable prediction. Not contextually meaningful.

(d) Classification:

(e) Predictions:

\[ \hat{y}(20) = 31.5 - 0.50 \times 20 = 31.5 - 10.0 = \mathbf{21.5} \text{ min} \quad \text{(interpolation — reliable)} \]

\[ \hat{y}(35) = 31.5 - 0.50 \times 35 = 31.5 - 17.5 = \mathbf{14.0} \text{ min} \quad \text{(extrapolation — flag as risky)} \]

Concern for x = 35: This is extrapolation. The linear trend at 5–30 h/week may not continue — at extreme volumes, overtraining effects could plateau or reverse performance gains. Report with an explicit warning.

(f) Significance test for \( H_0: \rho = 0 \):

\( H_0: \rho = 0 \) vs. \( H_a: \rho \neq 0 \), \( \alpha = 0.05 \), two-tailed.

\( df = n - 2 = 12 - 2 = 10 \)

\[ t = \frac{-0.92\sqrt{10}}{\sqrt{1 - (-0.92)^2}} = \frac{-0.92 \times 3.162}{\sqrt{1 - 0.8464}} = \frac{-2.909}{\sqrt{0.1536}} = \frac{-2.909}{0.392} \approx -7.42 \]

\( |t| = 7.42 \gg t^*(df=10) = 2.228 \) → \( p \ll 0.05 \) → Reject \( H_0 \).

Conclusion: There is statistically significant evidence of a linear relationship between weekly training hours and 5K run time in this population.

(g) Practical significance:

\[ r^2 = (-0.92)^2 = 0.8464 \approx 0.85 \]

Training hours explain approximately 85% of the variability in 5K run times. This is both statistically significant and practically meaningful — the model accounts for the great majority of performance variability. The remaining 15% reflects individual differences, race conditions, and other factors.

(h) Coach's request for x = 38 h/week:

\( x = 38 > 30 = x_{\max} \) — this is extrapolation, 8 hours beyond the maximum observed training volume. The model should not be used confidently here. Extreme training volumes may violate the linearity assumption (overtraining non-linearity), and the observed pattern may not extend to 38 h/week. Recommendation: do not use the model for \( x = 38 \) without collecting data from high-volume athletes. At minimum, report the prediction with a clear extrapolation warning and do not use it for individual training decisions.

Section 7 — Mastery Check Solutions

Question 1 — Feynman Test: Statistical vs. Practical Significance

Model answer (~150 words):

"Statistical significance" (\( p < \alpha \)) only tells you that the linear relationship is non-zero in the population — that the \( r \) you observed is unlikely to be due to chance alone. It says nothing about the size of the relationship. With a large enough sample, even a trivially weak relationship becomes statistically significant.

\( r^2 \) measures practical significance: the proportion of variance in \( y \) explained by \( x \). A model with \( r^2 = 0.04 \) explains only 4% of the variability — the other 96% is random noise from the model's perspective. With a large enough sample (\( n = 5000 \)), even \( r = 0.02 \) (so \( r^2 = 0.0004 \)) can produce \( p < 0.001 \).

Concrete example: A company finds \( r = 0.18 \) between post frequency and engagement, \( n = 500 \), \( p = 0.001 \). But \( r^2 = 0.03 \) — engagement is 97% unexplained. Reporting "significant predictor" without reporting \( r^2 = 0.03 \) is misleading. A model that explains 3% of variance is practically useless for strategy decisions, even if the relationship is real. Always report both the \( p \)-value and \( r^2 \).


Question 2 — Apply: Sleep Hours → Energy Intake (n = 25, r = −0.62)

Model: \( \hat{y} = 2890 - 185x \), sleep hours → energy intake (kcal/day), range \( x \in [4, 10] \) h.

Full five-step significance test for \( H_0: \rho = 0 \):

Step 1 — Hypotheses: \( H_0: \rho = 0 \) vs. \( H_a: \rho \neq 0 \), \( \alpha = 0.05 \), two-tailed.

Step 2 — Conditions: Assume bivariate normality and independence of observations.

Step 3 — Test statistic: \( df = n - 2 = 25 - 2 = 23 \)

\[ t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} = \frac{-0.62 \times \sqrt{23}}{\sqrt{1 - 0.3844}} = \frac{-0.62 \times 4.796}{\sqrt{0.6156}} = \frac{-2.974}{0.785} \approx -3.79 \]

(Note: \( |t| \approx 3.79 \); the lesson MCQ option states \( t \approx 3.73 \) — minor rounding difference, both reach the same conclusion.)

Step 4 — p-value: \( |t| = 3.79 > t^*(df=23) = 2.069 \), so \( p < 0.05 \).

Step 5 — Conclusion: Reject \( H_0 \). There is statistically significant evidence of a linear relationship between sleep hours and energy intake in this population of adults.

Prediction at x = 7: \( \hat{y}(7) = 2890 - 185 \times 7 = 2890 - 1295 = 1595 \) kcal/day. Since \( x = 7 \in [4, 10] \), this is interpolation — the prediction is safe to use.


Question 3 — Error Analysis: Confusing \( r^2 \) with Individual Prediction Accuracy

Model: \( \hat{y} = -105 + 0.92x \), height (cm) → weight (kg), \( r^2 = 0.68 \), \( n = 40 \).

Error identified — \( r^2 \) vs. individual prediction accuracy:

\( r^2 = 0.68 \) means that 68% of the overall variability in weight across the sample is explained by the linear relationship with height. It is a population-level measure of fit quality. It does not tell you how close any specific prediction will be to the true value for a given individual.

The student's claim — that the prediction will be "within ±12% of the true value" — confuses two unrelated quantities: (1) the unexplained variance fraction \( 1 - r^2 = 0.32 \), which is a population-level statistic, and (2) the margin of error for an individual, which requires a prediction interval.

What is actually needed: A prediction interval for an individual at \( x = 175 \) cm, which accounts for both (a) uncertainty in the mean response and (b) natural variability among individuals with the same height. With \( r^2 = 0.68 \), a 95% prediction interval might realistically span 15–20 kg on either side of 56.0 kg — far wider than ±12% implies.

Section 8 — Boss Fight Solutions

Path A — The Diagnostician (Screen Time → Sleep Quality, n = 10)

Given: \( r = -0.82 \), \( \bar{x} = 5.0 \) h, \( \bar{y} = 58.0 \), \( s_x = 2.0 \), \( s_y = 12.5 \). Observed range: \( x \in [1, 7] \) hours for 9 subjects; one subject at \( x = 12 \) h.

Task 1 — Regression equation:

\[ b = r \cdot \frac{s_y}{s_x} = -0.82 \times \frac{12.5}{2.0} = -0.82 \times 6.25 = -5.125 \approx -5.13 \]

\[ a = \bar{y} - b\bar{x} = 58.0 - (-5.13)(5.0) = 58.0 + 25.65 = 83.65 \]

Regression equation: \( \hat{y} = 83.65 - 5.13x \)

Task 2 — Significance test:

\( H_0: \rho = 0 \) vs. \( H_a: \rho \neq 0 \), \( \alpha = 0.05 \), \( df = n - 2 = 8 \), \( t^* = 2.306 \).

\[ t = \frac{-0.82\sqrt{8}}{\sqrt{1 - 0.6724}} = \frac{-0.82 \times 2.828}{\sqrt{0.3276}} = \frac{-2.319}{0.572} \approx -4.05 \]

\( |t| = 4.05 > t^* = 2.306 \) → \( p < 0.05 \) → Reject \( H_0 \). Statistically significant evidence of a linear relationship.

Task 3 — Residual plot diagnosis:

The nine subjects in [1, 7] hours show random scatter near zero with consistent spread — the linearity and equal-variance assumptions appear satisfied for that portion of the data. The subject at \( x = 12 \) has a residual of \( -22 \), which is very large compared to the others. This indicates the point at \( x = 12 \) is:

Task 4 — Model evaluation paragraph (model answer):

"The regression of sleep quality on screen time is statistically significant (\( t \approx -4.05 \), \( p < 0.05 \)) with \( r^2 = 0.82^2 = 0.67 \), indicating that screen time explains approximately 67% of the variance in sleep quality within this sample — strong practical significance. However, a critical concern is the single subject reporting 12 hours of screen time per day. This point is far outside the main cluster (\( x \in [1, 7] \)), giving it high leverage; its residual of \( -22 \) confirms it is also an outlier. Before trusting the model, I would recommend (1) re-running the regression without this point to quantify its influence on the slope, (2) investigating whether 12 hours is a data entry error, and (3) reporting both regression estimates. Within the range of [1, 7] hours/day — where the model has reliable data — interpolated predictions appear trustworthy, provided conditions are met."


Path B — The Predictor (Running Club: \( \hat{y} = 31.5 - 0.50x \), range [5, 30] h)

Task 1 — Coach A: Predict for x = 22 h/week:

\[ \hat{y}(22) = 31.5 - 0.50 \times 22 = 31.5 - 11.0 = \mathbf{20.5} \text{ minutes} \]

Range check: \( 22 \in [5, 30] \) → Interpolation. This prediction is supported by the data and can be used with confidence.

Task 2 — Coach B: Predict for x = 40 h/week (elite athlete):

\[ \hat{y}(40) = 31.5 - 0.50 \times 40 = 31.5 - 20.0 = \mathbf{11.5} \text{ minutes} \]

Advisory note: \( x = 40 > 30 = x_{\max} \) — this is extrapolation. The model was fit to runners training 5–30 h/week; it has no data for 40 h/week. The linear trend may not hold at extreme volumes — overtraining effects (injury risk, diminishing returns) could create a non-linear plateau. The prediction of 11.5 minutes should not be used to set individual training targets. Recommend collecting data from high-volume athletes before trusting this prediction.

Task 3 — Coach C: Slope interpretation and limitation:

Slope interpretation: "For each additional hour of weekly training, the predicted 5K run time decreases by 0.50 minutes (30 seconds), on average."

Limitation for individual training design: The slope describes the average relationship across the 12 runners in the dataset — it does not guarantee that any individual runner will improve by exactly 30 seconds per additional training hour. Individual responses vary due to fitness level, running efficiency, recovery capacity, and other factors. Using the slope to set individual training prescriptions requires caution.

Task 4 — Correcting Coach D's misunderstanding (model answer):

"Coach D, I need to clarify what the p-value actually tells us. The p-value < 0.001 means there is strong statistical evidence that a linear relationship exists between training volume and race time in this population — the relationship is almost certainly not due to chance. But 'statistically significant' does not mean 'we can trust all predictions.' Statistical significance applies only within the range of the data used to fit the model, which is 5–30 hours per week. Any prediction outside this range — such as the 40 h/week request — is an extrapolation, and the linear trend may not hold there. What \( r^2 = 0.85 \) adds is a measure of practical significance: training hours explain 85% of the variance in race times among these 12 runners. That is excellent model fit within the observed range. But it still does not authorize extrapolation — a model can be a very good fit inside the data and completely wrong outside it."

Section 9 — Challenge Problem Solutions

Challenge 1 — The Effect of an Influential Point

Original dataset: \( n = 8 \), \( x \in [10, 50] \), \( \hat{y} = 5.0 + 1.1x \), \( r^2 = 0.82 \). Ninth point added at \( (90, 80) \).

(a) Regression outlier check:

\[ \hat{y}(90) = 5.0 + 1.1 \times 90 = 5.0 + 99.0 = 104.0 \]

Residual: \( e = 80 - 104 = -24 \). A residual of \( -24 \) is very large compared to the typical residuals for the other 8 points (which have \( r^2 = 0.82 \), suggesting residuals on the order of ±a few units). Yes, the ninth point is a regression outlier.

(b) High leverage and influence:

Removing the ninth point changes \( b \) from 0.75 to 1.1 — a 47% change in the slope — and improves \( r^2 \) from 0.61 to 0.82. This is a substantial change on both metrics. Yes, the ninth point is highly influential. It is both an outlier (large negative residual) and an influential point (materially changes the slope and fit).

(c) Leverage and extreme x-values:

\( x = 90 \) is 40 units beyond the original \( x_{\max} = 50 \). High leverage arises because this point is far from \( \bar{x} \) of the original data. The regression line is "pulled" toward a remote point, giving it enormous control over the slope direction. General principle: extreme \( x \)-values always have high leverage, regardless of their \( y \)-value or residual. Even if the ninth point fell on the regression line (zero residual), it would still have high leverage and could materially affect the slope if its \( y \)-value deviated from the fitted value.


Challenge 2 — How Sample Size Affects Significance (\( r = 0.40 \))

Formula: \( t = r\sqrt{n-2} / \sqrt{1-r^2} \). With \( r = 0.40 \): \( \sqrt{1 - r^2} = \sqrt{1 - 0.16} = \sqrt{0.84} \approx 0.9165 \).

\( n \)\( df \)\( t \)\( t^* \) (\( \alpha = 0.05 \))Significant?
530.763.182No
1081.232.306No
20181.852.101No
30282.312.048Yes
50483.02~2.01Yes

Arithmetic for each row:

Follow-up answer: \( r = 0.40 \) requires approximately \( n = 30 \) to become statistically significant at \( \alpha = 0.05 \). Yet \( r^2 = 0.16 \) — the model explains only 16% of the variance in \( y \). The lesson: large samples make even weak relationships detectable. A "significant" result with \( r^2 = 0.16 \) is statistically real (the relationship exists in the population) but practically weak (the model accounts for only one-sixth of the variability). Statistical significance alone is insufficient — always report \( r^2 \) alongside the \( p \)-value to convey practical significance.


Challenge 3 — What Does \( r^2 \) Actually Measure?

Claim: "Since \( r^2 = 0.88 \), my prediction will be within ±12% of the true value."

Error 1 — \( r^2 \) is not a statement about individual prediction accuracy.

\( r^2 = 0.88 \) means that 88% of the overall variability in \( y \) across the dataset is explained by the linear relationship with \( x \). It is a population-level measure of fit quality — it describes how tightly the data cluster around the regression line on average. It does not tell you how close any specific prediction \( \hat{y}(x_0) = 42.5 \) will be to the true \( y \) for the particular individual at \( x_0 \). Two individuals with the same \( x_0 \) can have very different \( y \) values even when \( r^2 = 0.88 \).

Error 2 — "Within ±12%" confuses unexplained variance with an error margin.

\( 1 - r^2 = 0.12 \) is the proportion of unexplained variance — a dimensionless fraction. It is not a percentage error margin for predictions. Even if only 12% of variance is unexplained, the actual prediction error depends on the absolute scale of \( y \). The margin of error for a specific individual requires the residual standard error \( s_e = \sqrt{\text{SSE}/(n-2)} \), not the unexplained variance proportion.

What is needed to correctly quantify individual prediction uncertainty:

To build a formal prediction interval for an individual at \( x = x_0 \), you need:

  1. \( s_e = \sqrt{\text{SSE}/(n-2)} \) — the residual standard deviation, which captures how spread out individual observations are around the line.
  2. The leverage of \( x_0 \) — how far \( x_0 \) is from \( \bar{x} \), which affects precision of the mean response estimate.
  3. The \( t^* \) critical value for \( df = n - 2 \).

A 95% prediction interval for an individual is substantially wider than a confidence interval for the mean response, because it must account for both estimation uncertainty and natural individual variability. With \( r^2 = 0.88 \), the mean response is well estimated, but individual outcomes still vary around the line. Depending on the scale of \( y \) and \( s_e \), a 95% prediction interval could span many units on either side of 42.5 — far wider than the student's ±12% claim implies.