EN FR

DS-2: Data Visualization — Solutions

Module {1} · Descriptive Statistics

How to use this page: Attempt every problem yourself before reading the solution. If your answer differs, read the solution carefully and identify where your reasoning went differently — that's where the learning happens.

← Back to DS-2: Data Visualization

Section 5 — Guided Practice

GP1 — Completing a Frequency Table

The commuter dataset (n = 40):

Question 1a: Relative frequency for the 30–44 class.

\[ f_r = \frac{f}{n} = \frac{14}{40} = 0.35 \] The 30–44 class has 14 observations out of 40. ✓

Question 1b: Cumulative frequency for the 30–44 class.

Add all classes up to and including 30–44: \( cf = 6 + 12 + 14 = 32 \). ✓

Common mistake: writing 18 (which is cf through 15–29) or 14 (which is just f for this class). Cumulative means "up to and including this class."

Question 1c: What percentage of commuters take 45 minutes or more?

"45 minutes or more" = 45–59 class (\( f_r = 0.15 \)) + 60–74 class (\( f_r = 0.05 \)).

\[ f_r(\geq 45) = 0.15 + 0.05 = 0.20 = 20\% \]

Alternatively: \( 1 - cf_r(\text{through } 44) = 1 - 0.80 = 0.20 \). ✓

Common mistake: Answering 95% — that's the cumulative relative frequency through 45–59, meaning 95% of commuters take 59 minutes or less. You want the proportion who take 45 or more minutes.

GP2 — Choosing the Correct Graph Type (All 5 Variants)

The key decision chain: What type of variable? → What graph?

GP3 — Reading a Frequency Table (All 5 Variants)

Reference table: quiz completion times for 50 students.

GP4 — Identify the Misleading Feature

The bar chart of store revenue (Flagship: $215M, Neighbourhood: $208M, Express: $196M) uses a y-axis starting at $180M instead of $0.

Primary flaw: Truncated y-axis. The actual range is $215M − $196M = $19M, which is 8.8% of the Express revenue. By starting the axis at $180M, the charted range is $220M − $180M = $40M. The Flagship bar occupies \( (215-180)/40 = 87.5\% \) of the axis height; the Express bar only \( (196-180)/40 = 40\% \). This makes Flagship appear more than twice as tall as Express, visually overstating a modest 9.7% actual advantage.

---

Section 6 — Independent Practice

IP1 — Build a Complete Frequency Table (Generator)

This problem generates a new dataset each time. The solution method is always the same:

  1. Find min and max. Class width = \( \lceil(\text{max} - \text{min} + 1) / 5\rceil \)
  2. Define 5 class boundaries starting from min.
  3. Tally each value into its class to get \( f \).
  4. Compute \( f_r = f / n \) for each class; verify they sum to 1.00.
  5. Compute \( cf \) as running totals of \( f \); final cf must equal n = 15.
  6. Compute \( cf_r \) as running totals of \( f_r \); final cf_r must equal 1.00.

Sanity check four quantities: \( \sum f = n \), \( \sum f_r = 1.00 \), final \( cf = n \), final \( cf_r = 1.00 \).

IP2 — Interpreting a Histogram (Daily Steps)

The histogram of daily steps for 60 office workers (classes: 2000–3999, 4000–5999, 6000–7999, 8000–9999, 10000–11999; frequencies approximately 5, 12, 22, 15, 6).

(a) Class width: Each class spans 2,000 steps. Class width = 2,000.

(b) Workers in 6,000–7,999 steps: Read the bar height — approximately 22 workers. (The bar reaches slightly above the "20" gridline.)

(c) Percentage walking fewer than 6,000 steps:

\[ \frac{f_{2000-3999} + f_{4000-5999}}{n} = \frac{5 + 12}{60} = \frac{17}{60} \approx 0.283 = 28.3\% \]

(d) Shape: The distribution is slightly right-skewed. The peak is at 6,000–7,999 steps. The right tail (8,000–11,999) falls off gradually. Most office workers cluster in the 4,000–8,000 range, with fewer workers reaching 10,000+ steps.

IP3 — Select the Best Graph (All 5 Variants)

IP4 — Two Variables, One Graph

(a) Graph: Scatter plot. Both hours of sleep and cognitive performance score are quantitative continuous variables. The goal is to explore their relationship — scatter plot is the correct choice. Each of the 85 participants becomes a point: x = hours of sleep, y = test score.

(b) Axes: Hours of sleep on the x-axis (explanatory variable); cognitive performance score on the y-axis (response variable). Convention: the variable hypothesized to "explain" the other goes on x.

(c) Pattern for positive association: An upward trend from lower-left to upper-right. Points with small x (few hours of sleep) cluster toward small y (low scores), and points with large x (more sleep) cluster toward large y (higher scores). This upward pattern — called a positive association — supports the hypothesis that more sleep correlates with better performance.

IP5 — Critique the Crime Rate Graph

The graph shows crime rates: Year 1: 958, Year 2: 955, Year 3: 952, Year 4: 950, Year 5: 947, Year 6: 943 (incidents per 100,000). Y-axis: 940 to 960.

(a) Misleading technique: Truncated y-axis. Starting at 940 compresses the y-range to 20 units. The 15-unit drop (958 → 943) spans 75% of the chart height, making the trend look dramatic. On an axis from 0 to 1,000, the line would be nearly flat — accurately reflecting that crime changed by only 1.57%.

(b) True percentage decrease:

\[ \frac{958 - 943}{958} \times 100 = \frac{15}{958} \times 100 \approx 1.57\% \]

A 1.57% decline over 6 years — real, but not a "plummet."

(c) Honest redraw: Start the y-axis at 0 (or at a clearly indicated break point with a zigzag symbol). Label both the axis and each data point with its exact value. The line would appear as a very gentle downward slope — accurately conveying the modest improvement. The headline should read "Crime rate declines 1.6% over six years" rather than "plummets."

---

Section 7 — Mastery Check

M1 — Feynman Test: Histogram vs. Bar Chart

Key points a complete answer should include:

M2 — Apply: Soccer Player Dataset

Part A — Distribution of distance run per game (quantitative continuous): Histogram. Distance run is measured along a continuum. A histogram groups values into class intervals and shows where most players cluster, the spread, and the shape.

Part B — Relationship between distance run and goals scored (two quantitative variables): Scatter plot. Both variables are quantitative; the goal is to explore whether players who run more tend to score more (or less). Each player = one point; x = distance run, y = goals scored.

Part C — Number of players per position (qualitative nominal — Goalkeeper, Defender, Midfielder, Forward): Bar chart. Position is categorical with no natural numerical scale. Four bars, one per position category, with gaps between them.

M3 — Error Detection: Unequal Class Widths

The error: The student used the same bar width for all four classes, but the 30–49 class spans 20 cm while the others span only 10 cm. In a histogram, area = frequency (when displayed correctly). When class widths are unequal, the y-axis must show frequency density = f ÷ class width, so that the area of each bar correctly represents its frequency.

Frequency densities:

\[ \text{10–19: } \frac{4}{10} = 0.40 \quad \text{20–29: } \frac{9}{10} = 0.90 \quad \text{30–49: } \frac{11}{20} = 0.55 \quad \text{50–59: } \frac{6}{10} = 0.60 \]

On a frequency density histogram, the 20–29 bar (0.90) would be the tallest — not the 30–49 bar (0.55). The student's claim that "the third bar is the most frequent" is misleading because it visually inflated the wide class. The 20–29 range actually has the highest concentration of plants.

---

Section 8 — Boss Fight

Path A — The Analyst

Frequency table (classes: 3–5, 6–8, 9–11, 12–14, n = 25):

Class\( f \)\( f_r \)\( cf \)\( cf_r \)
3–560.2460.24
6–890.36150.60
9–1170.28220.88
12–1430.12251.00
Total251.00

Tally verification:

Histogram description:

Class width issue: The originally proposed classes (2–4, 5–7, 8–10, 11–14) have unequal widths — the last class is 4 units wide instead of 3. Using classes 3–5, 6–8, 9–11, 12–14 (each 3 units wide) corrects this, and the data range (3 to 14) is fully covered.

Path B — The Architect

Graph 1 — Holiday Revenue:

Graph 2 — Satisfaction Distribution:

Graph 3 — Market Share Pie:

---

Section 9 — Challenge Problems

Challenge 1 — The Ogive

Using the frequency table from Example 1 (defective items per batch):

ClassUpper boundary\( cf_r \)
1–22.50.10
3–44.50.40
5–66.50.75
7–88.50.95
9–1010.51.00

Median estimate (50th percentile): The \( cf_r = 0.50 \) lies between the 3–4 class (cf_r = 0.40, upper boundary 4.5) and the 5–6 class (cf_r = 0.75, upper boundary 6.5).

Linear interpolation:

\[ x_{\text{median}} = 4.5 + \frac{0.50 - 0.40}{0.75 - 0.40} \times (6.5 - 4.5) = 4.5 + \frac{0.10}{0.35} \times 2 \approx 4.5 + 0.571 \approx 5.07 \text{ defects} \]

Preview of DS-5: The ogive is the graphical tool for reading any percentile. The p-th percentile is estimated by drawing a horizontal line at \( cf_r = p/100 \) and reading the corresponding x-value.

Challenge 2 — Does Bin Width Matter?

(a) Changing bin width changes how values are aggregated. Wide bins merge many observations together, smoothing the distribution and potentially hiding structure. The 3-class histogram lumped together the 24–26 and 27–29 subgroups into one "24–29" bar, hiding the right-skew that Histogram B makes visible.

(b) Histogram B (6 classes) reveals more meaningful structure for n = 40. Sturges' rule gives \( k \approx 1 + \log_2(40) \approx 6.3 \), confirming 6 classes is appropriate. Too few bins smooth away real patterns; too many bins introduce noise.

(c) With 18 classes (width = 1 year), n = 40 gives an average of ~2.2 observations per bar. Many bars would contain 0, 1, or 2 values. The histogram would be extremely jagged — every accidental gap in the data would appear as a bar of height 0, creating a false impression of "holes" in the distribution when there are none.

Challenge 3 — The Double Y-Axis Debate

(a) Why dual y-axis misleads: The scales can be freely adjusted to make two completely unrelated variables appear to move together — or to make genuinely correlated variables appear independent. Visual alignment of two lines is entirely a function of scale choices, not of actual correlation. This makes any correlation claim from a dual y-axis graph highly suspect without independent verification.

(b) Legitimate uses: A dual y-axis is defensible when (1) both variables are contextually related and telling the same story (e.g., temperature and precipitation on a climate chart), (2) both axes are clearly labeled with units, (3) the designer does not manipulate scales to create false visual alignment, and (4) the chart does not imply a direct proportional comparison between the two scales.

(c) Confounding variable: Both ice cream sales and drowning rates are driven by a third variable — summer heat. Hot weather causes more people to buy ice cream and more people to swim (creating more opportunities to drown). When a third "lurking" or confounding variable drives both measured variables, a strong correlation arises even without any causal link between them. The ice cream → drowning inference is a classic spurious correlation. This concept will be formalized in REG-1 (Correlation Analysis) and is one of the most important principles in applied statistics.

You've completed DS-2: Data Visualization! The next lesson is DS-3: Central Tendency Measures, which uses the frequency distributions you've built here to compute means, medians, and modes.