// LEARN — SCATTERPLOT

What a scatterplot is

The perceptual mechanism

A scatterplot encodes two quantitative variables simultaneously as position on two orthogonal axes — the most accurate perceptual channel available for quantitative data. Each observation becomes a point whose X coordinate encodes one variable and Y coordinate encodes the other. The spatial distribution of the point cloud reveals the relationship between the two variables: the viewer's visual system detects the overall slope, clustering, spread, and outliers pre-attentively, before any conscious analysis.

No other chart type answers the question "do these two variables move together?" as directly. A bar chart can show both variables, but not their relationship. A line chart can show one variable over time, but not its relationship to a second. The scatterplot is the only chart built specifically for this task.

Reading correlation from a scatterplot

Direction: a cloud sloping up-right is a positive correlation (both variables increase together); down-right is negative (one rises as the other falls); horizontal or circular is null (no relationship). Shape: a straight diagonal band is linear; a curved band may be exponential, logarithmic, or U-shaped — each requiring a different model. Strength: a narrow, tight band indicates strong correlation; a wide dispersed cloud indicates weak correlation. Outliers: points far from the main cloud are worth naming — they often carry the most analytical value.

This chart computes and displays the Pearson correlation coefficient (r) and R² live. Pearson r ranges from −1 (perfect negative) to +1 (perfect positive); 0 = no linear relationship. R² is the proportion of variance in Y explained by X under the linear model.

The trend line and its limits

The dashed trend line is an Ordinary Least Squares (OLS) linear regression line — it minimises the sum of squared vertical distances from each point to the line. It is the correct summary of a linear relationship and can be used for interpolation within the observed range. Three things it cannot do: extrapolate reliably beyond the data range; detect non-linear relationships (a curved trend would require a polynomial or log model); or prove causation.

⚠ Correlation is not causation. A strong Pearson r means the two variables move together — it does not mean one causes the other. A third unobserved variable (a confounder) may be driving both. This is not a limitation of the chart type; it is a fundamental principle of statistical inference that must be stated whenever correlation is presented.

Why it was chosen for this data

The dataset is paired numerical observations across named entities (countries), where the message is a bivariate relationship — does economic wealth predict longevity? The scatterplot is the only correct chart for this question. A bar chart of GDP and a separate bar chart of life expectancy would show both variables but hide their relationship entirely. The scatterplot shows the positive correlation, the logarithmic saturation at high GDP, and the outliers (USA: high GDP, lower life expectancy than peers; Nigeria: low on both) all at once.

The one design decision worth knowing

The brush selection (drag on the chart) isolates a rectangular region of the point cloud and recomputes Pearson r and R² for only the selected points. This is not a decoration — it is the correct way to investigate whether correlation holds within a sub-range, or whether a different relationship structure operates in one part of the space (e.g., among high-income countries only). Local correlation analysis is how scatterplots graduate from presentation to analysis.

// Framework — FT Visual Vocabulary