Hover points · Drag to brush-select · Click legend to filter · Correlation ≠ causation
A scatterplot encodes two quantitative variables simultaneously as position on two orthogonal axes — the most accurate perceptual channel available for quantitative data. Each observation becomes a point whose X coordinate encodes one variable and Y coordinate encodes the other. The spatial distribution of the point cloud reveals the relationship between the two variables: the viewer's visual system detects the overall slope, clustering, spread, and outliers pre-attentively, before any conscious analysis.
No other chart type answers the question "do these two variables move together?" as directly. A bar chart can show both variables, but not their relationship. A line chart can show one variable over time, but not its relationship to a second. The scatterplot is the only chart built specifically for this task.
Direction: a cloud sloping up-right is a positive correlation (both variables increase together); down-right is negative (one rises as the other falls); horizontal or circular is null (no relationship). Shape: a straight diagonal band is linear; a curved band may be exponential, logarithmic, or U-shaped — each requiring a different model. Strength: a narrow, tight band indicates strong correlation; a wide dispersed cloud indicates weak correlation. Outliers: points far from the main cloud are worth naming — they often carry the most analytical value.
This chart computes and displays the Pearson correlation coefficient (r) and R² live. Pearson r ranges from −1 (perfect negative) to +1 (perfect positive); 0 = no linear relationship. R² is the proportion of variance in Y explained by X under the linear model.
The dashed trend line is an Ordinary Least Squares (OLS) linear regression line — it minimises the sum of squared vertical distances from each point to the line. It is the correct summary of a linear relationship and can be used for interpolation within the observed range. Three things it cannot do: extrapolate reliably beyond the data range; detect non-linear relationships (a curved trend would require a polynomial or log model); or prove causation.
⚠ Correlation is not causation. A strong Pearson r means the two variables move together — it does not mean one causes the other. A third unobserved variable (a confounder) may be driving both. This is not a limitation of the chart type; it is a fundamental principle of statistical inference that must be stated whenever correlation is presented.
The dataset is paired numerical observations across named entities (countries), where the message is a bivariate relationship — does economic wealth predict longevity? The scatterplot is the only correct chart for this question. A bar chart of GDP and a separate bar chart of life expectancy would show both variables but hide their relationship entirely. The scatterplot shows the positive correlation, the logarithmic saturation at high GDP, and the outliers (USA: high GDP, lower life expectancy than peers; Nigeria: low on both) all at once.
The brush selection (drag on the chart) isolates a rectangular region of the point cloud and recomputes Pearson r and R² for only the selected points. This is not a decoration — it is the correct way to investigate whether correlation holds within a sub-range, or whether a different relationship structure operates in one part of the space (e.g., among high-income countries only). Local correlation analysis is how scatterplots graduate from presentation to analysis.
FT Visual Vocabulary category: Correlation — "Showing the relationship between two or more variables." Abela quadrant: Relationship (variables against each other). Tufte principle applied: the point cloud is all data; the trend line summarises it; the axis labels name it. No legend is needed for the primary encoding — position is self-explaining. Colour is added only for secondary grouping, always with a redundant text label.