Deep dive: stats + scales + guides

Notes

Modified

May 28, 2026

Learning objectives

Define the statistical transformation and scales components of the grammar of graphics
Demonstrate how to use stat_*() functions from {ggplot2}
Implement scale transformations for $x$ and $y$ axes
Modify guides to change the visual appearance of scales on plots

Data: World economic measures

The World Bank publishes a rich and detailed set of socioeconomic indicators spanning several decades and dozens of topics. Here we focus on a few key indicators for the year 2021.

gdp_per_cap - GDP per capita (current USD)
pop - Total population
life_exp - Life expectancy at birth, total (years)
female_labor_pct - Labor force, female (% of total labor force)
income_level - Classification of economies based on national income levels

world_bank <- read_rds("data/wb-indicators.rds")
glimpse(world_bank)

Rows: 174
Columns: 8
$ iso2c            <chr> "AF", "AO", "AL", "AE", "AR", "AM", "AU", "AT", "AZ", "BI", "BE…
$ country          <chr> "Afghanistan", "Angola", "Albania", "United Arab Emirates", "Ar…
$ year             <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 202…
$ gdp_per_cap      <dbl> 413.7579, 2916.1366, 9730.8692, 49850.6872, 14261.8466, 8159.08…
$ female_labor_pct <dbl> 6.848413, 49.398305, 46.369111, 22.302110, 43.248796, 47.193740…
$ life_exp         <dbl> 66.03500, 64.61700, 79.60200, 82.90900, 77.39500, 77.46585, 83.…
$ pop              <dbl> 41454761, 36749906, 2414095, 10483751, 45538401, 2964300, 26659…
$ income_level     <fct> Low income, Lower middle income, Upper middle income, High inco…

Statistical transformations

Stats and geoms

Every geom_*() in {ggplot2} has an associated stat — a statistical transformation that processes the raw data before it is drawn. Most of the time the transformation is invisible, but many geoms perform meaningful transformations behind the scenes.

`stat`	`geom` that uses it
`stat_bin()`	`geom_bar()`, `geom_freqpoly()`, `geom_histogram()`
`stat_bin2d()`	`geom_bin2d()`
`stat_bindot()`	`geom_dotplot()`
`stat_binhex()`	`geom_hex()`
`stat_boxplot()`	`geom_boxplot()`
`stat_contour()`	`geom_contour()`
`stat_quantile()`	`geom_quantile()`
`stat_smooth()`	`geom_smooth()`
`stat_sum()`	`geom_count()`

Understanding this relationship matters when you need to override the default statistical transformation or add a summary statistic as an additional layer — for example, overlaying a group median on a strip plot.

Layering with stats

Suppose we want to show all individual data points and then highlight the median for each income group with a distinct marker. There are three equivalent ways to do this.

ggplot(world_bank, aes(x = income_level, y = life_exp)) +
  geom_point(alpha = 0.5) +

  stat_summary(
    geom = "point",
    fun = "median",
    color = "red",
    size = 5,
    pch = 4,
    stroke = 2
  )

1: stat_summary() computes fun (here, median) for each x group and draws the result using the specified geom. pch = 4 draws an × symbol; stroke controls its line width.

ggplot(world_bank, aes(x = income_level, y = life_exp)) +
  geom_point(alpha = 0.5) +

  geom_point(
    stat = "summary",
    fun = "median",
    color = "red",
    size = 5,
    pch = 4,
    stroke = 2
  )

1: Any geom can be given a custom stat argument. This is syntactically equivalent to stat_summary(geom = "point", ...).

world_bank |>
  group_by(income_level) |>
  summarize(median_life_exp = median(life_exp)) |>
  ggplot(mapping = aes(x = income_level)) +
  geom_point(data = world_bank, mapping = aes(y = life_exp), alpha = 0.5) +

  geom_point(
    mapping = aes(y = median_life_exp),
    color = "red",
    size = 5,
    pch = 4,
    stroke = 2
  )

1: Compute median life expectancy per group before plotting.
2: The raw data layer uses data = world_bank, overriding the piped-in summary data for this layer only.
3: The summary layer uses the piped-in data, which contains only the group medians.

Option 3 is more verbose but gives you full control over the summary computation and is easier to debug — you can inspect the summarized data frame before plotting it.

Scales

What is a scale?

A scale is a function that maps from data space to aesthetic space — it translates a range of data values (the domain) into a range of visual properties (the range). The axis or legend, called a guide, is the inverse: it lets the reader translate visual properties back into data values.

Every aesthetic in a {ggplot2} plot is associated with exactly one scale. When you write:

ggplot(
  world_bank,
  aes(x = female_labor_pct, y = life_exp, color = income_level)
) +
  geom_point(alpha = 0.8)

{ggplot2} automatically adds scale_x_continuous(), scale_y_continuous(), and scale_color_discrete(). The explicit version produces the same output.

ggplot(
  data = world_bank,
  mapping = aes(x = female_labor_pct, y = life_exp, color = income_level)
) +
  geom_point(alpha = 0.8) +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_color_discrete()

Anatomy of a scale function

Scale functions follow the pattern scale_<aes>_<type>():

scale — always the prefix
<aes> — the aesthetic being scaled: x, y, color, fill, shape, size, etc.
<type> — the type of scale: continuous, discrete, log10, brewer, viridis, etc.

Some scale functions add a fourth component indicating how a palette is applied:

scale_color_viridis_b()
scale_color_viridis_c()
scale_color_viridis_d()

1: binned — maps continuous data into discrete bins
2: continuous — maps continuous data to a color gradient
3: discrete — maps categorical data to distinct colors

Specifying scales more than once

Each aesthetic can only have one active scale. If you specify the same scale twice, the second call wins and {ggplot2} issues a message:

ggplot(
  data = world_bank,
  mapping = aes(x = female_labor_pct, y = life_exp, color = income_level)
) +
  geom_point(alpha = 0.8) +
  scale_x_continuous(name = "female_labor_pct") +
  scale_x_continuous(name = "Female labor (% of workforce)")

1: This scale is overridden — {ggplot2} will warn you.
2: The second scale_x_continuous() wins; this is the label that appears on the axis.

Scale for x is already present.
Adding another scale for x, which will replace the existing scale.

This is a common source of confusion when building up a plot incrementally. If you see “Scale for x is already present. Adding another scale for x, which will replace the existing scale”, you have a duplicate scale somewhere.

Incorrect scale pairings

Scale type must match the variable type. Applying a continuous scale to a discrete variable (or vice versa) produces errors or unexpected behavior:

ggplot(
  data = world_bank,
  mapping = aes(x = income_level, y = life_exp)
) +
  geom_point(alpha = 0.5) +
  scale_x_continuous()

1: income_level is a categorical variable; scale_x_continuous() expects a numeric one. This produces an error.

Error in `scale_x_continuous()`:
! Discrete value supplied to a continuous scale.
ℹ Example values: Low income, Lower middle income, Upper middle income, and High income.

ggplot(
  data = world_bank,
  mapping = aes(x = income_level, y = life_exp)
) +
  geom_point(alpha = 0.5) +
  scale_y_discrete()

1: life_exp is continuous; scale_y_discrete() will coerce it in unexpected ways. No error is thrown, but the axis is wrong.

Scale transformations

By default, continuous scales map data linearly to aesthetic space. The transform argument applies a mathematical transformation before mapping — useful when data is highly skewed.

ggplot(
  data = world_bank,
  mapping = aes(x = female_labor_pct, y = gdp_per_cap)
) +
  geom_point(alpha = 0.5)

ggplot(
  data = world_bank,
  mapping = aes(x = female_labor_pct, y = gdp_per_cap)
) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(transform = "log10")

1: transform = "log10" applies a log₁₀ transformation to the y-axis. Axis labels still show the original data values; only the spacing changes.

The log transformation reveals structure in the right tail of the GDP distribution that is invisible on the linear scale.

Common transformations available via the `transform` argument
Name	Function $f(x)$
`"log"`	$\log(x)$
`"log10"`	$\log_{10}(x)$
`"log2"`	$\log_2(x)$
`"sqrt"`	$\sqrt{x}$
`"reverse"`	$-x$
`"exp"`	$e^x$

For the most common transformations, {ggplot2} provides convenience scale functions that are equivalent:

ggplot(data = world_bank, mapping = aes(x = female_labor_pct, y = life_exp)) +
  geom_point(alpha = 0.5) +
  scale_y_log10()

1: scale_y_log10() is shorthand for scale_y_continuous(transform = "log10"). The convenience functions (scale_y_log10(), scale_y_sqrt(), scale_x_reverse()) are less typing and slightly more readable.

Guides

What are guides?

Guides are the visual elements that help readers decode a scale — axes for position aesthetics, and legends for color, shape, size, and other aesthetics.

Common components of axes and legends in ggplot2 — Source: ggplot2-book, Chapter 14

Customizing axes with `scale_*()`

Axis labels, breaks, and limits are all controlled through the relevant scale_*() function. Here is a fully customized scatterplot built up argument by argument:

ggplot(data = world_bank, mapping = aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(
    name = "Life expectancy at birth",
    breaks = seq(from = 50, to = 90, by = 10),
    limits = c(50, 90)
  ) +
  scale_x_continuous(
    name = "GDP per capita",
    labels = label_currency(scale_cut = cut_short_scale())
  )

1: name sets the axis title. This is equivalent to using labs(y = ...).
2: breaks specifies where tick marks and gridlines appear. Values outside the data range appear only if they’re within limits.
3: limits sets the visible range. Observations outside this range are dropped (with a warning) rather than plotted off-scale.
4: label_currency(scale_cut = cut_short_scale()) from {scales} formats values as “$50K”, “$100K” etc. — much more readable than scientific notation or raw numbers.

Customizing legends with `guides()`

For color and other non-position aesthetics, the guides() function controls the guide type and its appearance. Let us build a base plot to illustrate:

base_plot <- ggplot(
  data = world_bank,
  mapping = aes(x = life_exp, y = gdp_per_cap, color = female_labor_pct)
) +
  geom_point() +
  scale_y_continuous(labels = label_currency(scale_cut = cut_short_scale())) +
  scale_color_viridis_c(labels = label_percent(scale = 1))
base_plot

The default colorbar guide has several customizable properties. Examples of common modifications are shown below.

base_plot +
  guides(color = guide_colorbar(reverse = TRUE))

1: reverse = TRUE flips the colorbar so the highest values appear at the bottom. Useful when your data narrative naturally reads top-to-bottom.

base_plot +
  guides(
    color = guide_colorbar(
      theme = theme(legend.key.height = unit(2, "cm"))
    )
  )

1: legend.key.height controls the height of the colorbar. Increasing it makes the gradient easier to read.

base_plot +
  guides(
    color = guide_colorbar(
      theme = theme(legend.direction = "horizontal")
    )
  )

1: legend.direction = "horizontal" rotates the colorbar. Horizontal bars work well at the bottom of a plot.

base_plot +
  guides(
    color = guide_colorbar(
      theme = theme(legend.text.position = "left")
    )
  )

1: legend.text.position controls which side of the bar the tick labels appear on.

Scale and guide types
Scale type	Default guide	Function
Continuous color/fill	colorbar	`guide_colorbar()`
Binned color/fill	colorsteps	`guide_colorsteps()`
Position scales	axis	`guide_axis()`
Discrete scales (except position)	legend	`guide_legend()`
Binned scales (except position/color/fill)	bins	`guide_bins()`

Summary

Statistical transformations (stats) process raw data before it is drawn; most geoms apply a stat automatically
Every aesthetic is associated with a scale function (scale_<aes>_<type>()); {ggplot2} adds default scales automatically
Scale type must match variable type — mixing continuous and discrete produces errors
Use transform (or convenience functions like scale_y_log10()) to apply axis transformations for skewed data
Guides (axes and legends) are the inverse of scales; customize them via scale_*() arguments or guides()
The {scales} package provides formatting functions (label_currency(), label_percent(), etc.) for readable axis labels

Acknowledgements

Material derived in part from STA 313: Advanced Data Visualization.