Deep dive: stats + scales + guides

Lecture 5

Dr. Benjamin Soltoff

Cornell University
INFO 3312/5312 - Spring 2025

February 4, 2025

Announcements

Announcements

  • (Tentative) Second-half topic schedule posted
  • Homework 02 due tomorrow
  • Project 01

Project 01

Visualization critique

Media bias in the United States

  • What is the story?
  • How effective is the design?

Setup

World Bank indicators

world_bank <- read_rds("data/wb-indicators.rds")
world_bank
# A tibble: 181 × 8
   iso2c country  year gdp_per_cap female_labor_pct life_exp    pop income_level
   <chr> <chr>   <dbl>       <dbl>            <dbl>    <dbl>  <dbl> <fct>       
 1 AO    Angola   2021       1927.             49.9     61.6 3.45e7 Lower middl…
 2 AL    Albania  2021       6377.             44.1     76.5 2.81e6 Upper middl…
 3 AE    United…  2021      44332.             17.7     78.7 9.37e6 High income 
 4 AR    Argent…  2021      10651.             42.5     75.4 4.58e7 Upper middl…
 5 AM    Armenia  2021       4973.             52.7     72.0 2.79e6 Upper middl…
 6 AU    Austra…  2021      60697.             47.2     83.3 2.57e7 High income 
 7 AT    Austria  2021      53518.             46.8     81.2 8.96e6 High income 
 8 AZ    Azerba…  2021       5408.             49.9     69.4 1.01e7 Upper middl…
 9 BI    Burundi  2021        221.             51.9     61.7 1.26e7 Low income  
10 BE    Belgium  2021      51850.             46.8     81.9 1.16e7 High income 
# ℹ 171 more rows

Stats

Stats < > geoms

  • Statistical transformation (stat) transforms the data, typically by summarizing
  • Many of {ggplot2} ’s stats are used behind the scenes to generate many important geoms
stat geom
stat_bin() geom_bar(), geom_freqpoly(), geom_histogram()
stat_bin2d() geom_bin2d()
stat_bindot() geom_dotplot()
stat_binhex() geom_hex()
stat_boxplot() geom_boxplot()
stat_contour() geom_contour()
stat_quantile() geom_quantile()
stat_smooth() geom_smooth()
stat_sum() geom_count()

stat_boxplot()

Documentation for stat_boxplot().

Layering with stats

ggplot(world_bank, aes(x = income_level, y = life_exp)) +
  geom_point(alpha = 0.5) +
  stat_summary(geom = "point", fun = "median", color = "red", size = 5, pch = 4, stroke = 2)

Alternate: layering with stats

ggplot(world_bank, aes(x = income_level, y = life_exp)) +
  geom_point(alpha = 0.5) +
  geom_point(stat = "summary", fun = "median", color = "red", size = 5, pch = 4, stroke = 2)

Alternate alternate: do it with {dplyr}

world_bank |>
  group_by(income_level) |>
  summarize(median_life_exp = median(life_exp)) |>
  ggplot(mapping = aes(x = income_level)) +
  geom_point(data = world_bank, mapping = aes(y = life_exp), alpha = 0.5) +
  geom_point(mapping = aes(y = median_life_exp), color = "red", size = 5, pch = 4, stroke = 2)

Scales

What is a scale?

  • Each scale is a function from a region in data space (the domain of the scale) to a region in aesthetic space (the range of the scale)

  • The axis or legend (also known as a guide) is the inverse function: it allows you to convert visual properties back to data

Scale specification

Every aesthetic in your plot is associated with exactly one scale:

# automatic scales
ggplot(world_bank, aes(x = female_labor_pct, y = life_exp, color = income_level)) +
  geom_point(alpha = 0.8)
# manual scales
ggplot(world_bank, aes(x = female_labor_pct, y = life_exp, color = income_level)) +
  geom_point(alpha = 0.8) +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_color_discrete()

Anatomy of a scale function

scale_<aes>_<type>()

  • Always starts with scale
  • <aes>: Name of the primary aesthetic (e.g., color, shape, x)
  • <type>: Name of the scale (e.g., continuous, discrete, brewer)

Some scale functions add a fourth component to the function name

scale_color_viridis_b()
scale_color_viridis_c()
scale_color_viridis_d()
1
Binned palette
2
Continuous palette
3
Discrete palette

Guess the output

What will the x-axis label of the following plot say?

ggplot(world_bank, aes(x = female_labor_pct, y = life_exp, color = income_level)) +
  geom_point(alpha = 0.8) +
  scale_x_continuous(name = "female_labor_pct") +
  scale_x_continuous(name = "Female labor (percentage of total workforce)")

“Address” messages

ggplot(world_bank, aes(x = female_labor_pct, y = life_exp, color = income_level)) +
  geom_point(alpha = 0.8) +
  scale_x_continuous(name = "female_labor_pct") +
  scale_x_continuous(name = "Female labor (percentage of total workforce)")
Scale for x is already present.
Adding another scale for x, which will replace the existing scale.

What happens if incorrect pairing?

ggplot(
  data = world_bank,
  mapping = aes(
    x = income_level,
    y = life_exp
  )
) +
  geom_point(alpha = 0.5) +
  scale_x_continuous()
ggplot(
  data = world_bank,
  mapping = aes(
    x = income_level,
    y = life_exp
  )
) +
  geom_point(alpha = 0.5) +
  scale_y_discrete()
Error in `scale_x_continuous()`:
! Discrete values supplied to continuous scale.
ℹ Example values: Lower middle income, Upper middle income, High income, Upper
  middle income, and Upper middle income

Transformations

When working with continuous data, the default is to map linearly from the data space onto the aesthetic space, but this scale can be transformed

ggplot(
  world_bank,
  aes(x = female_labor_pct, y = gdp_per_cap)
) +
  geom_point(alpha = 0.5)

ggplot(
  world_bank,
  aes(x = female_labor_pct, y = gdp_per_cap)
) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(trans = "log10")

Common scale transformations

Box-Cox scale transformations

Continuous scale transformations

Name Function \(f(x)\) Inverse \(f^{-1}(y)\)
asn \(\tanh^{-1}(x)\) \(\tanh(y)\)
exp \(e ^ x\) \(\log(y)\)
identity \(x\) \(y\)
log \(\log(x)\) \(e ^ y\)
log10 \(\log_{10}(x)\) \(10 ^ y\)
log2 \(\log_2(x)\) \(2 ^ y\)
logit \(\log(\frac{x}{1 - x})\) \(\frac{1}{1 + e(y)}\)
pow10 \(10^x\) \(\log_{10}(y)\)
probit \(\Phi(x)\) \(\Phi^{-1}(y)\)
reciprocal \(x^{-1}\) \(y^{-1}\)
reverse \(-x\) \(-y\)
sqrt \(x^{1/2}\) \(y ^ 2\)

Convenience functions for transformations

ggplot(
  world_bank,
  aes(x = female_labor_pct, y = life_exp)
) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(trans = "log10")

ggplot(
  world_bank,
  aes(x = female_labor_pct, y = life_exp)
) +
  geom_point(alpha = 0.5) +
  scale_y_log10()

Application exercise

ae-04

Instructions

  • Go to the course GitHub org and find your ae-04 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of the day

Work through part 1

Implement log transformations.

07:00

Guides

What is a guide?

Guides are legends and axes:

Common components of axes and legends

Customizing axes

ggplot(world_bank, aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(
    name = "Life expectancy at birth"
  )

Customizing axes

Why do 50 and 90 not appear on the \(y\)-axis?

ggplot(world_bank, aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(
    name = "Life expectancy at birth",
    breaks = seq(from = 50, to = 90, by = 10)
  )

Customizing axes

ggplot(world_bank, aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(
    name = "Life expectancy at birth",
    breaks = seq(from = 50, to = 90, by = 10),
    limits = c(50, 90)
  )

Customizing axes

ggplot(world_bank, aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(
    name = "Life expectancy at birth",
    breaks = seq(from = 50, to = 90, by = 10),
    limits = c(50, 90)
  ) +
  scale_x_continuous(
    name = "GDP per capita",
    breaks = c(0, 5e04, 1e05),
    labels = c("$0", "$50,000", "$100,000")
  )

Customizing axes

library(scales)

ggplot(world_bank, aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(
    name = "Life expectancy at birth",
    breaks = seq(from = 50, to = 90, by = 10),
    limits = c(50, 90)
  ) +
  scale_x_continuous(
    name = "GDP per capita",
    labels = label_currency()
  )

Customizing axes

ggplot(world_bank, aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(
    name = "Life expectancy at birth",
    breaks = seq(from = 50, to = 90, by = 10),
    limits = c(50, 90)
  ) +
  scale_x_continuous(
    name = "GDP per capita",
    labels = label_currency(scale_cut = cut_short_scale())
  )

Modifying scale guides

Scale guides

Scale guides

Scale type Default guide type Function
Continuous scales for color/fill aesthetics colorbar guide_colorbar()
Binned scales for color/fill aesthetics colorsteps guide_colorsteps()
Position scales (continuous, binned and discrete) axis guide_axis()
Discrete scales (except position scales) legend guide_legend()
Binned scales (except position/color/fill scales) bins guide_bins()

Implementation

... +
  guides(color = guide_colorbar(theme = theme(...), ...))

... +
  scale_color_gradient(guide = guide_colorbar(theme = theme(...), ...))

Example implementation

base_plot + guides(color = guide_colorbar(reverse = TRUE))

base_plot + guides(color = guide_colorbar(theme = theme(legend.key.height = unit(2, "cm"))))

base_plot + guides(color = guide_colorbar(theme = theme(legend.direction = "horizontal")))

base_plot + guides(color = guide_colorbar(theme = theme(legend.text.position = "left")))

Application exercise

ae-04

Work through part 2

Recreate this plot.

10:00

Wrap up

Recap

  • {ggplot2} implements statistical transformations (typically as defaults)
  • Scales visually encode data to mappings
  • Guides control the appearance of scales
  • {scales} package provides a wide range of transformation and formatting functions
  • Use guide_*() functions to customize the appearance of guides

Acknowledgements

Six more weeks of winter