Deep dive: stats + scales + guides

Lecture 5

Dr. Benjamin Soltoff

Cornell University
INFO 3312/5312 - Spring 2024

February 6, 2024



Visualization critique

Media bias in the United States

  • What is the story?
  • How effective is the design?


Packages + figures

# load packages

# set default theme for ggplot2
theme_set(theme_minimal(base_size = 14))

YAML options

  fig-width: 7
  fig-asp: 0.618
  fig-retina: 2
  dpi: 150
  fig-align: center
  out-width: 80%

World Bank indicators

world_bank <- read_rds("data/wb-indicators.rds")
# A tibble: 181 × 8
   iso2c country  year gdp_per_cap female_labor_pct life_exp    pop income_level
   <chr> <chr>   <dbl>       <dbl>            <dbl>    <dbl>  <dbl> <fct>       
 1 AO    Angola   2021       1927.             49.9     61.6 3.45e7 Lower middl…
 2 AL    Albania  2021       6377.             44.1     76.5 2.81e6 Upper middl…
 3 AE    United…  2021      44332.             17.7     78.7 9.37e6 High income 
 4 AR    Argent…  2021      10651.             42.5     75.4 4.58e7 Upper middl…
 5 AM    Armenia  2021       4973.             52.7     72.0 2.79e6 Upper middl…
 6 AU    Austra…  2021      60697.             47.2     83.3 2.57e7 High income 
 7 AT    Austria  2021      53518.             46.8     81.2 8.96e6 High income 
 8 AZ    Azerba…  2021       5408.             49.9     69.4 1.01e7 Upper middl…
 9 BI    Burundi  2021        221.             51.9     61.7 1.26e7 Low income  
10 BE    Belgium  2021      51850.             46.8     81.9 1.16e7 High income 
# ℹ 171 more rows


Stats < > geoms

  • Statistical transformation (stat) transforms the data, typically by summarizing
  • Many of ggplot2’s stats are used behind the scenes to generate many important geoms
stat geom
stat_bin() geom_bar(), geom_freqpoly(), geom_histogram()
stat_bin2d() geom_bin2d()
stat_bindot() geom_dotplot()
stat_binhex() geom_hex()
stat_boxplot() geom_boxplot()
stat_contour() geom_contour()
stat_quantile() geom_quantile()
stat_smooth() geom_smooth()
stat_sum() geom_count()


Documentation for stat_boxplot().

Layering with stats

ggplot(world_bank, aes(x = income_level, y = life_exp)) +
  geom_point(alpha = 0.5) +
  stat_summary(geom = "point", fun = "median", color = "red", size = 5, pch = 4, stroke = 2)

Alternate: layering with stats

ggplot(world_bank, aes(x = income_level, y = life_exp)) +
  geom_point(alpha = 0.5) +
  geom_point(stat = "summary", fun = "median", color = "red", size = 5, pch = 4, stroke = 2)

Alternate alternate: do it with dplyr

world_bank |>
  group_by(income_level) |>
  summarize(median_life_exp = median(life_exp)) |>
  ggplot(mapping = aes(x = income_level)) +
  geom_point(data = world_bank, mapping = aes(y = life_exp), alpha = 0.5) +
  geom_point(mapping = aes(y = median_life_exp), color = "red", size = 5, pch = 4, stroke = 2)

Statistical transformations

What can you say about the distribution of average life expectancy from the following QQ plot?

ggplot(world_bank, aes(sample = life_exp)) +
  stat_qq() +
  stat_qq_line() +
  labs(y = "life_exp")


What is a scale?

  • Each scale is a function from a region in data space (the domain of the scale) to a region in aesthetic space (the range of the scale)

  • The axis or legend is the inverse function: it allows you to convert visual properties back to data

Scale specification

Every aesthetic in your plot is associated with exactly one scale:

# automatic scales
ggplot(world_bank, aes(x = female_labor_pct, y = life_exp, color = income_level)) +
  geom_point(alpha = 0.8)
# manual scales
ggplot(world_bank, aes(x = female_labor_pct, y = life_exp, color = income_level)) +
  geom_point(alpha = 0.8) +
  scale_x_continuous() +
  scale_y_continuous() +

Anatomy of a scale function


  • Always starts with scale
  • A: Name of the primary aesthetic (e.g., color, shape, x)
  • B: Name of the scale (e.g., continuous, discrete, brewer)

Some scale functions add a fourth component to the function name

Binned palette
Continuous palette
Discrete palette

Guess the output

What will the x-axis label of the following plot say?

ggplot(world_bank, aes(x = female_labor_pct, y = life_exp, color = income_level)) +
  geom_point(alpha = 0.8) +
  scale_x_continuous(name = "female_labor_pct") +
  scale_x_continuous(name = "Female labor (percentage of total workforce)")

“Address” messages

ggplot(world_bank, aes(x = female_labor_pct, y = life_exp, color = income_level)) +
  geom_point(alpha = 0.8) +
  scale_x_continuous(name = "female_labor_pct") +
  scale_x_continuous(name = "Female labor (percentage of total workforce)")
Scale for x is already present.
Adding another scale for x, which will replace the existing scale.

Guess the output

What happens if you pair a discrete variable with a continuous scale? What happens if you pair a continuous variable with a discrete scale? Answer in the context of the following plots.

  data = world_bank,
  mapping = aes(
    x = income_level,
    y = life_exp
) +
  geom_point(alpha = 0.5) +
  data = world_bank,
  mapping = aes(
    x = income_level,
    y = life_exp
) +
  geom_point(alpha = 0.5) +
Error: Discrete value supplied to continuous scale



When working with continuous data, the default is to map linearly from the data space onto the aesthetic space, but this scale can be transformed

  aes(x = female_labor_pct, y = gdp_per_cap)
) +
  geom_point(alpha = 0.5)

  aes(x = female_labor_pct, y = gdp_per_cap)
) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(trans = "log10")

Common scale transformations

Box-Cox scale transformations

Continuous scale transformations

Name Function \(f(x)\) Inverse \(f^{-1}(y)\)
asn \(\tanh^{-1}(x)\) \(\tanh(y)\)
exp \(e ^ x\) \(\log(y)\)
identity \(x\) \(y\)
log \(\log(x)\) \(e ^ y\)
log10 \(\log_{10}(x)\) \(10 ^ y\)
log2 \(\log_2(x)\) \(2 ^ y\)
logit \(\log(\frac{x}{1 - x})\) \(\frac{1}{1 + e(y)}\)
pow10 \(10^x\) \(\log_{10}(y)\)
probit \(\Phi(x)\) \(\Phi^{-1}(y)\)
reciprocal \(x^{-1}\) \(y^{-1}\)
reverse \(-x\) \(-y\)
sqrt \(x^{1/2}\) \(y ^ 2\)

Convenience functions for transformations

  aes(x = female_labor_pct, y = life_exp)
) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(trans = "log10")

  aes(x = female_labor_pct, y = life_exp)
) +
  geom_point(alpha = 0.5) +

What is a guide?

Guides are legends and axes:

Common components of axes and legends

Source: ggplot2: Elegant Graphics for Data Analysis, Chp 14.

Customizing axes

ggplot(world_bank, aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
    name = "Life expectancy at birth"

Customizing axes

Why do 50 and 90 not appear on the \(y\)-axis?

ggplot(world_bank, aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
    name = "Life expectancy at birth",
    breaks = seq(from = 50, to = 90, by = 10)

Customizing axes

ggplot(world_bank, aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
    name = "Life expectancy at birth",
    breaks = seq(from = 50, to = 90, by = 10),
    limits = c(50, 90)

Customizing axes

ggplot(world_bank, aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
    name = "Life expectancy at birth",
    breaks = seq(from = 50, to = 90, by = 10),
    limits = c(50, 90)
  ) +
    name = "GDP per capita",
    breaks = c(0, 5e04, 1e05),
    labels = c("$0", "$50,000", "$100,000")

Customizing axes


ggplot(world_bank, aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
    name = "Life expectancy at birth",
    breaks = seq(from = 50, to = 90, by = 10),
    limits = c(50, 90)
  ) +
    name = "GDP per capita",
    labels = label_dollar()

Customizing axes

ggplot(world_bank, aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
    name = "Life expectancy at birth",
    breaks = seq(from = 50, to = 90, by = 10),
    limits = c(50, 90)
  ) +
    name = "GDP per capita",
    labels = label_dollar(scale_cut = cut_short_scale())

Modifying scale guides

Scale guides

Scale guides

Scale type Default guide type Function
Continuous scales for color/fill aesthetics colorbar guide_colorbar()
Binned scales for color/fill aesthetics colorsteps guide_colorsteps()
Position scales (continuous, binned and discrete) axis guide_axis()
Discrete scales (except position scales) legend guide_legend()
Binned scales (except position/color/fill scales) bins guide_bins()


... +
  guides(color = guide_colorbar(...))

... +
  scale_color_gradient(guide = guide_colorbar(...))

Example implementation

base_plot + guides(color = guide_colorbar(reverse = TRUE))

base_plot + guides(color = guide_colorbar(barheight = unit(2, "cm")))

base_plot + guides(color = guide_colorbar(direction = "horizontal"))

base_plot + guides(color = guide_colorbar(label.position = "left"))

Wrap up

Wrap up

  • ggplot2 implements statistical transformations (typically as defaults)
  • Scales visually encode data to mappings
  • Guides control the appearance of scales
  • scales package provides a wide range of transformation and formatting functions
  • Use guide_*() functions to customize the appearance of guides

