Deep dive: stats + scales + guides

Lecture 5

Dr. Benjamin Soltoff

Cornell University
INFO 3312/5312 - Spring 2024

February 6, 2024

Announcements

Announcements

  • Homework 02 due tomorrow
  • Project 01

Project 01

  • Project description
  • Team assignments on Thursday
  • Deliverables
    • February 15 - proposals for peer review
    • February 19 - revised proposals for instructor review
    • March 1 - write-up and presentation

Visualization critique

Media bias in the United States

  • What is the story?
  • How effective is the design?

Setup

Packages + figures

# load packages
library(tidyverse)
library(scales)

# set default theme for ggplot2
theme_set(theme_minimal(base_size = 14))

YAML options

execute:
  fig-width: 7
  fig-asp: 0.618
  fig-retina: 2
  dpi: 150
  fig-align: center
  out-width: 80%

World Bank indicators

world_bank <- read_rds("data/wb-indicators.rds")
world_bank
# A tibble: 181 × 8
   iso2c country  year gdp_per_cap female_labor_pct life_exp    pop income_level
   <chr> <chr>   <dbl>       <dbl>            <dbl>    <dbl>  <dbl> <fct>       
 1 AO    Angola   2021       1927.             49.9     61.6 3.45e7 Lower middl…
 2 AL    Albania  2021       6377.             44.1     76.5 2.81e6 Upper middl…
 3 AE    United…  2021      44332.             17.7     78.7 9.37e6 High income 
 4 AR    Argent…  2021      10651.             42.5     75.4 4.58e7 Upper middl…
 5 AM    Armenia  2021       4973.             52.7     72.0 2.79e6 Upper middl…
 6 AU    Austra…  2021      60697.             47.2     83.3 2.57e7 High income 
 7 AT    Austria  2021      53518.             46.8     81.2 8.96e6 High income 
 8 AZ    Azerba…  2021       5408.             49.9     69.4 1.01e7 Upper middl…
 9 BI    Burundi  2021        221.             51.9     61.7 1.26e7 Low income  
10 BE    Belgium  2021      51850.             46.8     81.9 1.16e7 High income 
# ℹ 171 more rows

Stats

Stats < > geoms

  • Statistical transformation (stat) transforms the data, typically by summarizing
  • Many of ggplot2’s stats are used behind the scenes to generate many important geoms
stat geom
stat_bin() geom_bar(), geom_freqpoly(), geom_histogram()
stat_bin2d() geom_bin2d()
stat_bindot() geom_dotplot()
stat_binhex() geom_hex()
stat_boxplot() geom_boxplot()
stat_contour() geom_contour()
stat_quantile() geom_quantile()
stat_smooth() geom_smooth()
stat_sum() geom_count()

stat_boxplot()

Documentation for stat_boxplot().

Layering with stats

ggplot(world_bank, aes(x = income_level, y = life_exp)) +
  geom_point(alpha = 0.5) +
  stat_summary(geom = "point", fun = "median", color = "red", size = 5, pch = 4, stroke = 2)

Alternate: layering with stats

ggplot(world_bank, aes(x = income_level, y = life_exp)) +
  geom_point(alpha = 0.5) +
  geom_point(stat = "summary", fun = "median", color = "red", size = 5, pch = 4, stroke = 2)

Alternate alternate: do it with dplyr

world_bank |>
  group_by(income_level) |>
  summarize(median_life_exp = median(life_exp)) |>
  ggplot(mapping = aes(x = income_level)) +
  geom_point(data = world_bank, mapping = aes(y = life_exp), alpha = 0.5) +
  geom_point(mapping = aes(y = median_life_exp), color = "red", size = 5, pch = 4, stroke = 2)

Statistical transformations

What can you say about the distribution of average life expectancy from the following QQ plot?

ggplot(world_bank, aes(sample = life_exp)) +
  stat_qq() +
  stat_qq_line() +
  labs(y = "life_exp")

Scales

What is a scale?

  • Each scale is a function from a region in data space (the domain of the scale) to a region in aesthetic space (the range of the scale)

  • The axis or legend is the inverse function: it allows you to convert visual properties back to data

Scale specification

Every aesthetic in your plot is associated with exactly one scale:

# automatic scales
ggplot(world_bank, aes(x = female_labor_pct, y = life_exp, color = income_level)) +
  geom_point(alpha = 0.8)
# manual scales
ggplot(world_bank, aes(x = female_labor_pct, y = life_exp, color = income_level)) +
  geom_point(alpha = 0.8) +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_color_discrete()

Anatomy of a scale function

scale_A_B()

  • Always starts with scale
  • A: Name of the primary aesthetic (e.g., color, shape, x)
  • B: Name of the scale (e.g., continuous, discrete, brewer)

Some scale functions add a fourth component to the function name

scale_color_viridis_b()
scale_color_viridis_c()
scale_color_viridis_d()
1
Binned palette
2
Continuous palette
3
Discrete palette

Guess the output

What will the x-axis label of the following plot say?

ggplot(world_bank, aes(x = female_labor_pct, y = life_exp, color = income_level)) +
  geom_point(alpha = 0.8) +
  scale_x_continuous(name = "female_labor_pct") +
  scale_x_continuous(name = "Female labor (percentage of total workforce)")
00:30

“Address” messages

ggplot(world_bank, aes(x = female_labor_pct, y = life_exp, color = income_level)) +
  geom_point(alpha = 0.8) +
  scale_x_continuous(name = "female_labor_pct") +
  scale_x_continuous(name = "Female labor (percentage of total workforce)")
Scale for x is already present.
Adding another scale for x, which will replace the existing scale.

Guess the output

What happens if you pair a discrete variable with a continuous scale? What happens if you pair a continuous variable with a discrete scale? Answer in the context of the following plots.

ggplot(
  data = world_bank,
  mapping = aes(
    x = income_level,
    y = life_exp
  )
) +
  geom_point(alpha = 0.5) +
  scale_x_continuous()
ggplot(
  data = world_bank,
  mapping = aes(
    x = income_level,
    y = life_exp
  )
) +
  geom_point(alpha = 0.5) +
  scale_y_discrete()
Error: Discrete value supplied to continuous scale

01:00

Transformations

When working with continuous data, the default is to map linearly from the data space onto the aesthetic space, but this scale can be transformed

ggplot(
  world_bank,
  aes(x = female_labor_pct, y = gdp_per_cap)
) +
  geom_point(alpha = 0.5)

ggplot(
  world_bank,
  aes(x = female_labor_pct, y = gdp_per_cap)
) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(trans = "log10")

Common scale transformations

Box-Cox scale transformations

Continuous scale transformations

Name Function \(f(x)\) Inverse \(f^{-1}(y)\)
asn \(\tanh^{-1}(x)\) \(\tanh(y)\)
exp \(e ^ x\) \(\log(y)\)
identity \(x\) \(y\)
log \(\log(x)\) \(e ^ y\)
log10 \(\log_{10}(x)\) \(10 ^ y\)
log2 \(\log_2(x)\) \(2 ^ y\)
logit \(\log(\frac{x}{1 - x})\) \(\frac{1}{1 + e(y)}\)
pow10 \(10^x\) \(\log_{10}(y)\)
probit \(\Phi(x)\) \(\Phi^{-1}(y)\)
reciprocal \(x^{-1}\) \(y^{-1}\)
reverse \(-x\) \(-y\)
sqrt \(x^{1/2}\) \(y ^ 2\)

Convenience functions for transformations

ggplot(
  world_bank,
  aes(x = female_labor_pct, y = life_exp)
) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(trans = "log10")

ggplot(
  world_bank,
  aes(x = female_labor_pct, y = life_exp)
) +
  geom_point(alpha = 0.5) +
  scale_y_log10()

Application exercise

ae-03 - Part 1

  • Go to the course GitHub org and find your ae-03 (repo name will be suffixed with your NetID).
  • Clone the repo in RStudio Workbench, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of tomorrow.
07:00

Guides

What is a guide?

Guides are legends and axes:

Common components of axes and legends

Source: ggplot2: Elegant Graphics for Data Analysis, Chp 14.

Customizing axes

ggplot(world_bank, aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(
    name = "Life expectancy at birth"
  )

Customizing axes

Why do 50 and 90 not appear on the \(y\)-axis?

ggplot(world_bank, aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(
    name = "Life expectancy at birth",
    breaks = seq(from = 50, to = 90, by = 10)
  )

Customizing axes

ggplot(world_bank, aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(
    name = "Life expectancy at birth",
    breaks = seq(from = 50, to = 90, by = 10),
    limits = c(50, 90)
  )

Customizing axes

ggplot(world_bank, aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(
    name = "Life expectancy at birth",
    breaks = seq(from = 50, to = 90, by = 10),
    limits = c(50, 90)
  ) +
  scale_x_continuous(
    name = "GDP per capita",
    breaks = c(0, 5e04, 1e05),
    labels = c("$0", "$50,000", "$100,000")
  )

Customizing axes

library(scales)

ggplot(world_bank, aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(
    name = "Life expectancy at birth",
    breaks = seq(from = 50, to = 90, by = 10),
    limits = c(50, 90)
  ) +
  scale_x_continuous(
    name = "GDP per capita",
    labels = label_dollar()
  )

Customizing axes

ggplot(world_bank, aes(x = gdp_per_cap, y = life_exp)) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(
    name = "Life expectancy at birth",
    breaks = seq(from = 50, to = 90, by = 10),
    limits = c(50, 90)
  ) +
  scale_x_continuous(
    name = "GDP per capita",
    labels = label_dollar(scale_cut = cut_short_scale())
  )

Modifying scale guides

Scale guides

Scale guides

Scale type Default guide type Function
Continuous scales for color/fill aesthetics colorbar guide_colorbar()
Binned scales for color/fill aesthetics colorsteps guide_colorsteps()
Position scales (continuous, binned and discrete) axis guide_axis()
Discrete scales (except position scales) legend guide_legend()
Binned scales (except position/color/fill scales) bins guide_bins()

Implementation

... +
  guides(color = guide_colorbar(...))

... +
  scale_color_gradient(guide = guide_colorbar(...))

Example implementation

base_plot + guides(color = guide_colorbar(reverse = TRUE))

base_plot + guides(color = guide_colorbar(barheight = unit(2, "cm")))

base_plot + guides(color = guide_colorbar(direction = "horizontal"))

base_plot + guides(color = guide_colorbar(label.position = "left"))

Application exercise

ae-03 - Part 2

Recreate this plot.

10:00

Wrap up

Wrap up

  • ggplot2 implements statistical transformations (typically as defaults)
  • Scales visually encode data to mappings
  • Guides control the appearance of scales
  • scales package provides a wide range of transformation and formatting functions
  • Use guide_*() functions to customize the appearance of guides

Some artwork

A drawing of three bunnies by my daughter, Beverly.