Deep dive: layers (II)

Lecture 4

Dr. Benjamin Soltoff

Cornell University
INFO 3312/5312 - Spring 2025

January 30, 2025

Announcements

Announcements

  • Homework 01 due yesterday
  • Homework 02
  • Enrollment update
  • Post questions on the discussion forum

Visualization critique

Distracted boyfriends cause statisticians to move to New Jersey?

  • What is the story?
  • Is it believable?
  • Is it relevant?

Setup

From last time

tompkins <- read_csv("data/tompkins-home-sales.csv") |>
  mutate(decade_built = (year_built %/% 10) * 10) |>
  mutate(
    decade_built_cat = case_when(
      decade_built <= 1940 ~ "1940 or before",
      decade_built >= 1990 ~ "1990 or after",
      .default = as.character(decade_built)
    )
  )

mean_price_decade <- tompkins |>
  group_by(decade_built_cat) |>
  summarize(mean_price = mean(price))

Geoms

Geoms

  • Geometric objects, or geoms for short, perform the actual rendering of the layer, controlling the type of plot that you create

  • You can think of them as “the geometric shape used to represent the data”

One variable

One variable

  • Discrete:

    • geom_bar(): display distribution of discrete variable.
  • Continuous

    • geom_histogram(): bin and count continuous variable, display with bars

    • geom_density(): smoothed density estimate

    • geom_dotplot(): stack individual points into a dot plot

    • geom_freqpoly(): bin and count continuous variable, display with lines

Comparing across groups

Which of the following allows for easier comparison across groups?

ggplot(tompkins, aes(x = price, fill = decade_built_cat)) +
  geom_histogram(binwidth = 100000)

ggplot(tompkins, aes(x = price, color = decade_built_cat)) +
  geom_freqpoly(binwidth = 100000, linewidth = 1)

Application exercise

ae-03

Instructions

  • Go to the course GitHub org and find your ae-03 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of the day

Work through part 1

Create and compare different types of bar charts.

10:00

Two variables

Two variables - both continuous

  • geom_point(): scatterplot

  • geom_quantile(): smoothed quantile regression

  • geom_rug(): marginal rug plots

  • geom_smooth(): smoothed line of best fit

  • geom_text(): text labels

Two variables - show density

  • geom_bin2d(): bin into rectangles and count

  • geom_density2d(): smoothed 2d density estimate

  • geom_hex(): bin into hexagons and count

geom_hex()

Not so helpful for 38 observations:

tompkins |>
  filter(decade_built == 1940) |>
  ggplot(aes(x = area, y = price)) +
  geom_hex()

geom_hex()

More helpful for 1270 observations:

ggplot(tompkins, aes(x = area, y = price)) +
  geom_hex()

geom_hex()

Even more helpful for 53940 observations:

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_hex()

geom_hex()

(Maybe) even more helpful on the log scale:

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_hex() +
  scale_fill_gradient(trans = "log10")

Two variables

  • At least one discrete
    • geom_count(): count number of point at distinct locations
    • geom_jitter(): randomly jitter overlapping points
  • One continuous, one discrete
    • geom_col(): a bar chart of pre-computed summaries
    • geom_boxplot(): boxplots
    • geom_violin(): show density of values in each group

geom_jitter()

How are the following three plots different?

ggplot(tompkins, aes(x = beds, y = price)) +
  geom_point()

ggplot(tompkins, aes(x = beds, y = price)) +
  geom_jitter()

ggplot(tompkins, aes(x = beds, y = price)) +
  geom_jitter()

geom_jitter() and set.seed()

set.seed(531)

ggplot(tompkins, aes(x = beds, y = price)) +
  geom_jitter()

set.seed(531)

ggplot(tompkins, aes(x = beds, y = price)) +
  geom_jitter()

Two variables

  • One time, one continuous
    • geom_area(): area plot
    • geom_line(): line plot
    • geom_step(): step plot
  • Display uncertainty:
    • geom_crossbar(): vertical bar with center
    • geom_errorbar(): error bars
    • geom_linerange(): vertical line
    • geom_pointrange(): vertical line with center
  • Spatial
    • geom_sf(): for map data

Average price per year built

mean_price_year <- tompkins |>
  group_by(decade_built) |>
  summarize(
    n = n(),
    mean_price = mean(price),
    sd_price = sd(price)
  )

mean_price_year
# A tibble: 23 × 4
   decade_built     n mean_price sd_price
          <dbl> <int>      <dbl>    <dbl>
 1         1800     2    262500    95459.
 2         1810     2    435000   233345.
 3         1820     6    382083.  161852.
 4         1830     5    330400    80114.
 5         1840     8    510700   196711.
 6         1850    22    258136.  138885.
 7         1860    44    271182.  124217.
 8         1870    29    381904.  221011.
 9         1880    41    327278.  221005.
10         1890    37    364171.  192543.
# ℹ 13 more rows

geom_line()

ggplot(mean_price_year, aes(x = decade_built, y = mean_price)) +
  geom_line()

geom_area()

ggplot(mean_price_year, aes(x = decade_built, y = mean_price)) +
  geom_area()

geom_step()

ggplot(mean_price_year, aes(x = decade_built, y = mean_price)) +
  geom_step()

Application exercise

ae-03

Work through part 2

Create and compare different types of charts for comparing a categorical and continuous variable.

10:00

Let’s clean things up a bit!

ggplot(tompkins, aes(x = area, y = price)) +
  geom_point(alpha = 0.2, size = 2, color = "#B31B1B") +
  scale_x_continuous(labels = label_number(big.mark = ",")) +
  scale_y_continuous(labels = label_currency(scale_cut = cut_short_scale())) +
  labs(
    x = "Area (square feet)",
    y = "Sale price (USD)",
    title = "Sale prices of homes in Tompkins County, NY",
    subtitle = "2022-24",
    caption = "Source: Redfin.com"
  )

Wrap up

Recap

  • {ggplot2} uses geom_*() functions to define types of plots
  • Select appropriate geom_*() functions based on the number and types of variables you wish to visualize
  • Consider the number of observations to determine an appropriate chart type and/or adjustments to the chart

Acknowledgements

Happy birthday!

My daughter, Beverly.