Deep dive: layers (I)

Lecture 3

Dr. Benjamin Soltoff

Cornell University
INFO 3312/5312 - Spring 2025

January 28, 2025

Announcements

Announcements

  • Homework 01 due tomorrow
  • Submitting computational AEs – commit and push to your repo by 11:59pm today

Visualization critique

Gotta catch ’em all

  • What is the story?
  • Is the chart design effective?
  • Is the chart believeable?

A/B testing

Data: Sale prices of houses in Tompkins County

  • Data on houses that were sold in Tompkins County, NY from 2022-24

  • Scraped from Redfin

Import the data

library(tidyverse)

tompkins <- read_csv("data/tompkins-home-sales.csv")
glimpse(tompkins)
Rows: 1,270
Columns: 12
$ sold_date    <date> 2022-09-12, 2022-09-12, 2022-09-12, 2022-09-13, 2022-07-…
$ price        <dbl> 340000, 390000, 625500, 246600, 172000, 205000, 230000, 2…
$ beds         <dbl> 2, 4, 2, 2, NA, 2, 5, 5, 3, 5, 3, 2, 2, 4, 3, 5, 4, 3, 4,…
$ baths        <dbl> 3.0, 3.0, 3.0, 1.5, NA, 1.0, 2.0, 2.0, 2.5, 4.0, 1.0, 1.5…
$ area         <dbl> 1864, 3252, 1704, 1264, 2644, 820, 2900, 2364, 2016, 2882…
$ lot_size     <dbl> 4.50000000, 0.33999082, 65.00000000, 0.21000918, 0.130004…
$ year_built   <dbl> 1999, 1988, 1988, 1953, 1870, 1932, 1850, 1985, 1984, 200…
$ hoa_month    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ town         <chr> "Newfield", "Ithaca", "Dryden", "Ithaca", "Dryden", "Itha…
$ municipality <chr> "Unincorporated", "Unincorporated", "Unincorporated", "It…
$ long         <dbl> -76.59488, -76.45546, -76.35953, -76.52435, -76.29872, -7…
$ lat          <dbl> 42.38609, 42.47046, 42.43971, 42.45208, 42.49046, 42.4273…

A simple visualization

ggplot(tompkins, aes(x = area, y = price)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.7) +
  labs(
    x = "Area (square feet)",
    y = "Sale price (USD)",
    title = "Price and area of houses in Tompkins County"
  )

New variable: decade_built

tompkins <- tompkins |>
  mutate(decade_built = (year_built %/% 10) * 10)

tompkins |>
  select(year_built, decade_built)
# A tibble: 1,270 × 2
   year_built decade_built
        <dbl>        <dbl>
 1       1999         1990
 2       1988         1980
 3       1988         1980
 4       1953         1950
 5       1870         1870
 6       1932         1930
 7       1850         1850
 8       1985         1980
 9       1984         1980
10       2002         2000
# ℹ 1,260 more rows

New variable: decade_built_cat

tompkins <- tompkins |>
  mutate(
    decade_built_cat = case_when(
      decade_built <= 1940 ~ "1940 or before",
      decade_built >= 1990 ~ "1990 or after",
      .default = as.character(decade_built)
    )
  )

tompkins |>
  count(decade_built_cat)
# A tibble: 6 × 2
  decade_built_cat     n
  <chr>            <int>
1 1940 or before     443
2 1950               117
3 1960               120
4 1970               136
5 1980               143
6 1990 or after      311

A slightly more complex visualization

ggplot(
  tompkins,
  aes(x = area, y = price, color = decade_built_cat)
) +
  geom_point(alpha = 0.7, show.legend = FALSE) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.5, show.legend = FALSE) +
  scale_y_continuous(labels = label_dollar()) +
  facet_wrap(facets = vars(decade_built_cat)) +
  labs(
    x = "Area (square feet)",
    y = "Sale price (USD)",
    color = "Decade built",
    title = "Price and area of houses in Tompkins County"
  )

A/B testing

Activity

In the next two slides, the same plots are created with different “cosmetic” choices.

Test 1

Test 2

What makes figures bad?

Bad taste

Data-to-ink ratio

Tufte strongly recommends maximizing the data-to-ink ratio this in the Visual Display of Quantitative Information (Tufte, 1983).

Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design … [It] consists of complex ideas communicated with clarity, precision, and efficiency. … [It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space … [It] is nearly always multivariate … And graphical excellence requires telling the truth about the data. (Tufte, 1983, p. 51).

Cover of The Visual Display of Quantitative Information

Which of the plots has a higher data-to-ink ratio?

Summary statistics

mean_price_decade <- tompkins |>
  group_by(decade_built_cat) |>
  summarize(mean_price = mean(price))

mean_price_decade
# A tibble: 6 × 2
  decade_built_cat mean_price
  <chr>                 <dbl>
1 1940 or before      351273.
2 1950                330779.
3 1960                355146.
4 1970                354562.
5 1980                338600.
6 1990 or after       445540.

Barplot

ggplot(
  mean_price_decade,
  aes(y = decade_built_cat, x = mean_price)
) +
  geom_col() +
  labs(
    x = "Mean sales price", y = "Decade built",
    title = "Mean sales price of houses in Tompkins County"
  )

Scatterplot

ggplot(
  mean_price_decade,
  aes(y = decade_built_cat, x = mean_price)
) +
  geom_point(size = 4) +
  labs(
    x = "Mean sales price", y = "Decade built",
    title = "Mean sales price of houses in Tompkins County"
  )

A clip from the TV show 'Parks and Recreation'. Leslie Knope has a lollipop stuck to her sweater and tells Ron Swanson 'It's called lollipopping'.

Lollipop chart – a happy medium?

Application exercise

ae-02

Instructions

  • Go to the course GitHub org and find your ae-02 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of the day
10:00

Bad data

Bad perception

Aspect ratios affect our perception of rates of change, modeled after an example by William S. Cleveland.

Aesthetic mappings in {ggplot2}

A second look: lollipop chart

ggplot(
  mean_price_decade,
  aes(y = decade_built_cat, x = mean_price)
) +
  geom_point(size = 4) +
  geom_segment(
    mapping = aes(
      x = 0, xend = mean_price,
      y = decade_built_cat, yend = decade_built_cat
    )
  ) +
  labs(
    x = "Mean sales price", y = "Decade built",
    title = "Mean sales price of houses in Tompkins County"
  )

Activity: Spot the differences I

Can you spot the differences between the code here and the one provided in the previous slide? Are there any differences in the resulting plot? Work with a partner to answer.

ggplot(
  mean_price_decade,
  aes(y = decade_built_cat, x = mean_price)
) +
  geom_point(size = 4) +
  geom_segment(
    mapping = aes(
      xend = 0,
      yend = decade_built_cat
    )
  ) +
  labs(
    x = "Mean sales price", y = "Decade built",
    title = "Mean sales price of houses in Tompkins County"
  )
01:00

Global vs. layer-specific aesthetics

  • Aesthetic mappings can be supplied in the initial ggplot() call, in individual layers, or in some combination of both.

  • Within each layer, you can add, override, or remove mappings.

  • If you only have one layer in the plot, the way you specify aesthetics doesn’t make any difference. However, the distinction is important when you start adding additional layers.

Activity: Spot the differences II

Do you expect the following plots to be the same or different? If different, how? Discuss in a group without running the code.

# Plot A
ggplot(data = tompkins, mapping = aes(x = area, y = price)) +
  geom_point(mapping = aes(color = decade_built_cat))
# Plot B
ggplot(data = tompkins, mapping = aes(x = area, y = price)) +
  geom_point(color = "blue")
# Plot C
ggplot(data = tompkins, mapping = aes(x = area, y = price)) +
  geom_point(color = "#A493BA")
02:00

Wrap up

Recap

  • Data visualizations can be bad for aesthetic, substantive, or perceptual reasons
  • {ggplot2} graphs are constructed from layers
  • Custom charts can be generated by combining multiple layers

Acknowledgements

Film recommendation