Deep dive: layers (I)

Lecture 3

Dr. Benjamin Soltoff

Cornell University
INFO 3312/5312 - Spring 2024

January 30, 2024

Announcements

Announcements

  • Homework 01 due tomorrow
  • “Graded” AEs – commit and push to your repo by 11:59pm tomorrow

Visualization critique

Gotta catch ’em all

  • What is the story?
  • Is the chart design effective?
  • Is the chart believeable?

A/B testing

Data: Sale prices of houses in Tompkins County

  • Data on houses that were sold in Tompkins County, NY from 2022-23

  • Scraped from Redfin

Import the data

library(tidyverse)

tompkins <- read_csv("data/tompkins-home-sales.csv")
glimpse(tompkins)
Rows: 1,897
Columns: 12
$ property_type <chr> "Single Family Residential", "Single Family Residential"…
$ address       <chr> "377 Millard Hill Rd", "113 Pinewood Pl", "373 Hunt Hill…
$ city          <chr> "Newfield", "Ithaca", "Ithaca", "Ithaca", "Dryden", "Ith…
$ state         <chr> "NY", "NY", "NY", "NY", "NY", "NY", "NY", "NY", "NY", "N…
$ zip_code      <dbl> 14867, 14850, 14850, 14850, 13053, 14850, 14882, 13073, …
$ price         <dbl> 340000, 390000, 625500, 246600, 172000, 205000, 230000, …
$ beds          <dbl> 2, 4, 2, 2, NA, 2, 5, 3, 5, 6, 3, 5, 3, 2, 2, 4, 3, 5, 4…
$ baths         <dbl> 3.0, 3.0, 3.0, 1.5, NA, 1.0, 2.0, 2.0, 2.0, 4.0, 2.5, 4.…
$ area          <dbl> 1864, 3252, 1704, 1264, 2644, 820, 2900, 1638, 2364, 225…
$ lot_size      <dbl> 4.50000000, 0.33999082, 65.00000000, 0.21000918, 0.13000…
$ year_built    <dbl> 1999, 1988, 1988, 1953, 1870, 1932, 1850, 1983, 1985, 19…
$ hoa_month     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

A simple visualization

ggplot(tompkins, aes(x = area, y = price)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.7) +
  labs(
    x = "Area (square feet)",
    y = "Sale price (USD)",
    title = "Price and area of houses in Tompkins County"
  )

New variable: decade_built

tompkins <- tompkins |>
  mutate(decade_built = (year_built %/% 10) * 10)

tompkins |>
  select(year_built, decade_built)
# A tibble: 1,897 × 2
   year_built decade_built
        <dbl>        <dbl>
 1       1999         1990
 2       1988         1980
 3       1988         1980
 4       1953         1950
 5       1870         1870
 6       1932         1930
 7       1850         1850
 8       1983         1980
 9       1985         1980
10       1991         1990
# ℹ 1,887 more rows

New variable: decade_built_cat

tompkins <- tompkins |>
  mutate(
    decade_built_cat = case_when(
      decade_built <= 1940 ~ "1940 or before",
      decade_built >= 1990 ~ "1990 or after",
      .default = as.character(decade_built)
    )
  )

tompkins |>
  count(decade_built_cat)
# A tibble: 6 × 2
  decade_built_cat     n
  <chr>            <int>
1 1940 or before     636
2 1950               156
3 1960               156
4 1970               192
5 1980               206
6 1990 or after      551

A slightly more complex visualization

ggplot(
  tompkins,
  aes(x = area, y = price, color = decade_built_cat)
) +
  geom_point(alpha = 0.7, show.legend = FALSE) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.5, show.legend = FALSE) +
  scale_y_continuous(labels = label_dollar()) +
  facet_wrap(facets = vars(decade_built_cat)) +
  labs(
    x = "Area (square feet)",
    y = "Sale price (USD)",
    color = "Decade built",
    title = "Price and area of houses in Tompkins County"
  )

A/B testing

In the next two slides, the same plots are created with different “cosmetic” choices. Examine the plots two given (Plot A and Plot B), and indicate your preference by voting for one of them in the Vote tab.

Test 1

Test 2

What makes figures bad?

Bad taste

Data-to-ink ratio

Tufte strongly recommends maximizing the data-to-ink ratio this in the Visual Display of Quantitative Information (Tufte, 1983).

Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design … [It] consists of complex ideas communicated with clarity, precision, and efficiency. … [It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space … [It] is nearly always multivariate … And graphical excellence requires telling the truth about the data. (Tufte, 1983, p. 51).

Cover of The Visual Display of Quantitative Information

Which of the plots has a higher data-to-ink ratio?

A deeper look

at the plotting code

Summary statistics

mean_area_decade <- tompkins |>
  group_by(decade_built_cat) |>
  summarize(mean_area = mean(area))

mean_area_decade
# A tibble: 6 × 2
  decade_built_cat mean_area
  <chr>                <dbl>
1 1940 or before       1872.
2 1950                 1645.
3 1960                 1874.
4 1970                 1908.
5 1980                 1852.
6 1990 or after        2226.

Barplot

ggplot(
  mean_area_decade,
  aes(y = decade_built_cat, x = mean_area)
) +
  geom_col() +
  labs(
    x = "Mean area (square feet)", y = "Decade built",
    title = "Mean area of houses in Tompkins County, by decade built"
  )

Scatterplot

ggplot(
  mean_area_decade,
  aes(y = decade_built_cat, x = mean_area)
) +
  geom_point(size = 4) +
  labs(
    x = "Mean area (square feet)", y = "Decade built",
    title = "Mean area of houses in Tompkins County, by decade built"
  )

A clip from the TV show 'Parks and Recreation'. Leslie Knope has a lollipop stuck to her sweater and tells Ron Swanson 'It's called lollipopping'.

Lollipop chart – a happy medium?

Application exercise

ae-01

  • Go to the course GitHub org and find your ae-01 (repo name will be suffixed with your NetID).
  • Clone the repo in RStudio Workbench, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of tomorrow.
10:00

Bad data

Bad perception

Aspect ratios affect our perception of rates of change, modeled after an example by William S. Cleveland.

Aesthetic mappings in ggplot2

A second look: lollipop chart

ggplot(
  mean_area_decade,
  aes(y = decade_built_cat, x = mean_area)
) +
  geom_point(size = 4) +
  geom_segment(
    mapping = aes(
      x = 0, xend = mean_area,
      y = decade_built_cat, yend = decade_built_cat
    )
  ) +
  labs(
    x = "Mean area (square feet)",
    y = "Decade built",
    title = "Mean area of houses in Tompkins County, by decade built"
  )

Activity: Spot the differences I

ggplot(
  mean_area_decade,
  aes(y = decade_built_cat, x = mean_area)
) +
  geom_point(size = 4) +
  geom_segment(
    mapping = aes(
      xend = 0,
      yend = decade_built_cat
    )
  ) +
  labs(
    x = "Mean area (square feet)",
    y = "Decade built",
    title = "Mean area of houses in Tompkins County, by decade built"
  )

Can you spot the differences between the code here and the one provided in the previous slide? Are there any differences in the resulting plot? Work in a pair (or group) to answer.

03:00

Global vs. layer-specific aesthetics

  • Aesthetic mappings can be supplied in the initial ggplot() call, in individual layers, or in some combination of both.

  • Within each layer, you can add, override, or remove mappings.

  • If you only have one layer in the plot, the way you specify aesthetics doesn’t make any difference. However, the distinction is important when you start adding additional layers.

Activity: Spot the differences II

Do you expect the following plots to be the same or different? If different, how? Discuss in a group without running the code.

# Plot A
ggplot(tompkins, aes(x = area, y = price)) +
  geom_point(aes(color = decade_built_cat))
# Plot B
ggplot(tompkins, aes(x = area, y = price)) +
  geom_point(color = "blue")
# Plot C
ggplot(tompkins, aes(x = area, y = price)) +
  geom_point(color = "#a493ba")
03:00

Wrap up

Wrap up

Think back to all the plots you saw in the lecture, without flipping back through the slides. Which plot first comes to mind? Describe it in words.

Into the Maasverse

Cover of House of Flame and Shadow by Sarah J Maas