Deep dive: layers (II)

Lecture 4

Dr. Benjamin Soltoff

Cornell University
INFO 3312/5312 - Spring 2024

February 1, 2024

Announcements

Announcements

  • Homework 02 will be posted tomorrow
  • Post questions on the discussion forum

Visualization critique

Distracted boyfriends cause statisticians to move to New Jersey?

  • What is the story?
  • Is it believable?
  • Is it relevant?

Setup

From last time

tompkins <- read_csv("data/tompkins-home-sales.csv") |>
  mutate(decade_built = (year_built %/% 10) * 10) |>
  mutate(
    decade_built_cat = case_when(
      decade_built <= 1940 ~ "1940 or before",
      decade_built >= 1990 ~ "1990 or after",
      .default = as.character(decade_built)
    )
  )

mean_area_decade <- tompkins |>
  group_by(decade_built_cat) |>
  summarize(mean_area = mean(area))

mean_area_decade
# A tibble: 6 × 2
  decade_built_cat mean_area
  <chr>                <dbl>
1 1940 or before       1872.
2 1950                 1645.
3 1960                 1874.
4 1970                 1908.
5 1980                 1852.
6 1990 or after        2226.

Geoms

Geoms

  • Geometric objects, or geoms for short, perform the actual rendering of the layer, controlling the type of plot that you create

  • You can think of them as “the geometric shape used to represent the data”

One variable

One variable

  • Discrete:

    • geom_bar(): display distribution of discrete variable.
  • Continuous

    • geom_histogram(): bin and count continuous variable, display with bars

    • geom_density(): smoothed density estimate

    • geom_dotplot(): stack individual points into a dot plot

    • geom_freqpoly(): bin and count continuous variable, display with lines

Aside

Always use “typewriter text” (monospace font) when writing function names, and follow with (), e.g.,

  • geom_freqpoly()

  • mean()

  • lm()

geom_dotplot()

What does each point represent? How are their locations determined? What do the \(x\) and \(y\) axes represent?

ggplot(tompkins, aes(x = price)) +
  geom_dotplot(binwidth = 100000, dotsize = 0.2)

Comparing across groups

Which of the following allows for easier comparison across groups?

ggplot(tompkins, aes(x = price, fill = decade_built_cat)) +
  geom_histogram(binwidth = 100000)

ggplot(tompkins, aes(x = price, color = decade_built_cat)) +
  geom_freqpoly(binwidth = 100000, linewidth = 1)

Application exercise

ae-02 - Part 1

  • Go to the course GitHub org and find your ae-02 (repo name will be suffixed with your NetID).
  • Clone the repo in RStudio Workbench, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of tomorrow.
10:00

Two variables

Two variables - both continuous

  • geom_point(): scatterplot

  • geom_quantile(): smoothed quantile regression

  • geom_rug(): marginal rug plots

  • geom_smooth(): smoothed line of best fit

  • geom_text(): text labels

Two variables - show density

  • geom_bin2d(): bin into rectangles and count

  • geom_density2d(): smoothed 2d density estimate

  • geom_hex(): bin into hexagons and count

geom_hex()

Not so helpful for 156 observations:

tompkins |>
  filter(decade_built == 1950) |>
  ggplot(aes(x = area, y = price)) +
  geom_hex()

geom_hex()

More helpful for 1897 observations:

ggplot(tompkins, aes(x = area, y = price)) +
  geom_hex()

geom_hex()

Even more helpful for 53940 observations:

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_hex()

geom_hex()

(Maybe) even more helpful on the (natural) log scale:

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_hex() +
  scale_fill_gradient(trans = "log")

geom_hex() and warnings

  • Requires installing the hexbin package separately!
install.packages("hexbin")
  • Otherwise you might see
Warning: Computation failed in `stat_binhex()`

Two variables

  • At least one discrete
    • geom_count(): count number of point at distinct locations
    • geom_jitter(): randomly jitter overlapping points
  • One continuous, one discrete
    • geom_col(): a bar chart of pre-computed summaries
    • geom_boxplot(): boxplots
    • geom_violin(): show density of values in each group

geom_jitter()

How are the following three plots different?

ggplot(tompkins, aes(x = beds, y = price)) +
  geom_point()

ggplot(tompkins, aes(x = beds, y = price)) +
  geom_jitter()

ggplot(tompkins, aes(x = beds, y = price)) +
  geom_jitter()

geom_jitter() and set.seed()

set.seed(1234)

ggplot(tompkins, aes(x = beds, y = price)) +
  geom_jitter()

set.seed(1234)

ggplot(tompkins, aes(x = beds, y = price)) +
  geom_jitter()

Two variables

  • One time, one continuous
    • geom_area(): area plot
    • geom_line(): line plot
    • geom_step(): step plot
  • Display uncertainty:
    • geom_crossbar(): vertical bar with center
    • geom_errorbar(): error bars
    • geom_linerange(): vertical line
    • geom_pointrange(): vertical line with center
  • Spatial
    • geom_sf(): for map data (more on this later…)

Average price per year built

mean_price_year <- tompkins |>
  group_by(year_built) |>
  summarize(
    n = n(),
    mean_price = mean(price),
    sd_price = sd(price)
  )

mean_price_year
# A tibble: 161 × 4
   year_built     n mean_price sd_price
        <dbl> <int>      <dbl>    <dbl>
 1       1800     2    282500    67175.
 2       1805     1    195000       NA 
 3       1810     1    185000       NA 
 4       1814     1    260000       NA 
 5       1815     2    435000   233345.
 6       1820     3    369167.   37611.
 7       1822     1    100000       NA 
 8       1824     1    580000       NA 
 9       1825     4    372500    85098.
10       1830     1    350000       NA 
# ℹ 151 more rows

geom_line()

ggplot(mean_price_year, aes(x = year_built, y = mean_price)) +
  geom_line()

geom_area()

ggplot(mean_price_year, aes(x = year_built, y = mean_price)) +
  geom_area()

geom_step()

ggplot(mean_price_year, aes(x = year_built, y = mean_price)) +
  geom_step()

Application exercise

ae-02 - Part 2

10:00

let’s clean things up a bit!

Let’s clean things up a bit!

ggplot(tompkins, aes(x = area, y = price)) +
  geom_point(alpha = 0.2, size = 2, color = "#B31B1B") +
  scale_x_continuous(labels = label_number(big.mark = ",")) +
  scale_y_continuous(labels = label_dollar(scale_cut = cut_short_scale())) +
  labs(
    x = "Area (square feet)",
    y = "Sale price (USD)",
    title = "Sale prices of homes in Tompkins County, NY",
    subtitle = "2022-23",
    caption = "Source: Redfin.com"
  )

Wrap up

Wrap up

  • ggplot2 uses geom_*() functions to define types of plots
  • Select appropriate geom_*() functions based on the number and types of variables you wish to visualize
  • Consider the number of observations to determine an appropriate chart type and/or adjustments to the chart

Happy birthday!