Practicing a bunch of geoms

Suggested answers

Application exercise
Answers
Modified

February 1, 2024

Important

These are suggested answers. This document should be used as reference only, it’s not designed to be an exhaustive key.

library(tidyverse)

For the following exercises we will work with data on houses that were sold in Tompkins County, NY in 2022 and 2023.1

The variables include:

The dataset can be found in the data folder of your repo. It is called tompkins-home-sales.csv. We will import the data and create a new variable, decade_built_cat, which identifies the decade in which the home was built. It will include catch-all categories for any homes pre-1940 and post-1990.

tompkins <- read_csv("data/tompkins-home-sales.csv") |>
  mutate(decade_built = (year_built %/% 10) * 10) |>
  mutate(
    decade_built_cat = case_when(
      decade_built <= 1940 ~ "1940 or before",
      decade_built >= 1990 ~ "1990 or after",
      .default = as.character(decade_built)
    )
  )

Part 1

Let’s start by visualizing the distribution of the number of bedrooms in the properties sold in Tompkins County, NY in 2022 and 2023. To simplify the task, let’s collapse the variable beds into a smaller number of categories and drop rows with missing values for this variable.

tompkins_beds <- tompkins |>
  mutate(beds = factor(beds) |>
    fct_collapse(
      "5+" = c("5", "6", "7", "9")
    )) |>
  drop_na(beds)

Since the number of bedrooms is effectively a categorical variable, we should select a geom appropriate for a single categorical variable.

Your turn: Create a bar chart visualizing the distribution of the number of bedrooms in the properties sold in Tompkins County, NY in 2022 and 2023.

ggplot(data = tompkins_beds, mapping = aes(x = beds)) +
  geom_bar() +
  labs(
    title = "Distribution of number of bedrooms",
    subtitle = "Properties sold in Tompkins County, NY (2022-23)",
    x = "Decade built",
    y = NULL,
    fill = "Number of bedrooms"
  )

Now let’s visualize the distribution of the number of bedrooms by the decade in which the property was built. We will still use a bar chart but also color-code the bar segments for each decade. Now we have a few variations to consider.

  • Stacked bar chart - each bar segment represents the frequency count and are stacked vertically on top of each other.2
  • Dodged bar chart - each bar segment represents the frequency count and are placed side by side for each decade. This leaves each segment with a common origin, or baseline value of 0.
  • Relative frequency bar chart - each bar segment represents the relative frequency (proportion) of each category within each decade.

Your turn: Generate each form of the bar chart and compare the differences. Which one do you think is the most informative?

Tip

Read the documentation for geom_bar() to identify an appropriate argument for specifying each type of bar chart.

ggplot(data = tompkins_beds, mapping = aes(x = decade_built_cat, fill = beds)) +
  geom_bar() +
  labs(
    title = "Distribution of number of bedrooms",
    subtitle = "Properties sold in Tompkins County, NY (2022-23)",
    x = "Decade built",
    y = NULL,
    fill = "Number of bedrooms"
  )

ggplot(data = tompkins_beds, mapping = aes(x = decade_built_cat, fill = beds)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Distribution of number of bedrooms",
    subtitle = "Properties sold in Tompkins County, NY (2022-23)",
    x = "Decade built",
    y = NULL,
    fill = "Number of bedrooms"
  )

ggplot(data = tompkins_beds, mapping = aes(x = decade_built_cat, fill = beds)) +
  geom_bar(position = "fill") +
  labs(
    title = "Distribution of number of bedrooms",
    subtitle = "Properties sold in Tompkins County, NY (2022-23)",
    x = "Decade built",
    y = NULL,
    fill = "Number of bedrooms"
  )

Part 2

Now let’s evaluate the typical property size (area) by the decade in which the property was built. We will start by summarizing the data and then visualize the results using a bar chart and a boxplot.

mean_area_decade <- tompkins |>
  group_by(decade_built_cat) |>
  summarize(mean_area = mean(area))

Your turn: Visualize the property size by the decade in which the property was built. Construct a bar chart reporting the average property size, as well as a boxplot, violin plot, and strip chart (e.g. jittered scatterplot). What does each graph tell you about the distribution of property size by decade built? Which ones do you find to be more or less effective?

ggplot(data = mean_area_decade, mapping = aes(x = decade_built_cat, y = mean_area)) +
  geom_col() +
  labs(
    title = "Average property size by decade built",
    x = "Decade built",
    y = "Average area (sq. ft)"
  )

ggplot(data = tompkins, mapping = aes(x = decade_built_cat, y = area)) +
  geom_boxplot() +
  labs(
    title = "Distribution of property size by decade built",
    x = "Decade built",
    y = "Area (sq. ft)"
  )

ggplot(data = tompkins, mapping = aes(x = decade_built_cat, y = area)) +
  geom_violin() +
  labs(
    title = "Distribution of property size by decade built",
    x = "Decade built",
    y = "Area (sq. ft)"
  )

set.seed(123) # for reproducibility
ggplot(data = tompkins, mapping = aes(x = decade_built_cat, y = area)) +
  geom_jitter(alpha = 0.3) +
  labs(
    title = "Distribution of property size by decade built",
    x = "Decade built",
    y = "Area (sq. ft)"
  )

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31)
 os       macOS Ventura 13.5.2
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2024-02-03
 pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 bit           4.0.5   2022-11-15 [1] CRAN (R 4.3.0)
 bit64         4.0.5   2020-08-30 [1] CRAN (R 4.3.0)
 cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.1)
 colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
 crayon        1.5.2   2022-09-29 [1] CRAN (R 4.3.0)
 digest        0.6.34  2024-01-11 [1] CRAN (R 4.3.1)
 dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.3.1)
 evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.1)
 fansi         1.0.6   2023-12-08 [1] CRAN (R 4.3.1)
 farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.0)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
 ggplot2     * 3.4.4   2023-10-12 [1] CRAN (R 4.3.1)
 glue          1.7.0   2024-01-09 [1] CRAN (R 4.3.1)
 gtable        0.3.4   2023-08-21 [1] CRAN (R 4.3.0)
 here          1.0.1   2020-12-13 [1] CRAN (R 4.3.0)
 hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
 htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.3.1)
 htmlwidgets   1.6.4   2023-12-06 [1] CRAN (R 4.3.1)
 jsonlite      1.8.8   2023-12-04 [1] CRAN (R 4.3.1)
 knitr         1.45    2023-10-30 [1] CRAN (R 4.3.1)
 labeling      0.4.3   2023-08-29 [1] CRAN (R 4.3.0)
 lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.1)
 lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.3.1)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
 purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 readr       * 2.1.5   2024-01-10 [1] CRAN (R 4.3.1)
 rlang         1.1.3   2024-01-10 [1] CRAN (R 4.3.1)
 rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.1)
 rprojroot     2.0.4   2023-11-05 [1] CRAN (R 4.3.1)
 rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)
 scales        1.2.1   2024-01-18 [1] Github (r-lib/scales@c8eb772)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 stringi       1.8.3   2023-12-11 [1] CRAN (R 4.3.1)
 stringr     * 1.5.1   2023-11-14 [1] CRAN (R 4.3.1)
 tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
 tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
 timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)
 tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
 utf8          1.2.4   2023-10-22 [1] CRAN (R 4.3.1)
 vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.3.1)
 vroom         1.6.5   2023-12-05 [1] CRAN (R 4.3.1)
 withr         2.5.2   2023-10-30 [1] CRAN (R 4.3.1)
 xfun          0.41    2023-11-01 [1] CRAN (R 4.3.1)
 yaml          2.3.8   2023-12-11 [1] CRAN (R 4.3.1)

 [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────

Footnotes

  1. Data source: Redfin.↩︎

  2. Or horizontally for a horizontal bar chart.↩︎