Considering the data-ink ratio: The lollipop chart

Application exercise

Modified

January 29, 2024

Important

Go to the course GitHub organization and locate the repo titled ae-01-YOUR_GITHUB_USERNAME to get started.

This AE is due January 31 at 11:59pm.

library(tidyverse)

# set default theme to minimal - reduce extraneous background ink
theme_set(theme_minimal())

For the following exercises we will work with data on houses that were sold in Tompkins County, NY in 2022 and 2023.¹

The variables include:

property_type - type of property (e.g. single family residential, townhouse, condo)
address - street address of property
city - city of property
state - state of property (all are New York)
zip_code - ZIP code of property
price - sale price (in dollars)
beds - number of bedrooms
baths - number of bathrooms. Full bathrooms with shower/toilet count as 1, bathrooms with just a toilet count as 0.5.
area - living area of the home (in square feet)
lot_size - size of property’s lot (in acres)
year_built - year home was built
hoa_month - monthly HOA dues. If the property is not part of an HOA, then the value is NA

The dataset can be found in the data folder of your repo. It is called tompkins-home-sales.csv. We will import the data and create a new variable, decade_built_cat, which identifies the decade in which the home was built. It will include catch-all categories for any homes pre-1940 and post-1990.

tompkins <- read_csv("data/tompkins-home-sales.csv")

Average home size by decade

Let’s examine the average size of homes recently sold in Tompkins County by their age. To simplify this task, we will split the homes by decade of construction. It will include catch-all categories for any homes pre-1940 and post-1990. Then we will calculate the average size of homes sold by decade.

# create decade variable
tompkins <- tompkins |>
  mutate(
    decade_built = (year_built %/% 10) * 10,
    decade_built_cat = case_when(
      decade_built <= 1940 ~ "1940 or before",
      decade_built >= 1990 ~ "1990 or after",
      .default = as.character(decade_built)
    )
  )

# calculate mean area by decade
mean_area_decade <- tompkins |>
  group_by(decade_built_cat) |>
  summarize(mean_area = mean(area))
mean_area_decade

# A tibble: 6 × 2
  decade_built_cat mean_area
  <chr>                <dbl>
1 1940 or before       1872.
2 1950                 1645.
3 1960                 1874.
4 1970                 1908.
5 1980                 1852.
6 1990 or after        2226.

Visualizing the data as a bar chart

A conventional approach to visualizing this data is a bar chart. Since we already calculated the average area, we can use geom_col() to create the bar chart. We also graph it horizontally to avoid overlapping labels for the decades.

ggplot(
  data = mean_area_decade,
  mapping = aes(x = mean_area, y = decade_built_cat)
) +
  geom_col() +
  labs(
    x = "Mean area (square feet)", y = "Decade built",
    title = "Mean area of houses in Tompkins County, by decade built"
  )

Visualizing the data as a dot plot

The bar chart violates the data-ink ratio principle. The bars are not necessary to convey the information. We can use a dot plot instead. The dot plot is a variation of the bar chart, where the bars are replaced by dots. The dot plot is a (potentially) better choice because it uses less ink to convey the same information.

ggplot(
  data = mean_area_decade,
  mapping = aes(x = mean_area, y = decade_built_cat)
) +
  geom_point(size = 4) +
  labs(
    x = "Mean area (square feet)", y = "Decade built",
    title = "Mean area of houses in Tompkins County, by decade built"
  )

The dot plot minimizes the data-ink ratio, but it is not perfect. Unlike with a bar chart, there is no expectation that the origin of the \(x\)-axis begins at 0. The relative distance between the dots communicates the difference in mean area, and compared to the bar chart, the difference in mean area is exaggerated.

Visualizing the data as a lollipop chart

The lollipop chart is a happy compromise, utilizing a skinny line + dot to communicate the values.

Tip

Try to construct the chart without using geom_col(). You would have to spend more time tweaking some of the function’s parameters so it looks appropriate.

There is another geom_*() that works pretty well here.

# add code here

Footnotes

Data source: Redfin.↩︎