Visualizing spatial data II

Notes
Modified

May 22, 2026

NoteLearning objectives
  • Define vector files for storing geospatial data
  • Import spatial data using the {sf} package
  • Generate maps using {ggplot2} and geom_sf()
  • Transform coordinate systems

Vector data formats

Raster maps provide a photographic background but give no access to the underlying geography. For complete control over the geographic elements of your geospatial visualization, you need the geographic boundaries as vector data: a file of coordinates and attributes.

Two formats are common:

Shapefile: A collection of files (.shp, .dbf, .prj) where .shp stores the geometry coordinates, .dbf stores attribute data, and .prj stores the projection. Shapefiles cannot be a single file — they are always a folder.

GeoJSON: A plain-text format using JSON notation. A single .geojson file contains both geometry and attributes. Easier to share and version-control; widely supported. For large or complex geometries, file size may become an issue since you cannot easily compress GeoJSON.

The {sf} package

Simple features

A feature is simply a thing or object that exists in the real world. A simple feature is a standardized way to represent geographic features as geometric shapes (points, lines, polygons) with associated attributes. The simple features standard defines how to encode spatial data in a consistent way across different software and formats.

The {sf} package represents geographic data as simple features in R. Simple features are stored as a data frame with a special geometry list column that holds the spatial coordinates.

Three cute fuzzy monsters adding spatial geometries to a table of attributes using glue and tape. Title text reads 'sf: spatial data...simplified.'

Image credit: Allison Horst

Features can be defined as one of several geometry types. The simplest1 are points, lines, and polygons, but can contain multiple geometries (e.g. a state with multiple islands) or even mixed geometry types in the same object.

The geometry column is sticky: it stays attached to the data frame even when you filter(), mutate(), or select() other columns. Standard {dplyr} verbs work on {sf} objects.

Importing spatial data

st_read() imports shapefiles, GeoJSON, and many other spatial formats:

nyc_shape <- st_read(dsn = "data/borough-boundaries")
nyc_json <- st_read(dsn = "data/borough-boundaries.geojson")
1
For shapefiles, pass the folder path — {sf} automatically finds the .shp, .dbf, and .prj files inside.
2
For GeoJSON, pass the .geojson file path directly.
nyc_shape
Simple feature collection with 5 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -74.25559 ymin: 40.49613 xmax: -73.70001 ymax: 40.91553
Geodetic CRS:  WGS84(DD)
  boro_code     boro_name shape_area shape_leng                       geometry
1         5 Staten Island 1623631283   325924.0 MULTIPOLYGON (((-74.05051 4...
2         2         Bronx 1187189499   463277.2 MULTIPOLYGON (((-73.89681 4...
3         1     Manhattan  636605816   359103.2 MULTIPOLYGON (((-74.01093 4...
4         3      Brooklyn 1934169229   728478.1 MULTIPOLYGON (((-73.86327 4...
5         4        Queens 3041397430   888238.6 MULTIPOLYGON (((-73.82645 4...
nyc_json
Simple feature collection with 5 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -74.25559 ymin: 40.49613 xmax: -73.70001 ymax: 40.91553
Geodetic CRS:  WGS 84
  boro_code     boro_name    shape_area    shape_leng                       geometry
1         5 Staten Island 1623631283.36 325924.002076 MULTIPOLYGON (((-74.05051 4...
2         2         Bronx  1187189499.3 463277.240478 MULTIPOLYGON (((-73.89681 4...
3         1     Manhattan 636605816.437 359103.151368 MULTIPOLYGON (((-74.01093 4...
4         3      Brooklyn 1934169228.83 728478.125489 MULTIPOLYGON (((-73.86327 4...
5         4        Queens 3041397430.33 888238.562635 MULTIPOLYGON (((-73.82645 4...

The sf object prints like a regular data frame, but with a header showing the geometry type, CRS (coordinate reference system), and bounding box.

Visualizing sf objects with geom_sf()

The key function for mapping {sf} objects is geom_sf(). It reads the geometry column automatically — no x or y aesthetic mapping is needed.

US state boundaries from {tigris}

The {tigris} package provides US Census geographic boundaries (states, counties, congressional districts, and more) as {sf} objects:

usa <- states(cb = TRUE)
1
cb = TRUE downloads a more detailed cartographic boundary file. If you get the simplest version, the resulting map will look very blocky.
ggplot(data = usa) +
  geom_sf()
1
geom_sf() with no aesthetic mappings draws the geometry. For polygon features, it fills with the default gray and draws the boundary with a thin black line.

Notice the resulting graph does not look like a typical US map. The default plot extent is the bounding box of all the geometries. This includes Alaska and Hawaii far to the west and south of the contiguous states (in fact, the tail-end of the Aleutian Islands is actually located in the Eastern Hemisphere). It also includes all U.S. territories. This makes the map look stretched out and hard to read.

A conventional (albeit overly simplistic) solution is to filter out Alaska and Hawaii along with the U.S. territories and just plot the “lower 48” states:

usa_48 <- usa |>
  filter(NAME %in% state.name) |>
  filter(NAME != "Alaska", NAME != "Hawaii")

ggplot(data = usa_48) +
  geom_sf(fill = "palegreen", color = "black")
1
state.name is a built-in R vector of the 50 state names.
2
fill and color set the polygon fill and border colors, just like any other {ggplot2} geom.

Repositioning Alaska and Hawaii with shift_geometry()

A more sophisticated approach is to use {tigris}’s shift_geometry() function, which moves Alaska and Hawaii to standard positions below the lower 48:

states_sf <- states(cb = TRUE) |>
  filter(NAME %in% state.name) |>
  shift_geometry()
1
shift_geometry() applies an Albers Equal Area projection to the contiguous states, scales Alaska to 50% of its actual size, and repositions both Alaska and Hawaii. This is the standard approach for US national maps.
Retrieving data for the year 2024
ggplot(data = states_sf) +
  geom_sf()

Overlaying point data

Since this is a {ggplot2} object, we can add additional layers on top using standard {ggplot2} syntax. For example, we can overlay point data with latitude/longitude coordinates using geom_point(). Let’s add the locations of homicides in New York City to a map of the five boroughs.

Direct point overlay

crimes <- read_csv("data/nyc-crimes.csv")
crimes_homicide <- crimes |>
  filter(ofns_desc == "MURDER & NON-NEGL. MANSLAUGHTER")
ggplot(data = nyc_json) +
  geom_sf() +
  geom_point(
    data = crimes_homicide,
    mapping = aes(x = longitude, y = latitude),
    shape = 1
  )
1
shape = 1 uses open circles, making it easier to see overlapping points.

Converting to an {sf} object

For spatial operations (joins, intersections, distance calculations), point data must be converted to an {sf} object using st_as_sf():

crimes_homicide_sf <- st_as_sf(
  x = crimes_homicide,
  coords = c("longitude", "latitude")
)
st_crs(crimes_homicide_sf) <- 4326
crimes_homicide_sf
1
coords names the columns containing the x (longitude) and y (latitude) coordinates.
2
st_crs() <- 4326 assigns the WGS 84 coordinate reference system (the standard CRS for GPS and most web maps). Without setting the CRS, {sf} cannot spatially align the points with other layers.
Simple feature collection with 269 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -74.08578 ymin: 40.59087 xmax: -73.73331 ymax: 40.90316
Geodetic CRS:  WGS 84
# A tibble: 269 × 6
   cmplnt_num   boro_nm cmplnt_fr_dt        law_cat_cd ofns_desc             geometry
 * <chr>        <chr>   <dttm>              <chr>      <chr>              <POINT [°]>
 1 240954923H1  BROOKL… 1977-12-20 05:00:00 FELONY     MURDER &… (-73.97448 40.68079)
 2 245958045H1  BROOKL… 2001-08-13 04:00:00 FELONY     MURDER &… (-73.92013 40.69862)
 3 8101169H6113 MANHAT… 2005-03-06 05:00:00 FELONY     MURDER &… (-73.94295 40.81126)
 4 8101169H6113 MANHAT… 2005-03-06 05:00:00 FELONY     MURDER &… (-73.94295 40.81126)
 5 16631466H89… BROOKL… 2006-05-24 04:00:00 FELONY     MURDER &…  (-73.91796 40.6703)
 6 246056367H1  QUEENS  2015-05-13 04:00:00 FELONY     MURDER &…  (-73.7491 40.60219)
 7 243507594H1  MANHAT… 2020-06-19 04:00:00 FELONY     MURDER &… (-73.95732 40.81194)
 8 243688124H1  BROOKL… 2021-01-31 05:00:00 FELONY     MURDER &…  (-73.8627 40.67093)
 9 240767513H1  BROOKL… 2021-02-17 05:00:00 FELONY     MURDER &… (-73.97582 40.61046)
10 240767512H1  BROOKL… 2021-05-24 04:00:00 FELONY     MURDER &… (-73.97582 40.61046)
# ℹ 259 more rows

Now we can add both layers using geom_sf()

ggplot() +
  geom_sf(data = nyc_json) +
  geom_sf(data = crimes_homicide_sf, shape = 1)
1
Notice geom_sf() automatically detects that the simple features are points and plots them accordingly.

Choropleth maps

A choropleth encodes a numeric variable through fill color applied to geographic polygons (states, counties, ZIP codes). The key requirement is that the polygon data and the attribute data share a common identifier to join on.

Importing data with {tidycensus}

The {tidycensus} package provides access to US Census and ACS data with geometry attached:

usa_fb <- get_acs(
  geography = "state",
  variables = c(
    total = "B05002_001",
    foreign = "B05002_013"
  ),
  year = 2024,
  output = "wide",
  geometry = TRUE
) |>
  mutate(pct_foreign = foreignE / totalE) |>
  filter(NAME %in% state.name) |>
  shift_geometry()
1
ACS variable codes identify specific survey questions. B05002_001 is total population; B05002_013 is foreign-born population.
2
geometry = TRUE attaches the {sf} polygon geometry to the returned data frame. No separate spatial file is needed.
3
pct_foreign is computed from the estimate (E) columns returned by get_acs().

Drawing the choropleth

ggplot(data = usa_fb) +
  geom_sf(mapping = aes(fill = pct_foreign))
1
Mapping pct_foreign to fill colors each state by its proportion of foreign-born residents. The default fill scale (blue gradient) is applied automatically.

We can do better than the default blue gradient. Let’s use a sequential palette from the {colorspace} package that is designed for perceptual uniformity:

ggplot(data = usa_fb) +
  geom_sf(mapping = aes(fill = pct_foreign)) +
  scale_fill_continuous_sequential(
    palette = "mako",
    labels = label_percent()
  )

Spatial aggregation

Spatial aggregation counts or summarizes point observations by which polygon they fall in. The st_join() function performs a spatial join: for each point, it identifies the containing polygon and attaches that polygon’s attributes.

nyc_json |>
  st_join(y = crimes_homicide_sf) |>
  count(boro_name)
1
st_join() matches each point in crimes_homicide_sf to the borough boundary in nyc_json that contains it.
2
count(boro_name) aggregates to borough level.
Simple feature collection with 5 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -74.25559 ymin: 40.49613 xmax: -73.70001 ymax: 40.91553
Geodetic CRS:  WGS 84
      boro_name   n                       geometry
1         Bronx 101 MULTIPOLYGON (((-73.89679 4...
2      Brooklyn  75 MULTIPOLYGON (((-73.86318 4...
3     Manhattan  46 MULTIPOLYGON (((-74.0086 40...
4        Queens  44 MULTIPOLYGON (((-73.82646 4...
5 Staten Island   3 MULTIPOLYGON (((-74.05054 4...

Now we can map the result directly:

nyc_json |>
  st_join(y = crimes_homicide_sf) |>
  count(boro_name) |>
  ggplot() +
  geom_sf(mapping = aes(fill = n)) +
  scale_fill_continuous_sequential(palette = "viridis") +
  labs(fill = "Homicides")

NoteComparing spatial visualization approaches

We can visualize the NYC homicide data in three ways:

Comparison of spatial visualization approaches for NYC homicide data

Comparison of spatial visualization approaches for NYC homicide data

Comparison of spatial visualization approaches for NYC homicide data

Comparison of spatial visualization approaches for NYC homicide data

Comparison of spatial visualization approaches for NYC homicide data

Comparison of spatial visualization approaches for NYC homicide data

The choropleth averages over each borough, hiding within-borough variation. The heatmap shows concentration but not exact locations. The scatterplot preserves all information but can be hard to read with many points.

There is no one inherent “correct” way to visualize spatial data. The best approach depends on the specific question, the data, and the intended audience. Always consider the trade-offs of each method and choose the one that best communicates your message.

Binning continuous data

For choropleths, continuous color gradients can be hard to compare across regions because small differences in color are difficult to perceive. Binning collapses the continuous variable into discrete categories:

ggplot(data = usa_fb) +
  geom_sf(aes(fill = pct_foreign)) +

  scale_fill_binned_sequential(
    palette = "inferno",
    n.breaks = 4,
    labels = label_percent()
  )
1
scale_fill_binned_sequential() from {colorspace} creates a binned sequential scale — the continuous variable is divided into discrete intervals, each assigned a distinct color.
2
n.breaks = 4 attempts to create four breaks (five bins). Fewer breaks make the map easier to read; more breaks preserve more detail.

By default, n.breaks chooses “nice” round break values. Set nice.breaks = FALSE to use exact quantiles instead:

ggplot(data = usa_fb) +
  geom_sf(aes(fill = pct_foreign)) +
  scale_fill_binned_sequential(
    palette = "inferno",
    n.breaks = 3,
    nice.breaks = FALSE,
    labels = label_percent()
  )
1
n.breaks = 3 creates quartile boundaries (3 cuts = 4 bins).
2
nice.breaks = FALSE places breaks exactly at the quartile boundaries rather than rounding to “nice” numbers.

Map projections

A coordinate reference system (CRS) defines how coordinates correspond to locations on the Earth’s surface. We can inherit the CRS defined for the simple features data frame, or manually transform the map to a different CRS using coord_sf(crs = ...).

Setting the CRS with coord_sf()

coord_sf(crs = ...) transforms the map to any CRS:

ggplot(data = usa_48) +
  geom_sf() +
  coord_sf(crs = 3857)
1
crs = 3857 is the Web Mercator projection used by Google Maps, OpenStreetMap, and most web tile services. It distorts area at high latitudes.

Mercator (EPSG:3857)

Mercator (EPSG:3857)
ggplot(data = usa_48) +
  geom_sf() +
  coord_sf(crs = 5070)
1
crs = 5070 is the US National Atlas Equal Area projection — a standard choice for US maps that preserves area relationships.

Albers Equal Area (EPSG:5070)

Albers Equal Area (EPSG:5070)

Finding an appropriate projection with {crsuggest}

The {crsuggest} package suggests appropriate CRS codes for a given {sf} object based on its geographic extent:

suggest_crs(nyc_json)
1
suggest_crs() ranks projected CRS options appropriate for the geographic extent of the input. The top results minimize distortion for the specific region.
# A tibble: 10 × 6
   crs_code crs_name                                  crs_type crs_gcs crs_units crs_proj4
   <chr>    <chr>                                     <chr>      <dbl> <chr>     <chr>    
 1 6539     NAD83(2011) / New York Long Island (ftUS) project…    6318 us-ft     +proj=lc…
 2 6538     NAD83(2011) / New York Long Island        project…    6318 m         +proj=lc…
 3 4456     NAD27 / New York Long Island              project…    4267 us-ft     +proj=lc…
 4 3628     NAD83(NSRS2007) / New York Long Island (… project…    4759 us-ft     +proj=lc…
 5 3627     NAD83(NSRS2007) / New York Long Island    project…    4759 m         +proj=lc…
 6 32118    NAD83 / New York Long Island              project…    4269 m         +proj=lc…
 7 2908     NAD83(HARN) / New York Long Island (ftUS) project…    4152 us-ft     +proj=lc…
 8 2831     NAD83(HARN) / New York Long Island        project…    4152 m         +proj=lc…
 9 2263     NAD83 / New York Long Island (ftUS)       project…    4269 us-ft     +proj=lc…
10 3748     NAD83(HARN) / UTM zone 18N                project…    4152 m         +proj=ut…

Bivariate choropleths with {biscale}

A bivariate choropleth encodes two variables simultaneously using a 2D color palette — each cell in the legend corresponds to a combination of values on both variables. The {biscale} package implements this scale in R.

Consider this example where we visualize simultaneously the percentage of smokers and binge drinkers in US counties:

library(biscale)

# Load county health data (smoking + binge drinking rates)
county_health <- read_csv(
  "data/PLACES__County_Data__GIS_Friendly_Format___2024_release_20250326.csv"
) |>
  select(
    GEOID = CountyFIPS,
    drinking = BINGE_AdjPrev,
    smoking = CSMOKING_AdjPrev
  )
# County geometries from tigris
county_sf <- counties(cb = TRUE) |>
  semi_join(
    y = fips_codes |> filter(state_name %in% state.name),
    by = join_by(STATEFP == state_code, COUNTYFP == county_code)
  ) |>
  shift_geometry()
1
bi_class() classifies each county into one of 9 cells (3×3 grid) based on quantiles of both variables.
2
bi_scale_fill() and bi_legend() apply the bivariate palette. The legend is a standalone plot combined with {patchwork}’s inset_element().
Retrieving data for the year 2024
# Build map
smoke_drink_map <- left_join(x = county_sf, y = county_health) |>
  bi_class(x = smoking, y = drinking, style = "quantile") |>
  ggplot() +
  geom_sf(mapping = aes(fill = bi_class), linewidth = 0.1) +
  bi_scale_fill(pal = "PinkGrn", guide = "none") +
  labs(
    title = "Smoking and excessive drinking in the United States",
    caption = "Source: Behavioral Risk Factor Surveillance System (2022)"
  ) +
  theme_map()
Joining with `by = join_by(GEOID)`
# Separate legend
smoke_drink_legend <- bi_legend(
  pal = "PinkGrn",
  dim = 3,
  xlab = "Higher % Smokers",
  ylab = "Higher % Binge Drinkers",
  size = 9
)

# Combine
smoke_drink_map +
  inset_element(
    smoke_drink_legend,
    left = 0,
    bottom = 0.3,
    right = 0.20,
    top = 0.575
  )

This is a genuinely ambitious design: encoding two continuous variables into a single map requires the reader to learn the bivariate legend before interpreting any particular county. The tradeoff - complexity in exchange for spatial co-location of two related variables - is worth it when the question is specifically about geographic co-occurrence.

Summary

  • Vector data stores geographic features (points, lines, polygons) as coordinates with attributes; common formats are shapefiles and GeoJSON
  • The {sf} package represents vector data as a data frame with a sticky geometry column; st_read() imports it
  • geom_sf() visualizes {sf} objects in {ggplot2} with no explicit x/y aesthetics required
  • shift_geometry() from {tigris} repositions Alaska and Hawaii for US national maps
  • st_as_sf() converts a data frame with coordinate columns to an {sf} object; always set the CRS with st_crs() <- 4326
  • st_join() performs spatial joins: points are matched to the polygon that contains them
  • scale_fill_binned_sequential() creates a binned choropleth that is easier to read than a continuous gradient
  • coord_sf(crs = ...) transforms the map to any CRS; use suggest_crs() to find an appropriate projection for your region

Acknowledgements

Material derived in part from Data Visualization with R and STA 313: Advanced Data Visualization.

Footnotes

  1. Pun intended.↩︎