Data wrangling (II)

Notes

Modified

May 28, 2026

Learning objectives

Explain the structure of relational data
Implement mutating joins to combine two tables
Implement filtering joins to subset rows based on a second table
Apply joins and data wrangling to build a complex visualization

Relational data

Data for real projects rarely lives in a single table. A hospital database might have separate tables for patients, visits, diagnoses, and prescriptions. A course system might have separate tables for students, enrollments, and assignments. These relational data structures organize information into multiple linked tables, where each table has a clear purpose and tables are connected by shared key variables.

Combining information across tables is called a join. The result of a join is a new table that brings together columns from two sources based on matching values in one or more key variables.

Example data

To illustrate how joins work, we’ll use two small tables: comic book characters and their publishers.

superheroes

# A tibble: 3 × 3
  name     alignment publisher    
  <chr>    <chr>     <chr>        
1 Deadpool Chaotic   Marvel       
2 Batman   Good      DC           
3 Sabrina  Good      Archie Comics

publishers

# A tibble: 3 × 2
  publisher year_founded
  <chr>            <dbl>
1 DC                1934
2 Marvel            1939
3 Image             1992

The shared variable is publisher. Three of the four publishers appear in superheroes; one (Image) does not. Three of the four characters (Deadpool, Batman, Sabrina) have publishers in the publishers table; Archie Comics is not in publishers.

Mutating joins

Mutating joins add new columns from the second table (y) to the first table (x), matching rows by a key variable. The four variants differ in how they handle non-matching rows.

`inner_join()`

inner_join() keeps only rows where the key appears in both tables. Non-matches in either table are dropped:

inner_join(x = superheroes, y = publishers, by = join_by(publisher))

1: join_by(publisher) specifies that publisher is the key column used to match rows. Since Sabrina has publisher = "Archie Comics" which is absent from publishers, and Image has no matching character, only Deadpool and Batman survive.

# A tibble: 2 × 4
  name     alignment publisher year_founded
  <chr>    <chr>     <chr>            <dbl>
1 Deadpool Chaotic   Marvel            1939
2 Batman   Good      DC                1934

`left_join()`

left_join() keeps all rows from x (the left table) and adds matched columns from y. Rows in x with no match in y get NA for the y columns:

left_join(x = superheroes, y = publishers, by = join_by(publisher))

1: All three characters are retained. Sabrina gets NA for year_founded because Archie Comics is not in publishers. The Image row from publishers is dropped since it has no matching character.

# A tibble: 3 × 4
  name     alignment publisher     year_founded
  <chr>    <chr>     <chr>                <dbl>
1 Deadpool Chaotic   Marvel                1939
2 Batman   Good      DC                    1934
3 Sabrina  Good      Archie Comics           NA

left_join() is the most commonly used join in practice: it says “for every row in my main dataset, bring in whatever information is available from this second table, and leave NA where there’s nothing to match.”

`right_join()`

right_join() keeps all rows from y (the right table). It is equivalent to swapping the order of the tables and using left_join():

right_join(x = superheroes, y = publishers, by = join_by(publisher))

1: All three publishers are retained. Image appears with NA for name and alignment since no character maps to it.

# A tibble: 3 × 4
  name     alignment publisher year_founded
  <chr>    <chr>     <chr>            <dbl>
1 Deadpool Chaotic   Marvel            1939
2 Batman   Good      DC                1934
3 <NA>     <NA>      Image             1992

`full_join()`

full_join() keeps all rows from both tables, filling with NA wherever there is no match:

full_join(x = superheroes, y = publishers, by = join_by(publisher))

1: Every character and every publisher appears. Sabrina gets NA for year_founded; Image gets NA for name and alignment.

# A tibble: 4 × 4
  name     alignment publisher     year_founded
  <chr>    <chr>     <chr>                <dbl>
1 Deadpool Chaotic   Marvel                1939
2 Batman   Good      DC                    1934
3 Sabrina  Good      Archie Comics           NA
4 <NA>     <NA>      Image                 1992

Filtering joins

Filtering joins do not add new columns. Instead, they use the second table to decide which rows of the first table to keep — without modifying the structure of x.

`semi_join()`

semi_join() keeps rows from x that have at least one match in y:

semi_join(x = superheroes, y = publishers, by = join_by(publisher))

1: Only Deadpool and Batman are kept — the characters whose publishers appear in the publishers table. Sabrina is dropped. No columns from publishers are added to the result.

# A tibble: 2 × 3
  name     alignment publisher
  <chr>    <chr>     <chr>    
1 Deadpool Chaotic   Marvel   
2 Batman   Good      DC

semi_join() is useful for checking data quality (“which records in x have a matching record in y?”) or for subsetting a large dataset based on a lookup table.

`anti_join()`

anti_join() keeps rows from x that have no match in y — the complement of semi_join():

anti_join(x = superheroes, y = publishers, by = join_by(publisher))

1: Only Sabrina is kept — the one character whose publisher is not in the publishers table. This is useful for diagnosing unmatched records or orphaned rows.

# A tibble: 1 × 3
  name    alignment publisher    
  <chr>   <chr>     <chr>        
1 Sabrina Good      Archie Comics

Choosing a join

Join	Rows kept	Adds columns?
`inner_join()`	Rows in both x and y	Yes
`left_join()`	All rows in x	Yes
`right_join()`	All rows in y	Yes
`full_join()`	All rows in x and y	Yes
`semi_join()`	Rows in x matched by y	No
`anti_join()`	Rows in x unmatched by y	No

In most visualization workflows, left_join() is the right default: start with your primary dataset and enrich it with information from a lookup or reference table. Use inner_join() when you genuinely want to restrict to only matched rows. Use semi_join() and anti_join() to diagnose and debug.

Joining on multiple keys

Sometimes two tables share more than one variable that together identify a unique match. Pass multiple variables to join_by():

# hypothetical example — joining on two keys
inner_join(x, y, by = join_by(first_name, last_name))

When key column names differ between tables, specify the mapping explicitly:

inner_join(x, y, by = join_by(publisher == company_name))

Summary

Relational data structures split information across multiple linked tables; joins combine them on shared key variables
Mutating joins (inner_join(), left_join(), right_join(), full_join()) add columns from the second table, differing in which non-matching rows they retain
Filtering joins (semi_join(), anti_join()) keep or drop rows from the first table based on matches in the second, without adding new columns
left_join() is the most common choice: keep all rows from your primary dataset and enrich with available data from a second table

Acknowledgements

Material derived in part from STA 313: Advanced Data Visualization and R for Data Science.