Data wrangling (II)

Notes
Modified

May 19, 2026

NoteLearning objectives
  • Explain the structure of relational data
  • Implement mutating joins to combine two tables
  • Implement filtering joins to subset rows based on a second table
  • Apply joins and data wrangling to build a complex visualization

Relational data

Data for real projects rarely lives in a single table. A hospital database might have separate tables for patients, visits, diagnoses, and prescriptions. A course system might have separate tables for students, enrollments, and assignments. These relational data structures organize information into multiple linked tables, where each table has a clear purpose and tables are connected by shared key variables.

Combining information across tables is called a join. The result of a join is a new table that brings together columns from two sources based on matching values in one or more key variables.

Example data

To illustrate how joins work, we’ll use two small tables: comic book characters and their publishers.

superheroes
# A tibble: 3 × 3
  name     alignment publisher    
  <chr>    <chr>     <chr>        
1 Deadpool Chaotic   Marvel       
2 Batman   Good      DC           
3 Sabrina  Good      Archie Comics
publishers
# A tibble: 3 × 2
  publisher year_founded
  <chr>            <dbl>
1 DC                1934
2 Marvel            1939
3 Image             1992

The shared variable is publisher. Three of the four publishers appear in superheroes; one (Image) does not. Three of the four characters (Deadpool, Batman, Sabrina) have publishers in the publishers table; Archie Comics is not in publishers.

Mutating joins

Mutating joins add new columns from the second table (y) to the first table (x), matching rows by a key variable. The four variants differ in how they handle non-matching rows.

inner_join()

inner_join() keeps only rows where the key appears in both tables. Non-matches in either table are dropped:

inner_join(x = superheroes, y = publishers, by = join_by(publisher))
1
join_by(publisher) specifies that publisher is the key column used to match rows. Since Sabrina has publisher = "Archie Comics" which is absent from publishers, and Image has no matching character, only Deadpool and Batman survive.
# A tibble: 2 × 4
  name     alignment publisher year_founded
  <chr>    <chr>     <chr>            <dbl>
1 Deadpool Chaotic   Marvel            1939
2 Batman   Good      DC                1934

left_join()

left_join() keeps all rows from x (the left table) and adds matched columns from y. Rows in x with no match in y get NA for the y columns:

left_join(x = superheroes, y = publishers, by = join_by(publisher))
1
All three characters are retained. Sabrina gets NA for year_founded because Archie Comics is not in publishers. The Image row from publishers is dropped since it has no matching character.
# A tibble: 3 × 4
  name     alignment publisher     year_founded
  <chr>    <chr>     <chr>                <dbl>
1 Deadpool Chaotic   Marvel                1939
2 Batman   Good      DC                    1934
3 Sabrina  Good      Archie Comics           NA

left_join() is the most commonly used join in practice: it says “for every row in my main dataset, bring in whatever information is available from this second table, and leave NA where there’s nothing to match.”

right_join()

right_join() keeps all rows from y (the right table). It is equivalent to swapping the order of the tables and using left_join():

right_join(x = superheroes, y = publishers, by = join_by(publisher))
1
All three publishers are retained. Image appears with NA for name and alignment since no character maps to it.
# A tibble: 3 × 4
  name     alignment publisher year_founded
  <chr>    <chr>     <chr>            <dbl>
1 Deadpool Chaotic   Marvel            1939
2 Batman   Good      DC                1934
3 <NA>     <NA>      Image             1992

full_join()

full_join() keeps all rows from both tables, filling with NA wherever there is no match:

full_join(x = superheroes, y = publishers, by = join_by(publisher))
1
Every character and every publisher appears. Sabrina gets NA for year_founded; Image gets NA for name and alignment.
# A tibble: 4 × 4
  name     alignment publisher     year_founded
  <chr>    <chr>     <chr>                <dbl>
1 Deadpool Chaotic   Marvel                1939
2 Batman   Good      DC                    1934
3 Sabrina  Good      Archie Comics           NA
4 <NA>     <NA>      Image                 1992

Filtering joins

Filtering joins do not add new columns. Instead, they use the second table to decide which rows of the first table to keep — without modifying the structure of x.

semi_join()

semi_join() keeps rows from x that have at least one match in y:

semi_join(x = superheroes, y = publishers, by = join_by(publisher))
1
Only Deadpool and Batman are kept — the characters whose publishers appear in the publishers table. Sabrina is dropped. No columns from publishers are added to the result.
# A tibble: 2 × 3
  name     alignment publisher
  <chr>    <chr>     <chr>    
1 Deadpool Chaotic   Marvel   
2 Batman   Good      DC       

semi_join() is useful for checking data quality (“which records in x have a matching record in y?”) or for subsetting a large dataset based on a lookup table.

anti_join()

anti_join() keeps rows from x that have no match in y — the complement of semi_join():

anti_join(x = superheroes, y = publishers, by = join_by(publisher))
1
Only Sabrina is kept — the one character whose publisher is not in the publishers table. This is useful for diagnosing unmatched records or orphaned rows.
# A tibble: 1 × 3
  name    alignment publisher    
  <chr>   <chr>     <chr>        
1 Sabrina Good      Archie Comics

Choosing a join

Join Rows kept Adds columns?
inner_join() Rows in both x and y Yes
left_join() All rows in x Yes
right_join() All rows in y Yes
full_join() All rows in x and y Yes
semi_join() Rows in x matched by y No
anti_join() Rows in x unmatched by y No

In most visualization workflows, left_join() is the right default: start with your primary dataset and enrich it with information from a lookup or reference table. Use inner_join() when you genuinely want to restrict to only matched rows. Use semi_join() and anti_join() to diagnose and debug.

Joining on multiple keys

Sometimes two tables share more than one variable that together identify a unique match. Pass multiple variables to join_by():

# hypothetical example — joining on two keys
inner_join(x, y, by = join_by(first_name, last_name))

When key column names differ between tables, specify the mapping explicitly:

inner_join(x, y, by = join_by(publisher == company_name))

Summary

  • Relational data structures split information across multiple linked tables; joins combine them on shared key variables
  • Mutating joins (inner_join(), left_join(), right_join(), full_join()) add columns from the second table, differing in which non-matching rows they retain
  • Filtering joins (semi_join(), anti_join()) keep or drop rows from the first table based on matches in the second, without adding new columns
  • left_join() is the most common choice: keep all rows from your primary dataset and enrich with available data from a second table

Acknowledgements

Material derived in part from STA 313: Advanced Data Visualization and R for Data Science.