Implement filtering joins to subset rows based on a second table
Apply joins and data wrangling to build a complex visualization
Relational data
Data for real projects rarely lives in a single table. A hospital database might have separate tables for patients, visits, diagnoses, and prescriptions. A course system might have separate tables for students, enrollments, and assignments. These relational data structures organize information into multiple linked tables, where each table has a clear purpose and tables are connected by shared key variables.
Combining information across tables is called a join. The result of a join is a new table that brings together columns from two sources based on matching values in one or more key variables.
Example data
To illustrate how joins work, we’ll use two small tables: comic book characters and their publishers.
superheroes
# A tibble: 3 × 3
name alignment publisher
<chr> <chr> <chr>
1 Deadpool Chaotic Marvel
2 Batman Good DC
3 Sabrina Good Archie Comics
publishers
# A tibble: 3 × 2
publisher year_founded
<chr> <dbl>
1 DC 1934
2 Marvel 1939
3 Image 1992
The shared variable is publisher. Three of the four publishers appear in superheroes; one (Image) does not. Three of the four characters (Deadpool, Batman, Sabrina) have publishers in the publishers table; Archie Comics is not in publishers.
Mutating joins
Mutating joins add new columns from the second table (y) to the first table (x), matching rows by a key variable. The four variants differ in how they handle non-matching rows.
inner_join()
inner_join() keeps only rows where the key appears in both tables. Non-matches in either table are dropped:
inner_join(x = superheroes, y = publishers, by =join_by(publisher))
1
join_by(publisher) specifies that publisher is the key column used to match rows. Since Sabrina has publisher = "Archie Comics" which is absent from publishers, and Image has no matching character, only Deadpool and Batman survive.
# A tibble: 2 × 4
name alignment publisher year_founded
<chr> <chr> <chr> <dbl>
1 Deadpool Chaotic Marvel 1939
2 Batman Good DC 1934
left_join()
left_join() keeps all rows from x (the left table) and adds matched columns from y. Rows in x with no match in y get NA for the y columns:
left_join(x = superheroes, y = publishers, by =join_by(publisher))
1
All three characters are retained. Sabrina gets NA for year_founded because Archie Comics is not in publishers. The Image row from publishers is dropped since it has no matching character.
# A tibble: 3 × 4
name alignment publisher year_founded
<chr> <chr> <chr> <dbl>
1 Deadpool Chaotic Marvel 1939
2 Batman Good DC 1934
3 Sabrina Good Archie Comics NA
left_join() is the most commonly used join in practice: it says “for every row in my main dataset, bring in whatever information is available from this second table, and leave NA where there’s nothing to match.”
right_join()
right_join() keeps all rows from y (the right table). It is equivalent to swapping the order of the tables and using left_join():
right_join(x = superheroes, y = publishers, by =join_by(publisher))
1
All three publishers are retained. Image appears with NA for name and alignment since no character maps to it.
# A tibble: 3 × 4
name alignment publisher year_founded
<chr> <chr> <chr> <dbl>
1 Deadpool Chaotic Marvel 1939
2 Batman Good DC 1934
3 <NA> <NA> Image 1992
full_join()
full_join() keeps all rows from both tables, filling with NA wherever there is no match:
full_join(x = superheroes, y = publishers, by =join_by(publisher))
1
Every character and every publisher appears. Sabrina gets NA for year_founded; Image gets NA for name and alignment.
# A tibble: 4 × 4
name alignment publisher year_founded
<chr> <chr> <chr> <dbl>
1 Deadpool Chaotic Marvel 1939
2 Batman Good DC 1934
3 Sabrina Good Archie Comics NA
4 <NA> <NA> Image 1992
Filtering joins
Filtering joins do not add new columns. Instead, they use the second table to decide which rows of the first table to keep — without modifying the structure of x.
semi_join()
semi_join() keeps rows from x that have at least one match in y:
semi_join(x = superheroes, y = publishers, by =join_by(publisher))
1
Only Deadpool and Batman are kept — the characters whose publishers appear in the publishers table. Sabrina is dropped. No columns from publishers are added to the result.
# A tibble: 2 × 3
name alignment publisher
<chr> <chr> <chr>
1 Deadpool Chaotic Marvel
2 Batman Good DC
semi_join() is useful for checking data quality (“which records in x have a matching record in y?”) or for subsetting a large dataset based on a lookup table.
anti_join()
anti_join() keeps rows from x that have no match in y — the complement of semi_join():
anti_join(x = superheroes, y = publishers, by =join_by(publisher))
1
Only Sabrina is kept — the one character whose publisher is not in the publishers table. This is useful for diagnosing unmatched records or orphaned rows.
# A tibble: 1 × 3
name alignment publisher
<chr> <chr> <chr>
1 Sabrina Good Archie Comics
Choosing a join
Join
Rows kept
Adds columns?
inner_join()
Rows in both x and y
Yes
left_join()
All rows in x
Yes
right_join()
All rows in y
Yes
full_join()
All rows in x and y
Yes
semi_join()
Rows in x matched by y
No
anti_join()
Rows in x unmatched by y
No
In most visualization workflows, left_join() is the right default: start with your primary dataset and enrich it with information from a lookup or reference table. Use inner_join() when you genuinely want to restrict to only matched rows. Use semi_join() and anti_join() to diagnose and debug.
Joining on multiple keys
Sometimes two tables share more than one variable that together identify a unique match. Pass multiple variables to join_by():
# hypothetical example — joining on two keysinner_join(x, y, by =join_by(first_name, last_name))
When key column names differ between tables, specify the mapping explicitly:
inner_join(x, y, by =join_by(publisher == company_name))
Summary
Relational data structures split information across multiple linked tables; joins combine them on shared key variables
Mutating joins (inner_join(), left_join(), right_join(), full_join()) add columns from the second table, differing in which non-matching rows they retain
Filtering joins (semi_join(), anti_join()) keep or drop rows from the first table based on matches in the second, without adding new columns
left_join() is the most common choice: keep all rows from your primary dataset and enrich with available data from a second table