Data wrangling (II)

Lecture 8

Dr. Benjamin Soltoff

Cornell University
INFO 3312/5312 - Spring 2025

February 13, 2025

Announcements

Announcements

  • Draft proposal rendered + committed + pushed to GitHub by 11:59pm tonight
  • Peer evaluations tomorrow 💘 before you depart for February Break
  • Homework 3 distributed tomorrow, due the 26th

Visualization critique

Which is a better chart?

Design a better plot

ae-06

Design a better plot

Create a chart that emphasizes the increased reliance of part-time faculty in academia.

20:00

Relational data structures

Introduction to relational data

  • Multiple tables of data that when combined together accomplish goals
  • Relations define the important element, not just the individual tables
  • Relations are defined between a pair of tables
  • Relational verbs
    • Mutating joins
    • Filtering joins

Comic book characters

Name Alignment Gender Publisher
Deadpool Chaotic Male Marvel
Batman Good Male DC
Sabrina Good Female Archie Comics

Publishers

Publisher Year founded
DC 1934
Marvel 1939
Image 1992

Mutating joins

inner_join()

inner_join()

inner_join(x = superheroes, y = publishers, by = join_by(Publisher))
# A tibble: 2 × 5
  Name     Alignment Gender Publisher `Year\nfounded`
  <chr>    <chr>     <chr>  <chr>               <dbl>
1 Deadpool Chaotic   Male   Marvel               1939
2 Batman   Good      Male   DC                   1934

left_join()

left_join()

left_join(x = superheroes, y = publishers, by = join_by(Publisher))
# A tibble: 3 × 5
  Name     Alignment Gender Publisher        `Year\nfounded`
  <chr>    <chr>     <chr>  <chr>                      <dbl>
1 Deadpool Chaotic   Male   "Marvel"                    1939
2 Batman   Good      Male   "DC"                        1934
3 Sabrina  Good      Female "Archie\nComics"              NA

right_join()

right_join()

right_join(x = superheroes, y = publishers, by = join_by(Publisher))
# A tibble: 3 × 5
  Name     Alignment Gender Publisher `Year\nfounded`
  <chr>    <chr>     <chr>  <chr>               <dbl>
1 Deadpool Chaotic   Male   Marvel               1939
2 Batman   Good      Male   DC                   1934
3 <NA>     <NA>      <NA>   Image                1992

right_join() reversed

full_join()

full_join()

full_join(x = superheroes, y = publishers, by = join_by(Publisher))
# A tibble: 4 × 5
  Name     Alignment Gender Publisher        `Year\nfounded`
  <chr>    <chr>     <chr>  <chr>                      <dbl>
1 Deadpool Chaotic   Male   "Marvel"                    1939
2 Batman   Good      Male   "DC"                        1934
3 Sabrina  Good      Female "Archie\nComics"              NA
4 <NA>     <NA>      <NA>   "Image"                     1992

Filtering joins

semi_join()

semi_join()

semi_join(x = superheroes, y = publishers, by = join_by(Publisher))
# A tibble: 2 × 4
  Name     Alignment Gender Publisher
  <chr>    <chr>     <chr>  <chr>    
1 Deadpool Chaotic   Male   Marvel   
2 Batman   Good      Male   DC       

anti_join()

anti_join()

anti_join(x = superheroes, y = publishers, by = join_by(Publisher))
# A tibble: 1 × 4
  Name    Alignment Gender Publisher       
  <chr>   <chr>     <chr>  <chr>           
1 Sabrina Good      Female "Archie\nComics"

Application exercise

Ivy League representation on the judiciary

ae-07

Instructions

  • Go to the course GitHub org and find your ae-07 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of the day

Wrap up

Recap

  • Data is often split across multiple tables and we need to bring them together to analyze and visualize
  • Use *_join() functions to combine two tables
  • Continue to practice data wrangling and transformation operations

Fondue for two