HW 03 - Viz + wrangling

Homework
Modified

February 21, 2024

Important

This homework is due Friday, February 21 February 23 at 11:59pm ET.

Getting started

  • Go to the info3312-sp24 organization on GitHub. Click on the repo with the prefix hw-03. It contains the starter documents you need to complete the lab.

  • Clone the repo and start a new project in RStudio.

Packages

library(tidyverse)
library(scales)
library(ggridges)
library(fs)
library(janitor)
library(emojifont)

Guidelines + tips

As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

Note

Note: Do not let R output answer the question for you unless the question specifically asks for just a plot. For example, if the question asks for the number of columns in the data set, please type out the number of columns. You are subject to lose points if you do not.

Workflow + formatting

Make sure to

  • Update author name on your document.
  • Label all code chunks informatively and concisely.
  • Follow the Tidyverse code style guidelines.
  • Make at least 3 commits.
  • Resize figures where needed, avoid tiny or huge plots.
  • Turn in an organized, well formatted document.

Exercise 1

A new day, a new plot, a new geom. The goal of this exercise is to learn about a new type of plot (ridgeline plot) and to learn how to make it.

Social scientists often use vote-based measures of political ideology to study legislative behavior. The data you will use is from the NOMINATE scores, which are a common way to measure the ideology of members of Congress. You can find individual NOMINATE scores for every legislator from every term of the U.S. Congress since the 1st Congress in 1789 in HSall_members.csv.

You will use the “first dimension” scores (nominate_dim1) which in modern times are interpreted as identifying political ideology. Negative scores are interpreted as “liberal”, and positive scores are interpreted as “conservative”. Use an appropriate function from the ggridges package to make a ridge plot of partisan polarization in the U.S. Congress. Focus only terms of Congress since 1945, and make sure to separately visualize the House of Representatives and the Senate.

Tip

Terms of Congress last for two years and are identified sequentially (e.g the 1st Congress ran from 1789-91, the 2nd Congress from 1791-93, etc.) The congress variable in the data set is a numeric variable that identifies the term of Congress. For interpretability, it would be helpful to instead label the graph based on the years of the term of Congress (e.g. 1945-1947, 1947-1949, etc.). Feel free to decide how best to generate these labels.

Also include an interpretation for your visualization. You should review feedback from your homeworks 1 and 2 to make sure you capture anything you may have missed previously.

Note

This is not a geom we introduced in class, so seeing an example of it in action will be helpful. Read the package README at https://wilkelab.org/ggridges and/or the introduction vignette at https://wilkelab.org/ggridges/articles/introduction.html. There is more information than you need for this question in the vignette; the first section on Geoms should be sufficient to help you get started.

Now is a good time to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 2

Key lyme pie. The goal of this exercise is to recreate a pie chart in R and then improve it by presenting the same information as a bar graph. The pie chart to recreate is below and it comes from the Lyme Disease Association.1

Bar chart of 2018 US reported lyme disease cases featuring top 15 states

Below are the steps I recommend you follow and some guidance on what (not) to worry about:

  • First, create the data frame: Use the annotations in the visualization provided to do this. You should create the new data frame using the tibble() or the tribble() functions.

  • Then, recreate the pie chart: When recreating the pie chart you do not need to

    • make it a 3D pie chart (2D is sufficient)
    • match the colors (default ggplot2 colors or any other color palette is fine)
    • annotate the plot in the same way (just the legend is sufficient)
    • match the entire caption (see below for what we want you to match)

    However you should,

    • make a 2D pie chart
    • present a legend on the right that shows the mapping of the colors to states
    • match the title text, location, and alignment
    • match the text, location, and alignment of the first two lines of the caption
  • Finally, improve the visualization by presenting this information in the form of a bar graph. And as an additional challenge, imagine you’re working for the state of Connecticut, so highlight that bar corresponding to that state in some way. Write a sentence or two describing why you chose to highlight the Connecticut info the way you did.

Now is a good time to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 3

Foreign Connected PACs. Only American citizens (and immigrants with green cards) can contribute to federal politics, but the American divisions of foreign companies can form political action committees (PACs) and collect contributions from their American employees. (Source: https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2024).

In this exercise you will work with data from contributions to US political parties from foreign-connected PACs. The data is stored in CSV files in the data directory of your repository/project. There are 11 files, each for an election cycle between 2000 and 2022. You can load all of the data at once using the code below.

# get a list of files with "Foreign Connected PAC" in their names
list_of_files <- dir_ls(path = "data", regexp = "Foreign Connected PAC")

# read all files and row bind them
# keeping track of the file name in a new column called year
pac <- read_csv(list_of_files, id = "year")

The ultimate goal of this exercise is to recreate yet another plot. But there is a nontrivial amount of data wrangling and tidying that needs to happen before you can do that. Below are the steps you should follow so that you can obtain the necessary interim objects we will be looking for as we review your work.

  • First, clean the names of the variables in the dataset with a new function from the janitor package: clean_names(). Then clean and transform the data such that you have something like the following at the end.

    # A tibble: 2,402 × 6
        year pac_name_affiliate                    country_of_origin parent_company      dems repubs
       <dbl> <chr>                                 <chr>             <chr>              <dbl>  <dbl>
     1  2000 7-Eleven                              Japan             Ito-Yokado          1500   7000
     2  2000 ABB Group                             Switzerland       Asea Brown Boveri  17000  28500
     3  2000 Accenture                             UK                Accenture plc      23000  52984
     4  2000 ACE INA                               UK                ACE Group          12500  26000
     5  2000 Acuson Corp (Siemens AG)              Germany           Siemens AG          2000      0
     6  2000 Adtranz (DaimlerChrysler)             Germany           DaimlerChrysler AG 10000    500
     7  2000 AE Staley Manufacturing (Tate & Lyle) UK                Tate & Lyle        10000  14000
     8  2000 AEGON USA (AEGON NV)                  Netherlands       Aegon NV           10500  47750
     9  2000 AIM Management Group                  UK                AMVESCAP           10000  15000
    10  2000 Air Liquide America                   France            L'Air Liquide SA       0      0
    # ℹ 2,392 more rows
  • Then, pivot the data longer such that instead of dems and repubs columns you have a column called party with levels Democrat and Republican and another column called amount that contains the amount of contribution.

  • Then, for each election cycle (year) calculate the total amount of contributions to Democrat and Republican parties from PACs with country_of_origin UK. The resulting summary table should have two rows for each year of data, one for Democrat and one for Republican contributions.

  • Then, recreate the following visualization.

  • Finally, remake the same visualization, but for a different country. I recommend you choose a country with a substantial number of contributions to US politics. Interpret the new visualization that you make.

Now is a good time to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 4

Hop on. We have two datasets we’ll work with in this exercise:

  • data/flights.rds: All flights out of New York City (JFK, Laguardia, and Newark) in 2022.

  • data/planes.rds: Plane metadata for plane tailnumbers found in the FAA aircraft registry in 2022.

The tasks for this question are outlined below:

  • Load the datasets and then join them such that each row is a flight out of New York City. Use tailnum as the unique identifier to join by. The resulting dataset should contain flights with tailnums that exist in both datasets and should be named nyc_flights_planes. Then, report the number of rows and columns in nyc_flights_planes.
Note

It’s possible that not all flights in flights.rds have a corresponding plane in planes.rds.

  • Create a new variable called size that categorizes the planes into four: small, medium, large, and jumbo. You can do this based on any information in the data that makes sense to you to use, but you should explain your reasoning and justify the cutoffs you use (with citations and/or additional visualizations of other variables in the data).

  • Create a visualization like the one below. Note that the size of the airplane emoji increases with plane size. The data presented in your plot will most likely look different than mine because you might use different criteria to determine size of the plane, and that’s ok! And the sizes of the plane emojis may not be the same either as it’s difficult (if not impossible) to tell from the plot what font sizes I used. Just match the general look and layout of the plot.

  • Time to get creative! Create another plot that displays some flight patterns in 2022. Your plot should be based on the joined nyc_flights_planes dataset and the new variable you created, size, must be one of the variables you represented. You’re free to choose any other variables you want for your plot. Along with your plot, provide an interpretation.


Render, commit, and push one last time.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceding.

Wrap up

Submission

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
  • Click on your INFO 3312 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with exercise. All the pages of homework lab should be associated with at least one question (i.e., should be “checked”).

Grading

  • Exercise 1: 10 points
  • Exercise 2: 10 points
  • Exercise 3: 15 points
  • Exercise 4: 15 points
  • Total: 50 points