HW 03 - Viz + wrangling

Homework

Modified

February 12, 2025

Important

This homework is due February 26 at 11:59pm ET.

Learning objectives

Clean and wrangle data for visualization
Implement relational joins
Use reference documentation to implement new chart types
Create charts using polar coordinate systems

Getting started

Go to the info3312-sp25 organization on GitHub. Click on the repo with the prefix hw-03. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio.

General guidance

Guidelines + tips

As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

Workflow + formatting

Make sure to

Update author name on your document.
Label all code chunks informatively and concisely.
Follow the Tidyverse code style guidelines.
Make at least 3 commits.
Resize figures where needed, avoid tiny or huge plots.
Turn in an organized, well formatted document.

Packages

library(tidyverse)
library(scales)
library(ggridges)

Exercises

Exercise 1

Implementing ridgeline plots. The goal of this exercise is to learn about a new type of plot (ridgeline plot) and to learn how to make it.

Social scientists often use vote-based measures of political ideology to study legislative behavior. The data you will use is from the NOMINATE scores, which are a common way to measure the ideology of members of Congress. You can find individual NOMINATE scores for every legislator from every term of the U.S. Congress since the 1st Congress in 1789 in HSall_members.csv.

You will use the “first dimension” scores (nominate_dim1) which in modern times are interpreted as identifying political ideology. Negative scores are interpreted as “liberal”, and positive scores are interpreted as “conservative”. Use an appropriate function from the {ggridges} package to make a ridge plot of partisan polarization in the U.S. Congress. How has the ideological makeup of the House of Representatives and the Senate shifted over time? Focus only terms of Congress since 1945, and make sure to separately visualize the House of Representatives and the Senate.

Whazza whazza what?

We have not created ridgeline plots previously. Use the package documentation to assist your implementation of this plot.

Hints

Terms of Congress last for two years and are identified sequentially (e.g the 1st Congress ran from 1789-91, the 2nd Congress from 1791-93, etc.) The congress variable in the data set is a numeric variable that identifies the term of Congress. For interpretability, it would be helpful to instead label the graph based on the years of the term of Congress (e.g. 1945-1947, 1947-1949, etc.). Feel free to decide how best to generate these labels.
Typically humans read content from top to bottom. Make sure your chart is oriented in a way that follows the natural flow of time.

Also include an interpretation for your visualization. You should review feedback from your homeworks 1 and 2 to make sure you capture anything you may have missed previously.

Exercise 2

Follow the money. Vote-based measures of political ideology are common in social science research, but have certain limitations. Because they are based on observed voting behavior in political institutions, they can only be used to measure the ideology of individuals who have actually served in a political institution (i.e. individuals who run for office but fail to win election cannot be measured). Furthermore, voting-based measures for individuals who serve in different institutions are not directly comparable because they do not have significant overlap in the things for which they case votes (e.g. legislators vs. executives vs. judges, national vs. state politicians).

Fortunately there are other ways to measure ideology independent of voting behavior. Adam Bonica created the Database on Ideology, Money in Politics, and Elections (DIME) which provides comprehensive ideological mapping of not only elected officials, but also other political elites, interest groups, and donors by compiling “over 850 million itemized political contributions made by individuals and organizations to local, state, and federal elections covering from 1979 to 2024”. These contributions are collapsed into a single dimension Campaign Finance Score (CFScore) which measures the estimated ideology of all electoral candidates, recipients, and donors in the United States.

Whenever researchers introduce a new measure, they often will show that it correlates with existing measures of the same concept. Use the DIME dataset located in data/dime.csv to construct a visualization that compares the first dimension NOMINATE scores to the corresponding recipient CFScore (recipient.cfscore) for all legislators from the U.S. Congress.

Hints

Your chart should visually distinguish Democrats from Republicans. Avoid using the usual red-blue color coding for the parties since that combination has accessibility concerns (more on this later in the course).
Use faceting so the reader can distinguish each two-year Congressional term.
party_code identifies the partisan affiliation of each individual in the NOMINATE database. 100 is Democrats and 200 is Republicans. Given the dearth of alternative parties in U.S. politics, you can ignore all third-party legislators.
icpsr uniquely identifies each individual in the DIME database. From the documentation:

ICPSR: Adjusted ICPSR legislator ID. Candidates that have never served in Congress are assigned IDs based off of their candidate IDs assigned by the FEC, NIMSP, or state reporting agencies. The four-digit election cycle is appended to the ID. Candidates that are active in multiple election cycles (or file to run for multiple seats during a single election cycle) will appear multiple times. This variable provides a unique row identifier.

You will need to prep this column to use it to combine the DIME dataset with the NOMINATE dataset.

Exercise 3

Key lyme pie. The goal of this exercise is to recreate a pie chart in R and then improve it by presenting the same information as a bar graph. The pie chart to recreate is below and it comes from the Lyme Disease Association.¹

Bar chart of 2018 US reported lyme disease cases featuring top 15 states

Below are the steps I recommend you follow and some guidance on what (not) to worry about:

First, create the data frame: Use the annotations in the visualization provided to do this. You should create the new data frame using the tibble() or the tribble() functions.
Then, recreate the pie chart: When recreating the pie chart you do not need to
- make it a 3D pie chart (2D is sufficient)
- match the colors (default {ggplot2} colors or any other color palette is fine)
- annotate the plot in the same way (just the legend is sufficient)
- match the entire caption (see below for what we want you to match)
However you should,
- make a 2D pie chart
- present a legend on the right that shows the mapping of the colors to states
- match the title text, location, and alignment
- match the text, location, and alignment of the first two lines of the caption
Finally, improve the visualization by presenting this information in the form of a bar graph. And as an additional challenge, imagine you’re working for the state of New York, so highlight that bar corresponding to that state in some way. Write a sentence or two describing why you chose to highlight the New York info the way you did.

Exercise 4

Federal judges and the Ivy League. Finish the visualization we started for the application exercise. Make sure the plot is finished, polished, and readily interpretable. Use the visualization to develop an answer to the question: are Democratic or Republican presidents more likely to appoint judges from the Ivy League?

Exercise 5

Federal judges and the Ivy League (redux). Use the federal judges dataset to ask and answer a new question about federal judges. For example, you might continue to explore the theme of Ivy League representation in the federal courts.

Example charts

Please don’t just replicate the examples

The examples are supposed to inspire you to explore the data and think about different ways to examine and visualize the data. Don’t just attempt to replicate one of the examples.

Or explore entirely different questions related to federal judges. For this exercise, you will be evaluated on both the quality of the final visualization as well as the effort necessary to prepare the data for the visualization.

Generative AI (GAI) self-reflection

As stated in the syllabus, include a written reflection for this assignment of how you used GAI tools (e.g. what tools you used, how you used them to assist you with writing code), what skills you believe you acquired, and how you believe you demonstrated mastery of the learning objectives.

Wrap up

Submission

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
Click on your INFO 3312 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages associated with exercise. All the pages of homework should be associated with at least one question (i.e., should be “checked”).

Grading

Exercise 1: 10 points
Exercise 2: 10 points
Exercise 3: 15 points
Exercise 4: 5 points
Exercise 5: 10 points
Total: 50 points

Acknowledgments

Exercise 3 drawn from Advanced Data Visualization by Mine Çetinkaya-Rundel.

Footnotes

Source: https://lymediseaseassociation.org/resources/2018-reported-lyme-cases-top-15-states ↩︎