HW 01 - Prefresher

Homework
Modified

January 30, 2025

Important

This homework is due January 29 at 11:59pm ET.

Learning objectives

  • Configure your GitHub credentials
  • Review common methods for data wrangling
  • Implement basic data visualizations with {ggplot2}

Getting started

Important

If you need assistance installing and configuring software, make sure to get help from a TA during office hours.

Access RStudio

If you plan to use your own computer

If you plan to use RStudio Workbench

  • Go to https://rstudio-workbench.infosci.cornell.edu and log in with your Cornell NetID and Password.
  • Click the “New Session” button on the top of the page. Leave all the settings on their default state and click “Start Session”. You should now see an RStudio session.
Warning

If this is your first time accessing RStudio Workbench for the course, it will take a couple of minutes to prepare your session. Please be patient. When you start a session in the future, your container will already be prepared and it should start within 15 seconds.

Setup your GitHub authentication

If you are using your own computer

Run the following code in the R console to ensure you have the required packages installed:

install.packages(c("usethis", "gitcreds", "gh", "renv"))

In order to push changes to GitHub, you need to authenticate yourself. That is, you need to prove you are the owner of your GitHub account. When you log in to GitHub.com from your browser, you provide your username and password to prove your identity. But when you want to push and pull from your computer, you cannot use this method. Instead, you will prove your identity using one of two methods.

Authenticate using a Personal Access Token (PAT)

Note

This method is preferred since it allows for seamless communication between R and Git for all possible applications.

A personal access token (or PAT) is a string of characters that can be used to authenticate a user when accessing a computer system instead of a username and password. Many online services are shifting towards requiring PATs for security reasons.

With this method you will clone repositories using a regular HTTPS url like https://github.com/<OWNER>/<REPO>.git.

If you are using RStudio Workbench

Configure the Git credential helper by running the following R code in the console:

usethis::use_git_config(credential.helper = "store")

Create your personal access token

Run this code from your R console:

usethis::create_github_token(
  scopes = c("repo", "user", "gist", "workflow"),
  description = "RStudio Workbench",
  host = "https://github.coecis.cornell.edu/"
)

This is a helper function that takes you to the web form to create a PAT.

  • Give the PAT a description (e.g. “PAT for INFO 2951”)
  • Leave the remaining options on the pre-filled form selected and click “Generate token”. As the page says, you must store this token somewhere, because you’ll never be able to see it again, once you leave that page or close the window. For now, you can copy it to your clipboard (we will save it in the next step).

If you lose or forget your PAT, just generate a new one.

Store your PAT

In order to store your PAT so you don’t have to reenter it every time you interact with Git, we need to run the following code:

gitcreds::gitcreds_set(url = "https://github.coecis.cornell.edu/")

When prompted, paste your PAT into the console and press return. Your credential should now be saved on your computer.

Confirm your PAT is saved

Run the following code:

gh::gh_whoami(.api_url = "https://github.coecis.cornell.edu/")

usethis::git_sitrep()

You should see output that provides information about your GitHub account.

Authenticate using Secure Shell Protocol

Note

You can use this approach to authenticate yourself on GitHub. Note that you may find some limitations communicating with Git outside of standard processes (e.g. cloning/pushing/pulling repos directly), and will still need to create a PAT for some course assignments. However for students using RStudio Workbench, the SSH method will work for the entire semester (i.e. set it up once and never have to worry about it again).

The Secure Shell Protocol (SSH) is another method for authenticating your identity when communicating with GitHub. While a password can eventually be cracked with a brute force attack, SSH keys are nearly impossible to decipher by brute force alone. Generating a key pair provides you with two long strings of characters: a public and a private key. You can place the public key on any server (like GitHub), and then unlock it by connecting to it with a client that already has the private key (your computer or RStudio Serve). When the two match up, the system unlocks without the need for a password.

The URL for SSH remotes looks like git@github.com:<OWNER>/<REPO>.git. Make sure you use this URL to clone a repository. If you accidentally use the HTTPS version, the operation will not work.

Set up your SSH key

Note

You only need to do this authentication process one time on a single system.

  • Type credentials::ssh_keygen() into your console.

  • R will ask “No SSH key found. Generate one now?” You should click 1 for yes.

  • You will generate a key. It will begin with “ssh-rsa….” and look something like this:

    $key
    [1] "/home/bcs88/.ssh/id_rsa"
    
    $pubkey
    [1] "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDJYmJeave083exQwYcIqZJk/Y1mgPxdcTYCTWLL+6mlhN9MM3enjDqb2eZvVJ0JK29NYL1++DTqY/saP08IlswNIMntwaWFDNx42yLsuFrWiPqzm9hWWnRcor/d+4zTrcSIEvfAAnLsYkagNqurrCf2taO62YRepTgxErLvLOG10qn4LKhNfT+PTqdPq2Mr88jXQYYrRxGnOV6oVYf6PurKkiooTsKYxVtJWai8Ek9fhK2y5vaQd5yP0H/3Hbw8Mn+rB+O8Yj6/oQKGBCgxkDB4Aw7T91DkIXlHppneO683Y54WvUftJYvSVsnyt/XuNjvXNAir0+kHETLM32uzH6L"
  • Copy the entire string of characters (not including the quotation marks) and paste them into the settings page on GitHub. Give the key an informative title such as “INFO 2951 RStudio Workbench”. Click “Add SSH key.”

Configure Git

There is one more thing we need to do before getting started on the assignment. Specifically, we need to configure your git so that RStudio can communicate with GitHub. This requires two pieces of information: your name and email address.

To do so, you will use the use_git_config() function from the usethis package. Type the following lines of code in the console in RStudio filling in your name and the email address associated with your GitHub account.

usethis::use_git_config(
  user.name = "Your name", 
  user.email = "Email associated with your GitHub account"
  )

For example, mine would be

usethis::use_git_config(
  user.name = "Benjamin Soltoff", 
  user.email = "bcs88@cornell.edu"
  )

You are now ready interact with GitHub via RStudio!

Clone the repo & start new RStudio project

  • Go to the course organization at https://github.coecis.cornell.edu/info3312-sp25 organization on GitHub. Click on the repo with the prefix hw-01. It contains the starter documents you need to complete the lab.

  • Click on the green CODE button, select HTTPS or SSH based on the authentication method you set up previously. Click on the clipboard icon to copy the repo URL.

  • In RStudio, go to FileNew ProjectVersion ControlGit.

  • Copy and paste the URL of your assignment repo into the dialog box Repository URL.

  • Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.

  • Click hw-01-prefresher.qmd to open the template Quarto file. This is where you will write up your code and narrative for the lab.

Loading {renv} cached packages

We use {renv} to ensure all students use reproducible environments. We pre-configure each assignment repo’s lockfile to list the minimum required packages that need to be installed. To access these packages, when you first clone the repo run

renv::restore()

This will retrieve installed packages from your cache folder, or download and install packages you have not used before.

Guidelines + tips

As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

Workflow + formatting

Make sure to

  • Update author name on your document.
  • Label all code chunks informatively and concisely.
  • Follow the Tidyverse code style guidelines.
  • Make at least 3 commits.
  • Resize figures where needed, avoid tiny or huge plots.
  • Turn in an organized, well formatted document.

Packages

library(tidyverse)
library(googlesheets4)
library(janitor)
library(scales)

Part 1: Romance novel covers

Trading Places

Recipe for Second Chances

Thank You for Sharing

Kissing Kosher
Figure 1: Romance novel covers from “What does a happily ever after look like?” by Alice Liang, The Pudding, 2021.

This dataset comes from The Pudding.1 “What does a happily ever after look like?” examines representation in the literary genre of romance through the lens of novel covers. The dataset contains information on romance novels identified on Publishers Weekly’s announcements in the Romance and/or Romance & Erotica sections.

Exercise 1

Import the spreadsheet from Google Sheets. The entire dataset can be found here. Use {googlesheets4} to import the dataset. Do not download the file as a CSV and then import into R. You must use {googlesheets4} to import the data.

Once you have imported the data, clean the data so that it is ready for analysis. Specifically,

  • Reformat all column names to utilize a snake_case format.
  • Fix the date column so it is stored as a date type.
Tip

You should notice an issue if you import this column using the default settings. This is because the author changed formatting styles midway through the column. To fix this issue, I recommend that you import the spreadsheet so this column is formatted as a character vector, then write code to appropriately format each cell as a date value using the {lubridate} package.

  • Keep the following columns which will be relevant to the remaining exercises

    • Year
    • Title
    • Author
    • Publisher
    • Date published
    • Description
    • Style
    • Whether or not a man is partially unclothed on the cover
    • Whether or not a woman is partially unclothed on the cover
    • Whether or not a person of color (POC) is depicted on the cover
Note

When you are finished, use the glimpse() function to print the data frame so we can easily examine its structure.

Now is a good time to render, commit, and push.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 2

Calculate the annual percentage of romance novel covers by their raunchiness, illustration, and diversity. These metrics are defined as:

  • Raunchiness - any novel cover which contains a man or woman partially unclothed.
  • Illustration - any novel cover which is illustrated (as opposed to photorealistic)
  • Diversity - any novel cover with a person of color

Calculate the annual percentage of novel covers which fall into each of these categories. Use these percentages to recreate the plot below.

Note

Your plot need not be exactly the same in terms of its dimensions, color palette, etc. However, it should be as close as possible.

Now is a good time to render, commit, and push.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Part 2: They’re eating the dogs, they’re eating the cats!

Exercise 3

Report on American attitudes on the consumption of various animals. YouGov polled 1,000 U.S. adult citizens on topics related to vegetarianism and the eating of meat. Question 19 specifically asked respondents

Setting aside your own dietary preferences, do you think it is morally acceptable or unacceptable for other people to eat the following animal under normal circumstances?

The cross-tabulation table reporting the results is stored in data/eating-animals.csv. Use the data set to reproduce the visualization below.

Note

Your plot need not be exactly the same in terms of its dimensions, color palette, etc. However, it should be as close as possible.

Now is a good time to render, commit, and push.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Part 3: The economics of prison commissaries

Prison commissaries in the United States have been accused of inflating the cost of essential items sold to incarcerated individuals at significantly higher prices than those charged outside of prison. To investigate these claims, The Appeal compiled a national database of prison commissary lists. The resulting raw price data can be found in data/commissary-prices.csv.

Use the data set to answer the following questions.

Exercise 4

Which states have the most expensive Ramen on average? Calculate the average price of ramen per state and print a table reporting the 10 most expensive states and their average price.

Now is a good time to render, commit, and push.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 5

Which states have the cheapest deodorant? Calculate the lowest price for deodorant per state and print a table reporting the 10 least expensive states and their minimum price.

Now is a good time to render, commit, and push.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 6

How many states sell some form of Lady Speed Stick deodorant? Report the number of states that sell at least one product from the Lady Speed Stick brand.


Now is a good time to render, commit, and push.

Generative AI (GAI) self-reflection

As stated in the syllabus, include a written reflection for this assignment of how you used GAI tools (e.g. what tools you used, how you used them to assist you with writing code), what skills you believe you acquired, and how you believe you demonstrated mastery of the learning objectives.

Render, commit, and push one last time.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Wrap up

Submission

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
  • Click on your INFO 3312 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with exercise. All the pages of homework should be associated with at least one question (i.e., should be “checked”).

Grading

  • Exercise 1: 8 points
  • Exercise 2: 12 points
  • Exercise 3: 12 points
  • Exercise 4: 4 points
  • Exercise 5: 4 points
  • Exercise 6: 6 points
  • Workflow + formatting: 4 points
  • Total: 50 points
Note

The “Workflow & formatting” component assesses the reproducible workflow. This includes:

  • Following {tidyverse} code style
  • All code being visible in rendered PDF (no more than 80 characters)
  • Appropriate figure sizing, and figures with informative labels and legends

Footnotes

  1. h/t to the author Alice Liang for collecting and sharing this data publicly.↩︎