HW 01 - Prefresher

Homework
Modified

January 22, 2026

Important

This homework is due January 28 at 11:59pm ET.

Learning objectives

  • Configure your GitHub credentials
  • Review common methods for data wrangling
  • Implement basic data visualizations with {ggplot2}

Getting started

Access Positron

If you plan to use Posit Workbench

  • Go to https://posit-workbench.infosci.cornell.edu and log in with your Cornell NetID and Password.
  • Click the “New Session” button on the top of the page. Selection Positron Pro. Leave all the settings on their default state and click “Start Session”. You should now see a Positron session.

If you plan to use your own computer

Setup your GitHub authentication

NoteIf you are using your own computer

Run the following code in the R console to ensure you have the required packages installed:

install.packages(c("usethis", "gitcreds", "gh", "renv"))

In order to push changes to GitHub, you need to authenticate yourself. That is, you need to prove you are the owner of your GitHub account. When you log in to GitHub.com from your browser, you provide your username and password to prove your identity. But when you want to push and pull from your computer, you cannot use this method. Instead, you will prove your identity using one of two methods.

Authenticate using a Personal Access Token (PAT)

Note

This method is preferred since it allows for seamless communication between R and Git for all possible applications.

A personal access token (or PAT) is a string of characters that can be used to authenticate a user when accessing a computer system instead of a username and password. Many online services are shifting towards requiring PATs for security reasons.

With this method you will clone repositories using a regular HTTPS url like https://github.com/<OWNER>/<REPO>.git.

NoteIf you are using Posit Workbench

Configure the Git credential helper by running the following R code in the console:

usethis::use_git_config(credential.helper = "store")

Create your personal access token

Run this code from your R console:

usethis::create_github_token(
  scopes = c("repo", "user", "gist", "workflow"),
  description = "Posit Workbench",
  host = "https://github.coecis.cornell.edu/"
)

This is a helper function that takes you to the web form to create a PAT.

  • Give the PAT a description (e.g. “PAT for INFO 3312/5312”)
  • Leave the remaining options on the pre-filled form selected and click “Generate token”. As the page says, you must store this token somewhere, because you’ll never be able to see it again, once you leave that page or close the window. For now, you can copy it to your clipboard (we will save it in the next step).

If you lose or forget your PAT, just generate a new one.

Store your PAT

In order to store your PAT so you don’t have to reenter it every time you interact with Git, we need to run the following code:

gitcreds::gitcreds_set(url = "https://github.coecis.cornell.edu/")

When prompted, paste your PAT into the console and press return. Your credential should now be saved on your computer.

Confirm your PAT is saved

Run the following code:

gh::gh_whoami(.api_url = "https://github.coecis.cornell.edu/")

usethis::git_sitrep()

You should see output that provides information about your GitHub account.

Authenticate using Secure Shell Protocol

Note

You can use this approach to authenticate yourself on GitHub. Note that you may find some limitations communicating with Git outside of standard processes (e.g. cloning/pushing/pulling repos directly), and will still need to create a PAT for some course assignments. However for students using Posit Workbench, the SSH method will work for the entire semester (i.e. set it up once and never have to worry about it again).

The Secure Shell Protocol (SSH) is another method for authenticating your identity when communicating with GitHub. While a password can eventually be cracked with a brute force attack, SSH keys are nearly impossible to decipher by brute force alone. Generating a key pair provides you with two long strings of characters: a public and a private key. You can place the public key on any server (like GitHub), and then unlock it by connecting to it with a client that already has the private key (your computer or Posit Workbench). When the two match up, the system unlocks without the need for a password.

The URL for SSH remotes looks like git@github.com:<OWNER>/<REPO>.git. Make sure you use this URL to clone a repository. If you accidentally use the HTTPS version, the operation will not work.

Set up your SSH key

Note

You only need to do this authentication process one time on a single system.

  • Type credentials::ssh_keygen() into your console.

  • R will ask “No SSH key found. Generate one now?” You should click 1 for yes.

  • You will generate a key. It will begin with “ssh-rsa….” and look something like this:

    $key
    [1] "/home/bcs88/.ssh/id_rsa"
    
    $pubkey
    [1] "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDJYmJeave083exQwYcIqZJk/Y1mgPxdcTYCTWLL+6mlhN9MM3enjDqb2eZvVJ0JK29NYL1++DTqY/saP08IlswNIMntwaWFDNx42yLsuFrWiPqzm9hWWnRcor/d+4zTrcSIEvfAAnLsYkagNqurrCf2taO62YRepTgxErLvLOG10qn4LKhNfT+PTqdPq2Mr88jXQYYrRxGnOV6oVYf6PurKkiooTsKYxVtJWai8Ek9fhK2y5vaQd5yP0H/3Hbw8Mn+rB+O8Yj6/oQKGBCgxkDB4Aw7T91DkIXlHppneO683Y54WvUftJYvSVsnyt/XuNjvXNAir0+kHETLM32uzH6L"
  • Copy the entire string of characters (not including the quotation marks) and paste them into the settings page on GitHub. Give the key an informative title such as “INFO 3312/5312 Posit Workbench”. Click “Add SSH key.”

Configure Git

There is one more thing we need to do before getting started on the assignment. Specifically, we need to configure your git so that Positron can communicate with GitHub. This requires two pieces of information: your name and email address.

To do so, you will use the use_git_config() function from the {usethis} package. Type the following lines of code in the console in Positron filling in your name and the email address associated with your GitHub account.

usethis::use_git_config(
  user.name = "Your name",
  user.email = "Email associated with your GitHub account"
)

For example, mine would be

usethis::use_git_config(
  user.name = "Benjamin Soltoff",
  user.email = "bcs88@cornell.edu"
)

You are now ready interact with GitHub via Positron!

Clone the repo & start a new Positron workspace

  • Go to the course organization at https://github.coecis.cornell.edu/info3312-sp26 organization on GitHub. Click on the repo with the prefix hw-01. It contains the starter documents you need to complete the homework.

  • Click on the green CODE button, select HTTPS or SSH based on the authentication method you set up previously. Click on the clipboard icon to copy the repo URL.

  • In Positron, open the command palette by pressing Ctrl + Shift + P (or Cmd + Shift + P on a Mac). Type Git: Clone and select it from the list.

  • Paste the URL you copied from GitHub into the dialog box that appears.

  • Choose a location to save the repo on your computer. This will create a new folder with the name of the repo in the location you selected.

  • Once the cloning is complete, Positron will prompt you to open the cloned repository. Click Open to open the repo in a new Positron workspace.

  • Click hw-01-prefresher.qmd to open the template Quarto file. This is where you will write up your code and narrative for the homework.

R and Positron

Below are the components of the Positron IDE.

Positron IDE

Core layout elements of the Positron IDE. Source: Positron documentation

See the Positron documentation for more information on the layout.

YAML

The top portion of your Quarto file (between the three dashed lines) is called YAML. It stands for “YAML Ain’t Markup Language”. It is a human friendly data serialization standard for all programming languages. All you need to know is that this area is called the YAML (we will refer to it as such) and that it contains meta information about your document.

Important

Open the Quarto (.qmd) file in your workspace, change the author name to your name, and render the document. Examine the rendered document.

Loading {renv} cached packages

Reproducible environments

Project-oriented workflows benefit from reproducible environments. There are three major benefits to reproducible environments:

  • Isolation: Each project has its own set of packages, avoiding conflicts between projects.
  • Portability: Projects can be shared with others without worrying about package versions or dependencies.
  • Reproducibility: Projects can be run on different systems with the same results, as all package versions are controlled.

In this class, we use the {renv} package to manage reproducible environments. It allows us to create isolated project environments with specific package versions, ensuring that everyone in the class can reproduce the same results, whether you are using Posit Workbench or your own computer.

A workflow diagram showing how renv works in a project-oriented workflow

{renv} workflow. Source: {renv} documentation

While the overall workflow is somewhat complex, we keep things simple in this class. We pre-configure each assignment repo’s lockfile to list the minimum required packages that need to be installed. To access these packages, when you first clone the repo run

renv::restore()

This will retrieve installed packages from your cache folder, or download and install packages you have not used before.

General guidance

TipGuidelines + tips

As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

TipWorkflow + formatting

Make sure to

  • Update author name on your document.
  • Label all code chunks informatively and concisely.
  • Follow the Tidyverse code style guidelines.
  • Make at least 3 commits.
  • Resize figures where needed, avoid tiny or huge plots.
  • Turn in an organized, well formatted document.
ImportantBuilt-in R code formatter

Each repository for this course has the Air formatter enabled for R scripts and Quarto documents. From the documentation:

A formatter is in charge of the layout of your R code. Formatters do not change the meaning of code; instead they ensure that whitespace, newlines, and other punctuation conform to a set of rules and standards.

The Air formatter is automatically applied each time you save an R script or Quarto document, and can also be accessed through the command palette.

Packages

Part 1: Romance novel covers

Game Changer

Great Big Beautiful Life

Say You’ll Remember Me

One Golden Summer
Figure 1

This dataset comes from The Pudding.1 “What does a happily ever after look like?” examines representation in the literary genre of romance through the lens of novel covers. The dataset contains information on romance novels identified on Publishers Weekly’s announcements in the Romance and/or Romance & Erotica sections.

Exercise 1

Import the spreadsheet from Google Sheets. The entire dataset can be found here. Use {googlesheets4} to import the dataset. Do not download the file as a CSV and then import into R. You must use {googlesheets4} to import the data.

Once you have imported the data, clean the data so that it is ready for analysis. Specifically,

  • Reformat all column names to utilize a snake_case format.
  • Fix the date column so it is stored as a date type.
Tip

You should notice an issue if you import this column using the default settings. This is because the author changed formatting styles midway through the column. To fix this issue, I recommend that you import the spreadsheet so this column is formatted as a character vector, then write code to appropriately format each cell as a date value using the {lubridate} package.

  • Keep the following columns which will be relevant to the remaining exercises

    • Year
    • Title
    • Author
    • Publisher
    • Date published
    • Description
    • Style
    • Whether or not a man is partially unclothed on the cover
    • Whether or not a woman is partially unclothed on the cover
    • Whether or not a person of color (POC) is depicted on the cover
Note

When you are finished, use the glimpse() function to print the data frame so we can easily examine its structure.

Now is a good time to render, commit, and push.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 2

Calculate the annual percentage of romance novel covers by their raunchiness, illustration, and diversity. These metrics are defined as:

  • Raunchiness - any novel cover which contains a man or woman partially unclothed.
  • Illustration - any novel cover which is illustrated (as opposed to photorealistic)
  • Diversity - any novel cover with a person of color

Calculate the annual percentage of novel covers which fall into each of these categories. Use these percentages to recreate the plot below.

Note

Your plot need not be exactly the same in terms of its dimensions, color palette, etc. However, it should be as close as possible.

Now is a good time to render, commit, and push.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Part 2: They’re eating the dogs, they’re eating the cats!

Exercise 3

Report on American attitudes on the consumption of various animals. YouGov polled 1,000 U.S. adult citizens on topics related to vegetarianism and the eating of meat. Question 19 specifically asked respondents

Setting aside your own dietary preferences, do you think it is morally acceptable or unacceptable for other people to eat the following animal under normal circumstances?

The cross-tabulation table reporting the results is stored in data/eating-animals.csv. Use the data set to reproduce the visualization below.

Note

Your plot need not be exactly the same in terms of its dimensions, color palette, etc. However, it should be as close as possible.

Now is a good time to render, commit, and push.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Part 3: The economics of prison commissaries

Prison commissaries in the United States have been accused of inflating the cost of essential items sold to incarcerated individuals at significantly higher prices than those charged outside of prison. To investigate these claims, The Appeal compiled a national database of prison commissary lists. The resulting raw price data can be found in data/commissary-prices.csv.

Use the data set to answer the following questions.

Exercise 4

Which states have the most expensive Ramen on average? Calculate the average price of ramen per state and print a table reporting the 10 most expensive states and their average price.

Now is a good time to render, commit, and push.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 5

Which states have the cheapest deodorant? Calculate the lowest price for deodorant per state and print a table reporting the 10 least expensive states and their minimum price.

Now is a good time to render, commit, and push.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 6

How many states sell some form of Lady Speed Stick deodorant? Report the number of states that sell at least one product from the Lady Speed Stick brand.


Now is a good time to render, commit, and push.

Generative AI (GAI) self-reflection

As stated in the syllabus, include a written reflection for this assignment of how you used GAI tools (e.g. what tools you used, how you used them to assist you with writing code), what skills you believe you acquired, and how you believe you demonstrated mastery of the learning objectives.

Render, commit, and push one last time.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Wrap up

Submission

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
  • Click on your INFO 3312 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with exercise. All the pages of homework should be associated with at least one question (i.e., should be “checked”).

Grading

  • Exercise 1: 8 points
  • Exercise 2: 12 points
  • Exercise 3: 12 points
  • Exercise 4: 4 points
  • Exercise 5: 4 points
  • Exercise 6: 6 points
  • Workflow + formatting: 4 points
  • Total: 50 points
Note

The “Workflow & formatting” component assesses the reproducible workflow. This includes:

  • Following {tidyverse} code style
  • All code being visible in rendered PDF (no more than 80 characters)
  • Appropriate figure sizing, and figures with informative labels and legends

Footnotes

  1. h/t to the author Alice Liang for collecting and sharing this data publicly.↩︎