HW 01 - Romance novel covers

Homework
Modified

January 24, 2024

Important

This homework is due January 31 at 11:59pm ET.

Getting started

Important

Your lab TAs will lead you through the Getting Started and Packages sections.

Log in to RStudio

  • Go to https://rstudio-workbench.infosci.cornell.edu and log in with your Cornell NetID and Password.
  • Click the “New Session” button on the top of the page. Leave all the settings on their default state and click “Start Session”. You should now see an RStudio session.
Warning

If this is your first time accessing RStudio Workbench for the course, it will take a couple of minutes to prepare your session. Please be patient. When you start a session in the future, your container will already be prepared and it should start within 15 seconds.

Set up your SSH key

You will authenticate GitHub using SSH. Below are an outline of the authentication steps; you are encouraged to follow along as your TA demonstrates the steps.

Note

You only need to do this authentication process one time on a single system. If you previously used the RStudio Workbench for INFO 2950 or 5011, you can likely skip this step. Your old SSH key is still valid.

  • Type credentials::ssh_keygen() into your console.

  • R will ask “No SSH key found. Generate one now?” You should click 1 for yes.

  • You will generate a key. It will begin with “ssh-rsa….” and look something like this:

    $key
    [1] "/home/bcs88/.ssh/id_rsa"
    
    $pubkey
    [1] "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDJYmJeave083exQwYcIqZJk/Y1mgPxdcTYCTWLL+6mlhN9MM3enjDqb2eZvVJ0JK29NYL1++DTqY/saP08IlswNIMntwaWFDNx42yLsuFrWiPqzm9hWWnRcor/d+4zTrcSIEvfAAnLsYkagNqurrCf2taO62YRepTgxErLvLOG10qn4LKhNfT+PTqdPq2Mr88jXQYYrRxGnOV6oVYf6PurKkiooTsKYxVtJWai8Ek9fhK2y5vaQd5yP0H/3Hbw8Mn+rB+O8Yj6/oQKGBCgxkDB4Aw7T91DkIXlHppneO683Y54WvUftJYvSVsnyt/XuNjvXNAir0+kHETLM32uzH6L"
  • Copy the entire string of characters (not including the quotation marks) and paste them into the settings page on GitHub. Give the key an informative title such as “INFO 3312/5312 RStudio Workbench”. Click “Add SSH key.”

Configure Git

There is one more thing we need to do before getting started on the assignment. Specifically, we need to configure your git so that RStudio can communicate with GitHub. This requires two pieces of information: your name and email address.

To do so, you will use the use_git_config() function from the usethis package. Type the following lines of code in the console in RStudio filling in your name and the email address associated with your GitHub account.

usethis::use_git_config(
  user.name = "Your name", 
  user.email = "Email associated with your GitHub account"
  )

For example, mine would be

usethis::use_git_config(
  user.name = "Benjamin Soltoff", 
  user.email = "bcs88@cornell.edu"
  )

You are now ready interact with GitHub via RStudio!

Clone the repo & start new RStudio project

  • Go to the course organization at https://github.coecis.cornell.edu/info3312-sp24 organization on GitHub. Click on the repo with the prefix hw-01. It contains the starter documents you need to complete the lab.

  • Click on the green CODE button, select Use SSH (this might already be selected by default, and if it is, you’ll see the text Clone with SSH). Click on the clipboard icon to copy the repo URL.

  • In RStudio, go to FileNew ProjectVersion ControlGit.

  • Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.

  • Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.

  • Click hw-01-romance.qmd to open the template Quarto file. This is where you will write up your code and narrative for the lab.

Guidelines + tips

As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

Note

Note: Do not let R output answer the question for you unless the question specifically asks for just a plot. For example, if the question asks for the number of columns in the data set, please type out the number of columns. You are subject to lose points if you do not.

Workflow + formatting

Make sure to

  • Update author name on your document.
  • Label all code chunks informatively and concisely.
  • Follow the Tidyverse code style guidelines.
  • Make at least 3 commits.
  • Resize figures where needed, avoid tiny or huge plots.
  • Turn in an organized, well formatted document.

Packages

library(tidyverse)
library(tidymodels)
library(googlesheets4)
library(janitor)
library(scales)
library(themis)
library(textrecipes)

Data: Romance novel covers

Trading Places

Recipe for Second Chances

Thank You for Sharing

Kissing Kosher
Figure 1: Romance novel covers from “What does a happily ever after look like?” by Alice Liang, The Pudding, 2021.

This week’s dataset comes to us from The Pudding.1 “What does a happily ever after look like?” examines representation in the literary genre of romance through the lens of novel covers. The dataset contains information on romance novels identified on Publishers Weekly’s announcements in the Romance and/or Romance & Erotica sections.

Exercises

Exercise 1

Import the spreadsheet from Google Sheets. The entire dataset can be found here. Use googlesheets4 to import the dataset. Do not download the file as a CSV and then import into R. You must use googlesheets4 to import the data.

Once you have imported the data, clean the data so that it is ready for analysis. Specifically,

  • Reformat all column names to utilize a snake_case format.

  • Fix the date column so it is stored as a date type. You should notice an issue if you import this column using the default settings. This is because the author changed formatting styles midway through the column. To fix this issue, I recommend that you import the spreadsheet so this column is formatted as a character vector, then write code to appropriately format each cell as a date value using the lubridate package.

  • Keep the following columns which will be relevant to the remaining exercises

    • Year
    • Title
    • Author
    • Publisher
    • Date published
    • Description
    • Style
    • Whether or not a man is partially unclothed on the cover
    • Whether or not a woman is partially unclothed on the cover
    • Whether or not a person of color (POC) is depicted on the cover
Note

When you are finished, use the glimpse() function to print the data frame so we can easily examine its structure.

Now is a good time to render, commit, and push.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 2

Calculate the annual percentage of romance novel covers by their raunchiness, illustration, and diversity. These metrics are defined as:

  • Raunchiness - any novel cover which contains a man or woman partially unclothed.
  • Illustration - any novel cover which is illustrated (as opposed to photorealistic)
  • Diversity - any novel cover with a person of color

Calculate the annual percentage of novel covers which fall into each of these categories. Use these percentages to recreate the plot below.

Note

Your plot need not be exactly the same in terms of its dimensions, color palette, etc. However, it should be as close as possible.

Now is a good time to render, commit, and push.

Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 3

You will create a machine learning model to predict whether or not a novel cover depicts a man partially unclothed. To begin, you will partition the data .

First we need to format the data frame to work with tidymodels. Keep only the columns we will use for the model:

  • Title
  • Date
  • Description
  • Style
  • Whether or not a man is partially unclothed on the cover
  • Whether or not a woman is partially unclothed on the cover
  • Whether or not a person of color (POC) is depicted on the cover

Format all logical columns to instead be stored as factor columns where the levels are FALSE and TRUE.

Next, reproducibly split your data into training and test sets. Allocate 80% of observations to training, and 20% to testing. Partition the training set into 10 distinct folds for model fitting. Unless otherwise stated, you will use these sets for all the remaining exercises.

Now is a good time to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 4

Fit a null model to predict whether or not a novel cover depicts a partially clothed man. Use the cross-validation folds, report the accuracy and ROC AUC, and interpret them in the context of the model.

Now is a good time to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 5

Fit a lasso regression model. Estimate a lasso logistic regression model to predict whether or not a novel cover depicts a partially clothed man.

Tip

A lasso regression model is a form of penalized regression where the mixture hyperparameter is set to 1.

Your feature engineering recipe should:

  • For all text predictors (title and description columns):
    • Tokenize all text predictors.
    • Remove stop words.
    • Stem the tokens.
    • Calculate all possible 1-grams, 2-grams, and 3-grams.
    • Retain the 500 most frequently occurring tokens for each predictor
    • Calculate tf-idf scores
  • Generate additional text features using step_textfeature(). This creates a series of numeric features based on the original character strings.
  • Convert the date column to usable features, specifically the month and year when the novel was published.
  • Impute all missing values using the median value for each predictors and the most frequent value for nominal predictors.
  • Convert all nominal predictors to dummy variables.
  • Normalize all predictors.2
  • Downsample the training data to have an equal number of “Yes” and “No” observations when fitting the model.

Tune the model over the penalty hyperparameter using a regular grid of at least 30 values.

Tip

Check out dials::grid_regular().

Tune the model using the cross-validated folds and the glmnet engine, and report the ROC AUC values for the five best models. Use autoplot() to inspect the performance of the models. How do these models perform?

Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 6

Evaluate the best lasso model. Fit the recipe + lasso model with the optimized penalty hyperparameter using the full training set. Evaluate the model’s performance using the test set. Report the accuracy and ROC AUC values for this model, along with the ROC curve and confusion matrix for the predictions. How does this model perform? Does it perform equally well for novels that do and do not depict partially clothed males on their covers, or does it have a built-in bias towards one specific outcome?

Render, commit, and push one last time. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Wrap up

Submission

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
  • Click on your INFO 3312 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with exercise. All the pages of homework lab should be associated with at least one question (i.e., should be “checked”).

Grading

  • Exercise 1: 8 points
  • Exercise 2: 8 points
  • Exercise 3: 4 points
  • Exercise 4: 4 points
  • Exercise 5: 12 points
  • Exercise 6: 8 points
  • Workflow + formatting: 6 points
  • Total: 50 points
Note

The “Workflow & formatting” component assesses the reproducible workflow. This includes:

  • Following tidyverse code style
  • All code being visible in rendered PDF (no more than 80 characters)
  • Appropriate figure sizing, and figures with informative labels and legends
  • Ensuring reproducibility by setting a random seed value.

Footnotes

  1. h/t to the author Alice Liang for collecting and sharing this data publicly.↩︎

  2. Lasso regression requires all features to be scaled and normalized so they have the same mean and variance.↩︎