Mini-project

Project

Modified

April 6, 2026

Important dates

Report: due Wed, March 25th at 11:59pm
Oral exam: conducted between April 6th and April 17th

Learning objectives

By the end of this project, you will:

Design a visualization to communicate insights from a dataset
Create and refine your visualization
Communicate your design process
Reflect on your design choices and the effectiveness of your visualization

Introduction

TL;DR: Create a high-quality data visualization and talk about it.

In this mini-project, you will individually create a data visualization that effectively communicates insights from a dataset provided by the instructor. You will go through the process of exploring the data, sketching and ideating on an initial chart design, creating a rough draft of the chart using R, and refining your visualization to a polished finish. After submitting your visualization, you will participate in an oral exam to discuss your design choices and reflect on the effectiveness of your chart.

Deliverables

The primary deliverables for the project are:

A report documenting your entire design process
An oral examination with the instructional staff

Dataset

For the mini-project you will analyze library checkouts for the 500 “greatest” novels. This dataset contains library circulation information for books in the Top 500 “Greatest” Novels —- that is, the novels most widely held in libraries according to OCLC, a major library consortium.

The library checkout data comes from the city of Seattle. The Seattle Public Library’s (SPL) open checkout data is one of the only publicly available sources of book reception data in the country (Gupta et al. 2025; Walsh 2022). The dataset presented here is a combination of both the Top 500 “Greatest” Novels and a mirrored version of the SPL’s open checkout data, recording monthly checkouts from 2005 until February 2025.

Accessing the dataset

The dataset is available in the data/ directory of this repository as a compressed CSV file named top_500_spl_df.csv.gz.¹ You can load the dataset into R using the read_csv() function from the {readr} package, which can read compressed files directly.

library(tidyverse)

checkouts <- read_csv("data/top_500_spl_df.csv.gz")
checkouts

# A tibble: 350,893 × 46
    ...1 usageclass checkouttype materialtype checkoutyear checkoutmonth
   <dbl> <chr>      <chr>        <chr>               <dbl>         <dbl>
 1     0 Physical   Horizon      BOOK                 2006             1
 2     1 Physical   Horizon      BOOK                 2006             1
 3     2 Physical   Horizon      BOOK                 2006             1
 4     3 Physical   Horizon      BOOK                 2006             1
 5     4 Physical   Horizon      BOOK                 2006             1
 6     5 Digital    OverDrive    EBOOK                2006             1
 7     6 Physical   Horizon      BOOK                 2006             1
 8     7 Physical   Horizon      BOOK                 2006             1
 9     8 Physical   Horizon      BOOK                 2006             1
10     9 Physical   Horizon      BOOK                 2006             1
# ℹ 350,883 more rows
# ℹ 40 more variables: checkouts <dbl>, title <chr>, subjects <chr>,
#   creator <chr>, publisher <chr>, publicationyear <chr>, isbn <chr>,
#   `year-month` <date>, last_name <chr>, first_name <chr>,
#   total_checkouts <dbl>, top_500_rank <dbl>, top_500_title <chr>,
#   author <chr>, pub_year <dbl>, orig_lang <chr>, genre <chr>,
#   author_birth <chr>, author_death <chr>, author_gender <chr>, …

What’s in the data?

From the Seattle Public Library, we inherit the following columns.

usageclass: denotes if the item is “physical” or “digital.”
checkouttype: denotes the vendor tool used to check out the item.
materialtype: describes the type of item checked out (examples: book, song, movie, music, magazine).
checkoutyear: the 4-digit year of checkout for this record.
checkoutmonth: the month of checkout for this record.
checkouts: a count of the number of times the title was checked out within the “checkout month.”
isbn: a comma-separated list of isbns associated with the item record for the checkout.
title: the full title and subtitle of an individual item.
creator: the author or entity responsible for authoring the item according to the spl.
subjects: the subject of the item as it appears in the catalog.
publisher: the publisher of the title.
publicationyear: the year from the catalog record in which the item was published, printed, or copyrighted.

The dataset contains extensive metadata information on the Top 500 “Greatest” Novels.

Click to view all metadata fields

Basic Info on Novels

top_500_rank: numeric rank of text in oclc’s original top 500 list.
top_500_title: title of text, as recorded in oclc’s original top 500 list.
author: author of text, as recorded in oclc’s original top 500 list.
pub_year: year of first publication of text, according to wikipedia.
orig_lang: original language of text, according to wikipedia.
genre: genre of text, as recorded in oclc’s original top 500 list (filtered by the ‘choose genre’ dropdown).

Author Demographic Info

author_birth: author year of birth, according to viaf.
author_death: author year of death, according to viaf.
author_gender: author gender, according to viaf. note: viaf only includes binary gender categories, with an alternate option of “unknown.” although we want to resist binary categorizations of gender, we have used viaf because it provides the most comprehensive and accurate information we could find for authors on this list, and because it can be difficult if historical authors held non-binary identities. if we find evidence that any of the authors on the list identified or identify as non-binary, we will change the gender categories to reflect their identifications.
author_primary_lang: author’s primary language of publication, according to viaf.
author_nationality: author’s nationality according to viaf. viaf includes multiple national associations for many authors, but we have only collected information on the first country associated with each author. importantly, this does not include information on tribal citizenship or on changes in nationality across an author’s lifetime.
author_field_of_activity: author’s primary fields of activity, according to viaf. viaf includes data from multiple global partner institutions, but we only collect viaf data associated with the library of congress (loc).
author_occupation: author’s primary occupations, according to viaf. viaf includes data from multiple global partner institutions, but we only collect viaf data associated with the library of congress (loc).

Library Holdings Info

oclc_holdings: total physical library holdings listed in worldcat for an individual work (owi), according to classify.
oclc_eholdings: total digital library holdings listed in worldcat for an individual work (owi), according to oclc.
oclc_total_editions: total editions of an individual work–physical and digital–listed in worldcat according to oclc.
oclc_holdings_rank: numeric rank of text based on total holdings recorded in worldcat.
oclc_editions_rank: numeric rank of text based on total number of editions recorded in worldcat.

Online Popularity Info

gr_avg_rating: average star rating for a text on goodreads.
gr_num_ratings: total number of ratings for a text on goodreads.
gr_num_reviews: total number of reviews for a text on goodreads.
gr_avg_rating_rank: numeric rank of text based on average goodreads rating.
gr_num_ratings_rank: numeric rank of text based on overall number of ratings on goodreads.

Unique Identifiers and URLs

oclc_owi: work id on oclc. a work id represents a cluster based on “author and title information from bibliographic and authority records.” a title can be represented by multiple clusters, and therefore multiple owis. more information about oclc work clustering can be found here.
author_viaf: author viaf id.
gr_url: url for text on goodreads.
wiki_url: url for text on wikipedia.
pg_eng_url: url for english-language text on project gutenberg.
pg_orig_url: url for original-language text (where applicable) on project gutenberg.
full_text: full text of the novel, if it is in the public domain.

More information on the dataset

The dataset was retrieved from Responsible Datasets in Context. You can find more information about the dataset, including how it was collected and its limitations, in the original post on the Responsible Datasets in Context website.

Note there are some minimal visualizations included on the website. You are welcome to read the post and these visualizations, but the visualization for your project must be your own original work.

Project workflow

Report

You will create a single polished, high-quality visualization using R and the assigned dataset. The report will be generated using Quarto and rendered as an HTML document. Your report will automatically be published via GitHub Pages when you push your changes to the repository.

The report documents your entire design process for the data visualization. It is modeled on Nicola Rennie’s The Art of Data Visualization with ggplot2: The TidyTuesday Cookbook, and should be structured with the following sections and content.

Dataset

Load the required packages
Import the dataset
Briefly describe the dataset, its structure, and its variables

Exploratory work

Learn more about the data and begin to formulate ideas for your visualization.

Data exploration

Summarize and visualize key aspects of the dataset
Identify interesting patterns, trends, or relationships in the data
Document your findings - what are you learning through this exploration? How is this informing your visualization ideas?

Exploratory sketches

Sketch out at least two distinct visualization ideas on paper or using a digital tool²
For each sketch, describe:
- The grammar of graphics for the chart (e.g., layers, mappings, scales). It need not be completely worked out yet – for example, you may not have decided on the exact color palette, but you should have a clear idea of the basic structure of the chart and how the data will be mapped to visual channels.
- The rationale behind your design choices (e.g., chart type, color scheme, layout)
- The intended message or insight to be communicated
Reflect on the strengths and weaknesses of each sketch and explain why you ultimately chose one to pursue

Choosing an appropriate type of chart

There are lots of guides and tools online about how to select an appropriate chart type based on the data you have and the message you want to communicate. I personally like From Data to Viz, but there are many others out there.

Preparing a plot

Begin creating your visualization in R based on your chosen sketch.

Data wrangling

Perform any required data wrangling to prepare the dataset for visualization
Document the steps taken and explain why they were necessary for your visualization

The first plot

Create a functional first draft of your chosen visualization
It need not be polished or final, but it should convey the basic structure and message of your intended chart
Essentially it should have all the grammatical components of the chart (e.g. layers, mappings, scales, etc.) but you do not need to have any of the styling or theming worked out yet

Advanced styling

Make it shine! This is where you take the basic plot you created in the previous section and refine it to a polished, high-quality visualization. Adjustments you will likely make include:

Fine-tuning colors, fonts, and other stylistic elements
Adding titles, labels, and annotations to enhance clarity
Implementing custom themes or styles to align with your design vision
Improving layout, spacing, and aspect ratio for better readability
Ensuring accessibility and usability of the visualization

Reflection

Given the time constraints, it’s unlikely that your chart will be perfect. However, it’s important to reflect on your design choices and the effectiveness of your visualization. Address the following questions in your reflection:

How well does your final visualization communicate the intended message or insight?
What design choices did you make to enhance clarity and engagement?
What challenges did you encounter during the design process, and how did you address them?
If you had more time, what additional improvements or refinements would you make to your visualization?

Generative AI (GAI) self-reflection

As stated in the syllabus, include a written reflection for this assignment of how you used GAI tools (e.g. what tools you used, how you used them to assist you with writing code), what skills you believe you acquired, and how you believe you demonstrated mastery of the learning objectives.

Oral exam

Each student will participate in a 15-20 minute oral exam. During the oral exam, you will discuss your design process and answer questions about your visualization. The exam will cover topics such as:

Your data exploration and insights
The rationale behind your chosen visualization design
Specific design choices and their intended effects
Reflections on the effectiveness of your visualization
Implementation details and your code

Students will sign up for oral exam time slots to be held in the two weeks following Spring Break. The oral exam will be conducted in-person with either Dr. Soltoff or Catherine (the PhD TA).

Wrap up

Submission

Render your report and push your changes to the repository by the deadline. This will automatically publish your report online via GitHub Pages.
Check that your report is published online and that all links and images are working correctly. You can find your published report at https://pages.github.coecis.cornell.edu/info3312-sp26/proj-mini-NETID/. It is your responsibility to ensure that your report is published and accessible online by the deadline.
Sign up for an oral exam time slot in the two weeks following Spring Break.

Grading and evaluation criteria

Total	100 pts
Report	50 pts
Oral exam	50 pts

Report

Category	Less developed projects	Typical projects	More developed projects
Data exploration + insight	Exploration is superficial or disconnected from visualization choices. Findings are not clearly articulated or don’t motivate the final design.	Thorough exploration reveals key patterns and relationships. Clear documentation of findings that directly inform visualization choices. Shows genuine discovery process.	All expectations of typical projects + exploration uncovers non-obvious or compelling insights. Articulates a clear, focused narrative that the visualization will communicate.
Design thinking + justification	Sketches are missing or lack description. Rationale for design choices is absent or poorly explained. Little evidence of deliberate decision-making about chart type, variables, or visual encodings.	Presents multiple sketch ideas with clear descriptions of chart type, variables, and design rationale. Explains why the chosen design is appropriate for the data and message. Reflects on trade-offs between options.	All expectations of typical projects + sketches demonstrate sophisticated thinking about design alternatives and how different designs communicate different messages. Shows deep consideration of visual hierarchy, data-ink ratio, and other design principles.
Chart type + grammar	Chart type is inappropriate for the data or message. Missing or incorrect mappings of variables to visual channels. Basic grammatical components (layers, scales) are incomplete or incorrect.	Chart type is appropriate and well-justified. Variables are effectively mapped to visual channels. All grammatical components (layers, mappings, scales) are present and correct. Code is functional and readable.	All expectations of typical projects + uses sophisticated or layered designs (e.g., faceting, multiple geoms) where appropriate. Demonstrates mastery of the grammar of graphics and intentional use of visual encoding to strengthen the message.
Visual design	Visualization is difficult to read. Poor color choices, illegible fonts, or confusing layout. Lacks labels or annotations. Does not follow best practices.	Visualization is polished and easy to read. Appropriate color palette with sufficient contrast. Clean typography and well-organized layout. Clear titles, axis labels, and legends. Follows visualization best practices taught in class.	All expectations of typical projects + employs sophisticated visual design with custom themes, distinctive color schemes, or effective use of whitespace and visual hierarchy. Color choices show understanding of colorblindness accessibility. Typography and styling enhance clarity and engagement.
Accessibility + clarity	Visualization is inaccessible or unclear to the intended audience. Labels are missing or ambiguous. Color reliance makes visualization unclear for colorblind viewers.	All elements are clearly labeled and easy to interpret. Color is used effectively with sufficient contrast. Visualization is accessible to a broad audience. Legend or annotations clarify any non-obvious visual encodings.	All expectations of typical projects + goes beyond basic accessibility. Considers multiple ways audiences might interpret the visualization. Thoughtful use of annotations or annotations to guide interpretation. Includes alternative text description.
Reflection	Reflection is missing or superficial. Does not meaningfully address design choices or effectiveness.	Reflection addresses design choices and their effects. Shows honest assessment of what worked and what didn’t. Demonstrates understanding of how visualization choices communicate (or fail to communicate) the intended message.	All expectations of typical projects + reflection demonstrates sophisticated understanding of why certain designs are more effective. Identifies specific, actionable improvements. Shows evidence of iterative thinking and learning throughout the design process.

Oral exam

Requirements updated on Monday, April 6

Category	Less developed responses	Typical responses	More developed responses
Data exploration + insight	Unable to clearly describe key patterns or insights from the data. Connections between exploration and the final visualization are unclear or unsupported.	Accurately describes main findings from data exploration and explains how these findings informed the final visualization.	All expectations of typical responses + identifies subtle or non-obvious insights, and clearly articulates a focused message for the audience.
Design rationale + refinement	Struggles to explain why chart type, encodings, or styling choices were made. Limited evidence of iterative refinement or intentional design trade-offs.	Clearly explains the rationale for major design choices (chart type, mappings, layout, color, labeling) and describes how revisions improved clarity and communication.	All expectations of typical responses + demonstrates sophisticated reasoning about alternatives and trade-offs, including accessibility and audience interpretation, and justifies refinements with clear design principles.
Communication + reflection	Responses are vague, inconsistent, or overly memorized; difficulty answering follow-up questions. Reflection on effectiveness is superficial.	Communicates the design process clearly in conversation, responds to follow-up questions with specific examples, and reflects on strengths, limitations, and next steps.	All expectations of typical responses + demonstrates strong ownership of decisions, adapts explanations to probing questions, and offers specific, actionable improvements grounded in critical reflection.

Late work policy

There is no late work accepted on this project. Your report must be submitted by the deadline. You are responsible for ensuring the rendered report is published online.

Students are expected to attend their scheduled oral exam time slot. If you miss your scheduled oral exam, we will attempt to reschedule, but this is not guaranteed and may result in a penalty. If you know in advance that you will miss your scheduled oral exam, please contact the instructor as soon as possible to discuss accommodations.

Acknowledgments

Gupta, Neel. 2026. “Library Checkouts for the Top 500 ‘Greatest’ Novels.” Edited by Melanie Walsh, Anna Preus, Amardeep Singh, Sylvia Fernandez, and Miriam Posner. Responsible Datasets in Context. February 25, 2026. https://www.responsible-datasets-in-context.com/posts/spl-500-novels/. Licensed under CC BY-SA 4.0.

References

Gupta, Neel, David Christensen, and Melanie Walsh. 2025. “Seattle Public Library’s Open Checkout Data: What Can It Tell Us about Readers and Book Popularity More Broadly?” Journal of Open Humanities Data 11 (August): 46. https://doi.org/10.5334/johd.332.

Walsh, Melanie. 2022. “Where Is All the Book Data?” Public Books. https://www.publicbooks.org/where-is-all-the-book-data/.

Footnotes

The original CSV file was over 200 megabytes so we compressed it to make it easier to work with in a Git repository.↩︎
If sketched on paper, take a legible photo and include it in your report.↩︎