Mini-project

Modified

March 12, 2026

Important dates

  • Report: due Wed, March 25th at 11:59pm
  • Oral exam: conducted between April 6th and April 17th

Learning objectives

By the end of this project, you will:

  • Design a visualization to communicate insights from a dataset
  • Create and refine your visualization
  • Communicate your design process
  • Reflect on your design choices and the effectiveness of your visualization

Introduction

TL;DR: Create a high-quality data visualization and talk about it.

In this mini-project, you will individually create a data visualization that effectively communicates insights from a dataset provided by the instructor. You will go through the process of exploring the data, sketching and ideating on an initial chart design, creating a rough draft of the chart using R, and refining your visualization to a polished finish. After submitting your visualization, you will participate in an oral exam to discuss your design choices and reflect on the effectiveness of your chart.

Deliverables

The primary deliverables for the project are:

  1. A report documenting your entire design process
  2. An oral examination with the instructional staff

Dataset

For the mini-project you will analyze library checkouts for the 500 “greatest” novels. This dataset contains library circulation information for books in the Top 500 “Greatest” Novels —- that is, the novels most widely held in libraries according to OCLC, a major library consortium.

The library checkout data comes from the city of Seattle. The Seattle Public Library’s (SPL) open checkout data is one of the only publicly available sources of book reception data in the country (Gupta et al. 2025; Walsh 2022). The dataset presented here is a combination of both the Top 500 “Greatest” Novels and a mirrored version of the SPL’s open checkout data, recording monthly checkouts from 2005 until February 2025.

Accessing the dataset

The dataset is available in the data/ directory of this repository as a compressed CSV file named top_500_spl_df.csv.gz.1 You can load the dataset into R using the read_csv() function from the {readr} package, which can read compressed files directly.

library(tidyverse)

checkouts <- read_csv("data/top_500_spl_df.csv.gz")
checkouts
# A tibble: 350,893 × 46
    ...1 usageclass checkouttype materialtype checkoutyear checkoutmonth
   <dbl> <chr>      <chr>        <chr>               <dbl>         <dbl>
 1     0 Physical   Horizon      BOOK                 2006             1
 2     1 Physical   Horizon      BOOK                 2006             1
 3     2 Physical   Horizon      BOOK                 2006             1
 4     3 Physical   Horizon      BOOK                 2006             1
 5     4 Physical   Horizon      BOOK                 2006             1
 6     5 Digital    OverDrive    EBOOK                2006             1
 7     6 Physical   Horizon      BOOK                 2006             1
 8     7 Physical   Horizon      BOOK                 2006             1
 9     8 Physical   Horizon      BOOK                 2006             1
10     9 Physical   Horizon      BOOK                 2006             1
# ℹ 350,883 more rows
# ℹ 40 more variables: checkouts <dbl>, title <chr>, subjects <chr>,
#   creator <chr>, publisher <chr>, publicationyear <chr>, isbn <chr>,
#   `year-month` <date>, last_name <chr>, first_name <chr>,
#   total_checkouts <dbl>, top_500_rank <dbl>, top_500_title <chr>,
#   author <chr>, pub_year <dbl>, orig_lang <chr>, genre <chr>,
#   author_birth <chr>, author_death <chr>, author_gender <chr>, …

What’s in the data?

From the Seattle Public Library, we inherit the following columns.

  • usageclass: denotes if the item is “physical” or “digital.”
  • checkouttype: denotes the vendor tool used to check out the item.
  • materialtype: describes the type of item checked out (examples: book, song, movie, music, magazine).
  • checkoutyear: the 4-digit year of checkout for this record.
  • checkoutmonth: the month of checkout for this record.
  • checkouts: a count of the number of times the title was checked out within the “checkout month.”
  • isbn: a comma-separated list of isbns associated with the item record for the checkout.
  • title: the full title and subtitle of an individual item.
  • creator: the author or entity responsible for authoring the item according to the spl.
  • subjects: the subject of the item as it appears in the catalog.
  • publisher: the publisher of the title.
  • publicationyear: the year from the catalog record in which the item was published, printed, or copyrighted.

The dataset contains extensive metadata information on the Top 500 “Greatest” Novels.

Basic Info on Novels

  • top_500_rank: numeric rank of text in oclc’s original top 500 list.
  • top_500_title: title of text, as recorded in oclc’s original top 500 list.
  • author: author of text, as recorded in oclc’s original top 500 list.
  • pub_year: year of first publication of text, according to wikipedia.
  • orig_lang: original language of text, according to wikipedia.
  • genre: genre of text, as recorded in oclc’s original top 500 list (filtered by the ‘choose genre’ dropdown).

Author Demographic Info

  • author_birth: author year of birth, according to viaf.
  • author_death: author year of death, according to viaf.
  • author_gender: author gender, according to viaf. note: viaf only includes binary gender categories, with an alternate option of “unknown.” although we want to resist binary categorizations of gender, we have used viaf because it provides the most comprehensive and accurate information we could find for authors on this list, and because it can be difficult if historical authors held non-binary identities. if we find evidence that any of the authors on the list identified or identify as non-binary, we will change the gender categories to reflect their identifications.
  • author_primary_lang: author’s primary language of publication, according to viaf.
  • author_nationality: author’s nationality according to viaf. viaf includes multiple national associations for many authors, but we have only collected information on the first country associated with each author. importantly, this does not include information on tribal citizenship or on changes in nationality across an author’s lifetime.
  • author_field_of_activity: author’s primary fields of activity, according to viaf. viaf includes data from multiple global partner institutions, but we only collect viaf data associated with the library of congress (loc).
  • author_occupation: author’s primary occupations, according to viaf. viaf includes data from multiple global partner institutions, but we only collect viaf data associated with the library of congress (loc).

Library Holdings Info

  • oclc_holdings: total physical library holdings listed in worldcat for an individual work (owi), according to classify.
  • oclc_eholdings: total digital library holdings listed in worldcat for an individual work (owi), according to oclc.
  • oclc_total_editions: total editions of an individual work–physical and digital–listed in worldcat according to oclc.
  • oclc_holdings_rank: numeric rank of text based on total holdings recorded in worldcat.
  • oclc_editions_rank: numeric rank of text based on total number of editions recorded in worldcat.

Online Popularity Info

  • gr_avg_rating: average star rating for a text on goodreads.
  • gr_num_ratings: total number of ratings for a text on goodreads.
  • gr_num_reviews: total number of reviews for a text on goodreads.
  • gr_avg_rating_rank: numeric rank of text based on average goodreads rating.
  • gr_num_ratings_rank: numeric rank of text based on overall number of ratings on goodreads.

Unique Identifiers and URLs

  • oclc_owi: work id on oclc. a work id represents a cluster based on “author and title information from bibliographic and authority records.” a title can be represented by multiple clusters, and therefore multiple owis. more information about oclc work clustering can be found here.
  • author_viaf: author viaf id.
  • gr_url: url for text on goodreads.
  • wiki_url: url for text on wikipedia.
  • pg_eng_url: url for english-language text on project gutenberg.
  • pg_orig_url: url for original-language text (where applicable) on project gutenberg.
  • full_text: full text of the novel, if it is in the public domain.
TipMore information on the dataset

The dataset was retrieved from Responsible Datasets in Context. You can find more information about the dataset, including how it was collected and its limitations, in the original post on the Responsible Datasets in Context website.

Note there are some minimal visualizations included on the website. You are welcome to read the post and these visualizations, but the visualization for your project must be your own original work.

Project workflow

Report

You will create a single polished, high-quality visualization using R and the assigned dataset. The report will be generated using Quarto and rendered as an HTML document. Your report will automatically be published via GitHub Pages when you push your changes to the repository.

The report documents your entire design process for the data visualization. It is modeled on Nicola Rennie’s The Art of Data Visualization with ggplot2: The TidyTuesday Cookbook, and should be structured with the following sections and content.

Dataset

  • Load the required packages
  • Import the dataset
  • Briefly describe the dataset, its structure, and its variables

Exploratory work

Learn more about the data and begin to formulate ideas for your visualization.

Data exploration

  • Summarize and visualize key aspects of the dataset
  • Identify interesting patterns, trends, or relationships in the data
  • Document your findings - what are you learning through this exploration? How is this informing your visualization ideas?

Exploratory sketches

  • Sketch out at least two distinct visualization ideas on paper or using a digital tool2
  • For each sketch, describe:
    • The grammar of graphics for the chart (e.g., layers, mappings, scales). It need not be completely worked out yet – for example, you may not have decided on the exact color palette, but you should have a clear idea of the basic structure of the chart and how the data will be mapped to visual channels.
    • The rationale behind your design choices (e.g., chart type, color scheme, layout)
    • The intended message or insight to be communicated
  • Reflect on the strengths and weaknesses of each sketch and explain why you ultimately chose one to pursue
TipChoosing an appropriate type of chart

There are lots of guides and tools online about how to select an appropriate chart type based on the data you have and the message you want to communicate. I personally like From Data to Viz, but there are many others out there.

Preparing a plot

Begin creating your visualization in R based on your chosen sketch.

Data wrangling

  • Perform any required data wrangling to prepare the dataset for visualization
  • Document the steps taken and explain why they were necessary for your visualization

The first plot

  • Create a functional first draft of your chosen visualization
  • It need not be polished or final, but it should convey the basic structure and message of your intended chart
  • Essentially it should have all the grammatical components of the chart (e.g. layers, mappings, scales, etc.) but you do not need to have any of the styling or theming worked out yet

Advanced styling

Make it shine! This is where you take the basic plot you created in the previous section and refine it to a polished, high-quality visualization. Adjustments you will likely make include:

  • Fine-tuning colors, fonts, and other stylistic elements
  • Adding titles, labels, and annotations to enhance clarity
  • Implementing custom themes or styles to align with your design vision
  • Improving layout, spacing, and aspect ratio for better readability
  • Ensuring accessibility and usability of the visualization

Reflection

Given the time constraints, it’s unlikely that your chart will be perfect. However, it’s important to reflect on your design choices and the effectiveness of your visualization. Address the following questions in your reflection:

  • How well does your final visualization communicate the intended message or insight?
  • What design choices did you make to enhance clarity and engagement?
  • What challenges did you encounter during the design process, and how did you address them?
  • If you had more time, what additional improvements or refinements would you make to your visualization?

Generative AI (GAI) self-reflection

As stated in the syllabus, include a written reflection for this assignment of how you used GAI tools (e.g. what tools you used, how you used them to assist you with writing code), what skills you believe you acquired, and how you believe you demonstrated mastery of the learning objectives.

Oral exam

Each student will participate in a 15-20 minute oral exam. During the oral exam, you will discuss your design process and answer questions about your visualization. The exam will cover topics such as:

  • Your data exploration and insights
  • The rationale behind your chosen visualization design
  • Specific design choices and their intended effects
  • Reflections on the effectiveness of your visualization
  • Implementation details and your code

Students will sign up for oral exam time slots to be held in the two weeks following Spring Break. The oral exam will be conducted in-person with either Dr. Soltoff or Catherine (the PhD TA).

Wrap up

Submission

  • Render your report and push your changes to the repository by the deadline. This will automatically publish your report online via GitHub Pages.
  • Check that your report is published online and that all links and images are working correctly. You can find your published report at https://pages.github.coecis.cornell.edu/info3312-sp26/proj-mini-NETID/. It is your responsibility to ensure that your report is published and accessible online by the deadline.
  • Sign up for an oral exam time slot in the two weeks following Spring Break.

Grading and evaluation criteria

Total 100 pts
Report 50 pts
Oral exam 50 pts

Report

Category Less developed projects Typical projects More developed projects
Data exploration + insight Exploration is superficial or disconnected from visualization choices. Findings are not clearly articulated or don’t motivate the final design. Thorough exploration reveals key patterns and relationships. Clear documentation of findings that directly inform visualization choices. Shows genuine discovery process. All expectations of typical projects + exploration uncovers non-obvious or compelling insights. Articulates a clear, focused narrative that the visualization will communicate.
Design thinking + justification Sketches are missing or lack description. Rationale for design choices is absent or poorly explained. Little evidence of deliberate decision-making about chart type, variables, or visual encodings. Presents multiple sketch ideas with clear descriptions of chart type, variables, and design rationale. Explains why the chosen design is appropriate for the data and message. Reflects on trade-offs between options. All expectations of typical projects + sketches demonstrate sophisticated thinking about design alternatives and how different designs communicate different messages. Shows deep consideration of visual hierarchy, data-ink ratio, and other design principles.
Chart type + grammar Chart type is inappropriate for the data or message. Missing or incorrect mappings of variables to visual channels. Basic grammatical components (layers, scales) are incomplete or incorrect. Chart type is appropriate and well-justified. Variables are effectively mapped to visual channels. All grammatical components (layers, mappings, scales) are present and correct. Code is functional and readable. All expectations of typical projects + uses sophisticated or layered designs (e.g., faceting, multiple geoms) where appropriate. Demonstrates mastery of the grammar of graphics and intentional use of visual encoding to strengthen the message.
Visual design Visualization is difficult to read. Poor color choices, illegible fonts, or confusing layout. Lacks labels or annotations. Does not follow best practices. Visualization is polished and easy to read. Appropriate color palette with sufficient contrast. Clean typography and well-organized layout. Clear titles, axis labels, and legends. Follows visualization best practices taught in class. All expectations of typical projects + employs sophisticated visual design with custom themes, distinctive color schemes, or effective use of whitespace and visual hierarchy. Color choices show understanding of colorblindness accessibility. Typography and styling enhance clarity and engagement.
Accessibility + clarity Visualization is inaccessible or unclear to the intended audience. Labels are missing or ambiguous. Color reliance makes visualization unclear for colorblind viewers. All elements are clearly labeled and easy to interpret. Color is used effectively with sufficient contrast. Visualization is accessible to a broad audience. Legend or annotations clarify any non-obvious visual encodings. All expectations of typical projects + goes beyond basic accessibility. Considers multiple ways audiences might interpret the visualization. Thoughtful use of annotations or annotations to guide interpretation. Includes alternative text description.
Reflection Reflection is missing or superficial. Does not meaningfully address design choices or effectiveness. Reflection addresses design choices and their effects. Shows honest assessment of what worked and what didn’t. Demonstrates understanding of how visualization choices communicate (or fail to communicate) the intended message. All expectations of typical projects + reflection demonstrates sophisticated understanding of why certain designs are more effective. Identifies specific, actionable improvements. Shows evidence of iterative thinking and learning throughout the design process.

Oral exam

Category Less developed responses Typical responses More developed responses
Design rationale Struggles to explain why chart type, visual encodings, or design choices were made. Explanations are vague or generic. Clearly explains the rationale for chosen chart type, visual encodings, and design decisions. Shows understanding of visualization principles. All expectations of typical responses + demonstrates deep, critical thinking about design alternatives and trade-offs. Can articulate why specific choices best communicate the intended message.
Data exploration + insight Unable to describe key patterns or insights from the data. Connections between exploration and visualization are unclear. Accurately describes main findings from data exploration and how these informed the visualization. All expectations of typical responses + identifies subtle or non-obvious insights. Shows curiosity and genuine engagement with the data.
Implementation Cannot explain code or wrangling steps. Relies on generic or memorized answers. Explains main steps in code and data wrangling. Shows understanding of how code produces the visualization. All expectations of typical responses + demonstrates mastery of implementation details, can discuss alternatives, and justify choices.
Reflection + iteration Reflection is superficial or defensive. Little awareness of strengths, weaknesses, or possible improvements. Thoughtfully reflects on what worked, what didn’t, and what could be improved. Can discuss challenges and how they were addressed. All expectations of typical responses + demonstrates iterative thinking, learning from mistakes, and proposes specific, actionable improvements.
Authenticity + ownership Responses are formulaic, generic, or inconsistent with submitted work. Cannot answer follow-up questions about process or choices. Responses are consistent with submitted work and show personal engagement. Can answer follow-up questions with detail and confidence. All expectations of typical responses + demonstrates clear ownership of the project, with evidence of independent decision-making and authentic learning.

Late work policy

There is no late work accepted on this project. Your report must be submitted by the deadline. You are responsible for ensuring the rendered report is published online.

Students are expected to attend their scheduled oral exam time slot. If you miss your scheduled oral exam, we will attempt to reschedule, but this is not guaranteed and may result in a penalty. If you know in advance that you will miss your scheduled oral exam, please contact the instructor as soon as possible to discuss accommodations.

Acknowledgments

References

Gupta, Neel, David Christensen, and Melanie Walsh. 2025. “Seattle Public Library’s Open Checkout Data: What Can It Tell Us about Readers and Book Popularity More Broadly?” Journal of Open Humanities Data 11 (August): 46. https://doi.org/10.5334/johd.332.
Walsh, Melanie. 2022. “Where Is All the Book Data?” Public Books. https://www.publicbooks.org/where-is-all-the-book-data/.

Footnotes

  1. The original CSV file was over 200 megabytes so we compressed it to make it easier to work with in a Git repository.↩︎

  2. If sketched on paper, take a legible photo and include it in your report.↩︎