Project 02

Project

Modified

June 19, 2026

Important dates

Proposal: due June 11th at 11:59pm
Report: due June 17th at 1:25pm
Oral exam: June 18th

Learning objectives

By the end of this project, you will:

Formulate questions that can be answered with data visualizations.
Design data visualizations that effectively communicate information.
Implement best practices for data visualization using {ggplot2} and the grammar of graphics.
Communicate data-driven findings in written formats.

Introduction

TL;DR: Tell a story using data visualizations.

You will use a dataset from the TidyTuesday project to apply your data visualization skills to tell a story. You can choose any dataset released in 2026 as part of this project.

Your task for the project is to come up with two questions to answer, answer them with data visualizations, and write up and present your method and findings.

Examples of Tidy Tuesday visualizations

The Data Science Learning Community (DSLC) has facilitated visualization and data analysis through weekly datasets since 2018. As a result there is a wealth of data visualizations published on social media under the hashtag #TidyTuesday. You can easily search for examples on Bluesky and Mastodon, as well as many blog posts and personal websites.

Nicola Rennie is a data visualization specialist who participated contributed weekly to Tidy Tuesday for over three years. She turned many of her Tidy Tuesday projects into The Art of Data Visualization with ggplot2: The TidyTuesday Cookbook. It includes 12 examples of Tidy Tuesday visualizations, including the data exploration process, initial design sketches, first drafts of plots, and the final polished visualizations. These are excellent examples of the type of workflow you will need to follow for your project.

Deliverables

The primary deliverables for the project are:

A project proposal.
A report of your findings.

Organization of files in the repository

The files in your repository are organized as a Quarto Project. This enables easy rendering of all Quarto documents within the project folder with a single command, as well as the ability to share YAML configurations across multiple documents. To render the project use the Positron command palette to run Quarto: Render Project.

Deliverables

Proposal

Your proposal should include:

A brief description of your dataset including its provenance, dimensions, etc. (Make sure to load the data and use inline code for some of this information.)
The reason why you chose this dataset.
The two questions you want to answer.
A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).

Choosing a dataset

The dataset you choose should have some numerical and some categorical variables or you should be able to recode some of the existing variables so that you can ultimately have both numerical and categorical variables to work with.

It is also very important that the dataset you choose allows for two distinct questions to be asked and answered using a not-completely-overlapping set of variables, i.e., Question 1 requires the use of variables x, y, and z and Question 2 requires the use of variables a, b, c, and d or x, a, and b. Some shared variables are ok, but the set of variables should not be completely overlapping, i.e., Question 2 can’t also require the use of variables x, y, and z.

Framing your questions

Each of the two questions you come up with should involve more than two variables in order to answer. You should phrase them in a way that is within the scope of inference of your data. For example, if you have an observational dataset, you shouldn’t phrase your question in a causal way.

Report

Your report should consist of three parts:

Introduction (1-2 paragraphs): Brief introduction to the dataset. You may repeat some of the information about the dataset provided in the introduction to the dataset on the TidyTuesday repository, paraphrasing on your own terms. Imagine that your project is a standalone document and the evaluator has no prior knowledge of the dataset.
Question 1: The title should relate to the question you’re answering.
- Introduction (1-2 paragraphs): Introduction to the question and what parts of the dataset are necessary to answer the question. Also discuss why you’re interested in this question.
- Approach (1-2 paragraphs): Describe what types of plots you are going to make to address your question. For each plot, provide a clear explanation as to why this plot (e.g. boxplot, barplot, histogram, etc.) is best for providing the information you are asking about. The two plots should be of different types, and at least one of the two plots needs to use either color mapping or facets.
- Analysis (2-3 code blocks, 2 figures, text/code comments as needed): In this section, provide the code that generates your plots. Use scale functions to provide nice axis labels and guides. You are welcome to use theme functions to customize the appearance of your plot, but you are not required to do so. All plots must be made with {ggplot2}. Do not use base R or lattice plotting functions.
- Discussion (1-3 paragraphs): In the Discussion section, interpret the results of your analysis. Identify any trends revealed (or not revealed) by the plots. Speculate about why the data looks the way it does.
Question 2: Same structure outlined for Question 1, but for your new question. And the title should relate to the question you’re answering.

We encourage you to be concise. A paragraph should typically not be longer than 5 sentences.

You are not required to perform any statistical tests in this project, but you may do so if you find it helpful to answer your question.

Oral exam

Each student will participate in a 15-20 minute oral exam. During the oral exam, you will discuss your design process and answer questions about your visualization. The exam will cover topics such as:

Your rationale for choosing the dataset and questions you asked.
Your design choices for the visualizations and how they effectively communicate information.
Your interpretation of the visualizations and the insights you derived from them.

Students will sign up for oral exam time slots to be held on June 18th.

Reproducibility, organization, and code style

All written work should be reproducible, and the GitHub repo should be neatly organized.

Points for reproducibility + organization will be based on the reproducibility of the entire repository and the organization of the project GitHub repo.
The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.
Code style includes not just formatting but also the use of comments, meaningful variable names, and overall readability of the code.

Repo organization

The following folders and files in your project repository:

/data/*: Your dataset
- /data/*.csv: Your dataset in CSV format
- /data/README.md: Metadata about your dataset including information on provenance, codebook, etc.¹
index.qmd: Your project report
proposal.qmd - Your project proposal

Wrap up

Submission

Proposal

The proposal is ungraded, but I will provide written feedback to any student who submits one. To submit your proposal, push your rendered proposal.PDF to GitHub by 11:59pm on June 11th.

Report

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
Click on your INFO 3312 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages for the assignment as Report and click Submit.

Oral exam

Reproducibility, organization, and code style

Ensure all your work has been pushed to GitHub by the deadline.

Grading and evaluation criteria

Total	100 pts
Report	50 pts
Oral exam	30 pts
Reproducibility and organization	10 pts
Code style	10 pts

Report

Category	Less developed projects	Typical projects	More developed projects
Introduction	Explanation of the question and dataset is unclear or missing. Fails to describe relevant variables.	Provides a clear explanation of the question and the dataset used to answer the question, including a description of all relevant variables in the dataset.	All expectations of typical projects + clearly describes why the question is important and what is at stake in the results of the analysis. Even if the reader doesn’t know much about the subject, they know why they care about the results of your analysis.
Q1: Justification of approach	The chosen analysis approach is inappropriate. Visualizations are insufficiently explained and justified.	The chosen analysis approach and visualizations are clearly explained and justified.	All elements of typical projects + shows careful consideration for the most effective chart designs. Goes beyond single layer simplistic charts where appropriate to effectively leverage the grammar of graphics for designing complex statistical charts.
Q1: Code	Code is broken or does not work correctly. Code is hard to read for a human being and lacks stylistic consistency.	Code is functional, easy to read, and properly formatted.	All elements of typical projects + code is optimized using best practices and properly documented.
Q1: Visualization	Visualizations are inappropriate, hard to read, or lack appropriate labeling.	The visualizations are appropriate, follow best practices as taught in class, are easy to read, and properly labeled.	All elements of typical projects + employ custom visual designs and/or theming. Visualizations are distinctive to the project/group.
Q1: Discussion	Discussion of results is underdeveloped. Lacks a substantial connection to the visualizations.	Discussion of results is clear and correct, and it has some depth without begin excessively long.	All elements of typical projects + identifies clear insights derived from the visualizations. Analysis demonstrates teams understand not just how to create charts but also effectively interpret them.
Q2: Justification of approach	The chosen analysis approach is inappropriate. Visualizations are insufficiently explained and justified.	The chosen analysis approach and visualizations are clearly explained and justified.	All elements of typical projects + shows careful consideration for the most effective chart designs. Goes beyond single layer simplistic charts where appropriate to effectively leverage the grammar of graphics for designing complex statistical charts.
Q2: Code	Code is broken or does not work correctly. Code is hard to read for a human being and lacks stylistic consistency.	Code is functional, easy to read, and properly formatted.	All elements of typical projects + code is optimized using best practices and properly documented.
Q2: Visualization	Visualizations are inappropriate, hard to read, or lack appropriate labeling.	The visualizations are appropriate, follow best practices as taught in class, are easy to read, and properly labeled.	All elements of typical projects + employ custom visual designs and/or theming. Visualizations are distinctive to the project/group.
Q2: Discussion	Discussion of results is underdeveloped. Lacks a substantial connection to the visualizations.	Discussion of results is clear and correct, and it has some depth without begin excessively long.	All elements of typical projects + identifies clear insights derived from the visualizations. Analysis demonstrates teams understand not just how to create charts but also effectively interpret them.

Oral exam

Category	Less developed responses	Typical responses	More developed responses
Dataset and question rationale	Unable to clearly articulate why the dataset was chosen or justify the relevance of the research questions. Questions lack clarity or depth.	Clearly explains the motivation for selecting the dataset and provides sound justification for the two research questions. Questions are well-framed and appropriate for the data.	All expectations of typical responses + demonstrates sophisticated reasoning about dataset selection, shows how questions build on each other or reveal complementary insights, and articulates broader significance of the inquiry.
Visualization design choices	Design choices appear arbitrary or poorly justified. Visualizations do not effectively communicate the intended message. Chart types, encodings, or labeling are unclear or misleading.	Clearly explains rationale for chart type selection and visual encodings. Visualizations follow best practices, use appropriate scales and labels, and effectively communicate key findings.	All expectations of typical responses + demonstrates sophisticated understanding of the grammar of graphics, justifies trade-offs between alternatives, considers audience interpretation, and employs custom design or theming that enhances clarity.
Interpretation and insight	Discussion of visualization results is superficial or disconnected from the plots. Fails to identify meaningful patterns or trends. Misinterprets the data shown.	Accurately interprets visualization results and identifies clear patterns and trends. Discussion connects findings back to the original questions and provides substantive insight into what the data reveals.	All expectations of typical responses + identifies subtle or non-obvious insights, speculates thoughtfully about underlying causes, and articulates a coherent narrative that synthesizes findings across both questions.
Communication and reflection	Responses are vague or difficult to follow. Struggles to explain design decisions or interpretation when asked follow-up questions.	Communicates the analysis process clearly in conversation, responds to follow-up questions with specific examples, and reflects on strengths and limitations of the visualization approach.	All expectations of typical responses + demonstrates strong ownership of analytical decisions, adapts explanations to probing questions, and offers specific, actionable insights about what the visualizations reveal and how they could be improved.

Reproducibility and organization

Category	Less developed projects	Typical projects
Reproducibility (code)	Required files are missing. Quarto files do not render successfully (except for if a package needs to be installed).	All required files are provided. Project files (e.g. Quarto, Shiny apps, R scripts) render without issues and reproduce the necessary outputs.
Data documentation	Codebook is missing. No local copies of data files.	All datasets are stored in a data folder, a codebook is provided, and a local copy of the data file is used in the code where needed.
File organization/readability	Documents lack a clear structure. There are extraneous materials in the repo and/or files are not clearly organized.	Documents (Quarto files and R scripts) are well structured and easy to follow. No extraneous materials.

Code style

Category	Less developed projects	Typical projects
Naming conventions	Variable names, functions, and data objects use inconsistent or unclear naming (e.g., `df1`, `temp`).	Variable names, functions, and data objects use clear, descriptive naming that reflects their purpose (e.g., `sales_by_region`, `salesByRegion`, etc.)
Formatting and spacing	Code lacks consistent indentation, spacing around operators, or appropriate line breaks. Multiple statements appear on single lines.	Code follows {tidyverse} formatting conventions with consistent indentation, spaces around operators, line breaks for long pipelines.
Use of base pipe	Pipes are absent, misused, or create overly long/deeply nested workflows that are difficult to understand.	Pipes (`\|>`) create clear, linear data transformation workflows. Pipelines are reasonably sized and intermediate steps are understandable.
Deprecated functions	Uses deprecated or superseded functions in their code. Fails to adhere to warnings about deprecated code and update their syntax to use the correct implementation.	Does not utilize deprecated or superseded functions. No warnings or messages produced noting code uses deprecated functions.
Code cleanliness	Code contains unused variables, commented-out blocks, duplicated code, or unnecessary intermediate objects. Repeated logic is copy-pasted rather than factored out.	No redundant or dead code present. Where applicable, repeated logic is factored into short, focused, clearly-named functions rather than copy-pasted.
Comments	Comments restate what code does rather than explaining why. Comments are excessive, outdated, or missing where needed.	Comments explain why a step exists rather than what it does. Comments are used sparingly and are up to date with the code.
Avoidance of code smells	Code contains magic numbers, overly complex expressions, hard-coded paths, or deeply nested conditionals that reduce readability and flexibility.	Code avoids magic numbers, overly complex expressions, hard-coded paths, and deeply nested conditionals.

Late work policy

There is no late work accepted on this project. Be sure to turn in your work early to avoid any technological mishaps.

Acknowledgments

Project instructions draw in part from STA 313: Advanced Data Visualization and INFO 2951: Introduction to Data Science with R

Footnotes

It is ok for you to repeat some information from the TidyTuesday repository but, but make sure appropriately attribute it here.↩︎