Project 02
Important dates
Learning objectives
By the end of this project, you will:
- Formulate questions that can be answered with data visualizations.
- Design data visualizations that effectively communicate information.
- Implement best practices for data visualization using {ggplot2} and the grammar of graphics.
- Communicate data-driven findings in written formats.
Introduction
TL;DR: Tell a story using data visualizations.
You will use a dataset from the TidyTuesday project to apply your data visualization skills to tell a story. You can choose any dataset released in 2026 as part of this project.
Your task for the project is to come up with two questions to answer, answer them with data visualizations, and write up and present your method and findings.
The Data Science Learning Community (DSLC) has facilitated visualization and data analysis through weekly datasets since 2018. As a result there is a wealth of data visualizations published on social media under the hashtag #TidyTuesday. You can easily search for examples on Bluesky and Mastodon, as well as many blog posts and personal websites.
Nicola Rennie is a data visualization specialist who participated contributed weekly to Tidy Tuesday for over three years. She turned many of her Tidy Tuesday projects into The Art of Data Visualization with ggplot2: The TidyTuesday Cookbook. It includes 12 examples of Tidy Tuesday visualizations, including the data exploration process, initial design sketches, first drafts of plots, and the final polished visualizations. These are excellent examples of the type of workflow you will need to follow for your project.
Deliverables
The primary deliverables for the project are:
- A project proposal.
- A report of your findings.
The files in your repository are organized as a Quarto Project. This enables easy rendering of all Quarto documents within the project folder with a single command, as well as the ability to share YAML configurations across multiple documents. To render the project use the Positron command palette to run Quarto: Render Project.
Deliverables
Proposal
Your proposal should include:
- A brief description of your dataset including its provenance, dimensions, etc. (Make sure to load the data and use inline code for some of this information.)
- The reason why you chose this dataset.
- The two questions you want to answer.
- A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).
The dataset you choose should have some numerical and some categorical variables or you should be able to recode some of the existing variables so that you can ultimately have both numerical and categorical variables to work with.
It is also very important that the dataset you choose allows for two distinct questions to be asked and answered using a not-completely-overlapping set of variables, i.e., Question 1 requires the use of variables x, y, and z and Question 2 requires the use of variables a, b, c, and d or x, a, and b. Some shared variables are ok, but the set of variables should not be completely overlapping, i.e., Question 2 can’t also require the use of variables x, y, and z.
Each of the two questions you come up with should involve more than two variables in order to answer. You should phrase them in a way that is within the scope of inference of your data. For example, if you have an observational dataset, you shouldn’t phrase your question in a causal way.
Report
Your report should consist of three parts:
Introduction (1-2 paragraphs): Brief introduction to the dataset. You may repeat some of the information about the dataset provided in the introduction to the dataset on the TidyTuesday repository, paraphrasing on your own terms. Imagine that your project is a standalone document and the evaluator has no prior knowledge of the dataset.
Question 1: The title should relate to the question you’re answering.
Introduction (1-2 paragraphs): Introduction to the question and what parts of the dataset are necessary to answer the question. Also discuss why you’re interested in this question.
Approach (1-2 paragraphs): Describe what types of plots you are going to make to address your question. For each plot, provide a clear explanation as to why this plot (e.g. boxplot, barplot, histogram, etc.) is best for providing the information you are asking about. The two plots should be of different types, and at least one of the two plots needs to use either color mapping or facets.
Analysis (2-3 code blocks, 2 figures, text/code comments as needed): In this section, provide the code that generates your plots. Use scale functions to provide nice axis labels and guides. You are welcome to use theme functions to customize the appearance of your plot, but you are not required to do so. All plots must be made with {ggplot2}. Do not use base R or lattice plotting functions.
Discussion (1-3 paragraphs): In the Discussion section, interpret the results of your analysis. Identify any trends revealed (or not revealed) by the plots. Speculate about why the data looks the way it does.
Question 2: Same structure outlined for Question 1, but for your new question. And the title should relate to the question you’re answering.
We encourage you to be concise. A paragraph should typically not be longer than 5 sentences.
You are not required to perform any statistical tests in this project, but you may do so if you find it helpful to answer your question.
Oral exam
Each student will participate in a 15-20 minute oral exam. During the oral exam, you will discuss your design process and answer questions about your visualization. The exam will cover topics such as:
- Your rationale for choosing the dataset and questions you asked.
- Your design choices for the visualizations and how they effectively communicate information.
- Your interpretation of the visualizations and the insights you derived from them.
Students will sign up for oral exam time slots to be held on June 18th.
Reproducibility, organization, and code style
All written work should be reproducible, and the GitHub repo should be neatly organized.
- Points for reproducibility + organization will be based on the reproducibility of the entire repository and the organization of the project GitHub repo.
- The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.
- Code style includes not just formatting but also the use of comments, meaningful variable names, and overall readability of the code.
Clone the repository to a new location, install required packages as specified in the lockfiles (i.e. renv::restore()), and try to render all Quarto files and run any R scripts. If everything runs without error, your project is (likely) reproducible!
Repo organization
The following folders and files in your project repository:
/data/*: Your dataset/data/*.csv: Your dataset in CSV format/data/README.md: Metadata about your dataset including information on provenance, codebook, etc.1
index.qmd: Your project reportproposal.qmd- Your project proposal
Wrap up
Submission
Report
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
- Click on your INFO 3312 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages for the assignment as Report and click Submit.
Oral exam
Sign up for an oral exam time slot to be held on June 18th.
Reproducibility, organization, and code style
Ensure all your work has been pushed to GitHub by the deadline.
Grading and evaluation criteria
| Total | 100 pts |
|---|---|
| Report | 50 pts |
| Oral exam | 30 pts |
| Reproducibility and organization | 10 pts |
| Code style | 10 pts |
Report
| Category | Less developed projects | Typical projects | More developed projects |
|---|---|---|---|
| Introduction | Explanation of the question and dataset is unclear or missing. Fails to describe relevant variables. | Provides a clear explanation of the question and the dataset used to answer the question, including a description of all relevant variables in the dataset. | All expectations of typical projects + clearly describes why the question is important and what is at stake in the results of the analysis. Even if the reader doesn’t know much about the subject, they know why they care about the results of your analysis. |
| Q1: Justification of approach | The chosen analysis approach is inappropriate. Visualizations are insufficiently explained and justified. | The chosen analysis approach and visualizations are clearly explained and justified. | All elements of typical projects + shows careful consideration for the most effective chart designs. Goes beyond single layer simplistic charts where appropriate to effectively leverage the grammar of graphics for designing complex statistical charts. |
| Q1: Code | Code is broken or does not work correctly. Code is hard to read for a human being and lacks stylistic consistency. | Code is functional, easy to read, and properly formatted. | All elements of typical projects + code is optimized using best practices and properly documented. |
| Q1: Visualization | Visualizations are inappropriate, hard to read, or lack appropriate labeling. | The visualizations are appropriate, follow best practices as taught in class, are easy to read, and properly labeled. | All elements of typical projects + employ custom visual designs and/or theming. Visualizations are distinctive to the project/group. |
| Q1: Discussion | Discussion of results is underdeveloped. Lacks a substantial connection to the visualizations. | Discussion of results is clear and correct, and it has some depth without begin excessively long. | All elements of typical projects + identifies clear insights derived from the visualizations. Analysis demonstrates teams understand not just how to create charts but also effectively interpret them. |
| Q2: Justification of approach | The chosen analysis approach is inappropriate. Visualizations are insufficiently explained and justified. | The chosen analysis approach and visualizations are clearly explained and justified. | All elements of typical projects + shows careful consideration for the most effective chart designs. Goes beyond single layer simplistic charts where appropriate to effectively leverage the grammar of graphics for designing complex statistical charts. |
| Q2: Code | Code is broken or does not work correctly. Code is hard to read for a human being and lacks stylistic consistency. | Code is functional, easy to read, and properly formatted. | All elements of typical projects + code is optimized using best practices and properly documented. |
| Q2: Visualization | Visualizations are inappropriate, hard to read, or lack appropriate labeling. | The visualizations are appropriate, follow best practices as taught in class, are easy to read, and properly labeled. | All elements of typical projects + employ custom visual designs and/or theming. Visualizations are distinctive to the project/group. |
| Q2: Discussion | Discussion of results is underdeveloped. Lacks a substantial connection to the visualizations. | Discussion of results is clear and correct, and it has some depth without begin excessively long. | All elements of typical projects + identifies clear insights derived from the visualizations. Analysis demonstrates teams understand not just how to create charts but also effectively interpret them. |
Oral exam
TODO
| Category | Less developed responses | Typical responses | More developed responses |
|---|---|---|---|
| Dataset and question rationale | Unable to clearly articulate why the dataset was chosen or justify the relevance of the research questions. Questions lack clarity or depth. | Clearly explains the motivation for selecting the dataset and provides sound justification for the two research questions. Questions are well-framed and appropriate for the data. | All expectations of typical responses + demonstrates sophisticated reasoning about dataset selection, shows how questions build on each other or reveal complementary insights, and articulates broader significance of the inquiry. |
| Visualization design choices | Design choices appear arbitrary or poorly justified. Visualizations do not effectively communicate the intended message. Chart types, encodings, or labeling are unclear or misleading. | Clearly explains rationale for chart type selection and visual encodings. Visualizations follow best practices, use appropriate scales and labels, and effectively communicate key findings. | All expectations of typical responses + demonstrates sophisticated understanding of the grammar of graphics, justifies trade-offs between alternatives, considers audience interpretation, and employs custom design or theming that enhances clarity. |
| Interpretation and insight | Discussion of visualization results is superficial or disconnected from the plots. Fails to identify meaningful patterns or trends. Misinterprets the data shown. | Accurately interprets visualization results and identifies clear patterns and trends. Discussion connects findings back to the original questions and provides substantive insight into what the data reveals. | All expectations of typical responses + identifies subtle or non-obvious insights, speculates thoughtfully about underlying causes, and articulates a coherent narrative that synthesizes findings across both questions. |
| Communication and reflection | Responses are vague or difficult to follow. Struggles to explain design decisions or interpretation when asked follow-up questions. | Communicates the analysis process clearly in conversation, responds to follow-up questions with specific examples, and reflects on strengths and limitations of the visualization approach. | All expectations of typical responses + demonstrates strong ownership of analytical decisions, adapts explanations to probing questions, and offers specific, actionable insights about what the visualizations reveal and how they could be improved. |
Reproducibility and organization
| Category | Less developed projects | Typical projects |
|---|---|---|
| Reproducibility (code) | Required files are missing. Quarto files do not render successfully (except for if a package needs to be installed). | All required files are provided. Project files (e.g. Quarto, Shiny apps, R scripts) render without issues and reproduce the necessary outputs. |
| Reproducibility (packages) | renv.lock file does not include all required packages. External users have to manually install packages in order to get code to evaluate. |
renv.lock includes all required packages. Manual package installation is not required to render any code in the repo (e.g. Quarto documents, R scripts). |
| Data documentation | Codebook is missing. No local copies of data files. | All datasets are stored in a data folder, a codebook is provided, and a local copy of the data file is used in the code where needed. |
| File organization/readability | Documents lack a clear structure. There are extraneous materials in the repo and/or files are not clearly organized. | Documents (Quarto files and R scripts) are well structured and easy to follow. No extraneous materials. |
Code style
| Category | Less developed projects | Typical projects |
|---|---|---|
| Naming conventions | Variable names, functions, and data objects use inconsistent or unclear naming (e.g., df1, temp). |
Variable names, functions, and data objects use clear, descriptive naming that reflects their purpose (e.g., sales_by_region, salesByRegion, etc.) |
| Formatting and spacing | Code lacks consistent indentation, spacing around operators, or appropriate line breaks. Multiple statements appear on single lines. | Code follows {tidyverse} formatting conventions with consistent indentation, spaces around operators, line breaks for long pipelines. |
| Use of base pipe | Pipes are absent, misused, or create overly long/deeply nested workflows that are difficult to understand. | Pipes (|>) create clear, linear data transformation workflows. Pipelines are reasonably sized and intermediate steps are understandable. |
| Deprecated functions | Uses deprecated or superseded functions in their code. Fails to adhere to warnings about deprecated code and update their syntax to use the correct implementation. | Does not utilize deprecated or superseded functions. No warnings or messages produced noting code uses deprecated functions. |
| Code cleanliness | Code contains unused variables, commented-out blocks, duplicated code, or unnecessary intermediate objects. Repeated logic is copy-pasted rather than factored out. | No redundant or dead code present. Where applicable, repeated logic is factored into short, focused, clearly-named functions rather than copy-pasted. |
| Comments | Comments restate what code does rather than explaining why. Comments are excessive, outdated, or missing where needed. | Comments explain why a step exists rather than what it does. Comments are used sparingly and are up to date with the code. |
| Avoidance of code smells | Code contains magic numbers, overly complex expressions, hard-coded paths, or deeply nested conditionals that reduce readability and flexibility. | Code avoids magic numbers, overly complex expressions, hard-coded paths, and deeply nested conditionals. |
Late work policy
There is no late work accepted on this project. Be sure to turn in your work early to avoid any technological mishaps.
Acknowledgments
- Project instructions draw in part from STA 313: Advanced Data Visualization and INFO 2951: Introduction to Data Science with R
Footnotes
It is ok for you to repeat some information from the TidyTuesday repository but, but make sure appropriately attribute it here.↩︎