Project 1
Important dates
- Proposal for peer review: due Thu, Feb 12th at 11:59pm
- Peer review feedback: due Fri, Feb 13th at 11:59pm
- Revised proposal for instructor review: due Thu, Feb 20th at 11:59pm
- Report and presentation: due Thu Mar 5th at 1:25pm
The details will be updated as the project date approaches.
Learning objectives
By the end of this project, you will:
- Formulate questions that can be answered with data visualizations.
- Design data visualizations that effectively communicate information.
- Implement best practices for data visualization using {ggplot2} and the grammar of graphics.
- Communicate data-driven findings in written and oral formats.
Introduction
TL;DR: Tell a story using data visualizations.
You will use a dataset from the TidyTuesday project to apply your data visualization skills to tell a story. You can choose any dataset released since 2025 as part of this project.
Your task for the project is to come up with two questions to answer, answer them with data visualizations, and write up and present your method and findings.
The Data Science Learning Community (DSLC) has facilitated visualization and data analysis through weekly datasets since 2018. As a result there is a wealth of data visualizations published on social media under the hashtag #TidyTuesday. You can easily search for examples on Bluesky and Mastodon, as well as many blog posts and personal websites.
Nicola Rennie is a data visualization specialist who participated contributed weekly to Tidy Tuesday for over three years. She turned many of her Tidy Tuesday projects into The Art of Data Visualization with ggplot2: The TidyTuesday Cookbook. It includes 12 examples of Tidy Tuesday visualizations, including the data exploration process, initial design sketches, first drafts of plots, and the final polished visualizations. These are excellent examples of the type of workflow you will need to follow for your project.
Deliverables
The primary deliverables for the project are:
- A project proposal.
- A report of your findings.
- An oral presentation with slides.
There will be additional submissions throughout the semester to facilitate completion of the final product and presentation.
The files in your repository are organized as a Quarto Project. This enables easy rendering of all Quarto documents within the project folder with a single command, as well as the ability to share YAML configurations across multiple documents. To render the project use the Positron command palette to run Quarto: Render Project.
Teams
Projects will be completed in teams of 3-4 students. Every team member should be involved in all aspects of planning and executing the project. Each team member should make an equal contribution to all parts of the project. The scope of your project is based on the number of contributing team members on your project. If you have 4 contributing team members, we will expect a larger project than a team of 3 contributing team members.
Students must work with other students registered for the same course number (i.e. 3312 students will be partnered with other 3312 students, 5312 students with other 5312 students). The course staff will assign students to teams. To facilitate this process, we will provide a short survey identifying study and communication habits. Once teams are assigned, they cannot be changed.
Team conflicts
Conflict is a healthy part of any team relationship. If your team doesn’t have conflict, then your team members are likely not communicating their issues with each other. Use your team contract (written at the beginning of the project) to help keep your team dynamic healthy.
When you have conflict, you should follow this procedure:
Refer to the team contract and follow it to address the conflict.
If you resolve the conflict without issue, great! Otherwise, update the team contract and try to resolve the conflict yourselves.
If your team is unable to resolve your conflict, please contact soltoffbc@cornell.edu and explain your situation.
We’ll ask to meet with all the group members and figure out how we can work together to move forward.
Please do not avoid confrontation if you have conflict. If there’s a conflict, the best way to handle it is to bring it into the open and address it.
Project grade adjustments
Remember, do not do the work for a slacking team member. This only rewards their bad behavior. Simply leave their work unfinished. (We will not increase your grade during adjustments for doing more than your fair share.)
Your team will initially receive a final grade assuming that all team members contributed to your project. If you have a 4-person team, but only 3 persons contributed, your team will likely receive a lower grade initially because only 3 persons worth of effort exists for a 4-person project. About a week after the initial project grades are released, adjustments will be made to each individual team member’s group project grade.
We use your project’s Git history (to view the contributions of each team member) and the peer evaluations to adjust each team members’ grades. Both adjustments to increase or decrease your grade are possible based on each individual’s contributions.
For example, if you have a 4-person team, but only 3 contributing members, the 3 contributing members may have their grades increased to reflect the effort of only 3 contributing members. The non-contributing member will likely have their grade decreased significantly.
I am serious about every member of the team equitably contributing to the project. Students who fail to contribute equitably may receive up to a 100% deduction on their project grade.
Please be patient for the grade adjustments. The adjustments take time to do them fairly. Please know that the instructor handles this entire process himself, and I take it very seriously. If you think your initial group project grade is unfair, please wait for your grade adjustment before you contact us.
The slacking team member
Please do not cover for a slacking/freeloading team member. Please do not do their work for them! This only rewards their bad behavior. Simply leave their work unfinished. (We will not increase your grade during adjustments for doing more than your fair share.)
Remember, we have your Git history. We can see who contributes to the project and who doesn’t. If a team member rarely commits to Git and only makes very small commits, we can see that they did not contribute their fair share.
All students should make their project contributions through their own GitHub account. Do not commit changes to the repository from another team member’s GitHub account. Your Git history should reflect your individual contributions to the project.
Deliverables
Proposal
Your proposal should include:
- A brief description of your dataset including its provenance, dimensions, etc. (Make sure to load the data and use inline code for some of this information.)
- The reason why you chose this dataset.
- The two questions you want to answer.
- A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).
The dataset you choose should have some numerical and some categorical variables or you should be able to recode some of the existing variables so that you can ultimately have both numerical and categorical variables to work with.
It is also very important that the dataset you choose allows for two distinct questions to be asked and answered using a not-completely-overlapping set of variables, i.e., Question 1 requires the use of variables x, y, and z and Question 2 requires the use of variables a, b, c, and d or x, a, and b. Some shared variables are ok, but the set of variables should not be completely overlapping, i.e., Question 2 can’t also require the use of variables x, y, and z.
Each of the two questions you come up with should involve more than two variables in order to answer. You should phrase them in a way that is within the scope of inference of your data. For example, if you have an observational dataset, you shouldn’t phrase your question in a causal way.
Peer review
Reviewer tasks
Critically reviewing others’ work is a crucial part of the scientific process, and INFO 3312/5312 is no exception. You will be assigned two teams to review. This feedback is intended to help you create a high quality final project, as well as give you experience reading and constructively critiquing the work of others.
The peer review assignments are as follows:
TODO
Teams will develop the review together, with discussion among all team members, but only one team member will submit it as an issue on the project repo. To do so, go to the Issues tab, click on the green New issue button on the top right, and then click on the green Get started button for the issue template titled Peer review.
This will start a new issue with a peer review form that you can fill out. You’re expected to be thorough in your review, but this doesn’t necessarily require lengthy responses.
Remember, your goal is to help the team whose project proposal you’re reviewing. The team will not lose points because of issues you point out, as long as they address them before I review their proposals. You should be critical, but respectful in your review. Peer reviews will be evaluated on the quality of the feedback left for the other teams.
Reviewee tasks
Once you receive feedback from your peers, you should address them. You should do this by directly updating your proposal or making any other updates to your repo as needed. You can do these updates all in one commit or you can spread it across multiple commits.
Regardless, in the last commit that addresses the peer review comments, you should use a keyword in your commit message that will close the peer review issues. These words are close, closes, closed, fix, fixes, fixed, resolve, resolves, and resolved and they need to be followed by the issue number (which you can find next to the issue title). So, your commit message can say something like “Finished updates based on peer review, fixes #1”.
Report
Your report should consist of three parts:
Introduction (1-2 paragraphs): Brief introduction to the dataset. You may repeat some of the information about the dataset provided in the introduction to the dataset on the TidyTuesday repository, paraphrasing on your own terms. Imagine that your project is a standalone document and the evaluator has no prior knowledge of the dataset.
Question 1: The title should relate to the question you’re answering.
Introduction (1-2 paragraphs): Introduction to the question and what parts of the dataset are necessary to answer the question. Also discuss why you’re interested in this question.
Approach (1-2 paragraphs): Describe what types of plots you are going to make to address your question. For each plot, provide a clear explanation as to why this plot (e.g. boxplot, barplot, histogram, etc.) is best for providing the information you are asking about. The two plots should be of different types, and at least one of the two plots needs to use either color mapping or facets.
Analysis (2-3 code blocks, 2 figures, text/code comments as needed): In this section, provide the code that generates your plots. Use scale functions to provide nice axis labels and guides. You are welcome to use theme functions to customize the appearance of your plot, but you are not required to do so. All plots must be made with {ggplot2}. Do not use base R or lattice plotting functions.
Discussion (1-3 paragraphs): In the Discussion section, interpret the results of your analysis. Identify any trends revealed (or not revealed) by the plots. Speculate about why the data looks the way it does.
Question 2: Same structure outlined for Question 1, but for your new question. And the title should relate to the question you’re answering.
We encourage you to be concise. A paragraph should typically not be longer than 5 sentences.
You are not required to perform any statistical tests in this project, but you may do so if you find it helpful to answer your question.
Presentation
Your presentation should be a concise oral presentation that identifies and answers the questions you ask in your report. Your presentation will be created using Quarto. Use the presentation to tell your story. Your report is about not only the story, but your process for writing the story. The presentation is not the same thing - just tell your story, and use data to support your arguments.
- You don’t need to explain your approach to the plots.
- We should not see any code in your slides.
- The plots should be designed for a slide presentation. Make appropriate adjustments to any plots you created for the report (e.g. improved font size, adding/removing annotations), or design completely different plots for the presentation.
As a starting point, I recommend 1 slide for introduction, 2 slides for Question 1, and 2 slides for Question 2. You can imagine spending roughly one minute on each slide. You should feel free to have more (or fewer) slides. We expect all slides to be created using Quarto. Your evaluation will be based on your content, professionalism (including sticking to time), and your performance during the Q&A (question and answer).
Website
Each of your projects will have a website that looks like this. You are not expected to change the styling of the website, but if you want to, you’ll need to edit the _quarto.yml file in your repo. Feel free to Google your way around it or ask on the discussion forum/office hours!
Reproducibilty, organization, and code style
All written work should be reproducible, and the GitHub repo should be neatly organized.
- Points for reproducibility + organization will be based on the reproducibility of the entire repository and the organization of the project GitHub repo.
- The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.
- Code style includes not just formatting but also the use of comments, meaningful variable names, and overall readability of the code.
Have a team member clone the repository to a new location, install required packages as specified in the lockfiles (i.e. renv::restore()), and try to render all Quarto files and run any R scripts. If everything runs without error, your project is (likely) reproducible!
Repo organization
The following folders and files in your project repository:
/data/*: Your dataset/data/*.csv: Your dataset in CSV format/data/README.md: Metadata about your dataset including information on provenance, codebook, etc.1
index.qmd: Your project reportproposal.qmd- Your project proposalpresentation.qmd- Your project presentation_quarto.yml: Setup file for project website
Overall grading
| Total | 100 pts |
|---|---|
| Proposal | 10 pts |
| Presentation | 35 pts |
| Instructor | 30 pts |
| Peers | 5 pts |
| Report | 35 pts |
| Reproducibility and organization | 5 pts |
| Code style | 10 pts |
| Between team peer evaluation | 5 pts |
Evaluation criteria
Proposal
| Category | Less developed projects | Typical projects |
|---|---|---|
| Dataset | Dataset is missing from the Dataset lacks a codebook. |
Dataset is in the Codebook for the dataset is included as the |
| Write-up | The write-up is missing one or more required components. | All required components are included in the write-up. |
| Workflow | Peer review issues are left open or do not have associated commits which respond to the feedback. | Peer review issues are closed via a commit message. |
| Teamwork | One or more team members do not have commits in the repo. | All team members contribute to the repo via commits. |
Presentation
Teaching team
| Category | Less developed projects | Typical projects | More developed projects |
|---|---|---|---|
| Time management | Only some members speak during the presentation. Team does not manage time wisely (e.g. runs out of time, finishes early without adequately presenting their project). | All members speak during the presentation. Team does not exceed the five minute limit. | Team maximally uses their five minutes. Clearly communicates their objectives and outcomes from the project. |
| Professionalism | Presentation is slapped together or haphazard. Seems like independent pieces of work patched together. | Presentation appears to be rehearsed. There is cohesion to the presentation. | All elements of typical projects + everyone says something meaningful about the project. |
| Slides | Slides contain excessive text and/or content. Team relies too heavily on slides for their presentation. |
Slides are well-organized. Slides are used as a tool to assist the oral presentation. |
All elements of typical projects + graphics and tables follow best-practices (e.g. all text is legible, appropriate use of color and legends). Slides are not crammed full of text. |
| Creativity/originality | Project meets the minimum requirements but not much else. Project is incomplete or does not meet the team’s objectives. |
Project appears carefully thought out. Time and effort seem to have gone into the planning and implementation of the project. | All elements of typical projects + project goes above and beyond the minimum requirements. |
| Content | Questions are not clearly stated. Questions are unanswered or not supported by the data visualizations. Data visualizations are poor quality and do not adhere to practices as taught in class. |
Questions are clearly articulated. Questions are answered using supporting data visualizations. Data visualizations follow good practices. Conclusions made based on the visualizations are justified. |
All elements of typical projects + data visualizations are of exceptional quality. Conclusions are justified and limitations are carefully considered and articulated. |
Peers
Content: Are the questions clearly articulated and is the data being used relevant?
Content: Did the team use appropriate visualizations and did they interpret them accurately?
Creativity and critical thought: Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?
Slides: Are the slides well organized, readable, not full of text, featuring figures with legible labels, legends, etc.?
Professionalism: How well did the team present? Does the presentation appear to be well practiced? Are they reading off of a script? Did everyone get a chance to say something meaningful about the project?
Report
| Category | Less developed projects | Typical projects | More developed projects |
|---|---|---|---|
| Introduction | Explanation of the question and dataset is unclear or missing. Fails to describe relevant variables. | Provides a clear explanation of the question and the dataset used to answer the question, including a description of all relevant variables in the dataset. | All expectations of typical projects + clearly describes why the question is important and what is at stake in the results of the analysis. Even if the reader doesn’t know much about the subject, they know why they care about the results of your analysis. |
| Q1: Justification of approach | The chosen analysis approach is inappropriate. Visualizations are insufficiently explained and justified. | The chosen analysis approach and visualizations are clearly explained and justified. | All elements of typical projects + shows careful consideration for the most effective chart designs. Goes beyond single layer simplistic charts where appropriate to effectively leverage the grammar of graphics for designing complex statistical charts. |
| Q1: Code | Code is broken or does not work correctly. Code is hard to read for a human being and lacks stylistic consistency. | Code is functional, easy to read, and properly formatted. | All elements of typical projects + code is optimized using best practices and properly documented. |
| Q1: Visualization | Visualizations are inappropriate, hard to read, or lack appropriate labeling. | The visualizations are appropriate, follow best practices as taught in class, are easy to read, and properly labeled. | All elements of typical projects + employ custom visual designs and/or theming. Visualizations are distinctive to the project/group. |
| Q1: Discussion | Discussion of results is underdeveloped. Lacks a substantial connection to the visualizations. | Discussion of results is clear and correct, and it has some depth without begin excessively long. | All elements of typical projects + identifies clear insights derived from the visualizations. Analysis demonstrates teams understand not just how to create charts but also effectively interpret them. |
| Q2: Justification of approach | The chosen analysis approach is inappropriate. Visualizations are insufficiently explained and justified. | The chosen analysis approach and visualizations are clearly explained and justified. | All elements of typical projects + shows careful consideration for the most effective chart designs. Goes beyond single layer simplistic charts where appropriate to effectively leverage the grammar of graphics for designing complex statistical charts. |
| Q2: Code | Code is broken or does not work correctly. Code is hard to read for a human being and lacks stylistic consistency. | Code is functional, easy to read, and properly formatted. | All elements of typical projects + code is optimized using best practices and properly documented. |
| Q2: Visualization | Visualizations are inappropriate, hard to read, or lack appropriate labeling. | The visualizations are appropriate, follow best practices as taught in class, are easy to read, and properly labeled. | All elements of typical projects + employ custom visual designs and/or theming. Visualizations are distinctive to the project/group. |
| Q2: Discussion | Discussion of results is underdeveloped. Lacks a substantial connection to the visualizations. | Discussion of results is clear and correct, and it has some depth without begin excessively long. | All elements of typical projects + identifies clear insights derived from the visualizations. Analysis demonstrates teams understand not just how to create charts but also effectively interpret them. |
Reproducibility and organization
| Category | Less developed projects | Typical projects |
|---|---|---|
| Reproducibility (code) | Required files are missing. Quarto files do not render successfully (except for if a package needs to be installed). | All required files are provided. Project files (e.g. Quarto, Shiny apps, R scripts) render without issues and reproduce the necessary outputs. |
| Reproducibility (packages) | renv.lock file does not include all required packages. External users have to manually install packages in order to get code to evaluate. |
renv.lock includes all required packages. Manual package installation is not required to render any code in the repo (e.g. Quarto documents, R scripts). |
| Data documentation | Codebook is missing. No local copies of data files. | All datasets are stored in a data folder, a codebook is provided, and a local copy of the data file is used in the code where needed. |
| File organization/readability | Documents lack a clear structure. There are extraneous materials in the repo and/or files are not clearly organized. | Documents (Quarto files and R scripts) are well structured and easy to follow. No extraneous materials. |
Code style
| Category | Less developed projects | Typical projects |
|---|---|---|
| Naming conventions | Variable names, functions, and data objects use inconsistent or unclear naming (e.g., df1, temp). |
Variable names, functions, and data objects use clear, descriptive naming that reflects their purpose (e.g., sales_by_region, salesByRegion, etc.) |
| Formatting and spacing | Code lacks consistent indentation, spacing around operators, or appropriate line breaks. Multiple statements appear on single lines. | Code follows {tidyverse} formatting conventions with consistent indentation, spaces around operators, line breaks for long pipelines. |
| Use of base pipe | Pipes are absent, misused, or create overly long/deeply nested workflows that are difficult to understand. | Pipes (|>) create clear, linear data transformation workflows. Pipelines are reasonably sized and intermediate steps are understandable. |
| Code cleanliness | Code contains unused variables, commented-out blocks, duplicated code, or unnecessary intermediate objects. Repeated logic is copy-pasted rather than factored out. | No redundant or dead code present. Where applicable, repeated logic is factored into short, focused, clearly-named functions rather than copy-pasted. |
| Comments | Comments restate what code does rather than explaining why. Comments are excessive, outdated, or missing where needed. | Comments explain why a step exists rather than what it does. Comments are used sparingly and are up to date with the code. |
| Avoidance of code smells | Code contains magic numbers, overly complex expressions, hard-coded paths, or deeply nested conditionals that reduce readability and flexibility. | Code avoids magic numbers, overly complex expressions, hard-coded paths, and deeply nested conditionals. |
Between team peer evaluation
Peer reviews will be graded on the extent to which they comprehensively and constructively address the components of the reviewee’s team’s proposal.
0 points: No peer reviews
1 point: Only one peer review issue open, feedback provided is not constructive or actionable
2 points: Both peer review issues open, feedback provided is not constructive or actionable
3 points: Both peer review issues open, feedback provided is not sufficiently thorough
4 points: Both peer review issues open, one of the reviews is not sufficiently thorough
5 points: Both peer review issues open, both reviews are constructive, actionable, and sufficiently thorough
Late work policy
There is no late work accepted on this project. Be sure to turn in your work early to avoid any technological mishaps.
Acknowledgments
- Project instructions draw in part from STA 313: Advanced Data Visualization and INFO 2951: Introduction to Data Science with R
Footnotes
It is ok for you to repeat some information from the TidyTuesday repository but, but make sure appropriately attribute it here.↩︎