Welcome to INFO 3312/5312

Lecture 1

Dr. Benjamin Soltoff

Cornell University
INFO 3312/5312 - Spring 2024

January 23, 2024

Agenda

Agenda

  • Course details
  • Introductions
  • Course components
  • Why we visualize

Course details

Timetable

  • Lectures
    • Tuesdays 1:25-2:40pm
    • Thursdays 1:25-2:40pm
  • Discussions
    • 201: Fridays 10:10-11:00am
    • 202: Fridays 11:15-12:05pm
    • 203: Fridays 9:05-9:55am (grad students only)

Students on the waitlist

  • INFO 3312/5312 enrollment is restricted to IS/ISST majors and IS MPS students
  • If you are not an IS/ISST major (or are still in the process of affiliating), join the waitlist through Student Center
  • PINs distributed on a rolling basis
  • We currently have 2 seats available and 16 students on the waitlist

Themes: what, why, and how

  • What: the communication (e.g. plot, table, report)
    • Specific types of visualizations for a particular purpose (e.g., maps for spatial data, Sankey diagrams for proportions, etc.)
    • Tooling to produce them (e.g., specific R packages)
  • How: the process
    • Start with a design (sketch + pseudo code)
    • Pre-process data (e.g., wrangle, reshape, join, etc.)
    • Map data to aesthetics
    • Make visual encoding decisions (e.g., address accessibility concerns)
    • Post-process for visual appeal and annotation
  • Why: the theory
    • Tie together “how” and “what” through the grammar of graphics
    • Extend to underlying theory of cognition and information processing

Introductions

Meet the instructor

Dr. Benjamin Soltoff

Lecturer in Information Science

Gates Hall 216

Headshot of Dr. Benjamin Soltoff

Meet the course team

Grad TAs

  • Clayton S
  • Muhan Z
  • Su H

Undergrad TAs

  • Aarya T
  • Jessica K
  • Steven C
  • Sahiba D

Meet each other!

  • Form a small group (3-4 individuals) with people sitting around you

  • First, introduce yourselves to each other:

    • Your name - Prof/Dr. Soltoff
    • Your major - Political science
    • The last movie you saw - To All The Boys I’ve Loved Before
    • What you hope to get out of this class - A paycheck
  • Start with the bad graphs – Share your examples of “bad” graphs and why you think they’re bad.

  • Then, share your good graphs – Same deal, share your examples of “good” graphs and why you think they’re good.

  • Finally, choose the one plot from your group that you think is most striking, either because it’s bad or because it’s good, and have one team member share the graph on this discussion post.

Course components

Homepage

https://info3312.infosci.cornell.edu/

  • All course materials
  • Links to Canvas, GitHub, RStudio Workbench, etc.
  • Let’s take a tour!

Course toolkit

All linked from the course website:

Important

Make sure you can access RStudio (Posit) Workbench before lab on Friday.

Activities: Prepare, Participate, Practice, Perform

  • Prepare: Introduce new content and prepare for lectures by completing the readings

  • Participate: Attend and actively participate in lectures and labs, office hours, team meetings

  • Practice: Practice applying visualization techniques with application exercises during lecture, graded for completion

  • Perform: Put together what you’ve learned to analyze real-world data

    • Homework assignments x 6-ish (individual)
    • Team projects (2)

Teams

  • Team assignments
    • Assigned by course staff
    • Peer evaluation after completion
  • Expectations and roles
    • Everyone is expected to contribute equal effort
    • Everyone is expected to understand all code turned in
    • Individual contribution evaluated by peer evaluation, commits, etc.

Grading

Category Percentage
Homework 40%
Project 1 20%
Project 2 30%
Application exercises 10%

See course syllabus for how the final letter grade will be determined.

INFO 5312

Additional expectations:

  • INFO 5312 homework will at times be graded against a more stringent rubric
  • INFO 5312 students will be grouped together for all projects

15 minute rule

Support

  • Attend office hours
  • Ask and answer questions on the discussion forum
  • Reserve email for questions on personal matters and/or grades
  • Read the course support page

Announcements

  • Posted on Canvas (Announcements tool), be sure to check regularly (or forward announcements to your email)
  • I’ll assume that you’ve read an announcement by the next “business” day

Diversity + inclusion

  • I want you to feel like you belong in this class and are respected
  • We are committed to full inclusion in education for all persons
  • If you feel that we have failed these goals, please either let us know or report it, and we will address the issue

Accessibility

I want this course to be accessible to students with all abilities. Please feel free to let me know if there are circumstances affecting your ability to participate in class.

Course policies

 

As long as you meet
the prereqs

Prerequisites

  • INFO 2950 or INFO 5001
  • Prior experience with R and Git is required

Ideally you took INFO 2950 or 5001 with me.

If not, you need a firm understanding of R (including tidyverse and tidymodels() and Git workflows.

Late work, waivers, regrades policy

  • We have policies!
  • Read about them on the course syllabus and refer back to them when you need it

Collaboration policy

  • Only work that is clearly assigned as team work should be completed collaboratively.

  • Homeworks must be completed individually. You may not directly share answers / code with others, however you are welcome to discuss the problems in general and ask for advice.

Sharing / reusing code policy

  • We are aware that a huge volume of code is available on the web, and many tasks may have solutions posted

  • Any recycled code that is discovered and is not explicitly cited will be treated as plagiarism, regardless of source.

  • All code must be written by you, the human being.

Generative AI

Academic integrity

  1. A student shall in no way misrepresent his or her work.
  2. A student shall in no way fraudulently or unfairly advance his or her academic position.
  3. A student shall refuse to be a party to another student’s failure to maintain academic integrity.
  4. A student shall not in any other manner violate the principle of academic integrity.

Most importantly!

Ask if you’re not sure if something violates a policy!

Why do we visualize?

Why do we visualize?

  1. Discover patterns that may not be obvious from numerical summaries

Anscombe’s quartet

   set  x     y
1    I 10  8.04
2    I  8  6.95
3    I 13  7.58
4    I  9  8.81
5    I 11  8.33
6    I 14  9.96
7    I  6  7.24
8    I  4  4.26
9    I 12 10.84
10   I  7  4.82
11   I  5  5.68
12  II 10  9.14
13  II  8  8.14
14  II 13  8.74
15  II  9  8.77

Summary statistics for Anscombe’s quartet

# A tibble: 4 × 6
  set   mean_x mean_y  sd_x  sd_y     r
  <fct>  <dbl>  <dbl> <dbl> <dbl> <dbl>
1 I          9   7.50  3.32  2.03 0.816
2 II         9   7.50  3.32  2.03 0.816
3 III        9   7.5   3.32  2.03 0.816
4 IV         9   7.50  3.32  2.03 0.817

Scatterplots for Anscombe’s quartet

Just show me the data!

head(my_data, 10)
# A tibble: 10 × 2
       x     y
   <dbl> <dbl>
 1  55.4  97.2
 2  51.5  96.0
 3  46.2  94.5
 4  42.8  91.4
 5  40.8  88.3
 6  38.7  84.9
 7  35.6  79.9
 8  33.1  77.6
 9  29.0  74.5
10  26.2  71.4
mean(my_data$x)
[1] 54.26327
mean(my_data$y)
[1] 47.83225
cor(my_data$x, my_data$y)
[1] -0.06447185

oh no

Raw data is not enough

Why do we visualize?

  1. Discover patterns that may not be obvious from numerical summaries

  2. Convey information in a way that is otherwise difficult/impossible to convey

Impact of Omicron variant on unvaccinated populations

National risk index

Data visualization

Data visualization

“The simple graph has brought more information to the data analyst’s mind than any other device.”

John Tukey

  • Data visualization is the creation and study of the visual representation of data
  • Many tools for visualizing data – R is one of them
  • Many approaches/systems within R for making data visualizations – ggplot2 is one of them, and that’s what we’re going to use

A fuzzy monster in a beret and scarf, critiquing their own column graph on a canvas in front of them while other assistant monsters (also in berets) carry over boxes full of elements that can be used to customize a graph (like themes and geometric shapes). In the background is a wall with framed data visualizations. Stylized text reads 'ggplot2: build a data masterpiece.' Learn more about ggplot2.

ggplot2 \(\in\) tidyverse

  • ggplot2 is tidyverse’s data visualization package
  • gg in “ggplot2” stands for Grammar of Graphics
  • Inspired by the book Grammar of Graphics by Leland Wilkinson

Grammar of Graphics

A grammar of graphics is a tool that enables us to concisely describe the components of a graphic

Hello ggplot2!

  • ggplot() is the main function in ggplot2
  • Plots are constructed in layers
  • Structure of the code for plots can be summarized as
ggplot(data = [dataset], 
       mapping = aes(x = [x-variable], y = [y-variable])) +
   geom_xxx() +
   other options
  • The ggplot2 package comes with the tidyverse

This week’s tasks

Film recommendation