INFO 370: Introduction to Data Science

email Benji | email Greg | email class
Instructors: Benjamin Xie, Gregory L. Nelson
Fall 2017. Section A

ScheduleGradingActivitiesReadingIndividual HomeworkProjectResources

In this class you'll learn how to think like a data scientist. You'll learn what data scientists do and how they do it. You'll also learn about the contexts in which a data scientist exists. By the end of the course, you should be able to enter any organization and begin to understand the social and technical contexts in which you help make decisions. If you want to be a great data scientist, this is the course for you.

Learning objectives for the course:
  1. Comprehend the practice of data science as an interactive, iterative process.
  2. Critique the quality of data, models, and results within a decision context.
  3. Consider contextual and critical perspectives in data science
  4. Have familiarity using computational tools that support data scientists
  5. Understand what data scientists do in various organizational and social contexts

Prerequisites

You should have aspirations to be a data scientist or to work closely with them. Because we'll use data to inform decisions, you should also know:

  • How to use a scripting language (R, Python) to manipulate data
  • How to use command line
  • How to use Git and GitHub
  • How to access a Web API

The prerequisite course, INFO 201 (the Technical Foundations of Informatics), should be suitable preparation for the above. Refer to the INFO 201 online book to refresh your knowledge of the course.

Office Hours

We are available to talk about jobs, careers, graduate school, research, class, taboos, and anything else. Benji's office hours this quarter are Monday and Tuesday 9-10:30 am in MGH-015 (door is locked, so just knock to get in). Greg's office hours this quarter are Wednesday 2-3:20 and Thursday 1-3PM in CSE atrium tables. Occasionally we need to schedule things over it. To guarantee we'll be around, write to us in advance to secure a time.

Devices in Class

We will use smartphones and laptops throughout the quarter to facilitate activities and project work in-class. However, research and student feedback clearly shows that using devices on non-class related activities not only harms your own learning, but other students' learning as well. Therefore, I only allow device usage during activities that require devices. At all other times, you should not be using your device. We'll help you remember this by announcing when to bring devices out and when to put them away.

Typical Week

  • Sunday: Do reading assignments (read readings, review and run scripts). Complete reflection survey.
  • Monday: Go to class and participate. After class, report struggles online.
  • Tuesday:Do reading assignments (read readings, review and run scripts). Complete reflection survey.
  • Wednesday: Go to class and participate. After class, report struggles online.
  • Wednesday-Saturday: Homework or group project work

Schedule

Week 0 — What is data science?
9/27Lecture Data science is a process Assigned: Homework 1. Due Fri 10/6.
Week 1 — Decision Making in Data Science
10/2Lecture Decision-making and Probability
10/2Lab Automating data science process with an R script
10/4Lecture Decision contexts in data science Assigned: Homework 2. Due Tues 10/10 Fri 10/13 (updated).
Week 2 — Using probability models to support decisions
10/9Lecture Building Bayesian Models
10/9Lab Running Statistical Models
10/11Lecture Improving modeling decisions using Baye's rule Assigned: Homework 3. Due Tues 10/17.
Week 3 — Cleaning and Selecting Data
10/16Lecture Bayesian Inference in Action, Data cleaning process
10/16Lab Applying data cleaning process using Wrangler
10/18Lecture Finding and selecting data sources Assigned: Project Milestone 1: Group formation & initial questions. Due Sun 10/22.
Week 4 — Collecting and Making Sense of Data
10/23Lecture Collecting data from the internet
  • Reading: Web scraping
Assigned: Project Milestone 2: Pilot Study. Due Sun 10/29.
10/23Lab Practicing web scraping
10/25Lecture Exploring data using visualizations and models
  • Reading: Exploring Paroles
Week 5 — Visualization; Predictive Models
10/30Lecture Visualization design
  • Reading: Visualizations: the good, the bad, and the ugly
Assigned: Project Milestone 3: Project Proposal. Due Sun 11/5.
10/30Lab Work on class project
11/1Lecture Using data and simulations to decide among predictive models
  • Reading: Expressing Optimization Goals
Week 6 — Finding and Comparing Models & Parameters
11/6Lecture Comparing models using visualization
  • Reading: Visualizing Models, Their Difference with Data (Residuals)
Assigned: Project Milestone 4: Proposal Review. Due Tues 11/7.
11/6Lab Finding models using computer simulations
11/8Lecture Finding models for decision-making using computer simulations
  • Reading: Example decision-making models
Assigned: Project Milestone 5: Proposal Revision. Due Sun 11/12.
Week 7 — Contrasting and Interpreting Models
11/13Lecture Parametric and "non-parametric" models; generalization
  • Reading: Fantastic models and where to find them (the internet)
Assigned: Project Milestone 7 & 8: Presentation & Artifact. Due Mon 12/4, Fri 12/8.
11/13Lab Deeper model comparison and interpretation using visualization
11/15Lecture Model fit, overfitting and cross-validation
Week 8 — Models, Bias, and Social Impacts
11/20Lecture Models, bias, and social impacts
11/20Lab Work on class project
11/22No class
Week 9 — Big Data and Opacity
11/27Lecture Scaling to "Big Data"
  • Reading: Excerpt from Fourth Paradigm; Business Articles on Big Data
11/27Lab More data, more problems?
11/29Lecture Where is your data from - Mediocristan or Extremistan? & Problem of Induction
  • Reading: Excerpt from Antifragile and Simulations
Week 10 — Project Work and Reflections
12/4Lecture Presentations
12/4Lab Reflecting on class projects
12/6Lecture Presentations
Finals week
Homework 4 (Project and Course Reflection) Due 12/14.
No class or finals will be held this week.

Grading

There are 100 points you can earn in this class:

  • Activities (13 points, 0.5 points for each class or lab). Show up and engage to get credit.
  • Reading (17 points, 1 point each). Prove you read and understood the reading.
  • Individual Homework (30 points). Prove you understand important data sciencetopics.
  • Project (40 points, team score). Reach several milestones related to your team data science project.

After rounding your points to the nearest even number, We'll map your 100 points to a 4.0 scale using the table below.

≥ 97 → 4.0 92 → 3.5 87 → 3.0 82 → 2.5 77 → 2.0 72 → 1.5 67 → 0.9
96 → 3.9 91 → 3.4 86 → 2.9 81 → 2.4 76 → 1.9 71 → 1.4 66 → 0.8
95 → 3.8 90 → 3.3 85 → 2.8 80 → 2.3 75 → 1.8 70 → 1.2 65 → 0.7
94 → 3.7 89 → 3.2 84 → 2.7 79 → 2.2 74 → 1.7 69 → 1.1 ≤ 64 → 0.0
93 → 3.6 88 → 3.1 83 → 2.6 78 → 2.1 73 → 1.6 68 → 1.0

Late work receives no credit unless you can provide a note from a health care professional or provost documenting the reason for your absence. However, you can miss up to 3 activities without penalty and without documentation. This should be enough to allow for sickness, unavoidable travel, or other personal matters.

If you miss a reading quiz due to sickness, you can make up the quiz credit by sending a 250-500 word critique of the reading and submitting it to your Google Drive folder within a week of the quiz you missed. Title the Google doc with the class number and "make up quiz". E.g. "2.3 make up quiz" for the make up quiz for week 2 and class 3/wednesday lecture.

Activities

Each day in class we'll practice some skill. You'll get 0.5 points if you engage in and complete the activity. How to get credit for the activity will depend on the activity; sometimes being present will be enough, sometimes being to class on time will be enough, and sometimes you'll have to turn something in.

Reading

To access the readings, you will do the following:

  1. Click on the reading (a link to a Google doc) on the course schedule
  2. Copy the google doc to your personal INFO 370 folder (which we shared with you at the beginning of the course). Instructions on making a copy of a file in Google Drive.
  3. Read through the google doc/reading. Highlight and comment any parts which are confusing.
  4. Complete the questions marked "TODO".

You should complete your readings and reflection before at the beginning of each lecture (twice a week). The Google Doc in your personal Drive folder is your submission (not using Canvas for readings). Each class, you'll come prepared to discuss the assigned reading.

The day that each reading is due, we'll do the following:

  • Share what you're confused about.
  • We clarify confusions.
  • We give you some questions to answer individually about the assigned reading (a "Reading Quiz").
  • You turn in your answer.
  • You discuss your answers with your neighbor.
  • We discuss the correct answers as a class.

You will receive 0.5 points for completing the reading and reflection before class (on the Google Doc). You will receive up to another 0.5 points for getting the in-class reading quiz correct. We will give partial credit for partially correct answers on the reading quiz, at our discretion. In total, you can receive up to 1 point per reading.

Individual Homework

There will be a few individual homework assignments which are separate from reading assignments and project milestones.

  1. Review of prerequisite knowledge (5 points). Out Wed 9/27. Due Fri 10/6
  2. Analyzing a Data Science Case Study (10 points). Out Wed 10/4. Due Tues 10/10 Fri 10/13 (updated).
  3. Probability with Bayesian Inference (10 points). Out Wed 10/11. Due Tues 10/17.
  4. Project and course reflection (5 points). Out Mon 12/11. Due Mon 12/11.

All homeworks are due by 11:59:00 PM PST on the specified date.

The goal of the individual homework assignments is to ensure you an understanding of specific concepts which are critical to your understanding of data science.

Project

The project is split across 8 milestones/assignments, each worth a different amount:

  1. Group formation and initial questions (2 points). Out Wed 10/18. Due Sun 10/22.
  2. Pilot study (3 points). Out Mon 10/23. Due Sun 10/29.
  3. Proposal (4 points). Out Mon 10/30. Due Sun 11/5.
  4. Proposal Review (2 points). Out Mon 11/6. Due Tues 11/7.
  5. Proposal Revision (3 points). Out Wed 11/8. Due Sun 11/12.
  6. Project check-in meeting (2 points). Out Wed 11/16. Due Tues 11/21.
  7. Presentation (10 points). Out Mon 11/13. Due Mon 12/4.
  8. Artifact (14 points). Out Mon 11/13. Due Fri 12/8.

All assignments except the Project check-in meeting are due by 11:59:00 PM PST on the specified date.

The goal of the project is for you to practice the process of data science to make or inform a decision, so you can experience the nuances of formulating a good question, setting up process, constraints, and plans in relation to a context. Note, however, that because the timeline for the project is so short, it won't give you a deep, longitudinal experience with software engineering, nor will it give you practice with massive complexity or scale. I believe these are experiences best left to practice in industry, as they're very difficult to replicate in the artificial setting of school.

Note that in the final three weeks of the course, we'll use class time primarily for team meeting time. The TA and I will be available as free consultants, helping you debug, do research on libraries, and offer advice on implementation. You are not required to come to class those days, but I highly recommend using the time for team coordination and help.

Resources

Links to Data Science communities at/near UW:

Links to important UW resources:

  • Disability Services Office: If you require disability accommodations for this course, work with the DSO.
  • SafeCampus: Resources and points of contact to promote a safer UW community.