0% found this document useful (0 votes)
69 views74 pages

STAT121 / AC209 / E-109: CS109 Data Science

This document provides an outline and overview of the CS109 Data Science course at Harvard University. It discusses what data science is, why it is important, who teaches the course, and how the course is structured. The course covers key data science topics like data munging, exploratory analysis, prediction, and communication of results. It is taught using real-world datasets and Python tools. The course aims to take students through the full data science process on projects from data collection to modeling to visualization.

Uploaded by

Matheus Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views74 pages

STAT121 / AC209 / E-109: CS109 Data Science

This document provides an outline and overview of the CS109 Data Science course at Harvard University. It discusses what data science is, why it is important, who teaches the course, and how the course is structured. The course covers key data science topics like data munging, exploratory analysis, prediction, and communication of results. It is taught using real-world datasets and Python tools. The course aims to take students through the full data science process on projects from data collection to modeling to visualization.

Uploaded by

Matheus Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

STAT121 / AC209 / E-109

CS109 Data Science


Hanspeter Pfister
[email protected]

Joe Blitzstein
[email protected]
Outline
What?
Why?
Who?
How?
Outline
What?
Why?
Who?
How?
Data Science
To gain insights into data through
computation, statistics, and visualization
A Data Scientist Is...
A data scientist is someone who knows more
statistics than a computer scientist and more
computer science than a statistician.
- Josh Blumenstock

Data Scientist = statistician + programmer +


coach + storyteller + artist
- Shlomo Aragmon
Nate Silver
Nate Silver won the election
Harvard Business Review
#natesilverfacts
http://techcrunch.com/2012/11/07/nate-silver-as-software/
Nate Silver on Pundits
Silver: Pundits are no
better than a coin toss.
Stewart: Do you foresee a
coin getting its own show?
The coin toss show?

http://www.thedailyshow.com/watch/wed-october-17-2012/nate-silver
Some Key Principles
use many data sources (the plural of anecdote is not data)

understand how the data were collected (sampling is essential)

weight the data thoughtfully (not all polls are equally good)

use statistical models (not just hacking around in Excel)

understand correlations (e.g., states that trend similarly)

think like a Bayesian, check like a frequentist (reconciliation)

have good communicationskills (What does a 60%


probability even mean? How can we visualize, validate, and
understand the conclusions?)
Human Genome
Microarrays

Affimetrix Chip

[wikipedia]
Sequencing
Sequencing Cost
Genome Data
Genome Visualization

[Krzywinski+2009]+

[Thorvaldsd,r-2013]-

[Meyer&2009]&
Personalized Therapy
...10 years from now, each cancer
patient is going to want to get a genomic
analysis of their cancer and will expect
customized therapy based on that
information.
Director, The Cancer Genome Atlas
(TCGA), Time Magazine, 6/13/11
Netflix Prize
Some Challenges
massive data (500k users, 20k movies, 100m ratings)

curse of dimensionality (very high-dimensional


problem)

missing data (99% of data missing; not missing at


random)

extremely complicated set of factors that affect peoples


ratings of movies (actors, directors, genre, ...)

need to avoid overfitting (test data vs. training data)


Netflix Prize Progress

http://blogs.hbr.org/cs/2012/10/big_data_hype_and_reality.html
Connectome
What is the connectivity of large brain circuits?

Ramn y Cajal, 1905


Connectome Workflow
Ultra-Thin Section EM
Automatic
Reconstruction

Combine Multiple 2D Globally Consistent


2D Segmentation
Segmentations with Fusion 3D Segmentation

[Kaynig et al., CVPR 10]


[Vazquez et al., ICCV 2011]
2012
Data Science
Computer
Statistics
Science

Domain Science Drew Conway


Machine Human
Data Management Human Cognition

Data Mining Perception

Machine Learning Visualization Story Telling

Business Intelligence Decision Making


Theory
Statistics
Data Science

Inspired by Daniel Keim, Visual Analytics: Definition,


Process, and Challenges
Outline

What?
Why?
Who?
How?
The Age of Big Data

BBC, 2013
Crime Prevention
Boston Globe,
Sunday, Aug 4, 2013
Big Data

2.5 exabytes
daily data

years 2012
[IBMbigdata]

[Domo]
Between the dawn of civilization and
2003, we only created five exabytes of
information; now were creating that
amount every two days.
Eric Schmidt, Google (and others)
http://onesecond.designly.com/
Smarter Devices

Michael Franklin, UC Berkeley


Commodity Computing

Michael Franklin, UC Berkeley


Ubiquitous Connectivity

Michael Franklin, UC Berkeley


travers808,Visual.ly
1 Zetabyte = 1 Billion Terabytes
Jim Gray, Microsoft
By 2018, the US could face a shortage
of up to 190,000 workers with analytical
skills
McKinsey Global Institute

The sexy job in the next 10 years will


be statisticians. Data Scientists?
Hal Varian, Prof. Emeritus UC Berkeley
Chief Economist, Google
Hal Varian Explains...
The ability to take data to be able to
understand it, to process it, to
extract value from it, to visualize it, to
communicate it's going to be a hugely
important skill in the next decades, not
only at the professional level but even at
the educational level for elementary school
kids, for high school kids, for college kids.
Because now we really do have essentially
free and ubiquitous data. Hal Varian
Ask an interesting What is the scientific goal?
What would you do if you had all the data?
question. What do you want to predict or estimate?

How were the data sampled?


Which data are relevant?
Get the data. Are there privacy issues?

Plot the data.

Explore the data. Are there anomalies?


Are there patterns?

Build a model.
Model the data. Fit the model.
Validate the model.

Communicate and What did we learn?


Do the results make sense?
visualize the results. Can we tell a story?
Outline

What?
Why?
Who?
How?
Hanspeter Pfister

An Wang
My Background
Grew up in Switzerland

M.Sc. in EE from ETH Zurich

Ph.D. in CS from SUNY Stony Brook

11 years in industry (MERL)

At Harvard since 2007, Visual Computing Group (4 Ph.D., 7 PD)

Teach CS109 / CS171, taught CS175 / CS264 / CS205

Director of the Institute of Applied Computational Science (IACS)

Two daughters, Lilly (10) and Audrey (7)


Joe Blitzstein
Professor of the Practice in Statistics,
Co-Director of Undergraduate Studies in Statistics
[email protected], twitter @stat110, SC 714
CS109 Staff
Chris Beaumont, Head TF Ray Jones

Johanna Beyer Steffen Kirchhoff

Nicolas Bonneel Seymour Knowles-Barley

Alex DAmour Alexander Lex

Rahul Dave Deqing Sun

Brandon Haynes Tim Brenner, A/V


About You
Outline
What?
Why?
Who?
How?
CS109 Key Facets
data munging/scraping/sampling/cleaningin order to get an
informative, manageable data set;

data storage and management in order to be able to access


data - especially big data - quickly and reliably during
subsequent analysis;

exploratory data analysisto generate hypotheses and


intuition about the data;

predictionbased on statistical tools such as regression,


classification, and clustering; and

communicationof results through visualization, stories, and


interpretable summaries.
Act I: Predictions
Data Science Process
Data Types and Data Munging
Probability Review
Classification & Regression
Cross Validation, Clustering
Visualization & Story Telling
Act II: Recommendations
Bayesian Thinking & Computation
Monte Carlo Methods
Machine Learning Methods
MapReduce and Amazons EC2
Databases (Margo Seltzer)
Act III: Network Analysis
Network Visualization
Network Sampling
Community Detection
Guest Lecture
Abstractions...
...and Tools
xkcd
Homework
Real-World focus
Scrape and wrangle messy data
Apply sophisticated statistical analysis
Visualize and communicate results
Election data, movie reviews,Yelp! data, etc.
Final Project
Pick a project of your choosing
Teams of up to 2 students
Process books, web sites, screencasts
IPython (exceptions possible)
Best project prizes!
cs109.org
Is this course for me ???
Prerequisites
Programming experience
C, C++, Java, Python, etc.
Basic statistical knowledge
STAT100, ideally STAT110
Willingness to learn new software & tools
This can be time consuming
You will need to read online documentation
Be Patient

Be Flexible

Be Constructive

http://davidzinger.wordpress.com/2007/05/page/2/
Next Steps
HW 0
Good test of your basic skills

Installation of several Python frameworks

Not graded, do it as soon as possible

Read syllabus carefully


Do readings
Post comments to Piazza using #readings

You might also like