Introduction to Data Science
Introduction to Data Science
Winter 2018
Feb 13 Erik Sudderth Computer Science Computer Vision and Machine Learning
Mar 13 Padhraic Smyth Computer Science Review: Past and Future of Data Science
• 1960’s
– Development of statistical computing and exploratory data analysis
• 1980’s
– Computing allowed statisticians to explore more flexible models
– Increase in use of “non-parametric” techniques and simulation methods
• 1990’s
– Development of “machine learning” – very flexible predictive modeling techniques
developed in computer science
• Today
– Data science = computing + statistics + applicatinos
60 Terabytes/day
20 Petabytes/year
• Technological drivers
– Sensors (cheap and ubiquitous, e.g., GPS on your phone)
– Data storage (we are all “data owners”)
– Computational power
– Data analysis methods (statistics and machine learning)
– Internet and wireless communication (can collect and share data)
Online advertising
Automated
recommendations
Demand
forecasting
Fraud detection
Churn prediction
Automated
customer support
Online advertising
Automated
recommendations
Demand
forecasting
Fraud detection
Churn prediction
Automated
customer support
? ?
?
P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 27
How does Amazon forecast how many items for its warehouses?
From dailymail.co.uk
From www.formaspace.com
From linkedin.com
Data Matrix:
Rows = genes
Columns = patients
From www.spot-7.com
From http://cimss.ssec.wisc.edu/
Extracted
Data
Transformed
Data
Data for
Modeling
Predictive
Model
Predictions/
Decisions
– Database systems
– Algorithms
– Software engineering
– Machine learning
– Probabilistic and statistical models
– Quantification of uncertainty
– Data visualization
– and more…
Statistics Computing
(Mathematical and (Algorithms and
Probabilistic Software)
Foundations)
Data Science
Applications
(Analyzing Real Data)
Computing
Statistics
ICS 46: Data Structures
IFMTX 43: Intro to Software Engineering
Stats 120 ABC: Intro to Prob and Stats CS 122A: Intro to Data Management
Stats 68: Exploratory Data Analysis CS 161: Design and Analysis of Algorithms
Stats 110-112: Statistical Methods (CS 131: Parallel and Distributed Computing)
CS 178: Machine Learning (CS 172: Neural Networks/Deep Learning)
(Stats 140: Multivariate Statistics)
Applications
800
700
600
500
400
300
200
100
0
-6 -5 -4 -3 -2 -1 0 1 2 3
Distance in Kilometers (Log-Scale, Base 10)
Feb 13 Erik Sudderth Computer Science Computer Vision and Machine Learning
Mar 13 Padhraic Smyth Computer Science Review: Past and Future of Data Science