0% found this document useful (0 votes)
20 views

Introduction to Data Science

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Introduction to Data Science

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Stats5 Seminar: Introduction to Data Science

Winter 2018

Professor Padhraic Smyth


Departments of Computer Science and Statistics
University of California, Irvine
Outline

• Class organization and topics

• History of data analysis

• Data science and real-world applications

• The Data Science Major

• Limitations of what we can do with data

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 2


Class Organization

• Meet weekly for 40 minute seminar with 5-10 minute discussion

• 8 topics (with guest speakers), weeks 2 through 9


– You are encouraged to ask questions during and after the talks

• Intro and wrap-up talks in weeks 1 and 10

• Class Web site is at www.ics.uci.edu/~smyth/courses/stats5


– Slides and related materials will be posted during the quarter

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 3


Schedule of Lectures
Department
Date Speaker Topic
Or Organization
Jan 9 Padhraic Smyth Computer Science Introduction to Data Science

Jan 16 Padhraic Smyth Computer Science Machine Learning

Jan 23 Michael Carey Computer Science Databases and Data Management

Jan 30 Sameer Singh Computer Science Statistical Natural Language Processing

Feb 6 Zhaoxia Yu Statistics An Introduction to Cluster Analysis

Feb 13 Erik Sudderth Computer Science Computer Vision and Machine Learning

Feb 20 John Brock Cylance, Inc Data Science and CyberSecurity


Video Lecture Microsoft Research
Feb 27 Bias in Machine Learning
(Kate Crawford) and NYU
Mar 6 Matt Harding Economics Data Science in Economics and Finance

Mar 13 Padhraic Smyth Computer Science Review: Past and Future of Data Science

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 4


Submission of Review Forms (Weeks 2 to 10)

• Submit Review forms for Lectures 2 through 10

• Review forms will be available online at the start of each class


– A few relatively short questions based on the lecture that day
– Needs to be submitted to EEE by noon for each lecture
– Bring your laptop or other device

• Requirements to pass the class


– Attend and submit review form for least 8 lectures for weeks 2 through 10
(allowed to miss one if you need to for some reason)

• No final exam: pass/fail based on attendance and review forms

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 5


Academic Integrity

• The review form you submit each week must be


(a) written by you, and
(b) written during the lecture that week

• Failure to adhere to this policy may result in failing the class

• It is the responsibility of each student to be familiar with UCI's


Academic Integrity Policies and UCI's definitions and examples of
academic misconduct. See the class Web site for additional info.

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 6


A BRIEF HISTORY OF DATA ANALYSIS AND COMPUTING

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 7


Computers and Data

The historical meaning of the term “computer”:


“one who computes” (i.e., a person)

Since the 1700’s, statisticians have been using


“computers” to analyze data – so its not a new idea

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 8


Computers and Data

The historical meaning of the term “computer”:


“one who computes” (i.e., a person)

Since the 1700’s, statisticians have been using


“computers” to analyze data – so its not a new idea

For example, Karl Pearson, one of the founders of


statistics, directed a team of “computers” in his lab in
London around the early 1900’s

…..but for many years, “computers” could only work


on relatively small problems

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 9


Statistics and Modern Computing

• Post World War II


– Increasing use of computing to solve algorithmic aspects of statistical analyses

• 1960’s
– Development of statistical computing and exploratory data analysis

• 1980’s
– Computing allowed statisticians to explore more flexible models
– Increase in use of “non-parametric” techniques and simulation methods

• 1990’s
– Development of “machine learning” – very flexible predictive modeling techniques
developed in computer science

• Today
– Data science = computing + statistics + applicatinos

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 10


1985: ~ $100k
per gigabyte

2015: ~ $0.3 cents


per gigabyte

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 11


From http://exploringbigdata.blogspot.com/
P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 12
http://www.mapsandthecity.com/wp-content/uploads/2011/11/Schermafbeelding-2011-11-09-om-10.13
P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 13
Modeling Human Behavior using Social Media
From Lichman and Smyth, ACM SIGKDD 2014

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 14


Geolocated Tweets around UC Irvine

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 15


From: healthitananalytics.com P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 16
Scientific Data: Large Hadron Collider at CERN

60 Terabytes/day
20 Petabytes/year

1 Terabyte = 1012 bytes


1 Petabyte = 1015 bytes

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 17


A Paradigm Shift in Data Analysis

• Technological drivers
– Sensors (cheap and ubiquitous, e.g., GPS on your phone)
– Data storage (we are all “data owners”)
– Computational power
– Data analysis methods (statistics and machine learning)
– Internet and wireless communication (can collect and share data)

• Convergence…..tremendous demand for data analysis


– In business, in sciences, in medicine, in engineering, and more……

• In the past, this demand was met by statistics


– Does not scale up – there are not nearly enough statisticians
– Need more tools than just statistics….need databases, algorithms, machine
learning,….

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 18


DATA SCIENCE IN THE REAL WORLD

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 19


What is Data Science?
Data science involves the full lifecycle of data:
from real-world unstructured data…..to predictions and decisions

Data science is broader than just databases, statistics, ML, algorithms


…..but these are all critical components

Key aspects of data science include


– Domain knowledge and problem definition
– Data preparation/organization/management
– Understanding of uncertainty (statistics)
– Computing, algorithms, fitting models, machine learning
– Iterative exploration and experimentation
– Human judgement and interpretation

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 20


How is Data Science used in each of these Organizations?

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 21


How is Data Science used in each of these Organizations?

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 22


Organizations Data Science Applications

Online advertising

Automated
recommendations

Demand
forecasting

Fraud detection

Churn prediction

Automated
customer support

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 23


Organizations Data Science Applications

Online advertising

Automated
recommendations

Demand
forecasting

Fraud detection

Churn prediction

Automated
customer support

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 24


How does Facebook predict what content to show you?

Graphics from Lars Backstrom, ESWC 2011


P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 25
Web Search: How do search engines rank search results?

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 26


How do ad companies decide what online ads to show you?

? ?

?
P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 27
How does Amazon forecast how many items for its warehouses?

From dailymail.co.uk

From www.formaspace.com

From linkedin.com

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 28


How do autonomous cars recognize objects in image data?

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 29


How can we use wearable data to improve our health?

Images from community.fitbit.com

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 30


How can we make personalized recommendations in medicine?

Data Matrix:
Rows = genes
Columns = patients

From www.originlab.com P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 31


Astronomy: How can we process terabytes/day of telescope data?

Large Synoptic Telescope (LST)


15 Terabytes/day
100+ Petabytes in 10 years

From Raddick et al, Astronomy Education Review, 2009

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 32


Physics: What is required to search for new physics particles?

Large Hadron Collider:


700 Mbytes/second
60 Terabytes/day
20 Petabytes/year

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 33


How can we detect land changes in NASA satellite images?

From www.spot-7.com

From http://cimss.ssec.wisc.edu/

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 34


How can algorithms interpret and summarize sports data?

Graphics from www.stats.com/sportvu/sportvu.asp


P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 35
Politics: How can we reliably predict events like elections?

“Nowcast” forecast: Downloaded on July 25th 2016,


from http://projects.fivethirtyeight.com/2016-election-forecast/

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 36


Data Pipelines
Unstructured
Data

Extracted
Data

Transformed
Data

Data for
Modeling

Predictive
Model

Predictions/
Decisions

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 37


Scullley et al, NIPS 2015 Conference

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 38


THE DATA SCIENCE MAJOR

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 39


All of the applications we discussed are built on ideas from…

– Database systems
– Algorithms
– Software engineering
– Machine learning
– Probabilistic and statistical models
– Quantification of uncertainty
– Data visualization
– and more…

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 40


Components of Data Science

Statistics Computing
(Mathematical and (Algorithms and
Probabilistic Software)
Foundations)
Data Science

Applications
(Analyzing Real Data)

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 41


What Classes will you take in the DS Major?

Computing
Statistics
ICS 46: Data Structures
IFMTX 43: Intro to Software Engineering
Stats 120 ABC: Intro to Prob and Stats CS 122A: Intro to Data Management
Stats 68: Exploratory Data Analysis CS 161: Design and Analysis of Algorithms
Stats 110-112: Statistical Methods (CS 131: Parallel and Distributed Computing)
CS 178: Machine Learning (CS 172: Neural Networks/Deep Learning)
(Stats 140: Multivariate Statistics)

Applications

Stats 170AB: Data Science Capstone Project


INF 143: Information Visualization
(INF 131: Human Computer Interaction)
(CS 121: Information Retrieval)
(CS 122B: Project in Databases/Web Applications)
(Summer intermships, e.g., junior year)

(Sample electives shown in parentheses)


P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 42
Sample Course of Study in the Major

Years 1 and 2: foundational courses in computer science, mathematics, statistics,


including statistical computing

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 43


Years 3 and 4: more emphasis and specialization in data science topics such as
machine learning, databases, visualization, advanced statistics

Year 3: sample program


Fall Winter Spring
Stats 110, Statistical Methods Stats 111, Statistical Methods Stats 112, Statistical Methods
for Data Analysis I for Data Analysis II for Data Analysis III
CS 161, Design and Analysis of CS 178, Machine Learning and CS 122A, Introduction to Data
Algorithms Data-Mining Management
In4matx 43, Introduction to ICS 139W, Critical Writing on In4matx 143, Information
Software Engineering Information Technology Visualization
GE IV/VIII, GE III/VII, GE VI,

Year 4: two-quarter capstone “data-intensive” project, + statistics and CS electives

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 44


Research at UC Irvine in Data Science
Histogram of Distances between Tweets
900

800

700

600

500

400

300

200

100

0
-6 -5 -4 -3 -2 -1 0 1 2 3
Distance in Kilometers (Log-Scale, Base 10)

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 45


LIMITATIONS OF WHAT WE CAN LEARN FROM DATA

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 46


P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 47
Kidney Cancer Death Rates by County in the US
Lowest Rates Highest Rates

From A. Gelman and D. Nolan


Oxford University Press, 2002
P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 48
From Tatem et al., Nature 2004.
(see also response letters at http://faculty.washington.edu/kenrice/natureletter.pdf)

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 49


How Much Climate Data Do We Actually Have?

Image from http://cimss.ssec.wisc.edu/

Image from ipcc.ch


P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 50
Schedule of Lectures
Department
Date Speaker Topic
Or Organization
Jan 9 Padhraic Smyth Computer Science Introduction to Data Science

Jan 16 Padhraic Smyth Computer Science Classification Algorithms in Machine Learning

Jan 23 Michael Carey Computer Science Databases and Data Management

Jan 30 Sameer Singh Computer Science Statistical Natural Language Processing

Feb 6 Zhaoxia Yu Statistics An Introduction to Cluster Analysis

Feb 13 Erik Sudderth Computer Science Computer Vision and Machine Learning

Feb 20 John Brock Cylance, Inc Data Science and CyberSecurity


Video Lecture Microsoft Research and
Feb 27 Bias in Machine Learning
(Kate Crawford) NYU
Mar 6 Matt Harding Economics Data Science in Economics and Finance

Mar 13 Padhraic Smyth Computer Science Review: Past and Future of Data Science

P. Smyth: Stats 5: Data Science Seminar, Winter 2018: 51

You might also like