1c. INTRODUCTION-Data-Science-basic
1c. INTRODUCTION-Data-Science-basic
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Motivation for the Course:
Data Is Driving Everything
2
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
The Key Question in Data Science:
How Do We Explain & Predict the World?
Much of science and engineering derives from physics, where we have rich
predictive models
Newton’s laws, the theory of relativity, optics, how materials react under stress, etc.
The basis of prediction tends to be simulation
Of course, in the real world we often want to combine models and data!
3
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Outline of Topics
• Checkpoint exercises
• Practice notebook
4
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Wordcloud for Data Science
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Data Science is Interdisciplinary
(CS+STAT+MATH)
∩
(SCIENCE |
ECONOMICS |
SOCIOLOGY |
BUSINESS | LAW…)
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Data Science Versus
Data Analytics and Data Engineering
This course will initially focus on the technical and data manipulation
aspects of Data Science, which we will call “Data Analytics”
Some people also call this data engineering
Data Science is much broader, and includes domain expertise,
communicating the results to users via storytelling and visualization, and
policy and legal implications.
We will get to the broader social context later, e.g. Data Ethics
Privacy, fairness, accountability, transparency, …
We will give examples throughout the course of how data science is used
in different domains
7
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Data Analytics
8
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Data and Models
Descriptive or
Raw Data Structured, Extracted
Inferential
Integrated Data features
Model
Observations
sampled from the Machine Measurable
Of the population
world interpretable data characteristics
as a whole
10
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Open and Closed Worlds,
Observations vs the Universe
Databases often assume a closed world: we know everything…
The company’s employees, their exact salaries, every product, …
Data science is in an open world: we can’t see the whole population
• We are given samples or incomplete observations
• We want to predict the future, or characterize the entire population
Data +
feature
Genes extraction,
wrangling
Text
Do not be like the cat who wanted a fish
but was afraid to get his paws wet.
William Shakespeare
12
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Raw Data vs Structured Data
Images
Data +
feature
Genes extraction,
wrangling
Text
Goal: raw data to structured data
Do not be like the cat who wanted a fish• Fields, entities, objects, machine learning features
but was afraid to get his paws wet.
William Shakespeare
• May be very regular or semi-structured
13
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Linked Data: Find Patterns in Connectivity
(Clusters, Paths, …)
14
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Data with Complex Semantics:
Knowledge Graphs
Classes, subclasses, instances, and properties
15
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Dynamic, High-Velocity Data: Track over Time,
Forecast the Future
16
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Tabular (Relational) Data
and Joins / Lookups (e.g. to Web Services)
New York Taxi Data
17
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Tabular (Relational) Data
and Joins / Lookups (e.g. to Web Services)
New York Taxi Data
Reverse
Geocode
Data
18
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Tabular (Relational) Data
and Joins / Lookups (e.g. to Web Services)
New York Taxi Data
Reverse
Geocode
Data
19
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Tabular (Relational) Data
and Joins / Lookups (e.g. to Web Services)
New York Taxi Data
Reverse
Geocode
Data
Street View 20
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
What Makes Data “Big” Data?
A gigabyte? A petabyte? A yottabyte?
A Million records? Billion records? Trillion records?
21
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
What Does Data Analytics Involve?
22
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
The Goal of Data Analytics:
From Data to “Knowledge” or Action
Pattern detection: Raw data ⇒ patterns ⇒ partial understanding
• “Show me sales by region by product category”
• “Show me clusters of documents by concept”
• “Data cubes” (sales by region by quarter by type of product)
• Typically, descriptive statistics
23
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
What Does Big Data Analytics Involve?
25
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Example: High-Throughput Gene
Sequencing
23andme.com
https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/
26
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Data Science / Data Analytics:
Beware Over-Hyped Expectations!
27
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Data Science Application Process
28
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Disclaimer and Recap
29
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
A Word from Practitioners in Data Science
30
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Recap: What Data Science Involves
31
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.