0% found this document useful (0 votes)
23 views31 pages

1c. INTRODUCTION-Data-Science-basic

This document provides an introduction to the field of data science. It discusses how data has become inexpensive to acquire and store, enabling the use of algorithms and data to understand phenomena, build models, and make predictions. The key question in data science is how to explain and predict the world, especially in areas without good predictive models, by using sampling, statistics, and data-driven approaches. Data science draws on computer science, statistics, mathematics, and various domains of science. The document outlines topics to be covered, including what data science involves, considerations around big data, and the tools and techniques used in data analytics.

Uploaded by

Gaurav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views31 pages

1c. INTRODUCTION-Data-Science-basic

This document provides an introduction to the field of data science. It discusses how data has become inexpensive to acquire and store, enabling the use of algorithms and data to understand phenomena, build models, and make predictions. The key question in data science is how to explain and predict the world, especially in areas without good predictive models, by using sampling, statistics, and data-driven approaches. Data science draws on computer science, statistics, mathematics, and various domains of science. The document outlines topics to be covered, including what data science involves, considerations around big data, and the tools and techniques used in data analytics.

Uploaded by

Gaurav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Introduction:

What Is Data Science?

Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Motivation for the Course:
Data Is Driving Everything

1. Modern data acquisition is inexpensive!


• Smartphones, embedded systems, inexpensive sensors,
• Medical devices, simulators, …
2. Data storage is inexpensive!
3. Parallel (compute cluster) computation is inexpensive
• The Cloud, clusters of computers, GPUs, tensor processors, …

Can we use algorithms + data to understand phenomena? Build or augment


models? Build detectors? Make diagnoses?

2
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
The Key Question in Data Science:
How Do We Explain & Predict the World?
Much of science and engineering derives from physics, where we have rich
predictive models
Newton’s laws, the theory of relativity, optics, how materials react under stress, etc.
The basis of prediction tends to be simulation

How do we make predictions where we don’t have good models?


Ee.g., human behavior, biology, the brain, whether a product
will be a success, what to invest in
• We need to use sampling, statistics, and “data-first” approaches
• But we need enough representative data, and the right questions, for good models!

Of course, in the real world we often want to combine models and data!
3
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Outline of Topics

• What is data science?


• Considerations around big data
• What does analytics involve?
• Disclaimer and recap

• Checkpoint exercises
• Practice notebook

4
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Wordcloud for Data Science

Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Data Science is Interdisciplinary

(CS+STAT+MATH)

(SCIENCE |
ECONOMICS |
SOCIOLOGY |
BUSINESS | LAW…)

Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Data Science Versus
Data Analytics and Data Engineering
This course will initially focus on the technical and data manipulation
aspects of Data Science, which we will call “Data Analytics”
Some people also call this data engineering
Data Science is much broader, and includes domain expertise,
communicating the results to users via storytelling and visualization, and
policy and legal implications.

We will get to the broader social context later, e.g. Data Ethics
Privacy, fairness, accountability, transparency, …
We will give examples throughout the course of how data science is used
in different domains
7
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Data Analytics

Our focus – managing and creating models from data at scale:


Complex, heterogeneous, large, high dimensionality, velocity
We will touch on many related techniques and issues
machine learning, distributed computing, distributed algorithms, and
parallel computation
We will leverage standard tools and platforms, and point to others

Our focus is on the foundations of machine learning and cloud


platforms – you should continue to study these in depth afterwards!

8
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Data and Models
Descriptive or
Raw Data Structured, Extracted
Inferential
Integrated Data features
Model
Observations
sampled from the Machine Measurable
Of the population
world interpretable data characteristics
as a whole

Parts of the data may be:


• In different systems or organizations
• Across different documents, databases, etc.
Also, it’s often not in a form where:
• We can directly use it, e.g., it’s in text documents or HTML
• It’s clean and regular – e.g., it has missing values, spurious values, etc.
Our goal is to extract features that help us make predictions 9
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Considerations around Big Data:
Structure, Cleaning, and Linking

10
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Open and Closed Worlds,
Observations vs the Universe
Databases often assume a closed world: we know everything…
The company’s employees, their exact salaries, every product, …
Data science is in an open world: we can’t see the whole population
• We are given samples or incomplete observations
• We want to predict the future, or characterize the entire population

How we sample affects our model!


There are many pitfalls that can lead to biased models – a topic researchers
are trying to address!
As the SEC says: “Past performance is not indicative of future results”
11
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Raw Data vs Structured Data
Images

Data +
feature
Genes extraction,
wrangling

Text
Do not be like the cat who wanted a fish
but was afraid to get his paws wet.
William Shakespeare

12
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Raw Data vs Structured Data
Images

Data +
feature
Genes extraction,
wrangling

Text
Goal: raw data to structured data
Do not be like the cat who wanted a fish• Fields, entities, objects, machine learning features
but was afraid to get his paws wet.
William Shakespeare
• May be very regular or semi-structured

Ultimately, goal is to go from data to information to knowledge

13
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Linked Data: Find Patterns in Connectivity
(Clusters, Paths, …)

14
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Data with Complex Semantics:
Knowledge Graphs
Classes, subclasses, instances, and properties

15
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Dynamic, High-Velocity Data: Track over Time,
Forecast the Future

16
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Tabular (Relational) Data
and Joins / Lookups (e.g. to Web Services)
New York Taxi Data

17
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Tabular (Relational) Data
and Joins / Lookups (e.g. to Web Services)
New York Taxi Data

Reverse
Geocode
Data

18
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Tabular (Relational) Data
and Joins / Lookups (e.g. to Web Services)
New York Taxi Data

Reverse
Geocode
Data

19
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Tabular (Relational) Data
and Joins / Lookups (e.g. to Web Services)
New York Taxi Data

Reverse
Geocode
Data

Street View 20
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
What Makes Data “Big” Data?
A gigabyte? A petabyte? A yottabyte?
A Million records? Billion records? Trillion records?

There is no consensus definition, but from our perspective:


• Too complex for a human to understand directly
• Doesn’t fit into a single uniform memory space (e.g. variables in Python) –
means we need to think carefully about I/O and/or communication
• Need more than brute force algorithms to analyze
• May require multiple computers to work in parallel to process
• Possibly high dimensionality, requiring feature selection and dimensionality
reduction
• May be changing rapidly (high velocity)

21
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
What Does Data Analytics Involve?

22
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
The Goal of Data Analytics:
From Data to “Knowledge” or Action
Pattern detection: Raw data ⇒ patterns ⇒ partial understanding
• “Show me sales by region by product category”
• “Show me clusters of documents by concept”
• “Data cubes” (sales by region by quarter by type of product)
• Typically, descriptive statistics

Given an observation: Hypothesis ⇒ experiment over sample ⇒


significance
• “Behavioral factor F leads to higher risk of outcome O”
• Do statistical test, measure significance vs. null hypothesis
• Typically, inferential statistics

23
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
What Does Big Data Analytics Involve?

• Acquisition, access – data may exist without being accessible


• Wrangling – data may be in the wrong form
• Integration, representation – data relationships may not be captured
• Cleaning, filtering – data may have variable quality
• Hypothesizing, querying, analyzing, modeling – from data to info
• Understanding, iterating, exploring – helping build knowledge

• And: ethical obligations – need to protect data, follow good


statistical practices, present results in a non-misleading way
24
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Example: Netflix Recommendation
(Or Amazon, Or Personalized Search, Or …)

25
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Example: High-Throughput Gene
Sequencing

23andme.com

https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/
26
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Data Science / Data Analytics:
Beware Over-Hyped Expectations!

Data science myth: Data science reality:


• We’ll learn everything • We’ll typically rely on human
“bottom up” using fancy expertise to impose models
statistics and machine over the data, the features, etc.
learning • Deep learning can do feature
• Basically we “turn the crank” selection – but why throw
and out pop insights! away what we know!

Data + algorithms 🡪 knowledge Data + human insight +


algorithms + iteration 
information  knowledge

27

Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Data Science Application Process

• What question are you answering?


• What is the right scope of the project?
• What data will you use?
• What techniques are you going to try?
• How will you evaluate your results?
• What maintenance will be required?

28
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Disclaimer and Recap

29
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
A Word from Practitioners in Data Science

At least 80-90% of their work involves not machine learning,


but:
• Working with experts to understand the domain,
assumptions, questions, etc.
• Trying to catalog and make sense of the data sources
• Wrangling, extracting, and integrating the data
• Cleaning the wrangled data

30
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Recap: What Data Science Involves

• Data is high-dimensional, hard to understand, and requires an


understanding of computation and I/O costs.
• Given an integrated dataset (often the hard part!), data science
involves extracting and selecting features, as well as adding semantic
structure to the data.
• There are many applications of Data Science, from discovery to
clustering to classification to recommendation!
Sometimes statistical, sometimes algorithmic, and sometimes reliant on
extraction

31
Except where otherwise noted, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

You might also like