0% found this document useful (0 votes)
54 views

01a Data Science Introduction 1

The document discusses the rise of big data and data science. It provides key figures on the growing size of the global data sphere and data production. It highlights both celebrated successes of data science, like advances in image recognition and healthcare, as well as notorious failures, such as offensive tweets from Microsoft's AI chatbot Tay. The document defines data science as developing tools to store, manipulate, and analyze data to extract knowledge.

Uploaded by

KADDAMI Saousan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

01a Data Science Introduction 1

The document discusses the rise of big data and data science. It provides key figures on the growing size of the global data sphere and data production. It highlights both celebrated successes of data science, like advances in image recognition and healthcare, as well as notorious failures, such as offensive tweets from Microsoft's AI chatbot Tay. The document defines data science as developing tools to store, manipulate, and analyze data to extract knowledge.

Uploaded by

KADDAMI Saousan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

INF442 — Algorithms for Data Analysis in C++

Introduction to Data Science

— Steve Oudot
The big data era

Key figures:

I size of ‘global data sphere’:


obal2.75
datazBsphere = summation of all created,
predict. captured
(2012) → 33 zB (2018) → 175 zB (2025) or replicated data in the W
(1 zB = 1021 Bytes)
— source: International Data Corporation

2
The big data era

Key figures:

I size of ‘global data sphere’:


obal2.75
datazBsphere = summation of all created,
predict. captured
(2012) → 33 zB (2018) → 175 zB (2025) or replicated data in the W
(1 zB = 1021 Bytes)
— source: International Data Corporation

I correlated with World’s total storage capacity


— data centers and cloud (45% - 55% in 2025)

2
The big data era

Key figures:

I size of ‘global data sphere’:


obal2.75
datazBsphere = summation of all created,
predict. captured
(2012) → 33 zB (2018) → 175 zB (2025) or replicated data in the W
(1 zB = 1021 Bytes)
— source: International Data Corporation

I correlated with World’s total storage capacity


— data centers and cloud (45% - 55% in 2025)

I exponential growth (+ 30% each year)


— expected to be sustained on the long run

2
The big data era

Key figures:

I size of ‘global data sphere’:


obal2.75
datazBsphere = summation of all created,
predict. captured
(2012) → 33 zB (2018) → 175 zB (2025) or replicated data in the W
(1 zB = 1021 Bytes)
— source: International Data Corporation

I correlated with World’s total storage capacity


— data centers and cloud (45% - 55% in 2025)

I exponential growth (+ 30% each year)


— expected to be sustained on the long run

I small fraction of data is processed/analyzed


— shortage of trained data scientists

2
Data production
Data are produced at an unprecedented rate by:
I Industry / Economy
I Sciences

I End users

3
Challenges

Complex data Corrupted data


(non-linear, sparse, (noise, outliers,
high-dimensional) missing values)

Big data
(streamed, online, distributed)

4
Data science’s celebrated successes...

AI for games:

1997: IBM’s Deep Blue wins chess match


against world champion G. Kasparov

2016: DeepMind’s AlphaGo wins Go match


against 18-time world champion Lee Sedol

2019: DeepMind’s AlphaStar beats


Starcraft II professional players
5
Data science’s celebrated successes...

ImageNet Challenge:

I database of 40 · 106 + images, structured in 20 · 103 + categories

I images collected on the Internet

I annotation process crowdsourced


to Amazon Mechanical Turk

[J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei: ImageNet: A Large-Scale Hierarchical Image Database, CVPR 2009] 5
Data science’s celebrated successes...

ImageNet Challenge:

I annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

I until 2011, classification error rates around 25%

s network more than 60−→


hasbreakthrough
I 2012: million
deepparameters to tune
CNN (AlexNet) reduced error to 16%

[Krizhevsky, Sutskever, Hinton: ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012] 5
Data science’s celebrated successes...

ImageNet Challenge:

I annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

I until 2011, classification error rates around 25%

s network more than 60−→


hasbreakthrough
I 2012: million
deepparameters to tune
CNN (AlexNet) reduced error to 16%

rrow I by now:
tasks: error one-against
typically rates below all
5%, performances
(e.g. recognizingbetter
cats, than
cars, human on narrow tasks
etc.) performance quality

[Krizhevsky, Sutskever, Hinton: ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012] 5
Data science’s celebrated successes...

ImageNet Challenge:

I annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

I until 2011, classification error rates around 25%

s network more than 60−→


hasbreakthrough
I 2012: million
deepparameters to tune
CNN (AlexNet) reduced error to 16%

unsupervised pre-training
I unsupervised ' usingleads
pre-training auto-encoders
to conceptaslearning
feature(e.g.
generators
humantoface,
be plugged
cat face)into

this image is the optimal stimulus of the

[Le et al.: Building high-level features using large scale unsupervised learning, ICML 2012] 5
Data science’s celebrated successes...

Healthcare data:

2011: a new subgroup of breast cancers


with excellent survival rates is discovered
using exploratory data analysis techniques

[Nicolau et al.: Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile
and excellent survival, PNAS 2011]
5
... and notorious failures

Microsoft’s Tay:

I AI-powered chat bot, launched on Twitter (@TayandYou) on March 23, 2016

I learned from its interactions with people

6
... and notorious failures

Microsoft’s Tay:

I AI-powered chat bot, launched on Twitter (@TayandYou) on March 23, 2016

I learned from its interactions with people

I shut down only 16 hours after launch

I produced inflammatory, offensive (racist, sexually-charged) tweets

6
... and notorious failures

Microsoft’s Tay:

I AI-powered chat bot, launched on Twitter (@TayandYou) on March 23, 2016

I learned from its interactions with people

I shut down only 16 hours after launch

I produced inflammatory, offensive (racist, sexually-charged) tweets

I training overrun by trolls

I numerous questions raised


(technical, legal, ethical)

6
... and notorious failures

Some notorious AI failures in 2018:

Google Photo’s AI feature confuses skier


and mountain

Amazon’s AI recruiting tool proven to be


gender-based

Uber’s self-driving car kills pedestrian


in Arizona

6
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data

7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data

word cloud of paper titles at NIPS 2016


(source: http://www.kaggle.com/benhamner/nips- papers) 7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data

word cloud of paper titles at NIPS 2016


(source: http://www.kaggle.com/benhamner/nips- papers) 7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data

word cloud of paper titles at NIPS 2016


(source: http://www.kaggle.com/benhamner/nips- papers) 7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data

word cloud of paper titles at NIPS 2016


(source: http://www.kaggle.com/benhamner/nips- papers) 7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data

word cloud of paper titles at NIPS 2016


(source: http://www.kaggle.com/benhamner/nips- papers) 7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data

word cloud of paper titles at NIPS 2016


(source: http://www.kaggle.com/benhamner/nips- papers) 7
What is data science?

Core topics:

I statistical analysis

I machine learning

I pattern recognition

I data mining

I optimization (convex / combinatorial)

I database management and distributed systems

I high-performance computing (streaming, distributed, cloud)

7
Data?
Datum ≡ observation ≡ ”chunk of information”

8
Data?
Datum ≡ observation ≡ ”chunk of information”

Vector
e: ici representation
on parle des representations des donnees prise ven
ariables
entree par la plupart des algorithmes de t

{
x1
v1 ··· vd

observations
{
···
xn
coordinate
matrix

categorical variables: 1, 2, · · · , K (arbitrary labels)


(e.g. ”cat”, ”dog”, ”horse”)

continuous variables: real or complex values


(e.g. temperature, pressure, geographic coordinate, income, amplitude/phase)
8
Data?
Datum ≡ observation ≡ ”chunk of information”

Vector
e: ici representation
on parle des representations des donnees prise oen
bservations
entree par la plupart des algorithmes de t

{
Metric representation x1
x1 ··· xn

observations
{
···
xn
distance /
(dis-)similarity
matrix

distances: Euclidean, Hamming, geodesic, diffusion, edit, Jaccard, Wasserstein, etc.

(dis-)similarity measures: cosine, Kullback-Leibler, Bregman divergences, etc.

8
Data?
Datum ≡ observation ≡ ”chunk of information”

Vector
e: ici representation
on parle des representations des donnees prise en entree par la plupart des algorithmes de t

Metric representation

Rd

?
···

8
Data?
Datum ≡ observation ≡ ”chunk of information”

Vector
e: ici representation
on parle des representations des donnees prise en entree par la plupart des algorithmes de t

Metric representation

Rd

?
···

feature extraction

8
Programming languages for data science

I Databases / data manipulation: Structured Query Language (SQL)

te: all other modern languages are built on SQL (e.g. QBE is in fact just a front-e

9
Programming languages for data science

I Databases / data manipulation: Structured Query Language (SQL)

I Data analysis: Python (CS) / R (stats)

e: what is taught is: (1) principles of each approach; (2) how to apply it in Python

9
Programming languages for data science

I Databases / data manipulation: Structured Query Language (SQL)

I Data analysis: Python (CS) / R (stats)

I Effective data processing:


C / C++ / CUDA (GPGPU)

[...]

9
Learning paradigms
horse
Supervised learning
Input: data with labels (examples)
cat dog
Goal: predict the labels of new data
?
Typical problems:
I classification (categorical labels)
I regression (continuous labels)
I forecasting (regression on time series)
energy consumption

? weather parameters
10
Learning paradigms

Unsupervised learning
Input: data without labels

Goal: identify patterns, correlations

Typical problems:
I clustering
I dimensionality reduction
I anomaly detection / noise removal

10
Learning paradigms

Unsupervised learning

Semi-supervised learning (only a fraction of the input data has labels)

Supervised learning
10
Learning paradigms

Reinforcement learning
Input: Markov decision process:
I agent & environment states, vis. rules, actions, transition probabilities, rewards

Goal: find policy that minimizes the regret


es: - it is the (expected loss is
total loss that of measured,
reward compared to the process, hence penalizing every mistake
throughout
optimal strategy)

Typical problems:

I exploration vs. exploitation


(e.g. multi-armed
pically, problems where bandit)
exploration vs. exploitation dilemma appears can be modelled as reinforce

I control learning

10
Learning paradigms

Reinforcement learning
Input: Markov decision process:
I agent & environment states, vis. rules, actions, transition probabilities, rewards

Goal: find policy that minimizes the regret


es: - it is the (expected loss is
total loss that of measured,
reward compared to the process, hence penalizing every mistake
throughout
optimal strategy)

Typical problems:

I exploration vs. exploitation


(e.g. multi-armed
pically, problems where bandit)
exploration vs. exploitation dilemma appears can be modelled as reinforce

I control learning

[Géron 2017] 10
Learning paradigms

(source: NVIDIA)
10
Course outline
• Lecture 1: introduction to data science / C++ as C (1/2)

• Lecture 2: Nearest-Neighbor search / C++ as C (2/2)

• Lecture 3: k-means clustering / classes (1/2)


unsupervised

• Lecture 4: hierarchical clustering / classes (2/2)

• Lecture 5: density estimation and noise removal / inheritance

• Lecture 6: k-NN classifier / genericity


supervised

• Lecture 7: linear models for regression / STL

• Lecture 8: linear models for classification / -

• Lecture 9: artificial neural networks / C++11

• Lecture 10: feature extraction and dimensionality reduction / -


11

You might also like