0% found this document useful (0 votes)
1K views

An Introduction To Supervised Machine Learning and Pattern Classification - The Big Picture

This document provides an overview of machine learning concepts and supervised learning. It discusses the different types of machine learning including supervised, unsupervised, and reinforcement learning. For supervised learning, it describes the workflow including data collection, preprocessing, model training and evaluation. It then explains key supervised learning algorithms like decision trees, naive Bayes, k-nearest neighbors, logistic regression, and support vector machines. The document emphasizes the importance of validation, avoiding overfitting, and selecting the right algorithm based on characteristics of the problem and data.

Uploaded by

Weirliam John
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

An Introduction To Supervised Machine Learning and Pattern Classification - The Big Picture

This document provides an overview of machine learning concepts and supervised learning. It discusses the different types of machine learning including supervised, unsupervised, and reinforcement learning. For supervised learning, it describes the workflow including data collection, preprocessing, model training and evaluation. It then explains key supervised learning algorithms like decision trees, naive Bayes, k-nearest neighbors, logistic regression, and support vector machines. The document emphasizes the importance of validation, avoiding overfitting, and selecting the right algorithm based on characteristics of the problem and data.

Uploaded by

Weirliam John
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Practical Data Science

An Introduction to Supervised Machine Learning


and Pattern Classification: The Big Picture

Sebastian Raschka
Michigan State University
NextGen Bioinformatics Seminars - 2015

Feb. 11, 2015

A Little Bit About Myself ...


PhD candidate in Dr. L. Kuhns Lab:
Developing software & methods for
- Protein ligand docking
- Large scale drug/inhibitor discovery

and some other machine learning side-projects

What is Machine Learning?


"Field of study that gives computers the
ability to learn without being explicitly
programmed.
(Arthur Samuel, 1959)

By Phillip Taylor [CC BY 2.0]

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Examples of Machine Learning


Text Recognition

Biology

http://commons.wikimedia.org/wiki/
File:American_book_company_1916._letter_envelope-2.JPG#filelinks
[public domain]

Spam Filtering
https://flic.kr/p/5BLW6G [CC BY 2.0]

Examples of Machine Learning


Self-driving cars
Recommendation systems

http://commons.wikimedia.org/wiki/File:Netflix_logo.svg [public domain]


By Steve Jurvetson [CC BY 2.0]

Photo search
and many, many
more ...
http://googleresearch.blogspot.com/2014/11/a-picture-is-worth-thousand-coherent.html

How many of you have used


machine learning before?

Our Agenda

Concepts and the big picture

Workflow

Practical tips & good habits

Labeled data
Direct feedback
Predict outcome/future

Supervised

Learning
Unsupervised

No labels
No feedback
Find hidden structure

Reinforcement

Decision process
Reward system
Learn series of actions

Unsupervised
learning

Unsupervised Learning

Supervised
learning

Supervised Learning

Clustering:

Regression:

Classification:

[DBSCAN on a toy dataset]

[Soccer Fantasy Score prediction]

[SVM on 2 classes of the Wine dataset]

Todays topic

Nomenclature

IRIS
https://archive.ics.uci.edu/ml/datasets/Iris

Instances (samples, observations)


sepal_length

sepal_width

petal_length

petal_width

class

5.1

3.5

1.4

0.2

setosa

4.9

3.0

1.4

0.2

setosa

50

6.4

3.2

4.5

1.5

veriscolor

150

5.9

3.0

5.1

1.8

virginica

Features (attributes, dimensions)

Classes (targets)

Classification
1) Learn from training data
class1
class2
x1

x2

2) Map unseen (new) data

Supervised
Learning

Raw Data Collection

Missing Data

Pre-Processing
Feature Extraction

Sampling

Training Dataset

Split

Feature Selection
Feature Scaling

Pre-Processing

Test Dataset

New Data

Dimensionality Reduction

Final Model
Evaluation

Prediction

Learning Algorithm
Training

Cross Validation
Refinement
Hyperparameter
Optimization

Performance Metrics

Post-Processing
Model Selection

Final Classification/
Regression Model

Sebastian Raschka 2014


This work is licensed under a Creative Commons Attribution 4.0 International License.

Supervised
Learning

Raw Data Collection

Missing Data

Pre-Processing
Feature Extraction

Sampling

Training Dataset

Split

Feature Selection
Feature Scaling

Pre-Processing

Test Dataset

New Data

Dimensionality Reduction

Final Model
Evaluation

Prediction

Learning Algorithm
Training

Cross Validation
Refinement
Hyperparameter
Optimization

Performance Metrics

Post-Processing
Model Selection

Final Classification/
Regression Model

Sebastian Raschka 2014


This work is licensed under a Creative Commons Attribution 4.0 International License.

A Few Common Classifiers


Perceptron

Naive Bayes

Decision Tree
K-Nearest Neighbor
Logistic Regression
Artificial Neural Network / Deep Learning
Support Vector Machine
Ensemble Methods: Random Forest, Bagging, AdaBoost

Discriminative Algorithms
Map x y directly.
E.g., distinguish between people speaking different languages
without learning the languages.
Logistic Regression, SVM, Neural Networks

Generative Algorithms
Models a more general problem: how the data was generated.
I.e., the distribution of the class; joint probability distribution p(x,y).
Naive Bayes, Bayesian Belief Network classifier, Restricted
Boltzmann Machine

Examples of Discriminative Classifiers:


Perceptron
F. Rosenblatt. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957.

xi1
xi2

w0
w1

w2

yi

x1

y {-1,1}

x2

y = wTx = w0 + w1x1 + w2x2


wj = weight
xi = training sample
yi = desired output
y^ i = actual output
t = iteration step
= learning rate
= threshold (here 0)

update rule:

yi

1 if wTxi
-1 otherwise

wj(t+1) = wj(t) + (yi - yi)xi

until
t+1 = max iter
or error = 0

Discriminative Classifiers:
Perceptron
F. Rosenblatt. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957.

1
xi1
xi2

w0
w1

w2

yi
y {-1,1}

x1
x2

Binary classifier (one vs all, OVA)


Convergence problems (set n iterations)
Modification: stochastic gradient descent
Modern perceptron: Support Vector Machine (maximize margin)
Multilayer perceptron (MLP)

Generative Classifiers:
Naive Bayes
Bayes Theorem:

P(j | xi) =

Posterior probability =

Iris example:

P(xi | j) P(j)
P(xi)
Likelihood x Prior probability

P(Setosa"| xi),

Evidence

xi = [4.5 cm, 7.4 cm]

Generative Classifiers:
Naive Bayes

Bayes Theorem:

Decision Rule:

P(j | xi) =

P(xi | j) P(j)

pred. class label j

P(xi)

argmax P(j | xi)


i = 1, , m

e.g., j {Setosa, Versicolor, Virginica}

Generative Classifiers:
Naive Bayes
Evidence:

P(j | xi) =

Prior probability:

P(xi | j) P(j)
P(xi)

Nj
P(j) =
Nc

Class-conditional
probability
(here Gaussian kernel):

(cancels out)

(class frequency)

1
P(xik |j) = (2 j2) exp

P(xi |j)

(-

(xik - j)2
2j2

P(xik |j)

Generative Classifiers:
Naive Bayes

Naive conditional independence assumption typically


violated
Works well for small datasets
Multinomial model still quite popular for text classification
(e.g., spam filter)

Non-Parametric Classifiers:
K-Nearest Neighbor
k=3

e.g., k=1

Simple!
Lazy learner
Very susceptible to curse of dimensionality

Iris Example

C=3

k=3
mahalanobis dist.
uniform weights

Setosa

Virginica

depth = 2

Versicolor

Decision Tree
petal length <= 2.45?

N
petal length <= 4.75?

Setosa

Y
Virginica

N
Versicolor

Entropy = pi logk pi
i
depth = 4

e.g.,

2 (- 0.5 log2(0.5)) = 1

Information Gain =
entropy(parent) [avg entropy(children)]

"No Free Lunch" :(


D. H. Wolpert. The supervised learning no-free-lunch theorems. In Soft Computing and Industry, pages 2542. Springer, 2002.

Our model is a simplification of reality

Simplification is based on assumptions (model bias)

Assumptions fail in certain situations

Roughly speaking:
No one model works best for all possible situations.

Which Algorithm?
What is the size and dimensionality of my training set?
Is the data linearly separable?
How much do I care about computational efficiency?
- Model building vs. real-time prediction time
- Eager vs. lazy learning / on-line vs. batch learning
- prediction performance vs. speed
Do I care about interpretability or should it "just work well?"
...

Supervised
Learning

Raw Data Collection

Missing Data

Pre-Processing
Feature Extraction

Sampling

Training Dataset

Split

Feature Selection
Feature Scaling

Pre-Processing

Test Dataset

New Data

Dimensionality Reduction

Final Model
Evaluation

Prediction

Learning Algorithm
Training

Cross Validation
Refinement
Hyperparameter
Optimization

Performance Metrics

Post-Processing
Model Selection

Final Classification/
Regression Model

Sebastian Raschka 2014


This work is licensed under a Creative Commons Attribution 4.0 International License.

Missing Values:
- Remove features (columns)
- Remove samples (rows)
- Imputation (mean, nearest neighbor, )

Sampling:
- Random split into training and validation sets
- Typically 60/40, 70/30, 80/20
- Dont use validation set until the very end!
(overfitting)

Categorical Variables
M

10.1

class
label
class1

color size
0 green
1

red

13.5

class2

blue

XL

15.3

class1

nominal

ordinal

green (1,0,0)
red (0,1,0)
blue (0,0,1)

0
1
2

class
label
0
1
0

prize

M1
L2
XL 3

color=blue color=green
0
1
0
0
1
0

color=red
0
1
0

prize
10.1
13.5
15.3

size
1
2
3

Supervised
Learning

Raw Data Collection

Missing Data

Pre-Processing
Feature Extraction

Sampling

Training Dataset

Split

Feature Selection
Feature Scaling

Pre-Processing

Test Dataset

New Data

Dimensionality Reduction

Final Model
Evaluation

Prediction

Learning Algorithm
Training

Cross Validation
Refinement
Hyperparameter
Optimization

Performance Metrics

Post-Processing
Model Selection

Final Classification/
Regression Model

Sebastian Raschka 2014


This work is licensed under a Creative Commons Attribution 4.0 International License.

Generalization Error and Overfitting

How well does the model perform on unseen data?

Generalization Error and Overfitting

Error Metrics: Confusion Matrix


here: setosa = positive

TP

FN

FP

TN

[Linear SVM on sepal/petal lengths]

Error Metrics
here: setosa = positive

TP

FP

FN

TN

[Linear SVM on sepal/petal lengths]

micro and macro


averaging for multi-class

TP + TN
Accuracy =
FP +FN +TP +TN
= 1 - Error
FP
False Positive Rate =
N
TP
True Positive Rate =
P
(Recall)
TP
Precision =
TP + FP

Receiver Operating Characteristic


(ROC) Curves

Model Selection
Complete dataset

Training dataset

Test dataset

k-fold cross-validation (k=4):


fold 1

fold 2

fold 3

fold 4

Test set

Test set

Test set

Test set

1st iteration

calc. error

2nd iteration

calc. error

3rd iteration

calc. error

4th iteration

calc. error

calculate
avg. error

k-fold CV and ROC

Feature Selection
IMPORTANT!
(Noise, overfitting, curse of dimensionality, efficiency)
-

Domain knowledge
Variance threshold
Exhaustive search
Decision trees

Simplest example:
Greedy Backward Selection

start:

X = [x1, x2, x3, x4]


X = [x1, x3, x4]

stop:
(if d = k)

X = [x1, x3]

Dimensionality Reduction

Transformation onto a new feature subspace

e.g., Principal Component Analysis (PCA)

Find directions of maximum variance

Retain most of the information

PCA in 3 Steps
0. Standardize data
xik - k
z=

1. Compute covariance matrix

ik =

1 (xij - j) (xik - k)
n -1 i

2 1
21
=
31
41

12
2 2
32
42

13
23
2 3
43

14
24
34
2 4

PCA in 3 Steps
2. Eigendecomposition and sorting eigenvalues
Xv=v

Eigenvectors
[[ 0.52237162
[-0.26335492
[ 0.58125401
[ 0.56561105

-0.37231836 -0.72101681 0.26199559]


-0.92555649 0.24203288 -0.12413481]
-0.02109478 0.14089226 -0.80115427]
-0.06541577 0.6338014
0.52354627]]

Eigenvalues
[ 2.93035378

0.92740362

(from high to low)

0.14834223

0.02074601]

PCA in 3 Steps
3. Select top k eigenvectors and transform data
Eigenvectors
[[ 0.52237162
[-0.26335492
[ 0.58125401
[ 0.56561105

-0.37231836 -0.72101681 0.26199559]


-0.92555649 0.24203288 -0.12413481]
-0.02109478 0.14089226 -0.80115427]
-0.06541577 0.6338014
0.52354627]]

Eigenvalues
[ 2.93035378

0.92740362

[First 2 PCs of Iris]

0.14834223

0.02074601]

Hyperparameter Optimization:
GridSearch in scikit-learn

C=1000,
gamma=0.1

C=1

k=11
uniform weights

Non-Linear Problems
- XOR gate
depth=4

Kernel Trick
Kernel function
Kernel
Map onto high-dimensional space (non-linear combinations)

Kernel Trick
Trick: No explicit dot product!
Radius Basis Function (RBF) Kernel:

Kernel PCA

PC1, linear PCA

PC1, kernel PCA

Supervised
Learning

Raw Data Collection

Missing Data

Pre-Processing
Feature Extraction

Sampling

Training Dataset

Split

Feature Selection
Feature Scaling

Pre-Processing

Test Dataset

New Data

Dimensionality Reduction

Final Model
Evaluation

Prediction

Learning Algorithm
Training

Cross Validation
Refinement
Hyperparameter
Optimization

Performance Metrics

Post-Processing
Model Selection

Final Classification/
Regression Model

Sebastian Raschka 2014


This work is licensed under a Creative Commons Attribution 4.0 International License.

Thanks!
Questions?
@rasbt
[email protected]
https://github.com/rasbt

Additional Slides

Inspiring Literature
P. N. Klein. Coding the Matrix: Linear
Algebra Through Computer Science
Applications. Newtonian Press, 2013.

S. Gutierrez. Data Scientists at Work.


Apress, 2014.

R. Schutt and C. ONeil. Doing Data


Science: Straight Talk from the Frontline.
OReilly Media, Inc., 2013.

R. O. Duda, P. E. Hart, and D. G. Stork.


Pattern classification. 2nd. Edition. New
York, 2001.

Useful Online Resources

https://www.coursera.org/course/ml

http://stats.stackexchange.com

http://www.kaggle.com

My Favorite Tools
http://scikit-learn.org/stable/
http://www.numpy.org
http://pandas.pydata.org

Seaborn

http://stanford.edu/~mwaskom/software/seaborn/
http://ipython.org/notebook.html

Which one to pick?


class1
class2

class1
class2

Generalization error!

The problem of overfitting

You might also like