An Introduction To Supervised Machine Learning and Pattern Classification - The Big Picture
An Introduction To Supervised Machine Learning and Pattern Classification - The Big Picture
Sebastian Raschka
Michigan State University
NextGen Bioinformatics Seminars - 2015
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Biology
http://commons.wikimedia.org/wiki/
File:American_book_company_1916._letter_envelope-2.JPG#filelinks
[public domain]
Spam Filtering
https://flic.kr/p/5BLW6G [CC BY 2.0]
Photo search
and many, many
more ...
http://googleresearch.blogspot.com/2014/11/a-picture-is-worth-thousand-coherent.html
Our Agenda
Workflow
Labeled data
Direct feedback
Predict outcome/future
Supervised
Learning
Unsupervised
No labels
No feedback
Find hidden structure
Reinforcement
Decision process
Reward system
Learn series of actions
Unsupervised
learning
Unsupervised Learning
Supervised
learning
Supervised Learning
Clustering:
Regression:
Classification:
Todays topic
Nomenclature
IRIS
https://archive.ics.uci.edu/ml/datasets/Iris
sepal_width
petal_length
petal_width
class
5.1
3.5
1.4
0.2
setosa
4.9
3.0
1.4
0.2
setosa
50
6.4
3.2
4.5
1.5
veriscolor
150
5.9
3.0
5.1
1.8
virginica
Classes (targets)
Classification
1) Learn from training data
class1
class2
x1
x2
Supervised
Learning
Missing Data
Pre-Processing
Feature Extraction
Sampling
Training Dataset
Split
Feature Selection
Feature Scaling
Pre-Processing
Test Dataset
New Data
Dimensionality Reduction
Final Model
Evaluation
Prediction
Learning Algorithm
Training
Cross Validation
Refinement
Hyperparameter
Optimization
Performance Metrics
Post-Processing
Model Selection
Final Classification/
Regression Model
Supervised
Learning
Missing Data
Pre-Processing
Feature Extraction
Sampling
Training Dataset
Split
Feature Selection
Feature Scaling
Pre-Processing
Test Dataset
New Data
Dimensionality Reduction
Final Model
Evaluation
Prediction
Learning Algorithm
Training
Cross Validation
Refinement
Hyperparameter
Optimization
Performance Metrics
Post-Processing
Model Selection
Final Classification/
Regression Model
Naive Bayes
Decision Tree
K-Nearest Neighbor
Logistic Regression
Artificial Neural Network / Deep Learning
Support Vector Machine
Ensemble Methods: Random Forest, Bagging, AdaBoost
Discriminative Algorithms
Map x y directly.
E.g., distinguish between people speaking different languages
without learning the languages.
Logistic Regression, SVM, Neural Networks
Generative Algorithms
Models a more general problem: how the data was generated.
I.e., the distribution of the class; joint probability distribution p(x,y).
Naive Bayes, Bayesian Belief Network classifier, Restricted
Boltzmann Machine
xi1
xi2
w0
w1
w2
yi
x1
y {-1,1}
x2
update rule:
yi
1 if wTxi
-1 otherwise
until
t+1 = max iter
or error = 0
Discriminative Classifiers:
Perceptron
F. Rosenblatt. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957.
1
xi1
xi2
w0
w1
w2
yi
y {-1,1}
x1
x2
Generative Classifiers:
Naive Bayes
Bayes Theorem:
P(j | xi) =
Posterior probability =
Iris example:
P(xi | j) P(j)
P(xi)
Likelihood x Prior probability
P(Setosa"| xi),
Evidence
Generative Classifiers:
Naive Bayes
Bayes Theorem:
Decision Rule:
P(j | xi) =
P(xi | j) P(j)
P(xi)
Generative Classifiers:
Naive Bayes
Evidence:
P(j | xi) =
Prior probability:
P(xi | j) P(j)
P(xi)
Nj
P(j) =
Nc
Class-conditional
probability
(here Gaussian kernel):
(cancels out)
(class frequency)
1
P(xik |j) = (2 j2) exp
P(xi |j)
(-
(xik - j)2
2j2
P(xik |j)
Generative Classifiers:
Naive Bayes
Non-Parametric Classifiers:
K-Nearest Neighbor
k=3
e.g., k=1
Simple!
Lazy learner
Very susceptible to curse of dimensionality
Iris Example
C=3
k=3
mahalanobis dist.
uniform weights
Setosa
Virginica
depth = 2
Versicolor
Decision Tree
petal length <= 2.45?
N
petal length <= 4.75?
Setosa
Y
Virginica
N
Versicolor
Entropy = pi logk pi
i
depth = 4
e.g.,
2 (- 0.5 log2(0.5)) = 1
Information Gain =
entropy(parent) [avg entropy(children)]
Roughly speaking:
No one model works best for all possible situations.
Which Algorithm?
What is the size and dimensionality of my training set?
Is the data linearly separable?
How much do I care about computational efficiency?
- Model building vs. real-time prediction time
- Eager vs. lazy learning / on-line vs. batch learning
- prediction performance vs. speed
Do I care about interpretability or should it "just work well?"
...
Supervised
Learning
Missing Data
Pre-Processing
Feature Extraction
Sampling
Training Dataset
Split
Feature Selection
Feature Scaling
Pre-Processing
Test Dataset
New Data
Dimensionality Reduction
Final Model
Evaluation
Prediction
Learning Algorithm
Training
Cross Validation
Refinement
Hyperparameter
Optimization
Performance Metrics
Post-Processing
Model Selection
Final Classification/
Regression Model
Missing Values:
- Remove features (columns)
- Remove samples (rows)
- Imputation (mean, nearest neighbor, )
Sampling:
- Random split into training and validation sets
- Typically 60/40, 70/30, 80/20
- Dont use validation set until the very end!
(overfitting)
Categorical Variables
M
10.1
class
label
class1
color size
0 green
1
red
13.5
class2
blue
XL
15.3
class1
nominal
ordinal
green (1,0,0)
red (0,1,0)
blue (0,0,1)
0
1
2
class
label
0
1
0
prize
M1
L2
XL 3
color=blue color=green
0
1
0
0
1
0
color=red
0
1
0
prize
10.1
13.5
15.3
size
1
2
3
Supervised
Learning
Missing Data
Pre-Processing
Feature Extraction
Sampling
Training Dataset
Split
Feature Selection
Feature Scaling
Pre-Processing
Test Dataset
New Data
Dimensionality Reduction
Final Model
Evaluation
Prediction
Learning Algorithm
Training
Cross Validation
Refinement
Hyperparameter
Optimization
Performance Metrics
Post-Processing
Model Selection
Final Classification/
Regression Model
TP
FN
FP
TN
Error Metrics
here: setosa = positive
TP
FP
FN
TN
TP + TN
Accuracy =
FP +FN +TP +TN
= 1 - Error
FP
False Positive Rate =
N
TP
True Positive Rate =
P
(Recall)
TP
Precision =
TP + FP
Model Selection
Complete dataset
Training dataset
Test dataset
fold 2
fold 3
fold 4
Test set
Test set
Test set
Test set
1st iteration
calc. error
2nd iteration
calc. error
3rd iteration
calc. error
4th iteration
calc. error
calculate
avg. error
Feature Selection
IMPORTANT!
(Noise, overfitting, curse of dimensionality, efficiency)
-
Domain knowledge
Variance threshold
Exhaustive search
Decision trees
Simplest example:
Greedy Backward Selection
start:
stop:
(if d = k)
X = [x1, x3]
Dimensionality Reduction
PCA in 3 Steps
0. Standardize data
xik - k
z=
ik =
1 (xij - j) (xik - k)
n -1 i
2 1
21
=
31
41
12
2 2
32
42
13
23
2 3
43
14
24
34
2 4
PCA in 3 Steps
2. Eigendecomposition and sorting eigenvalues
Xv=v
Eigenvectors
[[ 0.52237162
[-0.26335492
[ 0.58125401
[ 0.56561105
Eigenvalues
[ 2.93035378
0.92740362
0.14834223
0.02074601]
PCA in 3 Steps
3. Select top k eigenvectors and transform data
Eigenvectors
[[ 0.52237162
[-0.26335492
[ 0.58125401
[ 0.56561105
Eigenvalues
[ 2.93035378
0.92740362
0.14834223
0.02074601]
Hyperparameter Optimization:
GridSearch in scikit-learn
C=1000,
gamma=0.1
C=1
k=11
uniform weights
Non-Linear Problems
- XOR gate
depth=4
Kernel Trick
Kernel function
Kernel
Map onto high-dimensional space (non-linear combinations)
Kernel Trick
Trick: No explicit dot product!
Radius Basis Function (RBF) Kernel:
Kernel PCA
Supervised
Learning
Missing Data
Pre-Processing
Feature Extraction
Sampling
Training Dataset
Split
Feature Selection
Feature Scaling
Pre-Processing
Test Dataset
New Data
Dimensionality Reduction
Final Model
Evaluation
Prediction
Learning Algorithm
Training
Cross Validation
Refinement
Hyperparameter
Optimization
Performance Metrics
Post-Processing
Model Selection
Final Classification/
Regression Model
Thanks!
Questions?
@rasbt
[email protected]
https://github.com/rasbt
Additional Slides
Inspiring Literature
P. N. Klein. Coding the Matrix: Linear
Algebra Through Computer Science
Applications. Newtonian Press, 2013.
https://www.coursera.org/course/ml
http://stats.stackexchange.com
http://www.kaggle.com
My Favorite Tools
http://scikit-learn.org/stable/
http://www.numpy.org
http://pandas.pydata.org
Seaborn
http://stanford.edu/~mwaskom/software/seaborn/
http://ipython.org/notebook.html
class1
class2
Generalization error!