Lecture 1
Lecture 1
Machine Learning
10715, Fall 2014
Introduction &
Linear Classifiers
Reading:
© Eric Xing @ CMU, 2014 1
Intro to Adv. ML 10-715
Class webpage:
http://www.cs.cmu.edu/~epxing/Class/10715/
Mailing Lists:
To contact the instructors: [email protected]
Class announcements list: [email protected].
TA:
Kirthevasan Kandasamy, GHC 8015
Veeranjaneyulu Sadhanala, GHC 8005
Guest Lecturers
Yaoliang Yu
Andrew Wilson
Class Assistant:
Mallory Deptola, GHC 8001, x8-5527
© Eric Xing @ CMU, 2014 3
Logistics
4 homework assignments: 40% of grade
Theory exercises
Implementation exercises
Policies …
© Eric Xing @ CMU, 2014 4
What is Learning
Learning is about seeking a predictive and/or executable understanding of
natural/artificial subjects, phenomena, or activities from …
Apoptosis + Medicine
Grammatical rules
Manufacturing procedures
Inference:
Natural laws
what does this mean?
…
Any similar article?
…
representing;
classifying, clustering and recognizing;
reasoning under uncertainty;
predicting;
and reacting to
…
complex, real world data, based on the system's own experience with data,
and (hopefully) under a unified model or mathematical framework, that
Information retrieval
Games
Robotic control
Pedigree
Evolution
© Eric Xing @ CMU, 2014 Planning 9
Natural language processing and
speech recognition
Now most pocket Speech Recognizers or Translators are running
on some sort of learning device --- the more you play/use them, the
smarter they become!
A. Ng 2005
© Eric Xing @ CMU, 2014 12
Text Mining
We want:
Bioinformatics
catatccatatcaatatcaatatcgctattattaacgatcatgctctgctgatcaagtattcagcgctgcgctagattcgacagattgaatcgagctcaatagactcaacagactccactcgacagatgcgcaatgccaaggacaattgccg
tggagtaaacgaggcgtatgcgcaacctgcacctggcggacgcggcgtatgcgcaatgtgcaattcgcttaccttctcgttgcgggtcaggaactcccagatgggaatggccgatgacgagctgatctgaatgtggaaggcgcccagcaggc
aagattactttcgccgcagtcgtcatggtgtcgttgctgcttttatgttgcgtactccgcactacacggagagttcaggggattcgtgctccgtgatctgtgatccgtgttccgtgggtcaattgcacggttcggttgtgtaaccttcgtgt
tctttttttttagggcccaataaaagcgcttttgtggcggcttgatagattatcacttggtttcggtggctagccaagtggctttcttctgtccgacgcacttaattgaattaaccaaacaacgagcgtggccaattcgtattatcgctgtt
tacgtgtgtctcagcttgaaacgcaaaagcttgtttcacacatcggtttctcggcaagatgggggagtcagtcggtctagggagaggggcgcccaccagtcgatcacgaaaacggcgaattccaagcgaaacggaaacggagcgagcactat
agtactatgtcgaacaaccgatcgcggcgatgtcagtgagtcgtcttcggacagcgctggcgctccacacgtatttaagctctgagatcggctttgggagagcgcagagagcgccatcgcacggcagagcgaaagcggcagtgagcgaaagc
gagcggcagcgggtgggggatcgggagccccccgaaaaaaacagaggcgcacgtcgatgccatcggggaattggaacctcaatgtgtgggaatgtttaaatattctgtgttaggtagtgtagtttcatagactatagattctcatacagatt
gagtccttcgagccgattatacacgacagcaaaatatttcagtcgcgcttgggcaaaaggcttaagcacgactcccagtccccccttacatttgtcttcctaagcccctggagccactatcaaacttgttctacgcttgcactgaaaataga
accaaagtaaacaatcaaaaagaccaaaaacaataacaaccagcaccgagtcgaacatcagtgaggcattgcaaaaatttcaaagtcaagtttgcgtcgtcatcgcgtctgagtccgatcaagccgggcttgtaattgaagttgttgatgag
ttactggattgtggcgaattctggtcagcatacttaacagcagcccgctaattaagcaaaataaacatatcaaattccagaatgcgacggcgccatcatcctgtttgggaattcaattcgcgggcagatcgtttaattcaattaaaaggtag
aaaagggagcagaagaatgcgatcgctggaatttcctaacatcacggaccccataaatttgataagcccgagctcgctgcgttgagtcagccaccccacatccccaaatccccgccaaaagaagacagctgggttgttgactcgccagattg
attgcagtggagtggacctggtcaaagaagcaccgttaatgtgctgattccattcgattccatccgggaatgcgataaagaaaggctctgatccaagcaactgcaatccggatttcgattttctctttccatttggttttgtatttacgtac
aagcattctaatgaagacttggagaagacttacgttatattcagaccatcgtgcgatagaggatgagtcatttccatatggccgaaatttattatgtttactatcgtttttagaggtgttttttggacttaccaaaagaggcatttgttttc
ttcaactgaaaagatatttaaattttttcttggaccattttcaaggttccggatatatttgaaacacactagctagcagtgttggtaagttacatgtatttctataatgtcatattcctttgtccgtattcaaatcgaatactccacatctc
ttgtacttgaggaattggcgatcgtagcgatttcccccgccgtaaagttcctgatcctcgttgtttttgtacatcataaagtccggattctgctcgtcgccgaagatgggaacgaagctgccaaagctgagagtctgcttgaggtgctggtc
gtcccagctggataaccttgctgtacagatcggcatctgcctggagggcacgatcgaaatccttccagtggacgaacttcacctgctcgctgggaatagcgttgttgtcaagcagctcaaggagcgtattcgagttgacgggctgcaccacg
ctgctccttcgctggggattcccctgcgggtaagcgccgcttgcttggactcgtttccaaatcccatagccacgccagcagaggagtaacagagctcwhereisthegenetgattaaaaatatcctttaagaaagcccatgggtataactt
actgcgtcctatgcgaggaatggtctttaggttctttatggcaaagttctcgcctcgcttgcccagccgcggtacgttcttggtgatctttaggaagaatcctggactactgtcgtctgcctggcttatggccacaagacccaccaagagcg
aggactgttatgattctcatgctgatgcgactgaagcttcacctgactcctgctccacaattggtggcctttatatagcgagatccacccgcatcttgcgtggaatagaaatgcgggtgactccaggaattagcattatcgatcggaaagtg
ataaaactgaactaacctgacctaaatgcctggccataattaagtgcatacatacacattacattacttacatttgtataagaactaaattttatagtacataccacttgcgtatgtaaatgcttgtcttttctcttatatacgttttataa
cccagcatattttacgtaaaaacaaaacggtaatgcgaacataacttatttattggggcccggaccgcaaaccggccaaacgcgtttgcacccataaaaacataagggcaacaaaaaaattgttaagctgttgtttatttttgcaatcgaaa
cgctcaaatagctgcgatcactcgggagcagggtaaagtcgcctcgaaacaggaagctgaagcatcttctataaatacactcaaagcgatcattccgaggcgagtctggttagaaatttacatggactgcaaaaaggtatagccccacaaac
cacatcgctgcgtttcggcagctaattgccttttagaaattattttcccatttcgagaaactcgtgtgggatgccggatgcggctttcaatcacttctggcccgggatcggattgggtcacattgtctgcgggctctattgtctcgatccgc
ggcgcagttcgcgtgcttagcggtcagaaaggcagagattcggttcggattgatgcgctggcagcagggcacaaagatctaatgactggcaaatcgctacaaataaattaaagtccggcggctaattaatgagcggactgaagccactttgg
attaaccaaaaaacagcagataaacaaaaacggcaaagaaaattgccacagagttgtcacgctttgttgcacaaacatttgtgcagaaaagtgaaaagcttttagccattattaagtttttcctcagctcgctggcagcacttgcgaatgta
ctgatgttcctcataaatgaaaattaatgtttgctctacgctccaccgaactcgcttgtttgggggattggctggctaatcgcggctagatcccaggcggtataaccttttcgcttcatcagttgtgaaaccagatggctggtgttttggca
cagcggactcccctcgaacgctctcgaaatcaagtggctttccagccggcccgctgggccgctcgcccactggaccggtattcccaggccaggccacactgtaccgcaccgcataatcctcgccagactcggcgctgataaggcccaatgtc
actccgcaggcgtctatttatgccaaggaccgttcttcttcagctttcggctcgagtatttgttgtgccatgttggttacgatgccaatcgcggtacagttatgcaaatgagcagcgaataccgctcactgacaatgaacggcgtcttgtca
tattcatgctgacattcatattcattcctttggttttttgtcttcgacggactgaaaagtgcggagagaaacccaaaaacagaagcgcgcaaagcgccgttaatatgcgaactcagcgaactcattgaagttatcacaacaccatatccata
catatccatatcaatatcaatatcgctattattaacgatcatgctctgctgatcaagtattcagcgctgcgctagattcgacagattgaatcgagctcaatagactcaacagactccactcgacagatgcgcaatgccaaggacaattgccg
tggagtaaacgaggcgtatgcgcaacctgcacctggcggacgcggcgtatgcgcaatgtgcaattcgcttaccttctcgttgcgggtcaggaactcccagatgggaatggccgatgacgagctgatctgaatgtggaaggcgcccagcaggc
aagattactttcgccgcagtcgtcatggtgtcgttgctgcttttatgttgcgtactccgcactacacggagagttcaggggattcgtgctccgtgatctgtgatccgtgttccgtgggtcaattgcacggttcggttgtgtaaccttcgtgt
…
All software
apps.
This ML niche is growing (why?)
…
All software
apps.
This ML niche is growing
Improved machine learning algorithms
Increased data capture, networking
Software too complex to write by hand
New sensors / IO devices
Demand for self-customization to user, environment
© Eric Xing @ CMU, 2014 16
Paradigms of Machine Learning
Supervised Learning
Given D Xi , Yi , learn f() : Yi f Xi , s.t. D X j
new
Y
j
Unsupervised Learning
Given D Xi , learn f() : Yi
f X i , s.t. D new X j Y
j
Semi-supervised Learning
Reinforcement Learning
Given D env, actions, rewards, simulator/trace/real game
policy : e , r a
learn , s.t. env, new real game a1 , a2 , a3
utility : a , e r
Active Learning
Given D ~ G () , learn D new ~ G' () and f() , s.t.
D all G' (), policy, Yj
Hypothesis (classifier)
p ( X | Y 1)
p1 ( X ; 1 , 1 )
p ( X | Y 2)
p2 ( X ; 2 , 2 )
P ( X | Y ) p (Y )
P (Y | X )
P( X )
Likelihood Ratio:
( X )
Discriminant function:
h( X )
Learning strategies
Generative Learning
Discriminative Learning
K-Nearest-Neighbor Classifier:
where the h(X) is represented by all the data, and by an algorithm
L ( ) P ( x1, x2 , , x N ; )
P ( x; ) P ( x2 ; ), , P ( x N ; )
n 1 P ( xn ; )
N
3. pick the setting of parameters most likely to have generated the data we saw:
* arg max L ( ) arg max log L ( )
© Eric Xing @ CMU, 2014
How hard is it to learn the optimal
classifier?
How do we represent these? How many parameters?
Prior, P(Y):
Suppose Y is composed of k classes
Likelihood, P(X|Y):
Suppose X is composed of n binary features
e.g.,
Equivalent to:
More generally:
Y
How many parameters now?
Suppose X is composed of m binary features
X1 X2 X3 X4
Decision rule:
k
1
( 2 )1/ 2
exp - 1
2
( x n -
T 1
k ) ( x n -
k)
Given a datum xn, we predict its label using the conditional probability of the label
given the datum:
k
( 2
1
) 1/ 2
exp -1
2 ( x n -
T 1
k ) ( x n -
k)
p ( y nk 1 | x n , , )
1
k ' k ' (2 )1/ 2 exp - 1
2( x n -
T 1
k ) ( x n -
k')
© Eric Xing @ CMU, 2014 35
The A Gaussian Discriminative
Naïve Bayes Classifier
When X is multivariate-Gaussian vector: Yn
The joint probability of a datum and it label is:
Xn
p ( x n , y nk 1 | , ) p ( y nk 1) p ( x n | y nk 1, , )
k
1
( 2 )1/ 2
exp -
1
2 ( x n -
T 1
k ) ( x n -
k)
2 exp
j 2 2
1 j
1 exp j xnj ( 1j - 2j ) 12 ([ 1j ]2 - [ 2j ]2 ) log (11 1 )
1
1
1
1 exp
j 2 2
( x n - 1 ) log j C
j j 2
2j j
j
1 e
T
xn
1 e xn
T
kT x n
e
p ( y nk 1 | xn )
e
Tj x n
j
© Eric Xing @ CMU, 2014 38
Summary:
The Naïve Bayes Algorithm
Train Naïve Bayes (examples)
for each* value yk
estimate
for each* value xij of each attribute Xi
estimate
Classify (Xnew)
Discriminative classifiers:
Yi
Directly assume some functional form for P(Y|X)
This is a ‘discriminative’ model of the data! Xi
Estimate parameters of P(Y|X) directly from training data
2 exp
j 2 2
1 j
1 exp j xnj ( 1j - 2j ) 12 ([ 1j ]2 - [ 2j ]2 ) log (11 1 )
1
1
1
1 exp
j 2 2
( x n - 1 ) log j C
j j 2
2j j
j
1e T x n
1
( x) T x
p( y 1 | x)
1 e
Training data
Data likelihood =
The gradient:
Classify webpages
Y = {Student, professor, project, …}
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1
…
Zaire 0
or
or
Test phase:
For each document xnew
Use naïve Bayes decision rule
Experiment:
Solve only a two-class subset: 1 vs 2.
1768 instances, 61188 features.
Use dimensionality reduction on the data (SVD).
Use 90% as training set, 10% as test set.
Test prediction error used as accuracy measure.
comp.windows.x
vs.
rec.motorcycles
Training Ratio
© Eric Xing @ CMU, 2014 62
Results: Multiple Classes
Accuracy
5-out-of-20 classes
All 20 classes
10-out-of-20 classes
Training Ratio
• 30 features.
• A fixed test set
• Training set varied
from 10% to 100%
of the training set
• Number of
dimensions of the
data varied from 5
to 50 in steps of 5
Discriminative classifiers:
Yi
Directly assume some functional form for P(Y|X)
This is a ‘discriminative’ model of the data! Xi
Estimate parameters of P(Y|X) directly from training data
**
k ' k ' exp j 2 2 ( x j k ', j ) 2 log k ', j C
k ', j
LR: 1
( x)
1e T x
Estimation method:
NB parameter estimates are uncoupled
LR parameter estimates are coupled
Implication: if we want
for some small constant 0, it suffices to pick order m
examples
result follows from Vapnik’s structural risk bound, plus fact that the "VC
Dimension" of an m-dimensional linear separators is m
Let
Logistic regression
Functional form follows from Naïve Bayes assumptions
For Gaussian Naïve Bayes assuming variance
For discrete-valued Naïve Bayes too
But training procedure picks parameters without the conditional independence
assumption
Gradient ascent/descent
– General approach when closed-form solutions unavailable