0% found this document useful (0 votes)
103 views7 pages

6.86x Machine Learning With Python: Linear Classifiers

This document provides a summary of machine learning concepts including: - Types of machine learning such as supervised, unsupervised, and semi-supervised learning. - Linear classifiers including linear classifiers through the origin and with an offset. Key concepts for linear classifiers like decision boundaries, margin boundaries, and training error. - The perceptron algorithm for training linear classifiers using an iterative method. - Performance metrics like hinge loss that measure errors in linear classifiers.

Uploaded by

Alexander CTO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views7 pages

6.86x Machine Learning With Python: Linear Classifiers

This document provides a summary of machine learning concepts including: - Types of machine learning such as supervised, unsupervised, and semi-supervised learning. - Linear classifiers including linear classifiers through the origin and with an offset. Key concepts for linear classifiers like decision boundaries, margin boundaries, and training error. - The perceptron algorithm for training linear classifiers using an iterative method. - Performance metrics like hinge loss that measure errors in linear classifiers.

Uploaded by

Alexander CTO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

6.

86x Machine Learning with Python • set of classifiers: h ∈ H Hinge Loss, Margin Boundaries and Regularization
Linear Classifiers through the Origin We consider functions of the form Distance from a Line to a Point The perpendicular distance from a line with
This is a cheat sheet for machine learning based on the online course given by Prof. equation θ · x + θ0 = 0 to a point with coordinates x0 is
Tommi Jaakkola and Prof. Regina Barzilay. Compiled by Janus B. Advincula. h(x; θ) = sign (θ1 x1 + · · · + θd xd ) = sign (θ · x) . |θ · x0 + θ0 |
d=
Last Updated January 17, 2020 kθk
θ·x>0 θ·x<0
x0
Linear Classifiers θ θ
Introduction to Machine Learning x
What is machine learning? Machine learning as a discipline aims to design, d
understand and apply computer programs that learn from experience (i.e., data) for
θ·x=0
the purpose of modeling, prediction or control.
Linear Classifiers with Offset We can consider functions of the form Decision Boundary The decision boundary is the set of points x which satisfy
Types of Machine Learning
θ · x + θ0 = 0.
• Supervised learning: prediction based on examples of correct behavior h(x; θ) = sign (θ · x + θ0 )
Margin Boundary The margin boundary is the set of points x which satisfy
• Unsupervised learning: no explicit target, only data, goal is to where θ0 is the offset parameter.
model/discover θ · x + θ0 = ±1.
Linear Separation Training examples Sn are linearly separable
 if there exists
 a  
• Semi-supervised learning: supplement limited annotations with y (i) θ · x(i) + θ0
parameter vector θb and offset parameter θb0 such that y (i) θb · x(i) + θb0 > 0 for all
unsupervised learning γi (θ, θ0 ) =
i = 1, . . . , n. kθk
• Active learning: learn to query the examples actually needed for learning
Training Error The training error for a linear classifier is
• Transfer learning: how to apply what you have learned from A to B
n
• Reinforcement learning: learning to act, not just predict; goal is to optimize 1 X hh (i)  (i)
 ii
En (θ, θ0 ) = y θ · x + θ0 ≤ 0 .
the consequences of actions n i=1
negative margin boundary
positive margin boundary θ · x + θ0 = −1
Linear Classifier and Perceptron Perceptron Algorithm
θ · x + θ0 = 1
decision boundary
h(x) = +1 h(x) = −1 procedure P ERCEPTRON({(x(i) , y (i) ), i = 1, . . . , n}, T ) θ · x + θ0 = 0
θ = 0 (vector)
Hinge Loss
for t = 1, . . . , T do (
0 if z ≥ 1
for i = 1, . . . , n do Lossh (z) =
  1−z if z < 1
if y (i) θ · x(i) ≤ 0 then  
with z = y (i) θ · x(i) + θ0 .
θ = θ + y (i) x(i)
classifier: h(x) = 0 Regularization Maximize margin
return θ 1 1 2
max ⇒ min kθk
Key Concepts Perceptron Algorithm (with offset) kθk 2

• feature vectors, labels: procedure P ERCEPTRON({(x(i) , y (i) ), i = 1, . . . , n}, T ) Linear Classification and Generalization
x∈R ,
d
y ∈ {−1, +1} θ = 0 (vector) 1 distance from the decision
for t = 1, . . . , T do kθk boundary to the margin boundary
• training set: n
(i) (i)
 o for i = 1, . . . , n do
Sn = x ,y , i = 1, . . . , n  
if y (i) θ · x(i) + θ0 ≤ 0 then
• classifier: θ
d θ = θ + y (i) x(i)
h : R → {−1, +1}
(i)
+
n
d
o θ0 = θ0 + y
χ = x ∈ R : h(x) = +1 negative margin boundary
return θ, θ0 θ · x + θ0 = −1
positive margin boundary
n o
− d
χ = x ∈ R : h(x) = −1 Convergence Assumptions:
θ · x + θ0 = 1
decision boundary
• training error: y (i) θ ∗ ·x(i)
 
n • There exists θ ∗ such that ≥ γ for all i = 1, . . . , n for some θ · x + θ0 = 0
1 X hh  (i) 

(i)
ii (i)
En (h) = h x 6 y
= x
n i=1 γ > 0. Objective function
(
1 if error • All examples are bounded x(i) ≤ R, i = 1, . . . , n. n 

hh 
(i)

(i)
ii 1 X 
(i)

(i)
 λ 2
h x 6= y = J (θ, θ0 ) = Lossh y θ · x + θ0 + kθk
0 otherwise n i=1 2
R2
• test error: E(h) Then the number k of updates made by the perceptron algorithm is bounded by . λ is the regularization factor.
γ2
Stochastic Gradient Descent Select i ∈ {1, . . . , n} at random Nonlinear Classification Decision Boundary The decision boundary satisfies
  Feature Transformation n

(i)
 λ 2 X (j)

(j)

θ ← θ − ηt ∇θ Lossh θ · x + θ0 + kθk αj y K x , x = 0.
2 x 7→ φ(x)
j=1
0
ηt is the learning rate which can vary at every iteration. θ · x → θ · φ(x)
Radial Basis Kernel
Support Vector Machine Non-linear Classification  
0 1 0 2
h (x; θ, θ0 ) = sign (θ · φ(x) + θ0 ) K(x, x ) = exp − x − x
• Support Vector Machine finds the maximum margin linear separator by 2
solving the quadratic program that corresponds to J(θ, θ0 )
Kernel Function A kernel function is simply an inner product between two feature 3
• In the realizable case, if we disallow any margin violations, the quadratic vectors. Using kernels is advantageous when the inner products are faster to evaluate +1
-1
program we have to solve is: than using explicit vectors (e.g., when the vectors would be infinite dimensional). 2
1 2 0 0
Find θ, θ0 that minimize 2 kθk subject to K x, x = φ(x) · φ(x ) 1

X2
 
(i) (i) Perceptron
y θ·x + θ0 ≥ 1, i = 1, . . . , n
θ=0 1

2
Nonlinear Classification, Linear Regression, for i = 1, . . . , n do
 
if y (i) θ · φ x(i) ≤ 0 then 3
Collaborative Filtering   3 2 1 0
X1
1 2 3
θ ← θ + y (i) φ x(i)

Linear Regression This algorithm gives Other non-linear classifiers


n
X  
(j) (j)
Empirical Risk θ= αj y φ x • We can get non-linear classifiers or regression methods by simply mapping
j=1 examples into feature vectors non-linearly, and applying a linear method on
n the resulting vectors.
1 X 1  (i) (i) 2
 where αj is the number of mistakes. For the offset parameter, we get
Rn (θ) = y −θ·x squared error
n i=1 2 n
• These feature vectors can be high dimensional.
X (j)
θ0 = αj y . • We can turn the linear methods into kernel methods by casting the
Gradient-based Approach We can use stochastic gradient descent to find the j=1 computations in terms of inner products.
minima of the empirical risk.
Kernel Perceptron Algorithm We can reformulate the perceptron algorithm so that
Algorithm Initialize θ = 0. we initialize and update αj ’s, instead of θ. Recommender Systems
Randomly pick i = {1, . . . 
, n}.
  n     Problem Description We are given a matrix where each row corresponds to a user’s
θ = θ + η y (i) − θ · x(i) x(i) . (i) rating of movies, for example, and each column corresponds to the user ratings for a
X (j) (j) (i)
θ·φ x = αj y φ x ·φ x
j=1
particular movie. It can also be product ratings, etc. This matrix will be very sparse.
η is the learning rate.
| {z }
  The goal is to predict user ratings for those movies that are yet to be rated.
K x(j) ,x(i)
Closed Form Solution Let m movies
n n
procedure K ERNEL P ERCEPTRON({(x(i) , y (i) ), i = 1, . . . , n}, T )  
1 X (i)  (i) | 1 X (i) (i)
A= x x and B = y x . Initialize α1 , . . . , αn to some values  
n i=1 n i=1 n users  Yai 
for t = 1, . . . , T do
 

Then, for i = 1, . . . , n do
−1 n
θb = A B.
 
if y (i) αj y (j) K x(j) , x(i) ≤ 0 then
P

In matrix notation, this is j=1 K-Nearest Neighbor Method The K-Nearest Neighbor method makes use of
| −1 | αj = αj + 1 ratings by K other similar users when predicting Yai . Let KNN(a) be the set of K
θb = X X X Y. users similar to user a, and let sim(a, b) be a similarity measure between users a and
b ∈ KNN(a). The KNN method predicts a rating Yai to be
Generalization and Regularization The initilization θ = 0 is equivalent to α1 = · · · = αn = 0. P
Ridge Regression: The loss function is sim(a, b)Ybi
Composition rules: b∈KNN(a)
Y
bai =
λ
P
Jλ,n =
2
kθk + Rn (θ) 1. K(x, x0 ) = 1 is a kernel function. sim(a, b)
2 b∈KNN(a)
d 0
2. Let f : R → R and K(x, x ) is a kernel. Then so is
where λ is the regularization factor. We can find its minima using gradient-based K(x,
e x0 ) = f (x)K(x, x0 )f (x0 ) The similarity measure sim(a, b) could be any distance function between the feature
approach. vectors xa and xb .
0 0
3. If K1 (x, x ) and K2 (x, x ) are kernels, then
Algorithm Initialize θ = 0. K(x, x0 ) = K1 (x, x0 ) + K2 (x, x0 ) is a kernel. • Euclidean distance: kxa − xb k
Randomly pick i = {1,
 . . . , n}. 4. If K1 (x, x0 ) and K2 (x, x0 ) are kernels, then
 xa · xb
θ = (1 − ηλ) θ + η y (i) − θ · x(i) x(i) . K(x, x0 ) = K1 (x, x0 )K2 (x, x0 ) is a kernel. • Cosine similarity: cos θ =
kxa kkxb k
ez − e−z 2
Collaborative Filtering Our goal is to come up with a matrix X that has no blank Hyperbolic Tangent Function tanh(z) = = 1 − 2z In the figure,
entries and whose (a, i)th entry Xai is the prediction of user a’s rating to movie i. ez + e−z e +1 
W11
 
W12

~1 =
W and ~2 =
W .
Let D be the set of all (a, i)’s for which a user rating Yai exists. A naive approach is Deep Neural Networks A deep (feedforward) neural network refers to a neural W21 W22
to minimize the objective function network that contains not only the input and output layers, but also hidden layers in They map the input onto the f1 -f2 axes.
between. Below is a deep feedforward neural network of 2 hidden layers, with each
X 1 2 λ X 2 hidden layer consisting of 5 units: Hidden Layer Representation
J(X) = (Yai − Xai ) + Xai . • Hidden Layer Units
2 2
(a,i)∈D (a,i)

1.0 W1
The results are W2
+1
bai = Yai
X for (a, i) ∈ D x1 0.5
-1
1+λ
X
bai = 0 for (a, i) ∈
/ D. 0.0

X2
f
x2
The problem with this approach is that there is no connection between the entries of
0.5
X. We can impose additional constraint on X:
|
X = UV x3 1.0
| 1.0 0.5 0.0 0.5 1.0
for some n × d matrix U and d × m matrix V , where d is the rank of the matrix X. X1
Alternating Minimization Assume that U and V are rank k matrices. Then, we can
write the objective function as • Linear Activation

-1
 
2.5

Input Layer

Hidden Layer

Hidden Layer

Output Layer
X 1  |
2 λ X 2 X 2 1
J(X) = Yai − U V ai +  Uak + Vik  . 2.0
2 2 a,k i,k
(a,i)∈D
1.5
To find the solution, we fix (initialize) U (or V ) and minimize the objective with 1.0

f2
respect to V (or U ). We plug-in the result back to the objective and minimize it with 0.5
respect to U (or V ). We repeat this alternating process until there is no change in the 0.0
objective function. One Hidden Layer Model
0.5
Example Consider the case k = 1. Then, Ua1 = ua and Vi1 = vi . If we initialize 1.0
Layer 0 Layer 1 Layer 2
ua to some values, then we have to optimize the function
(tanh) (linear)
1 0 1 2 3
X 1 2 λX 2 f1
(Yai − ua vi ) + v . W11 z1 f1
2 2 i i x1
(a,i)∈D
W12 • tanh Activation
z
f
1.00 -1
Neural Networks 0.75
1
W21
x2 0.50
Introduction to Feedforward Neural Networks W22 z2 f2
0.25
A Unit in a Neural Network A neural network unit is a primitive neural network

f2
0.00
that consists of only the input layer, and an output layer with only one output. 2 2
X X 0.25
x1 z1 = xj Wj1 + W01 z2 = xj Wj2 + W02
0.50
j=1 j=1
w1 0.75
f1 = f (z1 ) = tanh(z1 ) f2 = f (z2 ) = tanh(z2 )
x2 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
f (z) z=
0
f1 w1 +
0
f2 w2 f = f (z) = z f1
w2
.
.
. Neural Signal Transformation We can visualize what the hidden layer is doing • ReLU Activation
similarly to a linear classifier.
wd -1
xd 2.5
x2 1
2.0
A neural network unit computes a non-linear weighted combination of its input:
d 1.5
X ~2
W

f2
y
b = f (z) where z = w0 + xi wi
1.0
i=1
x1
where wi are the weights, z is a number and is the weighted sum of the inputs xi , 0.5
~1
W
and f is generally a non-linear function called the activation function.
0.0
Linear Function f (z) = z 0.0 0.5 1.0 1.5 2.0 2.5 3.0
f1
Rectified Linear Unit (ReLU) f (z) = max{0, z}
Summary We have the following notations: • Let’s say we want to encode the incomplete sentence Efforts and courage are
not. First, we have to represent the first word as a vector (say, a one-hot
• Units in neural networks are linear classifiers, just with different output • b`j is the bias of the j th th
neuron in the ` layer. vector). This will be x1 . Then,
non-linearity.
• a`j is the activation of the j th neuron in the `th layer.
• The units in feedforward neural networks are arranged in layers. s1 = tanh W
s,x 
x1 .
• By learning the parameters associated with the hidden layer units, we learn
`
• wjk is the weight for the connection from the kth neuron in the (` − 1)th layer
how to represent examples (as hidden layer activations). to the j th neuron in the `th layer. The second word will be x2 , and we compute for s2 .
• The representations in neural networks are learned directly to facilitate the
If the activation function is f and the loss function we are minimizing is C, then the
end-to-end task. s2 = tanh W
s,s
s1 + W
s,x 
xt .
equations describing the network are:
• A simple classifier (output unit) suffices to solve complex classification tasks if !
it operates on the hidden layer representations. `
aj = f
X ` `−1
wjk ak + bj
` We continue this process until we’ve encoded all the words in the sentence.
k We can visualize this as follows:
Feedforward Neural Networks, Back Propagation, and  
L
Stochastic Gradient Descent (SGD) Loss = C a
lego piece (encoder)
Simple Example This simple neural network is made up of L hidden layers, but Let the weighted inputs to the d neurons in layer ` be defined as s1 s2 s3 s4 s5
each layer consists of only one unit, and each unit has activation function f . ` ` `−1 ` ` d s0 sentence
z ≡w a +b , where z ∈ R . as a vector
` `
w1 z1 f1 wL zL fL Then, the activation of layer ` is also written as a ≡ f (z ). Also, let δj` ≡ ∂C
x ... y ∂z ` x1 x2 x3 x4 x5
j
denote the error of neuron j in layer `. Then, δ ` ∈ Rd denotes the full vector of errors
z1 = xw1 associated with layer `.
f1 = tanh(xw1 ) Efforts and courage are not ...
. Equations of Backpropagation
.  
. L
δ = ∇a C f z
0 L
fL = tanh(fL−1 wL ) Differences from standard feedforward architecture
|
(y − fL )2
1
h i  
L(y, fL ) = Loss (y, fL ) = ` `+1 `+1 0 `
2 δ = w δ f z
• Input is received at each layer (per word), not just at the beginning as in a
For i = 2, . . . , L: zi = fi−1 wi where fi−1 = f (zi−1 ). Also, y is the true value ∂C `
= δj typical feedforward network.
and fL is the output of the neural network. ∂b`j
Gradient Descent The gradient descent update rule for the parameter wi is ∂C • The number of layers varies and depends on the length of the sentence.
`−1 `
wi ← wi − η · ∇wi L(y, fL ) `
= ak δj
∂wjk • Parameters of each layer (representing an application of an RNN) are shared
where η is the learning rate. For instance, we have
The symbol represents the Hadamard product. (same RNN at each step).
∂L ∂f1 ∂L
=
     
a b e f ae bf
∂w1 ∂w1 ∂f1 = . Basic RNN
c d g h cg dh
∂f1 h
2
i 
2

st = tanh W
s,s
st−1 + W
s,x
xt

= 1 − tanh (xw1 ) x = 1 − f1 x
∂w1 Recurrent Neural Networks
∂L ∂L ∂f2 ∂L  2

Temporal/Sequence Problems Simple Gated RNN
= = 1 − f2 w2 .
∂f1 ∂f2 ∂f1 ∂f2 • Sequence prediction problems can be recast in a form amenable to feedforward g,s g,x 
gt = sigmoid W st−1 + W xt
Thus, when we back-propagate, we get neural networks.
s,s s,x 
∂L 
2
 
2

• We have to engineer how history is mapped to a vector (representation). This st = (1 − gt ) st−1 + gt tanh W st−1 + W xt
= x 1 − f1 · · · 1 − fL w2 · · · wL · 2 (fL − y) .
∂w1 vector is then fed into, e.g., a neural network.
Note that the above derivation applies to tanh activation. • We would like to learn how to encode the history into a vector. Long Short-Term Memory (LSTM)
Backpropagation Consider the L-layer neural network below. Key Concepts  
f,h f,x
Layer 0 Layer 1 Layer 2 Layer 3 • Encoding – e.g., mapping a sequence to a vector ft = sigmoid W ht−1 + W xt forget gate
 
• Decoding – e.g., mapping a vector to, e.g., a sequence it = sigmoid W
i,h i,x
ht−1 + W xt input gate
Example: Encoding Sentences  
o,h o,x
• Introduce adjustable lego pieces and optimize them for end-to-end ot = sigmoid W ht−1 + W xt output gate
x1 performance.  
c,h c,x
ct = ft ct−1 + it tanh W ht−1 + W xt memory cell
a31 ht = ot tanh(ct ) visible state
x2 context new context
θ
or state or state
st−1 st Markov Language Models Let w ∈ V denote the set of possible words/symbols
x3 that includes

• an UNK symbol for any unknown word (out of vocabulary)


new information
3 xt • hbegi symbol for specifying the start of a sentence
w25
b25 s,s s,x 
st = tanh W st−1 + W xt • hendi symbol for specifying the end of the sentence
s,s s,x 
First-order Markov Model In a first-order Markov model (bigram model), the next st = tanh W st−1 + W xt state
symbol only depends on the previous one. Each symbol (except hbegi) in the o 
pt = softmax W st output distribution
sequence is predicted using the same condition probability table until an hendi
symbol is seen. The probability associated to the sentence is
Decoding
Y
P (wi |wi−1 ) .
i=1 sampled word Olen nähnyt parempia luentoja hendi
distribution over the possible words p1 p2 p3 p4 p5
Maximum Likelihood Estimation The goal is to maximize the probability that the
model can generate all the observed sentences (corpus S) vector encoding
n o of a sentence
s s s
s ∈ S, s = w1 , w2 , . . . , w|s|
I have seen
   better lectures.
|s|
hnulli
Y Y  
s s
` = log  P wi |wi−1 
 
s∈S i=1

The maximum likelihood estimate is obtained as normalized counts of successive Convolutional Neural Networks Unsupervised Learning
word occurrences (matching statistics) Problem Image classification

0 count w0 , w
 Clustering
P w |w = P
b
 • The presence of objects may vary in location across different images.
count (w, w)
e Training set We are provided a training set but with no labels
w
e Patch classifier/filter n o
(i)
Sn = x i = 1, . . . , n
Feature-based Markov Model We can also represent the Markov model as a
feedforward neural network (very extendable). We define a one-hot vector, φ (wi−1 ), and the goal is to find structure in the data.
corresponding to the previous word. This will be an input to the feedforward neural Example: Google News
network.
p1 input weights
p2
The patch classifier goes through the entire image. We can think of the weights as the
φ (wi−1 ) W p3 image that the unit prefers to see.
x . .
. . Convolution The convolution is an operation between two functions f and g:
. .
pk
Z +∞
(f ∗ g) (t) ≡ f (τ )g(t − τ )dτ.
−∞
In the figure,
pk = P (wi = k|wi−1 ) Intuitively, convolution blends the two functions f and g by expressing the amount of
is the probability of the next word, given the previous word. The aggregate input to overlap of one function as it is shifted over another function.
the kth output unit is Discrete Convolution For discrete functions, we can define the convolution as
X
zk = xj Wjk + W0k .
m=+∞
k X
(f ∗ g) [n] ≡ f [m]g[n − m].
These input values are not probabilities. A typical transformation is the softmax m=−∞
transformation:
ezk Example: Image Quantization
pk = P z . Pooling We wish to know whether a feature was there but not exactly where it was.
e j
j Pooling (Max) Pooling region and stride may vary.
RNNs for Sequences Our RNN now also produces an output (e.g., a word) as well
• Pooling induces translation invariance at the cost of spatial resolution.
as update its state
• Stride reduces the size of the resulting feature map.
[0.1, 0.3, . . . , 0.2] output distribution

previous new
θ state original compressed
state

Partition A partition of a set is a grouping of the set’s elements into non-empty


subsets, in such a way that every element is included in one and only one of the
feature map subsets. In other words, C1 , . . . , CK is a partition of {1, . . . , n} if and only if
original feature map after max pooling
previous output C1 ∪ · · · ∪ CK = {1, . . . , n} and
as an input x
Example of CNN From LeCun (2013) Ci ∩ Cj = ∅ for any i 6= j in {1, . . . , K}.
Clustering: Input Generative Models Mixture Models; EM Algorithm
n o Generative vs. Discriminative Models Generative models work by explicitly Gaussian Mixture Models Instead of just a single Gaussian, we have a mixture of
• Set of feature vectors Sn = x(i) i = 1, . . . , n modeling the probability distribution of each of the individual classes in the training Gaussian components. Assume that there are K Gaussians with known means and

data. Discriminative models learn explicit decision boundary between classes. variances. Assume also that the mixture weights p1 , . . . , pK are known. The
• The number of clusters K Simple Multinomial Generative Model Consider a multinomial model M to likelihood for an observation x obtained from the model is
generate text documents. Assume that M has a fixed vocabulary W and we generate K
X  
(j) 2
Clustering: Output a document by sampling one word at a time from this vocabulary. Furthermore, all p(x|θ) = pj N x; µ , σj I .
the words that are generated by M are independent of each other. We denote the j=1

• A partition of indices {1, . . . , n} into K sets, C1 , . . . , CK probability that M generates certain word w ∈ W is
X For the training set
P (w|θ) = θw , θw ≥ 0, θw = 1. n
(i)
o
• Representatives in each of the K partition sets, given as z1 , . . . , zK Sn = x , i = 1, . . . , n ,
w∈W

Then, the probability of generating the document D is the likelihood is


Cost We can calculate the total cost by summing the cost of each cluster:
n K
n X  
count(w)
Y Y Y (i) (j) 2
K
P (D|θ) = θwi = θw . P (Sn |θ) = pj N x ; µ , σj I .
X
i=1 w∈W i=1 j=1
Cost (C1 , . . . , CK ) = Cost (Cj )
j=1 Maximum Likelihood Estimate The log-likelihood for the model is
X Observed Case Consider the case of hard clustering, i.e., a point either belongs to a
` = log P(D|θ) = count(w) log θw cluster or not. Let (
Similarity Measure We use the Euclidean distance between the elements of a cluster w∈W 1, x(i) is assigned to j
and its representative to calculate the cost for each cluster. Then, the total cost is δ(j|i) =
and the maximum likelihood estimate is 0, otherwise.
K X
count(w) n
2 θbw = . Also, let n
bj =
P
δ(j|i) denote the number of points belonging to cluster j.
count(w0 )
X (i) P
Cost (C1 , . . . , CK , z1 , . . . , zK ) = x − zj .

i=1
w0 ∈W
j=1 i∈Cj Maximizing the likelihood gives
Prediction Consider using a multinomial generative model M for the task of binary
n
bj
classification consisting of two classes: + (positive class) and − (negative class). p
bj =
K-Means Algorithm n
• θ + : parameter for the positive class n
(j) 1 X (i)
1. Randomly select z1 , . . . , zK . • θ − : parameter for the negative class µ
b = δ(j|i) x
n
bj i=1
2. Iterate: Suppose that we classify a new document D to belong to the positive class if and only n
if 2 1 X
(i) (j) 2

σ
bj = δ(j|i) x − µ .
(a) Given z1 , . . . , zK , assign each data point x(i) to the closest zj so that P(D|θ + ) n
bj d i=1
log ≥ 0.
n
P(D|θ − )
X
(i)
2
The generative classifier is equivalent to a linear classifier: The EM Algorithm Instead of hard clustering, the data can actually be generated
Cost (z1 , . . . , zK ) = min x − zj .

j=1,...,K from different clusters with different probabilities. We have soft clustering. We can
i=1 P(D|θ + ) X θ+ X 0 maximize the likelihood through the EM algorithm.
log −
= count(w) log w −
= count(w) θw .
P(D|θ ) θ w
(b) Given C1 , . . . , CK , find the best representatives z1 , . . . , zK , i.e., find w∈W w∈W
Randomly initialize θ: µ(1) , . . . , µ(K) , σ12 , . . . , σK
2
, p1 , . . . , p K .
z1 , . . . , zK such that Prior, Posterior and Likelihood In the above discussion, there is an assumption that
1. E-step:
the likelihood of being in one of the classes is the same. However, we may have some  
1 X (i) x(i) ; µ(j) , σj2 I
2
X
(i) prior knowledge and we want to incorporate it into our model. The posterior pj N
zj = argmin x − z = x .

z |Cj | i∈C distribution for the positive class is then p(j|i) =
i∈Cj j p (x|θ)
P(D|θ + ) P(y = +) K  
P (y = +|D) = . where p(x|θ) =
P
pj N x(i)
;µ (j)
, σj2 I
K-Medoids Algorithm The K-Medoids algorithm is a variation of the K-Means P(D) j=1
algorithm that addresses some of the K-Means algorithm’s limitations.
The generative classifier becomes
2. M-step:
1. Randomly select {z1 , . . . , zK } ⊆ {x1 , . . . , xn }. P (y = +|D) X 0 0
log = count(w) θw + θ0 n
P (y = −|D)
X
w∈W n
bj = p(j|i)
2. Iterate:
+ i=1
0 θw P(y = +)
(a) Given z1 , . . . , zK , assign each x (i)
to the closest zj so that where θw = log −
and θ00 = log . n
bj
θw P(y = −) p
bj =
d n
n   Gaussian Generative Models The likelihood of x ∈ R being generated by a n
1 X
X (i)
Cost (z1 , . . . , zK ) = min dist x , zj Gaussian with mean µ and standard deviation σ is µ
(j)
= p(j|i) x
(i)
j=1,...,K b
i=1   1

1
 n
bj i=1
2 2
fX x|µ, σ = exp − kx − µk .
(b) Given Cj ∈ {C1 , . . . , CK }, find the best representative (2πσ 2 )d/2 2σ 2
2 1 X
n
(i) (j) 2

σ
bj = p(j|i) x − µ .
zj ∈ {x1 , . . . , xn } such that MLE for the Mean n
bj d i=1
n
1 X (i)
X 
(i)
 µ
b= x
n i=1
dist x , zj Reinforcement Learning
x(i) ∈Cj MLE for the Variance
n
2 1 X (i)
2
Objectives of RL The goal of RL is to learn a good policy with no or limited
σ
b = x − µ

is minimal. nd i=1 supervision.
Markov Decision Processes Value Iteration Algorithm
Definition A Markov decision process (MDP) is defined by Definition Value iteration is an iterative algorithm that computes the values of states
indexed by k. Let Vk∗ (s) be the expected reward from state s after k steps:
• a set of states s ∈ S (may be observed or unobserved); ∗
Vk (s) → V (s)

as k → ∞.
• a set of actions a ∈ A; 1. Initialization: V0∗ (s) = 0
• action-dependent transition probabilities T (s, a, s0 ) = P(s0 |s, a) so that, for 2. Iterate until Vk∗ (s) ' Vk+1

(s) ∀s
each state s and action a,
 
X 
∗ 0  0 ∗ 0 
X 0
Vk+1 (s) ← max T (s, a, s ) R(s, a, s ) + γVk (s )
T (s, a, s ) = 1 a 
0

s
s0 ∈S
3. Compute Q∗ (s, a) and π ∗ (s, a) = argmax Q∗ (s, a).
. a

• reward functions R(s, a, s0 ), representing the reward for starting in state s, Convergence This algorithm will converge as long as γ < 1.
taking action a and ending up in state s0 after one step. (The reward function
may also depend only on s, or only s and s.) Q-Value Iteration
Definition We can directly operate at the level of Q-values. Q-value iteration is a
Property MDPs satisfy the Markov property in that the transition probabilities and reformulation of value iteration algorithm.
rewards depend only on the current state and action, and remain unchanged
regardless of the history (i.e., past states and actions) that leads to the current state. Update Rule
 
∗ 0 0 ∗ 0 0
X
Utility Function The main problem for MDPs is to optimize the agent’s behavior. Qk+1 (s, a) ← T (s, a, s ) R(s, a, s ) + γ max Qk (s , a )
We first need to specify the criterion that we are trying to maximize in terms of a0
s0
accumulated rewards. We define a utility function and maximize its expectation.
Reinforcement Learning
• Finite horizon based utility: The utility function is the sum of rewards after
MDP vs. RL In MDPs, we are given 4 quantities hS, A, T, Ri. In reinforcement
acting for a fixed number n of steps. When the rewards depend only on the
learning, we are given only the states and actions hS, Ai. In the real world,
states, the utility function is
transitions and rewards might not be directly available and they need to be estimated.
n
X Estimation Consider a random variable X. The goal is to estimate
U [s0 , s1 , . . . , sn ] = R(si .) X
i=0 E [f (X)] = p(x)f (x).
x

• (Infinite horizon) discounted reward based utility: In this setting, the reward We have access to K samples: xi , i = 1, . . . , K.
one step into the future is discounted by a factor γ, the reward two steps Model-based Learning
ahead by γ 2 , and so on. The goal is to continue acting (without an end) while 1
maximizing the expected discounted reward. The discounting allows us to p
b(xi ) = count(xi )
K
focus on near term rewards, and control this focus by changing γ. If the
K
rewards depend only on the states, the utility function is X
E [f (X)] ≈ p
b(xi )f (xi )

X i=1
k
U [s0 , s1 , . . . ] = γ R(sk ). Model-free Learning
k=0 K
1 X
E [f (X)] ≈ f (xi )
Optimal Policy A policy is a function π : S → A that assigns an action π(s) to any K i=1
state s. Given an MDP and a utility function U [s0 , s1 , . . . , sn ], our goal is to find an
Q-Value Iteration for RL
optimal policy function that maximizes the expectation of the utility. We denote the
optimal policy by π ∗ . 1. Initialization: Q(s, a) = 0 ∀s, a
2. Iterate until convergence:
Bellman Equations (a) Collect sample: s, a, s0 , R(s, a, s0 )
Value Function Denote by Q∗ (s, a) the expected reward starting at s, taking action (b) Update:
a and acting optimally. The value function V ∗ (s) is the expected reward starting at
 
0 0 0
state s and acting optimally. Qi+1 (s, a) ← α R(s, a, s ) + γ max Qi (s , a ) + (1 − α) Qi (s, a)
a0

The Bellman Equations These equations connect the notion of the value of a state
 
0 0 0
and the value of policy. = Qi (s, a) + α R(s, a, s ) + γ max Qi (s , a ) − Qi (s, a)
a0
∗ ∗ ∗ ∗
V (s) = max Q (s, a) = Q (s, π (s))
a Recommended Resources
∗ 0  0 ∗ 0 
X
Q (s, a) = T (s, a, s ) R(s, a, s ) + γV (s )
s0 • Introduction to Machine Learning with Python (Müller and Guido)

We can define the V (s) recursively to get • Machine Learning with Python – From Linear Models to Deep Learning
[Lecture Slides] (http://www.edx.org)
 
X  • LaTeX File (github.com/mynameisjanus/686xMachineLearning)
∗ 0  0 ∗ 0 
V (s) = max T (s, a, s ) R(s, a, s ) + γV (s )
a 
0 s
 Please share this cheatsheet with friends!

You might also like