0% found this document useful (0 votes)

103 views7 pages

6.86x Machine Learning With Python: Linear Classifiers

This document provides a summary of machine learning concepts including: - Types of machine learning such as supervised, unsupervised, and semi-supervised learning. - Linear classifiers including linear classifiers through the origin and with an offset. Key concepts for linear classifiers like decision boundaries, margin boundaries, and training error. - The perceptron algorithm for training linear classifiers using an iterative method. - Performance metrics like hinge loss that measure errors in linear classifiers.

Uploaded by

Alexander CTO

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

103 views7 pages

6.86x Machine Learning With Python: Linear Classifiers

Uploaded by

Alexander CTO

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

6.

86x Machine Learning with Python • set of classifiers: h ∈ H Hinge Loss, Margin Boundaries and Regularization
Linear Classifiers through the Origin We consider functions of the form Distance from a Line to a Point The perpendicular distance from a line with
This is a cheat sheet for machine learning based on the online course given by Prof. equation θ · x + θ0 = 0 to a point with coordinates x0 is
Tommi Jaakkola and Prof. Regina Barzilay. Compiled by Janus B. Advincula. h(x; θ) = sign (θ1 x1 + · · · + θd xd ) = sign (θ · x) . |θ · x0 + θ0 |
d=
Last Updated January 17, 2020 kθk
θ·x>0 θ·x<0
x0
Linear Classifiers θ θ
Introduction to Machine Learning x
What is machine learning? Machine learning as a discipline aims to design, d
understand and apply computer programs that learn from experience (i.e., data) for
θ·x=0
the purpose of modeling, prediction or control.
Linear Classifiers with Offset We can consider functions of the form Decision Boundary The decision boundary is the set of points x which satisfy
Types of Machine Learning
θ · x + θ0 = 0.
• Supervised learning: prediction based on examples of correct behavior h(x; θ) = sign (θ · x + θ0 )
Margin Boundary The margin boundary is the set of points x which satisfy
• Unsupervised learning: no explicit target, only data, goal is to where θ0 is the offset parameter.
model/discover θ · x + θ0 = ±1.
Linear Separation Training examples Sn are linearly separable
if there exists
a
• Semi-supervised learning: supplement limited annotations with y (i) θ · x(i) + θ0
parameter vector θb and offset parameter θb0 such that y (i) θb · x(i) + θb0 > 0 for all
unsupervised learning γi (θ, θ0 ) =
i = 1, . . . , n. kθk
• Active learning: learn to query the examples actually needed for learning
Training Error The training error for a linear classifier is
• Transfer learning: how to apply what you have learned from A to B
n
• Reinforcement learning: learning to act, not just predict; goal is to optimize 1 X hh (i) (i)
ii
En (θ, θ0 ) = y θ · x + θ0 ≤ 0 .
the consequences of actions n i=1
negative margin boundary
positive margin boundary θ · x + θ0 = −1
Linear Classifier and Perceptron Perceptron Algorithm
θ · x + θ0 = 1
decision boundary
h(x) = +1 h(x) = −1 procedure P ERCEPTRON({(x(i) , y (i) ), i = 1, . . . , n}, T ) θ · x + θ0 = 0
θ = 0 (vector)
Hinge Loss
for t = 1, . . . , T do (
0 if z ≥ 1
for i = 1, . . . , n do Lossh (z) =
1−z if z < 1
if y (i) θ · x(i) ≤ 0 then
with z = y (i) θ · x(i) + θ0 .
θ = θ + y (i) x(i)
classifier: h(x) = 0 Regularization Maximize margin
return θ 1 1 2
max ⇒ min kθk
Key Concepts Perceptron Algorithm (with offset) kθk 2

• feature vectors, labels: procedure P ERCEPTRON({(x(i) , y (i) ), i = 1, . . . , n}, T ) Linear Classification and Generalization
x∈R ,
d
y ∈ {−1, +1} θ = 0 (vector) 1 distance from the decision
for t = 1, . . . , T do kθk boundary to the margin boundary
• training set: n
(i) (i)
o for i = 1, . . . , n do
Sn = x ,y , i = 1, . . . , n
if y (i) θ · x(i) + θ0 ≤ 0 then
• classifier: θ
d θ = θ + y (i) x(i)
h : R → {−1, +1}
(i)
+
n
d
o θ0 = θ0 + y
χ = x ∈ R : h(x) = +1 negative margin boundary
return θ, θ0 θ · x + θ0 = −1
positive margin boundary
n o
− d
χ = x ∈ R : h(x) = −1 Convergence Assumptions:
θ · x + θ0 = 1
decision boundary
• training error: y (i) θ ∗ ·x(i)

n • There exists θ ∗ such that ≥ γ for all i = 1, . . . , n for some θ · x + θ0 = 0
1 X hh (i)

(i)
ii (i)
En (h) = h x 6 y
= x
n i=1 γ > 0. Objective function
(
1 if error • All examples are bounded x(i) ≤ R, i = 1, . . . , n. n

hh
(i)

(i)
ii 1 X
(i)

(i)
λ 2
h x 6= y = J (θ, θ0 ) = Lossh y θ · x + θ0 + kθk
0 otherwise n i=1 2
R2
• test error: E(h) Then the number k of updates made by the perceptron algorithm is bounded by . λ is the regularization factor.
γ2
Stochastic Gradient Descent Select i ∈ {1, . . . , n} at random Nonlinear Classification Decision Boundary The decision boundary satisfies
Feature Transformation n

(i)
λ 2 X (j)

(j)

θ ← θ − ηt ∇θ Lossh θ · x + θ0 + kθk αj y K x , x = 0.
2 x 7→ φ(x)
j=1
0
ηt is the learning rate which can vary at every iteration. θ · x → θ · φ(x)
Radial Basis Kernel
Support Vector Machine Non-linear Classification
0 1 0 2
h (x; θ, θ0 ) = sign (θ · φ(x) + θ0 ) K(x, x ) = exp − x − x
• Support Vector Machine finds the maximum margin linear separator by 2
solving the quadratic program that corresponds to J(θ, θ0 )
Kernel Function A kernel function is simply an inner product between two feature 3
• In the realizable case, if we disallow any margin violations, the quadratic vectors. Using kernels is advantageous when the inner products are faster to evaluate +1
-1
program we have to solve is: than using explicit vectors (e.g., when the vectors would be infinite dimensional). 2
1 2 0 0
Find θ, θ0 that minimize 2 kθk subject to K x, x = φ(x) · φ(x ) 1

X2

(i) (i) Perceptron
y θ·x + θ0 ≥ 1, i = 1, . . . , n
θ=0 1

2
Nonlinear Classification, Linear Regression, for i = 1, . . . , n do

if y (i) θ · φ x(i) ≤ 0 then 3
Collaborative Filtering 3 2 1 0
X1
1 2 3
θ ← θ + y (i) φ x(i)

Linear Regression This algorithm gives Other non-linear classifiers

n
X
(j) (j)
Empirical Risk θ= αj y φ x • We can get non-linear classifiers or regression methods by simply mapping
j=1 examples into feature vectors non-linearly, and applying a linear method on
n the resulting vectors.
1 X 1 (i) (i) 2
where αj is the number of mistakes. For the offset parameter, we get
Rn (θ) = y −θ·x squared error
n i=1 2 n
• These feature vectors can be high dimensional.
X (j)
θ0 = αj y . • We can turn the linear methods into kernel methods by casting the
Gradient-based Approach We can use stochastic gradient descent to find the j=1 computations in terms of inner products.
minima of the empirical risk.
Kernel Perceptron Algorithm We can reformulate the perceptron algorithm so that
Algorithm Initialize θ = 0. we initialize and update αj ’s, instead of θ. Recommender Systems
Randomly pick i = {1, . . .
, n}.
n Problem Description We are given a matrix where each row corresponds to a user’s
θ = θ + η y (i) − θ · x(i) x(i) . (i) rating of movies, for example, and each column corresponds to the user ratings for a
X (j) (j) (i)
θ·φ x = αj y φ x ·φ x
j=1
particular movie. It can also be product ratings, etc. This matrix will be very sparse.
η is the learning rate.
| {z }
The goal is to predict user ratings for those movies that are yet to be rated.
K x(j) ,x(i)
Closed Form Solution Let m movies
n n
procedure K ERNEL P ERCEPTRON({(x(i) , y (i) ), i = 1, . . . , n}, T )  
1 X (i) (i) | 1 X (i) (i)
A= x x and B = y x . Initialize α1 , . . . , αn to some values  
n i=1 n i=1 n users  Yai 
for t = 1, . . . , T do
 

Then, for i = 1, . . . , n do
−1 n
θb = A B.

if y (i) αj y (j) K x(j) , x(i) ≤ 0 then
P

In matrix notation, this is j=1 K-Nearest Neighbor Method The K-Nearest Neighbor method makes use of
| −1 | αj = αj + 1 ratings by K other similar users when predicting Yai . Let KNN(a) be the set of K
θb = X X X Y. users similar to user a, and let sim(a, b) be a similarity measure between users a and
b ∈ KNN(a). The KNN method predicts a rating Yai to be
Generalization and Regularization The initilization θ = 0 is equivalent to α1 = · · · = αn = 0. P
Ridge Regression: The loss function is sim(a, b)Ybi
Composition rules: b∈KNN(a)
Y
bai =
λ
P
Jλ,n =
2
kθk + Rn (θ) 1. K(x, x0 ) = 1 is a kernel function. sim(a, b)
2 b∈KNN(a)
d 0
2. Let f : R → R and K(x, x ) is a kernel. Then so is
where λ is the regularization factor. We can find its minima using gradient-based K(x,
e x0 ) = f (x)K(x, x0 )f (x0 ) The similarity measure sim(a, b) could be any distance function between the feature
approach. vectors xa and xb .
0 0
3. If K1 (x, x ) and K2 (x, x ) are kernels, then
Algorithm Initialize θ = 0. K(x, x0 ) = K1 (x, x0 ) + K2 (x, x0 ) is a kernel. • Euclidean distance: kxa − xb k
Randomly pick i = {1,
. . . , n}. 4. If K1 (x, x0 ) and K2 (x, x0 ) are kernels, then
xa · xb
θ = (1 − ηλ) θ + η y (i) − θ · x(i) x(i) . K(x, x0 ) = K1 (x, x0 )K2 (x, x0 ) is a kernel. • Cosine similarity: cos θ =
kxa kkxb k
ez − e−z 2
Collaborative Filtering Our goal is to come up with a matrix X that has no blank Hyperbolic Tangent Function tanh(z) = = 1 − 2z In the figure,
entries and whose (a, i)th entry Xai is the prediction of user a’s rating to movie i. ez + e−z e +1
W11

W12

~1 =
W and ~2 =
W .
Let D be the set of all (a, i)’s for which a user rating Yai exists. A naive approach is Deep Neural Networks A deep (feedforward) neural network refers to a neural W21 W22
to minimize the objective function network that contains not only the input and output layers, but also hidden layers in They map the input onto the f1 -f2 axes.
between. Below is a deep feedforward neural network of 2 hidden layers, with each
X 1 2 λ X 2 hidden layer consisting of 5 units: Hidden Layer Representation
J(X) = (Yai − Xai ) + Xai . • Hidden Layer Units
2 2
(a,i)∈D (a,i)

1.0 W1
The results are W2
+1
bai = Yai
X for (a, i) ∈ D x1 0.5
-1
1+λ
X
bai = 0 for (a, i) ∈
/ D. 0.0

X2
f
x2
The problem with this approach is that there is no connection between the entries of
0.5
X. We can impose additional constraint on X:
|
X = UV x3 1.0
| 1.0 0.5 0.0 0.5 1.0
for some n × d matrix U and d × m matrix V , where d is the rank of the matrix X. X1
Alternating Minimization Assume that U and V are rank k matrices. Then, we can
write the objective function as • Linear Activation

-1
 
2.5

Input Layer

Hidden Layer

Output Layer
X 1 |
2 λ X 2 X 2 1
J(X) = Yai − U V ai +  Uak + Vik  . 2.0
2 2 a,k i,k
(a,i)∈D
1.5
To find the solution, we fix (initialize) U (or V ) and minimize the objective with 1.0

f2
respect to V (or U ). We plug-in the result back to the objective and minimize it with 0.5
respect to U (or V ). We repeat this alternating process until there is no change in the 0.0
objective function. One Hidden Layer Model
0.5
Example Consider the case k = 1. Then, Ua1 = ua and Vi1 = vi . If we initialize 1.0
Layer 0 Layer 1 Layer 2
ua to some values, then we have to optimize the function
(tanh) (linear)
1 0 1 2 3
X 1 2 λX 2 f1
(Yai − ua vi ) + v . W11 z1 f1
2 2 i i x1
(a,i)∈D
W12 • tanh Activation
z
f
1.00 -1
Neural Networks 0.75
1
W21
x2 0.50
Introduction to Feedforward Neural Networks W22 z2 f2
0.25
A Unit in a Neural Network A neural network unit is a primitive neural network

f2
0.00
that consists of only the input layer, and an output layer with only one output. 2 2
X X 0.25
x1 z1 = xj Wj1 + W01 z2 = xj Wj2 + W02
0.50
j=1 j=1
w1 0.75
f1 = f (z1 ) = tanh(z1 ) f2 = f (z2 ) = tanh(z2 )
x2 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
f (z) z=
0
f1 w1 +
0
f2 w2 f = f (z) = z f1
w2
.
.
. Neural Signal Transformation We can visualize what the hidden layer is doing • ReLU Activation
similarly to a linear classifier.
wd -1
xd 2.5
x2 1
2.0
A neural network unit computes a non-linear weighted combination of its input:
d 1.5
X ~2
W

f2
y
b = f (z) where z = w0 + xi wi
1.0
i=1
x1
where wi are the weights, z is a number and is the weighted sum of the inputs xi , 0.5
~1
W
and f is generally a non-linear function called the activation function.
0.0
Linear Function f (z) = z 0.0 0.5 1.0 1.5 2.0 2.5 3.0
f1
Rectified Linear Unit (ReLU) f (z) = max{0, z}
Summary We have the following notations: • Let’s say we want to encode the incomplete sentence Efforts and courage are
not. First, we have to represent the first word as a vector (say, a one-hot
• Units in neural networks are linear classifiers, just with different output • b`j is the bias of the j th th
neuron in the ` layer. vector). This will be x1 . Then,
non-linearity.
• a`j is the activation of the j th neuron in the `th layer.
• The units in feedforward neural networks are arranged in layers. s1 = tanh W
s,x
x1 .
• By learning the parameters associated with the hidden layer units, we learn
`
• wjk is the weight for the connection from the kth neuron in the (` − 1)th layer
how to represent examples (as hidden layer activations). to the j th neuron in the `th layer. The second word will be x2 , and we compute for s2 .
• The representations in neural networks are learned directly to facilitate the
If the activation function is f and the loss function we are minimizing is C, then the
end-to-end task. s2 = tanh W
s,s
s1 + W
s,x
xt .
equations describing the network are:
• A simple classifier (output unit) suffices to solve complex classification tasks if !
it operates on the hidden layer representations. `
aj = f
X ` `−1
wjk ak + bj
` We continue this process until we’ve encoded all the words in the sentence.
k We can visualize this as follows:
Feedforward Neural Networks, Back Propagation, and
L
Stochastic Gradient Descent (SGD) Loss = C a
lego piece (encoder)
Simple Example This simple neural network is made up of L hidden layers, but Let the weighted inputs to the d neurons in layer ` be defined as s1 s2 s3 s4 s5
each layer consists of only one unit, and each unit has activation function f . ` ` `−1 ` ` d s0 sentence
z ≡w a +b , where z ∈ R . as a vector
` `
w1 z1 f1 wL zL fL Then, the activation of layer ` is also written as a ≡ f (z ). Also, let δj` ≡ ∂C
x ... y ∂z ` x1 x2 x3 x4 x5
j
denote the error of neuron j in layer `. Then, δ ` ∈ Rd denotes the full vector of errors
z1 = xw1 associated with layer `.
f1 = tanh(xw1 ) Efforts and courage are not ...
. Equations of Backpropagation
.
. L
δ = ∇a C f z
0 L
fL = tanh(fL−1 wL ) Differences from standard feedforward architecture
|
(y − fL )2
1
h i
L(y, fL ) = Loss (y, fL ) = ` `+1 `+1 0 `
2 δ = w δ f z
• Input is received at each layer (per word), not just at the beginning as in a
For i = 2, . . . , L: zi = fi−1 wi where fi−1 = f (zi−1 ). Also, y is the true value ∂C `
= δj typical feedforward network.
and fL is the output of the neural network. ∂b`j
Gradient Descent The gradient descent update rule for the parameter wi is ∂C • The number of layers varies and depends on the length of the sentence.
`−1 `
wi ← wi − η · ∇wi L(y, fL ) `
= ak δj
∂wjk • Parameters of each layer (representing an application of an RNN) are shared
where η is the learning rate. For instance, we have
The symbol represents the Hadamard product. (same RNN at each step).
∂L ∂f1 ∂L
=

a b e f ae bf
∂w1 ∂w1 ∂f1 = . Basic RNN
c d g h cg dh
∂f1 h
2
i
2

st = tanh W
s,s
st−1 + W
s,x
xt

= 1 − tanh (xw1 ) x = 1 − f1 x
∂w1 Recurrent Neural Networks
∂L ∂L ∂f2 ∂L 2

Temporal/Sequence Problems Simple Gated RNN
= = 1 − f2 w2 .
∂f1 ∂f2 ∂f1 ∂f2 • Sequence prediction problems can be recast in a form amenable to feedforward g,s g,x
gt = sigmoid W st−1 + W xt
Thus, when we back-propagate, we get neural networks.
s,s s,x
∂L
2

2

• We have to engineer how history is mapped to a vector (representation). This st = (1 − gt ) st−1 + gt tanh W st−1 + W xt
= x 1 − f1 · · · 1 − fL w2 · · · wL · 2 (fL − y) .
∂w1 vector is then fed into, e.g., a neural network.
Note that the above derivation applies to tanh activation. • We would like to learn how to encode the history into a vector. Long Short-Term Memory (LSTM)
Backpropagation Consider the L-layer neural network below. Key Concepts
f,h f,x
Layer 0 Layer 1 Layer 2 Layer 3 • Encoding – e.g., mapping a sequence to a vector ft = sigmoid W ht−1 + W xt forget gate

• Decoding – e.g., mapping a vector to, e.g., a sequence it = sigmoid W
i,h i,x
ht−1 + W xt input gate
Example: Encoding Sentences
o,h o,x
• Introduce adjustable lego pieces and optimize them for end-to-end ot = sigmoid W ht−1 + W xt output gate
x1 performance.
c,h c,x
ct = ft ct−1 + it tanh W ht−1 + W xt memory cell
a31 ht = ot tanh(ct ) visible state
x2 context new context
θ
or state or state
st−1 st Markov Language Models Let w ∈ V denote the set of possible words/symbols
x3 that includes

• an UNK symbol for any unknown word (out of vocabulary)

new information
3 xt • hbegi symbol for specifying the start of a sentence
w25
b25 s,s s,x
st = tanh W st−1 + W xt • hendi symbol for specifying the end of the sentence
s,s s,x
First-order Markov Model In a first-order Markov model (bigram model), the next st = tanh W st−1 + W xt state
symbol only depends on the previous one. Each symbol (except hbegi) in the o
pt = softmax W st output distribution
sequence is predicted using the same condition probability table until an hendi
symbol is seen. The probability associated to the sentence is
Decoding
Y
P (wi |wi−1 ) .
i=1 sampled word Olen nähnyt parempia luentoja hendi
distribution over the possible words p1 p2 p3 p4 p5
Maximum Likelihood Estimation The goal is to maximize the probability that the
model can generate all the observed sentences (corpus S) vector encoding
n o of a sentence
s s s
s ∈ S, s = w1 , w2 , . . . , w|s|
I have seen
   better lectures.
|s|
hnulli
Y Y 
s s
` = log  P wi |wi−1 
 
s∈S i=1

The maximum likelihood estimate is obtained as normalized counts of successive Convolutional Neural Networks Unsupervised Learning
word occurrences (matching statistics) Problem Image classification

0 count w0 , w
Clustering
P w |w = P
b
• The presence of objects may vary in location across different images.
count (w, w)
e Training set We are provided a training set but with no labels
w
e Patch classifier/filter n o
(i)
Sn = x i = 1, . . . , n
Feature-based Markov Model We can also represent the Markov model as a
feedforward neural network (very extendable). We define a one-hot vector, φ (wi−1 ), and the goal is to find structure in the data.
corresponding to the previous word. This will be an input to the feedforward neural Example: Google News
network.
p1 input weights
p2
The patch classifier goes through the entire image. We can think of the weights as the
φ (wi−1 ) W p3 image that the unit prefers to see.
x . .
. . Convolution The convolution is an operation between two functions f and g:
. .
pk
Z +∞
(f ∗ g) (t) ≡ f (τ )g(t − τ )dτ.
−∞
In the figure,
pk = P (wi = k|wi−1 ) Intuitively, convolution blends the two functions f and g by expressing the amount of
is the probability of the next word, given the previous word. The aggregate input to overlap of one function as it is shifted over another function.
the kth output unit is Discrete Convolution For discrete functions, we can define the convolution as
X
zk = xj Wjk + W0k .
m=+∞
k X
(f ∗ g) [n] ≡ f [m]g[n − m].
These input values are not probabilities. A typical transformation is the softmax m=−∞
transformation:
ezk Example: Image Quantization
pk = P z . Pooling We wish to know whether a feature was there but not exactly where it was.
e j
j Pooling (Max) Pooling region and stride may vary.
RNNs for Sequences Our RNN now also produces an output (e.g., a word) as well
• Pooling induces translation invariance at the cost of spatial resolution.
as update its state
• Stride reduces the size of the resulting feature map.
[0.1, 0.3, . . . , 0.2] output distribution

previous new
θ state original compressed
state

Partition A partition of a set is a grouping of the set’s elements into non-empty

subsets, in such a way that every element is included in one and only one of the
feature map subsets. In other words, C1 , . . . , CK is a partition of {1, . . . , n} if and only if
original feature map after max pooling
previous output C1 ∪ · · · ∪ CK = {1, . . . , n} and
as an input x
Example of CNN From LeCun (2013) Ci ∩ Cj = ∅ for any i 6= j in {1, . . . , K}.
Clustering: Input Generative Models Mixture Models; EM Algorithm
n o Generative vs. Discriminative Models Generative models work by explicitly Gaussian Mixture Models Instead of just a single Gaussian, we have a mixture of
• Set of feature vectors Sn = x(i) i = 1, . . . , n modeling the probability distribution of each of the individual classes in the training Gaussian components. Assume that there are K Gaussians with known means and

data. Discriminative models learn explicit decision boundary between classes. variances. Assume also that the mixture weights p1 , . . . , pK are known. The
• The number of clusters K Simple Multinomial Generative Model Consider a multinomial model M to likelihood for an observation x obtained from the model is
generate text documents. Assume that M has a fixed vocabulary W and we generate K
X
(j) 2
Clustering: Output a document by sampling one word at a time from this vocabulary. Furthermore, all p(x|θ) = pj N x; µ , σj I .
the words that are generated by M are independent of each other. We denote the j=1

• A partition of indices {1, . . . , n} into K sets, C1 , . . . , CK probability that M generates certain word w ∈ W is
X For the training set
P (w|θ) = θw , θw ≥ 0, θw = 1. n
(i)
o
• Representatives in each of the K partition sets, given as z1 , . . . , zK Sn = x , i = 1, . . . , n ,
w∈W

Then, the probability of generating the document D is the likelihood is

Cost We can calculate the total cost by summing the cost of each cluster:
n K
n X
count(w)
Y Y Y (i) (j) 2
K
P (D|θ) = θwi = θw . P (Sn |θ) = pj N x ; µ , σj I .
X
i=1 w∈W i=1 j=1
Cost (C1 , . . . , CK ) = Cost (Cj )
j=1 Maximum Likelihood Estimate The log-likelihood for the model is
X Observed Case Consider the case of hard clustering, i.e., a point either belongs to a
` = log P(D|θ) = count(w) log θw cluster or not. Let (
Similarity Measure We use the Euclidean distance between the elements of a cluster w∈W 1, x(i) is assigned to j
and its representative to calculate the cost for each cluster. Then, the total cost is δ(j|i) =
and the maximum likelihood estimate is 0, otherwise.
K X
count(w) n
2 θbw = . Also, let n
bj =
P
δ(j|i) denote the number of points belonging to cluster j.
count(w0 )
X (i) P
Cost (C1 , . . . , CK , z1 , . . . , zK ) = x − zj .

i=1
w0 ∈W
j=1 i∈Cj Maximizing the likelihood gives
Prediction Consider using a multinomial generative model M for the task of binary
n
bj
classification consisting of two classes: + (positive class) and − (negative class). p
bj =
K-Means Algorithm n
• θ + : parameter for the positive class n
(j) 1 X (i)
1. Randomly select z1 , . . . , zK . • θ − : parameter for the negative class µ
b = δ(j|i) x
n
bj i=1
2. Iterate: Suppose that we classify a new document D to belong to the positive class if and only n
if 2 1 X
(i) (j) 2

σ
bj = δ(j|i) x − µ .
(a) Given z1 , . . . , zK , assign each data point x(i) to the closest zj so that P(D|θ + ) n
bj d i=1
log ≥ 0.
n
P(D|θ − )
X
(i)
2
The generative classifier is equivalent to a linear classifier: The EM Algorithm Instead of hard clustering, the data can actually be generated
Cost (z1 , . . . , zK ) = min x − zj .

j=1,...,K from different clusters with different probabilities. We have soft clustering. We can
i=1 P(D|θ + ) X θ+ X 0 maximize the likelihood through the EM algorithm.
log −
= count(w) log w −
= count(w) θw .
P(D|θ ) θ w
(b) Given C1 , . . . , CK , find the best representatives z1 , . . . , zK , i.e., find w∈W w∈W
Randomly initialize θ: µ(1) , . . . , µ(K) , σ12 , . . . , σK
2
, p1 , . . . , p K .
z1 , . . . , zK such that Prior, Posterior and Likelihood In the above discussion, there is an assumption that
1. E-step:
the likelihood of being in one of the classes is the same. However, we may have some
1 X (i) x(i) ; µ(j) , σj2 I
2
X
(i) prior knowledge and we want to incorporate it into our model. The posterior pj N
zj = argmin x − z = x .

z |Cj | i∈C distribution for the positive class is then p(j|i) =
i∈Cj j p (x|θ)
P(D|θ + ) P(y = +) K
P (y = +|D) = . where p(x|θ) =
P
pj N x(i)
;µ (j)
, σj2 I
K-Medoids Algorithm The K-Medoids algorithm is a variation of the K-Means P(D) j=1
algorithm that addresses some of the K-Means algorithm’s limitations.
The generative classifier becomes
2. M-step:
1. Randomly select {z1 , . . . , zK } ⊆ {x1 , . . . , xn }. P (y = +|D) X 0 0
log = count(w) θw + θ0 n
P (y = −|D)
X
w∈W n
bj = p(j|i)
2. Iterate:
+ i=1
0 θw P(y = +)
(a) Given z1 , . . . , zK , assign each x (i)
to the closest zj so that where θw = log −
and θ00 = log . n
bj
θw P(y = −) p
bj =
d n
n Gaussian Generative Models The likelihood of x ∈ R being generated by a n
1 X
X (i)
Cost (z1 , . . . , zK ) = min dist x , zj Gaussian with mean µ and standard deviation σ is µ
(j)
= p(j|i) x
(i)
j=1,...,K b
i=1 1

1
n
bj i=1
2 2
fX x|µ, σ = exp − kx − µk .
(b) Given Cj ∈ {C1 , . . . , CK }, find the best representative (2πσ 2 )d/2 2σ 2
2 1 X
n
(i) (j) 2

σ
bj = p(j|i) x − µ .
zj ∈ {x1 , . . . , xn } such that MLE for the Mean n
bj d i=1
n
1 X (i)
X
(i)
µ
b= x
n i=1
dist x , zj Reinforcement Learning
x(i) ∈Cj MLE for the Variance
n
2 1 X (i)
2
Objectives of RL The goal of RL is to learn a good policy with no or limited
σ
b = x − µ

is minimal. nd i=1 supervision.
Markov Decision Processes Value Iteration Algorithm
Definition A Markov decision process (MDP) is defined by Definition Value iteration is an iterative algorithm that computes the values of states
indexed by k. Let Vk∗ (s) be the expected reward from state s after k steps:
• a set of states s ∈ S (may be observed or unobserved); ∗
Vk (s) → V (s)
∗
as k → ∞.
• a set of actions a ∈ A; 1. Initialization: V0∗ (s) = 0
• action-dependent transition probabilities T (s, a, s0 ) = P(s0 |s, a) so that, for 2. Iterate until Vk∗ (s) ' Vk+1
∗
(s) ∀s
each state s and action a,
 
X 
∗ 0 0 ∗ 0
X 0
Vk+1 (s) ← max T (s, a, s ) R(s, a, s ) + γVk (s )
T (s, a, s ) = 1 a 
0

s
s0 ∈S
3. Compute Q∗ (s, a) and π ∗ (s, a) = argmax Q∗ (s, a).
. a

• reward functions R(s, a, s0 ), representing the reward for starting in state s, Convergence This algorithm will converge as long as γ < 1.
taking action a and ending up in state s0 after one step. (The reward function
may also depend only on s, or only s and s.) Q-Value Iteration
Definition We can directly operate at the level of Q-values. Q-value iteration is a
Property MDPs satisfy the Markov property in that the transition probabilities and reformulation of value iteration algorithm.
rewards depend only on the current state and action, and remain unchanged
regardless of the history (i.e., past states and actions) that leads to the current state. Update Rule

∗ 0 0 ∗ 0 0
X
Utility Function The main problem for MDPs is to optimize the agent’s behavior. Qk+1 (s, a) ← T (s, a, s ) R(s, a, s ) + γ max Qk (s , a )
We first need to specify the criterion that we are trying to maximize in terms of a0
s0
accumulated rewards. We define a utility function and maximize its expectation.
Reinforcement Learning
• Finite horizon based utility: The utility function is the sum of rewards after
MDP vs. RL In MDPs, we are given 4 quantities hS, A, T, Ri. In reinforcement
acting for a fixed number n of steps. When the rewards depend only on the
learning, we are given only the states and actions hS, Ai. In the real world,
states, the utility function is
transitions and rewards might not be directly available and they need to be estimated.
n
X Estimation Consider a random variable X. The goal is to estimate
U [s0 , s1 , . . . , sn ] = R(si .) X
i=0 E [f (X)] = p(x)f (x).
x

• (Infinite horizon) discounted reward based utility: In this setting, the reward We have access to K samples: xi , i = 1, . . . , K.
one step into the future is discounted by a factor γ, the reward two steps Model-based Learning
ahead by γ 2 , and so on. The goal is to continue acting (without an end) while 1
maximizing the expected discounted reward. The discounting allows us to p
b(xi ) = count(xi )
K
focus on near term rewards, and control this focus by changing γ. If the
K
rewards depend only on the states, the utility function is X
E [f (X)] ≈ p
b(xi )f (xi )
∞
X i=1
k
U [s0 , s1 , . . . ] = γ R(sk ). Model-free Learning
k=0 K
1 X
E [f (X)] ≈ f (xi )
Optimal Policy A policy is a function π : S → A that assigns an action π(s) to any K i=1
state s. Given an MDP and a utility function U [s0 , s1 , . . . , sn ], our goal is to find an
Q-Value Iteration for RL
optimal policy function that maximizes the expectation of the utility. We denote the
optimal policy by π ∗ . 1. Initialization: Q(s, a) = 0 ∀s, a
2. Iterate until convergence:
Bellman Equations (a) Collect sample: s, a, s0 , R(s, a, s0 )
Value Function Denote by Q∗ (s, a) the expected reward starting at s, taking action (b) Update:
a and acting optimally. The value function V ∗ (s) is the expected reward starting at

0 0 0
state s and acting optimally. Qi+1 (s, a) ← α R(s, a, s ) + γ max Qi (s , a ) + (1 − α) Qi (s, a)
a0

The Bellman Equations These equations connect the notion of the value of a state

0 0 0
and the value of policy. = Qi (s, a) + α R(s, a, s ) + γ max Qi (s , a ) − Qi (s, a)
a0
∗ ∗ ∗ ∗
V (s) = max Q (s, a) = Q (s, π (s))
a Recommended Resources
∗ 0 0 ∗ 0
X
Q (s, a) = T (s, a, s ) R(s, a, s ) + γV (s )
s0 • Introduction to Machine Learning with Python (Müller and Guido)
∗
We can define the V (s) recursively to get • Machine Learning with Python – From Linear Models to Deep Learning
[Lecture Slides] (http://www.edx.org)
 
X  • LaTeX File (github.com/mynameisjanus/686xMachineLearning)
∗ 0 0 ∗ 0
V (s) = max T (s, a, s ) R(s, a, s ) + γV (s )
a 
0 s
 Please share this cheatsheet with friends!

Pengukuran Indikator Mutu RS
No ratings yet
Pengukuran Indikator Mutu RS
26 pages
Midterm Review Spring18 Sols
No ratings yet
Midterm Review Spring18 Sols
22 pages
NN Theory
No ratings yet
NN Theory
138 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
46 pages
Perceptron
No ratings yet
Perceptron
23 pages
Linear Regression
No ratings yet
Linear Regression
37 pages
cs188 Fa23 Note21
No ratings yet
cs188 Fa23 Note21
8 pages
12 - Bài Toán Phân L P - LR - v2
No ratings yet
12 - Bài Toán Phân L P - LR - v2
130 pages
Lecture Notes 3 Perceptron
No ratings yet
Lecture Notes 3 Perceptron
7 pages
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
No ratings yet
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
54 pages
Lecturenotes Perceptron
No ratings yet
Lecturenotes Perceptron
7 pages
Perceptron Notes
No ratings yet
Perceptron Notes
5 pages
Lecture 2 Math
No ratings yet
Lecture 2 Math
34 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
Machine Learning: Support Vector Machines Kernel Methods
No ratings yet
Machine Learning: Support Vector Machines Kernel Methods
87 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
cs221 Lecture11
No ratings yet
cs221 Lecture11
71 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Intro ML Linear Classifier
No ratings yet
Intro ML Linear Classifier
18 pages
Minsky y Papert
No ratings yet
Minsky y Papert
77 pages
2IIG0 Cheat Sheet 1
No ratings yet
2IIG0 Cheat Sheet 1
2 pages
SD-M1 TSI Chapitre 4
No ratings yet
SD-M1 TSI Chapitre 4
42 pages
05 Optimization Basics
No ratings yet
05 Optimization Basics
94 pages
SML Lecture5
No ratings yet
SML Lecture5
45 pages
Perceptron Bound Proof
No ratings yet
Perceptron Bound Proof
27 pages
ML - Lec 6 - Linear Classifiers
No ratings yet
ML - Lec 6 - Linear Classifiers
55 pages
ML Unit I
No ratings yet
ML Unit I
14 pages
03 - Non Linear Classifiers PDF
No ratings yet
03 - Non Linear Classifiers PDF
38 pages
Week3 LearningI
No ratings yet
Week3 LearningI
48 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
PRu 4
No ratings yet
PRu 4
13 pages
Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
Machine Learning (CSEN3203) 1-14
No ratings yet
Machine Learning (CSEN3203) 1-14
15 pages
Lecture 13 - Kernels
No ratings yet
Lecture 13 - Kernels
5 pages
ch6 (Q 2,8,4)
No ratings yet
ch6 (Q 2,8,4)
9 pages
hw1 Sols PDF
No ratings yet
hw1 Sols PDF
5 pages
To Machine Learning: Isabelle Guyon
No ratings yet
To Machine Learning: Isabelle Guyon
40 pages
CIS 4526: Foundations of Machine Learning Linear Classification: Perceptron
No ratings yet
CIS 4526: Foundations of Machine Learning Linear Classification: Perceptron
33 pages
CS60010: Deep Learning: Spring 2021
No ratings yet
CS60010: Deep Learning: Spring 2021
32 pages
Linear Separability
No ratings yet
Linear Separability
4 pages
Perceptron
No ratings yet
Perceptron
26 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Session 6 Machine Learning Algorithms
No ratings yet
Session 6 Machine Learning Algorithms
46 pages
Ai and ML
No ratings yet
Ai and ML
16 pages
Machine Learning: Linear Models For Classification 1
No ratings yet
Machine Learning: Linear Models For Classification 1
30 pages
4 - SVM
No ratings yet
4 - SVM
58 pages
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
No ratings yet
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
5 pages
Perceptron, Convergence, and Generalization
No ratings yet
Perceptron, Convergence, and Generalization
5 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
Homework2 - Tran Anh Vu
No ratings yet
Homework2 - Tran Anh Vu
3 pages
Unit 1,2,3
No ratings yet
Unit 1,2,3
17 pages
Pr5 PerceptronWriteUp
No ratings yet
Pr5 PerceptronWriteUp
6 pages
CIE-2 Solutions
No ratings yet
CIE-2 Solutions
10 pages
Lect 1
No ratings yet
Lect 1
24 pages
HW 3
No ratings yet
HW 3
7 pages
Chapter 4. Classification Algorithms-Stud
No ratings yet
Chapter 4. Classification Algorithms-Stud
43 pages
Cheat Sheet For Exam
No ratings yet
Cheat Sheet For Exam
2 pages
Mock Exams 2024
No ratings yet
Mock Exams 2024
81 pages
ML Assignment 1: 1. A) What Is Machine Learning? Explain Types of Machine Learning
No ratings yet
ML Assignment 1: 1. A) What Is Machine Learning? Explain Types of Machine Learning
8 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
15CS73 Module 4
No ratings yet
15CS73 Module 4
60 pages
Test2 Zoom
No ratings yet
Test2 Zoom
85 pages
Measurement System Analysis: VARIABLE STUDY (Average-Range Method)
No ratings yet
Measurement System Analysis: VARIABLE STUDY (Average-Range Method)
1 page
Sampling Distributions
No ratings yet
Sampling Distributions
9 pages
Research Methods: PH.D in Nursing
No ratings yet
Research Methods: PH.D in Nursing
63 pages
Bmark Description
No ratings yet
Bmark Description
2 pages
The Impact of Multicultural Working Environment To Job Satisfaction
No ratings yet
The Impact of Multicultural Working Environment To Job Satisfaction
61 pages
Almawati-2023022010 Review Artikel
No ratings yet
Almawati-2023022010 Review Artikel
55 pages
Von Mises Distribution
No ratings yet
Von Mises Distribution
11 pages
ANOVA) Can Be Thought of As An Extension of A T-Test For
100% (3)
ANOVA) Can Be Thought of As An Extension of A T-Test For
19 pages
Final Syllbus M.ED - 2017238084203
No ratings yet
Final Syllbus M.ED - 2017238084203
60 pages
DBSCAN
No ratings yet
DBSCAN
18 pages
MMW For Non Math
No ratings yet
MMW For Non Math
13 pages
Finding Answers Through Data Collection
100% (1)
Finding Answers Through Data Collection
7 pages
Impact of Social Media Benefits On Tourists' Travel Choice
No ratings yet
Impact of Social Media Benefits On Tourists' Travel Choice
9 pages
Reviewer PR 2
No ratings yet
Reviewer PR 2
2 pages
Practical Research 2: Week 7
No ratings yet
Practical Research 2: Week 7
28 pages
QUANTITATIVE ANALYSES OF SOFTWARE VULNERABILITIES-Joh - Colostate - 0053A - 10768
No ratings yet
QUANTITATIVE ANALYSES OF SOFTWARE VULNERABILITIES-Joh - Colostate - 0053A - 10768
224 pages
SCSSAdv158 Defending Set Plays
No ratings yet
SCSSAdv158 Defending Set Plays
7 pages
Dwyer, D. B., Falkai, P., & Koutsouleris, N. Machine Learning Approaches For Clinical Psychology and Psychiatry
No ratings yet
Dwyer, D. B., Falkai, P., & Koutsouleris, N. Machine Learning Approaches For Clinical Psychology and Psychiatry
30 pages
Justin G. Longenecker, Carlos W. Moore, J. William Petty, Leslie E. Palich, & Joseph A. Mckinney
No ratings yet
Justin G. Longenecker, Carlos W. Moore, J. William Petty, Leslie E. Palich, & Joseph A. Mckinney
31 pages
Chapter 1 Overview of Statistics
No ratings yet
Chapter 1 Overview of Statistics
32 pages
20 Binomial Distribution
No ratings yet
20 Binomial Distribution
14 pages
Research Paper Rosalie Q. Capistrano Bureaucraticmanagementtheory
No ratings yet
Research Paper Rosalie Q. Capistrano Bureaucraticmanagementtheory
18 pages
Multiple Linear Regression Analysis
No ratings yet
Multiple Linear Regression Analysis
23 pages
Market & Demand Analysis
100% (1)
Market & Demand Analysis
10 pages
Pearson Correlation Coefficient
No ratings yet
Pearson Correlation Coefficient
4 pages
PROBLEM SET-3 Discrete Probability
No ratings yet
PROBLEM SET-3 Discrete Probability
4 pages
CEO Golf Research
No ratings yet
CEO Golf Research
50 pages

6.86x Machine Learning With Python: Linear Classifiers

Uploaded by

6.86x Machine Learning With Python: Linear Classifiers

Uploaded by

6.

Linear Regression This algorithm gives Other non-linear classifiers

• an UNK symbol for any unknown word (out of vocabulary)

Partition A partition of a set is a grouping of the set’s elements into non-empty

Then, the probability of generating the document D is the likelihood is

You might also like