6.86x Machine Learning With Python: Linear Classifiers
6.86x Machine Learning With Python: Linear Classifiers
86x Machine Learning with Python • set of classifiers: h ∈ H Hinge Loss, Margin Boundaries and Regularization
Linear Classifiers through the Origin We consider functions of the form Distance from a Line to a Point The perpendicular distance from a line with
This is a cheat sheet for machine learning based on the online course given by Prof. equation θ · x + θ0 = 0 to a point with coordinates x0 is
Tommi Jaakkola and Prof. Regina Barzilay. Compiled by Janus B. Advincula. h(x; θ) = sign (θ1 x1 + · · · + θd xd ) = sign (θ · x) . |θ · x0 + θ0 |
d=
Last Updated January 17, 2020 kθk
θ·x>0 θ·x<0
x0
Linear Classifiers θ θ
Introduction to Machine Learning x
What is machine learning? Machine learning as a discipline aims to design, d
understand and apply computer programs that learn from experience (i.e., data) for
θ·x=0
the purpose of modeling, prediction or control.
Linear Classifiers with Offset We can consider functions of the form Decision Boundary The decision boundary is the set of points x which satisfy
Types of Machine Learning
θ · x + θ0 = 0.
• Supervised learning: prediction based on examples of correct behavior h(x; θ) = sign (θ · x + θ0 )
Margin Boundary The margin boundary is the set of points x which satisfy
• Unsupervised learning: no explicit target, only data, goal is to where θ0 is the offset parameter.
model/discover θ · x + θ0 = ±1.
Linear Separation Training examples Sn are linearly separable
if there exists
a
• Semi-supervised learning: supplement limited annotations with y (i) θ · x(i) + θ0
parameter vector θb and offset parameter θb0 such that y (i) θb · x(i) + θb0 > 0 for all
unsupervised learning γi (θ, θ0 ) =
i = 1, . . . , n. kθk
• Active learning: learn to query the examples actually needed for learning
Training Error The training error for a linear classifier is
• Transfer learning: how to apply what you have learned from A to B
n
• Reinforcement learning: learning to act, not just predict; goal is to optimize 1 X hh (i) (i)
ii
En (θ, θ0 ) = y θ · x + θ0 ≤ 0 .
the consequences of actions n i=1
negative margin boundary
positive margin boundary θ · x + θ0 = −1
Linear Classifier and Perceptron Perceptron Algorithm
θ · x + θ0 = 1
decision boundary
h(x) = +1 h(x) = −1 procedure P ERCEPTRON({(x(i) , y (i) ), i = 1, . . . , n}, T ) θ · x + θ0 = 0
θ = 0 (vector)
Hinge Loss
for t = 1, . . . , T do (
0 if z ≥ 1
for i = 1, . . . , n do Lossh (z) =
1−z if z < 1
if y (i) θ · x(i) ≤ 0 then
with z = y (i) θ · x(i) + θ0 .
θ = θ + y (i) x(i)
classifier: h(x) = 0 Regularization Maximize margin
return θ 1 1 2
max ⇒ min kθk
Key Concepts Perceptron Algorithm (with offset) kθk 2
• feature vectors, labels: procedure P ERCEPTRON({(x(i) , y (i) ), i = 1, . . . , n}, T ) Linear Classification and Generalization
x∈R ,
d
y ∈ {−1, +1} θ = 0 (vector) 1 distance from the decision
for t = 1, . . . , T do kθk boundary to the margin boundary
• training set: n
(i) (i)
o for i = 1, . . . , n do
Sn = x ,y , i = 1, . . . , n
if y (i) θ · x(i) + θ0 ≤ 0 then
• classifier: θ
d θ = θ + y (i) x(i)
h : R → {−1, +1}
(i)
+
n
d
o θ0 = θ0 + y
χ = x ∈ R : h(x) = +1 negative margin boundary
return θ, θ0 θ · x + θ0 = −1
positive margin boundary
n o
− d
χ = x ∈ R : h(x) = −1 Convergence Assumptions:
θ · x + θ0 = 1
decision boundary
• training error: y (i) θ ∗ ·x(i)
n • There exists θ ∗ such that ≥ γ for all i = 1, . . . , n for some θ · x + θ0 = 0
1 X hh (i)
(i)
ii
(i)
En (h) = h x 6 y
=
x
n i=1 γ > 0. Objective function
(
1 if error • All examples are bounded
x(i)
≤ R, i = 1, . . . , n. n
hh
(i)
(i)
ii 1 X
(i)
(i)
λ 2
h x 6= y = J (θ, θ0 ) = Lossh y θ · x + θ0 + kθk
0 otherwise n i=1 2
R2
• test error: E(h) Then the number k of updates made by the perceptron algorithm is bounded by . λ is the regularization factor.
γ2
Stochastic Gradient Descent Select i ∈ {1, . . . , n} at random Nonlinear Classification Decision Boundary The decision boundary satisfies
Feature Transformation n
(i)
λ 2 X (j)
(j)
θ ← θ − ηt ∇θ Lossh θ · x + θ0 + kθk αj y K x , x = 0.
2 x 7→ φ(x)
j=1
0
ηt is the learning rate which can vary at every iteration. θ · x → θ · φ(x)
Radial Basis Kernel
Support Vector Machine Non-linear Classification
0 1
0
2
h (x; θ, θ0 ) = sign (θ · φ(x) + θ0 ) K(x, x ) = exp −
x − x
• Support Vector Machine finds the maximum margin linear separator by 2
solving the quadratic program that corresponds to J(θ, θ0 )
Kernel Function A kernel function is simply an inner product between two feature 3
• In the realizable case, if we disallow any margin violations, the quadratic vectors. Using kernels is advantageous when the inner products are faster to evaluate +1
-1
program we have to solve is: than using explicit vectors (e.g., when the vectors would be infinite dimensional). 2
1 2 0 0
Find θ, θ0 that minimize 2 kθk subject to K x, x = φ(x) · φ(x ) 1
X2
(i) (i) Perceptron
y θ·x + θ0 ≥ 1, i = 1, . . . , n
θ=0 1
2
Nonlinear Classification, Linear Regression, for i = 1, . . . , n do
if y (i) θ · φ x(i) ≤ 0 then 3
Collaborative Filtering 3 2 1 0
X1
1 2 3
θ ← θ + y (i) φ x(i)
Then, for i = 1, . . . , n do
−1 n
θb = A B.
if y (i) αj y (j) K x(j) , x(i) ≤ 0 then
P
In matrix notation, this is j=1 K-Nearest Neighbor Method The K-Nearest Neighbor method makes use of
| −1 | αj = αj + 1 ratings by K other similar users when predicting Yai . Let KNN(a) be the set of K
θb = X X X Y. users similar to user a, and let sim(a, b) be a similarity measure between users a and
b ∈ KNN(a). The KNN method predicts a rating Yai to be
Generalization and Regularization The initilization θ = 0 is equivalent to α1 = · · · = αn = 0. P
Ridge Regression: The loss function is sim(a, b)Ybi
Composition rules: b∈KNN(a)
Y
bai =
λ
P
Jλ,n =
2
kθk + Rn (θ) 1. K(x, x0 ) = 1 is a kernel function. sim(a, b)
2 b∈KNN(a)
d 0
2. Let f : R → R and K(x, x ) is a kernel. Then so is
where λ is the regularization factor. We can find its minima using gradient-based K(x,
e x0 ) = f (x)K(x, x0 )f (x0 ) The similarity measure sim(a, b) could be any distance function between the feature
approach. vectors xa and xb .
0 0
3. If K1 (x, x ) and K2 (x, x ) are kernels, then
Algorithm Initialize θ = 0. K(x, x0 ) = K1 (x, x0 ) + K2 (x, x0 ) is a kernel. • Euclidean distance: kxa − xb k
Randomly pick i = {1,
. . . , n}. 4. If K1 (x, x0 ) and K2 (x, x0 ) are kernels, then
xa · xb
θ = (1 − ηλ) θ + η y (i) − θ · x(i) x(i) . K(x, x0 ) = K1 (x, x0 )K2 (x, x0 ) is a kernel. • Cosine similarity: cos θ =
kxa kkxb k
ez − e−z 2
Collaborative Filtering Our goal is to come up with a matrix X that has no blank Hyperbolic Tangent Function tanh(z) = = 1 − 2z In the figure,
entries and whose (a, i)th entry Xai is the prediction of user a’s rating to movie i. ez + e−z e +1
W11
W12
~1 =
W and ~2 =
W .
Let D be the set of all (a, i)’s for which a user rating Yai exists. A naive approach is Deep Neural Networks A deep (feedforward) neural network refers to a neural W21 W22
to minimize the objective function network that contains not only the input and output layers, but also hidden layers in They map the input onto the f1 -f2 axes.
between. Below is a deep feedforward neural network of 2 hidden layers, with each
X 1 2 λ X 2 hidden layer consisting of 5 units: Hidden Layer Representation
J(X) = (Yai − Xai ) + Xai . • Hidden Layer Units
2 2
(a,i)∈D (a,i)
1.0 W1
The results are W2
+1
bai = Yai
X for (a, i) ∈ D x1 0.5
-1
1+λ
X
bai = 0 for (a, i) ∈
/ D. 0.0
X2
f
x2
The problem with this approach is that there is no connection between the entries of
0.5
X. We can impose additional constraint on X:
|
X = UV x3 1.0
| 1.0 0.5 0.0 0.5 1.0
for some n × d matrix U and d × m matrix V , where d is the rank of the matrix X. X1
Alternating Minimization Assume that U and V are rank k matrices. Then, we can
write the objective function as • Linear Activation
-1
2.5
Input Layer
Hidden Layer
Hidden Layer
Output Layer
X 1 |
2 λ X 2 X 2 1
J(X) = Yai − U V ai + Uak + Vik . 2.0
2 2 a,k i,k
(a,i)∈D
1.5
To find the solution, we fix (initialize) U (or V ) and minimize the objective with 1.0
f2
respect to V (or U ). We plug-in the result back to the objective and minimize it with 0.5
respect to U (or V ). We repeat this alternating process until there is no change in the 0.0
objective function. One Hidden Layer Model
0.5
Example Consider the case k = 1. Then, Ua1 = ua and Vi1 = vi . If we initialize 1.0
Layer 0 Layer 1 Layer 2
ua to some values, then we have to optimize the function
(tanh) (linear)
1 0 1 2 3
X 1 2 λX 2 f1
(Yai − ua vi ) + v . W11 z1 f1
2 2 i i x1
(a,i)∈D
W12 • tanh Activation
z
f
1.00 -1
Neural Networks 0.75
1
W21
x2 0.50
Introduction to Feedforward Neural Networks W22 z2 f2
0.25
A Unit in a Neural Network A neural network unit is a primitive neural network
f2
0.00
that consists of only the input layer, and an output layer with only one output. 2 2
X X 0.25
x1 z1 = xj Wj1 + W01 z2 = xj Wj2 + W02
0.50
j=1 j=1
w1 0.75
f1 = f (z1 ) = tanh(z1 ) f2 = f (z2 ) = tanh(z2 )
x2 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
f (z) z=
0
f1 w1 +
0
f2 w2 f = f (z) = z f1
w2
.
.
. Neural Signal Transformation We can visualize what the hidden layer is doing • ReLU Activation
similarly to a linear classifier.
wd -1
xd 2.5
x2 1
2.0
A neural network unit computes a non-linear weighted combination of its input:
d 1.5
X ~2
W
f2
y
b = f (z) where z = w0 + xi wi
1.0
i=1
x1
where wi are the weights, z is a number and is the weighted sum of the inputs xi , 0.5
~1
W
and f is generally a non-linear function called the activation function.
0.0
Linear Function f (z) = z 0.0 0.5 1.0 1.5 2.0 2.5 3.0
f1
Rectified Linear Unit (ReLU) f (z) = max{0, z}
Summary We have the following notations: • Let’s say we want to encode the incomplete sentence Efforts and courage are
not. First, we have to represent the first word as a vector (say, a one-hot
• Units in neural networks are linear classifiers, just with different output • b`j is the bias of the j th th
neuron in the ` layer. vector). This will be x1 . Then,
non-linearity.
• a`j is the activation of the j th neuron in the `th layer.
• The units in feedforward neural networks are arranged in layers. s1 = tanh W
s,x
x1 .
• By learning the parameters associated with the hidden layer units, we learn
`
• wjk is the weight for the connection from the kth neuron in the (` − 1)th layer
how to represent examples (as hidden layer activations). to the j th neuron in the `th layer. The second word will be x2 , and we compute for s2 .
• The representations in neural networks are learned directly to facilitate the
If the activation function is f and the loss function we are minimizing is C, then the
end-to-end task. s2 = tanh W
s,s
s1 + W
s,x
xt .
equations describing the network are:
• A simple classifier (output unit) suffices to solve complex classification tasks if !
it operates on the hidden layer representations. `
aj = f
X ` `−1
wjk ak + bj
` We continue this process until we’ve encoded all the words in the sentence.
k We can visualize this as follows:
Feedforward Neural Networks, Back Propagation, and
L
Stochastic Gradient Descent (SGD) Loss = C a
lego piece (encoder)
Simple Example This simple neural network is made up of L hidden layers, but Let the weighted inputs to the d neurons in layer ` be defined as s1 s2 s3 s4 s5
each layer consists of only one unit, and each unit has activation function f . ` ` `−1 ` ` d s0 sentence
z ≡w a +b , where z ∈ R . as a vector
` `
w1 z1 f1 wL zL fL Then, the activation of layer ` is also written as a ≡ f (z ). Also, let δj` ≡ ∂C
x ... y ∂z ` x1 x2 x3 x4 x5
j
denote the error of neuron j in layer `. Then, δ ` ∈ Rd denotes the full vector of errors
z1 = xw1 associated with layer `.
f1 = tanh(xw1 ) Efforts and courage are not ...
. Equations of Backpropagation
.
. L
δ = ∇a C f z
0 L
fL = tanh(fL−1 wL ) Differences from standard feedforward architecture
|
(y − fL )2
1
h i
L(y, fL ) = Loss (y, fL ) = ` `+1 `+1 0 `
2 δ = w δ f z
• Input is received at each layer (per word), not just at the beginning as in a
For i = 2, . . . , L: zi = fi−1 wi where fi−1 = f (zi−1 ). Also, y is the true value ∂C `
= δj typical feedforward network.
and fL is the output of the neural network. ∂b`j
Gradient Descent The gradient descent update rule for the parameter wi is ∂C • The number of layers varies and depends on the length of the sentence.
`−1 `
wi ← wi − η · ∇wi L(y, fL ) `
= ak δj
∂wjk • Parameters of each layer (representing an application of an RNN) are shared
where η is the learning rate. For instance, we have
The symbol represents the Hadamard product. (same RNN at each step).
∂L ∂f1 ∂L
=
a b e f ae bf
∂w1 ∂w1 ∂f1 = . Basic RNN
c d g h cg dh
∂f1 h
2
i
2
st = tanh W
s,s
st−1 + W
s,x
xt
= 1 − tanh (xw1 ) x = 1 − f1 x
∂w1 Recurrent Neural Networks
∂L ∂L ∂f2 ∂L 2
Temporal/Sequence Problems Simple Gated RNN
= = 1 − f2 w2 .
∂f1 ∂f2 ∂f1 ∂f2 • Sequence prediction problems can be recast in a form amenable to feedforward g,s g,x
gt = sigmoid W st−1 + W xt
Thus, when we back-propagate, we get neural networks.
s,s s,x
∂L
2
2
• We have to engineer how history is mapped to a vector (representation). This st = (1 − gt ) st−1 + gt tanh W st−1 + W xt
= x 1 − f1 · · · 1 − fL w2 · · · wL · 2 (fL − y) .
∂w1 vector is then fed into, e.g., a neural network.
Note that the above derivation applies to tanh activation. • We would like to learn how to encode the history into a vector. Long Short-Term Memory (LSTM)
Backpropagation Consider the L-layer neural network below. Key Concepts
f,h f,x
Layer 0 Layer 1 Layer 2 Layer 3 • Encoding – e.g., mapping a sequence to a vector ft = sigmoid W ht−1 + W xt forget gate
• Decoding – e.g., mapping a vector to, e.g., a sequence it = sigmoid W
i,h i,x
ht−1 + W xt input gate
Example: Encoding Sentences
o,h o,x
• Introduce adjustable lego pieces and optimize them for end-to-end ot = sigmoid W ht−1 + W xt output gate
x1 performance.
c,h c,x
ct = ft ct−1 + it tanh W ht−1 + W xt memory cell
a31 ht = ot tanh(ct ) visible state
x2 context new context
θ
or state or state
st−1 st Markov Language Models Let w ∈ V denote the set of possible words/symbols
x3 that includes
The maximum likelihood estimate is obtained as normalized counts of successive Convolutional Neural Networks Unsupervised Learning
word occurrences (matching statistics) Problem Image classification
0 count w0 , w
Clustering
P w |w = P
b
• The presence of objects may vary in location across different images.
count (w, w)
e Training set We are provided a training set but with no labels
w
e Patch classifier/filter n o
(i)
Sn = x i = 1, . . . , n
Feature-based Markov Model We can also represent the Markov model as a
feedforward neural network (very extendable). We define a one-hot vector, φ (wi−1 ), and the goal is to find structure in the data.
corresponding to the previous word. This will be an input to the feedforward neural Example: Google News
network.
p1 input weights
p2
The patch classifier goes through the entire image. We can think of the weights as the
φ (wi−1 ) W p3 image that the unit prefers to see.
x . .
. . Convolution The convolution is an operation between two functions f and g:
. .
pk
Z +∞
(f ∗ g) (t) ≡ f (τ )g(t − τ )dτ.
−∞
In the figure,
pk = P (wi = k|wi−1 ) Intuitively, convolution blends the two functions f and g by expressing the amount of
is the probability of the next word, given the previous word. The aggregate input to overlap of one function as it is shifted over another function.
the kth output unit is Discrete Convolution For discrete functions, we can define the convolution as
X
zk = xj Wjk + W0k .
m=+∞
k X
(f ∗ g) [n] ≡ f [m]g[n − m].
These input values are not probabilities. A typical transformation is the softmax m=−∞
transformation:
ezk Example: Image Quantization
pk = P z . Pooling We wish to know whether a feature was there but not exactly where it was.
e j
j Pooling (Max) Pooling region and stride may vary.
RNNs for Sequences Our RNN now also produces an output (e.g., a word) as well
• Pooling induces translation invariance at the cost of spatial resolution.
as update its state
• Stride reduces the size of the resulting feature map.
[0.1, 0.3, . . . , 0.2] output distribution
previous new
θ state original compressed
state
• A partition of indices {1, . . . , n} into K sets, C1 , . . . , CK probability that M generates certain word w ∈ W is
X For the training set
P (w|θ) = θw , θw ≥ 0, θw = 1. n
(i)
o
• Representatives in each of the K partition sets, given as z1 , . . . , zK Sn = x , i = 1, . . . , n ,
w∈W
• reward functions R(s, a, s0 ), representing the reward for starting in state s, Convergence This algorithm will converge as long as γ < 1.
taking action a and ending up in state s0 after one step. (The reward function
may also depend only on s, or only s and s.) Q-Value Iteration
Definition We can directly operate at the level of Q-values. Q-value iteration is a
Property MDPs satisfy the Markov property in that the transition probabilities and reformulation of value iteration algorithm.
rewards depend only on the current state and action, and remain unchanged
regardless of the history (i.e., past states and actions) that leads to the current state. Update Rule
∗ 0 0 ∗ 0 0
X
Utility Function The main problem for MDPs is to optimize the agent’s behavior. Qk+1 (s, a) ← T (s, a, s ) R(s, a, s ) + γ max Qk (s , a )
We first need to specify the criterion that we are trying to maximize in terms of a0
s0
accumulated rewards. We define a utility function and maximize its expectation.
Reinforcement Learning
• Finite horizon based utility: The utility function is the sum of rewards after
MDP vs. RL In MDPs, we are given 4 quantities hS, A, T, Ri. In reinforcement
acting for a fixed number n of steps. When the rewards depend only on the
learning, we are given only the states and actions hS, Ai. In the real world,
states, the utility function is
transitions and rewards might not be directly available and they need to be estimated.
n
X Estimation Consider a random variable X. The goal is to estimate
U [s0 , s1 , . . . , sn ] = R(si .) X
i=0 E [f (X)] = p(x)f (x).
x
• (Infinite horizon) discounted reward based utility: In this setting, the reward We have access to K samples: xi , i = 1, . . . , K.
one step into the future is discounted by a factor γ, the reward two steps Model-based Learning
ahead by γ 2 , and so on. The goal is to continue acting (without an end) while 1
maximizing the expected discounted reward. The discounting allows us to p
b(xi ) = count(xi )
K
focus on near term rewards, and control this focus by changing γ. If the
K
rewards depend only on the states, the utility function is X
E [f (X)] ≈ p
b(xi )f (xi )
∞
X i=1
k
U [s0 , s1 , . . . ] = γ R(sk ). Model-free Learning
k=0 K
1 X
E [f (X)] ≈ f (xi )
Optimal Policy A policy is a function π : S → A that assigns an action π(s) to any K i=1
state s. Given an MDP and a utility function U [s0 , s1 , . . . , sn ], our goal is to find an
Q-Value Iteration for RL
optimal policy function that maximizes the expectation of the utility. We denote the
optimal policy by π ∗ . 1. Initialization: Q(s, a) = 0 ∀s, a
2. Iterate until convergence:
Bellman Equations (a) Collect sample: s, a, s0 , R(s, a, s0 )
Value Function Denote by Q∗ (s, a) the expected reward starting at s, taking action (b) Update:
a and acting optimally. The value function V ∗ (s) is the expected reward starting at
0 0 0
state s and acting optimally. Qi+1 (s, a) ← α R(s, a, s ) + γ max Qi (s , a ) + (1 − α) Qi (s, a)
a0
The Bellman Equations These equations connect the notion of the value of a state
0 0 0
and the value of policy. = Qi (s, a) + α R(s, a, s ) + γ max Qi (s , a ) − Qi (s, a)
a0
∗ ∗ ∗ ∗
V (s) = max Q (s, a) = Q (s, π (s))
a Recommended Resources
∗ 0 0 ∗ 0
X
Q (s, a) = T (s, a, s ) R(s, a, s ) + γV (s )
s0 • Introduction to Machine Learning with Python (Müller and Guido)
∗
We can define the V (s) recursively to get • Machine Learning with Python – From Linear Models to Deep Learning
[Lecture Slides] (http://www.edx.org)
X • LaTeX File (github.com/mynameisjanus/686xMachineLearning)
∗ 0 0 ∗ 0
V (s) = max T (s, a, s ) R(s, a, s ) + γV (s )
a
0 s
Please share this cheatsheet with friends!