0% found this document useful (0 votes)

4 views

learning2

Mathematics For AI

Uploaded by

Surya Basnet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

learning2

Mathematics For AI

Uploaded by

Surya Basnet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 82

Lecture 3: Machine Learning 2

Roadmap

Stochastic Gradient Descent

Non-linear features

Neural networks

Feature templates

CS221 2
• In this module, we will introduce stochastic gradient descent.
Gradient descent is slow
1 X
TrainLoss(w) = Loss(x, y, w)
|Dtrain |
(x,y)∈Dtrain

Algorithm: gradient descent

Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η∇w TrainLoss(w)

Problem: each iteration requires going over all training examples — expensive when have lots
of data!

CS221 4
• So far, we’ve seen gradient descent as a general-purpose algorithm to optimize the training loss.
• But one problem with gradient descent is that it is slow.
• Recall that the training loss is a sum over the training data. If we have one million training examples, then each gradient computation requires
going through those one million examples, and this must happen before we can make any progress.
• Can we make progress before seeing all the data?
Stochastic gradient descent
1 X
TrainLoss(w) = Loss(x, y, w)
|Dtrain |
(x,y)∈Dtrain

Algorithm: stochastic gradient descent

Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
For (x, y) ∈ Dtrain :
w ← w − η∇w Loss(x, y, w)

CS221 6
• The answer is stochastic gradient descent (SGD).
• Rather than looping through all the training examples to compute a single gradient and making one step, SGD loops through the examples
(x, y) and updates the weights w based on each example.
• Each update is not as good because we’re only looking at one example rather than all the examples, but we can make many more updates
this way.
• Aside: there is a continuum between SGD and GD called minibatch SGD, where each update consists of an average over B examples.
• Aside: There are other variants of SGD. You can randomize the order in which you loop over the training data in each iteration. Think about
why this is important if in your training data, you had all the positive examples first and the negative examples after that.
Step size
w←w− η ∇w Loss(x, y, w)
|{z}
step size

Question: what should η be?

0 1
η
conservative, more stable aggressive, faster

Strategies:
• Constant: η = 0.1
√
• Decreasing: η = 1/ # updates made so far

CS221 8
• One remaining issue is choosing the step size, which in practice is quite important.
• Generally, larger step sizes are like driving fast. You can get faster convergence, but you might also get very unstable results and crash and
burn.
• On the other hand, with smaller step sizes you get more stability, but you might get to your destination more slowly. Note that the weights
do not change if η = 0
• A suggested form for the step size is to set the initial step size to 1 and let the step size decrease as the inverse of the square root of the
number of updates we’ve taken so far.
• Aside: There are more sophisticated algorithms like AdaGrad and Adam that adapt the step size based on the data, so that you don’t have
to tweak it as much.
• Aside: There are some nice theoretical results showing that SGD is guaranteed to converge in this case (provided all your gradients are
bounded).
Stochastic gradient descent in Python

[code]

CS221 10
• Now let us code up stochastic gradient descent for linear regression in Python.
• First we generate a large enough dataset so that speed actually matters. We will also generate 1 million points according to x ∼ N (0, I) and
y ∼ N (w∗ · x, 1), where w∗ is the true weight vector, but hidden to the algorithm.
• This way, we can diagnose whether the algorithm is actually working or not by checking whether it recovers something close to w∗ .
• Let’s first run gradient descent, and watch that it makes progress but it is very slow.
• Now let us implement stochastic gradient descent. It is much faster.
Summary
1 X
TrainLoss(w) = Loss(x, y, w)
|Dtrain |
(x,y)∈Dtrain

gradient descent stochastic gradient descent

Key idea: stochastic updates

It’s not about quality, it’s about quantity.

CS221 12
• In summary, we’ve shown how stochastic gradient descent can be faster than gradient descent.
• Gradient just spends too much time refining its gradient (quality), while you can get a quick and dirty estimate just from one sample and
make more updates (quantity).
• Of course, sometimes stochastic gradient descent can be unstable, and other techniques such as mini-batching can be used to stabilize it.
Roadmap

Stochastic Gradient Descent

Non-linear features

Neural networks

Feature templates

CS221 14
• In this module, we’ll show that even using the machinery of linear models, we can obtain much more powerful non-linear predictors.
Linear regression
training data
3
x y
1 1 learning algorithm
Which predictors are possible?
f predictor
2 3 Hypothesis class
4 3
2.71

F = {fw (x) = w · φ(x) : w ∈ Rd }

3
φ(x) = [1, x]
2

y
f (x) = [1, 0.57] · φ(x)
1

f (x) = [2, 0.2] · φ(x) 0

0 1 2 3 4 5

CS221 16
• We will look at regression and later turn to classification.
• Recall that in linear regression, given training data, a learning algorithm produces a predictor that maps new inputs to new outputs. The first
design decision: what are the possible predictors that the learning algorithm can consider (what is the hypothesis class)?
• For linear predictors, remember the hypothesis class is the set of predictors that map some input x to the dot product between some weight
vector w and the feature vector φ(x).
• As a simple example, if we define the feature extractor to be φ(x) = [1, x], then we can define various linear predictors with different intercepts
and slopes.
More complex data
4

y
1

0
0 1 2 3 4 5

How do we fit a non-linear predictor?

CS221 18
• But sometimes data might be more complex and not be easily fit by a linear predictor. In this case, what can we do?
• One immediate reaction might be to go to something fancier like neural networks or decision trees.
• But let’s see how far we can get with the machinery of linear predictors first.
Quadratic predictors

φ(x) = [1, x, x2 ]
Example: φ(3) = [1, 3, 9]

f (x) = [2, 1, −0.2] · φ(x) 4

3
f (x) = [4, −1, 0.1] · φ(x)
2

y
f (x) = [1, 1, 0] · φ(x)
1
3
F = {fw (x) = w · φ(x) : w ∈ R }
0
0 1 2 3 4 5

Non-linear predictors just by changing φ

CS221 20
• The key observation is that the feature extractor φ can be arbitrary.
• So let us define it to include an x2 term.
• Now, by setting the weights appropriately, we can define a non-linear (specifically, a quadratic) predictor.
• The first two examples of quadratic predictors vary in intercept, slope and curvature.
• Note that by setting the weight for feature x2 to zero, we recover linear predictors.
• Again, the hypothesis class is the set of all predictors fw obtained by varying w.
• Note that the hypothesis class of quadratic predictors is a superset of the hypothesis class of linear predictors.
• In summary, we’ve seen our first example of obtaining non-linear predictors just by changing the feature extractor φ!
• Advanced: here x ∈ R is one-dimensional, so x2 is just one additional feature. If x ∈ Rd were d-dimensional, then there would be O(d2 )
quadratic features of the form xi xj for i, j ∈ {1, . . . , d}. When d is large, then d2 can be prohibitively large, which is one reason that using
the machinery of linear predictors to increase expressiveness can be problematic.
Piecewise constant predictors

φ(x) = [1[0 < x ≤ 1], 1[1 < x ≤ 2], 1[2 < x ≤ 3], 1[3 < x ≤ 4], 1[4 < x ≤ 5]]

Example: φ(2.3) = [0, 0, 1, 0, 0]

3
f (x) = [1, 2, 4, 4, 3] · φ(x)
2

y
f (x) = [4, 3, 3, 2, 1.5] · φ(x)
1

F = {fw (x) = w · φ(x) : w ∈ R5 } 0

0 1 2 3 4 5

Expressive non-linear predictors by partitioning the input space

CS221 22
• Quadratic predictors are still a bit restricted: they can only go up and then down smoothly (or vice-versa).
• We introduce another type of feature extractor which divides the input space into regions and allows the predicted value of each region to
vary independently, yielding piecewise constant predictors (see figure).
• Specifically, each component of the feature vector corresponds to one region (e.g., [0, 1)) and is 1 if x lies in that region and 0 otherwise.
• Assuming the regions are disjoint, the weight associated with a component/region is exactly the predicted value.
• As you make the regions smaller, then you have more features, and the expressiveness of your hypothesis class increases. In the limit, you can
essentially capture any predictor you want.
• Advanced: what happens if x were not a scalar, but a d-dimensional vector? Then if each component gets broken up into B bins, then there
will be B d features! For each feature, we need to fit its weight, and there will in generally be too few examples to fit all the features.
Predictors with periodicity structure

φ(x) = [1, x, x2 , cos(3x)]

Example: φ(2) = [1, 2, 4, 0.96]

3
f (x) = [1, 1, −0.1, 1] · φ(x)
2

y
f (x) = [3, −1, 0.1, 0.5] · φ(x)
1
F = {fw (x) = w · φ(x) : w ∈ R4 }
0
0 1 2 3 4 5

Just throw in any features you want

CS221 24
• Quadratic and piecewise constant predictors are just two examples of an unboundedly large design space of possible feature extractors.
• Generally, the choice of features is informed by the prediction task that we wish to solve (either prior knowledge or preliminary data exploration).
• For example, if x represents time and we believe the true output y varies according to some periodic structure (e.g., traffic patterns repeat
daily, sales patterns repeat annually), then we might use periodic features such as cosine to capture these trends.
• Each feature might represent some type of structure in the data. If we have multiple types of structures, these can just be ”thrown in” into
the feature vector.
• Features represent what properties might be useful for prediction. If a feature is not useful, then the learning algorithm can assign a weight
close to zero to that feature. Of course, the more features one has, the harder learning becomes.
Linear in what?
Prediction:

fw (x) = w · φ(x)

Linear in w? Yes
Linear in φ(x)? Yes
Linear in x? No!

Key idea: non-linearity

• Expressiveness: score w · φ(x) can be a non-linear function of x

• Efficiency: score w · φ(x) always a linear function of w

CS221 26
• Wait a minute...how are we able to obtain non-linear predictors if we’re still using the machinery of linear predictors? It’s a linguistic sleight
of hand, as ”linear” is ambiguous.
• The score is w · φ(x) linear in w and φ(x). However, the score is not linear in x (it might not even make sense because x need not be a
vector at all — it could be a string or a PDF file).
• The significance is as follows: From the feature extractor’s viewpoint, we can define arbitrary features that yield very non-linear functions in
x.
• From the learning algorithm’s viewpoint (which only looks at φ(x), not x), linearity enables efficient weight optimization.
• Advanced: if the score is linear in w and the loss function Loss is convex (which holds for the squared, hinge, logistic losses but not the
zero-one loss), then minimizing the training loss TrainLoss is a convex optimization problem, and gradient descent with a proper step size is
guaranteed to converge to the global minimum.
Linear classification

3
φ(x) = [x1 , x2 ]
2
f (x) = sign([−0.6, 0.6] · φ(x))
1

x2
0

-1

-2

-3
-3 -2 -1 0 1 2 3

Decision boundary is a line

CS221 28
• Now let’s turn from regression to classification.
• The story is pretty much the same: you can define arbitrary features to yield non-linear classifiers.
• Recall that in binary classification, the classifier (predictor) returns the sign of the score.
• The classifier can be therefore be represented by its decision boundary, which divides the input space into two regions: points with positive
score and points with negative score.
• Note that the classifier fw (x) is a non-linear function of x (and φ(x)) no matter what (due to the sign function), so it is not helpful to talk
about whether fw is linear or non-linear. Instead we will ask whether the decision boundary corresponding to fw is linear or not.
Quadratic classifiers
φ(x) = [x1 , x2 , x21 + x22 ] 3

f (x) = sign([2, 2, −1] · φ(x)) 2

1
Equivalently:(

x2
0
1 if {(x 1 − 1)2 + (x 2 − 1)2 ≤ 2}
f (x) = -1
−1 otherwise
-2

-3
-3 -2 -1 0 1 2 3

Decision boundary is a circle

CS221 30
• Let us see how we can define a classifier with a non-linear decision boundary.
• Let’s try to construct a feature extractor that induces a decision boundary that is a circle: the inside is classified +1 and the outside is
classified -1.
• We will add a new feature x21 + x22 into the feature vector, and define the weights to be as follows.
√
• Then rewrite the classifier to make it clear that it is the equation for the interior of a circle with radius 2.
• As a sanity check, we you can see that x = [0, 0] results in a score of 0, which means that it is on the decision boundary. And as either of x1
or x2 grow in magnitude (either |x1 | → ∞ or |x2 | → ∞), the contribution of the third feature dominates and the sign of the score will be
negative.
Visualization in feature space
Input space: x = [x1 , x2 ], decision boundary is a circle

Feature space: φ(x) = [x1 , x2 , x21 + x22 ], decision boundary is a hyperplane

CS221 32
• Let’s try to understand the relationship between the non-linearity in x and linearity in φ(x).
• Click on the image to see the linked video (which is about polynomial kernels and SVMs, but the same principle applies here).
• In the input space x, the decision boundary which separates the red and blue points is a circle.
• We can also visualize the points in feature space, where each point is given an additional dimension x21 + x22 .
• In this three-dimensional feature space, a linear predictor (which is now defined by a hyperplane instead of a line) can in fact separate the red
and blue points.
• This corresponds to the non-linear predictor in the original two-dimensional space.
Summary

fw (x) = w · φ(x)
linear in w, φ(x)
non-linear in x

• Regression: non-linear predictor, classification: non-linear decision boundary

• Types of non-linear features: quadratic, piecewise constant, etc.

Non-linear predictors with linear machinery

CS221 34
• To summarize, we have shown that the term ”linear” is ambiguous: a predictor in regression is non-linear in the input x but is linear in the
feature vector φ(x).
• The score is also linear with respect to the weights w, which is important for efficient learning.
• Classification is similar, except we talk about (non-)linearity of the decision boundary.
• We also saw many types of non-linear predictors that you could create by concocting various features (quadratic predictors, piecewise constant
predictors).
• So next time someone on the street asks you about linear predictors, you should first ask them ”linear in what?”
Roadmap

Stochastic Gradient Descent

Non-linear features

Neural networks

Feature templates

CS221 36
• In this module, I will present neural networks, a way to construct non-linear predictors via problem decomposition.
Non-linear predictors
4

Linear predictors: 2

y
1
fw (x) = w · φ(x), φ(x) = [1, x] 0
0 1 2 3 4 5

Non-linear (quadratic) predictors: 3

y
fw (x) = w · φ(x), φ(x) = [1, x, x2 ] 1

0
0 1 2 3 4 5

x
4

Non-linear neural networks: 2

y
fw (x) = w · σ(Vφ(x)), φ(x) = [1, x] 1

0
0 1 2 3 4 5

CS221 38
• Recall that our first hypothesis class was linear (in x) predictors, which for regression means that the predictors are lines.
• However, we also showed that you could get non-linear (in x) predictors by simply changing the feature extractor φ. For example, by adding
the feature x2 , one obtains quadratic predictors.
• One disadvantage of this approach is that if x were d-dimensional, one would need O(d2 ) features and corresponding weights, which presents
considerable computational and statistical challenges.
• We will show that with neural networks, we can leave the feature extractor alone, but increase the complexity of predictor, which can also
produce non-linear (though not necessarily quadratic) predictors.
• It is a common misconception that neural networks allow you to express more complex predictors. You can define φ to include essentially all
predictors (as is done in kernel methods).
• Rather, neural networks yield non-linear predictors in a more compact way. For instance, you might not need O(d2 ) features to represent the
desired non-linear predictor.
Motivating example
Example: predicting car collision

Input: positions of two oncoming cars x = [x1 , x2 ]

Output: whether safe (y = +1) or collide (y = −1)

Unknown: safe if cars sufficiently far: y = sign(|x1 − x2 | − 1)

2
x1 x2 y 1
0 2 1

x2
0
2 0 1 -1

0 0 -1 -2

2 2 -1 -3
-3 -2 -1 0 1 2 3

x1
CS221 40
• As a motivating example, consider the problem of predicting whether two cars are going to collide given the their positions (as measured from
distance from one side of the road). In particular, let x1 be the position of one car and x2 be the position of the other car.
• Suppose the true output is 1 (safe) whenever the cars are separated by a distance of at least 1. This relationship can be represented by
the decision boundary which labels all points in the interior region between the two red lines as negative, and everything on the exterior (on
either side) as positive. Of course, this true input-output relationship is unknown to the learning algorithm, which only sees training data.
Consider a simple training dataset consisting of four points. (This is essentially the famous XOR problem that was impossible to fit using
linear classifiers.)
Decomposing the problem
3

Test if car 1 is far right of car 2: 2 h2 (x)

h1 (x) = 1[x1 − x2 ≥ 1] 1

x2
0
Test if car 2 is far right of car 1:
-1
h2 (x) = 1[x2 − x1 ≥ 1]
-2 h1 (x)
Safe if at least one is true:
-3
f (x) = sign(h1 (x) + h2 (x)) -3 -2 -1 0 1 2 3

x h1 (x) h2 (x) f (x)

[0, 2] 0 1 +1
[2, 0] 1 0 +1
[0, 0] 0 0 −1
[2, 2] 0 0 −1
CS221 42
• One way to motivate neural networks (without appealing to the brain) is problem decomposition.
• The intuition is to break up the full problem into two subproblems: the first subproblem tests if car 1 is to the far right of car 2; the second
subproblem tests if car 2 is to the far right of car 1. Then the final output is 1 iff at least one of the two subproblems returns 1.
• Concretely, we can define h1 (x) to be the output of the first subproblem, which is a simple linear decision boundary (in fact, the right line in
the figure).
• Analogously, we define h2 (x) to be the output of the second subproblem.
• Note that h1 (x) and h2 (x) take on values 0 or 1 instead of -1 or +1.
• The points can then be classified by first computing h1 (x) and h2 (x), and then combining the results into f (x).
Rewriting using vector notation
Intermediate subproblems:

h1 (x) = 1[x1 − x2 ≥ 1] = 1[[−1, +1, −1] · [1, x1 , x2 ] ≥ 0]

h2 (x) = 1[x2 − x1 ≥ 1] = 1[[−1, −1, +1] · [1, x1 , x2 ] ≥ 0]

   
1
−1 +1 −1 
h(x) = 1  x 1  ≥ 0
−1 −1 +1
x2

Predictor:

f (x) = sign(h1 (x) + h2 (x)) = sign([1, 1] · h(x))

CS221 44
• Now let us rewrite this predictor f (x) using vector notation.
• We can define a feature vector [1, x1 , x2 ] and a corresponding weight vector, where the dot product thresholded yields exactly h1 (x).
• We do the same for h2 (x).
• We put the two subproblems into one equation by stacking the weight vectors into one matrix. Recall that left-multiplication by a matrix is
equivalent to taking the dot product with each row. By convention, the thresholding at 0 (1[· ≥ 0]) applies component-wise.
• Finally, we can define the predictor in terms of a simple dot product.
• Now of course, we don’t know the weight vectors, but we can learn them from the training data!
Avoid zero gradients
Problem: gradient of h1 (x) with respect to v1 is 0
h1 (x) = 1[v1 · φ(x) ≥ 0]

Solution: replace with an activation function σ with non-zero gradients

4
σ(z)

3 Threshold: 1[z ≥ 0]
1
Logistic: 1+e−z
1
ReLU: max(z, 0)
0
-5 -3 0 3 5

z = v1 · φ(x)
h1 (x) = σ(v1 · φ(x))
CS221 46
• Later we’ll show how to perform learning using gradient descent, but we can anticipate one problem, which we encountered when we tried to
optimize the zero-one loss.
• The gradient of h1 (x) with respect to v1 is always zero because of the threshold function.
• To fix this, we replace the threshold function with an activation function with non-zero gradients
• Classically, neural networks used the logistic function σ(z), which looks roughly like the threshold function but has non-zero gradients
everywhere.
• Even though the gradients are non-zero, they can be quite small when |z| is large (a phenomenon known as saturation). This makes optimizing
with the logistic function still difficult.
• In 2012, Glorot et al. introduced the ReLU activation function, which is simply max(z, 0). This has the advantage that at least on the
positive side, the gradient does not vanish (though on the negative side, the gradient is always zero). As a bonus, ReLU is easier to compute
(only max, no exponentiation). In practice, ReLU works well and has become the activation function of choice.
• Note that if the activation function were linear (e.g., the identity function), then the gradients would always be nonzero, but you would lose
the power of a neural network, because you would simply get the product of the final-layer weight vector and the weight matrix (w> V),
which is equivalent to optimizing over a single weight vector.
• Therefore, that there is a tension between wanting an activation function that is non-linear but also has non-zero gradients.
Two-layer neural networks
Intermediate subproblems: φ(x)
h(x) V

=σ

Predictor (classification): h(x)

w
fV,w (x) = sign ·

Interpret h(x) as a learned feature representation!

Hypothesis class:
F = {fV,w : V ∈ Rk×d , w ∈ Rk }

CS221 48
• Now we are finally ready to define the hypothesis class of two-layer neural networks.
• We start with a feature vector φ(x).
• We multiply it by a weight matrix V (whose rows can be interpreted as the weight vectors of the k intermediate subproblems.
• Then we apply the activation function σ to each of the k components to get the hidden representation h(x) ∈ Rk .
• We can actually interpret h(x) as a learned feature vector (representation), which is derived from the original non-linear feature vector φ(x).
• Given h(x), we take the dot product with a weight vector w to get the score used to drive either regression or classification.
• The hypothesis class is the set of all such predictors obtained by varying the first-layer weight matrix V and the second-layer weight vector
w.
Deep neural networks
1-layer neural network: φ(x)
w
score = ·

2-layer neural network: φ(x)

V
w
score = ·σ

3-layer neural network: φ(x)

V2 V1
w
score = ·σ σ

CS221 50
• We can push these ideas to build deep neural networks, which are neural networks with many layers.
• Warm up: for a one-layer neural network (a.k.a. a linear predictor), the score that drives prediction is simply a dot product between a weight
vector and a feature vector.
• We just saw for a two-layer neural network, we apply a linear layer V first, followed by a non-linearity σ, and then take the dot product.
• To obtain a three-layer neural network, we apply a linear layer and a non-linearity (this is the basic building block). This can be iterated any
number of times. No matter now deep the neural network is, the top layer is always a linear function, and all the layers below that can be
interpreted as defining a (possibly very complex) hidden feature vector.
• In practice, you would also have a bias term (e.g., Vφ(x) + b). We have omitted all bias terms for notational simplicity.
[figure from Honglak Lee]

Layers represent multiple levels of abstractions

CS221 52
• It can be difficult to understand what a sequence of (matrix multiply, non-linearity) operations buys you.
• To provide intuition, suppose the input feature vector φ(x) is a vector of all the pixels in an image.
• Then each layer can be thought of producing an increasingly abstract representation of the input. The first layer detects edges, the second
detects object parts, the third detects objects. What is shown in the figure is for each component j of the hidden representation h(x), the
input image φ(x) that maximizes the value of hj (x).
• Though we haven’t talked about learning neural networks, it turns out that the ”levels of abstraction” story is actually borne out visually
when we learn neural networks on real data (e.g., images).
Why depth?
φ(x)
h1 (x) h2 (x) h3 (x) h4 (x)
score

Intuitions:

• Multiple levels of abstraction

• Multiple steps of computation

• Empirically works well

• Theory is still incomplete

CS221 54
• Beyond learning hierarchical feature representations, deep neural networks can be interpreted in a few other ways.
• One perspective is that each layer can be thought of as performing some computation, and therefore deep neural networks can be thought of
as performing multiple steps of computation.
• But ultimately, the real reason why deep neural networks are interesting is because they work well in practice.
• From a theoretical perspective, we have a quite an incomplete explanation for why depth is important. The original motivation from
McCulloch/Pitts in 1943 showed that neural networks can be used to simulate a bounded computation logic circuit. Separately it has been
shown that depth k + 1 logic circuits can represent more functions than depth k. However, neural networks are real-valued and might have
types of computations which don’t fit neatly into logical paradigm. Obtaining a better theoretical understanding is an active area of research
in statistical learning theory.
Summary
φ(x)
V
w
score = · σ( )

• Intuition: decompose problem into intermediate parallel subproblems

• Deep networks iterate this decomposition multiple times

• Hypothesis class contains predictors ranging over weights for all layers

CS221 56
• To summarize, we started with a toy problem (the XOR problem) and used it to motivate neural networks, which decompose a problem into
intermediate subproblems, which are solved in parallel.
• Deep networks iterate this multiple times to build increasingly high-level representations of the input.
Roadmap

Stochastic Gradient Descent

Non-linear features

Neural networks

Feature templates

CS221 58
• In this module, we’ll talk about how to use feature templates to construct features in a flexible way.
Feature extraction + learning
F = {fw (x) = sign(w · φ(x)) : w ∈ Rd }
All predictors
F
Learning
Feature extraction
fw

• Feature extraction: choose F based on domain knowledge

• Learning: choose fw ∈ F based on data

Want F to contain good predictors but not be too big

CS221 60
• Recall that the hypothesis class F is the set of predictors considered by the learning algorithm. In the case of linear predictors, F is given by
some function of w · φ(x) for all w (sign for classification, no sign for regression). This can be visualized as a set in the figure.
• Learning is the process of choosing a particular predictor fw from F given training data.
• But the question that will concern us in this module is how do we choose F? We saw some options already: linear predictors, quadratic
predictors, etc., but what makes sense for a given application?
• If the hypothesis class doesn’t contain any good predictors, then no amount of learning can help. So the question when extracting features is
really whether they are powerful enough to express good predictors. It’s okay and expected that F will contain bad ones as well. Of course,
we don’t want F to be too big, or else learning becomes hard, not just computationally but statistically (as we’ll explain when we talk about
generalization).
Feature extraction with feature names
Example task:
classifier
string (x) valid email address? (y)
fw (x) = sign(w · φ(x))

Question: what properties of x might be relevant for predicting y?

Feature extractor: Given x, produce set of (feature name, feature value) pairs

length>10 :1
fracOfAlpha : 0.85
[email protected] feature extractor φ contains @ :1
endsWith com : 1
x arbitrary! endsWith org : 0

CS221 [features]
φ(x) 62
• To get some intuition about feature extraction, let us consider the task of predicting whether whether a string is a valid email address or not.
• We will assume the classifier fw is a linear classifier, which is given by some feature extractor φ.
• Feature extraction is a bit of an art that requires intuition about both the task and also what machine learning algorithms are capable of.
The general principle is that features should represent properties of x which might be relevant for predicting y.
• Think about the feature extractor as producing a set of (feature name, feature value) pairs. For example, we might extract information about
the length, or fraction of alphanumeric characters, whether it contains various substrings, etc.
• It is okay to add features which turn out to be irrelevant, since the learning algorithm can always in principle choose to ignore the feature,
though it might take more data to do so.
• We have been associating each feature with a name so that it’s easier for us (humans) to interpret and develop the feature extractor. The
feature names act like the analogue of comments in code. Mathematically, the feature name is not needed by the learning algorithm and
erasing them does not change prediction or learning.
Prediction with feature names
Weight vector w ∈ Rd Feature vector φ(x) ∈ Rd
length>10 :-1.2 length>10 :1
fracOfAlpha :0.6 fracOfAlpha :0.85
contains @ :3 contains @ :1
endsWith com:2.2 endsWith com:1
endsWith org :1.4 endsWith org :0

Score: weighted combination of features

Pd
w · φ(x) = j=1 wj φ(x)j

Example: −1.2(1) + 0.6(0.85) + 3(1) + 2.2(1) + 1.4(0) = 4.51

CS221 64
• A feature vector formally is just a list of numbers, but we have endowed each feature in the feature vector with a name.
• The weight vector is also just a list of numbers, but we can endow each weight with the corresponding name as well.
• Recall that the score is simply the dot product between the weight vector and the feature vector. In other words, the score aggregates the
contribution of each feature, weighted appropriately.
• Each feature weight wj determines how the corresponding feature value φj (x) contributes to the prediction.
• If wj is positive, then the presence of feature j (φj (x) = 1) favors a positive classification (e.g., ending with com). Conversely, if wj is
negative, then the presence of feature j favors a negative classification (e.g., length greater than 10). The magnitude of wj measures the
strength or importance of this contribution.
• Advanced: while tempting, it can be a bit misleading to interpret feature weights in isolation, because the learning algorithm treats w
holistically. In particular, a feature weight wj produced by a learning algorithm will change depending on the presence of other features. If
the weight of a feature is positive, it doesn’t necessarily mean that feature is positively correlated with the label.
Organization of features?
length>10 :1
fracOfAlpha : 0.85
[email protected] feature extractor φ contains @ :1
endsWith com : 1
x arbitrary! endsWith org : 0

φ(x)

Which features to include? Need an organizational principle...

CS221 66
• How would we go about about creating good features?
• Here, we used our prior knowledge to define certain features (contains @) which we believe are helpful for detecting email addresses.
• But this is ad-hoc, and it’s easy to miss useful features (e.g., endsWith us), and there might be other features which are predictive but not
intuitive.
• We need a more systematic way to go about this.
Feature templates

Definition: feature template

A feature template is a group of features all computed in a similar way.

endsWith aaa : 0
endsWith aab : 0
endsWith aac : 0
[email protected] last three characters equals ...
endsWith com : 1
...
endsWith zzz : 0

Define types of pattern to look for, not particular patterns

CS221 68
• A useful organization principle is a feature template, which groups all the features which are computed in a similar way. (People often use
the word ”feature” when they really mean ”feature template”.)
• Rather than defining individual features like endsWith com, we can define a single feature template which expands into all the features that
computes whether the input x matches any three characters.
• Typically, we will write a feature template as an English description with a blank ( ), which is to be filled in with an arbitrary value.
• The upshot is that we don’t need to know which particular patterns (e.g., three-character suffixes) are useful, but only that existence of
certain patterns (e.g., three-character suffixes) are useful cue to look at.
• It is then up to the learning algorithm to figure out which patterns are useful by assigning the appropriate feature weights.
Feature templates example 1
Input:

[email protected]

Feature template Example feature

Last three characters equals Last three characters equals com : 1
Length greater than Length greater than 10 : 1
Fraction of alphanumeric characters Fraction of alphanumeric characters : 0.85

CS221 70
• Here are some other examples of feature templates.
• Note that an isolated feature (e.g., fraction of alphanumeric characters) can be treated as a trivial feature template with no blanks to be
filled.
• In many cases, the feature value is binary (0 or 1), but they can also be real numbers.
Feature templates example 2
Input:

Latitude: 37.4068176
Longitude: -122.1715122

Feature template Example feature name

Pixel intensity of image at row and column ( channel) Pixel intensity of image at row 10 and column 93 (red channel) : 0.8
Latitude is in [ , ] and longitude is in [ , ] Latitude is in [ 37.4, 37.5 ] and longitude is in [ -122.2, -122.1 ] : 1

CS221 72
• As another example application, suppose the input is an aerial image along with the latitude/longitude corresponding where the image was
taken. This type of input arises in poverty mapping and land cover classification.
• In this case, we might define one feature template corresponding to the pixel intensities at various pixel-wise row/column positions in the
image across all the 3 color channels (e.g., red, green, blue).
• Another feature template might define a family of binary features, one for each region of the world, where each region is defined by a bounding
box over latitude and longitude.
Sparsity in feature vectors
endsWith a :0
endsWith b :0
endsWith c :0
endsWith d :0
endsWith e :0
endsWith f :0
endsWith g :0
endsWith h :0
endsWith i :0
endsWith j :0
endsWith k :0
endsWith l :0
endsWith m:1
[email protected] last character equals endsWith n :0
endsWith o :0
endsWith p :0
endsWith q :0
endsWith r :0
endsWith s :0
endsWith t :0
endsWith u :0
endsWith v :0
endsWith w :0
endsWith x :0
endsWith y :0
endsWith z :0

Compact representation:
CS221 {"endsWith m": 1} 74
• In general, a feature template corresponds to many features, and sometimes, for a given input, most of the feature values are zero; that is,
the feature vector is sparse.
• Of course, different feature vectors have different non-zero features.
• In this case, it would be inefficient to represent all the features explicitly. Instead, we can just store the values of the non-zero features,
assuming all other feature values are zero by default.
Two feature vector implementations
Arrays (good for dense features): Dictionaries (good for sparse features):

pixelIntensity(0,0) : 0.8 fracOfAlpha : 0.85

pixelIntensity(0,1) : 0.6 contains a : 0
pixelIntensity(0,2) : 0.5 contains b : 0
pixelIntensity(1,0) : 0.5 contains c : 0
pixelIntensity(1,1) : 0.8 contains d : 0
pixelIntensity(1,2) : 0.7 contains e : 0
pixelIntensity(2,0) : 0.2 ...
pixelIntensity(2,1) : 0 contains @ : 1
pixelIntensity(2,2) : 0.1 ...

[0.8, 0.6, 0.5, 0.5, 0.8, 0.7, 0.2, 0, 0.1] {"fracOfAlpha": 0.85, "contains @": 1}

CS221 76
• In general, there are two common ways to implement feature vectors: using arrays and using dictionaries.
• Arrays assume a fixed ordering of the features and store the feature values as an array. This implementation is appropriate when the number
of nonzeros is significant (the features are dense). Arrays are especially efficient in terms of space and speed (and you can take advantage of
GPUs). In computer vision applications, features (e.g., the pixel intensity features) are generally dense, so arrays are more common.
• However, when we have sparsity (few nonzeros), it is typically more efficient to implement the feature vector as a dictionary (map) from
strings to doubles rather than a fixed-size array of doubles. The features not in the dictionary implicitly have a default value of zero. This
sparse implementation is useful for natural language processing with linear predictors, and is what allows us to work efficiently over millions
of features. In Python, one would define a feature vector φ(x) as the dictionary {"endsWith "+x[-3:]: 1}. Dictionaries do incur extra
overhead compared to arrays, and therefore dictionaries are much slower when the features are not sparse.
• One advantage of the sparse feature implementation is that you don’t have to instantiate all the set of possible features in advance; the weight
vector can be initialized to {}, and only when a feature weight becomes non-zero do we store it. This means we can dynamically update a
model with incrementally arriving data, which might instantiate new features.
Summary
F = {fw (x) = sign(w · φ(x)) : w ∈ Rd }
Feature template:
endsWith aaa : 0
endsWith aab : 0
endsWith aac : 0
[email protected] last three characters equals ...
endsWith com : 1
...
endsWith zzz : 0

Dictionary implementation:
{"endsWith com": 1}

CS221 78
• The question we are concerned with in this module is to how to define the hypothesis class F, which in the case of linear predictors is the
question of what the feature extractor φ is.
• We showed how feature templates can be useful for organizing the definition of many features, and that we can use dictionaries to represent
sparse feature vectors efficiently.
• Stepping back, feature engineering is one of the most critical components in the practice of machine learning. It often does not get as much
attention as it deserves, mostly because it is a bit of an art and somewhat domain-specific.
• More powerful predictors such as neural networks will alleviate some of the burden of feature engineering, but even neural networks use feature
vectors as the initial starting point, and therefore its effectiveness is ultimately governed by how good the features are.
Overall Summary
• Stochastic Gradient Descent: faster gradient descent using sample gradients

• Non-Linear Features: Linear in weights w, but nonlinear in inputs x

• Neural networks: Learning hierarchical feature representations

• Feature templates: useful for organizing the definition of many features,

• Next: Backpropagation, k-means, generalization, best practices

CS221 80
• In summary, we started with stochastic gradient descent which can be faster than gradient descent. with the cost of noisier updates
• Then, we discussed non-linear features, and outlines a general recipe for linear models that are nonlinear in the original inputs by using the
feature vector mapping
• Then, we discussed neural networks, which can be interpreted as an parametric approach for learning flexible hierarchical feature representations
• Finally, we covered how feature templates can be useful for organizing the definition of many features
• Next Lecture, we will cover backpropagation, briefly cover kmeans for unsupervised learning, then discuss generalization and best practices

CS230: Deep Learning: Winter Quarter 2018 Stanford University Midterm Examination 180 Minutes
100% (1)
CS230: Deep Learning: Winter Quarter 2018 Stanford University Midterm Examination 180 Minutes
36 pages
Master Thesis Value at Risk
100% (3)
Master Thesis Value at Risk
4 pages
Poetry Recitation Competition: Rules and Guidelines
No ratings yet
Poetry Recitation Competition: Rules and Guidelines
5 pages
Steger - Approaches To The Study of Globalization
89% (9)
Steger - Approaches To The Study of Globalization
34 pages
CS221 - Artificial Intelligence - Machine Learning - 6 Non-Linear Features
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 6 Non-Linear Features
22 pages
Machine Learning: Neural Networks
No ratings yet
Machine Learning: Neural Networks
22 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Learning 2
No ratings yet
Learning 2
104 pages
Notes5_Regression
No ratings yet
Notes5_Regression
14 pages
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
24 pages
CS221 - Artificial Intelligence - Machine Learning - 1 Overview
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 1 Overview
16 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
CS 229, Autumn 2017 Problem Set #4: EM, DL & RL
No ratings yet
CS 229, Autumn 2017 Problem Set #4: EM, DL & RL
10 pages
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
12 pages
03 Ai
No ratings yet
03 Ai
59 pages
NISS Deep Learning Tutorial
No ratings yet
NISS Deep Learning Tutorial
58 pages
CS229 Lecture 2 PDF
100% (1)
CS229 Lecture 2 PDF
48 pages
Solutions To Deep Learning
No ratings yet
Solutions To Deep Learning
25 pages
Cs230exam Fall18 PDF
No ratings yet
Cs230exam Fall18 PDF
32 pages
Cheatsheet Reflex Models
No ratings yet
Cheatsheet Reflex Models
4 pages
cs188-sp24-note22
No ratings yet
cs188-sp24-note22
8 pages
Neural Networks
No ratings yet
Neural Networks
14 pages
CS 188 Introduction To Artificial Intelligence Fall 2017 Note 10 Neural Networks: Motivation
No ratings yet
CS 188 Introduction To Artificial Intelligence Fall 2017 Note 10 Neural Networks: Motivation
9 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
Lecture 1, Part 2: Linear Classification: Roger Grosse
No ratings yet
Lecture 1, Part 2: Linear Classification: Roger Grosse
10 pages
cs188 sp23 Note25
No ratings yet
cs188 sp23 Note25
8 pages
NN Theory
No ratings yet
NN Theory
138 pages
cs188 Fa23 Note21
No ratings yet
cs188 Fa23 Note21
8 pages
Lecture 2: Basics and Definitions: Networks As Data Models
No ratings yet
Lecture 2: Basics and Definitions: Networks As Data Models
28 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
No ratings yet
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
14 pages
CS601 Machine Learning Unit 2 Notes 1672759753
No ratings yet
CS601 Machine Learning Unit 2 Notes 1672759753
14 pages
Andrew NG Main - Notes PDF
No ratings yet
Andrew NG Main - Notes PDF
226 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
CS229 Andrew NG Lecture Notes
No ratings yet
CS229 Andrew NG Lecture Notes
216 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Super VIP Cheat Sheet: Arti Cial Intelligence
No ratings yet
Super VIP Cheat Sheet: Arti Cial Intelligence
18 pages
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
No ratings yet
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
223 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
XCS221 Mod1 Slides
No ratings yet
XCS221 Mod1 Slides
307 pages
learning1
No ratings yet
learning1
68 pages
DL Practical 02 Binary Class Classifier Using ANN
No ratings yet
DL Practical 02 Binary Class Classifier Using ANN
5 pages
Lecture Slides 2 - Neural Networks - 2021
No ratings yet
Lecture Slides 2 - Neural Networks - 2021
42 pages
11-Nonlinear Models (Neural Networks)
No ratings yet
11-Nonlinear Models (Neural Networks)
6 pages
ch6 (Q 2,8,4)
No ratings yet
ch6 (Q 2,8,4)
9 pages
ML Main Printing Material
No ratings yet
ML Main Printing Material
241 pages
Cs229-Main Notes Andrew NG and Tengyu Ma
No ratings yet
Cs229-Main Notes Andrew NG and Tengyu Ma
227 pages
Main Notes
No ratings yet
Main Notes
227 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Main Notes
No ratings yet
Main Notes
227 pages
Homework 3: SVM and Sentiment Analysis: Minted Listings
No ratings yet
Homework 3: SVM and Sentiment Analysis: Minted Listings
7 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Lecture Slides Week11
No ratings yet
Lecture Slides Week11
33 pages
Problemset2 PDF
No ratings yet
Problemset2 PDF
4 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
Lecture 13 - Perceptrons: Machine Learning March 16, 2010
No ratings yet
Lecture 13 - Perceptrons: Machine Learning March 16, 2010
49 pages
Lecture Slides-Week11
No ratings yet
Lecture Slides-Week11
32 pages
CS 229 Autumn 2017 Problem Set #3: Deep Learning & Unsupervised Learning
No ratings yet
CS 229 Autumn 2017 Problem Set #3: Deep Learning & Unsupervised Learning
9 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Unit 4_C Designing Interfaces and Dialouges
No ratings yet
Unit 4_C Designing Interfaces and Dialouges
25 pages
OS Syllabus
No ratings yet
OS Syllabus
5 pages
Unit 3 Analysis_a System Requirements
No ratings yet
Unit 3 Analysis_a System Requirements
47 pages
Process creation 2
No ratings yet
Process creation 2
11 pages
system call
No ratings yet
system call
21 pages
chapter-3-lab-lab-assignment
No ratings yet
chapter-3-lab-lab-assignment
7 pages
Deadlock
No ratings yet
Deadlock
38 pages
cs221-lecture10
No ratings yet
cs221-lecture10
43 pages
disk_management
No ratings yet
disk_management
46 pages
ai-ch18-learning-from-examples-part-2
No ratings yet
ai-ch18-learning-from-examples-part-2
30 pages
Aneka
No ratings yet
Aneka
12 pages
Hypervisor ESXI 5
No ratings yet
Hypervisor ESXI 5
8 pages
Architecture of Server Virtualization 3
No ratings yet
Architecture of Server Virtualization 3
13 pages
cs221-lecture12
No ratings yet
cs221-lecture12
28 pages
Machine Learning
No ratings yet
Machine Learning
68 pages
slides_kbAgents (1)
No ratings yet
slides_kbAgents (1)
97 pages
Chapter 4 Lab Instructions
No ratings yet
Chapter 4 Lab Instructions
3 pages
TaskModel
No ratings yet
TaskModel
68 pages
MoreAnekaExamples
No ratings yet
MoreAnekaExamples
10 pages
chapter-2-lab-lab-assignment
No ratings yet
chapter-2-lab-lab-assignment
6 pages
chapter-1-lab-lab-assignment
No ratings yet
chapter-1-lab-lab-assignment
7 pages
unit-3-stacks-and-queues
No ratings yet
unit-3-stacks-and-queues
13 pages
unit-4-recursion
No ratings yet
unit-4-recursion
10 pages
unit-2-linked-lists
No ratings yet
unit-2-linked-lists
21 pages
2 vector-calculus
No ratings yet
2 vector-calculus
3 pages
unit-5-binary-trees
No ratings yet
unit-5-binary-trees
28 pages
laudon_ess10e_pp_4
No ratings yet
laudon_ess10e_pp_4
48 pages
unit-1-complexity-analysis
No ratings yet
unit-1-complexity-analysis
6 pages
e Commercesecurityandpaymentsystems
No ratings yet
e Commercesecurityandpaymentsystems
21 pages
ESR Study of Reactions of Cellulose With .OH Generated by Fe2+ - H2O2
No ratings yet
ESR Study of Reactions of Cellulose With .OH Generated by Fe2+ - H2O2
11 pages
Industrial Building
70% (10)
Industrial Building
37 pages
C01_Finite Element Analysis
No ratings yet
C01_Finite Element Analysis
46 pages
English Paper II
No ratings yet
English Paper II
86 pages
Waste Not Want Not
76% (25)
Waste Not Want Not
14 pages
6 The Rio Grande Free Sample PDF
No ratings yet
6 The Rio Grande Free Sample PDF
5 pages
William Shakespear Bio
No ratings yet
William Shakespear Bio
3 pages
2023 SH1 Promo Schedule
No ratings yet
2023 SH1 Promo Schedule
3 pages
Casing
100% (1)
Casing
14 pages
Local Culture and Tax Avoidance Evidence From Gambling - 2022 - Global Finance
No ratings yet
Local Culture and Tax Avoidance Evidence From Gambling - 2022 - Global Finance
20 pages
Advantages and Disadvantages of Business Continuity Management (K. Venclova 2013)
No ratings yet
Advantages and Disadvantages of Business Continuity Management (K. Venclova 2013)
5 pages
2022 9 변형
No ratings yet
2022 9 변형
22 pages
Component Object Model (COM) Is A Binary Interface
No ratings yet
Component Object Model (COM) Is A Binary Interface
7 pages
Determination of Oleyl Propylenediamine On The Surfaces of Water Steam Cycles PPChem May June 2017
No ratings yet
Determination of Oleyl Propylenediamine On The Surfaces of Water Steam Cycles PPChem May June 2017
12 pages
Entrepreneurship Project
No ratings yet
Entrepreneurship Project
2 pages
microsoft-az-204-dumps-by-montoya-29-01-2024-8qa-certsdeals
No ratings yet
microsoft-az-204-dumps-by-montoya-29-01-2024-8qa-certsdeals
9 pages
Im C4510.C6010 Partes
100% (1)
Im C4510.C6010 Partes
230 pages
Counting: Discrete Structures For Computing On August 31, 2021
No ratings yet
Counting: Discrete Structures For Computing On August 31, 2021
27 pages
WK 1 Handout Introduction To Accounting
No ratings yet
WK 1 Handout Introduction To Accounting
7 pages
Bag Filter Installation - Manual-1
100% (5)
Bag Filter Installation - Manual-1
22 pages
Fiscal Function Notes
No ratings yet
Fiscal Function Notes
10 pages
T1 Practice Quiz #2 With Markscheme
No ratings yet
T1 Practice Quiz #2 With Markscheme
4 pages
UK362-2425-003292
No ratings yet
UK362-2425-003292
1 page
Routledge Handbook of Mental Health in Elite Sport, 1st Edition PDF ebook with Full Chapters
100% (7)
Routledge Handbook of Mental Health in Elite Sport, 1st Edition PDF ebook with Full Chapters
16 pages
Cracking of Boiler Tubes
No ratings yet
Cracking of Boiler Tubes
8 pages
Behavior Idioms
No ratings yet
Behavior Idioms
5 pages
GENERAL-PHYSICS-1-12-Q1-M16
No ratings yet
GENERAL-PHYSICS-1-12-Q1-M16
15 pages

learning2

Uploaded by

learning2

Uploaded by

Lecture 3: Machine Learning 2

Stochastic Gradient Descent

Algorithm: gradient descent

Algorithm: stochastic gradient descent

Question: what should η be?

gradient descent stochastic gradient descent

Key idea: stochastic updates

It’s not about quality, it’s about quantity.

Stochastic Gradient Descent

F = {fw (x) = w · φ(x) : w ∈ Rd }

f (x) = [2, 0.2] · φ(x) 0

How do we fit a non-linear predictor?

f (x) = [2, 1, −0.2] · φ(x) 4

Non-linear predictors just by changing φ

Example: φ(2.3) = [0, 0, 1, 0, 0]

F = {fw (x) = w · φ(x) : w ∈ R5 } 0

Expressive non-linear predictors by partitioning the input space

φ(x) = [1, x, x2 , cos(3x)]

Just throw in any features you want

Key idea: non-linearity

• Expressiveness: score w · φ(x) can be a non-linear function of x

Decision boundary is a line

f (x) = sign([2, 2, −1] · φ(x)) 2

Decision boundary is a circle

Feature space: φ(x) = [x1 , x2 , x21 + x22 ], decision boundary is a hyperplane

• Regression: non-linear predictor, classification: non-linear decision boundary

• Types of non-linear features: quadratic, piecewise constant, etc.

Non-linear predictors with linear machinery

Stochastic Gradient Descent

Non-linear (quadratic) predictors: 3

Non-linear neural networks: 2

Input: positions of two oncoming cars x = [x1 , x2 ]

Unknown: safe if cars sufficiently far: y = sign(|x1 − x2 | − 1)

Test if car 1 is far right of car 2: 2 h2 (x)

x h1 (x) h2 (x) f (x)

h1 (x) = 1[x1 − x2 ≥ 1] = 1[[−1, +1, −1] · [1, x1 , x2 ] ≥ 0]

h2 (x) = 1[x2 − x1 ≥ 1] = 1[[−1, −1, +1] · [1, x1 , x2 ] ≥ 0]

f (x) = sign(h1 (x) + h2 (x)) = sign([1, 1] · h(x))

Solution: replace with an activation function σ with non-zero gradients

Predictor (classification): h(x)

Interpret h(x) as a learned feature representation!

2-layer neural network: φ(x)

3-layer neural network: φ(x)

Layers represent multiple levels of abstractions

• Multiple levels of abstraction

• Multiple steps of computation

• Empirically works well

• Theory is still incomplete

• Intuition: decompose problem into intermediate parallel subproblems

• Deep networks iterate this decomposition multiple times

Stochastic Gradient Descent

• Feature extraction: choose F based on domain knowledge

Want F to contain good predictors but not be too big

Question: what properties of x might be relevant for predicting y?

Score: weighted combination of features

Example: −1.2(1) + 0.6(0.85) + 3(1) + 2.2(1) + 1.4(0) = 4.51

Which features to include? Need an organizational principle...

Definition: feature template

A feature template is a group of features all computed in a similar way.

Define types of pattern to look for, not particular patterns

Feature template Example feature

Feature template Example feature name

pixelIntensity(0,0) : 0.8 fracOfAlpha : 0.85

• Non-Linear Features: Linear in weights w, but nonlinear in inputs x

• Neural networks: Learning hierarchical feature representations

• Feature templates: useful for organizing the definition of many features,

• Next: Backpropagation, k-means, generalization, best practices

You might also like