Week 4 Lecture Notes
Week 4 Lecture Notes
Non-linear Hypotheses
Performing linear regression with a complex set of data with many features is very unwieldy. Say you wanted
to create a hypothesis from three (3) features that included all the quadratic terms:
2
g(θ0 + θ1 x + θ2 x 1 x 2 + θ3 x 1 x 3
1
2
+θ4 x + θ5 x 2 x 3
2
2
+θ6 x )
3
That gives us 6 features. The exact way to calculate how many features for all polynomial terms is the
combination function with repetition: http://www.mathsisfun.com/combinatorics/combinations-
(n+r−1)!
permutations.html . In this case we are taking all two-element combinations of three features:
r!(n−1)!
(3+2−1)!
= . (Note: you do not have to know these formulas, I just found it helpful for understanding).
4!
= 6
(2!⋅(3−1)!) 4
For 100 features, if we wanted to make them quadratic we would get resulting new
(100+2−1)!
= 5050
(2⋅(100−1)!)
features.
We can approximate the growth of the number of new features we get with all quadratic terms with (n2 /2).
And if you wanted to include all cubic terms in your hypothesis, the features would grow asymptotically at
(n ) . These are very steep growths, so as the number of our features increase, the number of quadratic or
3
Example: let our training set be a collection of 50 x 50 pixel black-and-white photographs, and our goal will be
to classify which ones are photos of cars. Our feature set size is then n = 2500 if we compare every pair of
pixels.
Now let's say we need to make a quadratic hypothesis function. With quadratic features, our growth is
(n /2). So our total features will be about 2500 /2 = 3125000, which is very impractical.
2 2
Neural networks o ers an alternate way to perform machine learning when we have complex hypotheses
with many features.
There is evidence that the brain uses only one "learning algorithm" for all its di erent functions. Scientists
have tried cutting (in an animal brain) the connection between the ears and the auditory cortex and rewiring
the optical nerve with the auditory cortex to nd that the auditory cortex literally learns to see.
This principle is called "neuroplasticity" and has many examples and experimental evidence.
Model Representation I
https://www.coursera.org/learn/machine-learning/resources/RmTEz 1/5
4/11/2018 Coursera | Online Courses From Top Universities. Join for Free | Coursera
Let's examine how we will represent a hypothesis function using neural networks.
At a very simple level, neurons are basically computational units that take input (dendrites) as electrical input
(called "spikes") that are channeled to outputs (axons).
In our model, our dendrites are like the input features x1 ⋯ xn , and the output is the result of our hypothesis
function:
In this model our x0 input node is sometimes called the "bias unit." It is always equal to 1.
In neural networks, we use the same logistic function as in classi cation: . In neural networks however
1
−θ T x
1+e
Our "theta" parameters are sometimes instead called "weights" in the neural networks model.
⎡ x0 ⎤
⎢ ⎥
x1 → [ ] → hθ (x)
⎢ ⎥
⎣x ⎦
2
Our input nodes (layer 1) go into another node (layer 2), and are output as the hypothesis function.
The rst layer is called the "input layer" and the nal layer the "output layer," which gives the nal value
computed on the hypothesis.
We can have intermediate layers of nodes between the input and output layers called the "hidden layer."
We label these intermediate or "hidden" layer nodes a20 ⋯ a2n and call them "activation units."
(j)
a = "activation" of unit i in layer j
i
(j)
Θ = matrix of weights controlling function mapping from layer j to layer j + 1
⎡ x0 ⎤ ⎡ a(2) ⎤
1
⎢ ⎥ ⎢ ⎥
x1 (2)
⎢ ⎥ → ⎢a ⎥ → hθ (x)
2
⎢ x2 ⎥ ⎢ ⎥
(2)
⎣x ⎦ ⎣a ⎦
3 3
This is saying that we compute our activation nodes by using a 3×4 matrix of parameters. We apply each row
of the parameters to our inputs to obtain the value for one activation node. Our hypothesis output is the
logistic function applied to the sum of the values of our activation nodes, which have been multiplied by yet
another parameter matrix Θ (2) containing the weights for our second layer of nodes.
(j)
If network has sj units in layer j and sj+1 units in layer j + 1, then Θ will be of dimension sj+1 × (sj + 1).
https://www.coursera.org/learn/machine-learning/resources/RmTEz 2/5
4/11/2018 Coursera | Online Courses From Top Universities. Join for Free | Coursera
The +1 comes from the addition in Θ of the "bias nodes," x0 and Θ 0 . In other words the output nodes will
(j) (j)
Example: layer 1 has 2 input nodes and layer 2 has 4 activation nodes. Dimension of Θ is going to be 4×3
(1)
Model Representation II
In this section we'll do a vectorized implementation of the above functions. We're going to de ne a new
variable zk that encompasses the parameters inside our g function. In our previous example if we replaced
(j)
(2) (2)
a = g(z )
1 1
(2) (2)
a = g(z )
2 2
(2) (2)
a = g(z )
3 3
In other words, for layer j=2 and node k, the variable z will be:
(j)
⎡ ⎤
⎡ x0 ⎤ z
1
⎢ ⎥
⎢ ⎥
x1 (j) ⎢ z (j) ⎥
x = ⎢ ⎥ z = 2
⎢ ⎥
⎢⋯⎥ ⋯
⎢ ⎥
⎣x ⎦ (j)
n
⎣ zn ⎦
Setting x = a
(1)
, we can rewrite the equation as:
We are multiplying our matrix Θ (j−1) with dimensions sj × (n + 1) (where sj is the number of our activation
nodes) by our vector a (j−1)
with height (n+1). This gives us our vector z(j) with height sj .
Now we can get a vector of our activation nodes for layer j as follows:
(j) (j)
a = g(z )
We can then add a bias unit (equal to 1) to layer j after we have computed a(j) . This will be element a0 and
(j)
will be equal to 1.
We get this nal z vector by multiplying the next theta matrix after Θ (j−1) with the values of all the activation
nodes we just got.
This last theta matrix Θ (j) will have only one row so that our result is a single number.
(j+1) (j+1)
h Θ (x) = a = g(z )
https://www.coursera.org/learn/machine-learning/resources/RmTEz 3/5
4/11/2018 Coursera | Online Courses From Top Universities. Join for Free | Coursera
Notice that in this last step, between layer j and layer j+1, we are doing exactly the same thing as we did in
logistic regression.
Adding all these intermediate layers in neural networks allows us to more elegantly produce interesting and
more complex non-linear hypotheses.
⎡x ⎤
0
⎢ ⎥ (2)
→ [ g(z ) ] → hΘ (x)
⎢ x1 ⎥
⎣ x2 ⎦
(1)
Θ = [ −30 20 20 ]
This will cause the output of our hypothesis to only be positive if both x1 and x2 are 1. In other words:
x 1 = 0 and x 2 = 0 then g(−30) ≈ 0
x 1 = 0 and x 2 = 1 then g(−10) ≈ 0
x 1 = 1 and x 2 = 0 then g(−10) ≈ 0
x 1 = 1 and x 2 = 1 then g(10) ≈ 1
So we have constructed one of the fundamental operations in computers by using a small neural network
rather than using an actual AND gate. Neural networks can also be used to simulate all the other logical
gates.
AN D :
(1)
Θ = [ −30 20 20 ]
N OR :
(1)
Θ = [ 10 −20 −20 ]
OR :
(1)
Θ = [ −10 20 20 ]
We can combine these to get the XNOR logical operator (which gives 1 if x1 and x2 are both 0 or both 1).
⎡ x0 ⎤ (2)
a
⎢ ⎥ 1
(3)
x1 → → [a ] → hΘ (x)
⎢ ⎥ (2)
[a ]
⎣x ⎦ 2
2
For the transition between the rst and second layer, we'll use a Θ ( 1) matrix that combines the values for
AND and NOR:
https://www.coursera.org/learn/machine-learning/resources/RmTEz 4/5
4/11/2018 Coursera | Online Courses From Top Universities. Join for Free | Coursera
(1)
−30 20 20
Θ =
[ 10 −20 −20 ]
For the transition between the second and third layer, we'll use a Θ matrix that uses the value for OR:
(
2)
(2)
Θ = [ −10 20 20 ]
(2) (1)
a = g(Θ ⋅ x)
(3)
hΘ (x) = a
And there we have the XNOR operator using two hidden layers!
(2) (3)
⎡x ⎤ ⎡ ⎤ ⎡ ⎤
0 a a ⎡ hΘ (x) ⎤
0 0 1
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
x1 (2) (3)
⎢ ⎥ ⎢a ⎥ ⎢a ⎥ ⎢ hΘ (x) 2 ⎥
1 1
x2 → → → ⋯ → →
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
(2) (3) hΘ (x) 3
⎢⋯⎥ ⎢ a2 ⎥ ⎢ a2 ⎥ ⎢ ⎥
⎣ hΘ (x) ⎦
⎣ xn ⎦ ⎣ ⋯ ⎦ ⎣ ⋯ ⎦ 4
Our nal layer of nodes, when multiplied by its theta matrix, will result in another vector, on which we will
apply the g() logistic function to get a vector of hypothesis values.
Our resulting hypothesis for one set of inputs may look like:
⎡0⎤
⎢ ⎥
0
hΘ (x) = ⎢ ⎥
⎢1⎥
⎣0⎦
In which case our resulting class is the third one down, or hΘ (x)3 .
Our nal value of our hypothesis for a set of inputs will be one of the elements in y.
https://www.coursera.org/learn/machine-learning/resources/RmTEz 5/5