Coursera Machine Learning Specialization
Coursera Machine Learning Specialization
SPECIALIZATION
NOTES TAKEN FROM COURSERA’S
MACHINE LEARNING SPECIALIZATION COURSE
BY
PARSA BASHARI
SUMMER 2023
2
Contents
3
4 CONTENTS
3 Unsupervised Learning 33
3.1 Week 1: Clustering and Anomaly Detection . . . . . . . . . . . . . . . . . . . . 33
3.1.1 Clustering Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.2 K-means Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.3 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.4 Anomaly Detection vs. Supervised Learning . . . . . . . . . . . . . . . . 37
3.2 Week 2: Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.1 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.2 Recommender Systems Implementation Details . . . . . . . . . . . . . . 40
3.2.3 TensorFlow Implementation of Collaborative Filtering . . . . . . . . . . . 40
3.2.4 Limitations of Collaborative Filtering . . . . . . . . . . . . . . . . . . . . 41
3.2.5 Content-based Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.6 TensorFlow Implementation of Content-based Filtering . . . . . . . . . . 42
3.2.7 Retrieval and Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.8 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Week 3: Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 Return, Policy, and Markov Decision Process . . . . . . . . . . . . . . . . 44
3.3.2 State-Action Value Function (Q-Function) . . . . . . . . . . . . . . . . . 45
3.3.3 Deep Q-Learning for Continuous State Spaces . . . . . . . . . . . . . . . 45
Chapter 1
2. Unsupervised Learning: Inputs are only given (The outputs are not specified)
5
6 CHAPTER 1. SUPERVISED MACHINE LEARNING
1.1.2 Terminology
• Training Set: Data used to train the model
fw,b (x) = wx + b
where w and b are some constants. Most of the time, we simplify this notation to the following
format:
f (x) = wx + b
This model is called univariate linear regression due to the fact that there is only one input
variable x in our model. The numbers w and b are called parameters.
where ŷ (i) is the predicted value of our model function corresponding to the input x(i) . So the
above formula can be written as follows:
m
1 X
J(w, b) = (fw,b (x(i) ) − y (i) )2
2m i=1
Note that our goal is to find some w and b to minimize the cost function J(w, b).
As an example, assume that we are given a training set like this:
{(x, y)} = {(0, 0), (1, 1), (2, 2), (3, 3)}
y
1
0
0 0.5 1 1.5 2 2.5 3
x
It is obvious that the best approximation of this plot is the line y = x that occurs when we chose
w = 1 and b = 0. So we expect that the cost function is minimum in the point (w, b) = (1, 0).
Now let’s form the cost function as follows:
4
1X 1
J(w, b) = ((w − 1)x(i) + b)2 = (2b2 + 6b(w − 1) + 7(w − 1)2 )
8 i=1 4
Then we can plot this cost function:
1,000
J(w, b)
500
−10 10
0 0
w 10 −10
b
Another way to visualize the cost function is using a contour plot. This is the contour plot
representation of the cost function above:
0
b
−2
−4
−1 0 1 2 3
w
8 CHAPTER 1. SUPERVISED MACHINE LEARNING
The main part of this algorithm is to calculate the new w and b. Note that α is called the
learning rate which is an infinitesimal positive number that specifies the length of each step:
∂
w =w−α f (w, b)
∂w
∂
b=b−α f (w, b)
∂b
Note that if α is too small, the algorithm will work properly but it will be slow. If α is
too large, gradient descent may fail to converge and never reach the minimum. Highlight the
fact that as we approach the local minimum, gradient descent will take smaller steps because
∂ ∂
∂w
f (w, b) and ∂w f (w, b) get smaller.
Now let’s use the gradient descent algorithm to find the proper w and b in the linear
regression model. Recall that the model function is fw,b (x) = wx + b, so we can calculate the
partial derivatives with respect to w and b:
m
∂ ∂ 1 X
J(w, b) = ( (wx(i) + b − y (i) )2 )
∂w ∂w 2m i=1
m
1 X
= (wx(i) + b − y (i) )x(i)
m i=1
m
∂ ∂ 1 X
J(w, b) = ( (wx(i) + b − y (i) )2 )
∂b ∂b 2m i=1
m
1 X
= (wx(i) + b − y (i) )
m i=1
In the linear regression model, the cost function always forms a convex surface1 that has
only one local minimum which is also its global minimum. Thus, the gradient descent algorithm
will always work if the learning rate is selected properly and also the initial value of w and b
does not matter.
We call this procedure Batch Gradient Descent because each step of the algorithm uses
all the training examples.
1
A surface that has a supporting plane (a plane that contains a point of the surface, but does not separate
any two points of it) for each point on it.
1.2. WEEK 2: REGRESSION WITH MULTIPLE VARIABLES 9
• xj : j th feature
• n : The number of features
• ⃗x(i) : features of ith training example
• xj (i) : value of the feature j in ith training example
And our linear regression model function will look like this:
fw,b
⃗ (⃗x) = w1 x1 + w2 x2 + ... + wn xn + b
where w⃗ and b are the parameters of the model and ⃗x is the input. Now we can rewrite the
function f using dot product:
fw,b
⃗ (⃗x) = w.⃗
⃗ x+b
1.2.2 Vectorization
Vectorization is a way to do vector calculation faster in coding. The numpy library supports
vectorization and using its methods, we have to do vector calculation much faster than the
naive implementation (e.g. using a for loop). Now, look at the following two different imple-
mentations of fw,b
⃗ (⃗
x) in python:
The reason for faster calculation in the vectorized environment is the use of parallel pro-
cessing hardware. Now we can implement the gradient descent algorithm for multiple linear
regression. Notice that the cost function is now in the form J(w,⃗ b) and returns a scalar. We
∂
have to calculate ∂wj f (w,
⃗ b) for j from 1 to n. So each algorithm step will look like this:
m
1 X (i)
wj = wj − α ⃗ x(i) + b − y (i) )xj
(w.⃗ j = 1, 2, ..., n
m i=1
where x1 is the size of the house and x2 is the number of bedrooms. In this case, when we
run the gradient descent, the algorithm may jump multiple times before reaching the minimum
point, because in the counter plot, the ovals are long and narrow.
To fix this issue, we scale the features such that their ranges become the same. For example,
we can divide each feature by the maximum of its range. Thus in our example, we can scale as
follows:
x1 x1 x2 x2
x1,scaled = = , x2,scaled = =
max1 2000 max2 5
Another way to scale the features is called Mean Normalization. In this method, we
scale features to the interval [−1, 1]. To do this, we first find the average of each feature (say
µ1 , µ2 , ...). Then we use the following relation to scale the features:
xj − µ j
xj,scaled =
maxj − minj
The last method of feature scaling is called Z-score Normalization. In this method, in
addition to the average, we calculate the standard deviation, σ, and then we scale the feature
as follows:
xj − µ j
xj,scaled =
σj
0.5
−6 −4 −2 0 2 4 6
2
In the course, the Sigmoid/Logistic function is denoted as g(x).
12 CHAPTER 1. SUPERVISED MACHINE LEARNING
where S is the Sigmoid function. This function can be interpreted as the chance that y is equal
to 1. Therefore, the logistic regression model function is also written as
fw,b
⃗ (⃗x) = P (y = 1|⃗x; w,
⃗ b)
which is the probability that y = 1, given input ⃗x and parameters w,
⃗ b.
2
x2
0
0 1 2 3
x1
Then the decision boundary is the line below when w1 = 1, w2 = 1, and b = −3:
z = w1 x1 + w2 x2 + b = 0 =⇒ x1 + x2 = 3
which is the purple line in the following diagram:
2
x2
0 1 2 3
x1
Recall from section 1.2.6, we can employ feature engineering (and especially polynomial regres-
sion) to form complex nonlinear decision boundary curves using additional features.
1.3. WEEK 3: CLASSIFICATION 13
This function describes how the model function predicts a single training example. Let’s plot
the loss function. Note that the domain is (0, 1) because the Sigmoid function only creates
values between 0 and 1:
6 y (i) = 1
y (i) = 0
4
Loss
0
0 0.2 0.4 0.6 0.8 1
fw,b
⃗
You can check in both cases that the further the prediction fw,b
⃗ (⃗x(i) ) is from target y (i) , the
higher the loss. Now we can define the new cost function for logistic regression as follows:
m
1 X
J(w,
⃗ b) = L(fw,b
⃗ (⃗x(i) ), y (i) )
m i=1
which turns out that will generate a convex curve on which the gradient descent will work
properly. According to the fact that y (i) can only be 0 or 1, we can simplify the loss function
above and write it as follows:
L(fw,b
⃗ (⃗x(i) ), y (i) ) = −y (i) log(fw,b
⃗ (⃗x(i) )) − (1 − y (i) ) log(1 − fw,b
⃗ (⃗x(i) ))
which is much easier in implementation. Eventually, the cost function for logistic regression is
calculated as follows:
m
1 X (i)
⃗ b) = −
J(w, y log(fw,b
⃗ (⃗x(i) )) + (1 − y (i) ) log(1 − fw,b
⃗ (⃗x(i) ))
m i=1
m
∂ 1 X
J(w,
⃗ b) = (fw,b
⃗ (⃗x(i) ) − y (i) )
∂b m i=1
which are absolutely the same equations as the equations in section 1.1.5, but with a different
model function fw,b
⃗ (⃗x(i) ).
underfit/high bias ✓ ✗ ✗
generalized - - ✓
overfit/high variance ✗ ✓ -
Addressing Overfitting
2. Only select useful features when insufficient training examples are available (Feature
Selection)
1.3.6 Regularization
Regularization is a technique to reduce the problem of overfitting. The main idea of regular-
ization is that we decrease the parameters wj to lower the effect of each feature on the ultimate
function. To achieve this, we add a penalty to the cost function to penalize the model if the
parameters are getting big. Here is our new regularized cost function in both linear and logistic
regression:
n
λ X 2
Jreg (w, b) = J(w, b) + w
2m j=1 j
where λ is called the regularization parameter. Note that regularizing the parameter b is
optional and is often omitted.
16 CHAPTER 1. SUPERVISED MACHINE LEARNING
Chapter 2
To mimic the functionality of the human brain, we model each neuron as a node that gets some
inputs and gives an output called activation which is sent to the next node (neuron). As an
example, assume that we want to predict if a T-shirt will be a top seller. Then we design a
neural network like this:
we call the middle layer the hidden layer. We can show the input layer as a vector ⃗x, the
hidden layer as ⃗a, and the output layer as a scalar a. Note that in a neural network system, the
hidden layers are designed by the algorithm itself and we are not responsible to do anything
about that.
17
18 CHAPTER 2. ADVANCED LEARNING ALGORITHMS
where S is the Sigmoid or logistic function which is called the activation function because it
creates the activation value of a layer. Note that according to this notation, the input layer is
denoted as a[0] to make the relation above work correctly for any value of l. A famous neural
network architecture where each layer has fewer units than the previous one is called forward
propagation.
1 import numpy as np
2
3 def my_dense_v ( A_in , W , b , g ) :
4 """
5 Computes dense layer
6 Args :
7 A_in ( ndarray (m , n ) ) : Data , m examples , n features each
8 W ( ndarray (n , j ) ) : Weight matrix , n features per unit , j units
9 b ( ndarray (1 , j ) ) : bias vector , j units
10 g activation function ( e . g . sigmoid , relu ..)
11 Returns
12 A_out ( tf . Tensor or ndarray (m , j ) ) : m examples , j units
13 """
14 Z = np . matmul ( A_in , W ) + b
15 A_out = g ( Z )
16 return ( A_out )
Listing 2.2: with vectorization
2.2. WEEK 2: NEURAL NETWORK TRAINING 19
8
6 Linear Activation Logistic Activation ReLU Activation
6
4
1 4
2
2
0 0.5
0
−2
−2
−5 0 5 −5 0 5 −5 0 5
g(z) = max(0, z)
It is important to learn how to choose an activation function. We first decide about the output
layer. Here is a guide:
20 CHAPTER 2. ADVANCED LEARNING ALGORITHMS
For the hidden layers, it turns out that the ReLU activation function is the most common
choice compared to the logistic function. Here are some benefits of the ReLU over the logistic
activation function:
1. It is faster to calculate.
2. It is faster to learn (because of having less flat parts than the logistic function).
Note that the linear activation function can not be used in hidden layers, Because this is
completely equivalent to using normal linear or logistic regression.
a1 + a2 + ... + aN = 1
The loss function here is derived from the case N = 2 (logistic regression):
− log a1 y=1
− log a2
y=2
L(a1 , a2 , ..., aN , y) = .
..
− log a y=N
N
1 model = Sequential (
2 [
3 Dense (25 , activation = ’ relu ’) ,
4 Dense (15 , activation = ’ relu ’) ,
5 Dense (4 , activation = ’ softmax ’) # softmax activation here
6 ]
7 )
8 model . compile (
9 loss = tf . keras . losses . S p a r s e C a t e g o r i c a l C r o s s e n t r o p y () ,
10 optimizer = tf . keras . optimizers . Adam (0.001) ,
11 )
12 model . fit (
13 X_train , y_train ,
14 epochs =10
15 )
Listing 2.4: normal implementation
1 preferred_model = Sequential (
2 [
3 Dense (25 , activation = ’ relu ’) ,
4 Dense (15 , activation = ’ relu ’) ,
5 Dense (4 , activation = ’ linear ’)
6 ]
7 )
8 preferred_model . compile (
9 loss = tf . keras . losses . S p a r s e C a t e g o r i c a l C r o s s e n t r o p y ( from_logits = True ) ,
10 optimizer = tf . keras . optimizers . Adam (0.001) ,
11 )
12 preferred_model . fit (
13 X_train , y_train ,
14 epochs =10
15 )
Listing 2.5: preferred implementation
You may have wondered about the additional argument passed into compile() function. The
Adam Algorithm 1 is an optimization to the gradient descent algorithm. It makes the gradient
descent faster by using different learning rates for different parameters of the model.
In normal layers, each unit uses all elements of the input vector (activations of the previous
layer). But it turns out that if each unit of a layer uses only a partition of the previous layer’s
output, it will yield some benefits. For instance:
Jtrain (w,
⃗ b)
Jcv (w,
⃗ b)
error
degree of polynomial
Now where Jtrain is high (Jtrain ≈ Jcv ), we have high bias (underfit), and where Jtrain is low
and Jcv ≫ Jtrain , we have high variance (overfit).
We can also use this approach to choose the best regularization parameter λ. If we plot
Jcv (w,
⃗ b) and Jtrain (w,
⃗ b) with respect to λ, we get something like this:
Jtrain (w,
⃗ b)
Jcv (w,
⃗ b)
error
And we can apply the previous conclusions in the case of choosing λ, too.
But when we talked about ”high” or ”low” error so far, what exactly did we mean? In
order to specify a concrete definition of high and low, we should establish a baseline level of
performance. There are some ways to do this:
• Human-level performance
• Competing algorithms performance
• Guess based on experience
Then the two key quantities to measure are the difference between the training error and the
baseline error, and the difference between the training error and the cross-validation error.
According to these quantities, we can analyze our model the same as before.
24 CHAPTER 2. ADVANCED LEARNING ALGORITHMS
Learning Curve
A learning curve is the functionality of our model in the case of different training set sizes. In
general, if we plot the error of our model with respect to mtrain , we get something like this:
Jtrain (w,
⃗ b)
Jcv (w,
⃗ b)
error
mtrain
NOTE: If a learning algorithm suffers from high bias, getting more training data will not (by
itself) help much (the plot will be flattened out after a specific point). But in the high variance
case, getting more training data is likely to help.
• Try increasing λ
• Try decreasing λ
A large neural network will usually do as well or better than a smaller one so long as
regularization is chosen appropriately. In practice, to add regularization to a neural network
layer, we can use the following piece of code:
1 layer = Dense ( units =25 , activation = " relu " , ke rn el _r eg ul ar iz er = L2 (0.01) )
Listing 2.6: neural network regularization
In the error analysis step, we can manually examine the misclassified examples in the cross-
validation set and categorize them based on common traits. This will help a lot in deciding
what to do next. For example, you can collect more data with a specific property that had
been seen a lot in the misclassified examples.
Data Augmentation is a data-collecting technique that modifies an existing training
example to create a new training example. For instance, if we are developing a character
recognition model and we have an image of the letter A, then we can rotate, enlarge, shrink, or
mirror the image and create new training examples with different input but the same output
label. As another example, in a speech recognition model, we can add noisy background sounds
to the input and make new examples with the same output sentence.
During the past decades, most machine learning researchers’ attention was on the conven-
tional model-centric approach which means the algorithm/model itself. Thanks to that
paradigm of ML research, there are algorithms that are already very good and will work well
for many applications. So, sometimes it can be more fruitful to spend more time taking a
data-centric approach in which you focus on data collection and augmentation.
26 CHAPTER 2. ADVANCED LEARNING ALGORITHMS
2.3.6 Deployment
The full cycle of a machine learning project is something like this:
The most common way to deploy an ML system is to put it on a server and develop a client (a
website or mobile app). When the client calls an API, the server passes the input to the ML
system and gets the output (prediction ŷ). Then it sends the result to the client to be shown
to the user. Note that this process requires software engineers to be developed, maintained,
and monitor.
It turns out that there is a trade-off between precision and recall that follows this curve:
0.8
Precision
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Recall
To decide which model to choose, we should look at both precision(P ) and recall(R). To make
the decision easier, we define a new metric called F1 score which is a combination of P and R
and is defined as follows:
1 2P R
F1 score = 1 1 1 =
( + R)
2 P
P +R
Now we can choose the algorithm with the highest F1 score.
• Continuous Feature: In this case, we set a threshold to convert this feature into a
binary feature. For example, if we have the feature weight that gets continuous values,
we replace it with a binary feature that tells if the weight is less than 5kg or not.
The following figure indicates a simple decision tree for our previous example:
where S is the data set that entropy is calculated, c represents the classes in S, and pc represents
the proportion of data points that belong to class c to the number of total data points in S. In
our classification problem where there are only two classes, the entropy function is like this:
H(p1 ) = −p1 log2 (p1 ) − (1 − p1 ) log2 (1 − p1 )
where p1 is the proportion of cats in the set. The plot of the entropy function is as follows:
0.8
0.6
H(p1 )
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
p1
2.4. WEEK 4: DECISION TREES 29
Note that the entropy function reaches its highest value (1) when p1 = 0.5, which means the
set is completely impure. Also note that when p1 = 1 or 0, the entropy is 0, which means the
set is completely pure. When we split a set at a node according to a specific feature, we come
up with two new entropy values for the left and right branches. To be able to decide based on
these values, we define a new function called Information Gain which shows how much the
entropy has decreased by splitting a set.
right
Information Gain = H(pnode
1 ) − w left
H(p left
1 ) + w right
H(p 1 )
where wleft and wright are the proportions of new sets after we split the old set (note that
wleft + wright = 1). In a more general form, the information gain function is defined as follows:
X |Sv |
Gain(S, a) = H(S) − H(Sv )
|S|
v∈Values(a)
So far, you can solve classification problems with decision trees. In order to solve regression
problems, all we have to do is to replace entropy with variance. And also when we reach a
leaf node, we report the mean of the examples to make regression predictions.
30 CHAPTER 2. ADVANCED LEARNING ALGORITHMS
It turns out that even though we sample randomly each time, in so many cases there is no
change in the features selected for the root node and nodes in the first few layers. In order
to solve this problem, at each node, when choosing a feature to use to split, if n features are
available, we pick a random subset of k < n features and allow
√ the algorithm to only choose
from that subset of features (a common choice is to set k = n).
works well on tabular (structured) data works well on all types of data
small decision trees may be human readable easy to string together to build large systems
32 CHAPTER 2. ADVANCED LEARNING ALGORITHMS
Chapter 3
Unsupervised Learning
• µk : cluster centroid k
• µc(i) : cluster centroid of cluster to which example x(i) has been assigned
33
34 CHAPTER 3. UNSUPERVISED LEARNING
In each iteration of the K-means algorithm, we are trying to minimize a cost function called
distortion function and is defined as follows:
m
1 X (i)
J(c, µ) = ∥x − µc(i) ∥2
m i=1
It turns out that choosing different initial centroids will cause significant changes in our final
clustering model. Thus, it is common to run the algorithm multiple times and pick the one
that gave the lowest distortion function.
The number of clusters (K) highly depends on the underlying purpose of the clustering task.
But if we have no prior knowledge about the number of clusters, there is a method called
elbow to help us choose the right value for K. In this method, we calculate the cost function
(distortion) for each value of K and if we can find an elbow shape in the plot, we choose that
value as K. The following plot shows what we mean by an elbow shape:
10
8
cost function J
1 2 3 4 5 6 7 8
K (no. of clusters)
In the above plot, there is an elbow shape at point (3, 3), so we prefer to use K = 3 as the
number of clusters. Note that we cannot choose a K that minimizes the cost function, because
J(µ, c) is a decreasing function.
3.1. WEEK 1: CLUSTERING AND ANOMALY DETECTION 35
Then we can define our model as the joint probability density function:
n
Y
p(⃗x) = p(x1 ; µ1 , σ12 )p(x2 ; µ2 , σ22 ) . . . p(xn ; µn , σn2 ) = p(xj ; µj , σj2 )
j=1
Now in the inference phase, when we are given a new example ⃗x, we calculate p(⃗x) and based
on the obtained probability, we can decide if it is a normal example or is an anomalous one.
(
1 if p(x) < ϵ (anomaly)
y=
0 if p(x) ≥ ϵ (normal)
For example, if we have 2 features and the following points (normal examples):
4
x2
1
0 2 4 6 8 10
x1
1
These estimators are derived from maximum likelihood estimator (MLE) method.
36 CHAPTER 3. UNSUPERVISED LEARNING
Then we can fit the following joint normal distribution with parameters µ1 = 5, σ1 = 2 and
µ2 = 3, σ2 = 1:
Density
0.1
0
6
0 4
5 2
x1 10 0 x2
You can see both the training examples and the Gaussian distribution in the following
contour plot:
4
x2
0 2 4 6 8 10
x1
Algorithm Evaluation
In anomaly detection tasks, we often have a large number of unlabeled examples (but we know
that most of them are normal). In order to evaluate our model, we need some labeled examples
to put in cross-validation and test sets. Then we can use evaluation metrics like precision,
recall, and F1 -score. It is common to use the cross-validation set to choose parameter ϵ and
then use the test set to evaluate the model.
The most common bug in anomaly detection models is that p(x) is large for both normal and
anomalous examples. This problem usually means that the features we have chosen, are not
able to clearly specify the anomaly in the system. In this case, we often add other features that
may be relevant to anomalous behavior in data.
3.1. WEEK 1: CLUSTERING AND ANOMALY DETECTION 37
For instance, suppose we want to monitor computers in a data center and we have the
following features:
Then adding the following features will cause small values for p(x) for anomalous x in the
cross-validation set:
x3 (x3 )2
x5 = x6 =
x4 x4
The other common problem in anomaly detection systems occurs when our features are non-
gaussian and thus we can not efficiently fit a normal distribution into it. In order to make
the variables more like Gaussian distribution, it is recommended to apply one of the following
operations to the features:
√ 1
x ← log(x) x ← log(x + c) x← x x ← x3
Very small number of positive examples Large number of positive and negative
(0-20 is common). Large number of neg- examples
ative examples
Many different types of anomalies. Hard Enough positive examples for the algo-
for any algorithm to learn from positive rithm to get a sense of what positive ex-
examples what the anomalies look like; amples are like; future positive examples
future anomalies may look nothing like are likely to be similar to ones in the
any of the anomalous examples we have training set.
seen so far.
Finding new previously unseen defects Finding known, previously seen defects
The columns are users and rows are movies, So y (i,j) is the rating given by user j to movie i.
The ? signs in the matrix Y show that the user has not rated a movie so far. We can define
the following matrix that indicates if a user has rated for a movie or not:
1 1 . . . 1
1 0 . . . 1
R= . . .
.. .. . . ...
0 1 ... 1
nm ×nu
Suppose we have features of the movies (number of features = n). For example, if x1 is the
amount of romance and x2 is the amount of action, ... then We can show the features using
the following matrix:
0.9 0.0 . . . 0.2
1.0 0.01 . . . 0.7
X= .
.. .. .. ..
. . .
0.1 0.99 . . . 0.0
nm ×n
In the case that we have the features of each movie (we are given the matrix X), we can predict
user j’s rating for movie i as
w(j) · x(i) + b(j)
In this case, we are fitting a linear regression model for each user separately. So the cost
function for user j is
n
(j) 1 X
(j) (j) (i) (j) (i,j) 2 λ X (j) 2
J(w , b ) = (w · x + b − y ) + (w )
2 i:R =1 2 k=1 k
ij
In real-life situations, we often do not have the features of each movie. Now imagine that we
are given the features W and b. Now the cost function for feature x(i) is defined as follows:
n
1 X λ X (j) 2
J(x(i) ) = (w(j) · x(i) + b(j) − y (i,j) )2 + (x )
2 j:R =1 2 k=1 k
ij
Now in the case where we have neither features nor parameters, we can put the two equations
together and come up with the following cost function:
" #
nu Xn nm X n
1 X λ X (j) λ X (i)
J(X, W, b) = (w(j) · x(i) + b(j) − y (i,j) )2 + (w )2 + (x )2
2 2 j=1 k=1 k 2 i=1 k=1 k
(i,j):Rij =1
| {z }
regularization
Then we can use gradient descent to find the minimum point of J(X, W, b) with the difference
that here X is also a parameter as well as W and b. So in each iteration of gradient descent,
we now have an extra term for X and our algorithm is turned to this:
There are lots of times when our data is binary in recommender systems. For example, a user
can like/dislike a post on social media, or a user can purchase or don’t purchase a specific
commodity on an online shop. In these cases, we use the logistic function (with analogy to
when we moved from regression to classification). In more concrete terms, we predict that the
probability of y (i,j) = 1 is given by
1
fw,b,x (x) = g(w(j) · x(i) + b(j) ), where g(z) =
1 + e−z
In addition to our model function, the cost function also changes. The loss function for a single
example is as follows:
L(fw,b,x (x), y (i,j) ) = −y (i,j) log(fw,b,x (x)) − (1 − y (i,j) ) log(1 − fw,b,x (x))
When we are using collaborative filtering, we often cannot interpret the features x(i) of item i
(features do not represent something specific like genre). So, to find other items related to an
item, we can find the nearest item (say k) in the Euclidean space in a way that the Euclidean
distance between x(i) and x(k) is minimum: ∥x(k) − x(i) ∥2
After we trained our neural network, if we want to find movies similar to movie i we should
(k) (i)
find k that minimizes ∥vm − vm ∥2 . Note that this can be pre-computed ahead of time.
1. Retrieval:
2. Ranking:
Retrieving more items in the first step results in better performance, but slower recommen-
dations. To analyse and optimize the trade-off, carry out offline experiments to see if retrieving
additional items results in more relevant recommendations.
PCA in scikit-learn
We can use scikit-learn to implement PCA in code. First, we perform feature scaling as an
optional pre-processing. Then we can use fit function to fit the data to obtain 2 or 3 new axes
(principal components). There is an optional step in which we can examine how much variance
is explained by each principal component using explained_variance_ratio. In the final step,
we use the transform function to project the data onto the new axes. A code example is given
here:
1 X = np . array ([[1 , 1] , [2 , 1] , [3 , 2] , [ -1 , -1] , [ -2 , -1] , [ -3 , -2]])
2 pca = PCA ( n_components =1)
3 pca . fit ( X )
4 pca . e x p l a i n e d _ v a r i a n c e _ r a t i o _
5 X_trans = pca . transform ( X )
6 X_reduced = pca . inver se_tra nsform ( X_trans )
44 CHAPTER 3. UNSUPERVISED LEARNING
where t is the terminal state and γ is the discount factor and is used to make the agent
impatient (to make it prefer a sooner smaller reward to a later larger reward).
In order to make a decision in each state, we define policy which is a function π(s) = a
mapping from states to actions, that tells us what action a to take in a given state s. The goal
of a reinforcement learning model is to find a policy π that tells you what action to take in
every state so as to maximize the return.
Markov’s Decision Process is a Reinforcement Learning policy used to map a current state to an
action where the agent continuously interacts with the environment to produce new solutions
and receive rewards. Markov’s Process states that the future is independent of the past, given
the present. This means that, given the present state, the next state can be predicted easily,
without the need for the previous state.
3.3. WEEK 3: REINFORCEMENT LEARNING 45
There is a key equation in reinforcement learning that helps us to compute the Q-function
called Bellman equation. If s and a are current state and action, s′ is the state you get
to after taking action a, and a′ is the action you take in state s′ , then the Bellman equation
expresses that:
Q(s, a) = R(s) + γ max
′
Q(s′ , a′ )
a
In stochastic (random) environments, the goal is to maximize the expected return and the
Bellman equation becomes
You may wonder how is it possible to use the Q-Function during the training process. The
answer is that at first, we don’t know what is Q(s, a). So we initialize the neural network
randomly as a guess. The training process algorithm is given here:
You may have noticed that in line 3 of the above algorithm, we have to take action while
still learning. We can take actions randomly which turns out to be not a good idea. We can
also pick the action a that maximizes Q(s, a). There is another way to choose actions that
turns out to be very effective called ϵ-greedy policy. In this method, we have two concepts
Exploitation and Exploration. In each state, we choose one of the following ways:
This random action picking helps to solve the problems occurring in the random initialization
of the neural network. Since the Q-Function gets better and better during the training process,
it is common to start with a high ϵ and then gradually decrease it. This makes our model
choose more random actions at first when the Q-function is not as accurate.
You may have noticed that in order to find maxa Q(s, a) we have to run the inference n times (n
is the number of actions) for each action and then find the maximum case. In order to fix that
problem, we can change the architecture of our network to get only s as input and output n
numbers as Q(s, a1 ), Q(s, a2 ), . . . , Q(s, an ). This improvement will result in much more efficient
inference.
Reinforcement learning has an exciting research direction with potential for future application,
but at the moment it has some limitations. For instance, it is much easier to get it to work in
a simulation environment than a real robot and it also has fewer applications than supervised
and unsupervised learning.