A Layman's Guide To The Project
A Layman's Guide To The Project
Figure 1: To describe the supervised learning problem slightly more formally, our goal is, given a
training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding
value of y. Here, x represents input and y represents output.
Classification Problem
In our project, we are only concerned with the classification problem. Here,
y can take on only a small number of discrete values.
Figure 3: An example of a classification problem. Here, we are trying to separate the green
points from the red points.
1
Our hypothesesℎ𝜃 (𝑥 )need to satisfy 0 ≤ ℎ𝜃 (𝑥 ) ≤ 1. This is
accomplished by plugging 𝜃 𝑇 𝑥 into the Logistic Function, also known as the
Sigmoid Function.
Here,
Figure 4: The sigmoid function 𝛷 (z), shown here, maps any real number to the (0, 1) interval,
making it useful for transforming an arbitrary-valued function into a function better suited for
classification.
Cost Function
We can measure the accuracy of our hypothesis function by using a cost
function. The cost function J(θ) for logistic regression is:
2
To break it apart, we are calculating the cost for parameter vector 𝜃 ,
where, ℎ𝜃 (𝑥 (𝑖) ) is the predicted value for 𝑖 𝑡ℎ training example and 𝑦 (𝑖) is the
corresponding actual value.
Using these equations:
- If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis
function also outputs 0. If our hypothesis approaches 1, then the cost function
will
approach infinity.
- If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis
function outputs 1. If our hypothesis approaches 0, then the cost function will
approach infinity.
Note that writing the cost function in this way guarantees that J(θ) is convex for
logistic regression.
Our main goal here is to minimise the cost function. To achieve that goal, we have
to find 𝑚𝑖𝑛𝜃 𝐽(𝜃).
3
Figure 5: In this above graph we plot 𝜃0 vs 𝜃1 vs J(𝜃0 ,𝜃1 ) to find out global minimum of J(𝜃 ).
Learning Algorithm
For the purpose of this project, we choose gradient descent to find the
minimum value of cost function J(𝜃).
Gradient Descent
So we have our hypothesis function and we have a way of measuring how
well it fits into the data. Now we need to estimate the parameters in the
hypothesis function. That's where gradient descent comes in. In gradient descent
algorithm we initialise parameter vector 𝜃with some random values and run the
algorithm for few iterations, until the value of 𝜃converges to the global minimum.
Also we have to check that the value of 𝜃decreases in every iteration.
The way we do this is by taking the derivative (the tangential line to a
function) of our cost function. The slope of the tangent is the derivative at that
point and it will give us a direction to move towards. We make steps down the
cost function in the direction with the steepest descent. The size of each step is
determined by the parameter α, which is called the learning rate.
The gradient descent algorithm is:
4
𝛿
𝜃𝑗 := 𝜃𝑗 - 𝛼 J(𝜃)
𝛿𝜃𝑗
For each input feature j, one should simultaneously update the parameter vector
θ.
On a side note, we should adjust our parameter ‘α’ to ensure that the
gradient descent algorithm converges in a reasonable time. Failure to converge or
too much time to obtain the minimum value imply that our step size is wrong.
While small value of 𝛼 takes long time to converge, a large value may never
converge at all, oscillating on both sides of the minima.
𝛿
Solving the derivative
𝛿𝜃𝑗
J(𝜃),
Figure 6: In GD, we initialise parameter vector with some random values and run the algorithm
for few iterations, until the value of converges to the global minimum.
5
Multiclass Classification: One-vs-all
As our problem is to classify a single digit (0-9) for a given input, we approach with
generalized logistic regression to multiclass problems.
Since y = {0,1...n}, we divide our problem into n+1 (+1 because the index
starts at 0) binary classification problems; in each one, we predict the probability
that 'y' is a member of one of our classes.
We are basically choosing one class and then lumping all the others into a
single second class. We do this repeatedly, applying binary logistic regression to
each case, and then use the hypothesis that returned the highest value as our
prediction.
(𝑖)
Figure 7: This figure shows how one could classify 3 classes: ℎ𝜃 = P(y = i|x; 𝜃 ) (x = 1,2,3)
To summarize:
6
- Train a logistic regression classifier ℎ𝜃 (x) for each class to predict the
probability that  y = i .
- To make a prediction on a new x, pick the class that maximizes ℎ𝜃 (x).
For the purpose of our project instead of using normal gradient descent
algorithm, we used an optimised version of the algorithm - fmincg(). This function
is more sophisticated, and provides us with faster ways to optimize θ that can be
used instead of gradient descent. It works on a continuous differentiable
multivariate function.We just need to provide it the following two functions for a
given input value θ:
1. Cost Function J(𝜃).
𝛿
2. Partial derivative of the cost function J(𝜃).
𝛿 𝜃𝑗
Then, with multiple iterations, "fmincg()" can quickly reach the global minimum of
the cost function.
This function works faster and more efficiently with larger dataset.
Figure 8: The problem of overfitting (or underfitting) is caused by the wrong choice of the
hypothesis function and it can either result in a function too simple or a very complicated
function which does not generalize to predict new data.
This terminology is applied to both linear and logistic regression. There are two
main options to address the issue of overfitting:
Regularization
Recall that our cost function for logistic regression was:
8
Now we can apply gradient descent on (θ) to get the desired result.
Our goal is to minimise cost function (θ) with using all number of
features of parameter vector θ. With the help of validation set, we can set the
best fitted value of to achieve this goal.
Part II
9
Biological Neuron
Figure 9: A biological neuron gets the input signals along the dendrites and based on them,
sends an output signal along the axon.
Biological neural networks in our brain have inspired the design of artificial
neural networks (ANN). It originated when we tried to find algorithms to mimic
the brain. It was very widely used in 80s and early 90s, but the popularity
diminished in late 90s due to the lack of enough computing power. It has seen a
recent resurgence with highly improved computing performance and state-of-the-
art techniques for many applications.
Figure 10: An artificial neuron analogous to the biological neuron - the yellow circle is the body
of the neuron taking input from the black circles and pass the output to the next (level)neuron.
10
An artificial neuron is a mathematical function conceived as a model of
biological neuron. Artificial neurons are elementary units in an artificial neural
network. In an artificial neuron, the input features are 𝑥1 ⋯𝑥𝑛 , and the output is
the result of our hypothesis function. In the above model, the 𝑥0 input node is
called the "bias unit." It is always equal to 1. For this project, we used the transfer
1
function as in classification, 𝑇 , it is called the sigmoid (logistic) activation
1+𝑒 −𝜃 𝑥
function. Connecting each input to the neuron, there are "weights" (here
represented by the "theta" parameters).
It can also be represented as:
Our input nodes (layer 1), also known as the "input layer", go into another
node (layer 2), which finally outputs the hypothesis function, known as the
"output layer".
Figure 11: A: A single neuron can only form a linear decision boundary separating two set of
points. B: If the set of points is not linearly separable, we need to use a neural network
consisting of at least two layers of neurons, to form a non-linear decision boundary.
Figure 12: A two-layer neural network consist 1 input layer, 1 hidden layer and 1 output layer.
Feedforward Propagation
Figure 13: In forward propagation, the activation values of the neurons are calculated using the
inputs and the weights of the connections; and propagated forward to their next neurons.
12
In this example, we label these intermediate or "hidden" layer nodes 𝑎02 ⋯𝑎𝑛2 and
call them "activation units."
We get this final z vector by multiplying the next theta matrix after 𝛩(𝑗−1)
with the values of all the activation nodes we just got. This last theta matrix 𝛩(𝑗)
will have only one row which is multiplied by one column 𝑎(𝑗) so that our result is
a single number. We then get our final result with:
14
Figure 14: To classify data into multiple classes, we let our hypothesis function return a vector of
values. Here we wanted to classify our data into one of four categories.
A training set is in the form of (𝑥| |(1), 𝑦 (1) ), (𝑥| |(2), 𝑦 (2) ),....,
(𝑥| |(𝑚), 𝑦 (𝑚) ).
In the above figure, we have ℎ𝛩 (𝑥) ∈ 𝑅4 , so the ANN has to identify 4 possible
outcomes. So the output vector from the neural network can be:
Each 𝑦 (𝑖) represents a different case. The inner layers, each provide us with some
new information which leads to our final hypothesis function. The setup looks like:
In our project, the ANN has to decide among multiple choices (10 to be
specific). To address this problem, the output layer will have 10 neurons: the first
output neuron will try to classify whether it is case 1 or not; and so will the rest of
the neurons try to classify their corresponding cases. To get a more precise
output, each neuron will give the probability for each case.
15
Back-propagation
Cost Function:
Let's first define a few variables that we will need to use:
Recall that the cost function for regularized logistic regression was:
In the regularization part, after the square brackets, we must account for
multiple theta matrices. The number of columns in our current theta matrix is
equal to the number of nodes in our current layer (including the bias unit). The
number of rows in our current theta matrix is equal to the number of nodes in the
next layer (excluding the bias unit). As before with logistic regression, we square
every term.
Algorithm:
"Back-propagation" is neural-network terminology for minimizing our cost
16
function, just like what we were doing with gradient descent in logistic and linear
regression. Our goal is to compute: 𝑚𝑖𝑛𝛩 J(Θ)
That is, we want to minimize our cost function J using an optimal set of
parameters in theta. In this section we'll look at the equations we use to compute
the
partial derivative of J(Θ):
Figure 15: In backpropagation, the error values are propagated backwards, starting from the
output, until each neuron has an associated error value which roughly represents its
contribution to the original output.
4. Compute
The delta values of layer l are calculated by multiplying the delta values in
the next layer with the theta matrix of layer l. We then element-wise multiply that
with a function called g', which is the derivative of the activation function g
evaluated with the input values given by z(l).
The g-prime derivative terms can also be written out as:
5.
or with vectorization,
Putting It Together
First, we have to pick a network architecture
Choose the layout of our neural network, including how many hidden units in each
layer and how many layers in total we want to have.
18
Number of input units = dimension of features x(i)
Number of output units = number of classes
Number of hidden units per layer = usually more the better (must balance
with cost of computation as it increases with more hidden units)
Defaults: 1 hidden layer. If we have more than 1 hidden layer, then it is
recommended that we have the same number of units in every hidden
layer.
19
Part III
Figure 16: Few images from the MNIST dataset, 20 images of each digit. Note: The original
images were white-on-black. They have been inverted to show here.
20
Each of these image is an 8-bit grayscale image and 28 by 28 pixels in size.
Figure 17: Each image is 28 by 28 pixels (= a total of 784 pixels) which is fed into the neural
network. So the input layer of the NN consists of 784 (one for each pixel) + 1 (for bias) neurons.
There are 60,000 training examples in MNIST dataset, where each training
example is a 28 pixel by 28 pixel grayscale image of the digit. Each pixel is
represented by a floating point number indicating the grayscale intensity at that
location. The 28 by 28 grid of pixels is “unrolled” into a 784-dimensional vector.
Each of these training examples becomes a single row in our data matrix X. This
21
gives us a 60,000 by 784 matrix X where every row is a training example for a
handwritten digit image:
The second part of the training set is a 60,000-dimensional vector y that contains
labels for the training set.
Feature Scaling
We can speed up gradient descent (or other optimization algorithm) by
having each of our input values in roughly the same range. This is because θ (the
weights) will descend quickly on small ranges and slowly on large ranges, and so
will oscillate inefficiently down to the optimum when the variables are very
uneven.
The way to prevent this is to modify the ranges of our input variables so
that they are all roughly the same. One way to achieve this is feature scaling (it is
also known as data normalization). It involves dividing the input values by the
range (i.e. the maximum value minus the minimum value) of the input variable,
resulting in a new range of just 1.
Figure 18: The gradients (the path of gradient is drawn in red) could take a long time and go
back and forth to find the optimal solution. Instead if we scaled our feature, the contour of the
cost function might look like circles; then the gradient can take a much more straight path and
achieve the optimal point much faster.
In the MNIST dataset, all the pixels were represented by their grayscale
22
values (0-255). So, we divided each pixel value by 255 to get the results within 0-1.
This dramatically improves the number of iterations required for training.
Training
Figure 19: Our neural network with 3 layers – an input layer, a hidden layer and an output layer.
Back-Propagation
We will implement the backpropagation algorithm to train the NN and minimize
the cost. The intuition behind the back-propagation algorithm is as follows:
1. Given a training example (x(t),y(t)), we will first run a “forward pass” to
compute all the activations throughout the network, including the output
value of the hypothesis hΘ(x).
2. Then, for each node j in layer l, we would compute an “error term” δj(l) that
measures how much that node was “responsible” for any errors in our
output.
3. For an output node, we can directly measure the difference between the
network’s activation and the true target value, and use that to define δj(3)
(since layer 3 is the output layer).
4. For the hidden units, we will compute δj(l) based on a weighted average of
the error terms of the nodes in layer (l + 1).
Figure 20: Visualizing 25 hidden neurons after training the NN with the MNIST dataset.
25
Model Selection using Validation Set
Just because a learning algorithm fits a training set well, that does not mean
it is a good hypothesis. It could over fit and as a result our predictions on the test
set would be poor. The error of our hypothesis as measured on the data set with
which we trained the parameters will be lower than the error on any other data
set.
Given many models with different hyperparameters, we can use a
systematic approach to identify the 'best' function. So, we will define a dataset to
"test" the model in the training phase (i.e., the validation dataset), in order to limit
problems like overfitting and others. For our project, we followed the holdout
method of cross-validation. In the holdout method, we randomly assign data
points to two sets d0 and d1, usually called the training set and the validation set,
respectively. The size of each of the sets is arbitrary although typically the
validation set is smaller than the training set. We then train on d0 and validate the
model on d1. In typical cross-validation, multiple runs are aggregated together; in
contrast, the holdout method, in isolation, involves a single run.
We divided the training set of the MNIST dataset containing 60,000 images
into two parts: 70% training set (for training the different models), and the
remaining 30% is the validation set (for testing the different models). Then there is
the test set containing 10,000 images of the MNIST database for testing the best
chosen model on unseen data.
The steps we followed to choose the best suited value of the hyper-parameters
are:
1. Optimize the parameters in Θ using the training set for each value of the
hyper-parameters.
2. Find the value of the hyper-parameter with the least error (misclassification
error) using the validation set.
3. Use this value of the hyper-parameters to form the final model and check
the error of this model using the test set (unseen data). This way, the hyper-
parameters have not been trained using the test set.
The two hyper-parameters we are concerned with, in our neural network model
are the regularization parameter λ and the number of neurons in the hidden layer:
26
Regularization parameter λ
We checked with these values of the regularization parameter λ:
[0 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10]
Figure 21: Checking the different models of the neural network using the validation set, we get
the best accuracy at λ = 0.1.
The training and the validation error we got with different values of the
regularization parameter (lambda) are:
27
So, we chose the value of the regularization parameter to be 0.1 for our final
model.
28
Figure 22: We gain more accuracy if we increase the number of hidden neurons, but then the
accuracy decreases at some point due to overfitting the training set.
As you can see, we gain more accuracy if we increase the number of hidden
neurons, but then the accuracy decreases at some point (results may differ a bit
due to random initialization of weights). As we increase the number of neurons,
our model will be able to capture more features, but if we capture too many
features, then we end up overfitting our model to the training data and it won't do
well with unseen data.
We found that we can achieve high accuracy on the validation set with 200 hidden
neurons.
Note:
1. There is another hyper-parameter, the learning rate (alpha) in the gradient
descent learning algorithm. We can achieve good performance at alpha =
0.5, but we are using fmincg(), an optimized learning algorithm for working
with large datasets like the MNIST dataset. This algorithm doesn’t need an
explicit learning rate.
2. We didn’t use K-Fold Cross-Validation. One of the main reasons for using
cross-validation instead of using the conventional validation (holdout
29
method) is that there is not enough data available to partition it into
separate training and test sets. It can then help avoid overfitting the
function to the validation set. But we are using MNIST dataset, where
enough data is already available, with separate data for training and testing.
Testing
We used the validation set and found that the best value of regularization
parameter is 0.1. We also found that with increasing the number of neurons in the
hidden layer, we can achieve higher accuracies on the test data.
We used the cross-entropy loss as the cost function; backpropagation for
updating the weights of each neuron; and fmincg() algorithm as the cost function
optimizer.
In each iteration (epoch), we trained using all the 60,000 images (batch-size)
of the training set. And we used the misclassification error (ie., percentage of
images correctly recognized by our NN of all the images in the set) to calculate the
training and test set accuracy.
We tried with different combinations of number of hidden units and the
number of iterations and recorded the accuracies on the dataset.
50 30 91.54 91.57
100 30 90.29 90.94
200 100 96.261667 96.07
30
Note:
1. Due to random initialization of the weights of the neural network, we might get
slightly different values while training and testing again.
2. If we train using the first 1000 images of the MNIST Training Set for 1000
iterations, we can achieve MNIST Test Set accuracy of 87.37%.
This is really interesting, as we can achieve accuracies close to 90% using just a
small part of MNIST Training Set.
Trained using Gradient Descent (for 1000 iterations every time using the first 1000
images of MNIST Training Set) with learning rate = 0.5, regularisation parameter =
0.1 and no. of hidden units = 25.
3. We haven’t used any ready-made toolkit for achieving these results.
Part IV
31
Figure 23: Input image with multiple digits before and after segmentation
Segmentation
What we do is, take an image containing more than one digits.
We detect the individual objects and use ‘regionprops’ to find the bounding
boxes around each such object.
The next step is to return a structure which contains elements, which are
the digits within the bounding boxes.
The bounding box property contains four elements, the x and y coordinates
of the starting point of the box around each object; and the height and the
width of the box.
If we want to see the segmented digits within the image we can give a
colour to the box around the objects.
Next, we store each individual object into a cell array, so that we can extract
each digit whenever we need.
Normalization
In our project for digit recognition, we apply normalization in the sense that
the image to be fed to the neural network, be reduced to the size that our
neural network will accept.
Hence, we normalize our image to 28x28 pixels. We have tried our best to
maintain the aspect ratio of the image or else the accuracy for our result
could have been lesser.
33