0% found this document useful (0 votes)

25 views34 pages

A Layman's Guide To The Project

Uploaded by

feiairic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views34 pages

A Layman's Guide To The Project

Uploaded by

feiairic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

 Part I

About the Technology Used

Supervised Learning
For the purpose of our project we used supervised learning. It is a learning
method that can infer a function from labeled training data. The training data
consist of a set of training examples. Each example is a pair consisting of an input
object (typically a vector) and a desired output value (also called the supervisory
signal). A supervised learning algorithm analyzes the training data and produces an
inferred function, which can be used for mapping new examples. An optimal
scenario will allow for the algorithm to correctly determine the class labels for
unseen instances. This requires the learning algorithm to generalize from the
training data to unseen situations in a "reasonable" way.

Figure 1: To describe the supervised learning problem slightly more formally, our goal is, given a
training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding
value of y. Here, x represents input and y represents output.

Supervised Learning can be broadly classified into regression and classification

problems:
In a regression problem, we try to predict the value of a continuous-valued
function, given a number of input data.
Figure 2: In a regression problem, we try to find the best value of 𝜃0 and 𝜃1 where 𝜃𝑖 is the
parameter vector. Choose 𝜃0 ,𝜃1 so that ℎ𝜃 (𝑥) is close to 𝑦 for our training examples
(𝑥, 𝑦) .
Whereas in classification, we try to find the correct class label for the given
input. So, in Figure 1, if the ‘predicted y’ is continuous, we call it a regression
problem. And if the ‘predicted y’ takes discrete values only, we call it a
classification problem.

Classification Problem
In our project, we are only concerned with the classification problem. Here,
y can take on only a small number of discrete values.

Figure 3: An example of a classification problem. Here, we are trying to separate the green
points from the red points.

1
Our hypothesesℎ𝜃 (𝑥 )need to satisfy 0 ≤ ℎ𝜃 (𝑥 ) ≤ 1. This is
accomplished by plugging 𝜃 𝑇 𝑥 into the Logistic Function, also known as the
Sigmoid Function.
Here,

ℎ𝜃 (𝑥) = 𝑔(𝑧) z = 𝜃𝑇 𝑥 g(z) = 1+𝑒1 −𝑍

1
i.e, ℎ𝜃 (𝑥) = 𝑇
1+𝑒 −𝜃 𝑥

Figure 4: The sigmoid function 𝛷 (z), shown here, maps any real number to the (0, 1) interval,
making it useful for transforming an arbitrary-valued function into a function better suited for
classification.

ℎ𝜃 (𝑥 ) will give us the probability that our output is 1. For example,

ℎ𝜃 (𝑥 )=0.7 gives us a probability of 70% that our output is 1. Our probability
that our prediction is 0 is just the complement of our probability that it is 1 (e.g. if
probability that it is 1 is 70%, then the probability that it is 0 is 30%).

Cost Function
We can measure the accuracy of our hypothesis function by using a cost
function. The cost function J(θ) for logistic regression is:

2
To break it apart, we are calculating the cost for parameter vector 𝜃 ,
where, ℎ𝜃 (𝑥 (𝑖) ) is the predicted value for 𝑖 𝑡ℎ training example and 𝑦 (𝑖) is the
corresponding actual value.
Using these equations:
- If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis
function also outputs 0. If our hypothesis approaches 1, then the cost function
will
approach infinity.
- If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis
function outputs 1. If our hypothesis approaches 0, then the cost function will
approach infinity.

Note that writing the cost function in this way guarantees that J(θ) is convex for
logistic regression.
Our main goal here is to minimise the cost function. To achieve that goal, we have
to find 𝑚𝑖𝑛𝜃 𝐽(𝜃).

3
Figure 5: In this above graph we plot 𝜃0 vs 𝜃1 vs J(𝜃0 ,𝜃1 ) to find out global minimum of J(𝜃 ).

Learning Algorithm
For the purpose of this project, we choose gradient descent to find the
minimum value of cost function J(𝜃).

Gradient Descent
So we have our hypothesis function and we have a way of measuring how
well it fits into the data. Now we need to estimate the parameters in the
hypothesis function. That's where gradient descent comes in. In gradient descent
algorithm we initialise parameter vector 𝜃with some random values and run the
algorithm for few iterations, until the value of 𝜃converges to the global minimum.
Also we have to check that the value of 𝜃decreases in every iteration.
The way we do this is by taking the derivative (the tangential line to a
function) of our cost function. The slope of the tangent is the derivative at that
point and it will give us a direction to move towards. We make steps down the
cost function in the direction with the steepest descent. The size of each step is
determined by the parameter α, which is called the learning rate.
The gradient descent algorithm is:

Repeat until convergence:

4
𝛿
𝜃𝑗 := 𝜃𝑗 - 𝛼 J(𝜃)
𝛿𝜃𝑗

For each input feature j, one should simultaneously update the parameter vector
θ.
On a side note, we should adjust our parameter ‘α’ to ensure that the
gradient descent algorithm converges in a reasonable time. Failure to converge or
too much time to obtain the minimum value imply that our step size is wrong.
While small value of 𝛼 takes long time to converge, a large value may never
converge at all, oscillating on both sides of the minima.
𝛿
Solving the derivative
𝛿𝜃𝑗
J(𝜃),

where, m is the total number of training examples.

Figure 6: In GD, we initialise parameter vector with some random values and run the algorithm
for few iterations, until the value of converges to the global minimum.

5
Multiclass Classification: One-vs-all
As our problem is to classify a single digit (0-9) for a given input, we approach with
generalized logistic regression to multiclass problems.
Since y = {0,1...n}, we divide our problem into n+1 (+1 because the index
starts at 0) binary classification problems; in each one, we predict the probability
that 'y' is a member of one of our classes.

We are basically choosing one class and then lumping all the others into a
single second class. We do this repeatedly, applying binary logistic regression to
each case, and then use the hypothesis that returned the highest value as our
prediction.

(𝑖)
Figure 7: This figure shows how one could classify 3 classes: ℎ𝜃 = P(y = i|x; 𝜃 ) (x = 1,2,3)

To summarize:
6
- Train a logistic regression classifier ℎ𝜃 (x) for each class to predict the
probability that y = i .
- To make a prediction on a new x, pick the class that maximizes ℎ𝜃 (x).
For the purpose of our project instead of using normal gradient descent
algorithm, we used an optimised version of the algorithm - fmincg(). This function
is more sophisticated, and provides us with faster ways to optimize θ that can be
used instead of gradient descent. It works on a continuous differentiable
multivariate function.We just need to provide it the following two functions for a
given input value θ:
1. Cost Function J(𝜃).
𝛿
2. Partial derivative of the cost function J(𝜃).
𝛿 𝜃𝑗

Then, with multiple iterations, "fmincg()" can quickly reach the global minimum of
the cost function.
This function works faster and more efficiently with larger dataset.

The Problem of Overfitting

Figure 8: The problem of overfitting (or underfitting) is caused by the wrong choice of the
hypothesis function and it can either result in a function too simple or a very complicated
function which does not generalize to predict new data.

Underfitting, or high bias, is when the form of our hypothesis function

ℎ𝜃 (𝑥) maps poorly to the trend of the data. It is usually caused by a function that
is too simple or uses too few features.
7
At the other extreme, overfitting, or high variance, is caused by a hypothesis
function that fits the available data but does not generalize well to predict new
data. It is usually caused by a complicated function that creates a lot of
unnecessary curves and angles unrelated to the data.
Between these two extremes, lies the hypothesis function which is just right, and
provides the best fit to the data.

This terminology is applied to both linear and logistic regression. There are two
main options to address the issue of overfitting:

1) Reduce the number of features:

- Manually select which features to keep.
- Use a model selection algorithm.
2) Regularization
- Keep all the features, but reduce the magnitude of parameters 𝜃𝑗 .
- Regularization works well when we have a lot of slightly useful features.

Regularization
Recall that our cost function for logistic regression was:

We can regularize this equation by adding a term to the end:

The λ is the regularization parameter. It determines how much the costs of

our 𝜃 parameters are inflated. By multiplying the parameter vector θ with
regularization parameter ฀ we penalize all the features except 𝜃0 .
Using the above cost function with the extra summation, we can smooth
the output of our hypothesis function to reduce overfitting. If 𝜆 is chosen to be
too large, it may smooth out the function too much and cause underfitting.
Thus, when computing the equation, we should continuously update the two
following equations:

8
Now we can apply gradient descent on ฀(θ) to get the desired result.
Our goal is to minimise cost function ฀(θ) with using all ฀ number of
features of parameter vector θ. With the help of validation set, we can set the
best fitted value of ฀ to achieve this goal.

 Part II

Understanding Neural Networks

9
Biological Neuron

Figure 9: A biological neuron gets the input signals along the dendrites and based on them,
sends an output signal along the axon.

Biological neural networks in our brain have inspired the design of artificial
neural networks (ANN). It originated when we tried to find algorithms to mimic
the brain. It was very widely used in 80s and early 90s, but the popularity
diminished in late 90s due to the lack of enough computing power. It has seen a
recent resurgence with highly improved computing performance and state-of-the-
art techniques for many applications.

Neuron model: Logistic Unit

Figure 10: An artificial neuron analogous to the biological neuron - the yellow circle is the body
of the neuron taking input from the black circles and pass the output to the next (level)neuron.

10
An artificial neuron is a mathematical function conceived as a model of
biological neuron. Artificial neurons are elementary units in an artificial neural
network. In an artificial neuron, the input features are 𝑥1 ⋯𝑥𝑛 , and the output is
the result of our hypothesis function. In the above model, the 𝑥0 input node is
called the "bias unit." It is always equal to 1. For this project, we used the transfer
1
function as in classification, 𝑇 , it is called the sigmoid (logistic) activation
1+𝑒 −𝜃 𝑥
function. Connecting each input to the neuron, there are "weights" (here
represented by the "theta" parameters).
It can also be represented as:

Our input nodes (layer 1), also known as the "input layer", go into another
node (layer 2), which finally outputs the hypothesis function, known as the
"output layer".

Building A Neural Network

A single neuron is a type of linear classifier, i.e. a classification algorithm
that can separate two set of input points which are linearly separable. But if the
set of points is not linearly separable (ie. we need a non-linear decision boundary),
then we require a neural network containing at least two layers of neurons.

Figure 11: A: A single neuron can only form a linear decision boundary separating two set of
points. B: If the set of points is not linearly separable, we need to use a neural network
consisting of at least two layers of neurons, to form a non-linear decision boundary.

In a neural network, there are multiple neurons (also called nodes),

arranged into layers. Our input nodes (layer 1), also known as the "input layer", go
11
into another node (layer 2), which finally outputs the hypothesis function, known
as the "output layer". We can have intermediate layers of nodes between the
input and output layers called the "hidden layers." This final hypothesis function
can be any complex function which can fit into any complex distribution of points.

Figure 12: A two-layer neural network consist 1 input layer, 1 hidden layer and 1 output layer.

Feedforward Propagation

Figure 13: In forward propagation, the activation values of the neurons are calculated using the
inputs and the weights of the connections; and propagated forward to their next neurons.

12
In this example, we label these intermediate or "hidden" layer nodes 𝑎02 ⋯𝑎𝑛2 and
call them "activation units."

If we had one hidden layer, it would look like:

The values for each of the "activation" nodes is obtained as follows:

We compute our activation nodes by using a 3×4 matrix of parameters. We

apply each row of the parameters to our inputs to obtain the value for one
activation node. Our hypothesis output is the logistic function applied to the sum
of the values of our activation nodes, which have been multiplied by yet another
parameter matrix𝛩(2) containing the weights for our second layer of nodes.
Each layer gets its own matrix of weights, 𝛩(𝑗) .
Vectorized Implementation
In this section we'll do a vectorized implementation of the above functions.
Doing this allows us to more elegantly produce interesting and more complex non-
(𝑗)
linear hypotheses. We're going to define a new variable 𝑧𝑘 that encompasses the
parameters inside our g function. In our previous example if we replaced by the
variable z for all the parameters we would get:
(2) (2)
𝑎1 = g(𝑧1 )
(2) (2)
𝑎2 = g(𝑧2 )
(2) (2)
𝑎3 = g(𝑧3 )
13
In other words, for layer j=2 and node k, the variable z will be:

The vector representation of x and 𝑧 𝑗 is:

Setting x =𝑎(1) , we can rewrite the equation as:

𝑧 (𝑗) = 𝛩(𝑗−1) 𝑎(𝑗−1)

Now we can get a vector of our activation nodes for layer j as follows:
(𝑗) (𝑗)
𝑎 = g(𝑧 )

Here the function g is applied element-wise to the vector 𝑧 (𝑗) .

We can then add a bias unit (equal to 1) to layer j after we have computed
(𝑗)
𝑎(𝑗) . This will be element 𝑎0 and will be equal to 1. To compute our final
hypothesis, let's first compute another z vector:

𝑧 (𝑗+1) = 𝛩(𝑗) 𝑎(𝑗)

We get this final z vector by multiplying the next theta matrix after 𝛩(𝑗−1)
with the values of all the activation nodes we just got. This last theta matrix 𝛩(𝑗)
will have only one row which is multiplied by one column 𝑎(𝑗) so that our result is
a single number. We then get our final result with:

ℎ𝛩 (x) = 𝑎(𝑗+1) = g(𝑧 (𝑗+1) )

Multiple output units: One-vs-all
In this section we will discuss about Multiple output units: One-vs-all classification

14
Figure 14: To classify data into multiple classes, we let our hypothesis function return a vector of
values. Here we wanted to classify our data into one of four categories.

A training set is in the form of (𝑥| |(1), 𝑦 (1) ), (𝑥| |(2), 𝑦 (2) ),....,
(𝑥| |(𝑚), 𝑦 (𝑚) ).

In the above figure, we have ℎ𝛩 (𝑥) ∈ 𝑅4 , so the ANN has to identify 4 possible
outcomes. So the output vector from the neural network can be:

Each 𝑦 (𝑖) represents a different case. The inner layers, each provide us with some
new information which leads to our final hypothesis function. The setup looks like:

In our project, the ANN has to decide among multiple choices (10 to be
specific). To address this problem, the output layer will have 10 neurons: the first
output neuron will try to classify whether it is case 1 or not; and so will the rest of
the neurons try to classify their corresponding cases. To get a more precise
output, each neuron will give the probability for each case.

15
Back-propagation
Cost Function:
Let's first define a few variables that we will need to use:

L = total number of layers in the network

𝑠𝑙 = number of units (not counting bias unit) in layer l
K = number of output units/classes
Recall that in neural networks, we may have many output nodes. We
denote ℎ𝛩 (𝑥)𝑘 as being a hypothesis that results in the 𝑘 𝑡ℎ output. Our cost
function for neural networks is going to be a generalization of the one we used for
logistic regression.

Recall that the cost function for regularized logistic regression was:

For neural networks, it is going to be slightly more complicated:

We have added a few nested summations to account for our multiple

output nodes. In the first part of the equation, before the square brackets, we
have an additional nested summation that loops through the number of output
nodes.

In the regularization part, after the square brackets, we must account for
multiple theta matrices. The number of columns in our current theta matrix is
equal to the number of nodes in our current layer (including the bias unit). The
number of rows in our current theta matrix is equal to the number of nodes in the
next layer (excluding the bias unit). As before with logistic regression, we square
every term.
Algorithm:
"Back-propagation" is neural-network terminology for minimizing our cost

16
function, just like what we were doing with gradient descent in logistic and linear
regression. Our goal is to compute: 𝑚𝑖𝑛𝛩 J(Θ)
That is, we want to minimize our cost function J using an optimal set of
parameters in theta. In this section we'll look at the equations we use to compute
the
partial derivative of J(Θ):

Figure 15: In backpropagation, the error values are propagated backwards, starting from the
output, until each neuron has an associated error value which roughly represents its
contribution to the original output.

Given training set {(x(1), y(1)) ... (x(m), y(m))}

(𝑙)
Set 𝛥𝑖,𝑗 : = 0 for all (l,i,j), (hence we end up having a matrix full of zeros)

For training example t =1 to m:

1. Set a :=x(t)
(1)

2. Perform forward propagation to compute a(l) for l=2,3,…,L

3. Using y(t), compute δ(L)=a(L)−y(t)
where L is our total number of layers and a(L) is the vector of outputs of the
activation units for the last layer. So our "error values" for the last layer are simply
17
the differences of our actual results in the last layer and the correct outputs in y.
To get the delta values of the layers before the last layer, we can use an equation
that steps us back from right to left:

4. Compute

The delta values of layer l are calculated by multiplying the delta values in
the next layer with the theta matrix of layer l. We then element-wise multiply that
with a function called g', which is the derivative of the activation function g
evaluated with the input values given by z(l).
The g-prime derivative terms can also be written out as:

5.
or with vectorization,

Hence we update our new Δ matrix.

The capital-delta matrix D is used as an "accumulator" to add up our values as we

go along and eventually compute our partial derivative. Thus we get:

Putting It Together
First, we have to pick a network architecture
Choose the layout of our neural network, including how many hidden units in each
layer and how many layers in total we want to have.

18
 Number of input units = dimension of features x(i)
 Number of output units = number of classes
 Number of hidden units per layer = usually more the better (must balance
with cost of computation as it increases with more hidden units)
 Defaults: 1 hidden layer. If we have more than 1 hidden layer, then it is
recommended that we have the same number of units in every hidden
layer.

Training a Neural Network

1. Randomly initialize the weights.
2. Implement forward propagation to get ℎ𝛩 (x(i)) for any x(i).
3. Implement the cost function.
4. Implement back-propagation to compute partial derivatives.
5. Use gradient descent or a built-in optimization function to minimize the cost
function with the weights in theta

19
 Part III

Working With Data

Data Source - MNIST

The first thing we'll need is a data set to learn from - a training data set.
We'll use the MNIST data set, which contains 60,000 scanned images of
handwritten digits, together with their correct classifications. The name ‘MNIST’
comes from the fact that it is a modified subset of two data sets collected by NIST,
the United States' National Institute of Standards and Technology.

Here's a few images from MNIST:

Figure 16: Few images from the MNIST dataset, 20 images of each digit. Note: The original
images were white-on-black. They have been inverted to show here.

20
Each of these image is an 8-bit grayscale image and 28 by 28 pixels in size.

The dataset can be broadly divided into two parts:

1. The first part contains 60,000 images to be used as training data; about 6K
training examples of each digit from 0 to 9. The images are scanned
handwriting samples from 250 people, half of whom were US Census
Bureau employees, and remaining half were high school students.
2. The second part of the MNIST data set is 10,000 images to be used as test
data; about 1K testing examples of each digit from 0 to 9. We'll use the test
data to evaluate how well our neural network has learned to recognize
digits. To make this a good test of performance, the test data was taken
from a different set of 250 people than the original training data (a group
split between Census Bureau employees and high school students).

Inputs and outputs of NN

Figure 17: Each image is 28 by 28 pixels (= a total of 784 pixels) which is fed into the neural
network. So the input layer of the NN consists of 784 (one for each pixel) + 1 (for bias) neurons.

There are 60,000 training examples in MNIST dataset, where each training
example is a 28 pixel by 28 pixel grayscale image of the digit. Each pixel is
represented by a ﬂoating point number indicating the grayscale intensity at that
location. The 28 by 28 grid of pixels is “unrolled” into a 784-dimensional vector.
Each of these training examples becomes a single row in our data matrix X. This
21
gives us a 60,000 by 784 matrix X where every row is a training example for a
handwritten digit image:

The second part of the training set is a 60,000-dimensional vector y that contains
labels for the training set.

Feature Scaling
We can speed up gradient descent (or other optimization algorithm) by
having each of our input values in roughly the same range. This is because θ (the
weights) will descend quickly on small ranges and slowly on large ranges, and so
will oscillate inefficiently down to the optimum when the variables are very
uneven.
The way to prevent this is to modify the ranges of our input variables so
that they are all roughly the same. One way to achieve this is feature scaling (it is
also known as data normalization). It involves dividing the input values by the
range (i.e. the maximum value minus the minimum value) of the input variable,
resulting in a new range of just 1.

Figure 18: The gradients (the path of gradient is drawn in red) could take a long time and go
back and forth to find the optimal solution. Instead if we scaled our feature, the contour of the
cost function might look like circles; then the gradient can take a much more straight path and
achieve the optimal point much faster.

In the MNIST dataset, all the pixels were represented by their grayscale

22
values (0-255). So, we divided each pixel value by 255 to get the results within 0-1.
This dramatically improves the number of iterations required for training.

Training

Figure 19: Our neural network with 3 layers – an input layer, a hidden layer and an output layer.

We will implement a neural network to recognize handwritten digits using

the MNIST training set. The neural network will be able take the MNIST images for
training and form complex non-linear hypotheses to fit the data; and achieve good
accuracy in recognizing the test set images.
We will implement a neural network with 3 layers – an input layer, a hidden layer
and an output layer:
1. The input layer consists of (784 + 1) input neurons, one for taking in each
pixel of an image. The one extra neuron is for the bias.
2. The output layer consists of 10 neurons, one each for predicting the
probability (values between 0 and 1) of the image being one of the ten
digits (from 0 to 9). The prediction from the neural network will be the digit
that has the highest probability.
3. The hidden layer consists of 25 (+ 1, for bias) neurons. This number of
hidden units is good for achieving decent accuracy in recognizing test set
images, while giving good performance on average-specced personal
computers. Later on, we will carry out model selection using validation set
to choose the number of hidden neurons, in addition to choosing the other
hyper-parameters of the neural network.
23
Feedforward Propagation
We will implement feedforward propagation for the neural network.
The grayscale values of the pixels of each image is fed to the input layer of
the neural network. Then, the outputs of all the input neurons is fed as inputs to
the hidden neurons. And, the outputs of the hidden neurons is fed as inputs to the
neurons of the output layer. Finally, the outputs from the output neurons is the
probability of the image being a particular digit.

Mathematically speaking, the NN computes the output ℎ𝜃 (𝑥(𝑖))for every

training example i and returns the associated predictions. To classify data into
multiple classes (from 0 to 9), we let our hypothesis function return a 10-element
vector of decimal values between 0 and 1. Similar to the one-vs-all classification
strategy, the prediction from the neural network will be the label that has the
largest output.
Random Initialization of Weights
When training neural networks, it is important to randomly initialize the
parameters (the weights) for symmetry breaking. One effective strategy for
random initialization is to randomly select values for the weights uniformly in the
range [-ϵ, ϵ].
One effective strategy for choosing ϵ is to base it on the number of units in
the network. A good choice of ϵ is ϵ = √6 /(√Lin+Lout) , where Lin = sl and Lout = sl+1
are the number of units in the layers adjacent to Θ(l).
This range of values ensures that the parameters are kept small and makes the
learning more efficient.

Regularized Cost Function

We used the cross-entropy error function as the cost function, with L2
regularization on the weights to prevent the problem of overfitting. We will
24
choose the best value of the regularization parameter λ using the validation set.
Note that we should not be regularizing the terms that correspond to the bias.

Back-Propagation
We will implement the backpropagation algorithm to train the NN and minimize
the cost. The intuition behind the back-propagation algorithm is as follows:
1. Given a training example (x(t),y(t)), we will first run a “forward pass” to
compute all the activations throughout the network, including the output
value of the hypothesis hΘ(x).
2. Then, for each node j in layer l, we would compute an “error term” δj(l) that
measures how much that node was “responsible” for any errors in our
output.
3. For an output node, we can directly measure the difference between the
network’s activation and the true target value, and use that to define δj(3)
(since layer 3 is the output layer).
4. For the hidden units, we will compute δj(l) based on a weighted average of
the error terms of the nodes in layer (l + 1).

Figure 20: Visualizing 25 hidden neurons after training the NN with the MNIST dataset.

25
Model Selection using Validation Set
Just because a learning algorithm fits a training set well, that does not mean
it is a good hypothesis. It could over fit and as a result our predictions on the test
set would be poor. The error of our hypothesis as measured on the data set with
which we trained the parameters will be lower than the error on any other data
set.
Given many models with different hyperparameters, we can use a
systematic approach to identify the 'best' function. So, we will define a dataset to
"test" the model in the training phase (i.e., the validation dataset), in order to limit
problems like overfitting and others. For our project, we followed the holdout
method of cross-validation. In the holdout method, we randomly assign data
points to two sets d0 and d1, usually called the training set and the validation set,
respectively. The size of each of the sets is arbitrary although typically the
validation set is smaller than the training set. We then train on d0 and validate the
model on d1. In typical cross-validation, multiple runs are aggregated together; in
contrast, the holdout method, in isolation, involves a single run.
We divided the training set of the MNIST dataset containing 60,000 images
into two parts: 70% training set (for training the different models), and the
remaining 30% is the validation set (for testing the different models). Then there is
the test set containing 10,000 images of the MNIST database for testing the best
chosen model on unseen data.
The steps we followed to choose the best suited value of the hyper-parameters
are:
1. Optimize the parameters in Θ using the training set for each value of the
hyper-parameters.
2. Find the value of the hyper-parameter with the least error (misclassification
error) using the validation set.
3. Use this value of the hyper-parameters to form the final model and check
the error of this model using the test set (unseen data). This way, the hyper-
parameters have not been trained using the test set.
The two hyper-parameters we are concerned with, in our neural network model
are the regularization parameter λ and the number of neurons in the hidden layer:

26
Regularization parameter λ
We checked with these values of the regularization parameter λ:
[0 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10]

Figure 21: Checking the different models of the neural network using the validation set, we get
the best accuracy at λ = 0.1.

The training and the validation error we got with different values of the
regularization parameter (lambda) are:

27
So, we chose the value of the regularization parameter to be 0.1 for our final
model.

Number of hidden neurons

Deciding the number of neurons in the hidden layer is a very important part
of deciding our overall neural network architecture. Though the hidden layer does
not directly interact with the external environment, it has a tremendous influence
on the final output.
Using too few neurons in the hidden layer will result in underfitting.
Underfitting occurs when there are too few neurons in the hidden layer to
adequately detect the signals in a complicated data set. Too many neurons in the
hidden layer may result in overfitting.
We applied the validation set to select the number of hidden neurons.

28
Figure 22: We gain more accuracy if we increase the number of hidden neurons, but then the
accuracy decreases at some point due to overfitting the training set.

As you can see, we gain more accuracy if we increase the number of hidden
neurons, but then the accuracy decreases at some point (results may differ a bit
due to random initialization of weights). As we increase the number of neurons,
our model will be able to capture more features, but if we capture too many
features, then we end up overfitting our model to the training data and it won't do
well with unseen data.
We found that we can achieve high accuracy on the validation set with 200 hidden
neurons.

Note:
1. There is another hyper-parameter, the learning rate (alpha) in the gradient
descent learning algorithm. We can achieve good performance at alpha =
0.5, but we are using fmincg(), an optimized learning algorithm for working
with large datasets like the MNIST dataset. This algorithm doesn’t need an
explicit learning rate.
2. We didn’t use K-Fold Cross-Validation. One of the main reasons for using
cross-validation instead of using the conventional validation (holdout
29
method) is that there is not enough data available to partition it into
separate training and test sets. It can then help avoid overfitting the
function to the validation set. But we are using MNIST dataset, where
enough data is already available, with separate data for training and testing.

Testing
We used the validation set and found that the best value of regularization
parameter is 0.1. We also found that with increasing the number of neurons in the
hidden layer, we can achieve higher accuracies on the test data.
We used the cross-entropy loss as the cost function; backpropagation for
updating the weights of each neuron; and fmincg() algorithm as the cost function
optimizer.
In each iteration (epoch), we trained using all the 60,000 images (batch-size)
of the training set. And we used the misclassification error (ie., percentage of
images correctly recognized by our NN of all the images in the set) to calculate the
training and test set accuracy.
We tried with different combinations of number of hidden units and the
number of iterations and recorded the accuracies on the dataset.

Number of hidden Number of Training Set Test Set

neurons iterations Accuracy (%) Accuracy (%)
(epochs)
25 30 89.9533 90.18

50 30 91.54 91.57
100 30 90.29 90.94
200 100 96.261667 96.07

200 1000 100 98.29

We achieved maximum accuracy of 98.29% (error = 1.71%) on the test set
of the MNIST dataset.

30
Note:
1. Due to random initialization of the weights of the neural network, we might get
slightly different values while training and testing again.
2. If we train using the first 1000 images of the MNIST Training Set for 1000
iterations, we can achieve MNIST Test Set accuracy of 87.37%.
This is really interesting, as we can achieve accuracies close to 90% using just a
small part of MNIST Training Set.
Trained using Gradient Descent (for 1000 iterations every time using the first 1000
images of MNIST Training Set) with learning rate = 0.5, regularisation parameter =
0.1 and no. of hidden units = 25.
3. We haven’t used any ready-made toolkit for achieving these results.

 Part IV

Applying our Neural Network on External

Image
Noise Removal
 Noise present in the images can be removed before or after segmentation.
But because it is more practical to do it for all the digits at once, we apply
noise removal on the original picture before segmentation.
 First of all, we remove what we call ‘Salt and Pepper noise’ using a Median
Filter.
 Next type of noise we remove is called Gaussian Noise using Linear Filter.
 Then, we use apply thresholding to convert grayscale images into binary
images.

31
Figure 23: Input image with multiple digits before and after segmentation

Segmentation
 What we do is, take an image containing more than one digits.
 We detect the individual objects and use ‘regionprops’ to find the bounding
boxes around each such object.
 The next step is to return a structure which contains elements, which are
the digits within the bounding boxes.
 The bounding box property contains four elements, the x and y coordinates
of the starting point of the box around each object; and the height and the
width of the box.
 If we want to see the segmented digits within the image we can give a
colour to the box around the objects.
 Next, we store each individual object into a cell array, so that we can extract
each digit whenever we need.

Normalization
 In our project for digit recognition, we apply normalization in the sense that
the image to be fed to the neural network, be reduced to the size that our
neural network will accept.
 Hence, we normalize our image to 28x28 pixels. We have tried our best to
maintain the aspect ratio of the image or else the accuracy for our result
could have been lesser.

Feeding To Our Neural Network

 The image is stored in form of a matrix with grayscale values of the pixels
between 0 to 255. The grayscale values are divided by 255 to get the
fractional values between 0 and 1. This feature scaling is performed so that
we can fit the function in fewer iterations while training.
 The network has 784 input neurons, hence we reshape the matrix from a
28x28 one, to a 1x784 one for easier input feeding.
32
Acknowledgement
Effective noise removal from external images is a very complex process and
requires in-depth knowledge in this domain. And without good noise
removal, it is impossible to achieve good success rate in detecting
digits from external images.

As we didn’t prioritise on processing of external images over getting good results

on test data set, the results on external images is not good and is inconsistent.

The segmentation algorithm can also be vastly improved to identify individual

objects from all types of images, without false positives.

ML Classification Trupesh Patel
No ratings yet
ML Classification Trupesh Patel
39 pages
Daniel Voigt Godoy - Deep Learning With PyTorch Step-By-Step A Beginner's Guide-Leanpub - Com (2022)
100% (1)
Daniel Voigt Godoy - Deep Learning With PyTorch Step-By-Step A Beginner's Guide-Leanpub - Com (2022)
1,045 pages
DSCTP 2022 1 ML Slides
No ratings yet
DSCTP 2022 1 ML Slides
110 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
43 pages
Machine Learning Notes by Standard Andrew NG
No ratings yet
Machine Learning Notes by Standard Andrew NG
142 pages
CS229 Lecture 2 PDF
100% (1)
CS229 Lecture 2 PDF
48 pages
Algorithms Notes
No ratings yet
Algorithms Notes
66 pages
Lec 3
No ratings yet
Lec 3
22 pages
Week 1 Lecture Notes
No ratings yet
Week 1 Lecture Notes
7 pages
(MLP) MidtermNote
No ratings yet
(MLP) MidtermNote
31 pages
CS229
No ratings yet
CS229
69 pages
Chapter02 Introduction To DeepLearning
No ratings yet
Chapter02 Introduction To DeepLearning
84 pages
Slide 2
No ratings yet
Slide 2
30 pages
Week 3 Lecture Notes
No ratings yet
Week 3 Lecture Notes
7 pages
M02Logistic Regression Logistic RegressioLogistic Regressionn
No ratings yet
M02Logistic Regression Logistic RegressioLogistic Regressionn
19 pages
cs188 Fa22 Note21
No ratings yet
cs188 Fa22 Note21
4 pages
Logistic Regression
No ratings yet
Logistic Regression
37 pages
06 Logistic Regression
No ratings yet
06 Logistic Regression
55 pages
Notes 1
No ratings yet
Notes 1
30 pages
Logistic Regression (Probability Concepts) and Perceptron
No ratings yet
Logistic Regression (Probability Concepts) and Perceptron
20 pages
AC-ED L04 - Logistic Regression, Regularization
No ratings yet
AC-ED L04 - Logistic Regression, Regularization
80 pages
Ch2Regression and Regularization1
No ratings yet
Ch2Regression and Regularization1
45 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
Regression
No ratings yet
Regression
30 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
5 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
cs229 2
No ratings yet
cs229 2
275 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
10 pages
Tom Mitchell Provides A More Modern Definition
No ratings yet
Tom Mitchell Provides A More Modern Definition
10 pages
Linear Regression
No ratings yet
Linear Regression
75 pages
What Is Machine Learning?
No ratings yet
What Is Machine Learning?
12 pages
What Is Machine Learning by Coursera
No ratings yet
What Is Machine Learning by Coursera
47 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
Q No. 1 1.1machine Learning:: Machine Learning Is The Study of Computer Algorithms That Improve Automatically
No ratings yet
Q No. 1 1.1machine Learning:: Machine Learning Is The Study of Computer Algorithms That Improve Automatically
10 pages
ML: Introduction 1. What Is Machine Learning?
No ratings yet
ML: Introduction 1. What Is Machine Learning?
38 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
No ratings yet
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
20 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
Gansp Awareness Quiz PDF
No ratings yet
Gansp Awareness Quiz PDF
13 pages
Linearna Regresija - NG
No ratings yet
Linearna Regresija - NG
7 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
8 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
Lec1 PDF
No ratings yet
Lec1 PDF
56 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
Deep Learning Unit 1
No ratings yet
Deep Learning Unit 1
32 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
PS QP IC X Math Factorisation
100% (1)
PS QP IC X Math Factorisation
2 pages
ML:Introduction What Is Machine Learning?: Continuous and Discrete Data
No ratings yet
ML:Introduction What Is Machine Learning?: Continuous and Discrete Data
6 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
No ratings yet
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
17 pages
KPI Dashboard
No ratings yet
KPI Dashboard
4 pages
Numerical Methods - E. Balaguruswamy
No ratings yet
Numerical Methods - E. Balaguruswamy
124 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
4 CSE 447 Digital Filter
No ratings yet
4 CSE 447 Digital Filter
82 pages
Brute Force Searching String Matching
No ratings yet
Brute Force Searching String Matching
9 pages
Time Delay Estimation
No ratings yet
Time Delay Estimation
36 pages
DSP Chapter8 PDF
No ratings yet
DSP Chapter8 PDF
66 pages
Signals & Systems Unit II: Fourier Series Representation of Continuous-Time Periodic Signals
No ratings yet
Signals & Systems Unit II: Fourier Series Representation of Continuous-Time Periodic Signals
18 pages
Ap5292 Digital Image Processing
No ratings yet
Ap5292 Digital Image Processing
1 page
Linear Algebra-Week-2
No ratings yet
Linear Algebra-Week-2
18 pages
Things To Remember - Principal Component Analysis
No ratings yet
Things To Remember - Principal Component Analysis
2 pages
Chapter Four - Dynamic Programming
No ratings yet
Chapter Four - Dynamic Programming
40 pages
DPT 3 Answer Key
No ratings yet
DPT 3 Answer Key
7 pages
Adaptive Equalization Techniques Using Recursive Least Square (RLS) Algorithm
No ratings yet
Adaptive Equalization Techniques Using Recursive Least Square (RLS) Algorithm
8 pages
Lecture 2 Power Planning
No ratings yet
Lecture 2 Power Planning
35 pages
Some Problems Illustrating The Principles of Duality
No ratings yet
Some Problems Illustrating The Principles of Duality
22 pages
Mws Ind Ode TXT Runge4th Examples PDF
No ratings yet
Mws Ind Ode TXT Runge4th Examples PDF
6 pages
Polynomials
No ratings yet
Polynomials
6 pages
20EC3305 - PTRP - Assignment 2 Questions - 2022-23
No ratings yet
20EC3305 - PTRP - Assignment 2 Questions - 2022-23
2 pages
Gaussian Mixture Models
No ratings yet
Gaussian Mixture Models
35 pages
Fsolve - Optimization Toolbox
No ratings yet
Fsolve - Optimization Toolbox
6 pages
Sheet2 AI
No ratings yet
Sheet2 AI
2 pages
Euclid's Algorithm: ENGI 1331: Exam 2 Review - Additional Practice Problems Fall 2020
No ratings yet
Euclid's Algorithm: ENGI 1331: Exam 2 Review - Additional Practice Problems Fall 2020
4 pages
Week 1
No ratings yet
Week 1
6 pages
Artificial Neural Network and Its Applications
No ratings yet
Artificial Neural Network and Its Applications
21 pages
Compression For Prefix-Free Codes.: // Make A Lookup Table From Trie
No ratings yet
Compression For Prefix-Free Codes.: // Make A Lookup Table From Trie
2 pages
Aqua-Spa and Hydro-Lux Example
No ratings yet
Aqua-Spa and Hydro-Lux Example
7 pages
Problem 1 (Total: 15%) : 19ECE06C Signals & Systems Problem-Based Project
No ratings yet
Problem 1 (Total: 15%) : 19ECE06C Signals & Systems Problem-Based Project
6 pages