Lesson 7.0 Supervised Learning With Neural Networks
Lesson 7.0 Supervised Learning With Neural Networks
NEURAL NETWORKS
BIOLOGICAL NEURAL NETWORKS
• Artificial neurons are of many types. The simplest type is called a perceptrons. They were
developed in the 1950s and 1960s by the scientist Frank Rosenblatt, inspired by earlier work by
Warren McCulloch and Walter Pitts.
• Today, it's more common to use other models of artificial neurons such as the sigmoid neuron. To
understand sigmoid neurons we need to first understand perceptrons.
• A perceptron takes several binary inputs, x1,x2,…, and produces a single binary output. Take A
simple decision - Should I attend a festival this weekend. To make my decision i need to know:
• 1. Is the weather good? … x1
• 2. Does my friend want to go? … x2
• 3. Is it near public transportation? … x3
• I can have these questions represented as binary variables x1, x2, & x3 with their content as 0 or 1
for the answers no or yes.
• But some variables are more important than others. For example, if we really hated bad weather
but care less about going with our friend and public transit, we could pick the weights 6, 2 and 2.
We introduce weights w1, w2, & w3, that indicate how important each feature is helping me make
my decision.
• To make the decision you need to consider all the factors and their weights. Mathematically this
MATHEMATICAL MODEL
• You can summarize the decision making process as shown where the output
is the decision. It is 0 if the sum is below a threshold value or 1 if it is above.
Recall this is similar to the equation of a line (linear models) or plane (SVM)
so you just separating data points using weights that you have created.
• We can simplify the way we describe perceptrons by writing it as a dot
product rather than using summations and moving the threshold to the
other side of the inequality sign thus:
• ∑jwjxj≤ threshold
• w⋅x+ (-threshold)≤0
• w⋅x+b≤0
• To learn, small changes in some weight (or bias) are made and the
corresponding change in the output observed to find out if the results are
more accurate. The network then learns the best combination of weights
• Perceptron neurons however are difficult to control. Small changes in input
can result to big changes in the output which can completely flip the
outcome (from 0 to 1) hence making it difficult for the model to learn.
• Sigmoid neurons are similar to perceptrons, but modified so that small
changes in their weights and bias cause only a small change in their
output hence making gradual learning possible
• Their output is proportional to that of perceptrons output w.x+b thus
output =σ(w.x+b).
• The coefficient of proportionality σ is known as the sigmoid function or
logistic function. It ranges between 0 and 1 with a large value indicating
the output will be positive while a small value indicates a negative output.
• This is because perceptrons accept binary input (0 or 1), while Sigmoid
neurons take continuous inputs. Because b is an added value while σ a
product, changes in the latter result in an exponential change while in the
former a stepped change in the output hence the ease in learning of the
later
ACTIVATION FUNCTIONS
• The learning methods such as the sigmoid function are known as an activation functions and recently
other improved functions with specific benefits have been developed.
• This activation function is what makes Neural Networks more powerful than a linear model which
only use x.w + b.
• The Sigmoid function is easier to explain, however other more complex activation functions exist
such as:
1) hyperbolic tan (Tanh) – The Tanh function is a nonlinear activation function that maps input
values to a range between -1 and 1. It's zero-centered, meaning that outputs can be negative,
positive, or zero.
2) Maxout - Maxout is an activation function that selects the maximum value from a set of inputs.
It is often used in combination with dropout for robust and adaptable learning.
3) Softmax - The Softmax function converts a vector of values into a probability distribution, with
the sum of the probabilities equal to 1. This function is typically used in classification problems
for output layers.
4) Rectified linear unit (ReLU) – Popular for CNN learning. It is a simple and nonlinear
activation function that replaces negative input values with zero while retaining positive values
as is. It is computationally efficient and helps address vanishing gradient problems.
5) Leaky rectified linear unit – It is a variant of ReLU that allows a small, non-zero slope for
negative input values, preventing neurons from becoming inactive due to zero gradients.
L E A R N I N G O P T I M I Z AT I O N A L G O R I T H M S
• Basically, learning calls for adjusting weights. But how are this weights adjusted?
• For example, If you were to train a model to predict who will pass the course. you can use
attendance (x1) and previous performance (x2) to train the model. Using the architecture
shown,you can start by selecting the weights randomly. In the diagram a data point for a student
who attended 4 lectures and got 5 in the last class, was predicted as 0.1 (≈ 0) while the correct
answer was 1. The randomly selected weights resulted in a loss and we can calculate this loss.
• There are many ways of calculating this loss. Common methods are mean squared error loss
and binary cross-entropy loss. The former is common where output is continuous while the latter
for binary output. The function for the MSE is shown. Because there were several data points used
for training the average is obtained
• Gradient descend is a learning optimization algorithm that can help us find weights that minimize
this loss. It uses the slope of a function to find the direction of descent and then takes a small step
towards the descent direction in each iteration. This process continues until it reaches the minimum
value of the function.
• The following is the illustration of the algorithm:
1. Initialize random weights W.
2. Repeat until it reaches the minimum value.
3. Calculate the loss/weight ratio (slope).
4. Update weights - with a value that reduces loss ratio (slope).
5. Return weights
• How fast you descend the slope of the function determines how quickly you learn and it can be
controlled by using a parameter known as the learning rate. A small learning rate means that the
algorithm will take small steps in each iteration thus a long time to converge (find minimum). A
large learning rate can cause the algorithm to overshoot the point of minimum value thus failing to
converge.
• Adam optimization algorithm (Adaptive Moment Estimation) have a smoother way to descend
ARTIFICIAL NEURAL NETWORKS/MULTI L AYER PERCEPTRONS
• By building a network where the output from some perceptrons are used in the inputs of other
perceptrons, we build more complex models which are more accurate. This is called an Artificial Neural
Network (ANN) Or Multi Layer Perceptron (MLP).
• In the first layer each perceptron makes a decision which acts as input for perceptron in the next layer.
In the second layer, the decisions made are weighed and used to make a decision on the weights. The
last layer weighs this decision and uses it to make a final decision. Each layer makes and increasingly
more complex decision than the previous layer.
• The leftmost layer neurons are called input neurons while the rightmost output neurons, The middle
layer is called a hidden layer.
• A network can have a single hidden layer or multiple hidden layers. These are known as shallow and
deep neural networks respectively.
APPLICATIONS - IMAGE PROCESSING
• When trying to determine whether a handwritten image depicts a
number "8" or not, we start by encoding the intensities of the
image pixels into the input neurons.
• If the image is a 28 by 28 greyscale image, then you will have
784=28×28 pixels which become input neurons, with the
intensities scaled appropriately between 0 and 1.
• The output layer will contain just a single neuron, with high output
values for the output neuron 8 while low values for the other
neurons
DESIGN OF THE HIDDEN LAYERS
• While the design of the input and output layers of a neural network is
straightforward, the design process for the hidden layers is not.
• There are however guidelines which help people get the behavior they want
out of their nets. For example when deciding on the number of layers,
although deeper networks are more accurate they take more time to train
hence a trade off is required.
• Similarly when deciding on the number of nodes, we need to understand
the problem and what the nodes are doing. This may call for some
experiments.
• In the handwriting example we had 10 output nodes instead of 4 because
they performed better. To understand why we can assume that with 4
output nodes each node is responsible for determining ¼ of the image.
When predicting a number like 0, each output node will have the following
output from the input image and use it to determine the number. While it
can make the prediction the process would be more accurate if we used 10
nodes.
• However even 4 neural networks on the output node would still work well if
the algorithm finds good weights.
TYPES OF NEURAL NETWORKS