0% found this document useful (0 votes)
58 views46 pages

2 DeepLearning

The document discusses neural networks and deep learning. It defines key concepts like deep learning, artificial neural networks, machine learning vs deep learning, the neuron, artificial neuron, linear perceptron, feedforward neural networks, layers in neural networks, and provides a TensorFlow code example to create a neural network model. It compares machine learning and deep learning based on factors like data requirements, accuracy, computation power, cognitive ability, hardware needs and time taken.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views46 pages

2 DeepLearning

The document discusses neural networks and deep learning. It defines key concepts like deep learning, artificial neural networks, machine learning vs deep learning, the neuron, artificial neuron, linear perceptron, feedforward neural networks, layers in neural networks, and provides a TensorFlow code example to create a neural network model. It compares machine learning and deep learning based on factors like data requirements, accuracy, computation power, cognitive ability, hardware needs and time taken.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Neural Network and Deep

Learning
Samatrix Consulting Pvt Ltd
Deep Learning
Deep Learning
• Deep learning is a branch of machine learning.
• Deep learning uses artificial neural networks to understand the content of
images, natural language, and speech.
• Deep learning is a part of artificial intelligence.
• It is a subset of machine learning.
• The origins of deep learning can be attributed to Walter Pitts and Warren
McCulloch.
• In 1943, they created a computer model by taking inspiration from neural
networks present in the human brain.
• Deep learning uses artificial neural networks (ANN) to help a machine
learn.
Machine Learning vs Deep Learning
We can compare machine learning and deep learning based on six
important characteristics.
1. Quantity of data required for training
2. High accuracy while avoiding overfitting
3. Computation Power
4. Cognitive ability
5. Hardware requirement
6. Time taken
Machine Learning vs Deep Learning
• Quantity of data required for training
• The traditional machine learning algorithms do not require too much data.
Whereas the deep learning models need a larger amount of data
• High accuracy while avoiding overfitting
• Compared to deep learning algorithms, the machine learning algorithms are
relatively less accurate because machine learning algorithms require less
amount of training data to make inferences.
• On the other hand, deep learning algorithms need a large amount of data and
hence they are more accurate.
Machine Learning vs Deep Learning
• Computational Power
• Because the machine learning algorithms use a lesser amount of data, the amount of
power used by machine learning algorithms is less compared to the deep learning
algorithms.
• Deep learning algorithms require more power to analyze the data and train the
model.
• Cognitive ability
• Cognitive ability refers to the ability of the algorithm to understand the inaccuracies
and sort out the issues on its own.
• The machine learning model has a lower cognitive ability. To adjust itself to change in
training data or to improve the accuracy of the predictions, a programmer is required
to make the necessary changes and retrain the model.
• On the other hand, deep learning models have the higher cognitive ability.
• They can learn from the data and make the necessary changes on their own.
Machine Learning vs Deep Learning
• Hardware requirement
• The traditional machine learning models can be trained on low-end systems.
• On the other hand, the deep learning models require high-end sophisticated
machines equipped with GPU.
• Time taken
• Compared to the machine learning algorithms, the deep learning algorithms
need a longer time to train the models.
The Neuron
• In the previous section, we have studied that deep learning uses artificial
neural networks to solve complex problems without being explicitly
programmed.
• The neural network, or artificial neural network, has been inspired by and
modeled after the biological neural networks.
• The foundational unit of the human brain is the neural network.
• A grain-sized piece of the human brain contains over 10,000 neurons.
• Each of the neurons forms an average of 6000 connections with other
neurons.
• With the help of such a massive biological network, we can experience the
world around us.
The Neuron
• The neuron receives the information from other
neurons.
• It processes this information in a unique way and
then sends the result to other cells.
• The neuron receives the information through
dendrites.
• The strength of each incoming connection
determines the weight of the connection.
• The cell body calculates the total input from all the
connections by adding the weight of the signal for
each connection.
• This sum is transformed into a new signal that is
propagated along the cell’s axon and sent off to
other neurons.
Artificial Neuron
• This functionality of neurons in our brain can be represented using artificial
neurons.
• The artificial neurons also take some number of inputs, 𝑥1 , 𝑥2 , … , 𝑥𝑛 . Each
of the input is multiplied by specific weight, 𝑤1 , 𝑤2 , … , 𝑤𝑛 .
• We can add the weighted inputs together to produce logit of the neuron
𝑧 = σ𝑛𝑖=0 𝑤𝑖 𝑥𝑖 .
• Bias, a constant, is also part of the logit but it is not shown in figure 3.
• We pass the logit to a function 𝑓 to produce the output 𝑦 = 𝑓(𝑧).
• We transmit the output to other neurons.
• In vector form, we can re-express the output of the neuron as 𝑦 = 𝑓(x ⋅
w + 𝑏) where 𝑏 is the bias.
Artificial Neuron Architecture
• The Artificial Neuron comprises the
following architecture
1. Input layer: This layer takes inputs from
other neurons or networks
2. Summation layer: This layer aggregates the
signals it receives
3. Activation layer: This layer takes an
aggregated information and returns a value
if the aggregated input crosses a certain
threshold value otherwise it does not fire
4. Output layer: This layer might be connected
to other neurons or networks. This layer acts
as a final output layer and is used for
predictions.
Linear Perceptron
• The linear perceptron is a simple algorithm that, given an input vector 𝑥 of 𝑛
values (𝑥1 , 𝑥2 , … , 𝑥𝑛 ), often called input features outputs either a 1 (yes) or 0
(no).
1 𝑤𝑥 + 𝑏 > 0
𝑓 𝑥 =ቊ
0 𝑤𝑥 + 𝑏 ≤ 0
• The linear perceptron is used to classify the data into two parts using a linear
hyperplane as shown in figure 4.
• Also known as linear binary classifier.
Feedforward Neural Network
• A group of artificial neural networks in which the connections between the
neurons do not form a cycle are called feedforward neural networks.
• In these neural networks, the connections between the neurons move only in a
forward direction from the input layer through hidden layers to the output layer.
• In these networks, the information flows in the forward direction only.
• Every feedforward neural network should at least have two layers: an input layer
and an output layer.
• The feedforward neural network approximates a function by using input values
that are fed from the input layer and the final output values from the output
layer.
• It compares the output values with the label values.
Shallow Feedforward Neural Network
• When a model has only input and output layer for
function approximation, it is called shallow feedforward
neural network or single-layer perceptron.
• We can directly compute the output values using the
relationship 𝑦 = 𝑓(w ∙ x + 𝑏).
• The shallow feedforward neural networks are not useful
to approximate the nonlinear function.
• There we need hidden layers between input and
output.
Deep Feedforward Neural Network
• In deep feedforward neural network or multilayer
perceptron (figure – 6), we can add one or more
hidden layers between input layer and output
layer so that we can approximate more complex
functions.
• In this architecture, every neuron is connected to
the neurons in the next layer and uses an
activation function.
• That is why they are also called fully connected
neural networks.
• The deep neural networks can approximate any
linear or non-linear function. Hence, they are
widely used to solve real-world problems.
Layers in Feedforward Neural Network
The generic neural network architecture consists of three types of
layers:

• An Input Layer
• An Output Layer
• A number of hidden layers
Input Layer
• The very first layer of the feedforward neural network is known as the
input layer.
• This layer is used to feed data into the network.
• No activation function is applied on the input layer.
• Its sole purpose is to get the data into the system.
• Ideally, the number of input layers should be equal to the number of
features.
• For example, if our model uses four input variables to predict one
response variable, we should use four neurons in the input layer.
Output Layer
• The very last layer of the feedforward neural network is known as the
output layer.
• This layer is used to output the prediction.
• Based on the nature of the problem, we decide on the number of
neurons in the output layer.
• For regression, we need to predict a single value, hence, we require
only one neuron in the output layer.
• For binary classification, we need two neurons in the output layer.
• For multi-class classification with five different classes, we need five
neurons in the output layer.
Hidden Layer
• In the feedforward neural network, the hidden layer is located
between the input and output layers.
• The hidden layers are responsible for the nonlinear transformation of
the input that has entered into the network.
TensorFlow Code for Neural Network
• Sequential API is the simplest way to create a deep neural network
model in TensorFlow 2.0.
• A Sequential() model creates a stack of neural network layers.
• The following code fragment defines a single layer that expect 784
input variables (features).
• Our neural network is dense, which means that each neuron in a layer
is connected to all the neurons located in the previous layer, and to all
the neurons in the following layer:
TensorFlow Code for Neural Network
import tensorflow as tf
from tensorflow import keras
NB_CLASSES = 10
RESHAPED = 784
model = tf.keras.models.Sequential()
model.add(keras.layers.Dense(NB_CLASSES,
input_shape=(RESHAPED,), kernel_initializer='zeros',
name='dense_layer', activation='softmax'))
TensorFlow Code for Neural Network
We can initialize each resume with specific weights using the
kernel_initializer parameter with values such as:

random_uniform: Weights can be initialized using uniform random variables in the


range -0.05 to 0.05
random_normal: Weights can be initialized using zero mean and standard
deviation of 0.05
zero: Weights can be initialized to zero
Limitations of Linear Neurons
• The linear neurons are easier to compute but they have serious
limitations.
• We can also represent the linear neurons using a neural network
without any hidden layer.
• The real-world problems are very complex and they are far from a
linear solution.
• In order to solve real-world complex problems, we need to build
nonlinearity in our model.
• That can be achieved using hidden layers.
Sigmoid Neuron
• The sigmoid neurons use the function
1
𝑓 𝑧 = −(𝑤 𝑇 𝑥+𝑏)
1+ 𝑒
• It means that if the logit is very small, the output of the logistic neuron is
close to zero.
• If the logit is very large, the output of the logistic function will be close to
one.
• The neuron will assume an S-shape between these two extremes as shown
in figure – 7.
• In other words, we can say that the sigmoid squashes arbitrary values into
the [0, 1] interval and outputs the probability between 0 and 1.
Tanh Neuron
• Similar to sigmoid neurons, the tanh neurons also use S-shaped
nonlinearity.
• However, the output of tanh neurons ranges from -1 to 1.
• On several occasions, we prefer tanh neurons over sigmoid neurons
because the tanh neuron is zero-centered.
ReLU (Restricted Linear Unit)
• ReLU neuron uses a different kind of nonlinearity.
• It uses the function 𝑓 𝑧 = max(0, 𝑧).
• It results in a characteristic hockey-stick-shaped response.
• ReLU is one of the most popular neurons.
• It is widely used in solving computer vision problems.
• ReLU zero outs the negative values as shown in Figure – 9
SoftMax Output Layer
• On several occasions, we want that our output vector should be the
probability distribution over a set of mutually exclusive labels.
• For example, the project to recognize the handwritten digits (ten
mutually exclusive digits, 0 through 9) from the MNIST dataset using
neural networks.
• However, we would not be able to classify each digit with 100%
confidence.
• So, we will calculate the probability vector 𝑝0 , 𝑝1 , … , 𝑝9 of each digit
such that σ9𝑖=0 𝑝𝑖 = 1.
SoftMax Output Layer
• We can achieve this using a special output layer which is known as the
softmax layer.
• In the softmax layer, the output of a neuron depends on the output of all
other neurons in the softmax layer.
• We can calculate the output from a particular neuron in the softmax layer
by dividing the output from the neuron by the sum of the output from all
the neurons in the layer.
• In the case of a strong prediction, the output from one of the neurons in
the softmax layer will be close to 1 whereas the output from the rest of the
neurons in the layer will be close to 0.
• In the case of a weak prediction, multiple neurons in the softmax layer will
have more or less equal values.
Activation Functions
• In neural network jargon, Sigmoid, Tanh, ReLU, and softmax are call
activation functions.
• The activation functions are the basic building blocks of a learning
algorithm. An example of an activation function that is applied after a
linear function is illustrated in Figure – 10.
Activation Functions
• We can compare the ReLU, Tanh, and Sigmoid functions as follows
• ReLU function is a general-purpose activation function that is widely used in
neural networks. It should be used in hidden layers.
• The sigmoid function is the best for classification task
• Sigmoid and Tanh functions generally cause vanishing gradient problems.
• The best strategy would be to start with ReLU and then try other
activation functions to check if the performance improves.
Loss (Cost or Error) Function
• We use loss functions to measure the performance of a deep learning
model for given data.
• The loss function is generally based on error terms.
• The error terms can be calculated by finding out the distance
between the real (measured) value and the predicted value of the
trained model.
𝐸𝑟𝑟𝑜𝑟 = 𝑀𝑒𝑎𝑠𝑢𝑟𝑒𝑑 𝑉𝑎𝑙𝑢𝑒 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑉𝑎𝑙𝑢𝑒
𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖
• The function is referred to as loss function, cost function, or error
function.
Loss (Cost or Error) Function
• In deep learning, we use several loss functions to evaluate the
performance.
• We should choose the right error function to find the optimum
solution for our problem.
• However, the selection of the loss function should depend upon the
nature of the problem.
• For example, for regression problems, the root mean squared error
(RMSE) function is the right loss function.
• For multi-class classification problems, we should use multi-class
crossentropy.
Loss (Cost or Error) Function
• For deep learning tasks, several loss functions are available.
• Root mean squared error (root_mean_squared_error), mean squared
error (mean_squared_error), mean absolute error (mean_absolute_error),
mean absolute percentage error (mean_absolute_percentage_error) are
appropriate loss functions for regression.
• For two-class classification problems, you should use binary
crossentropy (binary_crossentropy).
• For a many-class classification problems, you should use categorical
crossentropy (categorical_crossentropy).
Loss (Cost or Error) Function
• Crossentropy is a quantity from the field of Information Theory.
• It measures the distance between probability distributions.
• In this case, it measures the difference between true distribution and
our predictions.
Information Theory
Information Theory
We try to determine the amount of information an event has using Information
Theory. The principles of information theory are as follows

• If the probability of an event is high, the information is considered less


informative. On the other hand, if the probability of lower, the information is
considered high informative.
• The information from independent events can be calculated by adding their
individual information content.

The amount of information of an event 𝑥 is defined as follows:

𝐼 𝑥 = − log 𝑃 𝑥
Information Theory
• In this case, the 𝑙𝑜𝑔 is the natural logarithm. For example, if the
probability of an event is 𝑃 𝑥 = 0.8, then 𝐼 𝑥 = 0.22.
• Alternatively, if 𝑃 𝑥 = 0.2, then 𝐼 𝑥 = 1.61.
• Hence, we can see that event information content is opposite to the
event probability.
• We can measure the amount of self-information using a natural unit
of information called nat.
• We can also use base 2 logarithm i.e., 𝐼 𝑥 = − log 2 𝑃 𝑥 . In this
case, we measure it in bits.
Information Theory
• Since there is no principal difference between the two versions, we
will use the natural logarithm version in this section.
• The example given above has been related to a single outcome.
• We can also use it for multiple outcomes by measuring the amount of
information over the probability distribution of the random variable.
• We can denote it using 𝐼(𝑋), where 𝑋 is a random discrete variable.
• The mean (or expected value) of a discrete random variable is the
weighted sum of all possible values multiplied by their probabilities.
• In this case, also, we will multiply the information content of each
event by the probability of that event.
Shannon Entropy
• We call this measure, Shannon Entropy (or just entropy). We can
define Shannon Entropy as follows:
𝑛

𝐻 𝑋 =𝐸 𝐼 𝑋 = ෍ 𝑃 𝑋 = 𝑥𝑖 log 𝑃(𝑋 = 𝑥𝑖 )
𝑖=1
• In this case 𝑥𝑖 represents the discrete variable value. The events with
higher probabilities will carry more weight compared to the events
with lower probabilities.
• Let compute the entropy using the coin toss examples
Shannon Entropy
Example 1: Let’s assume 𝑃 ℎ𝑒𝑎𝑑𝑠 = 𝑃 𝑡𝑎𝑖𝑙𝑠 = 0.5. In this case entropy is

𝐻 𝑋 = −𝑃 ℎ𝑒𝑎𝑑𝑠 log 𝑃 ℎ𝑒𝑎𝑑𝑠 − 𝑃 𝑡𝑎𝑖𝑙𝑠 log 𝑃 𝑡𝑎𝑖𝑙𝑠


= −0.5 ∗ −0.69 − 0.5 ∗ −0.69 = 0.7

Example 2: Let’s assume that the coins is biased and outcomes are not
equally likely. 𝑃 ℎ𝑒𝑎𝑑𝑠 = 0.2 and 𝑃 𝑡𝑎𝑖𝑙𝑠 = 0.8

𝐻 𝑋 = −𝑃 ℎ𝑒𝑎𝑑𝑠 log 𝑃 ℎ𝑒𝑎𝑑𝑠 − 𝑃 𝑡𝑎𝑖𝑙𝑠 log 𝑃 𝑡𝑎𝑖𝑙𝑠


= −0.2 ∗ −1.62 − 0.8 ∗ −0.22 = 0.5
Shannon Entropy
• Hence, we can say that the entropy is
highest when the outcomes are
equally likely and decreases when one
outcome becomes prevalent.
• So, we can use entropy as a
measurement of uncertainty or chaos.
• In the following diagram, we have
shown the graph of the entropy 𝐻(𝑋)
over a binary event (such as the coin
toss), depending on the probability
distribution of the two outcomes.
Cross-Entropy
• Now, let’s assume that we have a discrete random variable, 𝑋, and
two different probability distributions over it.
• In deep learning, this is a usual scenario. For example, the neural
network produces some probability distribution 𝑄(𝑋) and we want to
compare it with a target distribution 𝑃(𝑋) during training.
• Using cross-entropy, we can measure the difference between these
two distributions. The cross-entropy can be defined as follows:
𝑛

𝐻 𝑃, 𝑄 = − ෍ 𝑃 𝑋 = 𝑥𝑖 log 𝑄(𝑋 = 𝑥𝑖 )
𝑖=1
Cross-Entropy
For example, let’s calculate the cross-entropy between two probability
distributions from the previous coin toss scenario.

Predicted Distribution: 𝑄 ℎ𝑒𝑎𝑑𝑠 = 0.2, 𝑄 𝑡𝑎𝑖𝑙𝑠 = 0.8


Target (true) Distribution: 𝑃 ℎ𝑒𝑎𝑑𝑠 = 𝑃 𝑡𝑎𝑖𝑙𝑠 = 0.5

The cross entropy can be calculated as follows

𝐻 𝑃, 𝑄 = −𝑃 ℎ𝑒𝑎𝑑𝑠 × log 𝑄 ℎ𝑒𝑎𝑑𝑠 − 𝑃 𝑡𝑎𝑖𝑙𝑠 × log 𝑄(𝑡𝑎𝑖𝑙𝑠)

= −0.5 × −1.61 − 0.5 × −0.22 = 0.915


Kullback-Leibler divergence (KL divergence)
KL Divergence is another measure of the difference between two probability distribution.
𝑛
𝑃 𝑋 = 𝑥𝑖
𝐷𝐾𝐿 (𝑃| 𝑄 = ෍ 𝑃 𝑋 = 𝑥𝑖 log
𝑄 𝑋 = 𝑥𝑖
𝑛 𝑖=1

= ෍ 𝑃 𝑋 = 𝑥𝑖 [log 𝑃 𝑋 = 𝑥𝑖 − log 𝑄(𝑋 = 𝑥𝑖 )]


𝑛 𝑖=1

= ෍[𝑃 𝑋 = 𝑥𝑖 log 𝑃 𝑋 = 𝑥𝑖 − 𝑃(𝑋 = 𝑥𝑖 ) log 𝑄(𝑋 = 𝑥𝑖 )]


𝑖=1

= 𝐻 𝑃, 𝑄 − 𝐻(𝑃)

We can see that the KL divergence measure the difference between the target and the predicted log
probabilities.
Kullback-Leibler divergence (KL divergence)
The KL divergence of the coin toss example is as follows

𝐷𝐾𝐿 (𝑃| 𝑄
= 𝑃 ℎ𝑒𝑎𝑑𝑠 × [log 𝑃 ℎ𝑒𝑎𝑑𝑠 − 𝑄(ℎ𝑒𝑎𝑑𝑠)] − 𝑃 𝑡𝑎𝑖𝑙𝑠
× [log 𝑃 𝑡𝑎𝑖𝑙𝑠 − 𝑄(𝑡𝑎𝑖𝑙𝑠)]
= 0.5(log 0.5 − log 0.2 + 0.5(log 0.5 − log(0.8)) = 0.22
Thanks
Samatrix Consulting Pvt Ltd

You might also like