DL Unit 4 Notes
DL Unit 4 Notes
Deep learning is one of the major subfield of machine learning framework. Machine learning
is the study of design of algorithms, inspired from the model of human brain. Deep learning
is becoming more popular in data science fields like robotics, artificial intelligence(AI), audio
& video recognition and image recognition. Artificial neural network is the core of deep
learning methodologies. Deep learning is supported by various libraries such as Theano,
TensorFlow, Caffe, Mxnet etc., Keras is one of the most powerful and easy to use python
library, which is built on top of popular deep learning libraries like TensorFlow, Theano, etc.,
for creating deep learning models.
Overview of Keras
Keras runs on top of open source machine libraries like TensorFlow, Theano or Cognitive
Toolkit (CNTK). Theano is a python library used for fast numerical computation tasks.
TensorFlow is the most famous symbolic math library used for creating neural networks and
deep learning models. TensorFlow is very flexible and the primary benefit is distributed
computing. CNTK is deep learning framework developed by Microsoft. It uses libraries such
as Python, C#, C++ or standalone machine learning toolkits. Theano and TensorFlow are
very powerful libraries but difficult to understand for creating neural networks.
Keras is based on minimal structure that provides a clean and easy way to create deep
learning models based on TensorFlow or Theano. Keras is designed to quickly define deep
learning models. Well, Keras is an optimal choice for deep learning applications.
Features
Keras leverages various optimization techniques to make high level neural network API
easier and more performant. It supports the following features −
Benefits
Keras is highly powerful and dynamic framework and comes up with the following
advantages −
Sequential API
Functional API
Consider the following eight steps to create deep learning model in Keras −
We will use the Jupyter Notebook for execution and display of output as shown below −
Step 1 − Loading the data and preprocessing the loaded data is implemented first to execute
the deep learning model.
import warnings
warnings.filterwarnings('ignore')
import numpy as np
np.random.seed(123) # for reproducibility
This step can be defined as “Import libraries and Modules” which means all the libraries and
modules are imported as an initial step.
Step 2 − In this step, we will define the model architecture −
model = Sequential()
model.add(Conv2D(32, 3, 3, activation = 'relu', input_shape = (28,28,1)))
model.add(Conv2D(32, 3, 3, activation = 'relu'))
model.add(MaxPool2D(pool_size = (2,2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation = 'softmax'))
Keras, TensorFlow, Theano, and CNTK are all related to deep learning and neural networks,
but they serve different purposes and have different roles in the machine learning ecosystem.
1. Theano:
o Theano was an open-source numerical computation library for Python.
o Developed by the Montreal Institute for Learning Algorithms (MILA) at the
University of Montreal.
o Provided a low-level interface for tensor operations, which allowed for
efficient computation on CPUs and GPUs.
o Theano is no longer actively developed or supported as of September 2017.
2. TensorFlow:
o An open-source machine learning framework developed by the Google Brain
team.
o Offers a flexible platform for building and deploying machine learning
models, including neural networks.
o Provides both high-level APIs (like Keras) for quick development and low-
level APIs for more fine-grained control.
o Supports distributed computing and deployment on various platforms.
o TensorFlow is widely used in both research and industry.
3. Keras:
o Originally developed as a high-level neural networks API written in Python.
o Designed to be user-friendly, modular, and extensible.
o In the past, Keras could be used with different backends, including
TensorFlow, Theano, and Microsoft Cognitive Toolkit (CNTK).
o Since TensorFlow version 2.0, Keras has been integrated as the official high-
level API for TensorFlow, making it the default choice for most TensorFlow
users.
o Keras provides a simple and consistent interface for building and training
neural networks.
4. Microsoft Cognitive Toolkit (CNTK):
o Developed by Microsoft, CNTK is an open-source deep learning framework.
o Designed for efficient training and evaluation of deep neural networks.
o Supports both high-level and low-level APIs.
o CNTK is particularly known for its efficiency in handling large datasets and
complex neural network architectures.
o While CNTK was once an option as a backend for Keras, it is not as
commonly used as TensorFlow in the broader deep learning community.
In summary:
Recurrent Neural Networks (RNNs) are a type of artificial neural network designed to
process sequences of data. They work especially well for jobs requiring sequences, such as
time series data, voice, natural language, and other activities.
RNN works on the principle of saving the output of a particular layer and feeding this back to
the input in order to predict the output of the layer.
Below is how you can convert a Feed-Forward Neural Network into a Recurrent Neural
Network:
The nodes in different layers of the neural network are compressed to form a single layer of
recurrent neural networks. A, B, and C are the parameters of the network.
Here, “x” is the input layer, “h” is the hidden layer, and “y” is the output layer. A, B, and C
are the network parameters used to improve the output of the model. At any given time t, the
current input is a combination of input at x(t) and x(t-1). The output at any given time is
fetched back to the network to improve on the output.
Now that you understand what a recurrent neural network is let’s look at the different types of
recurrent neural networks.
RNN were created because there were a few issues in the feed-forward neural network:
The solution to these issues is the RNN. An RNN can handle sequential data, accepting the
current input data, and previously received inputs. RNNs can memorize previous inputs due
to their internal memory.
In Recurrent Neural networks, the information cycles through a loop to the middle hidden
layer.
Fig: Working of Recurrent Neural Network
The input layer ‘x’ takes in the input to the neural network and processes it and passes it onto
the middle layer.
The middle layer ‘h’ can consist of multiple hidden layers, each with its own activation
functions and weights and biases. If you have a neural network where the various parameters
of different hidden layers are not affected by the previous layer, ie: the neural network does
not have memory, then you can use a recurrent neural network.
The Recurrent Neural Network will standardize the different activation functions and weights
and biases so that each hidden layer has the same parameters. Then, instead of creating
multiple hidden layers, it will create one and loop over it as many times as required.
A feed-forward neural network allows information to flow only in the forward direction, from
the input nodes, through the hidden layers, and to the output nodes. There are no cycles or
loops in the network.
Any time series problem, like predicting the prices of stocks in a particular month, can be
solved using an RNN.
Text mining and Sentiment analysis can be carried out using an RNN for Natural Language
Processing (NLP).
Machine Translation
Given an input in one language, RNNs can be used to translate the input into different
languages as output.
Advantages of Recurrent Neural Network
Recurrent Neural Networks (RNNs) have several advantages over other types of neural
networks, including:
RNNs are designed to handle input sequences of variable length, which makes them well-
suited for tasks such as speech recognition, natural language processing, and time series
analysis.
RNNs have a memory of past inputs, which allows them to capture information about the
context of the input sequence. This makes them useful for tasks such as language modeling,
where the meaning of a word depends on the context in which it appears.
Parameter Sharing
RNNs share the same set of parameters across all time steps, which reduces the number of
parameters that need to be learned and can lead to better generalization.
Non-Linear Mapping
RNNs use non-linear activation functions, which allows them to learn complex, non-linear
mappings between inputs and outputs.
Sequential Processing
RNNs process input sequences sequentially, which makes them computationally efficient and
easy to parallelize.
Flexibility
RNNs can be adapted to a wide range of tasks and input types, including text, speech, and
image sequences.
Improved Accuracy
These advantages make RNNs a powerful tool for sequence modeling and analysis, and have
led to their widespread use in a variety of applications, including natural language processing,
speech recognition, and time series analysis.
Although Recurrent Neural Networks (RNNs) have several advantages, they also have some
disadvantages. Here are some of the main disadvantages of RNNs:
Vanishing And Exploding Gradients
RNNs can suffer from the problem of vanishing or exploding gradients, which can make it
difficult to train the network effectively. This occurs when the gradients of the loss function
with respect to the parameters become very small or very large as they propagate through
time.
Computational Complexity
RNNs can be computationally expensive to train, especially when dealing with long
sequences. This is because the network has to process each input in sequence, which can be
slow.
Although RNNs are designed to capture information about past inputs, they can struggle to
capture long-term dependencies in the input sequence. This is because the gradients can
become very small as they propagate through time, which can cause the network to forget
important information.
Lack Of Parallelism
RNNs are inherently sequential, which makes it difficult to parallelize the computation. This
can limit the speed and scalability of the network.
There are many different variants of RNNs, each with its own advantages and disadvantages.
Choosing the right architecture for a given task can be challenging, and may require
extensive experimentation and tuning.
The output of an RNN can be difficult to interpret, especially when dealing with complex
inputs such as natural language or audio. This can make it difficult to understand how the
network is making its predictions.
These disadvantages are important when deciding whether to use an RNN for a given task.
However, many of these issues can be addressed through careful design and training of the
network and through techniques such as regularization and attention mechanisms.
1. One to One
2. One to Many
3. Many to One
4. Many to Many
One to One RNN
This type of neural network is known as the Vanilla Neural Network. It's used for general
machine learning problems, which has a single input and a single output.
This type of neural network has a single input and multiple outputs. An example of this is the
image caption.
This RNN takes a sequence of inputs and generates a single output. Sentiment analysis is a
good example of this kind of network where a given sentence can be classified as expressing
positive or negative sentiments.
Many to Many RNN
This RNN takes a sequence of inputs and generates a sequence of outputs. Machine
translation is one of the examples.
Recurrent Neural Networks enable you to model time-dependent and sequential data
problems, such as stock market prediction, machine translation, and text generation. You will
find, however, RNN is hard to train because of the gradient problem.
RNNs suffer from the problem of vanishing gradients. The gradients carry information used
in the RNN, and when the gradient becomes too small, the parameter updates become
insignificant. This makes the learning of long data sequences difficult.
While training a neural network, if the slope tends to grow exponentially instead of decaying,
this is called an Exploding Gradient. This problem arises when large error gradients
accumulate, resulting in very large updates to the neural network model weights during the
training process.
Long training time, poor performance, and bad accuracy are the major issues in gradient
problems.
Now, let’s discuss the most popular and efficient way to deal with gradient problems,
i.e., Long Short-Term Memory Network (LSTMs).
Suppose you want to predict the last word in the text: “The clouds are in the ______.”
The most obvious answer to this is the “sky.” We do not need any further context to predict
the last word in the above sentence.
Consider this sentence: “I have been staying in Spain for the last 10 years…I can speak fluent
______.”
The word you predict will depend on the previous few words in context. Here, you need the
context of Spain to predict the last word in the text, and the most suitable answer to this
sentence is “Spanish.” The gap between the relevant information and the point where it's
needed may have become very large. LSTMs help you solve this problem.
Recurrent Neural Networks (RNNs) use activation functions just like other neural networks
to introduce non-linearity to their models. Here are some common activation functions used
in RNNs:
Sigmoid Function:
The sigmoid function is commonly used in RNNs. It has a range between 0 and 1, which
makes it useful for binary classification tasks. The formula for the sigmoid function is:
σ(x) = 1 / (1 + e^(-x))
Hyperbolic Tangent (Tanh) Function:
The tanh function is also commonly used in RNNs. It has a range between -1 and 1, which
makes it useful for non-linear classification tasks. The formula for the tanh function is:
The ReLU function is a non-linear activation function that is widely used in deep neural
networks. It has a range between 0 and infinity, which makes it useful for models that require
positive outputs. The formula for the ReLU function is:
ReLU(x) = max(0, x)
The Leaky ReLU function is similar to the ReLU function, but it introduces a small slope to
negative values, which helps to prevent "dead neurons" in the model. The formula for the
Leaky ReLU function is:
Softmax Function:
The softmax function is often used in the output layer of RNNs for multi-class classification
tasks. It converts the network output into a probability distribution over the possible classes.
The formula for the softmax function is:
These are just a few examples of the activation functions used in RNNs. The choice of
activation function depends on the specific task and the model's architecture.
In a typical RNN, one input is fed into the network at a time, and a single output is obtained.
But in backpropagation, you use the current as well as the previous inputs as input. This is
called a timestep and one timestep will consist of many time series data points entering the
RNN simultaneously.
Once the neural network has trained on a timeset and given you an output, that output is used
to calculate and accumulate the errors. After this, the network is rolled back up and weights
are recalculated and updated keeping the errors in mind.
LSTM is a type of RNN that is designed to handle the vanishing gradient problem that can
occur in standard RNNs. It does this by introducing three gating mechanisms that control the
flow of information through the network: the input gate, the forget gate, and the output gate.
These gates allow the LSTM network to selectively remember or forget information from the
input sequence, which makes it more effective for long-term dependencies.
GRU is another type of RNN that is designed to address the vanishing gradient problem. It
has two gates: the reset gate and the update gate. The reset gate determines how much of the
previous state should be forgotten, while the update gate determines how much of the new
state should be remembered. This allows the GRU network to selectively update its internal
state based on the input sequence.
Bidirectional RNNs:
Bidirectional RNNs are designed to process input sequences in both forward and backward
directions. This allows the network to capture both past and future context, which can be
useful for speech recognition and natural language processing tasks.
Encoder-Decoder RNNs:
Encoder-decoder RNNs consist of two RNNs: an encoder network that processes the input
sequence and produces a fixed-length vector representation of the input and a decoder
network that generates the output sequence based on the encoder's representation. This
architecture is commonly used for sequence-to-sequence tasks such as machine translation.
Attention Mechanisms
Attention mechanisms are a technique that can be used to improve the performance of RNNs
on tasks that involve long input sequences. They work by allowing the network to attend to
different parts of the input sequence selectively rather than treating all parts of the input
sequence equally. This can help the network focus on the input sequence's most relevant parts
and ignore irrelevant information.
These are just a few examples of the many variant RNN architectures that have been
developed over the years. The choice of architecture depends on the specific task and the
characteristics of the input and output sequences.
LSTMs also have a chain-like structure, but the repeating module is a bit different structure.
Instead of having a single neural network layer, four interacting layers are communicating
extraordinarily.
The first step in the LSTM is to decide which information should be omitted from the cell in
that particular time step. The sigmoid function determines this. It looks at the previous state
(ht-1) along with the current input xt and computes the function.
Let the output of h(t-1) be “Alice is good in Physics. John, on the other hand, is good at
Chemistry.”
Let the current input at x(t) be “John plays football well. He told me yesterday over the phone
that he had served as the captain of his college football team.”
The forget gate realizes there might be a change in context after encountering the first full
stop. It compares with the current input sentence at x(t). The next sentence talks about John,
so the information on Alice is deleted. The position of the subject is vacated and assigned to
John.
Step 2: Decide How Much This Unit Adds to the Current State
In the second layer, there are two parts. One is the sigmoid function, and the other is the tanh
function. In the sigmoid function, it decides which values to let through (0 or
1). tanh function gives weightage to the values which are passed, deciding their level of
importance (-1 to 1).
With the current input at x(t), the input gate analyzes the important information — John plays
football, and the fact that he was the captain of his college team is important.
“He told me yesterday over the phone” is less important; hence it's forgotten. This process of
adding some new information can be done via the input gate.
Step 3: Decide What Part of the Current Cell State Makes It to the Output
The third step is to decide what the output will be. First, we run a sigmoid layer, which
decides what parts of the cell state make it to the output. Then, we put the cell state through
tanh to push the values to be between -1 and 1 and multiply it by the output of the sigmoid
gate.
Let’s consider this example to predict the next word in the sentence: “John played
tremendously well against the opponent and won for his team. For his contributions, brave
____ was awarded player of the match.”
There could be many choices for the empty space. The current input brave is an adjective,
and adjectives describe a noun. So, “John” could be the best output after brave.
Now that you understand how LSTMs work, let’s do a practical implementation to predict the
prices of stocks using the “Google stock price” data.
Based on the stock price data between 2012 and 2016, we will predict the stock prices of
2017.