0% found this document useful (0 votes)
7 views

Curriculum: Tuesday, February 15, 2022 3:30 PM

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Curriculum: Tuesday, February 15, 2022 3:30 PM

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 408

Curriculum

Tuesday, February 15, 2022 3:30 PM

Course Announcement Page 1


Features
Tuesday, February 15, 2022 3:31 PM

Course Announcement Page 2


Prerequisites
Tuesday, February 15, 2022 3:31 PM

Course Announcement Page 3


Extra Content
Tuesday, February 15, 2022 3:32 PM

Course Announcement Page 4


What is Deep Learning?
16 February 2022 06:32

Deep Learning is a subfield of Artificial Intelligence and Machine Learning that is


inspired by the structure of a human brain.
Deep learning algorithms attempt to draw similar conclusions as humans would by
continually analyzing data with a given logical structure called Neural Network.

Deep learning is part of a broader family of machine learning methods


based on artificial neural networks with representation learning.

Deep Learning Algorithms uses multiple layers to progressively extract


higher-level features from the raw input. For example, in image
processing, lower layers may identify edges, while higher layers may
identify the concepts relevant to a human such as digits or letters or
faces.

D1 What is Deep Learning Page 5


D1 What is Deep Learning Page 6
Deep Learning VS Machine Learning
16 February 2022 06:33

1. Data Dependency
2. Hardware Dependency
3. Training Time
4. Feature Selection
5. Interpretability

D1 What is Deep Learning Page 7


Why now?
Wednesday, February 16, 2022 6:59 AM

D1 What is Deep Learning Page 8


D1 What is Deep Learning Page 9
D1 What is Deep Learning Page 10
Code Example
Saturday, February 19, 2022 7:10 AM

D3 What is Perceptron Page 11


Sunday, February 20, 2022 11:02 AM

D3 What is Perceptron Page 12


D3 What is Perceptron Page 13
D3 What is Perceptron Page 14
Recap
22 February 2022 14:12

D5 Perceptron Loss Functions Page 15


Problem with Perceptron Trick
22 February 2022 16:10

D5 Perceptron Loss Functions Page 16


Loss Function
22 February 2022 16:46

D5 Perceptron Loss Functions Page 17


Perceptron Loss Function
22 February 2022 17:47

D5 Perceptron Loss Functions Page 18


D5 Perceptron Loss Functions Page 19
Explanation of Loss Function
23 February 2022 08:07

D5 Perceptron Loss Functions Page 20


Gradient Descent
23 February 2022 08:50

D5 Perceptron Loss Functions Page 21


Loss Function Differentiation
23 February 2022 12:57

D5 Perceptron Loss Functions Page 22


More Loss Functions
23 February 2022 13:35

D5 Perceptron Loss Functions Page 23


D5 Perceptron Loss Functions Page 24
The Problem
24 February 2022 13:41

D7 MLP Intuition Page 25


Perceptron with Sigmoid
24 February 2022 15:35

D7 MLP Intuition Page 26


Abstract Solution
24 February 2022 15:36

D7 MLP Intuition Page 27


Mathematical Reasoning
24 February 2022 15:36

D7 MLP Intuition Page 28


Adding Weights
24 February 2022 15:37

D7 MLP Intuition Page 29


The Idea of MLP
24 February 2022 15:37

D7 MLP Intuition Page 30


Adding nodes in hidden layer
25 February 2022 07:52

D7 MLP Intuition Page 31


Adding nodes in input
25 February 2022 08:25

D7 MLP Intuition Page 32


Adding nodes in output node
25 February 2022 08:33

D7 MLP Intuition Page 33


Deep Neural Network
25 February 2022 08:41

D7 MLP Intuition Page 34


Sunday, March 6, 2022 4:56 PM

D7 MLP Intuition Page 35


MNIST Dataset
Wednesday, March 9, 2022 6:39 AM

D7 MLP Intuition Page 36


MLP Notation
26 February 2022 07:02

D8 MLP Notation Page 37


Forward Propogation
03 March 2022 06:07

D9 Forward Propogation Page 38


What is Loss Function?
23 March 2022 07:15

D13 Loss Functions in Deep Learning Page 39


D13 Loss Functions in Deep Learning Page 40
D13 Loss Functions in Deep Learning Page 41
D13 Loss Functions in Deep Learning Page 42
D13 Loss Functions in Deep Learning Page 43
23 March 2022 07:31

D13 Loss Functions in Deep Learning Page 44


Backpropogation
30 March 2022 13:18

D14 Backpropogation Page 45


D14 Backpropogation Page 46
D14 Backpropogation Page 47
D14 Backpropogation Page 48
D14 Backpropogation Page 49
D14 Backpropogation Page 50
D14 Backpropogation Page 51
D14 Backpropogation Page 52
D14 Backpropogation Page 53
MLP Memoization
Thursday, April 7, 2022 7:45 AM

D15 Memoization Page 54


Gradient Descent
Thursday, April 7, 2022 7:44 AM

Gradient descent is one of the most popular algorithms to perform


optimization and by far the most common way to optimize neural
networks.

Gradient descent is a way to minimize an objective


function J(θ) parameterized by a model's parameters θ∈Rd by
updating the parameters in the opposite direction of the gradient of
the objective function ∇θJ(θ) w.r.t. to the parameters. The learning
rate η determines the size of the steps we take to reach a (local)
minimum. In other words, we follow the direction of the slope of the
surface created by the objective function downhill until we reach a
valley.

There are three variants of gradient descent, which differ in how much
data we use to compute the gradient of the objective function.
Depending on the amount of data, we make a trade -off between the
accuracy of the parameter update and the time it takes to perform an
update.

D16 Gradient Descent Page 55


D16 Gradient Descent Page 56
D16 Gradient Descent Page 57
Vanishing Gradient Problem
Thursday, April 7, 2022 7:45 AM

D17 Vanishing Gradients Page 58


D17 Vanishing Gradients Page 59
D17 Vanishing Gradients Page 60
Saturday, April 9, 2022 10:01 AM

D17 Vanishing Gradients Page 61


D17 Vanishing Gradients Page 62
D17 Vanishing Gradients Page 63
D17 Vanishing Gradients Page 64
D17 Vanishing Gradients Page 65
D17 Vanishing Gradients Page 66
D17 Vanishing Gradients Page 67
How to improve a neural network
29 April 2022 13:51

D18 How to improve a neural network Page 68


D18 How to improve a neural network Page 69
D18 How to improve a neural network Page 70
Feature Scaling
05 May 2022 17:37

D20 Feature Scaling Page 71


D20 Feature Scaling Page 72
The Problem of Overfitting
12 May 2022 09:32

D21 Dropouts Page 73


Possible Solutions
12 May 2022 09:33

D21 Dropouts Page 74


The concept of Dropouts
12 May 2022 09:33

D21 Dropouts Page 75


Why this works?
12 May 2022 09:33

D21 Dropouts Page 76


Random Forest Analogy
12 May 2022 09:33

D21 Dropouts Page 77


How prediction works?
12 May 2022 09:35

D21 Dropouts Page 78


Regression Code Example
12 May 2022 09:34

D21 Dropouts Page 79


Classification Code Example
12 May 2022 09:34

D21 Dropouts Page 80


Effect of p
12 May 2022 10:06

D21 Dropouts Page 81


Practical Tips and Tricks
12 May 2022 09:34

D21 Dropouts Page 82


Drawbacks
12 May 2022 09:34

D21 Dropouts Page 83


Resources
12 May 2022 12:38

D21 Dropouts Page 84


Overfitting
20 May 2022 08:14

D22 Regularization Page 85


Why Neural Networks Overfit?
19 May 2022 07:34

D22 Regularization Page 86


D22 Regularization Page 87
D22 Regularization Page 88
Ways to solve overfitting
20 May 2022 08:13

D22 Regularization Page 89


Regularization
20 May 2022 08:14

D22 Regularization Page 90


D22 Regularization Page 91
Intuition behind Regularization
20 May 2022 08:15

D22 Regularization Page 92


Code Demo
20 May 2022 08:15

D22 Regularization Page 93


Further Resources
20 May 2022 08:15

D22 Regularization Page 94


What are Activation Functions?
31 May 2022 14:49

In artificial neural networks, each neuron forms a weighted sum of its inputs and
passes the resulting scalar value through a function referred to as an activation
function or transfer function. If a neuron has n inputs
then the output or activation of a neuron is

This function g is referred to as the activation function.

D23 Activation Functions Page 95


Why Activation Functions are needed?
31 May 2022 14:50

D23 Activation Functions Page 96


Ideal Activation Function
31 May 2022 14:50

D23 Activation Functions Page 97


Sigmoid Activation Function
31 May 2022 14:50

D23 Activation Functions Page 98


D23 Activation Functions Page 99
Tanh Activation Function
31 May 2022 14:50

D23 Activation Functions Page 100


Relu Activation Function
01 June 2022 16:43

D23 Activation Functions Page 101


Dying Relu Problem
08 June 2022 22:53

D24 Relu Variants Page 102


D24 Relu Variants Page 103
Leaky Relu
08 June 2022 22:53

D24 Relu Variants Page 104


Parametric Relu
08 June 2022 22:53

D24 Relu Variants Page 105


Elu - Exponential Linear Unit
09 June 2022 00:29

D24 Relu Variants Page 106


Selu - Scaled Exponential Linear Unit
09 June 2022 00:29

D24 Relu Variants Page 107


Why Weight Initialization is Important?
22 June 2022 12:48

D25 Weight Initialization Techiniques Page 108


What not to do?
22 June 2022 12:49

D25 Weight Initialization Techiniques Page 109


D25 Weight Initialization Techiniques Page 110
D25 Weight Initialization Techiniques Page 111
D25 Weight Initialization Techiniques Page 112
Key Insights
22 June 2022 12:49

D25 Weight Initialization Techiniques Page 113


D25 Weight Initialization Techiniques Page 114
Weight Initialization Techniques
22 June 2022 12:49

D25 Weight Initialization Techiniques Page 115


What is Batch Norm?
27 June 2022 11:00

Batch-Normalization (BN) is an algorithmic method which makes


the training of Deep Neural Networks (DNN) faster and more
stable.
It consists of normalizing activation vectors from hidden
layers using the mean and variance of the current batch. This
normalization step is applied right before (or right after) the
nonlinear function.

D26 Batch Normalization Page 116


Why use Batch Norm?
30 June 2022 16:08

D26 Batch Normalization Page 117


Internal Covariate Shift
27 June 2022 11:02

D26 Batch Normalization Page 118


D26 Batch Normalization Page 119
Batch Norm - The How
27 June 2022 11:02

D26 Batch Normalization Page 120


D26 Batch Normalization Page 121
Batch Norm during test
27 June 2022 11:03

D26 Batch Normalization Page 122


Advantages
27 June 2022 11:02

D26 Batch Normalization Page 123


Keras Implementation
27 June 2022 11:03

D26 Batch Normalization Page 124


Introduction
05 July 2022 09:57

D27 Optimizers Page 125


Role of Optimizer
05 July 2022 10:01

D27 Optimizers Page 126


Types of Optimizers
05 July 2022 10:02

D27 Optimizers Page 127


Challenges
05 July 2022 10:02

D27 Optimizers Page 128


What next?
05 July 2022 10:02

D27 Optimizers Page 129


The What
05 July 2022 20:15

D28 EWMA Page 130


Mathematical Formulation
05 July 2022 20:16

D28 EWMA Page 131


D28 EWMA Page 132
Understanding Graphs
20 July 2022 12:03

D29 SGD with Momentum Page 133


Convex Vs Non-Convex Optimization
20 July 2022 13:06

D29 SGD with Momentum Page 134


Momentum Optimization - The Why?
20 July 2022 14:28

D29 SGD with Momentum Page 135


Momentum Optimization - The What?
20 July 2022 14:33

D29 SGD with Momentum Page 136


Momentum Optimization - Mathematics(The How?)
20 July 2022 14:56

D29 SGD with Momentum Page 137


Effect of beta
20 July 2022 15:09

D29 SGD with Momentum Page 138


Problem with momentum optimization
20 July 2022 15:16

D29 SGD with Momentum Page 139


Demo
24 July 2022 13:17

D30 NAG Page 140


Mathematical Intuition
24 July 2022 13:17

D30 NAG Page 141


Geometric Intuition
24 July 2022 13:17

D30 NAG Page 142


Disadvantage
24 July 2022 13:18

D30 NAG Page 143


Keras Code
24 July 2022 13:18

D30 NAG Page 144


AdaGrad Intro
02 August 2022 18:45

D31 Adagrad Page 145


How optimizers behave(Why?)
02 August 2022 18:44

D31 Adagrad Page 146


Adagrad mathematics + Intuition
02 August 2022 18:44

D31 Adagrad Page 147


Adagrad Demo
02 August 2022 18:44

D31 Adagrad Page 148


Keras Implementation
02 August 2022 18:44

D31 Adagrad Page 149


Disadvantage
02 August 2022 18:43

D31 Adagrad Page 150


The Why?
03 August 2022 14:03

D32 RMSProp Page 151


Mathematical Formulation
03 August 2022 14:04

D32 RMSProp Page 152


D32 RMSProp Page 153
Disadvantage
03 August 2022 14:54

D32 RMSProp Page 154


Introduction
04 August 2022 10:45

D33 ADAM Page 155


Mathematical Formulation
04 August 2022 10:45

D33 ADAM Page 156


Demo
04 August 2022 10:45

D33 ADAM Page 157


Verdict
04 August 2022 10:45

D33 ADAM Page 158


What is a CNN?
17 August 2022 06:47

Convolutional neural networks, also known as convnet, or CNNs, are a


special kind of neural network for processing data that has a known
grid-like topology like time series data(1D) or images(2D).

D35 Intro to CNN Page 159


Why not use ANN?
17 August 2022 06:47

1. High Computation Cost


2. Overfitting
3. Loss of imp info like spatial arrangement of pixels

D35 Intro to CNN Page 160


CNN Intuition
17 August 2022 06:48

D35 Intro to CNN Page 161


CNN Applications
17 August 2022 06:48

D35 Intro to CNN Page 162


Roadmap
17 August 2022 06:48

D35 Intro to CNN Page 163


Human Visual Cortex
18 August 2022 16:05

D36 CNN vs Visual Cortex Page 164


The Experiment
18 August 2022 16:09

D36 CNN vs Visual Cortex Page 165


Conclusion
18 August 2022 17:57

D36 CNN vs Visual Cortex Page 166


Development
18 August 2022 16:15

D36 CNN vs Visual Cortex Page 167


Introduction
19 August 2022 16:45

D37 Convolution Operation Page 168


Basics of Images
19 August 2022 16:53

D37 Convolution Operation Page 169


Edge Detection (Convolution Operation)
19 August 2022 16:53

D37 Convolution Operation Page 170


Demo
19 August 2022 16:54

D37 Convolution Operation Page 171


Working with RGB Images
19 August 2022 16:54

D37 Convolution Operation Page 172


Multiple Filters
23 August 2022 08:24

Image taken from Andrew NG's lecture

D37 Convolution Operation Page 173


Problem with Convolution
26 August 2022 14:25

D38 Padding and Stride Page 174


What is Padding?
26 August 2022 14:26

D38 Padding and Stride Page 175


D38 Padding and Stride Page 176
Strides
27 August 2022 08:23

D38 Padding and Stride Page 177


D38 Padding and Stride Page 178
Why Strides are required?
27 August 2022 08:24

D38 Padding and Stride Page 179


The Problem with Convolution
01 September 2022 09:55

D39 Pooling Page 180


Pooling
01 September 2022 09:55

D39 Pooling Page 181


Demo
01 September 2022 09:57

D39 Pooling Page 182


Pooling on Volumes
01 September 2022 09:56

D39 Pooling Page 183


Advantages of Pooling
01 September 2022 09:56

D39 Pooling Page 184


D39 Pooling Page 185
Keras Code
01 September 2022 09:56

D39 Pooling Page 186


Types of Pooling
01 September 2022 09:57

D39 Pooling Page 187


Disadvantages of Pooling
01 September 2022 09:57

D39 Pooling Page 188


CNN Architecture
02 September 2022 10:55

D40 CNN Architecture Page 189


D40 CNN Architecture Page 190
LeNet
02 September 2022 10:55

D40 CNN Architecture Page 191


Guidelines
02 September 2022 10:58

D40 CNN Architecture Page 192


Keras Code
02 September 2022 14:58

D40 CNN Architecture Page 193


CNN Vs ANN
06 September 2022 10:00

D41 CNN Vs ANN Page 194


D41 CNN Vs ANN Page 195
Backpropagation in CNN
10 September 2022 11:12

D42 Backprop in CNN Page 196


D42 Backprop in CNN Page 197
D42 Backprop in CNN Page 198
Backpropogation in CNN
15 September 2022 09:37

Day 43 Backpropogation in CNN Part 2 Page 199


Day 43 Backpropogation in CNN Part 2 Page 200
Day 43 Backpropogation in CNN Part 2 Page 201
Day 43 Backpropogation in CNN Part 2 Page 202
Why use Pretrained models?
03 October 2022 12:52

Day 46 Using Pretrained Models Page 203


ImageNET Dataset
03 October 2022 12:36

Day 46 Using Pretrained Models Page 204


ILSVRC
03 October 2022 12:38

Day 46 Using Pretrained Models Page 205


Famous Architectures
03 October 2022 12:39

Day 46 Using Pretrained Models Page 206


Idea of Pretrained Models
03 October 2022 12:39

Day 46 Using Pretrained Models Page 207


Keras Demo
03 October 2022 12:40

Day 46 Using Pretrained Models Page 208


Problem with training your own model
10 October 2022 10:48

Day 47 Transfer Learning Page 209


Using Pretrained Models
10 October 2022 10:48

Day 47 Transfer Learning Page 210


Transfer Learning
10 October 2022 10:49

Transfer learning is a research problem in machine learning that focuses on storing


knowledge gained while solving one problem and applying it to a different but
related problem.

Day 47 Transfer Learning Page 211


Day 47 Transfer Learning Page 212
Why Transfer Learning works
10 October 2022 10:50

Day 47 Transfer Learning Page 213


Ways of doing Transfer Learning
10 October 2022 10:50

Day 47 Transfer Learning Page 214


Code
10 October 2022 10:50

Day 47 Transfer Learning Page 215


Problem with Sequential Model
14 October 2022 16:00

Day 48 Functional Models Page 216


Day 48 Functional Models Page 217
A Simple Example
14 October 2022 16:01

Day 48 Functional Models Page 218


Multi Output Model
14 October 2022 16:02

Day 48 Functional Models Page 219


Multi Input Model
14 October 2022 16:02

Day 48 Functional Models Page 220


Shared Layers Model
14 October 2022 16:02

Day 48 Functional Models Page 221


Sequential Data
22 October 2022 13:09

D49 Why RNN Page 222


Why use RNN?
22 October 2022 13:37

D49 Why RNN Page 223


D49 Why RNN Page 224
RNN Applications
22 October 2022 13:37

D49 Why RNN Page 225


Roadmap
22 October 2022 13:37

D49 Why RNN Page 226


Why RNNs?
29 October 2022 13:30

D50 Forward Propagation in RNNs Page 227


Data for RNN
29 October 2022 13:30

D50 Forward Propagation in RNNs Page 228


RNN Architecture
29 October 2022 13:30

D50 Forward Propagation in RNNs Page 229


RNN Forward Prop
29 October 2022 17:02

D50 Forward Propagation in RNNs Page 230


Simplified Representation
29 October 2022 13:32

D50 Forward Propagation in RNNs Page 231


Code
29 October 2022 13:32

D50 Forward Propagation in RNNs Page 232


State and Memory
29 October 2022 13:33

D50 Forward Propagation in RNNs Page 233


Keras Code Example
06 November 2022 09:14

In natural language processing, word embedding is a term used for the


representation of words for text analysis, typically in the form of a real -
valued vector that encodes the meaning of the word such that the
words that are closer in the vector space are expected to be similar in
meaning.

D51 Code Example Keras Page 234


D51 Code Example Keras Page 235
D51 Code Example Keras Page 236
D51 Code Example Keras Page 237
D51 Code Example Keras Page 238
D51 Code Example Keras Page 239
D51 Code Example Keras Page 240
D51 Code Example Keras Page 241
Till Now
17 November 2022 17:01

D52 Types of RNN Page 242


Many to One
17 November 2022 17:02

D52 Types of RNN Page 243


One to Many
17 November 2022 17:02

D52 Types of RNN Page 244


Many to Many
17 November 2022 17:02

D52 Types of RNN Page 245


One to One
17 November 2022 17:02

D52 Types of RNN Page 246


Backpropagation in RNN
01 December 2022 16:43

D53 Backpropogation through time Page 247


D53 Backpropogation through time Page 248
D53 Backpropogation through time Page 249
Problem with RNN
19 December 2022 16:33

Problems with RNN Page 250


Problems with RNN Page 251
Recap
21 August 2023 11:55

D54 LSTM The What Page 252


LSTM Core Idea
21 August 2023 17:27

D54 LSTM The What Page 253


LSTM Architecture
21 August 2023 18:41

D54 LSTM The What Page 254


D54 LSTM The What Page 255
LSTM Gates
21 August 2023 19:06

D54 LSTM The What Page 256


Summary
21 August 2023 19:29

D54 LSTM The What Page 257


The Architecture
29 August 2023 07:57

D55 LSTM Architecture Page 258


The Gates
29 August 2023 08:09

D55 LSTM Architecture Page 259


What are Ct and ht
29 August 2023 08:08

D55 LSTM Architecture Page 260


What is Xt
29 August 2023 17:40

D55 LSTM Architecture Page 261


What are ft, it, ot and Ct
29 August 2023 08:09

D55 LSTM Architecture Page 262


Pointwise Operations
29 August 2023 18:26

D55 LSTM Architecture Page 263


Neural Network Layers
29 August 2023 18:34

D55 LSTM Architecture Page 264


The Forget Gate
29 August 2023 19:58

D55 LSTM Architecture Page 265


The Input Gate
30 August 2023 04:38

D55 LSTM Architecture Page 266


The Output Gate
30 August 2023 05:27

D55 LSTM Architecture Page 267


What is a Next Word Predictor
08 September 2023 08:50

D56 Next Word Predictor Page 268


The Strategy
08 September 2023 08:51

D56 Next Word Predictor Page 269


D56 Next Word Predictor Page 270
The Architecture
08 September 2023 08:55

D56 Next Word Predictor Page 271


How to improve performance?
08 September 2023 08:51

D56 Next Word Predictor Page 272


What is GRU
05 October 2023 00:09

D57 Gated Recurrent Unit Page 273


The Big Idea Behind GRU
05 October 2023 00:47

D57 Gated Recurrent Unit Page 274


The Setup
05 October 2023 01:07

D57 Gated Recurrent Unit Page 275


The Input Xt
05 October 2023 01:52

D57 Gated Recurrent Unit Page 276


Architecture
05 October 2023 02:10

D57 Gated Recurrent Unit Page 277


What exactly is hidden state?
05 October 2023 02:19

• There was a king Vikram very strong and powerful


• There was an enemy king kaali
• Both had a war and kaali killed Vikram
• Vikram had a son Vikram Jr who grew up he to become very strong
just like his father
• He also attacked Kaali But got killed
• Vikram Jr too had a son called Vikram super Jr and when he grew
up he also fought kaali
• And he killed kaali and took revenge of his father and grand father

[Power, Conflict, Tragedy, Revenge]

D57 Gated Recurrent Unit Page 278


D57 Gated Recurrent Unit Page 279
Calculating the reset gate
05 October 2023 14:24

D57 Gated Recurrent Unit Page 280


D57 Gated Recurrent Unit Page 281
LSTM vs GRU
05 October 2023 16:45

Here are the main differences between LSTM and GRU:

1. Number of Gates:
• LSTM: Has three gates — input (or update) gate, forget gate, and output gate.
• GRU: Has two gates — reset gate and update gate.

2. Memory Units:
• LSTM: Uses two separate states - the cell state (ct) and the hidden state (ht). The cell
state acts as an "internal memory" and is crucial for carrying long-term dependencies.
• GRU: Simplifies this by using a single hidden state (ht) to both capture and output the
memory.

3. Parameter Count:
• LSTM: Generally has more parameters than a GRU because of its additional gate and
separate cell state. For an input size of d and a hidden size of h, the LSTM has 4
×((d×ℎ)+(ℎ×ℎ)+ℎ)4×((d×h)+(h×h)+h)parameters.

• GRU: Has fewer parameters. For the same sizes, the GRU has 3×((d×ℎ)+(ℎ×ℎ)+ℎ)3
×((d×h)+(h×h)+h)parameters.

4. Computational Complexity:
• LSTM: Due to the extra gate and cell state, LSTMs are typically more computationally
intensive than GRUs.
• GRU: Is simpler and can be faster to compute, especially on smaller datasets or when
computational resources are limited.

5. Empirical Performance:
• LSTM: In many tasks, especially more complex ones, LSTMs have been observed to
perform slightly better than GRUs.

• GRU: Can perform comparably to LSTMs on certain tasks, especially when data is
limited or tasks are simpler. They can also train faster due to fewer parameters.

6. Choice in Practice:
• The choice between LSTM and GRU often comes down to empirical testing. Depending
on the dataset and task, one might outperform the other. However, GRUs, due to their
simplicity, are often the first choice when starting out.

D57 Gated Recurrent Unit Page 282


What is Deep RNN
17 October 2023 16:28

D58 Deep RNNs Page 283


Architecture
17 October 2023 16:29

D58 Deep RNNs Page 284


D58 Deep RNNs Page 285
Notation
17 October 2023 16:29

D58 Deep RNNs Page 286


Why and When to use?
17 October 2023 16:29

1. Hierarchical Representation
2. Customization for Advanced Tasks

D58 Deep RNNs Page 287


Code Example
17 October 2023 16:30

D58 Deep RNNs Page 288


Variants
17 October 2023 16:30

D58 Deep RNNs Page 289


Disadvantages
17 October 2023 16:30

D58 Deep RNNs Page 290


The Why?
26 October 2023 15:19

D59 Bidirectional RNNs Page 291


Bidirectional RNN Architecture
26 October 2023 15:19

D59 Bidirectional RNNs Page 292


Code
26 October 2023 15:21

D59 Bidirectional RNNs Page 293


Applications and Drawbacks
26 October 2023 15:21

D59 Bidirectional RNNs Page 294


Introduction
16 November 2023 16:15

D60 Seq2Seq Models Page 295


Sequence Tasks and its types
16 November 2023 16:16

D60 Seq2Seq Models Page 296


Seq2Seq Tasks
16 November 2023 16:16

D60 Seq2Seq Models Page 297


History of Seq2Seq Models
16 November 2023 16:16

D60 Seq2Seq Models Page 298


Stage 1 - Encoder Decoder Architecture
18 November 2023 16:16

Ilya Sutskever

D60 Seq2Seq Models Page 299


Stage 2 - Attention Mechanism
20 November 2023 10:59

"Sadly mistaken, he realized that the job offer was


actually an incredible opportunity that would lead
to significant personal and professional growth."

D60 Seq2Seq Models Page 300


D60 Seq2Seq Models Page 301
Stage 3 - Transformers
20 November 2023 12:18

D60 Seq2Seq Models Page 302


Stage 4 - Transfer Learning
20 November 2023 15:39

Transfer learning (TL) is a technique in which knowledge learned from a


task is re-used in order to boost performance on a related task.

For example, for image classification, knowledge gained while learning


to recognize cars could be applied when trying to recognize trucks.

D60 Seq2Seq Models Page 303


D60 Seq2Seq Models Page 304
D60 Seq2Seq Models Page 305
Stage 5 - LLMs
20 November 2023 18:51

D60 Seq2Seq Models Page 306


D60 Seq2Seq Models Page 307
The Grand Finale - ChatGPT
22 November 2023 10:13

D60 Seq2Seq Models Page 308


D60 Seq2Seq Models Page 309
Seq2Seq Data
08 December 2023 17:12

D61 Encoder Decoder Module Page 310


Before Starting
08 December 2023 19:24

D61 Encoder Decoder Module Page 311


High Level Overview
08 December 2023 17:14

D61 Encoder Decoder Module Page 312


What's under the hood?
08 December 2023 17:14

D61 Encoder Decoder Module Page 313


D61 Encoder Decoder Module Page 314
Training the Architecture using Backpropagation
08 December 2023 17:15

D61 Encoder Decoder Module Page 315


D61 Encoder Decoder Module Page 316
Prediction
08 December 2023 17:15

D61 Encoder Decoder Module Page 317


Improvement 1 - Embeddings
08 December 2023 17:16

D61 Encoder Decoder Module Page 318


Improvement 2 - Deep LSTMs
08 December 2023 17:16

D61 Encoder Decoder Module Page 319


Improvement 3 - Reversing the Input
08 December 2023 17:16

D61 Encoder Decoder Module Page 320


The Sutskever Architecture
08 December 2023 17:18

Application to Translation: The model focused on translating English to French, demonstrating


the effectiveness of sequence-to-sequence learning in neural machine translation.

Special End-of-Sentence Symbol: Each sentence in the dataset was terminated with a unique
end-of-sentence symbol ("<EOS>"), enabling the model to recognize the end of a sequence.

Dataset: The model was trained on a subset of 12 million sentences, comprising 348 million
French words and 304 million English words, taken from a publicly available dataset.

Vocabulary Limitation: To manage computational complexity, fixed vocabularies for both


languages were used, with 160,000 most frequent words for English and 80,000 for French.
Words not in these vocabularies were replaced with a special "UNK" token.

Reversing Input Sequences: The input sentences (English) were reversed before feeding them
into the model, which was found to significantly improve the model's learning efficiency,
especially for longer sentences.

Word Embeddings: The model used a 1000-dimensional word embedding layer to represent
input words, providing dense, meaningful representations of each word.

Architecture Details: Both the input (encoder) and output (decoder) models had 4 layers, with
each layer containing 1,000 units, showcasing a deep LSTM-based architecture.

Output Layer and Training: The output layer employed a Softmax function to generate the
probability distribution over the target vocabulary. The model was trained end-to-end with
these settings.

Performance - BLEU Score: The model achieved a BLEU score of 34.81, surpassing the baseline
Statistical Machine Translation (SMT) system's score of 33.30 on the same dataset, marking a
significant advancement in neural machine translation.

D61 Encoder Decoder Module Page 321


The Why
20 December 2023 13:35

Once upon a time in a small Indian village, a mischievous monkey


stole a turban from a sleeping barber, wore it to a wedding, danced
with the bewildered guests, accidentally got crowned the 'Banana
King' by the local kids, and ended up leading a vibrant, impromptu
parade of laughing villagers, cows, and street dogs, all while
balancing a stack of mangoes on its head, creating a hilariously
unforgettable spectacle and an amusing legend that the village still
chuckles about every monsoon season.

D62 Attention Mechanism Page 322


The Solution
20 December 2023 17:32

Once upon a time in a small Indian village, a mischievous monkey


stole a turban from a sleeping barber, wore it to a wedding, danced
with the bewildered guests, accidentally got crowned the 'Banana
King' by the local kids, and ended up leading a vibrant, impromptu
parade of laughing villagers, cows, and street dogs, all while
balancing a stack of mangoes on its head, creating a hilariously
unforgettable spectacle and an amusing legend that the village still
chuckles about every monsoon season.

D62 Attention Mechanism Page 323


The What
21 December 2023 06:04

D62 Attention Mechanism Page 324


D62 Attention Mechanism Page 325
Recap
16 January 2024 16:10

D63 Bandanau Attention Vs Luong Attention Page 326


Bahdanau Attention
16 January 2024 16:11

D63 Bandanau Attention Vs Luong Attention Page 327


D63 Bandanau Attention Vs Luong Attention Page 328
Luong Attention
17 January 2024 00:09

D63 Bandanau Attention Vs Luong Attention Page 329


What is Transformer?
27 January 2024 18:41

D64 Transformers Part 1 - Introduction Page 330


Impact of Transformers
27 January 2024 20:03

D64 Transformers Part 1 - Introduction Page 331


The Origin Story!
27 January 2024 22:38

D64 Transformers Part 1 - Introduction Page 332


D64 Transformers Part 1 - Introduction Page 333
The Timeline
28 January 2024 00:55

D64 Transformers Part 1 - Introduction Page 334


Advantages
28 January 2024 01:00

D64 Transformers Part 1 - Introduction Page 335


Famous Applications
28 January 2024 01:17

D64 Transformers Part 1 - Introduction Page 336


Disadvantages
28 January 2024 01:51

D64 Transformers Part 1 - Introduction Page 337


Future
28 January 2024 02:09

D64 Transformers Part 1 - Introduction Page 338


The What

D65 Transformers Part 1 - Self Attention Page 339


Embeddings

D65 Transformers Part 1 - Self Attention Page 340


D65 Transformers Part 1 - Self Attention Page 341
First Principle Approach
04 February 2024 17:55

embedding of bank which


produces a n dim vector

similarity is how much


of emoney has the
similarity with ebank
which is represented by
the coefficient of ebank

D65 Transformers Part 1 - Self Attention Page 342


D65 Transformers Part 1 - Self Attention Page 343
Progress
06 February 2024 00:42

D65 Transformers Part 1 - Self Attention Page 344


Query, Key & Value Vectors
06 February 2024 13:32

D65 Transformers Part 1 - Self Attention Page 345


Revision
28 February 2024 16:25

D66 Why Scaling is needed for Self Attention Page 346


D66 Why Scaling is needed for Self Attention Page 347
D66 Why Scaling is needed for Self Attention Page 348
What is dk
28 February 2024 16:59

D66 Why Scaling is needed for Self Attention Page 349


Recep
08 March 2024 15:14

D67 Self Attention Geometric Visualization Page 350


Geometric Intuition
08 March 2024 15:16

D67 Self Attention Geometric Visualization Page 351


D67 Self Attention Geometric Visualization Page 352
Recap of Attention
11 March 2024 17:09

D68 Why call it self attention Page 353


Recap of Self Attention
08 April 2024 14:46

D69 Multihead Attention Page 354


Problem with Self Attention
08 April 2024 14:47

The man saw the astronomer with a telescope

The man saw the astronomer with a telescope

The

man

saw

the

astronomer

with

telescope

D69 Multihead Attention Page 355


Multi-head Attention
10 April 2024 09:50

The man saw the astronomer with a telescope

D69 Multihead Attention Page 356


D69 Multihead Attention Page 357
15 April 2024 16:36

random Page 358


The Why
23 May 2024 14:33

D70 Positional Encodings Page 359


Proposing a simple solution
23 May 2024 15:37

D70 Positional Encodings Page 360


D70 Positional Encodings Page 361
The sine function as a solution
23 May 2024 17:17

D70 Positional Encodings Page 362


D70 Positional Encodings Page 363
Positional Encoding
23 May 2024 20:35

D70 Positional Encodings Page 364


D70 Positional Encodings Page 365
Interesting Observations
24 May 2024 02:06

D70 Positional Encodings Page 366


D70 Positional Encodings Page 367
Agenda
07 June 2024 02:03

D71 Layer Normalization Page 368


What is Normalization
05 June 2024 10:32

Normalization in deep learning refers to the process of transforming data or model outputs to
have specific statistical properties, typically a mean of zero and a variance of one.

What do we normalize?

Benefits of Normalization in Deep Learning

• Improved Training Stability:

○ Normalization helps to stabilize and accelerate the training process by reducing the
likelihood of extreme values that can cause gradients to explode or vanish.

• Faster Convergence:

○ By normalizing inputs or activations, models can converge more quickly because the
gradients have more consistent magnitudes. This allows for more stable updates
during backpropagation.

• Mitigating Internal Covariate Shift:

○ Internal covariate shift refers to the change in the distribution of layer inputs during
training. Normalization techniques, like batch normalization, help to reduce this
shift, making the training process more robust.

• Regularization Effect:

○ Some normalization techniques, like batch normalization, introduce a slight


regularizing effect by adding noise to the mini-batches during training. This can help
to reduce overfitting.

D71 Layer Normalization Page 369


D71 Layer Normalization Page 370
Batch Norm(Revision)
05 June 2024 10:39

D71 Layer Normalization Page 371


Why don't we use Batch Norm in Transformers?
05 June 2024 10:40

Review Sentiment
Hi Nitish 1
How are you today 0
I am good 0
You? 1

Embedding dimension - 3

Batch Size - 2

D71 Layer Normalization Page 372


D71 Layer Normalization Page 373
Layer Norm
05 June 2024 10:40

D71 Layer Normalization Page 374


Layer Norm in Transformers
05 June 2024 10:40

D71 Layer Normalization Page 375


Recap
09 July 2024 08:47

D72 Transformer Encoder Page 376


Simplified Representation!
09 July 2024 13:05

D72 Transformer Encoder Page 377


D72 Transformer Encoder Page 378
Encoder Architecture
09 July 2024 16:31

D72 Transformer Encoder Page 379


D72 Transformer Encoder Page 380
Some questions
09 July 2024 20:48

1. Why use residual connections?


2. Why use a FFNN?
3. Why use 6 encoder blocks?

D72 Transformer Encoder Page 381


Recap
23 July 2024 17:09

D73 Masked Self Attention Page 382


Autoregressive models
23 July 2024 17:39

The Transformer decoder is autoregressive at


inference time and non-autoregressive at
training time.

In the context of deep learning, autoregressive models are a class of


models that generate data points in a sequence by conditioning each
new point on the previously generated points.

D73 Masked Self Attention Page 383


D73 Masked Self Attention Page 384
Transformer as an Autoregressive Model
23 July 2024 23:16

The Transformer decoder is autoregressive at


inference time and non-autoregressive at
training time.

Inference

Query Sentence -> I am fine

D73 Masked Self Attention Page 385


Training

S.No English Sentence Hindi Sentence


1 How are you?
2 Congratulations
3 Thank you

D73 Masked Self Attention Page 386


The problem in parallelizing
25 July 2024 22:55

S.No English Sentence Hindi Sentence


1 How are you?
2 Congratulations
3 Thank you

D73 Masked Self Attention Page 387


D73 Masked Self Attention Page 388
Finding the answer
26 July 2024 00:21

ce = w11 * + w12 * + w13 *

ce = w21 * + w22* + w23 *

ce = w31 * + w32 * + w33 *

D73 Masked Self Attention Page 389


Plan of Action
12 August 2024 17:53

D74 Cross Attention Page 390


What is Cross Attention
12 August 2024 17:53

Cross-attention is a mechanism used in transformer architectures, particularly in tasks


involving sequence-to-sequence data like translation or summarization. It allows a model to
focus on different parts of an input sequence when generating an output sequence.

Cross Attention is conceptually very similar to Self-Attention

Self-Attention Vs Cross Attention


1. The input
2. The processing
3. The output

D74 Cross Attention Page 391


Self-Attention Vs Cross Attention (Input)
13 August 2024 08:22

D74 Cross Attention Page 392


Self-Attention Vs Cross Attention (Processing)
12 August 2024 17:54

D74 Cross Attention Page 393


Self-Attention Vs Cross Attention [Output]
12 August 2024 18:05

D74 Cross Attention Page 394


Cross Attention Vs Bahdanau/Luong Attention
12 August 2024 18:06

D74 Cross Attention Page 395


Use-cases
12 August 2024 18:07

D74 Cross Attention Page 396


Plan of Attack
22 August 2024 15:36

Training not Inference

D75 Decoder Architecture Page 397


Simplified View
22 August 2024 00:33

D75 Decoder Architecture Page 398


D75 Decoder Architecture Page 399
Decoder Architecture
22 August 2024 00:56
Eng | Hindi

I am good Mai badhiya hu


We are friends Hum dost hai

1. Shifting
2. Tokenization
3. Embedding
4. Positional Encoding

D75 Decoder Architecture Page 400


D75 Decoder Architecture Page 401
Machine Translation Task
English Sentence - We are friends

Hindi Sentence - Hum dost hai

D75 Decoder Architecture Page 402


D75 Decoder Architecture Page 403
Plan of Attack
03 September 2024 00:37

D76 Transformer Inference Page 404


The Setup
03 September 2024 00:39

S no English Sentence Hindi Sentence


1 How are you Aap kaise ho
2 Thank you Dhanyawad
3 You are welcome Apka swagat hai

Query Sentence

We are friends

D76 Transformer Inference Page 405


Transformer during Inference
03 September 2024 00:50

D76 Transformer Inference Page 406


D76 Transformer Inference Page 407
D76 Transformer Inference Page 408

You might also like