0% found this document useful (0 votes)
1 views

Deep Learning Module-02

The document provides an overview of feedforward neural networks, including their architecture, activation functions, and historical context. It discusses gradient-based learning, cost functions, and various optimization techniques such as batch normalization and dropout. Additionally, it covers practical implementation considerations and common issues faced in deep learning, along with solutions.

Uploaded by

sanjana sm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Deep Learning Module-02

The document provides an overview of feedforward neural networks, including their architecture, activation functions, and historical context. It discusses gradient-based learning, cost functions, and various optimization techniques such as batch normalization and dropout. Additionally, it covers practical implementation considerations and common issues faced in deep learning, along with solutions.

Uploaded by

sanjana sm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

21CS743 | DEEP LEARNING

Module-02

Feedforward Networks and Deep Learning

Introduction to Feedforward Neural Networks

1.1 Basic Concepts

• A feedforward neural network is the simplest form of artificial neural network(ANN)

ud
• Information moves in only one direction: forward, from input nodes through hidden nodes
to output nodes

• No cycles or loops exist in the network structure

1.2 Historical Context

1. Origins

o
lo
Inspired by biological neural networks

First proposed by Warren McCulloch and Walter Pitts (1943)


C
o Significant advancement with perceptron by Frank Rosenblatt (1958)

2. Evolution
tu

o Single-layer to multi-layer networks

o Development of backpropagation in 1986

o Modern deep learning revolution (2012-present)


V

Page 1
21CS743 | DEEP LEARNING

ud
lo
C
1.3 Network Architecture

1. Input Layer
tu

o Receives raw input data

o No computation performed

o Number of neurons equals number of input features


V

o Standardization/normalization often applied here

2. Hidden Layers

o Performs intermediate computations

o Can have multiple hidden layers

o Each neuron connected to all neurons in previous layer

Page 2
21CS743 | DEEP LEARNING

o Feature extraction and transformation occur here

3. Output Layer

o Produces final network output

o Number of neurons depends on problem type

o Classification: typically one neuron per class

ud
o Regression: usually one neuron

1.4 Activation Functions

1. Sigmoid (Logistic)

o Formula: σ(x) = 1/(1 + e^(-x))

o
Range: [0,1] lo
Used in binary classification
C
o Properties:

▪ Smooth gradient

▪ Clear prediction probability


tu

▪ Suffers from vanishing gradient

2. Hyperbolic Tangent (tanh)

o Formula: tanh(x) = (e^x - e^(-x))/(e^x + e^(-x))


V

o Range: [-1,1]

o Often performs better than sigmoid

o Properties:

▪ Zero-centered

Page 3
21CS743 | DEEP LEARNING

▪ Stronger gradients

▪ Still has vanishing gradient issue

3. ReLU (Rectified Linear Unit)

o Formula: f(x) = max(0,x)

o Most commonly used

ud
o Helps solve vanishing gradient problem

o Properties:

▪ Computationally efficient

▪ No saturation in positive region

4. Leaky ReLU
▪ lo
Dying ReLU problem
C
o Formula: f(x) = max(0.01x, x)

o Addresses dying ReLU problem

o Small negative slope


tu

o Properties:

▪ Never completely dies

▪ Allows for negative values


V

▪ More robust than standard ReLU

2. Gradient-Based Learning

2.1 Understanding Gradients

1. Definition

Page 4
21CS743 | DEEP LEARNING

o Gradient is a vector of partial derivatives

o Points in direction of steepest increase

o Used to minimize loss function

2. Properties

o Direction indicates fastest increase

ud
o Magnitude indicates steepness

o Negative gradient used for minimization

2.2 Cost Functions

1. Mean Squared Error (MSE)

o
lo
Used for regression problems

Formula: MSE = (1/n)Σ(y_true - y_pred)²


C
o Properties:

▪ Always positive

▪ Penalizes larger errors more


tu

▪ Differentiable

2. Cross-Entropy Loss

o Used for classification problems


V

o Formula: -Σ(y_true * log(y_pred))

o Properties:

▪ Measures probability distribution difference

▪ Better for classification than MSE

Page 5
21CS743 | DEEP LEARNING

▪ Provides stronger gradients

3. Huber Loss

o Combines MSE and MAE

o Less sensitive to outliers

o Formula:

ud
▪ L = 0.5(y - f(x))² if |y - f(x)| ≤ δ

▪ L = δ|y - f(x)| - 0.5δ² otherwise

2.3 Gradient Descent Types

1. Batch Gradient Descent

o
lo
Uses entire dataset for each update

More stable but slower

Formula: θ = θ - α∇J(θ)
C
o

o Memory intensive for large datasets

2. Stochastic Gradient Descent (SGD)


tu

o Updates parameters after each sample

o Faster but less stable

o Better for large datasets


V

o High variance in parameter updates

3. Mini-batch Gradient Descent

o Compromise between batch and SGD

o Updates parameters after small batches

Page 6
21CS743 | DEEP LEARNING

o Most commonly used in practice

o Typical batch sizes: 32, 64, 128

4. Advanced Optimizers a) Adam (Adaptive Moment Estimation)

o Combines momentum and RMSprop

o Adaptive learning rates

ud
o Formula includes first and second moments

b) RMSprop

o Adaptive learning rates

o Divides by running average of gradient magnitudes

c) Momentum

o
lo
Adds fraction of previous update
C
o Helps escape local minima

o Reduces oscillation

3. Backpropagation and Chain Rule


tu

3.1 Chain Rule Fundamentals

1. Mathematical Basis

o df/dx = df/dy * dy/dx


V

o Allows computation of composite function derivatives

o Essential for neural network training

2. Application in Neural Networks

o Computes gradients layer by layer

Page 7
21CS743 | DEEP LEARNING

o Propagates error backwards

o Updates weights based on contribution to error

3.2 Forward Pass

1. Input Processing

o Data normalization

ud
o Weight initialization

o Bias addition

2. Layer Computation

python

Copy

# Pseudo-code for forward pass


lo
C
for layer in network:

Z = W * A + b # Linear transformation

A = activation(Z) # Apply activation function


tu

3. Output Generation

o Final layer activation

o Prediction computation
V

o Error calculation

3.3 Backward Pass

1. Error Calculation

o Compare output with target

Page 8
21CS743 | DEEP LEARNING

o Calculate loss using cost function

o Initialize gradient computation

2. Weight Updates

o Calculate gradients using chain rule

o Update weights: w_new = w_old - learning_rate * gradient

ud
o Update biases similarly

3. Detailed Steps

python

Copy

# Pseudo-code for backward pass

# Output layer
lo
C
dZ = A - Y # For MSE

dW = (1/m) * dZ * A_prev.T

db = (1/m) * sum(dZ)
tu

# Hidden layers

dZ = dA * activation_derivative(Z)
V

dW = (1/m) * dZ * A_prev.T

db = (1/m) * sum(dZ)

4. Regularization for Deep Learning

4.1 L1 Regularization

Page 9
21CS743 | DEEP LEARNING

1. Mathematical Form

o Adds absolute value of weights to loss

o Formula: L1 = λΣ|w|

o Promotes sparsity

2. Properties

ud
o Feature selection capability

o Produces sparse models

o Less sensitive to outliers

4.2 L2 Regularization

1. Mathematical Form

o
lo
Adds squared weights to loss

Formula: L2 = λΣw²
C
o

o Prevents large weights

2. Properties
tu

o Smooth weight decay

o No sparse solutions

o More stable training


V

4.3 Dropout

1. Basic Concept

o Randomly deactivate neurons

o Probability p of keeping neurons

Page 10
21CS743 | DEEP LEARNING

o Different network for each training batch

2. Implementation Details

python

Copy

# Pseudo-code for dropout

ud
mask = np.random.binomial(1, p, size=layer_size)

A = A * mask

A = A / p # Scale to maintain expected value

3. Training vs. Testing

o
lo
Used only during training

Scaled appropriately during inference


C
o Acts as model ensemble

4.4 Early Stopping

1. Implementation
tu

o Monitor validation error

o Save best model

o Stop when validation error increases


V

2. Benefits

o Prevents overfitting

o Reduces training time

o Automatic model selection

Page 11
21CS743 | DEEP LEARNING

5. Advanced Concepts

5.1 Batch Normalization

1. Purpose

o Normalizes layer inputs

o Reduces internal covariate shift

ud
o Speeds up training

2. Algorithm

python

Copy

lo
# Pseudo-code for batch normalization

mean = np.mean(x, axis=0)


C
var = np.var(x, axis=0)

x_norm = (x - mean) / np.sqrt(var + ε)

out = gamma * x_norm + beta


tu

5.2 Weight Initialization

1. Xavier/Glorot Initialization

o Variance = 2/(nin + nout)


V

o Suitable for tanh activation

2. He Initialization

o Variance = 2/nin

o Better for ReLU activation

Page 12
21CS743 | DEEP LEARNING

6. Practical Implementation

6.1 Network Design Considerations

1. Architecture Choices

o Number of layers

o Neurons per layer

ud
o Activation functions

2. Hyperparameter Selection

o Learning rate

o Batch size

o lo
Regularization strength

6.2 Training Process


C
1. Data Preparation

o Splitting data

o Normalization
tu

o Augmentation

2. Training Loop

o Forward pass
V

o Loss computation

o Backward pass

o Parameter updates

Practice Problems and Exercises

Page 13
21CS743 | DEEP LEARNING

1. Basic Concepts

o Explain the role of activation functions in neural networks

o Compare and contrast different types of gradient descent

o Describe the vanishing gradient problem

2. Mathematical Problems

ud
o Calculate gradients for a simple 2-layer network

o Implement batch normalization equations

o Compute different loss functions

3. Implementation Challenges

o
lo
Design a network for MNIST classification

Implement dropout in Python


C
o Create a custom loss function

Key Formulas Reference Sheet

1. Activation Functions
tu

o Sigmoid: σ(x) = 1/(1 + e^(-x))

o tanh(x) = (e^x - e^(-x))/(e^x + e^(-x))

o ReLU: f(x) = max(0,x)


V

2. Loss Functions

o MSE = (1/n)Σ(y_true - y_pred)²

o Cross-Entropy = -Σ(y_true * log(y_pred))

3. Regularization

Page 14
21CS743 | DEEP LEARNING

o L1 = λΣ|w|

o L2 = λΣw²

4. Gradient Descent

o Update: w = w - α∇J(w)

o Momentum: v = βv - α∇J(w)

ud
Common Issues and Solutions

1. Vanishing Gradients

o Use ReLU activation

o Implement batch normalization

o
lo
Try residual connections

2. Overfitting
C
o Add dropout

o Use regularization

o Implement early stopping


tu

3. Poor Convergence

o Adjust learning rate

o Try different optimizers


V

o Check data normalization

Page 15

You might also like