50 Most Important CNN Interview Questions
50 Most Important CNN Interview Questions
Q1.What is a convolutional neural network (CNN), and how does it differ from traditional neural
networks?
A convolutional neural network (CNN) is a type of neural network that is particularly effective in
analyzing visual data such as images. It differs from traditional neural networks by using
convolutional layers, which apply filters or kernels to input data to extract features. CNNs also utilize
pooling layers to downsample feature maps and reduce dimensionality. The architecture of CNNs is
designed to capture spatial hierarchies and patterns in data, making them well-suited for tasks such
as image classification, object detection, and image segmentation.
Q3. How does the backpropagation algorithm work in the context of CNNs?
Backpropagation in CNNs is the algorithm used to update the network's weights and biases based
on the calculated gradients of the loss function. During training, the network's predictions are
compared to the ground truth labels, and the loss is computed. The gradients of the loss with
respect to the network's parameters are then propagated backward through the network, layer by
layer, using the chain rule of calculus. This allows the gradients to be efficiently calculated, and the
weights and biases are updated using optimization algorithms such as stochastic gradient descent
(SGD) to minimize the loss.
Q4. How does the backpropagation algorithm work in the context of CNNs?
Backpropagation in CNNs is the algorithm used to update the network's weights and biases based
on the calculated gradients of the loss function. During training, the network's predictions are
compared to the ground truth labels, and the loss is computed. The gradients of the loss with
respect to the network's parameters are then propagated backward through the network, layer by
layer, using the chain rule of calculus. This allows the gradients to be efficiently calculated, and the
weights and biases are updated using optimization algorithms such as stochastic gradient descent
(SGD) to minimize the loss.
Object tracking using CNNs involves the task of following and locating a specific object of interest
over time in a sequence of images or a video. There are different approaches to object tracking using
CNNs, including Siamese networks, correlation filters, and online learning-based methods. Siamese
networks utilize twin networks to embed the appearance of the target object and perform similarity
comparison between the target and candidate regions in subsequent frames. Correlation filters
employ filters to learn the appearance model of the target object and use correlation operations to
track the object across frames. Online learning-based methods continuously update the appearance
model of the target object during tracking, adapting to changes in appearance and conditions. These
approaches enable robust and accurate object tracking for applications such as video surveillance,
object recognition, and augmented reality.
Object segmentation in CNNs refers to the task of segmenting or partitioning an image into distinct
regions corresponding to different objects or semantic categories. Unlike object detection, which
provides bounding boxes around objects, segmentation aims to assign a label or class to each pixel
within an image. CNN-based semantic segmentation methods typically employ an encoder-decoder
architecture, such as U-Net or Fully Convolutional Networks (FCN), which leverages the hierarchical
feature representations learned by the encoder to generate pixel-level segmentation maps in the
decoder. These methods enable precise and detailed segmentation, facilitating applications like
image editing, medical imaging analysis, and autonomous driving.
Image embedding in CNNs refers to the process of mapping images into lower-dimensional vector
representations, also known as image embeddings. These embeddings capture the semantic and
visual information of the images in a compact and meaningful way. CNN-based image embedding
methods typically utilize the output of intermediate layers in the network, often referred to as the
"bottleneck" layer or the "embedding layer." The embeddings can be used for various tasks such as
image retrieval, image similarity calculation, or as input features for downstream machine learning
algorithms. By embedding images into a lower-dimensional space, it becomes easier to compare
and manipulate images based on their visual characteristics and semantic content.
Q12. Describe the challenges and techniques for distributed training of CNNs.
Distributed training of CNNs refers to the process of training a CNN model across multiple machines
or devices in a distributed computing environment. This approach allows for parallel processing of
large datasets and the ability to leverage multiple computing resources to speed up the training
process. However, distributed training comes with its challenges, including communication
overhead, synchronization, and load balancing. Techniques such as data parallelism, where each
device processes a subset of the data, and model parallelism, where different devices handle
different parts of the model, can be used to distribute the workload.
Q13. Discuss the benefits of using GPUs for CNN training and inference.
Parallel Processing: GPUs (Graphics Processing Units) are designed to handle parallel processing
tasks efficiently. CNNs involve intensive matrix operations which can be processed concurrently
across thousands of cores in a GPU, significantly speeding up computations compared to CPUs.
High Performance: GPUs are optimized for numerical computations and can perform many floating-
point operations per second (FLOPS). This high performance makes them well-suited for training
deep neural networks like CNNs, which require massive amounts of computation.
Speed: Due to their parallel architecture and high computational power, GPUs can train CNNs much
faster than CPUs. This enables researchers and practitioners to experiment with larger datasets,
more complex models, and iterate more quickly during the development process.
Large Model Support: CNNs are becoming increasingly complex with deeper architectures and more
parameters. GPUs provide the computational power necessary to train these large models efficiently.
Without GPUs, training such models would be prohibitively slow or even infeasible.
Reduced Training Time: Faster training times on GPUs mean that researchers and developers can
experiment with different model architectures, hyperparameters, and optimization techniques more
rapidly. This accelerated experimentation can lead to faster innovation and better-performing
models.
Scalability: GPU clusters can be easily scaled up by adding more GPUs to a system. This scalability
allows for distributed training of CNNs across multiple GPUs, further reducing training times for large
datasets and complex models.
Q14. . Explain the concept of occlusion and how it affects CNN performance.
Occlusion refers to the process of partially or completely covering a portion of an input image to
observe its impact on the CNN's performance. Occlusion analysis helps understand the robustness
and sensitivity of CNNs to different parts of the image. By occluding specific regions of the input
image, it is possible to observe changes in the CNN's predictions. If occluding certain regions
consistently leads to a drop in prediction accuracy, it suggests that those regions are crucial for the
CNN's decision-making process.
Occlusion analysis provides insights into the CNN's understanding of different image components
and can reveal potential biases or vulnerabilities in the model. It can also be used to interpret and
explain the model's behavior and identify the features or regions the model relies on for making
predictions. By occluding different parts of an image and observing the resulting predictions,
researchers and practitioners can gain valuable insights into the inner workings of CNNs and
improve their understanding and trustworthiness.
Q15.Why do we prefer Convolutional Neural networks (CNN) over Artificial Neural networks (ANN) for
image data as input?
1. Feedforward neural networks can learn a single feature representation of the image but in the case
of complex images, ANN will fail to give better predictions, this is because it cannot learn pixel
dependencies present in the images.
2. CNN can learn multiple layers of feature representations of an image by applying filters, or
transformations.
3. In CNN, the number of parameters for the network to learn is significantly lower than the multilayer
neural networks since the number of units in the network decreases, therefore reducing the chance
of overfitting.
4. Also, CNN considers the context information in the small neighborhood and due to this feature,
these are very important to achieve a better prediction in data like images. Since digital images are a
bunch of pixels with high values, it makes sense to use CNN to analyze them. CNN decreases their
values, which is better for the training phase with less computational power and less information
loss.
Input Layer:
The input layer receives the raw input data, typically in the form of images in CNNs.
The data is represented by a three-dimensional matrix, where the dimensions correspond to width,
height, and number of channels (e.g., RGB channels for color images).
The input data is usually reshaped into a single column or vector before feeding it into the network.
Convolutional Layer:
The convolutional layer applies a set of learnable filters (kernels) to the input data.
Each filter performs a convolution operation by sliding across the input image and computing dot
products to extract features.
The output of this layer is a set of feature maps, each representing the presence of specific features
in the input image.
ReLU Layer:
The Rectified Linear Unit (ReLU) layer introduces non-linearity to the network by applying the ReLU
activation function element-wise to the feature maps.
ReLU sets all negative values to zero and leaves positive values unchanged, effectively introducing
sparsity and enabling the network to learn complex patterns.
Pooling Layer:
The pooling layer performs down-sampling operations to reduce the spatial dimensions of the
feature maps while preserving important features.
Common pooling operations include max pooling, average pooling, and global pooling, which extract
the maximum, average, or summary statistics from local regions of the feature maps, respectively.
The fully connected layer, also known as the dense layer, connects every neuron in the previous
layer to every neuron in the subsequent layer.
These layers perform classification based on the features extracted by the convolutional layers.
The output of the fully connected layer typically represents class scores or probabilities for different
classes.
For multi-class classification, a softmax layer is used to normalize the outputs into a probability
distribution over multiple classes.
Output Layer:
The output layer contains the labels or predictions in the form of one-hot encoded vectors, where
each element represents the likelihood of a particular class.
Q17. Explain the significance of the RELU Activation function in Convolution Neural Network.
RELU Layer – After each convolution operation, the RELU operation is used. Moreover, RELU is a
non-linear activation function. This operation is applied to each pixel and replaces all the negative
pixel values in the feature map with zero.
Usually, the image is highly non-linear, which means varied pixel values. This is a scenario that is
very difficult for an algorithm to make correct predictions. RELU activation function is applied in
these cases to decrease the non-linearity and make the job easier.
Therefore this layer helps in the detection of features, decreasing the non-linearity of the image,
converting negative pixels to zero which also allows detecting the variations of features.
Q19.An input image has been converted into a matrix of size 12 X 12 along with a filter of size 3 X 3
with a Stride of 1. Determine the size of the convoluted matrix.
To calculate the size of the convoluted matrix, we use the generalized equation, given by:
C = ((n-f+2p)/s)+1
Q20. Explain the terms “Valid Padding” and “Same Padding” in CNN.
Valid Padding:
Valid Padding: This type is used when there is no requirement for Padding. The output matrix after
convolution will have the dimension of (n – f + 1) X (n – f + 1).
Same Padding:
Padding is added to input data to maintain spatial dimensions in output feature maps.
Padding ensures convolution covers the entire input.
Used for preserving spatial information and symmetry in the network architecture.
Same Padding: Here, we added the Padding elements all around the output matrix. After this type of
padding, we will get the dimensions of the input matrix the same as that of the convolved matrix.
Spatial Pooling can be of different types – max pooling, average pooling, and Sum pooling.
Max pooling: Once we obtain the feature map of the input, we will apply a filter of determined shapes
across the feature map to get the maximum value from that portion of the feature map. It is also
known as subsampling because from the entire portion of the feature map covered by filter or kernel
we are sampling one single maximum value.
Average pooling: Computes the average value of the feature map covered by kernel or filter, and
takes the floor value of the result.
Q22. Does the size of the feature map always reduce upon applying the filters? Explain why or why
not.
No, the convolution operation shrinks the matrix of pixels(input image) only if the size of the filter is
greater than 1 i.e, f > 1.
When we apply a filter of 1×1, then there is no reduction in the size of the image and hence there is
no loss of information.
Q23 What is Stride? What is the effect of high Stride on the feature map?
Stride:
Stride refers to the number of pixels the convolutional filter moves across the input data at each step
during the convolution operation.
A stride of 1 means the filter moves one pixel at a time, while a stride of 2 means it moves two pixels
at a time.
High stride values result in a more aggressive downsampling of the feature map.
With a higher stride, the convolutional filter skips more pixels, leading to fewer output activations.
This results in a reduction in the spatial dimensions of the feature map.
As a consequence, higher stride values lead to a decrease in the size of the feature map, potentially
resulting in loss of spatial information and finer details.
However, high stride values can also help in reducing computational complexity and memory usage,
especially in deeper networks or when dealing with large input images.
Filter size
Stride
Max or average pooling
Q26.Can CNNs perform Dimensionality Reduction? If so, which layer is responsible for this task in
CNN architectures?
Yes, CNNs can perform dimensionality reduction. The layer responsible for this task in CNN
architectures is the pooling layer.
Q27. What are some common problems associated with the Convolution operation in Convolutional
Neural Networks (CNNs), and how can these problems be resolved?
Some common problems associated with the Convolution operation in CNNs include:
Border Effects: Convolutional operations near the borders of the input can lead to incomplete
receptive fields, causing loss of information.
Overfitting: Deep CNN architectures with many convolutional layers can suffer from overfitting,
especially when trained on small datasets.
Vanishing Gradients: Deep CNNs with many layers may encounter vanishing gradient problems
during training, hindering learning in earlier layers.
Computational Complexity: The convolution operation can be computationally expensive, especially
for large input images and deep networks.
Padding: Adding appropriate padding to the input data (e.g., using "same" padding) can mitigate
border effects and ensure complete receptive fields.
Regularization: Techniques such as dropout, weight decay, and batch normalization can help
prevent overfitting in CNNs.
Skip Connections: Introducing skip connections, as in Residual Networks (ResNets), can alleviate
vanishing gradient problems and facilitate training of deeper networks.
Pooling and Striding: Using pooling layers with appropriate stride values can reduce computational
complexity and spatial dimensions while preserving important features.
AlexNet:
Developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, AlexNet won the ImageNet Large
Scale Visual Recognition Challenge (ILSVRC) in 2012.
It consists of eight layers, including five convolutional layers and three fully connected layers.
AlexNet introduced several key innovations, such as the extensive use of data augmentation,
dropout regularization, ReLU activation functions, and overlapping pooling.
VGG was developed by the Visual Geometry Group at the University of Oxford.
The architecture is characterized by its simplicity and uniformity, consisting of a series of
convolutional layers followed by max-pooling layers.
VGG models are typically deeper than AlexNet, with variants such as VGG16 and VGG19, which
have 16 and 19 weight layers, respectively.
Despite its simplicity, VGG achieved competitive performance on the ImageNet dataset.
Inception (GoogLeNet):
The Inception architecture, also known as GoogLeNet, was developed by researchers at Google led
by Szegedy et al.
It is characterized by its inception modules, which consist of multiple parallel convolutional layers of
different filter sizes.
Inception modules are designed to capture features at different spatial scales efficiently.
Inception v3 and Inception v4 are later versions of the original architecture, which further improved
performance and computational efficiency.
MobileNet:
`
MobileNet, proposed by Google researchers Howard et al., is designed for mobile and embedded
vision applications where computational resources are limited.
It utilizes depthwise separable convolutions, which significantly reduce the number of parameters
and computations compared to traditional convolutions.
MobileNet achieves a good balance between model size, accuracy, and computational efficiency,
making it suitable for resource-constrained environments.
Structure: AlexNet consists of eight layers, including five convolutional layers followed by three fully
connected layers.
Key Components:
Convolutional Layers: The first five layers are convolutional layers, which extract features from the
input images.
ReLU Activation: Rectified Linear Unit (ReLU) activation functions are applied after each
convolutional layer to introduce non-linearity.
Max Pooling: Max pooling layers are used for down-sampling and feature reduction after certain
convolutional layers.
Local Response Normalization (LRN): LRN layers are employed for normalization, enhancing
generalization and reducing overfitting.
Fully Connected Layers: The last three layers are fully connected layers, performing classification
based on the extracted features.
Dropout: Dropout regularization is applied to the fully connected layers to prevent overfitting.
Softmax Activation: The final layer employs softmax activation to produce class probabilities for
classification.
Q30.Regularization Techniques in CNNs:
Dropout: Randomly deactivates neurons during training to prevent over-reliance on specific features.
Weight Decay (L2 Regularization): Penalizes large weights to encourage simpler models and prevent
overfitting.
Batch Normalization: Normalizes layer activations to stabilize training and improve generalization.
Data Augmentation: Increases training dataset size by applying random transformations to input
data, reducing overfitting.
Early Stopping: Monitors validation performance and stops training when performance starts to
degrade to prevent overfitting.
Fully-Connected Layer
Dropout layer
Flattening Layer
Transforms the 3D feature maps from the previous layer into a 1D vector
Handles the subsequent input for the fully-connected layer and reduces complexity
Example:
import tensorflow as tf
from tensorflow.keras import layers, models
Depth refers to the number of filters or channels applied to the input data.
Each filter generates a single output channel in the feature map.
Increasing depth allows the network to learn more complex features.
Deeper layers increase computational cost and parameters.
Depth is chosen based on task complexity and computational resources.
Convolutional Neural Networks (CNNs) handle overfitting through a combination of techniques that
are tailored to their unique architecture, data characteristics, and computational demands.
Data Augmentation: Amplifying the dataset by generating synthetic training examples, such as
rotated or flipped images, helps to improve model generalization.
Dropout: During each training iteration, a random subset of neurons in the network is temporarily
removed, minimizing reliance on specific features and enhancing robustness.
Regularization: and penalties are used to restrict the size of the model's weights. This discourages
extreme weight values, reducing overfitting.
Ensemble Methods: Combining predictions from several models reduces the risk of modeling the
noise in the dataset.
Adaptive Learning Rate: Using strategies like RMSprop or Adam, the learning rate is adjusted for
each parameter, ensuring that the model doesn't get stuck in local minima.
Early Stopping: The model's training is halted when the accuracy on a validation dataset ceases to
improve, an indication that further training would lead to overfitting.
Weight Initialization: Starting the model training with thoughtful initial weights ensures that the model
doesn't get stuck in undesirable local minima, making training more stable.
Batch Normalization: Normalizing the inputs of each layer within a mini-batch can accelerate training
and often acts as a form of regularization.
Minimum Complexity: Choosing a model complexity that best matches the dataset, as overly
complex models are more prone to overfitting.
Minimum Convolutional Filter Size: Striking a balance between capturing local features and avoiding
excessive computations supports good generalization.
Q36. What is the difference between a fully connected layer and a convolutional layer?
Convolutional Layer: Neurons in a convolutional layer are only connected to a small, localized region
of the input data, known as the receptive field. This local connectivity allows the network to focus on
extracting local features and patterns.
Fully Connected Layer: Neurons in a fully connected layer are connected to all neurons in the
previous layer, resulting in global connectivity. This enables the network to learn complex, global
relationships between features.
Parameter Sharing:
Convolutional Layer: In convolutional layers, the same set of weights (filters) is applied across
different spatial locations of the input data. This parameter sharing reduces the number of learnable
parameters and facilitates feature learning.
Fully Connected Layer: Each neuron in a fully connected layer has its own set of weights, resulting in
a larger number of learnable parameters compared to convolutional layers.
Spatial Structure:
Convolutional Layer: Convolutional layers preserve the spatial structure of the input data. The
arrangement of neurons in the feature maps reflects the spatial relationships between features in the
input.
Fully Connected Layer: Fully connected layers flatten the spatial structure of the input data into a 1D
vector. As a result, spatial information is lost, and the network treats all input features as
independent.
Usage in CNNs:
Convolutional Layer: Convolutional layers are primarily used for feature extraction in CNNs. They are
well-suited for processing high-dimensional data such as images, where local features are important.
Fully Connected Layer: Fully connected layers are commonly used for classification or regression
tasks in CNNs. They aggregate information from the extracted features and produce final
predictions.
Computational Efficiency:
Convolutional Layer: Due to parameter sharing and local connectivity, convolutional layers are
computationally efficient, especially for processing large input data like images.
Fully Connected Layer: Fully connected layers involve a large number of parameters and
computations, making them more computationally expensive, especially for high-dimensional input
data.
In Convolutional Neural Networks (CNNs), Feature Mapping refers to the process of transforming the
input image or feature map into higher-level abstractions. It strategically uses filters and activation
functions to identify and enhance visual patterns.
Feature Extraction: Through carefully designed filters, it identifies visual characteristics like edges,
corners, and textures that are essential for the interpretation of the image.
Feature Localization: By highlighting specific areas of the image (for instance, the presence of an
edge or a texture), it helps the network understand the spatial layout and object relationships.
In CNNs, bias is a learnable parameter associated with each filter, independent of the input data.
Bias adds flexibility to the model, enabling it to better represent the underlying data.
Core Mechanism: Reducing Overfitting and Computational Costs Parameter sharing minimizes
overfitting and computational demands by using the same set of weights across different receptive
fields in the input.
Overfitting Reduction: Sharing parameters helps prevent models from learning noise or specific
features that might be unique to certain areas or examples in the dataset.
Computational Efficiency: By reusing weights during convolutions, CNNs significantly reduce the
number of parameters to learn, leading to more efficient training and inference.
Q41 . Why are CNNs particularly well-suited for image recognition tasks?
Convolutional Neural Networks (CNNs) are optimized for image recognition. They efficiently handle
visual data by leveraging convolutional layers, pooling, and other architectural elements tailored to
image-specific features.
Weight Sharing: CNNs utilize the same set of weights across the entire input (image), which is
especially beneficial for grid-like data such as images.
Local Receptive Fields: By processing input data in small, overlapping sections, CNNs are adept at
recognizing local patterns.
Convolutional Layers apply a set of filters to the input image, identifying features like edges, textures,
and shapes. This process is often combined with pooling layers that reduce spatial dimensions,
retaining pertinent features while reducing the computational burden.
Pooling makes CNNs robust to shifts in the input, noise, and variation in the appearance of detected
features.
Convolution: Imagine a filter sliding over an image, detecting features in a localized manner.
Pooling: Visualize a partitioned grid of the image. The max operation captures the most important
feature from each grid section, condensing the data.
Receptive fields in the context of Convolutional Neural Networks (CNNs) refer to the area of an input
volume that a particular layer is "looking" at. The receptive field size dictates which portions of the
input volume contribute to the computation of a given activation.
The concept of local receptive fields lies at the core of CNNs. Neurons in a convolutional layer are
connected to a small region of the input, rather than being globally connected. During convolution,
the weights in the filter are multiplied with the input values located within this local receptive field.
Pooling operations are often interspersed between convolutional layers to reduce the spatial
dimensions of the representation. Both max-pooling and average-pooling use a sliding window over
the input feature map, sliding typically by the same stride value as the corresponding convolutional
layer.
Additionally, subsampling layers shrink the input space, typically by discarding every nth pixel and
are largely phased out for practical applications.
Receptive fields play a crucial role in learning hierarchical representations of visual data.
In early layers, neurons extract simple features from local input regions. As we move deeper into the
network, neurons have larger receptive fields, allowing them to combine more complex local features
from the previous layer.
Local Response Normalization (LRN) was originally proposed for AlexNet, the CNN that won the
2012 ImageNet Challenge. However, the technique has mostly been superseded by others like batch
and layer normalization.
Advantages
Improved Detectability: By enhancing the activations of "strong" cells relative to their neighbors, LRN
can lead to better feature responses.
Implicit Feature Integration: The technique can promote feature map cooperation, making the CNN
more robust and comprehensive in its learned representations.
Disadvantages
Lack of Widespread Adoption: The reduced popularity of LRN in modern architectures and
benchmarks makes it a non-standard choice. Moreover, implementing LRN across different
frameworks can be challenging, leading to its disuse in production networks.
Redundancy with Other Normalization Methods: More advanced normalization techniques like batch
and layer normalization have shown similar performance without being locally limited.
Transfer Learning:
Fine-Tuning:
Normalization:
Normalize pixel values to a common scale, typically in the range [0, 1] or [-1, 1]. Normalization helps
stabilize training and ensures that features contribute equally to the learning process.
Mean Subtraction:
Subtracting the mean pixel value of the dataset from each pixel helps center the data around zero,
which can improve convergence during training.
Data Augmentation:
Data augmentation techniques such as random rotation, flipping, scaling, and cropping are applied
to artificially increase the diversity of the training dataset. This helps prevent overfitting and improves
the model's generalization ability.
Image Augmentation:
Images may need augmentation, such as adjusting brightness, contrast, or saturation levels, to
increase the robustness of the model to variations in lighting conditions.
Data Splitting:
The dataset is typically divided into training, validation, and test sets. The training set is used to train
the model, the validation set is used to tune hyperparameters and monitor performance, and the test
set is used to evaluate the final model's performance.
Label Encoding:
If the dataset labels are in categorical form (e.g., text labels), they may need to be encoded into
numerical format (e.g., one-hot encoding) to be compatible with the model's output layer.
Q47, What are the metrics commonly used to evaluate the performance of Convolutional Neural
Networks (CNNs)?
Accuracy:
Accuracy measures the proportion of correctly classified samples out of the total number of samples
in the dataset. It is a straightforward metric for overall performance evaluation but may not be
suitable for imbalanced datasets.
Precision:
Precision measures the proportion of true positive predictions (correctly identified instances of a
class) out of all positive predictions. It indicates the model's ability to avoid false positives.
Recall (Sensitivity):
Recall measures the proportion of true positive predictions out of all actual positive instances in the
dataset. It indicates the model's ability to capture all positive instances, also known as sensitivity.
F1 Score:
The F1 score is the harmonic mean of precision and recall, providing a balanced measure of both
metrics. It is particularly useful when there is an imbalance between the classes in the dataset.
Q49 . Discuss the benefits of using skip connections in CNN architectures like ResNet
Q50.Explain the concept of batch normalization and its role in CNN training.
Batch Normalization (BN) is a technique used in CNNs to stabilize training and accelerate
convergence by normalizing the activations of each layer within a mini-batch.
Stabilizes Training: Reduces internal covariate shift, ensuring consistent input distributions across
mini-batches.
Accelerates Convergence: Enables higher learning rates, leading to faster and more stable learning.
Improves Gradient Flow: Enhances gradient flow during backpropagation, aiding in effective training
of deep networks.
Regularization Effect: Acts as a form of regularization, preventing overfitting and improving
generalization.