Deep Learning Carona

Uploaded by

Chiranth AS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

19 views95 pages

Deep Learning Carona

Uploaded by

Chiranth AS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 95

21CS743 | DEEP LEARNING | SEARCH CREATORS. jodul Introduction to Deep Learning, Machine Learning Basi Chapter-01; Introduction to Deep Learning > Deep learning, a subset of machine learning, has revolutionized various fields by enabling systems to learn and make decisions with minimal human intervention, > Arts core, deep learning leverages artificial neural networks with multiple layers (hence deep") to model complex patterns in data. v This introduction provides an overview of deep learning models, their architectures, applications, and significance in today's technological landscape. ‘What is Deep Learning? ‘ Deep leaming involves training artificial neural networks computational models inspired by the human brain to recognize patterns tind make decisions based on vast amounts of data, “Unlike traditional machine learning, which may require feature engineering and manual intervention, deep learning models automatically discover representations and features from raw data, making them particularly effective for tasks like image and speech recognition, ‘Core Components of Deep Learning Models 1, Neural Networks: The foundational structure in deep leaning, consisting of interconnected layers of nodes (neurons). Each neuron processes input data, applies a transformation, and passes the result to the next layer. 2. Layers: o Input Layer: Receives the raw data. © Hidden Layers: Intermediate layers where computations are performed. The “deep” in deep learning refers to the presence of multiple hidden layers. ion or classification Output Layer: Produces the final predi Search Creators... Page 121CS743 | DEEP LEARNING | SEARCH CREATORS. 3. Activation Functions: Non-linear functions (¢.g., ReLU, Sigmoid, Tanh) applied to neuron outputs to introduce non-linearity, enabling the network to learn complex patterns. 4, Loss Functior : Measures the difference between the model's predictions and the actual outcomes, guiding the optimization process. 5. Optimization Algorithms: Techniques (¢.g., Stochastic Gradient Descent, Adam) used to adjust the network's weights to minimize the loss function. Popular Deep Learning Architectures 1, Convolutional Neural Networks (CNNs) © Purpose: Primarily used for image and video recognition. Key Features: Utilize convolutional layers to automatically and adaptively learn spatial hierarchies of features from input images. © Applications: Image classification, object detection, facial recognition. 2. Recurrent Neural Networks (RNNs): Purpose: Designed for sequential data processing. Key Features: Incorporate loops to maintain information across time steps, making them suitable for tasks where context is essential. Variants: Long Short-Term Memory (LSTM) and Gated Recurrent Un (GRU) networks address issues like vanishing gradients. © Applications: Language modeling, machine translation, speech recognition. 3. Transformer Models: o Purpose: Handle sequential data without relying on recurrence. o Key Features: Utilize self-attention mechanisms to weigh the importance of different parts of the input data. Search Creators... Page 221CS743 | DEEP LEARNING | SEARCH CREATORS. © Applications: Natural language processing tasks like text generation, translation, and understanding (e.g., GPT, BERT). 4, Generative Adversarial Networks (GANs): Purpose: Generate new data samples that resemble a given dataset. Key Features: Consist of two networks—a generator and a discriminator—that compete against each other, improving the quality of generated data over time © Applications: Image generation, style transfer, data augmentation. 5. Autoencoders: © Purpose: Learn efficient data encodings in an unsupervised manner. Key Features: Comprise an encoder that compresses the data and a decoder that reconstructs it © Applications: Dimensionality reduction, anomaly detection, denoising data. Applications of Deep Learning Deep learning models have a wide array of applications across various industries + Healthcare: Medical image analysis, drug discovery, personalized treatment plans. + Automotive: Autonomous driving, driver assistance systems. + Finance: Fraud detection, algorithmic trading, risk management. + Entertainment: Content recommendation, video game AI, music composition ‘+ Natural Language Processing: Chatbots, language translation, sentiment analysis ‘+ Robotics: Object manipulation, navigation, human-robot interaction. Search Creators... Page 321CS743 | DEEP LEARNING | SEARCH CREATORS. Advantages of Deep Learning Automatic Feature Extraction: nates the need for manual feature engineering, allowing models to learn directly from raw data, + Scalability: Can handle large volumes of data and complex models with millions of parameters, ‘+ Versatility: Applicable to diverse domains and tasks, from vision and speech to text and beyond. ‘+ Performance: Achieves state-of-the-art results in many benchmark tasks, often surpassing human-level performance. Challenges and Considerations + Data Requirements: Deep learning models typically require vast amounts of labeled data, which can be costly and time-consuming to obtain. ‘+ Computational Resources: Training deep models demands significant computational power, often necessitating specialized hardware like GPUs. + Interpretabili : Deep networks are often considered "black boxes,” making it difficult to understand how decisions are made. ‘+ Overfitting: Models can become too tailored to training data, reducing their ability to generalize to new, unseen data, Future of Deep Learning * As technology advances, deep learning continues to evolve with innovations in architectures, optimization techniques, and applications. + Areas like unsupervised and self-supervised learning aim to reduce reliance on labeled data, while efforts in explainable AI seek to make models more transparent, + Additionally, integrating deep learning with other AI fields, such as reinforcement learning and symbolic reasoning, holds promise for creating more robust and versatile intelligent systems, Search Creators... Page 421CS743 | DEEP LEARNING | SEARCH CREATORS. Historical Trends in Deep Learning Deep learning, a branch of machine learning, has experienced tremendous growth and transformation over the decades, While its core principles date back to the mid-20th century, it has undergone several stages of advancement due to technological innovations, better algorithms, and increased computational power. Below is a timeline highlighting key historical trends in deep learning: 1, Early Foundations (1940s-1960s) ‘The foundation for deep learning lies in early research on neural networks and the imitation of human cognition in machines. Several key milestones shaped the beginnings of the field: ‘+ 1943: McCulloch and Pitts: The concept of a neuron asa binary classifier was introduced by Warren McCulloch and Walter Pitts. They proposed a mathematical model of a neuron that laid the groundwork for later neural network research. ‘+ 1958: Perceptron by Frank Rosenblatt: The perceptron was a simple neural network designed to perform binary classification tasks. It could learn by adjusting weights based on input-output relationships, similar to modern deep learning models. However, limitations in handling non-linearly separable data, such as the XOR problem, restricted its capabilities + 1960s: Backpropagation Concept Introduced: Although it wasn't widely used until much later, the concept of backpropagation—the algorithm for training multilayer neural networks—was introduced by multiple researchers, including Bryson and Ho. 2. Dormant Period (1970s-1980s) After initial interest, neural networks entered a period of decline, often called the "AI winter.” There was disappointment in the limitations of single-layer perceptrons, and other machine Jearning methods, such as support vector machines (SVMs) and decision trees, gained traction, ‘+ 1970s: The limitations of early neural networks like the perceptron, led to reduced funding and enthusiasm for the approach. Search Creators... Page 521CS743 | DEEP LEARNING | SEARCH CREATORS. + 1980s: Interest was revived through theoretical work, and some breakthroughs in deep learning principles were laid during this period, though they wouldn’t be fully realized for decades. 3. The Reawakening of Neural Networks (1980s-1990s) + 1986: Backpropagation Popularized: The backpropagation algorithm, rediscovered and popularized by Geoffrey Hinton, David Rumelhart, and Ronald J. Williams, enabled the training of multi-layer perceptrons, which overcame the limitations of single-layer models. This development reignited interest in neural networks and laid the groundwork for future deep learning models. 1989: Convolutional Neural Networks (CNNs) Introduced: Yann LeCun developed the first CNN, LeNet, designed for image classification tasks, LeNet was able to recognize handwritten digits and was used by banks to process checks, marking one of the earliest practical applications of deep learning, ‘+ 1990s: Recurrent Neural Networks (RNNs): Researchers like Jiirgen Schmidhuber and Sepp Hochreiter developed Long Short-Term Memory (LSTM) networks in 1997, solving the problem of vanishing gradients in standard RNNs and allowing neural networks to better handle sequential data. 4, Emergence of Deep Learning (2000s) + 2006: Deep Belief Networks (DBNs): Geoffrey Hinton and his team proposed the idea of using deep belief networks, a type of unsupervised deep neural network. This marked the beginning of modem deep learning, where the goal was to train deeper neural networks that could learn complex representations. ‘+ 2007-2009: GPU Acceleration: The adoption of Graphics Processing Units (GPUs) for deep learning computations drastically improved the ability to train deeper networks faster. This technological breakthrough allowed for more practical training of neural networks with multiple layers. Search Creators... Page 621CS743 | DEEP LEARNING | SEARCH CREATORS. 5. Breakthrough Era (2010s) ‘The 2010s are often referred to as the "Golden Age" of deep learning. With the combination of better hardware (especially GPUs), large datasets, and advanced algorithms, deep learning achieved state-of-the-art performance across various domains 2012: Alex : AlexNet and ImageNet Competition: A deep CNN called AlexNet, developed by Krizhevsky and Geoffrey Hinton, won the ImageNet Large Scale Visual Recognition Challenge by a large margin. This victory demonstrated the power of deep leaning in image recognition and spurred widespread interest in the field. 2014: Generative Adversarial Networks (GANs): Introduced by Tan Goodfellow, GANs became one of the most revolutionary architectures in deep learning. GANs consist of two networks—a generator and a discriminator—that compete against each other, enabling the creation of highly realistic synthetic data, VGGNet and ResNet: VGGNet and ResNet were breakthroughs in CNN architectures that allowed for deeper networks to be trained without performance degradation, ResNet’s introduction of skip connections solved the problem of vanishing gradients for very deep networks : Transformers and Attention Mechanisms: ‘The introduction of the Transformer model by Vaswani et al. transformed the field of natural language processing (NLP). The Transformer, which uses self- attention mechanisms to process sequences in parallel, has since become the foundation of cutting-edge NLP models, including BERT and GPT. 2018-2019: Transfer Learning and Pre-trained Models: Large pre-trained models like BERT (from Google) and GPT-2 (from OpenAl) demonstrated the power of transfer learn with jing, where a model pre-trained on massive datasets can be fine-tuned for specific tasks smaller datasets, drastically reducing training time and improving performance. Search Creators... Page 721CS743 | DEEP LEARNING | SEARCH CREATORS. 6. Modern Trends (2020s and Beyond) ‘The 2020s have seen deep learning evolve further, with a focus on more efficient models, ethical Al practices, and novel applications. + Transformer Dominance: The transformer architecture has become ubiquitous, particularly in NLP. Models like GPT-3 (2020) and ChatGPT have demonstrated unprecedented language generation abilities, paving the way for practical AT applications in content generation, summarization, and conversational Al. + Deep Reinforcement Learning: Deep learning has been integrated with reinforcement learning to create AT agents capable of mastering complex environments. Breakthroughs like AlphaGo and AlphaZero (developed by DeepMind) demonstrate the potential of AT in learning strategies through trial and error in dynamic environments. ‘+ Ethics and Interpretability: As deep learning models are increasingly deployed in real- world applications, attention has shifted toward ensuring fairness, reducing biases, and improving the interpretability of these “black box" models. + Resource Efficiency: There has been a growing interest in optimizing deep leaming models to make them more resource-efficient, addressing concems about the environmental impact of training massive models. Techniques like pruning, quantization, and distillation aim to reduce the computational and energy demands of deep learning models. Search Creators... Page 821CS743 | DEEP LEARNING | SEARCH CREATORS. Chapter-02: Machine Learning Basi Machine learning allows computers to learn from data to improve their performance on certain tasks. The main components of machine learning are the task (T), the performance measure (P), and the experience (E). These three elements form the basis of any machine learning algorithm. 1. The Task (T) ‘The task in machine learning is the problem that we want the system to solve. It could be recognizing images, predicting numbers, translating languages, or even detecting fraud. The task doesn’t include learning itself but refers to the goal or action we want the machine to perform, Some common tasks include: + Classification: The algorithm assigns an input (like an image) into one of several categories. For example, identifying whether an image is of a cat or a dog is a classification task. ‘+ Regression: The algorithm predicts a continuous value, like forecasting house prices or stock market trends ‘+ Transcription: The algorithm converts unstructured data into a structured format, such as recognizing text in images (optical character recognition) or converting speech into text + Machine Translation: Translating text from one language to another, like English to French. + Anomaly Detection: Finding unusual patterns or behaviors, such as detecting fraud in transactions. + Structured Output: Tasks where the output involves multiple values that are connected, such as generating captions for images. ‘+ Synthesis and Sampling: The algorithm creates new data that is similar to the training data, like generating realistic images or audio. Search Creators... Page 921CS743 | DEEP LEARNING | SEARCH CREATORS. + Imputation of Missing Values: Predicting missing data points based on the available information. + Denoising: Cleaning up corrupted data by predicting what the original data was before it got corrupted. + Density Estimation: Learning the probability distribution that explains how data points are spread out in the dataset. 2. The Performance Measure (P) ‘The performance measure tells us how well the machine learning algorithm is doing. It helps us compare the system’s predictions with the actual results. Different tasks require different performance measures. For example, in classification tasks, the performance measure might be accuracy, which tells us how many predictions were correct. Alternatively, we can measure the error rate, which counts how many predictions were wrong. In some cases, we may want a more detailed performance measure, such as gi al credit for partially correct answers, ig pal For tasks that don’t involve predieting categories (ike density estimation), accuraey isn’t useful, ike log- probability. so we use other performance measure: 3. The Experience (E) ‘The experience refers to the data that the algorithm leams from. There are different types of experiences: + Supervised Lean 1g: The system is trained using data that includes both input features and their corresponding outputs or labels. For example, training a model with labeled images of cats and dogs, so it leams to classify them. Search Creators... Page 1021CS743 | DEEP LEARNING | SEARCH CREATORS. + Unsupervised Learning: The system is trained using data without labels. It tries to find patterns or structure in the data, such as grouping similar data points together (clustering) or estimating the data distribution (density estimation). ‘+ Semi-Supervised Learning: Some examples in the training data have labels, but others don’t. This is useful when getting labeled data is difficult or expensive. ‘+ Reinforcement Learning: The system leas by interacting with an environment and receiving feedback based on its actions. This approach is used in robotics and game playing, where the system gets rewards or penalties based on the decisions it makes. Example: Linear Regression To make the concept clearer, we can look at an example of a machine learning task called linear regression, which predicts a continuous value. In linear regression, the algorithm uses the input data (represented as a vector) to predict a value by calculating a linear combination of the input features, Forexample, if you want to predict the price of a house based on its size and location, the algorithm might use a linear function to estimate the price. The output is calculated by multiplying the input features by their corresponding weights and summing them up. The weights are the parameters that the algorithm adjusts during training. The goal is to find the weights that minimize the mean squared error (MSE), which measures how far off the predictions are from the actual values. Search Creators... Page 1121CS743 | DEEP LEARNING | SEARCH CREATORS. inear regression example Optimization of w 10 05 0.0 0.5 10 05 TO 1.5 Supervised Learning Algorithms Supervised learning algorithms learn to map inputs (x) to outputs (y) using a training set. These outputs often require human intervention but can also be collected automatically. 1, Probabilistic Supervised Learning Most supervised learning algorithms estimate the probability of output yyy given input xxx, represented as p(ylx)p(y | x)p(ylx). This can be done using maximum likelihood estimation, which finds the best parameters O\theta0 for a distribution, 2. Logistic Regression ‘+ In linear regression, we predict continuous values using a normal distribution. +. For classification tasks (e.g., binary classification), we predict a class by squashing the output into a probability between 0 and 1 using the logistic sigmoid function 6(6Tx)\sigma(6*T x)o(8Tx). ‘© This technique is known as logistic regression, Despite its name, it is used for classification, not regression. Search Creators... Page 1221CS743 | DEEP LEARNING | SEARCH CREATORS. 3. Finding Optimal We + Linear regression allows us to compute optimal weights using a simple formula (normal equations). ‘+ Logistic regression does not have a closed-form solution. Instead, the optimal weights are found by minimizing the negative log-likelihood (NLL) using gradient descent. 4, k-Nearest Neighbour’ (k-NN) ‘+ keNN is a non-parametric algorithm used for classification or regression. It doesn’t have a traditional training phase; instead, it stores all training data, + Attest time, it finds the k-nearest neighbors of a test point and predicts the output by averaging their values. © Forclassific ion, it averages over one-hot encoded vectors to get a probability distribution over classes. ‘+ Strength: K-NN can handle large datasets well and achieve high accuracy with enough training examples. + Weakness: It struggles with s nall datasets and computational efficiency, especially with irrelevant features, as it treats all features equally. 5. Decision Trees + Decision Trees divide the input space into regions based on decisions made at each node of the tree. Internal nodes make binary decisions, and leaf nodes map each region to a constant output, ‘+ Strength: They are easy to understand and interpret. ‘+ Weakness: Decision trees may struggle with problems where decision boundaries aren’t axis-aligned, requiring many nodes to approximate simple boundaries. Search Creators... Page 1321CS743 | DEEP LEARNING | SEARCH CREATORS. Unsupervised Learning Algorithms Unsupervised learning algorithms deal with data that contains only features and no labeled targets. ‘They aim to extract meaningful patterns or structures from the data without human supervision, and they are often used for tasks like clustering, density estimation, and learning data representations. 1. Goals of Unsupervised Learning The main goal in unsupervised learning is often to find the best representation of the data. A good representation preserves the most important information about the data while simplifying it or making it easier to work with. 2. Types of Representations There are three common types of data representations: + Low-Dimensional Representations: Compress the data into fewer dimensions while retaining as much information as possible. ‘+ Sparse Representations: Map the data into a higher-dimensional space where most of the values are zero. This structure makes the representation more efficient and reduces redundancy. ‘+ Independent Representations: Try to separate the underlying sources of variation in the data, making the features statistically independent. 3. Benefits of Good Representations + Reducing the dimensionality of the data helps with compression and makes it eas and use the key features. ‘+ Sparse and independent representations make the data easier to interpret and process in machine learning algorithms. Search Creators... Page 1421CS743 | DEEP LEARNING | SEARCH CREATORS. Principal Component Analysis (PCA) Principal Component Analysis (PCA) is an unsupervised learning algorithm used for dimensionality reduction and data representation. It finds a lower-dimensional representation of the data while preserving as much information as possible. 1. Goals of PCA PCA reduces the dimensionality of the data while ensuring that the new representation's features are decorrelated (no linear correlations between the features). It is a step toward achieving statistical independence of the features, though PCA only removes linear relationships. 2. How PCA Works Linear Transformation: PCA projects the data onto new axes that capture the directions of maximum variance in the data. The algorithm learns an orthogonal transformation that projects input xxx to a new representation 2=xTWz = x*T Wz=xTW, where WWW is a matrix of principal components (the directions of maximum variance). ‘The first principal component explains the most variance in the data, and each subsequent component captures the remaining variance, while being orthogonal to the previous ones. 3. Covariance and Dimensionality Reduction PCA transforms the data such that the covariance matrix of the new representation is diagonal, meaning the new features are uncorrelated. It uses eigenvectors of the data’s covariance matrix or singular value decomposition (SVD) to find the directions of maximum variance. The result is a compact, decorrelated representation of the data that can be used for further analysis while minimizing information loss. Search Creators... Page 1521CS743 | DEEP LEARNING | SEARCH CREATORS. k-Means Clustering k-Means clustering is a simple and widely used unsupervised learning algorithm. It divides a dataset into k clusters, grouping examples that are close to each other in the feature space. Each data point is assigned to the nearest cluster, and the algorithm iteratively refines these clusters. 1. How k-Means Works The algorithm begins by initializing k centroids (cluster centers), which are assigned random values. Assignment Step: Each data point is assigned to the nearest centroid, forming clusters. Update Step: Each centroid is recalculated as the mean of the points assigned to it. This process repeats until the centroids no longer change significantly, signaling convergence, 2, One-Hot Representation + k-means clustering provides a one-hot representation for each data point. If a point belongs to cluster ili, its representation vector hhh has a 1 at position iii and 0 everywhere else, ‘+ This is an example of a sparse representation because only one element in the vector is non-zero for each point. ‘+ However, this representation is limited because it treats clusters as mutually exclusive and doesn’t capture relationships between different clusters, Search Creators... Page 1621CS743 | DEEP LEARNING | SEARCH CREATORS. 3. Limitations of k-Means ‘+ Il-posed Problem: There is no single, definitive way to evaluate how well the clustering reflects real-world structures. For example, clustering based on vehicle color (red vs. gray) is as valid as clustering based on type (car vs. truck), but each reveals different information. + Lack of Fine-Grained Similarity: k-means provides a strict one-hot output, which doesn’t, capture nuanced similarities between examples, For instance, it can’t show that red cars are more similar to gray cars than gray trucks. 4. Comparison with Distributed Representations ‘+ In contrast to one-hot encoding, a distributed representation captures multiple attributes for each data point. For example, vehicles could be described by both color and type (e.g. car or truck), allowing for more detailed comparisons, ‘+ Distributed representations are more flexible and can capture complex relationships between data points, reducing the burden on the algorithm to find a single attribute for clustering. Search Creators... Page 1721CS743 | DEEP LEARNING | SEARCH CREATORS. Module-02 Feedforward Networks and Deep Learning Introduction to Feedforward Neural Networks 1.1 Basic Concepts ‘+A feedforward neural network is the simplest form of artificial neural network (ANN) ‘+ Information moves in only one direction: forward, from input nodes through hidden nodes to output nodes ‘+ No cycles or loops exist in the network structure 1.2 Historical Context 1. Origins © Inspired by biological neural networks © First proposed by Warren McCulloch and Walter Pitts (1943) © Significant advancement with perceptron by Frank Rosenblatt (1958) 2. Evolution © Single-layer to multi-layer networks © Development of backpropagation in 1986 © Modern deep learning revolution (2012-present) Search Creators... Page 121CS743 | DEEP LEARNING | SEARCH CREATORS. Hidden Layers Output Layer / QA Ay 6 fy Ke? yy 1.3 Network Architecture 1. Input Layer © Receives raw input data ‘© No computation performed ‘© Number of neurons equals number of input features © Standardization/normalization often applied here 2. Hidden Layers ‘© Performs intermediate computations © Can have multiple hidden layers © Each neuron connected to all neurons in previous layer Search Creators... Page 221CS743 | DEEP LEARNING | SEARCH CREATORS. © Feature extraction and transformation occur here 3. Output Layer © Produces final network output © Number of neurons depends on problem type © Classification: typically one neuron per class © Regression: usually one neuron 1.4 Activation Functions 1. Sigmoid (Logistic) eo Formula: o(x) = 11 + e*(-x)) Range: [0,1] Used in binary classification © Properties: + Smooth gradient + Clear prediction probability + Suffers from vanishing gradient 2. Hyperbolic Tangent (tanh) ©» Formula: tanh(x) = (e% - eX(-x))(e%x + e(-x)) Range: [-1,1] © Often performs better than sigmoid © Properties: + Zero-centered Search Creators... Page 321CS743 | DEEP LEARNING | SEARCH CREATORS. + Stronger gradients + Still has vanishing gradient issue 3. ReLU (Rectified Linear Unit) © Formula: f(x) = max(0,x) © Most commonly used © Helps solve vanishing gradient problem © Properties: + Computationally efficient + No saturation in positive region + Dying ReLU problem 4, Leaky ReLU © Formula: f(x) = max(0.01x, x) Addresses dying ReLU problem © Small negative slope Properties: + Never completely dies + Allows for negative values + More robust than standard ReLU 2. Gradient-Based Learning 2.1 Understanding Gradients 1. Definition Search Creators... Page 421CS743 | DEEP LEARNING | SEARCH CREATORS. Gradient is a vector of partial derivatives Points in direction of steepest increase © Used to minimize loss function 2. Properties © Direction indicates fastest increase © Magnitude indicates steepness © Negative gradient used for minimization 2.2 Cost Functions 1, Mean Squared Error (MSE) © Used for regression problems © Formula: MSE = (1/n)S(y_true - y_pred)? © Properties: + Always positive + Penalizes larger errors more + Differentiable 2. Cross-Entropy Loss © Used for classification problems © Formula: -E(y_true * log(y_pred)) © Properties: + Measures probability distribution difference + Better for classification than MSE Search Creators... Page 521CS743 | DEEP LEARNING | SEARCH CREATORS. + Provides stronger gradients 3, Huber Loss Combines MSE and MAE o Less sensitive to outliers o Formula: + L=0.5¢y - fo ify - fix) <8 + L=8ly-- f(x) -0.56* otherwise 2.3 Gradient Descent Types 1. Batch Gradient Descent co Uses entire dataset for each update More stable but slower Formula: = 6 - aVJ(0) Memory intensive for large datasets 2. Stochastic Gradient Descent (SGD) © Updates parameters after each sample © Faster but less stable © Better for large datasets © High variance in parameter updates 3. Mini-batch Gradient Descent © Compromise between batch and SGD © Updates parameters after small batches Search Creators... Page 621CS743 | DEEP LEARNING | SEARCH CREATORS. © Most commonly used in practice © Typical batch sizes: 32, 64, 128 4. Advanced Optimizers a) Adam (Adaptive Moment Estimation) © Combines momentum and RMSprop © Adaptive learning rates Formula includes first and second moments b) RMSprop © Adaptive learning rates, Divides by running average of gradient magnitudes c) Momentum © Adds fraction of previous update Helps escape local minima Reduces oscillation 3. Backpropagation and Chain Rule 3.1 Chain Rule Fundamentals 1. Mathematical Basis dt/dx = df/dy * dy/dx. © Allows computation of composite function derivatives Essential for neural network training 2. Application in Neural Networks © Computes gradients layer by layer Search Creators... Page 721CS743 | DEEP LEARNING | SEARCH CREATORS. © Propagates error backwards © Updates weights based on contribution to error 3.2 Forward Pass 1. Input Processing © Data normalization © Weight initialization © Bias addition 2. Layer Computation python Copy # Pseudo-code for forward pass for layer in network: Z=W* A+b # Linear transformation A= activation(Z) # Apply activation function 3. Output Generation Final layer activation Prediction computation © Enror calculation 3.3 Backward Pass 1. Error Calculation © Compare output with target Search Creators... Page 821CS743 | DEEP LEARNING | SEARCH CREATORS. © Calculate loss using cost function © Initialize gradient computation 2. Weight Updates © Caleulate gradients using chain rule © Update weights: w_new = w_old - learning_rate * gradient © Update biases similarly 3. Detailed Steps python Copy # Pseudo-code for backward pass # Output layer dZ=A-Y #ForMSE dW = (I/m) * dZ * A_prev.T db = (I/m) * sum(dZ) # Hidden layers Z.=4A * activation_derivative(Z) dW = (I/m)* dZ* A_prev.T db = (1/m) * sum(dZ) 4, Regularization for Deep Learning 4.1 LI Regularization Search Creators... Page 921CS743 | DEEP LEARNING | SEARCH CREATORS. 1. Mathematical Form Adds absolute value of weights to loss Formula: L1 = 23), © Promotes sparsity 2. Properties © Feature selection capability © Produces sparse models © Less sensitive to outliers 4.2 L2 Regularization 1. Mathematical Form Adds squared weights to loss © Formula: L2 = 2Sw? Prevents large weights 2. Properties © Smooth weight decay No sparse solutions ©” More stable training 4.3 Dropout 1. Basic Concept © Randomly deactivate neurons © Probability p of keeping neurons Search Creators... Page 1021CS743 | DEEP LEARNING | SEARCH CREATORS. © Different network for each training batch 2. Implementation Det python Copy # Pseudo-code for dropout mask = np.random. binomial(1, p, size=layer_size) A=A* mask A=A/p #Scale 10 maintain expected value 3. Training vs. Testing o Used only during training o Scaled appropriately during inference o Acts as model ensemble 4.4 Early Stopping 1, Implementation Monitor validation error © Save best model Stop when validation error increases 2. Benefits © Prevents overfitting © Reduces training time © Automatic model selection Search Creators... Page 1121CS743 | DEEP LEARNING | SEARCH CREATORS. 5, Advanced Concepts 5.1 Batch Normalization 1. Purpose © Normalizes layer inputs © Reduces internal covariate shift © Speeds up training 2. Algorithm python Copy # Pseudo-code for batch normalization ‘mean = np.mean(x, axis=0) var = np.var(x, axis=0) X_norm = (x - mean) / np.sqrt(var + €) out = gamma * x_norm + beta 5.2 Weight Initialization 1. Xavier/Glorot Initialization © Variance = 2nin + nout) © Suitable for tanh activation 2. He Initialization © Variance = 2/nin © Better for ReLU activation Search Creators... Page 1221CS743 | DEEP LEARNING | SEARCH CREATORS. 6. Practical Implementation 6.1 Network Design Considerations 1. Architecture Choices Number of layers, Neurons per layer Activation functions 2. Hyperparameter Selection Learning rate © Batch size Regularization strength 6.2 Training Process 1. Data Preparation Splitting data Normalization ‘Augmentation 2. Training Loop Forward pass Loss computation Backward pass Parameter updates Practice Problems and Exercises Search Creators, Page 1321CS743 | DEEP LEARNING | SEARCH CREATORS. 1. Basic Concepts Explain the role of activation functions in neural networks Compare and contrast different types of gradient descent © Describe the vanishing gradient problem 2. Mathematical Problems © Calculate gradients for a simple 2-layer network Implement batch normalization equations © Compute different loss functions 3. Implementation Challenges © Design a network for MNIST classification © Implement dropout in Python © Create a custom loss funetion Key Formulas Reference Sheet 1. Activation Functions Sigmoid: o(x) = (1 + e*(-x)) © tanh(x) = (ex - e%-x)/(eAx + eA(-x)) ReLU: f(x) = max(0,x) 2. Loss Functions o MSE =(I/n)E(y_true - y_pred? © Cross-Entropy Z(y_true * log(y_pred)) 3. Regularization Search Creators... Page 1421CS743 | DEEP LEARNING | SEARCH CREATORS. o LI=2E}1 4. Gradient Descent © Update: w = w - aVJ(w) © Momentum: v= Bv - a¥J(w) ‘Common Issues and Solutions 1. Vanishing Gradients Use ReLU activation © Implement batch normalization © Try residual connections 2. Overfitting © Add dropout ©. Use regulariz: © Implement early stopping 3. Poor Convergence Adjust learning rate © Try different optimizers © Check data normalization Search Creators... Page 1521CS743 | DEEP LEARNING | SEARCH CREATORS. Module-03 Op on for Training Deep Models Introduction to Optimization in Deep Learning Definition + Optimization: Adjusting model parameters (weights, biases) to minimize the loss function. ‘+ Loss Funetion: Measures the error between predicted outputs and actual targets. ‘© Goal: Find parameters that reduce the error and improve predictions. Welant je ont ost weight Key Objective ‘+ Generalization: Ensure the model performs well on new, unseen data. © Underfitting: Model is too simple, doesn’t capture patterns. © Ove ing: Model is too complex, learns noise, performs poorly on new data. Search Creators... Page 121CS743 | DEEP LEARNING | SEARCH CREATORS. Challenges 1, High Dimensionality of Parameter Space o Deep learning models have millions of parameters. © Exploring this vast space is computationally challenging. 2. Non-convex Loss Surfaces © Loss surfaces are complex with many local minima and saddle points. + Local Minima: Points where the loss is low, but not the lowest. + Saddle Points: Flat regions that slow down optimization. © Hard to find the absolute best solution (global minimum). Strategies to Overcome Challenges + Gradient Descent Variants: © Stochastic Gradient Descent (SGD): Efficiently updates parameters using small batches of data. © Adam, RMSprop: Advanced methods that adapt learning rates during training, + Regularization Techniques: © LA/L2 Regularization: Adds penalties to prevent overfitting. © Dropout: Randomly disables neurons during training to reduce reliance on speci neurons. + Learning Rate Scheduling: © Dynamically adjusts the learning rate to ensure better convergence Search Creators, Page 221CS743 | DEEP LEARNING | SEARCH CREATORS. + Momentum and Adaptive Methods: Momentum: Helps in moving faster towards the minima by considering past gradients. Adaptive Methods: Adjust learning rates based on gradient history for stable training. Empirical Risk Minimization (ERM) Concept + Empirical Risk Minimization (ERM) is a foundational concept in machine learning. ‘+ It involves minimizing the average loss on the training data to approximate the true risk or error on the entire data distribution. ‘© The objective of ERM isto train a model that performs well on unseen data by minimizing the empirical risk derived from the training set What is Empirical Risk Minimization? Confidence Empirical risk Small Large Complexity of a function set Search Creators... Page 321CS743 | DEEP LEARNING | SEARCH CREATORS. Mathematical Formulation The empirical risk is calculated as the average loss over the training set: ics 1 = Number of training examples. FC eet ean ne an cece) Pearce sae * = Loss function that measures the error b aE The goal is to find the parameter w that minimizes R(w) Overfitting vs. Generalization 1, Overfitting: © Occurs when the model performs extremely well on the training data but poorly on unseen test data, The mode! learns the noise and specific patterns in the training set, which do not generalize. Symptoms: High training accuracy, low test accuracy 2. Generalization: © The ability of a model to perform well on new, unseen data, Search Creators. Page 421CS743 | DEEP LEARNING | SEARCH CREATORS. © A generalized model strikes a balance between fitting the training data and maintaining good performance on the test data. © Symptoms: Balanced performance on both training and test datasets. Regularization Techniques To combat overfitting and enhance generalization, several regularization techniques are employed: Patenrrey er chniques add penalty terms to the loss function, discouraging plex ou gu ne ne ec PMG EO ee erat uy 2. Dropout: A regularization method that randomly "drops out" a fraction of neurons during training. © This prevents units from co-adapting too much, forcing the network to learn more robust features. During each training iteration, some neurons are ignored (set to zero), which helps in reducing overfitting and improving generalization. Search Creators. Page 521CS743 | DEEP LEARNING | SEARCH CREATORS. Challenges in Neural Network Optimization 1. Non-Convexity ‘+ Nature: Loss surfaces in neural networks are non-convex. + Challenges: © Multiple Local Minima: Loss is low but not the lowest globally. © Saddle Points: Gradients are zero but not at minima or maxima, causing slow convergence. ‘+ Visualization: Loss landscape diagrams show complex terrains with hills, valleys, and flat regions, Vanishing and Exploding Gradients + Vanishing Gradients: © Problem: Gradients become very small as they backpropagate. © Impact: Slow learning, especially in earlier layers. + Exploding Gradients: © Problem: Gradients grow excessively large © Impact: Unstable updates, leading to divergence or large parameter values, + Solutions: © ReLU Activation: Prevents vanishing gradients by not saturating for positive inputs. © Gradient Clipping: Caps gradients to prevent them from becoming too large. Search Creators, Page 621CS743 | DEEP LEARNING | SEARCH CREATORS. + Definition: Occurs when parameter updates are poorly scaled + Impact: Inefficient training, with some parameters updating too quickly or too slowly. + Solution: Normalization Techniques: + Batch Normalization: Normalizes layer inputs for consistent scaling. + Other Normalization: Layer Normalization, Group Normalization Basic Algorithms: Stochastic Gradient Descent (SGD) 1. Gradient Descent (GD) ‘+ Concept: Gradient Descent is an optimization algorithm used to minimize a loss function by updating the model's parameters iteratively. eres rs aU ee AD) Ss Ce ee cers ey Process: © Compute the gradient of the loss funct ‘+ Update the parameters in the opposite direction of the gradient. «Repeat until convergence Search Creators. Page 721CS743 | DEEP LEARNING | SEARCH CREATORS. Stochastic Gradient Descent (SGD) © Concept: Stochastic Gradient Descent improves upon standard GD by updating the model parameters using a randomly selected mini-batch of the training data rather than the entire dataset. 7 NV wL(w) + Advantages: © Faster Updates: Each update is quicker since it uses a small batch of data. © Efficieney: Reduces computational cost, especially for large datasets. © Challenges: Noisier Convergence: Due to randomnes . the convergence path is less and can fluctuate. Requires More Iterations: Often requires more epochs to converge 3. Learning Rate Definition: The learning rate controls the size of the step taken towards minimizing the loss during each update, + Impaet: Too High: Causes overshooting the minimum, © Too Low: Leads to slow convergence. © Strategies: Learning Rate Decay: Gradually reduce the learning rate as training progresses Search Creators, Page 821CS743 | DEEP LEARNING | SEARCH CREATORS. o Warm Restarts: Periodically reset the leaming rate to a higher value to escape Jocal minima. 4, Momentum + Concept: Momentum helps accelerate convergence by combining the current gradient with a fraction of the previous gradient, smoothing updates and reducing oscillations. ‘+ Update Rule: eas ee AAD) as revi) Tu Cte Benefits: © Smoother Updates: Reduces fluctuations in updates, leading to more stable convergence, © Faster Convergence: Helps in faster convergence, especially in regions with shallow gradients, Search Creators. Page 921CS743 | DEEP LEARNING | SEARCH CREATORS. Importance of Parameter Initialization ‘+ Prevents Vanishing/Exploding Gradients: © Proper initialization ensures that gradients remain within a manageable range during backpropagation. © Poor initialization can lead to gradients that either vanish (become too small) or explode (become too large), hindering effective learning + Accelerates Convergence: Well-initialized parameters help the network converge faster, reducing training time. Ensures that the model starts training with meaningful gradients, leading to efficient optimization, 2. Initialization Strategies a. Xan Init ization (Glorot Initialization) = Concept: © Designed for sigmoid and tanh activations. co Ensures that the variance of the outputs of a layer remains roughly constant across layers. Search Creators... Page 1021CS743 | DEEP LEARNING | SEARCH CREATORS. NTR eae ere: Cee ee Maree nud, re een eee + Benefits: Balances the scale of gradients flowing in both forward and backward directions. © Helps prevent saturation in sigmoid/tanh activations, maintaining effective learning © Concept: Specifically designed for ReLU and its variants, © Accounts for the fact that ReLU activation outputs are not symmetrically distributed around zero, Search Creators. Page 1121CS743 | DEEP LEARNING | SEARCH CREATORS. Mires Cee ea enc + Benefits: Prevents the dying ReLU problem (where neurons output zero for all inputs). © Maintains gradient flow and supports faster convergence. 3. Practical Impact + Faster Convergence © Proper initialization provides a good starting point for optimization, reducing the number of iterations required to converge. + Better Final Accuracy: Empirical studies show that networks with proper initialization not only converge faster but also achieve better final accuracy. alization can lead to suboptimal solutions or longer training times. Search Creators. Page 1221CS743 | DEEP LEARNING | SEARCH CREATORS. Algorithms with Aday ‘e Learning Rates 1, Motivation + Need for Adaptive Learning Rates: Fixed learning rates can be ineffective as they do not account for the varying characteristics of different layers or the nature of the training data. © Certain parameters may require larger updates, while others may need smaller adjustments. Adaptive learning rates enable the model to adjust learning based on the training dynamics. 2. AdaGrad + Concept: © AdaGrad (Adaptive Gradient Algorithm) adapts the learning rate for each parameter based on the past gradients. It increases the learning rate for infrequent features and decreases it for frequent features, making it particularly effective for sparse data scenarios, Update Rule Search Creators. Page 1321CS743 | DEEP LEARNING | SEARCH CREATORS. + Advantages: Good for Sparse Data: AdaGrad performs well in scenarios where features have varying frequencies, such as in natural language processing tasks. iminishing Learning Rate: As training progresses, the learning rates decrease, preventing overshooting the minimum. © Challenges: o Rapid Lear to premature convergence and potentially suboptimal solutions. 1g Rate Decay: The learning rate can decrease too quickly, leading 3. RMSProp + Concept: o RMSProp (Root Mean Square Propagation) improves upon AdaGrad by using a moving average of squared gradients, addressing the rapid decay issue of AdaGrad's learning rate. ee ren ts Aira Advantages: © More Stable Convergence: By maintaining a moving average, RMSProp helps stabilize updates, ensuring the learning rate does not decrease too quickly © Effective for Non-Stationary Objectives: It performs well on problems where the data distribution may change over time. Search Creators, Page 1421CS743 | DEEP LEARNING | SEARCH CREATORS. Porro * The moving average of squared gradients can be computed using a decay factor 8 soit fe) ge eee) Choosing the Right Optimization Algorithm 1, Factors to Consider = Data Size Large datasets may require optimization algorithms that can handle more frequent updates (e.g., SGD or mini-batch variants). © Smaller datasets may benefit from adaptive methods that adjust learning rates (e.g., AdaGrad or Adam). + Model Complexity: Complex models (deep networks) can benefit from algorithms that adjust learning rates dynamically (eg., RMSProp or Adam) to navigate complex loss surfaces effectively. © Simpler models may work well with standard SGD. + Computational Resources: © Resource availability may dictate the choice of algorithm. Some algorithms (e.g., Adam) are more computationally intensive due to maintaining additional state information (like momentum and moving averages). 2. Comparison of Optimization Algorithms + Stochastic Gradient Descent (SGD): © Pros: Simple and effective; widely used in practice. © Cons: Requires careful tuning of learning rates and may converge slowly Search Creators, Page 1521CS743 | DEEP LEARNING | SEARCH CREATORS. + AdaGrad: © Pros: Adapts learning rates based on parameter frequency; effective for sparse data © Cons: Tends to slow down learning too quickly due to rapid decay of learning rates. + RMSProp: © Pros: Balances learning rates dynamically: provides stable convergence, especially in non-stationary problems. © Cons: Requires tuning of decay rate parameter. ‘+ Adam (Adaptive Moment Estimation): © Pros: Combines momentum with adaptive learning rates; generally performs well across a wide range of tasks and is robust to hyperparameter settings. © Cons: More complex to implement and requires careful tuning for optimal performance. 3. Practical ips © Start with Adam: © For most tasks, beginning with the Adam optimizer is recommended due to its versatility and strong performance in various scenarios. + Fine-Tune Learning Rates: © Experiment with different learning rates to find the best fit for your specific model and data. A common approach is to perform a learning rate search or use techniques like cyclical learning rates. ‘+ Use Learning Rate Scheduling Implement learning rate schedules (¢.g., decay, step-wise, or cosine annealing) to adjust the learning rate dynamically during training for improved convergence and performance, Search Creators... Page 1621CS743 | DEEP LEARNING | SEARCH CREATORS. Case Studies and Practical Implementations 1. Image Classification with CNN Objective: © Train a Convolutional Neural Network (CNN) on the CIFAR-10 dataset using Stochastic Gradient Descent (SGD) and RMSProp. Compare the performance in terms of learning curves, loss, and accuracy. Dataset: e CIFAR-10 consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class. The classes include airplanes, cars, birds, cats, deer, dogs, frogs, horses, and trucks. Model Architecture: © Useasimple CNN architecture with convolutional layers, ReLU activation, pooling layers, and a fully connected output layer. Training Process: © Implement two training runs: one using SGD and the other using RMSProp. o Hyperparameters: + Learning Rate: Set initial values (e.g., 0.01 for SGD, 0.001 for RMSProp). + Batch Size: Use mini-batches (e.g., 32). + Number of Epochs: Train for a predetermined number of epochs (e.g., 50) + Comparison Metrics: Learning Curves: Plot training and validation accuracy and loss over epochs for both optimizers. Search Creators... Page 1721CS743 | DEEP LEARNING | SEARCH CREATORS. © Loss and Accuracy: Analyze final training and validation loss and accuracy after training completion, + Expected Results: © RMSProp is anticipated to achieve faster convergence and higher accuracy compared to SGD, particularly in the later epochs due to its adaptive learning rates. 2. NLP Task with RNN/Transformer © Objective: o Train a Recurrent Neural Network (RNN) or Transformer model on text data to highlight vanishing gradient issues and compare different optimizers (SGD, AdaGrad, RMSProp). + Dataset: © Use a text dataset such as IMDB reviews for sentiment analysis or any sequence data suitable for RNNs or Transformers. + Model Architecture: Implement either an RNN or Transformer architecture, depending on the chosen approach. © Include Jayers such as LSTM or GRU for RNNS, or attention mechanisms for ‘Transformers. + Training Process: © Conduct training with different optimizers: SGD, AdaGrad, and RMSProp. o Hyperparameters: + Learning Rates: Start with different learning rates for each optimizer. * Batch Size: Use appropriate batch sizes for the model, Search Creators... Page 1821CS743 | DEEP LEARNING | SEARCH CREATORS. + Number of Epochs: Set a common epoch count for all models. + Vanishing Gradient Issues: © Discuss how RNNS are susceptible to vanishing gradients, leading to difficulties in learning long-range dependencies in sequences. This problem can be less pronounced in Transformers due to their attention mechanism. + Comparison Metrics: Loss Curves: Visualize the loss curves for each optimizer to show convergence behavior. ‘Training Performance: Analyze the final training and validation accuracy and loss. + Expected Results © RMSProp and AdaGrad may show better performance than SGD, particularly in tasks where the data is sparse or where gradients can vanish, leading to slower convergence. Search Creators... Page 1921CS743 | DEEP LEARNING | SEARCH CREATORS. Visualization + Loss Curves: © Plot the training and validation loss curves for each optimizer used in both case studies. This visualization will demonstrate: + Convergence Behavior: How quickly each optimizer converges toa lower loss value. + Stability: The stability of loss reduction over time and the presence of fluctuations, © Learning Curves: © Include plots of training and validation accuracy over epochs for visual comparison of model performance across different optimizers, Page 20 Search Creators...21CS743 | DEEP LEARNING | SEARCH CREATORS. Convolut Definition of Convolution ‘+ Convolution: A mathematical operation that combines two functions (input signal/image and filter/kernel) to produce a third function. ‘+ Purpose: Captures important patterns and structures in the input data, crucial for tasks like mage recognition. image patch hidden tayer 1 jidden layer 2 final layer layer ‘feature maps Bfeature maps class units 2606 2eae 14884 xs t 7 I (heenel: 9x9x1 pooling (kernel: SxSx4) pooling (kernel: SxSx8) Formulation * Fora 2D input I(s, y) and a 2D filter K: Cee el Caen ns + I(z,y): Input image an ce gi « (I K)(z,y): Output feature map after convolution. Search Creators, Page 121CS743 | DEEP LEARNING | SEARCH CREATORS. 3. Parameters of Convolution a. Stride + Definition: The number of pixels the filter moves over the input. + Types: © Stride of 1: Filter moves one pixel at a time, resulting in a detailed output. © Stride of 2: Filter moves two pixels ata time, reducing output size (downsampling). b.Padding ‘+ Definition: Adding extra pixels around the input image. + Types: © Valid Padding: No padding applied; results in a smaller output feature map. © Same Padding: Padding applied to maintain the same output dimensions as the input. 4. Significance in Neural Networks ‘+ Application: Used in convolutional layers of CNNs to extract features from images. ‘+ Learning Hierarchical Representations: Stacked convolutional layers enable learning of complex patterns, essential for image classification and other tasks Search Creators... Page 221CS743 | DEEP LEARNING | SEARCH CREATORS. Purpose of Pooling ‘+ Spatial Size Reduction: Decreases the dimensions of the feature maps. + Parameter and Computation Reduction: Reduces the number of parameters and computations in the network. ‘+ Overfitting Control: Helps to control overfitting by providing a form of translational invariance. ‘suahe) esuap convolution pooling dense 2. Types of Pooling a, Max Pooling + Definition: Selects the maximum value from each patch (sub-region) of the feature map. ‘+ Purpose: Captures the most prominent features while reducing spatial dimensions. b, Average Pooling ‘+ Definition: Takes the average value from each patch of the feature map. ‘+ Purpose: Provides a smooth representation of features, reducing sensitivity to noise. Search Creators... Page 321CS743 | DEEP LEARNING | SEARCH CREATORS. 3. Operation of Pooling oling layer with pooling size p and stride s CT CA Barat ¢ 4 ae ae Ce es output(sr, y): Resulting value at position (:r, y) in the output feature map. Significance in Neural Networks ‘+ Feature Extraction: Reduces the size of the feature maps while retaining the most relevant features, ‘+ Efficiency: Decreases computational load, allowing deeper networks to train faster. ‘+ Robustness: Provides a degree of invariance to small translations in the input, making the model more robust. Search Creators. Page 421CS743 | DEEP LEARNING | SEARCH CREATORS. 1, Convolution as an Infinitely Strong Prior + Focus on Local Patterns: Emphasizes the importance of local patterns in the data (e.g., edges and textures) over global patterns. + Effectiveness in CNNs: This locality assumption enhances the effectiveness of Convolutional Neural Networks (CNNs) for image and video analysis. 2. Pooling as an Infinitely Strong Prior + Enhances Translational Invariance: Allows the network to recognize objects regardless of their position within the image. ‘+ Reduces Sensitivity to Position: By downsampling, pooling reduces sensitivity to the exact location of features, improving generalization. 3. Significance in Neural Networks + Feature Learning: Both operations prioritize local features, enabling efficient learning of, essential characteristics from input data. + Improved Generalization: The combination of convolution and pooling enhances the model's ability to generalize across various input variations. Search Creators... Page 521CS743 | DEEP LEARNING | SEARCH CREATORS. ‘Variants of the Basic Convolution Function 1, Dilated Convolutions ‘+ Definition: Introduces spacing (dilation) between kernel elements. ‘+ Wider Context: Allows the model to incorporate a wider context of the input data without significantly increasing the number of parameters. ‘+ Applications: Useful in tasks where understanding broader spatial relationships is important, such as in semantic segmentation, 2. Depthwise Separable Convolutions + Two-Stage Process: © Depthwise Convolution: Applies a separate convolution for each input channel, reducing computational complexity. © Pointwise Convolution: Uses 1x1 convolutions to combine the outputs from the depthwise convolution + Parameter Efficiency: Reduces the number of parameters and computations compared to standard convolutions while maintaining performance. ‘+ Applications: Commonly used in lightweight models, such as MobileNets, for mobile and edge devices. Search Creators... Page 621CS743 | DEEP LEARNING | SEARCH CREATORS. 1, Definition of Structured Outputs + Structured Outputs: Refers to tasks where the output has a specific structure or spatial arrangement, such as pixel-wise predictions in image segmentation or keypoint localization in object detection. 2, Importance in Semantic Segmentation + Maintaining Spatial Structure: For tasks like semantic segmentation, it’s crucial to maintain the spatial relationships between pixels in predi ions to ensure that the output accurately represents the original input image. ecialized Networks ‘+ Network Design: Specialized neural network architectures, such as Fully Convolutional Networks (FCNs), are designed to handle structured outputs by replacing fully connected layers with convolutional layers, allowing for spatially consistent predictions. ‘+ Skip Connections: Techniques like skip connections (used in U-Net and ResNet) help preserve high-resolution features from earlier layers, improving the accuracy of the output, 4, Adjusted Loss Functions ‘+ Loss Function Modification: Loss functions may be adjusted to enforce structural consistency in the predictions. Common approaches include: © Pixel-wise Loss: Evaluating the loss on a per-pixel basis (e.g., Cross-Entropy Loss for segmentation). Structural Loss: Incorporating penalties for structural deviations, such as Dice Loss or Intersection over Union (loU) metrics, which consider the overlap between predicted and true regions. Search Creators... Page 7

DL Unit - 1 Notes
No ratings yet
DL Unit - 1 Notes
45 pages
Unit1 TDL Compressed
No ratings yet
Unit1 TDL Compressed
402 pages
Deep Learning UNIT 5
No ratings yet
Deep Learning UNIT 5
182 pages
NNDL Notes Unit 3
No ratings yet
NNDL Notes Unit 3
38 pages
Vasudevan S. Deep Learning. A Comprehensive Guide 2022
No ratings yet
Vasudevan S. Deep Learning. A Comprehensive Guide 2022
307 pages
Deep Learning Module-01 Search Creators
No ratings yet
Deep Learning Module-01 Search Creators
17 pages
DL - Unit - 1 - Foundations of Deep Learning
No ratings yet
DL - Unit - 1 - Foundations of Deep Learning
35 pages
(Machine Learning - Foundations, Methodologies, and Applications) Fengxiang He, Dacheng Tao - Foundations of Deep Learning-Springer (2025)
No ratings yet
(Machine Learning - Foundations, Methodologies, and Applications) Fengxiang He, Dacheng Tao - Foundations of Deep Learning-Springer (2025)
298 pages
UNIT I Part 1 Notes
No ratings yet
UNIT I Part 1 Notes
28 pages
DL IT324a 1
No ratings yet
DL IT324a 1
38 pages
Deep Learning - PPT - Updated
No ratings yet
Deep Learning - PPT - Updated
15 pages
Lecture 1 Introduction of Deep Learning
No ratings yet
Lecture 1 Introduction of Deep Learning
31 pages
Deep Learning PDF
No ratings yet
Deep Learning PDF
7 pages
3rd Unit DL Final Class Notes
No ratings yet
3rd Unit DL Final Class Notes
78 pages
What Is Deep Learning - How It Works, Techniques & Applications - MATLAB & Simulink
No ratings yet
What Is Deep Learning - How It Works, Techniques & Applications - MATLAB & Simulink
14 pages
cq02 Vdthanh Ass3
No ratings yet
cq02 Vdthanh Ass3
20 pages
Module 2
No ratings yet
Module 2
37 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
DL Notes
No ratings yet
DL Notes
97 pages
Deep Learning Note 21cs743
No ratings yet
Deep Learning Note 21cs743
96 pages
Module 1
No ratings yet
Module 1
16 pages
Deep Learning
No ratings yet
Deep Learning
48 pages
Neural Networks and Deep Learning: A Comprehensive Overview of Modern Techniques and Applications
No ratings yet
Neural Networks and Deep Learning: A Comprehensive Overview of Modern Techniques and Applications
15 pages
Group I
No ratings yet
Group I
20 pages
R21 - A7709 - Deep Learning: Dr. Bhawani Sankar Panigrahi
No ratings yet
R21 - A7709 - Deep Learning: Dr. Bhawani Sankar Panigrahi
92 pages
3rd Unit DL Final Class Notes
No ratings yet
3rd Unit DL Final Class Notes
78 pages
Chapter1. Introduction To Deep Learning
No ratings yet
Chapter1. Introduction To Deep Learning
21 pages
Kishor
No ratings yet
Kishor
9 pages
CH 4 Deep Learning
No ratings yet
CH 4 Deep Learning
7 pages
Deep Learning Module-01
No ratings yet
Deep Learning Module-01
17 pages
Ann Report Ashwin
No ratings yet
Ann Report Ashwin
16 pages
Deep Learning Project
No ratings yet
Deep Learning Project
24 pages
What Is Deep Learning Basics
No ratings yet
What Is Deep Learning Basics
11 pages
AI Chapter 4
No ratings yet
AI Chapter 4
63 pages
Deep Learning-1
No ratings yet
Deep Learning-1
20 pages
NN DL Unit - III
No ratings yet
NN DL Unit - III
19 pages
1 - Deep Learning 10-10-2023
No ratings yet
1 - Deep Learning 10-10-2023
30 pages
Deep Learning
No ratings yet
Deep Learning
22 pages
Deep Learning Fundamentals
No ratings yet
Deep Learning Fundamentals
19 pages
Deep Learning
No ratings yet
Deep Learning
7 pages
Salman Technical Seminar
No ratings yet
Salman Technical Seminar
24 pages
Deep Learning University
No ratings yet
Deep Learning University
129 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
Deep Learning Research Paper
No ratings yet
Deep Learning Research Paper
4 pages
uNIT 1
No ratings yet
uNIT 1
16 pages
Expanded Deep Learning Document-1
No ratings yet
Expanded Deep Learning Document-1
11 pages
Original Apollo
No ratings yet
Original Apollo
7,889 pages
Deep Learning
No ratings yet
Deep Learning
13 pages
Deep Learning
100% (3)
Deep Learning
32 pages
3
No ratings yet
3
1 page
ANN Unit 3 Answers
No ratings yet
ANN Unit 3 Answers
12 pages
Advancements and Applications of Deep Learning
No ratings yet
Advancements and Applications of Deep Learning
4 pages
Report of Ann Cat3-Dev
No ratings yet
Report of Ann Cat3-Dev
8 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
The Fundamental Concepts Behind Deep Learning
No ratings yet
The Fundamental Concepts Behind Deep Learning
22 pages
DeepLearning - 1NT22CS078 - I Shania Jone
No ratings yet
DeepLearning - 1NT22CS078 - I Shania Jone
4 pages
Deep Learning Kathi
No ratings yet
Deep Learning Kathi
18 pages
Introduction To Deep Learning: by Gargee Sanyal
No ratings yet
Introduction To Deep Learning: by Gargee Sanyal
20 pages
21cs743 Model Question Paper Solution
No ratings yet
21cs743 Model Question Paper Solution
33 pages
(IJCST-V9I4P17) :yew Kee Wong
No ratings yet
(IJCST-V9I4P17) :yew Kee Wong
4 pages
Module 1 and Module 2
No ratings yet
Module 1 and Module 2
18 pages
CC Module1
No ratings yet
CC Module1
15 pages
Iot Mod 4, Mod 5
No ratings yet
Iot Mod 4, Mod 5
1 page

Deep Learning Carona

Uploaded by

Deep Learning Carona

Uploaded by

You might also like