ML Combined
ML Combined
Dr.S. Nagaraju
Adjunct Faculty,
Computer Science and Engineering, IIITDMK
Detailed Syllabus
Arthur Samuel, a pioneer in artificial intelligence and computer gaming, coined
the term “Machine Learning”.
The ML process starts with feeding good quality datasets and then training our
machines(computers) by building machine learning models using the dataset and
different algorithms.
The choice of algorithms depends on what dataset we have and what kind of
task we are trying to automate.
ML models are also used to power autonomous vehicles, drones, and robots,
making them more intelligent and adaptable to changing environments.
The Machine Learning Process
Step 1: Import the necessary Python Libraries:.
i. numpy: It provides support for large, multi-dimensional arrays and matrices, along
with a variety of mathematical functions.
ii. pandas: It provides data structures like DataFrames to efficiently handle and
analyze datasets.
iii. scikit-learn: It provides a wide range of modules for classification, regression,
clustering, dimensionality reduction, and more.
iv. TensorFlow: It is an open-source deep learning framework that supports both high-
level APIs and low-level operations for building and training neural networks.
v. PyTorch: It is widely used for research and applications in machine learning and
deep learning.
vi. Matplotlib: It provides a wide range of 2D plot types and customization options.
vii. Seaborn: Built on top of Matplotlib, Seaborn provides a higher-level interface for
creating attractive and informative statistical graphics. It's particularly well-suited for
visualizing statistical relationships in your data.
1. Import the necessary Python Libraries:.
i. import numpy as np
ii. import pandas as pd
iii. import sklearn
iv. import matplotlib.pyplot as plt
v. import seaborn as sns
# Load your dataset pandas DataFrame
df = pd.read_csv('your_dataset.csv')
# Select independent (features) and dependent (target) columns using ‘iloc[]’ from a
pandas DataFrame:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
The iloc selects index rows and columns and store them in variables X and y respectively.
Cleaning the dataset involves preparing your dataset by handling missing values,
removing duplicates, and addressing inconsistencies.
# Step 1: Handle Missing Values
# Count the number of missing values in each column
missing_values = df.isnull().sum()
print(missing_values)
# Fill missing age values with the median age
df['Age'].fillna(df['Age'].median(), inplace=True)
# Fill missing salary values with the mean salary
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
# Step 2: Remove Duplicates
df.drop_duplicates(inplace=True)
# Step 3: Replace inconsistent gender values
df['Gender'] = df['Gender'].str.lower()
df['Gender'].replace({'m': 'male', 'f': 'female'}, inplace=True)
#Encode categorical variables using techniques like one-hot encoding or label
encoding.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[("encoder",OneHotEncoder(),[0])],remainder='pas
sthrough')
X=np.array(ct.fit_transform(X))
Here, the goal is to estimate the values of c and m that minimize the
differences between actual targets and the predicted targets by the model.
Assume, following table shows the given x and y training set values:
The cost function quantifies the error between the predicted values made by the
linear regression model and the actual target values in the test set.
It provides a measure of the discrepancy between the model's predictions and the
actual target values.
The goal is to minimize the error or loss, which typically involves finding the
optimal parameters (slope m and intercept c) for the linear equation that defines
the best-fitting line.
The cost function most commonly used in linear regression is the Mean Squared
Error (MSE).
The MSE is calculated by taking the average of the squared differences between
the predicted values and the actual target values.
Where
It is one of the most used optimization techniques in machine learning projects for
updating the parameters of a model in order to minimize a cost function.
The main aim of gradient descent is to find the best parameters of a model which
gives the highest accuracy on training as well as testing datasets.
In gradient descent, the gradient is a vector that points in the direction of the
steepest increase of the function at a specific point.
Moving in the opposite direction of the gradient allows the algorithm to gradually
descend towards lower values of the function, and eventually reaching to the
minimum of the function.
Step 1 We first initialize the parameters of the model randomly.
Step 2 Compute the gradient of the cost function with respect to each parameter.
Step 3 Update the parameters of the model by taking steps in the opposite direction
of the model.
Step 4 Repeat steps 2 and 3 iteratively to get the best parameter for the defined
model.
Learning systems refer to the algorithms and methodologies used by machines to
improve their performance on a specific task through experience.
These systems enable machines to learn from data and adapt their behavior without
being explicitly programmed.
The goal is to learn a mapping from inputs to outputs so that the algorithm can
make predictions on new, unseen data.
Support Vector Machines (SVM), Linear and Logistic Regression are common
examples of supervised learning.
Learning: The algorithm learns patterns in the text, subject, sender, etc.,
associated with spam emails.
Here, the algorithm is given an unlabeled dataset and is tasked with finding
patterns, structures, or relationships within the data.
Learning: The algorithm identifies hidden patterns and clusters customers with
similar buying habits.
This approach combines elements of both supervised and unsupervised learning.
It uses a small amount of labelled data along with a larger amount of unlabelled
data to improve the learning process.
Learning: The model learns translation patterns from the labeled data and
generalizes to translate new, unlabelled sentences.
In reinforcement learning, an agent learns to interact with an environment to
achieve a specific goal.
The agent takes actions and receives rewards or penalties based on its actions.
The goal is to learn a policy that maximizes the cumulative reward over time.
Task: The agent learns to make moves that maximize its chances of winning
(maximizing rewards) over time.
Learning: The agent explores different moves and learns to associate actions
with rewards through trial and error.
Deep learning is a subset of machine learning that involves neural networks with
multiple layers (deep neural networks).
These networks are capable of automatically learning features from raw data and
have been particularly successful in tasks like image and speech recognition.
Task: Train a deep neural network to recognize and classify objects in images.
This is particularly useful when labelled data is insufficient for the target task.
Task: Use the pretrained model to classify medical images (e.g., X-rays).
Learning: The model leverages knowledge from its previous task to improve
performance on the new, related task.
In online learning, the model is updated continuously as new data becomes
available.
This is useful for applications where data arrives in a stream and the model needs
to adapt over time.
Learning: The model updates its predictions as new stock prices arrive,
adapting to changing market conditions.
Active learning involves an iterative process where the model selects the most
informative samples from a pool of unlabeled data and requests labels for those
samples.
This helps the model to learn more efficiently with fewer labelled examples.
Task: Select the most informative documents to label for training a classifier.
Learning: The model actively chooses documents that will provide the most
improvement in classification accuracy.
GANs consist of two networks, a generator and a discriminator, that compete with
each other.
The generator creates data instances, while the discriminator tries to distinguish
between real and generated data.
This helps to obtain a more robust estimate of the model's performance compared
to a single train-test split.
Divide your dataset into k subsets, where k is the number of folds in cross-
validation. Common choices are 5 or 10 folds.
Each subset is called a fold. For this example, let's consider k = 5.
Step 2: Cross-Validation Loop
Evaluate the trained model on the test set and calculate the performance
metrics of interest (e.g., accuracy, precision, recall, F1-score).
Repeat steps 3-5 for each fold, using a different fold as the test set in each
iteration.
Step 7: Performance Aggregation
After completing all iterations (i.e., testing the model on all folds), aggregate the
performance metrics obtained from each fold to get an overall assessment of the
model's performance.
Step 9: Result
The final result is a more accurate and robust estimate of the model's
performance than what would have been obtained from a single train-test split.
In LOOCV, each data point is used as the test set once, while the remaining points are used
for training.
This means that if you have N data points, you'll have N iterations.
LOOCV provides Low-bias since it uses almost all the data for training in each iteration.
It ensures that in each fold maintains the same number of class distributions in
training and test set.
This helps to prevent situations where a fold contains only one class in
training/test set, making it difficult to train and evaluate the model properly.
k-Fold CV: A common choice for general model evaluation when computational
resources are limited.
Stratified CV: Specifically useful for classification tasks with imbalanced classes.
LOOCV: Lower bias, higher variance due to training on almost the entire dataset.
k-Fold CV: More efficient than LOOCV, suitable for moderate-sized datasets.
It is a type of supervised learning, where the algorithm learns from labeled training
data to make predictions about new, unseen data points.
The goal is to build a classification model that can automatically classify emails
as either "spam" or "not spam" (ham).
For instance, you might tokenize the text (splitting it into words), remove stop
words (common words like "the," "is," etc. that don't carry much meaning), and
convert the words into numerical representations.
The dataset is usually split into two parts: the training set and the testing set.
During training, the model learns the patterns and relationships between the
features and the corresponding labels.
After training, you use the testing set to evaluate how well the model performs
on new, unseen data.
You measure various metrics such as accuracy, precision, recall, F1-score and
other metrices to assess the model's performance.
Logistic Regression: Despite its name, logistic regression is used for binary
classification. It models the probability of belonging to a certain class using a
logistic function.
Decision Trees: These create a tree-like model of decisions based on the values of
features. Each internal node represents a decision based on a feature, and each leaf
node represents a class label.
Random Forest: This is an ensemble method that combines multiple decision
trees to improve accuracy and robustness.
Support Vector Machines (SVM): SVM finds a hyperplane that best separates
different classes, aiming to maximize the margin between the classes.
K-Nearest Neighbors (KNN): KNN classifies data points based on the classes of
their k nearest neighbors in the training data.
Naive Bayes: This algorithm uses Bayes' theorem to calculate the probability of a
data point belonging to a particular class based on its features.
It is used for binary classification, which involves predicting one of two possible
outcomes (usually represented as 0 and 1) based on input features.
Despite its name, logistic regression is primarily used for classification rather than
regression tasks.
It models the probability of the binary outcome occurring as a function of the input
features.
It serves as the foundation for more complex algorithms like Support Vector
Machines and neural networks.
The sigmoid function maps any input value to a value between 0 and 1, which
can be interpreted as a probability.
The output of the sigmoid function, which ranges from 0 to 1, represents the
probability of the Class (1).
The optimization process adjusts the weights and bias to minimize the cost
function and improve the model's predictions.
Following performance evaluation metrics are essential to assess the quality and
effectiveness of classification algorithms/models.
Confusion Matrix F1-Score Area Under the PR Curve
Accuracy True Negative Rate ROC Curve
Precision False Positive Rate Area Under the ROC Curve
Recall Precision-Recall Curve
Suppose you're working on a medical diagnosis task to classify whether patients
have a certain disease (Cancer) or not.
You have a spam detection dataset with 100 emails, and you're using a binary
classification model.
Confusion Matrix: A table showing the counts of true
True Positives (TP): 40 positives, true negatives, false positives, and false
True Negatives (TN): 50 negatives.
False Positives (FP): 5 Predicted Predicted
Positive Negative
False Negatives (FN): 5
Actual Positive 40 (TP) 5 (FN)
Actual Negative 5 (FP) 50 (TN)
Accuracy: The proportion of correctly classified instances out of the total
instances.
(TP + TN) / (TP + TN + FP + FN) = (40 + 50) / 100 = 90%
Precision: The ratio of true positive (TP) predictions to the total positive
predictions (TP + false positives).
TP / (TP + FP) = 40 / (40 + 5) = 88.89%
Recall (Sensitivity or True Positive Rate): The ratio of true positive predictions
to the total actual positives (TP + false negatives).
TP / (TP + FN) = 40 / (40 + 5) = 88.89%
False Positive Rate: The ratio of false positive predictions to the total actual
negatives (FP + TN).
FP / (FP + TN) = 5 / (5 + 50) = 9.09%
Area Under the ROC Curve (AUC-ROC): The area under the ROC curve,
which measures the overall performance of a binary classification model.
AUC value from the ROC curve.
Precision-Recall Curve: A graphical representation of precision against recall at
various thresholds.
Graph plotting Precision against Recall at various thresholds.
Area Under the Precision-Recall Curve (AUC-PR): The area under the
precision-recall curve, providing an alternative to AUC-ROC for imbalanced
datasets.
AUC value from the precision-recall curve.
Decision Trees: These create a tree-like model of decisions based on the values of
features.
Each internal node represents a decision based on a feature, and each leaf node
represents a class label.
Each internal node represents a decision based on a feature, and each leaf node
represents a class label.
Inertia: A lower inertia value indicates that data points within clusters are closer to
their cluster centers, implying better clustering.
It is used for dimensionality reduction and data analysis.
Its primary goal is to transform a dataset with multiple correlated variables into a
new set of principal components.
These components capture the most significant patterns of variance present in the
original data.
PCA helps to improve the model performance and visualize the data in 3D/2D.
Involves following steps to reduce the dimensionality of a dataset while preserving
its most important patterns.
1. Standardization:
Given a dataset with n samples and p features, the first step is to standardize the
data by subtracting the mean and dividing by the standard deviation for each
feature.
2. Calculate the Covariance Matrix:
The covariance matrix of the standardized data is calculated.
For a feature matrix X with dimensions n x p, the covariance matrix C is
calculated as: C = (X^T * X) / (n - 1)
3. Compute Eigenvectors and Eigenvalues:
Eigenvectors are the directions in which the data varies the most, and
eigenvalues represent the amount of variance explained by each eigenvector.
The eigenvectors are also known as the principal components.
The equation C * v = λ * v, where v is an eigenvector and λ is its corresponding
eigenvalue, gives us the eigenvectors and eigenvalues.
4. Sort Eigenvectors:
Sort the eigenvectors in decreasing order of their eigenvalues. These
eigenvectors (principal components) capture the most important directions of
variance in the data.
5. Choose Number of Principal Components:
Select a subset of the top k eigenvectors based on the cumulative variance you
want to retain.
6. Project Data onto Lower-Dimensional Space:
Multiply the standardized data by the selected eigenvectors (principal
components) to project the data onto the lower-dimensional space.
This transformation yields the new set of coordinates, called the eigen-
coordinates.
7. Interpret Principal Components:
Each principal component captures a specific pattern of variance in the original
data.
These patterns can be interpreted to understand which features or relationships
contribute most to the variations in the dataset.
Recognition systems play a crucial role in various fields, such as machine learning,
computer vision, natural language processing, and signal processing.
These systems are designed to identify and classify patterns, objects, or entities
from input data, which could be images, audio signals, text, or other forms of data.
These models enable computers to perform tasks like image recognition, speech
recognition, natural language processing, and more.
The design cycle for recognition systems follows a series of steps to create the
models.
Breakdown of the design cycle:
1. Problem Definition:
Clearly define the problem you want to solve, such as image classification,
voice recognition, etc.
Specify the input data format and the desired output (classes or labels).
4. Algorithm Selection:
Choose a suitable algorithm based on the problem type and the characteristics
of the data.
Common algorithms include neural networks (for deep learning), support
vector machines, random forests, and k-nearest neighbors, and more.
5. Model Training:
Split the dataset into training and validation sets.
Train the model on the training set, adjusting model parameters and
hyperparameters to optimize performance.
6. Model Evaluation:
Test the trained model on a separate testing dataset to assess its generalization
performance.
Measure the model's performance using appropriate evaluation metrics
(accuracy, precision, recall, F1-score, etc.).
7. Hyperparameter Tuning:
Adjust the model's hyperparameters to find the best configuration that yields
the optimal performance on the testing set.
8. Deployment:
Once satisfied with the model's performance, deploy it l into the target system
or environment.
9. Monitoring and Maintenance:
Continuously monitor the model's performance in real-world scenarios.
Update the model as needed to address issues or adapt to changing conditions.
10. Feedback and Iteration:
Collect feedback from users and system performance to identify areas for
improvement.
Iterate on the model by refining the data collection, preprocessing, feature
extraction, and model training processes
Non-linearly separable problems refer to classification problems where the classes
cannot be separated by a straight line in higher dimensions.
In such cases, using linear classifiers like linear regression or linear support vector
machines (SVMs) may not yield satisfactory results.
Experimentation and testing with different approaches are essential to find the
solution that best fits the problem's characteristics.
Cover's theorem, also known as the Cover's theorem of separability, that
establishes the existence of a hyperplane that can separate any set of points in a
high-dimensional space.
"For any given finite set of points in a high-dimensional feature space, there exists
a hyperplane that can separate them perfectly, provided the number of dimensions
is sufficiently high."
While this theorem is theoretically interesting, it's important to note that it doesn't
provide a practical method for finding or constructing such hyperplanes.
The theorem is more of an existence statement and doesn't guarantee that such
separations can be easily identified or used in real-world scenarios.
To address non-linearly separable problems, various techniques and models can be
used:
Support Vector Machines (SVMs) are powerful classifiers, but they are
inherently linear.
However, by using the kernel trick, SVMs can be extended to handle non-linear
data.
The radial basis function (RBF) kernel can map the input data into a higher-
dimensional space where it becomes linearly separable.
Neural Networks:
They can capture intricate relationships in the data through hidden layers and
activation functions.
RBF Networks are a type of neural network that use radial basis functions as
activation functions in the hidden layer.
They can approximate non-linear features and are useful for non-linear
classification tasks.
K-Nearest Neighbors (KNN) is a straightforward algorithm that can be used for
both classification and regression tasks.
Instead, it memorizes the entire training dataset and uses it to make predictions
when new data points need to be classified or predicted.
.
4. Standardization:
To standardize the data by subtracting the mean and dividing by the standard
deviation for each feature.
5. Prediction Phase:
When a new, unseen data point is presented for prediction, the algorithm identifies
the 'k' closest data points (neighbors) from the training dataset based on a
Euclidean distance.
The value of 'k' is a user-defined parameter that determines the number of
neighbors to consider.
It's an important hyperparameter that affects the performance of the algorithm.
6. Classification (for classification tasks):
For classification tasks, the algorithm counts the occurrences of each class among
the 'k' nearest neighbors.
The new data point is assigned the class label that is most common among its 'k'
nearest neighbors.
This can be determined by a majority vote.
7. Regression (for regression tasks):
For regression tasks, KNN calculates the average or weighted average of the target
values of the 'k' nearest neighbors.
The predicted value for the new data point is set to this average.
8. Model Evaluation
After training, you use the testing set to evaluate how well the model performs on
new, unseen data.
You measure various metrics such as accuracy, precision, recall, F1-score and other
metrices to assess the model's performance.
We want to classify points into two classes: "Red" and "Blue".
We have a training dataset with two features (x1 and x2) and corresponding class
labels:
x1 x2 Class
2 3 Red
4 2 Red
4 4 Blue
6 2 Blue
Now, let's say we want to classify a new data point, (x1_new, x2_new) = (5, 3),
using KNN with k = 3.
Calculate Euclidean distances between the new point and all training points:
Distance to (2, 3): sqrt((5-2)^2 + (3-3)^2) = 3
Distance to (4, 2): sqrt((5-4)^2 + (3-2)^2) = sqrt(2)
Distance to (4, 4): sqrt((5-4)^2 + (3-4)^2) = sqrt(2)
Distance to (6, 2): sqrt((5-6)^2 + (3-2)^2) = sqrt(2)
Select the 'k' nearest neighbors (k = 3):
Assign the new point to the class with the majority among neighbors:
Since there are more Blue neighbors, the new point is classified as Blue.
In this example, we'll use a simplified dataset with one feature and a target value.
Our goal is to predict the target value for a new data point using KNN regression.
We want to predict the target value for a new data point with a feature value of 6
using KNN regression with k = 3.
Calculate Euclidean distances between the new data point and all training points:
Distance to (2, 10): sqrt((6-2)^2) = 4
Distance to (3, 12): sqrt((6-3)^2) = 3
Distance to (5, 15): sqrt((6-5)^2) = 1
Distance to (7, 20): sqrt((6-7)^2) = 1
Distance to (9, 22): sqrt((6-9)^2) = 3
Select the 'k' nearest neighbors (k = 3):
Nearest neighbors: (5, 15), (7, 20), (3, 12).
Deep Learning is the most exciting and powerful branch of Machine Learning.
Employs deep neural networks with multiple layers to automatically learn complex patterns
and representations from data.
Artificial
Neuron
Biological Neuron/
BNNs are composed of biological neurons, which are complex and highly
specialized cells found in the nervous system of living organisms, including
humans.
ANNs on the other hand, are composed of artificial neurons or nodes, which are
mathematical abstractions designed to mimic the basic processing units of
biological neurons.
Each artificial neuron computes a weighted sum of its inputs and applies an
activation function to produce an output.
ANNs, being computational constructs, are robust in the sense that they can run
without degradation as long as the hardware or software is maintained. However,
their energy efficiency depends on the hardware they are implemented on.
Model
Loss Function
Learning Algorithm
Loss Function
Sigmoid activation functions compress the neuron's output to lie between 0 and
1, which can be used to represent a probability.
The hyperbolic tangent (tanh) function limits the output to lie between -1 and 1.
The Softmax activation function, typically used in the output layer of multi-
class classification problems, gives a distribution of probabilities over multiple
classes.
Certain activation functions, like the sigmoid or tanh, can saturate in regions
where their gradients are nearly zero/one. This can slow down or halt learning.
Other activation functions, such as ReLU (Rectified Linear Unit), do not saturate
in the positive region, which can accelerate convergence.
A model with more sparse activations is often more easily interpretable and can
lead to a more efficient model.
Formula: f(x)=1/1+e−x
Range: (0, 1)
Pros:
◦ Smooth gradient, preventing “jumps” in output values.
◦ Outputs a probability between 0 and 1.
Cons:
◦ Vanishing gradient problem for very high or very low values of x.
◦ Not zero-centred.
◦ Computationally expensive.
Formula:
Range: (-1, 1)
Pros:
◦ Smooth gradient.
◦ Zero-centred.
Cons:
◦ Still has the vanishing gradient problem, though not as severe as the sigmoid
function.
◦ Computationally expensive.
Formula: f(x)=max(0 , x)
Range: (0, ∞)
Pros:
◦ Helps mitigate the vanishing gradient problem.
◦ Computationally efficient (no expensive ex) and allows the model to converge
quickly.
Cons:
◦ Dying ReLU problem where neurons can sometimes be stuck during training and not
activate for any data point.
◦ Not zero-centred.
Formula: f(x)=max(0.01x , x)
Range: (-∞, ∞)
Pros:
◦ Addresses the Dying ReLU problem by allowing small negative values.
◦ Does not saturate in positive or negative region.
◦ Computationally efficient (no expensive ex).
◦ Close to Zero-centred outputs.
Cons:
◦ Performance and benefits are data-dependent.
Formula: f(x)
Pros:
◦ Outputs a probability distribution over multiple classes.
Cons: Only used in the output layer.
Why we need activation functions?
What are the most commonly used activation functions and their mathematical
expressions?
What are the advantages and disadvantages of the Sigmoid activation function?
What is vanishing gradient problem?
Why is ReLU and its variants preferred over sigmoid and tanh for deeper
networks?
What are the potential problems with ReLU, and how are they addressed?
How does the Softmax activation function differ from other activation functions,
and when is it used?
Optimization algorithms play a pivotal role in training deep learning models.
While Gradient Descent (GD) is the foundational optimization algorithm.
It is used to minimize the loss or error of a model by updating the model's
parameters.
Updates model weights and bias using the gradient of the entire dataset.
Numerous GD variants and entirely new approaches have been developed to
achieve faster convergence.
Pros:
Straightforward and deterministic approach.
Cons:
◦ Can be very slow for large datasets as it processes all data for a single update.
Here's a step-by-step breakdown of the gradient descent algorithm:
Initialization:
Initialize w and b randomly.
Set the learning rate.
Iterate over data:
Compute using Sigmoid Activation function i.e.,
The core idea of gradient descent remains same across these variants, with
differences primarily in computational efficiency, convergence speed, and
stability.
What is gradient descent and why is it important?
What is the difference between gradient descent and stochastic gradient descent?
The algorithm computes the gradient of the loss function with respect to
the model's parameters and updates the parameters by iteratively adjusting.
Measures the squared difference between the actual and predicted values.
Formula: /N
MSE=1/3[(2.5−3.0)2+(3.0−3.2)2+(3.5−3.7)2]=0.0233
Regression Losses:
Mean Absolute Error (MAE) / L1 Loss:
Measures the absolute difference between the actual and predicted values.
Formula:
MAE=1/3[|2.5−3.0|+|3.0−3.2|+|3.5−3.7|]=0.3
Classification Losses:
Binary Cross-Entropy (Log Loss):
Measures the logarithmic difference between the actual label and the predicted
probability.
Formula:
Formula:
The exploding gradient problem occurs when the gradients of the loss with
respect to the model parameters become very large, leading to very large
updates and causing the training to diverge.
Cause:
Example: Let's say we have a neural network layer with 500 input units and 200
output units.
Change Activation Functions:
The choice of the activation function can influence the magnitude of the
gradients.
Examples:
ReLU (Rectified Linear Unit): ReLU can help with the vanishing gradient
problem but might cause exploding gradients in some cases.
Leaky ReLU: An improvement over ReLU, it allows a small gradient for
negative values, reducing the chances of units "dying" during training.
Overfitting and underfitting are fundamental problems in the domain of machine
learning and deep learning, representing two extremes in model performance
concerning its complexity and ability to generalize.
Overfitting:
When a model performs exceptionally well on the training data but poorly on
unseen data (like a validation or test set), it is said to be overfitting.
Overfitting indicates that the model has learned the training data's noise and
outliers, rather than the underlying distribution.
Causes:
High Model Complexity: Deep neural networks with many layers or many neurons
can capture intricate patterns in the training data, including noise and outliers.
Insufficient Training Data: If the model has too many parameters relative to the
number of training samples, it's more likely to overfit.
Lack of Regularization: Regularization techniques help in preventing overfitting.
Underfitting:
Underfitting occurs when a model cannot capture the underlying patterns of the
data.
It performs poorly on both training and unseen data.
This suggests that the model is too simplistic and hasn't captured the
complexities of the training data's distribution.
Causes:
Low Model Complexity: A neural network might not capture complex, non-
linear relationships in the data.
Over-Regularization: Applying too much regularization can restrict the
model's capacity to learn.
Poor Feature Representation: If the input features don't capture the
necessary information about the data, the model might underfit..
It is a type of deep learning model specifically designed for processing grid-like
data, such as images.
CNNs utilize layers with convolutional filters that can automatically and
adaptively learn spatial hierarchies of features.
CNN Applications:
Image Classification: Determining if a given picture is that of a cat, dog, car, etc.
Object Detection: Detecting and drawing bounding boxes around cars in a street image.
Image Segmentation: Marking every pixel of an image as 'cat', 'background', 'dog', etc.
Face Recognition: Unlocking smartphones using facial data or tagging people in photos.
Medical Imaging: Detecting tumors in MRI scans or abnormalities in X-ray images and
.many more.
A convolutional layer performs a convolution operation, which involves taking a
filter (or kernel our feature detectors) and sliding it over the input data (such as an
image data) to produce a feature map or activation map.
The goal of the convolution operation is to identify specific patterns in the input
data.
These patterns can be simple features like edges or textures in early layers, and as
we move deeper into the network, the patterns can represent more complex
structures.
Filters (or kernels) in a convolutional layer are small weight matrices that are used
to slide over the input data.
For a 2D image, filters are also 2D matrices, typically of size 3x3, 5x5, or 7x7.
The depth of the filter matches the depth of the input data.
For instance, for an RGB image, the input depth is 3, so filters also have a depth of
3.
ReLU introduces non-linearity into the model.
This is crucial for learning complex patterns.
Without a non-linear activation function, no matter how many layers the neural
network has, it would still behave like a single-layer model.
Max pooling layers offer a way to reduce spatial dimensions and retain important
features within CNNs, making the network more robust and efficient.
Many CNN architectural choices in deep learning, use max pooling or average
pooling.
Fully Connection
Pretrained CNN models have already been trained on ImageNet dataset.
This dataset contains millions of images across thousands of categories.
These models can be used for a wide range of computer vision tasks, including
classification, detection, and more complex tasks.
Here are some popular pretrained CNN models that are frequently used in various
applications:
AlexNet
Depth: 8 layers (5 convolutional and 3 fully connected).
Parameter Count: 60 million parameters.
Learning Capacity: Much greater than shallow networks due to deeper architecture.
Utilized ReLU activation for faster training and used data augmentation and dropout to
combat overfitting.
Won the ImageNet challenge; set a new standard for deep learning in computer vision.
ZFNet (Zeiler & Fergus Net)
Depth: Similar to AlexNet.
Parameter Count: Similar to AlexNet.
Learning Capacity: Improved performance on image recognition tasks.
Features: Introduced a visualization technique for understanding.
Use Cases: Modified AlexNet to win the ImageNet challenge the following year.
VGG
Depth: 16 to 19 layers.
Parameter Count: 138 million parameters in VGG16, even more in VGG19.
Learning Capacity: Greater than AlexNet and ZFNet due to increased depth.
Features: Utilized smaller (3x3) convolution filters throughout which allowed deeper
networks.
Use Cases: Popular for feature extraction in various image processing tasks.
GoogLeNet
Depth: 22 layers.
Parameter Count: About 4 million.
Learning Capacity: Better than VGG.
Features: Perform several convolutions in parallel.
Use Cases: Won the ImageNet challenge.
ResNet
Depth: 18 to 152 layers, with ResNet-50, ResNet-101, and ResNet-152 being popular
variants.
Parameter Count: ResNet-50 has around 25 million parameters.
Learning Capacity: Ability to train very deep networks effectively.
Use Cases: Deep ResNets have achieved state-of-the-art results in various tasks.
Regularization techniques are used to prevent overfitting.
Overfitting occurs when a model learns the detail and noise in the training data to
the extent that it negatively impacts the performance of the model on new data.
Most common regularization techniques used in deep learning:
L2 Regularization
Early Stopping
Data Augmentation
Drop-out
Batch Normalization
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.regularizers import l1_l2
model = Sequential()
model.add(LSTM(50, input_shape=(timesteps, features), kernel_regularizer =
l1_l2(l1=0.01, l2=0.01)))
model.add(Dense(1))
This snippet is creating a simple Sequential model with one LSTM layer followed by a
Dense layer.
The LSTM layer is using both L1 and L2 regularization with regularization factors of
0.01. The kernel_regularizer applies this regularization to the weights of the LSTM units.
Attention mechanisms enable models to dynamically focus on relevant parts of
the input for making decisions.
They assign weights to different parts of the input, indicating the importance of
each part in the context of the task.
This mechanism addresses the limitations of RNNs and LSTMs in processing
very long sequences.
Applications:
Machine Translation
Sentiment Analysis
Computer Vision etc.
Can be computationally expensive with large inputs and more prone to
overfitting, with smaller datasets.
Also known as a feedforward deep neural network or a multi-layer perceptron.
Fundamental network consist of multiple layers of neurons, including an input
layer, several hidden layers, and an output layer and data moves in one direction.
Can capture hidden patterns in data due to multiple layers.
It is used in NLP for machine translation, language modeling, and text
classification.
Needs substantial amounts of labeled data for training to avoid overfitting.
Without proper regularization, they can overfit to training data, reducing
generalization.
Performance heavily relies on the choice of hyperparameters, which often
requires trial and error.
RNN is a type of ANN specifically designed to recognize patterns in sequences
of data.
It can maintain information in 'memory' over time, which helps in understanding
context in text or time series data.
Ability to process entire sequences of data (like a sentence or a time series),
which is essential in many applications including vehicle traffic prediction.
Applications:
Due to the vanishing gradient problem, it's challenging for RNNs to learn
and remember long-range dependencies in a sequence.
RNNs need to selectively read, write and forget the information to learn and
remember long-range dependencies in a sequence.
LSTM, a type of RNN, is designed to address the limitations of traditional RNNs
in handling long-term dependencies.
It effectively captures long-range connections in sequential data, making it highly
suitable for tasks where context over long sequences is crucial.
GRU often train faster and require fewer resources than LSTM.
GRUs are suitable for predicting stock prices, weather forecasting, and other
sequential data analysis.