0% found this document useful (0 votes)
28 views254 pages

ML Combined

The document provides a comprehensive overview of machine learning, including its definition, process, and various applications. It outlines the steps involved in building machine learning models, such as data preparation, algorithm selection, and model evaluation, while also discussing different types of regression techniques and optimization methods like gradient descent. Additionally, it categorizes learning systems into various types, including supervised, unsupervised, and reinforcement learning.

Uploaded by

kumaraayush1807
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views254 pages

ML Combined

The document provides a comprehensive overview of machine learning, including its definition, process, and various applications. It outlines the steps involved in building machine learning models, such as data preparation, algorithm selection, and model evaluation, while also discussing different types of regression techniques and optimization methods like gradient descent. Additionally, it categorizes learning systems into various types, including supervised, unsupervised, and reinforcement learning.

Uploaded by

kumaraayush1807
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 254

Machine Learning

Dr.S. Nagaraju
Adjunct Faculty,
Computer Science and Engineering, IIITDMK
Detailed Syllabus
 Arthur Samuel, a pioneer in artificial intelligence and computer gaming, coined
the term “Machine Learning”.

 He defined machine learning as – a “Field of study that gives machine


(computer) the capability to learn without being explicitly programmed”.

 The ML process starts with feeding good quality datasets and then training our
machines(computers) by building machine learning models using the dataset and
different algorithms.

 The choice of algorithms depends on what dataset we have and what kind of
task we are trying to automate.

 The learning process of computers can be automated and improved based on


massive amounts of historical data.
 It is a subset of AI that develops algorithms by learning the hidden patterns of
the datasets used to make predictions on new/unseen data.

 ML applications range from image and speech recognition to natural language


processing, recommendation systems, fraud detection, portfolio optimization,
automated tasks, etc.

 ML models are also used to power autonomous vehicles, drones, and robots,
making them more intelligent and adaptable to changing environments.
The Machine Learning Process
Step 1: Import the necessary Python Libraries:.
i. numpy: It provides support for large, multi-dimensional arrays and matrices, along
with a variety of mathematical functions.
ii. pandas: It provides data structures like DataFrames to efficiently handle and
analyze datasets.
iii. scikit-learn: It provides a wide range of modules for classification, regression,
clustering, dimensionality reduction, and more.
iv. TensorFlow: It is an open-source deep learning framework that supports both high-
level APIs and low-level operations for building and training neural networks.
v. PyTorch: It is widely used for research and applications in machine learning and
deep learning.
vi. Matplotlib: It provides a wide range of 2D plot types and customization options.
vii. Seaborn: Built on top of Matplotlib, Seaborn provides a higher-level interface for
creating attractive and informative statistical graphics. It's particularly well-suited for
visualizing statistical relationships in your data.
1. Import the necessary Python Libraries:.
i. import numpy as np
ii. import pandas as pd
iii. import sklearn
iv. import matplotlib.pyplot as plt
v. import seaborn as sns
# Load your dataset pandas DataFrame
df = pd.read_csv('your_dataset.csv')

# Using iloc[] indexer to select index of rows and columns


# Select the first three rows and the first two columns
X = df.iloc[0:3, 0:2]
In this example, df.iloc[0:3, 0:2] selects the first three rows (index 0, 1, and 2) and the
first two columns (index 0 and 1) of the DataFrame df.

# Select independent (features) and dependent (target) columns using ‘iloc[]’ from a
pandas DataFrame:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
The iloc selects index rows and columns and store them in variables X and y respectively.
 Cleaning the dataset involves preparing your dataset by handling missing values,
removing duplicates, and addressing inconsistencies.
# Step 1: Handle Missing Values
# Count the number of missing values in each column
missing_values = df.isnull().sum()
print(missing_values)
# Fill missing age values with the median age
df['Age'].fillna(df['Age'].median(), inplace=True)
# Fill missing salary values with the mean salary
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
# Step 2: Remove Duplicates
df.drop_duplicates(inplace=True)
# Step 3: Replace inconsistent gender values
df['Gender'] = df['Gender'].str.lower()
df['Gender'].replace({'m': 'male', 'f': 'female'}, inplace=True)
#Encode categorical variables using techniques like one-hot encoding or label
encoding.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[("encoder",OneHotEncoder(),[0])],remainder='pas
sthrough')
X=np.array(ct.fit_transform(X))

from sklearn.preprocessing import LabelEncoder


le=LabelEncoder()
y=le.fit_transform(y)
 Splitting the dataset into training and testing sets is a crucial step in machine learning
to evaluate your model's performance.

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

from sklearn.preprocessing import StandardScaler


ss=StandardScaler()
X_train[:,3:]=ss.fit_transform(X_train[:,3:])
Training Set and Test Set
 from sklearn.preprocessing import StandardScaler
 ss=StandardScaler()
 X_train[:,3:]=ss.fit_transform(X_train[:,3:])
Feature Scaling
Normalization
 Splitting the dataset into training and testing sets is a crucial step in machine learning
to evaluate your model's performance.

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 Choose the appropriate type of machine learning algorithm based on your problem (e.g.,
classification, regression, clustering).
 Consider different algorithms such as decision trees, support vector machines, neural
networks, etc.
 Choose an algorithm that suits the dataset size, complexity, and available computational
resources.

from sklearn.linear_model import


LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
 Regression is used in data analysis to model the relationship between a
dependent variable and one or more independent variables.
 Its main purpose is to predict or estimate the value of the dependent variable
based on the values of the independent variables.
 Regression is widely used for various purposes, such as making predictions for
Stock Price, Climate, Air Quality, Drug Dosage etc.
 The key components of regression are:
 Dependent Variable (Target): This is the variable that you're trying to
predict. It's typically denoted as “y“.
 Independent Variables (Predictor Variables or Features): These are the
variables that you use to predict the dependent variable. They are usually
denoted as "X".
 Regression Line/Equation: The goal of regression is to find a mathematical
line that best fits the data points in a way that minimizes the difference between
the predicted values and the actual observed values of the dependent variable.
The equation of the regression line is used to make predictions for new data
points.
 Regression analysis involves various steps, including data collection, model
selection, parameter estimation, and model evaluation.
 Regression models are assessed using metrics such as mean squared error
(MSE), Root Mean Square Error (RMSE) and the coefficient of determination
(R²).
 These metrics help determine how well the model fits the data and makes
accurate predictions.
 Polynomial Regression: When the relationship between variables is better
represented by a polynomial curve rather than a straight line, polynomial
regression is used.
 The equation for polynomial regression y = m1*x + m2*x^2 + ... + mn*x^n
+c
Where,
x^1…x^n are the polynomial terms of different degrees.
 Multiple Regression: This extends linear regression to include multiple
independent variables.
 The equation becomes y = m1*x1 + m2*x2 + ... + mn*xn + c, where
m1…mn is the slopes(coefficients) associated with each independent
variable.
Types of Regression Techniques

 It is used to model a linear relationship between one independent variable (X)


and a dependent variable (y).
 It aims to find a straight line that best fits the actual targets and the predicted
targets.
 In mathematical terms, the simple linear regression model can be represented
as:
y = m ⋅X + c
Where,
y is the dependent variable.
X is the independent variable.
c is the intercept, which represents the value of y when X is 0.
m is the slope, which represents the change in y for a unit change in X

 Here, the goal is to estimate the values of c and m that minimize the
differences between actual targets and the predicted targets by the model.
 Assume, following table shows the given x and y training set values:

 Step 1: Calculate the means of x and y


 Step 2: Calculate the slope m=Sxy/Sxx i.e., 28/10=2.8
 Step 3: Calculate the intercept c = ymean – slope* xmean => 14.6 – 2.8 * 3 = 6.2
 Step 4: The desired equation of the regression model is y = 2.8 x + 6.2
 Step 5: We shall use Step 4 question to predict the values of y for the given
values of x.
 The performance of this model can be analysed by calculating the Root Mean
Square Error (RMSE) (magnitude of prediction errors) and R2 value (proportion of
variance).
 Squared Error(SE) (sum of Error2)=10.8, MSE= Error2 /n=2.16 and
RMSE=1.469
 Coefficient of Determination (R2) = 1-SE/SSy-ymean => 1- 10.8 / 89.2 = 0.8789
 If R² is close to 1 and RMSE is relatively low, it indicates a model that explains a
large portion of the variance and makes accurate predictions.
 If R² is close to 0 and RMSE is relatively high, it indicates a model that does not
explain much variance and makes less accurate predictions.
 It is used to model a linear relationship between multiple independent variables
(X) and a dependent variable (y).
 It aims to find a straight line that best fits the actual targets and the predicted
targets.
 In mathematical terms, the multiple linear regression model can be represented
as:
y = m1*x1 + m2*x2 + ... + mn*xn + c
Where,
y is the dependent variable.
x1…xn are the independent variable.
c is the intercept, which represents the value of y when X is 0.
m1…mn is the slopes(coefficients)
 Here, the goal is to estimate the values of c and m1…mn that minimize the
differences between actual targets and the predicted targets by the model.
 It extends the concept of linear regression by fitting a polynomial equation to
the relationship between the independent variable(s) and the dependent
variable.
 It allows to capture more complex and nonlinear relationships between
variables by introducing polynomial terms of higher degrees into the regression
equation.
 In mathematical terms, the polynomial regression model can be represented as:
y = m1*x + m2*x^2 + ... + mn*x^n + c
Where,
y is the dependent variable, x is the independent variable.
x^1…x^n are the polynomial terms of different degrees.
c is the intercept, which represents the value of y when X is 0.
m1…mn is the slopes(coefficients)
 Here, the goal is to estimate the values of c and m1…mn that minimize the
differences between actual targets and the predicted targets by the model.
1. Forward Pass: ML algorithm takes in input dataset and produces predictions.
2. Loss Function: Also known as the error or cost function, is used to evaluate the
accuracy of the predictions made by the model.
i. The function compares the predicted output of the model to the actual output and
calculates the difference between them.
ii. This difference is known as error or loss.
iii. The goal of the model is to minimize the error or loss function by adjusting its
internal parameters.
3. Model Optimization Process: The process of adjusting the internal parameters
of the model to minimize the error or loss function.
i. This is done using an optimization algorithm, such as gradient descent.
ii. The optimization algorithm calculates the gradient of the error function with respect
to the model’s parameters and uses this information to adjust the parameters to
reduce the error.
iii. The algorithm repeats this process until the error is minimized to a satisfactory level.
 Cost Function: also known as a loss function or objective function typically
represented by the Mean Squared Error (MSE).

 The cost function quantifies the error between the predicted values made by the
linear regression model and the actual target values in the test set.

 It provides a measure of the discrepancy between the model's predictions and the
actual target values.

 The goal is to minimize the error or loss, which typically involves finding the
optimal parameters (slope m and intercept c) for the linear equation that defines
the best-fitting line.
 The cost function most commonly used in linear regression is the Mean Squared
Error (MSE).

 The MSE is calculated by taking the average of the squared differences between
the predicted values and the actual target values.

 Mathematically, the MSE is expressed as:

Where

 n is the number of data points in the dataset.


 yi​ is the actual target value for the ith data point.
 y^i is the predicted target value for the ith data point.
 Here, the goal is to estimate the values of m and c that minimize the differences
between actual targets and the predicted targets by the model that minimize the
MSE.

 This is often achieved using optimization algorithms like gradient descent.

 The optimization process involves iteratively adjusting the parameters m and c in


the direction that reduces the MSE, ultimately leading to a line that best fits the
target data.
 Gradient Descent is an iterative optimization algorithm that tries to find the
optimum value of an cost function.

 It is one of the most used optimization techniques in machine learning projects for
updating the parameters of a model in order to minimize a cost function.

 The main aim of gradient descent is to find the best parameters of a model which
gives the highest accuracy on training as well as testing datasets.

 In gradient descent, the gradient is a vector that points in the direction of the
steepest increase of the function at a specific point.

 Moving in the opposite direction of the gradient allows the algorithm to gradually
descend towards lower values of the function, and eventually reaching to the
minimum of the function.
 Step 1 We first initialize the parameters of the model randomly.

 Step 2 Compute the gradient of the cost function with respect to each parameter.

 Step 3 Update the parameters of the model by taking steps in the opposite direction
of the model.

 Step 4 Repeat steps 2 and 3 iteratively to get the best parameter for the defined
model.
 Learning systems refer to the algorithms and methodologies used by machines to
improve their performance on a specific task through experience.

 These systems enable machines to learn from data and adapt their behavior without
being explicitly programmed.

 There are several types of learning systems in machine learning:

 Supervised Learning  Deep Learning


 Unsupervised Learning  Transfer Learning
 Semi-supervised Learning  Self-supervised Learning
 Active Learning  Online Learning
 Reinforcement Learning  Generative Adversarial Networks
 In this type of learning, the algorithm is provided with a labeled dataset containing
input-output pairs.

 The goal is to learn a mapping from inputs to outputs so that the algorithm can
make predictions on new, unseen data.

 Support Vector Machines (SVM), Linear and Logistic Regression are common
examples of supervised learning.

 Example: Email Spam Detection

 Data: A dataset of emails labelled as either "spam" or "not spam."

 Task: Train a classifier to predict whether a new email is spam or not.

 Learning: The algorithm learns patterns in the text, subject, sender, etc.,
associated with spam emails.
 Here, the algorithm is given an unlabeled dataset and is tasked with finding
patterns, structures, or relationships within the data.

 Clustering and dimensionality reduction are common examples of unsupervised


learning.

 Example: Customer Segmentation

 Data: Customer purchase history (unlabelled).

 Task: Group customers into segments based on their purchasing behavior.

 Learning: The algorithm identifies hidden patterns and clusters customers with
similar buying habits.
 This approach combines elements of both supervised and unsupervised learning.

 It uses a small amount of labelled data along with a larger amount of unlabelled
data to improve the learning process.

 Example: Language Translation

 Data: A small set of translated sentences (labelled) and a large set of


untranslated sentences (unlabelled).

 Task: Develop a translation model to translate sentences from one language to


another.

 Learning: The model learns translation patterns from the labeled data and
generalizes to translate new, unlabelled sentences.
 In reinforcement learning, an agent learns to interact with an environment to
achieve a specific goal.

 The agent takes actions and receives rewards or penalties based on its actions.

 The goal is to learn a policy that maximizes the cumulative reward over time.

 Example: Game Playing (e.g., Chess)

 Environment: Chess board and rules.

 Agent: AI player (learner).

 Task: The agent learns to make moves that maximize its chances of winning
(maximizing rewards) over time.

 Learning: The agent explores different moves and learns to associate actions
with rewards through trial and error.
 Deep learning is a subset of machine learning that involves neural networks with
multiple layers (deep neural networks).

 These networks are capable of automatically learning features from raw data and
have been particularly successful in tasks like image and speech recognition.

 Example: Image Classification

 Data: Labelled images of different objects.

 Task: Train a deep neural network to recognize and classify objects in images.

 Learning: The neural network learns hierarchical features (edges, textures,


shapes) to make accurate predictions.
 Transfer learning involves training a model on one task and then transferring its
learned knowledge to a related task.

 This is particularly useful when labelled data is insufficient for the target task.

 Example: Fine-tuning Pretrained Model

 Data: Pretrained model on a large dataset (e.g., ImageNet).

 Task: Use the pretrained model to classify medical images (e.g., X-rays).

 Learning: The model leverages knowledge from its previous task to improve
performance on the new, related task.
 In online learning, the model is updated continuously as new data becomes
available.

 This is useful for applications where data arrives in a stream and the model needs
to adapt over time.

 Example: Stock Price Prediction

 Data: Continuous stream of stock prices.

 Task: Develop a model to predict future stock prices.

 Learning: The model updates its predictions as new stock prices arrive,
adapting to changing market conditions.
 Active learning involves an iterative process where the model selects the most
informative samples from a pool of unlabeled data and requests labels for those
samples.

 This helps the model to learn more efficiently with fewer labelled examples.

 Example: Document Classification

 Data: Large pool of unlabelled documents.

 Task: Select the most informative documents to label for training a classifier.

 Learning: The model actively chooses documents that will provide the most
improvement in classification accuracy.
 GANs consist of two networks, a generator and a discriminator, that compete with
each other.

 The generator creates data instances, while the discriminator tries to distinguish
between real and generated data.

 This results in the generator producing increasingly realistic data.

 Example: Image Generation

 Generator: Creates fake images.

 Discriminator: Distinguishes between real and fake images.

 Learning: The generator improves its image generation to fool the


discriminator, resulting in increasingly realistic images.
 Cross-validation is a technique used to assess the performance of a machine
learning model by dividing the dataset into multiple subsets and training and
evaluating the model on different combinations of these subsets.

 This helps to obtain a more robust estimate of the model's performance compared
to a single train-test split.

 Here's a step-by-step example of how cross-validation works:

 Suppose you're working on a binary classification task to predict whether an


email is spam or not spam, and you want to evaluate the performance of a
classifier using k-Fold Cross-Validation.

 Step 1: Dataset Splitting

 Divide your dataset into k subsets, where k is the number of folds in cross-
validation. Common choices are 5 or 10 folds.
 Each subset is called a fold. For this example, let's consider k = 5.
 Step 2: Cross-Validation Loop

 For each fold, perform the following steps:

 Step 3: Splitting Train-Test sets (Test set=No. of samples/k and remaining


Traning set)

 Step 4: Model Training

 Train your classifier on the training set.

 Step 5: Model Evaluation

 Evaluate the trained model on the test set and calculate the performance
metrics of interest (e.g., accuracy, precision, recall, F1-score).

 Step 6: Iteration Updates

 Repeat steps 3-5 for each fold, using a different fold as the test set in each
iteration.
 Step 7: Performance Aggregation

 After completing all iterations (i.e., testing the model on all folds), aggregate the
performance metrics obtained from each fold to get an overall assessment of the
model's performance.

 Step 9: Result

 The final result is a more accurate and robust estimate of the model's
performance than what would have been obtained from a single train-test split.

 k-Fold CV strikes a balance between computational efficiency and reliable


performance estimation.

 k-Fold CV is not suitable for imbalanced class of distributions.


 There are three Types of cross-validation Techniques (k-Fold CV, LOOCV,
Stratified CV, and Stratified k-Fold CV)

 Leave One Out Cross-Validation (LOOCV)

 In LOOCV, each data point is used as the test set once, while the remaining points are used
for training.

 This means that if you have N data points, you'll have N iterations.

 LOOCV provides Low-bias since it uses almost all the data for training in each iteration.

 However, LOOCV can be computationally expensive, especially for large datasets.


 Stratified CV is especially useful for classification tasks when the dataset has
imbalanced class distributions.

 It ensures that in each fold maintains the same number of class distributions in
training and test set.

 This helps to prevent situations where a fold contains only one class in
training/test set, making it difficult to train and evaluate the model properly.

 Stratified k-Fold CV is a combination of stratification and k-


Fold CV, which is widely used for classification tasks.
 Use Cases:

 LOOCV: Useful for small datasets when computational resources allow.

 k-Fold CV: A common choice for general model evaluation when computational
resources are limited.

 Stratified CV: Specifically useful for classification tasks with imbalanced classes.

 Bias and Variance:

 LOOCV: Lower bias, higher variance due to training on almost the entire dataset.

 k-Fold CV: Balanced bias-variance.

 Stratified CV: Similar bias-variance trade-off as k-Fold CV but addresses class


distribution concerns.
 Computational Complexity:

 LOOCV: Can be computationally expensive, especially for large datasets.

 k-Fold CV: More efficient than LOOCV, suitable for moderate-sized datasets.

 Stratified CV: Similar computational complexity to k-Fold CV.

EXPLAIN separability of problems


 In machine learning, particularly in the field of classification, "separability" refers to
how well distinct classes or categories of data points can be separated by a decision
boundary.
 The decision boundary is a straight line or plane or hyperplane that separates data
points of different classes in feature space.
 The degree of separability influences the ease with which a classification algorithm can
accurately classify data points.
 Linear Separability:
 Linear separability refers to a scenario where classes of data points can be cleanly
separated by a straight line, plane, or hyperplane in a multi-dimensional space.
 In two-dimensional space, a linear decision boundary is a straight line; in three-
dimensional space, it's a plane; and in higher dimensions, it's a hyperplane.
 Linearly separable data is relatively simple to classify using linear algorithms like
Support Vector Machines (SVMs), logistic regression, and perceptrons.
 Examples of linearly separable data include scenarios where classes do not overlap
and can be well-separated by a single linear boundary.
 Non-linear Separability:
 Non-linear separability occurs when classes of data points cannot be effectively
separated by a straight line, plane, or hyperplane in the feature space.
 In these cases, the decision boundary needs to be more complex and non-linear, such
as curves to accurately separate the classes.
 Non-linear classifiers like decision trees, random forests, neural networks, and
kernel methods (e.g., kernel SVMs) are used to capture these complex decision
boundaries.
 Real-world data often exhibits non-linear separability due to overlapping classes or
intricate relationships between features and classes.
 The concept of separability influences the choice of machine learning algorithms and
preprocessing techniques.
 It's worth noting that not all real-world data is easily separable.
 Sometimes, feature engineering, dimensionality reduction can be employed to
transform the data into a space where separation becomes easier.
 The "curse of dimensionality" is required when working with high-dimensional data.
 As the number of features (dimensions) increases, following problems emerge:
 Increased Sparsity:
 As the number of dimensions increases, data points become more spread out in the
feature space.
 This sparsity makes it harder to find meaningful patterns or relationships between data
points.
 Data Scarcity:
 The amount of data needed to maintain a certain level of data density grows
exponentially with the number of dimensions.
 Collecting sufficient data becomes more challenging as the dimensionality increases.
 Diminished Discriminative Power:
 In classification tasks, as dimensions increase, the volume of the features space grows
exponentially.
 This can result in data points from different classes becoming indistinguishable,
making accurate classification harder.
 Overfitting:
 Models become more prone to overfitting, where they capture noise in the data
rather than true underlying patterns.
 This can lead to poor generalization on new, unseen data.
 Computational Complexity: As dimensions increase, calculations become more
complex and time-consuming, requiring more computational resources.
 Visualization Challenges: Beyond three dimensions, visualizing data becomes nearly
impossible. Understanding relationships between multiple features becomes
challenging.
 Feature Selection and Engineering: In high-dimensional spaces, feature selection and
engineering become critical to focus on relevant information and reduce noise.
 To mitigate the curse of dimensionality, we need to use techniques such as
dimensionality reduction (e.g., Principal Component Analysis), feature selection, and
specialized algorithms designed to handle high-dimensional data efficiently.
 These techniques are used to manage the curse of dimensionality of datasets, improve
model performance, and extract relevant information from the data.
 These techniques are used based on the specific problem and dataset characteristics.
 Feature Selection:
 Feature selection involves choosing a subset of the most relevant features (attributes)
from the original set to build a model.
 The goal is to improve model efficiency, and reduce overfitting.
 Irrelevant or redundant features are excluded.
 Common methods for feature selection include:
 Filter Methods: These methods rank features based on statistical metrics like
correlation, mutual information, or variance threshold.
 Wrapper Methods: These methods evaluate feature subsets using the model's
performance on a validation set.
 Embedded Methods: These methods incorporate feature selection within the model
training process (e.g., L1 regularization).
 Feature reduction techniques aim to reduce the dimensionality of the data by
transforming it into a lower-dimensional space while retaining as much of the
original information as possible.
 This helps combat the curse of dimensionality and improves computational
efficiency.
 Key techniques for feature reduction include:
 Principal Component Analysis (PCA): It is used to reduce the
dimensionality of high-dimensional datasets while retaining as much variance
in the data as possible. This is particularly useful when working with large
datasets, improving computational efficiency in various applications.
 Linear Discriminant Analysis (LDA): Similar to PCA, finds the linear
combination of features that maximizes class separation.
 t-Distributed Stochastic Neighbor Embedding (t-SNE): Non-linear
technique for reducing high-dimensional data in low-dimensional space by
preserving the pairwise similarities between data points.
 Feature expansion involves creating new features from the existing ones to
provide additional information and potentially improve the performance of
machine learning models.
 This can be particularly useful when certain relationships between features are
not linear.
 Techniques for feature expansion include:
 Polynomial Features: Creating new features by raising existing features to
different powers.
 Interaction Features: also known as interaction terms, are new features that
are created by combining two or more existing features in a dataset.
 Classification is a fundamental concept in machine learning that involves
categorizing data points into predefined classes or categories based on their
features.

 It is a type of supervised learning, where the algorithm learns from labeled training
data to make predictions about new, unseen data points.

 Example: Email Spam Detection

 The goal is to build a classification model that can automatically classify emails
as either "spam" or "not spam" (ham).

 Step1: Data Collection and Preparation

 Collecting a dataset of emails, each labeled as "spam" or "ham“ and features


might include things like text, the presence of specific phrases, the sender's
address, etc.
 Step 2: Feature Extraction

 Preprocess the raw email data to extract relevant features.

 For instance, you might tokenize the text (splitting it into words), remove stop
words (common words like "the," "is," etc. that don't carry much meaning), and
convert the words into numerical representations.

 Step 3: Splitting the Data

 The dataset is usually split into two parts: the training set and the testing set.

 Step 4: Model Selection

 Choose a classification algorithm to train your model.

 Common choices include logistic regression, decision trees, random forests,


support vector machines, and neural networks.
 Step 5: Model Training

 You train the model using the training data.

 During training, the model learns the patterns and relationships between the
features and the corresponding labels.

 It adjusts its internal parameters to minimize the classification error.

 Step 6: Model Evaluation

 After training, you use the testing set to evaluate how well the model performs
on new, unseen data.

 You measure various metrics such as accuracy, precision, recall, F1-score and
other metrices to assess the model's performance.
 Logistic Regression: Despite its name, logistic regression is used for binary
classification. It models the probability of belonging to a certain class using a
logistic function.
 Decision Trees: These create a tree-like model of decisions based on the values of
features. Each internal node represents a decision based on a feature, and each leaf
node represents a class label.
 Random Forest: This is an ensemble method that combines multiple decision
trees to improve accuracy and robustness.
 Support Vector Machines (SVM): SVM finds a hyperplane that best separates
different classes, aiming to maximize the margin between the classes.
 K-Nearest Neighbors (KNN): KNN classifies data points based on the classes of
their k nearest neighbors in the training data.
 Naive Bayes: This algorithm uses Bayes' theorem to calculate the probability of a
data point belonging to a particular class based on its features.
 It is used for binary classification, which involves predicting one of two possible
outcomes (usually represented as 0 and 1) based on input features.

 Despite its name, logistic regression is primarily used for classification rather than
regression tasks.

 It models the probability of the binary outcome occurring as a function of the input
features.

 Logistic Regression is a simple yet effective technique for binary classification.

 It serves as the foundation for more complex algorithms like Support Vector
Machines and neural networks.

 Logistic Regression is widely used in Medical Diagnostics, Spam Detection, Fraud


Detection (Identifying fraudulent transactions based on transaction patterns), etc.
 Sigmoid Function (Logistic Function):

 The sigmoid function maps any input value to a value between 0 and 1, which
can be interpreted as a probability.

 The formula for the sigmoid function is:

 P(Class = 1) =sigmoid(z) = 1 / (1 + e^(-z)) (where "z" is a linear


combination of the input features, coefficients, and intercept)

 The output of the sigmoid function, which ranges from 0 to 1, represents the
probability of the Class (1).

 A threshold (often 0.5) is used to determine the class prediction.

 If the predicted probability is greater than or equal to the threshold, the


predicted class is 1; otherwise, it's 0.
 Linear Combination (z):
 z = w₁x₁ + w₂x₂ + ... + wnxn + c
 where "z" is a linear combination of the input features(x₁ .. xn) and their
associated weights/coefficients(w₁…wn), and c is a bias term (intercept).
 To train the model, we need to find the optimal values for the weights and the
bias that minimize the log-loss (cross-entropy) cost function.
 The log-loss function for a single data point (x, y) is given by:
 Cost(x, y) = - y * log(P(Class = 1)) - (1 - y) * log(1 - P(Class = 1))
 Where:
"y" is the true class label (0 or 1).
"P(Class = 1)" is the predicted probability of Class 1.
 Training and Optimization: The model is trained using optimization algorithms
like gradient descent or variants such as stochastic gradient descent.

 The optimization process adjusts the weights and bias to minimize the cost
function and improve the model's predictions.
 Following performance evaluation metrics are essential to assess the quality and
effectiveness of classification algorithms/models.
 Confusion Matrix  F1-Score  Area Under the PR Curve
 Accuracy  True Negative Rate  ROC Curve
 Precision  False Positive Rate  Area Under the ROC Curve
 Recall  Precision-Recall Curve
 Suppose you're working on a medical diagnosis task to classify whether patients
have a certain disease (Cancer) or not.

 You have a spam detection dataset with 100 emails, and you're using a binary
classification model.
Confusion Matrix: A table showing the counts of true
 True Positives (TP): 40 positives, true negatives, false positives, and false
 True Negatives (TN): 50 negatives.
 False Positives (FP): 5 Predicted Predicted
Positive Negative
 False Negatives (FN): 5
Actual Positive 40 (TP) 5 (FN)
Actual Negative 5 (FP) 50 (TN)
 Accuracy: The proportion of correctly classified instances out of the total
instances.
 (TP + TN) / (TP + TN + FP + FN) = (40 + 50) / 100 = 90%

 Precision: The ratio of true positive (TP) predictions to the total positive
predictions (TP + false positives).
 TP / (TP + FP) = 40 / (40 + 5) = 88.89%

 Recall (Sensitivity or True Positive Rate): The ratio of true positive predictions
to the total actual positives (TP + false negatives).
 TP / (TP + FN) = 40 / (40 + 5) = 88.89%

 F1-Score: The harmonic mean of precision and recall, providing a balanced


measure between them.
 2 * (Precision * Recall) / (Precision + Recall) = 2 * (0.8889 * 0.8889) / (0.8889 + 0.8889) =
88.89%
 Specificity (True Negative Rate): The ratio of true negative predictions to the
total actual negatives (TN + false positives).
 TN / (TN+FP) = 50 / (50 + 5) = 90.90%

 False Positive Rate: The ratio of false positive predictions to the total actual
negatives (FP + TN).
 FP / (FP + TN) = 5 / (5 + 50) = 9.09%

 Receiver Operating Characteristic (ROC) Curve: A graphical representation of


the true positive rate against the false positive rate at various thresholds.
 Graph plotting TPR (Recall) against FPR at various thresholds.

 Area Under the ROC Curve (AUC-ROC): The area under the ROC curve,
which measures the overall performance of a binary classification model.
 AUC value from the ROC curve.
 Precision-Recall Curve: A graphical representation of precision against recall at
various thresholds.
 Graph plotting Precision against Recall at various thresholds.

 Area Under the Precision-Recall Curve (AUC-PR): The area under the
precision-recall curve, providing an alternative to AUC-ROC for imbalanced
datasets.
 AUC value from the precision-recall curve.
 Decision Trees: These create a tree-like model of decisions based on the values of
features.

 Each internal node represents a decision based on a feature, and each leaf node
represents a class label.

 Example: Predicting Play Tennis


 To predict whether to play tennis based on two features: weather conditions and
wind speed.
 The target variable (class label) is whether to play tennis (Yes or No).
Weather = Sunny?
/ \
Yes No
/ \
Wind <= 10 Play = No
/ \
Yes No
/ \
Play = Yes Play = No
 Decision Trees: These create a tree-like model of decisions based on the values of
features.

 Each internal node represents a decision based on a feature, and each leaf node
represents a class label.

 Example: Predicting Play Tennis


 To predict whether to play tennis based on two features: weather conditions and
wind speed.
 The target variable (class label) is whether to play tennis (Yes or No).
Weather = Sunny?
/ \
Yes No
/ \
Wind <= 10 Play = No
/ \
Yes No
/ \
Play = Yes Play = No
 Following performance evaluation metrics are essential to assess the quality and
effectiveness of Clustering algorithms/models.
 Silhouette Score  Normalized Mutual Information
 Davies-Bouldin Index  Homogeneity, Completeness, and V-measure
 Calinski-Harabasz Index  Jaccard Index
 Inertia  Rand Index
 Adjusted Rand Index  Purity

 Imagine you're using k-means clustering to group customers based on their


purchase behavior. You have 150 customers and are aiming to create 3 clusters.
 After running the clustering algorithm, you compare the predicted clusters with
some ground truth information:
 Adjusted Rand Index (ARI): 0.72
 Silhouette Score: 0.58
 Davies-Bouldin Index: 0.62
 Inertia: 256.5
 Adjusted Rand Index (ARI): A value close to 1 indicates that the predicted
clusters are in good agreement with the ground truth.

 Silhouette Score: A value close to 1 indicates well-separated clusters, where each


data point is closer to its own cluster than to neighboring clusters.

 Davies-Bouldin Index: A lower value indicates better clustering; values close to 0


suggest tight, well-separated clusters.

 Inertia: A lower inertia value indicates that data points within clusters are closer to
their cluster centers, implying better clustering.
 It is used for dimensionality reduction and data analysis.
 Its primary goal is to transform a dataset with multiple correlated variables into a
new set of principal components.
 These components capture the most significant patterns of variance present in the
original data.
 PCA helps to improve the model performance and visualize the data in 3D/2D.
 Involves following steps to reduce the dimensionality of a dataset while preserving
its most important patterns.
1. Standardization:
 Given a dataset with n samples and p features, the first step is to standardize the
data by subtracting the mean and dividing by the standard deviation for each
feature.
 2. Calculate the Covariance Matrix:
 The covariance matrix of the standardized data is calculated.
 For a feature matrix X with dimensions n x p, the covariance matrix C is
calculated as: C = (X^T * X) / (n - 1)
 3. Compute Eigenvectors and Eigenvalues:
 Eigenvectors are the directions in which the data varies the most, and
eigenvalues represent the amount of variance explained by each eigenvector.
 The eigenvectors are also known as the principal components.
 The equation C * v = λ * v, where v is an eigenvector and λ is its corresponding
eigenvalue, gives us the eigenvectors and eigenvalues.
 4. Sort Eigenvectors:
 Sort the eigenvectors in decreasing order of their eigenvalues. These
eigenvectors (principal components) capture the most important directions of
variance in the data.
 5. Choose Number of Principal Components:
 Select a subset of the top k eigenvectors based on the cumulative variance you
want to retain.
 6. Project Data onto Lower-Dimensional Space:
 Multiply the standardized data by the selected eigenvectors (principal
components) to project the data onto the lower-dimensional space.
 This transformation yields the new set of coordinates, called the eigen-
coordinates.
 7. Interpret Principal Components:
 Each principal component captures a specific pattern of variance in the original
data.
 These patterns can be interpreted to understand which features or relationships
contribute most to the variations in the dataset.
 Recognition systems play a crucial role in various fields, such as machine learning,
computer vision, natural language processing, and signal processing.

 These systems are designed to identify and classify patterns, objects, or entities
from input data, which could be images, audio signals, text, or other forms of data.

 Recognition systems in the context of machine learning involve creating effective


and accurate models that can identify and classify patterns in datasets.

 These models enable computers to perform tasks like image recognition, speech
recognition, natural language processing, and more.

 The design cycle for recognition systems follows a series of steps to create the
models.
 Breakdown of the design cycle:

1. Problem Definition:
 Clearly define the problem you want to solve, such as image classification,
voice recognition, etc.
 Specify the input data format and the desired output (classes or labels).

2. Data Collection and Preparation:


 Gather a diverse dataset that covers the different scenarios your recognition
system will encounter.
 Clean, preprocess, and transform the data as necessary, which might involve
tasks normalization, or encoding.
3. Feature Extraction and Selection:
 Extract relevant features from the raw data that will be used to train the
model.
 Select or engineer features that are important for the recognition task and
could help the model distinguish between different classes.

4. Algorithm Selection:
 Choose a suitable algorithm based on the problem type and the characteristics
of the data.
 Common algorithms include neural networks (for deep learning), support
vector machines, random forests, and k-nearest neighbors, and more.
5. Model Training:
 Split the dataset into training and validation sets.
 Train the model on the training set, adjusting model parameters and
hyperparameters to optimize performance.
6. Model Evaluation:
 Test the trained model on a separate testing dataset to assess its generalization
performance.
 Measure the model's performance using appropriate evaluation metrics
(accuracy, precision, recall, F1-score, etc.).
7. Hyperparameter Tuning:
 Adjust the model's hyperparameters to find the best configuration that yields
the optimal performance on the testing set.
8. Deployment:
 Once satisfied with the model's performance, deploy it l into the target system
or environment.
9. Monitoring and Maintenance:
 Continuously monitor the model's performance in real-world scenarios.
 Update the model as needed to address issues or adapt to changing conditions.
10. Feedback and Iteration:
 Collect feedback from users and system performance to identify areas for
improvement.
 Iterate on the model by refining the data collection, preprocessing, feature
extraction, and model training processes
 Non-linearly separable problems refer to classification problems where the classes
cannot be separated by a straight line in higher dimensions.

 In such cases, using linear classifiers like linear regression or linear support vector
machines (SVMs) may not yield satisfactory results.

 Non-linearly separable problems are common in real-world scenarios where data


points of different classes are intermixed and cannot be cleanly divided by a single
linear boundary.

 When encountering a non-linearly separable problem, it's important to choose an


appropriate model that can effectively capture the underlying patterns in the data.

 Experimentation and testing with different approaches are essential to find the
solution that best fits the problem's characteristics.
 Cover's theorem, also known as the Cover's theorem of separability, that
establishes the existence of a hyperplane that can separate any set of points in a
high-dimensional space.

 Cover's theorem can be stated as follows:

"For any given finite set of points in a high-dimensional feature space, there exists
a hyperplane that can separate them perfectly, provided the number of dimensions
is sufficiently high."

 While this theorem is theoretically interesting, it's important to note that it doesn't
provide a practical method for finding or constructing such hyperplanes.

 The theorem is more of an existence statement and doesn't guarantee that such
separations can be easily identified or used in real-world scenarios.
To address non-linearly separable problems, various techniques and models can be
used:

 Kernel Trick in SVMs:

 Support Vector Machines (SVMs) are powerful classifiers, but they are
inherently linear.

 However, by using the kernel trick, SVMs can be extended to handle non-linear
data.

 The radial basis function (RBF) kernel can map the input data into a higher-
dimensional space where it becomes linearly separable.
 Neural Networks:

 Deep neural networks are highly flexible models capable of approximating


complex functions, making them well-suited for non-linearly separable
problems.

 They can capture intricate relationships in the data through hidden layers and
activation functions.

 Radial Basis Function Networks:

 RBF Networks are a type of neural network that use radial basis functions as
activation functions in the hidden layer.

 They can approximate non-linear features and are useful for non-linear
classification tasks.
 K-Nearest Neighbors (KNN) is a straightforward algorithm that can be used for
both classification and regression tasks.

 It falls under the category of instance-based or lazy learning algorithms, meaning


that it doesn't explicitly build a model during the training phase.

 Instead, it memorizes the entire training dataset and uses it to make predictions
when new data points need to be classified or predicted.

 KNN Algorithm Steps:


1. Import dataset
2. Splitting the dataset into Training and Test sets.
3. Training Phase:
 During the training phase, KNN simply stores the training features (attributes)
and their corresponding labels (for classification) or target values (for
regression) in memory.

.
4. Standardization:
 To standardize the data by subtracting the mean and dividing by the standard
deviation for each feature.

5. Prediction Phase:
 When a new, unseen data point is presented for prediction, the algorithm identifies
the 'k' closest data points (neighbors) from the training dataset based on a
Euclidean distance.
 The value of 'k' is a user-defined parameter that determines the number of
neighbors to consider.
 It's an important hyperparameter that affects the performance of the algorithm.
6. Classification (for classification tasks):
 For classification tasks, the algorithm counts the occurrences of each class among
the 'k' nearest neighbors.
 The new data point is assigned the class label that is most common among its 'k'
nearest neighbors.
 This can be determined by a majority vote.
7. Regression (for regression tasks):
 For regression tasks, KNN calculates the average or weighted average of the target
values of the 'k' nearest neighbors.
 The predicted value for the new data point is set to this average.
8. Model Evaluation
 After training, you use the testing set to evaluate how well the model performs on
new, unseen data.
 You measure various metrics such as accuracy, precision, recall, F1-score and other
metrices to assess the model's performance.
 We want to classify points into two classes: "Red" and "Blue".

 We have a training dataset with two features (x1 and x2) and corresponding class
labels:
x1 x2 Class
2 3 Red
4 2 Red
4 4 Blue
6 2 Blue
 Now, let's say we want to classify a new data point, (x1_new, x2_new) = (5, 3),
using KNN with k = 3.

 Calculate Euclidean distances between the new point and all training points:
Distance to (2, 3): sqrt((5-2)^2 + (3-3)^2) = 3
Distance to (4, 2): sqrt((5-4)^2 + (3-2)^2) = sqrt(2)
Distance to (4, 4): sqrt((5-4)^2 + (3-4)^2) = sqrt(2)
Distance to (6, 2): sqrt((5-6)^2 + (3-2)^2) = sqrt(2)
 Select the 'k' nearest neighbors (k = 3):

 Nearest neighbors: (4, 2), (4, 4), (6, 2)

 Count the class occurrences among the nearest neighbors:

 Among the neighbors, there's 1 Red and 2 Blue.

 Assign the new point to the class with the majority among neighbors:

 Since there are more Blue neighbors, the new point is classified as Blue.
 In this example, we'll use a simplified dataset with one feature and a target value.

 Our goal is to predict the target value for a new data point using KNN regression.

 Suppose we have the following training dataset:


Feature (X) Target (Y)
2 10
3 12
5 15
7 20
9 22

 We want to predict the target value for a new data point with a feature value of 6
using KNN regression with k = 3.
 Calculate Euclidean distances between the new data point and all training points:
 Distance to (2, 10): sqrt((6-2)^2) = 4
 Distance to (3, 12): sqrt((6-3)^2) = 3
 Distance to (5, 15): sqrt((6-5)^2) = 1
 Distance to (7, 20): sqrt((6-7)^2) = 1
 Distance to (9, 22): sqrt((6-9)^2) = 3
 Select the 'k' nearest neighbors (k = 3):
 Nearest neighbors: (5, 15), (7, 20), (3, 12).

 Calculate the average target value of the 'k' nearest neighbors:


 Average target value = (15 + 20 + 12) / 3 = 15.67

 Predict the target value for the new data point:


 Predicted target value = 15.67
Deep Learning
Dr.S. Nagaraju
Adjunct Faculty,
Computer Science and Engineering, IIITDMK
Detailed Syllabus
 Geoffrey Hinton is often referred to as the "godfathers" of deep learning due to his
pioneering contributions to the field.

 Deep Learning is the most exciting and powerful branch of Machine Learning.

 Employs deep neural networks with multiple layers to automatically learn complex patterns
and representations from data.

 Deep Learning models can be used for a variety of complex tasks:


axon

Artificial
Neuron

Biological Neuron/

 Dendrites are used to bring the inputs from other neurons.


 Synapse represents the strength of interaction of input neurons based on experience and
learning.
 Soma is a distributed and parallel processing unit that sends output to other neurons through
the Axon.
 The workings of ANNs, which are inspired by the structure and function of the
human brain.

 BNNs are composed of biological neurons, which are complex and highly
specialized cells found in the nervous system of living organisms, including
humans.

 These neurons communicate through electrical impulses and chemical synapses.

 ANNs on the other hand, are composed of artificial neurons or nodes, which are
mathematical abstractions designed to mimic the basic processing units of
biological neurons.

 These artificial neurons are computational constructs.


 In BNNs, information processing is highly complex and involves a combination of
electrical and chemical signals.

 Neurons in the brain process information in a distributed and parallel manner.

 ANNs simplify information processing by using mathematical operations.

 Each artificial neuron computes a weighted sum of its inputs and applies an
activation function to produce an output.

 ANNs process information sequentially, layer by layer.


 BNNs are capable of learning and adaptation based on experience and learning.

 ANNs learn through Learning algorithms, such as gradiant descent


/backpropagation, which adjust the weights to minimize the difference between
predicted and actual outputs.

 This learning is a form of mathematical optimization.


 BNN: Biological neurons are highly robust and energy-efficient, capable of
functioning for many years.

 ANNs, being computational constructs, are robust in the sense that they can run
without degradation as long as the hardware or software is maintained. However,
their energy efficiency depends on the hardware they are implemented on.
Model

Loss Function

Learning Algorithm
Loss Function

Squared error loss is equivalent to


perceptron loss when the outputs are
Booleans.
//Epochs is number of iteration applied through the entire dataset until the algorithm converges.
//Learning rate determines the step size of the update
For a misclassified positive label:
weight = weight + learning_rate * input
bias = bias + learning_rate
For a misclassified negative label:
weight = weight - learning_rate * input
bias = bias - learning_rate
Loss Function
 Activation functions play a pivotal role in the functioning of deep learning models.
 Activation functions allows the neural networks to capture more complex
relationships in the data, to learn from error and solve intricate problems beyond
just linear separability.
 The choice of activation function depends on the specific problem and the nature
of the data.
 The specific choice of activation function can have a significant impact on a neural
network's performance.
 It's often a good practice to experiment with different activation functions and
observe the network's performance.
 For instance, ReLU and its variants are widely popular in most convolutional
neural networks.
 Meanwhile, sigmoid and tanh might be used in models like RNN, LSTM, GRU,
etc.
 The derivatives of activation functions play a crucial role during backpropagation,
allowing weights to be updated based on the error gradient.
 Activation functions that have gradients which are neither too steep nor too flat are
generally conducive for faster and more stable convergence during training.
 Activation functions can be used to squash the outputs of a neuron to a desired
range.
 For example:

 Sigmoid activation functions compress the neuron's output to lie between 0 and
1, which can be used to represent a probability.

 The hyperbolic tangent (tanh) function limits the output to lie between -1 and 1.

 The Softmax activation function, typically used in the output layer of multi-
class classification problems, gives a distribution of probabilities over multiple
classes.
 Certain activation functions, like the sigmoid or tanh, can saturate in regions
where their gradients are nearly zero/one. This can slow down or halt learning.
 Other activation functions, such as ReLU (Rectified Linear Unit), do not saturate
in the positive region, which can accelerate convergence.

 Activation functions like ReLU introduce sparsity.

 A model with more sparse activations is often more easily interpretable and can
lead to a more efficient model.
 Formula: f(x)=1/1+e−x
 Range: (0, 1)
 Pros:
◦ Smooth gradient, preventing “jumps” in output values.
◦ Outputs a probability between 0 and 1.
 Cons:
◦ Vanishing gradient problem for very high or very low values of x.
◦ Not zero-centred.
◦ Computationally expensive.
 Formula:
 Range: (-1, 1)
 Pros:
◦ Smooth gradient.
◦ Zero-centred.
 Cons:
◦ Still has the vanishing gradient problem, though not as severe as the sigmoid
function.
◦ Computationally expensive.
 Formula: f(x)=max(0 , x)
 Range: (0, ∞)
 Pros:
◦ Helps mitigate the vanishing gradient problem.
◦ Computationally efficient (no expensive ex) and allows the model to converge
quickly.
 Cons:
◦ Dying ReLU problem where neurons can sometimes be stuck during training and not
activate for any data point.
◦ Not zero-centred.
 Formula: f(x)=max(0.01x , x)
 Range: (-∞, ∞)

 Pros:
◦ Addresses the Dying ReLU problem by allowing small negative values.
◦ Does not saturate in positive or negative region.
◦ Computationally efficient (no expensive ex).
◦ Close to Zero-centred outputs.

 Cons:
◦ Performance and benefits are data-dependent.
 Formula: f(x)

 Range: (-α, ∞) for some positive α


 Example: Let's consider two values, one positive and one negative. Assume α=1.
 For x=2: ELU(2)=2
 For x=−2: ELU(−2)=1(e−2−1)
ELU(−2)≈1(0.13533528324−1)
≈−0.86466471676
 Hence: For x=2, ELU outputs 2. For x=−2, ELU outputs approximately -0.865.
 Pros:
◦ Results in better training dynamics and improved generalization.
◦ Addresses the dying ReLU problem.
 Cons: Computationally more expensive due to the exponential operation.
 Formula: f(x) = x⋅σ(βx) where σ is the sigmoid function and β is a learnable
parameter.
 Range: (- ∞, ∞)
 Example: Let's consider β=1 and x=2.
 First, compute the sigmoid part: σ(2)=1/1+e−2​
≈1/1+0.135335283241​
≈0.880797
Now, compute the Swish(2)=2⋅0.880797S
Swish(2)≈1.761594
Hence: For x=2, Swish outputs approximately 1.761594.
 Pros:
◦ Often outperforms than ReLU in deeper models.
◦ Smooth and non-monotonic.
 Cons: Computationally more expensive due to the exponential operation.
C
 Formula: S(zi​)=zi / ∑j=1 ​ezj​
Eg: Consider a 3-class classification problem. Assume the raw scores from the last
hidden layer to the output layer neurons are:
z=[2.0,1.0,0.1]
Let's compute the softmax values for each of these scores:

 Pros:
◦ Outputs a probability distribution over multiple classes.
 Cons: Only used in the output layer.
 Why we need activation functions?
 What are the most commonly used activation functions and their mathematical
expressions?
 What are the advantages and disadvantages of the Sigmoid activation function?
 What is vanishing gradient problem?
 Why is ReLU and its variants preferred over sigmoid and tanh for deeper
networks?
 What are the potential problems with ReLU, and how are they addressed?
 How does the Softmax activation function differ from other activation functions,
and when is it used?
 Optimization algorithms play a pivotal role in training deep learning models.
 While Gradient Descent (GD) is the foundational optimization algorithm.
 It is used to minimize the loss or error of a model by updating the model's
parameters.
 Updates model weights and bias using the gradient of the entire dataset.
 Numerous GD variants and entirely new approaches have been developed to
achieve faster convergence.
 Pros:
 Straightforward and deterministic approach.
 Cons:
◦ Can be very slow for large datasets as it processes all data for a single update.
Here's a step-by-step breakdown of the gradient descent algorithm:
 Initialization:
 Initialize w and b randomly.
 Set the learning rate.
 Iterate over data:
 Compute using Sigmoid Activation function i.e.,

 Compute squared error loss i.e., /2

 Computer change in weights

 Till satisfied (loss=0/very small or epoch=10/100/1000/10000 or not much


change in previous and current iteration)
Here are the primary types of gradient descent algorithm:

 Batch Gradient Descent (BGD):


 Uses the entire dataset to compute the gradient of the cost function.
 Converges to the global minimum.
 Computation can be expensive for large datasets.

 Stochastic Gradient Descent (SGD):


 Uses only one training sample (randomly chosen) to compute the gradient at
each step.
 Introduces a lot of variance and oscillates heavily.
 Faster and is suitable for large datasets and online learning.
 Mini-batch Gradient Descent:
 A compromise between BGD and SGD.
 It uses a mini-batch of 'n' training samples (where 'n' is much smaller than the
total dataset) to compute the gradient at each step.
 Ssually converges faster than both BGD and SGD.
 The size of the mini-batch (often called "batch size") is a hyperparameter.

 The core idea of gradient descent remains same across these variants, with
differences primarily in computational efficiency, convergence speed, and
stability.
 What is gradient descent and why is it important?

 How does gradient descent work?

 What are the main variants of gradient descent?

 What is the vanishing and exploding gradient problem?

 What is the difference between gradient descent and stochastic gradient descent?

 How does gradient descent differ from other optimization algorithms?


 What is gradient descent and why is it important?
Gradient descent is an optimization algorithm used to minimize the loss
in machine learning and deep learning models by iteratively adjusting the
model's parameters.

 How does gradient descent work?

The algorithm computes the gradient of the loss function with respect to
the model's parameters and updates the parameters by iteratively adjusting.

 What are the main variants of gradient descent?

Batch Gradient Descent, Stochastic Gradient Descent (SGD), and Mini-


Batch Gradient Descent.
 What is the difference between gradient descent and stochastic gradient descent?

Gradient descent computes gradients using the entire dataset, while


stochastic gradient descent computes gradients using a single training sample.

 What is the vanishing and exploding gradient problem?

When gradients become extremely small (vanish) or extremely large


(explode), making training unstable.

 How does gradient descent differ from other optimization algorithms?

Gradient descent is based on iteratively adjusting parameters in the


direction of the steepest decrease of the loss function. There are other
optimization methods not based on gradients, such as genetic algorithms or
simulated annealing.
 Loss functions, also known as cost functions or objective functions, quantify how
well a model's predictions match the true data.
 In deep learning, different loss functions are used depending on the specific task.
 Regression Losses:
 Mean Squared Error (MSE) / Quadratic Loss / L2 Loss:

 Measures the squared difference between the actual and predicted values.

 Penalizes larger errors more than smaller ones.

 Formula: /N

 Actual values: =[2.5,3.0,3.5] and Predicted values: ​ =[3.0,3.2,3.7]

 MSE=1/3[(2.5−3.0)2+(3.0−3.2)2+(3.5−3.7)2]=0.0233
 Regression Losses:
 Mean Absolute Error (MAE) / L1 Loss:

 Measures the absolute difference between the actual and predicted values.

 Formula:

 Actual values: y =[2.5,3.0,3.5] and Predicted values: ​ =[3.0,3.2,3.7]

 MAE=1/3[|2.5−3.0|+|3.0−3.2|+|3.5−3.7|]=0.3
 Classification Losses:
 Binary Cross-Entropy (Log Loss):

 Measures the logarithmic difference between the actual label and the predicted
probability.

 Used for binary classification.

 Formula:

 Actual values: y =[0,1] and Predicted values: ​ =[0.2,0.8]


 Classification Losses:
 Categorical Cross-Entropy:

 Measures the logarithmic actual label and the predicted probability.

 Used for Multi-class classification.

 Formula:

Actual values: y = [[1,0,0],[0,1,0]] and Predicted values: ​ =[[0.7,0.2,0.1],[0.1,0.6,0.3]]


 The common issue related to deep learning and neural networks is the "vanishing
and exploding gradients" problem.
 Vanishing Gradient Problem:
 Definition:
 The vanishing gradient problem arises when the gradients of the loss with respect to
the model parameters approach zero.
 This makes the network hard to train, as the weights and biases of the initial layers of
the network are updated very slowly, making them almost stagnant.
 Cause:
 Deep neural networks, especially those using activation functions like sigmoid or
tanh, can cause the gradients to become very small.
 When this small gradient is multiplied repeatedly through backpropagation, the
gradient shrinks exponentially.
 Exploding Gradient Problem:
 Definition:

 The exploding gradient problem occurs when the gradients of the loss with
respect to the model parameters become very large, leading to very large
updates and causing the training to diverge.

 Cause:

 This is typically a problem in neural networks where large gradients get


multiplied many times, causing them to grow exponentially.
 Weight Initialization:
 Weight initialization plays a critical role in the stable training and faster convergence of
deep neural networks.
 Proper weight initialization can significantly speed up the convergence, helping in
avoiding gradient problems by ensuring that neither the weights are too small nor too
large or never initialize to Zero or same value.
 Weights connected to the same neuron should never be initialized to same value.
 Otherwise it falls into Symmetry Breaking Problem (Weights remains same, hence
never get updated) and all neurons learn the same features during training.
 Always normalize the inputs (so that they lie b/w 0 to 1)
 Methods:
 Xavier/Glorot Initialization.
 Especially effective for the sigmoid and tanh activation functions.
 If nin is the number of input units for the layer and nout is the number of output
units, then weights are initialized from a distribution with a variance of:
Var(W)=​2/(nin + nout).
 He Initialization:
 Named after Kaiming He, it's designed specifically for ReLU (Rectified Linear
Units) and its variants.
 If nin​ is the number of input units for the layer, weights are initialized from a
distribution with a variance of: Var(W)=​2/nin ​

 Example: Let's say we have a neural network layer with 500 input units and 200
output units.
 Change Activation Functions:
 The choice of the activation function can influence the magnitude of the
gradients.
 Examples:
 ReLU (Rectified Linear Unit): ReLU can help with the vanishing gradient
problem but might cause exploding gradients in some cases.
 Leaky ReLU: An improvement over ReLU, it allows a small gradient for
negative values, reducing the chances of units "dying" during training.
 Overfitting and underfitting are fundamental problems in the domain of machine
learning and deep learning, representing two extremes in model performance
concerning its complexity and ability to generalize.
 Overfitting:
 When a model performs exceptionally well on the training data but poorly on
unseen data (like a validation or test set), it is said to be overfitting.
 Overfitting indicates that the model has learned the training data's noise and
outliers, rather than the underlying distribution.
 Causes:
 High Model Complexity: Deep neural networks with many layers or many neurons
can capture intricate patterns in the training data, including noise and outliers.
 Insufficient Training Data: If the model has too many parameters relative to the
number of training samples, it's more likely to overfit.
 Lack of Regularization: Regularization techniques help in preventing overfitting.
 Underfitting:
 Underfitting occurs when a model cannot capture the underlying patterns of the
data.
 It performs poorly on both training and unseen data.
 This suggests that the model is too simplistic and hasn't captured the
complexities of the training data's distribution.
 Causes:
 Low Model Complexity: A neural network might not capture complex, non-
linear relationships in the data.
 Over-Regularization: Applying too much regularization can restrict the
model's capacity to learn.
 Poor Feature Representation: If the input features don't capture the
necessary information about the data, the model might underfit..
 It is a type of deep learning model specifically designed for processing grid-like
data, such as images.
 CNNs utilize layers with convolutional filters that can automatically and
adaptively learn spatial hierarchies of features.
 CNN Applications:
 Image Classification: Determining if a given picture is that of a cat, dog, car, etc.

 Object Detection: Detecting and drawing bounding boxes around cars in a street image.

 Image Segmentation: Marking every pixel of an image as 'cat', 'background', 'dog', etc.

 Face Recognition: Unlocking smartphones using facial data or tagging people in photos.

 Video Analysis: Detecting anomalies in surveillance videos.

 Medical Imaging: Detecting tumors in MRI scans or abnormalities in X-ray images and
.many more.
 A convolutional layer performs a convolution operation, which involves taking a
filter (or kernel our feature detectors) and sliding it over the input data (such as an
image data) to produce a feature map or activation map.
 The goal of the convolution operation is to identify specific patterns in the input
data.
 These patterns can be simple features like edges or textures in early layers, and as
we move deeper into the network, the patterns can represent more complex
structures.
 Filters (or kernels) in a convolutional layer are small weight matrices that are used
to slide over the input data.
 For a 2D image, filters are also 2D matrices, typically of size 3x3, 5x5, or 7x7.
 The depth of the filter matches the depth of the input data.
 For instance, for an RGB image, the input depth is 3, so filters also have a depth of
3.
 ReLU introduces non-linearity into the model.
 This is crucial for learning complex patterns.
 Without a non-linear activation function, no matter how many layers the neural
network has, it would still behave like a single-layer model.
 Max pooling layers offer a way to reduce spatial dimensions and retain important
features within CNNs, making the network more robust and efficient.
 Many CNN architectural choices in deep learning, use max pooling or average
pooling.
Fully Connection
 Pretrained CNN models have already been trained on ImageNet dataset.
 This dataset contains millions of images across thousands of categories.
 These models can be used for a wide range of computer vision tasks, including
classification, detection, and more complex tasks.
 Here are some popular pretrained CNN models that are frequently used in various
applications:
 AlexNet
 Depth: 8 layers (5 convolutional and 3 fully connected).
 Parameter Count: 60 million parameters.
 Learning Capacity: Much greater than shallow networks due to deeper architecture.
 Utilized ReLU activation for faster training and used data augmentation and dropout to
combat overfitting.
 Won the ImageNet challenge; set a new standard for deep learning in computer vision.
 ZFNet (Zeiler & Fergus Net)
 Depth: Similar to AlexNet.
 Parameter Count: Similar to AlexNet.
 Learning Capacity: Improved performance on image recognition tasks.
 Features: Introduced a visualization technique for understanding.
 Use Cases: Modified AlexNet to win the ImageNet challenge the following year.
 VGG
 Depth: 16 to 19 layers.
 Parameter Count: 138 million parameters in VGG16, even more in VGG19.
 Learning Capacity: Greater than AlexNet and ZFNet due to increased depth.
 Features: Utilized smaller (3x3) convolution filters throughout which allowed deeper
networks.
 Use Cases: Popular for feature extraction in various image processing tasks.
 GoogLeNet
 Depth: 22 layers.
 Parameter Count: About 4 million.
 Learning Capacity: Better than VGG.
 Features: Perform several convolutions in parallel.
 Use Cases: Won the ImageNet challenge.
 ResNet
 Depth: 18 to 152 layers, with ResNet-50, ResNet-101, and ResNet-152 being popular
variants.
 Parameter Count: ResNet-50 has around 25 million parameters.
 Learning Capacity: Ability to train very deep networks effectively.
 Use Cases: Deep ResNets have achieved state-of-the-art results in various tasks.
 Regularization techniques are used to prevent overfitting.
 Overfitting occurs when a model learns the detail and noise in the training data to
the extent that it negatively impacts the performance of the model on new data.
 Most common regularization techniques used in deep learning:
 L2 Regularization
 Early Stopping
 Data Augmentation
 Drop-out
 Batch Normalization
 from keras.models import Sequential
 from keras.layers import LSTM, Dense
 from keras.regularizers import l1_l2
 model = Sequential()
 model.add(LSTM(50, input_shape=(timesteps, features), kernel_regularizer =
l1_l2(l1=0.01, l2=0.01)))
 model.add(Dense(1))
 This snippet is creating a simple Sequential model with one LSTM layer followed by a
Dense layer.
 The LSTM layer is using both L1 and L2 regularization with regularization factors of
0.01. The kernel_regularizer applies this regularization to the weights of the LSTM units.
 Attention mechanisms enable models to dynamically focus on relevant parts of
the input for making decisions.
 They assign weights to different parts of the input, indicating the importance of
each part in the context of the task.
 This mechanism addresses the limitations of RNNs and LSTMs in processing
very long sequences.

 Applications:
 Machine Translation
 Sentiment Analysis
 Computer Vision etc.
 Can be computationally expensive with large inputs and more prone to
overfitting, with smaller datasets.
 Also known as a feedforward deep neural network or a multi-layer perceptron.
 Fundamental network consist of multiple layers of neurons, including an input
layer, several hidden layers, and an output layer and data moves in one direction.
 Can capture hidden patterns in data due to multiple layers.
 It is used in NLP for machine translation, language modeling, and text
classification.
 Needs substantial amounts of labeled data for training to avoid overfitting.
 Without proper regularization, they can overfit to training data, reducing
generalization.
 Performance heavily relies on the choice of hyperparameters, which often
requires trial and error.
 RNN is a type of ANN specifically designed to recognize patterns in sequences
of data.
 It can maintain information in 'memory' over time, which helps in understanding
context in text or time series data.
 Ability to process entire sequences of data (like a sentence or a time series),
which is essential in many applications including vehicle traffic prediction.
 Applications:

 Due to the vanishing gradient problem, it's challenging for RNNs to learn
and remember long-range dependencies in a sequence.
 RNNs need to selectively read, write and forget the information to learn and
remember long-range dependencies in a sequence.
 LSTM, a type of RNN, is designed to address the limitations of traditional RNNs
in handling long-term dependencies.
 It effectively captures long-range connections in sequential data, making it highly
suitable for tasks where context over long sequences is crucial.

 Their complexity can lead to overfitting,


especially with smaller datasets.
 LSTM learns output gate (ot-1) from the data and it is used to restrict what
fraction of st-1 information to pass to the next state st that is hidden representation
(ht-1) of st-1.
 GRUs are designed to model temporal dependencies and sequences effectively.

 They address the vanishing gradient problem common in traditional recurrent


neural networks (RNNs) by using gating mechanisms.

 GRU often train faster and require fewer resources than LSTM.

 GRUs are suitable for predicting stock prices, weather forecasting, and other
sequential data analysis.

 They can overfit on smaller datasets due to their complex structure.

You might also like