0% found this document useful (0 votes)
13 views

Mod 1

make this pdf into the structred way
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Mod 1

make this pdf into the structred way
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 99

Introduction

to
Machine
Learning
By
Sneha
Sureddy
Syllabus

 Introductionto machine learning, Basic mathematics,


logistic regression, multilayered perceptron (MLP),
fundamentals of deep learning, simpler models.
 Transfer
Learning, Model Selection, History of Neural
Networks, learning deep networks as a minimization
problem of a mathematical function.
 Validation methods, Gradient descent algorithm,
optimization using gradient descent, stochastic
gradient descent. Evaluating the Networks, Early
Stopping
Artificial Intelligence

Artificial :Made by Human,

Intelligence: Ability to understand or think


 Enables machine to mimic human behaviour.
 Create machines which behave, think and able to make decisions like
humans.
 Artificial Intelligence is coined by John McCarthy
3
Machine Learning

 Machine Learning coined by Arthur Samuel in 1959, allows the machine to learn
from examples and experience, without being explicitly programmed.

Machine Learning is
 Study of algorithms that
 improve their performance
 at some task
 with experience 4
Alpydin & Ch. Eick: ML Topic1 5
Machine Learning and Deep
Learning

 Machine Learning is suitable for Structured data. It is data that fits


neatly into data tables and includes discrete data types such as
numbers, short text, and dates.
 Deep Learning is suitable for Unstructured data, which doesn't fit
neatly into a data table because its size or nature: for example, audio
and video files and large text documents.
Why “Learn”?

 Learning is used when:

 Human expertise does not exist (navigating on Mars),

 Humans are unable to explain their expertise (speech


recognition)

 Solution needs to be adapted to particular cases (user


biometrics) 7
Traditional Learning vs. Machine
Learning
How machines learn?

 Method of teaching computers to make


predictions based on some data

9
How To Make A Machine Learn

 ML algorithm is trained using a training data set to create a model.


 When new input data is introduced to the ML algorithm, it makes a prediction on
the basis of the model.
 The prediction is evaluated for accuracy and if the accuracy is acceptable, the ML
algorithm is deployed.
 If the accuracy is not acceptable, the ML algorithm is trained again and again with an
10

augmented training data set.


Training the machine

11
Supervised Learning

Apple

Decision Function
/ Hypothesis

Orange

Supervised Classification

In a supervised learning model, the algorithm learns on a labeled dataset,


providing an answer key that the algorithm can use to evaluate its
accuracy.
 Supervised learning is the one where you have input variables (x) and an output
variable (Y) and you use an algorithm to learn the mapping function from the input
to the output
 To detect patterns based on the typical characteristics of the input
data.
 Groups similar data samples and identify different clusters within the
data.
Unsupervised Learning

Decision Function
/ Hypothesis

Unsupervised Classification

An unsupervised model, in contrast, provides unlabeled data that the


algorithm tries to make sense of by extracting features and patterns on its
own.
Binary Classification

Decision Function
/ Hypothesis

Binary classification is the task of classifying the elements of a


given set into two groups (predicting which group each one belongs
to) on the basis of a classification rule.
Multiclass Classification

Decision Function
/ Hypothesis

A classification task with more than two classes;


e.g., classify a set of images of fruits which may
be oranges, apples, or bananas.
REINFORCEMENT LEARNING

Reinforcement learning can be thought of as a hit


and trial method of learning.
The machine gets a Reward or Penalty point for each
action it performs. If the option is correct, the machine
gains the reward point or gets a penalty point in case
of a wrong response.
Application of Machine Learning

• Speech and Hand Writing Recognition


• Robotics
• Search Engines (Information Retrieval)
• Learning to Classify new astronomical structures
• Medical Diagnosis
• Computer Vision (Object Detection algorithms)
• Email filtering
• Stock Market analysis
• Game playing etc.
mite container ship motor scooter leopard true image
mite container ship motor scooter leopard label
black widow lifeboat go-kart jaguar
cockroach amphibian moped cheetah assigned labels by
tick fireboat bumper car snow leopard
machine
starfish drilling platform golf cart Egyptian cat

grille mushroom cherry Madagascar cat true image


convertible agaric dalmatia n squirrel monkey label
grille mushroom grape spider monkey
pickup jelly fungus elderberry titi assigned labels by
beach wagon gill fungus Staffordshire bullterrier indri machine
fire engine dead-man’s-fingers currant howler monkey

Source: Krizhevsky et al 2012: ImageNet Classification with Deep Convolutional Neural


Networks
Deep Learning in
Games

AlphaGo
machin
e vs
human

Ke Jie
Learned Model Parameters
Training Mathematical
x1 Set y1 Model
x2 y2
x3 y3
?
x4 y4

xi1 xiM
xN – 1 yN – 1
xN yN Learned
Parameters
Logistic Regression: Can be applied
when data is linearly separable.
Linearly separable data
Training Set
x= y=
x1 data outcome
y1

x2 y2
?

x3 y3

x4 y4
PREDICTION

xN – 1 yN – 1
xN yN
Learned Model
Parameters
Training Mathematical
x1 Set y1 Model
x2 y2
x3 y3
?
x4 y4

xi1 xiM
xN – 1 yN – 1
xN yN Learned
Parameters
Linear Predictive
Model
zi = (b1 x xi1) + (b2 x xi2) + ⋯ + (bM x xiM)
zi bia
+ b0 s
bM
b1

xi1 xi2 features of xiM


data
In logistic regression, bias (also known as the intercept or constant term) is a parameter that is
added to the model to shift the decision boundary or the predicted probability.
Simple Linear Predictive Model

A very simple idea here is we're just going to multiply every


component of the vector X by a parameter, where
 parameter b1 will be multiplied by the first component of Xi,
 parameterb2 will be multiplied by the second component of Xi
and parameter bm will be multiplied by the mth component.
We do all of those multiplications, we add them up and then we
add an additional constant or what we call a bias, b0
this a mapping from the data Xi to a number Zi,
Will it Rain?
xi = yi = 1,
features feature outcom yes
for day i Cloud Cover
s
Humidity Temperature Air Pressure
e
Did it Rain
yi = 0,
no
0.5 80% 75 1.2 1

0.2 95% 83 1.3 0

WEATHER z1 = (b1 X 0.5) + (b2 X 0.8) + (b3 X 75) + (b4 X y1 = 1 RAIN

1.2) + b0
y2 = 0
z2 = (b1 X 0.2) + (b2 X 0.95) + (b3 X 83) + (b4 X
1.3) + b0
Sigmoid function
 The sigmoid function, also known as the logistic function, is a
mathematical function that maps any real-valued number to a
value between 0 and 1. It is defined as:
sigmoid(z) = 1 / (1 + exp(-z))
The sigmoid function has an S-shaped curve, where:
For z very small, sigmoid(z) approaches 0
For z very large, sigmoid(z) approaches 1
 Sigmoid outputs probabilities, making it suitable for binary
classification problems where the target variable is 0 or 1, yes or
no, etc.
 The sigmoid of 100 is:
sigmoid(100) = 1 / (1 + e^(-100))
 Using the approximate value of e (2.71828), we can calculate the sigmoid:
sigmoid(100) ≈ 1 / (1 + 2.71828^(-100))
sigmoid(100) ≈ 1 / (1 + 1.38e-44)
sigmoid(100) ≈ 1
So, the sigmoid of 100 is approximately 1.Note that the sigmoid function
approaches 1 as the input value increases, and approaches 0 as the input value
decreases. In this case, the input value of 100 is large enough that the sigmoid
function outputs a value very close to 1.
Convert to a
Probability
zi = (b1 x 0.5) + (b2 x 0.8) +
(b3 x 75) + (b4 x 1.2) + b0 feature
Sigmoid Function
1 p(yi = 1|xi) =
s
Cloud Cover Humidity Temperature Air Pressure
σ(zi)
0.5 80% 75 1.2
Chance of

0.
5 b parameters tell us
how important data
Rain

variables are to the


0 zi
prediction
– – 0 2 4 6
6 4 2
Learned Model
Parameters
Training Logistic Regression
x1 Set y1 σ(zi Model (or
)
z = (b X “Network”)
x2 y2 x ) + (b X x ) + ꔇ + (bM X
zi i 1 i1 2 i2 iMx 0
x3 y3 )+
x4 y4 b
bM
b1

xN – 1 yN – 1
xi1 xi2 xiM
xN yN

(b0, b1,… bM)


Learned
Parameters
Motivation of Multi Layered
Perceptron(MLP)
Extended Logistic
Regression
probability of K
σ(zi1 σ(ziK
latent
) ) processes/features

zi1 ziK

xi1 xi2 features of xiM


data
Extended Logistic
Regression
σ(ζi probability of a
particular
)
outcome
ζi
probability of K
σ(zi1 σ(ziK
latent
) ) processes/features

zi1 ziK

xi1 xi2 features of xiM


data
Images are Encoded as
Numbers 0
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0

00 00 00 00 00 00 00 00 0
0 0 0 0 00 00 0 0 0 118
0 200
0 223
0 155
0 155
0 23
0 0 0 0 0 0
0 0 0 0 0 0 0
0 0
0 0 0 0 0 0 0 0 0 0 6 124 214 253 253 253 253 254 229 213 67
0 0 0
0 0 0
0
0 0 0 0 0 0 0 0 0 43 198 253 254 253 247 175 175 235 253 253 108 0
0 0 0 0 0 0 0 0 58 0 8
253 183 00 0 0 0 0 0 36 222 253 118 0 0 0 0 0 0
0 0
0 0 0 0 0 0 0 0 54 250 128
0 2 0 0 0 0 6 71 192 237 253 247 71 0 0 0 0 0 0

00 00 00 00 00 00 00 00 09 213 253
212 50
253 0
253 18
224 58 19247
123 198 253 254 253 0
85 0 00 016 0 139
0 247
0 0
247 71 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 6 241 227 136 200 253 253 253 254 192 34 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 58 9 254 254 186 14 0 0 0 0 0 0
189 254 175
118 254
0 256 254 254 254 149059 0 0 0 00 0 00 00 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 101 253 253 254 253 253 253 42 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 8 138 247 253 243 159 196 243 253 19 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 12 183 253 253 253 50 0 0 49 253 214 0 0 0 0 0 0 0 0 0 0


0

0 0 0 0 0 0 180 254 253 213 50 2 0 0 711 253 214 0 0 0 0 0 0 0 0 0 0


0

0 0 0 0 0 0 234 217 97 10 0 0 0 23 207 254 215 0 0 0 0 0 0 0 0 0 0


0

0 0 0 0 0 174 253 156 0 0 0 0 45 215 253 253 95 0 0 0 0 0 0 0 0 0 0


0

0 0 0 0 9 210 253 163 5 19 49 130 244 253 251 137 4 0 0 0 0 0 0 0 0 0 0


0

0 0 0 0 13 229 253 254 192 253 253 253 254 253 137 0 0 0 0 0 0 0 0 0 0 0 0
0

0 0 0 0 0 160 253 254 253 253 253 253 230 132 4 0 0 0 0 0 0 0 0 0 0 0 0


0

0 0 0 0 0 18 112 194 254 254 254 163 59 0 0 0 0 0 0 0 0 0 0 0 0 0 0


0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0

Data Source: MNIST Dataset by LeCun et al. (1999) / CC-by- 0


0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

SA 3.0
Single Filter
(Shallow
Learning)
filte
Use of single filter only
r
looks for the average
shape
Layer

A layer is a horizontal slice of a neural network,


comprising multiple nodes (neurons) that process inputs
and produce outputs.
A layer can be thought of as a container for multiple
filters.
 Layersare responsible for transforming the input data,
and each layer builds on the previous one to extract more
complex features.
Filter
 A filter is a small, sliding window that scans the input data, performing a specific
operation at each position.
 A filter is a vertical slice of a layer.
 Filters are responsible for detecting specific features or patterns in the input data.
Example:-
 A convolutional layer (layer) might contain multiple filters (filter 1, filter 2, filter
3, etc.) that scan the input image to detect different features (edges, lines,
textures, etc.)
 Each filter (filter 1, filter 2, etc.) is applied to the entire input image, generating a
feature map that highlights specific features.
Summary

 Layers are the building blocks of a neural


network, processing inputs and producing outputs.
 Filters are small, sliding windows within a layer
that detect specific features or patterns in the input
data.
Multi-Layer Perceptron (MLP)
A Multi-Layer Perceptron (MLP) typically consists of three types of layers:

1. Input Layer: This layer receives the input features or data. The number of nodes
in this layer corresponds to the number of input features.

2. Hidden Layers: These layers perform complex representations and


transformations of the input data. There can be one or multiple hidden layers,
each consisting of a set of nodes (neurons) with nonlinear activation functions.
The hidden layers allow the MLP to learn and represent more abstract features.
3. Output Layer: This layer generates the predicted output. The
number of nodes in this layer corresponds to the number of output
classes or targets. In a typical MLP architecture, the layers are fully
connected, meaning every node in one layer is connected to every
node in the next layer.
BIOLOGICAL MOTIVATION
 Parts of Human Neuron:
• DENDRITES: Accepts the inputs
• SOMA : Process the inputs
• AXON : Turns the processed inputs into outputs.
 The human brain contains a densely interconnected network of
approximately 1011neurons, each connected, on average, to 10 4 others.

51
NEURAL NETWORK REPRESENTATION
 Artificial Neural Network (ANNs) are programs designed to solve any
problem by trying to mimic the structure and the function of our
nervous system.
 Neural networks are based on simulated neurons, which are joined
together in a variety of ways to form networks.(first one is 3-4-1
architecture, second one is 3-4-2 architecture)

Figure: ANN Model 52


 3-4-2 neural network architecture refers to a neural network with:
• 3 neurons in the input layer.
• 4 neurons in the hidden layer.
• 2 neurons in the output layer.
 Here’s a description of how the architecture looks:

1. Input Layer: It consists of 3 neurons. Each of these neurons represents a feature


or an input value.
2. Hidden Layer: The hidden layer has 4 neurons. Each neuron in the hidden layer
is connected to every neuron in the input layer, with associated weights.
3. Output Layer: The output layer has 2 neurons, representing the final output
values. Each neuron in the output layer is connected to every neuron in the
hidden layer.
Neuron or Perceptron

54
Neuron
 A neuron typically consists of:
1. Inputs: Receives data from other neurons or external inputs.
2. Weights: Assigns importance to each input.
3. Bias: Adds a constant value to the weighted sum.
4. Activation function: Determines the output.
5. Output: Sends the result to other neurons or to the output of the
network.
Activation function
 The activation function is used to determine the output of a node given its
input. It introduces non-linearity to give output.
 Common activation functions include:
1. Sigmoid: Maps the input to a value between 0 and 1.
2. ReLU (Rectified Linear Unit): Maps all negative values to 0 and all positive
values to the same value.
3. Tanh (Hyperbolic Tangent): Maps the input to a value between -1 and 1.
4. Softmax: Used for multi-class classification problems, it maps the input to a
probability distribution over all classes.
Neural networks & Deep learning
 Neural networks are a type of machine learning model inspired by the
structure and function of the human brain.
 They consist of layers of interconnected nodes (neurons) that process and
transmit information.
 Deep learning is a subfield of machine learning that focuses on neural
networks with multiple layers, typically more than three.
 Deep learning models are designed to learn complex patterns
and representations in data, such as images, speech, and text.

Here's a simple analogy to illustrate the difference:-


 Neural networks are like a network of roads, while deep
learning is like a high-speed highway system that uses those
roads to quickly and efficiently transport data.
Machine learning and deep learning
Machine Learning is used when:
 Data is relatively small to medium-sized (tens of thousands to hundreds of
thousands of samples).
 Relationships between features and targets are relatively simple.
 Computational resources are limited.
Deep Learning is used when:
 Data is large and complex (millions to billions of samples).
 Relationships between features and targets are non-linear and complex.
 Computational resources like high-end GPUs or TPUs are required
Model Selection
 The logistic regression is restricted to a linear classifier where the
multilayer perceptron allows a nonlinear classifier. That nonlinearity in
the classification decision yields improve performance.

In general, if your problem requires:


 Simple models with few parameters, use machine learning.
 Complex models with many parameters, use deep learning.
Transfer learning
 Transfer learning is a machine learning technique where a model
trained on one task or dataset is re-used or fine-tuned for
another related task or dataset.
 The idea is to leverage the knowledge and features learned from
the initial task to improve performance on the new task, rather
than training a new model from scratch.
 Transfer learning is commonly used in deep learning, where a
pre-trained model is used as a starting point for a new task.
For example, a model trained on image recognition tasks like
ImageNet can be fine-tuned for a specific task like object detection,
segmentation, or facial recognition.

Transfer learning has many benefits, including:

• Reduced training time and data requirements

• Improved performance on the new task

• Sharing knowledge across related tasks


Early history of Neural
Networks
• The multilayer perceptron was invented in 1960 Multilayer
Perceptron
around 1960.Computing power was not what
it is today. So, while the multilayer
perceptron was developed, its use was 1986 Back
limited and the amount of data was Propagation
significantly less than we have today.
• Back propagation is a method that allows us
to learn the parameters of the model in a very 1989 Convolutional Neural
Network
efficient way.
• The convolutional neural network was a key
technology for analysing images.
The Seasons of Neural
Networks
1990 –1994 Neural Nets in the
• The models did not perform well. The Wild

amount of data required to train such a


model well is significantly increased.
• Long short-term memory is a very key 1995 Long Short-Term
Memory
technology for analysing data that
varies as a function of time.
• Neural networks did not perform as 1998 – 2005 More Neural Nets in the
well as has often advertised. Other Wild
methods in machine learning came to
the forefront
The Seasons of Neural Networks

• People did not speak about neural networks at all.


• Decided to rebrand the technology with the idea that
this technology really could work effectively but
nobody would pay attention to it just because of the
name.
• GPU provided a computational platform. ImageNet
was a dataset or is a data set of images over a million
images.
• AlphaGo was based upon the convolutional neural &
reinforcement learning. AlphaGo demonstrated the
capacity to play the game Go at a level that exceeded
the performance of humans.
Learning deep networks as a minimization problem
 Learn parameters to give us the best performance
 A loss function defines a penalty for poor predictions
 Want to minimize average loss
loss
functio gues
Loss Function
n s
N Define σ(zi) as the predicted
b* = arg
min
1
b N Σ
ℓ(yi,
i σ(zi))
probability
yi is our true label
optimal
paramete
rs true label Cross Entropy Mathematical form:
ℓ(y, σ(z)) = -ylog σ(z) - (1 - y)
log(1 - σ(z))
Empirical Risk Minimization (ERM):The goal is to minimize the
difference between the network's predictions and the true labels, measured
by a loss function (e.g., cross-entropy for classification problem, mean
squared error for regression problem).
Backpropagation (BP)

BP allows for efficient computation of gradients of the loss function with


respect to model parameters, which is essential for optimization.
Backpropagation (BP) works as follows:

1. Forward pass: Input data flows through the network, layer by layer, to
produce an output.

2. Error calculation: The difference between the predicted output and the
actual output (target) is calculated, resulting in an error or loss.
3. Backward pass: The error is propagated backwards through the network,
layer by layer, to calculate the gradients of the loss function with respect to
each parameter.

4. Weight update: The gradients are used to update the model parameters
(weights and biases) to minimize the loss function.
Overfitting
 The model performs well during training but does not perform well during
testing.
 Causes of overfitting:
1. Model complexity: Using a model with too many parameters or layers.
2. Too much training: Training the model for too many epochs or iterations.
3. Small training dataset: Using a dataset that is too small to capture the
underlying patterns.
4. Noise in the data: Presence of noise or outliers in the training data.
 To overcome overfitting techniques like Validation, Early Stopping,
Regularization etc., can be used.
Validation methods

 Validation methods in deep learning are essential techniques used to evaluate the
performance of a model during training and prevent overfitting.
 Validation, in the context of machine learning and deep learning, refers to the
process of evaluating a model's performance on a separate dataset, called the
validation set, during training.
 This dataset is not just used for training the model, but rather for assessing its
performance and making adjustments as needed.
Validation Process:
 Split data: Divide the available data into training, validation, and testing sets
(e.g., 70% for training, 10% for validation, and 20% for testing).
 Train model: Train the model on the training set.
 Evaluate on validation set: Evaluate the model's performance on the validation
set during training.
 Adjusthyperparameters: Based on validation performance, adjust
hyperparameters or training parameters.
 Repeat: Repeat steps until validation performance improves.
 Final evaluation: Evaluate the final model on the testing set to estimate its
performance on new, unseen data.
Split Data in
Separate Groups
x1 y1 xx
x1 11 y1
x2 yy
x2 11
y2 y2
xx22
x3 y3 x3 y3
x4 y4 random yy
x4 22 y4
assignme
nt xx33
yy33
xN – 1 yN – 1 xN –xx1 44 yN – 1
xyyN yN
xN yN 44

all available trainin validatio testin


data g n g

–– 11

yyNN ––
Split Data in
Separate Groups x1
x2
y1
y2
x3 y3
x1 y1 x4 y4
x2 y2
yN – 1
testin
x3 y3 xN – 1
yN
x1 g y1
xN y2
x4 y4 x2
trainin x3 y3
g x4 y4
validatio
x1 n y1 x yN – 1
xN – 1 yN – 1 x2 y2 xN
N–1

yN
x3
xN yN x4 y3
y4

all available
data yN – 1
xN – 1
yN
xN
trainin
x1 g y1
x2 y2 refine
model
x3 y3
x4 y4 validatio testin
x1 n y1 x1 g y1
x2 y2 x2 y2
xN – 1 x3 y3 x3 y3
yN – 1
xN x4 y4 x4 y4
yN

yN – 1 xN – 1 yN – 1
xN – 1
yN xN yN
xN

(b0, b1,… bM)

estimate performance on final performance


validation set evaluation
 K-Fold Cross-Validation is on of the validation methods . In this
method ,we Divide the dataset into k folds. Train the model on k-1 folds
and evaluate on the remaining fold. Repeat this process k times, and
average the performance metrics.
 These validation methods help ensure that the model is generalizing well to
new, unseen data and not overfitting to the training data. By monitoring
performance on a separate validation set, you can adjust hyperparameters
to improve the model's performance.
Gradient Descent (GD)
 Gradient Descent (GD) is an optimization algorithm used to minimize the loss
function in machine learning. It iteratively updates the model parameters to
find the optimal values that result in the lowest loss.
 Here's a step-by-step explanation:
1. Initialize parameters: Start with initial values for the model parameters.
2. Compute loss: Calculate the loss function using the current parameters.
3. Compute gradient: Calculate the gradient of the loss function with respect to
each parameter. The gradient indicates the direction of the steepest ascent.
4. Update parameters: Update the parameters in the opposite direction of the
gradient to minimize the loss
4
current point

update
slope
3
(b)

2
f

–2 –1 0 1 2
b
Gradient Descent Optimizer
 Visualize the entire hypothesis of possible weight vectors and their associated
E values
 Wo,W1: weights of a linear unit
 E: Error for fixed set of training examples
 Gradient descent search determines a weight vector that minimizes E by
starting with an arbitrary initial weight vector, then repeatedly modifying it in
small steps.
 At each step, the weight vector is altered in the direction that produces the
steepest descent along the error surface

84
Derivation of the Gradient Descent Rule
 How can we calculate the direction of steepest descent along the error
surface?
 This direction can be found by computing the derivative of E with respect to
each component of the vector

is learning rate, if step size is too big, we may miss global minima if it is too
small it takes lot of time to converge
85
Gradient Descent (GD):
1. Batch processing: GD uses the entire training dataset to compute the gradient of
the loss function.
2. GD calculates the exact gradient of the loss function.
3. GD updates the model parameters after processing the entire dataset.

Stochastic Gradient Descent (SGD):


1. Online processing: SGD uses a single sample to compute the gradient of the loss
function.
2. SGD approximates the gradient of the loss function using a single sample
3. SGD updates the model parameters after processing each sample or small batch.
Difference between Standard gradient descent and stochastic gradient descent

Standard gradient descent Stochastic gradient descent


Error is summed over all examples Weights are updated upon examining
before updating weights each training example.

More computation time per weight Less computation time per weight
update update

Used with a larger step size Used with a smaller step size
per weight update per weight update

87
Summary
SGD has several advantages over GD:
1. SGD converges faster than GD, especially for large datasets.
2. SGD requires less memory since it only processes a single sample or small
batch at a time.
 GD is more accurate but slower, while SGD is faster but less accurate. The
choice between GD and SGD depends on the specific problem, dataset size, and
computational resources.
 Mini-Batch Gradient Descent can be used practically. It uses a small batch of
samples to compute the gradient.
Evaluating neural networks
Evaluating neural networks involves assessing their performance on a given task.
Here are some ways to evaluate neural networks:
1. Accuracy: Measure the proportion of correct predictions.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Where : TP = True Positives (correctly predicted instances)
TN = True Negatives (correctly predicted non-instances)
FP = False Positives(incorrectly predicted instances)
FN = False Negatives((incorrectly predicted non-instances)

2. Loss: Evaluate the difference between predicted and actual outputs.


Loss (Cross-Entropy Loss):Loss = - (1/n) * Σ (y_true * log(y_pred) + (1-y_true) *
log(1-y_pred))
where:y_true = actual labels y_pred = predicted labels
n = number of samples
3.Precision: Measure the proportion of true positives among all positive predictions.
Precision = TP / (TP + FP)
4. Recall: Measure the proportion of true positives among all actual positive
instances.
Recall = TP / (TP + FN)
5. F1-score: Harmonic mean of precision and recall.
F1-score:F1-score = 2 * (Precision * Recall) / (Precision + Recall)
Note: These formulas assume a binary classification problem.
Early stopping

Early Stopping is a technique used to prevent overfitting in neural
networks.
 Early stopping validation is used to stop training when the model's
performance on the validation set starts to degrade.
 Monitor the validation loss during training. Stop training when the loss
stops improving. (i.e., when the loss starts to increase)
This helps to:
 1. Prevent overfitting:
 2. Save computational resources: Stop training earlier, reducing the
computational time.
 3. Improve generalization: Encourage the model to generalize better to unseen
data.

By implementing early stopping, you can train more efficient and effective
neural networks.
 Linear Regression and Logistic Regression are both supervised learning
algorithms used for prediction, but they differ in their approach and application:

Linear Regression
 1. Continuous output: Predicts a continuous value (e.g., price, temperature).
 2. Linear relationship: Assumes a linear relationship between inputs and output.
 3. Mean squared error: Optimizes for mean squared error between predicted and
actual values.
 4. Regression: Used for regression tasks, like predicting a continuous value.
Here no activation function is used , models linear relationships
Logistic Regression
 1. Binary output: Predicts a binary value (e.g., 0/1, yes/no).
 2. Non-linear relationship: Uses a sigmoid function to model a non-linear
relationship between inputs and output.
 3. Cross-entropy loss: Optimizes for cross-entropy loss between predicted
probabilities and actual labels.
 4. Classification: Used for classification tasks, like predicting a binary label.
 Here activation function(sigmoid) is used to convert a linear regression equation
to the logistic regression equation , i.e., models non-linear relationships
Underfitting and overfitting
 Underfitting and overfitting are two common problems in machine learning:
Underfitting-
 Occurs when a model is too simple or has too few parameters to capture the
underlying patterns in the data.
- Model fails to learn from the training data and performs poorly on both
training and test data.
- Symptoms: - High bias ,Low variance ,Poor performance on training data
Poor performance on test data
 Solutions: Increase model complexity, Add more features and samples, Use a
different Algorithm.
Overfitting
 Occurs when a model is too complex or has too many parameters, fitting the
noise in the training data rather than the underlying patterns.
 Model performs well on training data but poorly on test data.
Symptoms: Low bias,High variance ,Good performance on training data - Poor
performance on test data-
Solutions: - Regularization ,Early stopping , - Data augmentation - Cross-
validation
To avoid underfitting and overfitting, aim for a balance between model complexity
and data complexity. Use techniques like cross-validation, regularization, and early
stopping to find the optimal balance.
Bias-Variance

 Bias:Error introduced by simplifying assumptions or


approximations in the model.
 Variance: Error introduced by sensitivity to small fluctuations in
the training data.
 Tradeoff:-High bias (underfitting): Model is too simple, missing
important patterns.- High variance (overfitting): Model is too
complex, fitting noise in the training data.
 Optimalmodel:- Balances bias and variance- Finds a sweet spot
between underfitting and overfitting
 Consequences of imbalance:- High bias: - Model is too simplistic - Misses
important relationships - Poor performance on training and test data-
 High variance: - Model is too complex - Fits noise in the training data -
Poor performance on test data (but good on training data)
 Techniques to manage bias-variance tradeoff:- Regularization,- Cross-validation-
Early stopping- Ensemble methods (e.g., bagging, boosting)
- By understanding and managing the bias-variance tradeoff, you can build more
accurate and generalizable machine learning models.

You might also like