0% found this document useful (0 votes)

23 views52 pages

Machine Learning Fundamentals

Uploaded by

Khushi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as KEY, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views52 pages

Machine Learning Fundamentals

Uploaded by

Khushi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as KEY, PDF, TXT or read online on Scribd

You are on page 1/ 52

Machine Learning

Fundamentals
Feature Selection Techniques
Goal: To find the best set of features that allows one to build optimized models of studied
phenomena.

Task: Classification problem where we aim to predict whether a given email is spam.

Dataset Heads:
Length of the email
Number of exclamation marks
Presence of certain keywords (e.g., "free", "offer", "discount")
Number of spelling errors
Presence of attachments
Use of capital letters.

Which ones should be selected as good Features for this particular task?
Simple ways to find out good
Features
Correlation Analysis: We can analyze the correlation between each feature and the target
variable (spam or not spam). Features with high correlation are likely to be more
informative. For example, if emails containing certain keywords are more likely to be spam,
then the presence of those keywords would be a relevant feature.

Feature Importance: Train a model (such as a decision tree or a random forest) and examine
the feature importances provided by the model. Features with higher importance scores
contribute more to the predictive performance of the model and are thus more relevant.

Domain Knowledge: Sometimes, certain features might be irrelevant from a statistical

standpoint but are important from a domain perspective. For instance, while the number of
spelling errors may not have a high correlation with email spam, it could still be an important
feature if spam emails tend to have more spelling errors due to their low-quality content.
Utilities of Feature Selection
To reduce the dimensionality of feature space.

To speed up a learning algorithm.

To improve the predictive accuracy of a classification algorithm.

To improve the comprehensibility of the learning results.

Methods of Feature Selection
Filter Methods:

Techniques to Implement Filter Methods.

Information Gain:
The amount of information provided by the feature for identifying the
target value and measures reduction. The features with least
information is filtered.
Techniques for Filter Methods
Co-relation Analysis: Co-relation between the input and output attributes are
computed and the ones with least co-relation is filtered.
Variance Threshold: The features that has least variance in its data is filtered
out.

Mean Absolute Difference: Similar to Variance, just without the square.

Dispersion Ratio: Dispersion ratio is defined as the ratio of the Arithmetic
mean (AM) to that of Geometric mean (GM) for a given feature. Higher
dispersion ratio implies a more relevant feature.
Method for Feature Selection:
Wrapper Method
Greedy algorithms that train the algorithm by using a subset of features in an iterative manner. Based
on the conclusions made from training in prior to the model, addition and removal of features takes
place.
Stopping criteria for selecting the best subset are usually pre-defined by the person training the model,
such as when the performance of the model decreases or a specific number of features has been
achieved.
The main advantage of wrapper methods over the filter methods is that they provide an optimal set of
features for training the model, thus resulting in better accuracy than the filter methods.
Computationally more expensive.
Techniques for Wrapper Methods
Forward selection – This method is an iterative approach where we initially start with an empty set of
features and keep adding a feature which best improves our model after each iteration. The stopping
criterion is till the addition of a new variable does not improve the performance of the model.
Backward elimination – This method is also an iterative approach where we initially start with all
features and after each iteration, we remove the least significant feature. The stopping criterion is till no
improvement in the performance of the model is observed after the feature is removed.
Bi-directional elimination – This method uses both forward selection and backward elimination
technique simultaneously to reach one unique solution.
Exhaustive selection – This technique is considered as the brute force approach for the evaluation of
feature subsets. It creates all possible subsets and builds a learning algorithm for each subset and
selects the subset whose model’s performance is best.
Recursive elimination – This greedy optimization method selects features by recursively considering the
smaller and smaller set of features. The estimator is trained on an initial set of features and their
importance is obtained using feature_importance_attribute. The least important features are then
removed from the current set of features till we are left with the required number of features.
Methods for Feature Selections:
Embedded Methods:
In embedded methods, the feature selection algorithm is blended as part of the learning
algorithm, having its own built-in feature selection methods.
Embedded methods encounter the drawbacks of filter and wrapper methods and merge their
advantages.
These methods are faster like those of filter methods and more accurate than the filter methods
and take into consideration a combination of features as well.
Techniques for Embedded Methods
Regularization – This method adds a penalty to different parameters of
the machine learning model to avoid over-fitting of the model.

Tree-based methods – These methods such as Random Forest, Gradient

Boosting provides us feature importance as a way to select features as
well. Feature importance tells us which features are more important in
making an impact on the target feature.
Loss Functions In Machine Learning
Loss functions are a measurement of how good your
model is in terms of predicting the expected
outcome.
The loss function is directly related to the predictions
of the model you’ve built. If your loss function value is
low, your model will provide good results.
Mean square error
The Mean Squared Error measures how
close a regression line is to a set of data
points.
It is a risk function corresponding to the
expected value of the squared error loss.
Mean square error is calculated by taking
the average, specifically the mean, of
errors squared from data as it relates to a
function.
Advantages of MSE
It offers faster convergence in scenarios where the error values
are relatively small and consistent.
The key to its rapid convergence lies in the error amplification
mechanism it employs.
For larger errors, the squared term magnifies their impact,
which accelerates the minimization process during training.
Disadvantages of MSE
Outliers are data points that significantly deviate from the norm
and often don’t conform to the overall trend. MSE treats all
errors with equal importance, which means outliers have a
substantial impact on the loss calculation.
This can lead to compromised model performance when
dealing with normal data points. In essence, MSE amplifies the
influence of outliers, undermining the model’s ability to
generalize effectively.
Mean absolute error
MAE is calculated as the sum of absolute
errors divided by the sample size.

Instead of squaring the error terms, MAE takes the absolute

value of the differences between predicted and actual
values.
This attribute makes MAE inherently robust to outliers.
Binary cross-entropy
Let’s consider a simple classification problem:
Given a
Our dataset consisting of only one Feature: feature x,
x = [-2.2, -1.4, -0.8, 0.2, 0.4, 0.8, 1.2, 2.2, 2.9, what is its
4.6] label?
Labels=
Red/Green
Binary cross-entropy
Is the point Green?

What is the probability that the point is Green?

How should our loss functions be?

They should measure how Good or Bad our predicted probabilities
are.
It should return high values for bad
predictions and low values for good predictions.
Binary cross-entropy
The cross-entropy loss decreases as the predicted probability
converges to the actual label.
It measures the performance of a classification model whose
predicted output is a probability value between 0 and 1.
Binary Cross Entropy

Say, we assign the class green to 1, and red to 0.

Since we’re trying to compute a loss, we

need to penalize bad predictions.
If the probability associated with
the true class is 1.0, we need
its loss to be zero.
Conversely, if that probability is low,
say, 0.01, we need its loss to be HUGE!
Since the log of values between 0.0
and 1.0 is negative, we take the
negative log to obtain a positive
value for the loss
Binary cross-entropy
Hinge Loss
Hinge loss penalizes the wrong
predictions and the right
predictions that are not
confident (beyond a margin).

L = max(0, margin – y*f(x))

y – The actual class (1 or -1)

f(x) – the output of the classifier for the datapoint

Case 1: Correct Classification and |f(x)| ≥ Margin (1

in the graph)
Case 2: Correct Classification and |f(x)| < Margin (1
in the graph)
Hinge Loss: Correct Classification and |
f(x)| ≥ Margin

In this case, y*f(x) is always positive,

Greater than 1, so loss is always 0.
Hinge Loss: Correct Classification
and |f(x)|< 1

In this case, yf(x) is 1(value of f(x)),

So, some penalizing is there.

Here though the model has correctly classified the data we are
penalizing the model because it has not classified it with much
confidence (|f(x)| < 1) as the classification score is less than 1.
Hinge Loss: Incorrect Classification
In this case either of y or f(x) will
be negative.
So, the product y.f(x) will always
be negative.
The loss function value max(0,1-
y.f(x)) will always be the value
given by (1-y.f(x)) .
Here the loss value will increase
linearly with increase in value of
y.
Hinge Loss

Margin = 0.22

L = max(0, margin – y*f(x))

Case 3: 0.22 – (+1* (-0.24)) = (0.22+0.24) = 0.460m Case 2: 0.150, (0.22-0.150) = 0.07, max(0,
max(0, 0.460) ; L= 0.460 0.07)= 0.07

Case 0 : +1*(+0.560) = 0.560

0.22 – 0.560 = -0.340
Max(0, -0.340), L = 0
Optimization Algorithm in Machine
Learning
Optimization is the process where we train the model iteratively
that results in a maximum and minimum function evaluation.

Why we need to optimize?

To compare the results in every iteration by changing the
hyperparameters in each step until we reach the optimum results.
This helps to create an accurate model with less error rate.
Maxima and Minima
Maxima is the largest and Minima
is the smallest value of a function
within a given range.
Global Maxima and Minima: It
is the maximum value and
minimum value respectively on
the entire domain of the function.
Local Maxima and Minima: It
is the maximum value and
minimum value respectively of
the function within a given range.
There can be only one global
minima and maxima but there
can be more than one local
minima and maxima.
Differentiable Functions
Gradient Descent is an optimization algorithm.
It finds out the local minima of a differentiable
function.
It is a minimization algorithm that minimizes a given
function.
A function f(x) is differentiable at a point a, f’(a) exists.

f(x) is differentiable in an entire range if it is

differentiable at each point therein.
GRADIENT DESCENT
Y =
The differential here is
tan(Theta).
tan(Theta) < 0 on the LHS
of the curve.
tan(Theta) > 0 on the RHS
of the curve.
GRADIENT DESCENT
The slope changes
its sign from positive
to negative at
minima.
As we move closer
to the minima, the
slope reduces.
GRADIENT DESCENT
Objective: Calculate X*, the local
minimum of the function Y=X².
Pick an initial point X₀ at random
Calculate X₁ = X₀-r[df/dx] at X₀.
r is Learning Rate. (Let us take r=1.
Here, df/dx is nothing but
the gradient.)
Calculate X₂ = X₁-r[df/dx] at X₁.
Calculate for all the points: X₁, X₂,
X₃, ……., Xᵢ-₁, Xᵢ
General formula for calculating local
minima: Xᵢ = (Xᵢ-₁)-r[df/dx] at Xᵢ-₁
When (Xᵢ — Xᵢ-₁) is small, i.e., when
Xᵢ-₁, Xᵢ converge, we stop the
iteration
and declare X* = Xᵢ
Learning Rate in GD and why do we
need it
Learning Rate is a hyperparameter or tuning parameter that
determines the step size at each iteration while
moving towards minima in the function.
X₁ = X₀-r[df/dx] at X₀. r is
Learning Rate.
For example, if r = 0.1 in the initial step, it can be taken as
r=0.01 in the next step.
Likewise it can be reduced exponentially as we iterate further.
Learning Rate in GD and why do we
need it
WHAT can happen
if we keep the
learning rate
constant??
k� is number of iterations

Disadvantages of Gradient Descent

Can get stuck in a local minima.
When n(number of data points) is large, the time it takes
for k iterations to calculate the optimum vector becomes very
large.
Computation complexity O(knd), where:
k is number of iterations.
n is the number of samples in your dataset
d is the number of features you explore.
Stochastic Gradient Descent
In GD, the whole dataset is used in each iteration of the
algorithm. So, it is also called ‘Batch’ GD sometimes.
Instead of using all the parameters of our dataset, we randomly
select One of them, and perform GD.
SGD is stochastic in nature i.e. it picks up a “random” instance of
training data at each step and then computes the gradient,
making it much faster as there is much fewer data to manipulate
at a single time .
Convergence path for SGD is a little noisy.
Mini-Batch Gradient Descent
A subset of the dataset is chosen for a particular set of parameters.

Mini-Batch GD with
Momentum
Momentum is an optimization technique that accelerates the
optimization process by adding a fraction of the previous update to the
current update.
MBGD with Momentum: Steps
Initialize model parameters and momentum term to zero.
Divide the training dataset into mini-batches.
For each mini-batch:
Perform a forward pass to compute predictions.
Calculate the loss and gradients concerning the mini-batch.
Update the momentum term using the current gradients and the momentum
hyperparameter.
Update the model’s parameters using the momentum-adjusted gradient updates.

Repeat step 3 for each mini-batch in an iteration.

Update the learning rate and momentum term if needed for subsequent epochs.
Repeat steps 2 to 5 for a predefined number of iterations.
Difference between GD, SGD, and
MBGD
Converge in BGD, SGD, MBGD
Supervised Learning : Example
Application
An emergency room in a hospital measures 17 variables (e.g., blood
pressure, age, etc) of newly admitted patients.
A decision is needed: whether to put a new patient in an intensive-care
unit.
Due to the high cost of ICU, those patients who may survive less than a
month are given higher priority.
Problem: to predict high-risk patients and discriminate them from low-risk
patients.
Intuition
Like human learning from past experiences.
A computer does not have “experiences”.
A computer system learns from data, which represent some “past
experiences” of an application domain.
Our focus: Learn a target function that can be used to predict the
values of an attribute,.
The task is commonly called: Supervised learning, classification, or
inductive learning.
Supervised learning process: two
steps
Learning (training): Learn a model using the training data
Testing: Test the model using unseen test data to assess the model
accuracy,
The Process of Learning
Given
a data set D,
a task T, and
a performance measure M,
a computer system is said to learn from D to perform the task T
if after learning the system’s performance on T improves as
measured by M.
In other words, the learned model helps the system to perform T
better as compared to no learning.

CS583, Bing Liu, UIC

An example
Data: Loan application data
Task: Predict whether a loan should
be approved or not.
Performance measure: accuracy.

No learning: classify all future

applications (test data) to the majority
class (i.e., Yes):
Accuracy = 9/15 = 60%.
We can do better than 60% with
learning.

CS583, Bing Liu, UIC

Linear Regression
Two different types of variable are there:
Independent (=predictor) variable X
Dependent (=outcome) variable Y.

Exploration of Linear
relation between these two
variables:
Y=mX+B
Linear Regression A slope of 2 means that
every 1-unit change in X
yields a 2-unit change in
Y.
Two different types of variable are there:
Independent (=predictor) variable X
Dependent (=outcome) variable Y.

Exploration of Linear
relation between these two
variables:
Y=mX+B
Regression Equation
Expected value of y at a given x=
Predicted value for an individual
Assumptions for Linear Regression
Linear regression assumes that…
1. The relationship between X and Y is linear
2. Y is distributed normally at each value of X
3. The variance of Y at every value of X is the same (homogeneity of
variances/ homoscedasticity)
4. The observations are independent
Homoscedasticity
The standard error of Y given X is the average variability around the
regression line at any given value of X. It is assumed to be equal at
all values of X.
Types of Linear Regression
Simple Linear Regression:A single independent variable
is used to predict the value of a numerical dependent variable
Multiple Linear regression:More than one independent
variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called
Multiple Linear Regression.
Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent
variables.
Types of Regression Line
A linear line showing the relationship between the dependent and independent variables is called
a regression line.

Positive Linear Relationship:If the dependent Negative Linear Relationship:If the dependent
variable increases on the Y-axis and independent variable decreases on the Y-axis and independent
variable increases on X-axis, then such a relationship variable increases on the X-axis, then such a
is termed as a Positive linear relationship. relationship is called a negative linear relationship.

Unit - 3 Feature Engineering
No ratings yet
Unit - 3 Feature Engineering
29 pages
Interview Questions For Machine Learning Total 215 Questions
100% (1)
Interview Questions For Machine Learning Total 215 Questions
70 pages
Module5.2 Feature selection methods
No ratings yet
Module5.2 Feature selection methods
64 pages
dimensionalityReduction.pptx
No ratings yet
dimensionalityReduction.pptx
117 pages
Feature Engg Pre Processing Python
No ratings yet
Feature Engg Pre Processing Python
68 pages
Feature Selection: Slide 1
No ratings yet
Feature Selection: Slide 1
29 pages
3b Features PDF
No ratings yet
3b Features PDF
40 pages
u1 p2 2
No ratings yet
u1 p2 2
66 pages
Feature Selection
No ratings yet
Feature Selection
61 pages
CS464_Ch5_FeatureSelection
No ratings yet
CS464_Ch5_FeatureSelection
31 pages
Lecture 19
No ratings yet
Lecture 19
25 pages
NEC ML UNIT-III Complete Final
No ratings yet
NEC ML UNIT-III Complete Final
22 pages
Presentation 1 (2)
No ratings yet
Presentation 1 (2)
22 pages
Xplore Feature Engineering
No ratings yet
Xplore Feature Engineering
9 pages
Linear Regression Summary
No ratings yet
Linear Regression Summary
57 pages
L2
No ratings yet
L2
53 pages
7 Selectia trasaturilor
No ratings yet
7 Selectia trasaturilor
54 pages
Feature Selection
No ratings yet
Feature Selection
56 pages
Sonnet 18, 29, 116
No ratings yet
Sonnet 18, 29, 116
9 pages
Lecture 15_23.09.2024_ Feature Selection
No ratings yet
Lecture 15_23.09.2024_ Feature Selection
47 pages
ML Lecture 02
No ratings yet
ML Lecture 02
40 pages
Wrapper Method
No ratings yet
Wrapper Method
58 pages
Assignment 4 Reportdocx
No ratings yet
Assignment 4 Reportdocx
10 pages
کتاب پنجم بارگزاری شده
No ratings yet
کتاب پنجم بارگزاری شده
35 pages
Feature Selection
No ratings yet
Feature Selection
36 pages
Kraev, E., Koseoglu, B., Traverso, L., & Topiwalla, M. (2024). Shap-Select Lightweight Feature Selection Using SHAP Values and Regression. ArXiv Preprint ArXiv2410.06815.
No ratings yet
Kraev, E., Koseoglu, B., Traverso, L., & Topiwalla, M. (2024). Shap-Select Lightweight Feature Selection Using SHAP Values and Regression. ArXiv Preprint ArXiv2410.06815.
13 pages
Loss functions
No ratings yet
Loss functions
29 pages
Module-3 - DS (Autosaved)
No ratings yet
Module-3 - DS (Autosaved)
18 pages
MRMRKKT PDF
No ratings yet
MRMRKKT PDF
5 pages
Feature engineering
No ratings yet
Feature engineering
5 pages
Gansp Awareness Quiz PDF
No ratings yet
Gansp Awareness Quiz PDF
13 pages
Fast Clustering Based Feature Selection: Ubed S. Attar, Ajinkya N. Bapat, Nilesh S. Bhagure, Popat A. Bhesar
No ratings yet
Fast Clustering Based Feature Selection: Ubed S. Attar, Ajinkya N. Bapat, Nilesh S. Bhagure, Popat A. Bhesar
7 pages
ML Unit-3 - RTU
No ratings yet
ML Unit-3 - RTU
20 pages
Special Topic: Missing Values
No ratings yet
Special Topic: Missing Values
25 pages
Lecture 14
No ratings yet
Lecture 14
17 pages
Feature Selection Technique
No ratings yet
Feature Selection Technique
7 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
Lecture#10
No ratings yet
Lecture#10
24 pages
Notes-1
No ratings yet
Notes-1
3 pages
Feature Subset Selection With Fast Algorithm Implementation
No ratings yet
Feature Subset Selection With Fast Algorithm Implementation
5 pages
Module-3 DSV
No ratings yet
Module-3 DSV
20 pages
Pattern L1 L6
No ratings yet
Pattern L1 L6
19 pages
Feature Selection
No ratings yet
Feature Selection
5 pages
Feature Pruning and Normalization
No ratings yet
Feature Pruning and Normalization
8 pages
Feature Selection in Machine Learning
No ratings yet
Feature Selection in Machine Learning
4 pages
Loss
No ratings yet
Loss
18 pages
DBMS Report
No ratings yet
DBMS Report
21 pages
Operations Research(9)
No ratings yet
Operations Research(9)
215 pages
KFS-576A Maintenance Manual
100% (4)
KFS-576A Maintenance Manual
118 pages
Feature Selection - Study Material
No ratings yet
Feature Selection - Study Material
6 pages
JAMB Subject Combination 2025-2026 for All Courses - JAMB 2025
No ratings yet
JAMB Subject Combination 2025-2026 for All Courses - JAMB 2025
24 pages
Feature Selection
No ratings yet
Feature Selection
18 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
5 pages
Lecturer Zoology MCQs Past Papers... KutabKhano
100% (1)
Lecturer Zoology MCQs Past Papers... KutabKhano
9 pages
E-Note 14653 Content Document 20231228101402AM
No ratings yet
E-Note 14653 Content Document 20231228101402AM
10 pages
Question1 Answers Complete
No ratings yet
Question1 Answers Complete
4 pages
Chandra Shekar 2014
No ratings yet
Chandra Shekar 2014
13 pages
Manifestors Guide To Creative Urges by The Manifestor Community
No ratings yet
Manifestors Guide To Creative Urges by The Manifestor Community
16 pages
Pharmacy Management System
33% (3)
Pharmacy Management System
49 pages
PRu 4
No ratings yet
PRu 4
13 pages
AZ 305 Questions Answers File 2
No ratings yet
AZ 305 Questions Answers File 2
84 pages
Feature Selection in PR
No ratings yet
Feature Selection in PR
6 pages
Kernels, Model Selection and Feature Selection
No ratings yet
Kernels, Model Selection and Feature Selection
5 pages
Feature Selection
No ratings yet
Feature Selection
6 pages
Feature Selection Techniques in Machine Learning
No ratings yet
Feature Selection Techniques in Machine Learning
9 pages
Entrep 11 More Quizzes
No ratings yet
Entrep 11 More Quizzes
4 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
Feature Subset Selection: A Correlation Based Filter Approach
No ratings yet
Feature Subset Selection: A Correlation Based Filter Approach
4 pages
digital_sat_k12_student_weekend_142288154_7e0e4a31_0d9f_4f20_8c56
No ratings yet
digital_sat_k12_student_weekend_142288154_7e0e4a31_0d9f_4f20_8c56
1 page
Knee Anatomy
No ratings yet
Knee Anatomy
15 pages
The Keys of Basilus With Commentary
No ratings yet
The Keys of Basilus With Commentary
20 pages
Predicting Students' Engagement From Hope and Mindfulness
No ratings yet
Predicting Students' Engagement From Hope and Mindfulness
16 pages
pdfDownload (4)
No ratings yet
pdfDownload (4)
3 pages
project proposal
No ratings yet
project proposal
16 pages
TMT Tor Steel Rebars PDF
100% (1)
TMT Tor Steel Rebars PDF
10 pages
Constipation in Children
No ratings yet
Constipation in Children
70 pages
02 Padure, L. Spastic Hip in Cerebral Palsy
No ratings yet
02 Padure, L. Spastic Hip in Cerebral Palsy
4 pages
Sree Saraswathi Thyagaraja College (Autonomous), Pollachi
No ratings yet
Sree Saraswathi Thyagaraja College (Autonomous), Pollachi
19 pages
Pestel Swot Analyses-Silver Cross
No ratings yet
Pestel Swot Analyses-Silver Cross
5 pages
Phonetics Handout Maria Alejandra
No ratings yet
Phonetics Handout Maria Alejandra
5 pages
Desert Island Top 5
100% (1)
Desert Island Top 5
10 pages
CAF 3 Spring 2022
No ratings yet
CAF 3 Spring 2022
7 pages
Recent Development of Duplex Stainless S PDF
No ratings yet
Recent Development of Duplex Stainless S PDF
38 pages
0mbube (Genre) - Wikipedia
No ratings yet
0mbube (Genre) - Wikipedia
11 pages
Noiseless Concrete Pavements
No ratings yet
Noiseless Concrete Pavements
8 pages
Project Initiation-Kumaranath Fernando: Current Problem Statement
No ratings yet
Project Initiation-Kumaranath Fernando: Current Problem Statement
7 pages
International Civil Aviation Organization
No ratings yet
International Civil Aviation Organization
5 pages
Effects of Salinity On Plant Growth
No ratings yet
Effects of Salinity On Plant Growth
4 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet