0% found this document useful (0 votes)
7 views

Machine Learning Fundamentals

Machine Learning Fundamentals

Uploaded by

Khushi
Copyright
© © All Rights Reserved
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Machine Learning Fundamentals

Machine Learning Fundamentals

Uploaded by

Khushi
Copyright
© © All Rights Reserved
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
You are on page 1/ 52

Machine Learning

Fundamentals
Feature Selection Techniques
Goal: To find the best set of features that allows one to build optimized models of studied
phenomena.

Task: Classification problem where we aim to predict whether a given email is spam.

Dataset Heads:
Length of the email
Number of exclamation marks
Presence of certain keywords (e.g., "free", "offer", "discount")
Number of spelling errors
Presence of attachments
Use of capital letters.

Which ones should be selected as good Features for this particular task?
Simple ways to find out good
Features
Correlation Analysis: We can analyze the correlation between each feature and the target
variable (spam or not spam). Features with high correlation are likely to be more
informative. For example, if emails containing certain keywords are more likely to be spam,
then the presence of those keywords would be a relevant feature.

Feature Importance: Train a model (such as a decision tree or a random forest) and examine
the feature importances provided by the model. Features with higher importance scores
contribute more to the predictive performance of the model and are thus more relevant.

Domain Knowledge: Sometimes, certain features might be irrelevant from a statistical


standpoint but are important from a domain perspective. For instance, while the number of
spelling errors may not have a high correlation with email spam, it could still be an important
feature if spam emails tend to have more spelling errors due to their low-quality content.
Utilities of Feature Selection
To reduce the dimensionality of feature space.

To speed up a learning algorithm.

To improve the predictive accuracy of a classification algorithm.

To improve the comprehensibility of the learning results.


Methods of Feature Selection
Filter Methods:

Techniques to Implement Filter Methods.


Information Gain:
The amount of information provided by the feature for identifying the
target value and measures reduction. The features with least
information is filtered.
Techniques for Filter Methods
Co-relation Analysis: Co-relation between the input and output attributes are
computed and the ones with least co-relation is filtered.
Variance Threshold: The features that has least variance in its data is filtered
out.

Mean Absolute Difference: Similar to Variance, just without the square.


Dispersion Ratio: Dispersion ratio is defined as the ratio of the Arithmetic
mean (AM) to that of Geometric mean (GM) for a given feature. Higher
dispersion ratio implies a more relevant feature.
Method for Feature Selection:
Wrapper Method
Greedy algorithms that train the algorithm by using a subset of features in an iterative manner. Based
on the conclusions made from training in prior to the model, addition and removal of features takes
place.
Stopping criteria for selecting the best subset are usually pre-defined by the person training the model,
such as when the performance of the model decreases or a specific number of features has been
achieved.
The main advantage of wrapper methods over the filter methods is that they provide an optimal set of
features for training the model, thus resulting in better accuracy than the filter methods.
Computationally more expensive.
Techniques for Wrapper Methods
Forward selection – This method is an iterative approach where we initially start with an empty set of
features and keep adding a feature which best improves our model after each iteration. The stopping
criterion is till the addition of a new variable does not improve the performance of the model.
Backward elimination – This method is also an iterative approach where we initially start with all
features and after each iteration, we remove the least significant feature. The stopping criterion is till no
improvement in the performance of the model is observed after the feature is removed.
Bi-directional elimination – This method uses both forward selection and backward elimination
technique simultaneously to reach one unique solution.
Exhaustive selection – This technique is considered as the brute force approach for the evaluation of
feature subsets. It creates all possible subsets and builds a learning algorithm for each subset and
selects the subset whose model’s performance is best.
Recursive elimination – This greedy optimization method selects features by recursively considering the
smaller and smaller set of features. The estimator is trained on an initial set of features and their
importance is obtained using feature_importance_attribute. The least important features are then
removed from the current set of features till we are left with the required number of features.
Methods for Feature Selections:
Embedded Methods:
In embedded methods, the feature selection algorithm is blended as part of the learning
algorithm, having its own built-in feature selection methods.
Embedded methods encounter the drawbacks of filter and wrapper methods and merge their
advantages.
These methods are faster like those of filter methods and more accurate than the filter methods
and take into consideration a combination of features as well.
Techniques for Embedded Methods
Regularization – This method adds a penalty to different parameters of
the machine learning model to avoid over-fitting of the model.

Tree-based methods – These methods such as Random Forest, Gradient


Boosting provides us feature importance as a way to select features as
well. Feature importance tells us which features are more important in
making an impact on the target feature.
Loss Functions In Machine Learning
Loss functions are a measurement of how good your
model is in terms of predicting the expected
outcome.
The loss function is directly related to the predictions
of the model you’ve built. If your loss function value is
low, your model will provide good results.
Mean square error
The Mean Squared Error measures how
close a regression line is to a set of data
points.
It is a risk function corresponding to the
expected value of the squared error loss.
Mean square error is calculated by taking
the average, specifically the mean, of
errors squared from data as it relates to a
function.
Advantages of MSE
It offers faster convergence in scenarios where the error values
are relatively small and consistent.
The key to its rapid convergence lies in the error amplification
mechanism it employs.
For larger errors, the squared term magnifies their impact,
which accelerates the minimization process during training.
Disadvantages of MSE
Outliers are data points that significantly deviate from the norm
and often don’t conform to the overall trend. MSE treats all
errors with equal importance, which means outliers have a
substantial impact on the loss calculation.
This can lead to compromised model performance when
dealing with normal data points. In essence, MSE amplifies the
influence of outliers, undermining the model’s ability to
generalize effectively.
Mean absolute error
MAE is calculated as the sum of absolute
errors divided by the sample size.

Instead of squaring the error terms, MAE takes the absolute


value of the differences between predicted and actual
values.
This attribute makes MAE inherently robust to outliers.
Binary cross-entropy
Let’s consider a simple classification problem:
Given a
Our dataset consisting of only one Feature: feature x,
x = [-2.2, -1.4, -0.8, 0.2, 0.4, 0.8, 1.2, 2.2, 2.9, what is its
4.6] label?
Labels=
Red/Green
Binary cross-entropy
Is the point Green?

What is the probability that the point is Green?

How should our loss functions be?


They should measure how Good or Bad our predicted probabilities
are.
It should return high values for bad
predictions and low values for good predictions.
Binary cross-entropy
The cross-entropy loss decreases as the predicted probability
converges to the actual label.
It measures the performance of a classification model whose
predicted output is a probability value between 0 and 1.
Binary Cross Entropy

Say, we assign the class green to 1, and red to 0.

Since we’re trying to compute a loss, we


need to penalize bad predictions.
If the probability associated with
the true class is 1.0, we need
its loss to be zero.
Conversely, if that probability is low,
say, 0.01, we need its loss to be HUGE!
Since the log of values between 0.0
and 1.0 is negative, we take the
negative log to obtain a positive
value for the loss
Binary cross-entropy
Hinge Loss
Hinge loss penalizes the wrong
predictions and the right
predictions that are not
confident (beyond a margin).

L = max(0, margin – y*f(x))

y – The actual class (1 or -1)


f(x) – the output of the classifier for the datapoint

Case 1: Correct Classification and |f(x)| ≥ Margin (1


in the graph)
Case 2: Correct Classification and |f(x)| < Margin (1
in the graph)
Hinge Loss: Correct Classification and |
f(x)| ≥ Margin

In this case, y*f(x) is always positive,


Greater than 1, so loss is always 0.
Hinge Loss: Correct Classification
and |f(x)|< 1

In this case, y*f(x) is 1*(value of f(x)),


So, some penalizing is there.

Here though the model has correctly classified the data we are
penalizing the model because it has not classified it with much
confidence (|f(x)| < 1) as the classification score is less than 1.
Hinge Loss: Incorrect Classification
In this case either of y or f(x) will
be negative.
So, the product y.f(x) will always
be negative.
The loss function value max(0,1-
y.f(x)) will always be the value
given by (1-y.f(x)) .
Here the loss value will increase
linearly with increase in value of
y.
Hinge Loss

Margin = 0.22

L = max(0, margin – y*f(x))

Case 3: 0.22 – (+1* (-0.24)) = (0.22+0.24) = 0.460m Case 2: 0.150, (0.22-0.150) = 0.07, max(0,
max(0, 0.460) ; L= 0.460 0.07)= 0.07

Case 0 : +1*(+0.560) = 0.560


0.22 – 0.560 = -0.340
Max(0, -0.340), L = 0
Optimization Algorithm in Machine
Learning
Optimization is the process where we train the model iteratively
that results in a maximum and minimum function evaluation.

Why we need to optimize?


To compare the results in every iteration by changing the
hyperparameters in each step until we reach the optimum results.
This helps to create an accurate model with less error rate.
Maxima and Minima
Maxima is the largest and Minima
is the smallest value of a function
within a given range.
Global Maxima and Minima: It
is the maximum value and
minimum value respectively on
the entire domain of the function.
Local Maxima and Minima: It
is the maximum value and
minimum value respectively of
the function within a given range.
There can be only one global
minima and maxima but there
can be more than one local
minima and maxima.
Differentiable Functions
Gradient Descent is an optimization algorithm.
It finds out the local minima of a differentiable
function.
It is a minimization algorithm that minimizes a given
function.
A function f(x) is differentiable at a point a, f’(a) exists.

f(x) is differentiable in an entire range if it is


differentiable at each point therein.
GRADIENT DESCENT
Y =
The differential here is
tan(Theta).
tan(Theta) < 0 on the LHS
of the curve.
tan(Theta) > 0 on the RHS
of the curve.
GRADIENT DESCENT
The slope changes
its sign from positive
to negative at
minima.
As we move closer
to the minima, the
slope reduces.
GRADIENT DESCENT
Objective: Calculate X*, the local
minimum of the function Y=X².
Pick an initial point X₀ at random
Calculate X₁ = X₀-r[df/dx] at X₀.
r is Learning Rate. (Let us take r=1.
Here, df/dx is nothing but
the gradient.)
Calculate X₂ = X₁-r[df/dx] at X₁.
Calculate for all the points: X₁, X₂,
X₃, ……., Xᵢ-₁, Xᵢ
General formula for calculating local
minima: Xᵢ = (Xᵢ-₁)-r[df/dx] at Xᵢ-₁
When (Xᵢ — Xᵢ-₁) is small, i.e., when
Xᵢ-₁, Xᵢ converge, we stop the
iteration
and declare X* = Xᵢ
Learning Rate in GD and why do we
need it
Learning Rate is a hyperparameter or tuning parameter that
determines the step size at each iteration while
moving towards minima in the function.
X₁ = X₀-r[df/dx] at X₀. r is
Learning Rate.
For example, if r = 0.1 in the initial step, it can be taken as
r=0.01 in the next step.
Likewise it can be reduced exponentially as we iterate further.
Learning Rate in GD and why do we
need it
WHAT can happen
if we keep the
learning rate
constant??
k� is number of iterations

Disadvantages of Gradient Descent


Can get stuck in a local minima.
When n(number of data points) is large, the time it takes
for k iterations to calculate the optimum vector becomes very
large.
Computation complexity O(knd), where:
k is number of iterations.
n is the number of samples in your dataset
d is the number of features you explore.
Stochastic Gradient Descent
In GD, the whole dataset is used in each iteration of the
algorithm. So, it is also called ‘Batch’ GD sometimes.
Instead of using all the parameters of our dataset, we randomly
select One of them, and perform GD.
SGD is stochastic in nature i.e. it picks up a “random” instance of
training data at each step and then computes the gradient,
making it much faster as there is much fewer data to manipulate
at a single time .
Convergence path for SGD is a little noisy.
Mini-Batch Gradient Descent
A subset of the dataset is chosen for a particular set of parameters.

Mini-Batch GD with
Momentum
Momentum is an optimization technique that accelerates the
optimization process by adding a fraction of the previous update to the
current update.
MBGD with Momentum: Steps
Initialize model parameters and momentum term to zero.
Divide the training dataset into mini-batches.
For each mini-batch:
Perform a forward pass to compute predictions.
Calculate the loss and gradients concerning the mini-batch.
Update the momentum term using the current gradients and the momentum
hyperparameter.
Update the model’s parameters using the momentum-adjusted gradient updates.

Repeat step 3 for each mini-batch in an iteration.


Update the learning rate and momentum term if needed for subsequent epochs.
Repeat steps 2 to 5 for a predefined number of iterations.
Difference between GD, SGD, and
MBGD
Converge in BGD, SGD, MBGD
Supervised Learning : Example
Application
An emergency room in a hospital measures 17 variables (e.g., blood
pressure, age, etc) of newly admitted patients.
A decision is needed: whether to put a new patient in an intensive-care
unit.
Due to the high cost of ICU, those patients who may survive less than a
month are given higher priority.
Problem: to predict high-risk patients and discriminate them from low-risk
patients.
Intuition
Like human learning from past experiences.
A computer does not have “experiences”.
A computer system learns from data, which represent some “past
experiences” of an application domain.
Our focus: Learn a target function that can be used to predict the
values of an attribute,.
The task is commonly called: Supervised learning, classification, or
inductive learning.
Supervised learning process: two
steps
Learning (training): Learn a model using the training data
Testing: Test the model using unseen test data to assess the model
accuracy,
The Process of Learning
Given
a data set D,
a task T, and
a performance measure M,
a computer system is said to learn from D to perform the task T
if after learning the system’s performance on T improves as
measured by M.
In other words, the learned model helps the system to perform T
better as compared to no learning.

CS583, Bing Liu, UIC


An example
Data: Loan application data
Task: Predict whether a loan should
be approved or not.
Performance measure: accuracy.

No learning: classify all future


applications (test data) to the majority
class (i.e., Yes):
Accuracy = 9/15 = 60%.
We can do better than 60% with
learning.

CS583, Bing Liu, UIC


Linear Regression
Two different types of variable are there:
Independent (=predictor) variable X
Dependent (=outcome) variable Y.

Exploration of Linear
relation between these two
variables:
Y=mX+B
Linear Regression A slope of 2 means that
every 1-unit change in X
yields a 2-unit change in
Y.
Two different types of variable are there:
Independent (=predictor) variable X
Dependent (=outcome) variable Y.

Exploration of Linear
relation between these two
variables:
Y=mX+B
Regression Equation
Expected value of y at a given x=
Predicted value for an individual
Assumptions for Linear Regression
Linear regression assumes that…
1. The relationship between X and Y is linear
2. Y is distributed normally at each value of X
3. The variance of Y at every value of X is the same (homogeneity of
variances/ homoscedasticity)
4. The observations are independent
Homoscedasticity
The standard error of Y given X is the average variability around the
regression line at any given value of X. It is assumed to be equal at
all values of X.
Types of Linear Regression
Simple Linear Regression:A single independent variable
is used to predict the value of a numerical dependent variable
Multiple Linear regression:More than one independent
variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called
Multiple Linear Regression.
Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent
variables.
Types of Regression Line
A linear line showing the relationship between the dependent and independent variables is called
a regression line.

Positive Linear Relationship:If the dependent Negative Linear Relationship:If the dependent
variable increases on the Y-axis and independent variable decreases on the Y-axis and independent
variable increases on X-axis, then such a relationship variable increases on the X-axis, then such a
is termed as a Positive linear relationship. relationship is called a negative linear relationship.

You might also like