0% found this document useful (0 votes)
27 views57 pages

Unit - 2 MLA

Uploaded by

yajak70324
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views57 pages

Unit - 2 MLA

Uploaded by

yajak70324
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Unit -2

Syllabus: Regressions: Linear regression, Decision trees, Over Fitting, Instance based learning,
Feature Reduction, Collaborative filtering-based Recommendation Systems

Regression Analysis in Machine learning


Regression analysis is a statistical method to model the relationship
between a dependent (target) and independent (predictor) variables with
one or more independent variables. More specifically, Regression analysis
helps us to understand how the value of the dependent variable is
changing corresponding to an independent variable when other
independent variables are held fixed. It predicts continuous/real values
such as temperature, age, salary, price, etc.

Regression is a supervised learning technique which helps in finding the


correlation between variables and enables us to predict the continuous
output variable based on the one or more predictor variables. It is mainly
used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.

In Regression, we plot a graph between the variables which best fits the
given datapoints, using this plot, the machine learning model can make
predictions about the data. In simple words, "Regression shows a line
or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between
the datapoints and the regression line is minimum." The distance
between datapoints and line tells whether a model has captured a strong
relationship or not.

Some examples of regression can be as:

o Prediction of rain using temperature and other factors


o Determining Market trends
o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:


o Dependent Variable: The main factor in Regression analysis which we
want to predict or understand is called the dependent variable. It is also
called target variable.
o Independent Variable: The factors which affect the dependent variables
or which are used to predict the values of the dependent variables are
called independent variable, also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or
very high value in comparison to other observed values. An outlier may
hamper the result, so it should be avoided.
o Multicollinearity: If the independent variables are highly correlated with
each other than other variables, then such condition is called
Multicollinearity. It should not be present in the dataset, because it creates
problem while ranking the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the
training dataset but not well with test dataset, then such problem is
called Overfitting. And if our algorithm does not perform well even with
training dataset, then such problem is called underfitting.

Why do we use Regression Analysis?


Regression analysis helps in the prediction of a continuous variable.
There are various scenarios in the real world where we need some future
predictions such as weather condition, sales prediction, marketing trends,
etc., for such case we need some technology which can make predictions
more accurately. So for such case we need Regression analysis which is a
statistical method and used in machine learning and data science. Below
are some other reasons for using Regression analysis:

o Regression estimates the relationship between the target and the


independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most
important factor, the least important factor, and how each factor
is affecting the other factors.

Types of Regression
There are various types of regressions which are used in data science and
machine learning. Each type has its own importance on different
scenarios, but at the core, all the regression methods analyze the effect of
the independent variable on dependent variables. Here we are discussing
some important types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:

Linear Regression:

o Linear regression is a statistical regression method which is used for


predictive analysis.
o It is one of the very simple and easy algorithms which works on regression
and shows the relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent
variable (X-axis) and the dependent variable (Y-axis), hence called linear
regression.
o If there is only one input variable (x), then such linear regression is
called simple linear regression. And if there is more than one input
variable, then such linear regression is called multiple linear
regression.
o The relationship between variables in the linear regression model can be
explained using the below image. Here we are predicting the salary of an
employee on the basis of the year of experience.
o Below is the mathematical equation for Linear regression:

1. Y= aX+b

Here, Y = dependent variables (target variables),


X= Independent variables (predictor variables),
a and b are the linear coefficients

Some popular applications of linear regression are:

o Analyzing trends and sales estimates


o Salary forecasting
o Real estate prediction
o Arriving at ETAs in traffic.

Logistic Regression:

o Logistic regression is another supervised learning algorithm which is used


to solve the classification problems. In classification problems, we have
dependent variables in a binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0
or 1, Yes or No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of
probability.
o Logistic regression is a type of regression, but it is different from the linear
regression algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a
complex cost function. This sigmoid function is used to model the data in
logistic regression. The function can be represented as:
o f(x)= Output between the 0 and 1 value.
o x= input to the function
o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-
curve as follows:

o It uses the concept of threshold levels, values above the threshold level
are rounded up to 1, and values below the threshold level are rounded up
to 0.

There are three types of logistic regression:

o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)

Polynomial Regression:

o Polynomial Regression is a type of regression which models the non-


linear
o dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve
between the value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present
in a non-linear fashion, so for such case, linear regression will not best fit
to those datapoints. To cover such datapoints, we need Polynomial
regression.
o In Polynomial regression, the original features are transformed
into polynomial features of given degree and then modeled using
a linear model. Which means the datapoints are best fitted using a
polynomial line.

o The equation for polynomial regression also derived from linear regression
equation that means Linear regression equation Y= b 0+ b1x, is
transformed into Polynomial regression equation Y= b 0+b1x+ b2x2+
b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.

Support Vector Regression:


Support Vector Machine is a supervised learning algorithm which can be
used for regression as well as classification problems. So if we use it for
regression problems, then it is termed as Support Vector Regression.

Support Vector Regression is a regression algorithm which works for


continuous variables. Below are some keywords which are used
in Support Vector Regression:
o Kernel: It is a function used to map a lower-dimensional data into higher
dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes,
but in SVR, it is a line which helps to predict the continuous variables and
cover most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane,
which creates a margin for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest
to the hyperplane and opposite class.

In SVR, we always try to determine a hyperplane with a maximum margin,


so that maximum number of datapoints are covered in that margin. The
main goal of SVR is to consider the maximum datapoints within
the boundary lines and the hyperplane (best-fit line) must contain
a maximum number of datapoints. Consider the below image:

Here, the blue line is called hyperplane, and the other two lines are known
as boundary lines.

Decision Tree Regression:

o Decision Tree is a supervised learning algorithm which can be used for


solving both classification and regression problems.
o It can solve problems for both categorical and numerical data
o Decision Tree regression builds a tree-like structure in which each internal
node represents the "test" for an attribute, each branch represent the
result of the test, and each leaf node represents the final decision or
result.
o A decision tree is constructed starting from the root node/parent node
(dataset), which splits into left and right child nodes (subsets of dataset).
These child nodes are further divided into their children node, and
themselves become the parent node of those nodes. Consider the below
image:

Above image showing the example of Decision Tee regression, here, the
model is trying to predict the choice of a person between Sports cars or
Luxury car.

o Random forest is one of the most powerful supervised learning algorithms


which is capable of performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which
combines multiple decision trees and predicts the final output based on
the average of each tree output. The combined decision trees are called
as base models, and it can be represented more formally as:

g(x)= f0(x)+ f1(x)+ f2(x)+....

o Random forest uses Bagging or Bootstrap Aggregation technique of


ensemble learning in which aggregated decision tree runs in parallel and
do not interact with each other.
o With the help of Random Forest regression, we can prevent Overfitting in
the model by creating random subsets of the dataset.
Ridge Regression:

o Ridge regression is one of the most robust versions of linear regression in


which a small amount of bias is introduced so that we can get better long
term predictions.
o The amount of bias added to the model is known as Ridge Regression
penalty. We can compute this penalty term by multiplying with the
lambda to the squared weight of each individual features.
o The equation for ridge regression will be:

o A general linear or polynomial regression will fail if there is high


collinearity between the independent variables, so to solve such problems,
Ridge regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the
complexity of the model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.

Lasso Regression:

o Lasso regression is another regularization technique to reduce the


complexity of the model.
o It is similar to the Ridge Regression except that penalty term contains only
the absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas
Ridge Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression
will be:

Linear Regression in Machine Learning


Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis.
Linear regression makes predictions for continuous/real or numeric
variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a


dependent (y) and one or more independent (y) variables, hence called as
linear regression. Since linear regression shows the linear relationship,
which means it finds how the value of the dependent variable is changing
according to the value of the independent variable.

The linear regression model provides a sloped straight line representing


the relationship between the variables. Consider the below image:

Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear
Regression model representation.
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:

o Simple Linear Regression:


If a single independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression
algorithm is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value
of a numerical dependent variable, then such a Linear Regression
algorithm is called Multiple Linear Regression.

Linear Regression Line


A linear line showing the relationship between the dependent and
independent variables is called a regression line. A regression line can
show two types of relationship:

o Positive Linear Relationship:


If the dependent variable increases on the Y-axis and independent
variable increases on X-axis, then such a relationship is termed as a
Positive linear relationship.

o Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent
variable increases on the X-axis, then such a relationship is called a
negative linear relationship.
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit
line that means the error between predicted values and actual values
should be minimized. The best fit line will have the least error.

The different values for weights or the coefficient of lines (a 0, a1) gives a
different line of regression, so we need to calculate the best values for
a0 and a1 to find the best fit line, so to calculate this we use cost function.

Cost function-

o The different values for weights or coefficient of lines (a 0, a1) gives


the different line of regression, and the cost function is used to
estimate the values of the coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It
measures how a linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping
function, which maps the input variable to the output variable. This
mapping function is also known as Hypothesis function.

For Linear Regression, we use the Mean Squared Error (MSE) cost
function, which is the average of squared error occurred between the
predicted values and actual values. It can be written as:

For the above linear equation, MSE can be calculated as:

Where,

N=Total number of observation


Yi = Actual value
(a1xi+a0)= Predicted value.
Residuals: The distance between the actual value and predicted values is
called residual. If the observed points are far from the regression line,
then the residual will be high, and so cost function will high. If the scatter
points are close to the regression line, then the residual will be small and
hence the cost function.

Gradient Descent:

o Gradient descent is used to minimize the MSE by calculating the


gradient of the cost function.
o A regression model uses gradient descent to update the coefficients
of the line by reducing the cost function.
o It is done by a random selection of values of coefficient and then
iteratively update the values to reach the minimum cost function.

Model Performance:
The Goodness of fit determines how the line of regression fits the set of
observations. The process of finding the best model out of various models
is called optimization. It can be achieved by below method:

1. R-squared method:

o R-squared is a statistical method that determines the goodness of


fit.
o It measures the strength of the relationship between the dependent
and independent variables on a scale of 0-100%.
o The high value of R-square determines the less difference between
the predicted values and actual values and hence represents a good
model.
o It is also called a coefficient of determination, or coefficient of
multiple determination for multiple regression.
o It can be calculated from the below formula:

Assumptions of Linear Regression


Below are some important assumptions of Linear Regression. These are
some formal checks while building a Linear Regression model, which
ensures to get the best possible result from the given dataset.

o Linear relationship between the features and target:


Linear regression assumes the linear relationship between the
dependent and independent variables.
o Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent
variables. Due to multicollinearity, it may difficult to find the true
relationship between the predictors and target variables. Or we can
say, it is difficult to determine which predictor variable is affecting
the target variable and which is not. So, the model assumes either
little or no multicollinearity between the features or independent
variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for
all the values of independent variables. With homoscedasticity,
there should be no clear pattern distribution of data in the scatter
plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the
normal distribution pattern. If error terms are not normally
distributed, then confidence intervals will become either too wide or
too narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight
line without any deviation, which means the error is normally
distributed.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error
terms. If there will be any correlation in the error term, then it will
drastically reduce the accuracy of the model. Autocorrelation
usually occurs if there is a dependency between residual errors.

Simple Linear Regression in Machine


Learning
Simple Linear Regression is a type of Regression algorithms that models
the relationship between a dependent variable and a single independent
variable. The relationship shown by a Simple Linear Regression model is
linear or a sloped straight line, hence it is called Simple Linear Regression.

The key point in Simple Linear Regression is that the dependent


variable must be a continuous/real value. However, the independent
variable can be measured on continuous or categorical values.

Simple Linear regression algorithm has mainly two objectives:

o Model the relationship between the two variables. Such as the


relationship between Income and expenditure, experience and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according
to temperature, Revenue of a company according to the investments in a
year, etc.

Simple Linear Regression Model:


The Simple Linear Regression model can be represented using the below
equation:

y= a0+a1x+ ε

Where,

a0= It is the intercept of the Regression line (can be obtained


putting x=0)
a1= It is the slope of the regression line, which tells whether the
line is increasing or decreasing.
ε = The error term. (For a good model it will be negligible)

Multiple Linear Regression


In the previous topic, we have learned about Simple Linear Regression,
where a single Independent/Predictor(X) variable is used to model the
response variable (Y). But there may be various cases in which the
response variable is affected by more than one predictor variable; for such
cases, the Multiple Linear Regression algorithm is used.

Multiple Linear Regression is one of the important regression algorithms which models the linear
relationship between a single dependent continuous variable and more than one independent
variable.

MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear
combination of multiple predictor variables x 1, x2, x3, ...,xn. Since it is an
enhancement of Simple Linear Regression, so the same is applied for the
multiple linear regression equation, the equation becomes:

1. Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</
sub>x<sub>2</sub>+ b<sub>3</sub>x<sub>3</sub>+...... bnxn ............
... (a)

Where,

Y= Output/Response variable

b0, b1, b2, b3 , bn....= Coefficients of the model.

x1, x2, x3, x4,...= Various Independent/feature variable

Assumptions for Multiple Linear Regression:

o A linear relationship should exist between the Target and predictor


variables.
o The regression residuals must be normally distributed.
o MLR assumes little or no multicollinearity (correlation between the
independent variable) in data.

Decision Tree Classification Algorithm


o Decision Tree is a Supervised learning technique that can be used for
both classification and Regression problems, but mostly it is preferred for
solving Classification problems. It is a tree-structured classifier,
where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node
represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any decision and
have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the
given dataset.
o It is a graphical representation for getting all the possible
solutions to a problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like
structure.
o In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No),
it further split the tree into subtrees.

Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best
algorithm for the given dataset and problem is the main point to
remember while creating a machine learning model. Below are the two
reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a


decision, so it is easy to understand.
o The logic behind the decision tree can be easily understood because it
shows a tree-like structure.

Decision Tree Terminologies


 Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.

 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.

 Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.

 Branch/Sub Tree: A tree formed by splitting the tree.

 Pruning: Pruning is the process of removing the unwanted branches from the tree.

 Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the
algorithm starts from the root node of the tree. This algorithm compares
the values of root attribute with the record (real dataset) attribute and,
based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with
the other sub-nodes and move further. It continues the process until it
reaches the leaf node of the tree. The complete process can be better
understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the
best attributes.
o Step-4: Generate the decision tree node, which contains the best
attribute.
o Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as a
leaf node.

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to
select the best attribute for the root node and for sub-nodes. So, to solve
such problems there is a technique which is called as Attribute
selection measure or ASM. By this measurement, we can easily select
the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:

o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the
decision tree.
o A decision tree algorithm always tries to maximize the value of
information gain, and a node/attribute having the highest information gain
is split first. It can be calculated using the below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute.


It specifies randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:

o Gini index is a measure of impurity or purity used while creating a decision


tree in the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to
the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to
create binary splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree


Pruning is a process of deleting the unnecessary nodes from a tree in
order to get the optimal decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not
capture all the important features of the dataset. Therefore, a technique
that decreases the size of the learning tree without reducing accuracy is
known as Pruning. There are mainly two types of tree pruning technology
used:

o Cost Complexity Pruning


o Reduced Error Pruning.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human
follow while making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
o For more class labels, the computational complexity of the decision tree
may increase.

Overfitting and Underfitting in Machine


Learning
Overfitting and Underfitting are the two main problems that occur in
machine learning and degrade the performance of the machine learning
models.

The main goal of each machine learning model is to generalize well.


Here generalization defines the ability of an ML model to provide a
suitable output by adapting the given set of unknown input. It means after
providing training on the dataset, it can produce reliable and accurate
output. Hence, the underfitting and overfitting are the two terms that
need to be checked for the performance of the model and whether the
model is generalizing well or not.

Before understanding the overfitting and underfitting, let's understand


some basic term that will help to understand this topic well:
o Signal: It refers to the true underlying pattern of the data that helps the
machine learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the
performance of the model.
o Bias: Bias is a prediction error that is introduced in the model due to
oversimplifying the machine learning algorithms. Or it is the difference
between the predicted values and the actual values.
o Variance: If the machine learning model performs well with the training
dataset, but does not perform well with the test dataset, then variance
occurs.

Overfitting
Overfitting occurs when our machine learning model tries to cover all the
data points or more than the required data points present in the given
dataset. Because of this, the model starts caching noise and inaccurate
values present in the dataset, and all these factors reduce the efficiency
and accuracy of the model. The overfitted model has low bias and high
variance.

The chances of occurrence of overfitting increase as much we provide


training to our model. It means the more we train our model, the more
chances of occurring the overfitted model.

Overfitting is the main problem that occurs in supervised learning.

Example: The concept of the overfitting can be understood by the below


graph of the linear regression output:
As we can see from the above graph, the model tries to cover all the data
points present in the scatter plot. It may look efficient, but in reality, it is
not so. Because the goal of the regression model to find the best fit line,
but here we have not got any best fit, so, it will generate the prediction
errors.

How to avoid the Overfitting in Model


Both overfitting and underfitting cause the degraded performance of the
machine learning model. But the main cause is overfitting, so there are
some ways by which we can reduce the occurrence of overfitting in our
model.

o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling

Underfitting
Underfitting occurs when our machine learning model is not able to
capture the underlying trend of the data. To avoid the overfitting in the
model, the fed of training data can be stopped at an early stage, due to
which the model may not learn enough from the training data. As a result,
it may fail to find the best fit of the dominant trend in the data.

In the case of underfitting, the model is not able to learn enough from the
training data, and hence it reduces the accuracy and produces unreliable
predictions.

An underfitted model has high bias and low variance.

Example: We can understand the underfitting using below output of the


linear regression model:
As we can see from the above diagram, the model is unable to capture
the data points present in the plot.

How to avoid underfitting:

o By increasing the training time of the model.


o By increasing the number of features.

Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the
machine learning models to achieve the goodness of fit. In statistics
modeling, it defines how closely the result or predicted values match the
true values of the dataset.

The model with a good fit is between the underfitted and overfitted model,
and ideally, it makes predictions with 0 errors, but in practice, it is difficult
to achieve it.

As when we train our model for a time, the errors in the training data go
down, and the same happens with test data. But if we train the model for
a long duration, then the performance of the model may decrease due to
the overfitting, as the model also learn the noise present in the dataset.
The errors in the test dataset start increasing, so the point, just before the
raising of errors, is the good point, and we can stop here for achieving a
good model.

There are two other methods by which we can get a good point for our
model, which are the resampling method to estimate model accuracy
and validation dataset.
Overfitting in Machine Learning
In the real world, the dataset present will never be clean and perfect. It
means each dataset contains impurities, noisy data, outliers, missing
data, or imbalanced data. Due to these impurities, different problems
occur that affect the accuracy and the performance of the model. One of
such problems is Overfitting in Machine Learning. Overfitting is a problem
that a model can exhibit.

A statistical model is said to be overfitted if it can’t generalize well with unseen data.

Before understanding overfitting, we need to know some basic terms,


which are:

Noise: Noise is meaningless or irrelevant data present in the dataset. It


affects the performance of the model if it is not removed.

Bias: Bias is a prediction error that is introduced in the model due to


oversimplifying the machine learning algorithms. Or it is the difference
between the predicted values and the actual values.

Variance: If the machine learning model performs well with the training
dataset, but does not perform well with the test dataset, then variance
occurs.

Generalization: It shows how well a model is trained to predict unseen


data.

What is Overfitting?
o Overfitting & underfitting are the two main errors/problems in the
machine learning model, which cause poor performance in Machine
Learning.
o Overfitting occurs when the model fits more data than required, and
it tries to capture each and every datapoint fed to it. Hence it starts
capturing noise and inaccurate data from the dataset, which
degrades the performance of the model.
o An overfitted model doesn't perform accurately with the test/unseen
dataset and can’t generalize well.
o An overfitted model is said to have low bias and high variance.
Example to Understand Overfitting
We can understand overfitting with a general example. Suppose there are
three students, X, Y, and Z, and all three are preparing for an exam. X has
studied only three sections of the book and left all other sections. Y has a
good memory, hence memorized the whole book. And the third student,
Z, has studied and practiced all the questions. So, in the exam, X will only
be able to solve the questions if the exam has questions related to section
3. Student Y will only be able to solve questions if they appear exactly the
same as given in the book. Student Z will be able to solve all the exam
questions in a proper way.

The same happens with machine learning; if the algorithm learns from a
small part of the data, it is unable to capture the required data points and
hence under fitted.

Suppose the model learns the training dataset, like the Y student. They
perform very well on the seen dataset but perform badly on unseen data
or unknown instances. In such cases, the model is said to be Overfitting.

And if the model performs well with the training dataset and also with the
test/unseen dataset, similar to student Z, it is said to be a good fit.

How to detect Overfitting?


Overfitting in the model can only be detected once you test the data. To
detect the issue, we can perform Train/test split.

In the train-test split of the dataset, we can divide our dataset into
random test and training datasets. We train the model with a training
dataset which is about 80% of the total dataset. After training the model,
we test it with the test dataset, which is 20 % of the total dataset.

Now, if the model performs well with the training dataset but not with the
test dataset, then it is likely to have an overfitting issue.
For example, if the model shows 85% accuracy with training data and
50% accuracy with the test dataset, it means the model is not performing
well.

Ways to prevent the Overfitting


Although overfitting is an error in Machine learning which reduces the
performance of the model, however, we can prevent it in several ways.
With the use of the linear model, we can avoid overfitting; however, many
real-world problems are non-linear ones. It is important to prevent
overfitting from the models. Below are several ways that can be used to
prevent overfitting:

1. Early Stopping
2. Train with more data
3. Feature Selection
4. Cross-Validation
5. Data Augmentation
6. Regularization

Early Stopping
In this technique, the training is paused before the model starts learning
the noise within the model. In this process, while training the model
iteratively, measure the performance of the model after each iteration.
Continue up to a certain number of iterations until a new iteration
improves the performance of the model.

After that point, the model begins to overfit the training data; hence we
need to stop the process before the learner passes that point.

Stopping the training process before the model starts capturing noise
from the data is known as early stopping.

However, this technique may lead to the underfitting problem if training is


paused too early. So, it is very important to find that "sweet spot"
between underfitting and overfitting.

Feature Selection
While building the ML model, we have a number of parameters or features
that are used to predict the outcome. However, sometimes some of these
features are redundant or less important for the prediction, and for this
feature selection process is applied. In the feature selection process, we
identify the most important features within training data, and other
features are removed. Further, this process helps to simplify the model
and reduces noise from the data. Some algorithms have the auto-feature
selection, and if not, then we can manually perform this process.

Cross-Validation
Cross-validation is one of the powerful techniques to prevent overfitting.

In the general k-fold cross-validation technique, we divided the dataset


into k-equal-sized subsets of data; these subsets are known as folds.

Data Augmentation
Data Augmentation is a data analysis technique, which is an alternative to
adding more data to prevent overfitting. In this technique, instead of
adding more training data, slightly modified copies of already existing
data are added to the dataset.

The data augmentation technique makes it possible to appear data


sample slightly different every time it is processed by the model. Hence
each data set appears unique to the model and prevents overfitting.

Ensemble Methods
In ensemble methods, prediction from different machine learning models
is combined to identify the most popular result.

The most commonly used ensemble methods are Bagging and


Boosting.

In bagging, individual data points can be selected more than once. After
the collection of several sample datasets, these models are trained
independently, and depending on the type of task-i.e., regression or
classification-the average of those predictions is used to predict a more
accurate result. Moreover, bagging reduces the chances of overfitting in
complex models.

In boosting, a large number of weak learners arranged in a sequence are


trained in such a way that each learner in the sequence learns from the
mistakes of the learner before it. It combines all the weak learners to
come out with one strong learner. In addition, it improves the predictive
flexibility of simple models.
Instance Based
Learning:

Machine learning is a field of artificial intelligence that deals with giving


machines the ability to learn without being explicitly programmed. In this
context, instance-based learning and model-based learning are two
different approaches used to create machine learning models. While both
approaches can be effective, they also have distinct differences that must
be taken into account when building a machine learning system. Let’s
explore the differences between these two types of machine learning.

Instance-based learning (also known as memory-based learning or lazy


learning) involves memorizing training data in order to make predictions
about future data points. This approach doesn’t require any prior
knowledge or assumptions about the data, which makes it easy to
implement and understand. However, it can be computationally expensive
since all of the training data needs to be stored in memory before making
a prediction. Additionally, this approach doesn’t generalize well to unseen
data sets because its predictions are based on memorized examples
rather than learned models.

Feature Engineering for Machine Learning


Feature engineering is the pre-processing step of machine
learning, which is used to transform raw data into features that
can be used for creating a predictive model using Machine
learning or statistical Modelling. Feature engineering in machine
learning aims to improve the performance of models.

What is a feature?
The machine learning algorithms take input data to generate the output.
The input data remains in a tabular form consisting of rows (instances or
observations) and columns (variable or attributes), and these attributes
are often known as features.

What is Feature Engineering?


Feature engineering is the pre-processing step of machine
learning, which extracts features from raw data. It helps to
represent an underlying problem to predictive models in a better way,
which as a result, improve the accuracy of the model for unseen data. The
predictive model contains predictor variables and an outcome variable,
and while the feature engineering process selects the most useful
predictor variables for the model.

Since 2016, automated feature engineering is also used in different


machine learning software that helps in automatically extracting features
from raw data. Feature engineering in ML contains mainly four
processes: Feature Creation, Transformations, Feature Extraction,
and Feature Selection.

1. Feature Creation: Feature creation is finding the most useful


variables to be used in a predictive model. The process is
subjective, and it requires human creativity and intervention. The
new features are created by mixing existing features using addition,
subtraction, and ration, and these new features have great
flexibility.
2. Transformations: The transformation step of feature engineering
involves adjusting the predictor variable to improve the accuracy
and performance of the model. For example, it ensures that the
model is flexible to take input of the variety of data; it ensures that
all the variables are on the same scale, making the model easier to
understand. It improves the model's accuracy and ensures that all
the features are within the acceptable range to avoid any
computational error.
3. Feature Extraction: Feature extraction is an automated feature
engineering process that generates new variables by extracting
them from the raw data. The main aim of this step is to reduce the
volume of data so that it can be easily used and managed for data
modelling. Feature extraction methods include cluster analysis,
text analytics, edge detection algorithms, and principal
components analysis (PCA).
4. Feature Selection: While developing the machine learning model,
only a few variables in the dataset are useful for building the model,
and the rest features are either redundant or irrelevant. If we input
the dataset with all these redundant and irrelevant features, it may
negatively impact and reduce the overall performance and accuracy
of the model. Hence it is very important to identify and select the
most appropriate features from the data and remove the irrelevant
or less important features, which is done with the help of feature
selection in machine learning. "Feature selection is a way of
selecting the subset of the most relevant features from the
original features set by removing the redundant, irrelevant,
or noisy features."

Below are some benefits of using feature selection in machine learning:

o It helps in avoiding the curse of dimensionality.


o It helps in the simplification of the model so that the researchers
can easily interpret it.
o It reduces the training time.
o It reduces overfitting hence enhancing the generalization.
Feature Selection Techniques in Machine
Learning
Feature selection is a way of selecting the subset of the most relevant features from the
original features set by removing the redundant, irrelevant, or noisy features.

Feature selection is one of the important concepts of machine learning,


which highly impacts the performance of the model. As machine learning
works on the concept of "Garbage In Garbage Out", so we always need to
input the most appropriate and relevant dataset to the model in order to
get a better result.

In this topic, we will discuss different feature selection techniques for


machine learning. But before that, let's first understand some basics of
feature selection.

o What is Feature Selection?


o Need for Feature Selection
o Feature Selection Methods/Techniques
o Feature Selection statistics

What is Feature Selection?


A feature is an attribute that has an impact on a problem or is useful for
the problem, and choosing the important features for the model is known
as feature selection. Each machine learning process depends on feature
engineering, which mainly contains two processes; which are Feature
Selection and Feature Extraction. Although feature selection and
extraction processes may have the same objective, both are completely
different from each other. The main difference between them is that
feature selection is about selecting the subset of the original feature set,
whereas feature extraction creates new features. Feature selection is a
way of reducing the input variable for the model by using only relevant
data in order to reduce overfitting in the model.

So, we can define feature Selection as, "It is a process of


automatically or manually selecting the subset of most
appropriate and relevant features to be used in model building."
Feature selection is performed by either including the important features
or excluding the irrelevant features in the dataset without changing them.

Need for Feature Selection


Before implementing any technique, it is really important to understand,
need for the technique and so for the Feature Selection. As we know, in
machine learning, it is necessary to provide a pre-processed and good
input dataset in order to get better outcomes. We collect a huge amount
of data to train our model and help it to learn better. Generally, the
dataset consists of noisy data, irrelevant data, and some part of useful
data. Moreover, the huge amount of data also slows down the training
process of the model, and with noise and irrelevant data, the model may
not predict and perform well. So, it is very necessary to remove such
noises and less-important data from the dataset and to do this, and
Feature selection techniques are used.

Selecting the best features helps the model to perform well. For example,
Suppose we want to create a model that automatically decides which car
should be crushed for a spare part, and to do this, we have a dataset. This
dataset contains a Model of the car, Year, Owner's name, Miles. So, in this
dataset, the name of the owner does not contribute to the model
performance as it does not decide if the car should be crushed or not, so
we can remove this column and select the rest of the features(column) for
the model building.

Below are some benefits of using feature selection in machine learning:

o It helps in avoiding the curse of dimensionality.


o It helps in the simplification of the model so that it can be
easily interpreted by the researchers.
o It reduces the training time.
o It reduces overfitting hence enhance the generalization.

Feature Selection Techniques


There are mainly two types of Feature Selection techniques, which are:

o Supervised Feature Selection technique


Supervised Feature selection techniques consider the target
variable and can be used for the labelled dataset.
o Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target
variable and can be used for the unlabelled dataset.

There are mainly three techniques under supervised feature Selection:

1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as
a search problem, in which different combinations are made, evaluated,
and compared with other combinations. It trains the algorithm by using
the subset of features iteratively.

On the basis of the output of the model, features are added or subtracted,
and with this feature set, the model has trained again.

Some techniques of wrapper methods are:

o Forward selection - Forward selection is an iterative process,


which begins with an empty set of features. After each iteration, it
keeps adding on a feature and evaluates the performance to check
whether it is improving the performance or not. The process
continues until the addition of a new variable/feature does not
improve the performance of the model.
o Backward elimination - Backward elimination is also an iterative
approach, but it is the opposite of forward selection. This technique
begins the process by considering all the features and removes the
least significant feature. This elimination process continues until
removing the features does not improve the performance of the
model.
o Exhaustive Feature Selection- Exhaustive feature selection is
one of the best feature selection methods, which evaluates each
feature set as brute-force. It means this method tries & make each
possible combination of features and return the best performing
feature set.
o Recursive Feature Elimination-
Recursive feature elimination is a recursive greedy optimization
approach, where features are selected by recursively taking a
smaller and smaller subset of features. Now, an estimator is trained
with each set of features, and the importance of each feature is
determined using coef_attribute or through
a feature_importances_attribute.

2. Filter Methods
In Filter Method, features are selected on the basis of statistics measures.
This method does not depend on the learning algorithm and chooses the
features as a pre-processing step.
The filter method filters out the irrelevant feature and redundant columns
from the model by using different metrics through ranking.

The advantage of using filter methods is that it needs low computational


time and does not overfit the data.

Some common techniques of Filter methods are as follows:

o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio

Information Gain: Information gain determines the reduction in entropy


while transforming the dataset. It can be used as a feature selection
technique by calculating the information gain of each variable with
respect to the target variable.

Chi-square Test: Chi-square test is a technique to determine the


relationship between the categorical variables. The chi-square value is
calculated between each feature and the target variable, and the desired
number of features with the best chi-square value is selected.

Fisher's Score:
Fisher's score is one of the popular supervised technique of features
selection. It returns the rank of the variable on the fisher's criteria in
descending order. Then we can select the variables with a large fisher's
score.

Missing Value Ratio:

The value of the missing value ratio can be used for evaluating the feature
set against the threshold value. The formula for obtaining the missing
value ratio is the number of missing values in each column divided by the
total number of observations. The variable is having more than the
threshold value can be dropped.

3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper
methods by considering the interaction of features along with low
computational cost. These are fast processing methods similar to the filter
method but more accurate than the filter method.

These methods are also iterative, which evaluates each iteration, and
optimally finds the most important features that contribute the most to
training in a particular iteration. Some techniques of embedded methods
are:
o Regularization- Regularization adds a penalty term to different
parameters of the machine learning model for avoiding overfitting in
the model. This penalty term is added to the coefficients; hence it
shrinks some coefficients to zero. Those features with zero
coefficients can be removed from the dataset. The types of
regularization techniques are L1 Regularization (Lasso
Regularization) or Elastic Nets (L1 and L2 regularization).
o Random Forest Importance - Different tree-based methods of
feature selection help us with feature importance to provide a way
of selecting features. Here, feature importance specifies which
feature has more importance in model building or has a great
impact on the target variable. Random Forest is such a tree-based
method, which is a type of bagging algorithm that aggregates a
different number of decision trees. It automatically ranks the nodes
by their performance or decrease in the impurity (Gini impurity)
over all the trees. Nodes are arranged as per the impurity values,
and thus it allows to pruning of trees below a specific node. The
remaining nodes create a subset of the most important features.

How to choose a Feature Selection Method?


For machine learning engineers, it is very important to understand that
which feature selection method will work properly for their model. The
more we know the datatypes of variables, the easier it is to choose the
appropriate statistical measure for feature selection.
To know this, we need to first identify the type of input and output
variables. In machine learning, variables are of mainly two types:

o Numerical Variables: Variable with continuous values such as


integer, float
o Categorical Variables: Variables with categorical values such as
Boolean, ordinal, nominals.

Below are some univariate statistical measures, which can be used for
filter-based feature selection:

1. Numerical Input, Numerical Output:

Numerical Input variables are used for predictive regression modelling.


The common method to be used for such a case is the Correlation
coefficient.

o Pearson's correlation coefficient (For linear Correlation).


o Spearman's rank coefficient (for non-linear correlation).

2. Numerical Input, Categorical Output:

Numerical Input with categorical output is the case for classification


predictive modelling problems. In this case, also, correlation-based
techniques should be used, but with categorical output.

o ANOVA correlation coefficient (linear).


o Kendall's rank coefficient (nonlinear).

3. Categorical Input, Numerical Output:

This is the case of regression predictive modelling with categorical input.


It is a different example of a regression problem. We can use the same
measures as discussed in the above case but in reverse order.

4. Categorical Input, Categorical Output:

This is a case of classification predictive modelling with categorical Input


variables.

The commonly used technique for such a case is Chi-Squared Test. We


can also use Information gain in this case.

Introduction to Dimensionality Reduction


Technique
What is Dimensionality Reduction?
The number of input features, variables, or columns present in a given
dataset is known as dimensionality, and the process to reduce these
features is called dimensionality reduction.

A dataset contains a huge number of input features in various cases,


which makes the predictive modeling task more complicated. Because it is
very difficult to visualize or make predictions for the training dataset with
a high number of features, for such cases, dimensionality reduction
techniques are required to use.

Dimensionality reduction technique can be defined as, "It is a way of


converting the higher dimensions dataset into lesser dimensions
dataset ensuring that it provides similar information." These
techniques are widely used in machine learning for obtaining a better fit
predictive model while solving the classification and regression problems.

It is commonly used in the fields that deal with high-dimensional data,


such as speech recognition, signal processing, bioinformatics, etc.
It can also be used for data visualization, noise reduction, cluster
analysis, etc.
The Curse of Dimensionality
Handling the high-dimensional data is very difficult in practice, commonly
known as the curse of dimensionality. If the dimensionality of the input
dataset increases, any machine learning algorithm and model becomes
more complex. As the number of features increases, the number of
samples also gets increased proportionally, and the chance of overfitting
also increases. If the machine learning model is trained on high-
dimensional data, it becomes overfitted and results in poor performance.

Hence, it is often required to reduce the number of features, which can be


done with dimensionality reduction.

Benefits of applying Dimensionality Reduction


Some benefits of applying dimensionality reduction technique to the given
dataset are given below:

o By reducing the dimensions of the features, the space required to store


the dataset also gets reduced.
o Less Computation training time is required for reduced dimensions of
features.
o Reduced dimensions of features of the dataset help in visualizing the data
quickly.
o It removes the redundant features (if present) by taking care of
multicollinearity.

Disadvantages of dimensionality Reduction


There are also some disadvantages of applying the dimensionality
reduction, which are given below:

o Some data may be lost due to dimensionality reduction.


o In the PCA dimensionality reduction technique, sometimes the principal
components required to consider are unknown.

Approaches of Dimension Reduction


There are two ways to apply the dimension reduction technique, which are
given below:

Feature Selection
Feature selection is the process of selecting the subset of the relevant
features and leaving out the irrelevant features present in a dataset to
build a model of high accuracy. In other words, it is a way of selecting the
optimal features from the input dataset.

Three methods are used for the feature selection:

1. Filters Methods

In this method, the dataset is filtered, and a subset that contains only the
relevant features is taken. Some common techniques of filters method
are:

o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.

2. Wrappers Methods

The wrapper method has the same goal as the filter method, but it takes a
machine learning model for its evaluation. In this method, some features
are fed to the ML model, and evaluate the performance. The performance
decides whether to add those features or remove to increase the accuracy
of the model. This method is more accurate than the filtering method but
complex to work. Some common techniques of wrapper methods are:

o Forward Selection
o Backward Selection
o Bi-directional Elimination

3. Embedded Methods: Embedded methods check the different training


iterations of the machine learning model and evaluate the importance of
each feature. Some common techniques of Embedded methods are:

o LASSO
o Elastic Net
o Ridge Regression, etc.

Feature Extraction:
Feature extraction is the process of transforming the space containing
many dimensions into space with fewer dimensions. This approach is
useful when we want to keep the whole information but use fewer
resources while processing the information.

Some common feature extraction techniques are:

a. Principal Component Analysis


b. Linear Discriminant Analysis
c. Kernel PCA
d. Quadratic Discriminant Analysis

Common techniques of Dimensionality Reduction


a. Principal Component Analysis
b. Backward Elimination
c. Forward Selection
d. Score comparison
e. Missing Value Ratio
f. Low Variance Filter
g. High Correlation Filter
h. Random Forest
i. Factor Analysis
j. Auto-Encoder

Principal Component Analysis (PCA)


Principal Component Analysis is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated
features with the help of orthogonal transformation. These new
transformed features are called the Principal Components. It is one of
the popular tools that is used for exploratory data analysis and predictive
modeling.

PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces
the dimensionality. Some real-world applications of PCA are image
processing, movie recommendation system, optimizing the power
allocation in various communication channels.

Backward Feature Elimination


The backward feature elimination technique is mainly used while
developing Linear Regression or Logistic Regression model. Below steps
are performed in this technique to reduce the dimensionality or in feature
selection:

o In this technique, firstly, all the n variables of the given dataset are taken
to train the model.
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n-1
features for n times, and will compute the performance of the model.
o We will check the variable that has made the smallest or no change in the
performance of the model, and then we will drop that variable or features;
after that, we will be left with n-1 features.
o Repeat the complete process until no feature can be dropped.

In this technique, by selecting the optimum performance of the model and


maximum tolerable error rate, we can define the optimal number of
features require for the machine learning algorithms.

Forward Feature Selection


Forward feature selection follows the inverse process of the backward
elimination process. It means, in this technique, we don't eliminate the
feature; instead, we will find the best features that can produce the
highest increase in the performance of the model. Below steps are
performed in this technique:

o We start with a single feature only, and progressively we will add each
feature at a time.
o Here we will train the model on each feature separately.
o The feature with the best performance is selected.
o The process will be repeated until we get a significant increase in the
performance of the model.

Missing Value Ratio


If a dataset has too many missing values, then we drop those variables as
they do not carry much useful information. To perform this, we can set a
threshold level, and if a variable has missing values more than that
threshold, we will drop that variable. The higher the threshold value, the
more efficient the reduction.

Low Variance Filter


As same as missing value ratio technique, data columns with some
changes in the data have less information. Therefore, we need to
calculate the variance of each variable, and all data columns with
variance lower than a given threshold are dropped because low variance
features will not affect the target variable.

High Correlation Filter


High Correlation refers to the case when two variables carry
approximately similar information. Due to this factor, the performance of
the model can be degraded. This correlation between the independent
numerical variable gives the calculated value of the correlation coefficient.
If this value is higher than the threshold value, we can remove one of the
variables from the dataset. We can consider those variables or features
that show a high correlation with the target variable.

Random Forest
Random Forest is a popular and very useful feature selection algorithm in
machine learning. This algorithm contains an in-built feature importance
package, so we do not need to program it separately. In this technique,
we need to generate a large set of trees against the target variable, and
with the help of usage statistics of each attribute, we need to find the
subset of features.
Random forest algorithm takes only numerical variables, so we need to
convert the input data into numeric data using hot encoding.

Factor Analysis
Factor analysis is a technique in which each variable is kept within a group
according to the correlation with other variables, it means variables within
a group can have a high correlation between themselves, but they have a
low correlation with variables of other groups.

We can understand it by an example, such as if we have two variables


Income and spend. These two variables have a high correlation, which
means people with high income spends more, and vice versa. So, such
variables are put into a group, and that group is known as the factor. The
number of these factors will be reduced as compared to the original
dimension of the dataset.

Auto-encoders
One of the popular methods of dimensionality reduction is auto-encoder,
which is a type of ANN or artificial neural network, and its main aim is to
copy the inputs to their outputs. In this, the input is compressed into
latent-space representation, and output is occurred using this
representation. It has mainly two parts:

o Encoder: The function of the encoder is to compress the input to form the
latent-space representation.
o Decoder: The function of the decoder is to recreate the output from the
latent-space representation.

Feature Extraction for Style Transferring


After loading the images into memory, we will implement the style
transfer. It is necessary to separate the style of the image from its
contents to achieve the style transfer. After that, it is also possible to
transfer the style elements of one image to the content elements of the
second image. This process is done using mainly feature extraction from
standard convolutional neural networks.

These features are then manipulated to extract either content information


or style information. This process involves three images a style image, a
content image and finally a target image. The style of the style image is
combined with the content in the content image to create a final target
image.

This process begins by selecting a few layers within our model to extract
features from. We will get a good idea of how our image is being
processed throughout the neural network by selecting a few layers to
extract features from. We extract the model features of our style image
and content image as well. After that, we extract features from our target
image and compare it to our style image feature and our content image
feature.

Collaborative Filtering and its Types in Python


The most popular method for creating intelligent prediction models that
get better at making recommendations as more data about users is
gathered is called collaborative Filtering.

Collaborative Filtering is used by the majority of websites, like Netflix,


Amazon, YouTube, and YouTube, as part of their advanced
recommendation algorithms. This method can create recommenders that
make user recommendations based on their shared preferences.

Collaborative Filtering: What Is It?


With the help of collaborative filtering technology, users can exclude
items based on the opinions of other users who share their interests.

It operates by looking through a big group of people and identifying a


smaller group of users with tastes comparable to a certain user. It
considers the products they enjoy and combines them to produce a list of
recommendations.

Selecting similar users and combining their selections to get a list of


suggestions can be done in various ways. This post will demonstrate how
to use Python to accomplish that.

DATASET:
You will require data that includes a set of items and a set of users who
have responded to some of the items to experiment with recommendation
algorithms.

Either the explicit (rating on a scale of 1 to 5, likes or dislikes) or the


implicit response may occur (viewing an item, adding it to a wish list, the
time spent on an article).

When working with such data, you will typically see it in a matrix of user
responses to various items from a collection of items. The user ratings
would be listed in each row, and the object ratings would be listed in each
column. An example of a matrix with five users and five objects would be:
Rating Table

o The matrix shows that five people rated various products on a scale
of 1 to 5. For instance, the third item has a rating of 4 from the first
user.
o Since consumers often only rate a small number of things, the
matrix's cells are frequently vacant. It's improbable that every user
will review or comment on every item. A sparse matrix is one in
which most cells are vacant, whereas a dense matrix is the reverse,
with most of the cells filled.
o For study and benchmarking, many datasets have been gathered
and made public. Here is a list of reliable data sources from which
you can select.
o The MovieLens dataset amassed by GroupLens Research would be
the ideal one, to begin with. The MovieLens 100k dataset, in
particular, is a reliable benchmark dataset with 100,000 ratings for
1682 films from 943 individuals, with each user having rated at least
20 films.

This dataset consists of numerous files detailing the movies, the users,
and the ratings people have assigned to the films they have seen.

The following are those that are noteworthy:

o item: the movie list


o data: the user-submitted rating list

The item ID, timestamp, user ID, and rating are listed in a tab-separated
list in the file u.data that holds the ratings. The file's opening few lines are
as follows:

MovieLens 100k Data's First 5 Rows

The file, as previously mentioned, contains the rating that a user assigned
a specific movie. These 100,000 evaluations, which will be used to
forecast user ratings for movies they haven't seen, are contained in this
file.

Collaborative Filtering Procedure


The first stage in creating a system that can automatically suggest
products to users based on the preferences of other users is to identify
comparable individuals or products. The second step is predicting user
ratings for things that still need to be rated. Consequently, you will require
the following information:
o How can you tell which users or things are comparable to one
another?
o How do you predict the rating a user will give a product based on
the ratings of similar users, given that you know which users are
similar?
o How do you assess the reliability of the ratings you generate?

Step 1:The answers to the first two questions are all different. A family of
algorithms known as collaborative Filtering offers numerous methods for
locating comparable users or things and numerous methods for
determining ratings based on the ratings of comparable users. Depending
on your decisions, you might choose a collaborative filtering strategy. In
this post, you'll learn about the various methods for determining similarity
and predicting ratings.

Step 2:The age of users, the movie's genre, or any other information
about users or objects are not used in an approach that relies solely on
collaborative Filtering to determine how similar two items are.

Step 3:It is only determined by the explicit or implicit rating a user


provides a product. For instance, despite having a significant age gap, two
users can be deemed comparable if they assign the same scores to ten
films.

Step 4:There are several ways to test the accuracy of your predictions,
and the third issue likewise has many possible solutions, including error
calculation methods that apply to other applications besides collaborative
filtering recommenders.

Step 5:The Root Mean Square Error (RMSE), which involves predicting
ratings for a test dataset of user-item pairings whose rating values are
previously known, is one method for gauging the accuracy of your
conclusion. The error would be the discrepancy between the known value
and the forecasted value. Finding the average (or mean) of the test set's
error values, squaring them all, and then taking the square root of that
average will yield the RMSE.

Step 6:Mean Absolute Error (MAE), which finds the amount of error by
obtaining its absolute value and then taking the average of all error
values, is another statistic to gauge accuracy.

Let's examine the many algorithms that make up the collaborative


filtering family.

Memory Based
The first group of algorithms comprises memory-based ones that compute
predictions using statistical methods on the complete dataset.

The following steps are taken to determine the rating R a user U


would assign to item I:

o Finding users who have rated the item I similarly to U and


calculating the rating R based on the ratings of those users.
o Each of them is described in further detail in the sections that
follow.

How to Find Comparable Users Using Ratings?


Let's first construct a straightforward dataset to comprehend the idea of
similarity.

Four people named A, B, C, and D who have rated two films are included
in the data. Lists are used to hold the ratings, and each list comprises two
numbers that represent the rating of each film:

Ratings from A are [1.0, 2.0], B [2.0, 4.0], C [2.5, 4.0], and [2.5, 4.0], and
D ratings are [4.5, 5.0].

Plot the user ratings for two movies on a graph, then seek a
pattern to get started with a visual cue. The graph appears as
follows:

Each point in the graph above represents a user, and it is compared to the
ratings they gave to two films.
Measuring similarity by examining the distance seen between points is a
good method. The formula for the Euclidean distance between two
locations can be used to calculate the distance. The following program
demonstrates how to use a scipy function:

User-Based vs. Item-Based Collaborative Filtering


User-based or consumer collaborative filtering is the method used in the
examples above. It uses the rating matrix to identify users who are similar
to one another based on the scores they provide. The method is known as
item-based or shared Filtering if you utilize the rating matrix to locate
comparable objects based on the ratings provided to them by users.

Although the two methods are distinct concepts, they are technically
extremely similar. Here is a comparison between the two:

o User-based: For just a user U, the rate for an item I that hasn't
been rated is found by selecting N users from the same list who've
already rated the item I and computing the rating depending on
these N ratings. This is done using rating vectors made up of
supplied item ratings.
o Item-based: For an item I that has a set of comparable items
determined based on user ratings, the score by a user U who hasn't
reviewed it is found by selecting N comparable things that have
been rated by U and determining the rating based on these N
evaluations.

Model-Based
The huge yet sparse user-item matrix is reduced or compressed as part of
the model-based techniques, which fall under the second group. A
fundamental understanding of data pre-processing can be very beneficial
for comprehending this phase.

Diminished Dimensions

Two dimensions exist in the user-item matrix:

1. The total user base


2. The number of goods.

Reducing the number of dimensions can enhance the algorithm's


performance in terms of both space and time if the matrix is largely
empty. Various techniques, including matrix factorization and
autoencoders, can be used to do this.

A huge matrix can be divided into smaller ones by matrix factorization.


This is comparable to integer factorization, where Twelve can be
expressed as either 4 x 3 or 6 x 2. A matrix A with elements m x n can be
broken down into two matrices, X and Y, with values m x p and p x n,
respectively, in the case of matrices.

The users and things are represented as separate entities in the reduced
matrices. In the first matrix, the m rows stand in for the m users, while the
p columns provide information on the attributes or traits of the users. The
item matrices with n samples and p attributes are the same.

For Example:

Amazon created item-based collaborative Filtering. Item-based Filtering is


quicker and more reliable than user-based Filtering in systems with more
items than users. It works because the average rating an item receives
typically stays the same even as the average rating a user gives other
goods. When the rating matrix is sparse, it is also known to do better than
the user-based approach.

However, the item-based technique could do better for datasets featuring


browsing or entertainment-related items, like MovieLens, where the target
consumers perceive the recommendations as highly obvious. Such
datasets perform better when employing content-based filtering or hybrid
recommenders that consider the data's content, such as the data's genre,
as you will see in the next section.

Product Recommendation Machine Learning


Product recommendation is a popular application of machine learning that
aims to personalize the customer shopping experience. By analyzing
customer behavior, preferences, and purchase history, a recommendation
engine can suggest products more likely to interest a particular customer.

The task of proposing a product or products to a consumer based on his


purchasing history is known as "product recommendation" in machine
learning. A machine learning model called a product recommender system
suggests products, content, or services to a specific consumer. Here,
we've developed a C#.NET Core console application that serves as a
product recommender system using data from Amazon's product co-
purchasing network.

Different product recommendation algorithms can be used to generate


personalized product recommendations. One popular approach is
collaborative filtering, which makes recommendations based on the
behavior and preferences of similar users. For example, if two customers
have purchased similar products in the past, the algorithm may suggest
similar products to both customers.

Another approach is content-based filtering, which makes


recommendations based on the products' attributes. For example, the
algorithm may suggest other products if a customer has purchased a
particular clothing brand.

A more advanced approach is a hybrid recommendation, combining


collaborative and content-based filtering strengths. The hybrid approach
considers both the behavior and preferences of similar users and the
attributes of the products themselves. This can result in more accurate
and relevant recommendations.

A recommendation engine must first be trained on a customer behavior


and product information dataset to generate personalized product
recommendations. This dataset can include purchase history, browsing
history, and customer ratings and reviews.

Once the recommendation engine has been trained, it can generate


recommendations for individual customers. The recommendations can be
presented in various ways, such as a list of products or personalized
product recommendations.

Using product recommendations in the e-commerce industry is becoming


increasingly popular to increase sales and customer satisfaction. A
recommendation engine can increase the chances that a customer will
make a purchase by suggesting products that are more likely to be of
interest to a particular customer.

One of the key benefits of product recommendation is that it can help


increase a customer's average order value (AOV). A recommendation
engine can increase the number of items that a customer purchases
during a single transaction by suggesting additional products that are
likely to be of interest to a customer.

In addition to increasing sales and AOV, product recommendations can


help improve the customer experience. By suggesting products that are
more likely to interest a particular customer, a recommendation engine
can help save customers time and effort when browsing for products.

Moreover, it can also help to increase customer loyalty and retention. By


suggesting products that are more likely to be of interest to a particular
customer, a recommendation engine can help to build a stronger
relationship with the customer and increase the chances that the
customer will return to the website in the future.

Recommender systems are used in various contexts, including movies,


music, news, books, research articles, search queries, social tagging, and
items in general. They have grown in popularity in recent years. The bulk
of today's E-Commerce sites, including eBay, Amazon, Alibaba, etc.,
employ their proprietary recommendation algorithms to better match
customers with the goods they are likely to like. These algorithms are
mostly used in the digital space.

Another important aspect of product recommendation is the ability to


handle the cold-start problem. A cold-start problem occurs when a new
customer visits a website, and the recommendation engine needs more
information about the customer to make personalized recommendations.

A hybrid recommender system uses several different recommendation


methods to produce the output. The suggestion accuracy is typically
greater in hybrid recommender systems compared to collaborative or
content-based systems. Knowledge about collaborative filtering's domain
dependencies and people's preferences in content-based systems is the
cause.

Both factors work together to increase shared knowledge, which improves


suggestions. Exploring novel approaches to integrate content data into
content-based algorithms and collaborative filtering algorithms with user
activity data is especially intriguing, given the increase in knowledge.

One way to handle the cold-start problem is to use a hybrid approach that
combines content-based filtering and demographic information. For
example, suppose a new customer is browsing for men's clothing. In that
case, the recommendation engine can suggest products based on the
most popular men's clothing items and the customer's age and location.

Types of recommendation systems


There are several types of recommendation systems in machine learning,
including:

Content-based filtering: Recommends items based on their similarity to


items the user has previously liked.

Collaborative filtering: Recommends items based on the preferences of


similar users.

Hybrid: combines both content-based and collaborative filtering to make


recommendations.

Hybrid with memory-based and model-based: Memory-based


recommendation is a way to make recommendations based on the
similarity between items and the users' past behavior, whereas model-
based recommendation uses machine learning algorithms to model the
user behavior and make recommendations.
Hybrid with demographic and user-based: Demographic-based
recommendation is a way to make recommendations based on user
demographic information, and user-based recommendation is a way to
make recommendations based on the similarity of users.

Hybrid with demographic and item-based: Demographic-based


recommendation is a way to make recommendations based on user
demographic information, and item-based recommendation is a way to
make recommendations based on the similarity of items.

Content-based Filtering
Content-based filtering is a recommendation system that suggests items
to users based on their previous interactions with similar items. This
system typically uses the features or attributes of the items to identify
similar items.

For example, if a user has previously watched several action movies, a


content-based recommendation system would suggest other action
movies to the user based on the genre, actors, and other similar attributes
of the movies they have previously watched.

One of the key advantages of content-based filtering is that it can


recommend items to users even if they have yet to interact with many
items in the past. It can also make recommendations to users who have
unique tastes and preferences.

Collaborative Filtering
Collaborative filtering is a recommendation system that suggests items to
a user based on similar users' preferences. This system does not use the
attributes or features of the items to make recommendations but instead
uses the past behavior of users to identify similar users and recommend
items that similar users have liked.

There are two main types of collaborative filtering:

User-based collaborative filtering: This method finds similar users based


on their past interactions with items and then recommends items that
similar users have liked. For example, if two users have similar viewing
histories on Netflix, the system may recommend the same movie to both
users.

Item-based collaborative filtering: This method finds similar items based


on how users have interacted with them and then recommends those
similar items to a user. For example, if a user has liked several movies of
a particular genre, the system may recommend other movies of that
genre to the user.
Collaborative filtering is a powerful technique; it can recommend items to
users even if they have not previously interacted with many items. It can
also adapt to changes in users' preferences over time. Additionally, it can
recommend new and diverse items to users.

Hybrid with Memory-based and Model-based


Hybrid recommendation systems that combine memory-based and model-
based approaches are becoming increasingly popular as they can take
advantage of the strengths of both methods while addressing some of
their weaknesses.

A memory-based recommendation system is a way to make


recommendations based on the similarity between items and the users'
past behavior. It typically uses a user-item matrix to store the interactions
between users and items, and then uses a similarity measure, such as
cosine similarity, to find similar items or users.

On the other hand, a model-based recommendation system uses machine


learning algorithms to model user behavior and make recommendations.
These algorithms can be used to learn the underlying patterns in the data
and make predictions about which items a user may be interested in.

A hybrid recommendation system that combines memory-based and


model-based approaches can take advantage of the scalability and
interpretability of memory-based methods while addressing some
limitations.

For example, a hybrid system can use memory-based methods to make


recommendations quickly and easily while using model-based methods to
learn the underlying patterns in the data and make recommendations for
new users or items.

Additionally, a hybrid system can use memory-based methods to make


recommendations for items like those a user has liked in the past while
using model-based methods to recommend items that may be different
but still of interest to the user.

Overall, hybrid recommendation systems that combine memory-based


and model-based approaches have the potential to overcome some of the
limitations of each method individually and provide more accurate and
diverse recommendations to users.

You might also like