Unit - 2 MLA
Unit - 2 MLA
Syllabus: Regressions: Linear regression, Decision trees, Over Fitting, Instance based learning,
Feature Reduction, Collaborative filtering-based Recommendation Systems
In Regression, we plot a graph between the variables which best fits the
given datapoints, using this plot, the machine learning model can make
predictions about the data. In simple words, "Regression shows a line
or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between
the datapoints and the regression line is minimum." The distance
between datapoints and line tells whether a model has captured a strong
relationship or not.
Types of Regression
There are various types of regressions which are used in data science and
machine learning. Each type has its own importance on different
scenarios, but at the core, all the regression methods analyze the effect of
the independent variable on dependent variables. Here we are discussing
some important types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Linear Regression:
1. Y= aX+b
Logistic Regression:
When we provide the input values (data) to the function, it gives the S-
curve as follows:
o It uses the concept of threshold levels, values above the threshold level
are rounded up to 1, and values below the threshold level are rounded up
to 0.
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Polynomial Regression:
o The equation for polynomial regression also derived from linear regression
equation that means Linear regression equation Y= b 0+ b1x, is
transformed into Polynomial regression equation Y= b 0+b1x+ b2x2+
b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.
Here, the blue line is called hyperplane, and the other two lines are known
as boundary lines.
Above image showing the example of Decision Tee regression, here, the
model is trying to predict the choice of a person between Sports cars or
Luxury car.
Lasso Regression:
y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear
Regression model representation.
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
The different values for weights or the coefficient of lines (a 0, a1) gives a
different line of regression, so we need to calculate the best values for
a0 and a1 to find the best fit line, so to calculate this we use cost function.
Cost function-
For Linear Regression, we use the Mean Squared Error (MSE) cost
function, which is the average of squared error occurred between the
predicted values and actual values. It can be written as:
Where,
Gradient Descent:
Model Performance:
The Goodness of fit determines how the line of regression fits the set of
observations. The process of finding the best model out of various models
is called optimization. It can be achieved by below method:
1. R-squared method:
y= a0+a1x+ ε
Where,
Multiple Linear Regression is one of the important regression algorithms which models the linear
relationship between a single dependent continuous variable and more than one independent
variable.
MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear
combination of multiple predictor variables x 1, x2, x3, ...,xn. Since it is an
enhancement of Simple Linear Regression, so the same is applied for the
multiple linear regression equation, the equation becomes:
1. Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</
sub>x<sub>2</sub>+ b<sub>3</sub>x<sub>3</sub>+...... bnxn ............
... (a)
Where,
Y= Output/Response variable
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the
algorithm starts from the root node of the tree. This algorithm compares
the values of root attribute with the record (real dataset) attribute and,
based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with
the other sub-nodes and move further. It continues the process until it
reaches the leaf node of the tree. The complete process can be better
understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the
best attributes.
o Step-4: Generate the decision tree node, which contains the best
attribute.
o Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as a
leaf node.
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the
decision tree.
o A decision tree algorithm always tries to maximize the value of
information gain, and a node/attribute having the highest information gain
is split first. It can be calculated using the below formula:
Where,
2. Gini Index:
A too-large tree increases the risk of overfitting, and a small tree may not
capture all the important features of the dataset. Therefore, a technique
that decreases the size of the learning tree without reducing accuracy is
known as Pruning. There are mainly two types of tree pruning technology
used:
Overfitting
Overfitting occurs when our machine learning model tries to cover all the
data points or more than the required data points present in the given
dataset. Because of this, the model starts caching noise and inaccurate
values present in the dataset, and all these factors reduce the efficiency
and accuracy of the model. The overfitted model has low bias and high
variance.
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
Underfitting
Underfitting occurs when our machine learning model is not able to
capture the underlying trend of the data. To avoid the overfitting in the
model, the fed of training data can be stopped at an early stage, due to
which the model may not learn enough from the training data. As a result,
it may fail to find the best fit of the dominant trend in the data.
In the case of underfitting, the model is not able to learn enough from the
training data, and hence it reduces the accuracy and produces unreliable
predictions.
Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the
machine learning models to achieve the goodness of fit. In statistics
modeling, it defines how closely the result or predicted values match the
true values of the dataset.
The model with a good fit is between the underfitted and overfitted model,
and ideally, it makes predictions with 0 errors, but in practice, it is difficult
to achieve it.
As when we train our model for a time, the errors in the training data go
down, and the same happens with test data. But if we train the model for
a long duration, then the performance of the model may decrease due to
the overfitting, as the model also learn the noise present in the dataset.
The errors in the test dataset start increasing, so the point, just before the
raising of errors, is the good point, and we can stop here for achieving a
good model.
There are two other methods by which we can get a good point for our
model, which are the resampling method to estimate model accuracy
and validation dataset.
Overfitting in Machine Learning
In the real world, the dataset present will never be clean and perfect. It
means each dataset contains impurities, noisy data, outliers, missing
data, or imbalanced data. Due to these impurities, different problems
occur that affect the accuracy and the performance of the model. One of
such problems is Overfitting in Machine Learning. Overfitting is a problem
that a model can exhibit.
A statistical model is said to be overfitted if it can’t generalize well with unseen data.
Variance: If the machine learning model performs well with the training
dataset, but does not perform well with the test dataset, then variance
occurs.
What is Overfitting?
o Overfitting & underfitting are the two main errors/problems in the
machine learning model, which cause poor performance in Machine
Learning.
o Overfitting occurs when the model fits more data than required, and
it tries to capture each and every datapoint fed to it. Hence it starts
capturing noise and inaccurate data from the dataset, which
degrades the performance of the model.
o An overfitted model doesn't perform accurately with the test/unseen
dataset and can’t generalize well.
o An overfitted model is said to have low bias and high variance.
Example to Understand Overfitting
We can understand overfitting with a general example. Suppose there are
three students, X, Y, and Z, and all three are preparing for an exam. X has
studied only three sections of the book and left all other sections. Y has a
good memory, hence memorized the whole book. And the third student,
Z, has studied and practiced all the questions. So, in the exam, X will only
be able to solve the questions if the exam has questions related to section
3. Student Y will only be able to solve questions if they appear exactly the
same as given in the book. Student Z will be able to solve all the exam
questions in a proper way.
The same happens with machine learning; if the algorithm learns from a
small part of the data, it is unable to capture the required data points and
hence under fitted.
Suppose the model learns the training dataset, like the Y student. They
perform very well on the seen dataset but perform badly on unseen data
or unknown instances. In such cases, the model is said to be Overfitting.
And if the model performs well with the training dataset and also with the
test/unseen dataset, similar to student Z, it is said to be a good fit.
In the train-test split of the dataset, we can divide our dataset into
random test and training datasets. We train the model with a training
dataset which is about 80% of the total dataset. After training the model,
we test it with the test dataset, which is 20 % of the total dataset.
Now, if the model performs well with the training dataset but not with the
test dataset, then it is likely to have an overfitting issue.
For example, if the model shows 85% accuracy with training data and
50% accuracy with the test dataset, it means the model is not performing
well.
1. Early Stopping
2. Train with more data
3. Feature Selection
4. Cross-Validation
5. Data Augmentation
6. Regularization
Early Stopping
In this technique, the training is paused before the model starts learning
the noise within the model. In this process, while training the model
iteratively, measure the performance of the model after each iteration.
Continue up to a certain number of iterations until a new iteration
improves the performance of the model.
After that point, the model begins to overfit the training data; hence we
need to stop the process before the learner passes that point.
Stopping the training process before the model starts capturing noise
from the data is known as early stopping.
Feature Selection
While building the ML model, we have a number of parameters or features
that are used to predict the outcome. However, sometimes some of these
features are redundant or less important for the prediction, and for this
feature selection process is applied. In the feature selection process, we
identify the most important features within training data, and other
features are removed. Further, this process helps to simplify the model
and reduces noise from the data. Some algorithms have the auto-feature
selection, and if not, then we can manually perform this process.
Cross-Validation
Cross-validation is one of the powerful techniques to prevent overfitting.
Data Augmentation
Data Augmentation is a data analysis technique, which is an alternative to
adding more data to prevent overfitting. In this technique, instead of
adding more training data, slightly modified copies of already existing
data are added to the dataset.
Ensemble Methods
In ensemble methods, prediction from different machine learning models
is combined to identify the most popular result.
In bagging, individual data points can be selected more than once. After
the collection of several sample datasets, these models are trained
independently, and depending on the type of task-i.e., regression or
classification-the average of those predictions is used to predict a more
accurate result. Moreover, bagging reduces the chances of overfitting in
complex models.
What is a feature?
The machine learning algorithms take input data to generate the output.
The input data remains in a tabular form consisting of rows (instances or
observations) and columns (variable or attributes), and these attributes
are often known as features.
Selecting the best features helps the model to perform well. For example,
Suppose we want to create a model that automatically decides which car
should be crushed for a spare part, and to do this, we have a dataset. This
dataset contains a Model of the car, Year, Owner's name, Miles. So, in this
dataset, the name of the owner does not contribute to the model
performance as it does not decide if the car should be crushed or not, so
we can remove this column and select the rest of the features(column) for
the model building.
1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as
a search problem, in which different combinations are made, evaluated,
and compared with other combinations. It trains the algorithm by using
the subset of features iteratively.
On the basis of the output of the model, features are added or subtracted,
and with this feature set, the model has trained again.
2. Filter Methods
In Filter Method, features are selected on the basis of statistics measures.
This method does not depend on the learning algorithm and chooses the
features as a pre-processing step.
The filter method filters out the irrelevant feature and redundant columns
from the model by using different metrics through ranking.
o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio
Fisher's Score:
Fisher's score is one of the popular supervised technique of features
selection. It returns the rank of the variable on the fisher's criteria in
descending order. Then we can select the variables with a large fisher's
score.
The value of the missing value ratio can be used for evaluating the feature
set against the threshold value. The formula for obtaining the missing
value ratio is the number of missing values in each column divided by the
total number of observations. The variable is having more than the
threshold value can be dropped.
3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper
methods by considering the interaction of features along with low
computational cost. These are fast processing methods similar to the filter
method but more accurate than the filter method.
These methods are also iterative, which evaluates each iteration, and
optimally finds the most important features that contribute the most to
training in a particular iteration. Some techniques of embedded methods
are:
o Regularization- Regularization adds a penalty term to different
parameters of the machine learning model for avoiding overfitting in
the model. This penalty term is added to the coefficients; hence it
shrinks some coefficients to zero. Those features with zero
coefficients can be removed from the dataset. The types of
regularization techniques are L1 Regularization (Lasso
Regularization) or Elastic Nets (L1 and L2 regularization).
o Random Forest Importance - Different tree-based methods of
feature selection help us with feature importance to provide a way
of selecting features. Here, feature importance specifies which
feature has more importance in model building or has a great
impact on the target variable. Random Forest is such a tree-based
method, which is a type of bagging algorithm that aggregates a
different number of decision trees. It automatically ranks the nodes
by their performance or decrease in the impurity (Gini impurity)
over all the trees. Nodes are arranged as per the impurity values,
and thus it allows to pruning of trees below a specific node. The
remaining nodes create a subset of the most important features.
Below are some univariate statistical measures, which can be used for
filter-based feature selection:
Feature Selection
Feature selection is the process of selecting the subset of the relevant
features and leaving out the irrelevant features present in a dataset to
build a model of high accuracy. In other words, it is a way of selecting the
optimal features from the input dataset.
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the
relevant features is taken. Some common techniques of filters method
are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a
machine learning model for its evaluation. In this method, some features
are fed to the ML model, and evaluate the performance. The performance
decides whether to add those features or remove to increase the accuracy
of the model. This method is more accurate than the filtering method but
complex to work. Some common techniques of wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
o LASSO
o Elastic Net
o Ridge Regression, etc.
Feature Extraction:
Feature extraction is the process of transforming the space containing
many dimensions into space with fewer dimensions. This approach is
useful when we want to keep the whole information but use fewer
resources while processing the information.
PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces
the dimensionality. Some real-world applications of PCA are image
processing, movie recommendation system, optimizing the power
allocation in various communication channels.
o In this technique, firstly, all the n variables of the given dataset are taken
to train the model.
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n-1
features for n times, and will compute the performance of the model.
o We will check the variable that has made the smallest or no change in the
performance of the model, and then we will drop that variable or features;
after that, we will be left with n-1 features.
o Repeat the complete process until no feature can be dropped.
o We start with a single feature only, and progressively we will add each
feature at a time.
o Here we will train the model on each feature separately.
o The feature with the best performance is selected.
o The process will be repeated until we get a significant increase in the
performance of the model.
Random Forest
Random Forest is a popular and very useful feature selection algorithm in
machine learning. This algorithm contains an in-built feature importance
package, so we do not need to program it separately. In this technique,
we need to generate a large set of trees against the target variable, and
with the help of usage statistics of each attribute, we need to find the
subset of features.
Random forest algorithm takes only numerical variables, so we need to
convert the input data into numeric data using hot encoding.
Factor Analysis
Factor analysis is a technique in which each variable is kept within a group
according to the correlation with other variables, it means variables within
a group can have a high correlation between themselves, but they have a
low correlation with variables of other groups.
Auto-encoders
One of the popular methods of dimensionality reduction is auto-encoder,
which is a type of ANN or artificial neural network, and its main aim is to
copy the inputs to their outputs. In this, the input is compressed into
latent-space representation, and output is occurred using this
representation. It has mainly two parts:
o Encoder: The function of the encoder is to compress the input to form the
latent-space representation.
o Decoder: The function of the decoder is to recreate the output from the
latent-space representation.
This process begins by selecting a few layers within our model to extract
features from. We will get a good idea of how our image is being
processed throughout the neural network by selecting a few layers to
extract features from. We extract the model features of our style image
and content image as well. After that, we extract features from our target
image and compare it to our style image feature and our content image
feature.
DATASET:
You will require data that includes a set of items and a set of users who
have responded to some of the items to experiment with recommendation
algorithms.
When working with such data, you will typically see it in a matrix of user
responses to various items from a collection of items. The user ratings
would be listed in each row, and the object ratings would be listed in each
column. An example of a matrix with five users and five objects would be:
Rating Table
o The matrix shows that five people rated various products on a scale
of 1 to 5. For instance, the third item has a rating of 4 from the first
user.
o Since consumers often only rate a small number of things, the
matrix's cells are frequently vacant. It's improbable that every user
will review or comment on every item. A sparse matrix is one in
which most cells are vacant, whereas a dense matrix is the reverse,
with most of the cells filled.
o For study and benchmarking, many datasets have been gathered
and made public. Here is a list of reliable data sources from which
you can select.
o The MovieLens dataset amassed by GroupLens Research would be
the ideal one, to begin with. The MovieLens 100k dataset, in
particular, is a reliable benchmark dataset with 100,000 ratings for
1682 films from 943 individuals, with each user having rated at least
20 films.
This dataset consists of numerous files detailing the movies, the users,
and the ratings people have assigned to the films they have seen.
The item ID, timestamp, user ID, and rating are listed in a tab-separated
list in the file u.data that holds the ratings. The file's opening few lines are
as follows:
The file, as previously mentioned, contains the rating that a user assigned
a specific movie. These 100,000 evaluations, which will be used to
forecast user ratings for movies they haven't seen, are contained in this
file.
Step 1:The answers to the first two questions are all different. A family of
algorithms known as collaborative Filtering offers numerous methods for
locating comparable users or things and numerous methods for
determining ratings based on the ratings of comparable users. Depending
on your decisions, you might choose a collaborative filtering strategy. In
this post, you'll learn about the various methods for determining similarity
and predicting ratings.
Step 2:The age of users, the movie's genre, or any other information
about users or objects are not used in an approach that relies solely on
collaborative Filtering to determine how similar two items are.
Step 4:There are several ways to test the accuracy of your predictions,
and the third issue likewise has many possible solutions, including error
calculation methods that apply to other applications besides collaborative
filtering recommenders.
Step 5:The Root Mean Square Error (RMSE), which involves predicting
ratings for a test dataset of user-item pairings whose rating values are
previously known, is one method for gauging the accuracy of your
conclusion. The error would be the discrepancy between the known value
and the forecasted value. Finding the average (or mean) of the test set's
error values, squaring them all, and then taking the square root of that
average will yield the RMSE.
Step 6:Mean Absolute Error (MAE), which finds the amount of error by
obtaining its absolute value and then taking the average of all error
values, is another statistic to gauge accuracy.
Memory Based
The first group of algorithms comprises memory-based ones that compute
predictions using statistical methods on the complete dataset.
Four people named A, B, C, and D who have rated two films are included
in the data. Lists are used to hold the ratings, and each list comprises two
numbers that represent the rating of each film:
Ratings from A are [1.0, 2.0], B [2.0, 4.0], C [2.5, 4.0], and [2.5, 4.0], and
D ratings are [4.5, 5.0].
Plot the user ratings for two movies on a graph, then seek a
pattern to get started with a visual cue. The graph appears as
follows:
Each point in the graph above represents a user, and it is compared to the
ratings they gave to two films.
Measuring similarity by examining the distance seen between points is a
good method. The formula for the Euclidean distance between two
locations can be used to calculate the distance. The following program
demonstrates how to use a scipy function:
Although the two methods are distinct concepts, they are technically
extremely similar. Here is a comparison between the two:
o User-based: For just a user U, the rate for an item I that hasn't
been rated is found by selecting N users from the same list who've
already rated the item I and computing the rating depending on
these N ratings. This is done using rating vectors made up of
supplied item ratings.
o Item-based: For an item I that has a set of comparable items
determined based on user ratings, the score by a user U who hasn't
reviewed it is found by selecting N comparable things that have
been rated by U and determining the rating based on these N
evaluations.
Model-Based
The huge yet sparse user-item matrix is reduced or compressed as part of
the model-based techniques, which fall under the second group. A
fundamental understanding of data pre-processing can be very beneficial
for comprehending this phase.
Diminished Dimensions
The users and things are represented as separate entities in the reduced
matrices. In the first matrix, the m rows stand in for the m users, while the
p columns provide information on the attributes or traits of the users. The
item matrices with n samples and p attributes are the same.
For Example:
One way to handle the cold-start problem is to use a hybrid approach that
combines content-based filtering and demographic information. For
example, suppose a new customer is browsing for men's clothing. In that
case, the recommendation engine can suggest products based on the
most popular men's clothing items and the customer's age and location.
Content-based Filtering
Content-based filtering is a recommendation system that suggests items
to users based on their previous interactions with similar items. This
system typically uses the features or attributes of the items to identify
similar items.
Collaborative Filtering
Collaborative filtering is a recommendation system that suggests items to
a user based on similar users' preferences. This system does not use the
attributes or features of the items to make recommendations but instead
uses the past behavior of users to identify similar users and recommend
items that similar users have liked.