0% found this document useful (0 votes)
12 views46 pages

ML Unit 2

Unit II of the document covers supervised learning methods in machine learning, focusing on distance-based methods, K-Nearest Neighbors (KNN), Decision Trees, and Naïve Bayes classifiers. It explains the principles and algorithms behind these techniques, including how they work, their advantages, and disadvantages. The document also discusses important concepts such as distance metrics, tree structures, and the application of Bayes' theorem in classification tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views46 pages

ML Unit 2

Unit II of the document covers supervised learning methods in machine learning, focusing on distance-based methods, K-Nearest Neighbors (KNN), Decision Trees, and Naïve Bayes classifiers. It explains the principles and algorithms behind these techniques, including how they work, their advantages, and disadvantages. The document also discusses important concepts such as distance metrics, tree structures, and the application of Bayes' theorem in classification tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

R20-MACHINE LEARNING

Unit-II:
Unit II: Supervised Learning(Regression/Classification):Basic Methods: Distance based Methods, Nearest
Neighbours, Decision Trees, Naive Bayes, Linear Models: Linear Regression, Logistic Regression,
Generalized Linear Models, Support Vector Machines, Binary Classification: Multiclass/Structured outputs,
MNIST, Ranking.

Basic Methods: Distance based Methods:


Introduction:
A distance function provides distance between the elements of a set. If the distance is zero then elements are
equivalent else they are different from each other.
Distance-based algorithms are nonparametric methods that can be used for classification and clustering.
These algorithms classify objects by the dissimilarity between them as measured by distance functions.

which measures distance ‘as the crow flies’. Two other values of p can be related back to the chess example.
The 1-norm denotes Manhattan distance, also called cityblock distance:

we can manipulate the value of p and calculate the distance in three different ways-

p = 1, Manhattan Distance

p = 2, Euclidean Distance

p = ∞, Chebychev Distance

1
1. distances between a point and itself are zero: Dis(x,x) = 0;
2. all other distances are larger than zero: if x = y then Dis(x, y) > 0;
3. distances are symmetric: Dis(y,x) = Dis(x, y);
4. detours can not shorten the distance(Triangular inequality): Dis(x, z) ≤ Dis(x, y)+Dis(y, z).

Nearest Neighbours:
K-Nearest Neighbor(KNN) Algorithm for Machine Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and put the
new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the similarity.
This means when new data appears then it can be easily classified into a well suite category by using
K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying
data.
o It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, and then it
classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on
a similarity measure. Our KNN model will find the similar features of the new data set to the cats and
dogs images and based on the most similar features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this
data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm.
With the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the
below diagram:

2
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
o Step-6: Our model is ready.

Replacing the Linear Regression model with k-Nearest Neighbors regression in the below code example 1.1
is as simple as replacing these two lines:

3
Example 1-1. Training and running a linear model using Scikit-Learn

4
Suppose we have a new data point and we need to put it in the required category. Consider the below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is
the distance between two points, which we have already studied in geometry. I can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:

5
o As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to
category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some values to find
the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the
model.
o Large values for K are good, but it may find some difficulties.
Advantages of KNN Algorithm:
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points for all the
training samples.

Decision Trees:

o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes
are used to make any decision and have multiple branches, whereas Leaf nodes are the output of
those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision based on
given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into
subtrees.
o Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

6
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset and
problem is the main point to remember while creating a machine learning model. Below are the two reasons
for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like structure.
Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which further
gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a
leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the
given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child
nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the
tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and, based
on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf node of the tree. The complete process can be better
understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.

7
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called
the final node as a leaf node.

Attribute Selection Measures:

While implementing a Decision tree, the main issue arises that how to select the best attribute for the root
node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute
selection measure or ASM. By this measurement, we can easily select the best attribute for the nodes of the
tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset based
on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the below
formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data.
Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)


Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2

ID3 Algorithm will perform following tasks recursively for construct Decision Tree:
1. Create root node for the tree
2. If all examples are positive, return leaf node ‘positive’
3. Else if all examples are negative, return leaf node ‘negative’
4. Calculate the entropy of current state H(S)
5. For each attribute, calculate the entropy with respect to the attribute ‘x’ denoted by H(S, x)
6. Select the attribute which has maximum value of IG(S, x)
7. Remove the attribute that offers highest IG from the set of attributes
8. Repeat until we run out of all attributes, or the decision tree has all leaf nodes.

8
Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important features
of the dataset. Therefore, a technique that decreases the size of the learning tree without reducing accuracy is
known as Pruning. There are mainly two types of tree pruning technology used:
o Cost Complexity Pruning
o Reduced Error Pruning.

From Orelly Text book:


Decision Trees are versatile Machine Learning algorithms that can per‐ form both classification and
regression tasks, and even multioutput tasks. They are very powerful algorithms, capable of fitting complex
datasets.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while making any
decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
o Decision trees, and in particular very deep trees that were not subject to pruning, are heavily reliant
on their training set. A small change in the training set can result in a dramatic change of the
resulting decision tree. Their inferior predictive accuracy, however, is a direct consequence of the
bias–variance tradeoff. Specifically, a decision tree model generally exhibits a high variance.

To overcome the above limitations, several promising approaches like ensemble learning techniques such
as bagging, random forest, and boosting are introduced.

Naïve Bayes Classifier Algorithm


o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used
for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps
in building the fast machine learning models that can make quick predictions.

9
o It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis,
Hybrid Recommender System and classifying articles.

Why is it called Naïve Bayes?


The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of
the occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and
taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is
true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Example:
variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify whether players will play or
not based on weather condition. Let’s follow the below steps to perform it.
Step 1: Convert the data set into a frequency table
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and probability
of playing is 0.64.

Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The class
with the highest posterior probability is the outcome of prediction.
Problem: Players will play if weather is sunny. Is this statement is correct?
We can solve it using above discussed method of posterior probability.

10
P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Advantages of Naïve Bayes Classifier:


o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o In situations where there are noisy and missing data, it performs well.
o It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship
between features.
o If the target dataset contains large number of numeric features then the reliability of the outcome
becomes limited.
Applications of Naïve Bayes Classifier:
o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.
Types of Naïve Bayes Model:
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.
Since the way the values are present in the dataset changes, the formula for conditional probability
changes to,

o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular document
belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or not in a
document. This model is also famous for document classification tasks.

NOTE: See in Unit-3 Also for Naïve Bayes classification

11
Linear Models:
Linear Regression:
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical
method that is used for predictive analysis. Linear regression makes predictions for continuous/real or
numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to the
value of the independent variable.
A linear model makes a prediction by simply computing a weighted sum of the input features, plus a
constant called the bias term (also called the intercept term).
Linear Regression model prediction equation:

𝑦^ -----is the predicted value, n is the number of features.


xi is the ith feature value, θj is the jth model parameter (including the bias term θ0 and the feature weights θ1 ,
θ2 , ⋯, θn ).

This can be written much more concisely using a vectorized form

Note: In Machine Learning, vectors are often represented as column vectors, which are 2D arrays with a
single column. If θ and x are column vectors, then the prediction is: y = θT x, where θT is the transpose of θ
(a row vector instead of a column vector) and θT x is the matrix multiplication of θT and x.

The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:

12
Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Here,
Y=Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model representation.
Types of Linear Regression:
Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent variable, then
such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Multiple Linear Regression.
Linear Regression Line
A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:
o Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then
such a relationship is termed as a Positive linear relationship.

13
o Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis,
then such a relationship is called a negative linear relationship.

Finding the best fit line:


Training a model means setting its parameters so that the model best fits the training set. For this purpose,
we first need a measure of how well (or poorly) the model fits the training data.
The most common performance measure of a regression model is the Root Mean Square Error (RMSE)

Therefore, to train a Linear Regression model, you need to find the value of θ that minimizes the RMSE.
It is simpler to minimize the Mean Square Error (MSE) than the RMSE, and it leads to the same result
(because the value that minimizes a function also minimizes its square root)..
The MSE of a Linear Regression hypothesis hθ on a training set X is calculated using

When working with linear regression, our main goal is to find the best fit line that means the error between
predicted values and actual values should be minimized. The best fit line will have the least error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression, so we
need to calculate the best values for a0 and a1 to find the best fit line, so to calculate this we use cost
function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of regression,
and the cost function is used to estimate the values of the coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear regression
model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the input
variable to the output variable. This mapping function is also known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of squared
error occurred between the predicted values and actual values. It can be written as:
For the above linear equation, MSE can be calculated as (same equation as above):

14
Where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.

Performing linear regression using Scikit-Learn is quite simple:

The LinearRegression class is based on the scipy.linalg.lstsq() function (the name stands for “least squares”),
which you could call directly:

The Bias/Variance Tradeoff:

An important theoretical result of statistics and Machine Learning is the fact that a model’s generalization
error can be expressed as the sum of three very different errors:
Bias
This part of the generalization error is due to wrong assumptions, such as assuming that the data is linear
when it is actually quadratic. A high-bias model is mostlikely to underfit the training data.
Variance
This part is due to the model’s excessive sensitivity to small variations in the training data. A model with
many degrees of freedom (such as a high-degree polynomial model) is likely to have high variance, and thus
to overfit the training data.
Irreducible error
This part is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up
the data (e.g., fix the data sources, such as broken sensors, or detect and remove outliers). Increasing a
model’s complexity will typically increase its variance and reduce its bias.
Conversely, reducing a model’s complexity increases its bias and reduces its variance.
This is why it is called a tradeoff.

Note: Overfitting occurs when high variance on test samples(i,e. Test Error)
Underfitting occurs when high bias on training samples(i,e. Train Error)

Regularization
There are extensions of the training of the linear model called regularization methods. These seek to both
minimize the sum of the squared error of the model on the training data (using ordinary least squares) but
also to reduce the complexity of the model (like the number or absolute size of the sum of all coefficients in
25
the model).
For a linear model, regularization is typically achieved by constraining the weights of the model. We will
now look at Ridge Regression, Lasso Regression, and Elastic Net, which implement three different ways
to constrain the weights.
Two popular examples of regularization procedures for linear regression are:
 Lasso Regression (Least Absolute Shrinkage and Selection Operator Regression (simply called
Lasso Regression): where Ordinary Least Squares is modified to also minimize the absolute sum of
the coefficients (called L1 regularization).

Least Absolute Shrinkage and Selection Operator Regression (simply called Lasso Regression) is
another regularized version of Linear Regression: just like Ridge Regression, it adds a regularization
term to the cost function, but it uses the ℓ1 norm of the weight vector instead of half the square of the
ℓ2 norm
Sample Code:

 Ridge Regression: where Ordinary Least Squares is modified to also minimize the squared absolute
sum of the coefficients (called L2 regularization).
Ridge Regression (also called Tikhonov regularization) is a regularized version of Linear

Regression: a regularization term equal to is added to the cost function.

These methods are effective to use when there is collinearity in your input values and ordinary least squares
would overfit the training data.
Sample code:

And using Stochastic Gradient Descent:

Note: It is important to scale the data (e.g., using a StandardScaler) before performing Ridge Regression, as
it is sensitive to the scale of the input features. This is true of most regularized models.

26
Elastic Net:
Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The regularization term is a
simple mix of both Ridge and Lasso’s regularization terms, and you can control the mix ratio r. When r = 0,
Elastic Net is equivalent to Ridge Regression, and when r = 1, it is equivalent to Lasso Regression
Elastic Net cost function

Sample Code:

Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The process of finding
the best model out of various models is called optimization. It can be achieved by below method:
1. R-squared method:
o R-squared is a statistical method that determines the goodness of fit.
o It measures the strength of the relationship between the dependent and independent variables on a
scale of 0-100%.
o The high value of R-square determines the less difference between the predicted values and actual
values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple determination for multiple
regression.
o It can be calculated from the below formula:

Advantages and Disadvantages


Advantages Disadvantages
Linear regression performs exceptionally well for The assumption of linearity between dependent and
linearly separable data independent variables
Easier to implement, interpret and efficient to train It is often quite prone to noise and overfitting
It handles overfitting pretty well using dimensionally
reduction techniques, regularization, and cross- Linear regression is quite sensitive to outliers
validation
One more advantage is the extrapolation beyond a
It is prone to multicollinearity
specific data set
Linear RegressionApplications
 Sales Forecasting
 Risk Analysis
 Housing Applications To Predict the prices and other factors
 Finance Applications To Predict Stock prices, investment evaluation, etc.
Simple Linear Regression:
Simple Linear Regression is a type of Regression algorithms that models the relationship between a
dependent variable and a single independent variable. The relationship shown by a Simple Linear Regression
27
model is linear or a sloped straight line, hence it is called Simple Linear Regression.
The key point in Simple Linear Regression is that the dependent variable must be a continuous/real value.
However, the independent variable can be measured on continuous or categorical values.
Simple Linear regression algorithm has mainly two objectives:
o Model the relationship between the two variables. Such as the relationship between Income and
expenditure, experience and Salary, etc.

28
o Forecasting new observations. Such as Weather forecasting according to temperature, Revenue of a
company according to the investments in a year, etc.
Simple Linear Regression Model:
The Simple Linear Regression model can be represented using the below equation:
y= a0+a1x+ ε or y = β0 + β1 x
Where,
a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or decreasing.
ε = The error term. (For a good model it will be negligible)

β1= = ,β0 = y - β1 x
Multiple Linear Regression
Multiple Linear Regression is an extension of Simple Linear regression as it takes more than one predictor
variable to predict the response variable.
We can define it as:
Multiple Linear Regression is one of the important regression algorithms which models the linear
relationship between a single dependent continuous variable and more than one independent variable.
Some key points about MLR:
o For MLR, the dependent or target variable(Y) must be the continuous/real, but the predictor or
independent variable may be of continuous or categorical form.
o Each feature variable must model the linear relationship with the dependent variable.
o MLR tries to fit a regression line through a multidimensional space of data-points.
MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple predictor variables
x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear Regression, so the same is applied for the
multiple linear regression equation, the equation becomes:
𝑌^ = b0+b1x1 + b2x2+ b3x3+ ... +. bnxn ............... (a)
Where,
𝑌^ = Output/Response variable
b0, b1, b2, b3 , bn ... = Coefficients of the model.
x1, x2, x3, x4, .. = Various Independent/feature variable

Assumptions for Multiple Linear Regression:


o A linear relationship should exist between the Target and predictor variables.
o The regression residuals must be normally distributed.
o MLR assumes little or no multicollinearity (correlation between the independent variable) in data.
In multiple regressions, there are two primary problems: multicollinearity(Two variables are perfectly
collinear if there is an exact linear relationship between them. Multi collinearity is the situation in which the
degree of correlation is not only between the dependent variable and the independent variable, but there is
also a strong correlation within (among) the independent variables themselves) and heteroskedasticity
(Hetero skedasticity refers to the changing variance of the error term. If the variance of the error term is not
constant across data sets, there will be erroneous predictions).

29
Improving Accuracy of the Linear Regression Model:
 bias and variance in the regression model before exploring how to improve the same.
 The concept of bias and variance is similar to accuracy and prediction. Accuracy refers to how close
the estimation is near the actual value, whereas prediction refers to continuous estimation of the
value.
 High bias = low accuracy (not close to real value)
High variance = low prediction (values are scattered)
Low bias = high accuracy (close to real value)
Low variance = high prediction (values are close to each other)
In the linear regression model, it is assumed that the number of observations (n) is greater than the number
of parameters (k) to be estimated, i.e. n > k, and in that case, the least squares estimates tend to have low
variance and hence will perform well on test observations.
However, if observations (n) is not much larger than parameters (k), then there can be high variability in the
least squares fit, resulting in overfitting and leading to poor predictions.
If k > n, then linear regression is not usable. This also indicates infinite variance, and so, the method cannot
be used at all.
Accuracy of linear regression can be improved using the following three methods:
1. Shrinkage Approach
2. Subset Selection
3. Dimensionality (Variable) Reduction

1 Shrinkage (Regularization) approach:


By limiting (shrinking) the estimated coefficients, we can try to reduce the variance at the cost of a
negligible increase in bias. This can in turn lead to substantial improvements in the accuracy of the model.
Few variables used in the multiple regression model are in fact not associated with the overall response and
are called as irrelevant variables; this may lead to unnecessary complexity in the regression model.
This approach involves fitting a model involving all predictors. However, the estimated coefficients are
shrunken towards zero relative to the least squares estimates. This shrinkage (also known as regularization)
has the effect of reducing the overall variance. Some of the coefficients may also be estimated to be exactly
zero, thereby indirectly performing variable selection.
The two best-known techniques for shrinking the regression coefficients towards zero are
1. ridge regression
2. lasso (Least Absolute Shrinkage Selector Operator)
See in above concepts discussed in this unit.
2 Subset selection:
There are two methods in which subset of the regression can be selected:
1. Best subset selection (considers all the possible (2k )
2. Stepwise subset selection
1. Forward stepwise selection (0 to k)
2. Backward stepwise selection (k to 0)
In best subset selection, we fit a separate least squares regression for each possible subset of the k predictors.
For computational reasons, best subset selection cannot be applied with very large value of predictors (k).
The best subset selection procedure considers all the possible (2 k) models containing subsets of the p
predictors.

21
0
The stepwise subset selection method can be applied to choose the best subset. There are two stepwise
subset selection:
1. Forward stepwise selection (0 to k)
2. Backward stepwise selection (k to 0)
Forward stepwise selection begins with a model containing no predictors, and then, predictors are added one
by one to the model, until all the k predictors are included in the model. In particular, at each step, the
variable (X) that gives the highest additional improvement to the fit is added.
Backward stepwise selection begins with the least squares model which contains all k predictors and then
iteratively removes the least useful predictor one by one.

3 Dimensionality reduction (Variable reduction):


The earlier methods, namely subset selection and shrinkage, control variance either by using a subset of the
original variables or by shrinking their coefficients towards zero. In dimensionality reduction, predictors (X)
are transformed, and the model is set up using the transformed variables after dimensionality reduction. The
number of variables is reduced using the dimensionality reduction method. Principal component analysis is
one of the most important dimensionality (variable) reduction techniques.

Logistic Regression (Logit Regression):


o Logistic regression is one of the most popular Machine Learning algorithms, which comes under the
Supervised Learning technique. It is used for predicting the categorical dependent variable using a
given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are used. Linear
Regression is used for solving Regression problems, whereas Logistic regression is used for solving
the classification problems.
o Logistic Regression (also called Logit Regression) is commonly used to estimate the probability that
an instance belongs to a particular class (e.g., what is the probability that this email is spam?). If the
estimated probability is greater than 50%, then the model predicts that the instance belongs to that
class (called the positive class, labeled “1”), or else it predicts that it does not (i.e., it belongs to the
negative class, labeled “0”). This makes it a binary classifier.
o Logistic Regression model estimated probability (vectorized form)

The logistic—noted σ(·)—is a sigmoid function (i.e., S-shaped) that outputs a number between 0 and 1.
Logistic function

o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether the cells
are cancerous or not, a mouse is obese or not based on its weight, etc.

21
1
o Logistic Regression is a significant machine learning algorithm because it has the ability to provide
probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data and can
easily determine the most effective variables used for the classification. The below image is showing
the logistic function:

The logistic formulae are stated in terms of the probability that Y = 1, which is referred to as P. The
probability that Y is 0 is 1 − P.

The ‘ln’ symbol refers to a natural logarithm and a + bX is the regression line equation. Probability (P) can
also be computed from the regression equation. So, if we know the regression equation, we could,
theoretically, calculate the expected probability that Y = 1 for a given value of X.

‘exp’ is the exponent function, which is sometimes also written as e.


Let us say we have a model that can predict whether a person is male or female on the basis of their height.
Given a height of 150 cm, we need to predict whether the person is male or female.
We know that the coefficients of a = −100 and b = 0.6.
Using the above equation, we can calculate the probability of male given a height of 150 cm or more
formally P(male|height = 150).

or a probability of near zero that the person is a male.

Logistic Regression model prediction

21
2
Notice that σ(t) < 0.5 when t < 0, and σ(t) ≥ 0.5 when t ≥ 0, so a Logistic Regression model predicts 1 if x T
θ is positive, and 0 if it is negative.
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the logistic
function.
o In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold
values tends to 0.
Assumptions for Logistic Regression:
 There exists a linear relationship between logit function and independent variables
 The dependent variable Y must be categorical (1/0) and take binary value,
e.g. if pass then Y = 1; else Y = 0
 The data meets the ‘iid’(identically distributed) criterion, i.e. the error terms, ε, are independent
from one another and identically distributed
 The error term follows a binomial distribution [n, p]
o n = # of records in the data
o p = probability of success (pass, responder)

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression equation. The mathematical
steps to get Logistic Regression equations are given below:
o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by
(1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:

The above equation is the final equation for Logistic Regression.


Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.

21
3
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of
the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as "low", "Medium", or "High".
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use the same
steps as we have done in previous topics of Regression. Below are the steps:
o Data Pre-processing step
o Fitting Logistic Regression to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.
Training and Cost Function:
This idea is captured by the cost function shown in Below Equation for a single training instance x.
Cost function of a single training instance

Logistic Regression cost function (log loss)

Logistic cost function partial derivatives

Let’s try to build a classifier to detect the Iris-Virginica type based only on the petal width feature. First let’s
load the data:

Now let’s train a Logistic Regression model:

Let’s look at the model’s estimated probabilities for flowers with petal widths varying from 0 to 3 cm

21
4
Output:

Figure : Estimated probabilities and decision boundary

Softmax Regression:

The Logistic Regression model can be generalized to support multiple classes directly, without having to
train and combine multiple binary classifiers. This is called So max Regression, or Multinomial Logistic
Regression.
when given an instance x, the Softmax Regression model first computes a score sk (x) for each class k, then
estimates the probability of each class by applying the softmax function (also called the normalized
exponential) to the scores. The equation to compute sk (x) should look familiar, as it is just like the equation
for Linear Regression prediction
Softmax score for class k

Note that each class has its own dedicated parameter vector θ (k). All these vectors are typically stored
as rows in a parameter matrix
Softmax function:

• K is the number of classes.


• s(x) is a vector containing the scores of each class for the instance x.
• σ(s(x))k is the estimated probability that the instance x belongs to class k given the scores of each
class for that instance.
Softmax Regression classifier prediction:

21
5
• The argmax operator returns the value of a variable that maximizes a function. In this equation, it returns
the value of k that maximizes the estimated probability σ(s(x))k .

Note: The Softmax Regression classifier predicts only one class at a time (i.e., it is multiclass,
notmultioutput) so it should be used only with mutually exclusive classes such as different types of plants.
Youcannot use it to recognize multiple people in one picture.

Generalized Linear Models:


GLMs can be used to construct the models for regression and classification problems by using the type of
distribution which best describes the data or labels given for training the model.
The normal linear model deals with continuous response variables — such as height and crop yield — and
continuous or discrete explanatory variables. Given the feature vectors {xi}, the responses {Yi} are
independent of each other, and each has a normal distribution with mean xiT β, where xiT is the i-th row of
the model matrix X. Generalized linear models allow for arbitrary response distributions, including discrete
ones.
Defination: In a generalized linear model linear model (GLM) the expected response for a given feature
vector x = [x1, . . . , xp]T is of the form

for some function h, which is called the activation function. The distribution of Y (for a given x) may
depend on additional dispersion parameters that model the randomness in the data that is not explained by x.
The inverse of function h is called the link function.

30
Support Vector Machines:
A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning model, capable of
performing linear or nonlinear classification, regression, and even outlier detection.
SVMs are particularly well suited for classification of complex but small- or medium-sized datasets.

SVM is a model, which can do linear classification as well as regression. SVM is based on the concept of a
surface, called a hyper plane, which draws a boundary between data instances plotted in the multi-
dimensional feature space. The output prediction of an SVM is one of two conceivable classes which are
already defined in the training data. In summary, the SVM algorithm builds an N-dimensional hyper plane
model that assigns future instances into one of the two possible output classes.

Training data sets which have a substantial grouping periphery will function well with SVM.
Generalization error in terms of SVM is the measure of how accurately and precisely this SVM model can
predict values for previously unseen data (new data). A hard margin in terms of SVM means that an SVM
model is inflexible in classification and tries to work exceptionally fit in the training set, thereby causing
overfitting.

Support Vectors: Support vectors are the data points (representing classes), the critical component
in a data set, which are near the identified set of lines (hyperplane). If support vectors are removed, they will
alter the position of the dividing hyperplane.

Hyperplane and Margin: For an N-dimensional feature space, hyperplane is a flat subspace of
dimension (N−1) that separates and classifies a set of data. For example, if we consider a two-dimensional
feature space (which is nothing but a data set having two features and a class variable), a hyperplane will be
a one-dimensional subspace or a straight line. In the same way, for a three-dimensional feature space (data
set having three features and a class variable),hyperplane is a two-dimensional subspace or a simple plane.
However, quite understandably, it is difficult to visualize a feature space greater than three dimensions,
much like for a subspace or hyperplane having more than three dimensions.

31
7
1.2 Identifying the correct hyperplane in SVM:

As we have already discussed, there may be multiple options for hyperplanes dividing the data instances
belonging to the different classes. We need to identify which one will result in the best classification. Let us
examine a few scenarios before arriving to that conclusion. For the sake of simplicity of visualization, the
hyperplanes have been shown as straight lines in most of the diagrams.

Scenario 1

As depicted in Figure 7.16, in this scenario, we have three hyperplanes: A, B, and C. Now, we need to
identify the correct hyperplane which better segregates the two classes represented by the triangles and
circles. As we can see, hyperplane ‘A’ has performed this task quite well.

Scenario 2

As depicted in Figure 7.17, we have three hyperplanes: A, B, and C. We have to identify the correct
hyperplane which classifies the triangles and circles in the best possible way. Here, maximizing the distances
between the nearest data points of both the classes and hyperplane will help us decide the correct
hyperplane. This distance is called as margin.

In Figure 7.17b, you can see that the margin for hyperplane A is high as compared to those for both B
and C. Hence, hyperplane A is the correct hyperplane. Another quick reason for selecting the hyperplane
with higher margin (distance) is robustness. If we select a hyperplane having a lower margin (distance), then
there is a high probability of misclassification.

31
8
Scenario 3

Use the rules as discussed in the previous section to identify the correct hyperplane in the scenario
shown in Figure 7.18. Some of you might have selected hyperplane B as it has a higher margin (distance
from the class) than A. But, here is the catch; SVM selects the hyperplane which classifies the classes
accurately before maximizing the margin. Here, hyperplane B has a classification error, and A has classified
all data instances correctly. Therefore, A is the correct hyperplane.

Scenario 4

In this scenario, as shown in Figure 7.19a, it is not possible to distinctly segregate the two classes by
using a straight line, as one data instance belonging to one of the classes (triangle) lies in the territory of the
other class (circle) as an outlier. One triangle at the other end is like an outlier for the triangle class. SVM
has a feature to ignore outliers and find the hyperplane that has the maximum margin (hyperplane A, as
shown in Fig. 7.19b). Hence, we can say that SVM is robust to outliers.

31
9
So, by summarizing the observations from the different scenarios, we can say that

1. The hyperplane should segregate the data instances belonging to the two classes in the best possible
way.

2. It should maximize the distances between the nearest data points of both the classes, i.e. maximize the
margin.

3. If there is a need to prioritize between higher margin and lesser misclassification, the hyperplane
should try to reduce misclassifications.

32
0
Support vectors, as can be observed in Figure 7.20, are data instances from the two classes which are
closest to the MMH. Quite understandably, there should be at least one support vector from each class. The
identification of support vectors requires intense mathematical formulation, which is out of scope of this
book. However, it is fairly intuitive to understand that modelling a problem using SVM is nothing but
identifying the support vectors and MMH corresponding to the problem space.

Binary Classification:

MNIST:
The MNIST is dataset, which is a set of 70,000 small images of digits handwritten by high school students
and employees of the US Census Bureau. Each image is labeled with the digit it represents.
The following code fetches the MNIST dataset:

Datasets loaded by Scikit-Learn generally have a similar dictionary structure including:


• A DESCR key describing the dataset
• A data key containing an array with one row per instance and one column per feature
• A target key containing an array with the labels
Training a Binary Classier:
the problem for now and only try to identify one digit—for example, the number 5. This “5-detector” will be
an example of a binary classifier, capable of distinguishing between just two classes, 5 and not-5.
Let’s create the target vectors for this classification task:

now let’s pick a classifier and train it. A good place to start is with a Stochastic Gradient Descent (SGD)
classifier, using Scikit-Learn’s SGDClassifier class. This classifier has the advantage of being capable of
handling very large datasets efficiently. This is in part because SGD deals with training instances
independently, one at a time(which also makes SGD well suited for online learning), as we will see later.
Let’s create an SGDClassifier and train it on the whole training set:

Now you can use it to detect images of the number 5:

32
1
Classification Metrics:

Performance Measures:
Confusion matrix: The confusion matrix is used to have a more complete picture when assessing the
performance of a model. It is defined as follows:

Main metrics: The following metrics are commonly used to assess the performance of classification
models:

32
2
ROC: The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the
threshold. These metrics are are summed up in the table below:

AUC: The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC
as shown in the following figure:

Regression metrics:
Basic metrics: Given a regression model f, the following metrics are commonly used to assess the
performance of the model:

32
3
Coefficient of determination: The coefficient of determination, often noted R^2 or r^2, provides a measure
of how well the observed outcomes are replicated by the model and is defined as follows:

Main metrics: The following metrics are commonly used to assess the performance of regression models,
by taking into account the number of variables n that they take into consideration:

Model selection:
Vocabulary– When selecting a model, we distinguish 3 different parts of the data that we have as follows:

Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These
are represented in the figure below:

Cross-validation: It also noted CV, is a method that is used to select a model that does not rely too much on
the initial training set. The different types are summed up in the table below:

The most commonly used method is called k-fold cross-validation and splits the training data into k folds to
validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error
is then averaged over the k folds and is named cross-validation error.

32
4
Regularization: The regularization procedure aims at avoiding the model to overfit the data and thus deals
with high variance issues. The following table sums up the different types of commonly used regularization
techniques:

Diagnostics:
Bias: The bias of a model is the difference between the expected prediction and the correct model that we try
to predict for given data points.
Variance: The variance of a model is the variability of the model prediction for given data points.
Bias/variance tradeoff: The simpler the model, the higher the bias, and the more complex the model, the
higher the variance.

40
Error analysis: Error analysis is analyzing the root cause of the difference in performance between the
current and the perfect models.
Ablative analysis: Ablative analysis is analyzing the root cause of the difference in performance between
the current and the baseline models.

Confusion Matrix Example:

41
Multiclass/Structured outputs:

Handling more than two classes: Certain concepts are fundamentally binary. For instance, the notion of a
coverage curve does not easily generalize to more than two classes.
We will now consider general issues related to having more than two classes in classification, scoring and
class probability estimation.
The discussion will address two issues:

42
o how to evaluate multi-class performance,
o how to build multi-class models out of binary models.
Whereas binary classifiers distinguish between two classes, multiclass classifiers (also called multinomial
classifiers) can distinguish between more than two classes.
Some algorithms (such as Random Forest classifiers or naive Bayes classifiers) are capable of handling
multiple classes directly. Others (such as Support Vector Machine classifiers or Linear classifiers) are
strictly binary classifiers.
If we have k classes, performance of a classifier can be assessed using a k-by-k contingency table. Assessing
performance is easy if we are interested in the classifier’s accuracy, which is still the sum of the descending
diagonal of the contingency table, divided by the number of test instances.

43
44
45
46
Multi-label Classification:
Until now each instance has always been assigned to just one class. In some cases you may want your
classifier to output multiple classes for each instance. For example, consider a face-recognition classifier:
what should it do if it recognizes several people on the same picture? Of course it should attach one tag per
person it recognizes. Say the classifier has been trained to recognize three faces, Alice, Bob, and Charlie;
then when it is shown a picture of Alice and Charlie, it should output [1, 0, 1] (meaning “Alice yes, Bob no,
Charlie yes”). Such a classification system that outputs multiple binary tags is called a multilabel
classification system.
47
a simpler example, just for illustration purposes:

This code creates a y_multilabel array containing two target labels for each digit image: the first indicates
whether or not the digit is large (7, 8, or 9) and the second indicates whether or not it is odd. The next lines
create a KNeighborsClassifier instance (which supports multilabel classification, but not all classifiers do)
and we train it using the multiple targets array. Now you can make a prediction, and notice that it outputs
two labels:

And it gets it right! The digit 5 is indeed not large (False) and odd (True)

Multioutput Classication:

The last type of classification task we are going to discuss here is called multioutputmulticlass classification
(or simply multioutput classification). It is simply a generalization of multilabel classification where each
label can be multiclass (i.e., it can have more than two possible values).
To illustrate this, let’s build a system that removes noise from images. It will take as input a noisy digit
image, and it will (hopefully) output a clean digit image, represented as an array of pixel intensities, just like
the MNIST images. Notice that the classifier’s output is multilabel (one label per pixel) and each label can
have multiple values (pixel intensity ranges from 0 to 255). It is thus an example of a multioutput
classification system.
Ranking:
A ranking is defined as a total order on a set of instances, possibly with ties. Ranking means positives are
greater than negative(i,e. +ve > -ve).
Suppose x and x’ are two instances such that x receives a lower score: Since higher scores express a stronger
belief that the instance in question is positive, this would be fine except in one case: if x is an actual positive
and x’ is an actual negative. We will call this a ranking error(actually x is positive but when you suppose x is
lower score(i.e,negative)) . finally , ranking error means the score of negative instance is more than score of
positive instance.

50
Ranking is a sorting documents by relevance to find contents of interest with respect to a query.
This is a fundamental problem of Information Retrieval, but this task also arises in many other

Applications:

1. Search Engines — Given a user profile (location, age, sex, …) a textual query, sort web pages results by
relevance.
2. Recommender Systems — Given a user profile and purchase history, sort the other items to find new
potentially interesting products for the user.
3. Travel Agencies — Given a user profile and filters (check-in/check-out dates, number and age of travelers,
…), sort available rooms by relevance.

51
Ranking models typically work by predicting a relevance score s = f(x) for each input x = (q, d) where q is
a query and d is a document. Once we have the relevance of each document, we can sort (i.e. rank) the documents
according to those scores.

Ranking models rely on a scoring function. (Image by author)

The scoring model can be implemented using various approaches.

 Vector Space Models – Compute a vector embedding (e.g. using Tf-Idf or BERT) for each query and document,
and then compute the relevance score f(x) = f(q, d) as the cosine similarity between the vectors embeddings
of q and d.

 Learning to Rank – The scoring model is a Machine Learning model that learns to predict a score s given an
input x = (q, d) during a training phase where some sort of ranking loss is minimized.

Ranking Evaluation Metrics

Before analyzing various ML models for Learning to Rank, we need to define which metrics are used to evaluate
ranking models. These metrics are computed on the predicted documents ranking, i.e. the k-th top retrieved
document is the k-th document with highest predicted score s.

51
Mean Average Precision (MAP)

MAP — Mean Average Precision.

Mean Average Precision is used for tasks with binary relevance, i.e. when the true score y of a document d can be
only 0 (non relevant) or 1 (relevant).

For a given query q and corresponding documents D = {d₁, …, dₙ}, we check how many of the top k retrieved
documents are relevant (y=1) or not (y=0)., in order to compute precision Pₖ and recall Rₖ. For k = 1…n we get
different Pₖ and Rₖ values that define the precision-recall curve: the area under this curve is the Average Precision
(AP).

Finally, by computing the average of AP values for a set of m queries, we obtain the Mean Average Precision
(MAP).

51

You might also like