ML Unit - 2
ML Unit - 2
Unit II:
Distance metrics are a key part of several machine learning algorithms. These distance metrics are
used in both supervised and unsupervised learning, generally to calculate the similarity between
data points. An effective distance metric improves the performance of our machine learning model,
Let’s say you need to create clusters using a clustering algorithm such as K-Means Clustering or k-
nearest neighbor algorithm (knn), which uses nearest neighbors to solve a classification or
regression problem. How will you define the similarity between different observations? How can
we say that two points are similar to each other? This will happen if their features are similar,
right? When we plot these points, they will be closer to each other by distance.
Hence, we can calculate the distance between points and then define the similarity between them.
Here’s the million-dollar question – how do we calculate this distance, and what are the different
distance metrics in machine learning? Also, are these metrics different for different learning
problems?
1. Euclidean Distance
2. Manhattan Distance
3. Minkowski Distance
4. Hamming Distance
Let’s start with the most commonly used distance metric – Euclidean Distance
Euclidean Distance
Euclidean Distance represents the shortest distance between two vectors.It is the square
The Euclidean distance metric corresponds to the L2-norm of a difference between vectors
and vector spaces. The cosine similarity is proportional to the dot product of two vectors and
Most machine learning algorithms, including K-Means use this distance metric to measure
the similarity between observations. Let’s say we have two points, as shown below:
So, the Euclidean Distance between these two points, A and B, will be:
We use this formula when we are dealing with 2 dimensions. We can generalize this for an n-
Manhattan Distance
Manhattan Distance is the sum of absolute differences between points across all the dimensions.
Since the above representation is 2 dimensional, to calculate Manhattan Distance, we will take the
sum of absolute distances in both the x and y directions. So, the Manhattan distance in a 2-
Where,
n = number of dimensions
Minkowski Distance
Hamming Distance
Hamming Distance measures the similarity between two strings of the same length. The Hamming
Distance between two strings of the same length is the number of positions at which the
Let’s understand the concept using an example. Let’s say we have two strings:
Since the length of these strings is equal, we can calculate the Hamming Distance. We will go
character by character and match the strings. The first character of both the strings (e and m,
respectively) is different. Similarly, the second character of both the strings (u and a) is different.
and so on.
Look carefully – seven characters are different, whereas two characters (the last two characters)
are similar:
Hence, the Hamming Distance here will be 7. Note that the larger the Hamming Distance between
two strings, the more dissimilar those strings will be (and vice versa).
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need
a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:
The K-NN working can be explained on the basis of the below algorithm:
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers
in the model.
o Large values for K are good, but it may find some difficulties.
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points
for all the training samples.
Decision trees are a popular machine learning algorithm that can be used for both regression and
classification tasks. They are easy to understand, interpret, and implement, making them an ideal
choice for beginners in the field of machine learning. In this comprehensive guide, we will cover
all aspects of the decision tree algorithm, including the working principles, different types of
decision trees, the process of building decision trees, and how to evaluate and optimize decision
trees.
A decision tree is a hierarchical model used in decision support that depicts decisions and their
potential outcomes, incorporating chance events, resource expenses, and utility. This algorithmic
model utilizes conditional control statements and is non-parametric, supervised learning, useful
for both classification and regression tasks. The tree structure is comprised of a root node,
branches, internal nodes, and leaf nodes, forming a hierarchical, tree-like structure.
It is a tool that has applications spanning several different areas. Decision trees can be used for
classification as well as regression problems. The name itself suggests that it uses a flowchart like
a tree structure to show the predictions that result from a series of feature-based splits. It starts
Root Node: The initial node at the beginning of a decision tree, where the entire population or
Decision Nodes: Nodes resulting from the splitting of root nodes are known as decision nodes.
Leaf Nodes: Nodes where further splitting is not possible, often indicating the final classification
Pruning: The process of removing or cutting down specific nodes in a decision tree to prevent
Branch / Sub-Tree: A subsection of the entire decision tree is referred to as a branch or sub-tree.
Parent and Child Node: In a decision tree, a node that is divided into sub-nodes is known as a
parent node, and the sub-nodes emerging from it are referred to as child nodes. The parent node
represents a decision or condition, while the child nodes represent the potential outcomes or further
Decision trees are upside down which means the root is at the top and then this root is split into
various several nodes. Decision trees are nothing but a bunch of if-else statements in layman terms.
It checks if the condition is true and if it is then it goes to the next node attached to that decision.
In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or rainy? If
yes then it will go to the next feature which is humidity and wind. It will again check if there is a
strong wind or weak, if it’s a weak wind and it’s rainy then the person may go and play.
Did you notice anything in the above flowchart? We see that if the weather is cloudy then we must
To answer this question, we need to know about few more concepts like entropy, information gain,
and Gini index. But in simple terms, I can say here that the output for the training dataset is always
yes for cloudy weather, since there is no disorderliness here we don’t need to split the node further.
The goal of machine learning is to decrease uncertainty or disorders from the dataset and for this,
Now you must be thinking how do I know what should be the root node? what should be the
decision node? when should I stop splitting? To decide this, there is a metric called “Entropy”
Several assumptions are made to build effective models when creating decision trees. These
assumptions help guide the tree’s construction and impact its performance. Here are some common
Binary Splits
Decision trees typically make binary splits, meaning each node divides the data into two subsets
based on a single feature or condition. This assumes that each decision can be represented as a
binary choice.
Recursive Partitioning
Decision trees use a recursive partitioning process, where each node is divided into child nodes,
and this process continues until a stopping criterion is met. This assumes that data can be
Feature Independence
Decision trees often assume that the features used for splitting nodes are independent. In practice,
feature independence may not hold, but decision trees can still perform well if features are
correlated.
Homogeneity
Decision trees aim to create homogeneous subgroups in each node, meaning that the samples
within a node are as similar as possible regarding the target variable. This assumption helps in
Decision trees are constructed using a top-down, greedy approach, where each split is chosen to
maximize information gain or minimize impurity at the current node. This may not always result
Decision trees can handle both categorical and numerical features. However, they may require
Overfitting
Decision trees are prone to overfitting when they capture noise in the data. Pruning and setting
Impurity Measures
Decision trees use impurity measures such as Gini impurity or entropy to evaluate how well a split
separates classes. The choice of impurity measure can impact tree construction.
No Missing Values
Decision trees assume that there are no missing values in the dataset or that missing values have
Decision trees may assume equal importance for all features unless feature scaling or weighting is
No Outliers
Decision trees are sensitive to outliers, and extreme values can influence their construction.
Small datasets may lead to overfitting, and large datasets may result in overly complex trees. The
Entropy
Entropy is nothing but the uncertainty in our dataset or measure of disorder. Let me try to explain
Suppose you have a group of friends who decides which movie they can watch together on Sunday.
There are 2 choices for movies, one is “Lucy” and the second is “Titanic” and now everyone has
to tell their choice. After everyone gives their answer we see that “Lucy” gets 4
votes and “Titanic” gets 5 votes. Which movie do we watch now? Isn’t it hard to choose 1 movie
now because the votes for both the movies are somewhat equal.
This is exactly what we call disorderness, there is an equal number of votes for both the movies,
and we can’t really decide which movie we should watch. It would have been much easier if the
votes for “Lucy” were 8 and for “Titanic” it was 2. Here we could easily say that the majority of
votes are for “Lucy” hence everyone will be watching this movie.
Here,
Now we know what entropy is and what is its formula, Next, we need to know that how exactly
Entropy basically measures the impurity of a node. Impurity is the degree of randomness; it tells
how random our data is. Apure sub-splitmeans that either you should be getting “yes”, or you
Supposea featurehas 8 “yes” and 4 “no” initially, after the first split the left node gets 5 ‘yes’ and
We see here the split is not pure, why? Because we can still see some negative classes in both the
nodes. In order to make a decision tree, we need to calculate the impurity of each split, and when
To check the impurity of feature 2 and feature 3 we will take the help for Entropy formula.
For feature 3,
We can clearly see from the tree itself that left node has low entropy or more purity than right node
since left node has a greater number of “yes” and it is easy to decide here.
Always remember that the higher the Entropy, the lower will be the purity and the higher will be
the impurity.
As mentioned earlier the goal of machine learning is to decrease the uncertainty or impurity in the
dataset, here by using the entropy we are getting the impurity of a particular node, we don’t know
if the parent entropy or the entropy of a particular node has decreased or not.
For this, we bring a new metric called “Information gain” which tells us how much the parent
Information Gain
Information gain measures the reduction of uncertainty given some feature and it is also a deciding
factor for which attribute should be selected as a decision node or root node.
It is just entropy of the full dataset – entropy of the dataset given some feature.
To understand this better let’s consider an example:Suppose our entire population has a total of 30
instances. The dataset is to predict whether the person will go to the gym or not. Let’s say 16
Now we have two features to predict whether he/she will go to the gym or not.
Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral” and “Highly
motivated”.
Let’s see how our decision tree will be made using these 2 features. We’ll use information gain to
decide which feature should be the root node and which feature should be placed after the split.
Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:
Our parent entropy was near 0.99 and after looking at this value of information gain, we can say
that the entropy of the dataset will decrease by 0.37 if we make “Energy” as our root node.
Similarly, we will do this with the other feature “Motivation” and calculate its information gain.
Now we have the value of E(Parent) and E(Parent|Motivation), information gain will be:
We now see that the “Energy” feature gives more reduction which is 0.37 than the “Motivation”
feature. Hence we will select the feature which has the highest information gain and then split the
In this example “Energy” will be our root node and we’ll do the same for sub-nodes. Here we can
see that when the energy is “high” the entropy is low and hence we can say a person will definitely
go to the gym if he has high energy, but what if the energy is low? We will again split the node
You must be asking this question to yourself that when do we stop growing our tree? Usually, real-
world datasets have a large number of features, which will result in a large number of splits, which
in turn gives a huge tree. Such trees take time to build and can lead to overfitting. That means the
tree will give very good accuracy on the training dataset but will give bad accuracy in test data.
There are many ways to tackle this problem through hyperparameter tuning. We can set the
maximum depth of our decision tree using themax_depth parameter. The more the value
of max_depth, the more complex your tree will be. The training error will off-course decrease if
we increase the max_depth value but when our test data comes into the picture, we will get a very
bad accuracy. Hence you need a value that will not overfit as well as underfit our data and for this,
Another way is to set the minimum number of samples for each spilt. It is denoted
For example, we can use a minimum of 10 samples to reach a decision. That means if a node has
less than 10 samples then using this parameter, we can stop the further splitting of this node and
min_samples_leaf – represents the minimum number of samples required to be in the leaf node.
The more you increase the number, the more is the possibility of overfitting.
max_features – it helps us decide what number of features to consider when looking for the best
split.
Pruning
Pruning is another method that can help us avoid overfitting. It helps in improving the performance
of the tree by cutting the nodes or sub-nodes which are not significant. Additionally, it removes
Pre-pruning – we can stop growing the tree earlier, which means we can prune/remove/cut a node
Post-pruning – once our tree is built to its depth, we can start pruning the nodes based on their
significance.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described
as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below steps:
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes that
these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for document classification tasks.
Linear regression is a type of statistical analysis used to predict the relationship between two
variables. It assumes a linear relationship between the independent variable and the dependent
variable, and aims to find the best-fitting line that describes the relationship. The line is determined
by minimizing the sum of the squared differences between the predicted values and the actual
values.
Linear regression is commonly used in many fields, including economics, finance, and social
sciences, to analyze and predict trends in data. It can also be extended to multiple linear regression,
where there are multiple independent variables, and logistic regression, which is used for binary
classification problems.
In a simple linear regression, there is one independent variable and one dependent variable. The
model estimates the slope and intercept of the line of best fit, which represents the relationship
between the variables. The slope represents the change in the dependent variable for each unit
change in the independent variable, while the intercept represents the predicted value of the
Linear regression is a quiet and the simplest statistical regression method used for predictive
analysis in machine learning. Linear regression shows the linear relationship between the
independent(predictor) variable i.e. X-axis and the dependent(output) variable i.e. Y-axis, called
linear regression. If there is a single input variable X(independent variable), such linear regression
The graph above presents the linear relationship between the output(y) and predictor(X)
variables. The blue line is referred to as the best-fit straight line. Based on the given data points,
To calculate best-fit line linear regression uses a traditional slope-intercept form which is given
below,
Yi = β0 + β1Xi
variable.
This algorithm explains the linear relationship between the dependent(output) variable y and the
But how the linear regression finds out which is the best fit line?
The goal of the linear regression algorithm is to get the best values for B0 and B1 to find the best
fit line. The best fit line is a line that has the least error which means the error between predicted
Random Error(Residuals)
In regression, the difference between the observed value of the dependent variable(yi) and the
εi = ypredicted – yi
where ypredicted = B0 + B1 Xi
In simple terms, the best fit line is a line that fits the given scatter plot in the best way.
Mathematically, the best fit line is obtained by minimizing the Residual Sum of Squares(RSS).
The cost function helps to work out the optimal values for B0 and B1, which provides the best fit
In Linear Regression, generally Mean Squared Error (MSE) cost function is used, which is the
average of squared error that occurred between the ypredicted and yi.
Using the MSE function, we’ll update the values of B0 and B1 such that the MSE value settles at
the minima. These parameters can be determined using the gradient descent method such that the
Gradient Descent is one of the optimization algorithms that optimize the cost function(objective
function) to reach the optimal minimal solution. To find the optimum solution we need to reduce
the cost function(MSE) for all data points. This is done by updating the values of B0 and
A regression model optimizes the gradient descent algorithm to update the coefficients of the line
by reducing the cost function by randomly selecting coefficient values and then iteratively
Let’s take an example to understand this. Imagine a U-shaped pit. And you are standing at the
uppermost point in the pit, and your motive is to reach the bottom of the pit. Suppose there is a
treasure at the bottom of the pit, and you can only take a discrete number of steps to reach the
bottom. If you opted to take one step at a time, you would get to the bottom of the pit in the end
but, this would take a longer time. If you decide to take larger steps each time, you may achieve
the bottom sooner but, there’s a probability that you could overshoot the bottom of the pit and not
even near the bottom. In the gradient descent algorithm, the number of steps you’re taking can be
considered as the learning rate, and this decides how fast the algorithm converges to the minima.
To update B0 and B1, we take gradients from the cost function. To find these gradients, we take
We need to minimize the cost function J. One of the ways to achieve this is to apply the batch
gradient descent algorithm. In batch gradient descent, the values are updated in each iteration.
The partial derivates are the gradients, and they are used to update the values of B0 and B1. Alpha
The strength of any linear regression model can be assessed using various evaluation metrics.
These evaluation metrics usually provide a measure of how well the observed outputs are being
2. Root Mean Squared Error (RSME) and Residual Standard Error (RSE)
R-Squared is a number that explains the amount of variation that is explained/captured by the
developed model. It always ranges between 0 & 1 . Overall, the higher the value of R-squared, the
R2 = 1 – ( RSS/TSS )
Residual sum of Squares (RSS) is defined as the sum of squares of the residual for each data
point in the plot/data. It is the measure of the difference between the expected and the actual
observed output.
Total Sum of Squares (TSS) is defined as the sum of errors of the data points from the mean of
The Root Mean Squared Error is the square root of the variance of the residuals. It specifies the
absolute fit of the model to the data i.e. how close the observed data points are to the predicted
To make this estimate unbiased, one has to divide the sum of the squared residuals by the degrees
of freedom rather than the total number of data points in the model. This term is then called
R-squared is a better measure than RSME. Because the value of Root Mean Squared Error depends
on the units of the variables (i.e. it is not a normalized measure), it can change with the change in
Regression is a parametric approach, which means that it makes assumptions about the data for
the purpose of analysis. For successful regression analysis, it’s essential to validate the following
assumptions.
2. Independence of residuals: The error terms should not be dependent on one another (like in
time-series data wherein the next value is dependent on the previous one). There should be no
correlation between the residual terms. The absence of this phenomenon is known
as Autocorrelation.
distribution with a mean equal to zero or close to zero. This is done in order to check whether the
If the error terms are non-normally distributed, suggests that there are a few unusual data points
4. The equal variance of residuals: The error terms must have constant variance. This
Generally, non-constant variance arises in the presence of outliers or extreme leverage values.
Once you have fitted a straight line on the data, you need to ask, “Is this straight line a significant
fit for the data?” Or “Is the beta coefficient explain the variance in the data plotted?” And
here comes the idea of hypothesis testing on the beta coefficient. The Null and Alternate
H0: B1 = 0
HA: B1 ≠ 0
To test this hypothesis we use a t-test, test statistics for the beta coefficient is given by,
1. t statistic: It is used to determine the p-value and hence, helps in determining whether the
2. F statistic: It is used to assess whether the overall model fit is significant or not. Generally,
the higher the value of the F-statistic, the more significant a model turns out to be.
Multiple linear regression is a technique to understand the relationship between a single dependent
The formulation for multiple linear regression is also similar to simple linear regression with
the small change that instead of having one beta variable, you will now have betas for all the
All the four assumptions made for Simple Linear Regression still hold true for Multiple Linear
1. Overfitting: When more and more variables are added to a model, the model may become
far too complex and usually ends up memorizing all the data points in the training set. This
phenomenon is known as the overfitting of a model. This usually leads to high training
3. Feature Selection: With more variables present, selecting the optimal set of predictors
from the pool of given features (many of which might be redundant) becomes an important
Multicollinearity
As multicollinearity makes it difficult to find out which variable is actually contributing towards
the prediction of the response variable, it leads one to conclude incorrectly, the effects of a variable
on the target variable. Though it does not affect the precision of the predictions, it is essential to
properly detect and deal with the multicollinearity present in the model, as random removal of any
of these correlated variables from the model causes the coefficient values to swing wildly and even
change signs.
2. Variance Inflation Factor (VIF): Pairwise correlations may not always be useful as it is
possible that just one variable might not be able to completely explain some other variable
but some of the variables combined could be ready to do this. Thus, to check these sorts of
relations between variables, one can use VIF. VIF basically explains the relationship of
one independent variable with all the other independent variables. VIF is given by,
where i refers to the ith variable which is being represented as a linear combination of the rest of
The common heuristic followed for the VIF values is if VIF > 10 then the value is definitely high
and it should be dropped. And if the VIF=5 then it may be valid but should be inspected first. If
There have always been situations where a model performs well on training data but not on the
test data. While training models on a dataset, overfitting, and underfitting are the most common
Before understanding overfitting and underfitting one must know about bias and variance.
Bias:
Bias is a measure to determine how accurate is the model likely to be on future unseen data.
Complex models, assuming there is enough training data available, can do predictions accurately.
Whereas the models that are too naive, are very likely to perform badly with respect to
Generally, linear algorithms have a high bias which makes them fast to learn and easier to
understand but in general, are less flexible. Implying lower predictive performance on complex
Variance:
Variance is the sensitivity of the model towards training data, that is it quantifies how much the
Ideally, the model shouldn’t change too much from one training dataset to the next training data,
which will mean that the algorithm is good at picking out the hidden underlying patterns between
Ideally, a model should have lower variance which means that the model doesn’t change drastically
after changing the training data(it is generalizable). Having higher variance will make a model
The aim of any supervised machine learning algorithm is to achieve low bias and low variance as
There is no escape from the relationship between bias and variance in machine learning.
There is a trade-off that plays between these two concepts and the algorithms must find a balance
As a matter of fact, one cannot calculate the real bias and variance error terms because we do not
Overfitting
When a model learns each and every pattern and noise in the data to such extent that it affects the
performance of the model on the unseen future dataset, it is referred to as overfitting. The model
fits the data so well that it interprets noise as patterns in the data.
When a model has low bias and higher variance it ends up memorizing the data and causing
overfitting. Overfitting causes the model to become specific rather than generic. This usually leads
Detecting overfitting is useful, but it doesn’t solve the actual problem. There are several ways to
Cross-validation
If the training data is too small to train add more relevant and clean data.
If the training data is too large, do some feature selection and remove unnecessary features.
Regularization
Underfitting:
Underfitting is not often discussed as often as overfitting is discussed. When the model fails to
learn from the training dataset and is also not able to generalize the test dataset, is referred to
as underfitting. This type of problem can be very easily detected by the performance metrics.
When a model has high bias and low variance it ends up not generalizing the data and causing
underfitting. It is unable to find the hidden underlying patterns from the data. This usually leads to
low training accuracy and very low test accuracy. The ways to prevent underfitting are stated
below,
In polynomial regression, we describe the relationship between the independent variable x and the
| x), characterizes fitting a nonlinear relationship between the x value and the conditional mean of
y. Typically, this corresponds to the least-squares method. The least-square approach minimizes
the coefficient variance according to the Gauss-Markov Theorem. This represents a type of Linear
Regression where the dependent and independent variables exhibit a curvilinear relationship and
A quadratic equation is a general term for a second-degree polynomial equation. This degree, on
the other hand, can go up to nth values. Here is the categorization of Polynomial Regression:
1. Linear – if degree as 1
2. Quadratic – if degree as 2
We cannot process all of the datasets and use polynomial regression machine learning to make a
better judgment. We can still do it, but there should be specific constraints for the dataset in order
A dependent variable’s behaviour can be described by a linear, or curved, an additive link between
We employ datasets featuring independently distributed errors with a normal distribution, having
Here we are dealing with mathematics, rather than going deep, just understand the basic structure,
we all know the equation of a linear equation will be a straight line, from that if we have many
features then we opt for multiple regression just increasing features part alone, then how about
polynomial, it’s not about increasing but changing the structure to a quadratic equation, you can
Rather than focusing on the distinctions between linear and polynomial regression, we may
comprehend the importance of polynomial regression by starting with linear regression. We build
our model and realize that it performs abysmally. We examine the difference between the actual
value and the best fit line we predicted, and it appears that the true value has a curve on the graph,
but our line is nowhere near cutting the mean of the points. This is where polynomial regression
comes into play; it predicts the best-fit line that matches the pattern of the data (curve).
One important distinction between Linear and Polynomial Regression is that Polynomial
Regression does not require a linear relationship between the independent and dependent variables
in the data set. When the Linear Regression Model fails to capture the points in the data and the
Linear Regression fails to adequately represent the optimum, then we use Polynomial Regression.
Before delving into the topic, let us first understand why we prefer Polynomial Regression over
Linear Regression in some situations, say the non-linear condition of the dataset, by programming
and visualization.
We need to enhance the model’s complexity to overcome under-fitting. In this sense, we need to
Because the weights associated with the features are still linear, this is still a linear model. x2 (x
square) is only a function. However, the curve we’re trying to fit is quadratic in nature.
Overfitting vs Under-fitting
We keep on increasing the degree, we will see the best result, but there comes the over-fitting
However, when fine-tuning the degree parameter to the optimal value, we encounter an over-fitting
problem, resulting in a 100 per cent r2 value. The conclusion is that we must avoid both overfitting
Note: To avoid over-fitting, we can increase the number of training samples so that the algorithm
does not learn the system’s noise and becomes more generalized.
How do we pick the best model? To address this question, we must first comprehend the trade-off
The mistake is due to the model’s simple assumptions in fitting the data is referred to as bias. A
high bias indicates that the model is unable to capture data patterns, resulting in under-fitting.
Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 50
Gajjala Ashok’s Unit -II : Machine Learning
The mistake caused by the complicated model trying to match the data is referred to as variance.
When a model has a high variance, it passes over the majority of the data points, causing the data
to overfit.
From the above program, when degree is 1 which means in linear regression, it shows underfitting
which means high bias and low variance. And when we get r2 value 100, which means low bias
As the model complexity grows, the bias reduces while the variance increases, and vice versa. A
machine learning model should, in theory, have minimal variance and bias. However, having both
is nearly impossible. As a result, a trade-off must be made in order to build a strong model that
We need to find the right degree of polynomial parameter, in order to avoid overfitting and
underfitting problems:
Forward selection: increase the degree parameter till you get the optimal result
The Cost Function is a function that evaluates a Machine Learning model’s performance for a
given set of data. The Cost Function is a single real number that calculates the difference between
anticipated and expected values. Many people dont know the differences between the Cost
Function and the Loss Function. To put it another way, the Cost Function is the average of the n-
sample error in the data, whereas the Loss Function is the error for individual data points. To put
it another way, the Loss Function refers to a single training example, whereas the Cost Function
The Mean Squared Error may also be used as the Cost Function of Polynomial regression;
We now know that the Cost Function’s optimum value is 0 or a close approximation to 0. To get
an optimal Cost Function, we may use Gradient Descent, which changes the weight and, as a result,
reduces mistakes.
in order to minimize a cost function (cost). It may decrease the Cost function (minimizing MSE
The values of slope (m) and slope-intercept (b) will be set to 0 at the start of the function, and the
learning rate (α) will be introduced. The learning rate (α) is set to an extremely low number,
perhaps between 0.01 and 0.0001. The learning rate is a tuning parameter in an optimization
algorithm that sets the step size at each iteration as it moves toward the cost function’s minimum.
The partial derivative is then determined in terms of m for the cost function equation, as well as
With the aid of the following equation, a and b are updated once the derivatives are determined. m
Gradient indicates the steepest climb of the loss function, but the steepest fall is the inverse of the
gradient, which is why the gradient is subtracted from the weights (m and b). The process of
updating the values of m and b continues until the cost function achieves or approaches the ideal
value of 0. The current values of m and b will be the best fit line’s optimal value.
This equation obtains the results in various experimental techniques. The independent and
The best approximation of the connection between the dependent and independent variables is a
polynomial. It can accommodate a wide range of functions. Polynomial is a type of curve that can
One or two outliers in the data might have a significant impact on the nonlinear analysis’ outcomes.
These are overly reliant on outliers. Furthermore, there are fewer model validation methods for
detecting outliers in nonlinear regression than there are for linear regression.
for classification tasks where the goal is to predict the probability that an instance of belonging
to a given class. It is used for classification algorithms its name is logistic regression. it’s referred
to as regression because it takes the output of the linear regression function as input and uses a
sigmoid function to estimate the probability for the given class. The difference between linear
regression and logistic regression is that linear regression output is the continuous value that can
be anything while logistic regression predicts the probability that an instance belongs to a given
class or not.
It is used for predicting the categorical dependent variable using a given set of independent
variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value.
It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as
0 and 1, it gives the probabilistic values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification.
Logistic Function (Sigmoid Function):
The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
It maps any real value into another value within a range of 0 and 1. o The value of the logistic
regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve
like the “S” form.
The S-form curve is called the Sigmoid function or the logistic function.
In logistic regression, we use the concept of the threshold value, which defines the probability
of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.
Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
1. Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as “cat”, “dogs”, or “sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as “low”, “Medium”, or “High”.
Learning GLM lets you understand how we can use probability distributions as building blocks for
modeling. I assume you are familiar with linear regression and normal distribution.
Linear regression is used to predict the value of continuous variable y by the linear combination of
explanatory variables X.
Linear regression
Here, i indicates the index of each sample. Notice this model assumes normal distribution for the
Poisson regression
So linear regression is all you need to know? Definitely not. If you’d like to apply statistical
For example, assume you need to predict the number of defect products (Y) with a sensor value (x)
There are several problems if you try to apply linear regression for this kind of data.
1. The relationship between X and Y does not look linear. It’s more likely to be exponential.
2. The variance of Y does not look constant with regard to X. Here, the variance of Y seems to
3. As Y represents the number of products, it always has to be a positive integer. In other words,
Y is a discrete variable. However, the normal distribution used for linear regression assumes
continuous variables. This also means the prediction by linear regression can be negative. It’s
Here, the more proper model you can think of is the Poisson regression model. Poisson regression
1. Linear predictor
2. Link function
3. Probability distribution
In the case of Poisson regression, it’s formulated like this.
Linear predictor is just a linear combination of parameter (b) and explanatory variable (x).
Link function literally “links” the linear predictor and the parameter for probability distribution.
In the case of Poisson regression, the typical link function is the log link function. This is because
the parameter for Poisson regression must be positive (explained later).
The last component is the probability distribution which generates the observed variable y. As we
Poisson distribution is used to model count data. It has only one parameter which stands for both
mean and standard deviation of the distribution. This means the larger the mean, the larger the
standard deviation. See below.
Now, let’s apply Poisson regression to our data. The result should look like this.
The magenta curve is the prediction by Poisson regression. I added the bar plot of the probability
mass function of Poisson distribution to make the difference from linear regression clear.
The prediction curve is exponential as the inverse of the log link function is an exponential function.
From this, it is also clear that the parameter for Poisson regression calculated by the linear predictor
guaranteed to be positive.
Linear regression is also an example of GLM. It just uses identity link function (the linear
predictor and the parameter for the probability distribution are identical) and normal
Linear regression
If you use logit function as the link function and binomial / Bernoulli distribution as the
probability distribution, the model is called logistic regression.
logistic regression
If you represent the linear predictor with z, the above equation is equivalent to the following.
Logistic function
The right-hand side of the second equation is called logistic function. Therefore, this model is called
logistic regression.
As the logistic function returns values between 0 and 1 for arbitrary inputs, it is a proper link
Logistic regression is used mostly for binary classification problems. Below is an example to fit
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct category
in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider
the below diagram in which there are two different categories that are classified using a decision
boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as support
vector creates a decision boundary between these two data (cat and dog) and choose extreme cases
(support vectors), it will see the extreme case of cat and dog. On the basis of the support vectors,
it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
1. Hyperplane: Hyperplane is the decision boundary that is used to separate the data points of
different classes in a feature space. In the case of linear classifications, it will be a linear
equation i.e. wx+b = 0.
2. Support Vectors: Support vectors are the closest data points to the hyperplane, which makes
a critical role in deciding the hyperplane and margin.
3. Margin: Margin is the distance between the support vector and hyperplane. The main
objective of the support vector machine algorithm is to maximize the margin. The wider
margin indicates better classification performance.
4. Kernel: Kernel is the mathematical function, which is used in SVM to map the original input
data points into high-dimensional feature spaces, so, that the hyperplane can be easily found
out even if the data points are not linearly separable in the original input space. Some of the
common kernel functions are linear, polynomial, radial basis function(RBF), and sigmoid.
5. Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a
hyperplane that properly separates the data points of different categories without any
misclassifications.
6. Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits a
soft margin technique. Each data point has a slack variable introduced by the soft-margin
SVM formulation, which softens the strict margin requirement and permits certain
misclassifications or violations. It discovers a compromise between increasing the margin
and reducing violations.
7. C: Margin maximisation and misclassification fines are balanced by the regularisation
parameter C in SVM. The penalty for going over the margin or misclassifying data items is
decided by it. A stricter penalty is imposed with a greater value of C, which results in a
smaller margin and perhaps fewer misclassifications.
8. Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect
classifications or margin violations. The objective function in SVM is frequently formed by
combining it with the regularisation term.
9. Dual Problem: A dual Problem of the optimisation problem that requires locating the
Lagrange multipliers related to the support vectors can be used to solve SVM. The dual
formulation enables the use of kernel tricks and more effective computing.
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
The dimensions of the hyperplane depend on the features present in the dataset, which means if
there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3
features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of
the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence
called a Support vector.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the
below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data,
we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used
two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated
as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in
2d space with z=1, then it will become as:
There are perhaps four main types of classification tasks that you may encounter; they are:
Binary Classification
Multi-Class Classification
Multi-Label Classification
Imbalanced Classification
Typically, binary classification tasks involve one class that is the normal state and another class
that is the abnormal state.
For example “not spam” is the normal state and “spam” is the abnormal state. Another example is
“cancer not detected” is the normal state of a task that involves a medical test and “cancer
detected” is the abnormal state.
The class for the normal state is assigned the class label 0 and the class with the abnormal state is
assigned the class label 1.
It is common to model a binary classification task with a model that predicts a Bernoulli probability
distribution for each example.
The Bernoulli distribution is a discrete probability distribution that covers a case where an event
will have a binary outcome as either a 0 or 1. For classification, this means that the model predicts
a probability of an example belonging to class 1, or the abnormal state.
Logistic Regression
k-Nearest Neighbors
Decision Trees
Support Vector Machine
Naive Bayes
Some algorithms are specifically designed for binary classification and do not natively support
more than two classes; examples include Logistic Regression and Support Vector Machines.
Multi-Class Classification
Multi-class classification refers to those classification tasks that have more than two class labels.
Examples include:
Face classification.
Plant species classification.
Optical character recognition.
Unlike binary classification, multi-class classification does not have the notion of normal and
abnormal outcomes. Instead, examples are classified as belonging to one among a range of known
classes.
The number of class labels may be very large on some problems. For example, a model may predict
a photo as belonging to one among thousands or tens of thousands of faces in a face recognition
system.
Problems that involve predicting a sequence of words, such as text translation models, may also
be considered a special type of multi-class classification. Each word in the sequence of words to
be predicted involves a multi-class classification where the size of the vocabulary defines the
number of possible classes that may be predicted and could be tens or hundreds of thousands of
words in size.
It is common to model a multi-class classification task with a model that predicts a Multinoulli
probability distribution for each example.
The Multinoulli distribution is a discrete probability distribution that covers a case where an event
will have a categorical outcome, e.g. K in {1, 2, 3, …, K}. For classification, this means that the
model predicts the probability of an example belonging to each class label.
Many algorithms used for binary classification can be used for multi-class classification.
k-Nearest Neighbors.
Decision Trees.
Naive Bayes.
Random Forest.
Gradient Boosting.
Algorithms that are designed for binary classification can be adapted for use for multi-class
problems.
This involves using a strategy of fitting multiple binary classification models for each class vs. all
other classes (called one-vs-rest) or one model for each pair of classes (called one-vs-one).
One-vs-Rest: Fit one binary classification model for each class vs. all other classes.
One-vs-One: Fit one binary classification model for each pair of classes.
Binary classification algorithms that can use these strategies for multi-class classification include:
Logistic Regression.
Support Vector Machine.
Multi-Label Classification
Multi-label classification refers to those classification tasks that have two or more class labels,
where one or more class labels may be predicted for each example.
Consider the example of photo classification, where a given photo may have multiple objects in
the scene and a model may predict the presence of multiple known objects in the photo, such as
“bicycle,” “apple,” “person,” etc.
This is unlike binary classification and multi-class classification, where a single class label is
predicted for each example.
It is common to model multi-label classification tasks with a model that predicts multiple outputs,
with each output taking predicted as a Bernoulli probability distribution. This is essentially a model
that makes multiple binary classification predictions for each example.
Classification algorithms used for binary or multi-class classification cannot be used directly for
multi-label classification. Specialized versions of standard classification algorithms can be used,
so-called multi-label versions of the algorithms, including:
Imbalanced Classification
Imbalanced classification refers to classification tasks where the number of examples in each class
is unequally distributed.
Typically, imbalanced classification tasks are binary classification tasks where the majority of
examples in the training dataset belong to the normal class and a minority of examples belong to
the abnormal class.
Examples include:
Fraud detection.
Outlier detection.
Medical diagnostic tests.
These problems are modeled as binary classification tasks, although may require specialized
techniques.
Specialized techniques may be used to change the composition of samples in the training dataset
by undersampling the majority class or oversampling the minority class.
Examples include:
Random Undersampling.
SMOTE Oversampling.
Specialized modeling algorithms may be used that pay more attention to the minority class when
fitting the model on the training dataset, such as cost-sensitive machine learning algorithms.
Examples include:
Finally, alternative performance metrics may be required as reporting the classification accuracy
may be misleading.
Examples include:
Precision.
Recall.
F-Measure.
In classification, the target domain are discrete class labels, and the loss is usually the 0-1 loss, i.e.
counting the misclassifications. In regression, the target domain is the real numbers, and the loss
is usually mean squared error. In structured prediction, both the target domain and the loss are
more or less arbitrary. This means the goal is not to predict a label or a number, but a possibly
much more complicated object like a sequence or a graph.
In structured prediction, we often deal with finite, but large output spaces Y. This situation could
be dealt with using classification with a very large number of classes. The idea behind structured
prediction is that we can do better than this, by making use of the structure of the output space.
expression “car door” is way more likely than “car boar”, while predicted individually these could
be easily confused. For a similar example, see OCR Letter sequence recognition.
Structured prediction tries to overcome these problems by considering the output (here the
sentence) as a whole and using a loss function that is appropriate for this domain.
A formalism
I hope I have convinced you that structured prediction is a useful thing. So how are we going to
formalize this? Having functions that produce arbitrary objects seem a bit hard to handle. There is
one very basic formula at the heart of structured prediction:
Here x is the input, Y is the set of all possible outputs and f is a compatibility function that says
how well y fits the input x. The prediction for x is y*, the element of Y that maximizes the
compatibility.
This very simple formula allows us to predict arbitrarily complex outputs, as long as we can say
how compatible a given output is with the input.
This approach opens up two questions:
How do we specify f ? How do we compute y*?
As I said above, the output set Y is usually a finite but very large set (all graphs, all sentences in
the English language, all images of a given resolution). Finding the argmax in the above equation
by exhaustive search is therefore out of the question. We need to restrict ourselves to f such that
we can do the maximization over y efficiently. The most popular tool for building such f is using
energy functions or conditional random fields (CRFs).
There are basically three challenges in doing structured learning and prediction:
PyStruct takes to be a linear function of some parameters w and a joint feature function
of x and y :
Here are parameters that are learned from data, and joint_feature is defined by the user-
specified structure of the model. The definition of joint_feature is given by the Models. PyStruct
assumes that y is a discrete vector, and most models in PyStruct assume a pairwise decomposition
of the energy f over entries of y , that is
The MNIST (Modified National Institute of Standards and Technology) database is a large
database of handwritten numbers or digits that are used for training various image processing
systems. The dataset also widely used for training and testing in the field of machine learning.
The set of images in the MNIST database are a combination of two of NIST's databases: Special
Database 1 and Special Database 3.
The MNIST dataset has 60,000 training images and 10,000 testing images.
The MNIST dataset can be online, and it is essentially a database of various handwritten digits.
The MNIST dataset has a large amount of data and is commonly used to demonstrate the real
power of deep neural networks. Our brain and eyes work together to recognize any numbered
image. Our mind is a potent tool, and it's capable of categorizing any image quickly. There are so
many shapes of a number, and our mind can easily recognize these shapes and determine what
number is it, but the same task is not simple for a computer to complete. There is only one way to
do this, which is the use of deep neural network which allows us to train a computer to classify the
handwritten digits effectively.
So, we have only dealt with data which contains simple data points on a Cartesian coordinate
system. From starting till now, we have distributed with binary class datasets. And when we use
multiclass datasets, we will use the Softmax activation function is quite useful for classifying
binary datasets. And it was quite effective in arranging values between 0 and 1. The sigmoid
function is not effective for multicausal datasets, and for this purpose, we use the softmax
activation function, which is capable of dealing with it.
major difference between the datasets that we have used before and the MNIST dataset is the
method in which MNIST data is inputted in a neural network.
In the perceptual model and linear regression model, each of the data points was defined by a
simple x and y coordinate. This means that the input layer needs two nodes to input single data
points.
In the MNIST dataset, a single data point comes in the form of an image. These images included
in the MNIST dataset are typical of 28*28 pixels such as 28 pixels crossing the horizontal axis
and 28 pixels crossing the vertical axis. This means that a single image from the MNIST database
has a total of 784 pixels that must be analyzed. The input layer of our neural network has 784
nodes to explain one of these images.
Here, we will see how to create a function that is a model for recognizing handwritten digits by
looking at each pixel in the image. Then using TensorFlow to train the model to predict the image
by making it look at thousands of examples which are already labeled. We will then check the
model's accuracy with a test dataset.
5.5.4 Ranking
Ranking is a type of machine learning that sorts data in a relevant order. Companies use ranking
to optimize search and recommendations.
Ranking is a type of supervised machine learning (ML) that uses labeled datasets to train its data
and models to classify future data to predict outcomes. Quite simply, the goal of a ranking model
is to sort data in an optimal and relevant order.
Ranking was first largely deployed within search engines. People search for a topic, while the
ranking algorithm reorders search results based on the PageRank, and the search engine is able to
display the most relevant results to its customers.
Until recently, most ranking models, and ML as whole, were limited in their scope of use, as most
companies didn’t have enough data to power these algorithms. Better methods for data collection
and more intuitive ML tools have made it possible for nearly anyone to deploy a successful ranking
model within their business.
As we’ll discuss later in this blog, ranking is incredibly versatile and dependent on the data a
company has. Even so, a common framework guides the construction of all ranking models.
Ranking models are made up of 2 main factors: queries and documents. Queries are any input
value, such as a question on Google or an interaction on an e-commerce site. Documents are the
output value or results of the query. Given the query, and the associated documents, a function,
given a list of parameters to rank on, will score the documents to be sorted in order of relevancy.
The machine learning algorithm learning to rank takes the scores from this model, and uses them
to predict future outcomes on a new and unseen list of documents.
As an example, a search for “Mage” is done on Google Search (“Mage” is the query). After the
search, a list of associated documents matching the query will be displayed (Mage A.I., Mage
definition, Mage World of Warcraft, etc.). The function will score each of the documents based on
their relevance to the query (Mage A.I. = 1, Mage definition = 2, Mage World of Warcraft =3, and
so on). The documents with higher scores will be ranked higher when there is a search for Mage.
Data required for a ranking model consists of documents from a query, user profiles, user
behaviors, search history, clicks, etc.
Ranking ensures that the most relevant results appear first on a customer’s search, maximizing the
chances they will find something of interest, and minimizing the chances of churn. With so many
options for organic web search, the need to stay competitive has never been greater. According to
a Google study, 61% of users said if they didn’t find what they were looking for right away, they
would quickly move on to another site. Depending on available data, companies can use ranking
within their web pages and apps to serve their customers the most relevant results as soon as they
enter.
Use cases:
The most successful companies are using ranking within their software to improve the user
experience. Ranking has allowed these companies to create customized feeds for each user based
on their past search and buying history. Ranking carries many use cases across industries, nearly
anyone with data can and should be using ranking in some capacity to optimize their business. A
few use cases are:
1. Search results
2. Targeted ads
3. Recommendations
Here are a few companies who have used ranking to maximize user engagement.
Amazon
With millions of listings or documents, for every product search or query, Amazon needed
to find a way to rank its products in order to maximize the chance of purchase. Using a
combination of individual preferences, gathered from users' search and purchasing history
and a product’s popularity, Amazon created a ranking system that would display the most
relevant products at the top of their feed. Additionally, ranking was used in Amazon’s
recommendation system, which would use users' ranked preferences in order to predict
what products a user is most likely to purchase in the future.
Netflix
Similar to Amazon, Netflix uses ranking to fuel their recommendation system. The
recommendation system predicts what content a user is most likely to watch and displays
the most relevant content at the top of the home page. Netflix uses a few different features
to rank and recommend content; such as: watch history, search history, and general
popularity. They also use ranking to fuel their collaborative filtering.
TikTok
TikTok’s standout feature is the For You page which is built on a ranking system. This
feature has allowed TikTok to customize each home page to be reflective of the preferences
and interests of its user. TikTok uses similar metrics to Netflix to rank its content: watch
history, re-watch rate, and engagement. Similar to Netflix, TikTok’s ranking system also
aids in collaborative filtering.
- Starbucks
Starbucks found great success with their mobile app, which is one of the most downloaded apps
on the App Store. The app allows Starbucks to create a custom user experience for their customers
even when they’re not within a physical coffee shop. The app uses ranking to recommend the most
relevant products to users. Taking into account order history, new products and general popularity
of other products, Starbucks is able to keep customers' favorite orders at the top of the
recommended search while introducing them to new products that they are most likely to enjoy.
For the companies listed above, entire teams of data scientists and AI engineers were built to create
and maintain the ranking systems in place. The cost to build these teams is impractical for most
businesses. Recently, there have been great tools emerging which allow for the easy building and
deployment of ranking models–this with little to no programming experience.
Mage allows for the building and deployment of a ranking model with no ML programming
knowledge. To use Mage, a database containing a list of queries and documents is first uploaded.
Queries could contain a list of clothes or menu items, their documents could be the number of
engagement (clicks and purchases) each received. The greater the quality and quantity of data
uploaded, the better that Mage is able to produce ranking predictions.
Once the data is uploaded, users will be given the option to transform their datasets by removing
and adding columns, applying transformer actions: split and filter data, group values, aggregate
data, and identifying what columns they would like to rank. Mage will then produce a ranking
model which can be deployed into your data warehouses, downloaded to a CSV file, or saved
directly to a Mage dataset.