0% found this document useful (0 votes)
23 views85 pages

ML Unit - 2

Unit II of Gajjala Ashok's Machine Learning course covers supervised learning techniques, focusing on methods such as distance-based methods, K-Nearest Neighbors (K-NN), and decision trees. It discusses various distance metrics like Euclidean, Manhattan, Minkowski, and Hamming distances, as well as the workings and applications of K-NN and decision trees for classification and regression tasks. The unit emphasizes the importance of these algorithms in machine learning, providing foundational knowledge for further study.

Uploaded by

sushmavemula991
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views85 pages

ML Unit - 2

Unit II of Gajjala Ashok's Machine Learning course covers supervised learning techniques, focusing on methods such as distance-based methods, K-Nearest Neighbors (K-NN), and decision trees. It discusses various distance metrics like Euclidean, Manhattan, Minkowski, and Hamming distances, as well as the workings and applications of K-NN and decision trees for classification and regression tasks. The unit emphasizes the importance of these algorithms in machine learning, providing foundational knowledge for further study.

Uploaded by

sushmavemula991
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Gajjala Ashok’s Unit -II : Machine Learning

Unit II:

Supervised Learning(Regression/Classification):Basic Methods: Distance based Methods,


Nearest Neighbours, Decision Trees, Naive Bayes, Linear Models: Linear Regression, Logistic
Regression, Generalized Linear Models, Support Vector Machines, Binary Classification:
Multiclass/Structured outputs, MNIST, Ranking.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 1


Gajjala Ashok’s Unit -II : Machine Learning

2.1 Distance based Methods


What Are Distance Metrics?

Distance metrics are a key part of several machine learning algorithms. These distance metrics are

used in both supervised and unsupervised learning, generally to calculate the similarity between

data points. An effective distance metric improves the performance of our machine learning model,

whether that’s for classification tasks or clustering.

Let’s say you need to create clusters using a clustering algorithm such as K-Means Clustering or k-
nearest neighbor algorithm (knn), which uses nearest neighbors to solve a classification or
regression problem. How will you define the similarity between different observations? How can
we say that two points are similar to each other? This will happen if their features are similar,
right? When we plot these points, they will be closer to each other by distance.

Hence, we can calculate the distance between points and then define the similarity between them.
Here’s the million-dollar question – how do we calculate this distance, and what are the different

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 2


Gajjala Ashok’s Unit -II : Machine Learning

distance metrics in machine learning? Also, are these metrics different for different learning
problems?

Types of Distance Metrics in Machine Learning

1. Euclidean Distance

2. Manhattan Distance

3. Minkowski Distance

4. Hamming Distance

Let’s start with the most commonly used distance metric – Euclidean Distance

Euclidean Distance

Euclidean Distance represents the shortest distance between two vectors.It is the square

root of the sum of squares of differences between corresponding elements.

The Euclidean distance metric corresponds to the L2-norm of a difference between vectors

and vector spaces. The cosine similarity is proportional to the dot product of two vectors and

inversely proportional to the product of their magnitudes.

Most machine learning algorithms, including K-Means use this distance metric to measure

the similarity between observations. Let’s say we have two points, as shown below:

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 3


Gajjala Ashok’s Unit -II : Machine Learning

So, the Euclidean Distance between these two points, A and B, will be:

Formula for Euclidean Distance

We use this formula when we are dealing with 2 dimensions. We can generalize this for an n-

dimensional space as:

Where ,n = number of dimensions

pi, qi = data points

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 4


Gajjala Ashok’s Unit -II : Machine Learning

Manhattan Distance

Manhattan Distance is the sum of absolute differences between points across all the dimensions.

We can represent Manhattan Distance as:

Formula for Manhattan Distance

Since the above representation is 2 dimensional, to calculate Manhattan Distance, we will take the

sum of absolute distances in both the x and y directions. So, the Manhattan distance in a 2-

dimensional space is given as:

And the generalized formula for an n-dimensional space is given as:

Where,

 n = number of dimensions

 pi, qi = data points

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 5


Gajjala Ashok’s Unit -II : Machine Learning

Minkowski Distance

Minkowski Distance is the generalized form of Euclidean and Manhattan Distance.

Formula for Minkowski Distance

Here, p represents the order of the norm.

Hamming Distance

Hamming Distance measures the similarity between two strings of the same length. The Hamming

Distance between two strings of the same length is the number of positions at which the

corresponding characters are different.

Let’s understand the concept using an example. Let’s say we have two strings:

“euclidean” and “manhattan”

Since the length of these strings is equal, we can calculate the Hamming Distance. We will go

character by character and match the strings. The first character of both the strings (e and m,

respectively) is different. Similarly, the second character of both the strings (u and a) is different.

and so on.

Look carefully – seven characters are different, whereas two characters (the last two characters)

are similar:

Hence, the Hamming Distance here will be 7. Note that the larger the Hamming Distance between

two strings, the more dissimilar those strings will be (and vice versa).

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 6


Gajjala Ashok’s Unit -II : Machine Learning

2.1.1 Nearest Neighbours

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then
it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 7


Gajjala Ashok’s Unit -II : Machine Learning

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need
a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor
is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the
below image:

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 8


Gajjala Ashok’s Unit -II : Machine Learning

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 9


Gajjala Ashok’s Unit -II : Machine Learning

o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers
in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points
for all the training samples.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 10


Gajjala Ashok’s Unit -II : Machine Learning

2.1.2 Decision Trees

Decision trees are a popular machine learning algorithm that can be used for both regression and
classification tasks. They are easy to understand, interpret, and implement, making them an ideal
choice for beginners in the field of machine learning. In this comprehensive guide, we will cover
all aspects of the decision tree algorithm, including the working principles, different types of
decision trees, the process of building decision trees, and how to evaluate and optimize decision
trees.

What is a Decision Tree?

A decision tree is a non-parametric supervised learning algorithm for classification and


regression tasks. It has a hierarchical tree structure consisting of a root node, branches, internal
nodes, and leaf nodes. Decision trees are used for classification and regression tasks, providing
easy-to-understand models.

A decision tree is a hierarchical model used in decision support that depicts decisions and their

potential outcomes, incorporating chance events, resource expenses, and utility. This algorithmic

model utilizes conditional control statements and is non-parametric, supervised learning, useful

for both classification and regression tasks. The tree structure is comprised of a root node,

branches, internal nodes, and leaf nodes, forming a hierarchical, tree-like structure.

It is a tool that has applications spanning several different areas. Decision trees can be used for

classification as well as regression problems. The name itself suggests that it uses a flowchart like

a tree structure to show the predictions that result from a series of feature-based splits. It starts

with a root node and ends with a decision made by leaves.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 11


Gajjala Ashok’s Unit -II : Machine Learning

Decision Tree Terminologies

 Root Node: The initial node at the beginning of a decision tree, where the entire population or

dataset starts dividing based on various features or conditions.

 Decision Nodes: Nodes resulting from the splitting of root nodes are known as decision nodes.

These nodes represent intermediate decisions or conditions within the tree.

 Leaf Nodes: Nodes where further splitting is not possible, often indicating the final classification

or outcome. Leaf nodes are also referred to as terminal nodes.

 Sub-Tree: Similar to a subsection of a graph being called a sub-graph, a sub-section of a decision

tree is referred to as a sub-tree. It represents a specific portion of the decision tree.

 Pruning: The process of removing or cutting down specific nodes in a decision tree to prevent

overfitting and simplify the model.

 Branch / Sub-Tree: A subsection of the entire decision tree is referred to as a branch or sub-tree.

It represents a specific path of decisions and outcomes within the tree.

 Parent and Child Node: In a decision tree, a node that is divided into sub-nodes is known as a

parent node, and the sub-nodes emerging from it are referred to as child nodes. The parent node

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 12


Gajjala Ashok’s Unit -II : Machine Learning

represents a decision or condition, while the child nodes represent the potential outcomes or further

decisions based on that condition.

Example of Decision Tree

Let’s understand decision trees with the help of an example:

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 13


Gajjala Ashok’s Unit -II : Machine Learning

Decision trees are upside down which means the root is at the top and then this root is split into

various several nodes. Decision trees are nothing but a bunch of if-else statements in layman terms.

It checks if the condition is true and if it is then it goes to the next node attached to that decision.

In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or rainy? If

yes then it will go to the next feature which is humidity and wind. It will again check if there is a

strong wind or weak, if it’s a weak wind and it’s rainy then the person may go and play.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 14


Gajjala Ashok’s Unit -II : Machine Learning

Did you notice anything in the above flowchart? We see that if the weather is cloudy then we must

go to play. Why didn’t it split more? Why did it stop there?

To answer this question, we need to know about few more concepts like entropy, information gain,

and Gini index. But in simple terms, I can say here that the output for the training dataset is always

yes for cloudy weather, since there is no disorderliness here we don’t need to split the node further.

The goal of machine learning is to decrease uncertainty or disorders from the dataset and for this,

we use decision trees.

Now you must be thinking how do I know what should be the root node? what should be the

decision node? when should I stop splitting? To decide this, there is a metric called “Entropy”

which is the amount of uncertainty in the dataset.

Decision Tree Assumptions

Several assumptions are made to build effective models when creating decision trees. These

assumptions help guide the tree’s construction and impact its performance. Here are some common

assumptions and considerations when creating decision trees:

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 15


Gajjala Ashok’s Unit -II : Machine Learning

Binary Splits

Decision trees typically make binary splits, meaning each node divides the data into two subsets

based on a single feature or condition. This assumes that each decision can be represented as a

binary choice.

Recursive Partitioning

Decision trees use a recursive partitioning process, where each node is divided into child nodes,

and this process continues until a stopping criterion is met. This assumes that data can be

effectively subdivided into smaller, more manageable subsets.

Feature Independence

Decision trees often assume that the features used for splitting nodes are independent. In practice,

feature independence may not hold, but decision trees can still perform well if features are

correlated.

Homogeneity

Decision trees aim to create homogeneous subgroups in each node, meaning that the samples

within a node are as similar as possible regarding the target variable. This assumption helps in

achieving clear decision boundaries.

Top-Down Greedy Approach

Decision trees are constructed using a top-down, greedy approach, where each split is chosen to

maximize information gain or minimize impurity at the current node. This may not always result

in the globally optimal tree.

Categorical and Numerical Features

Decision trees can handle both categorical and numerical features. However, they may require

different splitting strategies for each type.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 16


Gajjala Ashok’s Unit -II : Machine Learning

Overfitting

Decision trees are prone to overfitting when they capture noise in the data. Pruning and setting

appropriate stopping criteria are used to address this assumption.

Impurity Measures

Decision trees use impurity measures such as Gini impurity or entropy to evaluate how well a split

separates classes. The choice of impurity measure can impact tree construction.

No Missing Values

Decision trees assume that there are no missing values in the dataset or that missing values have

been appropriately handled through imputation or other methods.

Equal Importance of Features

Decision trees may assume equal importance for all features unless feature scaling or weighting is

applied to emphasize certain features.

No Outliers

Decision trees are sensitive to outliers, and extreme values can influence their construction.

Preprocessing or robust methods may be needed to handle outliers effectively.

Sensitivity to Sample Size

Small datasets may lead to overfitting, and large datasets may result in overly complex trees. The

sample size and tree depth should be balanced.

Entropy

Entropy is nothing but the uncertainty in our dataset or measure of disorder. Let me try to explain

this with the help of an example.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 17


Gajjala Ashok’s Unit -II : Machine Learning

Suppose you have a group of friends who decides which movie they can watch together on Sunday.

There are 2 choices for movies, one is “Lucy” and the second is “Titanic” and now everyone has

to tell their choice. After everyone gives their answer we see that “Lucy” gets 4

votes and “Titanic” gets 5 votes. Which movie do we watch now? Isn’t it hard to choose 1 movie

now because the votes for both the movies are somewhat equal.

This is exactly what we call disorderness, there is an equal number of votes for both the movies,

and we can’t really decide which movie we should watch. It would have been much easier if the

votes for “Lucy” were 8 and for “Titanic” it was 2. Here we could easily say that the majority of

votes are for “Lucy” hence everyone will be watching this movie.

In a decision tree, the output is mostly “yes” or “no”

The formula for Entropy is shown below:

Here,

 p+ is the probability of positive class

 p– is the probability of negative class

 S is the subset of the training example

How do Decision Trees use Entropy?

Now we know what entropy is and what is its formula, Next, we need to know that how exactly

does it work in this algorithm.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 18


Gajjala Ashok’s Unit -II : Machine Learning

Entropy basically measures the impurity of a node. Impurity is the degree of randomness; it tells

how random our data is. Apure sub-splitmeans that either you should be getting “yes”, or you

should be getting “no”.

Supposea featurehas 8 “yes” and 4 “no” initially, after the first split the left node gets 5 ‘yes’ and

2 ‘no’whereas right node gets 3 ‘yes’ and 2 ‘no’.

We see here the split is not pure, why? Because we can still see some negative classes in both the

nodes. In order to make a decision tree, we need to calculate the impurity of each split, and when

the purity is 100%, we make it as a leaf node.

To check the impurity of feature 2 and feature 3 we will take the help for Entropy formula.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 19


Gajjala Ashok’s Unit -II : Machine Learning

For feature 3,

We can clearly see from the tree itself that left node has low entropy or more purity than right node

since left node has a greater number of “yes” and it is easy to decide here.

Always remember that the higher the Entropy, the lower will be the purity and the higher will be

the impurity.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 20


Gajjala Ashok’s Unit -II : Machine Learning

As mentioned earlier the goal of machine learning is to decrease the uncertainty or impurity in the

dataset, here by using the entropy we are getting the impurity of a particular node, we don’t know

if the parent entropy or the entropy of a particular node has decreased or not.

For this, we bring a new metric called “Information gain” which tells us how much the parent

entropy has decreased after splitting it with some feature.

Information Gain

Information gain measures the reduction of uncertainty given some feature and it is also a deciding

factor for which attribute should be selected as a decision node or root node.

It is just entropy of the full dataset – entropy of the dataset given some feature.

To understand this better let’s consider an example:Suppose our entire population has a total of 30

instances. The dataset is to predict whether the person will go to the gym or not. Let’s say 16

people go to the gym and 14 people don’t

Now we have two features to predict whether he/she will go to the gym or not.

 Feature 1 is “Energy” which takes two values “high” and “low”

 Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral” and “Highly

motivated”.

Let’s see how our decision tree will be made using these 2 features. We’ll use information gain to

decide which feature should be the root node and which feature should be placed after the split.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 21


Gajjala Ashok’s Unit -II : Machine Learning

Image Source: Author

Let’s calculate the entropy

To see the weighted average of entropy of each node we will do as follows:

Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 22


Gajjala Ashok’s Unit -II : Machine Learning

Our parent entropy was near 0.99 and after looking at this value of information gain, we can say

that the entropy of the dataset will decrease by 0.37 if we make “Energy” as our root node.

Similarly, we will do this with the other feature “Motivation” and calculate its information gain.

Image Source: Author

Let’s calculate the entropy here:

To see the weighted average of entropy of each node we will do as follows:

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 23


Gajjala Ashok’s Unit -II : Machine Learning

Now we have the value of E(Parent) and E(Parent|Motivation), information gain will be:

We now see that the “Energy” feature gives more reduction which is 0.37 than the “Motivation”

feature. Hence we will select the feature which has the highest information gain and then split the

node based on that feature.

In this example “Energy” will be our root node and we’ll do the same for sub-nodes. Here we can

see that when the energy is “high” the entropy is low and hence we can say a person will definitely

go to the gym if he has high energy, but what if the energy is low? We will again split the node

based on the new feature which is “Motivation”.

When to Stop Splitting?

You must be asking this question to yourself that when do we stop growing our tree? Usually, real-

world datasets have a large number of features, which will result in a large number of splits, which

in turn gives a huge tree. Such trees take time to build and can lead to overfitting. That means the

tree will give very good accuracy on the training dataset but will give bad accuracy in test data.

There are many ways to tackle this problem through hyperparameter tuning. We can set the

maximum depth of our decision tree using themax_depth parameter. The more the value

of max_depth, the more complex your tree will be. The training error will off-course decrease if

we increase the max_depth value but when our test data comes into the picture, we will get a very

bad accuracy. Hence you need a value that will not overfit as well as underfit our data and for this,

you can use GridSearchCV.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 24


Gajjala Ashok’s Unit -II : Machine Learning

Another way is to set the minimum number of samples for each spilt. It is denoted

by min_samples_split. Here we specify the minimum number of samples required to do a spilt.

For example, we can use a minimum of 10 samples to reach a decision. That means if a node has

less than 10 samples then using this parameter, we can stop the further splitting of this node and

make it a leaf node.

There are more hyperparameters such as :

 min_samples_leaf – represents the minimum number of samples required to be in the leaf node.

The more you increase the number, the more is the possibility of overfitting.

 max_features – it helps us decide what number of features to consider when looking for the best

split.

To read more about these hyperparameters you can read ithere.

Pruning

Pruning is another method that can help us avoid overfitting. It helps in improving the performance

of the tree by cutting the nodes or sub-nodes which are not significant. Additionally, it removes

the branches which have very low importance.

There are mainly 2 ways for pruning:

 Pre-pruning – we can stop growing the tree earlier, which means we can prune/remove/cut a node

if it has low importance while growing the tree.

 Post-pruning – once our tree is built to its depth, we can start pruning the nodes based on their

significance.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 25


Gajjala Ashok’s Unit -II : Machine Learning

Naïve Bayes Classifier Algorithm


o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability
of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described
as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:

o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 26


Gajjala Ashok’s Unit -II : Machine Learning

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 27


Gajjala Ashok’s Unit -II : Machine Learning

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 28


Gajjala Ashok’s Unit -II : Machine Learning

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:


o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:


o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 29


Gajjala Ashok’s Unit -II : Machine Learning

Applications of Naïve Bayes Classifier:


o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes that
these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for document classification tasks.

2.2 Linear Models:

2.2.1 What is Linear Regression?

Linear regression is a type of statistical analysis used to predict the relationship between two

variables. It assumes a linear relationship between the independent variable and the dependent

variable, and aims to find the best-fitting line that describes the relationship. The line is determined

by minimizing the sum of the squared differences between the predicted values and the actual

values.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 30


Gajjala Ashok’s Unit -II : Machine Learning

Linear regression is commonly used in many fields, including economics, finance, and social

sciences, to analyze and predict trends in data. It can also be extended to multiple linear regression,

where there are multiple independent variables, and logistic regression, which is used for binary

classification problems.

Simple Linear Regression

In a simple linear regression, there is one independent variable and one dependent variable. The

model estimates the slope and intercept of the line of best fit, which represents the relationship

between the variables. The slope represents the change in the dependent variable for each unit

change in the independent variable, while the intercept represents the predicted value of the

dependent variable when the independent variable is zero.

Linear regression is a quiet and the simplest statistical regression method used for predictive

analysis in machine learning. Linear regression shows the linear relationship between the

independent(predictor) variable i.e. X-axis and the dependent(output) variable i.e. Y-axis, called

linear regression. If there is a single input variable X(independent variable), such linear regression

is called simple linear regression.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 31


Gajjala Ashok’s Unit -II : Machine Learning

The graph above presents the linear relationship between the output(y) and predictor(X)

variables. The blue line is referred to as the best-fit straight line. Based on the given data points,

we attempt to plot a line that fits the points the best.

To calculate best-fit line linear regression uses a traditional slope-intercept form which is given

below,

Yi = β0 + β1Xi

where Yi = Dependent variable, β0 = constant/Intercept, β1 = Slope/Intercept, Xi = Independent

variable.

This algorithm explains the linear relationship between the dependent(output) variable y and the

independent(predictor) variable X using a straight line Y= B0 + B1 X.

But how the linear regression finds out which is the best fit line?

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 32


Gajjala Ashok’s Unit -II : Machine Learning

The goal of the linear regression algorithm is to get the best values for B0 and B1 to find the best

fit line. The best fit line is a line that has the least error which means the error between predicted

values and actual values should be minimum.

Random Error(Residuals)

In regression, the difference between the observed value of the dependent variable(yi) and the

predicted value(predicted) is called the residuals.

εi = ypredicted – yi

where ypredicted = B0 + B1 Xi

What is the best fit line?

In simple terms, the best fit line is a line that fits the given scatter plot in the best way.

Mathematically, the best fit line is obtained by minimizing the Residual Sum of Squares(RSS).

Cost Function for Linear Regression

The cost function helps to work out the optimal values for B0 and B1, which provides the best fit

line for the data points.

In Linear Regression, generally Mean Squared Error (MSE) cost function is used, which is the

average of squared error that occurred between the ypredicted and yi.

We calculate MSE using simple linear equation y=mx+b:

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 33


Gajjala Ashok’s Unit -II : Machine Learning

Using the MSE function, we’ll update the values of B0 and B1 such that the MSE value settles at

the minima. These parameters can be determined using the gradient descent method such that the

value for the cost function is minimum.

Gradient Descent for Linear Regression

Gradient Descent is one of the optimization algorithms that optimize the cost function(objective

function) to reach the optimal minimal solution. To find the optimum solution we need to reduce

the cost function(MSE) for all data points. This is done by updating the values of B0 and

B1 iteratively until we get an optimal solution.

A regression model optimizes the gradient descent algorithm to update the coefficients of the line

by reducing the cost function by randomly selecting coefficient values and then iteratively

updating the values to reach the minimum cost function.

Let’s take an example to understand this. Imagine a U-shaped pit. And you are standing at the

uppermost point in the pit, and your motive is to reach the bottom of the pit. Suppose there is a

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 34


Gajjala Ashok’s Unit -II : Machine Learning

treasure at the bottom of the pit, and you can only take a discrete number of steps to reach the

bottom. If you opted to take one step at a time, you would get to the bottom of the pit in the end

but, this would take a longer time. If you decide to take larger steps each time, you may achieve

the bottom sooner but, there’s a probability that you could overshoot the bottom of the pit and not

even near the bottom. In the gradient descent algorithm, the number of steps you’re taking can be

considered as the learning rate, and this decides how fast the algorithm converges to the minima.

To update B0 and B1, we take gradients from the cost function. To find these gradients, we take

partial derivatives for B0 and B1.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 35


Gajjala Ashok’s Unit -II : Machine Learning

We need to minimize the cost function J. One of the ways to achieve this is to apply the batch

gradient descent algorithm. In batch gradient descent, the values are updated in each iteration.

(Last two equations shows the updating of values)

The partial derivates are the gradients, and they are used to update the values of B0 and B1. Alpha

is the learning rate.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 36


Gajjala Ashok’s Unit -II : Machine Learning

Evaluation Metrics for Linear Regression

The strength of any linear regression model can be assessed using various evaluation metrics.

These evaluation metrics usually provide a measure of how well the observed outputs are being

generated by the model.

The most used metrics are,

1. Coefficient of Determination or R-Squared (R2)

2. Root Mean Squared Error (RSME) and Residual Standard Error (RSE)

Coefficient of Determination or R-Squared (R2)

R-Squared is a number that explains the amount of variation that is explained/captured by the

developed model. It always ranges between 0 & 1 . Overall, the higher the value of R-squared, the

better the model fits the data.

Mathematically it can be represented as,

R2 = 1 – ( RSS/TSS )

 Residual sum of Squares (RSS) is defined as the sum of squares of the residual for each data

point in the plot/data. It is the measure of the difference between the expected and the actual

observed output.

 Total Sum of Squares (TSS) is defined as the sum of errors of the data points from the mean of

the response variable. Mathematically TSS is,

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 37


Gajjala Ashok’s Unit -II : Machine Learning

where y hat is the mean of the sample data points.

The significance of R-squared is shown by the following figures,

Root Mean Squared Error

The Root Mean Squared Error is the square root of the variance of the residuals. It specifies the

absolute fit of the model to the data i.e. how close the observed data points are to the predicted

values. Mathematically it can be represented as,

To make this estimate unbiased, one has to divide the sum of the squared residuals by the degrees

of freedom rather than the total number of data points in the model. This term is then called

the Residual Standard Error(RSE). Mathematically it can be represented as,

R-squared is a better measure than RSME. Because the value of Root Mean Squared Error depends

on the units of the variables (i.e. it is not a normalized measure), it can change with the change in

the unit of the variables.

Assumptions of Linear Regression

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 38


Gajjala Ashok’s Unit -II : Machine Learning

Regression is a parametric approach, which means that it makes assumptions about the data for

the purpose of analysis. For successful regression analysis, it’s essential to validate the following

assumptions.

1. Linearity of residuals: There needs to be a linear relationship between the dependent

variable and independent variable(s).

2. Independence of residuals: The error terms should not be dependent on one another (like in

time-series data wherein the next value is dependent on the previous one). There should be no

correlation between the residual terms. The absence of this phenomenon is known

as Autocorrelation.

There should not be any visible patterns in the error terms.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 39


Gajjala Ashok’s Unit -II : Machine Learning

3. Normal distribution of residuals: The mean of residuals should follow a normal

distribution with a mean equal to zero or close to zero. This is done in order to check whether the

selected line is actually the line of best fit or not.

If the error terms are non-normally distributed, suggests that there are a few unusual data points

that must be studied closely to make a better model.

4. The equal variance of residuals: The error terms must have constant variance. This

phenomenon is known as Homoscedasticity.

The presence of non-constant variance in the error terms is referred to as Heteroscedasticity.

Generally, non-constant variance arises in the presence of outliers or extreme leverage values.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 40


Gajjala Ashok’s Unit -II : Machine Learning

Hypothesis in Linear Regression

Once you have fitted a straight line on the data, you need to ask, “Is this straight line a significant

fit for the data?” Or “Is the beta coefficient explain the variance in the data plotted?” And

here comes the idea of hypothesis testing on the beta coefficient. The Null and Alternate

hypotheses in this case are:

H0: B1 = 0

HA: B1 ≠ 0

To test this hypothesis we use a t-test, test statistics for the beta coefficient is given by,

Assessing the model fit

Some other parameters to assess a model are:

1. t statistic: It is used to determine the p-value and hence, helps in determining whether the

coefficient is significant or not

2. F statistic: It is used to assess whether the overall model fit is significant or not. Generally,

the higher the value of the F-statistic, the more significant a model turns out to be.

Multiple Linear Regression

Multiple linear regression is a technique to understand the relationship between a single dependent

variable and multiple independent variables.

The formulation for multiple linear regression is also similar to simple linear regression with

the small change that instead of having one beta variable, you will now have betas for all the

variables used. The formula is given as:

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 41


Gajjala Ashok’s Unit -II : Machine Learning

Y = B0 + B1X1 + B2X2 + … + BpXp + ε

Considerations of Multiple Linear Regression

All the four assumptions made for Simple Linear Regression still hold true for Multiple Linear

Regression along with a few new additional assumptions.

1. Overfitting: When more and more variables are added to a model, the model may become

far too complex and usually ends up memorizing all the data points in the training set. This

phenomenon is known as the overfitting of a model. This usually leads to high training

accuracy and very low test accuracy.

2. Multicollinearity: It is the phenomenon where a model with several independent variables,

may have some variables interrelated.

3. Feature Selection: With more variables present, selecting the optimal set of predictors

from the pool of given features (many of which might be redundant) becomes an important

task for building a relevant and better model.

Multicollinearity

As multicollinearity makes it difficult to find out which variable is actually contributing towards

the prediction of the response variable, it leads one to conclude incorrectly, the effects of a variable

on the target variable. Though it does not affect the precision of the predictions, it is essential to

properly detect and deal with the multicollinearity present in the model, as random removal of any

of these correlated variables from the model causes the coefficient values to swing wildly and even

change signs.

Multicollinearity can be detected using the following methods.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 42


Gajjala Ashok’s Unit -II : Machine Learning

1. Pairwise Correlations: Checking the pairwise correlations between different pairs of

independent variables can throw useful insights in detecting multicollinearity.

2. Variance Inflation Factor (VIF): Pairwise correlations may not always be useful as it is

possible that just one variable might not be able to completely explain some other variable

but some of the variables combined could be ready to do this. Thus, to check these sorts of

relations between variables, one can use VIF. VIF basically explains the relationship of

one independent variable with all the other independent variables. VIF is given by,

where i refers to the ith variable which is being represented as a linear combination of the rest of

the independent variables.

The common heuristic followed for the VIF values is if VIF > 10 then the value is definitely high

and it should be dropped. And if the VIF=5 then it may be valid but should be inspected first. If

VIF < 5, then it is considered a good vif value.

Overfitting and Underfitting in Linear Regression

There have always been situations where a model performs well on training data but not on the

test data. While training models on a dataset, overfitting, and underfitting are the most common

problems faced by people.

Before understanding overfitting and underfitting one must know about bias and variance.

Bias:

Bias is a measure to determine how accurate is the model likely to be on future unseen data.

Complex models, assuming there is enough training data available, can do predictions accurately.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 43


Gajjala Ashok’s Unit -II : Machine Learning

Whereas the models that are too naive, are very likely to perform badly with respect to

predictions. Simply, Bias is errors made by training data.

Generally, linear algorithms have a high bias which makes them fast to learn and easier to

understand but in general, are less flexible. Implying lower predictive performance on complex

problems that fail to meet the expected outcomes.

Variance:

Variance is the sensitivity of the model towards training data, that is it quantifies how much the

model will react when input data is changed.

Ideally, the model shouldn’t change too much from one training dataset to the next training data,

which will mean that the algorithm is good at picking out the hidden underlying patterns between

the inputs and the output variables.

Ideally, a model should have lower variance which means that the model doesn’t change drastically

after changing the training data(it is generalizable). Having higher variance will make a model

change drastically even on a small change in the training dataset.

Let’s understand what is a bias-variance tradeoff is.

Bias Variance Tradeoff

The aim of any supervised machine learning algorithm is to achieve low bias and low variance as

it is more robust. So that the algorithm should achieve better performance.

There is no escape from the relationship between bias and variance in machine learning.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 44


Gajjala Ashok’s Unit -II : Machine Learning

There is an inverse relationship between bias and variance,

 An increase in bias will decrease the variance.

 An increase in the variance will decrease the bias.

There is a trade-off that plays between these two concepts and the algorithms must find a balance

between bias and variance.

As a matter of fact, one cannot calculate the real bias and variance error terms because we do not

know the actual underlying target function.

Now coming to the overfitting and underfitting.

Overfitting

When a model learns each and every pattern and noise in the data to such extent that it affects the

performance of the model on the unseen future dataset, it is referred to as overfitting. The model

fits the data so well that it interprets noise as patterns in the data.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 45


Gajjala Ashok’s Unit -II : Machine Learning

When a model has low bias and higher variance it ends up memorizing the data and causing

overfitting. Overfitting causes the model to become specific rather than generic. This usually leads

to high training accuracy and very low test accuracy.

Detecting overfitting is useful, but it doesn’t solve the actual problem. There are several ways to

prevent overfitting, which are stated below:

 Cross-validation

 If the training data is too small to train add more relevant and clean data.

 If the training data is too large, do some feature selection and remove unnecessary features.

 Regularization

Underfitting:

Underfitting is not often discussed as often as overfitting is discussed. When the model fails to

learn from the training dataset and is also not able to generalize the test dataset, is referred to

as underfitting. This type of problem can be very easily detected by the performance metrics.

When a model has high bias and low variance it ends up not generalizing the data and causing

underfitting. It is unable to find the hidden underlying patterns from the data. This usually leads to

low training accuracy and very low test accuracy. The ways to prevent underfitting are stated

below,

 Increase the model complexity

 Increase the number of features in the training data

 Remove noise from the data

2.2..2 Polynomial Regression

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 46


Gajjala Ashok’s Unit -II : Machine Learning

What is Polynomial Regression?

In polynomial regression, we describe the relationship between the independent variable x and the

dependent variable y using an nth-degree polynomial in x. Polynomial regression, denoted as E(y

| x), characterizes fitting a nonlinear relationship between the x value and the conditional mean of

y. Typically, this corresponds to the least-squares method. The least-square approach minimizes

the coefficient variance according to the Gauss-Markov Theorem. This represents a type of Linear

Regression where the dependent and independent variables exhibit a curvilinear relationship and

the polynomial equation is fitted to the data.

Types of Polynomial Regression

A quadratic equation is a general term for a second-degree polynomial equation. This degree, on

the other hand, can go up to nth values. Here is the categorization of Polynomial Regression:

1. Linear – if degree as 1

2. Quadratic – if degree as 2

3. Cubic – if degree as 3 and goes on, on the basis of degree.

Assumption of Polynomial Regression

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 47


Gajjala Ashok’s Unit -II : Machine Learning

We cannot process all of the datasets and use polynomial regression machine learning to make a

better judgment. We can still do it, but there should be specific constraints for the dataset in order

to get the best polynomial regression results.

 A dependent variable’s behaviour can be described by a linear, or curved, an additive link between

the dependent variable and a set of k independent factors.

 The independent variables lack any interrelationship.

 We employ datasets featuring independently distributed errors with a normal distribution, having

a mean of zero and a constant variance.

Simple Math to Understand Polynomial Regression

Here we are dealing with mathematics, rather than going deep, just understand the basic structure,

we all know the equation of a linear equation will be a straight line, from that if we have many

features then we opt for multiple regression just increasing features part alone, then how about

polynomial, it’s not about increasing but changing the structure to a quadratic equation, you can

visually understand from the diagram:

Linear Regression vs Polynomial Regression

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 48


Gajjala Ashok’s Unit -II : Machine Learning

Rather than focusing on the distinctions between linear and polynomial regression, we may

comprehend the importance of polynomial regression by starting with linear regression. We build

our model and realize that it performs abysmally. We examine the difference between the actual

value and the best fit line we predicted, and it appears that the true value has a curve on the graph,

but our line is nowhere near cutting the mean of the points. This is where polynomial regression

comes into play; it predicts the best-fit line that matches the pattern of the data (curve).

One important distinction between Linear and Polynomial Regression is that Polynomial

Regression does not require a linear relationship between the independent and dependent variables

in the data set. When the Linear Regression Model fails to capture the points in the data and the

Linear Regression fails to adequately represent the optimum, then we use Polynomial Regression.

Before delving into the topic, let us first understand why we prefer Polynomial Regression over

Linear Regression in some situations, say the non-linear condition of the dataset, by programming

and visualization.

Non-linear data in Polynomial Regression

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 49


Gajjala Ashok’s Unit -II : Machine Learning

We need to enhance the model’s complexity to overcome under-fitting. In this sense, we need to

make linear analyzes in a non-linear way, statistically by using Polynomial,

Because the weights associated with the features are still linear, this is still a linear model. x2 (x

square) is only a function. However, the curve we’re trying to fit is quadratic in nature.

Overfitting vs Under-fitting

We keep on increasing the degree, we will see the best result, but there comes the over-fitting

problem, if we get r2 value for a particular value shows 100.

When analyzing a dataset linearly, we encounter an under-fitting problem

Polynomial regression can correct this.

However, when fine-tuning the degree parameter to the optimal value, we encounter an over-fitting

problem, resulting in a 100 per cent r2 value. The conclusion is that we must avoid both overfitting

and underfitting issues.

Note: To avoid over-fitting, we can increase the number of training samples so that the algorithm

does not learn the system’s noise and becomes more generalized.

Bias vs Variance Tradeoff

How do we pick the best model? To address this question, we must first comprehend the trade-off

between bias and variance.

The mistake is due to the model’s simple assumptions in fitting the data is referred to as bias. A

high bias indicates that the model is unable to capture data patterns, resulting in under-fitting.
Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 50
Gajjala Ashok’s Unit -II : Machine Learning

The mistake caused by the complicated model trying to match the data is referred to as variance.

When a model has a high variance, it passes over the majority of the data points, causing the data

to overfit.

From the above program, when degree is 1 which means in linear regression, it shows underfitting

which means high bias and low variance. And when we get r2 value 100, which means low bias

and high variance, which means overfitting

As the model complexity grows, the bias reduces while the variance increases, and vice versa. A

machine learning model should, in theory, have minimal variance and bias. However, having both

is nearly impossible. As a result, a trade-off must be made in order to build a strong model that

performs well on both train and unseen data.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 51


Gajjala Ashok’s Unit -II : Machine Learning

Degree – How to Find the Right One?

We need to find the right degree of polynomial parameter, in order to avoid overfitting and

underfitting problems:

 Forward selection: increase the degree parameter till you get the optimal result

 Backward selection: decrease degree parameter till you get optimal

Loss and Cost Function – Polynomial Regression

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 52


Gajjala Ashok’s Unit -II : Machine Learning

The Cost Function is a function that evaluates a Machine Learning model’s performance for a

given set of data. The Cost Function is a single real number that calculates the difference between

anticipated and expected values. Many people dont know the differences between the Cost

Function and the Loss Function. To put it another way, the Cost Function is the average of the n-

sample error in the data, whereas the Loss Function is the error for individual data points. To put

it another way, the Loss Function refers to a single training example, whereas the Cost Function

refers to the complete training set.

The Mean Squared Error may also be used as the Cost Function of Polynomial regression;

however, the equation will vary somewhat.

We now know that the Cost Function’s optimum value is 0 or a close approximation to 0. To get

an optimal Cost Function, we may use Gradient Descent, which changes the weight and, as a result,

reduces mistakes.

Gradient Descent – Polynomial Regression

Gradient descent is a method of determining the values of a function’s parameters (coefficients)

in order to minimize a cost function (cost). It may decrease the Cost function (minimizing MSE

value) and achieve the best fit line.

The values of slope (m) and slope-intercept (b) will be set to 0 at the start of the function, and the

learning rate (α) will be introduced. The learning rate (α) is set to an extremely low number,

perhaps between 0.01 and 0.0001. The learning rate is a tuning parameter in an optimization

algorithm that sets the step size at each iteration as it moves toward the cost function’s minimum.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 53


Gajjala Ashok’s Unit -II : Machine Learning

The partial derivative is then determined in terms of m for the cost function equation, as well as

derivatives with regard to the b.

With the aid of the following equation, a and b are updated once the derivatives are determined. m

and b’s derivatives are derived above and are α.

Gradient indicates the steepest climb of the loss function, but the steepest fall is the inverse of the

gradient, which is why the gradient is subtracted from the weights (m and b). The process of

updating the values of m and b continues until the cost function achieves or approaches the ideal

value of 0. The current values of m and b will be the best fit line’s optimal value.

Application of Polynomial Regression

This equation obtains the results in various experimental techniques. The independent and

dependent variables have a well-defined connection.

 Used to figure out what isotopes are present in sediments.

 Utilized to look at the spread of various illnesses across a population

 Research on creation of synthesis.

Advantage of Polynomial Regression

The best approximation of the connection between the dependent and independent variables is a

polynomial. It can accommodate a wide range of functions. Polynomial is a type of curve that can

accommodate a wide variety of curvatures.

Disadvantages of Polynomial Regression

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 54


Gajjala Ashok’s Unit -II : Machine Learning

One or two outliers in the data might have a significant impact on the nonlinear analysis’ outcomes.

These are overly reliant on outliers. Furthermore, there are fewer model validation methods for

detecting outliers in nonlinear regression than there are for linear regression.

2.2.3 Logistic Regression

Logistic regression is a supervised machine learning algorithm mainly used

for classification tasks where the goal is to predict the probability that an instance of belonging

to a given class. It is used for classification algorithms its name is logistic regression. it’s referred

to as regression because it takes the output of the linear regression function as input and uses a

sigmoid function to estimate the probability for the given class. The difference between linear

regression and logistic regression is that linear regression output is the continuous value that can

be anything while logistic regression predicts the probability that an instance belongs to a given

class or not.

It is used for predicting the categorical dependent variable using a given set of independent

variables.

 Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value.
 It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as
0 and 1, it gives the probabilistic values which lie between 0 and 1.
 Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
 In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).
 The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
 Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
 Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification.
Logistic Function (Sigmoid Function):
 The sigmoid function is a mathematical function used to map the predicted values to
probabilities.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 55


Gajjala Ashok’s Unit -II : Machine Learning

 It maps any real value into another value within a range of 0 and 1. o The value of the logistic
regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve
like the “S” form.
 The S-form curve is called the Sigmoid function or the logistic function.
 In logistic regression, we use the concept of the threshold value, which defines the probability
of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.
Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
1. Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as “cat”, “dogs”, or “sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as “low”, “Medium”, or “High”.

Sr.No Linear Regresssion Logistic Regression


1 Linear regression is used to predict the Logistic regression is used to predict the
continuous dependent variable using a categorical dependent variable using a
given set of independent variables. given set of independent variables.
2 Linear regression is used for solving It is used for solving classification
Regression problem. problems.
3 In this we predict the value of continuous In this we predict values of categorical
variables varibles
4 In this we find best fit line. In this we find S-Curve .
5 Least square estimation method is used Maximum likelihood estimation method
for estimation of accuracy. is used for Estimation of accuracy.
6 The output must be continuous Output is must be categorical value such
value,such as price,age,etc. as 0 or 1, Yes or no, etc.
7 It required linear relationship between It not required linear relationship.
dependent and independent variables.
8 There may be collinearity between the There should not be collinearity between
independent variables. independent varible.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 56


Gajjala Ashok’s Unit -II : Machine Learning

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 57


Gajjala Ashok’s Unit -II : Machine Learning

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 58


Gajjala Ashok’s Unit -II : Machine Learning

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 59


Gajjala Ashok’s Unit -II : Machine Learning

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 60


Gajjala Ashok’s Unit -II : Machine Learning

2.3 Generalized linear models

Learning GLM lets you understand how we can use probability distributions as building blocks for

modeling. I assume you are familiar with linear regression and normal distribution.

Linear regression revisited

Linear regression is used to predict the value of continuous variable y by the linear combination of

explanatory variables X.

In the univariate case, linear regression can be expressed as follows;

Linear regression

Here, i indicates the index of each sample. Notice this model assumes normal distribution for the

noise term. The model can be illustrated as follows;

Linear regression illustrated

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 61


Gajjala Ashok’s Unit -II : Machine Learning

Poisson regression

So linear regression is all you need to know? Definitely not. If you’d like to apply statistical

modeling in real problems, you must know more than that.

For example, assume you need to predict the number of defect products (Y) with a sensor value (x)

as the explanatory variable. The scatter plot looks like this.

There are several problems if you try to apply linear regression for this kind of data.

1. The relationship between X and Y does not look linear. It’s more likely to be exponential.

2. The variance of Y does not look constant with regard to X. Here, the variance of Y seems to

increase when X increases.

3. As Y represents the number of products, it always has to be a positive integer. In other words,

Y is a discrete variable. However, the normal distribution used for linear regression assumes

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 62


Gajjala Ashok’s Unit -II : Machine Learning

continuous variables. This also means the prediction by linear regression can be negative. It’s

not appropriate for this kind of count data.

Here, the more proper model you can think of is the Poisson regression model. Poisson regression

is an example of generalized linear models (GLM).

There are three components in generalized linear models.

1. Linear predictor

2. Link function

3. Probability distribution
In the case of Poisson regression, it’s formulated like this.

Linear predictor is just a linear combination of parameter (b) and explanatory variable (x).

Link function literally “links” the linear predictor and the parameter for probability distribution.

In the case of Poisson regression, the typical link function is the log link function. This is because
the parameter for Poisson regression must be positive (explained later).

The last component is the probability distribution which generates the observed variable y. As we

use Poisson distribution here, the model is called Poisson regression.

Poisson distribution is used to model count data. It has only one parameter which stands for both

mean and standard deviation of the distribution. This means the larger the mean, the larger the
standard deviation. See below.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 63


Gajjala Ashok’s Unit -II : Machine Learning

Poisson distribution with mean=1, 5, 10

Now, let’s apply Poisson regression to our data. The result should look like this.

Poisson regression illustrated

The magenta curve is the prediction by Poisson regression. I added the bar plot of the probability

mass function of Poisson distribution to make the difference from linear regression clear.

The prediction curve is exponential as the inverse of the log link function is an exponential function.

From this, it is also clear that the parameter for Poisson regression calculated by the linear predictor
guaranteed to be positive.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 64


Gajjala Ashok’s Unit -II : Machine Learning

Inverse of log link function

2.3.1 Other typical GLM

1. Normal distribution: identity function

2. Poisson distribution: log function


3. Binomial distribution: logit function

Linear regression is also an example of GLM. It just uses identity link function (the linear

predictor and the parameter for the probability distribution are identical) and normal

distribution as the probability distribution.

Linear regression

If you use logit function as the link function and binomial / Bernoulli distribution as the
probability distribution, the model is called logistic regression.

logistic regression

If you represent the linear predictor with z, the above equation is equivalent to the following.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 65


Gajjala Ashok’s Unit -II : Machine Learning

Logistic function

The right-hand side of the second equation is called logistic function. Therefore, this model is called

logistic regression.

As the logistic function returns values between 0 and 1 for arbitrary inputs, it is a proper link

function for the binomial distribution.

Logistic regression is used mostly for binary classification problems. Below is an example to fit

logistic regression to some data.

Logistic regression illustrated


2.4 Support Vector Machines

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 66


Gajjala Ashok’s Unit -II : Machine Learning

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct category
in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider
the below diagram in which there are two different categories that are classified using a decision
boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as support
vector creates a decision boundary between these two data (cat and dog) and choose extreme cases
(support vectors), it will see the extreme case of cat and dog. On the basis of the support vectors,
it will classify it as a cat. Consider the below diagram:

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 67


Gajjala Ashok’s Unit -II : Machine Learning

SVM algorithm can be used for Face detection, image classification, text categorization, etc.

Support Vector Machine Terminology

1. Hyperplane: Hyperplane is the decision boundary that is used to separate the data points of
different classes in a feature space. In the case of linear classifications, it will be a linear
equation i.e. wx+b = 0.
2. Support Vectors: Support vectors are the closest data points to the hyperplane, which makes
a critical role in deciding the hyperplane and margin.
3. Margin: Margin is the distance between the support vector and hyperplane. The main
objective of the support vector machine algorithm is to maximize the margin. The wider
margin indicates better classification performance.
4. Kernel: Kernel is the mathematical function, which is used in SVM to map the original input
data points into high-dimensional feature spaces, so, that the hyperplane can be easily found
out even if the data points are not linearly separable in the original input space. Some of the
common kernel functions are linear, polynomial, radial basis function(RBF), and sigmoid.
5. Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a
hyperplane that properly separates the data points of different categories without any
misclassifications.
6. Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits a
soft margin technique. Each data point has a slack variable introduced by the soft-margin
SVM formulation, which softens the strict margin requirement and permits certain
misclassifications or violations. It discovers a compromise between increasing the margin
and reducing violations.
7. C: Margin maximisation and misclassification fines are balanced by the regularisation
parameter C in SVM. The penalty for going over the margin or misclassifying data items is

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 68


Gajjala Ashok’s Unit -II : Machine Learning

decided by it. A stricter penalty is imposed with a greater value of C, which results in a
smaller margin and perhaps fewer misclassifications.
8. Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect
classifications or margin violations. The objective function in SVM is frequently formed by
combining it with the regularisation term.
9. Dual Problem: A dual Problem of the optimisation problem that requires locating the
Lagrange multipliers related to the support vectors can be used to solve SVM. The dual
formulation enables the use of kernel tricks and more effective computing.

2.4.1 Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-


dimensional space, but we need to find out the best decision boundary that helps to classify the
data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means if
there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3
features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.

Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of
the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence
called a Support vector.

How does SVM works?

Linear SVM:

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 69


Gajjala Ashok’s Unit -II : Machine Learning

The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the
below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 70


Gajjala Ashok’s Unit -II : Machine Learning

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data,
we cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we have used
two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated
as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 71


Gajjala Ashok’s Unit -II : Machine Learning

So now, SVM will divide the datasets into classes in the following way. Consider the below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in
2d space with z=1, then it will become as:

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 72


Gajjala Ashok’s Unit -II : Machine Learning

Hence we get a circumference of radius 1 in case of non-linear data.

2.5 Binary Classification

There are perhaps four main types of classification tasks that you may encounter; they are:

 Binary Classification
 Multi-Class Classification
 Multi-Label Classification
 Imbalanced Classification

2.5.1 Binary Classification


Binary classification refers to those classification tasks that have two class labels.
Examples include:

 Email spam detection (spam or not).


 Churn prediction (churn or not).
 Conversion prediction (buy or not).

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 73


Gajjala Ashok’s Unit -II : Machine Learning

Typically, binary classification tasks involve one class that is the normal state and another class
that is the abnormal state.

For example “not spam” is the normal state and “spam” is the abnormal state. Another example is
“cancer not detected” is the normal state of a task that involves a medical test and “cancer
detected” is the abnormal state.
The class for the normal state is assigned the class label 0 and the class with the abnormal state is
assigned the class label 1.

It is common to model a binary classification task with a model that predicts a Bernoulli probability
distribution for each example.
The Bernoulli distribution is a discrete probability distribution that covers a case where an event
will have a binary outcome as either a 0 or 1. For classification, this means that the model predicts
a probability of an example belonging to class 1, or the abnormal state.

Popular algorithms that can be used for binary classification include:

 Logistic Regression
 k-Nearest Neighbors
 Decision Trees
 Support Vector Machine
 Naive Bayes
Some algorithms are specifically designed for binary classification and do not natively support
more than two classes; examples include Logistic Regression and Support Vector Machines.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 74


Gajjala Ashok’s Unit -II : Machine Learning

Scatter Plot of Binary Classification Dataset

Multi-Class Classification
Multi-class classification refers to those classification tasks that have more than two class labels.
Examples include:
 Face classification.
 Plant species classification.
 Optical character recognition.

Unlike binary classification, multi-class classification does not have the notion of normal and
abnormal outcomes. Instead, examples are classified as belonging to one among a range of known
classes.

The number of class labels may be very large on some problems. For example, a model may predict
a photo as belonging to one among thousands or tens of thousands of faces in a face recognition
system.

Problems that involve predicting a sequence of words, such as text translation models, may also
be considered a special type of multi-class classification. Each word in the sequence of words to
be predicted involves a multi-class classification where the size of the vocabulary defines the
number of possible classes that may be predicted and could be tens or hundreds of thousands of
words in size.

It is common to model a multi-class classification task with a model that predicts a Multinoulli
probability distribution for each example.
The Multinoulli distribution is a discrete probability distribution that covers a case where an event
will have a categorical outcome, e.g. K in {1, 2, 3, …, K}. For classification, this means that the
model predicts the probability of an example belonging to each class label.
Many algorithms used for binary classification can be used for multi-class classification.

Popular algorithms that can be used for multi-class classification include:

 k-Nearest Neighbors.
 Decision Trees.
 Naive Bayes.
 Random Forest.
 Gradient Boosting.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 75


Gajjala Ashok’s Unit -II : Machine Learning

Algorithms that are designed for binary classification can be adapted for use for multi-class
problems.

This involves using a strategy of fitting multiple binary classification models for each class vs. all
other classes (called one-vs-rest) or one model for each pair of classes (called one-vs-one).

 One-vs-Rest: Fit one binary classification model for each class vs. all other classes.
 One-vs-One: Fit one binary classification model for each pair of classes.
Binary classification algorithms that can use these strategies for multi-class classification include:

 Logistic Regression.
 Support Vector Machine.

Scatter Plot of Multi-Class Classification Dataset

Multi-Label Classification
Multi-label classification refers to those classification tasks that have two or more class labels,
where one or more class labels may be predicted for each example.
Consider the example of photo classification, where a given photo may have multiple objects in
the scene and a model may predict the presence of multiple known objects in the photo, such as
“bicycle,” “apple,” “person,” etc.
This is unlike binary classification and multi-class classification, where a single class label is
predicted for each example.
It is common to model multi-label classification tasks with a model that predicts multiple outputs,
with each output taking predicted as a Bernoulli probability distribution. This is essentially a model
that makes multiple binary classification predictions for each example.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 76


Gajjala Ashok’s Unit -II : Machine Learning

Classification algorithms used for binary or multi-class classification cannot be used directly for
multi-label classification. Specialized versions of standard classification algorithms can be used,
so-called multi-label versions of the algorithms, including:

 Multi-label Decision Trees


 Multi-label Random Forests
 Multi-label Gradient Boosting

Imbalanced Classification
Imbalanced classification refers to classification tasks where the number of examples in each class
is unequally distributed.

Typically, imbalanced classification tasks are binary classification tasks where the majority of
examples in the training dataset belong to the normal class and a minority of examples belong to
the abnormal class.

Examples include:

 Fraud detection.
 Outlier detection.
 Medical diagnostic tests.

These problems are modeled as binary classification tasks, although may require specialized
techniques.

Specialized techniques may be used to change the composition of samples in the training dataset
by undersampling the majority class or oversampling the minority class.

Examples include:

 Random Undersampling.
 SMOTE Oversampling.
Specialized modeling algorithms may be used that pay more attention to the minority class when
fitting the model on the training dataset, such as cost-sensitive machine learning algorithms.

Examples include:

 Cost-sensitive Logistic Regression.


 Cost-sensitive Decision Trees.
 Cost-sensitive Support Vector Machines.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 77


Gajjala Ashok’s Unit -II : Machine Learning

Finally, alternative performance metrics may be required as reporting the classification accuracy
may be misleading.

Examples include:

 Precision.
 Recall.
 F-Measure.

2.5.2 structured learning

What is structured learning?


Structured prediction is a generalization of the standard paradigms of supervised learning,
classification and regression. All of these can be thought of finding a function that minimizes some
loss over a training set. The differences are in the kind of functions that are used and the losses.
Or
Structured prediction or structured (output) learning is an umbrella
term for supervised machine learning techniques that involves predicting structured objects, rather
than scalar discrete or real values.

In classification, the target domain are discrete class labels, and the loss is usually the 0-1 loss, i.e.
counting the misclassifications. In regression, the target domain is the real numbers, and the loss
is usually mean squared error. In structured prediction, both the target domain and the loss are
more or less arbitrary. This means the goal is not to predict a label or a number, but a possibly
much more complicated object like a sequence or a graph.
In structured prediction, we often deal with finite, but large output spaces Y. This situation could
be dealt with using classification with a very large number of classes. The idea behind structured
prediction is that we can do better than this, by making use of the structure of the output space.

A (very simplified) example


Let’s say we want to generate text from spoken sentences. Viewed as a pure classification problem,
we could see each possible sentence as a class. This has several drawbacks: we have many classes,
and to do correct predictions, we have to have all possible sentences in the training set. That doesn’t
work well. Also, we might not care about getting the sentence completely right.
If we misinterpret a single word, this might be not as bad as misinterpreting every word. So a 0-1
loss on sentences seems inappropriate. We could also try to view every word as a separate class
and try to predict each word individually. This seems somehow better, since we could learn to get
most of the word in a sentence right. On the other hand, we lose all context. So for example the

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 78


Gajjala Ashok’s Unit -II : Machine Learning

expression “car door” is way more likely than “car boar”, while predicted individually these could
be easily confused. For a similar example, see OCR Letter sequence recognition.
Structured prediction tries to overcome these problems by considering the output (here the
sentence) as a whole and using a loss function that is appropriate for this domain.

A formalism
I hope I have convinced you that structured prediction is a useful thing. So how are we going to
formalize this? Having functions that produce arbitrary objects seem a bit hard to handle. There is
one very basic formula at the heart of structured prediction:

Here x is the input, Y is the set of all possible outputs and f is a compatibility function that says
how well y fits the input x. The prediction for x is y*, the element of Y that maximizes the
compatibility.
This very simple formula allows us to predict arbitrarily complex outputs, as long as we can say
how compatible a given output is with the input.
This approach opens up two questions:
How do we specify f ? How do we compute y*?
As I said above, the output set Y is usually a finite but very large set (all graphs, all sentences in
the English language, all images of a given resolution). Finding the argmax in the above equation
by exhaustive search is therefore out of the question. We need to restrict ourselves to f such that
we can do the maximization over y efficiently. The most popular tool for building such f is using
energy functions or conditional random fields (CRFs).
There are basically three challenges in doing structured learning and prediction:

 Choosing a parametric form of f.


 Solving .
 Learning parameters for f to minimize a loss.

PyStruct takes to be a linear function of some parameters w and a joint feature function
of x and y :

So that the prediction is given by

Here are parameters that are learned from data, and joint_feature is defined by the user-
specified structure of the model. The definition of joint_feature is given by the Models. PyStruct

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 79


Gajjala Ashok’s Unit -II : Machine Learning

assumes that y is a discrete vector, and most models in PyStruct assume a pairwise decomposition
of the energy f over entries of y , that is

2.5.3 MNIST Dataset

The MNIST (Modified National Institute of Standards and Technology) database is a large
database of handwritten numbers or digits that are used for training various image processing
systems. The dataset also widely used for training and testing in the field of machine learning.
The set of images in the MNIST database are a combination of two of NIST's databases: Special
Database 1 and Special Database 3.

The MNIST dataset has 60,000 training images and 10,000 testing images.

The MNIST dataset can be online, and it is essentially a database of various handwritten digits.
The MNIST dataset has a large amount of data and is commonly used to demonstrate the real
power of deep neural networks. Our brain and eyes work together to recognize any numbered
image. Our mind is a potent tool, and it's capable of categorizing any image quickly. There are so
many shapes of a number, and our mind can easily recognize these shapes and determine what
number is it, but the same task is not simple for a computer to complete. There is only one way to
do this, which is the use of deep neural network which allows us to train a computer to classify the
handwritten digits effectively.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 80


Gajjala Ashok’s Unit -II : Machine Learning

So, we have only dealt with data which contains simple data points on a Cartesian coordinate
system. From starting till now, we have distributed with binary class datasets. And when we use
multiclass datasets, we will use the Softmax activation function is quite useful for classifying
binary datasets. And it was quite effective in arranging values between 0 and 1. The sigmoid
function is not effective for multicausal datasets, and for this purpose, we use the softmax
activation function, which is capable of dealing with it.

major difference between the datasets that we have used before and the MNIST dataset is the
method in which MNIST data is inputted in a neural network.

In the perceptual model and linear regression model, each of the data points was defined by a
simple x and y coordinate. This means that the input layer needs two nodes to input single data
points.

In the MNIST dataset, a single data point comes in the form of an image. These images included
in the MNIST dataset are typical of 28*28 pixels such as 28 pixels crossing the horizontal axis
and 28 pixels crossing the vertical axis. This means that a single image from the MNIST database
has a total of 784 pixels that must be analyzed. The input layer of our neural network has 784
nodes to explain one of these images.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 81


Gajjala Ashok’s Unit -II : Machine Learning

Here, we will see how to create a function that is a model for recognizing handwritten digits by
looking at each pixel in the image. Then using TensorFlow to train the model to predict the image
by making it look at thousands of examples which are already labeled. We will then check the
model's accuracy with a test dataset.

MNIST dataset in TensorFlow, containing information of handwritten digits spitted into


three parts:

o Training Data (mnist.train) -55000 datapoints


o Validation Data (mnist.validate) -5000 datapoint
o Test Data (mnist.test) -10000 datapoints

5.5.4 Ranking
Ranking is a type of machine learning that sorts data in a relevant order. Companies use ranking
to optimize search and recommendations.

What is a ranking model?

Ranking is a type of supervised machine learning (ML) that uses labeled datasets to train its data
and models to classify future data to predict outcomes. Quite simply, the goal of a ranking model
is to sort data in an optimal and relevant order.
Ranking was first largely deployed within search engines. People search for a topic, while the
ranking algorithm reorders search results based on the PageRank, and the search engine is able to
display the most relevant results to its customers.
Until recently, most ranking models, and ML as whole, were limited in their scope of use, as most
companies didn’t have enough data to power these algorithms. Better methods for data collection
and more intuitive ML tools have made it possible for nearly anyone to deploy a successful ranking
model within their business.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 82


Gajjala Ashok’s Unit -II : Machine Learning

How does ranking work?

As we’ll discuss later in this blog, ranking is incredibly versatile and dependent on the data a
company has. Even so, a common framework guides the construction of all ranking models.
Ranking models are made up of 2 main factors: queries and documents. Queries are any input
value, such as a question on Google or an interaction on an e-commerce site. Documents are the
output value or results of the query. Given the query, and the associated documents, a function,
given a list of parameters to rank on, will score the documents to be sorted in order of relevancy.
The machine learning algorithm learning to rank takes the scores from this model, and uses them
to predict future outcomes on a new and unseen list of documents.

As an example, a search for “Mage” is done on Google Search (“Mage” is the query). After the
search, a list of associated documents matching the query will be displayed (Mage A.I., Mage
definition, Mage World of Warcraft, etc.). The function will score each of the documents based on
their relevance to the query (Mage A.I. = 1, Mage definition = 2, Mage World of Warcraft =3, and
so on). The documents with higher scores will be ranked higher when there is a search for Mage.
Data required for a ranking model consists of documents from a query, user profiles, user
behaviors, search history, clicks, etc.

Why should I care?

Ranking ensures that the most relevant results appear first on a customer’s search, maximizing the
chances they will find something of interest, and minimizing the chances of churn. With so many
options for organic web search, the need to stay competitive has never been greater. According to
a Google study, 61% of users said if they didn’t find what they were looking for right away, they
would quickly move on to another site. Depending on available data, companies can use ranking
within their web pages and apps to serve their customers the most relevant results as soon as they
enter.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 83


Gajjala Ashok’s Unit -II : Machine Learning

Use cases:

The most successful companies are using ranking within their software to improve the user
experience. Ranking has allowed these companies to create customized feeds for each user based
on their past search and buying history. Ranking carries many use cases across industries, nearly
anyone with data can and should be using ranking in some capacity to optimize their business. A
few use cases are:

1. Search results
2. Targeted ads
3. Recommendations

Here are a few companies who have used ranking to maximize user engagement.

 Amazon
With millions of listings or documents, for every product search or query, Amazon needed
to find a way to rank its products in order to maximize the chance of purchase. Using a
combination of individual preferences, gathered from users' search and purchasing history
and a product’s popularity, Amazon created a ranking system that would display the most
relevant products at the top of their feed. Additionally, ranking was used in Amazon’s
recommendation system, which would use users' ranked preferences in order to predict
what products a user is most likely to purchase in the future.
 Netflix
Similar to Amazon, Netflix uses ranking to fuel their recommendation system. The
recommendation system predicts what content a user is most likely to watch and displays
the most relevant content at the top of the home page. Netflix uses a few different features
to rank and recommend content; such as: watch history, search history, and general
popularity. They also use ranking to fuel their collaborative filtering.
 TikTok
TikTok’s standout feature is the For You page which is built on a ranking system. This
feature has allowed TikTok to customize each home page to be reflective of the preferences
and interests of its user. TikTok uses similar metrics to Netflix to rank its content: watch
history, re-watch rate, and engagement. Similar to Netflix, TikTok’s ranking system also
aids in collaborative filtering.

- Starbucks
Starbucks found great success with their mobile app, which is one of the most downloaded apps
on the App Store. The app allows Starbucks to create a custom user experience for their customers
even when they’re not within a physical coffee shop. The app uses ranking to recommend the most
relevant products to users. Taking into account order history, new products and general popularity
of other products, Starbucks is able to keep customers' favorite orders at the top of the
recommended search while introducing them to new products that they are most likely to enjoy.

The fastest way to build a ranking model

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 84


Gajjala Ashok’s Unit -II : Machine Learning

For the companies listed above, entire teams of data scientists and AI engineers were built to create
and maintain the ranking systems in place. The cost to build these teams is impractical for most
businesses. Recently, there have been great tools emerging which allow for the easy building and
deployment of ranking models–this with little to no programming experience.
Mage allows for the building and deployment of a ranking model with no ML programming
knowledge. To use Mage, a database containing a list of queries and documents is first uploaded.
Queries could contain a list of clothes or menu items, their documents could be the number of
engagement (clicks and purchases) each received. The greater the quality and quantity of data
uploaded, the better that Mage is able to produce ranking predictions.
Once the data is uploaded, users will be given the option to transform their datasets by removing
and adding columns, applying transformer actions: split and filter data, group values, aggregate
data, and identifying what columns they would like to rank. Mage will then produce a ranking
model which can be deployed into your data warehouses, downloaded to a CSV file, or saved
directly to a Mage dataset.

Dept. of AID,CSD, ST.Mary’s Women’s Engineering College. Pg.No: 85

You might also like