0% found this document useful (0 votes)
2 views47 pages

Unit V Non Parametric Machine Learning

The document discusses non-parametric machine learning techniques, focusing on the k-Nearest Neighbors (KNN) algorithm, decision trees, and their applications. KNN is a supervised learning classifier that uses proximity for classification, while decision trees provide a hierarchical structure for making decisions based on attribute tests. The document also covers distance metrics, advantages and disadvantages of KNN, and the workings of decision trees, including attribute selection measures like Information Gain and Gini Index.

Uploaded by

rajikarthi2013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views47 pages

Unit V Non Parametric Machine Learning

The document discusses non-parametric machine learning techniques, focusing on the k-Nearest Neighbors (KNN) algorithm, decision trees, and their applications. KNN is a supervised learning classifier that uses proximity for classification, while decision trees provide a hierarchical structure for making decisions based on attribute tests. The document also covers distance metrics, advantages and disadvantages of KNN, and the workings of decision trees, including attribute selection measures like Information Gain and Gini Index.

Uploaded by

rajikarthi2013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 47

UNIT V NON PARAMETRIC MACHINE LEARNING

k- Nearest Neighbors- Decision Trees – Branching – Greedy Algorithm - Multiple Branches –


Continuous attributes – Pruning. Random Forests: ensemble learning. Boosting – Adaboost
algorithm. Support Vector Machines – Large Margin Intuition – Loss Function - Hinge Loss –
SVM Kernels

Non-Parametric Machine Learning


k- Nearest Neighbors

The k-nearest neighbors algorithm, also known as KNN or k-NN, is a non-parametric, supervised
learning classifier, which uses proximity to make classifications or predictions about the grouping of
an individual data point. While it can be used for either regression or classification problems, it is
typically used as a classification algorithm, working off the assumption that similar points can be
found near one another.
For classification problems, a class label is assigned on the basis of a majority vote— i.e. the label
that is most frequently represented around a given data point is used. While this is technically
considered “plurality voting”, the term, “majority vote” is more commonly used in literature. The
distinction between these terminologies is that “majority voting” technically requires a majority of
greater than 50%, which primarily works when there are only two categories. When you have
multiple classes—e.g. four categories, you don’t necessarily need 50% of the vote to make a
conclusion about a class; you could assign a class label with a vote of greater than 25%.
K-Nearest Neighbor(KNN) Algorithm for Machine Learning
o K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into
a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs
an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.

MC4301 MACHINE LEARINIG


o Example: Suppose, we have an image of a creature that looks similar to cat and dog,
but we want to know either it is a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure. Our KNN model will find the
similar features of the new data set to the cats and dogs images and based on the most
similar features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need a
K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a particular
dataset. Consider the below diagram:

How does K-NN work?

MC4301 MACHINE LEARINIG


The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each
category.
o Step-5: Assign the new data points to that category for which the number of
the neighbor is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category.

Consider the below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:

MC4301 MACHINE LEARINIG


o By calculating the Euclidean distance, we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:

o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN
algorithm:
o There is no particular way to determine the best value for "K", so we need to try
some values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find
some difficulties.

Distance Metrics Used in KNN Algorithm


Distance metrics are crucial for calculating the similarity between data points in KNN. Here are the
commonly used metrics:
1. Euclidean Distance
Euclidean distance is the most common distance metric, calculated as the straight-line distance between
two points in Euclidean space. For two points P(x1,y1) and Q(x2,y2) , the Euclidean distance is calculated

as:
MC4301 MACHINE LEARINIG
Euclidean distance is suitable for continuous variables and is easy to compute, making it a popular
choice in KNN.
2. Manhattan Distance
Manhattan distance (or L1 distance) measures the distance between two points along the axes at right
angles. For two points P(x1,y1) and Q(x2,y2) , the Manhattan distance is calculated as:

It is useful for grid-like paths (e.g., city blocks) and is often employed when variables are more
discrete.
3. Minkowski Distance
Minkowski distance is a generalization of both Euclidean and Manhattan distances. It is defined as:

When p=2, it becomes Euclidean distance, and when p=1, it is equivalent to Manhattan distance.
Minkowski distance provides flexibility by adjusting the value of p for different scenarios.
Choosing the appropriate distance metric depends on the data type and the specific problem at hand.
Algorithm for K-Nearest Neighbor (KNN)
Here’s a simplified version of the KNN algorithm:
Algorithm Steps:
1. Select the number of neighbors k.
2. Calculate the distance between the query point and all other points in the dataset using a chosen
distance metric.
3. Sort the distances in ascending order and select the top k-nearest neighbors.
4. For classification: Assign the query point the class of the majority of its neighbors.
5. For regression: Predict the value of the query point as the average of the k-nearest neighbors.
Pseudo-code:
def knn(query_point, dataset, k):
distances = []
for point in dataset:
distance = compute_distance(query_point, point)
distances.append((point, distance))
distances.sort(key=lambda x: x[1]) # Sort based on distance
neighbors = distances[:k]
prediction = majority_vote(neighbors) # or average for regression
return prediction

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training

MC4301 MACHINE LEARINIG


data is large.
Disadvantages of KNN Algorithm:
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data
points for all the training samples.
Applications of k-NN in machine learning
The k-NN algorithm has been utilized within a variety of applications,
largely within classification. Some of these use cases include:
- Data preprocessing: Datasets frequently have missing values, but the KNN
algorithm can estimate for those values in a process known as missing data
imputation.

- Recommendation Engines: Using clickstream data from websites, the


KNN algorithm has been used to provide automatic recommendations to
users on additional content.

- Finance: It has also been used in a variety of finance and economic use
cases.

- Healthcare: KNN has also had application within the healthcare industry,
making predictions on the risk of heart attacks and prostate cancer. The
algorithm works by calculating the most likely gene expressions.

- Pattern Recognition: KNN has also assisted in identifying patterns, such as


in text and digit classification

Decision tree
A decision tree is a supervised learning algorithm used for both classification and regression tasks. It has
a hierarchical tree structure which consists of a root node, branches, internal nodes and leaf nodes. It It
works like a flowchart help to make decisions step by step where:
 Internal nodes represent attribute tests
 Branches represent attribute values
 Leaf nodes represent final decisions or predictions.
In a Decision tree, there are two nodes, which are
 Decision Node and
 Leaf Node.
 Decision nodes are used to make any decision and have multiple branches, whereas,
 Leaf nodes are the output of those decisions and do not contain any further
branches.

MC4301 MACHINE LEARINIG


The decisions or the test are performed on the basis of features of the given dataset.
 It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
 It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
 In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
 A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
 Below diagram explains the general structure of a decision tree:

Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning
model. Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision,
so it is easy to understand.
o The logic behind the decision tree can be easily understood because it shows a
tree-like structure.

MC4301 MACHINE LEARINIG


Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub- nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.

Types of Decision Tree


 ID3: This algorithm measures how mixed up the data is at a node using something
called entropy. It then chooses the feature that helps to clarify the data the most.C4.5 : This is an
improved version of ID3 that can handle missing data and continuous attributes.
 CART: This algorithm uses a different measure called Gini impurity to decide how to split the
data. It can be used for both classification (sorting data into categories) and regression
(predicting continuous values) tasks.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record

MC4301 MACHINE LEARINIG


(real dataset) attribute and, based on the comparison, follows the branch and jumps to the next
node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The
complete process can be better understood using the below algorithm:

Step-1: Begin the tree with the root node, says S, which contains the
complete dataset. Step-2: Find the best attribute in the dataset using
Attribute Selection Measure (ASM). Step-3: Divide the S into subsets
that contains possible values for the best attributes. Step-4: Generate
the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select
the best attribute for the nodes of the tree. There are two popular techniques for ASM,
This process is known as attribute selection. We have two popular attribute selection
measures:
1. Information Gain
2. Gini Index
1. Information Gain When we use a node in a decision tree to partition the
training instances into smaller subsets the entropy changes. Information
gain is a measure of this change in entropy. Definition: Suppose S is a
set of instances, A is an attribute, Sv is the subset of S with A = v, and
Values
(A) is the set of all possible values of A, then

Entropy Entropy is the measure of uncertainty of a random variable, it


characterizes the impurity of an arbitrary collection of examples. The
higher the entropy more the information content. Definition: Suppose
S is a set of instances, A is an attribute, Sv is the subset of S with A = v,
and Values (A) is the set of all possible values of A, then

MC4301 MACHINE LEARINIG


Building Decision Tree using Information Gain The essentials:
 Start with all training instances associated with the root node
 Use info gain to choose which attribute to label each node with
 Note: No root-to-leaf path should contain the same discrete attribute twice
 Recursively construct each subtree on the subset of training instances that would be
classified down that path in the tree.
 If all positive or all negative training instances remain, the label that node “yes” or “no”
accordingly
 If no attributes remain, label with a majority vote of training instances left at that
node
 If no instances remain, label with a majority vote of the parent’s training
instances.
Example: Now let us draw a Decision Tree for the following data using Information gain. Training
set: 3 features and 2 classes
X Y Z C

1 1 1 I

1 1 0 I

0 0 1 II

1 0 0 II

Here, we have 3 features and 2 output classes. To build a decision tree using Information gain. We
will take each of the features and calculate the information for each feature.

MC4301 MACHINE LEARINIG


MC4301 MACHINE LEARINIG
From the above images we can see that the information gain is maximum when we make a split on
feature Y. So, for the root node best-suited feature is feature Y. Now we can see that while splitting the
dataset by feature Y, the child contains a pure subset of the target variable. So we don't need to further
split the dataset. The final tree for the above dataset would look like this:

2. Gini Index
 Gini Index is a metric to measure how often a randomly chosen element would be
incorrectly identified.
 It means an attribute with a lower Gini index should be preferred.
 Sklearn supports “Gini” criteria for Gini Index and by default, it takes “gini” value.
For example if we have a group of people where all bought the product (100% "Yes") the Gini Index is 0
indicate perfect purity. But if the group has an equal mix of "Yes" and "No" the Gini Index would be 0.5
show high impurity or uncertainty. Formula for Gini Index is given by :

MC4301 MACHINE LEARINIG


Advantages of the Decision Tree

o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.


o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
For more class labels, the computational complexity of the decision tree may increase

Branching

A decision tree is a map of the possible outcomes of a series of related choices. It allows
an individual or organization to weigh possible actions against one another based on
their costs, probabilities, and benefits. They can can be used either to drive informal
discussion or to map out an algorithm that predicts the best choice mathematically.

A decision tree typically starts with a single node, which branches into possible
outcomes. Each of those outcomes leads to additional nodes, which branch off into
other possibilities. This gives it a treelike shape.

There are three different types of nodes: chance nodes, decision nodes, and end nodes. A
chance node, represented by a circle, shows the probabilities of certain results. A
decision node, represented by a square, shows a decision to be made, and an end node
shows the final outcome of a decision path.

Decision trees can also be drawn with flowchart symbols, which some people find easier
to read and understand.

Decision tree symbols

Shape Name Meaning

MC4301 MACHINE LEARINIG


Decision node Indicates a decision to be
made Chance node Shows multiple uncertain
outcomes

Alternative branches Each branch indicates a possible outcome or action

Rejected alternative Shows a choice that was


not selected Endpoint nodeIndicates a final
outcome

How to draw a decision tree

To draw a decision tree, first pick a medium. You can draw it by hand on paper or a
whiteboard, or you can use special decision tree software. In either case, here are the
steps to follow:

1. Start with the main decision. Draw a small box to represent this point,
then draw a line from the box to the right for each possible solution or
action. Label them accordingly.

MC4301 MACHINE LEARINIG


2. Add chance and decision nodes to expand the tree as follows:

 If another decision is necessary, draw another box.


 If the outcome is uncertain, draw a circle (circles represent chance
nodes). If the problem is solved, leave it blank (for now)

From each decision node, draw possible solutions. From each chance node, draw lines representing
possible outcomes. If you intend to analyze your options numerically, include the probability of
each outcome and the cost of each action.
3. Continue to expand until every line reaches an endpoint, meaning that there are no more
choices to be made or chance outcomes to consider. Then, assign a value to each possible

MC4301 MACHINE LEARINIG


outcome. It could be an abstract score or a financial value. Add triangles to signify
endpoints.
With a complete decision tree, you’re now ready to begin analyzing the decision you face.
Advantages and disadvantages
Decision trees remain popular for reasons like these:
 How easy they are to understand
 They can be useful with or without hard data, and any data
requires minimal preparation
 New options can be added to existing trees
 Their value in picking out the best of several options
 How easily they combine with other decision making tools
However, decision trees can become excessively complex. In such cases, a more compact
influence diagram can be a good alternative. Influence diagrams narrow the focus to critical
decisions, inputs, and objectives.
Multiple Branches
Multiclass classification is a popular problem in supervised machine learning.
Problem – Given a dataset of m training examples, each of which contains information
in the form of various features and a label. Each label corresponds to a class, to which the
training example belongs. In multiclass classification, we have a finite set of classes.
Each training example also has n features.
For example, in the case of identification of different types of fruits, “Shape”, “Color”,
“Radius” can be featured, and “Apple”, “Orange”, “Banana” can be different class labels.
A decision tree classifier is a systematic approach for multiclass classification. It poses a
set of questions to the dataset (related to its attributes/features). The decision tree
classification algorithm can be visualized on a binary tree. On the root and each
of the internal nodes, a question is posed and the data on that node is further split into
separate records that have different characteristics. The leaves of the tree refer to the
classes in which the dataset is split.
Greedy Algorithm
The greedy algorithm in data science is an approach to problem-solving that
chooses the best choice based on the circumstances at hand. This method disregards the
possibility that the best result now obtained may not be the ultimate best outcome. The
algorithm never goes back and corrects an inaccurate judgment.
This straightforward, understandable approach can be used to resolve any
optimization issue that calls for the maximum or minimal ideal outcome. The simplicity
of this algorithm makes it the best choice for implementation.
With a greedy solution, the runtime complexity is fairly manageable. However,
you can only use a greedy solution if the issue statement satisfies the two following
criteria:
Greedy Choice Property: Selecting the ideal option at every stage might result in
an overall (global) optimal solution, which is known as the greedy choice property.
Optimum Substructure: A problem has an ideal substructure if an ideal strategy
for the whole problem includes optimal solutions to each of its subproblems.

MC4301 MACHINE LEARINIG


Characteristics of a Greedy Algorithm
 The algorithm solves its problem by finding an optimal solution. This solution can be a
maximum or minimum value. It makes choices based on the best option available.
 The algorithm is fast and efficient with time complexity of O(n log n) or O(n). Therefore
applied in solving large-scale problems.
 The search for optimal solution is done without repetition – the algorithm runs once.
 It is straightforward and easy to implement.
Steps to create Greedy Algorithm
You can create a solution for a greedy algorithm where the statement is given
using the following steps below:
 Step 1: Find the ideal substructure or subproblem in the given problem
 Step 2: Choose the components of the solution (e.g., largest sum, shortest path).
 Step 3: Establish an iterative procedure for examining each subproblem and coming up
with the best answer.
Components of Greedy Algorithm
The following elements can be incorporated into the greedy algorithm:
 Candidate set: A candidate set is a solution that is derived from the set.
 Selection Function: Use this function to choose the candidates or subsets that can be
included in the solution.
 Feasibility function: A function used to assess a candidate’s or subset’s potential to
contribute to the solution is the feasibility function.
 Objective function: An objective function is used to give the solution or a portion of the
solution a value.
 Solution Function: This function indicates whether or not the complete function has
been achieved in the solution.
How Greedy Algorithms Work?
Greedy algorithms solve problems by following a simple, step-by-step approach:
1. Identify the Problem: The first step is to understand the problem and
determine that it can be solved using a greedy approach. This usually involves
recognizing that the problem can be broken down into a series of decisions.

MC4301 MACHINE LEARINIG


2. Make the Greedy Choice: At each step of the problem, the algorithm makes
the best choice that seems optimal at the moment. This choice is made based on the
greedy choice property, which means selecting the option that offers the most immediate
benefit.
3. Solve the Subproblems: After making a greedy choice, the problem is reduced
to a smaller subproblem. The algorithm then repeats the process—making another greedy
choice for the smaller problem.
4. Combine the Solutions: The algorithm continues making greedy choices and
solving subproblems until the entire problem is solved. The solution to the overall
problem is simply the combination of all the greedy choices made at each step.
Greedy Algorithm: Pseudo Code
Algorithm Greedy (x, z)
{
Sol : = 0;
for i = 0 to n do
{
a: = select(x);
if feasible(sol, a)
{
Sol: = union(sol , a)
}
return sol;
}}
The algorithm described above is greedy. The solution is first given a value of
zero. The array and element count are passed to the greedy algorithm. We choose
each piece individually inside the for loop and determine whether or not the solution
is workable. We conduct the union if the solution is practical.
Let’s consider a real-world issue and come up with a greedy solution for this.
Problem Statement:
James is very busy, which is a problem. He has allotted T hours to complete some
intriguing activities. He wants to accomplish as many things as he can in the time

MC4301 MACHINE LEARINIG


provided. In order to do that, he has made an array A containing timestamps for each item
on his agenda.
Here, we need to calculate how many tasks Alex can finish in the given amount of
time (T).
Solution: The presented issue is a simple greedy issue. With the current Time and
number Of Things in mind, we must choose the items from array A that will complete a
task in the shortest period of time for each iteration. We must complete the actions listed
below in order to come up with a solution.
 The array A should be sorted ascending.
 Choose a timestamp one at a time.
 Add the timestamp amount to the current Time after obtaining the timestamp.
 Add one more thing to the number Of Things.
 Up until the current Time value reaches T, repeat steps 2 through 4.
Example of Greedy Algorithm

From the origin to the destination, we must travel as cheaply as possible. Since
the three possible solutions have cost pathways of 10, 20, and 5, respectively, 5, is the
least expensive option, making it the best choice. This is the local optimum, and in order
to compute the global optimal solution, we discover the local optimum at each stage in
this manner.
Greedy Algorithms Examples
Below are some examples of greedy algorithm:
1. Activity Selection Problem

MC4301 MACHINE LEARINIG


The Activity Selection Problem involves selecting the maximum number of
activities that don't overlap with each other, given their start and end times. The goal is to
maximize the number of activities that can be attended.
Working:
 Step 1: Sort the activities based on their finish times (earliest finish time first).
 Step 2: Select the first activity (the one that finishes the earliest).
 Step 3: For each subsequent activity, select it only if its start time is greater than or equal
to the finish time of the last selected activity.
 Step 4: Repeat until all activities have been considered.
Example:
Given activities with the following start and end times:
 Activity 1: (1, 4)
 Activity 2: (3, 5)
 Activity 3: (0, 6)
 Activity 4: (5, 7)
 Activity 5: (8, 9)
The greedy choice would select Activity 1, then Activity 4, and finally Activity 5,
as these do not overlap. The maximum number of non-overlapping activities is 3.
2. Huffman Coding
Huffman Coding is a method of data compression that assigns variable-length
codes to input characters, with shorter codes assigned to more frequent characters. The
goal is to minimize the total length of the encoded message.
Working:
 Step 1: Count the frequency of each character in the input.
 Step 2: Create a priority queue (min-heap) where each node represents a character and its
frequency.
 Step 3: Extract the two nodes with the lowest frequency from the queue.
 Step 4: Create a new node with these two nodes as children and with a frequency equal
to the sum of their frequencies.
 Step 5: Insert this new node back into the queue.

MC4301 MACHINE LEARINIG


 Step 6: Repeat until only one node remains in the queue, which becomes the root of the
Huffman Tree.
Example:
Consider the characters and their frequencies:
 A: 5
 B: 9
 C: 12
 D: 13
 E: 16
 F: 45
The greedy algorithm will build a Huffman Tree by repeatedly merging the two
least frequent nodes until all characters are encoded, leading to an efficient prefix code
where no code is a prefix of any other.
3. Prim’s Algorithm
Prim’s Algorithm is used to find the Minimum Spanning Tree (MST) of a
connected, undirected graph. The MST is a subset of the edges that connects all vertices
together without any cycles and with the minimum possible total edge weight.
Working:
 Step 1: Start with any arbitrary vertex and add it to the MST.
 Step 2: At each step, add the smallest edge that connects a vertex in the MST to a vertex
outside the MST.
 Step 3: Repeat until all vertices are included in the MST.
Example:
Consider a graph with the following vertices and edges:
 Vertices: {A, B, C, D, E}
 Edges with weights: (A-B, 1), (B-C, 4), (A-C, 3), (C-D, 2), (B-D, 5), (D-E, 1)
Prim’s algorithm would start at any vertex, say A, and select the edge with the
smallest weight that connects to a new vertex, building the MST step by step. The
resulting MST might include edges (A-B), (A-C), (C-D), and (D-E) with the total
minimum weight.
Types of Greedy Algorithms

MC4301 MACHINE LEARINIG


1. Fractional Greedy Algorithms
These algorithms work by making the best possible choice at each step based on
fractions. They are typically used in problems where items can be divided into smaller
parts.
Example: Fractional Knapsack Problem – where the goal is to maximize the total
value in the knapsack by taking fractions of items if necessary.
2. Pure Greedy Algorithms
These algorithms make the best possible choice at each step without considering
future consequences and without the possibility of revisiting previous decisions. Once a
choice is made, it is final.
Example: Activity Selection Problem – where activities are selected based on
their finish times to maximize the number of non-overlapping activities.
3. Recursive Greedy Algorithms:
These algorithms use recursion to make a series of greedy choices. The problem is
divided into subproblems, and the algorithm makes a greedy choice and then recursively
solves the subproblem.
Example: Prim’s Algorithm – which builds a minimum spanning tree by making
the greedy choice of adding the smallest edge at each step and recursively continuing this
process.
4. Greedy Choice Property Algorithms:
These algorithms rely on the greedy choice property, which ensures that a
globally optimal solution can be reached by making locally optimal choices.
Example: Huffman Coding – where characters are assigned variable-length codes
based on their frequencies, with the goal of minimizing the total length of the encoded
message.
5. Adaptive Greedy Algorithms
These algorithms adapt their strategies based on the current state of the problem.
They may reconsider previous decisions if necessary to improve the overall solution.
Example: Greedy Graph Coloring – where the algorithm assigns colors to the
vertices of a graph, potentially adjusting choices as new vertices are colored.
6. Constructive Greedy Algorithms
These algorithms build the solution step by step by adding elements to the
solution set based on a specific criterion.

MC4301 MACHINE LEARINIG


Example: Kruskal’s Algorithm – which constructs a minimum spanning tree by
adding the smallest edges that do not form a cycle.
7. Non-Adaptive Greedy Algorithms:
These algorithms make decisions that are final and do not adapt or change once a
choice is made.
Example: Dijkstra’s Algorithm – which finds the shortest path from a source to
all other vertices in a graph by making locally optimal choices at each step.
Time Complexity of Greedy Algorithms
The time complexity of common greedy algorithms is mostly O(n log n) or O(n).

Algorithm Time
Complexity

Activity Selection O(n log n)


Problem

Huffman Coding O(n log n)

Fractional Knapsack O(n log n)


Problem

Coin Change Problem O(n)

Job Sequencing with O(n log n)


Deadlines

Space Complexity of Greedy Algorithms

Algorithm Space
Complexity

Activity Selection O(1)


Problem

Huffman Coding O(n)

Fractional Knapsack O(1)


Problem

MC4301 MACHINE LEARINIG


Coin Change Problem O(1)

Job Sequencing with O(n)


Deadlines

Applications of Greedy Algorithms


 Network Routing: Greedy algorithms like Dijkstra’s algorithm are used to find the
shortest path in network routing, optimizing data packet delivery across the internet.
 Task Scheduling: Used in job sequencing problems to maximize profit or minimize
deadlines by selecting jobs in order of priority.
 Resource Allocation: Applied in scenarios like the fractional knapsack problem to
allocate resources efficiently based on value-to-weight ratios.
 Data Compression: Huffman coding uses a greedy approach to compress data efficiently
by encoding frequently used characters with shorter codes.
 Coin Change: A greedy approach is often used to provide the minimum number of coins
for a given amount of money, particularly when the denominations are standard (like in
most currencies).
 Greedy Coloring: Applied in graph theory for problems like graph coloring, where the
goal is to minimize the number of colors needed to color a graph while ensuring no two
adjacent vertices share the same color.
 Traveling Salesman Problem (TSP) Heuristic: Greedy heuristics are used in
approximating solutions to the TSP by selecting the nearest unvisited city at each step.
Advantages of Greedy Algorithms
 Simplicity: Greedy algorithms are easy to understand and implement because they
follow a straightforward approach of making the best choice at each step.
 Efficiency: They often have lower time complexity compared to other algorithms like
dynamic programming, making them faster for certain problems.
 Locally Optimal Decisions: By making locally optimal choices, greedy algorithms can
quickly arrive at a solution without the need for complex backtracking.
 Works Well for Certain Problems: Greedy algorithms are particularly effective for
problems that exhibit the greedy choice property and optimal substructure, such as
Minimum Spanning Tree and Huffman Coding.
Disadvantages of Greedy Algorithms

MC4301 MACHINE LEARINIG


 May Not Always Produce Optimal Solutions: Greedy algorithms focus on immediate
benefits, which can sometimes lead to suboptimal solutions for problems where a global
perspective is needed.
 Limited Applicability: Greedy algorithms only work well for problems that satisfy the
greedy choice property and optimal substructure; they may fail or give incorrect results
otherwise.
 No Backtracking: Once a choice is made, it cannot be undone, which can lead to
incorrect results if the wrong choice is made early on.
 Lack of Flexibility: Greedy algorithms are rigid in their approach, and modifying them
to account for different problem constraints can be challenging.
The 0/1 Knapsack Problem
The 0/1 Knapsack Problem cannot be solved by a greedy algorithm because it
does not fulfill the greedy choice property, and the optimal substructure property, as
mentioned earlier.
The 0/1 Knapsack Problem
Rules:
 Every item has a weight and value.
 Your knapsack has a weight limit.
 Choose which items you want to bring with you in the knapsack.
 You can either take an item or not, you cannot take half of an item for example.
Goal:
 Maximize the total value of the items in the knapsack.
 The Traveling Salesman
 The Traveling Salesman Problem is a famous problem that also cannot be solved by a
greedy algorithm, because it does not fulfill the greedy choice property, and the optimal
substructure property, as mentioned earlier.
 The Traveling Salesman Problem states that you are a salesperson with a number of cities
or towns you must visit to sell your things.
The Traveling Salesman Problem
 Rules: Visit every city only once, then return back to the city you started in.
 Goal: Find the shortest possible route.
Pruning

MC4301 MACHINE LEARINIG


What is Decision Tree Pruning?
Decision tree pruning is a technique used to prevent decision trees
from overfitting the training data. Pruning aims to simplify the decision tree by removing
parts of it that do not provide significant predictive power, thus improving its ability to
generalize to new data.
Decision Tree Pruning removes unwanted nodes from the overfitted decision tree
to make it smaller in size which results in more fast, more accurate and more effective
predictions.
There are two approaches of pruning: a) Pre-pruning: Stop growing the tree before it
reaches perfection. b) Post-pruning: Allow the tree to grow entirely and then post-prune
some of the branches from it.
In the case of pre-pruning, the tree is stopped from further growing once it reaches a
certain number of decision nodes or decisions. Hence, in this strategy, the algorithm
avoids overfitting as well as optimizes computational cost. However, it also stands a
chance to ignore important information contributed by a feature which was skipped,
thereby resulting in miss out of certain patterns in the data.
In the case of post-pruning, the tree is allowed to grow to the full extent. Then, by
using certain pruning criterion, e.g. error rates at the nodes, the size of the tree is reduced.
This is a more effective approach in terms of classification accuracy as it considers all
minute information available from the training data. However, the computational cost is
obviously more than that of pre-pruning.
Random forest model

 Random forest is an ensemble classifier, i.e. a combining classifier that uses and
combines many decision tree classifiers.
 Ensembling is usually done using the concept of bagging with different feature sets. The
reason for using large number of trees in random forest is to train the trees enough such
that contribution from each feature comes in a number of models.
 After the random forest is generated by combining the trees, majority vote is applied to
combine the output of the different trees.

MC4301 MACHINE LEARINIG


How does random forest work?

1. If there are N variables or features in the input data set, select a subset
of ‘m’ (m < N) features at random out of the N features. Also, the
observations or data instances should be picked randomly.
2. Use the best split principle on these ‘m’ features to calculate the number of nodes ‘d’.
3. Keep splitting the nodes to child nodes till the tree is grown to the maximum possible extent.
4. Select a different subset of the training data ‘with replacement’ to train
another decision tree following steps (1) to (3). Repeat this to build and train ‘n’
decision trees.
5. Final class assignment is done on the basis of the majority votes from the ‘n’ trees.

Algorithm
Here is an outline of the random forest algorithm.
1. The random forests algorithm generates many classification trees. Each tree is generated as
follows:
(a) If the number of examples in the training set is N, take a sample of N examples at random -
but with replacement, from the original data. This sample will be the training set for generating
the tree.
(b) If there are M input variables, a number m is specified such that at each node, m variables are
selected at random out of the M and the best split on these m is used to split the node. The value
of m is held constant during the generation of the various trees in the forest.
(c) Each tree is grown to the largest extent possible.
2. To classify a new object from an input vector, put the input vector down each of the trees in
the forest. Each tree gives a classification, and we say the tree “votes” for that class. The forest
chooses the classification.

MC4301 MACHINE LEARINIG


Strengths and weaknesses of Random forest
Strengths of random forest
1) It runs efficiently on large and expansive data sets.
2) It has a robust method for estimating missing data and maintains precision when
a large proportion of the data is absent.
3) It has powerful techniques for balancing errors in a class population of
unbalanced data sets.
4) It gives estimates (or assessments) about which features are the most important
ones in the overall classification.
5) It generates an internal unbiased estimate (gauge) of the generalization error
as the forest generation progresses.
6) Generated forests can be saved for future use on other data.
7) Lastly, the random forest algorithm can be used to solve both
classification and regression problems.
Weaknesses of random forest
1) This model, because it combines a number of decision tree models, is not as
easy to understand as a decision tree model.
2) It is computationally much more expensive than a simple model like decision tree.

Application of random forest


Random forest is a very powerful classifier which combines the versatility of
many decision tree models into a single model. Because of the superior results,
this ensemble model is gaining wide adoption and popularity amongst the
machine learning practitioners to solve a wide range of classification problems.

Out-of-bag (OOB) error in random forest


 In random forests, we have seen, that each tree is constructed using a different
bootstrap sample from the original data. The samples left out of the bootstrap and
not used in the construction of the i-th tree can be used to measure the
performance of the model.
 At the end of the run, predictions for each such sample evaluated each time are
tallied, and the final prediction for that sample is obtained by taking a vote.

MC4301 MACHINE LEARINIG


 The total error rate of predictions for such samples is termed as out-of-bag
(OOB) error rate.
 The error rate shown in the confusion matrix reflects the OOB error rate. Because
of this reason, the error rate displayed is often surprisingly high.

The following steps explain the working Random Forest

Algorithm: Step 1: Select random samples from a given data

or training set.

Step 2: This algorithm will construct a decision tree for every training

data. Step 3: Voting will take place by averaging the decision tree.

Step 4: Finally, select the most voted prediction result as the final prediction result.

This combination of multiple models is called Ensemble. Ensemble uses two


methods:

1. Bagging: Creating a different training subset from sample training data with
replacement is called Bagging. The final output is based on majority voting.
2. Boosting: Combing weak learners into strong learners by creating sequential models

MC4301 MACHINE LEARINIG


such that the final model has the highest accuracy is called Boosting. Example: ADA
BOOST, XG BOOST.

Bagging: From the principle mentioned above, we can understand Random


forest uses the Bagging code. Now, let us understand this concept in detail.
Bagging is also known as Bootstrap Aggregation used by random forest. The
process begins with any original random data. After arranging, it is organised
into samples known as Bootstrap Sample. This process is known as
Bootstrapping.Further, the models are trained individually, yielding different
results known as Aggregation. In the last step, all the results are combined, and
the generated output is based on majority voting. This step is known as Bagging

and is done using an Ensemble Classifier.

Essential Features of Random Forest


 Miscellany: Each tree has a unique attribute, variety and features
concerning other trees. Not all trees are the same.

 Immune to the curse of dimensionality: Since a tree is a conceptual idea, it


requires no features to be considered. Hence, the feature space is reduced.

 Parallelization: We can fully use the CPU to build random forests since each

MC4301 MACHINE LEARINIG


tree is created autonomously from different data and features.

 Train-Test split: In a Random Forest, we don’t have to differentiate the data


for train and test because the decision tree never sees 30% of the data.

 Stability: The final result is based on Bagging, meaning the result is based on
majority voting or average.
Ensemble learning
Ensemble learning helps improve machine learning results by combining several
models. This approach allows the production of better predictive performance
compared to a single model. Basic idea is to learn a set of classifiers (experts)
and to allow them to vote.

Advantage : Improvement in predictive accuracy.


Disadvantage : It is difficult to understand an ensemble of classifiers

Types of Ensemble

Classifier – Bagging:
Bagging (Bootstrap Aggregation) is used to reduce the variance of a decision
tree. Suppose a set D of d tuples, at each iteration i, a training set Di of d tuples
is sampled with replacement from D (i.e., bootstrap). Then a classifier model
Mi is learned for each training set D < i. Each classifier M i returns its class
prediction. The bagged classifier M* counts the votes and assigns the class
with the most votes to X (unknown sample).

MC4301 MACHINE LEARINIG


Boosting
Boosting is an ensemble learning method that combines a set of weak learners
into a strong learner to minimize training errors. In boosting, a random sample
of data is selected, fitted with a model and then trained sequentially—that is,
each model tries to compensate for the weaknesses of its predecessor. With
each iteration, the weak rules from each individual classifier are combined to
form one, strong prediction rule

Types of boosting
Boosting methods are focused on iteratively combining weak learners to
build a strong learner that can predict more accurate outcomes. As a reminder, a
weak learner classifies data slightly better than random guessing. This approach
can provide robust results for prediction problems, and can even outperform
neural networks and support vector machines for tasks like image retrieval.

Boosting algorithms can differ in how they create and aggregate weak learners
during the sequential process. Three popular types of boosting methods
include:

 Adaptive boosting or AdaBoost: Yoav Freund and Robert Schapire are credited
with the creation of the AdaBoost algorithm. This method operates iteratively,
identifying misclassified data points and adjusting their weights to minimize the
training error. The model continues optimize in a sequential fashion until it
yields the strongest predictor.
 Gradient boosting: Building on the work of Leo Breiman, Jerome H.
Friedman developed gradient boosting, which works by sequentially adding
predictors to an ensemble with each one correcting for the errors of its
predecessor. However, instead of changing weights of data points like AdaBoost,
the gradient boosting trains on the residual errors of the previous predictor. The
name, gradient boosting, is used since it combines the gradient descent
algorithm and boosting method.
 Extreme gradient boosting or XGBoost: XGBoost is an implementation of
gradient boosting that’s designed for computational speed and scale.
XGBoost leverages multiple cores on the CPU, allowing for learning to occur
in parallel during training.

AdaBoost:
AdaBoost Freund and Schapire (1996) proposed a variant, named AdaBoost, short for adaptive
boosting, that uses the same training set over and over and thus need not be large, but the
classifiers should be simple so that they do not overfit. AdaBoost can also combine an arbitrary
number of baselearners, not three.
There are many machine learning algorithms to choose from for your problem statements. One
of these algorithms for predictive modeling is called AdaBoost.

MC4301 MACHINE LEARINIG


AdaBoost algorithm, short for Adaptive Boosting, is a Boosting technique used as
an Ensemble Method in Machine Learning. It is called Adaptive Boosting as the weights are
re-assigned to each instance, with higher weights assigned to incorrectly classified instances.

What this algorithm does is that it builds a model and gives equal weights to all the data points. It
then assigns higher weights to points that are wrongly classified. Now all the points with higher
weights are given more importance in the next model. It will keep training models until and
unless a lower error is received.

AdaBoost algorithm
How AdaBoost Works (Simplified):
1. Initialization: Assign equal weights to all training data points.
2. Training: Train a weak learner on the weighted data.
3. Prediction: Predict the outcome for each data point.
4. Weight Adjustment: Increase the weight of misclassified data points.
5. Iteration: Repeat steps 2-4 until a stopping criterion is met (e.g., a predefined number of
iterations or a satisfactory level of accuracy).

MC4301 MACHINE LEARINIG


6. Ensemble: Combine the predictions of all weak learners, weighted by their importance,
to make the final prediction.
The Working of the adaboost Algorithm
Let’s understand what and how this algorithm works under the hood with the following tutorial.
Step 1: Assigning Weights
The Image shown below is the actual representation of our dataset. Since the target column is
binary, it is a classification problem. First of all, these data points will be assigned some weights.
Initially, all the weights will be equal.

The formula to calculate the sample weights is:

Where N is the total number of data points


Here since we have 5 data points, the sample weights assigned will be 1/5.
Step 2: Classify the Samples
We start by seeing how well “Gender” classifies the samples and will see how the variables
(Age, Income) classify the samples.
We’ll create a decision stump for each of the features and then calculate the Gini Index of each
tree. The tree with the lowest Gini Index will be our first stump.
Here in our dataset, let’s say Gender has the lowest gini index, so it will be our first stump.
Step 3: Calculate the Influence
We’ll now calculate the “Amount of Say” or “Importance” or “Influence” for this classifier in
classifying the data points using this formula:

MC4301 MACHINE LEARINIG


The total error is nothing but the summation of all the sample weights of misclassified data
points.
Here in our dataset, let’s assume there is 1 wrong output, so our total error will be 1/5, and the
alpha (performance of the stump) will be:

Note: Total error will always be between 0 and 1.


0 Indicates perfect stump, and 1 indicates horrible stump.

From the graph above, we can see that when there is no misclassification, then we have no error
(Total Error = 0), so the “amount of say (alpha)” will be a large number.
When the classifier predicts half right and half wrong, then the Total Error = 0.5, and the
importance (amount of say) of the classifier will be 0.

MC4301 MACHINE LEARINIG


If all the samples have been incorrectly classified, then the error will be very high (approx. to 1),
and hence our alpha value will be a negative integer.
Step 4: Calculate TE and Performance
You might be wondering about the significance of calculating the Total Error (TE) and
performance of an Adaboost stump. The reason is straightforward – updating the weights is
crucial. If identical weights are maintained for the subsequent model, the output will mirror what
was obtained in the initial model.
The wrong predictions will be given more weight, whereas the correct predictions weights will
be decreased. Now when we build our next model after updating the weights, more preference
will be given to the points with higher weights.
After finding the importance of the classifier and total error, we need to finally update the
weights, and for this, we use the following formula:

The amount of, say (alpha) will be negative when the sample is correctly classified.
The amount of, say (alpha) will be positive when the sample is miss-classified.
There are four correctly classified samples and 1 wrong. Here, the sample weight of that
datapoint is 1/5, and the amount of say/performance of the stump of Gender is 0.69.
New weights for correctly classified samples are:

For wrongly classified samples, the updated weights will be:

Note

MC4301 MACHINE LEARINIG


See the sign of alpha when I am putting the values, the alpha is negative when the data point is
correctly classified, and this decreases the sample weight from 0.2 to 0.1004. It is positive when
there is misclassification, and this will increase the sample weight from 0.2 to 0.3988

We know that the total sum of the sample weights must be equal to 1, but here if we sum up all
the new sample weights, we will get 0.8004. To bring this sum equal to 1, we will normalize
these weights by dividing all the weights by the total sum of updated weights, which is 0.8004.
So, after normalizing the sample weights, we get this dataset, and now the sum is equal to 1.

Step 5: Decrease Errors


Now, we need to make a new dataset to see if the errors decreased or not. For this, we will
remove the “sample weights” and “new sample weights” columns and then, based on the “new
sample weights,” divide our data points into buckets.

MC4301 MACHINE LEARINIG


Step 6: New Dataset
We are almost done. Now, what the algorithm does is selects random numbers from 0-1. Since
incorrectly classified records have higher sample weights, the probability of selecting those
records is very high.
Suppose the 5 random numbers our algorithm take is 0.38,0.26,0.98,0.40,0.55.
Now we will see where these random numbers fall in the bucket, and according to it, we’ll make
our new dataset shown below.

This comes out to be our new dataset, and we see the data point, which was wrongly classified,
has been selected 3 times because it has a higher weight.
Step 7: Repeat Previous Steps
Now this act as our new dataset, and we need to repeat all the above steps i.e.
 Assign equal weights to all the data points.

MC4301 MACHINE LEARINIG


 Find the stump that does the best job classifying the new collection of samples by finding
their Gini Index and selecting the one with the lowest Gini index.
 Calculate the “Amount of Say” and “Total error” to update the previous sample weights.
 Normalize the new sample weights.
Iterate through these steps until and unless a low training error is achieved.
Suppose, with respect to our dataset, we have constructed 3 decision trees (DT1, DT2, DT3) in
a sequential manner. If we send our test data now, it will pass through all the decision trees, and
finally, we will see which class has the majority, and based on that, we will do predictions
for our test dataset.
Program:
import numpy as np
class DecisionStump:
def __init__(self):
self.polarity = 1
self.feature_idx = None
self.threshold = None
self.alpha = None
def predict(self, X):
n_samples = X.shape[0]
predictions = np.ones(n_samples)
feature_column = X[:, self.feature_idx]
if self.polarity == 1:
predictions[feature_column < self.threshold] = -1
else:
predictions[feature_column > self.threshold] = -1
return predictions
class AdaBoost:
def __init__(self, n_clf=5):
self.n_clf = n_clf

MC4301 MACHINE LEARINIG


self.clfs = []
def fit(self, X, y):
n_samples, n_features = X.shape
w = np.full(n_samples, (1 / n_samples))
for _ in range(self.n_clf):
clf = DecisionStump()
min_error = float('inf')
for feature_i in range(n_features):
X_column = X[:, feature_i]
thresholds = np.unique(X_column)
for threshold in thresholds:
predictions = np.ones(n_samples)
predictions[X_column < threshold] = -1
error = sum(w[y != predictions])
if error > 0.5:
error = 1 - error
p = -1
else:
p=1
if error < min_error:
clf.polarity = p
clf.threshold = threshold
clf.feature_idx = feature_i
min_error = error
EPS = 1e-10
clf.alpha = 0.5 * np.log((1.0 - min_error + EPS) / (min_error + EPS))
predictions = clf.predict(X)
w *= np.exp(-clf.alpha * y * predictions)

MC4301 MACHINE LEARINIG


w /= np.sum(w)
self.clfs.append(clf)
def predict(self, X):
clf_preds = [clf.alpha * clf.predict(X) for clf in self.clfs]
y_pred = np.sum(clf_preds, axis=0)
return np.sign(y_pred)
Support Vector Machine
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning. The goal of the SVM algorithm is to create the
best line or decision boundary that can segregate n-dimensional space into classes so that we can
easily put the new data point in the correct category in the future. This best decision boundary is
called a hyperplane. SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat.

MC4301 MACHINE LEARINIG


Consider the below diagram: SVM algorithm can be used for Face detection, image
classification, text categorization, etc.
Types of SVM
SVM can be of two types:
 Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
 Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n
dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM. The dimensions of
the hyperplane depend on the features present in the dataset, which means if there are 2
features (as shown in image), then hyperplane will be a straight line. And if there are 3
features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors: The data points or vectors that are the closest to the hyperplane and which
affect the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.
How does SVM works?
Linear SVM: The working of the SVM algorithm can be understood by using an example.
Suppose we have a dataset that has two tags (green and blue), and the dataset has two
features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in
either green or blue. Consider the below image:

MC4301 MACHINE LEARINIG


So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane.
SVM algorithm finds the closest point of the lines from both the classes. These points are
called support vectors.
The distance between the vectors and the hyperplane is called as margin. And the goal of
SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:

MC4301 MACHINE LEARINIG


So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below
image:

MC4301 MACHINE LEARINIG


Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.


SVM Kernels
In practice, SVM algorithm is implemented with kernel that transforms an input data space
into the required form. SVM uses a technique called the kernel trick in which kernel takes a
low dimensional input space and transforms it into a higher dimensional space. In simple
words, kernel converts non separable problems into separable problems by adding more
dimensions to it. It makes SVM more powerful, flexible and accurate. The following are
some of the types of kernels used by SVM.
Linear Kernel
It can be used as a dot product between any two observations. The formula of linear kernel is
as below

K(x,xi)=sum(x∗xi)

MC4301 MACHINE LEARINIG


From the above formula, we can see that the product between two vectors say 𝑥 & 𝑥𝑖 is the
sum of the multiplication of each pair of input values.
 SVM has a technique called the kernel trick to deal with non-linearly separable data.
 As shown in Figure below, these are functions which can transform lower
dimensional input space to a higher dimensional space. In the process, it
converts linearly non- separable data to a linearly separable data. These
functions are called kernels.

Some of the common kernel functions for transforming from a lower


dimension ‘i’ to a higher dimension ‘j’ used by different SVM
implementations are as follows:

MC4301 MACHINE LEARINIG


Strengths and weaknesses of SVM
Strengths of SVM

 SVM can be used for both classification and regression.


 It is robust, i.e. not much impacted by data with noise or outliers.
 The prediction results using this model are very promising.
Weaknesses of SVM
 SVM is applicable only for binary classification, i.e. when there are only two
classes in the problem domain.
 The SVM model is very complex – almost like a black box when it deals with a
high- dimensional data set. Hence, it is very difficult and close to impossible to
understand the model in such cases.
 It is slow for a large dataset, i.e. a data set with either a large number of
features or a large number of instances.
 It is quite memory-intensive.

Application of SVM
 SVM is most effective when it is used for binary classification, i.e. for solving a
machine learning problem with two classes.
 One common problem on which SVM can be applied is in the field of
bioinformatics – more specifically, in detecting cancer and other genetic
disorders.
 It can also be used in detecting the image of a face by binary classification of
images into face and nonface components.

MC4301 MACHINE LEARINIG

You might also like