Unit V Non Parametric Machine Learning
Unit V Non Parametric Machine Learning
The k-nearest neighbors algorithm, also known as KNN or k-NN, is a non-parametric, supervised
learning classifier, which uses proximity to make classifications or predictions about the grouping of
an individual data point. While it can be used for either regression or classification problems, it is
typically used as a classification algorithm, working off the assumption that similar points can be
found near one another.
For classification problems, a class label is assigned on the basis of a majority vote— i.e. the label
that is most frequently represented around a given data point is used. While this is technically
considered “plurality voting”, the term, “majority vote” is more commonly used in literature. The
distinction between these terminologies is that “majority voting” technically requires a majority of
greater than 50%, which primarily works when there are only two categories. When you have
multiple classes—e.g. four categories, you don’t necessarily need 50% of the vote to make a
conclusion about a class; you could assign a class label with a vote of greater than 25%.
K-Nearest Neighbor(KNN) Algorithm for Machine Learning
o K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into
a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs
an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need a
K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a particular
dataset. Consider the below diagram:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:
o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
as:
MC4301 MACHINE LEARINIG
Euclidean distance is suitable for continuous variables and is easy to compute, making it a popular
choice in KNN.
2. Manhattan Distance
Manhattan distance (or L1 distance) measures the distance between two points along the axes at right
angles. For two points P(x1,y1) and Q(x2,y2) , the Manhattan distance is calculated as:
It is useful for grid-like paths (e.g., city blocks) and is often employed when variables are more
discrete.
3. Minkowski Distance
Minkowski distance is a generalization of both Euclidean and Manhattan distances. It is defined as:
When p=2, it becomes Euclidean distance, and when p=1, it is equivalent to Manhattan distance.
Minkowski distance provides flexibility by adjusting the value of p for different scenarios.
Choosing the appropriate distance metric depends on the data type and the specific problem at hand.
Algorithm for K-Nearest Neighbor (KNN)
Here’s a simplified version of the KNN algorithm:
Algorithm Steps:
1. Select the number of neighbors k.
2. Calculate the distance between the query point and all other points in the dataset using a chosen
distance metric.
3. Sort the distances in ascending order and select the top k-nearest neighbors.
4. For classification: Assign the query point the class of the majority of its neighbors.
5. For regression: Predict the value of the query point as the average of the k-nearest neighbors.
Pseudo-code:
def knn(query_point, dataset, k):
distances = []
for point in dataset:
distance = compute_distance(query_point, point)
distances.append((point, distance))
distances.sort(key=lambda x: x[1]) # Sort based on distance
neighbors = distances[:k]
prediction = majority_vote(neighbors) # or average for regression
return prediction
- Finance: It has also been used in a variety of finance and economic use
cases.
- Healthcare: KNN has also had application within the healthcare industry,
making predictions on the risk of heart attacks and prostate cancer. The
algorithm works by calculating the most likely gene expressions.
Decision tree
A decision tree is a supervised learning algorithm used for both classification and regression tasks. It has
a hierarchical tree structure which consists of a root node, branches, internal nodes and leaf nodes. It It
works like a flowchart help to make decisions step by step where:
Internal nodes represent attribute tests
Branches represent attribute values
Leaf nodes represent final decisions or predictions.
In a Decision tree, there are two nodes, which are
Decision Node and
Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas,
Leaf nodes are the output of those decisions and do not contain any further
branches.
For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The
complete process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the
complete dataset. Step-2: Find the best attribute in the dataset using
Attribute Selection Measure (ASM). Step-3: Divide the S into subsets
that contains possible values for the best attributes. Step-4: Generate
the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select
the best attribute for the nodes of the tree. There are two popular techniques for ASM,
This process is known as attribute selection. We have two popular attribute selection
measures:
1. Information Gain
2. Gini Index
1. Information Gain When we use a node in a decision tree to partition the
training instances into smaller subsets the entropy changes. Information
gain is a measure of this change in entropy. Definition: Suppose S is a
set of instances, A is an attribute, Sv is the subset of S with A = v, and
Values
(A) is the set of all possible values of A, then
1 1 1 I
1 1 0 I
0 0 1 II
1 0 0 II
Here, we have 3 features and 2 output classes. To build a decision tree using Information gain. We
will take each of the features and calculate the information for each feature.
2. Gini Index
Gini Index is a metric to measure how often a randomly chosen element would be
incorrectly identified.
It means an attribute with a lower Gini index should be preferred.
Sklearn supports “Gini” criteria for Gini Index and by default, it takes “gini” value.
For example if we have a group of people where all bought the product (100% "Yes") the Gini Index is 0
indicate perfect purity. But if the group has an equal mix of "Yes" and "No" the Gini Index would be 0.5
show high impurity or uncertainty. Formula for Gini Index is given by :
o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
Branching
A decision tree is a map of the possible outcomes of a series of related choices. It allows
an individual or organization to weigh possible actions against one another based on
their costs, probabilities, and benefits. They can can be used either to drive informal
discussion or to map out an algorithm that predicts the best choice mathematically.
A decision tree typically starts with a single node, which branches into possible
outcomes. Each of those outcomes leads to additional nodes, which branch off into
other possibilities. This gives it a treelike shape.
There are three different types of nodes: chance nodes, decision nodes, and end nodes. A
chance node, represented by a circle, shows the probabilities of certain results. A
decision node, represented by a square, shows a decision to be made, and an end node
shows the final outcome of a decision path.
Decision trees can also be drawn with flowchart symbols, which some people find easier
to read and understand.
To draw a decision tree, first pick a medium. You can draw it by hand on paper or a
whiteboard, or you can use special decision tree software. In either case, here are the
steps to follow:
1. Start with the main decision. Draw a small box to represent this point,
then draw a line from the box to the right for each possible solution or
action. Label them accordingly.
From each decision node, draw possible solutions. From each chance node, draw lines representing
possible outcomes. If you intend to analyze your options numerically, include the probability of
each outcome and the cost of each action.
3. Continue to expand until every line reaches an endpoint, meaning that there are no more
choices to be made or chance outcomes to consider. Then, assign a value to each possible
From the origin to the destination, we must travel as cheaply as possible. Since
the three possible solutions have cost pathways of 10, 20, and 5, respectively, 5, is the
least expensive option, making it the best choice. This is the local optimum, and in order
to compute the global optimal solution, we discover the local optimum at each stage in
this manner.
Greedy Algorithms Examples
Below are some examples of greedy algorithm:
1. Activity Selection Problem
Algorithm Time
Complexity
Algorithm Space
Complexity
Random forest is an ensemble classifier, i.e. a combining classifier that uses and
combines many decision tree classifiers.
Ensembling is usually done using the concept of bagging with different feature sets. The
reason for using large number of trees in random forest is to train the trees enough such
that contribution from each feature comes in a number of models.
After the random forest is generated by combining the trees, majority vote is applied to
combine the output of the different trees.
1. If there are N variables or features in the input data set, select a subset
of ‘m’ (m < N) features at random out of the N features. Also, the
observations or data instances should be picked randomly.
2. Use the best split principle on these ‘m’ features to calculate the number of nodes ‘d’.
3. Keep splitting the nodes to child nodes till the tree is grown to the maximum possible extent.
4. Select a different subset of the training data ‘with replacement’ to train
another decision tree following steps (1) to (3). Repeat this to build and train ‘n’
decision trees.
5. Final class assignment is done on the basis of the majority votes from the ‘n’ trees.
Algorithm
Here is an outline of the random forest algorithm.
1. The random forests algorithm generates many classification trees. Each tree is generated as
follows:
(a) If the number of examples in the training set is N, take a sample of N examples at random -
but with replacement, from the original data. This sample will be the training set for generating
the tree.
(b) If there are M input variables, a number m is specified such that at each node, m variables are
selected at random out of the M and the best split on these m is used to split the node. The value
of m is held constant during the generation of the various trees in the forest.
(c) Each tree is grown to the largest extent possible.
2. To classify a new object from an input vector, put the input vector down each of the trees in
the forest. Each tree gives a classification, and we say the tree “votes” for that class. The forest
chooses the classification.
or training set.
Step 2: This algorithm will construct a decision tree for every training
data. Step 3: Voting will take place by averaging the decision tree.
Step 4: Finally, select the most voted prediction result as the final prediction result.
1. Bagging: Creating a different training subset from sample training data with
replacement is called Bagging. The final output is based on majority voting.
2. Boosting: Combing weak learners into strong learners by creating sequential models
Parallelization: We can fully use the CPU to build random forests since each
Stability: The final result is based on Bagging, meaning the result is based on
majority voting or average.
Ensemble learning
Ensemble learning helps improve machine learning results by combining several
models. This approach allows the production of better predictive performance
compared to a single model. Basic idea is to learn a set of classifiers (experts)
and to allow them to vote.
Types of Ensemble
Classifier – Bagging:
Bagging (Bootstrap Aggregation) is used to reduce the variance of a decision
tree. Suppose a set D of d tuples, at each iteration i, a training set Di of d tuples
is sampled with replacement from D (i.e., bootstrap). Then a classifier model
Mi is learned for each training set D < i. Each classifier M i returns its class
prediction. The bagged classifier M* counts the votes and assigns the class
with the most votes to X (unknown sample).
Types of boosting
Boosting methods are focused on iteratively combining weak learners to
build a strong learner that can predict more accurate outcomes. As a reminder, a
weak learner classifies data slightly better than random guessing. This approach
can provide robust results for prediction problems, and can even outperform
neural networks and support vector machines for tasks like image retrieval.
Boosting algorithms can differ in how they create and aggregate weak learners
during the sequential process. Three popular types of boosting methods
include:
Adaptive boosting or AdaBoost: Yoav Freund and Robert Schapire are credited
with the creation of the AdaBoost algorithm. This method operates iteratively,
identifying misclassified data points and adjusting their weights to minimize the
training error. The model continues optimize in a sequential fashion until it
yields the strongest predictor.
Gradient boosting: Building on the work of Leo Breiman, Jerome H.
Friedman developed gradient boosting, which works by sequentially adding
predictors to an ensemble with each one correcting for the errors of its
predecessor. However, instead of changing weights of data points like AdaBoost,
the gradient boosting trains on the residual errors of the previous predictor. The
name, gradient boosting, is used since it combines the gradient descent
algorithm and boosting method.
Extreme gradient boosting or XGBoost: XGBoost is an implementation of
gradient boosting that’s designed for computational speed and scale.
XGBoost leverages multiple cores on the CPU, allowing for learning to occur
in parallel during training.
AdaBoost:
AdaBoost Freund and Schapire (1996) proposed a variant, named AdaBoost, short for adaptive
boosting, that uses the same training set over and over and thus need not be large, but the
classifiers should be simple so that they do not overfit. AdaBoost can also combine an arbitrary
number of baselearners, not three.
There are many machine learning algorithms to choose from for your problem statements. One
of these algorithms for predictive modeling is called AdaBoost.
What this algorithm does is that it builds a model and gives equal weights to all the data points. It
then assigns higher weights to points that are wrongly classified. Now all the points with higher
weights are given more importance in the next model. It will keep training models until and
unless a lower error is received.
AdaBoost algorithm
How AdaBoost Works (Simplified):
1. Initialization: Assign equal weights to all training data points.
2. Training: Train a weak learner on the weighted data.
3. Prediction: Predict the outcome for each data point.
4. Weight Adjustment: Increase the weight of misclassified data points.
5. Iteration: Repeat steps 2-4 until a stopping criterion is met (e.g., a predefined number of
iterations or a satisfactory level of accuracy).
From the graph above, we can see that when there is no misclassification, then we have no error
(Total Error = 0), so the “amount of say (alpha)” will be a large number.
When the classifier predicts half right and half wrong, then the Total Error = 0.5, and the
importance (amount of say) of the classifier will be 0.
The amount of, say (alpha) will be negative when the sample is correctly classified.
The amount of, say (alpha) will be positive when the sample is miss-classified.
There are four correctly classified samples and 1 wrong. Here, the sample weight of that
datapoint is 1/5, and the amount of say/performance of the stump of Gender is 0.69.
New weights for correctly classified samples are:
Note
We know that the total sum of the sample weights must be equal to 1, but here if we sum up all
the new sample weights, we will get 0.8004. To bring this sum equal to 1, we will normalize
these weights by dividing all the weights by the total sum of updated weights, which is 0.8004.
So, after normalizing the sample weights, we get this dataset, and now the sum is equal to 1.
This comes out to be our new dataset, and we see the data point, which was wrongly classified,
has been selected 3 times because it has a higher weight.
Step 7: Repeat Previous Steps
Now this act as our new dataset, and we need to repeat all the above steps i.e.
Assign equal weights to all the data points.
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat.
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane.
SVM algorithm finds the closest point of the lines from both the classes. These points are
called support vectors.
The distance between the vectors and the hyperplane is called as margin. And the goal of
SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
K(x,xi)=sum(x∗xi)
Application of SVM
SVM is most effective when it is used for binary classification, i.e. for solving a
machine learning problem with two classes.
One common problem on which SVM can be applied is in the field of
bioinformatics – more specifically, in detecting cancer and other genetic
disorders.
It can also be used in detecting the image of a face by binary classification of
images into face and nonface components.