0% found this document useful (0 votes)
20 views38 pages

UNIT 4 Supervised Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views38 pages

UNIT 4 Supervised Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Supervised Learning: Learning a Class from Examples

Supervised learning is a machine learning approach where an algorithm learns a function that
maps an input to an output based on labeled examples. It involves training a model on a set of
inputs and their corresponding correct outputs, allowing the model to learn the underlying
patterns and make predictions on unseen data.

Key Takeaways

• Supervised learning is a machine learning technique that predicts outputs based on


labeled examples.

• It involves training a model using a dataset and learning the underlying patterns to
make predictions on unseen data.

• Supervised learning is widely used in various applications, including image


recognition, natural language processing, and fraud detection.

The Basics of Supervised Learning

Supervised learning starts with a dataset that contains input examples paired with
their correct outputs, also known as labels. The dataset is divided into a training set and a test
set. The training set is used to train the model, while the test set evaluates the model’s
performance. During training, the model learns to identify patterns and correlations between
input features and the corresponding output labels.

The Process of Supervised Learning

1. Preprocessing the data: This involves handling missing values, feature scaling, and
encoding categorical variables to prepare the dataset for training.
2. Splitting the dataset: The dataset is split into a training set and a test set to assess the
performance of the model on unseen data.

3. Selecting an algorithm: Choosing the most suitable algorithm for the given problem,
such as linear regression, decision trees, or support vector machines.

4. Training the model: The algorithm is trained on the training set by adjusting its
internal parameters to minimize the prediction errors.

5. Evaluating the model: The trained model is evaluated using metrics like accuracy,
precision, recall, or mean squared error, depending on the problem.

6. Making predictions: Once the model is trained and evaluated, it can make predictions
on new, unseen data points by applying the learned function.

The Importance of Supervised Learning

*Supervised learning enables machines to learn patterns and make predictions based on
existing knowledge.* By providing labeled examples, this approach allows the model to
make informed decisions and predictions in various domains.

Supervised learning has numerous applications across industries, including:

• Image recognition: Classifying images into different categories, such as identifying


objects or recognizing faces.

• Natural language processing: Analyzing and understanding human language, enabling


chatbots and voice assistants.

• Fraud detection: Identifying fraudulent transactions or activities based on historical


data patterns.
Machine learning has fundamentally altered the way we handle complicated issues in a
variety of fields. The categorization of data points into specified classes or categories is one
of the fundamental tasks in machine learning. Multiclass classification expands on the idea of
binary classification by handling more than two classes.
Binary Classification:
• Binary classification involves a dataset with only two class instances.
• It requires only one classifier model.
• Confusion Matrix is easy to derive and understand.
• In a binary classification problem, you might predict whether an animal is Dog or Cat.
Multiclass Classification:
• Multiclass classification involves datasets with multiple class labels.
• The number of classifier models depends on the chosen classification technique.
• One vs. All: If N-class instances then N binary classifier models
• One vs. One: If N-class instances then N * (N-1)/2 binary classifier models
• A multiclass classification problem could involve classifying animals into categories

such as Dog, Cat, Snake etc.


Techniques to Implement Multiclass Classification
1. One vs. Rest:
In the One vs. Rest classification approach, tailored for a dataset with N distinct classes, N
binary classifiers are generated each corresponding to one specific class. The number of
classifiers is equal to the number of unique class labels present in the dataset.
For instance, consider a dataset comprising three classes: Dog, Cat, and Snake. In adherence
to the One-vs-Rest strategy, three classifiers are created, each designed for its respective
class:
Classifier 1: Developed to distinguish instances of ‘Dog’ from a consolidated class of ‘Cat’
and ‘Snake’
• Positive class: ‘Dog’
• Negative classes: ‘Cat,’ ‘Snake’
Classifier 2: Developed to distinguish instances of ‘Cat’ from a consolidated class of ‘Dog’
and ‘Snake’
• Positive class: ‘Cat’
• Negative classes: ‘Dog,’ ‘Snake’
Classifier 3: Developed to distinguish instances of ‘Snake’ from a consolidated class of ‘Dog’
and ‘Cat.’
• Positive class: ‘Snake’
• Negative classes: ‘Dog,’ ‘Cat’
This approach ensures that each classifier specializes in recognizing one specific class while
collectively covering the entire set of classes present in the dataset. During prediction, the
class associated with the classifier exhibiting the highest confidence or probability is assigned
as the predicted class for a given input.

One vs. Rest


2. One vs. One:
In the One vs. One classification strategy tailored for a dataset with N distinct classes, a total
of N * (N-1) / 2 binary classifiers are generated. This approach involves creating a binary
classifier for every possible pair of classes. Consider a classification problem with three
types: Dog, Cat, and Snake (N=3).
For this scenario, we divide the primary dataset into N * (N-1) / 2 = 3 binary classifier
problems:
Classifier 1: Trained to distinguish between instances of ‘Dog’ and ‘Cat’
• Positive class: ‘Dog’
• Negative class: ‘Cat’
Classifier 2: Trained to distinguish between instances of ‘Dog’ and ‘Snake’
• Positive class: ‘Dog’
• Negative class: ‘Snake’
Classifier 3: Trained to distinguish between instances of ‘Cat’ and ‘Snake’
• Positive class: ‘Cat’
• Negative class: ‘Snake’
Each binary classifier exclusively predicts one class label. During the testing phase, when
inputting test data to the classifiers, the final result is determined based on a majority voting
mechanism. The class associated with the majority of votes across all binary classifiers is
considered the predicted class for the given input.

Generalization Error in Machine Learning


In supervised learning applications in machine learning and statistical learning theory,
generalization error (also known as the out-of-sample error) is a measure of how accurately an
algorithm is able to predict outcome values for previously unseen data
Notice that the gap between predictions and observed data is induced by model
inaccuracy, sampling error, and noise. Some of the errors are reducible but some are not. Choosing
the right algorithm and tuning parameters could improve model accuracy, but we will never be able to
make our predictions 100% accurate.
Bias-variance decomposition
An important way to understand generalization error is bias-variance decomposition.
bias is the error rate in the world of big data. A model has a high bias when, for example, it fails to
capture meaningful patterns in the data. Bias is measured by the differences between the expected
predicted values and the observed values, in the dataset D when the prediction variables are at the
level of x (X=x).
In contrast with bias, variance is an algorithm’s flexibility to learn patterns in the observed
data. Variance is the amount that an algorithm will change if a different dataset is used. A model is
of high variance when, for instance, it tries too hard that it not only captures the pattern of meaningful
features but also that the meaningless error (overfitting).
As a result, a model’s generalization error indicates how well it generalizes to unknown data. Bias,
variance, and irreducible error — which is the error contribution of noise — can be used to break it
down into these three components.

Generalization error could be measured by MSE. As the model capacity increases, the bias decreases
as the model fits the training datasets better. However, the variance increases, as your model become
sophisticated to fit more patterns of the current dataset, changing datasets (even if they come from the
same distribution) would be impactful.
VC Dimension

In the world of machine learning, understanding the complexity of a model is crucial for
ensuring good performance on new, unseen data. The Vapnik-Chervonenkis (VC) dimension
is a fundamental concept that helps quantify the complexity of a hypothesis class. Introduced
by Vladimir Vapnik and Alexey Chervonenkis, this measure provides insight into a model’s
capacity to fit a variety of functions. This blog explores the concept of VC dimension, its
mathematical foundations, and its implications for machine learning models and
generalization.
What is VC Dimension?

The VC dimension of a hypothesis class 𝐻H is defined as the maximum number of points


that can be shattered by 𝐻H. A set of points is said to be shattered by 𝐻H if, for every
possible subset of these points, there exists a hypothesis in 𝐻H that correctly classifies the
points in the subset. In simpler terms, the VC dimension measures the capacity of a model to
fit different data configurations.
Example: Consider a set of points in a two-dimensional space. If a linear classifier can
classify all possible subsets of these points correctly, the s\t is shattered by the hypothesis
class of linear classifiers. For instance, a linear classifier can shatter any three points in
general position (not all on a single line), giving a VC dimension of 3 for this class.
Mathematical Foundations of VC Dimension
1. Formal Definition
Formally, the VC dimension of a hypothesis class 𝐻H is the largest number 𝑑d such that
there exists a set of 𝑑d points that can be shattered by 𝐻H. If no such 𝑑d exists, the VC
dimension is infinite.
Example: In the case of linear classifiers in a 2D space, the VC dimension is 3. This is
because a linear classifier can correctly classify all possible subsets of three points, but not
necessarily four points.
2. Shattering
To understand shattering, consider a set of points {𝑥1,𝑥2,…,𝑥𝑑}{x1,x2,…,xd}. The
hypothesis class 𝐻H shatters this set if, for every possible labeling of the points (i.e., every
possible assignment of +1 or -1 to each point), there exists a hypothesis in 𝐻H that correctly
classifies the points according to that labeling.
Example: For a set of two points, a linear classifier can shatter the set because it can draw a
line to separate the points in both possible ways. However, for three collinear points, a linear
classifier cannot shatter the set as it cannot separate the middle point from the others without
misclassifying at least one point.
Importance of VC Dimension
1. Model Complexity and Generalization
The VC dimension provides a measure of model complexity. A higher VC dimension
indicates a more complex model that can fit more intricate patterns in the data. However,
higher complexity also increases the risk of overfitting, where the model captures noise instead of the
underlying data distribution. Understanding the VC dimension helps in selecting models that balance
complexity and generalization.
Example: In practice, a polynomial classifier with a high degree can shatter more points than a linear
classifier, indicating higher complexity. However, this increased capacity can lead to overfitting if not
managed properly.
2. Sample Complexity
The VC dimension directly influences the sample complexity of a learning algorithm — the number
of training examples required to ensure good generalization. According to statistical learning theory,
the number of samples needed to learn a hypothesis with high probability and low error is
proportional to the VC dimension of the hypothesis class.
Example: If a model has a high VC dimension, more training examples are needed to achieve the
same level of confidence and accuracy compared to a model with a lower VC dimension.

Let,
𝐻={𝑠𝑒𝑡 𝑜𝑓 𝑙𝑖𝑛𝑒𝑎𝑟 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟𝑠 𝑖𝑛 2 𝐷𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠}H={set of linear classifiers in 2 Dimensions}
Then any 3 points can be classified by 𝐻H correctly with separating hyper plane as shown in the
following figure.
And that's why the VC dimension of 𝐻H is 3. Because for any 4 points in 2D plane, a linear classifier
can not shatter all the combinations of the points. For example,

For this set of points, there is no separating hyper plane can be drawn to classify this set. So the
VC dimension is 3.
Model Assumptions in Regression

Regression is a parametric approach. ‘Parametric’ means it makes


assumptions about data for the purpose of analysis. Due to its
parametric side, regression is restrictive in nature. It fails to deliver
good results with data sets which doesn’t fulfill its assumptions.
Therefore, for a successful regression analysis, it’s essential to
validate these assumptions.

Let’s look at the important assumptions in regression analysis:

1. There should be a linear and additive relationship between


dependent (response) variable and independent (predictor)
variable(s). A linear relationship suggests that a change in
response Y due to one unit change in X1 is constant, regardless
of the value of X1. An additive relationship suggests that the
effect of X1 on Y is independent of other variables.

2. There should be no correlation between the residual (error)


terms. Absence of this phenomenon is known as
Autocorrelation.

3. The independent variables should not be correlated. Absence of


this phenomenon is known as multicollinearity.

4. The error terms must have constant variance. This phenomenon is


known as homoskedasticity. The presence of non-constant variance is
referred to heteroskedasticity.

5. The error terms must be normally distributed.


Regularization

This is a form of regression, that constrains/ regularizes or shrinks the


coefficient estimates towards zero. In other words, this technique
discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.

A simple relation for linear regression looks like this. Here Y


represents the learned relation and β represents the coefficient estimates for
different variables or predictors(X).

Y ≈ β0 + β1X1 + β2X2 + …+ βpXp

The fitting procedure involves a loss function, known as residual sum


of squares or RSS. The coefficients are chosen, such that they
minimize this loss function.

Now, this will adjust the coefficients based on your training data. If
there is noise in the training data, then the estimated coefficients won’t generalize well to the
future data. This is where regularization comes in and shrinks or regularizes these learned
estimates towards zero.

Ridge Regression
Above image shows ridge regression, where the RSS is modified by adding
the shrinkage quantity. Now, the coefficients are estimated by minimizing

this function. Here, λ is the tuning parameter that decides how much we want to
penalize the flexibility of our model. The increase in flexibility of a model is
represented by increase in its coefficients, and if we want to minimize
the above function, then these coefficients need to be small. This is
how the Ridge regression technique prevents coefficients from rising
too high. Also, notice that we shrink the estimated association of each
variable with the response, except the intercept β0, This intercept is a
measure of the mean value of the response when xi1 = xi2 = …= xip =
0.

When λ = 0, the penalty term has no effect,


and the estimates produced by ridge
regression will be equal to least squares. However, as λ→∞, the impact of
the shrinkage penalty grows, and the ridge regression coefficient estimates will approach
zero.As can be seen, selecting a good value of λ is critical. Cross
validation comes in handy for this purpose. The coefficient estimates
produced by this method are also known as the L2 norm.

The coefficients that are produced by the standard least squares method are scale
equivariant,
i.e. if we multiply each input by c then the corresponding
coefficients are scaled by a factor of 1/c. Therefore, regardless of how
the predictor is scaled, the multiplication of predictor and
coefficient(Xjβj) remains the same. However, this is not the case with ridge
regression, and therefore, we need to standardize the predictors or bring the predictors to
the same scale before performing ridge regression. The formula used to do this is
given below.
Lasso

Lasso is another variation, in which the above function is minimized.


Its clear that this variation differs from ridge regression only in penalizing the high
coefficients. It uses |βj|(modulus)instead of squares of β, as its penalty. In
statistics, this is known as the L1 norm.

A standard least squares model tends to have some variance in it, i.e.
this model won’t generalize well for a data set different than its
training data. Regularization, significantly reduces the variance of the model, without
substantial increase in its bias. So the tuning parameter λ, used in the
regularization techniques described above, controls the impact on bias
and variance. As the value of λ rises, it reduces the value of
coefficients and thus reducing the variance. Till a point, this increase in λ is
beneficial as it is only reducing the variance(hence avoiding overfitting), without loosing
But after certain value, the model starts
any important properties in the data.
loosing important properties, giving rise to bias in the model and thus
underfitting. Therefore, the value of λ should be carefully selected.

Elastic Net Regression


Linear regression refers to a model that assumes a linear relationship
between input variables and the target variable.
With a single input variable, this relationship is a line, and with higher
dimensions, this relationship can be thought of as a hyperplane that
connects the input variables to the target variable. The coefficients of
the model are found via an optimization process that seeks to
minimize the sum squared error between the predictions (yhat) and the
expected target values (y).
loss = sum i=0 to n (y_i – yhat_i)^2

A problem with linear regression is that estimated coefficients of the


model can become large, making the model sensitive to inputs and
possibly unstable. This is particularly true for problems with few
observations (samples) or more samples (n) than input predictors (p) or
variables (so-called p >> n problems).

One approach to addressing the stability of regression models is to


change the loss function to include additional costs for a model that
has large coefficients. Linear regression models that use these
modified loss functions during training are referred to collectively as
penalized linear regression.

One popular penalty is to penalize a model based on the sum of the


squared coefficient values. This is called an L2 penalty. An L2
penalty minimizes the size of all coefficients, although it prevents any
coefficients from being removed from the model.

l2_penalty = sum j=0 to p beta_j^2

Another popular penalty is to penalize a model based on the sum of


the absolute coefficient values. This is called the L1 penalty. An L1
penalty minimizes the size of all coefficients and allows some
coefficients to be minimized to the value zero, which removes the
predictor from the model.

l1_penalty = sum j=0 to p abs(beta_j)


Elastic net is a penalized linear regression model that includes both
the L1 and L2 penalties during training.

A hyperparameter “alpha” is provided to assign how much weight is


given to each of the L1 and L2 penalties. Alpha is a value between 0
and 1 and is used to weight the contribution of the L1 penalty and one
minus the alpha value is used to weight the L2 penalty.
elastic_net_penalty = (alpha * l1_penalty) + ((1 – alpha) * l2_penalty)

Principle of Naive Bayes Classifier:

A Naive Bayes classifier is a probabilistic machine learning model


that’s used for classification task. The crux of the classifier is based on
the Bayes theorem.

Bayes Theorem:

Using Bayes theorem, we can find the probability of A happening,


given that B has occurred. Here, B is the evidence and A is the
hypothesis. The assumption made here is that the predictors/features
are independent. That is presence of one particular feature does not
affect the other. Hence it is called naive.

Example:

Let us take an example to get some better intuition. Consider the


problem of playing golf. The dataset is represented as below.
We classify whether the day is suitable for playing golf, given the
features of the day. The columns represent these features and the rows
represent individual entries. If we take the first row of the dataset, we
can observe that is not suitable for playing golf if the outlook is rainy,
temperature is hot, humidity is high and it is not windy. We make two
assumptions here, one as stated above we consider that these
predictors are independent. That is, if the temperature is hot, it does
not necessarily mean that the humidity is high. Another assumption
made here is that all the predictors have an equal effect on the
outcome. That is, the day being windy does not have more importance
in deciding to play golf or not.

According to this example, Bayes theorem can be rewritten as:

The variable y is the class variable(play golf), which represents if it is


suitable to play golf or not given the conditions. Variable X represent
the parameters/features.

X is given as,

Here x_1,x_2….x_n represent the features, i.e they can be mapped to


outlook, temperature, humidity and windy. By substituting for X and
expanding using the chain rule we get,

Now, you can obtain the values for each by looking at the dataset and
substitute them into the equation. For all entries in the dataset, the
denominator does not change, it remain static. Therefore, the
denominator can be removed and a proportionality can be introduced.

In our case, the class variable(y) has only two outcomes, yes or no.
There could be cases where the classification could be multivariate.
Therefore, we need to find the class y with maximum probability.

Using the above function, we can obtain the class, given the
predictors.

K-Nearest Neighbor(KNN) Algorithm for Machine Learning


o K-Nearest Neighbour is one of the simplest Machine Learning
algorithms based on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new
case/data and available cases and put the new case into the
category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new
data point based on the similarity. This means when new data
appears then it can be easily classified into a well suite category
by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for
Classification but mostly it is used for the Classification
problems.
o K-NN is a non-parametric algorithm, which means it does not
make any assumption on underlying data.
o It is also called a lazy learner algorithm because it does not
learn from the training set immediately instead it stores the
dataset and at the time of classification, it performs an action on
the dataset.
o KNN algorithm at the training phase just stores the dataset and
when it gets new data, then it classifies that data into a category
that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks
similar to cat and dog, but we want to know either it is a cat or
dog. So for this identification, we can use the KNN algorithm, as
it works on a similarity measure. Our KNN model will find the
similar features of the new data set to the cats and dogs images
and based on the most similar features it will put it in either cat or
dog category.

Suppose there are two categories, i.e., Category A and Category B, and
we have a new data point x1, so this data point will lie in which of these
categories. To solve this type of problem, we need a K-NN algorithm.
With the help of K-NN, we can easily identify the category or class of
a particular dataset. Consider the below diagram:
The K-NN working can be explained on the basis of the below
algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of
neighbors
o Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data
points in each category.
o Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required
category. Consider the below image:
o Firstly, we will choose the number of neighbors, so we will
choose the k=5.
o Next, we will calculate the Euclidean distance between the data
points. The Euclidean distance is the distance between two points,
which we have already studied in geometry. It can be calculated
as:
o By calculating the Euclidean distance we got the nearest
neighbors, as three nearest neighbors in category A and two
nearest neighbors in category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence
this new data point must belong to category A.

Below are some points to remember while selecting the value of K in


the K-NN algorithm:

o There is no particular way to determine the best value for "K", so


we need to try some values to find the best out of them. The most
preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and
lead to the effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.
Advantages of KNN Algorithm:
o It is simple to implement.

o It is robust to the noisy training data

o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


o Always needs to determine the value of K which may be complex
some time.
o The computation cost is high because of calculating the distance
between the data points for all the training samples.

Logistic Regression

Logistic regression is a supervised algorithm, used when the value of


the target variable is categorical in nature. Logistic regression is most
commonly used when the data in question has binary output, so when
it belongs to one class or another, or is either a 0 or 1.
Remember that classification tasks have discrete categories, unlike
regressions tasks.

Here, by the idea of using a regression model to solve the


classification problem, we rationally raise a question of whether we
can draw a hypothesis function to fit to the binary dataset. For
simplification, we only concern the binary classification problem with
the dataset below:

The answer is that you will have to use a type of function, different
from linear functions, called a logistic function, or a sigmoid
function.

Sigmoid Function
The sigmoid function/logistic function is a function that resembles an
“S” shaped curve when plotted on a graph. It takes values between 0
and 1 and “squishes” them towards the margins at the top and bottom,
labeling them as 0 or 1.
The equation for the Sigmoid function is this:

.
Let’s see how the sigmoid function represent the given dataset.
This gives a value y that is extremely close to 0 if x is a large negative
value and close to 1 if x is a large positive value. After the input value
has been squeezed towards 0 or 1, the input can be run through a
typical linear function, but the inputs can now be put into distinct
categories.

It is important to understand that logistic regression should only be


used when the target variables fall into discrete categories and that if
there’s a range of continuous values the target value might be, logistic
regression should not be used. Examples of situations you might use
logistic regression in include:

• Predicting if an email is spam or not spam


• Whether a tumor is malignant or benign
• Whether a mushroom is poisonous or edible.

When using logistic regression, a threshold is usually specified that


indicates at what value the example will be put into one class vs. the
other class. In the spam classification task, a threshold of 0.5 might be
set, which would cause an email with a 50% or greater probability of
being spam to be classified as “spam” and any email with probability
less than 50% classified as “not spam”.

Introduction to SVM
Support vector machines (SVMs) are powerful yet flexible supervised
machine learning algorithms which are used both for classification and
regression. But generally, they are used in classification problems. In
1960s, SVMs were first introduced but later they got refined in 1990.
SVMs have their unique way of implementation as compared to other
machine learning algorithms. Lately, they are extremely popular
because of their ability to handle multiple continuous and categorical
variables.
Working of SVM
An SVM model is basically a representation of different classes in a
hyperplane in multidimensional space. The hyperplane will be
generated in an iterative manner by SVM so that the error can be
minimized. The goal of SVM is to divide the datasets into classes to
find a maximum marginal hyperplane (MMH).

The followings are important concepts in SVM −


• Support Vectors − Datapoints that are closest to the hyperplane
is called support vectors. Separating line will be defined with the
help of these data points.
• Hyperplane − As we can see in the above diagram, it is a
decision plane or space which is divided between a set of objects
having different classes.
• Margin − It may be defined as the gap between two lines on the
closet data points of different classes. It can be calculated as the
perpendicular distance from the line to the support vectors. Large
margin is considered as a good margin and small margin is
considered as a bad margin.
The main goal of SVM is to divide the datasets into classes to find a
maximum marginal hyperplane (MMH) and it can be done in the
following two steps −
• First, SVM will generate hyperplanes iteratively that segregates
the classes in best way.
• Then, it will choose the hyperplane that separates the classes
correctly.
SVM Kernels
In practice, SVM algorithm is implemented with kernel that
transforms an input data space into the required form. SVM uses a
technique called the kernel trick in which kernel takes a low
dimensional input space and transforms it into a higher dimensional
space. In simple words, kernel converts non-separable problems into
separable problems by adding more dimensions to it. It makes SVM
more powerful, flexible and accurate. The following are some of the
types of kernels used by SVM.
Linear Kernel
It can be used as a dot product between any two observations. The
formula of linear kernel is as below −
K(x,xi)=sum(x∗xi)K(x,xi)=sum(x∗xi)
From the above formula, we can see that the product between two
vectors say 𝑥 & 𝑥𝑖 is the sum of the multiplication of each pair of input
values.
Polynomial Kernel
It is more generalized form of linear kernel and distinguish curved or
nonlinear input space. Following is the formula for polynomial kernel

k(X,Xi)=1+sum(X∗Xi)^dk(X,Xi)=1+sum(X∗Xi)^d
Here d is the degree of polynomial, which we need to specify manually
in the learning algorithm.
Pros and Cons of SVM Classifiers
Pros of SVM classifiers
SVM classifiers offers great accuracy and work well with high
dimensional space. SVM classifiers basically use a subset of training
points hence in result uses very less memory.
Cons of SVM classifiers
They have high training time hence in practice not suitable for large
datasets. Another disadvantage is that SVM classifiers do not work
well with overlapping classes.

Decision Tree:

A decision tree is a flowchart-like structure in which each internal


node represents a test on a feature (e.g. whether a coin flip comes up
heads or tails) , each leaf node represents a class label (decision taken
after computing all features) and branches represent conjunctions of
features that lead to those class labels. The paths from root to leaf
represent classification rules. Below diagram illustrate the basic flow
of decision tree for decision making with labels (Rain(Yes), No
Rain(No)).
Decision Tree for Rain Forecasting

Decision tree is one of the predictive modelling approaches used


in statistics, data mining and machine learning.

Decision trees are constructed via an algorithmic approach that


identifies ways to split a data set based on different conditions. It is
one of the most widely used and practical methods for supervised
learning. Decision Trees are a non-parametric supervised
learning method used for both classification and regression tasks.

Tree models where the target variable can take a discrete set of values
are called classification trees. Decision trees where the target variable
can take continuous values (typically real numbers) are
called regression trees. Classification And Regression Tree (CART)
is general term for this.

Throughout this post i will try to explain using the examples.


Data Format

Data comes in records of forms.


(x,Y)=(x1,x2,x3,....,xk,Y)

The dependent variable, Y, is the target variable that we are trying to


understand, classify or generalize. The vector x is composed of the
features, x1, x2, x3 etc., that are used for that task.

Approach to make decision tree

While making decision tree, at each node of tree we ask different type
of questions. Based on the asked question we will calculate the
information gain corresponding to it.

Information Gain

Information gain is used to decide which feature to split on at each


step in building the tree. Simplicity is best, so we want to keep our tree
small. To do so, at each step we should choose the split that results in
the purest daughter nodes. A commonly used measure of purity is
called information. For each node of the tree, the information
value measures how much information a feature gives us about the
class. The split with the highest information gain will be taken as
the first split and the process will continue until all children nodes
are pure, or until the information gain is 0.

Algorithm for constructing decision tree usually works top-down, by


choosing a variable at each step that best splits the set of items.
Different algorithms use different metrices for measuring best.
Gini Impurity

First let’s understand the meaning of Pure and Impure.

Pure

Pure means, in a selected sample of dataset all data belongs to same


class (PURE).

Impure

Impure means, data is mixture of different classes.

Definition of Gini Impurity

Gini Impurity is a measurement of the likelihood of an incorrect


classification of a new instance of a random variable, if that new
instance were randomly classified according to the distribution of class
labels from the data set.

If our dataset is Pure then likelihood of incorrect classification is 0. If


our sample is mixture of different classes then likelihood of incorrect
classification will be high.

Steps for Making decision tree

• Get list of rows (dataset) which are taken into consideration for
making decision tree (recursively at each nodes).

• Calculate uncertanity of our dataset or Gini impurity or how much


our data is mixed up etc.

• Generate list of all question which needs to be asked at that node.


• Partition rows into True rows and False rows based on each
question asked.

• Calculate information gain based on gini impurity and partition of


data from previous step.

• Update highest information gain based on each question asked.

• Update best question based on information gain (higher


information gain).

• Divide the node on best question. Repeat again from step 1 again
until we get pure node (leaf nodes).

Advantage of Decision Tree

• Easy to use and understand.

• Can handle both categorical and numerical data.

• Resistant to outliers, hence require little data preprocessing.

Disadvantage of Decision Tree

• Prone to overfitting.

• Require some kind of measurement as to how well they are doing.

• Need to be careful with parameter tuning.

• Can create biased learned trees if some classes dominate.


Random Forest Algorithm

Random Forest is a popular machine learning algorithm that belongs to


the supervised learning technique. It can be used for both Classification
and Regression problems in ML. It is based on the concept of ensemble
learning, which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.

As the name suggests, “Random Forest is a classifier that contains


a number of decision trees on various subsets of the given dataset
and takes the average to improve the predictive accuracy of that
dataset”. Instead of relying on one decision tree, the random forest
takes the prediction from each tree and based on the majority votes of
predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy
and prevents the problem of overfitting.

The below diagram explains the working of the Random Forest


algorithm:
Assumptions for Random Forest

Since the random forest combines multiple trees to predict the class of
the dataset, it is possible that some decision trees may predict the
correct output, while others may not. But together, all the trees predict
the correct output. Therefore, below are two assumptions for a better
Random forest classifier:

o There should be some actual values in the feature variable of the


dataset so that the classifier can predict accurate results rather
than a guessed result.
o The predictions from each tree must have very low correlations.

Random Forest works in two-phase first is to create the random forest


by combining N decision tree, and second is to make predictions for
each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points
(Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree,
and assign the new data points to the category that wins the majority
votes.

The working of the algorithm can be better understood by the below


example:
Example: Suppose there is a dataset that contains multiple fruit
images. So, this dataset is given to the Random forest classifier. The
dataset is divided into subsets and given to each decision tree. During
the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the
Random Forest classifier predicts the final decision. Consider the
below image:

Advantages of Random Forest


o Random Forest is capable of performing both Classification and
Regression tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting
issue.

Disadvantages of Random Forest


o Although random forest can be used for both classification and
regression tasks, it is not more suitable for Regression tasks.

You might also like