0% found this document useful (0 votes)
18 views102 pages

Unit Ii

Uploaded by

Sai Manasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views102 pages

Unit Ii

Uploaded by

Sai Manasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 102

UNIT-II

Supervised Learning
(Regression/Classification)
Distance based Methods
Distance-based models are the second class of
Geometric models. Like Linear models, distance-based
models are based on the geometry of data.
 As the name implies, distance-based models work on
the concept of distance. In the context of Machine
learning, the concept of distance is not based on
merely the physical distance between two points.
Instead, we could think of the distance between two
points considering the mode of transport between
two points.
 Travelling between two cities by plane covers less
distance physically than by train because a plane is
unrestricted.
 Similarly, in chess, the concept of distance depends
on the piece used – for example, a Bishop can move
diagonally.
 Thus, depending on the entity and the mode of travel,
the concept of distance can be experienced differently.
The distance metrics commonly used
are Euclidean, Minkowski, Manhattan,
and Mahalanobis.
Distance is applied through the concept
of neighbours and exemplars. Neighbours are
points in proximity with respect to the distance
measure expressed through exemplars.
 Exemplars are either centroids that find a centre of
mass according to a chosen distance metric
or medoids that find the most centrally located data
point.
The most commonly used centroid is the arithmetic
mean, which minimises squared Euclidean distance to
all other points.
The centroid represents the geometric centre of a plane
figure, i.e., the arithmetic mean position of all the points
in the figure from the centroid point. This definition
extends to any object in n-dimensional space: its centroid
is the mean position of all the points.
Medoids are similar in concept to means or centroids.
Medoids are most commonly used on data when a mean
or centroid cannot be defined. They are used in contexts
where the centroid is not representative of the dataset,
such as in image data.
Examples of distance-based models include the nearest-
neighbour models, which use the training data as
exemplars – for example, in classification.
The K-means clustering algorithm also uses exemplars
to create clusters of similar data points.
K-Nearest Neighbours
K-Nearest Neighbour is one of the simplest Machine Learning
algorithms based on Supervised Learning technique.
K-NN algorithm assumes the similarity between the new
case/data and available cases and put the new case into the
category that is most similar to the available categories.
K-NN algorithm stores all the available data and classifies a new
data point based on the similarity. This means when new data
appears then it can be easily classified into a well suite category
by using K- NN algorithm.
K-NN algorithm can be used for Regression as well as for
Classification but mostly it is used for the Classification
problems.
K-NN is a non-parametric algorithm, which means
it does not make any assumption on underlying data.
It is also called a lazy learner algorithm because it
does not learn from the training set immediately
instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
KNN algorithm at the training phase just stores the
dataset and when it gets new data, then it classifies
that data into a category that is much similar to the
new data.
Example:
Suppose, we have an image of a creature that looks
similar to cat and dog, but we want to know either it is
a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure.
Our KNN model will find the similar features of the
new data set to the cats and dogs images and based on
the most similar features it will put it in either cat or
dog category.
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and
Category B, and we have a new data point x1, so this
data point will lie in which of these categories. To solve
this type of problem, we need a K-NN algorithm. With
the help of K-NN, we can easily identify the category or
class of a particular dataset. Consider the below
diagram:
How does K-NN work?
The K-NN working can be explained on the basis of
the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number
of neighbors
Step-3: Take the K nearest neighbors as per the
calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of
the data points in each category.
Step-5: Assign the new data points to that category for
which the number of the neighbor is maximum.
Step-6: Our model is ready.
Suppose we have a new data point and we need to put
it in the required category. Consider the below image:
Firstly, we will choose the number of neighbors, so we
will choose the k=5.
Next, we will calculate the Euclidean
distance between the data points. The Euclidean
distance is the distance between two points, which we
have already studied in geometry. It can be calculated
as:
By calculating the Euclidean distance we got the nearest
neighbors, as three nearest neighbors in category A and
two nearest neighbors in category B. Consider the below
image:

As we can see the 3 nearest neighbors are from category


A, hence this new data point must belong to category A.
How to select the value of K in the K-NN
Algorithm?
Below are some points to remember while selecting
the value of K in the K-NN algorithm:
There is no particular way to determine the best value
for "K", so we need to try some values to find the best
out of them. The most preferred value for K is 5.
A very low value for K such as K=1 or K=2, can be noisy
and lead to the effects of outliers in the model.
Large values for K are good, but it may find some
difficulties.
Advantages of KNN Algorithm:
It is simple to implement.
It is robust to the noisy training data
It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
Always needs to determine the value of K which may
be complex some time.
The computation cost is high because of calculating
the distance between the data points for all the
training samples.
Decision Trees
Decision Tree is a Supervised learning technique that can
be used for both classification and Regression problems, but
mostly it is preferred for solving Classification problems. It
is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent
the decision rules and each leaf node represents the
outcome.
In a Decision tree, there are two nodes, which are
the Decision Node and Leaf Node. Decision nodes are
used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do
not contain any further branches.
The decisions or the test are performed on the basis of
features of the given dataset.
It is a graphical representation for getting all the
possible solutions to a problem/decision based on given
conditions.
It is called a decision tree because, similar to a tree, it starts
with the root node, which expands on further branches and
constructs a tree-like structure.
In order to build a tree, we use the CART algorithm, which
stands for Classification and Regression Tree algorithm.
A decision tree simply asks a question, and based on the
answer (Yes/No), it further split the tree into subtrees.
Below diagram explains the general structure of a decision
tree:
Why use Decision Trees?
There are various algorithms in Machine learning, so
choosing the best algorithm for the given dataset and
problem is the main point to remember while creating
a machine learning model. Below are the two reasons
for using the Decision tree:
Decision Trees usually mimic human thinking ability
while making a decision, so it is easy to understand.
The logic behind the decision tree can be easily
understood because it shows a tree-like structure.
Decision Tree Terminologies:
Root Node: Root node is from where the decision tree
starts. It represents the entire dataset, which further gets
divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the
tree cannot be segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision
node/root node into sub-nodes according to the given
conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted
branches from the tree.
Parent/Child node: The root node of the tree is called the
parent node, and other nodes are called the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given
dataset, the algorithm starts from the root node of the
tree. This algorithm compares the values of root
attribute with the record (real dataset) attribute and,
based on the comparison, follows the branch and
jumps to the next node.
For the next node, the algorithm again compares the
attribute value with the other sub-nodes and move
further. It continues the process until it reaches the
leaf node of the tree. The complete process can be
better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which
contains the complete dataset.
Step-2: Find the best attribute in the dataset
using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible
values for the best attributes.
Step-4: Generate the decision tree node, which contains
the best attribute.
Step-5: Recursively make new decision trees using the
subsets of the dataset created in step -3. Continue this
process until a stage is reached where you cannot
further classify the nodes and called the final node as a
leaf node.
Example: Suppose there is a candidate who has a job
offer and wants to decide whether he should accept
the offer or Not. So, to solve this problem, the decision
tree starts with the root node (Salary attribute by
ASM). The root node splits further into the next
decision node (distance from the office) and one leaf
node based on the corresponding labels. The next
decision node further gets split into one decision node
(Cab facility) and one leaf node. Finally, the decision
node splits into two leaf nodes (Accepted offers and
Declined offer). Consider the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue
arises that how to select the best attribute for the root
node and for sub-nodes.
 So, to solve such problems there is a technique which
is called as Attribute selection measure or ASM. By
this measurement, we can easily select the best
attribute for the nodes of the tree. There are two
popular techniques for ASM, which are:
Information Gain
Gini Index
1. Information Gain:
Information gain is the measurement of changes in entropy
after the segmentation of a dataset based on an attribute.
It calculates how much information a feature provides us
about a class.
According to the value of information gain, we split the
node and build the decision tree.
A decision tree algorithm always tries to maximize the
value of information gain, and a node/attribute having the
highest information gain is split first. It can be calculated
using the below formula:

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(


each feature)
Entropy: Entropy is a metric to measure the impurity
in a given attribute. It specifies randomness in data.
Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
S= Total number of samples
P(yes)= probability of yes
P(no)= probability of no
2. Gini Index:
Gini index is a measure of impurity or purity used
while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
An attribute with the low Gini index should be
preferred as compared to the high Gini index.
It only creates binary splits, and the CART algorithm
uses the Gini index to create binary splits.
Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes
from a tree in order to get the optimal decision tree.
A too-large tree increases the risk of overfitting, and a
small tree may not capture all the important features
of the dataset. Therefore, a technique that decreases
the size of the learning tree without reducing accuracy
is known as Pruning.
There are mainly two types of
tree pruning technology used:
Cost Complexity Pruning
Reduced Error Pruning.
Advantages of the Decision Tree
It is simple to understand as it follows the same process which a
human follow while making any decision in real-life.
It can be very useful for solving decision-related problems.
It helps to think about all the possible outcomes for a problem.
There is less requirement of data cleaning compared to other
algorithms.
Disadvantages of the Decision Tree
The decision tree contains lots of layers, which makes it
complex.
It may have an overfitting issue, which can be resolved using
the Random Forest algorithm.
For more class labels, the computational complexity of the
decision tree may increase.
Naive Bayes
Naïve Bayes algorithm is a supervised learning algorithm,
which is based on Bayes theorem and used for solving
classification problems.
It is mainly used in text classification that includes a high-
dimensional training dataset.
Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast
machine learning models that can make quick predictions.
It is a probabilistic classifier, which means it predicts on
the basis of the probability of an object.
Some popular examples of Naïve Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words
Naïve and Bayes, Which can be described as:
Naïve: It is called Naïve because it assumes that the
occurrence of a certain feature is independent of the
occurrence of other features. Such as if the fruit is
identified on the bases of color, shape, and taste, then
red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to
identify that it is an apple without depending on each
other.
Bayes: It is called Bayes because it depends on the
principle of Bayes' Theorem.
 Bayes' Theorem:
 Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is
used to determine the probability of a hypothesis with prior
knowledge. It depends on the conditional probability.
 The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the
observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given
that the probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing
the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with
the help of the below example:
Suppose we have a dataset of weather conditions and
corresponding target variable "Play". So using this
dataset we need to decide that whether we should play or
not on a particular day according to the weather
conditions. So to solve this problem, we need to follow
the below steps:
1) Convert the given dataset into frequency tables.
2) Generate Likelihood table by finding the probabilities of
given features.
3) Now, use Bayes theorem to calculate the posterior
probability.
Problem: If the weather is sunny, then the Player
should play or not?
Solution: To solve this, first consider the below
dataset:
Frequency table for the Weather Conditions:
Likelihood table weather condition:
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|
Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Advantages of Naïve Bayes Classifier:
Naïve Bayes is one of the fast and easy ML algorithms to
predict a class of datasets.
It can be used for Binary as well as Multi-class
Classifications.
It performs well in Multi-class predictions as compared
to the other Algorithms.
It is the most popular choice for text classification
problems.
Disadvantages of Naïve Bayes Classifier:
Naive Bayes assumes that all features are independent or
unrelated, so it cannot learn the relationship between
features.
Applications of Naïve Bayes Classifier:
It is used for Credit Scoring.
It is used in medical data classification.
It can be used in real-time predictions because Naïve
Bayes Classifier is an eager learner.
It is used in Text classification such as Spam
filtering and Sentiment analysis.
Linear Models
Linear Regression
Linear regression is one of the easiest and most popular
Machine Learning algorithms. It is a statistical method
that is used for predictive analysis.
Linear regression makes predictions for continuous/real
or numeric variables such as sales, salary, age, product
price, etc.
Linear regression algorithm shows a linear relationship
between a dependent (y) and one or more independent
(y) variables, hence called as linear regression.
Since linear regression shows the linear relationship,
which means it finds how the value of the dependent
variable is changing according to the value of the
The linear regression model provides a sloped straight
line representing the relationship between the
variables. Consider the below image:
Mathematically, we can represent a linear regression
as:
y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of
freedom)
a1 = Linear regression coefficient (scale factor to each
input value).
ε = random error
The values for x and y variables are training datasets for
Linear Regression model representation.
Types of Linear Regression
Linear regression can be further divided into two types
of the algorithm:
Simple Linear Regression:
If a single independent variable is used to predict the
value of a numerical dependent variable, then such a
Linear Regression algorithm is called Simple Linear
Regression.
Multiple Linear regression:
If more than one independent variable is used to
predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called
Multiple Linear Regression.
Linear Regression Line
A linear line showing the relationship between the
dependent and independent variables is called
a regression line. A regression line can show two
types of relationship:
Positive Linear Relationship:
If the dependent variable increases on the Y-axis and
independent variable increases on X-axis, then such a
relationship is termed as a Positive linear relationship.
Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and
independent variable increases on the X-axis, then
such a relationship is called a negative linear
relationship.
Finding the best fit line:
When working with linear regression, our main goal is
to find the best fit line that means the error between
predicted values and actual values should be
minimized. The best fit line will have the least error.
The different values for weights or the coefficient of
lines (a0, a1) gives a different line of regression, so we
need to calculate the best values for a0 and a1 to find
the best fit line, so to calculate this we use cost
function.
Cost function
The different values for weights or coefficient of lines
(a0, a1) gives the different line of regression, and the
cost function is used to estimate the values of the
coefficient for the best fit line.
Cost function optimizes the regression coefficients or
weights. It measures how a linear regression model is
performing.
We can use the cost function to find the accuracy of
the mapping function, which maps the input
variable to the output variable. This mapping function
is also known as Hypothesis function.
For Linear Regression, we use the Mean Squared
Error (MSE) cost function, which is the average of
squared error occurred between the predicted values
and actual values. It can be written as:
For the above linear equation, MSE can be calculated as:
Residuals: The distance between the actual value and
predicted values is called residual. If the observed points
are far from the regression line, then the residual will be
high, and so cost function will high. If the scatter points
are close to the regression line, then the residual will be
small and hence the cost function.
Gradient Descent:
Gradient descent is used to minimize the MSE by
calculating the gradient of the cost function.
A regression model uses gradient descent to update the
coefficients of the line by reducing the cost function.
It is done by a random selection of values of coefficient
and then iteratively update the values to reach the
minimum cost function.
Model Performance:
The Goodness of fit determines how the line of
regression fits the set of observations. The process of
finding the best model out of various models is
called optimization. It can be achieved by below
method:
Logistic Regression
Logistic regression is one of the most popular
Machine Learning algorithms, which comes under the
Supervised Learning technique. It is used for
predicting the categorical dependent variable using a
given set of independent variables.
Logistic regression predicts the output of a categorical
dependent variable. Therefore the outcome must be a
categorical or discrete value. It can be either Yes or No,
0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
Logistic Regression is much similar to the Linear
Regression except that how they are used. Linear
Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the
classification problems.
In Logistic regression, instead of fitting a regression
line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1).
he curve from the logistic function indicates the
likelihood of something such as whether the cells are
cancerous or not, a mouse is obese or not based on its
weight, etc.
Logistic Regression is a significant machine learning
algorithm because it has the ability to provide
probabilities and classify new data using continuous
and discrete datasets.
Logistic Regression can be used to classify the
observations using different types of data and can
easily determine the most effective variables used for
the classification. The below image is showing the
logistic function:
Note: Logistic regression uses the concept of
predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify
samples; Therefore, it falls under the classification
algorithm.
Logistic Function (Sigmoid Function):
The sigmoid function is a mathematical function used
to map the predicted values to probabilities.
It maps any real value into another value within a
range of 0 and 1.
The value of the logistic regression must be between 0
and 1, which cannot go beyond this limit, so it forms a
curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
In logistic regression, we use the concept of the
threshold value, which defines the probability of either
0 or 1. Such as values above the threshold value tends
to 1, and a value below the threshold values tends to 0.
Assumptions for Logistic Regression:
The dependent variable must be categorical in nature.
The independent variable should not have multi-
collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from
the Linear Regression equation. The mathematical
steps to get Logistic Regression equations are given
below:
We know the equation of the straight line can be
written as:
In Logistic Regression y can be between 0 and 1 only,
so for this let's divide the above equation by (1-y):

But we need range between -[infinity] to +[infinity],


then take logarithm of the equation it will become:

The above equation is the final equation for Logistic


Regression.
Type of Logistic Regression:
On the basis of the categories, Logistic Regression can
be classified into three types:
Binomial: In binomial Logistic regression, there can be
only two possible types of the dependent variables,
such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there
can be 3 or more possible unordered types of the
dependent variable, such as "cat", "dogs", or "sheep"
Ordinal: In ordinal Logistic regression, there can be 3 or
more possible ordered types of dependent variables,
such as "low", "Medium", or "High".
Generalized Linear Models
Generalized Linear Model (GLiM, or GLM) is an
advanced statistical modelling technique formulated
by John Nelder and Robert Wedderburn in 1972.
 It is an umbrella term that encompasses many other
models, which allows the response variable y to have
an error distribution other than a normal distribution.
The models include Linear Regression, Logistic
Regression, and Poisson Regression.
GLM models allow us to build a linear relationship
between the response and predictors, even though
their underlying relationship is not linear.
 This is made possible by using a link function, which
links the response variable to a linear model. Unlike
Linear Regression models, the error distribution of the
response variable need not be normally distributed.
The errors in the response variable are assumed to
follow an exponential family of distribution (i.e.
normal, binomial, Poisson, or gamma distributions).
Since we are trying to generalize a linear regression
model that can also be applied in these cases, the
name Generalized Linear Models.
Why GLM?
Linear Regression model is not suitable if,
The relationship between X and y is not linear. There exists
some non-linear relationship between them. For example, y
increases exponentially as X increases.
Variance of errors in y (commonly called as Homoscedasticity
in Linear Regression), is not constant, and varies with X.
Response variable is not continuous, but discrete/categorical.
Linear Regression assumes normal distribution of the
response variable, which can only be applied on a
continuous data. If we try to build a linear regression
model on a discrete/binary y variable, then the linear
regression model predicts negative values for the
corresponding response variable, which is inappropriate.
For Example, Consider a linear model as follows:
A simple example of a mobile price in an e-commerce
platform:
Price = 12500 + 1.5*Screen size – 3*Battery Backup
(less than 4hrs)
Data available for,
Price of the mobile
Screen size (in inches)
Is battery backup less than 4hrs – with values either as
‘yes’, or ‘no’.
In this example, if the screen size increases by 1 unit,
then the price of the mobile increases by 1.5 times the
default price, keeping the intercept (12500) and
Battery Backup values constant. Likewise, if the
Battery Backup of less than 4hrs is ‘yes, then the
mobile price reduces by three times the default price.
If the Battery Backup of less than 4hrs is ‘no’, then the
mobile price is unaffected, as the term (3*Battery
Backup) becomes 0 in the linear model. The intercept
12500 indicates the default price for a standard value of
screen size. This is a valid model.
However, if we get a model as below:
Price = 12500 +1.5*Screen size + 3*Battery
Backup(less than 4hrs)
Here, if the battery backup less than 4 hrs is ‘yes, then
the model is saying the price of the phone increases by
three times. Clearly, from practical knowledge, we
know this is incorrect. There will be less demand for
such mobiles. These are going to be very old mobiles,
which when compared to the current range of mobiles
with the latest features, is going to be very less in price.
This is because the relationship between the two
variables is not linear, but we are trying to express it as
a linear relationship. Hence, an invalid model is built.
Similarly, if we are trying to predict if a particular
phone will be sold or not, using the same independent
variables, but the target is we are trying to predict if
the phone will sell or not, so it has only binary
outcomes.
Using Linear Regression, we get a model like,
Sales = 12500 +1.5*Screen size – 3*Battery
Backup(less than 4hrs)
This model doesn’t tell us if the mobile will be sold or
not, because the output of a linear regression model is
continuous value. It is possible to get negative values
as well as the output. It does not translate to our actual
objective of whether phones having some
specifications based on the predictors, will sell or not
(binary outcome).
Similarly if we are also trying to see what is the
number of sales of this mobile that will happen in the
next month, a negative value means nothing. Here, the
minimum value is 0 (no sale happened), or a positive
value corresponding to the count of the sales. Having
the count as a negative value is not meaningful to us.
Support Vector Machines
Support Vector Machine or SVM is one of the most
popular Supervised Learning algorithms, which is
used for Classification as well as Regression problems.
However, primarily, it is used for Classification
problems in Machine Learning.
The goal of the SVM algorithm is to create the best
line or decision boundary that can segregate n-
dimensional space into classes so that we can easily
put the new data point in the correct category in the
future. This best decision boundary is called a
hyperplane.
SVM chooses the extreme points/vectors that help in
creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is
termed as Support Vector Machine. Consider the
below diagram in which there are two different
categories that are classified using a decision boundary
or hyperplane:
Example: SVM can be understood with the example that we
have used in the KNN classifier. Suppose we see a strange cat
that also has some features of dogs, so if we want a model
that can accurately identify whether it is a cat or dog, so such
a model can be created by using the SVM algorithm.
We will first train our model with lots of images of cats and
dogs so that it can learn about different features of cats and
dogs, and then we test it with this strange creature.
 So as support vector creates a decision boundary between
these two data (cat and dog) and choose extreme cases
(support vectors), it will see the extreme case of cat and dog.
On the basis of the support vectors, it will classify it as a cat.
Consider the below diagram:
SVM algorithm can be used for Face detection, image
classification, text categorization, etc.
Types of SVM
SVM can be of two types:
Linear SVM: Linear SVM is used for linearly separable
data, which means if a dataset can be classified into
two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier
is used called as Linear SVM classifier.
Non-linear SVM: Non-Linear SVM is used for non-
linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such
data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM
algorithm:
Hyperplane: There can be multiple lines/decision
boundaries to segregate the classes in n-dimensional
space, but we need to find out the best decision
boundary that helps to classify the data points. This
best boundary is known as the hyperplane of SVM.
Support Vectors:
The data points or vectors that are the closest to the
hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these
vectors support the hyperplane, hence called a
Support vector.
How does SVM works?
Linear SVM
The working of the SVM algorithm can be understood
by using an example. Suppose we have a dataset that
has two tags (green and blue), and the dataset has two
features x1 and x2.
We want a classifier that can classify the pair(x1, x2) of
coordinates in either green or blue. Consider the
below image:
So as it is 2-d space so by just using a straight line, we
can easily separate these two classes. But there can be
multiple lines that can separate these classes. Consider
the below image:
Hence, the SVM algorithm helps to find the best line
or decision boundary; this best boundary or region is
called as a hyperplane.
SVM algorithm finds the closest point of the lines
from both the classes. These points are called support
vectors.
 The distance between the vectors and the hyperplane
is called as margin. And the goal of SVM is to
maximize this margin. The hyperplane with
maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by
using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below
image:
So to separate these data points, we need to add one
more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add
a third dimension z. It can be calculated as:

By adding the third dimension, the sample space will


become as below image:
 So now, SVM will divide the datasets into classes in the following way. Consider
the below image:

 Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If
we convert it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in
case of non-linear data.
Binary Classification
Multiclass/Structured outputs
The last type of classification task we are going to
discuss here is called multioutput– multiclass
classification (or simply multioutput classification).
 It is simply a generaliza‐ tion of multilabel
classification where each label can be multiclass (i.e.,
it can have more than two possible values).
To illustrate this, let’s build a system that removes
noise from images. It will take as input a noisy digit
image, and it will (hopefully) output a clean digit
image, represented as an array of pixel intensities, just
like the MNIST images.
 Notice that the classifier’s output is multilabel (one
label per pixel) and each label can have multiple values
(pixel intensity ranges from 0 to 255). It is thus an
example of a multioutput classification system.
Let’s start by creating the training and test sets by
taking the MNIST images and adding noise to their
pixel intensities with NumPy’s randint() function. The
target images will be the original images:
Let’s take a peek at an image from the test
set (yes, we’re snooping on the test data, so
you should be frowning right now)
MNIST
The MNIST (Modified National Institute of
Standards and Technology) database is a large
database of handwritten numbers or digits that are used
for training various image processing systems.
The dataset also widely used for training and testing in
the field of machine learning. The set of images in the
MNIST database are a combination of two of NIST's
databases: Special Database 1 and Special Database 3.
The MNIST dataset has 60,000 training images
and 10,000 testing images.
The MNIST dataset can be online, and it is essentially a
database of various handwritten digits. The MNIST dataset
has a large amount of data and is commonly used to
demonstrate the real power of deep neural networks.
 Our brain and eyes work together to recognize any
numbered image. Our mind is a potent tool, and it's
capable of categorizing any image quickly.
There are so many shapes of a number, and our mind can
easily recognize these shapes and determine what number
is it, but the same task is not simple for a computer to
complete.
There is only one way to do this, which is the use of deep
neural network which allows us to train a computer to
classify the handwritten digits effectively.
The MNIST dataset is a multilevel dataset consisting
of 10 classes in which we can classify numbers from 0
to 9.
The major difference between the datasets that we
have used before and the MNIST dataset is the method
in which MNIST data is inputted in a neural network.
In the perceptual model and linear regression model,
each of the data points was defined by a simple x and y
coordinate. This means that the input layer needs two
nodes to input single data points.
In the MNIST dataset, a single data point comes in the form
of an image. These images included in the MNIST dataset
are typical of 28*28 pixels such as 28 pixels crossing the
horizontal axis and 28 pixels crossing the vertical axis.
 This means that a single image from the MNIST database
has a total of 784 pixels that must be analyzed. The input
layer of our neural network has 784 nodes to explain one of
these images.
Here, we will see how to create a function that is a model for
recognizing handwritten digits by looking at each pixel in
the image.
Then using TensorFlow to train the model to predict the
image by making it look at thousands of examples which
are already labeled. We will then check the model's
accuracy with a test dataset.
MNIST dataset in TensorFlow, containing information
of handwritten digits spitted into three parts:
Training Data (mnist.train) -55000 datapoints
Validation Data (mnist.validate) -5000 datapoint
Test Data (mnist.test) -10000 datapoints
Now before we start, it is important to note that every
data point has two parts: an image (x) and a
corresponding label (y) describing the actual image
and each image is a 28x28 array, i.e., 784 numbers. The
label of the image is a number between 0 and 9
corresponding to the TensorFlow MNIST image. To
download and use MNIST dataset, use the
following commands:
from tensorflow.examples.tutorials.mnist import input_
data
mnist = input_data.read_data_sets("MNIST_data/", one
_hot=True)
Ranking
Rank is an active and connected transformation that
performs the filtering of data based on the group and
ranks. The rank transformation also provides the
feature to do ranking based on groups.
The rank transformation has an output port, and it is
used to assign a rank to the rows.
In Informatica, it is used to select a bottom or top range
of data. While string value ports can be ranked, the
Informatica Rank Transformation is used to rank
numeric port values. One might think MAX and MIN
functions can accomplish this same task.
However, the rank transformation allows groups of records
to be listed instead of a single value or record. The rank
transformation is created with the following types of ports.
Input port (I)
Output port (O)
Variable port (V)
Rank Port (R)
Rank Port
The port which is participated in a rank calculation is
known as Rank port.
Variable Port
A port that allows us to develop expression to store the data
temporarily for rank calculation is known as a variable port.
 Configuring the Rank Transformation
 Let’s see how to configure the following properties of Rank
transformation:
 Cache Directory: The directory is a space where the integration service
creates the index and data cache files.
 Top/Bottom: It specifies whether we want to select the top or bottom
rank of data.
 Number of Ranks: It specifies the number of rows that we want to rank.
 Case-Sensitive String Comparison: It is used to sort the strings by
using the case sensitive.
 Tracing Level: The amount of logging to be tracked in the session log
file.
 Rank Data Cache Size: The data cache size default value is 2,000,000
bytes. We can set a numeric value or Auto for the data cache size. In the
case of Auto, the Integration Service determines the cache size at runtime.
 Rank Index Cache Size: The index cache size default value is 1,000,000
bytes. We can set a numeric value or Auto for the index cache size. In the
case of Auto, the Integration Service determines the cache size at runtime.
Example
Suppose we want to load top 5 salaried employees for
each department; we will implement this using rank
transformation in the following steps, such as:
Step 1: Create a mapping having source EMP and target
EMP_TARGET
Step 2: Then in the mapping,
Select the transformation menu.
And click on the Create option.
Step 3: In the create transformation window,
Select rank transformation.
Enter transformation name "rnk_salary".
And click on the Create button.
Step 4: The rank transformation will be created in the
mapping, select the done button in the window.
Step 5: Connect all the ports from source qualifier to the
rank transformation.
Step 6: Double click on the rank transformation, and it
will open the "edit transformation window". In this
window,
Select the properties menu.
Select the "Top" option from the Top/Bottom property.
Enter 5 in the number of ranks.
Step 7: In the "edit transformation" window again,
Select the ports tab.
Select group by option for the Department number column.
Select Rank in the Salary Column.
Click on the OK button.
Step 8: Connect the ports from rank transformation to the
target table.
Now, save the mapping and execute it after creating session
and workflow. The source qualifier will fetch all the
records, but the rank transformation will pass only
records having three high salaries for each department.

You might also like