Unit Ii
Unit Ii
Supervised Learning
(Regression/Classification)
Distance based Methods
Distance-based models are the second class of
Geometric models. Like Linear models, distance-based
models are based on the geometry of data.
As the name implies, distance-based models work on
the concept of distance. In the context of Machine
learning, the concept of distance is not based on
merely the physical distance between two points.
Instead, we could think of the distance between two
points considering the mode of transport between
two points.
Travelling between two cities by plane covers less
distance physically than by train because a plane is
unrestricted.
Similarly, in chess, the concept of distance depends
on the piece used – for example, a Bishop can move
diagonally.
Thus, depending on the entity and the mode of travel,
the concept of distance can be experienced differently.
The distance metrics commonly used
are Euclidean, Minkowski, Manhattan,
and Mahalanobis.
Distance is applied through the concept
of neighbours and exemplars. Neighbours are
points in proximity with respect to the distance
measure expressed through exemplars.
Exemplars are either centroids that find a centre of
mass according to a chosen distance metric
or medoids that find the most centrally located data
point.
The most commonly used centroid is the arithmetic
mean, which minimises squared Euclidean distance to
all other points.
The centroid represents the geometric centre of a plane
figure, i.e., the arithmetic mean position of all the points
in the figure from the centroid point. This definition
extends to any object in n-dimensional space: its centroid
is the mean position of all the points.
Medoids are similar in concept to means or centroids.
Medoids are most commonly used on data when a mean
or centroid cannot be defined. They are used in contexts
where the centroid is not representative of the dataset,
such as in image data.
Examples of distance-based models include the nearest-
neighbour models, which use the training data as
exemplars – for example, in classification.
The K-means clustering algorithm also uses exemplars
to create clusters of similar data points.
K-Nearest Neighbours
K-Nearest Neighbour is one of the simplest Machine Learning
algorithms based on Supervised Learning technique.
K-NN algorithm assumes the similarity between the new
case/data and available cases and put the new case into the
category that is most similar to the available categories.
K-NN algorithm stores all the available data and classifies a new
data point based on the similarity. This means when new data
appears then it can be easily classified into a well suite category
by using K- NN algorithm.
K-NN algorithm can be used for Regression as well as for
Classification but mostly it is used for the Classification
problems.
K-NN is a non-parametric algorithm, which means
it does not make any assumption on underlying data.
It is also called a lazy learner algorithm because it
does not learn from the training set immediately
instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
KNN algorithm at the training phase just stores the
dataset and when it gets new data, then it classifies
that data into a category that is much similar to the
new data.
Example:
Suppose, we have an image of a creature that looks
similar to cat and dog, but we want to know either it is
a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure.
Our KNN model will find the similar features of the
new data set to the cats and dogs images and based on
the most similar features it will put it in either cat or
dog category.
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and
Category B, and we have a new data point x1, so this
data point will lie in which of these categories. To solve
this type of problem, we need a K-NN algorithm. With
the help of K-NN, we can easily identify the category or
class of a particular dataset. Consider the below
diagram:
How does K-NN work?
The K-NN working can be explained on the basis of
the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number
of neighbors
Step-3: Take the K nearest neighbors as per the
calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of
the data points in each category.
Step-5: Assign the new data points to that category for
which the number of the neighbor is maximum.
Step-6: Our model is ready.
Suppose we have a new data point and we need to put
it in the required category. Consider the below image:
Firstly, we will choose the number of neighbors, so we
will choose the k=5.
Next, we will calculate the Euclidean
distance between the data points. The Euclidean
distance is the distance between two points, which we
have already studied in geometry. It can be calculated
as:
By calculating the Euclidean distance we got the nearest
neighbors, as three nearest neighbors in category A and
two nearest neighbors in category B. Consider the below
image:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the
observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given
that the probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing
the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with
the help of the below example:
Suppose we have a dataset of weather conditions and
corresponding target variable "Play". So using this
dataset we need to decide that whether we should play or
not on a particular day according to the weather
conditions. So to solve this problem, we need to follow
the below steps:
1) Convert the given dataset into frequency tables.
2) Generate Likelihood table by finding the probabilities of
given features.
3) Now, use Bayes theorem to calculate the posterior
probability.
Problem: If the weather is sunny, then the Player
should play or not?
Solution: To solve this, first consider the below
dataset:
Frequency table for the Weather Conditions:
Likelihood table weather condition:
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|
Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Advantages of Naïve Bayes Classifier:
Naïve Bayes is one of the fast and easy ML algorithms to
predict a class of datasets.
It can be used for Binary as well as Multi-class
Classifications.
It performs well in Multi-class predictions as compared
to the other Algorithms.
It is the most popular choice for text classification
problems.
Disadvantages of Naïve Bayes Classifier:
Naive Bayes assumes that all features are independent or
unrelated, so it cannot learn the relationship between
features.
Applications of Naïve Bayes Classifier:
It is used for Credit Scoring.
It is used in medical data classification.
It can be used in real-time predictions because Naïve
Bayes Classifier is an eager learner.
It is used in Text classification such as Spam
filtering and Sentiment analysis.
Linear Models
Linear Regression
Linear regression is one of the easiest and most popular
Machine Learning algorithms. It is a statistical method
that is used for predictive analysis.
Linear regression makes predictions for continuous/real
or numeric variables such as sales, salary, age, product
price, etc.
Linear regression algorithm shows a linear relationship
between a dependent (y) and one or more independent
(y) variables, hence called as linear regression.
Since linear regression shows the linear relationship,
which means it finds how the value of the dependent
variable is changing according to the value of the
The linear regression model provides a sloped straight
line representing the relationship between the
variables. Consider the below image:
Mathematically, we can represent a linear regression
as:
y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of
freedom)
a1 = Linear regression coefficient (scale factor to each
input value).
ε = random error
The values for x and y variables are training datasets for
Linear Regression model representation.
Types of Linear Regression
Linear regression can be further divided into two types
of the algorithm:
Simple Linear Regression:
If a single independent variable is used to predict the
value of a numerical dependent variable, then such a
Linear Regression algorithm is called Simple Linear
Regression.
Multiple Linear regression:
If more than one independent variable is used to
predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called
Multiple Linear Regression.
Linear Regression Line
A linear line showing the relationship between the
dependent and independent variables is called
a regression line. A regression line can show two
types of relationship:
Positive Linear Relationship:
If the dependent variable increases on the Y-axis and
independent variable increases on X-axis, then such a
relationship is termed as a Positive linear relationship.
Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and
independent variable increases on the X-axis, then
such a relationship is called a negative linear
relationship.
Finding the best fit line:
When working with linear regression, our main goal is
to find the best fit line that means the error between
predicted values and actual values should be
minimized. The best fit line will have the least error.
The different values for weights or the coefficient of
lines (a0, a1) gives a different line of regression, so we
need to calculate the best values for a0 and a1 to find
the best fit line, so to calculate this we use cost
function.
Cost function
The different values for weights or coefficient of lines
(a0, a1) gives the different line of regression, and the
cost function is used to estimate the values of the
coefficient for the best fit line.
Cost function optimizes the regression coefficients or
weights. It measures how a linear regression model is
performing.
We can use the cost function to find the accuracy of
the mapping function, which maps the input
variable to the output variable. This mapping function
is also known as Hypothesis function.
For Linear Regression, we use the Mean Squared
Error (MSE) cost function, which is the average of
squared error occurred between the predicted values
and actual values. It can be written as:
For the above linear equation, MSE can be calculated as:
Residuals: The distance between the actual value and
predicted values is called residual. If the observed points
are far from the regression line, then the residual will be
high, and so cost function will high. If the scatter points
are close to the regression line, then the residual will be
small and hence the cost function.
Gradient Descent:
Gradient descent is used to minimize the MSE by
calculating the gradient of the cost function.
A regression model uses gradient descent to update the
coefficients of the line by reducing the cost function.
It is done by a random selection of values of coefficient
and then iteratively update the values to reach the
minimum cost function.
Model Performance:
The Goodness of fit determines how the line of
regression fits the set of observations. The process of
finding the best model out of various models is
called optimization. It can be achieved by below
method:
Logistic Regression
Logistic regression is one of the most popular
Machine Learning algorithms, which comes under the
Supervised Learning technique. It is used for
predicting the categorical dependent variable using a
given set of independent variables.
Logistic regression predicts the output of a categorical
dependent variable. Therefore the outcome must be a
categorical or discrete value. It can be either Yes or No,
0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
Logistic Regression is much similar to the Linear
Regression except that how they are used. Linear
Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the
classification problems.
In Logistic regression, instead of fitting a regression
line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1).
he curve from the logistic function indicates the
likelihood of something such as whether the cells are
cancerous or not, a mouse is obese or not based on its
weight, etc.
Logistic Regression is a significant machine learning
algorithm because it has the ability to provide
probabilities and classify new data using continuous
and discrete datasets.
Logistic Regression can be used to classify the
observations using different types of data and can
easily determine the most effective variables used for
the classification. The below image is showing the
logistic function:
Note: Logistic regression uses the concept of
predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify
samples; Therefore, it falls under the classification
algorithm.
Logistic Function (Sigmoid Function):
The sigmoid function is a mathematical function used
to map the predicted values to probabilities.
It maps any real value into another value within a
range of 0 and 1.
The value of the logistic regression must be between 0
and 1, which cannot go beyond this limit, so it forms a
curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
In logistic regression, we use the concept of the
threshold value, which defines the probability of either
0 or 1. Such as values above the threshold value tends
to 1, and a value below the threshold values tends to 0.
Assumptions for Logistic Regression:
The dependent variable must be categorical in nature.
The independent variable should not have multi-
collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from
the Linear Regression equation. The mathematical
steps to get Logistic Regression equations are given
below:
We know the equation of the straight line can be
written as:
In Logistic Regression y can be between 0 and 1 only,
so for this let's divide the above equation by (1-y):
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If
we convert it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in
case of non-linear data.
Binary Classification
Multiclass/Structured outputs
The last type of classification task we are going to
discuss here is called multioutput– multiclass
classification (or simply multioutput classification).
It is simply a generaliza‐ tion of multilabel
classification where each label can be multiclass (i.e.,
it can have more than two possible values).
To illustrate this, let’s build a system that removes
noise from images. It will take as input a noisy digit
image, and it will (hopefully) output a clean digit
image, represented as an array of pixel intensities, just
like the MNIST images.
Notice that the classifier’s output is multilabel (one
label per pixel) and each label can have multiple values
(pixel intensity ranges from 0 to 255). It is thus an
example of a multioutput classification system.
Let’s start by creating the training and test sets by
taking the MNIST images and adding noise to their
pixel intensities with NumPy’s randint() function. The
target images will be the original images:
Let’s take a peek at an image from the test
set (yes, we’re snooping on the test data, so
you should be frowning right now)
MNIST
The MNIST (Modified National Institute of
Standards and Technology) database is a large
database of handwritten numbers or digits that are used
for training various image processing systems.
The dataset also widely used for training and testing in
the field of machine learning. The set of images in the
MNIST database are a combination of two of NIST's
databases: Special Database 1 and Special Database 3.
The MNIST dataset has 60,000 training images
and 10,000 testing images.
The MNIST dataset can be online, and it is essentially a
database of various handwritten digits. The MNIST dataset
has a large amount of data and is commonly used to
demonstrate the real power of deep neural networks.
Our brain and eyes work together to recognize any
numbered image. Our mind is a potent tool, and it's
capable of categorizing any image quickly.
There are so many shapes of a number, and our mind can
easily recognize these shapes and determine what number
is it, but the same task is not simple for a computer to
complete.
There is only one way to do this, which is the use of deep
neural network which allows us to train a computer to
classify the handwritten digits effectively.
The MNIST dataset is a multilevel dataset consisting
of 10 classes in which we can classify numbers from 0
to 9.
The major difference between the datasets that we
have used before and the MNIST dataset is the method
in which MNIST data is inputted in a neural network.
In the perceptual model and linear regression model,
each of the data points was defined by a simple x and y
coordinate. This means that the input layer needs two
nodes to input single data points.
In the MNIST dataset, a single data point comes in the form
of an image. These images included in the MNIST dataset
are typical of 28*28 pixels such as 28 pixels crossing the
horizontal axis and 28 pixels crossing the vertical axis.
This means that a single image from the MNIST database
has a total of 784 pixels that must be analyzed. The input
layer of our neural network has 784 nodes to explain one of
these images.
Here, we will see how to create a function that is a model for
recognizing handwritten digits by looking at each pixel in
the image.
Then using TensorFlow to train the model to predict the
image by making it look at thousands of examples which
are already labeled. We will then check the model's
accuracy with a test dataset.
MNIST dataset in TensorFlow, containing information
of handwritten digits spitted into three parts:
Training Data (mnist.train) -55000 datapoints
Validation Data (mnist.validate) -5000 datapoint
Test Data (mnist.test) -10000 datapoints
Now before we start, it is important to note that every
data point has two parts: an image (x) and a
corresponding label (y) describing the actual image
and each image is a 28x28 array, i.e., 784 numbers. The
label of the image is a number between 0 and 9
corresponding to the TensorFlow MNIST image. To
download and use MNIST dataset, use the
following commands:
from tensorflow.examples.tutorials.mnist import input_
data
mnist = input_data.read_data_sets("MNIST_data/", one
_hot=True)
Ranking
Rank is an active and connected transformation that
performs the filtering of data based on the group and
ranks. The rank transformation also provides the
feature to do ranking based on groups.
The rank transformation has an output port, and it is
used to assign a rank to the rows.
In Informatica, it is used to select a bottom or top range
of data. While string value ports can be ranked, the
Informatica Rank Transformation is used to rank
numeric port values. One might think MAX and MIN
functions can accomplish this same task.
However, the rank transformation allows groups of records
to be listed instead of a single value or record. The rank
transformation is created with the following types of ports.
Input port (I)
Output port (O)
Variable port (V)
Rank Port (R)
Rank Port
The port which is participated in a rank calculation is
known as Rank port.
Variable Port
A port that allows us to develop expression to store the data
temporarily for rank calculation is known as a variable port.
Configuring the Rank Transformation
Let’s see how to configure the following properties of Rank
transformation:
Cache Directory: The directory is a space where the integration service
creates the index and data cache files.
Top/Bottom: It specifies whether we want to select the top or bottom
rank of data.
Number of Ranks: It specifies the number of rows that we want to rank.
Case-Sensitive String Comparison: It is used to sort the strings by
using the case sensitive.
Tracing Level: The amount of logging to be tracked in the session log
file.
Rank Data Cache Size: The data cache size default value is 2,000,000
bytes. We can set a numeric value or Auto for the data cache size. In the
case of Auto, the Integration Service determines the cache size at runtime.
Rank Index Cache Size: The index cache size default value is 1,000,000
bytes. We can set a numeric value or Auto for the index cache size. In the
case of Auto, the Integration Service determines the cache size at runtime.
Example
Suppose we want to load top 5 salaried employees for
each department; we will implement this using rank
transformation in the following steps, such as:
Step 1: Create a mapping having source EMP and target
EMP_TARGET
Step 2: Then in the mapping,
Select the transformation menu.
And click on the Create option.
Step 3: In the create transformation window,
Select rank transformation.
Enter transformation name "rnk_salary".
And click on the Create button.
Step 4: The rank transformation will be created in the
mapping, select the done button in the window.
Step 5: Connect all the ports from source qualifier to the
rank transformation.
Step 6: Double click on the rank transformation, and it
will open the "edit transformation window". In this
window,
Select the properties menu.
Select the "Top" option from the Top/Bottom property.
Enter 5 in the number of ranks.
Step 7: In the "edit transformation" window again,
Select the ports tab.
Select group by option for the Department number column.
Select Rank in the Salary Column.
Click on the OK button.
Step 8: Connect the ports from rank transformation to the
target table.
Now, save the mapping and execute it after creating session
and workflow. The source qualifier will fetch all the
records, but the rank transformation will pass only
records having three high salaries for each department.