0% found this document useful (0 votes)
6 views45 pages

Unit 3

The document provides an overview of supervised learning in machine learning, detailing various types such as linear regression, logistic regression, and their applications. It explains the concepts of dependent and independent variables, regression analysis, and methods like least squares for predictive modeling. Additionally, it introduces other learning methods including unsupervised, semi-supervised, and reinforcement learning, highlighting their characteristics and uses.

Uploaded by

anisha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views45 pages

Unit 3

The document provides an overview of supervised learning in machine learning, detailing various types such as linear regression, logistic regression, and their applications. It explains the concepts of dependent and independent variables, regression analysis, and methods like least squares for predictive modeling. Additionally, it introduces other learning methods including unsupervised, semi-supervised, and reinforcement learning, highlighting their characteristics and uses.

Uploaded by

anisha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 45

UNIT 3 - SUPERVISED LEARNING

S.NO TOPICS
1 INTRODUCTION TO MACHINE LEARNING
2
LINEAR REGRESSION MODELS
3
LEAST SQUARES, SINGLE AND MULTIPLE VARIABLES
4 BAYESIAN LINEAR REGRESSION, GRADIENT DESCENT LINEAR CLASSIFICATION
MODELS
5 DISCRIMINANT FUNCTIONS, PROBABILISTIC DISCRIMINATIVE MODEL
6
LOGISTIC REGRESSION, PROBABILISTIC GENERATIVE MODEL, NAÏVE BAYES
7 MAXIMUM MARGIN CLASSIFIER- SUPPORT VECTOR MACHINE, DECISION TREE,
RANDOM FOERSTS

1
Introduction to Machine Learning

• Machine learning is a growing technology which enables computers to learn


automatically from past data.

• Machine learning uses various algorithms for building mathematical


models and making predictions using historical data or information.

• Currently, it is being used for various tasks such as image


recognition, speech recognition, email filtering, Facebook auto-
tagging, recommender system, and many more.

Machine learning enables a machine to automatically learn from data,


improve performance from experiences, and predict things without being
explicitly programmed.

Fig.3.1.1 Schematic representation of Machine Learning

2
Fig.3.1.2 Classification of Machine Learning

Supervised Learning

Supervised learning is a type of machine learning method in which we


provide sample labeled data to the machine learning system in order to train it, and
on that basis, it predicts the output.

The goal of supervised learning is to map input data with the output data.

Supervised learning can be grouped further in two categories of

algorithms:
• Classification
• Regression
Unsupervised Learning
• Unsupervised learning is a learning method in which a machine learns
without any supervision.

• The training is provided to the machine with the set of data that has not
been labeled, classified, or categorized, and the algorithm needs to act on
that data without any supervision.

• In unsupervised learning, we don't have a predetermined result.


The machine tries to find useful insights from the huge amount of
data.

• It can be further classifieds into two categories of algorithms:


• Clustering
• Association

3
Semi Supervised Learning

 Semi-supervised learning is a type of machine learning that falls in between


supervised and unsupervised learning.

 It is a method that uses a small amount of labeled data and a large amount
of unlabeled data to train a model.

 The goal of semi-supervised learning is to learn a function that can


accurately predict the output variable based on the input variables, similar
to supervised learning.

 However, unlike supervised learning, the algorithm is trained on a dataset


that contains both labeled and unlabeled data.

Semi-supervised learning is particularly useful when there is a large amount of


unlabeled data available, but it’s too expensive or difficult to label all of it.

Reinforcement Learning

• Reinforcement learning is a feedback-based learning method, in which a


learning agent gets a reward for each right action and gets a penalty for each
wrong action.

• In reinforcement learning, the agent interacts with the environment and


explores it.

• The goal of an agent is to get the most reward points, and hence, it improves
its performance.

Deep Learning

 Deep learning is a class of machine learning algorithms that uses multiple


layers to progressively extract higher-level features from the raw input.

 For example, in image processing, lower layers may identify edges, while
higher layers may identify the concepts relevant to a human such as digits
or letters or faces.

Linear Regression Models

Regression Analysis in Machine learning

Regression analysis is a statistical method to model the relationship between a


dependent (target) and independent (predictor) variables with one or more
independent variables. More specifically, Regression analysis helps us to
understand how the value of the dependent variable is changing corresponding to
an independent variable when other independent variables are held fixed. It
predicts continuous/real values such as temperature, age, salary, price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various


advertisement every year and get sales on that. The below list shows the
advertisement made by the company in the last 5 years and the corresponding

4
sales:

Fig.3.2.1 Advertisement and Sales made by a company in the last 5 years

Now, the company wants to do the advertisement of $200 in the current year and
wants to know the prediction about the sales for this year. So to solve such
type of prediction problems in machine learning, we need regression analysis.

Regression is a supervised learning technique which helps in finding the


correlation between variables and enables us to predict the continuous output
variable based on the one or more predictor variables. It is mainly used for
prediction, forecasting, time series modeling, and determining the causal-effect
relationship between variables.

In Regression, we plot a graph between the variables which best fits the given
datapoints, using this plot, the machine learning model can make predictions
about the data.

In simple words, "Regression shows a line or curve that passes through all the
datapoints on target-predictor graph in such a way that the vertical distance
between the datapoints and the regression line is minimum." The distance between
datapoints and line tells whether a model has captured a strong relationship or
not.

Some examples of regression can be as:


Prediction of rain using temperature and other factors
Determining Market trends
Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:

o Dependent Variable: The main factor in Regression analysis which we want


to predict or understand is called the dependent variable. It is also
called target variable.

o Independent Variable: The factors which affect the dependent variables or


which are used to predict the values of the dependent variables are called
independent variable, also called as a predictor.

5
o Outliers: Outlier is an observation which contains either very low value or
very high value in comparison to other observed values. An outlier may
hamper the result, so it should be avoided.

o Multicollinearity: If the independent variables are highly correlated with


each other than other variables, then such condition is called
Multicollinearity. It should not be present in the dataset, because it creates
problem while ranking the most affecting variable.

o Underfitting and Overfitting: If our algorithm works well with the training
dataset but not well with test dataset, then such problem is
called Overfitting. And if our algorithm does not perform well even with
training dataset, then such problem is called underfitting.

Fig.3.2.2 Types of Regression

Linear Regression

o Linear regression is a statistical regression method which is


used for predictive analysis.
o It is one of the very simple and easy algorithms which works
on regression and shows the relationship between the
continuous variables.

6
o It is used for solving the regression problem in machine learning.

7
o Linear regression shows the linear relationship between the
independent variable (X-axis) and the dependent variable (Y-
axis), hence called linear regression.
o If there is only one input variable (x), then such linear
regression is called simple linear regression. And if there is
more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression
model can be explained using the below image. Here we are
predicting the salary of an employee on the basis of the year
of experience.

Fig.3.2.3 Example for Linear Regression

o Below is the mathematical equation for Linear regression:

Y=aX+b

Here, Y = dependent variables (target


variables), X= Independent variables (predictor
variables), a and b are the linear coefficients

Some popular applications of linear regression are:

 Analyzing trends and sales estimates


 Salary forecasting
 Real estate prediction
 Arriving at ETAs in traffic.

8
Logistic Regression:
o Logistic regression is another supervised learning algorithm
which is used to solve the classification problems. In
classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical
variable such as 0 or 1, Yes or No, True or False, Spam or not
spam, etc.
o It is a predictive analysis algorithm which works on the concept
of probability.
o Logistic regression is a type of regression, but it is different
from the linear regression algorithm in the term how they are
used.
o Logistic regression uses sigmoid function or logistic function
which is a complex cost function. This sigmoid function is used
to model the data in logistic regression. The function can be
represented as:

o f(x)= Output between the 0 and 1 value.


o x= input to the function
o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives


the S-curve as follows:

Fig. 3.3.1 Sigmoid

Function Least Square Linear Regression

Least squares is a method to apply linear regression. It helps us predict


results based on an existing set of data as well as clear anomalies in our data.
Anomalies are values that are too good, or bad, to be true or that represent rare
9
cases.

10
For example, say we have a list of how many topics future engineers here at
freeCodeCamp can solve if they invest 1, 2, or 3 hours continuously. Then we can
predict how many topics will be covered after 4 hours of continuous study even
without that data being available to us.

This method is used by a multitude of professionals, for example


statisticians, accountants, managers, and engineers (like in machine learning
problems).

Setting up an example

Before we jump into the formula and code, let's define the data we're going to
use.

To do that let's expand on the example mentioned earlier.

Let's assume that our objective is to figure out how many topics are covered
by a student per hour of learning.

Each pair (X, Y) will represent a student. Since we all have different rates of
learning, the number of topics solved can be higher or lower for the same time
invested.

Hours Topics Solved Hours Topics Solved

1 1.5 2.7 7.1

1.2 2 3 10

1.5 3 3.1 6

2 1.8 3.2 5

2.3 2.7 3.6 8.9

2.5 4.7

You can read it like this: "Someone spent 1 hour and solved 2 topics" or
"One student after 3 hours solved 10 topics".

In a graph these points look like this:

11
Each point is a student (X, Y) and how long it took that specific student to complete
a certain number of topics

The formula

Y = a + bX

a is the intercept, in other words the value that we expect, on average, from a
student that practices for one hour. One hour is the least amount of time we're
going to accept into our example data set.

b is the slope or coefficient, in other words the number of topics solved in a specific
hour (X). As we increase in hours (X) spent studying, b increases more and more.

x͞ -> 1+1.2+1.5+2+2.3+2.5+2.7+3+3.1+3.2+3.6
= 2.37
͞y -> 1,5+2+3+1,8+2,7+4,7+7,1+10+6+5+8,9 /
11 = 4.79
Now that we have the average we can expand our table to include the new results:

Table:3.3.1 Hours and Topics solved

Topics (X - (Y - ͞Y) (X - ͞X)*(Y - (X - ͞X)²


Hours (X) ͞X) Y)
Solved (Y)

1 1.5 -1.37 -3.29 4.51 1.88

1.2 2 -1.17 -2.79 3.26 1.37

1.5 3 -0.87 -1.79 1.56 0.76

2 1.8 -0.37 -2.99 1.11 0.14

2.3 2.7 -0.07 -2.09 0.15 0.00

2.5 4.7 0.13 -0.09 -0.01 0.02

2.7 7.1 0.33 2.31 0.76 0.11

3 10 0.63 5.21 3.28 0.40

3.1 6 0.73 1.21 0.88 0.53

3.2 5 0.83 0.21 0.17 0.69

3.6 8.9 1.23 4.11 5.06 1.51

∑(x - ͞x)*(y - y) -> 4.51+3.26+1.56+1.11+0.15+-

12
0.01+0.76+3.28+0.88+0.17+5.06
= 20.73
∑(x - ͞x)² -> 1.88+1.37+0.76+0.14+0.00+0.02+0.11+0.40+0.53+0.69+1.51 =
7.41

13
And finally we do 20.73 / 7.41 and we get b
= 2.8

Calculating "a"
= a + b x. We've already obtained
All that is left is a, for which the formula is
y͞͞͞

14
all those other values, so we can substitute them and we get:
 4.79 = a + 2.8*2.37
 4.79 = a + 6.64
 a = -6.64+4.79
 a = -1.85

The result
Our final formula becomes:

Y = -1.85 + 2.8*X
Now we replace the X in our formula with each value that we have:
Topics Solved Topics Solved
Hours Hours
-1.85 + 2.8 * X -1.85 + 2.8 * X

1 0.95 2.7 5.71

1.2 1.51 3 6.55

1.5 2.35 3.1 6.83

2 3.75 3.2 7.11

2.3 4.59 3.6 8.23

2.5 5.15

If we want to predict how many topics we expect a student to solve with 8 hours of
study, we replace it in our formula:

Y = -1.85 + 2.8*8

Y = 20.55

Multiple Linear Regressions

Regression models are used to describe relationships between variables by


fitting a line to the observed data. Regression allows you to estimate how a
dependent variable changes as the independent variable(s) change.

Multiple linear regression is used to estimate the relationship between two or


more independent variables and one dependent variable. You can use multiple
linear regression when you want to know:

How strong the relationship is between two or more independent variables


and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer
added affect crop growth).

The value of the dependent variable at a certain value of the independent


variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature,
and fertilizer addition).

15
How to perform a multiple linear regression

Multiple linear regression formula


The formula for a multiple linear regression is:

 y = the predicted value of the dependent variable


 B0 = the y-intercept (value of y when all other parameters are set to 0)
 B1 X1 = the regression coefficient ( ) of the first independent variable ( )
(a.k.a. the effect that increasing the value of the independent variable has on
the predicted y value)
 … = do the same for however many independent variables you are testing
 Bn Xn = the regression coefficient of the last independent variable
 € = model error (a.k.a. how much variation there is in our estimate of )

To find the best-fit line for each independent variable, multiple linear regression
calculates three things:

 The regression coefficients that lead to the smallest overall model error.

 The t statistic of the overall model.

 The associated p value (how likely it is that the t statistic would have
occurred by chance if the null hypothesis of no relationship between the
independent and dependent variables was true).

It then calculates the t statistic and p value for each regression coefficient in the
model.

Bayesian Linear Regression

In the Bayesian viewpoint, we formulate linear regression using probability


distributions rather than point estimates. The response, y, is not estimated as a
single value, but is assumed to be drawn from a probability distribution. The model
for Bayesian Linear Regression with the response sampled from a normal
distribution is:

The output, y is generated from a normal (Gaussian) Distribution


characterized by a mean and variance. The mean for linear regression is the
transpose of the weight matrix multiplied by the predictor matrix. The variance is
the square of the standard deviation σ (multiplied by the Identity matrix because
this is a multi-dimensional formulation of the model).

The aim of Bayesian Linear Regression is not to find the single “best” value
of the model parameters, but rather to determine the posterior distribution for the

16
model parameters. Not only is the response generated from a probability
distribution, but the model parameters are assumed to come from a distribution as
well. The posterior probability of the model parameters is conditional upon the
training inputs and outputs:

Here, P(β|y, X) is the posterior probability distribution of the


model parameters given the inputs and outputs. This is equal to the likelihood of the
data, P(y|β, X), multiplied by the prior probability of the parameters and divided
by a normalization constant. This is a simple expression of Bayes Theorem, the
fundamental underpinning of Bayesian Inference:

Here we can observe the two primary benefits of Bayesian Linear Regression.

Priors: If we have domain knowledge, or a guess for what the model


parameters should be, we can include them in our model, unlike in the frequentist
approach which assumes everything there is to know about the parameters comes
from the data. If we don’t have any estimates ahead of time, we can use non-
informative priors for the parameters such as a normal distribution.

Posterior: The result of perforing Bayesian Linear Regression is a


distribution of possible model parameters based on the data and the prior. This
allows us to quantify our uncertainty about the model: if we have fewer data points,
the posterior distribution will be more spread out.

17
Bayesian Linear Regression Model Results with 500 (top) and 15000
observations (bottom)

There is much more variation in the fits when using fewer data points, which
represents a greater uncertainty in the model. With all of the data points, the OLS
and Bayesian Fits are nearly identical because the priors are washed out by the
likelihoods from the data.

When predicting the output for a single datapoint using our Bayesian Linear
Model, we also do not get a single value but a distribution. Following is the
probability density plot for the number of calories burned exercising for 15.5
minutes. The red vertical line indicates the point estimate from OLS.

Gradient Descent in Linear Regression

In linear regression, the model targets to get the best-fit regression line to
predict the value of y based on the given input value (x). While training the model,
the model calculates the cost function which measures the Root Mean Squared
error between the predicted value (pred) and true value (y). The model targets to
minimize the cost function.

To minimize the cost function, the model needs to have the best value of θ1
and θ2. Initially model selects θ1 and θ2 values randomly and then iteratively
update these value in order to minimize the cost function until it reaches the
minimum. By the time model achieves the minimum cost function, it will have the
best θ1 and θ2 values. Using these finally updated values of θ1 and θ2 in the
hypothesis equation of linear equation, the model predicts the value of x in the best
manner it can.
Therefore, the question arises – How do θ1 and θ2 values get updated?

18
Linear Regression Cost Function:

19
θj : Weights of the hypothesis.
-> hθ(xi) : predicted y value for ith input.
-> j : Feature index number (can be 0, 1, 2,......., n).
-> α : Learning Rate of Gradient Descent.

Gradient Descent step-downs the cost function in the direction of the steepest
descent. The size of each step is determined by parameter α known as Learning
Rate.
In the Gradient Descent algorithm, one can infer two points :

 If slope is +ve : θj = θj – (+ve value). Hence value of θj decreases.

 If slope is -ve : θj = θj – (-ve value). Hence value of θj


increases.
20
If we choose α to be very large, Gradient Descent can overshoot the minimum. It
may fail to converge or even diverge.

If we choose α to be very small, Gradient Descent will take small steps to reach
local minima and will take a longer time to reach minima.

Advantages:
 Flexibility: Gradient Descent can be used with various cost functions and
can handle non-linear regression problems.
 Scalability: Gradient Descent is scalable to large datasets since it updates
the parameters for each training example one at a time.
21
 Convergence: Gradient Descent can converge to the global minimum of the
cost function, provided that the learning rate is set appropriately.

Disadvantages:
 Sensitivity to Learning Rate: The choice of learning rate can be critical in
Gradient Descent since using a high learning rate can cause the algorithm to
overshoot the minimum, while a low learning rate can make the algorithm
converge slowly.
 Slow Convergence: Gradient Descent may require more iterations to
converge to the minimum since it updates the parameters for each training
example one at a time.
 Local Minima: Gradient Descent can get stuck in local minima if the cost
function has multiple local minima.
 Noisy updates: The updates in Gradient Descent are noisy and have a high
variance, which can make the optimization process less stable and lead to
oscillations around the minimum.

Linear Classification

Models Discriminant Functions

Maximum-likelihood and Bayesian parameter estimation techniques assume


that the forms for the underlying probability densities were known, and that we
will use the training samples to estimate the values of their parameters. The
discriminant functions, and use the samples to estimate the values of parameters
of the classifier. Some of the discriminant functions are statistical and some of
them are not. None of them, however, requires knowledge of the forms of
underlying probability distributions, and in this limited sense, they can be said to
be nonparametric.

Linear discriminant functions have a variety of pleasant analytical


properties. They can be optimal if the underlying distributions are cooperative,
such as Gaussians having equal covariance, as might be obtained through an
intelligent choice of feature detectors. Even when they are not optimal, we might be
willing to sacrifice some performance in order to gain the advantage of their
simplicity. Linear discriminant functions are relatively easy to compute and in the
absence of information suggesting otherwise, linear classifiers are attractive
candidates for initial, trial classifiers.

The problem of finding a linear discriminant function will be formulated as a


problem of minimizing a criterion function. The obvious criterion function for
clas•sification purposes is the sample risk, or training error-the average loss
incurred in classifying the set of training samples. It is difficult to derive the
minimum-risk lin•ear discriminant, and for that reason it will be suitable to
investigate several related criterion functions that are analytically more tractable.

Linear Discriminant Functions and Decision Surfaces


A discriminant function that is a linear combination of the components of x
can be written as

where w is the weight vector and w0 the bias or threshold weight. Linear
discriminant functions are going to be studied for the two-category case, multi-

22
category case, and general case (Figure 3.6.1). For the general case there will
be c such discriminant functions, one for each of c categories.

Figure 3.6.1: Linear discriminant functions.


The Two-Category Case
For a discriminant function of the form of eq.9.1, a two-category classifier
implements the following decision rule: Decide w1 if g(x)>0 and w2 if g(x)<0.
Thus, x is assigned to w1 if the inner product wTx exceeds the threshold – w0 and
to w2 otherwise. If g(x)=0, x can ordinarily be assigned to either class, or can be left
undefined. The equation g(x)=0 defines the decision surface that separates points
assigned to w1 from points assigned to w2. When g(x) is linear, this decision
surface is a hyperplane. If x1 and x2 are both on the decision surface, then

fig 3.6.2
or

fig 3.6.3

and this shows that w is normal to any vector lying in the hyperplane. In general,
the hyperplane H divides the feature space into two half-spaces: decision
region R1 for w1 and region R2 for w2.

Because g(x)>0 if x is in R1, it follows that the normal vector w points into R1. It
is sometimes said that any x in R1 is on the positive side of H, and any x in R2 is
on the negative side (Figure 3.6.2).

The discriminant function g(x) gives an algebraic measure of the distance


from x to the hyperplane. The easiest way to see this is to express x as

where xp is the normal projection of x onto H, and r is the desired algebraic


distance which is positive if x is on the positive side and negative if x is on the
negative side. Then, because g(xp)=0,

23
In particular, the distance from the origin to H is given by . If w0>0, the
origin is on the positive side of H, and if w0<0, it is on the negative side. If w0=0,
then g(x) has the homogeneous form , and the hyperplane passes through the
origin (Figure 9.2).

A linear discriminant function divides the feature space by a hyperplane


decision surface. The orientation of the surface is determined by the normal
vector w, and the location of the surface is determined by the bias w0. The discrim-
inant function g(x) is proportional to the signed distance from x to the hyperplane,
with g(x)>0 when x is on the positive side, and g(x)<0 when x is on the negative
side.

24
The Multicategory Case

There is more than one way to devise multicategory classifiers employing


linear discriminant functions. For example, we might reduce the problem to c two-
class problems, where the ith problem is solved by a linear discriminant function
that separates points assigned to wi, from those not assigned to w1.

A more extravagant approach would be to use c(c-1)/2 linear discriminants,


one for every pair of classes. As illustrated in Figure 9.3, both of these approaches
can lead to regions in which the classification is undefined. We shall avoid this
problem by defining c linear discriminant functions

and assigning x to wi if for all j¹ i; in case of ties, the classification is


left undefined. The resulting classifier is called a linear machine.

A linear machine divides the feature space into c decision regions with gj(x)
being the largest discriminant if x is in region Ri. If Ri and Rj are contiguous, the
boundary between them is a portion of the hyperplane Hij defined by

Thus, with the linear machine it is not the weight vectors themselves but
their differences that are important. While there are c(c-1)/2 pairs of regions, they
need not all be contiguous, and the total number of hyperplane segments
appearing in the decision surfaces is often fewer than c(c-1)/2.

25
Fig. 3.6.4Decision boundaries defined by a linear machine.

Discriminative Machine Learning Model

Discriminative model refers to a class of models used in statistical classification,


especially in supervised machine learning. Also known as conditional models,
generative modeling learns the boundary between classes or labels in a dataset. It
tends to model the joint probability of data points and can create new instances
using probability estimates and maximum likelihood. Unlike the generative models,
discriminative models have the advantage of being more robust to outliers.
Discriminative models in machine learning are:

 Logistic regression
 Support vector machine
 Decision tree
 Random forests

26
Fig.3.7.1 Generative and Discriminative Model

Most of the Machine Learning and Deep Learning problems that you solve are
conceptualized from the Generative and Discriminative Models. In Machine
Learning, one can clearly distinguish between the two modelling types:

 Classifying an image as a dog or a cat falls under Discriminative Modelling


 Producing a realistic dog or a cat image is a Generative Modelling problem.\

The more the neural networks got adopted, the more the generative and
discriminative domains grew. To understand the algorithms based on these models,
you need to study the theory and all the modelling concepts.

Discriminative Modelling to solve a classification problem in Machine Learning or


Deep Learning. Being a superset of Machine Learning and Deep Learning
algorithms, Discriminative Modelling is not limited to classification tasks. It is also
widely used in object detection, semantic segmentation, panoptic segmentation,
keypoint detection, regression problems, and language modelling.

The discriminative model falls under the supervised learning branch. In a


classification task, given that the data is labelled, it tries to distinguish among
classes, for example, a car, traffic light and a truck. Also known as classifiers,
these models correspond image samples X to class labels Y, and discover the
probability of image sample belonging to class label

They learn to model the decision boundaries among classes (such as cats, dogs and

tigers ). The decision boundary could be linear or non-linear. The data points that

are far away from the decision boundary (i.e. the outliers) are not very

important. The discriminative model tries to learn a boundary that separates the

positive from the negative class, and comes up with the decision boundary. Only

those closest to this boundary are considered.

27
Discriminative models classify data points, without providing the model of how the

points were generated.

Fig.3.7.2 Image showing the classification of data points by Discriminative models.

In trying to classify a sample x belonging to class label y, the discriminative model


indirectly learns certain features of the dataset that make its task easier. For
example, a car has four wheels of a circular shape and more length than width,
while the traffic light is vertical with three circular rings. These features help the
model distinguish between the two classes.
The discriminative models could be:

 Probabilistic
 logistic regression
 a deep neural network, which models P(Y|X)
 Non-probabilistic
 Support Vector Machine (SVM), which tries to learn the mappings
directly from the data points to the classes with a hyperplane.

Discriminative modelling learns to model the conditional probability of class label y

given set of features x as P(Y|X).


Some of the discriminative models are:

 Support Vector Machine


 Logistic Regression
 k-Nearest Neighbour (kNN)
 Random Forest
 Deep Neural Network ( such as AlexNet, VGGNet, and ResNet )

Support Vector Machine

Support Vector Machine (SVM) is a non-parametric, supervised learning


technique very popular with engineers for it produces excellent results with

28
significantly less compute. A Machine Learning algorithm, it can be applied to both
classification (output is deterministic) and regression (output is continuous)
problems. It is largely used in text classification, image classification, protein and
gene classification.

Fig. 3.7.3Image showing the Kernel Trick technique, the dimensional space, using
which SVM can be easily implemented.

SVM can separate both linear and non-linear data points. A kernel-trick
helps separate non-linear points. Having no kernel, the Linear SVM finds the
hyperplane with the maximum margin-linear solution to the problem. The
boundary points in the feature space are called support vectors (as shown in the
figure above). Based on their relative position, the maximum margin is derived and
an optimal hyperplane drawn at the midpoint.

The hyperplane is N-1 dimensional, where N is the number of features


present in the given dataset. For example, A line will indicate the decision
boundary if a dataset has two features (2d input space).

Why do we need a hyperplane with maximum margin?

The decision boundary with the maximum margin works best, increases
the chance of generalization. Enough freedom to the boundary points reduces the
chance of misclassification. On the other hand, a decision boundary with smaller
margins usually leads to overfitting.

29
Fig. 3.7.4 Image depicting the selection of the hyperplane that maximises the
margin between the data points.

Logistic Regression

o Logistic regression is one of the most popular Machine Learning algorithms,


which comes under the Supervised Learning technique. It is used for
predicting the categorical dependent variable using a given set of
independent variables.

o Logistic regression predicts the output of a categorical dependent variable.


Therefore the outcome must be a categorical or discrete value. It can be
either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.

o Logistic Regression is much similar to the Linear Regression except that how
they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.

o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped


logistic function, which predicts two maximum values (0 or 1).

o The curve from the logistic function indicates the likelihood of something
such as whether the cells are cancerous or not, a mouse is obese or not
based on its weight, etc.

o Logistic Regression is a significant machine learning algorithm because it


has the ability to provide probabilities and classify new data using
continuous and discrete datasets.

o Logistic Regression can be used to classify the observations using different


types of data and can easily determine the most effective variables used for
the classification. The below image is showing the logistic function:

30
Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted


values to probabilities.

o It maps any real value into another value within a range of 0 and 1.

o The value of the logistic regression must be between 0 and 1, which cannot
go beyond this limit, so it forms a curve like the "S" form. The S-form curve
is called the Sigmoid function or the logistic function.

o In logistic regression, we use the concept of the threshold value, which


defines the probability of either 0 or 1. Such as values above the threshold
value tends to 1, and a value below the threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.

o The independent variable should not have multi-collinearity.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given
below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide
the above equation by (1-y):

31
o But we need range between -[infinity] to +[infinity], then take logarithm of
the equation it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three
types:

o Binomial: In binomial Logistic regression, there can be only two possible


types of the dependent variables, such as 0 or 1, Pass or Fail, etc.

o Multinomial: In multinomial Logistic regression, there can be 3 or more


possible unordered types of the dependent variable, such as "cat", "dogs", or
"sheep"

o Ordinal: In ordinal Logistic regression, there can be 3 or more possible


ordered types of dependent variables, such as "low", "Medium", or "High".
Steps in Logistic Regression: To implement the Logistic Regression using Python,
we will use the same steps as we have done in previous topics of Regression. Below
are the steps:

o Data Pre-processing step

o Fitting Logistic Regression to the Training set

o Predicting the test result

o Test accuracy of the result(Creation of Confusion matrix)

o Visualizing the test set result.

Terminologies involved in Logistic Regression:


Here are some common terms involved in logistic regression:

 Independent variables: The input characteristics or predictor


factors applied to the dependent variable’s predictions.
 Dependent variable: The target variable in a logistic regression model,
which we are trying to predict.
 Logistic function: The formula used to represent how the independent and

32
dependent variables relate to one another. The logistic function transforms
the input variables into a probability value between 0 and 1, which
represents the likelihood of the dependent variable being 1 or 0.
 Odds: It is the ratio of something occurring to something not occurring. it is
different from probability as probability is the ratio of something occurring to
everything that could possibly occur.
 Log-odds: The log-odds, also known as the logit function, is the natural
logarithm of the odds. In logistic regression, the log odds of the dependent
variable are modeled as a linear combination of the independent variables
and the intercept.
 Coefficient: The logistic regression model’s estimated parameters, show
how the independent and dependent variables relate to one another.
 Intercept: A constant term in the logistic regression model, which
represents the log odds when all independent variables are equal to
zero.
 Maximum likelihood estimation: The method used to estimate the
coefficients of the logistic regression model, which maximizes the likelihood
of observing the data given the model.

Navies Bayes Classifier

Naive Bayes classifiers are a collection of classification algorithms based on


Bayes’ Theorem.

It is not a single algorithm but a family of algorithms where all of them share
a common principle, i.e. every pair of features being classified is independent of
each other.

• To start with, let us consider a dataset.


• Consider a fictional dataset that describes the weather conditions for playing
a game of golf.
• Given the weather conditions, each tuple classifies the conditions as
fit(“Yes”) or unfit(“No”) for playing golf.
• Here is a tabular representation of our dataset.

Outlook Temperature Humidity Windy Play Golf

0 Rainy Hot High False No

1 Rainy Hot High True No

2 Overcast Hot High False Yes

3 Sunny Mild High False Yes

33
4 Sunny Cool Normal False Yes

5 Sunny Cool Normal True No

6 Overcast Cool Normal True Yes

7 Rainy Mild High False No

8 Rainy Cool Normal False Yes

9 Sunny Mild Normal False Yes

10 Rainy Mild Normal True Yes

11 Overcast Mild High True Yes

12 Overcast Hot Normal False Yes

13 Sunny Mild High True No

• The dataset is divided into two parts, namely, feature matrix and
the response vector.
• Feature matrix contains all the vectors(rows) of dataset in which each vector
consists of the value of dependent features.
• In above dataset, features are ‘Outlook’, ‘Temperature’, ‘Humidity’ and
‘Windy’.
• Response vector contains the value of class variable(prediction or output)
for each row of feature matrix.
• In above dataset, the class variable name is ‘Play golf’.

Assumption:

• The fundamental Naive Bayes assumption is that each feature makes an:
• independent
• Equal

contribution to the outcome

With relation to our dataset, this concept can be understood as:

We assume that no pair of features are dependent.

For example, the temperature being ‘Hot’ has nothing to do with the
humidity or the outlook being ‘Rainy’ has no effect on the winds.

Hence, the features are assumed to be independent.

34
Secondly, each feature is given the same weight(or importance).

For example, knowing only temperature and humidity alone can’t predict the
outcome accurately.

None of the attributes is irrelevant and assumed to be


contributing equally to the outcome.

Bayes’ Theorem

Bayes’ Theorem finds the probability of an event occurring given the


probability of another event that has already occurred. Bayes’ theorem is
stated mathematically as the following equation:.

P(A|B) = P(B|A) * P(A) / P(B)

The reverse is also true; for example:

P(B|A) = P(A|B) * P(B) / P(A)

where A and B are events and P(B) ≠ 0.

Basically, we are trying to find probability of event A, given the event B is


true. Event B is also termed as evidence.

P(A) is the priori of A (the prior probability, i.e. Probability of event before
evidence is seen). The evidence is an attribute value of an unknown
instance(here, it is event B).

P(A|B) is a posteriori probability of B, i.e. probability of event after


evidence is seen.

Now, with regards to our dataset, we can apply Bayes’ theorem in following
way:

P(y|X) = P(X|y) * P(y) / P(X)

where, y is class variable and X is a dependent feature vector (of size n)


where:

X=(x1,x2,x3,….,xn)

Just to clear, an example of a feature vector and corresponding class


variable can be: (refer 1st row of dataset)

X = (Rainy, Hot, High, False)

y = No

Naive assumption.

Now, its time to put a naive assumption to the Bayes’ theorem, which
is, independence among the features.

So now, we split evidence into the independent parts.

Now, if any two events A and B are independent, then,

P(A,B) = P(A)P(B)

35
Hence, we reach to the result:

P(y|x1,x2,……,xn) =P(x1|y)P(x2|y)….P(xn|y)P(y)

p(x1)p(x2)….p(xn)

which can be expressed as:

P(y|x1,x2,……,xn) =p(y)∏i=1 to nP(xi|y)

Now, we need to create a classifier model.

For this, we find the probability of given set of inputs for all possible values of the
class variable y and pick up the output with maximum probability.

Y=argmaxyp(y)∏i=1 to nP(xi|y)

So, finally, we are left with the task of calculating P(y) and P(xi | y).

Please note that P(y) is also called class probability and P(xi | y) is
called conditional probability.

Let us test it on a new set of features (let us call it today):

today = (Sunny, Hot, Normal, False)

So, probability of playing golf is given by:

• P(Yes|today) = P(sunnyOutlook|yes)P(hottemperature|yes)

P(NormalHumidity|yes)P(NoWind|yes)P(yes)

P(today)

Probability of not playing golf is given by:

• P(No|today) = P(sunnyOutlook|No)P(hottemperature|No)

P(NormalHumidity|No)P(NoWind|No)P(No)

P(today)

Since, P(today) is common in both probabilities, we can remove that term to get

Support Vector

Machine Maximal Margin

Classifier Hyperplane:

If we have p-dimensional space, a hyperplane is a flat subspace with dimension p-1

For example, in two-dimensional space a hyperplane is a straight line.

• In three-dimensional space, a hyperplane is a two-dimensional subspace.

36
• Imagine a knife cutting through a piece of cheese that is in cubical shape
and dividing it into two parts.

 Then put both pieces back and observe that space in between both pieces,
that is a two-dimensional subspace in a three-dimensional space.

• For p greater than 3, the imagination can be harder, but the intuition
remains the same.

This is the equation of a hyperplane in a two-dimensional space.

Similarly, the equation can be extended to p-dimensional setting and look like this:

• Each beta is a parameter for one of the many dimensions we have in our
space.

• Therefore, if we have a point X that satisfies the above equation then it


means the point is on the hyperplane.

• In other words, if we have a point, in say 5-dimensional space, then X1 to X4


will have different value for one single point in that space and if after you put
up the values in the equation above the equation ends up to be equal to zero
then it means the point lies on the hyperplane.

• This means if the equation holds value less than zero, the point does not lies
on the plain but rather below the hyperplane.

And, if the points lie on the upper side of the plain then the equation holds like
this:

37
Fig.3.10.1 A graph showing the plot for previous 2 equations
3.11 Random Forest Algorithm

Random Forest is a popular machine learning algorithm that belongs to the


supervised learning technique. It can be used for both Classification and
Regression problems in ML. It is based on the concept of ensemble
learning, which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number


of decision trees on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset." Instead of relying on one decision
tree, the random forest takes the prediction from each tree and based on the
majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.

• "Random Forest is a classifier that contains a number of decision trees


on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset."
• Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of
predictions, and it predicts the final output.

The below diagram explains the working of the Random Forest


algorithm:

38
Fig.3.11.1 Schematic diagram for Random Forest

Assumptions for Random Forest

• Since the random forest combines multiple trees to predict the class of the
dataset, it is possible that some decision trees may predict the correct
output, while others may not.

• But together, all the trees predict the correct output.

• Therefore, below are two assumptions for a better Random forest classifier:

• There should be some actual values in the feature variable of the dataset so
that the classifier can predict accurate results rather than a guessed result.

• The predictions from each tree must have very low correlations.

• How does Random Forest algorithm work?

• Random Forest works in two-phase first is to create the random forest by


combining N decision tree, and second is to make predictions for each tree
created in the first phase.

• Step-1: Select random K data points from the training set.

• Step-2: Build the decision trees associated with the selected data points
(Subsets).

• Step-3: Choose the number N for decision trees that you want to build.

• Step-4: Repeat Step 1 & 2.

* Step-5: For new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority votes.

39
Example: Suppose there is a dataset that contains multiple fruit images.

• So, this dataset is given to the Random forest classifier.

• The dataset is divided into subsets and given to each decision tree.

Fig.3.11.2 An Example for Random Forest

Applications of Random Forest

• There are mainly four sectors where Random forest mostly used:

• Banking: Banking sector mostly uses this algorithm for the identification of
loan risk.

• Medicine: With the help of this algorithm, disease trends and risks of the
disease can be identified.

• Land Use: We can identify the areas of similar land use by this algorithm.

• Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest

o Random Forest is capable of performing both Classification and Regression


tasks.

o It is capable of handling large datasets with high dimensionality.

o It enhances the accuracy of the model and prevents the overfitting issue.

40
Disadvantages of Random Forest

o Although random forest can be used for both classification and regression
tasks, it is not more suitable for Regression tasks.

3.12 Decision Trees

• Goal: Build a decision tree to classify examples as positive or negative


instances of a concept using supervised learning from a training set

• A decision tree is a tree where

– each non-leaf node has associated with it an attribute (feature)

– each leaf node has associated with it a classification (+ or -)

– each arc has associated with it one of the possible values of the
attribute at the node from which the arc is directed

• Generalization: allow for >2 classes

– e.g., {sell, hold, buy}

Fig.3.12.1 A Decision Tree

Inductive learning and bias

• Suppose that we want to learn a function f(x) = y and we are given some
sample (x,y) pairs, as in figure (a)

41
• There are several hypotheses we could make about this function, e.g.: (b),
(c) and (d)

• A preference for one over the others reveals the bias of our learning
technique, e.g.:

– prefer piece-wise functions

– prefer a smooth function

- prefer a simple function and treat outliers as noise

Preference bias: Ockham’s Razor

• A.k.a. Occam’s Razor, Law of Economy, or Law of Parsimony

• Principle stated by William of Ockham (1285-1347/49), a scholastic, that

– “non sunt multiplicanda entia praeter necessitatem”

– or, entities are not to be multiplied beyond necessity

• The simplest consistent explanation is the best

• Therefore, the smallest decision tree that correctly classifies all of the
training examples is best.

• Finding the provably smallest decision tree is NP-hard, so instead of


constructing the absolute smallest tree consistent with the training
examples, construct one that is pretty small

R&N’s restaurant domain

• Develop a decision tree to model the decision a patron makes when


deciding whether or not to wait for a table at a restaurant

• Two classes: wait, leave

• Ten attributes: Alternative available? Bar in restaurant? Is it Friday? Are we


hungry? How full is the restaurant? How expensive? Is it raining? Do we
have a reservation? What type of restaurant is it? What’s the purported
waiting time?

• Training set of 12 examples

• ~ 7000 possible cases

42
Fig.3.12.2 R&N’s Restaurant Problem with examples

43
Fig.3.12.3 Decision Tree for the R&Ns Restaurant problem

Information theory

• If there are n equally probable possible messages, then the probability p of


each is 1/n

• Information conveyed by a message is -log(p) = log(n)

• E.g., if there are 16 messages, then log(16) = 4 and we need 4 bits to


identify/send each message

• In general, if we are given a probability

distribution P = (p1, p2, .., pn)

• Then the information conveyed by the distribution (aka entropy of P)

is: I(P) = -(p1*log(p1) + p2*log(p2) + .. + pn*log(pn))

Evaluation methodology

• Standard methodology:

1. Collect a large set of examples (all with correct classifications)

2. Randomly divide collection into two disjoint sets: training and test

3. Apply learning algorithm to training set giving hypothesis H


44
4. Measure performance of H w.r.t. test set

• Important: keep the training and test sets disjoint!

• To study the efficiency and robustness of an algorithm, repeat steps 2-4 for
different training sets and sizes of training sets

• If you improve your algorithm, start again with step 1 to avoid evolving the
algorithm to work well on just this collection

Fig.3.12.4 Learning curve

Summary: Decision tree learning

• Inducing decision trees is one of the most widely used learning methods in
practice
• Can out-perform human experts in many problems
• Strengths include
– Fast
– Simple to implement
– Can convert result to a set of easily interpretable rules
– Empirically valid in many commercial products
– Handles noisy data
• Weaknesses include:
– Univariate splits/partitioning using only one attribute at a time so
limits types of possible trees

– Large decision trees may be hard to understand

– Requires fixed-length feature vectors

– Non-incremental (i.e., batch method)

45

You might also like