Unit 3
Unit 3
S.NO TOPICS
1 INTRODUCTION TO MACHINE LEARNING
2
LINEAR REGRESSION MODELS
3
LEAST SQUARES, SINGLE AND MULTIPLE VARIABLES
4 BAYESIAN LINEAR REGRESSION, GRADIENT DESCENT LINEAR CLASSIFICATION
MODELS
5 DISCRIMINANT FUNCTIONS, PROBABILISTIC DISCRIMINATIVE MODEL
6
LOGISTIC REGRESSION, PROBABILISTIC GENERATIVE MODEL, NAÏVE BAYES
7 MAXIMUM MARGIN CLASSIFIER- SUPPORT VECTOR MACHINE, DECISION TREE,
RANDOM FOERSTS
1
Introduction to Machine Learning
2
Fig.3.1.2 Classification of Machine Learning
Supervised Learning
The goal of supervised learning is to map input data with the output data.
algorithms:
• Classification
• Regression
Unsupervised Learning
• Unsupervised learning is a learning method in which a machine learns
without any supervision.
• The training is provided to the machine with the set of data that has not
been labeled, classified, or categorized, and the algorithm needs to act on
that data without any supervision.
3
Semi Supervised Learning
It is a method that uses a small amount of labeled data and a large amount
of unlabeled data to train a model.
Reinforcement Learning
• The goal of an agent is to get the most reward points, and hence, it improves
its performance.
Deep Learning
For example, in image processing, lower layers may identify edges, while
higher layers may identify the concepts relevant to a human such as digits
or letters or faces.
We can understand the concept of regression analysis using the below example:
4
sales:
Now, the company wants to do the advertisement of $200 in the current year and
wants to know the prediction about the sales for this year. So to solve such
type of prediction problems in machine learning, we need regression analysis.
In Regression, we plot a graph between the variables which best fits the given
datapoints, using this plot, the machine learning model can make predictions
about the data.
In simple words, "Regression shows a line or curve that passes through all the
datapoints on target-predictor graph in such a way that the vertical distance
between the datapoints and the regression line is minimum." The distance between
datapoints and line tells whether a model has captured a strong relationship or
not.
5
o Outliers: Outlier is an observation which contains either very low value or
very high value in comparison to other observed values. An outlier may
hamper the result, so it should be avoided.
o Underfitting and Overfitting: If our algorithm works well with the training
dataset but not well with test dataset, then such problem is
called Overfitting. And if our algorithm does not perform well even with
training dataset, then such problem is called underfitting.
Linear Regression
6
o It is used for solving the regression problem in machine learning.
7
o Linear regression shows the linear relationship between the
independent variable (X-axis) and the dependent variable (Y-
axis), hence called linear regression.
o If there is only one input variable (x), then such linear
regression is called simple linear regression. And if there is
more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression
model can be explained using the below image. Here we are
predicting the salary of an employee on the basis of the year
of experience.
Y=aX+b
8
Logistic Regression:
o Logistic regression is another supervised learning algorithm
which is used to solve the classification problems. In
classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical
variable such as 0 or 1, Yes or No, True or False, Spam or not
spam, etc.
o It is a predictive analysis algorithm which works on the concept
of probability.
o Logistic regression is a type of regression, but it is different
from the linear regression algorithm in the term how they are
used.
o Logistic regression uses sigmoid function or logistic function
which is a complex cost function. This sigmoid function is used
to model the data in logistic regression. The function can be
represented as:
10
For example, say we have a list of how many topics future engineers here at
freeCodeCamp can solve if they invest 1, 2, or 3 hours continuously. Then we can
predict how many topics will be covered after 4 hours of continuous study even
without that data being available to us.
Setting up an example
Before we jump into the formula and code, let's define the data we're going to
use.
Let's assume that our objective is to figure out how many topics are covered
by a student per hour of learning.
Each pair (X, Y) will represent a student. Since we all have different rates of
learning, the number of topics solved can be higher or lower for the same time
invested.
1.2 2 3 10
1.5 3 3.1 6
2 1.8 3.2 5
2.5 4.7
You can read it like this: "Someone spent 1 hour and solved 2 topics" or
"One student after 3 hours solved 10 topics".
11
Each point is a student (X, Y) and how long it took that specific student to complete
a certain number of topics
The formula
Y = a + bX
a is the intercept, in other words the value that we expect, on average, from a
student that practices for one hour. One hour is the least amount of time we're
going to accept into our example data set.
b is the slope or coefficient, in other words the number of topics solved in a specific
hour (X). As we increase in hours (X) spent studying, b increases more and more.
x͞ -> 1+1.2+1.5+2+2.3+2.5+2.7+3+3.1+3.2+3.6
= 2.37
͞y -> 1,5+2+3+1,8+2,7+4,7+7,1+10+6+5+8,9 /
11 = 4.79
Now that we have the average we can expand our table to include the new results:
12
0.01+0.76+3.28+0.88+0.17+5.06
= 20.73
∑(x - ͞x)² -> 1.88+1.37+0.76+0.14+0.00+0.02+0.11+0.40+0.53+0.69+1.51 =
7.41
13
And finally we do 20.73 / 7.41 and we get b
= 2.8
Calculating "a"
= a + b x. We've already obtained
All that is left is a, for which the formula is
y͞͞͞
14
all those other values, so we can substitute them and we get:
4.79 = a + 2.8*2.37
4.79 = a + 6.64
a = -6.64+4.79
a = -1.85
The result
Our final formula becomes:
Y = -1.85 + 2.8*X
Now we replace the X in our formula with each value that we have:
Topics Solved Topics Solved
Hours Hours
-1.85 + 2.8 * X -1.85 + 2.8 * X
2.5 5.15
If we want to predict how many topics we expect a student to solve with 8 hours of
study, we replace it in our formula:
Y = -1.85 + 2.8*8
Y = 20.55
15
How to perform a multiple linear regression
To find the best-fit line for each independent variable, multiple linear regression
calculates three things:
The regression coefficients that lead to the smallest overall model error.
The associated p value (how likely it is that the t statistic would have
occurred by chance if the null hypothesis of no relationship between the
independent and dependent variables was true).
It then calculates the t statistic and p value for each regression coefficient in the
model.
The aim of Bayesian Linear Regression is not to find the single “best” value
of the model parameters, but rather to determine the posterior distribution for the
16
model parameters. Not only is the response generated from a probability
distribution, but the model parameters are assumed to come from a distribution as
well. The posterior probability of the model parameters is conditional upon the
training inputs and outputs:
Here we can observe the two primary benefits of Bayesian Linear Regression.
17
Bayesian Linear Regression Model Results with 500 (top) and 15000
observations (bottom)
There is much more variation in the fits when using fewer data points, which
represents a greater uncertainty in the model. With all of the data points, the OLS
and Bayesian Fits are nearly identical because the priors are washed out by the
likelihoods from the data.
When predicting the output for a single datapoint using our Bayesian Linear
Model, we also do not get a single value but a distribution. Following is the
probability density plot for the number of calories burned exercising for 15.5
minutes. The red vertical line indicates the point estimate from OLS.
In linear regression, the model targets to get the best-fit regression line to
predict the value of y based on the given input value (x). While training the model,
the model calculates the cost function which measures the Root Mean Squared
error between the predicted value (pred) and true value (y). The model targets to
minimize the cost function.
To minimize the cost function, the model needs to have the best value of θ1
and θ2. Initially model selects θ1 and θ2 values randomly and then iteratively
update these value in order to minimize the cost function until it reaches the
minimum. By the time model achieves the minimum cost function, it will have the
best θ1 and θ2 values. Using these finally updated values of θ1 and θ2 in the
hypothesis equation of linear equation, the model predicts the value of x in the best
manner it can.
Therefore, the question arises – How do θ1 and θ2 values get updated?
18
Linear Regression Cost Function:
19
θj : Weights of the hypothesis.
-> hθ(xi) : predicted y value for ith input.
-> j : Feature index number (can be 0, 1, 2,......., n).
-> α : Learning Rate of Gradient Descent.
Gradient Descent step-downs the cost function in the direction of the steepest
descent. The size of each step is determined by parameter α known as Learning
Rate.
In the Gradient Descent algorithm, one can infer two points :
If we choose α to be very small, Gradient Descent will take small steps to reach
local minima and will take a longer time to reach minima.
Advantages:
Flexibility: Gradient Descent can be used with various cost functions and
can handle non-linear regression problems.
Scalability: Gradient Descent is scalable to large datasets since it updates
the parameters for each training example one at a time.
21
Convergence: Gradient Descent can converge to the global minimum of the
cost function, provided that the learning rate is set appropriately.
Disadvantages:
Sensitivity to Learning Rate: The choice of learning rate can be critical in
Gradient Descent since using a high learning rate can cause the algorithm to
overshoot the minimum, while a low learning rate can make the algorithm
converge slowly.
Slow Convergence: Gradient Descent may require more iterations to
converge to the minimum since it updates the parameters for each training
example one at a time.
Local Minima: Gradient Descent can get stuck in local minima if the cost
function has multiple local minima.
Noisy updates: The updates in Gradient Descent are noisy and have a high
variance, which can make the optimization process less stable and lead to
oscillations around the minimum.
Linear Classification
where w is the weight vector and w0 the bias or threshold weight. Linear
discriminant functions are going to be studied for the two-category case, multi-
22
category case, and general case (Figure 3.6.1). For the general case there will
be c such discriminant functions, one for each of c categories.
fig 3.6.2
or
fig 3.6.3
and this shows that w is normal to any vector lying in the hyperplane. In general,
the hyperplane H divides the feature space into two half-spaces: decision
region R1 for w1 and region R2 for w2.
Because g(x)>0 if x is in R1, it follows that the normal vector w points into R1. It
is sometimes said that any x in R1 is on the positive side of H, and any x in R2 is
on the negative side (Figure 3.6.2).
23
In particular, the distance from the origin to H is given by . If w0>0, the
origin is on the positive side of H, and if w0<0, it is on the negative side. If w0=0,
then g(x) has the homogeneous form , and the hyperplane passes through the
origin (Figure 9.2).
24
The Multicategory Case
A linear machine divides the feature space into c decision regions with gj(x)
being the largest discriminant if x is in region Ri. If Ri and Rj are contiguous, the
boundary between them is a portion of the hyperplane Hij defined by
Thus, with the linear machine it is not the weight vectors themselves but
their differences that are important. While there are c(c-1)/2 pairs of regions, they
need not all be contiguous, and the total number of hyperplane segments
appearing in the decision surfaces is often fewer than c(c-1)/2.
25
Fig. 3.6.4Decision boundaries defined by a linear machine.
Logistic regression
Support vector machine
Decision tree
Random forests
26
Fig.3.7.1 Generative and Discriminative Model
Most of the Machine Learning and Deep Learning problems that you solve are
conceptualized from the Generative and Discriminative Models. In Machine
Learning, one can clearly distinguish between the two modelling types:
The more the neural networks got adopted, the more the generative and
discriminative domains grew. To understand the algorithms based on these models,
you need to study the theory and all the modelling concepts.
They learn to model the decision boundaries among classes (such as cats, dogs and
tigers ). The decision boundary could be linear or non-linear. The data points that
are far away from the decision boundary (i.e. the outliers) are not very
important. The discriminative model tries to learn a boundary that separates the
positive from the negative class, and comes up with the decision boundary. Only
27
Discriminative models classify data points, without providing the model of how the
Probabilistic
logistic regression
a deep neural network, which models P(Y|X)
Non-probabilistic
Support Vector Machine (SVM), which tries to learn the mappings
directly from the data points to the classes with a hyperplane.
28
significantly less compute. A Machine Learning algorithm, it can be applied to both
classification (output is deterministic) and regression (output is continuous)
problems. It is largely used in text classification, image classification, protein and
gene classification.
Fig. 3.7.3Image showing the Kernel Trick technique, the dimensional space, using
which SVM can be easily implemented.
SVM can separate both linear and non-linear data points. A kernel-trick
helps separate non-linear points. Having no kernel, the Linear SVM finds the
hyperplane with the maximum margin-linear solution to the problem. The
boundary points in the feature space are called support vectors (as shown in the
figure above). Based on their relative position, the maximum margin is derived and
an optimal hyperplane drawn at the midpoint.
The decision boundary with the maximum margin works best, increases
the chance of generalization. Enough freedom to the boundary points reduces the
chance of misclassification. On the other hand, a decision boundary with smaller
margins usually leads to overfitting.
29
Fig. 3.7.4 Image depicting the selection of the hyperplane that maximises the
margin between the data points.
Logistic Regression
o Logistic Regression is much similar to the Linear Regression except that how
they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
o The curve from the logistic function indicates the likelihood of something
such as whether the cells are cancerous or not, a mouse is obese or not
based on its weight, etc.
30
Logistic Function (Sigmoid Function):
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot
go beyond this limit, so it forms a curve like the "S" form. The S-form curve
is called the Sigmoid function or the logistic function.
The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given
below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide
the above equation by (1-y):
31
o But we need range between -[infinity] to +[infinity], then take logarithm of
the equation it will become:
On the basis of the categories, Logistic Regression can be classified into three
types:
32
dependent variables relate to one another. The logistic function transforms
the input variables into a probability value between 0 and 1, which
represents the likelihood of the dependent variable being 1 or 0.
Odds: It is the ratio of something occurring to something not occurring. it is
different from probability as probability is the ratio of something occurring to
everything that could possibly occur.
Log-odds: The log-odds, also known as the logit function, is the natural
logarithm of the odds. In logistic regression, the log odds of the dependent
variable are modeled as a linear combination of the independent variables
and the intercept.
Coefficient: The logistic regression model’s estimated parameters, show
how the independent and dependent variables relate to one another.
Intercept: A constant term in the logistic regression model, which
represents the log odds when all independent variables are equal to
zero.
Maximum likelihood estimation: The method used to estimate the
coefficients of the logistic regression model, which maximizes the likelihood
of observing the data given the model.
It is not a single algorithm but a family of algorithms where all of them share
a common principle, i.e. every pair of features being classified is independent of
each other.
33
4 Sunny Cool Normal False Yes
• The dataset is divided into two parts, namely, feature matrix and
the response vector.
• Feature matrix contains all the vectors(rows) of dataset in which each vector
consists of the value of dependent features.
• In above dataset, features are ‘Outlook’, ‘Temperature’, ‘Humidity’ and
‘Windy’.
• Response vector contains the value of class variable(prediction or output)
for each row of feature matrix.
• In above dataset, the class variable name is ‘Play golf’.
Assumption:
• The fundamental Naive Bayes assumption is that each feature makes an:
• independent
• Equal
For example, the temperature being ‘Hot’ has nothing to do with the
humidity or the outlook being ‘Rainy’ has no effect on the winds.
34
Secondly, each feature is given the same weight(or importance).
For example, knowing only temperature and humidity alone can’t predict the
outcome accurately.
Bayes’ Theorem
P(A) is the priori of A (the prior probability, i.e. Probability of event before
evidence is seen). The evidence is an attribute value of an unknown
instance(here, it is event B).
Now, with regards to our dataset, we can apply Bayes’ theorem in following
way:
X=(x1,x2,x3,….,xn)
y = No
Naive assumption.
Now, its time to put a naive assumption to the Bayes’ theorem, which
is, independence among the features.
P(A,B) = P(A)P(B)
35
Hence, we reach to the result:
P(y|x1,x2,……,xn) =P(x1|y)P(x2|y)….P(xn|y)P(y)
p(x1)p(x2)….p(xn)
For this, we find the probability of given set of inputs for all possible values of the
class variable y and pick up the output with maximum probability.
Y=argmaxyp(y)∏i=1 to nP(xi|y)
So, finally, we are left with the task of calculating P(y) and P(xi | y).
Please note that P(y) is also called class probability and P(xi | y) is
called conditional probability.
• P(Yes|today) = P(sunnyOutlook|yes)P(hottemperature|yes)
P(NormalHumidity|yes)P(NoWind|yes)P(yes)
P(today)
• P(No|today) = P(sunnyOutlook|No)P(hottemperature|No)
P(NormalHumidity|No)P(NoWind|No)P(No)
P(today)
Since, P(today) is common in both probabilities, we can remove that term to get
Support Vector
Classifier Hyperplane:
36
• Imagine a knife cutting through a piece of cheese that is in cubical shape
and dividing it into two parts.
Then put both pieces back and observe that space in between both pieces,
that is a two-dimensional subspace in a three-dimensional space.
• For p greater than 3, the imagination can be harder, but the intuition
remains the same.
Similarly, the equation can be extended to p-dimensional setting and look like this:
• Each beta is a parameter for one of the many dimensions we have in our
space.
• This means if the equation holds value less than zero, the point does not lies
on the plain but rather below the hyperplane.
And, if the points lie on the upper side of the plain then the equation holds like
this:
37
Fig.3.10.1 A graph showing the plot for previous 2 equations
3.11 Random Forest Algorithm
The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
38
Fig.3.11.1 Schematic diagram for Random Forest
• Since the random forest combines multiple trees to predict the class of the
dataset, it is possible that some decision trees may predict the correct
output, while others may not.
• Therefore, below are two assumptions for a better Random forest classifier:
• There should be some actual values in the feature variable of the dataset so
that the classifier can predict accurate results rather than a guessed result.
• The predictions from each tree must have very low correlations.
• Step-2: Build the decision trees associated with the selected data points
(Subsets).
• Step-3: Choose the number N for decision trees that you want to build.
* Step-5: For new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority votes.
39
Example: Suppose there is a dataset that contains multiple fruit images.
• The dataset is divided into subsets and given to each decision tree.
• There are mainly four sectors where Random forest mostly used:
• Banking: Banking sector mostly uses this algorithm for the identification of
loan risk.
• Medicine: With the help of this algorithm, disease trends and risks of the
disease can be identified.
• Land Use: We can identify the areas of similar land use by this algorithm.
o It enhances the accuracy of the model and prevents the overfitting issue.
40
Disadvantages of Random Forest
o Although random forest can be used for both classification and regression
tasks, it is not more suitable for Regression tasks.
– each arc has associated with it one of the possible values of the
attribute at the node from which the arc is directed
• Suppose that we want to learn a function f(x) = y and we are given some
sample (x,y) pairs, as in figure (a)
41
• There are several hypotheses we could make about this function, e.g.: (b),
(c) and (d)
• A preference for one over the others reveals the bias of our learning
technique, e.g.:
• Therefore, the smallest decision tree that correctly classifies all of the
training examples is best.
42
Fig.3.12.2 R&N’s Restaurant Problem with examples
43
Fig.3.12.3 Decision Tree for the R&Ns Restaurant problem
Information theory
Evaluation methodology
• Standard methodology:
2. Randomly divide collection into two disjoint sets: training and test
• To study the efficiency and robustness of an algorithm, repeat steps 2-4 for
different training sets and sizes of training sets
• If you improve your algorithm, start again with step 1 to avoid evolving the
algorithm to work well on just this collection
• Inducing decision trees is one of the most widely used learning methods in
practice
• Can out-perform human experts in many problems
• Strengths include
– Fast
– Simple to implement
– Can convert result to a set of easily interpretable rules
– Empirically valid in many commercial products
– Handles noisy data
• Weaknesses include:
– Univariate splits/partitioning using only one attribute at a time so
limits types of possible trees
45