ML Unit 3 Notes 1
ML Unit 3 Notes 1
1
UNIT--3: SYLLABUS
UNIT
Regression
Introduction to regression analysis, Simple
linear regression, Multiple linear regression,
Assumptions in Regression Analysis, Main
Problems in Regression Analysis, Improving
Accuracy of the linear regression model,
Polynomial Regression Model, Logistic
Regression, Regularization, Regularized
Linear Regression, Regularized Logistic
Regression.
3.1 Introduction
In the context of regression, dependent variable (Y) is
the one whose value is to be predicted.
e.g. the price quote of the real estate
The dependent variable (Y) is functionally related to one
(say, X) or more independent variables called
predictors.
Regression is essentially finding a relationship (or)
association between the dependent variable (Y) and the
independent variable(s) (X), i.e. to find the function ‘f ’
for the association Y = f (X).
The most common regression algorithms are
◦ Simple linear regression
◦ Multiple linear regression
◦ Polynomial regression
◦ Multivariate adaptive regression splines
◦ Logistic regression
◦ Maximum likelihood estimation (least squares)
3.2 Simple Linear Regression
Simple linear regression is the simplest regression
model which involves only one predictor.
This model assumes a linear relationship between the
dependent variable and the predictor variable as shown
in Figure
3.2 Simple Linear Regression
In the context of real estate problem, if we take Price of a
Property as the dependent variable and the Area of the
Property (in sq. m.) as the predictor variable, we can build a
model using simple linear regression.
Scenario 1 for positive slope: Delta (Y) is positive and Delta (X) is positive
Scenario 2 for positive slope: Delta (Y) is negative and Delta (X) is negative
3.2 Simple Linear Regression
3.2.1 Slope of the simple linear regression
model Cont…
2. Curve linear positive slope
Curves in these graphs slope upward from left to
right.
Slope for a variable (X) may vary between two graphs, but it
will always be positive; hence, the above graphs are called as
graphs with curve linear positive slope.
3.2 Simple Linear Regression
3.2.1 Slope of the simple linear regression model
Cont…
3. Linear negative slope
A negative slope always moves downward on a graph
from left to right.
As X value (on X-axis) increases, Y value decreases
Scenario 1 for negative slope: Delta (Y) is positive and Delta (X) is negative
Scenario 2 for positive slope: Delta (Y) is negative and Delta (X) is positive
3.2 Simple Linear Regression
3.2.1 Slope of the simple linear regression
model Cont…
4. Curve linear negative slope
Curves in these graphs slope downward from left
to right.
Slope for a variable (X) may vary between two graphs, but it
will always be negaitive; hence, the above graphs are called as
graphs with curve linear negaitive slope.
3.2 Simple Linear Regression
3.2.2 No relationship graph
Scatter graph shown in Figure indicates ‘no
relationship’ curve as it is very difficult to
conclude whether the relationship between X
and Y is positive or negative.
3.2 Simple Linear Regression
3.2.3 Error in simple regression
For a regression model, X and Y values are
provided to the machine, and it identifies the
values of a (intercept) and b (slope) by
relating the values of X and Y.
Identifying the exact match of values for a
and b is not always possible. There will be
some error value (ɛ) associated with it. This
error is called marginal or residual error.
3.2 Simple Linear Regression
3.2.4 Example of simple regression
A college professor believes that if the
grade for internal examination is high in
a class, the grade for external
examination will also be high.
A random sample of 15 students in that
class was selected, and the data is given
below:
3.2 Simple Linear Regression
3.2.4 Example of simple regression Cont…
A scatter plot was drawn to explore the relationship
between the independent variable (internal marks)
mapped to X-axis and dependent variable (external
marks) mapped to Y-axis
• We can observe from the above graph, the line (i.e. the regression
line) does not predict the data exactly
• Instead, it just cuts through the data.
• Some predictions are lower than expected, while some others are
higher than expected.
3.2 Simple Linear Regression
3.2.4 Example of simple regression Cont…
Residual is the distance between the
predicted point on the regression line and the
actual point.
3.2 Simple Linear Regression
3.2.4 Example of simple regression Cont…
In simple linear regression, the line is drawn using the
regression formula.
the regression
line is slightly
curved for
polynomial
degree 3 with
the above 15
data points.
3.7 Polynomial Regression Model
Example: Let us use the below data set of (X, Y) for
degree 14 polynomial.
At degree 14,
(extreme case)
the regression line
will be overfitting
into all the original
values of X.
3.8 Logistic Regression
Logistic regression is one of the most popular
Machine Learning algorithms, which comes under
the Supervised Learning technique.
It is used for predicting the categorical
dependent variable using a given set of
independent variables.
Logistic regression predicts the output of a
categorical dependent variable. Therefore the
outcome must be a categorical or discrete value.
It can be either Yes or No, 0 or 1, true or False, etc.
but instead of giving the exact value as 0 and 1, it
gives the probabilistic values which lie between
0 and 1.
Linear Regression is used for solving Regression
problems, whereas Logistic regression is used
for solving the classification problems.
3.8 Logistic Regression
In Logistic regression, instead of fitting a regression
line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1).
The curve from the logistic function indicates the
likelihood of something such as whether the cells
are cancerous or not, a mouse is obese or not
based on its weight, etc
3.8 Logistic Regression
In logistic regression, dependent variable (Y) is
binary (0,1) and independent variables (X) are
continuous in nature.
The probabilities describing the possible
outcomes (probability that Y = 1) of a single trial
are modeled as a logistic function of the
predictor variables.
In the logistic regression model, a chi-square
test is used to gauge how well the logistic
regression model fits the data.
The goal of logistic regression is to predict the
likelihood that Y is equal to 1 given certain
values of X.
So, we are predicting probabilities rather than
the scores of the dependent variable.
3.8 Logistic Regression
Example: We might try to predict whether or not a small project
will succeed or fail on the basis of the number of years of
experience of the project manager handling the project.
We presume that those project managers who have been
managing projects for many years will be more likely to succeed.
This means that as X (the number of years of experience of
project manager) increases, the probability that Y will be equal to
1 (success of the new project) will tend to increase.
If we take a hypothetical example in which 60 already executed
projects were studied and the years of experience of project
managers ranges from 0 to 20 years, we could represent this
tendency to increase the probability that Y = 1 with a graph.
To illustrate this, it is convenient to segregate years of
experience into categories (i.e. 0–8, 9–16, 17–24, 25–32, 33–
40). If we compute the mean score on Y (averaging the 0s and
1s) for each category of years of experience, we will get
something like
3.8 Logistic Regression
When the graph is drawn for the above values of X
and Y, it appears like the graph in Figure 8.18.
As X increases, the probability that Y = 1 increases.
In other words, when the project manager has more
years of experience, a larger percentage of projects
succeed.
A perfect relationship represents a perfectly curved
S rather than a straight line,
3.8 Logistic Regression
In logistic regression, we use a logistic function,
which always takes values between zero and one.
The logistic formulae are stated in terms of the
probability that Y = 1, which is referred to as P. The
probability that Y is 0 is 1 − P.