CH 5
CH 5
REGRESSION
INTRODUCTION TO REGRESSION
• Regression analysis is the premier method of supervised learning.
• Given a training dataset D containing N training points (x„ y.), where
i = 1...N, regression analysis is used to model the relationship
between one or more independent variables xi and a dependent
variable y..
• The relationship between the dependent and independent variables
can be represented as a function as follows:
y =f(x)
y =f(x)
• The line of the form, y = ax + b can be fitted to the data points that
indicate the relationship between x and y.
• Multiple Regression It is a type of regression where a liner is fitted for finding the linear relationship between
two or more independent variables and one dependent variable to describe relationships among variables.
• Polynomial Regression It is a type of non-linear regression method of describing relationships among variables
where Nth degree polynomial is used to model the relationship between one independent variable and one
dependent variable.
• Polynomial multiple regression is used to model two or more independent variables and one dependent variable.
• Logistic Regression It is used for predicting categorical variables that involve one or more independent variables
and one dependent variable. This is also known as a binary classifier.
• Lasso and Ridge Regression Methods These are special variants of regression method where regularization
methods are used to limit the number and size of coefficients of the independent variables.
Limitations of Regression Method
• Outliers - Outliers are abnormal data. It can bias the outcome of the
regression. model, as outliers push the regression line towards it.
• Number of cases - The ratio of independent and dependent
variables should be at least 20 1. For every explanatory variable,
there should be at least 20 samples. Atleast five samples are
required in extreme cases.
• Missing data - Missing data in training data can make the model
unfit for the sampled data.
• Multicollinearity - If exploratory variables are highly correlated (0.9
and above), the regression is vulnerable to bias. Singularity leads to
perfect correlation of 1. The remedy is to remove exploratory
variables that exhibit correlation more than I. If there is a tie, then
the tolerance (1 - R squared) is used to eliminate variables that have
the greatest value.
INTRODUCTION TO LINEAR REGRESSION
• In the simplest form, the linear regression model can be created by
fitting a line among the scattered data points. The line is of the form
given in Eq.
y=a0+ a1* x +e
• Here, ao is the intercept which represents the bias and al represents
the slope of the line.
• These are called regression coefficients. e is the error in prediction.
The assumptions of linear regression are listed
as follows:
• The observation(y) are random and are mutually independent.
• The difference between the predicted and true values is called an
error. The error is also mutually independent with the same
distributions such as normal distribution with zero mean and
constant variables.
• The distribution of the error term is independent of the joint
distribution of explanatory variables.
• The unknown parameters of the regression models are constants.
• The idea of linear regression is based on Ordinary Least Square
(OLS) approach. This method is also known as ordinary least squares
method. In this method, the data points are modelled using a
straight line.
• Any arbitrarily drawn line is not an optimal line. In Figure
• In another words, OLS is an optimization technique where the
difference between the data points and the line is optimized.
Linear Regression in Matrix Form
VALIDATION OF REGRESSION METHODS
Coefficient of Determination
• The sum of the squares of the differences between the value of the
data pair and the average of y is called total variation