0% found this document useful (0 votes)
13 views36 pages

CH 5

Uploaded by

radhikasn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views36 pages

CH 5

Uploaded by

radhikasn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

INTRODUCTION TO

REGRESSION
INTRODUCTION TO REGRESSION
• Regression analysis is the premier method of supervised learning.
• Given a training dataset D containing N training points (x„ y.), where
i = 1...N, regression analysis is used to model the relationship
between one or more independent variables xi and a dependent
variable y..
• The relationship between the dependent and independent variables
can be represented as a function as follows:

y =f(x)
y =f(x)

• The feature variable x is also known as an explanatory variable,


exploratory variable, a predictor variable, an independent variable
covariate, or a domain point.
• y is a dependent variable. Dependent variables are also called as labels,
target variables, or response variables.
• Regression analysis determines the change in response variables when
one exploration variable is varied while keeping all other parameters
constant.
• This is used to determine the relationship each of the exploratory
variables exhibits. Thus, regression analysis is used for prediction and
forecasting.
INTRODUCTION TO LINEARITY, CORRELATION,
AND CAUSATION
• The quality of the regression analysis is determined by the factors
such as correlation and causation.
Regression and Correlation :
• Correlation among two variables can be done effectively using a
Scatter plot, which is a plot between explanatory variables and
response variables.
• It is a 2D graph showing the relationship between two variables
• The x-axis of the scatter plot is independent, or input or predictor
variables
• y-axis of the scatter plot is output or dependent or predicted
variables.
• The positive, negative, and random correlations are given in Figure
• In positive correlation, one variable change is associated with the
change in another variable.
• In negative correlation, the relationship between the variables is
reciprocal while in random correlation, no relationship exists
between variables.
• While correlation is about relationships among variables, say x and
y, regression is about predicting one variable given another variable.
Regression and Causation
• Causation is about causal relationship among variables, say x and y.

• Causation means knowing whether x causes y to happen or vice versa.

• x causes y is often denoted as x implies y. Correlation and Regression


relationships are not same as causation relationship.

• For example, the correlation between economical background and marks


scored does not imply that economic background causes high marks.
Linearity and Non-linearity Relationships
• The linearity relationship between the variables means the
relationship between the dependent and independent variables can
be visualized as a straight line.

• The line of the form, y = ax + b can be fitted to the data points that
indicate the relationship between x and y.

• By linearity, it is meant that as one variable increases, the


corresponding variable also increases in a linear manner
Types of Regression Methods
• Linear Regression It is a type of regression where a line is fitted upon given data for finding the linear
relationship between one independent variable and one dependent variable to describe relationships.

• Multiple Regression It is a type of regression where a liner is fitted for finding the linear relationship between
two or more independent variables and one dependent variable to describe relationships among variables.

• Polynomial Regression It is a type of non-linear regression method of describing relationships among variables
where Nth degree polynomial is used to model the relationship between one independent variable and one
dependent variable.

• Polynomial multiple regression is used to model two or more independent variables and one dependent variable.

• Logistic Regression It is used for predicting categorical variables that involve one or more independent variables
and one dependent variable. This is also known as a binary classifier.

• Lasso and Ridge Regression Methods These are special variants of regression method where regularization
methods are used to limit the number and size of coefficients of the independent variables.
Limitations of Regression Method
• Outliers - Outliers are abnormal data. It can bias the outcome of the
regression. model, as outliers push the regression line towards it.
• Number of cases - The ratio of independent and dependent
variables should be at least 20 1. For every explanatory variable,
there should be at least 20 samples. Atleast five samples are
required in extreme cases.
• Missing data - Missing data in training data can make the model
unfit for the sampled data.
• Multicollinearity - If exploratory variables are highly correlated (0.9
and above), the regression is vulnerable to bias. Singularity leads to
perfect correlation of 1. The remedy is to remove exploratory
variables that exhibit correlation more than I. If there is a tie, then
the tolerance (1 - R squared) is used to eliminate variables that have
the greatest value.
INTRODUCTION TO LINEAR REGRESSION
• In the simplest form, the linear regression model can be created by
fitting a line among the scattered data points. The line is of the form
given in Eq.
y=a0+ a1* x +e
• Here, ao is the intercept which represents the bias and al represents
the slope of the line.
• These are called regression coefficients. e is the error in prediction.
The assumptions of linear regression are listed
as follows:
• The observation(y) are random and are mutually independent.
• The difference between the predicted and true values is called an
error. The error is also mutually independent with the same
distributions such as normal distribution with zero mean and
constant variables.
• The distribution of the error term is independent of the joint
distribution of explanatory variables.
• The unknown parameters of the regression models are constants.
• The idea of linear regression is based on Ordinary Least Square
(OLS) approach. This method is also known as ordinary least squares
method. In this method, the data points are modelled using a
straight line.
• Any arbitrarily drawn line is not an optimal line. In Figure
• In another words, OLS is an optimization technique where the
difference between the data points and the line is optimized.
Linear Regression in Matrix Form
VALIDATION OF REGRESSION METHODS
Coefficient of Determination
• The sum of the squares of the differences between the value of the
data pair and the average of y is called total variation

You might also like