0% found this document useful (0 votes)
33 views40 pages

QT - Unit 2 - Part B - Regression

Uploaded by

anaaya321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views40 pages

QT - Unit 2 - Part B - Regression

Uploaded by

anaaya321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

UNIT II – Part B

REGRESSION
Regression Analysis
Establishing Correlation is a prerequisite for Linear Regression. We
can't use Linear Regression unless there is a Linear Correlation.
Correlation analysis describes the present or past situation. It
uses Sample data to infer a property of the source Population or
Process. There is no looking into the future. Linear Regression is
used to predict results.

Correlation analysis studies whether the variables under study are related or not and
to what degree. Correlation Analysis does not attempt to identify a
Cause-Effect relationship, Regression does.
In Correlation, we ask to what degree the plotted data forms a
shape that seems to follow an imaginary line that would go
through it. But we don't try to specify that line. In Linear
Regression, that establishes
Regression analysis line is thethe
whole point.
“nature We calculate
of relationship” a best-fit
between line
the variables.
through the data:relationship
It studies functional y = a + bx. and provides a mechanism for prediction or
forecasting.
Regression analysis is a statistical method to model the relationship
between a dependent (target or outcome) variable and one or more
independent (predictor) variables.
Regression analysis helps us to understand how the value of the dependent
variable is changing corresponding to an independent variable when other
independent variables are held fixed. It predicts continuous/real values
such as temperature, age, salary, price, etc.

Dependent Variable: The main factor in


Regression analysis which we want to predict
or understand is called the dependent variable.
It is also called outcome/target variable.

Independent Variable: The factors which


affect the dependent variables or which are
used to predict the values of the dependent
variables are called independent variable, also
called as a predictor variable.

The green points shown in the graph are actual data points.
Least Squares Method
Line of Best Fit/Regression Line
The least-squares regression method is a technique commonly used in
Regression Analysis. It is a mathematical method used to find the best fit
line that represents the relationship between an independent and
dependent variable in such a way that the error is minimized.
The Line of best fit is drawn across a scatter plot of data points in order to
represent a relationship between those data points.
The least squares method is one of the most effective ways used to draw
the line of best fit it is based on the idea that the square of the errors
(residuals) obtained must be minimized to the most possible extent and
hence the name least squares method
Regression Line
Regression Line is defined as a statistical concept that facilitates and predicts
the relationship between independent variable and dependent variable. A
regression line is a straight line that reflects the best-fit connection in a dataset
between independent and dependent variables.

• The independent variable is generally shown on the X-axis and the


dependent variable is shown on the Y-axis.
• The main purpose of developing a regression line is to predict or estimate the
value of the dependent variable based on the values of one or more
independent variables.
There are always two lines of regression:
Y on X – Predicts the value of Y from known values of X.
X on Y - Predicts the value of X from known values of Y.

The lines of regression Y on X or X on Y are best fit in the sense that it minimizes
the sum of the squares of the vertical distances from the observed points to the
line.
When X is known & Y is to predicted – Y on X is used.
When Y is known & X is to predicted – X on Y is used.
Assumptions of (Simple Linear) Regression
We make a few assumptions when we use linear regression to model the
relationship between independent and dependent variables. These
assumptions are essentially conditions that should be met before we draw
inferences regarding the model estimates or before we use a model to make a
prediction.

Regression fails to deliver good results with data sets which doesn’t fulfil its
assumptions. Therefore, for a successful regression analysis, it’s essential to
validate these assumptions
1. Linear relationship - There should be a linear relationship between dependent
(response) variable and independent (predictor) variable(s).
2. Normality of Errors- The errors or residuals must be normally distributed.
3. Homoscedasticity (or, equal variance around the line) - The error terms must
have constant variance.
4. No multicollinearity - The independent variables should not be correlated.
5. Autocorrelation - There should be no correlation between the residual (error).
The presence of correlation in error terms drastically reduces model’s accuracy.
Coefficient of Regression
Regression Equations/Lines
Regression Coefficient – Some Formulas
1. From Original Data

2. From Actual Mean

3. From Assumed Mean


4. From Covariance & Standard Deviation

5. From Correlation Coefficient & Standard Deviation

Relation between Correlation Coefficient &


Regression Coefficient
Properties of Regression
Coefficient of Determination
• The coefficient of determination (R²) measures how well a
statistical model predicts an outcome. The outcome is represented by the
model’s dependent variable.
• The coefficient of determination is often written as R2, which is pronounced
as “r squared.”
• The lowest possible value of R² is 0 and the highest possible value is 1. Put
simply, the better a model is at making predictions, the closer R² will be to 1.
You can see in the first dataset that
when the R2 is high, the observations
are close to the model’s predictions.
In other words, most points are close
to the line of best fit.

In contrast, you can see in


the second dataset that
when the R2 is low, the
observations are far from
the model’s predictions. In
other words, when the R2 is
low, many points are far
from the line of best fit.

Note: The coefficient of determination is always positive, even


when the correlation is negative.
Calculating the Coefficient of Determination
You can choose between two formulas to calculate the coefficient of
determination (R²) of a simple linear regression. The first formula is specific
to simple linear regressions, and the second formula can be used to
calculate the R² of many types of statistical models.
Alternatively,
Interpretation of Coefficient of Determination

• R² is the proportion of variance that is shared between the


independent and dependent variables.

• The coefficient of determination (R²) is interpreted as the


proportion of variance in the dependent variable that is
predicted by the statistical model.

• R² is the proportion of variance “explained” or “accounted


for” by the model. The proportion that remains (1 − R²) is
the variance that is not predicted by the model.
Correlation vs. Regression
CORRELATION REGRESSION

1. Correlation means the relationship between two Regression is a mathematical measure


or more variables which vary in sympathy so expressing the average relationship between the
that the movements in one tend to be two variables.
accompanied by the corresponding movement in
the other.
2 Correlation analysis attempts to determine the Regression analysis attempts to determine the
"degree and direction of relationship" between “nature and extent of relationship" between
variables. variables, i.e. functional relationship between
variables.
3 Correlation need not imply cause and effect Regression analysis clearly indicates the cause
relationship between the variables under study. and effect relationship between the variables.
4. There may be non-sense or spurious correlation There is no such thing like non-sense
between two variables which is due to pure regression.
chance and has no practical relevance .
5 Correlation coefficient r (X, Y) or simply Regression coefficients are not symmetric in X
between two variables is symmetric. i.e., and Y, i.e., byx is not equal to bxy.
r X,Y) = r(Y, X).
CORRELATION REGRESSION

6. Correlation cannot be used for Regression is a forecasting device. It can be


forecasting/prediction purposes. used to predict the value of dependent variable
from the given value of independent variable.

7. Correlation analysis is confined only to the Regression analysis has much wider
study of linear relationship between the applications as it studies linear as well as
variables and therefore has limited non-linear relationship between the
applications. variables.
8 Correlation coefficient is independent of Regression coefficients are independent
change of origin and scale. of only change of origin but not of scale.

Similarities between correlation and regression


In addition to differences, there are some similarities between correlation & regression:
• Both work to quantify the direction and strength of the relationship between two numeric
variables.
• Any time the correlation is negative, the regression will also be negative. Any time the
correlation is positive, the regression will be positive.
• Correlation coefficient and regression coefficients, both range from -1 to +1.
Uses of Regression Analysis
The regression analysis as a statistical tool has a number of uses, or utilities for which it
is widely used:

• It provides a functional relationship between two or more related variables with the
help of which we can easily estimate or predict the unknown values of one variable
from the known values of another variable.

• It provides a measure of errors of estimates made through the regression line. A


little scatter of the observed (actual) values around the relevant regression line
indicates good estimates of the values and vice-versa.

• It provides a measure of coefficient of correlation by taking the square root of the


product of the two regression coefficients, r = √ bxy.byx.

• It provides a measure of coefficient of the determination. This coefficient of


determination is computed by taking the product of the two regression coefficients,
R2 = bxy.byx.
• It provides a formidable tool of statistical analysis in the field of business and
commerce where people are interested in predicting the future events viz.:
consumption, production, investment, prices, sales, profits, etc. and success of
businessmen depends very much on the degree of accuracy in their various estimates.

• It provides a valuable tool for measuring and estimating the cause and effect
relationship among the economic variables that constitute the essence of economic
theory and economic life. It is highly used in the estimation of Demand curves, Supply
curves, Production functions, Cost functions, Consumption functions etc.

• This technique is highly used in our day-to-day life and sociological studies as well to
estimate the various factors viz. birth rate, death rate, tax rate, yield rate, etc.

• Last but not the least, the regression analysis technique gives us an idea about the
relative variation of a series.
Pitfalls of Correlation & Regression Analysis

• It involves very lengthy and complicated procedure of calculations and


analysis.
• It cannot be used in case of qualitative phenomenon viz, honesty, crime
etc.
• Regression retentions can change over time as do correlations. This is
called parameter instability.
• Regression analysis is difficult to apply when identifying the independent
variable and the dependent variable is challenging.
• Limited generalizability: Results may not apply beyond the data. The
functional relationship that is established between any two or more
variables on the basis of some limited data may not hold good if more
and more data are taken into consideration.
• Sensitivity to outliers: Correlation and regression can be affected by
extreme values.
• Correlation and regression may not handle non-normal data.

You might also like