Day7-Linear Regression New

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 26

Linear Regression

Introduction to Regression Analysis


• Regression Analysis is a form of predictive modeling techniques

• Estimate the relationships between two or more variables -How the dependent variable changes when one of
the independent variables

• Predict the value of a dependent variable based on the value of at least one independent variable-Explain the
impact of changes in an independent variable on the dependent variable

Dependent variable: the variable we wish to explain, the main factor you are trying to
understand and predict
Independent variable: the variable used to explain the dependent variable, the factors
that might influence the dependent variable.
Linear Regression
Linear regression attempts to model the relationship between two variables by fitting a linear equation
(straight line) to the observed data. One variable is considered to be an explanatory variable
(e.g. your income), and the other is considered to be a dependent variable (e.g. your expenses).

The sloped straight line representing the linear relationship that fits the given data best is called as a regression line.
It is also called as best fit line.
Uses of Regression?
Three major uses for regression analysis are:

1. Determining the strength of predictors- To identify the strength of the effect that the
independent variable(s) have on a dependent variable
what is the strength of relationship between dose and effect, sales and marketing
spending, or age and income.

2 . Forecasting an effect or impact of changes- How much the dependent variable changes
with a change in one or more independent variables
“ how much additional sales income do I get for each additional $1000 spent on
marketing?”

3. Trend forecasting - “what will the price of gold be in 6 months?”


Where is Linear Regression used?

• Evaluating Trends and Sales Estimates

• Analyzing the impact of Price Changes

• Assessment of risk in financial services and insurance domain


Linear regression equation
Mathematically, a linear regression is defined by
this equation:
Speed of
y = mx + c Vehicle

Where:

• x is an independent variable.
m=+ ve slope
• y is a dependent variable.
of Line

• c is the Y-intercept, which is the expected mean value


of y when all x variables are equal to 0. On a regression c= y- intercept
graph, it's the point where the line crosses the Y axis. of the line

• m is the slope of a regression line, which is the rate of


change for y as x changes.
C=2.4

Y = 0.4X + 2.4
0.3
Linear regression in Excel with Analysis ToolPak
Enable the Analysis ToolPak add-in:
1.In your Excel, click File > Options.
2.In the Excel Options dialog box, select Add-ins on the left sidebar, make sure Excel Add-ins is selected in
the Manage box, and click Go. 3. In the Add-ins dialog box,
tick off Analysis Toolpak,
and click OK:
X Range Y range
Step 1
Independent Variable Dependent varialble
Rain Fall (mm) Umbrellas sold Step 2
Jan'18 82 15
Feb 92.5 25
Mar 83.2 17
Apr 97.7 28
May 131.9 41
Jun 141.3 47
Jul 165.4 50
Aug 140 46
Sep 126.7 37
Oct 97.8 22
Nov 86.2 20
Dec 99.6 30
Jan'19 87 14
Feb 97.5 27 Step 3
Mar 88.2 14
Apr 102.7 30
May 123 43
Jun 146.3 49
Jul 160 49
Aug 145 44
Sep 131.7 39

Oct 118 36
Nov 91.2 20
Dec 104.6 32
SUMMARY RESIDUAL
OUTPUT OUTPUT

Regression Statistics Observation Predicted Umbrellas sold Residuals


1 17.82599924 -2.825999237
Multiple R 0.957666798
2 22.5510131 2.448986904
R Square 0.917125697 3 18.36600082 -1.366000821
4 24.89101996 3.10898004
Adjusted R Square 0.913358683
5 40.2810651 0.7189349
Standard Error 3.58141382 6 44.51107751 2.488922493
7 55.35610932 -5.356109317
Observations 24
8 43.92607579 2.073924208
9 37.94105824 -0.941058237
10 24.93602009 -2.936020092
ANOVA
11 19.71600478 0.283995219
df SS MS F Significance F 12 25.74602247 4.253977532
13 20.07600584 -6.076005837
Regression 1 3122.774784 3122.775 243.4623 2.21604E-13
14 24.8010197 2.198980304
Residual 22 282.1835489 12.82652 15 20.61600742 -6.616007421
16 27.14102656 2.858973441
Total 23 3404.958333
17 36.27605335 6.723946647
18 46.76108411 2.238915893

Coefficients Standard Error t Stat P-value Lower 95%


Upper 95% Lower 95.0% Upper 95.0% 19 52.92610219 -3.926102189
- - - 20 46.17608239 -2.176082391
Intercept -19.07410899 3.372182168 -5.65631 1.09E-05 -26.06758677 12.08063122 26.06758677 12.08063122 21 40.19106484 -1.191064836
22 34.02604675 1.973953246
Rain Fall (mm) 0.45000132 0.02884018 15.60328 2.22E-13 0.390190448 0.509812192 0.390190448 0.509812192
23 21.96601138 -1.966011381
24 27.99602907 4.003970933
Linear regression equation
Mathematically, a linear regression is defined by this equation:

y = bx + a

• x is an independent variable.
• y is a dependent variable.

• a is the Y-intercept, which is the expected mean value of y when all x variables are equal to
0. On a regression graph, it's the point where the line crosses the Y axis.

• b is the slope of a regression line, which is the rate of change for y as x changes.

Y (No. of Umbrella sold) = Rainfall Co-efficient * X (average monthly rainfall)+ Intercept


Y = 0.45 * X - 19.07
For example, with the average monthly rainfall equal to 82 mm, the umbrella sales would be approximately 17.8:
0.45*82-19.074=17.8

RESIDUAL OUTPUT

Regression analysis output: residuals


If you compare the estimated and actual number of sold umbrellas corresponding to Observation Predicted Umbrellas sold Residuals

the monthly rainfall of 82 mm, you will see that these numbers are slightly different: 1 17.82599924 -2.825999237
2 22.5510131 2.448986904
3 18.36600082 -1.366000821
• Estimated: 17.8 (calculated above) 4 24.89101996 3.10898004
• Actual: 15 (row 2 of the source data) 5 40.2810651 0.7189349
6 44.51107751 2.488922493

Why's the difference? Because independent variables are never perfect predictors of 7 55.35610932 -5.356109317

the dependent variables. And the residuals can help you understand how far away the 8 43.92607579 2.073924208
9 37.94105824 -0.941058237
actual values are from the predicted values:
10 24.93602009 -2.936020092
11 19.71600478 0.283995219
12 25.74602247 4.253977532
13 20.07600584 -6.076005837
14 24.8010197 2.198980304
15 20.61600742 -6.616007421
16 27.14102656 2.858973441
17 36.27605335 6.723946647
18 46.76108411 2.238915893
19 52.92610219 -3.926102189
20 46.17608239 -2.176082391
21 40.19106484 -1.191064836
22 34.02604675 1.973953246
23 21.96601138 -1.966011381
24 27.99602907 4.003970933
Linear regression graph in Excel
1. Select the two columns with your data, including headers (E4:F28)
2. On the Inset tab, in the Chats group, click the Scatter chart icon, and select the Scatter thumbnail (the first one):
Step 3

Step 4
Step 5
Regression analysis output: Summary Output
This part tells you how well the calculated linear regression equation fits your source data.

Here's what each piece of information means:

Multiple R. It is the Correlation Coefficient that measures the strength of a linear relationship between two variables. The correlation
coefficient can be any value between -1 and 1, and its absolute value indicates the relationship strength. The larger the absolute value,
the stronger the relationship:
•1 means a strong positive relationship
•-1 means a strong negative relationship
•0 means no relationship at all
R Square. It is the Coefficient of Determination, which is used as an indicator of the goodness of fit. It shows how many points fall on
the regression line. The R2 value is calculated from the total sum of squares, more precisely, it is the sum of the squared deviations of
the original data from the mean.
In our example, R2 is 0.91 (rounded to 2 digits), which is fairy good. It means that 91% of our values fit the regression analysis model.
In other words, 91% of the dependent variables (y-values) are explained by the independent variables (x-values). Generally, R Squared
of 95% or more is considered a good fit.
Adjusted R Square. It is the R square adjusted for the number of independent variable in the model. You will want to use this value
instead of R square for multiple regression analysis.
Standard Error. It is another goodness-of-fit measure that shows the precision of your regression analysis - the smaller the number,
the more certain you can be about your regression equation. While R 2 represents the percentage of the dependent variables variance
that is explained by the model, Standard Error is an absolute measure that shows the average distance that the data points fall from
the regression line.
Observations. It is simply the number of observations in your model.
Regression analysis output: ANOVA
The second part of the output is Analysis of Variance (ANOVA):

Basically, it splits the sum of squares into individual components that give information about the levels of variability within
your regression model:
df is the number of the degrees of freedom associated with the sources of variance.
SS is the sum of squares. The smaller the Residual SS compared with the Total SS, the better your model fits the data.
MS is the mean square.
F is the F statistic, or F-test for the null hypothesis. It is used to test the overall significance of the model.
Significance F is the P-value of F.
The ANOVA part is rarely used for a simple linear regression analysis in Excel, but you should definitely have a close look at
the last component. The Significance F value gives an idea of how reliable (statistically significant) your results are. If
Significance F is less than 0.05 (5%), your model is OK. If it is greater than 0.05, you'd probably better choose another
independent variable

You might also like