Exercise 4: Simple and Multiple Linear Regression Analysis
Exercise 4: Simple and Multiple Linear Regression Analysis
Course-work-and-data-analysis (/github/bikasbhattarai/Course-work-and-data-analysis/tree/master)
/ Hydrology-Course (/github/bikasbhattarai/Course-work-and-data-analysis/tree/master/Hydrology-Course)
/
GEO4310_2015 (/github/bikasbhattarai/Course-work-and-data-analysis/tree/master/Hydrology-Course/GEO4310_2015)
/
EX4 (/github/bikasbhattarai/Course-work-and-data-analysis/tree/master/Hydrology-Course/GEO4310_2015/EX4)
In [1]:
%%html
<style>
table {float:left}
</style>
1. Simple regression
Temperature data for a certain month (November 1977) is available from Falun (Dalarna), Gävle (Gästrikland) and
Knon (Värmland) (file: temp_falun.txt). For Falun the data series is not complete.
We want to fill the missing data for Falun using the best correlated data set of the three possible data sets:
Question1: Compute the correlation between Falun and (1), (2) and(3) and
determine which one shall be used as the independent variable.
In [2]:
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 1/15
9/26/2020 Jupyter Notebook Viewer
In [3]:
Out[3]:
Calculating the third datasets from the temperature data of T_Gavle and T_Knon, by using the inverse distance
weighting methods:
The equation used for the calculation of third datasets is given in equation 1 given below:
2 2
1 1
( ) ( )
82 110
In [4]:
# Calculating the third datasets by using equation 1 and inserting the calculated dat
df_temp['T_Galve_Knon']= (((1/82)**2/((1/82)**2+(1/110)**2))* df_temp['T_Gavle'] + ((
df_temp.head(2) # printing the dataframe upto 3 rows only
Out[4]:
In [5]:
Out[5]:
From this above table, there are no temperature observations in Falun for the days 22 – 30. Therefore there are
gaps in the table for these days.
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 2/15
9/26/2020 Jupyter Notebook Viewer
For the calculation of these parameters only the data from the 1st until the 21st day are used. Otherwise it would
not be comparable to the data from Falun.
In [6]:
In [7]:
Conclusion:
Question2: Calculate the regression coefficients and how much of the variance is
explained by the regression model, i.e. the R² values.
y = a + bx............(2)
Where,
a is the intercept
Together, a and a are called the regression coefficients and can be calculated by using the python function
described below:
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 3/15
9/26/2020 Jupyter Notebook Viewer
In [8]:
# create a fitted model between T_Galun as dependent variable and T_Gavle as indipend
fg = smf.ols(formula='T_Falun ~ T_Gavle', data=df_temp_21).fit()
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is co
y = −0.3989 + 0.9292 ∗ x
Linear regression equation for T_Falun as dependent variable and T_Knon as indipendent variables
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 4/15
9/26/2020 Jupyter Notebook Viewer
In [9]:
# create a fitted model between T_Galun as dependent variable and T_Knon as an indipe
fg = smf.ols(formula='T_Falun ~ T_Knon', data=df_temp_21).fit()
#print summary statistics
print(fg.summary())
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is co
When T_Falun as dependent variable and T_Knon as an indipendent variable then the liear regression equation
and coefficient of determination (R²) becomes:
y = 1.2902 + 0.8280 ∗ x
Linear regression equation for T_Falun as dependent variable and T_Galve_Knon as indipendent variables
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 5/15
9/26/2020 Jupyter Notebook Viewer
In [10]:
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is co
When T_Falun as dependent variable and T_Galve_Knon act as a an indipendent variable then the liear regression
equation and coefficient of determination (R²) becomes:
y = 0.1841 + 0.9176 ∗ x
In summary:
Table 1
Conclusion:
It is clear from the Table 1 that the coefficient of determination (R²) for the linear regression model with T_Falun as
dependent variable and T_Galve_Knon as an indipendent variable is highest so this model should be used as a
model for predicting the missing temperature for the station T_Falun.
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 6/15
9/26/2020 Jupyter Notebook Viewer
From the above table 1, our selected regression model is y = 0.1841 + 0.9176 ∗ x on the basis of good
coefficient of determination.
Table 2
Now formulating the test hypothesis for the coefficients to test wheather the coefficients are significantly different or
not and the test hypothesis can be formulated as given below:
H0 : a = 0
Ha : a ≠ 0
Now we have all the calculated statistics (from the summary statistics of best fit model) required for this test and are
shown in table 2, where t is the calculated t-value and p is the probability
Testing approach 1:
Approach 2:
Based on P value
Here, also P value (0.187) is not smaller than (α = 0.05) so H0 is not rejected
Approach 3:
Approach 2:
Based on P value
Here, also P value (0.187) is not smaller than (α = 0.05) so H0 is not rejected
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 7/15
9/26/2020 Jupyter Notebook Viewer
Approach 3:
Since the conficence intervals is (-0.097 to 0.466) so there is a possibility that the value 0 should be within this
confidence intervals.
Conclusion
From above all test it is clear that the H0 is not rejected and concluded that the value of a is not signigicantly
different from the value 0 at the 95% confidence interval.
H0 : b = 0
Ha : b ≠ 0
Testing approach 1:
Approach 2:
Based on P value
Approach 3:
Since the conficence intervals is (0.866 to0.969) so there is no possibility that the value 0 lies within this confidence
intervals, so H0 is rejected
Conclusion
From the above test it is clear that the H0 is rejected and concluded that the value of b is signigicantly different
from the value 0 at 95% confidence interval.
Question4: Plot the time series of the observed and calculated dependent variable
including the extended values on the same graph
From the above all possible regression analysis, the best fit model is obtained from the regression between
T_Falun with T_Galve_Knon. Hence our model for the estimation becomes:
y = 0.1841 + 0.9176 ∗ x . By using this equation the missing data of T_Falun is estimated and plotted as
follows:
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 8/15
9/26/2020 Jupyter Notebook Viewer
In [11]:
# Filling missing value for T_Falun and assigning the name T_Falun_Fill
df_temp['T_Falun_Fill'] = df_temp['T_Falun'].fillna(Estimated)
In [12]:
In [21]:
Out[21]:
<matplotlib.text.Text at 0x7f6adb77b750>
a) In the file multidata.txt there are a number of numerical variables. Chose Y as dependent variable and x1, x2, x3
as independent variables. Perform a forward stepwise multiple regression and also a standard multiple regression.
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 9/15
9/26/2020 Jupyter Notebook Viewer
In a forward stepwise multiple regression, start with performing a simple regression using the independent variable
which is best correlated with the dependent variable. Then add another independent variable, and make sure that
this second independent variable should have the higher partial correlation with the dependent while the influence
of the first independent variable is removed. Continue this procedure to see if the addition of a third independent
variable will be helpful. In a standard multiple regression, all the independent variables are used in the regression
model. By analysing the result of the regression, you could figure out if some independent variables do not
significantly contribute to the regression. If there are any, remove them from the model and redo the regression with
only the significant independent variables.
c) In the forward stepwise method present also your F-test results (use α = 5%)
In [14]:
Out[14]:
X1 X2 X3 Y
0 1 2 10 5.077
1 2 2 9 32.330
2 3 3 5 65.140
3 4 4 4 47.270
4 5 2 9 80.570
In [15]:
Out[15]:
X1 X2 X3 Y
From this table it is clear that the correlation between Y as a dependent variable and X1 as an indipendent variable
have highest correlation coefficient after that Y with X3 have second highest correlation coefficient so the first
regression equation becomes:
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 10/15
9/26/2020 Jupyter Notebook Viewer
In [16]:
#calculating regression equation by using the ols function in python where, Y as depe
#variable.
ols1 = smf.ols(formula='Y ~ X1', data=multi).fit()
print ols1.summary()
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is co
/home/bikascb/anaconda/lib/python2.7/site-packages/scipy/stats/stats.py:12
int(n))
y = 9.9956 + 11.9826 ∗ x
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 11/15
9/26/2020 Jupyter Notebook Viewer
In [17]:
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is co
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 12/15
9/26/2020 Jupyter Notebook Viewer
In [18]:
#calculating the regression coefficient with Y as dependent and X1, X2 and X3 as indi
ols3 = smf.ols(formula='Y ~ X1 + X2 + X3', data=multi).fit()
print ols3.summary()
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is co
Table 3
Conclusion
From all above combination it is clear that the model with Y as a dependent variables and X1, X2 and X3 as an
indipendent variables has high R² value so this is the best fitted model amongst others.
F -Test for Testing the Significance By using the F -Test one can find out, whether adding a variable is significant or
not.The test statistic is
1−R²n−1 (N −n−1)
F = ---------------(3)
1−R²n (N −n−2)
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 13/15
9/26/2020 Jupyter Notebook Viewer
whereas N is the number of data and n is the number of independent variables used. If F > F1−α;N −n−1;N −n−2
, the addition of Xn is significant. In the case of the three variables X1 , X3 and Y one obtaines at a significance
level of α = 0.05
R²n−1 = 0.390
R²n = 0.856
N = 17
In [19]:
round(((1-0.390)*14)/((1-0.856)*13),2)
Out[19]:
4.56
Here,
F = 4.56
F0,95;14;13 = 2.55
Conclusion: Here, 4.56 > 2.55, which is a true statement. That means that adding X3 to the regression is
significant.
F > F0.95;13;12
R²n−1 = 0.856
R²n = 0.873
N = 17
In [20]:
round(((1-0.856)*13)/((1-0.873)*12),2)
Out[20]:
1.23
Conclusion: As this is a false statement one can conclude, that adding X2 as a third independent variable to the
regression is not significant. That means a regression with X1 and X3 as independent variables will be adequately
exact.
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 14/15
9/26/2020 Jupyter Notebook Viewer
In [ ]:
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 15/15