0% found this document useful (0 votes)
94 views

Exercise 4: Simple and Multiple Linear Regression Analysis

This document presents an exercise on simple and multiple linear regression analysis using temperature data from three locations in Sweden: Falun, Gävle, and Knon for the month of November 1977. Simple linear regression is performed to estimate missing temperature values for Falun using data from the other two locations. The correlation between Falun and combinations of the other datasets is calculated, with the highest correlation found when using a combination of Gävle and Knon data. The regression coefficients are then calculated for the model using this combined dataset to estimate Falun temperatures, finding an R2 value of 0.967, indicating the model explains 96.7% of the variance in Falun temperatures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

Exercise 4: Simple and Multiple Linear Regression Analysis

This document presents an exercise on simple and multiple linear regression analysis using temperature data from three locations in Sweden: Falun, Gävle, and Knon for the month of November 1977. Simple linear regression is performed to estimate missing temperature values for Falun using data from the other two locations. The correlation between Falun and combinations of the other datasets is calculated, with the highest correlation found when using a combination of Gävle and Knon data. The regression coefficients are then calculated for the model using this combined dataset to estimate Falun temperatures, finding an R2 value of 0.967, indicating the model explains 96.7% of the variance in Falun temperatures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

9/26/2020 Jupyter Notebook Viewer

Course-work-and-data-analysis (/github/bikasbhattarai/Course-work-and-data-analysis/tree/master)
/ Hydrology-Course (/github/bikasbhattarai/Course-work-and-data-analysis/tree/master/Hydrology-Course)
/
GEO4310_2015 (/github/bikasbhattarai/Course-work-and-data-analysis/tree/master/Hydrology-Course/GEO4310_2015)
/
EX4 (/github/bikasbhattarai/Course-work-and-data-analysis/tree/master/Hydrology-Course/GEO4310_2015/EX4)

In [1]:

%%html
<style>
table {float:left}
</style>

Exercise 4: Simple and multiple linear regression


analysis
Name: Bikas Chandra Bhattarai

Date: 28 September, 2015

1. Simple regression

Temperature data for a certain month (November 1977) is available from Falun (Dalarna), Gävle (Gästrikland) and
Knon (Värmland) (file: temp_falun.txt). For Falun the data series is not complete.

We want to fill the missing data for Falun using the best correlated data set of the three possible data sets:

1. Only the data from Gävle


2. Only the data from Knon
3. Both Gävle and Knon and the information about distances (Gävle-Falun = 82 km, Knon-Falun = 110 km)

Question1: Compute the correlation between Falun and (1), (2) and(3) and
determine which one shall be used as the independent variable.

In [2]:

# this allows plots to appear directly in the notebook


%matplotlib inline
import pandas as pd
import numpy as np
import scipy.stats
from __future__ import division
import matplotlib.pyplot as plt

https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 1/15
9/26/2020 Jupyter Notebook Viewer

In [3]:

temp_data = pd.read_table('temp_falun.dat') #reading the table


df_temp = pd.DataFrame(temp_data) # Defining the dataframe
df_temp.head(2) # printing the dataframe upto 3 rows only

Out[3]:

Day T_Falun T_Gavle T_Knon

0 1 8.2 9.0 6.5

1 2 6.4 7.8 4.8

Calculating the third datasets from the temperature data of T_Gavle and T_Knon, by using the inverse distance
weighting methods:

The equation used for the calculation of third datasets is given in equation 1 given below:
2 2
1 1
( ) ( )
82 110

TGavle+K non = TGavle + TK non ...................(1)


2 2 2 2
1 1 1 1
( ) +( ) ( ) +( )
82 110 82 110

In [4]:

# Calculating the third datasets by using equation 1 and inserting the calculated dat
df_temp['T_Galve_Knon']= (((1/82)**2/((1/82)**2+(1/110)**2))* df_temp['T_Gavle'] + ((
df_temp.head(2) # printing the dataframe upto 3 rows only

Out[4]:

Day T_Falun T_Gavle T_Knon T_Galve_Knon

0 1 8.2 9.0 6.5 8.11

1 2 6.4 7.8 4.8 6.73

In [5]:

#printing the lower 10 rows from datasest


df_temp.tail(10)

Out[5]:

Day T_Falun T_Gavle T_Knon T_Galve_Knon

20 21 -1 -2.0 -3.1 -2.39

21 22 NaN 0.1 -1.1 -0.33

22 23 NaN -6.2 -5.4 -5.91

23 24 NaN 0.5 -1.9 -0.36

24 25 NaN -1.9 -3.7 -2.54

25 26 NaN -6.4 -10.7 -7.94

26 27 NaN -7.6 -14.9 -10.21

27 28 NaN 0.9 -5.8 -1.49

28 29 NaN 0.5 -10.0 -3.25

29 30 NaN -1.2 -14.2 -5.84

From this above table, there are no temperature observations in Falun for the days 22 – 30. Therefore there are
gaps in the table for these days.

https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 2/15
9/26/2020 Jupyter Notebook Viewer

For the calculation of these parameters only the data from the 1st until the 21st day are used. Otherwise it would
not be comparable to the data from Falun.

In [6]:

# removing the lower 9 rows and Day column from datasets


df_temp_21 = df_temp[:-9].drop('Day',1) # :-9 is for removing lower 9 rows, and drop

In [7]:

# calculating the correlation between each temperature datasets


df_corr = df_temp_21.corr()
print np.round(df_corr,3)

T_Falun T_Gavle T_Knon T_Galve_Knon


T_Falun 1.000 0.984 0.970 0.993
T_Gavle 0.984 1.000 0.937 0.991
T_Knon 0.970 0.937 1.000 0.976
T_Galve_Knon 0.993 0.991 0.976 1.000

Conclusion:

As the correlation coefficient rx;y = 0, 9932 of TF alun and TG ävle+K non


is highest, the combination of these
two samples is best linearly correlated. Therefore TG ävle+K non
should be used as independent variable to
calculate the temperature in Falun.

Question2: Calculate the regression coefficients and how much of the variance is
explained by the regression model, i.e. the R² values.

Simple linear regression equation can be written as in the following form:

y = a + bx............(2)

Where,

a is the intercept

b is the coefficient for x

Together, a and a are called the regression coefficients and can be calculated by using the python function
described below:

https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 3/15
9/26/2020 Jupyter Notebook Viewer

In [8]:

# Importing the statistical model


import statsmodels.formula.api as smf

# create a fitted model between T_Galun as dependent variable and T_Gavle as indipend
fg = smf.ols(formula='T_Falun ~ T_Gavle', data=df_temp_21).fit()

#print summary statistics


print(fg.summary())

OLS Regression Results


==========================================================================
Dep. Variable: T_Falun R-squared: 0
Model: OLS Adj. R-squared: 0
Method: Least Squares F-statistic: 5
Date: Mon, 28 Sep 2015 Prob (F-statistic): 1.36
Time: 11:53:40 Log-Likelihood: -25
No. Observations: 21 AIC: 5
Df Residuals: 19 BIC: 5
Df Model: 1
Covariance Type: nonrobust
==========================================================================
coef std err t P>|t| [95.0% Conf.
-------------------------------------------------------------------------
Intercept -0.3989 0.222 -1.800 0.088 -0.863 0
T_Gavle 0.9292 0.039 23.759 0.000 0.847
==========================================================================
Omnibus: 2.569 Durbin-Watson:
Prob(Omnibus): 0.277 Jarque-Bera (JB):
Skew: -0.672 Prob(JB): 0
Kurtosis: 3.031 Cond. No.
==========================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is co

So the equation (2) becomes:

y = −0.3989 + 0.9292 ∗ x

and the coefficient of determination (R² = 0.967)

Linear regression equation for T_Falun as dependent variable and T_Knon as indipendent variables

https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 4/15
9/26/2020 Jupyter Notebook Viewer

In [9]:

# create a fitted model between T_Galun as dependent variable and T_Knon as an indipe
fg = smf.ols(formula='T_Falun ~ T_Knon', data=df_temp_21).fit()
#print summary statistics
print(fg.summary())

OLS Regression Results


==========================================================================
Dep. Variable: T_Falun R-squared: 0
Model: OLS Adj. R-squared: 0
Method: Least Squares F-statistic: 3
Date: Mon, 28 Sep 2015 Prob (F-statistic): 4.04
Time: 11:53:40 Log-Likelihood: -3
No. Observations: 21 AIC: 6
Df Residuals: 19 BIC: 7
Df Model: 1
Covariance Type: nonrobust
==========================================================================
coef std err t P>|t| [95.0% Conf.
-------------------------------------------------------------------------
Intercept 1.2902 0.262 4.930 0.000 0.742
T_Knon 0.8280 0.048 17.376 0.000 0.728 0
==========================================================================
Omnibus: 4.658 Durbin-Watson:
Prob(Omnibus): 0.097 Jarque-Bera (JB): 2
Skew: -0.860 Prob(JB): 0
Kurtosis: 3.500 Cond. No.
==========================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is co

When T_Falun as dependent variable and T_Knon as an indipendent variable then the liear regression equation
and coefficient of determination (R²) becomes:

y = 1.2902 + 0.8280 ∗ x

and the coefficient of determination (R² = 0.941)

Linear regression equation for T_Falun as dependent variable and T_Galve_Knon as indipendent variables

https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 5/15
9/26/2020 Jupyter Notebook Viewer

In [10]:

# create a fitted model between T_Galun as dependent variable and T_Galve_Knon as an


fg = smf.ols(formula='T_Falun ~ T_Galve_Knon', data=df_temp_21).fit()
#print summary statistics
print(fg.summary())

OLS Regression Results


==========================================================================
Dep. Variable: T_Falun R-squared: 0
Model: OLS Adj. R-squared: 0
Method: Least Squares F-statistic:
Date: Mon, 28 Sep 2015 Prob (F-statistic): 3.00
Time: 11:53:40 Log-Likelihood: -16
No. Observations: 21 AIC: 3
Df Residuals: 19 BIC: 3
Df Model: 1
Covariance Type: nonrobust
==========================================================================
coef std err t P>|t| [95.0% Conf
-------------------------------------------------------------------------
Intercept 0.1841 0.134 1.369 0.187 -0.097
T_Galve_Knon 0.9176 0.025 37.353 0.000 0.866
==========================================================================
Omnibus: 0.498 Durbin-Watson:
Prob(Omnibus): 0.780 Jarque-Bera (JB): 0
Skew: 0.103 Prob(JB): 0
Kurtosis: 3.059 Cond. No.
==========================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is co

When T_Falun as dependent variable and T_Galve_Knon act as a an indipendent variable then the liear regression
equation and coefficient of determination (R²) becomes:

y = 0.1841 + 0.9176 ∗ x

and the coefficient of determination (R² = 0.987)

In summary:

Table 1

Dependent and indipendent variables Linear Regression Equation R²

T_Falun as dependent, T_Gavle as an indipendent y = −0.3989 + 0.9292 ∗ x 0.967

T_Falun as dependent, T_Knon as an indipendent y = 1.2902 + 0.8280 ∗ x 0.941

T_Falun as dependent, T_Galve_Knon as an indipendent y = 0.1841 + 0.9176 ∗ x 0.987

Conclusion:

It is clear from the Table 1 that the coefficient of determination (R²) for the linear regression model with T_Falun as
dependent variable and T_Galve_Knon as an indipendent variable is highest so this model should be used as a
model for predicting the missing temperature for the station T_Falun.

Question3: Test the significance of the regression coefficients

https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 6/15
9/26/2020 Jupyter Notebook Viewer

From the above table 1, our selected regression model is y = 0.1841 + 0.9176 ∗ x on the basis of good
coefficient of determination.

Table 2

Coefficient t_critical P [95.0% Conf. Int.]

a 1.369 0.187 -0.097-----0.466

b 37.353 0.000 0.866-----0.969

Hypothesis test for a

Now formulating the test hypothesis for the coefficients to test wheather the coefficients are significantly different or
not and the test hypothesis can be formulated as given below:

H0 : a = 0

Ha : a ≠ 0

Now we have all the calculated statistics (from the summary statistics of best fit model) required for this test and are
shown in table 2, where t is the calculated t-value and p is the probability

Testing approach 1:

Based on t_critical value

If the |t| > t critical , then H0 is rejected

From the table tcritical = t 1− α ;n − 1 = 2.093


2

Hence, Tcritical value is greater then t value so H0 is not rejected

Approach 2:

Based on P value

If the p > α then H0 is rejected

Here, also P value (0.187) is not smaller than (α = 0.05) so H0 is not rejected

Approach 3:

Based on confidence intervals

Since the conficence intervals is (-0.097 to 0Testing approach 1:

Based on t_critical value

If the |t| > t critical , then H0 is rejected

From the table tcritical = t 1− α ;n − 1 = 2.093


2

Hence, Tcritical value is greater then t value so H0 is not rejected

Approach 2:

Based on P value

If the p > α then H0 is rejected

Here, also P value (0.187) is not smaller than (α = 0.05) so H0 is not rejected

https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 7/15
9/26/2020 Jupyter Notebook Viewer

Approach 3:

Based on confidence intervals

Since the conficence intervals is (-0.097 to 0.466) so there is a possibility that the value 0 should be within this
confidence intervals.

Conclusion

From above all test it is clear that the H0 is not rejected and concluded that the value of a is not signigicantly
different from the value 0 at the 95% confidence interval.

Hypothesis test for b

H0 : b = 0

Ha : b ≠ 0

Testing approach 1:

Based on t_critical value

If the |t| > t critical , then H0 is rejected

From the table tcritical = t 1− α ;n − 1 = 2.093


2

and |t| = 37.353

Hence, Tcritical value is less then |t| value so H0 is rejected

Approach 2:

Based on P value

If the p > α then H0 is rejected

Here, also P value (0.00) is smaller than (α = 0.05) so H0 is rejected

Approach 3:

Based on confidence intervals

Since the conficence intervals is (0.866 to0.969) so there is no possibility that the value 0 lies within this confidence
intervals, so H0 is rejected

Conclusion

From the above test it is clear that the H0 is rejected and concluded that the value of b is signigicantly different
from the value 0 at 95% confidence interval.

Question4: Plot the time series of the observed and calculated dependent variable
including the extended values on the same graph

From the above all possible regression analysis, the best fit model is obtained from the regression between
T_Falun with T_Galve_Knon. Hence our model for the estimation becomes:

y = 0.1841 + 0.9176 ∗ x . By using this equation the missing data of T_Falun is estimated and plotted as
follows:

https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 8/15
9/26/2020 Jupyter Notebook Viewer

In [11]:

# Estimating the temperature by using best fitted regression equation


Estimated = (0.1841 +0.9176 * df_temp['T_Galve_Knon']).round(2)

# Filling missing value for T_Falun and assigning the name T_Falun_Fill
df_temp['T_Falun_Fill'] = df_temp['T_Falun'].fillna(Estimated)

In [12]:

# Showing the lower part of dataframe with filling missing data


print df_temp.tail(10)

Day T_Falun T_Gavle T_Knon T_Galve_Knon T_Falun_Fill


20 21 -1 -2.0 -3.1 -2.39 -1.00
21 22 NaN 0.1 -1.1 -0.33 -0.12
22 23 NaN -6.2 -5.4 -5.91 -5.24
23 24 NaN 0.5 -1.9 -0.36 -0.15
24 25 NaN -1.9 -3.7 -2.54 -2.15
25 26 NaN -6.4 -10.7 -7.94 -7.10
26 27 NaN -7.6 -14.9 -10.21 -9.18
27 28 NaN 0.9 -5.8 -1.49 -1.18
28 29 NaN 0.5 -10.0 -3.25 -2.80
29 30 NaN -1.2 -14.2 -5.84 -5.17

In [21]:

# Plotting Observed and an Estimated temperature for T_Falun


plt.plot(df_temp['T_Falun_Fill'],'bo--')
plt.plot(df_temp['T_Falun'], 'ro-')
plt.plot(df_temp['T_Galve_Knon'],'go-')
plt.legend(['Estimated', 'Observed','Temp_G_K'])
plt.xlabel('Day', size = 15)
plt.ylabel('Temperature', size = 15)

Out[21]:

<matplotlib.text.Text at 0x7f6adb77b750>

2. Multiple linear regression

a) In the file multidata.txt there are a number of numerical variables. Chose Y as dependent variable and x1, x2, x3
as independent variables. Perform a forward stepwise multiple regression and also a standard multiple regression.

https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 9/15
9/26/2020 Jupyter Notebook Viewer

In a forward stepwise multiple regression, start with performing a simple regression using the independent variable
which is best correlated with the dependent variable. Then add another independent variable, and make sure that
this second independent variable should have the higher partial correlation with the dependent while the influence
of the first independent variable is removed. Continue this procedure to see if the addition of a third independent
variable will be helpful. In a standard multiple regression, all the independent variables are used in the regression
model. By analysing the result of the regression, you could figure out if some independent variables do not
significantly contribute to the regression. If there are any, remove them from the model and redo the regression with
only the significant independent variables.

b) Present in each case the R2 values and the regression equations.

c) In the forward stepwise method present also your F-test results (use α = 5%)

d) What are your conclusions?

In [14]:

multi = pd.read_table('multidata.txt') #reading in the data


#defining the dataframe
df = pd.DataFrame(multi)
df.head()

Out[14]:

X1 X2 X3 Y

0 1 2 10 5.077

1 2 2 9 32.330

2 3 3 5 65.140

3 4 4 4 47.270

4 5 2 9 80.570

In [15]:

#calculating correlation between all


multi.corr()

Out[15]:

X1 X2 X3 Y

X1 1.000000 -0.048404 -0.142982 0.624132

X2 -0.048404 1.000000 0.252610 0.264732

X3 -0.142982 0.252610 1.000000 0.586944

Y 0.624132 0.264732 0.586944 1.000000

From this table it is clear that the correlation between Y as a dependent variable and X1 as an indipendent variable
have highest correlation coefficient after that Y with X3 have second highest correlation coefficient so the first
regression equation becomes:

https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 10/15
9/26/2020 Jupyter Notebook Viewer

In [16]:

#calculating regression equation by using the ols function in python where, Y as depe
#variable.
ols1 = smf.ols(formula='Y ~ X1', data=multi).fit()
print ols1.summary()

OLS Regression Results


==========================================================================
Dep. Variable: Y R-squared: 0
Model: OLS Adj. R-squared: 0
Method: Least Squares F-statistic: 9
Date: Mon, 28 Sep 2015 Prob (F-statistic): 0.0
Time: 11:53:41 Log-Likelihood: -78
No. Observations: 17 AIC:
Df Residuals: 15 BIC:
Df Model: 1
Covariance Type: nonrobust
==========================================================================
coef std err t P>|t| [95.0% Conf.
-------------------------------------------------------------------------
Intercept 9.9956 14.461 0.691 0.500 -20.828 40
X1 11.9826 3.873 3.094 0.007 3.727 20
==========================================================================
Omnibus: 1.684 Durbin-Watson:
Prob(Omnibus): 0.431 Jarque-Bera (JB):
Skew: 0.570 Prob(JB): 0
Kurtosis: 2.198 Cond. No.
==========================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is co

/home/bikascb/anaconda/lib/python2.7/site-packages/scipy/stats/stats.py:12
int(n))

So the regression equation becomes:

y = 9.9956 + 11.9826 ∗ x

https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 11/15
9/26/2020 Jupyter Notebook Viewer

In [17]:

#calculating regression equation by using the ols function in python


ols2 = smf.ols(formula='Y ~ X1 + X3', data=multi).fit()
print ols2.summary()

OLS Regression Results


==========================================================================
Dep. Variable: Y R-squared: 0
Model: OLS Adj. R-squared: 0
Method: Least Squares F-statistic: 4
Date: Mon, 28 Sep 2015 Prob (F-statistic): 1.26
Time: 11:53:41 Log-Likelihood: -66
No. Observations: 17 AIC:
Df Residuals: 14 BIC:
Df Model: 2
Covariance Type: nonrobust
==========================================================================
coef std err t P>|t| [95.0% Conf.
-------------------------------------------------------------------------
Intercept -14.2231 8.102 -1.756 0.101 -31.600 3
X1 13.8775 1.965 7.062 0.000 9.663 18
X3 1.3498 0.200 6.744 0.000 0.921
==========================================================================
Omnibus: 3.571 Durbin-Watson:
Prob(Omnibus): 0.168 Jarque-Bera (JB):
Skew: 0.561 Prob(JB): 0
Kurtosis: 3.901 Cond. No.
==========================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is co

In this case the equation becomes: −14.2231 + 13.8775b + 1.3498x

https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 12/15
9/26/2020 Jupyter Notebook Viewer

In [18]:

#calculating the regression coefficient with Y as dependent and X1, X2 and X3 as indi
ols3 = smf.ols(formula='Y ~ X1 + X2 + X3', data=multi).fit()
print ols3.summary()

OLS Regression Results


==========================================================================
Dep. Variable: Y R-squared: 0
Model: OLS Adj. R-squared: 0
Method: Least Squares F-statistic: 2
Date: Mon, 28 Sep 2015 Prob (F-statistic): 4.27
Time: 11:53:41 Log-Likelihood: -65
No. Observations: 17 AIC:
Df Residuals: 13 BIC:
Df Model: 3
Covariance Type: nonrobust
==========================================================================
coef std err t P>|t| [95.0% Conf.
-------------------------------------------------------------------------
Intercept -22.8259 10.270 -2.223 0.045 -45.013 -0
X1 13.9098 1.917 7.257 0.000 9.769 18
X2 3.8826 2.961 1.311 0.212 -2.514 10
X3 1.2841 0.202 6.372 0.000 0.849
==========================================================================
Omnibus: 2.908 Durbin-Watson: 2
Prob(Omnibus): 0.234 Jarque-Bera (JB):
Skew: 0.709 Prob(JB): 0
Kurtosis: 3.232 Cond. No.
==========================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is co

Table 3

Dependent and indipendent variables Regression Equation R²

Y as dependent, X1 as an indipendent y = 9.9956 + 11.9826 ∗ b 0.390

Y as dependent, X1,X3 as an indipendent y = −14.2231 + 13.8775 ∗ b1 + 1.3498 ∗ b2 0.856

Y as dependent, X1,X2,X3 as an indipendent y = −22.8259 + 13.9098 ∗ b1 + 3.8826 ∗ b2 + 1.2841 ∗ b3 0.873

Conclusion

From all above combination it is clear that the model with Y as a dependent variables and X1, X2 and X3 as an
indipendent variables has high R² value so this is the best fitted model amongst others.

Perform the F- test (alpha = 0.05) for adding indipendent variables.

Test for adding the third independent Variable (X3)

F -Test for Testing the Significance By using the F -Test one can find out, whether adding a variable is significant or
not.The test statistic is

1−R²n−1 (N −n−1)
F = ---------------(3)
1−R²n (N −n−2)

https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 13/15
9/26/2020 Jupyter Notebook Viewer

whereas N is the number of data and n is the number of independent variables used. If F > F1−α;N −n−1;N −n−2

, the addition of Xn is significant. In the case of the three variables X1 , X3 and Y one obtaines at a significance
level of α = 0.05

Calculating the value of F by using equation 3, where

R²n−1 = 0.390

R²n = 0.856

N = 17

The calculation are done in the code cell below:

In [19]:

round(((1-0.390)*14)/((1-0.856)*13),2)

Out[19]:

4.56

Here,

F = 4.56

F0,95;14;13 = 2.55

Conclusion: Here, 4.56 > 2.55, which is a true statement. That means that adding X3 to the regression is
significant.

Test for adding the third independent Variable (X2)


Again the F -test is performed to check whether adding the independent variable X2 to the regression is significant
or not by using the same equation as above (α = 0, 05).

F > F0.95;13;12

Calculating the value of F by using equation 3, where

R²n−1 = 0.856

R²n = 0.873

N = 17

The calculation are done in the code cell below:

In [20]:

round(((1-0.856)*13)/((1-0.873)*12),2)

Out[20]:

1.23

Conclusion: As this is a false statement one can conclude, that adding X2 as a third independent variable to the
regression is not significant. That means a regression with X1 and X3 as independent variables will be adequately
exact.

https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 14/15
9/26/2020 Jupyter Notebook Viewer

In [ ]:

https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 15/15

You might also like