Linear Regression Assignment
Linear Regression Assignment
Delivered by
Name : Rizka Amelia Dwi Safira
NRP : 03311940000044
Subject : Applied Statistics and Probabilities
Find your own data that show a relationship between one dependent variable and more than
one independent variable.
1. Perform linear regression analysis between the dependent variable with one
independent variable. Show whether the regression equation is good enough using the
goodness of fit and residual analysis.
2. Perform multiple linear analyses between the dependent variable with two or more
independent variables. Show whether the regression equation is good enough using the
goodness of fit and residual analysis.
3. Analyze if it is true that more independent variables will give a better model.
DATA
The data that will be used for the regression analysis is taken from POWER Data Access
Viewer (DAV). This data is served via ArcGIS Online that supports the community to access
the meterology and solar related parameters data based for assessing and designing renewable
energy system. While, POWER itself, Prediction of Worldwide Energy Resource, is a project
by NASA that collaborates with three user communities whose serve solar and meteorological
data: 1) Renewable Energy (RE); 2) Sustainable Buildings (SB); and 3) Agroclimatology (AG).
The whole dataset can be accessed through: https://power.larc.nasa.gov/data-access-viewer/.
Here, the data from Renewable Energy (RE) community will be applied. Furthermore, the
detailed description follows:
• Temporal average of data : Daily
• Location of study : 40.4054°N and 3.6981°W (Madrid, Spain)
• Time extent : 01-April-2021 to 30-June-2021
Among the various data, the analysis will be processed to All Sky Surface UV Index that
depends upon four variables:
Table 1 Parameters for Dependent and Independent Variable
Type of
Parameter (s) Code Variable
Variable
Dependent ALLSKY_SFC_UV
All Sky Surface UV Index y
Variable _INDEX
Temperature at 2 meters (°C) T2M x1
Independent Earth Skin Temperature (°C) TS x2
Variable Specific Humidity at 2 meters (g/kg) QV2M x3
Wind Speed at 2 meters (m/s) WS2M x4
Table 2 Data for Linear Regression Analysis
y x1 x2 x3 x4
MO DY ALLSKY_SFC_
T2M TS QV2M WS2M
UV_INDEX
4 1 1.2 12.9 12.97 6.23 2.3
4 2 0.99 11.13 10.86 7.57 1.16
4 3 1.04 11.19 11.23 6.96 2.34
4 4 1.38 9.9 10.26 4.82 2.34
4 5 1.34 11.19 11.27 4.64 1.54
4 6 1.44 11.8 11.98 5.13 2.27
4 7 1.59 9.88 10.59 4.46 2.79
4 8 1.05 10.44 10.94 6.23 2.27
4 9 0.66 9.3 9.58 6.47 1.81
4 10 1.03 10.97 11.65 6.9 1.62
4 11 1.03 10.45 11.12 6.65 2.39
4 12 1.48 8.22 8.84 3.91 1.88
4 13 0.69 10.12 10.69 6.29 1.37
4 14 1.06 12.73 13.46 7.51 1.51
4 15 0.95 11.53 12.48 6.04 1.55
4 16 1.3 9.02 10.02 4.27 2.73
4 17 1.32 7.01 7.95 3.91 3.01
4 18 1.41 9.12 9.8 4.39 1.8
4 19 1.28 11.77 10.85 5.55 1.17
4 20 1.16 12.01 12.15 6.96 1.33
4 21 1.17 10.65 11.15 6.77 1.42
4 22 1.23 11.59 12.26 6.9 0.88
4 23 1.02 11.24 11.82 6.65 2.93
4 24 1.27 11.8 12.62 7.02 3.8
4 25 0.45 10.19 10.58 7.32 2.41
4 26 1.27 12.26 12.66 7.2 1.38
4 27 1.29 13.39 14.1 7.75 0.86
4 28 1.12 12.52 12.81 7.87 2.19
4 29 1.27 11.25 11.73 7.02 2.69
4 30 1.33 9.8 10.1 5.92 1.41
5 1 1.33 9.8 10.45 5.55 1.86
5 2 1.42 10.34 11.72 5.62 1.45
5 3 1.17 10.99 11.66 6.59 1.73
5 4 1.8 13.21 13.73 6.29 1.04
5 5 1.99 14.81 14.87 6.35 1.52
5 6 1.97 16.09 15.83 7.26 2.02
5 7 1.92 17.1 17.15 7.87 1.59
5 8 1.88 18.12 17.2 7.69 2.3
5 9 0.95 13.42 13.56 8 4.05
5 10 1.62 10.05 11.08 5.92 2.82
5 11 1.52 10.22 11.55 5.25 2.84
y x1 x2 x3 x4
MO DY ALLSKY_SFC_
T2M TS QV2M WS2M
UV_INDEX
5 12 1.75 11.97 12.7 6.29 3.57
5 13 1.36 12.83 13.72 6.77 3.14
5 14 1.98 14.36 15.39 6.53 1.95
5 15 1.86 17.2 17.48 8.42 2.82
5 16 1.85 17.55 18.31 9.09 3.03
5 17 2.49 15.69 16.94 6.35 1.78
5 18 2.37 17.34 17.94 7.51 2.62
5 19 2.32 16.08 17.51 6.35 2.58
5 20 2.3 18.46 19.73 7.81 1.48
5 21 2.26 19.87 20.32 8.06 2.54
5 22 1.76 13.5 14.41 5.62 2.76
5 23 1.33 13.63 15.19 6.16 1.61
5 24 1.51 14.16 15.55 6.59 2.3
5 25 1.94 16.16 18.19 6.65 1.83
5 26 1.84 17.67 19.38 7.81 1.48
5 27 1.57 18.88 19.75 9.03 1.62
5 28 1.71 18.9 20.03 9.83 1.3
5 29 2.05 21.18 22.83 9.46 1.16
5 30 1.92 22.52 23.54 10.01 1.45
5 31 1.92 22.7 23.73 10.68 1.99
6 1 1.86 19.12 20.16 9.64 2.55
6 2 2.1 19.09 19.96 7.87 1.6
6 3 2.12 20.1 21.26 8.18 2.34
6 4 2.32 19.87 21.75 7.69 2.12
6 5 1.04 16.04 16.22 9.4 1.91
6 6 2.46 20.63 22.05 8.3 0.81
6 7 2.5 22.94 24.48 8.3 1.48
6 8 2.55 23.99 25.02 8.24 1.62
6 9 2.41 23.87 25.31 8.24 2.21
6 10 2.49 24.16 25.51 8.54 1.7
6 11 2.23 23.93 25.86 8.97 2.45
6 12 2.34 23.15 24.8 9.46 2.75
6 13 2.27 24.62 25.9 8.91 2.78
6 14 2.13 25.3 27.51 9.4 1.7
6 15 1.92 24.98 26.41 9.52 3.28
6 16 1.89 23.85 25.08 10.31 2.31
6 17 1.19 17.28 17.73 10.01 2.02
6 18 1.56 18.69 19.2 9.52 2.18
6 19 1.87 18.51 19.8 6.77 2.31
6 20 1.42 16.73 17.85 8.18 4.01
6 21 1.93 16.47 18.02 7.39 2.9
6 22 1.34 17.03 18.94 7.81 1.91
y x1 x2 x3 x4
MO DY ALLSKY_SFC_
T2M TS QV2M WS2M
UV_INDEX
6 23 1.8 17.18 19.32 6.53 2.98
6 24 2.16 19.06 21.56 6.96 1.88
6 25 2.13 23 24.62 8.18 1.63
6 26 2.29 24.57 25.53 8.42 2.66
6 27 2.37 22.05 23.25 6.65 3.28
6 28 2.38 18.81 20.92 5.68 2.59
6 29 2.32 21.3 23.19 6.65 1.32
6 30 2.56 22.81 24.29 6.84 2.3
• x-variance:
1 2 (∑ 𝑥 )2 1 (1439.300)2
𝑠𝑥2 = ( )(∑ − ) = ( ) (25007.874 − ) = 24.92455
𝑛−1 𝑥 𝑛 90 91
• Determining regression equation:
𝑠𝑥𝑦 2.01997
𝑏1 = 2 = = 0.08104
𝑠𝑥 24.92455
𝑏0 = 𝑦𝑎𝑣𝑔 − 𝑏1 𝑥 𝑎𝑣𝑔 = 1.675 − (0.08104 × 15.816) = 0.39323
So, the regression equation is: ŷ = 𝟎. 𝟑𝟗𝟑𝟐𝟑 + 𝟎. 𝟎𝟖𝟏𝟎𝟒𝐱
BY USING GOODNESS OF FIT
In Goodness of Fit, we evaluate the regression relationship by the ratio of SSR/SST, which is
called as the coefficient of determination (𝑟2 ):
𝑆𝑆𝑅 14.733
𝑟2 = = = 0.62504
𝑆𝑆𝑇 23.572
Hence, the coefficient determination in this analysis is 0.62504 or 62.504%. This indicates
that the regression model has accounted for 62.504% of the variability of the data.
Figure 1 Regression Line Obtained Using Excel (Left) and R Program (Right)
Result:
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-0.76907 -0.22682 0.01444 0.22959 0.82520
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.393233 0.110301 3.565 0.000588 ***
x 0.081043 0.006654 12.180 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Figure 2 Residual Analysis Plot Obtained Using Excel (Left) and R Program (Right)
Without computing manually, Excel and R Program already have the feature to determine the
value of b1 , b2 ,…, bn . Here is shown the result of both Excel and R Program in determining y-
intercept, each slope of every parameter, completed with standard error, t-stat, and P-value.
b0 <- coef(model)[1]
print(b0)
b1 <- coef(model)[2]
b2 <- coef(model)[3]
b3 <- coef(model)[4]
b4 <- coef(model) [5]
print(b1)
print(b2)
print(b3)
print(b4)
plot(model)
The Result:
Call:
lm(formula = y ~ x1 + x2 + x3 + x4)
Coefficients:
(Intercept) x1 x2 x3 x4
1.18056 0.11684 0.01069 -0.20817 -0.01140
Call:
lm(formula = y ~ x1 + x2 + x3 + x4)
Residuals:
Min 1Q Median 3Q Max
-0.48304 -0.16450 -0.00303 0.13733 0.63720
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.18056 0.14818 7.967 6.17e-12 ***
x1 0.11684 0.05251 2.225 0.0287 *
x2 0.01069 0.04684 0.228 0.8199
x3 -0.20817 0.02629 -7.919 7.70e-12 ***
x4 -0.01140 0.03545 -0.322 0.7486
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> # Get the Intercept and coefficients as vector elements.
> cat("# # # # The Coefficient Values # # # ","\n")
# # # # The Coefficient Values # # #
>
> b0 <- coef(model)[1]
> # Get the Intercept and coefficients as vector elements.
> cat("# # # # The Coefficient Values # # # ","\n")
# # # # The Coefficient Values # # #
>
> b0 <- coef(model)[1]
> print(b0)
(Intercept)
1.18056
>
> print(b1)
x1
0.116844
> print(b2)
x2
0.01069345
> print(b3)
x3
-0.2081744
> print(b4)
x4
-0.01139604
The result of R Square, Intercept, X for Variables 1-4 are matched with the manual
computing.
Figure 3 Residual Analysis Plot for Multiple Regression Obtained Using Excel (Left) and R Program (Right)
3. ANALYSIS WHETHER THE MORE INDEPENDENT VARIABLES THE
BETTER THE MODEL
Better models are shown by the higher value of coefficient of determination (R2 ) and more
constant value of residual (ɛ). Here is the summary of each value from the simple linear
regression and multiple linear regression.
Table 5 Comparison the Analysis Result Between Simple and Multiple Linear Regression
Type of R2 from
Residual Analysis Plot
Regression Goodness of Fit
Simple Linear
Regression
(1 Dependent
0.62504
Variable vs 1
Independent
Variable)
Multiple Linear
Regression (1
Dependent
0.79614
Variable vs 4
Independent
Variable)
Based on the summary table above, it can be concluded that more independent variables will
give a better model. It is mathematically shown by the increasing value of coefficient of
determination (R2 ) in 4 Independent Variables model, also graphically shown by more constant
value of residual in 4 Independent Variables model (apporved by more narrow the residual
spreading through the graphic).
Hence, it is more realistic to compute an adjusted value for R 2 to avoid overestimating the
𝑛−1
impact of adding extra variables, by formula: 𝑅𝑎2 = 1 − (1 − 𝑅 2 ) ( 𝑛−𝑝−1) where n is sample size