0% found this document useful (0 votes)
17 views

(TRANSLATED) Generalized Linear Model

Uploaded by

miftahul irfan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

(TRANSLATED) Generalized Linear Model

Uploaded by

miftahul irfan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Generalized Linear Modelling on COVID-19 Cases in Indonesia

Abstract: The COVID-19 outbreak that has not stopped yet makes scientists keep
conducting studies on this outbreak. Studies that are mostly done are about the prediction and
modeling of COVID-19 data. In line with that, this study also discusses COVID-19
modeling. The model that is mostly used is the linear model. However, if the classical
assumptions of normality are not fulfilled, a special method will be needed. The method that
can overcome this problem is the generalized linear model (GLM) with the assumption that
the data have an exponential family distribution. By using 3 types of exponential family
distribution, the obtained best result is the GLM with the Gaussian distribution.

Keywords: Generalized Linear Model, Gaussian, Poisson, Gamma, and AIC

1. Introduction

In 2019 – 2020, an outbreak emerged and then attacked the world called COVID-19
(coronavirus disease-19). This COVID-19 is an infectious disease caused by severe
acute respiratory syndrome coronavirus-2 (SARS-CoV-2). This virus first appeared
in China in December 2019. Since then, the virus has spread rapidly to various
regions of the world. As of 7 December 2020, as recorded by the WHO, there were
65.8 million confirmed cases of COVID-19. 1.5 million of them ended in death.
Furthermore, this virus also spread to 220 countries or regions (WHO, 2020). This
COVID-19 has been declared a pandemic by the WHO on 11 March 2020. In
Indonesia, the number of COVID-19 cases is still increasing day by day. As of 10
December 2020, there were 598,933 positive cases with 491,975 recoveries and
18,336 deaths (Indonesia's Ministry of Health, 2020).

The presence of this COVID-19 pandemic that has hit many countries makes many
scientists and academics conduct various studies on it. Studies that are often carried
out are time series analysis and regression analysis on COVID-19 data. However, in
the regression analysis, researchers may discover variable Y which is not normally
distributed. If we continue to do the analysis with the assumption of normality, we
will get poor analysis results. Therefore, a special analysis is needed to overcome this
abnormality. One of them is the Generalized Linear Model (GLM).

Studies using GLM have been conducted by David (2015), Sancetta (2014), Belotti et
al. (2020) on air pollution data, Bolance & Vernic (2019), Kawano et al. (2018), Guo
et al. (2020), Nordin et al. (2020), Chandler (2020), George (2015), Habahbeh
(2018), Ubaidillah et al. (2017) on data concerning household expenditure per capita,
Ummah (2012), and Calabrese (2011). Furthermore, studies using GLM have been
also applied to COVID-19 data as has been done by To et al. (2021) on COVID-19
data in Canada and Rath (2020) on COVID-19 data in India. Ratt (2020) concluded
that GLM is a good method for modeling and estimating COVID-19 data in India.
Therefore, the author is interested in conducting GLM on COVID-19 data in
Indonesia, in which the response variable is mortality, while the predictor variable is
the confirmed status.

2. Generalized Linear Model (GLM)

GLM is a general form of the linear model. In the classical linear model, Y is
assumed to be normally distributed with Ε ( Y )=μ and σ 2 variance. In GLM, the
response variable Y can be distributed other than normal. However, it must be
included in the exponential family. As a transition from the Linear Model to the
Generalized Linear model, the model is described through three components (Mc
Cullagh & Nelder, 1989), namely as follows.
1. Random Component. It is the observed values of the response Y that are mutually
independent of any particular distribution.
2. Systematic Component. It is a linear combination of the variable X with a
parameter which is denoted by η=Χβ .
3. Link between random and systematic (link function). It is a function that explains
the expected value of the response variable Y that connects with the explanatory
variables through the linear equation. It is denoted as ηi =g ( . ). This function g(.)
is called with a link function.
From these three components, the link function may determine the model to be used
in GLM. The simplest link function is g(μ) = μ which is called the identity link. If the
GLM has the simplest connecting function, then the GLM is a linear model with
continuous response. Other connecting functions will connect μ in a nonlinear matter
to the predictor.

2.1 The distribution of exponential family

According to Tirta (2015), in the GLM, the distribution of responses included in the
exponential family can have various types. A random variable Y may be included in
the distribution belonging to the exponential family if it has the model, as follows.
f Y ( y ; θ)=exp [ a ( y ) b (θ ) +c (θ )+ d ( y ) ]
In some cases, the functions a, b, c, and d may contain another parameter called
nuisance/disturbance. Some types of distribution that are often used in GLM can be
described as follows.
1. Normal / Gaussian distribution
The model of the probability density function of the random variable Y which has a
Normal or Gaussian distribution is as follows.

( ( ))
2
1 −1 y−θ
f ( y )= exp ,−∞< y< ∞
√2 πσ 2 σ
¿ exp ¿ ¿
With:
2 2
θ y −σ
b ( θ )=2
, d ( y )= 2 , c (θ )= .
σ 2σ 1
2 σ − log ( 2 π σ )
2 2
2
Here, σ is the nuisance parameter. Therefore, it means E [ Y ] =θ and Var [ Y ] =σ 2 .

2. Poisson distribution
The random variable Y which has a Poisson distribution has a probability density
function as shown in the following equation.
y −θ
θ e
f ( y )= , y =0 , 1, 2 , 3 , …
y!
¿ exp [ y log θ−θ−log y ! ]
The equation above indicates b ( θ )=log θ , c ( θ )=−θ , d ( y )=−log y . Therefore, it
means E [ Y ] =θ and Var [ Y ] =θ.

3. Gamma distribution
The random variable Y which has a Gamma distribution has a probability density
function as shown in the following equation.
ϕ −1 − yθ
θ ( yθ) e
f ( y )= , y> 0 ,
Γ (ϕ)
¿ exp ¿ ¿

With:

b ( θ )=−θ , a ( y )= y , c (θ )=ϕ log θ−log Γ ( ϕ ) ,d ( y )= ( ϕ−1 ) log y

Therefore, it means E ( Y )=ϕ /θ ,Var [ Y ] =ϕ /θ2 . Here, ϕ is the nuisance parameter.

2.2 Characteristics of members of the exponential family


The distribution of members of the exponential family as discussed above has
respective characteristics, as follows.
1. The characteristics of the Gaussian distribution are as follows.
a. Having continuous scale with a range of −∞ < y < ∞
b. Being symmetrical
c. Having variance that is independent of the mean (constant variance)
2. The characteristics of the Gamma distribution are as follows.
a. Having continuous scale with a range of 0< y< ∞
b. Being asymmetrical
c. Having variance that is quadratic with the mean (σ =ϕ μ 2 σ )
3. The characteristics of the Poisson distribution are as follows.
a. Having discrete scale with a range of 0 ≤ y <∞ , y =0 , 1, 2 , ⋯
b. Being asymmetrical
c. Having variance that is linearly related to the mean (σ =ϕμ )
4. The characteristics of the Bernoulli distribution (Binomial with n = 1) are as
follows.
a. Having discrete scale with a binary range of y=0.1
b. Having the symmetry that depends on the value of p

2.3 Link Function

With the distribution of response data that do not always follow the Gaussian
distribution, it means that the data range is also not always in the range of real
numbers, such as data concerning continuous positive numbers, whole numbers, or
binary only. Meanwhile, a linear combination of predictors, commonly referred to as
p
linear predictors, η=∑ x ij β j, is open to take any value of any real number. For that
ij=0

reason, we need a function that connects and at the same time synchronizes the
response with a linear predictor. This function is called the link function. In addition,
the link function also serves to maintain linearity so that the predictor remains linear
and normal, in which the range between the linear predictors and Y or μ y remains in
sync. Among the link functions that can be used, there is the so-called canonical link
function, namely the relationship function that occurs at the aquation
p
b ( θ )=η=∑ β j x j (Jong & Heller, 2008).
j=0

1. For the binomial distribution, the functions that can be used are as follows.
a. The logit function
η=log ( 1−μ
μ
)
b. The probit function
−1
η=Φ (μ)
Where, Φ is the cumulative function of the normal distribution, namely
as follows.

[ ]
x
1 −1 2
Φ ( x )= ∫ exp z dz
−∞ √2 π 2
c. Complementarity log−log
η=log [ −log ( 1−μ ) ]
2. For the Gaussian distribution, the canonical link function is the identity μi=ηi.
3. For the Gamma distribution, the reciprocal canonical link function is
1
log =ηi but the log link is also often used, namely log ( μi ) =ηi.
μi
4. For the Poisson distribution, the canonical link function is log ( μi ) =ηi.

2.4 Akaike Information Criterion (AIC)

In selecting the best model, this study uses the Akaike Information Criterion (AIC).
The model that has the smallest AIC value is as follows.

AIC=−2 log L+ 2 P

Where:
log L = the maximum value of the likelihood function of the Cox PH regression
model
P = number of independent variables in the Cox PH regression model (Tustianto,
Kris, & Soehono, 2012)

3. Methods

The steps in carrying this study were as follows.


1. Collecting the data on covid19.go.id.
2. Selecting response variables and predictor variables (the response variable is the
number of deaths per day due to COVID-19, while the predictor variable is the
number of confirmed COVID-19 cases per day).
3. Carrying out descriptive statistics.
4. Identifying a suitable distribution.
5. Carrying out the linearity test.
6. Performing Generalized Linear Modelling with 3 distributions.
7. Selecting the best model.
8. Estimating the estimated Y value.

4. Results and Discussion

4.1 Descriptive statistics

Table 1. Descriptive statistics


Variable Min Quartile 1 Median Mean Quartile 3 Max
s
X 106 689 1853 2214 3737 6267
Y 7.00 36.00 70.00 69.16 98.00 169.00

From the table of descriptive statistical above, it can be interpreted that the average
confirmed COVID-19 patient is 2214 people and the average patient who died is 69
people. The least confirmed number of COVID-19 is 106 people in a day and the
most confirmed number of COVID-19 is 6267 people in a day. Meanwhile, the
number of patients who died due to COVID-19 was at least 7 people in a day and at
most 169 people in a day.

4.2 Distribution identification

Before performing the analysis using the generalized linear model, the distribution of
the response variable Y must be identified. To see the appropriate distribution, a
goodness-of-fit test is carried out. The results of the goodness-of-fit test of several
distributions are presented in the following table.

Table 2. Distribution testing


Distributions AD p
Normal 1.715 < 0.005
Lognormal 7.163 < 0.005
Exponential 18.758 < 0.003
2-Parameter Exponential 12.596 < 0.010
Weibull 2.467 < 0.010
3-Parameter Weibull 2.299 < 0.005
Smallest Extreme Value 2.954 < 0.010
Largest Extreme value 2.923 < 0.010
Gamma 3.855 < 0.005
Logistic 2.140 < 0.005
Log-logistic 5.1743 < 0.005

From the table above, all distributions have a p-value of < 0.05. Therefore, H0 is
rejected and it can be concluded that the data do not follow one of those distributions.

4.3 Linearity testing

The linearity test is conducted by looking at the scatter plot. With the help of the R
program, we obtain an output graph, as follows.

Figure 1. Linear plot of data

From the figure above, the data is relatively increasing linear. Therefore, it can be
concluded that the data meet the assumption of linearity. Furthermore, the data are
analyzed with a generalized linear model.

4.4 Generalized Linear Model (GLM)


At this stage, the modeling using GLM is carried out on three distributions of the
exponential family, namely the Gaussian distribution, the Poisson distribution, and
the Gamma distribution.

a. Gaussian
In GLM, the Gaussian distribution uses the link identity function. With the help of
the R program, we obtain the following model.

y=23. 06763+ 0 .02082 x

With:

Table 3. The output for the Gaussian distribution


Std. Error t-value Pr(>|t|)
β0 2.031 11.36 < 0.00000002
β1 0.0007477 27.84 < 0.00000002

From the table above in the column Pr(>|t|), it can be seen that the variable x has a
significant effect on y with a p-value of < 0.05.

b. Poisson
In GLM, the Poisson distribution uses the link log function. With the help of the R
program, we obtain the following model.

y=3.489+ 0 . 0002892 x

With:

Table 4. The output for the Poisson distribution


Std. Error z-value Pr(>|z|)
β0 0.01614 216.18 < 0.0000002
β1 0.000004795 60.31 < 0.0000002

From the table above in the column Pr(>|t|), it can be seen that the variable x has a
significant effect on y with a p-value of < 0.05.

c. Gamma
In GLM, the Gamma distribution uses the link log function. With the help of the R
program, we obtain the following model.

y=3.3420523+ 0 . 0003462 x

With:

Table 5. The output for the Gamma distribution


Std. Error t-value Pr(>|t|)
β0 0.0426629 78.34 < 0.000001
β1 0.0000157 22.05 < 0.000001

From the table above in the column Pr(>|t|), it can be seen that the variable x has a
significant effect on y with a p-value of < 0.05.

Based on the modeling of the three distributions above, it is found that the variable x
has a significant effect on the variable y. However, we need to compare those three
models to obtain the best model.

4.5 The selection of the best model

From the three distributions that have been analyzed, we then calculate their AIC
value. The best model is the model of the distribution that has the smallest AIC value.
The following is a table of AIC values obtained with the help of the R program.

Table 6. The AIC values


Model Distribution AIC
s
1 Gaussian 2127
2 Poisson 3246
3 Gamma 2239.9

From the table above, the smallest AIC value is in model 1, namely GLM with a
Gaussian distribution with an AIC value of 2127. Therefore, the best model is model
1, namely GLM with a Gaussian distribution.

4.6 Estimating the estimated Y value


After obtaining the best model, then we estimate the value of y per day. By using the
GLM with a Gaussian distribution, the estimated y value is presented in the appendix.

5. Conclusion

The conclusions from the analysis above are as follows.


1. The number of confirmed cases of COVID-19 in Indonesia has a significant
influence on the number of deaths due to COVID-19 in Indonesia.
2. After modeling on three distributions and then doing a comparison between
the three models, the obtained best model is a model using a Gaussian
distribution with an AIC value of 2127.

References

David, M. 2015. Auto insurance premium calculation using generalized linear


models. Procedia Economics and Finance. 20; 147-156.
Belotti, J. T., Castanho, D. S., Arujo, L. N., da Silva, L. V., Alves, T. A., Tadano, Y.
S. Stevan Jr, S. L., Correa, F. C., & Siqueira, H. V. 2020. Environmental
Research. 191.
Bolance, C & Vernic, R. 2019. Multivariate count data generalized linear models:
Three approaches based on the Sarmanov distribution. Insurance: Mathematics
and Economics. 85; 89-103.
Calabrese, A., Schumacher, J. W. Schneider, D. M., Paninski, L., & Woolley, S. M.
N. 2011. A Generalized Linear Model for Estimating Spectrotemporal
Receptive Fields from Responses to Natural Sounds. Plos One.
Chandler, R. E. 2020. Multisite, multivariate weather generation based on generalized
linear models. Environmetal Modelling and Software. 134.
George, J., Letha, J., & Jairaj, P. G. 2015. Daily Rainfall Prediction using
Generalized Linear Bivariate Model- A Case Study. Procedia Technology. 24;
31-38.
Guo, J., Alam, M. S., Wang, J., Li, S., & Yuan, W. 2020. Optimal intensity measures
for probabilistic seismic demand models of a cable-stayed bridge based on
generalized linear regression models. Soil Dynamics and Earthquake
Engineering. 131.
Habahbeh, A., Fadiya, S. O., & Akkaya, M. 2018. Factors influencing SMEs
CloudERP adoption: A test with generalized linear model and artificial neural
network
Jong, P. D. & Heller, G. Z. 2008. Generalized Linear Models For Insurance Data.
Cambridge University Press, New York.
Kawano, S., Fujisawa, H., Takada, T., & Shiroishi, T. 2018. Sparse principal
component regression for generalized linear models. 124; 180-190.
Mc Cullagh, P. & Nelder, J. A. 1989. Generalized Linear Models. 2nd Edtiton,
Chapman and Hall, London.
Nordin, N. D., Zan, M. S. D., & Abdullah, F. 2020. Generalized linear model for
enhancing the temperature measurement performance in Brillouin optical time
domain analysis fiber sensor. Optical Fiber Technology. 58.
Rath, S., Tripathy, A., & Tripathy, A. R. 2020. Prediction of new active cases of
coronavirus disease (COVID-19) pandemic using multiple linear regression
model. Diabetes & Metabolic Syndrome: Clinical Research & Reviews. 14;
1467-1474
Sancetta, A. 2014. Semiparametric estimation of a class of generalized linear models
without smoothing. Journal of Multivariat Analysis. 134; 141-154
Tirta, I. M. 2015. On Line Dynamic Statistics Module Theori Model Linear
Tergeneralisir (GLM) dengan Variabel Kualitatif (Dummy), Natural Spline dan
B-Spline. UNEJ.PONSTAT.
To, T., Zhang, K., Bryan, M., Terebessy, E., Fong, I., Parikh, S., Zhu, J., & Su, Y.
2021. UV, ozone, and COVID-19 transmission in Ontario, Canada using
generalised linear models. Environmental Research. 194.
Tustianto, Kris, Loekito, A., & Soehono. 2012. Pemodelan Regresi Cox Proportional
Hazard Faktor-Faktor Lama Proses IMB (Izini Mendirikan Bangunan) Kota
Malang. Univeristas Brawijaya Malang, Malang
Ubaidillah, A., Kurnia, A., & Sadik, K. 2017. Generalized multilevel linear model
dengan pendekatan bayesian untuk pemodelan data pengeluaran perkapita
rumahtangga. Jurnal Aplikasi Statistika & Komputasi Statistik. 9; 1.
Ummah, Z., Suliyanto, & Sediono. 2012. Estimasi Model Linier Tergeneralisasi
Gaussian Berdasarkan Maximum Likelihood Estimator Dengan Menggunakan
Algoritma Fisher Scoring. Jurnal Matematika-FTS. 1; 110-120.
WHO. (2020). COVID-19 Weekly Epidemiological Update. In world health
organization.

You might also like