(TRANSLATED) Generalized Linear Model
(TRANSLATED) Generalized Linear Model
Abstract: The COVID-19 outbreak that has not stopped yet makes scientists keep
conducting studies on this outbreak. Studies that are mostly done are about the prediction and
modeling of COVID-19 data. In line with that, this study also discusses COVID-19
modeling. The model that is mostly used is the linear model. However, if the classical
assumptions of normality are not fulfilled, a special method will be needed. The method that
can overcome this problem is the generalized linear model (GLM) with the assumption that
the data have an exponential family distribution. By using 3 types of exponential family
distribution, the obtained best result is the GLM with the Gaussian distribution.
1. Introduction
In 2019 – 2020, an outbreak emerged and then attacked the world called COVID-19
(coronavirus disease-19). This COVID-19 is an infectious disease caused by severe
acute respiratory syndrome coronavirus-2 (SARS-CoV-2). This virus first appeared
in China in December 2019. Since then, the virus has spread rapidly to various
regions of the world. As of 7 December 2020, as recorded by the WHO, there were
65.8 million confirmed cases of COVID-19. 1.5 million of them ended in death.
Furthermore, this virus also spread to 220 countries or regions (WHO, 2020). This
COVID-19 has been declared a pandemic by the WHO on 11 March 2020. In
Indonesia, the number of COVID-19 cases is still increasing day by day. As of 10
December 2020, there were 598,933 positive cases with 491,975 recoveries and
18,336 deaths (Indonesia's Ministry of Health, 2020).
The presence of this COVID-19 pandemic that has hit many countries makes many
scientists and academics conduct various studies on it. Studies that are often carried
out are time series analysis and regression analysis on COVID-19 data. However, in
the regression analysis, researchers may discover variable Y which is not normally
distributed. If we continue to do the analysis with the assumption of normality, we
will get poor analysis results. Therefore, a special analysis is needed to overcome this
abnormality. One of them is the Generalized Linear Model (GLM).
Studies using GLM have been conducted by David (2015), Sancetta (2014), Belotti et
al. (2020) on air pollution data, Bolance & Vernic (2019), Kawano et al. (2018), Guo
et al. (2020), Nordin et al. (2020), Chandler (2020), George (2015), Habahbeh
(2018), Ubaidillah et al. (2017) on data concerning household expenditure per capita,
Ummah (2012), and Calabrese (2011). Furthermore, studies using GLM have been
also applied to COVID-19 data as has been done by To et al. (2021) on COVID-19
data in Canada and Rath (2020) on COVID-19 data in India. Ratt (2020) concluded
that GLM is a good method for modeling and estimating COVID-19 data in India.
Therefore, the author is interested in conducting GLM on COVID-19 data in
Indonesia, in which the response variable is mortality, while the predictor variable is
the confirmed status.
GLM is a general form of the linear model. In the classical linear model, Y is
assumed to be normally distributed with Ε ( Y )=μ and σ 2 variance. In GLM, the
response variable Y can be distributed other than normal. However, it must be
included in the exponential family. As a transition from the Linear Model to the
Generalized Linear model, the model is described through three components (Mc
Cullagh & Nelder, 1989), namely as follows.
1. Random Component. It is the observed values of the response Y that are mutually
independent of any particular distribution.
2. Systematic Component. It is a linear combination of the variable X with a
parameter which is denoted by η=Χβ .
3. Link between random and systematic (link function). It is a function that explains
the expected value of the response variable Y that connects with the explanatory
variables through the linear equation. It is denoted as ηi =g ( . ). This function g(.)
is called with a link function.
From these three components, the link function may determine the model to be used
in GLM. The simplest link function is g(μ) = μ which is called the identity link. If the
GLM has the simplest connecting function, then the GLM is a linear model with
continuous response. Other connecting functions will connect μ in a nonlinear matter
to the predictor.
According to Tirta (2015), in the GLM, the distribution of responses included in the
exponential family can have various types. A random variable Y may be included in
the distribution belonging to the exponential family if it has the model, as follows.
f Y ( y ; θ)=exp [ a ( y ) b (θ ) +c (θ )+ d ( y ) ]
In some cases, the functions a, b, c, and d may contain another parameter called
nuisance/disturbance. Some types of distribution that are often used in GLM can be
described as follows.
1. Normal / Gaussian distribution
The model of the probability density function of the random variable Y which has a
Normal or Gaussian distribution is as follows.
( ( ))
2
1 −1 y−θ
f ( y )= exp ,−∞< y< ∞
√2 πσ 2 σ
¿ exp ¿ ¿
With:
2 2
θ y −σ
b ( θ )=2
, d ( y )= 2 , c (θ )= .
σ 2σ 1
2 σ − log ( 2 π σ )
2 2
2
Here, σ is the nuisance parameter. Therefore, it means E [ Y ] =θ and Var [ Y ] =σ 2 .
2. Poisson distribution
The random variable Y which has a Poisson distribution has a probability density
function as shown in the following equation.
y −θ
θ e
f ( y )= , y =0 , 1, 2 , 3 , …
y!
¿ exp [ y log θ−θ−log y ! ]
The equation above indicates b ( θ )=log θ , c ( θ )=−θ , d ( y )=−log y . Therefore, it
means E [ Y ] =θ and Var [ Y ] =θ.
3. Gamma distribution
The random variable Y which has a Gamma distribution has a probability density
function as shown in the following equation.
ϕ −1 − yθ
θ ( yθ) e
f ( y )= , y> 0 ,
Γ (ϕ)
¿ exp ¿ ¿
With:
With the distribution of response data that do not always follow the Gaussian
distribution, it means that the data range is also not always in the range of real
numbers, such as data concerning continuous positive numbers, whole numbers, or
binary only. Meanwhile, a linear combination of predictors, commonly referred to as
p
linear predictors, η=∑ x ij β j, is open to take any value of any real number. For that
ij=0
reason, we need a function that connects and at the same time synchronizes the
response with a linear predictor. This function is called the link function. In addition,
the link function also serves to maintain linearity so that the predictor remains linear
and normal, in which the range between the linear predictors and Y or μ y remains in
sync. Among the link functions that can be used, there is the so-called canonical link
function, namely the relationship function that occurs at the aquation
p
b ( θ )=η=∑ β j x j (Jong & Heller, 2008).
j=0
1. For the binomial distribution, the functions that can be used are as follows.
a. The logit function
η=log ( 1−μ
μ
)
b. The probit function
−1
η=Φ (μ)
Where, Φ is the cumulative function of the normal distribution, namely
as follows.
[ ]
x
1 −1 2
Φ ( x )= ∫ exp z dz
−∞ √2 π 2
c. Complementarity log−log
η=log [ −log ( 1−μ ) ]
2. For the Gaussian distribution, the canonical link function is the identity μi=ηi.
3. For the Gamma distribution, the reciprocal canonical link function is
1
log =ηi but the log link is also often used, namely log ( μi ) =ηi.
μi
4. For the Poisson distribution, the canonical link function is log ( μi ) =ηi.
In selecting the best model, this study uses the Akaike Information Criterion (AIC).
The model that has the smallest AIC value is as follows.
AIC=−2 log L+ 2 P
Where:
log L = the maximum value of the likelihood function of the Cox PH regression
model
P = number of independent variables in the Cox PH regression model (Tustianto,
Kris, & Soehono, 2012)
3. Methods
From the table of descriptive statistical above, it can be interpreted that the average
confirmed COVID-19 patient is 2214 people and the average patient who died is 69
people. The least confirmed number of COVID-19 is 106 people in a day and the
most confirmed number of COVID-19 is 6267 people in a day. Meanwhile, the
number of patients who died due to COVID-19 was at least 7 people in a day and at
most 169 people in a day.
Before performing the analysis using the generalized linear model, the distribution of
the response variable Y must be identified. To see the appropriate distribution, a
goodness-of-fit test is carried out. The results of the goodness-of-fit test of several
distributions are presented in the following table.
From the table above, all distributions have a p-value of < 0.05. Therefore, H0 is
rejected and it can be concluded that the data do not follow one of those distributions.
The linearity test is conducted by looking at the scatter plot. With the help of the R
program, we obtain an output graph, as follows.
From the figure above, the data is relatively increasing linear. Therefore, it can be
concluded that the data meet the assumption of linearity. Furthermore, the data are
analyzed with a generalized linear model.
a. Gaussian
In GLM, the Gaussian distribution uses the link identity function. With the help of
the R program, we obtain the following model.
With:
From the table above in the column Pr(>|t|), it can be seen that the variable x has a
significant effect on y with a p-value of < 0.05.
b. Poisson
In GLM, the Poisson distribution uses the link log function. With the help of the R
program, we obtain the following model.
y=3.489+ 0 . 0002892 x
With:
From the table above in the column Pr(>|t|), it can be seen that the variable x has a
significant effect on y with a p-value of < 0.05.
c. Gamma
In GLM, the Gamma distribution uses the link log function. With the help of the R
program, we obtain the following model.
y=3.3420523+ 0 . 0003462 x
With:
From the table above in the column Pr(>|t|), it can be seen that the variable x has a
significant effect on y with a p-value of < 0.05.
Based on the modeling of the three distributions above, it is found that the variable x
has a significant effect on the variable y. However, we need to compare those three
models to obtain the best model.
From the three distributions that have been analyzed, we then calculate their AIC
value. The best model is the model of the distribution that has the smallest AIC value.
The following is a table of AIC values obtained with the help of the R program.
From the table above, the smallest AIC value is in model 1, namely GLM with a
Gaussian distribution with an AIC value of 2127. Therefore, the best model is model
1, namely GLM with a Gaussian distribution.
5. Conclusion
References