İstanbul Gedik University: Social Science Institute
İstanbul Gedik University: Social Science Institute
Autumn 2020
Task: Assignments – II
Ercan GÖK
18.01.2021
Abstract
Earthquakes are indispensable part of our life, and can cause some sociological trauma
in our society or in our countries. In order to avoid the negative effects of earthquakes,
we need to take some precautions, and if possible, we want to predict before those occur.
Earthquakes are recorded by a seismographic network. Each seismic station in the
network calculates the movement of the floor/ground at that site. The slip of one block
of rock over another in an earthquake gives off energy that makes the ground vibration.
That vibration pushes the neighbor piece of ground and leads it to vibrate, and therefore
the energy travels out from the earthquake hypocenter in a wave. Magnitude is the most
common measure of an earthquake’s size. It is a measure of the size of the earthquake
source and is the same number no matter where you are or what the shaking feels like.
We want to find whether there is a causal relationship between magnitude and station
or not, in order to analyze this, we will use a linear regression model. Firstly, we will
use simple linear regression model, and then we will go into deep with multiple linear
regression model analysis with some robustness checks.
Introduction
Linear regression analysis is suitable for working on different data sets because it is working on
a current topic. In addition, this study can be expanded by adding new variables and any of
the variables in the study can be shown as dependent variable.
Seismology is a data rich and data-driven science. Application of statistical learning for gain-
ing new insights from seismic data is a rapidly evolving sub-field of seismology. The availability
of a large amount of seismic data and computational resources, together with the development
of advanced techniques can foster more robust models and algorithms to process and analyze
seismic signals. Known examples or labeled data sets, are the essential requisite for building su-
pervised models. But, we are simply trying to find a causal relationship to capture earthquakes’
effects. In statistics, linear regression is a linear approach to modeling the relationship between
a scalar response and one or more explanatory variables (also known as dependent and inde-
pendent variables). The case of one explanatory variable is called simple linear regression; for
more than one, the process is called multiple linear regression. This term is distinct from mul-
tivariate linear regression, where multiple correlated dependent variables are predicted, rather
than a single scalar variable.
Previous Studies
In geological academic literature, earthquake magnitudes’ highness and lowness has been related
to many factors. I want to analyze the relationship between number of stations that report
1
earthquakes and its magnitudes. From the policy point of view, if number of stations increased
significantly as magnitude increase, with everything else constant, in order to abstain from
destructive effects of earthquakes with high magnitude, we could construct more stations that
reports them. Wells and Coppersmith (1994) have fit simple linear regression (SLR) models
to explain linear relations between magnitude and logarithms of depth, width, respectively.
Chu and Zhuang (2016) extends their analyses to multiple linear regression (MLR) models
by considering two or more predictors. We have discovered that fitting moment magnitude
on logarithms of rupture area and maximum displacement provides the best-fit model of two
predictors. And when maximum displacement is unavailable, they use surface length and depth
as dependent variables. They have also verified that magnitude is a significant predictor for
both. They proposed policy recommendations for government so as to avoid the detrimental
effects of earthquakes that have high magnitude.
Some studies have been done on earthquake data sets before. Exploratory analysis of a
public data set reporting earthquakes and similar events in a 30-day time frame was conducted.
For this, visuals were created on maps using different packages in R. On the other hand, some
studies have done time series analysis. Since the variable we will predict in this study consists
of continuous data, regression analysis can be used.
Theoretical Framework
In a linear regression model, we are keenly interested in seeing if there is a linear relationship
between a predictor variable (in our case, this is ”mag”) and a response variable (in our case,
this is ”stat”). From our data set, we’re going to be examining if there is a linear relationship
between an earthquake’s magnitude and the number of stations that reported the activity. The
intuition here is that as the magnitude of a quake changes, so does the number of stations that
report it in some sort of predictable manner. Firstly, in order to get a better feel for the quakes
data with a scatter plot. A scatter plot shows us the general shape of the data and can provide
us some hints to what the relationship between magnitude variable and stations variable might
be.
There are many practical uses of linear regression. If the goal is prediction, forecasting,
or error reduction, linear regression can be used to fit a predictive model to an observed data
set of values of the response and explanatory variables. After developing such a model, if
additional values of the explanatory variables are collected without an accompanying response
value, the fitted model can be used to make a prediction of the response. But our goal in this
paper is to explain variation in the response variable that can be attributed to variation in the
explanatory variables, linear regression analysis can be applied to quantify the strength of the
relationship between the response and the explanatory variables, and in particular to determine
whether some explanatory variables may have no linear relationship with the response at all,
or to identify which subsets of explanatory variables may contain redundant information about
the response. Our explanatory variable is magnitude, predicted variable or response variable
is the number of stations that reported earthquakes. First things first, let us see, the rough
sketch of our interested variables with scatter plot.
attach(quakes)
plot(jitter(mag, amount = 0.05), stations,
pch = 20,
ylab = "# of Stations Reporting",
xlab = "Magnitude",
main = "Fiji Earthquakes Magnitude and Reporting",
2
col = rgb(0.1, 0.2, 0.8, 0.3))
120
# of Stations Reporting
20 40 60 80
Magnitude
Data
Description: Our data set give the locations of 1000 seismic events of MB > 4.0. The events
occurred in a cube near Fiji since 1964. All of them intrinsically randomly selected, earthquakes
are natural events, because of that there will be probably no selection bias in these kinds of
data set if there is no intentionally created bias. The source of data set is one of the Harvard
PRIM-H project data sets. They in turn obtained it from Dr. John Woodhouse, Dept. of
Geophysics, Harvard University.
. long, Longitude
3
## 5 -20.42 181.96 649 4.0 11
## 6 -19.68 184.31 195 4.0 12
str(quakes)
summary(quakes)
Descriptive statistics values of the data set are shown in the table. Looking at the table, it is
seen that there are no missing observations. In addition, values such as minimum, maximum,
average, median are shown in this table. For example; The average of the variable depth is
311.4. It can also be said that the maximum value of the variable stations is 132.
Here is the distribution of our data set with outliers and median and quartile analysis
referring to below boxplots, and also the distributions with histogram analysis.
library(ggplot2)
ggplot(quakes) + geom_histogram(aes(x = mag), binwidth = .1, fill = ’grey30’)
4
90
count
60
30
0
4.0 4.5 5.0 5.5 6.0 6.5
mag
40
30
count
20
10
0
200 400 600
depth
90
count
60
30
0
50 100
stations
5
ggplot(quakes) + geom_histogram(aes(x = lat), binwidth = 0.5, fill = ’grey30’)
80
60
count
40
20
0
−40 −30 −20 −10
lat
100
count
50
0
165 170 175 180 185
long
6
4 4.1 4.2 4.3 4.4
−10
−20
−30
−20
−30
−30 5.0
4.5
5.5 5.6 5.7 5.9 6 4.0
−10
−20
−30
−20
−30
165170175180185 165170175180185
long
boxplot(quakes$mag)
7
6.0
5.5
5.0
4.5
4.0
boxplot(quakes$stations)
120
20 40 60 80
boxplot(quakes$depth)
8
700
500
300
100
boxplot(quakes$long)
185
180
175
170
165
boxplot(quakes$lat)
9
−15
−25
−35
plot(quakes)
−15
lat
−35
165 180
long
600
depth
100
4.0 5.5
mag
100
stations
20
From above scatter plot, We can see that magnitude and number of stations are moving
together in some sort of way. We are trying to develop some initial ideas about the relationship
between the magnitude of an earthquake and the number of stations that report that earth-
quake. Generally speaking, as magnitude rises, the number of stations reporting increases. We
start by giving our model a specific name that we can refer later on: quake.linear.regression,
then we run a regression the number of stations on magnitude:
10
quake.linear.regression <- lm(stations ~ mag)
quake.linear.regression
##
## Call:
## lm(formula = stations ~ mag)
##
## Coefficients:
## (Intercept) mag
## -180.42 46.28
##
## Call:
## lm(formula = stations ~ mag)
##
## Coefficients:
## (Intercept) mag
## -180.42 46.28
Stations
d = −180.42 + 46.28(M agnitude)
Our linear regression model gives us, as we said before, it provides a numerical relationship,
based on our sample dataset, between magnitude of an earthquake and the number of stations
that reported the earthquake. From the slope coefficient, we deduce that 1 unit change on the
Richter scale will, on average, change the number of reporting stations by 46.28 unit. Because
our slope is positive, our linear regression model estimates that there is a positive association
between magnitude and the number of reporting stations. Our intercept coefficient tells us if
magnitude of the earthquake was zero (but couldn’t be since reported earthquake magnitude
should be 4 at minimum.) -180.42 stations would report it. We start with our scatter plot
above, we could include the abline code with our parameters, slope and intercept coefficients,
to add our regression line.
11
Fiji Earthquakes Magnitude and Reporting
120
# of Stations Reporting
20 40 60 80
Magnitude
We can see that the regression line follows the data fairly well. But this positive relation
could be correlation not causation. To check whether there is a causation or not, I explore the
assumptions of the linear regression model (homoscedasticity, normally distributed errors, and
independent errors) to make sure I am only using this model during appropriate circumstances.
Homoscedasticity means that the variance of our residuals is constant across all earthquake
magnitudes. Put another way, the variance of our residuals is independent from our predictor
variable.
plot(mag, stations,
pch = 20,
ylab = "# of Stations Reporting",
xlab = "Magnitude",
main = "Fiji Earthquakes Magnitude and Reporting",
col = rgb(0.1, 0.2, 0.8, 0.3))
abline(-185, 56, col= "green", lwd = 2)
abline(-270, 56, col= "green", lwd = 2)
12
Fiji Earthquakes Magnitude and Reporting
120
# of Stations Reporting
20 40 60 80
Magnitude
As we compare the spread of the data, the variation appears to be relatively constant across
the plot except for the lowest earthquake magnitudes.
Primary option for checking the variance in residuals is a residual plot with our model’s fitted
values on the X-axis and residual size on the Y-axis. This residual plot allows us more clearly to
see changes in the variance of the residuals across all magnitudes compared to the scatter plot.
I created another function for residuals (quake.residuals) and fitted values (quake.fitted.values)
and then construct residual plot with a horizontal line at Y equals zero as a reference point for
the variation in residuals.
13
Residual Plot
40
20
Residual
0
−40
20 40 60 80 100 120
Magnitude
As we compare variation for different magnitudes, it appears that the residual variation is
slightly lower for magnitudes 4.0-4.25 compared to magnitudes greater than 4.25. While this
assumption is not perfectly met, the data is reasonable enough to allow us to proceed and not
completely disregard our model’s findings.
The third assumption for linear models is that our residuals follow a normal distribution.
We should check characteristics about the normal distribution:
Bell-Shaped Curve
Majority of data within one and two standard deviations of the mean
sd(quake.residuals)
## [1] 11.49485
Symmetrical in shape
An easy way to see if our residuals follows a normal distribution is by showing the residuals in
a histogram.
hist(quake.residuals, breaks=25,
xlab="Residual Value",
ylab="Frequency",
main="Histogram of Residuals",
col="blue")
14
Histogram of Residuals
150
Frequency
100
50
0
−40 −20 0 20 40
Residual Value
We can see from histogram above that our residuals’ standard error is 11.5 approximately, and
our 99 percent of data points fall in that two standard error region. Fourth assumption in our
linear regression model is each error term is independent of other error terms. For our data set,
there is little reason to believe that the residual number of stations reporting an earthquake for
a given magnitude would be dependent on the residual of another predictor/response variable
combination. Given this, we can claim that this assumption is met and continue with our linear
model. Based on our simple knowledge of earthquake reporting, we assume that independence
between residuals holds.
If we truly want a measure of the usefulness of our model, we can begin by looking at
R-Squared. R-Squared is known as the simple coefficient of determination. When comparing
two variables, R-squared represents the proportion of total variation in the response variable
that is explained by the linear regression model. Naturally, a higher R-squared shows that the
predictor variable predicts the response variable well.
In our case, we can see the proportion of variation in the number of stations reporting that
can be explained by our quake regression. R-Squared is provided by the summary command,
for which the only argument is the model name. Multiple R-squared, located at the bottom of
the summary output can be interpreted as follows: 72.45 percent of the total variation in the
number of stations reporting a quake can be explained by our linear model. R-squared values
range from 0-1, so 72.45 percent is noteworthy for sure.
summary(quake.linear.regression)
##
## Call:
## lm(formula = stations ~ mag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.871 -7.102 -0.474 6.783 50.244
15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -180.4243 4.1899 -43.06 <2e-16 ***
## mag 46.2822 0.9034 51.23 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 11.5 on 998 degrees of freedom
## Multiple R-squared: 0.7245,Adjusted R-squared: 0.7242
## F-statistic: 2625 on 1 and 998 DF, p-value: < 2.2e-16
summary(quake.linear.regression)
##
## Call:
## lm(formula = stations ~ mag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.871 -7.102 -0.474 6.783 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -180.4243 4.1899 -43.06 <2e-16 ***
## mag 46.2822 0.9034 51.23 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 11.5 on 998 degrees of freedom
## Multiple R-squared: 0.7245,Adjusted R-squared: 0.7242
## F-statistic: 2625 on 1 and 998 DF, p-value: < 2.2e-16
16
As realized, checking the assumptions of the linear regression model has objective and
subjective components, which ultimately can leave to decision to proceed with the model in
out hands. Now that we understand the basics of our linear model and we can go deep.
cor(quakes)
cor.test(mag,lat)
##
## Pearson’s product-moment correlation
##
## data: mag and lat
## t = -1.5962, df = 998, p-value = 0.1108
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.11210404 0.01156762
## sample estimates:
## cor
## -0.05046165
cor.test(mag,depth)
##
## Pearson’s product-moment correlation
##
## data: mag and depth
## t = -7.488, df = 998, p-value = 1.535e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2885057 -0.1710909
## sample estimates:
## cor
## -0.2306377
17
cor.test(mag,long)
##
## Pearson’s product-moment correlation
##
## data: mag and long
## t = -5.5512, df = 998, p-value = 3.637e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2325652 -0.1122788
## sample estimates:
## cor
## -0.1730673
##
## Call:
## lm(formula = stations ~ mag + lat + long + depth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.158 -7.019 -0.145 6.329 44.505
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.644e+02 1.218e+01 -21.706 < 2e-16 ***
## mag 4.904e+01 8.953e-01 54.777 < 2e-16 ***
## lat 3.568e-01 7.444e-02 4.793 1.89e-06 ***
## long 4.179e-01 6.291e-02 6.642 5.07e-11 ***
## depth 1.171e-02 1.659e-03 7.057 3.20e-12 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 10.91 on 995 degrees of freedom
## Multiple R-squared: 0.7529,Adjusted R-squared: 0.7519
## F-statistic: 757.8 on 4 and 995 DF, p-value: < 2.2e-16
In statistics, omitted-variable bias (OVB) occurs when a statistical model leaves out one
or more relevant variables. The bias results in the model attributing the effect of the missing
variables to those that were included. We see that after controlling variables that can be
correlated with our explanatory variables, our variable’s coefficient is increased. Because there
is a correlation between them, and its statistically significant. There is a negative correlation,
because of that our magnitude coefficient is downward biased.
18
point estimates, we are interested in knowing the relevant range that these parameters could
fall into given a different sample, i.e. the precision of our regression coefficients.
confint(quake.multiple.regression, level=.95)
## 2.5 % 97.5 %
## (Intercept) -288.35289258 -240.53881078
## mag 47.28505992 50.79883045
## lat 0.21073634 0.50290857
## long 0.29441479 0.54130957
## depth 0.00845289 0.01496518
We can interpret the results, for our linear model’s intercept, we are 95 percent confident
that the true intercept for all possible quakes samples will be in the range [-288.35, -240.54].
The 2.5 percent and 97.5 percent column headings represent the quantiles which are capturing
the middle 95 percent of the data. For our linear model’s slope, we are 95 percent confident
that the true slope of the linear relationship between magnitude and stations will be in the
range [47.285, 50.799].
Conclusion
Here, modeling has been done with linear regression models, both simple and multiple, using
the earthquake data set. Two different methods were used in this modeling. Comparisons
were made in terms of different criteria using these methods. Firstly, the relationship between
single independent and dependent variable was examined. Later, multivariate modeling was
performed and the relationship between variables was examined. A comparison was made in
terms of different criteria. These criteria are p-value, t-statistics, f-statistics etc.
It is an important study to estimate the intensity of the earthquake, especially when the
earthquakes that have occurred in recent years are considered. Some models make very good
predictions here because the intensity of the earthquake is predicted. In this study, it was seen
that the predictions were very successful. Therefore, this study can serve as an example for
many researchers. In addition, the study can be expanded using different non-linear estimations
here.
Similar studies will continue to be conducted in future research. Natural disasters are known
to have devastating immediate impacts, but their long-run effect on economic growth is not well
understood. For any given natural hazard type, population growth and urbanization will further
increase their impacts in the future. The number of climate-related extreme events is also likely
to increase with climate change. It will therefore become even more relevant in the future than
it is already today to understand the immediate and long-run impacts of natural disasters. Like
this study, earthquakes’ harmfulness can be decreased in such ways like increasing the number of
seismic stations, also this study will be expanded by using different statistical learning methods.
When more successful models are established to predict earthquake intensity, they will be more
successful in predicting the future. In addition, it is possible to obtain successful results for
different data sets by using the models in this study.
References
Chu, Annie, and Jiancang Zhuang. ”Multiple linear regression analyses on the relationships
among magnitude, rupture length, rupture width, 2 rupture area, and surface displacement 3.”
19
Fracture and Earthquake Assessment: 221.
Wells, Donald L., and Kevin J. Coppersmith. ”New empirical relationships among magnitude,
rupture length, rupture width, rupture area, and surface displacement.” Bulletin of the seis-
mological Society of America 84.4 (1994): 974-1002.
Okal, Emile A., Josrrero, and Costas E. Synolakis. ”The earthquake and tsunami of 1865
November 17: evidence for far-field tsunami hazard from Tonga.” Geophysical Journal Inter-
national 157.1 (2004): 164-174.
Lal, Padma Narsey, et al. Relationship between natural disasters and poverty: a Fiji case study.
SOPAC, 2009.
Tibi, R., C. H. Estabrook, and G. Bock. ”The 1996 June 17 Flores Sea and 1994 March 9 Fiji-
Tonga earthquakes: source processes and deep earthquake mechanisms.” Geophysical Journal
International 138.3 (1999): 625-642.
Kiyani, Amna, et al. ”Seismo ionospheric anomalies possibly associated with the 2018 Mw 8.2
Fiji earthquake detected with GNSS TEC.” Journal of Geodynamics 140 (2020): 101782.
Gibowicz, Slawomir J., and Stanislaw Lasocki. ”Analysis of shallow and deep earthquake dou-
blets in the Fiji-Tonga-Kermadec region.” Pure and Applied Geophysics 164.1 (2007): 53-74.
20