0% found this document useful (0 votes)
13 views19 pages

Final ProjectAlbertVidalLluisGimenez

Uploaded by

Manuel Flores
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views19 pages

Final ProjectAlbertVidalLluisGimenez

Uploaded by

Manuel Flores
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Statistical Models and Stochastic Methods

Final Project
Lung Cancer Dataset analysis

Authors: Albert Vidal Cáceres and Lluís Giménez Gabarró


Date: 17-12-2021
Course: 2021-2022
Index
1. Context and goals.
1.1. Adding variables
1.2. Explain what result we expect before doing any computations.

2. Description of the dataset.


2.1. Descriptive analysis of the data.
2.1.1. Expected distribution.
2.1.2. Descriptive statistics.
2.1.3. Plot the data.
2.2. Explicit reference of the data.

3. Description of the techniques used to analyse the data.

4. Results of the analysis of the data.


4.1. Import the data and remove all duplicates
4.2. Linear model and Poisson model
4.3. Poisson Model
4.4. Test for overdispersion and zero-inflated
4.5. Quasi-Poisson Models
4.6. Other link functions
4.7. Predictions
4.8. Confidence Intervals

5. Conclusion and discussion

6. Bibliography

7. Appendix

2
1. Context and goals
This data set contains counts of incident lung cancer cases and population size in four neighbouring
Danish cities by six age group during a period of three years (1968-1971):
- Danish Cites: A factor with 4 levels (Fredericia, Vejle, Horsens, Kolding)
- Age groups: 40-54, 55-59, 60-64, 65-69, 70-74, 75+
- Population: Takes into account the number of individuals in each city for each age group.
- Cases: It records the number of lung cancer cases in each city for each age group.

1.1 Adding a Variable


We must also take into consideration that Fredericia has a petrochemical industry named “Shell” in the
harbour area, so it would be useful to add a grouping variable called “Industry” that indicates if there was
any petrochemical industry in the city during that period of time.

We think it is a good idea because the presence of such an industry may be a lung cancer development
factor and therefore it may be a good predictor.

1.2 Explain what result we expect before doing any computations.


Our main goal is to investigate if the age of an individual and the city they reside in are significant
predictors for lung cancer development, taking into consideration that each city could have a
petrochemical industry. It is also possible that some cities have other data that is unaccounted for, like the
percentage of smokers, air pollution or healthy habits.

In light of the above, we expect that age is a significant predictor for the response variable since aging is
the inevitable time-dependent decline in physiological organ function and therefore a major risk factor for
cancer development [1].
Therefore, if we look at the different age groups, we expect a direct relationship. The higher the age
group, the higher the number of lung cancer cases.

We also believe that having certain types of industries near the city is also going to be a significant
predictor for the number of lung cancer cases. Due to the fact that they produce contaminating residuals
and air pollution, which are specially linked to lung cancer [2].
Again, we expect a direct relationship. Meaning that if there is a petrochemical industry in the city, the
probability of having lung cancer increases.

We are dealing with rate data, since the amount of observations in each city and age group varies among
them. So, we expect to have a larger number of lung cancer cases if the population variable is larger.

3
2. Description of the Dataset
2.1 Descriptive analysis of the data
In this study we are going to work with a lung cancer dataset, which is composed of 4 predictor variables
(City, Age, Population and Industry) and a response variable (number of lung cancer cases).

2.1.1 Expected distribution


By just looking at the data, we can clearly see that we are going to work with rates. Since it makes no
sense to perform a comparison of the number of lung cancer cases between two populations that do not
have the exact same size.
We can also see that we are working with low count data. Therefore, our intuition tells us that this data is
going to follow a poisson distribution, by simply looking at the dataset.

2.1.2 Descriptive statistics


A descriptive statistic is a summary statistic that quantitatively
describes or summarizes features from a collection of information
[3]. We will analyze the four variables. There are 6 observations per
city and 4 observations per age group.

2.1.3 Plot the data


First of all, we are going to plot the number of cases against the
population to see if there is actually a direct relationship (if the
population is higher, we should expect a higher number of cancer
cases).

Surprisingly, we can see that if the population


increases the number of lung cancer cases decreases,
which is strange. The reason why this happens is
because here we are not taking into consideration
many parameters such as the age or Industry. For
example, the point that is in the bottom right corner
is from Kolding (which does not have a
petrochemical industry near) and it is from a young
group of people (which have less probabilities of
developing lung cancer). On the other hand, many
points that are in the top left corner correspond to
Fredericia and old people.
So, the groups that have a higher population
correspond to young people.

4
To corroborate this, we can investigate the
relationship between each city and the
population of each age group. So, we can
make another scatterplot:
As we can see, in all cities there is a large
population of young individuals (2500-3200
individuals). On the other hand, the number of
people from the other age groups decreases
considerably in each city.

To enlighten this, we can do the rate of the


cancer cases and make two boxplots:
- The first one will show the correlation
between the age of the individuals
and the rate of cancer.
- The second one will show the
correlation between Industry and the
rate of cancer.

In the first boxplot, we can see there is a clear correlation between the age of the individuals and the rate
of cancer in the group. As the groups increase in age, the rate of cancer also increases, which makes sense
given the fact that age is a known factor in cancer development. Surprisingly, the 70+ age group has the
same median rate as the 60-64 group and a higher deviation, but we can suppose this is due to the fact that
it is least likely they would survive cancer. So, it would be coherent to think that the people that had
cancer at the age of 75+ have already died (reason why we have a lower number of cancer cases).

5
In the second boxplot, we can see the median rate of cancer is approximately equal in all cities, except for
Fredericia, where it is much higher. This could be explained by the fact there is a petrochemical industry
in the city, thus increasing the rate of cancer. We see there is an outlier in the Fredericia group, which has
a significantly lower rate of cancer than the rest. This refers to the younger age group, which hasn’t been
alive for long enough to feel the effects of the industry.

2.2. Explicit reference of the data


The dataset used is from the library “ISwR” and it is called “eba1977”.

E.B. Andersen (1977), Multiplicative Poisson models with unequal cell rates, Scandinavian Journal of
Statistics, 4:153–158. J. Clemmensen et al. (1974), Ugeskrift for Læger, pp. 2260–2268.

3. Description of the techniques used to analyse


the data
During this project we are going to use models to describe a phenomenon that we are studying, which is
lung cancer. So, first of all we are going to use a simple linear model to describe the relationship between
two variables. In this case, we are working with a statistical or stochastic model, since the variables are
not exactly related, there is an error associated.
So, we are fitting a line through a cloud of points to summarise the relationship between the response and
the predictor with a linear equation.
The theoretical model underlying linear regression is:

ᵝ0 is a constant; ᵝ1 is a constant that represents the effect of the predictor under response. If x goes up by
one unit, then the response will grow ᵝ1 units; xi is the predictor; εi is the error associated.
We also know that there is a random (εi, Yi) and a systematic part (ᵝ0 + ᵝ1 · xi), and there are some
assumptions that can be found in the class slides.

So, we will check if the predictor variables are useful to predict the response variable, if they are
significant...

In standard regression, it is required that the response variable is normally distributed. So a new kind of
linear model was created, the generalized linear model, to deal with this problem. Many methods are
related to the generalized linear model, one of them being the Poisson regression, which is made to
model count or rate data which follows a Poisson distribution.

6
The Poisson distribution is part of the exponential family and is modelled by the following equation:

y is a Poisson random variable that gives the number of occurrences and λ is an average rate of value in
the desired time interval.
There are three components in poisson distribution. The link function is one of them and it literally
“links” the linear predictor and the parameter for probability distribution. The log link function is the
default link function because the parameter for Poisson regression must be positive [4]. But, we can also
use other link functions such as the identity or sqrt link functions.

In Poisson regression, it is required that the variance must be equal to the expectation. When this is not
the case, we say that the data has over/underdispersion, if the variance is higher or lower than the mean,
and a different model must be applied.
To deal with this, we must use the quasi-poisson distribution, where an overdispersion parameter φ is
2
𝑋
introduced. It can be estimated by φ = 𝑑𝑓

It is also possible for the data to be zero-inflated, which happens when the data has a higher probability
of being zero than in the Poisson distribution. In this case, a zero-inflated Poisson model must be applied,
which will take into account that excess of zeroes.

When fitting the different models, we are going to make a number of hypothesis tests to check whether a
null hypothesis can be rejected or not. One of the statistics that we are going to obtain when doing the
regression models is the p-value, which denotes the probability (so it goes from 0 to 1) of observing a
test-statistic as extreme or more extreme than the one we have observed (under the assumption that the
Null hypothesis is true). So, it quantifies the evidence you have against the Null Hypothesis and therefore,
the smaller the p-value, the more evidence you have against Null Hypothesis.
The p-value is compared with the significance level of the test:
- p-value < 𝝰 we reject H0
- p-value > 𝝰 we do not reject H0
The significance level of the test or 𝛼 or Type I error denotes the probability of rejecting the null true
hypothesis (false negative).

There are multiple ways to compare the goodness of fit of different models. Let’s begin discussing the
Akaike Information Criterion, or AIC. It is a goodness of fit measure, meaning it quantifies the quality
of a model. The standalone number is meaningless, it only gains meaning when it is contrasted with
another model’s. The AIC can be explained by:

7
Since we want a likelihood as large as possible, a good AIC must be as small as possible. Therefore the
smaller the AIC, the better the model. K is the number of parameters, so it penalises you if you have a
complicated model with a lot of parameters.
If the AIC of a model is lower than the AIC of another, we can claim the prior fits better.

The analysis of variance, or ANOVA, is a family of models used to compare the means of two groups. In
our case, we are interested in using it to compare models, which is only possible if both are nested (the
parameters of one model are a subset of the other’s parameters). Using chi-square, it tests if the decrease
in residual sum of squares is significant or not.

A scatter plot with deviance residuals on the y axis and fitted values on the x axis can also be used to
visually estimate how well a model fits. If the scatter plot has no discernible pattern (the points are
distributed in a random-looking cloud), we can claim the model fits well because the residuals are well
behaved.

Finally, the CI (confidence interval) gives us the range of possible values that the population parameter
can have depending on the point estimate, the significance level of the test (𝛼 = 95% by default) and the
standard error of the point estimate. So, we are going to obtain a certain interval that has a 95%
probability of containing the true population parameter.

4. Results of the analysis of the data


4.1. Import the data and remove all duplicates
To analyse the data we must check if it is correctly annotated. So, we can check if there are any
duplicates, which there are not. We could also look for outliers and remove them, but since removing a
row in the dataset would imply getting rid of information taken from hundreds of people, we will not do
it.
4.2. Linear model and poisson model
We will make both linear and poisson models and compare them to see which fits the data better.
Regarding the linear model, we obtain a total of 9 coefficients and none of them are significant.
We can also see that the adjusted R squared statistic is negative. So, with all these predictors we can not
explain the variance in lung cancer cases, but we previously saw a relationship in the boxplots.
On the other hand, we can actually see that all the values of the obtained coefficients make sense.
The reason why is the following:
Firstly, both City and Age predictors are factors and therefore, when looking at the results, we are
comparing each level (different cities and age groups) with the reference level (which is the intercept).
All the city estimates are negative and therefore the rate of cancer is lower than in the intercept, Fredericia
(which makes total sense for the reasons we gave before). Regarding age estimates, they are all positive
and have an increasing tendency, because as the age increases the risk of cancer also increases (except for
the last group that we have mentioned before). Thus, the rate of cancer is higher than in the intercept,
youngest age group (which also makes total sense).

8
For a Poisson regression, the estimates still make sense, since they are consistent with the estimates on the
linear model. The p-values, however, are different, telling us that many predictors have a significant effect
on the response as we expected. The residual deviance and degrees of freedom are relatively similar,
indicating that there may not be overdispersion.
Full Linear Model Full Poisson Model

To check once again that the Poisson


model fits the data better than the linear
model, we can plot the residuals against
the fitted values.

Looking at the scatterplot, we can not see


a pattern that indicates that the model
does not fit well. We can actually see that
the residuals are randomly distributed and
that some of them are positive and some
of them are negative. Therefore, the
residuals are well behaved.

Actually, if we pay close attention, we


can see that there are small signs of
heteroscedasticity (there is a non constant variance). If you go from left to right, the variance goes down.
We have thought of log-transforming the response variable, but when doing this and then making again
the residuals against the fitted values scatterplot we observed an increase in heteroscedasticity. Thus, we
are not going to work with the log-transformed response variable.

In light of the above, the Poisson model fits the data better. Which makes sense because we are working
with low count data and rates.

9
4.3. Poisson model
To check if it follows an exact Poisson distribution, we will calculate both mean and variance and look if
their rates are equal to 1.
The mean is equal to 9.333 and the variance is equal to 9.971. So, the rate is 0.936. This gives us some
evidence that the data does follow a distribution similar to a Poisson distribution.

Before checking if there is over or underdispersion, we can


look again at the p-values of the different parameters.
Surprisingly, some of the cities are not significant and
therefore we can fit a nested reduced model that only takes
into account the Age groups. The result obtained from the
reduced model is the following:
We are unsure about which model fits better, since the
reduced one has a little lower AIC but slightly higher
residual deviance, so we will carry on with the
overdispersion tests and decide which model fits better
later on.

We are not really interested in comparing the different age groups with the intercept (age group 45-54)
(since we could also compare age group 65-59 with the other groups, for example). We just want to know
if AGE, as a whole, affects the risk of lung cancer. To do this we make a likelihood ratio test. We make a
model with only the constant and a model with the AGE variable and compare both.

We get a table with the null model that only has the constant and the model with race.
The reduction in deviance is equal to 100 units and the p-value is really small, so AGE is a really
significant predictor.
We can do the exact same thing to know if CITY, as a whole, affects the risk of lung cancer.

The p-value obtained is equal to 33.5%, so it looks like city is not a significant predictor. Which is strange
because at the exploratory level analysis we have seen that the city plays an important role in the
development of lung cancer, especially in the case of Fredericia.

10
4.4. Overdispersion and zero-inflated
Now we must check if there is overdispersion/underdispersion or not, to assess if the data is going to be
better explained by a Poisson or Quasi-Poisson distribution.
By just looking at the rate of the mean and the variance, we can see that there is a little bit of
underdispersion. We can double check this by using the dispersiontest() function, doing a one sided test
with an alternative equal to “less”, since we are testing for underdispersion. In this case:
- H0: Φ = 1. Meaning that there is no underdispersion.
- H1: Φ<1. Meaning that there is underdispersion.

As we can see from the test, the p-value is much larger


than 0.05, so we can not reject the null. This means that
the dispersion parameter is equal to 1 and, therefore,
there is no underdispersion.
On the other hand, the data is not zero-inflated. This could be an issue if we work with age groups that are
very young, such as 0-5 years old, since the probability of having lung cancer in such an early stage is
quite rare. But in this case, this does not happen (because we are working with adults) and therefore it
should not be an issue.

4.5. Quasipoisson Models


We know that there is no underdispersion, but we can make a Quasi-Poisson model and compare it to the
Poisson model to confirm this idea.
We can see that, when refitting the full model, the city predictors are still no longer significant, which is
strange because we know for sure that people from Fredericia have a higher risk of having lung cancer
because of the petrochemical industry. Also, as we saw in the previous section, a model only containing
city predictors is not significant according to the ANOVA calculation.
With this in mind, we create again a reduced Quasi-Poisson model which does not contain the city
predictor, like how we had done previously, to see if it would fit better.

Full Model Reduced Model

11
In the reduced model, each group level remains
significant. Now we can compare both models by
performing an ANOVA test.

From the results of the anova calculation, we see that the reduced model has a larger variance than the full
model, which means that it fits the data worse. As a consequence, the following tests will be performed
using the full model. Needless to say that both models are nested, which is why we are able to perform an
ANOVA calculation on them, since one model has been constructed using the other (the reduced model is
just the full model with less predictor variables).

We can compare the models obtained above with the ones obtained with the poisson distribution.
The estimates do not change, but the standard error and p-value do increase in the quasi-poisson model as
expected.

4.6. Other link functions


The basics of Poisson regression is to do the regression with a transformed response variable. Instead of
doing the regressions with the probabilities as a response, we use the logarithm of the probabilities as a
response. So, as we said, the natural log is the default link function for the Poisson error distribution. It
works well for count data as it forces all of the predicted values to be positive [5].
But, since we are aware that there are other link functions, we will use some of them to check if they
explain the data better or not.

We are going to use the log, identity and sqrt link functions.

LOG link IDENTITY link

12
SQRT link

As we can see, the p-values vary a lot between the


different link functions. In fact, in the identity link
function there are no significant predictors.
On the other hand, the residual vs fitted value plots that
can be found in the appendix (Plot 1, 2, 3), are all well
behaved since they look like a random cloud of points
(there is no clear pattern).
We can also see that the reduction in deviance in the
LOG link function is greater than in the other link
functions.
Therefore, we are going to stick with the LOG link
function.

4.7. Predictions
Now we can make predictions about the number of lung cancer cases for each age
group and city. To do this, we can first make a dataframe with the predictions and
the observed results. As we can see, some of the predictions are really good and
some of them really bad.
We can do the prediction using the reduced poisson model and fit the result in a
boxplot of the rate cases/population against age and the rate cases/population
against the city.

The boxes in red correspond to the actual values, and the blue ones are predicted. As we can see in both
boxplots, the predicted results follow a similar pattern as the observed lung cancer cases.

13
4.8. Confidence Intervals
Now we can quantify the effect of the different predictors on the response, and give a 95% confidence
interval for their true effect.
Firstly, both City and Age predictors are factors and therefore, when looking at the confidence intervals,
we are comparing each level (different cities and age groups) with the reference level (which is the
intercept).

So, with 95% confidence, the average rate is multiplied by


something inside the interval (0.002, 0.005) when the
individual is from Fredericia and is in the age group of 40-54
(Intercept). In the other cases, we have to compare them
against the reference level. So:
- With 95% confidence, the average rate is multiplied
by something inside the interval (0.504, 1.026) when
the individual is from Horsens in comparison to the
reference level.
- We can do the same in each predictor.

All the city estimates are lower than one and therefore the rate of cancer is going to be lower than in the
intercept, Fredericia (which makes total sense for the reasons we gave before). Regarding age estimates,
they are bigger than one and have an increasing tendency, because as the age increases the risk of cancer
also increases (except for the last group that we have mentioned at the beginning of the project). Thus, the
rate of cancer is higher than in the intercept, youngest age group (which also makes total sense).

5. Conclusion and discussion


Looking at the results, the lung cancer cases can be explained by a poisson model rather than a linear
model, as we expected because we are working with rate and low count data. Even though the
overdispersion parameter is slightly under the value of one, this difference is not significant because the
p-value of the one-sided dispersion test is equal to 40%, so there is no over/underdispersion and also there
is no zero-inflation. Meaning that we must work with a poisson distribution instead of a quasi-poisson
distribution.
With respect to the different link function (log, identity and sqrt) that we can use in the poisson model, it
is better to use the default one, which is the “log” link function.

Regarding the results obtained from the analyses about the different predictors, there is a little bit of
contradiction:
According to the full Poisson model, its AIC is higher than the one obtained from the reduced Poisson
model, and therefore the reduced Poisson model fits the data better. However, the deviance of the full
model is lower than the deviance of the reduced and therefore the full model explains the data better. On
the other hand, if we do an ANOVA of the null model and the model with city as the only predictor, it
gives us a high p-value and therefore the parameter city is not significant.

14
In light of the above, we conclude that the predictor City is not significant, which we did not expect.
Actually it is strange, since we know for sure that having a petrochemical industry in the city increases the
probability of lung cancer development and we saw a clear correlation when doing the exploratory level
analysis.
Nevertheless, when doing a prediction using the reduced model (only takes into consideration the
different age groups), the predictors are good. Meaning that the model is actually useful.

On the other hand, the predictor age is highly significant and thus is really helpful in order to predict the
number of lung cancer cases. As we expected, as age increases the rate of cancer also increases, which
makes sense given the fact that age is a known factor in cancer development as we mentioned before.
Surprisingly, the 70+ age group has the same median rate as the 60-64 group and a higher deviation, but
we can suppose this is due to the fact that it is likely that old people that have cancer may have not
survived.

So, the conclusion is that age is a significant predictor and the city of residence (having a petrochemical
industry in the city) is not significant in order to assess the probability of having lung cancer
development.

6. Bibliography
[1] Aunan J.R., Cho W.C., Søreide K. The Biology of Aging and Cancer: A Brief Overview of Shared and
Divergent Molecular Hallmarks. Aging Dis. 2017;8:628–642. doi: 10.14336/AD.2017.0103.
[2] Canadian Cancer Society: Air pollution.
https://cancer.ca/en/cancer-information/reduce-your-risk/know-your-environment/air-pollution
[3] Mann, Prem S. (1995). Introductory Statistics (2nd ed.). Wiley. ISBN 0-471-31009-3.
[4] https://towardsdatascience.com/generalized-linear-models-9cbf848bb8ab
[5]https://www.theanalysisfactor.com/generalized-linear-models-in-r-part-6-poisson-regression-count-vari
ables/

15
7. Appendix
# Libraries and data
library(ISwR)
library(AER)
library(RColorBrewer)

# Importing data
summary(eba1977)
working_data <- eba1977
attach(working_data)

# Descriptive statistics
plot(log(pop), cases, xlab = "log-scale population", ylab = "cases", main = "Lung cancer cases against
population")
boxplot(cases/pop ~ age, col = brewer.pal(length(levels(age)),"YlOrBr"), xlab = "Age groups", ylab =
"Rate of cases taking into account population", main = "Cases against age group")
boxplot(cases/pop ~ city, col = brewer.pal(length(levels(city)),"YlOrBr"), xlab = "Cities", ylab = "Rate of
cases taking into account population", main = "Cases against cities")
p <- ggplot(working_data, aes(x = city, y = pop, color = as.factor(age))) + xlab("City") +
ylab("Population") + labs(color = "Age") + geom_point(alpha = .8) + scale_color_manual(values =
c("red", "yellow", "blue", "green", "pink", "black"))
p <- p + theme_bw()
p

# Linear and poisson models


LM_cancer <- lm(cases~ city + age + offset(log(pop)), data= working_data)
summary(LM_cancer)
poi_model <- glm(cases ~ city + age + offset(log(pop)), family = "poisson", data = eba1977)
summary(poi_model)
anova(reduced_poi_model,test="Chisq")

poi_reduced <- glm(cases ~ age + offset(log(pop)), family = "poisson", data = eba1977)

plot(predict(poi_model, type = "resp"), resid(poi_model), pch = seq(1, length(levels(city))), xlab =


"Linear predictor", ylab = "Deviance residual", col = brewer.pal(n = length(levels(age)), name = "Set1"))
abline(0,0, col= "orange")
legend("topright", legend=levels(age), pch = 16, col=brewer.pal(n = length(levels(age)), name = "Set1"))
legend("bottomright", legend=levels(city), pch = seq(1, length(levels(city))))

# Test for dispersion


dispersiontest(poi_model)

16
# Residuals vs fitted values
plot(predict(poi_model, type = "resp"), resid(poi_model), xlab = "Linear predictor", ylab = "Deviance
residual")
abline(0,0, col= "orange")

# Poisson model using the log-transformed response variable.


poi_model_log <- glm(log(cases) ~ city + age + offset(log(pop)), family = "poisson", data = eba1977)
plot(predict(poi_model_log, type = "resp"), resid(poi_model_log), xlab = "Linear predictor", ylab =
"Deviance residual")
abline(0,0, col= "orange")

# Quasi-Poisson model
quasi_model <- glm(cases ~ city + age + offset(log(pop)), family = quasipoisson(link = "log"), data =
eba1977)
summary(quasi_model)
age_quasi_model <- glm(cases ~ age + offset(log(pop)), family = quasipoisson(link = "log"), data =
eba1977)
summary(age_quasi_model)
anova(quasi_model, age_quasi_model)

# Different links functions


model_link_identity <- glm(cases ~ city + age + offset(log(pop)), family = poisson(link = "identity"), data
= eba1977)
summary(model_link_identity)
model_link_sqrt <- glm(cases ~ city + age + offset(log(pop)), family = poisson(link = "sqrt"), data =
eba1977)
summary(model_link_sqrt)

plot(predict(model_link_identity, type = "resp"), resid(model_link_identity), xlab = "Linear predictor",


ylab = "Deviance residual")
abline(0,0, col= "orange")

plot(predict(model_link_sqrt, type = "resp"), resid(model_link_sqrt), xlab = "Linear predictor", ylab =


"Deviance residual")
abline(0,0, col= "orange")

17
# Confidence intervals
M <- summary(poi_model)$coefficients
se <- M[,2]
b <- coef(poi_model)
llb <- b - qnorm(0.975)*se
ulb <- b + qnorm(0.975)*se
llb <- exp(llb)
ulb <- exp(ulb)

Plot 1. Deviance vs fitted values using the log link function

Plot 2. Deviance vs fitted values using the log Identity function

18
Plot 3. Deviance vs fitted values using the SQRT link function

# Predictions
predicted_values <- predict(poi_reduced, type = “response”)
eba_predict <- eba1977
eba_predict$cases <- predicted_values
predict_vect <- append(rep("actual", 24), rep("predicted", 24))
new_df <- data.frame( #
city = append(eba1977$city, eba_predict$city),
age = append(eba1977$age, eba_predict$age),
predicted = predict_vect,
pop = append(eba1977$pop, eba_predict$pop),
cases = append(eba1977$cases, eba_predict$cases)
)

ggplot(new_df, aes(x=age, y=cases/pop, fill=predicted)) +


geom_boxplot()
ggplot(new_df, aes(x=city, y=cases/pop, fill=predicted)) +
geom_boxplot()

19

You might also like