Final ProjectAlbertVidalLluisGimenez
Final ProjectAlbertVidalLluisGimenez
Final Project
Lung Cancer Dataset analysis
6. Bibliography
7. Appendix
2
1. Context and goals
This data set contains counts of incident lung cancer cases and population size in four neighbouring
Danish cities by six age group during a period of three years (1968-1971):
- Danish Cites: A factor with 4 levels (Fredericia, Vejle, Horsens, Kolding)
- Age groups: 40-54, 55-59, 60-64, 65-69, 70-74, 75+
- Population: Takes into account the number of individuals in each city for each age group.
- Cases: It records the number of lung cancer cases in each city for each age group.
We think it is a good idea because the presence of such an industry may be a lung cancer development
factor and therefore it may be a good predictor.
In light of the above, we expect that age is a significant predictor for the response variable since aging is
the inevitable time-dependent decline in physiological organ function and therefore a major risk factor for
cancer development [1].
Therefore, if we look at the different age groups, we expect a direct relationship. The higher the age
group, the higher the number of lung cancer cases.
We also believe that having certain types of industries near the city is also going to be a significant
predictor for the number of lung cancer cases. Due to the fact that they produce contaminating residuals
and air pollution, which are specially linked to lung cancer [2].
Again, we expect a direct relationship. Meaning that if there is a petrochemical industry in the city, the
probability of having lung cancer increases.
We are dealing with rate data, since the amount of observations in each city and age group varies among
them. So, we expect to have a larger number of lung cancer cases if the population variable is larger.
3
2. Description of the Dataset
2.1 Descriptive analysis of the data
In this study we are going to work with a lung cancer dataset, which is composed of 4 predictor variables
(City, Age, Population and Industry) and a response variable (number of lung cancer cases).
4
To corroborate this, we can investigate the
relationship between each city and the
population of each age group. So, we can
make another scatterplot:
As we can see, in all cities there is a large
population of young individuals (2500-3200
individuals). On the other hand, the number of
people from the other age groups decreases
considerably in each city.
In the first boxplot, we can see there is a clear correlation between the age of the individuals and the rate
of cancer in the group. As the groups increase in age, the rate of cancer also increases, which makes sense
given the fact that age is a known factor in cancer development. Surprisingly, the 70+ age group has the
same median rate as the 60-64 group and a higher deviation, but we can suppose this is due to the fact that
it is least likely they would survive cancer. So, it would be coherent to think that the people that had
cancer at the age of 75+ have already died (reason why we have a lower number of cancer cases).
5
In the second boxplot, we can see the median rate of cancer is approximately equal in all cities, except for
Fredericia, where it is much higher. This could be explained by the fact there is a petrochemical industry
in the city, thus increasing the rate of cancer. We see there is an outlier in the Fredericia group, which has
a significantly lower rate of cancer than the rest. This refers to the younger age group, which hasn’t been
alive for long enough to feel the effects of the industry.
E.B. Andersen (1977), Multiplicative Poisson models with unequal cell rates, Scandinavian Journal of
Statistics, 4:153–158. J. Clemmensen et al. (1974), Ugeskrift for Læger, pp. 2260–2268.
ᵝ0 is a constant; ᵝ1 is a constant that represents the effect of the predictor under response. If x goes up by
one unit, then the response will grow ᵝ1 units; xi is the predictor; εi is the error associated.
We also know that there is a random (εi, Yi) and a systematic part (ᵝ0 + ᵝ1 · xi), and there are some
assumptions that can be found in the class slides.
So, we will check if the predictor variables are useful to predict the response variable, if they are
significant...
In standard regression, it is required that the response variable is normally distributed. So a new kind of
linear model was created, the generalized linear model, to deal with this problem. Many methods are
related to the generalized linear model, one of them being the Poisson regression, which is made to
model count or rate data which follows a Poisson distribution.
6
The Poisson distribution is part of the exponential family and is modelled by the following equation:
y is a Poisson random variable that gives the number of occurrences and λ is an average rate of value in
the desired time interval.
There are three components in poisson distribution. The link function is one of them and it literally
“links” the linear predictor and the parameter for probability distribution. The log link function is the
default link function because the parameter for Poisson regression must be positive [4]. But, we can also
use other link functions such as the identity or sqrt link functions.
In Poisson regression, it is required that the variance must be equal to the expectation. When this is not
the case, we say that the data has over/underdispersion, if the variance is higher or lower than the mean,
and a different model must be applied.
To deal with this, we must use the quasi-poisson distribution, where an overdispersion parameter φ is
2
𝑋
introduced. It can be estimated by φ = 𝑑𝑓
It is also possible for the data to be zero-inflated, which happens when the data has a higher probability
of being zero than in the Poisson distribution. In this case, a zero-inflated Poisson model must be applied,
which will take into account that excess of zeroes.
When fitting the different models, we are going to make a number of hypothesis tests to check whether a
null hypothesis can be rejected or not. One of the statistics that we are going to obtain when doing the
regression models is the p-value, which denotes the probability (so it goes from 0 to 1) of observing a
test-statistic as extreme or more extreme than the one we have observed (under the assumption that the
Null hypothesis is true). So, it quantifies the evidence you have against the Null Hypothesis and therefore,
the smaller the p-value, the more evidence you have against Null Hypothesis.
The p-value is compared with the significance level of the test:
- p-value < 𝝰 we reject H0
- p-value > 𝝰 we do not reject H0
The significance level of the test or 𝛼 or Type I error denotes the probability of rejecting the null true
hypothesis (false negative).
There are multiple ways to compare the goodness of fit of different models. Let’s begin discussing the
Akaike Information Criterion, or AIC. It is a goodness of fit measure, meaning it quantifies the quality
of a model. The standalone number is meaningless, it only gains meaning when it is contrasted with
another model’s. The AIC can be explained by:
7
Since we want a likelihood as large as possible, a good AIC must be as small as possible. Therefore the
smaller the AIC, the better the model. K is the number of parameters, so it penalises you if you have a
complicated model with a lot of parameters.
If the AIC of a model is lower than the AIC of another, we can claim the prior fits better.
The analysis of variance, or ANOVA, is a family of models used to compare the means of two groups. In
our case, we are interested in using it to compare models, which is only possible if both are nested (the
parameters of one model are a subset of the other’s parameters). Using chi-square, it tests if the decrease
in residual sum of squares is significant or not.
A scatter plot with deviance residuals on the y axis and fitted values on the x axis can also be used to
visually estimate how well a model fits. If the scatter plot has no discernible pattern (the points are
distributed in a random-looking cloud), we can claim the model fits well because the residuals are well
behaved.
Finally, the CI (confidence interval) gives us the range of possible values that the population parameter
can have depending on the point estimate, the significance level of the test (𝛼 = 95% by default) and the
standard error of the point estimate. So, we are going to obtain a certain interval that has a 95%
probability of containing the true population parameter.
8
For a Poisson regression, the estimates still make sense, since they are consistent with the estimates on the
linear model. The p-values, however, are different, telling us that many predictors have a significant effect
on the response as we expected. The residual deviance and degrees of freedom are relatively similar,
indicating that there may not be overdispersion.
Full Linear Model Full Poisson Model
In light of the above, the Poisson model fits the data better. Which makes sense because we are working
with low count data and rates.
9
4.3. Poisson model
To check if it follows an exact Poisson distribution, we will calculate both mean and variance and look if
their rates are equal to 1.
The mean is equal to 9.333 and the variance is equal to 9.971. So, the rate is 0.936. This gives us some
evidence that the data does follow a distribution similar to a Poisson distribution.
We are not really interested in comparing the different age groups with the intercept (age group 45-54)
(since we could also compare age group 65-59 with the other groups, for example). We just want to know
if AGE, as a whole, affects the risk of lung cancer. To do this we make a likelihood ratio test. We make a
model with only the constant and a model with the AGE variable and compare both.
We get a table with the null model that only has the constant and the model with race.
The reduction in deviance is equal to 100 units and the p-value is really small, so AGE is a really
significant predictor.
We can do the exact same thing to know if CITY, as a whole, affects the risk of lung cancer.
The p-value obtained is equal to 33.5%, so it looks like city is not a significant predictor. Which is strange
because at the exploratory level analysis we have seen that the city plays an important role in the
development of lung cancer, especially in the case of Fredericia.
10
4.4. Overdispersion and zero-inflated
Now we must check if there is overdispersion/underdispersion or not, to assess if the data is going to be
better explained by a Poisson or Quasi-Poisson distribution.
By just looking at the rate of the mean and the variance, we can see that there is a little bit of
underdispersion. We can double check this by using the dispersiontest() function, doing a one sided test
with an alternative equal to “less”, since we are testing for underdispersion. In this case:
- H0: Φ = 1. Meaning that there is no underdispersion.
- H1: Φ<1. Meaning that there is underdispersion.
11
In the reduced model, each group level remains
significant. Now we can compare both models by
performing an ANOVA test.
From the results of the anova calculation, we see that the reduced model has a larger variance than the full
model, which means that it fits the data worse. As a consequence, the following tests will be performed
using the full model. Needless to say that both models are nested, which is why we are able to perform an
ANOVA calculation on them, since one model has been constructed using the other (the reduced model is
just the full model with less predictor variables).
We can compare the models obtained above with the ones obtained with the poisson distribution.
The estimates do not change, but the standard error and p-value do increase in the quasi-poisson model as
expected.
We are going to use the log, identity and sqrt link functions.
12
SQRT link
4.7. Predictions
Now we can make predictions about the number of lung cancer cases for each age
group and city. To do this, we can first make a dataframe with the predictions and
the observed results. As we can see, some of the predictions are really good and
some of them really bad.
We can do the prediction using the reduced poisson model and fit the result in a
boxplot of the rate cases/population against age and the rate cases/population
against the city.
The boxes in red correspond to the actual values, and the blue ones are predicted. As we can see in both
boxplots, the predicted results follow a similar pattern as the observed lung cancer cases.
13
4.8. Confidence Intervals
Now we can quantify the effect of the different predictors on the response, and give a 95% confidence
interval for their true effect.
Firstly, both City and Age predictors are factors and therefore, when looking at the confidence intervals,
we are comparing each level (different cities and age groups) with the reference level (which is the
intercept).
All the city estimates are lower than one and therefore the rate of cancer is going to be lower than in the
intercept, Fredericia (which makes total sense for the reasons we gave before). Regarding age estimates,
they are bigger than one and have an increasing tendency, because as the age increases the risk of cancer
also increases (except for the last group that we have mentioned at the beginning of the project). Thus, the
rate of cancer is higher than in the intercept, youngest age group (which also makes total sense).
Regarding the results obtained from the analyses about the different predictors, there is a little bit of
contradiction:
According to the full Poisson model, its AIC is higher than the one obtained from the reduced Poisson
model, and therefore the reduced Poisson model fits the data better. However, the deviance of the full
model is lower than the deviance of the reduced and therefore the full model explains the data better. On
the other hand, if we do an ANOVA of the null model and the model with city as the only predictor, it
gives us a high p-value and therefore the parameter city is not significant.
14
In light of the above, we conclude that the predictor City is not significant, which we did not expect.
Actually it is strange, since we know for sure that having a petrochemical industry in the city increases the
probability of lung cancer development and we saw a clear correlation when doing the exploratory level
analysis.
Nevertheless, when doing a prediction using the reduced model (only takes into consideration the
different age groups), the predictors are good. Meaning that the model is actually useful.
On the other hand, the predictor age is highly significant and thus is really helpful in order to predict the
number of lung cancer cases. As we expected, as age increases the rate of cancer also increases, which
makes sense given the fact that age is a known factor in cancer development as we mentioned before.
Surprisingly, the 70+ age group has the same median rate as the 60-64 group and a higher deviation, but
we can suppose this is due to the fact that it is likely that old people that have cancer may have not
survived.
So, the conclusion is that age is a significant predictor and the city of residence (having a petrochemical
industry in the city) is not significant in order to assess the probability of having lung cancer
development.
6. Bibliography
[1] Aunan J.R., Cho W.C., Søreide K. The Biology of Aging and Cancer: A Brief Overview of Shared and
Divergent Molecular Hallmarks. Aging Dis. 2017;8:628–642. doi: 10.14336/AD.2017.0103.
[2] Canadian Cancer Society: Air pollution.
https://cancer.ca/en/cancer-information/reduce-your-risk/know-your-environment/air-pollution
[3] Mann, Prem S. (1995). Introductory Statistics (2nd ed.). Wiley. ISBN 0-471-31009-3.
[4] https://towardsdatascience.com/generalized-linear-models-9cbf848bb8ab
[5]https://www.theanalysisfactor.com/generalized-linear-models-in-r-part-6-poisson-regression-count-vari
ables/
15
7. Appendix
# Libraries and data
library(ISwR)
library(AER)
library(RColorBrewer)
# Importing data
summary(eba1977)
working_data <- eba1977
attach(working_data)
# Descriptive statistics
plot(log(pop), cases, xlab = "log-scale population", ylab = "cases", main = "Lung cancer cases against
population")
boxplot(cases/pop ~ age, col = brewer.pal(length(levels(age)),"YlOrBr"), xlab = "Age groups", ylab =
"Rate of cases taking into account population", main = "Cases against age group")
boxplot(cases/pop ~ city, col = brewer.pal(length(levels(city)),"YlOrBr"), xlab = "Cities", ylab = "Rate of
cases taking into account population", main = "Cases against cities")
p <- ggplot(working_data, aes(x = city, y = pop, color = as.factor(age))) + xlab("City") +
ylab("Population") + labs(color = "Age") + geom_point(alpha = .8) + scale_color_manual(values =
c("red", "yellow", "blue", "green", "pink", "black"))
p <- p + theme_bw()
p
16
# Residuals vs fitted values
plot(predict(poi_model, type = "resp"), resid(poi_model), xlab = "Linear predictor", ylab = "Deviance
residual")
abline(0,0, col= "orange")
# Quasi-Poisson model
quasi_model <- glm(cases ~ city + age + offset(log(pop)), family = quasipoisson(link = "log"), data =
eba1977)
summary(quasi_model)
age_quasi_model <- glm(cases ~ age + offset(log(pop)), family = quasipoisson(link = "log"), data =
eba1977)
summary(age_quasi_model)
anova(quasi_model, age_quasi_model)
17
# Confidence intervals
M <- summary(poi_model)$coefficients
se <- M[,2]
b <- coef(poi_model)
llb <- b - qnorm(0.975)*se
ulb <- b + qnorm(0.975)*se
llb <- exp(llb)
ulb <- exp(ulb)
18
Plot 3. Deviance vs fitted values using the SQRT link function
# Predictions
predicted_values <- predict(poi_reduced, type = “response”)
eba_predict <- eba1977
eba_predict$cases <- predicted_values
predict_vect <- append(rep("actual", 24), rep("predicted", 24))
new_df <- data.frame( #
city = append(eba1977$city, eba_predict$city),
age = append(eba1977$age, eba_predict$age),
predicted = predict_vect,
pop = append(eba1977$pop, eba_predict$pop),
cases = append(eba1977$cases, eba_predict$cases)
)
19