Sample 2 For Group Project Report
Sample 2 For Group Project Report
Project 2
Group 7
Multivariate Assessment for
Life Expectancy
Robert Lesperance, Megan Withers,
Nick Cellitti, and Amanda Fisher
Multivariate Assessment for Life Expectancy
Table of Contents
Table of Figures ............................................................................................................................................. 2
Executive Summary....................................................................................................................................... 3
Introduction .................................................................................................................................................. 4
VIF’s ......................................................................................................................................................... 13
Correlation Matrix................................................................................................................................... 14
Conclusion ................................................................................................................................................... 20
Bibliography ................................................................................................................................................ 22
Appendix A .................................................................................................................................................. 23
Appendix B .................................................................................................................................................. 24
1
Multivariate Assessment for Life Expectancy
Table of Figures
2
Multivariate Assessment for Life Expectancy
Executive Summary
This report was created on behalf of the actuary division in the S.C. insurance because of the
new fiscal year predictions for next year. The statistical analysis found in this report depicts the
relationship between average income, obesity, cancer deaths, smoking, and life expectancy between all
50 states in the United States. The data came from the Kaiser Family Foundation and the US Census
Bureau. The data was analyzed using multivariate regression analysis and hypothesis testing to
determine the overall effect that the independent variables have on the dependent variable, life
expectancy. By utilizing Mega-Stat, S.C. Insurance Agency was able to calculate the coefficient
correlation which showed that our linear equation was significant and had a strong positive correlation
(0.925). The correlation of determination analysis concludes that 85.6% of the variability in life
expectancy is explained by the independent variables. 14.4% are unexplained by the variables in this
model.
The overall fit of the model was found significant at the 10 percent significance level; while adult
obesity/overweight rate and average income rate were not found significant at the 10 percent
significance levels. Further testing revealed that multicollinearity did exist, and if the smoking adult
variable is removed, the overall significance for independent variables increases and all remaining
variables become significant. There were no violations of assumptions. There were 3 outliers in the data;
Utah, Minnesota, and Maryland. States that have a lower life expectancy and have a lower significance
to all of the four variables will in return give S.C. insurance the most profit. On the other side, if the
variables for life expectancy show high and have a strong correlation to all four variables, S.C. insurance
will lose money. Future testing on genetics, life style, and extracurricular activities should be conducted
3
Multivariate Assessment for Life Expectancy
Introduction
Every year S.C. Insurance Agency looks for ways to increase profit margins and better their
company as a whole. One of the easiest ways to do this is to determine which states will help S.C.
Insurance achieve the highest possible total profit and then to focus marketing in these specific states.
In this department of S.C. Insurance Agency is the actuary division. Due to advances in medicine and
new technological procedures, insurance payouts have become higher and higher. Clients are requesting
more insurance money to cover a larger portion of their expenses incurred for medical procedures and
tests. The primary objective of this department, as company statistician’s, is to forecast the best possible
insurance practices that will benefit S.C. Insurance financially, while still providing adequate health
insurance to potential clients. The division will be separated into four different groups, each testing a
different independent variable to determine which state has the shortest lifespan and which factors
affect lifespan most. Once each report on the four independent variables have been composed, they will
This report will focus on the impact that cancer per 100,000, average income per state, rate of
obesity/ overweight adults, rate of smoking adults, and geographic location (Northern part of U.S. or
Southern part of U.S.) have on average life expectancy. Based on the Kaiser Family Foundation finding's1,
the average life expectancy (in years) is determined by each state's own observations (State Health
Facts, 2014).The percentage of adults (18 years or older) that smokes or are obese/ overweight are
calculated per state as well (State Health Facts, 2014). The target population for samples in 2010
consists of persons living in households who have a working cellphone, are aged 18 and older, and
received 90 percent or more calls on cellphones. Samples are chosen randomly for the 50 states from a
set area, determined by area code and response to phone calls (Behavioral Risk Factor Surveillance
System, 2013). Average income per state is based on the real median household income, and the
1
Statistics are taken from the ‘CDC’- the Center for Disease and Control
4
Multivariate Assessment for Life Expectancy
householder and any individuals 15 years or older in that household, even if there is no relation to the
homeowner (Noss, 2012). The data provided from the site is based on sample approximates and closely
represent the entire household and group quarter populations for all states and recorded by the U.S.
(Christie, 2009) Census Bureau (Noss, 2012). Finally, cancer death rates are age-adjusted rates per
100,000 standard populations (Number of Cancer Deaths per 100,000 Population, 2013). Again, Kaiser
Family Foundation accumulated these findings recorded by the CDC2, and provided the figures for the
The correlation will be analyzed to determine if the regression analysis is significant and useful
for S.C. Insurance Agency’s goal of maximizing profits. The report will begin with a multivariate
regression analysis and then a series of hypothesis tests that will help conclude whether or not each
independent variable is in fact significant when compared to life expectancy. This report will also include
a multicollinearity test, normality test, non-variance test, and an autocorrelation test to evaluate error
potential in the data set. All of the testing and computational analysis will be completed in Microsoft
2
‘CDC’- Center for Disease and Control
5
Multivariate Assessment for Life Expectancy
The multiple regression analysis was performed using Mega-Stat to analyze the multiple data
sets. The original data can be referenced in Appendix A. The summary of the analysis, provided by
Mega-Stat, can be found in Figure 1. The regression equation for the model is: Life Expectancy = 91.1658
- Smoking Adults (20.7315) - Cancer Death Rate (0.0311) + Average Income (0.00001391) - Adult
Obesity/Overweight Rate (5.0349) - North or South (1.1314). The correlation coefficient (r) = 0.925
concludes that the data set has a strong positive linear correlation. The regression analysis shows that if
all independent factors are zero, the average life expectancy would still be 91.1658. This value is not
completely accurate due to other factors that may have a negative impact on average life expectancy.
Some of this can be explained by genetics, life style, and extracurricular activities. The coefficient of
6
Multivariate Assessment for Life Expectancy
determination (R2) shows the linear relationship between the independent variables and dependent
variable. The coefficient of determination (R2) is equal to 0.856 and indicates that 85.6% of the
variability in Life Expectancy can be contributed to smoking adults, cancer death rates, average income,
and obesity. The adjusted R2 value (0.840) penalizes the regression data for adding unnecessary
predictor variables that do not prove to be helpful for the overall model. Due to the fact that the R2
value (0.856) is not significantly different than the adjusted R2 value (0.840), this demonstrates that the
Overall Fit
To test overall goodness of fit for the multiple regression, a hypothesis test was performed
comparing F calculated values to F critical values. The F calculated value (52.50) is determined by taking
the mean squares of the explained and dividing by the mean squares of the error. Based on Figure 2, the
analysis declares that there is a goodness of fit, and at least one of the predictors in the regression
model is significant, because 52.50 is greater than Fisher’s critical value (1.982752).
7
Multivariate Assessment for Life Expectancy
Using Evan’s rule, the determination of reliable fit is assessed for the model. This verifies
whether R2 is reliable based on the sample size and number of predictors. The basis of Evan’s rule states
that at least 10 observations per predictor are necessary in order to eliminate over fitting of the model
whereas Doane’s rule is more relaxed and allows for only 5 observations for predictors, see Figure 3. The
model is utilized to limit the number of predictors based on sample size. The data provided in the
regression analysis shows that the model surpasses the minimum requirement for observations which
means the coefficient of determination (R2) is reliable and that there is no overfitting in this model.
The coefficient of determination (0.856) indicates that 85.6% of the correlation between
independent and dependent variables can be explained, and 14.4% of the correlation remains
unexplained by the independent variables. The predictor values are utilized to determine the impact of
each independent variable on the dependent variable. For this model, we chose a predictor value of 50
to multiply each slope (β1) value by. In each prediction test, we chose one specific independent variable
to set as zero to clearly show its impact on the entire model. The table shown in Figure 4 shows the
prediction model. The prediction model portrays the regression equation with a predictor value of 50
8
Multivariate Assessment for Life Expectancy
Hypothesis Tests
The intercept test shows that this model is significant at 10% level because the tcalc= 21.621 >
tcrit= 1.68023. The values for tcalc come directly from the regression output table from the regression
analysis. This tells S.C. Insurance that at 0 additional factors there would still be an average life
expectancy of 91.2 years and that the independent variables do in fact influence the dependent variable
life expectancy.
9
Multivariate Assessment for Life Expectancy
The hypothesis test in Figure 6 shows that smoking adults is significant at the 10% significance
level. Smoking is in fact a significant variable for S.C. insurance to compare with Life Expectancy.
The hypothesis test in Figure 7 shows that cancer death rate is significant at the 10% significance
level. Cancer rate is not a significant variable for S.C. insurance to compare with Life Expectancy.
The hypothesis test in Figure 8 shows that average income is significant at the 10% significance
level. Average Income is a significant variable for S.C. insurance to compare with Life Expectancy.
10
Multivariate Assessment for Life Expectancy
The hypothesis test in Figure 9 shows that obesity is not significant at the 10% significance level.
Obesity rate is not a significant variable for S.C. insurance to compare with Life Expectancy.
The binary significance test enables S.C. insurance to capture the effects of non-quantitative
variables, in this model it was location. The categorical predictor is chosen to be North=0 and South=1.
South is considered significant, based on the test in Figure 10, as there is a correlation between living in
warmer climates versus colder climates in the United States and the effects on Life Expectancy. The
11
Multivariate Assessment for Life Expectancy
division of the United States was done unscientifically, based on estimation of which states lean toward
warmer climates and those that generally are colder. The states classified as warmer can be found in
Appendix A. The binary variable in this case will shift the intercept, but will not affect the slope of the
multi-regression model.
The binary predictors are important because they enable us to view the effects of categorical
data (such as state location). If the binary coefficient is found to be significantly different from zero then
we can conclude that the binary predictor is a significant predictor for y. The new regression equation
when including the binary variable is Life Expectancy = 91.1658 - Smoking Adults (20.7315) - Cancer
Death Rate (0.0311) + Average Income (0.00001391) - Adult Obesity/Overweight Rate (5.0349) - North or
South (1.1314). The discrepancy between equations, depicted below, show the shift in the graph (
intercept changes from 91.1658 to 90.0344) that is supposed to occur due to the addition of the binary
variable.
12
Multivariate Assessment for Life Expectancy
VIF’s
Multicollinearity is when there are strong correlations between independent variables, which
increases the standard errors of the coefficients, and may find that the values would otherwise be
significant if Multicollinearity did not exist in the model. Looking at the variance inflation factor’s(VIF’s),
found in Figure 1, there are multiple values highlighted in blue. Since no values equal 1, meaning no
collinearity, and some values are greater than 5, the determination is that multicollinearity exists (Doane
& Seward, 2013). In this instance, the value with the highest VIF is smoking adults; to correct for
Multicollinearity the simplest option is to remove the independent variable and rerun the correlation
regression analysis.
13
Multivariate Assessment for Life Expectancy
Figure 11 demonstrates the changes when the independent variable Smoking Adult rate is
removed from the data set. As indicated in the far right column of the table, the VIF is adjusted and all
VIF’s are reduced by removing the variable (no blue highlights). Also, all variables are now showing as
Correlation Matrix
Another way to test for multicollinearity is the correlation matrix; the values highlighted in
yellow, shown in Figure 12, demonstrate high collinearity between variables. This indicates that there is
strong relationships between independent variables and could be affecting the data set in a positive
way. This also could affect the significance of other variables, had certain variables not be introduced to
this data set. According to Klein’s rule, we should not worry about the stability of the regression
because the pairwise predictor correlation values do not exceed R value (0.925). This tells us that the
14
Multivariate Assessment for Life Expectancy
In Figure 13, the independent variable Smoking Adult rate is removed from the regression
model. By removing the variable, it is apparent that correlation rate between independent variables has
15
Multivariate Assessment for Life Expectancy
The analysis has three major limitations that result in potential errors for the overall model is
called residuals. These limitations are all assumptions that are made: the errors are normally distributed,
the errors have constant variance, and the errors are dependent of one another (Doane & Seward,
2013). Figure 14 represents a histogram in relation to our residuals. This histogram was constructed
using the standardized residuals. The histogram shows that the data, shown in Appendix B, does not
contain any major outliers; all values fall within -1.5 to +1.5 with a normal bell curve distribution.
The first limitation (normally distributed) is also tested in Figure 15. The figure shows that the
regression fit is highly linear with possible unusual observations. The coefficient of determination
depicts whether Figure 15 passes the normality of errors assumption. The large R2 value (0.9898)
indicates a near perfect linear regression line (98.98% of the errors are explained), which means it
16
Multivariate Assessment for Life Expectancy
The second assumption that needs to be tested is whether or not the errors have constant
variances. If the variance is constant among the errors then the model would be considered
homoscedastic, which is the ideal condition. If the variances are not constant then the errors are
considered heteroscedastic; this indicates bias and makes the values inefficient, potentially overstating t
values. The residual plot in Figure 16 is used to test for homoscedasticity. Since the residual plot does
not fan-out or funnel-in, it is passes the homoscedasticity test, and the second assumption is met.
17
Multivariate Assessment for Life Expectancy
Autocorrelation Tests
The final assumption for residual tests is that the errors are independent from one another.
Figure 17 will be used to test assumption 3: autocorrelated errors. If the errors are in fact independent
from one another, then the data is considered to be non-autocorrelated. If the errors are
autocorrelated, this means that estimated variances are biased that affects the confidence intervals. The
Durbin-Watson test will be used to determine autocorrelation in the data set. As a general rule the
residual plot should cross the mean at zero (n/2) times or 50/2 = 25. If the centerline crossing is more or
less than 25, the data is considered negative autocorrelated and positive autocorrelated respectively.
There are 26 values above and 24 values below the 0.00 axis. This is indicative of a slight negative
autocorrelation, but is near enough to be call un-correlated. No autocorrelation exists for this test.
18
Multivariate Assessment for Life Expectancy
The Durbin-Watson test statistic seen in Figure 18 is determined from the regression analysis is
2.68. This value suggests a slight negative autocorrelation because the Durbin-Watson is greater than 2,
but not significantly close to 4. This is confirmed when compared to the centerline crossings that should
be (50/2) = 25. The residual plot actually has 26 centerline crossings and more than (n/2) centerline
crossings suggests a negative autocorrelation. Based on the Durbin-Watson test, which has slightly more
validity than the runs test, it can be concluded that no autocorrelation exists for this model.
Unusual Observations
The last step to regression analysis is to evaluate the unusual residuals. This is done with the use
of a leverage test. If the leverage is beyond = 0.10, then there is indication for high leverage.
A list of all the leverages for the data set can be found in Appendix B. This data set shows that Colorado
and Hawaii have a very high level (5/n) = 0.1 and North Dakota and Vermont still have high leverage but
are not as significant as Colorado and Hawaii. These four data points are considered unusual
19
Multivariate Assessment for Life Expectancy
Another method for testing unusual residuals is evaluating the studentized residual in Appendix
B. The studentized residual values help reveal extreme outliers, outside the -2 to 2 range. For this data,
there are three states that fall below -2 range: Maryland, Minnesota, and Utah. Utah and Minnesota
have very high life expectancies, where Maryland has a moderate life expectancy. Utah and Maryland
both have unusual dependent variables; Utah has low smoking rate and low obesity rate, while
Maryland has the highest average income level. The possible explanation for Utah’s low obesity and
smoking rates can be contributed to religious groups located in Utah (Cannon, 2010). The religious
beliefs established there have apparently affected the life style, improving logistics. Minnesota also has
a very high life expectancy; a contribution could be state wide bans on smoking and unhealthy living
practices (McCarthy, 2014). Maryland has a high average income due to cost of living in Maryland; the
cost of living is about 52% higher than that in Mississippi (Christie, 2009).
Conclusion
A multivariate regression was performed in order to evaluate the relationship between the
following independent variables obesity, cancer, smoking, income, and dependent variable life
expectancy. The regression equation Life Expectancy = 91.1658 - Smoking Adults (20.7315) - Cancer
Death Rate (0.0311) + Average Income (0.00001391) - Adult Obesity/Overweight Rate (5.0349) - North or
South (-1.1314) shows that there is a correlation between our independent variables and our dependent
variable. With this relationship we can prove that there is a strong dependency on our variables with
effect to our dependent variable. With every $1 increase to average income, life expectancy increases
by a factor of 0.00001391. With every 1% increase in obesity rate there is a 5.0349% decrease in life
expectancy. With every 1% increase in cancer death rate there is a 0.0311 decrease in life expectancy.
With every 1% increase in smoking adults there is a decrease in life expectancy by 20.7315%. When
analyzing the R2 value of .856 we conclude that 85.6% of the variability in life expectancy is explained by
20
Multivariate Assessment for Life Expectancy
the independent variables. The R2 value along with the series of hypothesis tests to determine the
significance of the overall model, slope, and intercept all prove that the independent variables are in
fact significant when compared to the dependent variable (life expectancy). While 85.6% of the
variability is explained by the regression analysis, there is 14.4% that is not explained. This 14.4% could
be attributed to many different variables, for example, genetics, extracurricular activities, and life style.
This tells us that S.C. Insurance Agency can conclude that average income, obesity rate, cancer
deaths, and smoking are in fact significant variables when related to life expectancy, and will help them
better the company as a whole. It will help show which states S.C. Insurance Agency will profit most
from and which states will lose the most money. For example, in the states that have a lower life
expectancy and have a lower significance to all of the four variables will in return give S.C. insurance the
most profit. On the other side, if the variables for life expectancy show high and have a strong
21
Multivariate Assessment for Life Expectancy
Bibliography
Behavioral Risk Factor Surveillance System. (2013, 08 15). Retrieved 05 31, 2014, from Centers of Disease and
Control Prevention: http://www.cdc.gov/brfss/data_documentation/index.htm
Number of Cancer Deaths per 100,000 Population. (2013, May 8). Retrieved June 6, 2014, from The Kaiser Family
Foundation: http://kff.org/other/state-indicator/cancer-death-rate-per-100000/
State Health Facts. (2014). Retrieved May 31, 2014, from The Henry J. Kaiser Family Foundation:
http://kff.org/other/state-indicator/smoking-adults/#
Cannon, M. W. (2010, April 13). UCLA Study Proves Mormons live longer. Retrieved June 3, 2014, from Deseret
News: http://www.deseretnews.com/article/705377709/UCLA-study-proves-Mormons-live-
longer.html?pg=all
Christie, L. (2009, September 22). Where to find the fattest paychecks. Retrieved June 6, 2014, from CNN Money:
http://money.cnn.com/2009/09/21/news/economy/highest_income_census/?postversion=2009092203
Doane, D., & Seward, L. (2013). Applied Statistics in Business and Economics. New York: McGraw-Hill/Irwin.
McCarthy, J. (2014, March 13). In U.S., Smoking Rate Highest in Kentucky, Lowest in Utah. Retrieved June 3, 2014,
from Gallup Well-being: http://www.gallup.com/poll/167771/smoking-rate-lowest-utah-highest-
kentucky.aspx?utm_source=rss&utm_medium=rss&utm_campaign=in-u-s-smoking-rate-lowest-in-utah-
highest-in-kentucky-smoking-rate-in-alaska-has-dropped-the-most-since-2008
Noss, A. (2012, September 1). Household Income for States: 2010 and 2011. Retrieved June 6, 2014, from U.S.
Census: http://www.census.gov/prod/2012pubs/acsbr11-02.pdf
22
Multivariate Assessment for Life Expectancy
Appendix A
23
Multivariate Assessment for Life Expectancy
Appendix B
24