Midterm Project Group 6
Midterm Project Group 6
Products game
Group 6
Major: Logistic and supply chain management
Course: Applied Statistics
Member
Nguyễn Đình Thái Dương 20223587
2023-12-27
I. Introduction
1. Motivation
The global insurance industry is worth trillions of dollars, so it is no surprise that insurers
are turning to technology to help them cut costs, better understand their customers and
develop new products. The insurance industry is no stranger to fraud. In fact, it’s estimated
that 10% of all insurance claims are fraudulent. This costs the industry billions of dollars
every year.
By applying data-analysis in the business, enterprises can overcome the issues and create
new ways to sell, distribute or underwrite Insurance products.
2. Preparation
• Load required libraries
library(stats)
library(dplyr)
##
## Attaching package: 'dplyr'
library(knitr)
library(stargazer)
##
## Please cite as:
##
## Attaching package: 'car'
library(broom)
##
## Attaching package: 'gridExtra'
• Import Dataset
setwd("D:/Project")
## [1] 1338 7
## sex
## female male
## 662 676
table(smoker)
## smoker
## no yes
## 1064 274
table(region)
## region
## northeast northwest southeast southwest
## 324 325 364 325
Description
From summary above, we examined an insurance dataset with 1338 individuals, covering 7
key factors such as age, sex, BMI, number of children, smoking status, region, and insurance
charges. 662 subjects were female and 676 were male. The age range was 18 to 64 years,
with an average of 39.21 years. BMI values ranged from 15.96 to 53.13, averaging at 30.66,
indicating an overweight average. The number of children per individual ranged from 0 to
5, with an average of about 1.1 children per person.
Out of 1338 people only 274 who smoked whilst others did notMost individuals . Insurance
charges varied from 1122 to 63770, with an average of 13270. The dataset also includes
regional information, which may impact insurance charges with 324 live in the North East,
325 live in the North West, 364 live in the South East and 325 live in the South West.
This dataset offers a comprehensive view of the demographic and lifestyle factors related
to insurance charges, presenting opportunities for further analysis.
# Count of Gender
gender_count <-table(Insurance$sex)
barplot(gender_count,
main = "Distribution of gender",
xlab = "Gender",
ylab = "Count",
ylim = c(0,700),
col = c("lightblue","lightpink"),
border = "black",
)
• Illustration of “Smoker”
#Count of Smoker
smoker_count <- table(Insurance$smoker)
barplot(smoker_count,
main = "Distribution of Smoker",
xlab = "Smoker",
ylab = "Count",
col = c("lightblue","lightpink"),
ylim = c(0,1200)
)
• Illustration of “Region”
#Count of the region
count_region <- table(Insurance$region)
#Create bar chart
barplot(count_region,
main = "Distribution of Region",
ylab = "Count",
xlab = "Region",
ylim = c(0,400),
col = c("lightblue"),
)
• Illustration of “Children”
# Count the number of children of each individual in the dataset
children_counts <- table(Insurance$children)
• Illustration of “Charges”
# Distribution of Charges
# Create a histogram of charges
hist(Insurance$charges,
main = "Distribution of Charges",
xlab='Charges',
xlim = c(0,70000),
ylim = c(0,400),
col = "lightblue")
4.Consider the relationship between the variables pairwise
We take into account the relationship between genders, ages
• Relationship between “Age” and “Sex”
#Create boxplot to compare the age distribution of Male and Female
boxplot( age ~ sex, data = Insurance,
col = c("lightblue", "lightpink"),
xlab = "Gender",
ylab = "Age",
ylim = c(10,70),
main = "Age vs Gender")
# Create Histogram to illustrate more details
par(mfrow = c(1,2))
# Histogram for Female
hist(Insurance$age[Insurance$sex == "female"],
col = "lightpink",
xlab = "Age",
ylab = "Frequency",
ylim = c(0,100),
main = "Female")
# Add legend
legend("topright",
legend = c("Male", "Female"),
col = c("blue", "red"),
pch = 1,
bg = "white")
• Relationship between “Age” and “Charges”
# Scatter plot between age's and charges' (NOT taking into account smoking
exposure
plot(Insurance$age, Insurance$charges,
xlab = "Age", ylab = "Charges",
main = "Age vs Charges",
)
# Scatterplot of Age vs. charges (taking into account smoking exposure)
plot(Insurance$age, Insurance$charges,
xlab = "Age", ylab = "Charges",
main = "Age vs Charges",
col = ifelse(Insurance$smoker == "yes","red","blue"))
# Add a legend
legend("topleft",
legend = c("Non-Smoker", "Smoker"),
col = c("blue", "red"),
pch = 1,
bg = "white")
• Relationship between “BMI” and “Charges”
# Scatterplot of bmi's vs. charges' (NOT taking into account smoking
exposure)
plot(Insurance$bmi, Insurance$charges,
xlab = "BMI", ylab = "Charges",
main = "BMI vs. Charges",
)
# Scatterplot of bmi vs. charges taking into account smoking exposure
plot(Insurance$bmi, Insurance$charges,
xlab = "BMI", ylab = "Charges",
main = "BMI vs. Charges",
col = ifelse(Insurance$smoker == "yes", "red", "blue"))
# Add a legend
legend("topright",
legend = c("Non-Smoker", "Smoker"),
col = c("blue", "red"),
pch = 1,
bg = "white")
• Relationship between “Smoker” and “Charges”
boxplot(charges ~ smoker,data = Insurance,
main = "Smoker vs Charges",
ylab = "Charge",
xlab = "Smoker",
col = "lightblue")
• Relationship between “Children” and “Charges”
plot (Insurance$children,Insurance$charge,
ylab = "Charge",
xlab = "Children",
main = "Children vs Charges",
col = "blue")
• Computation of correlation between variables
# Compute the correlation matrix for numerical variables
correlation_matrix <- cor(Insurance[, c("age", "bmi", "children",
"charges")])
correlation_matrix
##
## Call:
## lm(formula = charges ~ age, data = Insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8059 -6671 -5939 5440 47829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3165.9 937.1 3.378 0.000751 ***
## age 257.7 22.5 11.453 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11560 on 1336 degrees of freedom
## Multiple R-squared: 0.08941, Adjusted R-squared: 0.08872
## F-statistic: 131.2 on 1 and 1336 DF, p-value: < 2.2e-16
Description :
The data obtained from 1338 individuals,The relationship between age and charges was
studied: charges was the dependent variable that was to be estimated from the
independent variable,age. On the basis of the data, the following regression line was
determined: Y= 3165.9 + 257.7X; where X is age and Y is charges. The intercept is 3165.9.
This means that the value of y (Charges) is 3165.9 when x (Age) equals zero.
The residuals, which represent the differences between the observed values and the
predicted values, range from -8059 to 47829. The first quartile (1Q) is -6671, the median is
-5939, and the third quartile (3Q) is 5440.
Now , let’s look at the slope, which is 257.7 . The slope is interpreted as follows: y (Charges)
is predicted to increase 257.7 when x (age) increase by one.
The t-value for the coefficient of “age” is 11.453, which indicates that it is statistically
significant. The corresponding p-value is < 2e-16, which is extremely small, providing
strong evidence against the null hypothesis that the coefficient is zero.
The residual standard error is 11560, which represents the average amount by which the
observed values deviate from the predicted values. The model has been fitted to 1336
degrees of freedom.
The multiple R-squared value is 0.08941, indicating that approximately 8.941% of the
variance in the dependent variable can be explained by the independent variable “age.” The
adjusted R-squared value, which considers the number of predictors in the model, is
0.08872.
The F-statistic is 131.2, with a corresponding p-value of < 2.2e-16. This suggests that the
overall model is statistically significant, indicating that there is a relationship between the
independent variable “age” and the dependent variable “charges.”
In summary, the simple linear regression model suggests that there is a significant positive
relationship between age and charges in the given dataset. For each year increase in age,
the charges are estimated to increase by approximately 257.7, after accounting for the
intercept term. However, the overall predictive power of the model is relatively low, with
the independent variable “age” explaining only around 8.941% of the variance in the
dependent variable “charges.”
• Graph the scatter plot with the regression line
# Describe using scatter plot
plot(Insurance$age, Insurance$charges,
xlab = "Age", ylab = "Charges",
main = "Age vs. Charges",
col = ifelse(Insurance$smoker == "yes", "red", "blue")) # Color points
based on smoker status
# Add a legend
legend("topleft",
legend = c("Non-Smoker", "Smoker"),
col = c("blue", "red"),
pch = 1,
bg = "white")
Diagnostic test for Model of age predicting charges
par(mfrow = c(1,1))
• Normality test
# Extract the residuals
Res_age <- residuals(age_regression)
# Perform Shapiro-Wilk normality test
shapiro.test(Res_age)
##
## Shapiro-Wilk normality test
##
## data: Res_age
## W = 0.65773, p-value < 2.2e-16
From the Shapiro-Wilk normality test, the P-value for age’s is smaller than 0.05. fail to
Reject the null hypothesis of normality.It is assumed that the data is not normally
distributed
Visually, We can look at the Q-Q plot for visual inspect as the residuals do not follow the
normal distribution, it is under-dispersed.
• Testing for heteroskedasticity
# Perform Breusch-Pagan test to check for heteroskedasticity
bp_test <- ncvTest(age_regression)
print(bp_test)
As the p-value is bigger the typical significant level (0.05), fail to reject the null hypothesis
of the Breusch-Pagan.The data met the assumption of Homoscedasticity.This suggests that
there is no significant evidence of heteroscedasticity in the residuals of your linear
regression model. In other words, the assumption of constant variance is not violated, and
the variability of the residuals is reasonably consistent across all levels of the predictors.
Visually,from the residuals vs fitted plot, the residuals spread equally along the range of
predictors.
• Test for autocorrelation
#Perform the Durbin-Watson test
durbinWatsonTest(age_regression)
The Durbin-Watson statistic is 2.033284, which is close to the value of 2. This suggests that
there is little to no evidence of serial correlation (autocorrelation) in the residuals of your
linear regression model
The p-value associated with the test is 0.518. Since the p-value is greater than the typical
significance level of 0.05, there is insufficient evidence to reject the null hypothesis that
there is no serial correlation in the residuals. there is no significant evidence of
autocorrelation in the residuals of your linear regression model.
##
## Call:
## lm(formula = charges ~ bmi, data = Insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20956 -8118 -3757 4722 49442
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1192.94 1664.80 0.717 0.474
## bmi 393.87 53.25 7.397 2.46e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11870 on 1336 degrees of freedom
## Multiple R-squared: 0.03934, Adjusted R-squared: 0.03862
## F-statistic: 54.71 on 1 and 1336 DF, p-value: 2.459e-13
Description :
The data obtained from 1338 individuals,The relationship between age and charges was
studied: charges was the dependent variable that was to be estimated from the
independent variable,age. On the basis of the data, the following regression line was
determined: Y= 1192.94 + 393.87X; where X is BMI score and Y is charges. The intercept is
1192.94. This means that the value of Y (Charges) is 1192.94 when x (BMI) equals
zero.However, it is unlikely for a human being to have a lower BMI score than 10.
The residuals, which represent the differences between the observed values and the
predicted values, range from -20956 to 49442. The first quartile (1Q) is -8118, the median
is -3757, and the third quartile (3Q) is 4722.
Now , let’s look at the slope, which is 393.87 . The slope is interpreted as follows: y
(Charges) is predicted to increase 393.87 when x (age) increase by one.
The t-value for the coefficient of “bmi” is 7.397, which indicates that it is statistically
significant. The corresponding p-value is 2.46e-13, which is very small, suggesting strong
evidence against the null hypothesis that the coefficient is zero.
The residual standard error is 11870, which represents the average amount by which the
observed values deviate from the predicted values. The model has been fitted to 1336
degrees of freedom.
The multiple R-squared value is 0.03934, indicating that approximately 3.93% of the
variance in the dependent variable can be explained by the independent variable “bmi.”
The adjusted R-squared value, which takes into account the number of predictors in the
model, is 0.03862.
The F-statistic is 54.71, with a corresponding p-value of 2.459e-13. This suggests that the
overall model is statistically significant, indicating that there is a relationship between the
independent variable “bmi” and the dependent variable “charges.”
In summary, the simple linear regression model suggests that there is a significant positive
relationship between BMI and charges in the given dataset. For each unit increase in BMI,
the charges are estimated to increase by approximately 393.87, after accounting for the
intercept term. However, the overall predictive power of the model is relatively low, with
the independent variable “bmi” explaining only around 3.93% of the variance in the
dependent variable “charges.”
• Graph the scatterplot with the abline
# Decribe the scatterplot
plot(Insurance$bmi, Insurance$charges,
xlab = "BMI", ylab = "Charges",
main = "BMI vs. Charges",
col = ifelse(Insurance$smoker == "yes", "red", "blue")) # Color points
based on smoker status
# Add a legend
legend("topleft",
legend = c("Non-Smoker", "Smoker"),
col = c("blue", "red"),
pch = 1,
bg = "white")
Diagnostic Test for Model of BMI in predicting Charges
• Normality test
# Extract the residuals
Res_bmi <- residuals(bmi_regression)
# Perform Shapiro-Wilk normality test
shapiro.test(Res_bmi)
##
## Shapiro-Wilk normality test
##
## data: Res_bmi
## W = 0.86198, p-value < 2.2e-16
From the Shapiro-Wilk normality test, the P-value for bmi’s is smaller than 0.05. Fail to
Reject the null hypothesis of normality.It is assumed that the data is not normally
distributed.
Visually, We can look at the Q-Q plot for visual inspect as the residuals do not follow the
normal distribution, it is more likely to skewed to the right.
• Testing for heteroskedasticity
# Perform Breusch-Pagan test to check for heteroskedasticity
bp_test2 <- ncvTest(bmi_regression)
print(bp_test2)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 178.9783, Df = 1, p = < 2.22e-16
As the p-value is smaller than 0.05, reject the null hypothesis of the Breusch-
Pagan.Conclude that there is strong evidence of heteroscedasticity in the residuals of your
linear regression model. This means that the assumption of constant variance is violated,
and the variability of the residuals is not the same across all levels of the predictors.
For Visually inspect, we look at the Residuals vs fitted plot, the residuals spread unequally
along the range of predictors, as it takes a diverging cone/triangular shape.
#Perform the Durbin-Watson test
durbinWatsonTest(bmi_regression)
the Durbin-Watson statistic is 1.983154, which is close to the value of 2. This suggests that
there is little to no evidence of serial correlation (autocorrelation) in the residuals of your
linear regression model.
The p-value associated with the test is 0.764. Since the p-value is greater than the typical
significance level of 0.05, there is insufficient evidence to reject the null hypothesis that
there is no serial correlation in the residuals. there is no significant evidence of
autocorrelation in the residuals of your linear regression model.
6. Multiple Regression
Similar to Simple Linear Regression, an important aspect when building a multiple linear
regression model is to make sure that the following key assumptions are met.
1.Residual values are normally distributed
2.Linear relationship between the dependent and the independent variables
3.Multicollinearity
4.Homoscedasticity assumption
Our group take into consider the Independent data of Age, BMI, Children and Smoker ( we
encoded the value of smoker to 1 (True) and 0 (False) ).
# Encode the smoker variable
smoker_encoded <- ifelse(Insurance$smoker == "yes", 1, 0)
head(smoker_encoded)
## [1] 1 0 0 0 0 0
Model of BMI, Age, Children and Smoker in predicting Charges
# Define the formula for the regression model
attach(Insurance)
##
## Call:
## lm(formula = formula, data = Insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11897.9 -2920.8 -986.6 1392.2 29509.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12102.77 941.98 -12.848 < 2e-16 ***
## age 257.85 11.90 21.675 < 2e-16 ***
## bmi 321.85 27.38 11.756 < 2e-16 ***
## children 473.50 137.79 3.436 0.000608 ***
## smoker_encoded 23811.40 411.22 57.904 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6068 on 1333 degrees of freedom
## Multiple R-squared: 0.7497, Adjusted R-squared: 0.7489
## F-statistic: 998.1 on 4 and 1333 DF, p-value: < 2.2e-16
Description :
The residuals represent the differences between the actual insurance charges and the
predicted charges from the model. The minimum residual is -11,897.9, and the maximum
residual is 29,509.6. The residuals’ distribution shows a range of positive and negative
values, which suggests some variability in the model’s predictions.
The multiple R-squared value is 0.7497, which indicates that approximately 74.97% of the
variation in insurance charges can be explained by the predictor variables included in the
model.
The adjusted R-squared value is 0.7489. It takes into account the number of predictor
variables and adjusts the R-squared value accordingly. In this case, the adjusted R-squared
value is very close to the multiple R-squared value.
The F-statistic tests the overall significance of the model. The obtained F-statistic is 998.1,
with a very low p-value (p < 0.001), indicating that the model as a whole is statistically
significant in predicting insurance charges.
The residual standard error is a measure of the average deviation of the observed
insurance charges from the predicted values. In this model, the residual standard error is
6068, indicating the average difference between the observed and predicted charges.
In the given multiple regression model, the regression line can be represented as:
Insurance Charges = -12,102.77 + 257.85 * age + 321.85 * bmi + 473.50 * children +
23,811.40 * smoker_encoded
The intercept term of -12,102.77 represents the estimated insurance charges when all
predictor variables (age, bmi, children, and smoker_encoded) are zero.
Each predictor variable is multiplied by its respective coefficient estimate obtained from
the model.
For the “age” variable, the coefficient estimate is 257.85. This means that, on average, for
each one-unit increase in age, the estimated insurance charges increase by 257.85,
assuming all other variables are held constant.
The “bmi” variable has a coefficient estimate of 321.85. For each one-unit increase in BMI,
the estimated insurance charges increase by 321.85, on average, assuming all other
variables are held constant.
The “children” variable has a coefficient estimate of 473.50. For each additional child, the
estimated insurance charges increase by 473.50, on average, assuming all other variables
are held constant.
The “smoker_encoded” variable is a binary variable representing whether an individual is a
smoker or not. It has a coefficient estimate of 23,811.40. This means that smokers have
estimated insurance charges that are 23,811.40 higher, on average, compared to non-
smokers, assuming all other variables are held constant.
In conclusion, the multiple regression analysis suggests that age, BMI, the number of
children, and smoking status are significant predictors of insurance charges. The model
explains a considerable proportion of the variation in insurance charges, as indicated by
the high R-squared value and the significant F-statistic. However, it’s important to note that
there may be other factors not included in the model that could also influence insurance
charges.
• Graph the scatter plots with the regression line
# Extract the model coefficients
coefficients <- coef(multiple_regression)
# Create a list of variable names
variable_names <- c("age", "bmi", "children", "smoker_encoded")
par(mfrow = c(1,1))
The table show the degree of freedom of the 4 groups, estimated around 0 (smaller than
0.05), so we can reject the null hypothesis. No significant in mean test score among the
groups.
• Linearity test
Visually looking at the Residual vs Fitted plot, the residuals do not follow a random
sequence and they scatter in a particullar pattern. The pattern of the red line does not
follow a horizontall movement.
# Residuals vs. Fitted Values Plot
plot(multiple_regression, which = 1)
• Normality test
# Extract the residuals
Res_multiple <- residuals(multiple_regression)
# Perform Shapiro-Wilk normality test
shapiro.test(Res_multiple)
##
## Shapiro-Wilk normality test
##
## data: Res_multiple
## W = 0.89958, p-value < 2.2e-16
From the Shapiro-Wilk normality test, the P-value for age’s is smaller than 0.05. fail to
Reject the null hypothesis of normality.It is assumed that the data is not normally
distributed.
Visually inspect the Q-Q plot, the residuals do not follow the normal distribution, it is
under-dispersed (it should follow the line).
• Test for heteroskedasticity
# Perform Breusch-Pagan test to check for heteroskedasticity
bp_test3 <- ncvTest(multiple_regression)
print(bp_test3)
Since the p-value is extremely small, much smaller than the typical significance level of
0.05, reject the null hypothesis of the Breusch-Pagan,conclude that there is strong evidence
of heteroscedasticity in the residuals of your linear regression model. This means that the
assumption of constant variance is violated, and the variability of the residuals is not the
same across all levels of the predictors.
Visually,from the Residuals vs. Fitted Values plot, the residuals does not spread equally
along the range of predictors and tends to align in a cone/triangular shape.
• Test for autocorrelation
#Perform the Durbin-Watson test
durbinWatsonTest(multiple_regression)
IV. Reference
1. J Healthc Eng. (2022). Estimation and Prediction of Hospitalization and Medical Care
Costs Using Regression in Machine Learning. Retrieved from
“https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8906954/”
2. Bhumika Bhatt. (2019). Predictors of medical expenses. Retrieved from
“https://www.kaggle.com/code/bbhatt001/predictors-of-medical-expenses/log”