0% found this document useful (0 votes)
16 views41 pages

Midterm Project Group 6

Uploaded by

thutae20051995
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views41 pages

Midterm Project Group 6

Uploaded by

thutae20051995
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Hanoi University of Science and Technology

CASE STUDY REPORT

Products game
Group 6
Major: Logistic and supply chain management
Course: Applied Statistics

Member
Nguyễn Đình Thái Dương 20223587

Nguyễn Đức Nam 20223630

Hoàng Đức Dũng 20223545

Hoàng Minh Tuấn 20223671


Hoàng Đức Dũng Nguyễn Đức Nam
20223545 20223630

Nguyễn Đình Thái Dương Hoàng Minh Tuấn


20223587 20223671
Mid-term Project
Group 6

2023-12-27

I. Introduction
1. Motivation
The global insurance industry is worth trillions of dollars, so it is no surprise that insurers
are turning to technology to help them cut costs, better understand their customers and
develop new products. The insurance industry is no stranger to fraud. In fact, it’s estimated
that 10% of all insurance claims are fraudulent. This costs the industry billions of dollars
every year.
By applying data-analysis in the business, enterprises can overcome the issues and create
new ways to sell, distribute or underwrite Insurance products.

2. Preparation
• Load required libraries
library(stats)
library(dplyr)

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':


##
## filter, lag

## The following objects are masked from 'package:base':


##
## intersect, setdiff, setequal, union

library(knitr)
library(stargazer)

##
## Please cite as:

## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary


Statistics Tables.

## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer


library(ggplot2)
library(car)

## Loading required package: carData

##
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':


##
## recode

library(broom)

• For data visualization


library(ggcorrplot)
library(gridExtra)

##
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':


##
## combine

• Import Dataset
setwd("D:/Project")

The dataset is available at ” https://www.kaggle.com/code/bbhatt001/predictors-of-


medical-expenses/notebookhttps://www.kaggle.com/code/bbhatt001/predictors-of-
medical-expenses/notebook ” with the headline “Predictors of medical expenses” by
Bhumika Bhatt, established in 2019
• Load the data set
Insurance <- read.csv("insurance.csv",header = T)

II. Project Questions


1. Variable Explanation
• Illustrate the header of the data set
head(Insurance)

## age sex bmi children smoker region charges


## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
What could be the variation to be explained ?
Describe the given data In the context of insurance charges, the variation to be explained
could be the amount of charges (the “charges” column) based on different factors such as
Age, Sex, BMI, Number of Children, Smoking status, and Region.

2. Description of the Data


The code will show the summary of the given qualitative data namely “charges”, “bmi”,
“children” . Futhermore, provide frequency tables for categorical variables like “sex”,
“smoker”, “bmi”, “age”, “region”
• Summary of the data (Quantitative)
# Retrieve the dimension of the Data
dim(Insurance)

## [1] 1338 7

# Summarise the data


summary(Insurance)

## age sex bmi children


## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770

• Summary of categorical variables


# Summarise the data
attach(Insurance)
table(sex)

## sex
## female male
## 662 676

table(smoker)

## smoker
## no yes
## 1064 274
table(region)

## region
## northeast northwest southeast southwest
## 324 325 364 325

Description
From summary above, we examined an insurance dataset with 1338 individuals, covering 7
key factors such as age, sex, BMI, number of children, smoking status, region, and insurance
charges. 662 subjects were female and 676 were male. The age range was 18 to 64 years,
with an average of 39.21 years. BMI values ranged from 15.96 to 53.13, averaging at 30.66,
indicating an overweight average. The number of children per individual ranged from 0 to
5, with an average of about 1.1 children per person.
Out of 1338 people only 274 who smoked whilst others did notMost individuals . Insurance
charges varied from 1122 to 63770, with an average of 13270. The dataset also includes
regional information, which may impact insurance charges with 324 live in the North East,
325 live in the North West, 364 live in the South East and 325 live in the South West.
This dataset offers a comprehensive view of the demographic and lifestyle factors related
to insurance charges, presenting opportunities for further analysis.

3.Characteristic of all the variables


• Illustrations of “Age”
# Create a boxplot for the 'age' variable
boxplot(Insurance$age,
main = "Distribution of Age",
xlab = "Age",
col = "lightblue",
border = "black",
horizontal = T)
# Count of age column
barplot(table(Insurance$age),
main = "Distribution of Ages",
xlab = "Age",
ylab = "Frequency",
col = "skyblue",
border = "black",
ylim = c(0,70)
)
• Illustration of “Gender”

# Count of Gender
gender_count <-table(Insurance$sex)
barplot(gender_count,
main = "Distribution of gender",
xlab = "Gender",
ylab = "Count",
ylim = c(0,700),
col = c("lightblue","lightpink"),
border = "black",
)
• Illustration of “Smoker”
#Count of Smoker
smoker_count <- table(Insurance$smoker)
barplot(smoker_count,
main = "Distribution of Smoker",
xlab = "Smoker",
ylab = "Count",
col = c("lightblue","lightpink"),
ylim = c(0,1200)
)
• Illustration of “Region”
#Count of the region
count_region <- table(Insurance$region)
#Create bar chart
barplot(count_region,
main = "Distribution of Region",
ylab = "Count",
xlab = "Region",
ylim = c(0,400),
col = c("lightblue"),

)
• Illustration of “Children”
# Count the number of children of each individual in the dataset
children_counts <- table(Insurance$children)

# Create a bar chart for the number of children


barplot(children_counts,
xlab = "Number of Children",
ylab = "Count",
ylim = c(0,600),
col = "lightblue")
• Illustrations of “BMI”
# Distribution of BMI
# Create the first plot (histogram)
plot1 <- ggplot(Insurance, aes(x = bmi)) +
geom_histogram(binwidth = 2, fill = "blue", color = "white", aes(y =
..density..)) +
labs(title = "BMI Distribution", x = "BMI", y = "Frequency") +
theme_minimal()

# Create the second plot (histogram with density curve)


plot2 <- ggplot(Insurance, aes(x = bmi)) +
geom_histogram(binwidth = 2, fill = "blue", color = "white", aes(y =
..density..)) +
geom_density(alpha = 0.5, fill = "orange",color = "black") +
labs(title = "BMI Distribution with Density Curve", x = "BMI", y =
"Frequency") +
theme_minimal()

# Define the combined plot


combined_plot <- grid.arrange(plot1, plot2, ncol = 2)

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2


3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

• Illustration of “Charges”
# Distribution of Charges
# Create a histogram of charges
hist(Insurance$charges,
main = "Distribution of Charges",
xlab='Charges',
xlim = c(0,70000),
ylim = c(0,400),
col = "lightblue")
4.Consider the relationship between the variables pairwise
We take into account the relationship between genders, ages
• Relationship between “Age” and “Sex”
#Create boxplot to compare the age distribution of Male and Female
boxplot( age ~ sex, data = Insurance,
col = c("lightblue", "lightpink"),
xlab = "Gender",
ylab = "Age",
ylim = c(10,70),
main = "Age vs Gender")
# Create Histogram to illustrate more details
par(mfrow = c(1,2))
# Histogram for Female
hist(Insurance$age[Insurance$sex == "female"],
col = "lightpink",
xlab = "Age",
ylab = "Frequency",
ylim = c(0,100),
main = "Female")

# Histogram for Male


hist(Insurance$age[Insurance$sex == "male"],
col = "lightblue",
xlab = "Age",
ylab = "Frequency",
ylim = c(0,100),
main = "Male")
• Relationship between “BMI” and “Age”
# Create Scatter plot between BMI's and Age's data (Not take into account
"gender")
plot(Insurance$age,Insurance$bmi,
ylab = "BMI",
xlab = "Age",
col = "black"
)
# Create Scatter plot between BMI's and Age's data (Take into account
"gender")
plot(Insurance$age,Insurance$bmi,
ylab = "BMI",
xlab = "Age",
col = ifelse(Insurance$sex == "female","red","blue"))

# Add legend
legend("topright",
legend = c("Male", "Female"),
col = c("blue", "red"),
pch = 1,
bg = "white")
• Relationship between “Age” and “Charges”
# Scatter plot between age's and charges' (NOT taking into account smoking
exposure
plot(Insurance$age, Insurance$charges,
xlab = "Age", ylab = "Charges",
main = "Age vs Charges",
)
# Scatterplot of Age vs. charges (taking into account smoking exposure)
plot(Insurance$age, Insurance$charges,
xlab = "Age", ylab = "Charges",
main = "Age vs Charges",
col = ifelse(Insurance$smoker == "yes","red","blue"))

# Add a legend
legend("topleft",
legend = c("Non-Smoker", "Smoker"),
col = c("blue", "red"),
pch = 1,
bg = "white")
• Relationship between “BMI” and “Charges”
# Scatterplot of bmi's vs. charges' (NOT taking into account smoking
exposure)
plot(Insurance$bmi, Insurance$charges,
xlab = "BMI", ylab = "Charges",
main = "BMI vs. Charges",
)
# Scatterplot of bmi vs. charges taking into account smoking exposure
plot(Insurance$bmi, Insurance$charges,
xlab = "BMI", ylab = "Charges",
main = "BMI vs. Charges",
col = ifelse(Insurance$smoker == "yes", "red", "blue"))

# Add a legend
legend("topright",
legend = c("Non-Smoker", "Smoker"),
col = c("blue", "red"),
pch = 1,
bg = "white")
• Relationship between “Smoker” and “Charges”
boxplot(charges ~ smoker,data = Insurance,
main = "Smoker vs Charges",
ylab = "Charge",
xlab = "Smoker",
col = "lightblue")
• Relationship between “Children” and “Charges”
plot (Insurance$children,Insurance$charge,
ylab = "Charge",
xlab = "Children",
main = "Children vs Charges",
col = "blue")
• Computation of correlation between variables
# Compute the correlation matrix for numerical variables
correlation_matrix <- cor(Insurance[, c("age", "bmi", "children",
"charges")])
correlation_matrix

## age bmi children charges


## age 1.0000000 0.1092719 0.04246900 0.29900819
## bmi 0.1092719 1.0000000 0.01275890 0.19834097
## children 0.0424690 0.0127589 1.00000000 0.06799823
## charges 0.2990082 0.1983410 0.06799823 1.00000000

# Create a heatmap of the correlation matrix


heatmap(correlation_matrix,
col = colorRampPalette(c("blue", "white", "red"))(100), # Color
palette
symm = T, # Show symmetrically
margins = c(10, 10)) # Add margins for labels
5. Simple regression
Investigation of the regression model only take numeric value into consider. When looking
at the previous comparison, it is appeared there are a strong positive correlation between
“Age”, “BMI” and “Charges”.We are going to describe the relationship between Independent
variables (“Age” and “bmi”)and Dependent variables (“Charges”). We can use R to check
that our data meet the four main assumptions for linear regression.
Including :
1.Independence of observations (Because we only have one independent variable and one
dependent variable, we don’t need to test for any hidden relationships among variables)
2.Normality
3.Linearity
4.Homoscedasticity assumption
Model of age predicting charges
• Fitting linear models
# fit and summarise the LM
age_regression <-lm (charges ~ age, data = Insurance)
summary (age_regression)

##
## Call:
## lm(formula = charges ~ age, data = Insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8059 -6671 -5939 5440 47829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3165.9 937.1 3.378 0.000751 ***
## age 257.7 22.5 11.453 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11560 on 1336 degrees of freedom
## Multiple R-squared: 0.08941, Adjusted R-squared: 0.08872
## F-statistic: 131.2 on 1 and 1336 DF, p-value: < 2.2e-16

Description :
The data obtained from 1338 individuals,The relationship between age and charges was
studied: charges was the dependent variable that was to be estimated from the
independent variable,age. On the basis of the data, the following regression line was
determined: Y= 3165.9 + 257.7X; where X is age and Y is charges. The intercept is 3165.9.
This means that the value of y (Charges) is 3165.9 when x (Age) equals zero.
The residuals, which represent the differences between the observed values and the
predicted values, range from -8059 to 47829. The first quartile (1Q) is -6671, the median is
-5939, and the third quartile (3Q) is 5440.
Now , let’s look at the slope, which is 257.7 . The slope is interpreted as follows: y (Charges)
is predicted to increase 257.7 when x (age) increase by one.
The t-value for the coefficient of “age” is 11.453, which indicates that it is statistically
significant. The corresponding p-value is < 2e-16, which is extremely small, providing
strong evidence against the null hypothesis that the coefficient is zero.
The residual standard error is 11560, which represents the average amount by which the
observed values deviate from the predicted values. The model has been fitted to 1336
degrees of freedom.
The multiple R-squared value is 0.08941, indicating that approximately 8.941% of the
variance in the dependent variable can be explained by the independent variable “age.” The
adjusted R-squared value, which considers the number of predictors in the model, is
0.08872.
The F-statistic is 131.2, with a corresponding p-value of < 2.2e-16. This suggests that the
overall model is statistically significant, indicating that there is a relationship between the
independent variable “age” and the dependent variable “charges.”
In summary, the simple linear regression model suggests that there is a significant positive
relationship between age and charges in the given dataset. For each year increase in age,
the charges are estimated to increase by approximately 257.7, after accounting for the
intercept term. However, the overall predictive power of the model is relatively low, with
the independent variable “age” explaining only around 8.941% of the variance in the
dependent variable “charges.”
• Graph the scatter plot with the regression line
# Describe using scatter plot
plot(Insurance$age, Insurance$charges,
xlab = "Age", ylab = "Charges",
main = "Age vs. Charges",
col = ifelse(Insurance$smoker == "yes", "red", "blue")) # Color points
based on smoker status

# Add the regression line from the lm_model to the plot


abline(age_regression, col = "green", lwd = 1)

# Add a legend
legend("topleft",
legend = c("Non-Smoker", "Smoker"),
col = c("blue", "red"),
pch = 1,
bg = "white")
Diagnostic test for Model of age predicting charges

Sketch the standard Residual plot in R


par(mfrow = c(2,2))

#Create Residuals vs. Fitted Values Plot


plot(age_regression, which = 1)

# Create Normal Q-Q Plot


plot(age_regression, which = 2)

# Create Scale-Location Plot


plot(age_regression, which = 3)

# Create Residuals vs. Leverage Plot


plot(age_regression, which = 5)

par(mfrow = c(1,1))

• Normality test
# Extract the residuals
Res_age <- residuals(age_regression)
# Perform Shapiro-Wilk normality test
shapiro.test(Res_age)
##
## Shapiro-Wilk normality test
##
## data: Res_age
## W = 0.65773, p-value < 2.2e-16

From the Shapiro-Wilk normality test, the P-value for age’s is smaller than 0.05. fail to
Reject the null hypothesis of normality.It is assumed that the data is not normally
distributed
Visually, We can look at the Q-Q plot for visual inspect as the residuals do not follow the
normal distribution, it is under-dispersed.
• Testing for heteroskedasticity
# Perform Breusch-Pagan test to check for heteroskedasticity
bp_test <- ncvTest(age_regression)
print(bp_test)

## Non-constant Variance Score Test


## Variance formula: ~ fitted.values
## Chisquare = 0.0007999315, Df = 1, p = 0.97744

As the p-value is bigger the typical significant level (0.05), fail to reject the null hypothesis
of the Breusch-Pagan.The data met the assumption of Homoscedasticity.This suggests that
there is no significant evidence of heteroscedasticity in the residuals of your linear
regression model. In other words, the assumption of constant variance is not violated, and
the variability of the residuals is reasonably consistent across all levels of the predictors.
Visually,from the residuals vs fitted plot, the residuals spread equally along the range of
predictors.
• Test for autocorrelation
#Perform the Durbin-Watson test
durbinWatsonTest(age_regression)

## lag Autocorrelation D-W Statistic p-value


## 1 -0.01715458 2.033284 0.53
## Alternative hypothesis: rho != 0

The Durbin-Watson statistic is 2.033284, which is close to the value of 2. This suggests that
there is little to no evidence of serial correlation (autocorrelation) in the residuals of your
linear regression model
The p-value associated with the test is 0.518. Since the p-value is greater than the typical
significance level of 0.05, there is insufficient evidence to reject the null hypothesis that
there is no serial correlation in the residuals. there is no significant evidence of
autocorrelation in the residuals of your linear regression model.

Model of BMI in predicting Charges


• Fitting linear model
# fit and summarise the LM
bmi_regression <- lm(charges ~ bmi, data = Insurance)

# Summarise the model


summary (bmi_regression)

##
## Call:
## lm(formula = charges ~ bmi, data = Insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20956 -8118 -3757 4722 49442
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1192.94 1664.80 0.717 0.474
## bmi 393.87 53.25 7.397 2.46e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11870 on 1336 degrees of freedom
## Multiple R-squared: 0.03934, Adjusted R-squared: 0.03862
## F-statistic: 54.71 on 1 and 1336 DF, p-value: 2.459e-13

Description :
The data obtained from 1338 individuals,The relationship between age and charges was
studied: charges was the dependent variable that was to be estimated from the
independent variable,age. On the basis of the data, the following regression line was
determined: Y= 1192.94 + 393.87X; where X is BMI score and Y is charges. The intercept is
1192.94. This means that the value of Y (Charges) is 1192.94 when x (BMI) equals
zero.However, it is unlikely for a human being to have a lower BMI score than 10.
The residuals, which represent the differences between the observed values and the
predicted values, range from -20956 to 49442. The first quartile (1Q) is -8118, the median
is -3757, and the third quartile (3Q) is 4722.
Now , let’s look at the slope, which is 393.87 . The slope is interpreted as follows: y
(Charges) is predicted to increase 393.87 when x (age) increase by one.
The t-value for the coefficient of “bmi” is 7.397, which indicates that it is statistically
significant. The corresponding p-value is 2.46e-13, which is very small, suggesting strong
evidence against the null hypothesis that the coefficient is zero.
The residual standard error is 11870, which represents the average amount by which the
observed values deviate from the predicted values. The model has been fitted to 1336
degrees of freedom.
The multiple R-squared value is 0.03934, indicating that approximately 3.93% of the
variance in the dependent variable can be explained by the independent variable “bmi.”
The adjusted R-squared value, which takes into account the number of predictors in the
model, is 0.03862.
The F-statistic is 54.71, with a corresponding p-value of 2.459e-13. This suggests that the
overall model is statistically significant, indicating that there is a relationship between the
independent variable “bmi” and the dependent variable “charges.”
In summary, the simple linear regression model suggests that there is a significant positive
relationship between BMI and charges in the given dataset. For each unit increase in BMI,
the charges are estimated to increase by approximately 393.87, after accounting for the
intercept term. However, the overall predictive power of the model is relatively low, with
the independent variable “bmi” explaining only around 3.93% of the variance in the
dependent variable “charges.”
• Graph the scatterplot with the abline
# Decribe the scatterplot
plot(Insurance$bmi, Insurance$charges,
xlab = "BMI", ylab = "Charges",
main = "BMI vs. Charges",
col = ifelse(Insurance$smoker == "yes", "red", "blue")) # Color points
based on smoker status

# Add the regression line from the lm_model to the plot


abline(bmi_regression, col = "green",)

# Add a legend
legend("topleft",
legend = c("Non-Smoker", "Smoker"),
col = c("blue", "red"),
pch = 1,
bg = "white")
Diagnostic Test for Model of BMI in predicting Charges

Sketch the standard Residual plot in R


par(mfrow = c(2,2))

#Create Residuals vs. Fitted Values Plot


plot(bmi_regression, which = 1)

# Create Normal Q-Q Plot


plot(bmi_regression, which = 2)

# Create Scale-Location Plot


plot(age_regression, which = 3)

# Create Residuals vs. Leverage Plot


plot(bmi_regression, which = 5)
par(mfrow = c(1,1))

• Normality test
# Extract the residuals
Res_bmi <- residuals(bmi_regression)
# Perform Shapiro-Wilk normality test
shapiro.test(Res_bmi)

##
## Shapiro-Wilk normality test
##
## data: Res_bmi
## W = 0.86198, p-value < 2.2e-16

From the Shapiro-Wilk normality test, the P-value for bmi’s is smaller than 0.05. Fail to
Reject the null hypothesis of normality.It is assumed that the data is not normally
distributed.
Visually, We can look at the Q-Q plot for visual inspect as the residuals do not follow the
normal distribution, it is more likely to skewed to the right.
• Testing for heteroskedasticity
# Perform Breusch-Pagan test to check for heteroskedasticity
bp_test2 <- ncvTest(bmi_regression)
print(bp_test2)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 178.9783, Df = 1, p = < 2.22e-16

As the p-value is smaller than 0.05, reject the null hypothesis of the Breusch-
Pagan.Conclude that there is strong evidence of heteroscedasticity in the residuals of your
linear regression model. This means that the assumption of constant variance is violated,
and the variability of the residuals is not the same across all levels of the predictors.
For Visually inspect, we look at the Residuals vs fitted plot, the residuals spread unequally
along the range of predictors, as it takes a diverging cone/triangular shape.
#Perform the Durbin-Watson test
durbinWatsonTest(bmi_regression)

## lag Autocorrelation D-W Statistic p-value


## 1 0.007641922 1.983154 0.74
## Alternative hypothesis: rho != 0

the Durbin-Watson statistic is 1.983154, which is close to the value of 2. This suggests that
there is little to no evidence of serial correlation (autocorrelation) in the residuals of your
linear regression model.
The p-value associated with the test is 0.764. Since the p-value is greater than the typical
significance level of 0.05, there is insufficient evidence to reject the null hypothesis that
there is no serial correlation in the residuals. there is no significant evidence of
autocorrelation in the residuals of your linear regression model.

6. Multiple Regression
Similar to Simple Linear Regression, an important aspect when building a multiple linear
regression model is to make sure that the following key assumptions are met.
1.Residual values are normally distributed
2.Linear relationship between the dependent and the independent variables
3.Multicollinearity
4.Homoscedasticity assumption
Our group take into consider the Independent data of Age, BMI, Children and Smoker ( we
encoded the value of smoker to 1 (True) and 0 (False) ).
# Encode the smoker variable
smoker_encoded <- ifelse(Insurance$smoker == "yes", 1, 0)
head(smoker_encoded)

## [1] 1 0 0 0 0 0
Model of BMI, Age, Children and Smoker in predicting Charges
# Define the formula for the regression model
attach(Insurance)

## The following objects are masked from Insurance (pos = 3):


##
## age, bmi, charges, children, region, sex, smoker

formula <- charges ~ age + bmi + children + smoker_encoded

# Fit the multivariate linear regression model


multiple_regression <- lm(formula, data = Insurance)

# Summary of the regression model


summary(multiple_regression)

##
## Call:
## lm(formula = formula, data = Insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11897.9 -2920.8 -986.6 1392.2 29509.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12102.77 941.98 -12.848 < 2e-16 ***
## age 257.85 11.90 21.675 < 2e-16 ***
## bmi 321.85 27.38 11.756 < 2e-16 ***
## children 473.50 137.79 3.436 0.000608 ***
## smoker_encoded 23811.40 411.22 57.904 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6068 on 1333 degrees of freedom
## Multiple R-squared: 0.7497, Adjusted R-squared: 0.7489
## F-statistic: 998.1 on 4 and 1333 DF, p-value: < 2.2e-16

Description :
The residuals represent the differences between the actual insurance charges and the
predicted charges from the model. The minimum residual is -11,897.9, and the maximum
residual is 29,509.6. The residuals’ distribution shows a range of positive and negative
values, which suggests some variability in the model’s predictions.
The multiple R-squared value is 0.7497, which indicates that approximately 74.97% of the
variation in insurance charges can be explained by the predictor variables included in the
model.
The adjusted R-squared value is 0.7489. It takes into account the number of predictor
variables and adjusts the R-squared value accordingly. In this case, the adjusted R-squared
value is very close to the multiple R-squared value.
The F-statistic tests the overall significance of the model. The obtained F-statistic is 998.1,
with a very low p-value (p < 0.001), indicating that the model as a whole is statistically
significant in predicting insurance charges.
The residual standard error is a measure of the average deviation of the observed
insurance charges from the predicted values. In this model, the residual standard error is
6068, indicating the average difference between the observed and predicted charges.
In the given multiple regression model, the regression line can be represented as:
Insurance Charges = -12,102.77 + 257.85 * age + 321.85 * bmi + 473.50 * children +
23,811.40 * smoker_encoded
The intercept term of -12,102.77 represents the estimated insurance charges when all
predictor variables (age, bmi, children, and smoker_encoded) are zero.
Each predictor variable is multiplied by its respective coefficient estimate obtained from
the model.
For the “age” variable, the coefficient estimate is 257.85. This means that, on average, for
each one-unit increase in age, the estimated insurance charges increase by 257.85,
assuming all other variables are held constant.
The “bmi” variable has a coefficient estimate of 321.85. For each one-unit increase in BMI,
the estimated insurance charges increase by 321.85, on average, assuming all other
variables are held constant.
The “children” variable has a coefficient estimate of 473.50. For each additional child, the
estimated insurance charges increase by 473.50, on average, assuming all other variables
are held constant.
The “smoker_encoded” variable is a binary variable representing whether an individual is a
smoker or not. It has a coefficient estimate of 23,811.40. This means that smokers have
estimated insurance charges that are 23,811.40 higher, on average, compared to non-
smokers, assuming all other variables are held constant.
In conclusion, the multiple regression analysis suggests that age, BMI, the number of
children, and smoking status are significant predictors of insurance charges. The model
explains a considerable proportion of the variation in insurance charges, as indicated by
the high R-squared value and the significant F-statistic. However, it’s important to note that
there may be other factors not included in the model that could also influence insurance
charges.
• Graph the scatter plots with the regression line
# Extract the model coefficients
coefficients <- coef(multiple_regression)
# Create a list of variable names
variable_names <- c("age", "bmi", "children", "smoker_encoded")

# Create individual scatter plots for each variable


plots <- lapply(seq_along(variable_names), function(i) {
ggplot(Insurance, aes_string(x = variable_names[i], y = "charges")) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "blue") +
labs(x = variable_names[i], y = "Charges") +
ggtitle(paste("Charges vs.", variable_names[i])) +
theme_minimal()
})

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.


## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Arrange the plots in a grid


gridExtra::grid.arrange(grobs = plots, ncol = 2)

Diagnostic test for Model of different variables predicting charges


par(mfrow = c(2,2))
# Create Residuals vs. Fitted Values Plot
plot(multiple_regression, which = 1)

# Create Normal Q-Q Plot


plot(multiple_regression, which = 2)

# Create Scale-Location Plot


plot(multiple_regression, which = 3)

# Create Residuals vs. Leverage Plot


plot(multiple_regression, which = 5)

par(mfrow = c(1,1))

• ANOVA test The Variable:


ANOVA requires a mix of one continuous quantitative dependent variable (which
corresponds to the measurements to which the question relates) and one qualitative
independent variable (with at least 2 levels which will determine the groups to
compare).From the dataset, Charges is depicted as the dependent variable and other
variables are independence.
#Describe ANOVA table
anova(multiple_regression)

## Analysis of Variance Table


##
## Response: charges
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 1.7530e+10 1.7530e+10 476.130 < 2.2e-16 ***
## bmi 1 5.4464e+09 5.4464e+09 147.929 < 2.2e-16 ***
## children 1 5.7152e+08 5.7152e+08 15.523 8.574e-05 ***
## smoker_encoded 1 1.2345e+11 1.2345e+11 3352.911 < 2.2e-16 ***
## Residuals 1333 4.9078e+10 3.6818e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The table show the degree of freedom of the 4 groups, estimated around 0 (smaller than
0.05), so we can reject the null hypothesis. No significant in mean test score among the
groups.
• Linearity test
Visually looking at the Residual vs Fitted plot, the residuals do not follow a random
sequence and they scatter in a particullar pattern. The pattern of the red line does not
follow a horizontall movement.
# Residuals vs. Fitted Values Plot
plot(multiple_regression, which = 1)

• Normality test
# Extract the residuals
Res_multiple <- residuals(multiple_regression)
# Perform Shapiro-Wilk normality test
shapiro.test(Res_multiple)

##
## Shapiro-Wilk normality test
##
## data: Res_multiple
## W = 0.89958, p-value < 2.2e-16

From the Shapiro-Wilk normality test, the P-value for age’s is smaller than 0.05. fail to
Reject the null hypothesis of normality.It is assumed that the data is not normally
distributed.
Visually inspect the Q-Q plot, the residuals do not follow the normal distribution, it is
under-dispersed (it should follow the line).
• Test for heteroskedasticity
# Perform Breusch-Pagan test to check for heteroskedasticity
bp_test3 <- ncvTest(multiple_regression)
print(bp_test3)

## Non-constant Variance Score Test


## Variance formula: ~ fitted.values
## Chisquare = 230.5625, Df = 1, p = < 2.22e-16

Since the p-value is extremely small, much smaller than the typical significance level of
0.05, reject the null hypothesis of the Breusch-Pagan,conclude that there is strong evidence
of heteroscedasticity in the residuals of your linear regression model. This means that the
assumption of constant variance is violated, and the variability of the residuals is not the
same across all levels of the predictors.
Visually,from the Residuals vs. Fitted Values plot, the residuals does not spread equally
along the range of predictors and tends to align in a cone/triangular shape.
• Test for autocorrelation
#Perform the Durbin-Watson test
durbinWatsonTest(multiple_regression)

## lag Autocorrelation D-W Statistic p-value


## 1 -0.04506494 2.087394 0.12
## Alternative hypothesis: rho != 0

The Durbin-Watson statistic of 2.087394 suggests that there is little to no evidence of


autocorrelation in the residuals of your linear regression model.
Since the p-value is greater than the typical significance level of 0.05, you do not have
sufficient evidence to conclude that there is significant autocorrelation in the residuals,
there is no significant evidence of autocorrelation in the residuals of your linear regression
model.
III. Conclusion
In conclusion, we applied various statistical methods and tools to analyze a dataset of 1338
individuals and their related factors, such as age, sex, BMI, number of children, smoking
status, and region.
We performed descriptive statistics, graphical analysis, correlation analysis, simple and
multiple linear regression, and diagnostic tests to explore the data and test the
assumptions of the regression models.
We found that age, BMI, the number of children, and smoking status were significant
predictors of insurance charges, while sex and region were not. The multiple regression
model explained about 75% of the variation in insurance charges, and was overall
statistically significant.
However, we also encountered some limitations and challenges in our analysis, such as the
violation of normality and homoscedasticity assumptions, the presence of outliers and
influential points, and the potential omission of other relevant variables.
Therefore, we suggest that further research and data collection are needed to improve the
accuracy and validity of the regression model, and to better understand the factors that
affect insurance charges.

IV. Reference
1. J Healthc Eng. (2022). Estimation and Prediction of Hospitalization and Medical Care
Costs Using Regression in Machine Learning. Retrieved from
“https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8906954/”
2. Bhumika Bhatt. (2019). Predictors of medical expenses. Retrieved from
“https://www.kaggle.com/code/bbhatt001/predictors-of-medical-expenses/log”

3. Matt Dunham (2020). Attempting to Predict Medical Costs. Retrieved from


“https://rpubs.com/MattDunham99/medical_costs_code”

You might also like