0% found this document useful (0 votes)

16 views41 pages

Midterm Project Group 6

Uploaded by

thutae20051995

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views41 pages

Midterm Project Group 6

Uploaded by

thutae20051995

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Hanoi University of Science and Technology

CASE STUDY REPORT

Products game
Group 6
Major: Logistic and supply chain management
Course: Applied Statistics

Member
Nguyễn Đình Thái Dương 20223587

Nguyễn Đức Nam 20223630

Hoàng Đức Dũng 20223545

Hoàng Minh Tuấn 20223671

Hoàng Đức Dũng Nguyễn Đức Nam
20223545 20223630

Nguyễn Đình Thái Dương Hoàng Minh Tuấn

20223587 20223671
Mid-term Project
Group 6

2023-12-27

I. Introduction
1. Motivation
The global insurance industry is worth trillions of dollars, so it is no surprise that insurers
are turning to technology to help them cut costs, better understand their customers and
develop new products. The insurance industry is no stranger to fraud. In fact, it’s estimated
that 10% of all insurance claims are fraudulent. This costs the industry billions of dollars
every year.
By applying data-analysis in the business, enterprises can overcome the issues and create
new ways to sell, distribute or underwrite Insurance products.

2. Preparation
• Load required libraries
library(stats)
library(dplyr)

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':

##
## filter, lag

## The following objects are masked from 'package:base':

##
## intersect, setdiff, setequal, union

library(knitr)
library(stargazer)

##
## Please cite as:

## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary

Statistics Tables.

## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

library(ggplot2)
library(car)

## Loading required package: carData

##
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':

##
## recode

library(broom)

• For data visualization

library(ggcorrplot)
library(gridExtra)

##
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':

##
## combine

• Import Dataset
setwd("D:/Project")

The dataset is available at ” https://www.kaggle.com/code/bbhatt001/predictors-of-

medical-expenses/notebookhttps://www.kaggle.com/code/bbhatt001/predictors-of-
medical-expenses/notebook ” with the headline “Predictors of medical expenses” by
Bhumika Bhatt, established in 2019
• Load the data set
Insurance <- read.csv("insurance.csv",header = T)

II. Project Questions

1. Variable Explanation
• Illustrate the header of the data set
head(Insurance)

## age sex bmi children smoker region charges

## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
What could be the variation to be explained ?
Describe the given data In the context of insurance charges, the variation to be explained
could be the amount of charges (the “charges” column) based on different factors such as
Age, Sex, BMI, Number of Children, Smoking status, and Region.

2. Description of the Data

The code will show the summary of the given qualitative data namely “charges”, “bmi”,
“children” . Futhermore, provide frequency tables for categorical variables like “sex”,
“smoker”, “bmi”, “age”, “region”
• Summary of the data (Quantitative)
# Retrieve the dimension of the Data
dim(Insurance)

## [1] 1338 7

# Summarise the data

summary(Insurance)

## age sex bmi children

## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770

• Summary of categorical variables

# Summarise the data
attach(Insurance)
table(sex)

## sex
## female male
## 662 676

table(smoker)

## smoker
## no yes
## 1064 274
table(region)

## region
## northeast northwest southeast southwest
## 324 325 364 325

Description
From summary above, we examined an insurance dataset with 1338 individuals, covering 7
key factors such as age, sex, BMI, number of children, smoking status, region, and insurance
charges. 662 subjects were female and 676 were male. The age range was 18 to 64 years,
with an average of 39.21 years. BMI values ranged from 15.96 to 53.13, averaging at 30.66,
indicating an overweight average. The number of children per individual ranged from 0 to
5, with an average of about 1.1 children per person.
Out of 1338 people only 274 who smoked whilst others did notMost individuals . Insurance
charges varied from 1122 to 63770, with an average of 13270. The dataset also includes
regional information, which may impact insurance charges with 324 live in the North East,
325 live in the North West, 364 live in the South East and 325 live in the South West.
This dataset offers a comprehensive view of the demographic and lifestyle factors related
to insurance charges, presenting opportunities for further analysis.

3.Characteristic of all the variables

• Illustrations of “Age”
# Create a boxplot for the 'age' variable
boxplot(Insurance$age,
main = "Distribution of Age",
xlab = "Age",
col = "lightblue",
border = "black",
horizontal = T)
# Count of age column
barplot(table(Insurance$age),
main = "Distribution of Ages",
xlab = "Age",
ylab = "Frequency",
col = "skyblue",
border = "black",
ylim = c(0,70)
)
• Illustration of “Gender”

# Count of Gender
gender_count <-table(Insurance$sex)
barplot(gender_count,
main = "Distribution of gender",
xlab = "Gender",
ylab = "Count",
ylim = c(0,700),
col = c("lightblue","lightpink"),
border = "black",
)
• Illustration of “Smoker”
#Count of Smoker
smoker_count <- table(Insurance$smoker)
barplot(smoker_count,
main = "Distribution of Smoker",
xlab = "Smoker",
ylab = "Count",
col = c("lightblue","lightpink"),
ylim = c(0,1200)
)
• Illustration of “Region”
#Count of the region
count_region <- table(Insurance$region)
#Create bar chart
barplot(count_region,
main = "Distribution of Region",
ylab = "Count",
xlab = "Region",
ylim = c(0,400),
col = c("lightblue"),

)
• Illustration of “Children”
# Count the number of children of each individual in the dataset
children_counts <- table(Insurance$children)

# Create a bar chart for the number of children

barplot(children_counts,
xlab = "Number of Children",
ylab = "Count",
ylim = c(0,600),
col = "lightblue")
• Illustrations of “BMI”
# Distribution of BMI
# Create the first plot (histogram)
plot1 <- ggplot(Insurance, aes(x = bmi)) +
geom_histogram(binwidth = 2, fill = "blue", color = "white", aes(y =
..density..)) +
labs(title = "BMI Distribution", x = "BMI", y = "Frequency") +
theme_minimal()

# Create the second plot (histogram with density curve)

plot2 <- ggplot(Insurance, aes(x = bmi)) +
geom_histogram(binwidth = 2, fill = "blue", color = "white", aes(y =
..density..)) +
geom_density(alpha = 0.5, fill = "orange",color = "black") +
labs(title = "BMI Distribution with Density Curve", x = "BMI", y =
"Frequency") +
theme_minimal()

# Define the combined plot

combined_plot <- grid.arrange(plot1, plot2, ncol = 2)

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2

3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

• Illustration of “Charges”
# Distribution of Charges
# Create a histogram of charges
hist(Insurance$charges,
main = "Distribution of Charges",
xlab='Charges',
xlim = c(0,70000),
ylim = c(0,400),
col = "lightblue")
4.Consider the relationship between the variables pairwise
We take into account the relationship between genders, ages
• Relationship between “Age” and “Sex”
#Create boxplot to compare the age distribution of Male and Female
boxplot( age ~ sex, data = Insurance,
col = c("lightblue", "lightpink"),
xlab = "Gender",
ylab = "Age",
ylim = c(10,70),
main = "Age vs Gender")
# Create Histogram to illustrate more details
par(mfrow = c(1,2))
# Histogram for Female
hist(Insurance$age[Insurance$sex == "female"],
col = "lightpink",
xlab = "Age",
ylab = "Frequency",
ylim = c(0,100),
main = "Female")

# Histogram for Male

hist(Insurance$age[Insurance$sex == "male"],
col = "lightblue",
xlab = "Age",
ylab = "Frequency",
ylim = c(0,100),
main = "Male")
• Relationship between “BMI” and “Age”
# Create Scatter plot between BMI's and Age's data (Not take into account
"gender")
plot(Insurance$age,Insurance$bmi,
ylab = "BMI",
xlab = "Age",
col = "black"
)
# Create Scatter plot between BMI's and Age's data (Take into account
"gender")
plot(Insurance$age,Insurance$bmi,
ylab = "BMI",
xlab = "Age",
col = ifelse(Insurance$sex == "female","red","blue"))

# Add legend
legend("topright",
legend = c("Male", "Female"),
col = c("blue", "red"),
pch = 1,
bg = "white")
• Relationship between “Age” and “Charges”
# Scatter plot between age's and charges' (NOT taking into account smoking
exposure
plot(Insurance$age, Insurance$charges,
xlab = "Age", ylab = "Charges",
main = "Age vs Charges",
)
# Scatterplot of Age vs. charges (taking into account smoking exposure)
plot(Insurance$age, Insurance$charges,
xlab = "Age", ylab = "Charges",
main = "Age vs Charges",
col = ifelse(Insurance$smoker == "yes","red","blue"))

# Add a legend
legend("topleft",
legend = c("Non-Smoker", "Smoker"),
col = c("blue", "red"),
pch = 1,
bg = "white")
• Relationship between “BMI” and “Charges”
# Scatterplot of bmi's vs. charges' (NOT taking into account smoking
exposure)
plot(Insurance$bmi, Insurance$charges,
xlab = "BMI", ylab = "Charges",
main = "BMI vs. Charges",
)
# Scatterplot of bmi vs. charges taking into account smoking exposure
plot(Insurance$bmi, Insurance$charges,
xlab = "BMI", ylab = "Charges",
main = "BMI vs. Charges",
col = ifelse(Insurance$smoker == "yes", "red", "blue"))

# Add a legend
legend("topright",
legend = c("Non-Smoker", "Smoker"),
col = c("blue", "red"),
pch = 1,
bg = "white")
• Relationship between “Smoker” and “Charges”
boxplot(charges ~ smoker,data = Insurance,
main = "Smoker vs Charges",
ylab = "Charge",
xlab = "Smoker",
col = "lightblue")
• Relationship between “Children” and “Charges”
plot (Insurance$children,Insurance$charge,
ylab = "Charge",
xlab = "Children",
main = "Children vs Charges",
col = "blue")
• Computation of correlation between variables
# Compute the correlation matrix for numerical variables
correlation_matrix <- cor(Insurance[, c("age", "bmi", "children",
"charges")])
correlation_matrix

## age bmi children charges

## age 1.0000000 0.1092719 0.04246900 0.29900819
## bmi 0.1092719 1.0000000 0.01275890 0.19834097
## children 0.0424690 0.0127589 1.00000000 0.06799823
## charges 0.2990082 0.1983410 0.06799823 1.00000000

# Create a heatmap of the correlation matrix

heatmap(correlation_matrix,
col = colorRampPalette(c("blue", "white", "red"))(100), # Color
palette
symm = T, # Show symmetrically
margins = c(10, 10)) # Add margins for labels
5. Simple regression
Investigation of the regression model only take numeric value into consider. When looking
at the previous comparison, it is appeared there are a strong positive correlation between
“Age”, “BMI” and “Charges”.We are going to describe the relationship between Independent
variables (“Age” and “bmi”)and Dependent variables (“Charges”). We can use R to check
that our data meet the four main assumptions for linear regression.
Including :
1.Independence of observations (Because we only have one independent variable and one
dependent variable, we don’t need to test for any hidden relationships among variables)
2.Normality
3.Linearity
4.Homoscedasticity assumption
Model of age predicting charges
• Fitting linear models
# fit and summarise the LM
age_regression <-lm (charges ~ age, data = Insurance)
summary (age_regression)

##
## Call:
## lm(formula = charges ~ age, data = Insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8059 -6671 -5939 5440 47829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3165.9 937.1 3.378 0.000751 ***
## age 257.7 22.5 11.453 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11560 on 1336 degrees of freedom
## Multiple R-squared: 0.08941, Adjusted R-squared: 0.08872
## F-statistic: 131.2 on 1 and 1336 DF, p-value: < 2.2e-16

Description :
The data obtained from 1338 individuals,The relationship between age and charges was
studied: charges was the dependent variable that was to be estimated from the
independent variable,age. On the basis of the data, the following regression line was
determined: Y= 3165.9 + 257.7X; where X is age and Y is charges. The intercept is 3165.9.
This means that the value of y (Charges) is 3165.9 when x (Age) equals zero.
The residuals, which represent the differences between the observed values and the
predicted values, range from -8059 to 47829. The first quartile (1Q) is -6671, the median is
-5939, and the third quartile (3Q) is 5440.
Now , let’s look at the slope, which is 257.7 . The slope is interpreted as follows: y (Charges)
is predicted to increase 257.7 when x (age) increase by one.
The t-value for the coefficient of “age” is 11.453, which indicates that it is statistically
significant. The corresponding p-value is < 2e-16, which is extremely small, providing
strong evidence against the null hypothesis that the coefficient is zero.
The residual standard error is 11560, which represents the average amount by which the
observed values deviate from the predicted values. The model has been fitted to 1336
degrees of freedom.
The multiple R-squared value is 0.08941, indicating that approximately 8.941% of the
variance in the dependent variable can be explained by the independent variable “age.” The
adjusted R-squared value, which considers the number of predictors in the model, is
0.08872.
The F-statistic is 131.2, with a corresponding p-value of < 2.2e-16. This suggests that the
overall model is statistically significant, indicating that there is a relationship between the
independent variable “age” and the dependent variable “charges.”
In summary, the simple linear regression model suggests that there is a significant positive
relationship between age and charges in the given dataset. For each year increase in age,
the charges are estimated to increase by approximately 257.7, after accounting for the
intercept term. However, the overall predictive power of the model is relatively low, with
the independent variable “age” explaining only around 8.941% of the variance in the
dependent variable “charges.”
• Graph the scatter plot with the regression line
# Describe using scatter plot
plot(Insurance$age, Insurance$charges,
xlab = "Age", ylab = "Charges",
main = "Age vs. Charges",
col = ifelse(Insurance$smoker == "yes", "red", "blue")) # Color points
based on smoker status

# Add the regression line from the lm_model to the plot

abline(age_regression, col = "green", lwd = 1)

# Add a legend
legend("topleft",
legend = c("Non-Smoker", "Smoker"),
col = c("blue", "red"),
pch = 1,
bg = "white")
Diagnostic test for Model of age predicting charges

Sketch the standard Residual plot in R

par(mfrow = c(2,2))

#Create Residuals vs. Fitted Values Plot

plot(age_regression, which = 1)

# Create Normal Q-Q Plot

plot(age_regression, which = 2)

# Create Scale-Location Plot

plot(age_regression, which = 3)

# Create Residuals vs. Leverage Plot

plot(age_regression, which = 5)

par(mfrow = c(1,1))

• Normality test
# Extract the residuals
Res_age <- residuals(age_regression)
# Perform Shapiro-Wilk normality test
shapiro.test(Res_age)
##
## Shapiro-Wilk normality test
##
## data: Res_age
## W = 0.65773, p-value < 2.2e-16

From the Shapiro-Wilk normality test, the P-value for age’s is smaller than 0.05. fail to
Reject the null hypothesis of normality.It is assumed that the data is not normally
distributed
Visually, We can look at the Q-Q plot for visual inspect as the residuals do not follow the
normal distribution, it is under-dispersed.
• Testing for heteroskedasticity
# Perform Breusch-Pagan test to check for heteroskedasticity
bp_test <- ncvTest(age_regression)
print(bp_test)

## Non-constant Variance Score Test

## Variance formula: ~ fitted.values
## Chisquare = 0.0007999315, Df = 1, p = 0.97744

As the p-value is bigger the typical significant level (0.05), fail to reject the null hypothesis
of the Breusch-Pagan.The data met the assumption of Homoscedasticity.This suggests that
there is no significant evidence of heteroscedasticity in the residuals of your linear
regression model. In other words, the assumption of constant variance is not violated, and
the variability of the residuals is reasonably consistent across all levels of the predictors.
Visually,from the residuals vs fitted plot, the residuals spread equally along the range of
predictors.
• Test for autocorrelation
#Perform the Durbin-Watson test
durbinWatsonTest(age_regression)

## lag Autocorrelation D-W Statistic p-value

## 1 -0.01715458 2.033284 0.53
## Alternative hypothesis: rho != 0

The Durbin-Watson statistic is 2.033284, which is close to the value of 2. This suggests that
there is little to no evidence of serial correlation (autocorrelation) in the residuals of your
linear regression model
The p-value associated with the test is 0.518. Since the p-value is greater than the typical
significance level of 0.05, there is insufficient evidence to reject the null hypothesis that
there is no serial correlation in the residuals. there is no significant evidence of
autocorrelation in the residuals of your linear regression model.

Model of BMI in predicting Charges

• Fitting linear model
# fit and summarise the LM
bmi_regression <- lm(charges ~ bmi, data = Insurance)

# Summarise the model

summary (bmi_regression)

##
## Call:
## lm(formula = charges ~ bmi, data = Insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20956 -8118 -3757 4722 49442
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1192.94 1664.80 0.717 0.474
## bmi 393.87 53.25 7.397 2.46e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11870 on 1336 degrees of freedom
## Multiple R-squared: 0.03934, Adjusted R-squared: 0.03862
## F-statistic: 54.71 on 1 and 1336 DF, p-value: 2.459e-13

Description :
The data obtained from 1338 individuals,The relationship between age and charges was
studied: charges was the dependent variable that was to be estimated from the
independent variable,age. On the basis of the data, the following regression line was
determined: Y= 1192.94 + 393.87X; where X is BMI score and Y is charges. The intercept is
1192.94. This means that the value of Y (Charges) is 1192.94 when x (BMI) equals
zero.However, it is unlikely for a human being to have a lower BMI score than 10.
The residuals, which represent the differences between the observed values and the
predicted values, range from -20956 to 49442. The first quartile (1Q) is -8118, the median
is -3757, and the third quartile (3Q) is 4722.
Now , let’s look at the slope, which is 393.87 . The slope is interpreted as follows: y
(Charges) is predicted to increase 393.87 when x (age) increase by one.
The t-value for the coefficient of “bmi” is 7.397, which indicates that it is statistically
significant. The corresponding p-value is 2.46e-13, which is very small, suggesting strong
evidence against the null hypothesis that the coefficient is zero.
The residual standard error is 11870, which represents the average amount by which the
observed values deviate from the predicted values. The model has been fitted to 1336
degrees of freedom.
The multiple R-squared value is 0.03934, indicating that approximately 3.93% of the
variance in the dependent variable can be explained by the independent variable “bmi.”
The adjusted R-squared value, which takes into account the number of predictors in the
model, is 0.03862.
The F-statistic is 54.71, with a corresponding p-value of 2.459e-13. This suggests that the
overall model is statistically significant, indicating that there is a relationship between the
independent variable “bmi” and the dependent variable “charges.”
In summary, the simple linear regression model suggests that there is a significant positive
relationship between BMI and charges in the given dataset. For each unit increase in BMI,
the charges are estimated to increase by approximately 393.87, after accounting for the
intercept term. However, the overall predictive power of the model is relatively low, with
the independent variable “bmi” explaining only around 3.93% of the variance in the
dependent variable “charges.”
• Graph the scatterplot with the abline
# Decribe the scatterplot
plot(Insurance$bmi, Insurance$charges,
xlab = "BMI", ylab = "Charges",
main = "BMI vs. Charges",
col = ifelse(Insurance$smoker == "yes", "red", "blue")) # Color points
based on smoker status

# Add the regression line from the lm_model to the plot

abline(bmi_regression, col = "green",)

# Add a legend
legend("topleft",
legend = c("Non-Smoker", "Smoker"),
col = c("blue", "red"),
pch = 1,
bg = "white")
Diagnostic Test for Model of BMI in predicting Charges

Sketch the standard Residual plot in R

par(mfrow = c(2,2))

#Create Residuals vs. Fitted Values Plot

plot(bmi_regression, which = 1)

# Create Normal Q-Q Plot

plot(bmi_regression, which = 2)

# Create Scale-Location Plot

plot(age_regression, which = 3)

# Create Residuals vs. Leverage Plot

plot(bmi_regression, which = 5)
par(mfrow = c(1,1))

• Normality test
# Extract the residuals
Res_bmi <- residuals(bmi_regression)
# Perform Shapiro-Wilk normality test
shapiro.test(Res_bmi)

##
## Shapiro-Wilk normality test
##
## data: Res_bmi
## W = 0.86198, p-value < 2.2e-16

From the Shapiro-Wilk normality test, the P-value for bmi’s is smaller than 0.05. Fail to
Reject the null hypothesis of normality.It is assumed that the data is not normally
distributed.
Visually, We can look at the Q-Q plot for visual inspect as the residuals do not follow the
normal distribution, it is more likely to skewed to the right.
• Testing for heteroskedasticity
# Perform Breusch-Pagan test to check for heteroskedasticity
bp_test2 <- ncvTest(bmi_regression)
print(bp_test2)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 178.9783, Df = 1, p = < 2.22e-16

As the p-value is smaller than 0.05, reject the null hypothesis of the Breusch-
Pagan.Conclude that there is strong evidence of heteroscedasticity in the residuals of your
linear regression model. This means that the assumption of constant variance is violated,
and the variability of the residuals is not the same across all levels of the predictors.
For Visually inspect, we look at the Residuals vs fitted plot, the residuals spread unequally
along the range of predictors, as it takes a diverging cone/triangular shape.
#Perform the Durbin-Watson test
durbinWatsonTest(bmi_regression)

## lag Autocorrelation D-W Statistic p-value

## 1 0.007641922 1.983154 0.74
## Alternative hypothesis: rho != 0

the Durbin-Watson statistic is 1.983154, which is close to the value of 2. This suggests that
there is little to no evidence of serial correlation (autocorrelation) in the residuals of your
linear regression model.
The p-value associated with the test is 0.764. Since the p-value is greater than the typical
significance level of 0.05, there is insufficient evidence to reject the null hypothesis that
there is no serial correlation in the residuals. there is no significant evidence of
autocorrelation in the residuals of your linear regression model.

6. Multiple Regression
Similar to Simple Linear Regression, an important aspect when building a multiple linear
regression model is to make sure that the following key assumptions are met.
1.Residual values are normally distributed
2.Linear relationship between the dependent and the independent variables
3.Multicollinearity
4.Homoscedasticity assumption
Our group take into consider the Independent data of Age, BMI, Children and Smoker ( we
encoded the value of smoker to 1 (True) and 0 (False) ).
# Encode the smoker variable
smoker_encoded <- ifelse(Insurance$smoker == "yes", 1, 0)
head(smoker_encoded)

## [1] 1 0 0 0 0 0
Model of BMI, Age, Children and Smoker in predicting Charges
# Define the formula for the regression model
attach(Insurance)

## The following objects are masked from Insurance (pos = 3):

##
## age, bmi, charges, children, region, sex, smoker

formula <- charges ~ age + bmi + children + smoker_encoded

# Fit the multivariate linear regression model

multiple_regression <- lm(formula, data = Insurance)

# Summary of the regression model

summary(multiple_regression)

##
## Call:
## lm(formula = formula, data = Insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11897.9 -2920.8 -986.6 1392.2 29509.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12102.77 941.98 -12.848 < 2e-16 ***
## age 257.85 11.90 21.675 < 2e-16 ***
## bmi 321.85 27.38 11.756 < 2e-16 ***
## children 473.50 137.79 3.436 0.000608 ***
## smoker_encoded 23811.40 411.22 57.904 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6068 on 1333 degrees of freedom
## Multiple R-squared: 0.7497, Adjusted R-squared: 0.7489
## F-statistic: 998.1 on 4 and 1333 DF, p-value: < 2.2e-16

Description :
The residuals represent the differences between the actual insurance charges and the
predicted charges from the model. The minimum residual is -11,897.9, and the maximum
residual is 29,509.6. The residuals’ distribution shows a range of positive and negative
values, which suggests some variability in the model’s predictions.
The multiple R-squared value is 0.7497, which indicates that approximately 74.97% of the
variation in insurance charges can be explained by the predictor variables included in the
model.
The adjusted R-squared value is 0.7489. It takes into account the number of predictor
variables and adjusts the R-squared value accordingly. In this case, the adjusted R-squared
value is very close to the multiple R-squared value.
The F-statistic tests the overall significance of the model. The obtained F-statistic is 998.1,
with a very low p-value (p < 0.001), indicating that the model as a whole is statistically
significant in predicting insurance charges.
The residual standard error is a measure of the average deviation of the observed
insurance charges from the predicted values. In this model, the residual standard error is
6068, indicating the average difference between the observed and predicted charges.
In the given multiple regression model, the regression line can be represented as:
Insurance Charges = -12,102.77 + 257.85 * age + 321.85 * bmi + 473.50 * children +
23,811.40 * smoker_encoded
The intercept term of -12,102.77 represents the estimated insurance charges when all
predictor variables (age, bmi, children, and smoker_encoded) are zero.
Each predictor variable is multiplied by its respective coefficient estimate obtained from
the model.
For the “age” variable, the coefficient estimate is 257.85. This means that, on average, for
each one-unit increase in age, the estimated insurance charges increase by 257.85,
assuming all other variables are held constant.
The “bmi” variable has a coefficient estimate of 321.85. For each one-unit increase in BMI,
the estimated insurance charges increase by 321.85, on average, assuming all other
variables are held constant.
The “children” variable has a coefficient estimate of 473.50. For each additional child, the
estimated insurance charges increase by 473.50, on average, assuming all other variables
are held constant.
The “smoker_encoded” variable is a binary variable representing whether an individual is a
smoker or not. It has a coefficient estimate of 23,811.40. This means that smokers have
estimated insurance charges that are 23,811.40 higher, on average, compared to non-
smokers, assuming all other variables are held constant.
In conclusion, the multiple regression analysis suggests that age, BMI, the number of
children, and smoking status are significant predictors of insurance charges. The model
explains a considerable proportion of the variation in insurance charges, as indicated by
the high R-squared value and the significant F-statistic. However, it’s important to note that
there may be other factors not included in the model that could also influence insurance
charges.
• Graph the scatter plots with the regression line
# Extract the model coefficients
coefficients <- coef(multiple_regression)
# Create a list of variable names
variable_names <- c("age", "bmi", "children", "smoker_encoded")

# Create individual scatter plots for each variable

plots <- lapply(seq_along(variable_names), function(i) {
ggplot(Insurance, aes_string(x = variable_names[i], y = "charges")) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "blue") +
labs(x = variable_names[i], y = "Charges") +
ggtitle(paste("Charges vs.", variable_names[i])) +
theme_minimal()
})

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.

## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Arrange the plots in a grid

gridExtra::grid.arrange(grobs = plots, ncol = 2)

Diagnostic test for Model of different variables predicting charges

par(mfrow = c(2,2))
# Create Residuals vs. Fitted Values Plot
plot(multiple_regression, which = 1)

# Create Normal Q-Q Plot

plot(multiple_regression, which = 2)

# Create Scale-Location Plot

plot(multiple_regression, which = 3)

# Create Residuals vs. Leverage Plot

plot(multiple_regression, which = 5)

par(mfrow = c(1,1))

• ANOVA test The Variable:

ANOVA requires a mix of one continuous quantitative dependent variable (which
corresponds to the measurements to which the question relates) and one qualitative
independent variable (with at least 2 levels which will determine the groups to
compare).From the dataset, Charges is depicted as the dependent variable and other
variables are independence.
#Describe ANOVA table
anova(multiple_regression)

## Analysis of Variance Table

##
## Response: charges
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 1.7530e+10 1.7530e+10 476.130 < 2.2e-16 ***
## bmi 1 5.4464e+09 5.4464e+09 147.929 < 2.2e-16 ***
## children 1 5.7152e+08 5.7152e+08 15.523 8.574e-05 ***
## smoker_encoded 1 1.2345e+11 1.2345e+11 3352.911 < 2.2e-16 ***
## Residuals 1333 4.9078e+10 3.6818e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The table show the degree of freedom of the 4 groups, estimated around 0 (smaller than
0.05), so we can reject the null hypothesis. No significant in mean test score among the
groups.
• Linearity test
Visually looking at the Residual vs Fitted plot, the residuals do not follow a random
sequence and they scatter in a particullar pattern. The pattern of the red line does not
follow a horizontall movement.
# Residuals vs. Fitted Values Plot
plot(multiple_regression, which = 1)

• Normality test
# Extract the residuals
Res_multiple <- residuals(multiple_regression)
# Perform Shapiro-Wilk normality test
shapiro.test(Res_multiple)

##
## Shapiro-Wilk normality test
##
## data: Res_multiple
## W = 0.89958, p-value < 2.2e-16

From the Shapiro-Wilk normality test, the P-value for age’s is smaller than 0.05. fail to
Reject the null hypothesis of normality.It is assumed that the data is not normally
distributed.
Visually inspect the Q-Q plot, the residuals do not follow the normal distribution, it is
under-dispersed (it should follow the line).
• Test for heteroskedasticity
# Perform Breusch-Pagan test to check for heteroskedasticity
bp_test3 <- ncvTest(multiple_regression)
print(bp_test3)

## Non-constant Variance Score Test

## Variance formula: ~ fitted.values
## Chisquare = 230.5625, Df = 1, p = < 2.22e-16

Since the p-value is extremely small, much smaller than the typical significance level of
0.05, reject the null hypothesis of the Breusch-Pagan,conclude that there is strong evidence
of heteroscedasticity in the residuals of your linear regression model. This means that the
assumption of constant variance is violated, and the variability of the residuals is not the
same across all levels of the predictors.
Visually,from the Residuals vs. Fitted Values plot, the residuals does not spread equally
along the range of predictors and tends to align in a cone/triangular shape.
• Test for autocorrelation
#Perform the Durbin-Watson test
durbinWatsonTest(multiple_regression)

## lag Autocorrelation D-W Statistic p-value

## 1 -0.04506494 2.087394 0.12
## Alternative hypothesis: rho != 0

The Durbin-Watson statistic of 2.087394 suggests that there is little to no evidence of

autocorrelation in the residuals of your linear regression model.
Since the p-value is greater than the typical significance level of 0.05, you do not have
sufficient evidence to conclude that there is significant autocorrelation in the residuals,
there is no significant evidence of autocorrelation in the residuals of your linear regression
model.
III. Conclusion
In conclusion, we applied various statistical methods and tools to analyze a dataset of 1338
individuals and their related factors, such as age, sex, BMI, number of children, smoking
status, and region.
We performed descriptive statistics, graphical analysis, correlation analysis, simple and
multiple linear regression, and diagnostic tests to explore the data and test the
assumptions of the regression models.
We found that age, BMI, the number of children, and smoking status were significant
predictors of insurance charges, while sex and region were not. The multiple regression
model explained about 75% of the variation in insurance charges, and was overall
statistically significant.
However, we also encountered some limitations and challenges in our analysis, such as the
violation of normality and homoscedasticity assumptions, the presence of outliers and
influential points, and the potential omission of other relevant variables.
Therefore, we suggest that further research and data collection are needed to improve the
accuracy and validity of the regression model, and to better understand the factors that
affect insurance charges.

IV. Reference
1. J Healthc Eng. (2022). Estimation and Prediction of Hospitalization and Medical Care
Costs Using Regression in Machine Learning. Retrieved from
“https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8906954/”
2. Bhumika Bhatt. (2019). Predictors of medical expenses. Retrieved from
“https://www.kaggle.com/code/bbhatt001/predictors-of-medical-expenses/log”

3. Matt Dunham (2020). Attempting to Predict Medical Costs. Retrieved from

“https://rpubs.com/MattDunham99/medical_costs_code”

Chuchu's Assignment
100% (1)
Chuchu's Assignment
26 pages
Midterm Exam Data Analytics
No ratings yet
Midterm Exam Data Analytics
858 pages
Exam Pa Note
No ratings yet
Exam Pa Note
73 pages
Tenko Raykov, George A. Marcoulides-Basic Statistics - An Introduction With R-Rowman & Littlefield Publishers (2012) PDF
No ratings yet
Tenko Raykov, George A. Marcoulides-Basic Statistics - An Introduction With R-Rowman & Littlefield Publishers (2012) PDF
345 pages
Regression
100% (1)
Regression
87 pages
Dotong Manalang 2023 Entrepreneurial Intention
No ratings yet
Dotong Manalang 2023 Entrepreneurial Intention
13 pages
Inventory Management and SMEs Profitabil
No ratings yet
Inventory Management and SMEs Profitabil
5 pages
LondonR - R in Actuarial Analysis - Chibisi Chima-Okereke - 20111206
No ratings yet
LondonR - R in Actuarial Analysis - Chibisi Chima-Okereke - 20111206
40 pages
Correlation and Regression: Statistics For Economics 1
No ratings yet
Correlation and Regression: Statistics For Economics 1
72 pages
R Studio Lab Summary Sheet
No ratings yet
R Studio Lab Summary Sheet
3 pages
Lab 5
0% (1)
Lab 5
5 pages
Felipe Sanchez - Final Exam
No ratings yet
Felipe Sanchez - Final Exam
10 pages
Group 1 Project Report DA
No ratings yet
Group 1 Project Report DA
65 pages
Analyze The Report of Swedish Motor Insurance
No ratings yet
Analyze The Report of Swedish Motor Insurance
14 pages
R Note
No ratings yet
R Note
56 pages
AMDA Practical - A048
No ratings yet
AMDA Practical - A048
35 pages
R Record-1
No ratings yet
R Record-1
57 pages
88-110 Tantri PDF
No ratings yet
88-110 Tantri PDF
23 pages
Best Practises in Statistical Analysis in R
No ratings yet
Best Practises in Statistical Analysis in R
52 pages
Harvard-Sup Chain Syl
No ratings yet
Harvard-Sup Chain Syl
23 pages
BM-1, Applied Statistics, Lesson 2: Comparing Two Groups (And One Group)
No ratings yet
BM-1, Applied Statistics, Lesson 2: Comparing Two Groups (And One Group)
39 pages
BY Muhammad Imad Khan
No ratings yet
BY Muhammad Imad Khan
46 pages
'Name-Piyush Tiwari''/n' 'Section - C'/N' 'Roll - No-2001610100142'
No ratings yet
'Name-Piyush Tiwari''/n' 'Section - C'/N' 'Roll - No-2001610100142'
28 pages
Preprocessing - Preprocessing Your Data With R
No ratings yet
Preprocessing - Preprocessing Your Data With R
23 pages
Unit3 R
No ratings yet
Unit3 R
19 pages
Lean Logistics
No ratings yet
Lean Logistics
2 pages
Healthcare Analytics
No ratings yet
Healthcare Analytics
72 pages
Cost Control and Profitability of Selected Manufacturing Companies in Nigeria
No ratings yet
Cost Control and Profitability of Selected Manufacturing Companies in Nigeria
21 pages
STAT 214-T241-Lab 2
No ratings yet
STAT 214-T241-Lab 2
23 pages
Project 1: Descriptive Analysis of Demographic Data: TU Dortmund
No ratings yet
Project 1: Descriptive Analysis of Demographic Data: TU Dortmund
20 pages
Unit Ii Eda Using R
No ratings yet
Unit Ii Eda Using R
11 pages
Lab 2
No ratings yet
Lab 2
22 pages
Business Analytics Unit - IV Notes - 60637706 - 2025 - 05!15!02 - 16
No ratings yet
Business Analytics Unit - IV Notes - 60637706 - 2025 - 05!15!02 - 16
28 pages
TKUD
No ratings yet
TKUD
12 pages
Response Surface Regression: NCSS Statistical Software
No ratings yet
Response Surface Regression: NCSS Statistical Software
15 pages
Insurance Claim Project
No ratings yet
Insurance Claim Project
23 pages
R Code
No ratings yet
R Code
13 pages
545988E5-0C88-4AB1-8BB8-6BCF14B7A6EF
No ratings yet
545988E5-0C88-4AB1-8BB8-6BCF14B7A6EF
17 pages
Handout 3
No ratings yet
Handout 3
24 pages
CS1B April22 EXAM Clean Proof
No ratings yet
CS1B April22 EXAM Clean Proof
5 pages
Experiment 5
No ratings yet
Experiment 5
13 pages
Pombo - 2017 - Hydrological Sciences Journal Angola
No ratings yet
Pombo - 2017 - Hydrological Sciences Journal Angola
18 pages
Lab Manual - DSR
No ratings yet
Lab Manual - DSR
32 pages
Lab 3 (Tutorial 1)
No ratings yet
Lab 3 (Tutorial 1)
20 pages
C5 - DSC551 - R Programming
No ratings yet
C5 - DSC551 - R Programming
30 pages
Sunil Test
No ratings yet
Sunil Test
15 pages
SML Lab 1
No ratings yet
SML Lab 1
19 pages
IntroR 2
No ratings yet
IntroR 2
18 pages
R Complete
No ratings yet
R Complete
24 pages
All Values in The First Column
No ratings yet
All Values in The First Column
7 pages
Group 5 - Applied Statistics and Experimental 152611
No ratings yet
Group 5 - Applied Statistics and Experimental 152611
28 pages
Fraudulent Financial Reporting Based of
No ratings yet
Fraudulent Financial Reporting Based of
16 pages
Dummy Independent Variable
No ratings yet
Dummy Independent Variable
14 pages
Assignment - 1: Data Analytics and R
No ratings yet
Assignment - 1: Data Analytics and R
4 pages
Basic Statistics: Basic Statistical Interview Question
No ratings yet
Basic Statistics: Basic Statistical Interview Question
5 pages
The Effect of Self Efficacy and Locus of Control On Career Maturity of Final Level Students in Management Study Program (One of The Campuses in South Jakarta)
No ratings yet
The Effect of Self Efficacy and Locus of Control On Career Maturity of Final Level Students in Management Study Program (One of The Campuses in South Jakarta)
17 pages
Does Institutional Factors and Self-Regulations Influence Entrepreneurial Intentions South African Settings
No ratings yet
Does Institutional Factors and Self-Regulations Influence Entrepreneurial Intentions South African Settings
7 pages
Factors Affecting Work-Lifebalanceofwomeninbangladesh - astudyduringCOVID-19pandemic
No ratings yet
Factors Affecting Work-Lifebalanceofwomeninbangladesh - astudyduringCOVID-19pandemic
12 pages
STATS 10 Assignment 1
No ratings yet
STATS 10 Assignment 1
7 pages
Octapace Culture: A Predictor of Faculty Performance
No ratings yet
Octapace Culture: A Predictor of Faculty Performance
6 pages
AI Engineer Road Map 2024
No ratings yet
AI Engineer Road Map 2024
9 pages
Group 4
No ratings yet
Group 4
14 pages
TKUD
No ratings yet
TKUD
10 pages
Healthcareprojectkanu
No ratings yet
Healthcareprojectkanu
13 pages
Effect of Customer-Based Brand Equity On Customer Satisfaction in Shopee Indonesia
No ratings yet
Effect of Customer-Based Brand Equity On Customer Satisfaction in Shopee Indonesia
5 pages
R All Program
No ratings yet
R All Program
10 pages
TKUD
No ratings yet
TKUD
11 pages
Linear and Multilinear Regression
No ratings yet
Linear and Multilinear Regression
5 pages
Machine Learning Structural Equation Modeling Algorithm To Measure Performance
No ratings yet
Machine Learning Structural Equation Modeling Algorithm To Measure Performance
10 pages
Tutorial 5 - Calculating Mean, Standard Deviation, Frequencies
No ratings yet
Tutorial 5 - Calculating Mean, Standard Deviation, Frequencies
6 pages
STAT501 Online - HW2R - Spring2024
No ratings yet
STAT501 Online - HW2R - Spring2024
7 pages
ACMT 311 Assignment
No ratings yet
ACMT 311 Assignment
6 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
36-401 Modern Regression HW #5 Solutions: Air - Flow
No ratings yet
36-401 Modern Regression HW #5 Solutions: Air - Flow
7 pages
Bài tập Anh 8 Global Lưu Hoàng Trí 10. TEST (UNIT 10)
No ratings yet
Bài tập Anh 8 Global Lưu Hoàng Trí 10. TEST (UNIT 10)
6 pages
Assignment 3 (2023)
No ratings yet
Assignment 3 (2023)
9 pages
Programming For Data Analytics
No ratings yet
Programming For Data Analytics
27 pages
Project Writeup
No ratings yet
Project Writeup
9 pages
Medical
No ratings yet
Medical
4 pages
(Practical) Programming With R
No ratings yet
(Practical) Programming With R
5 pages
Pima Tutorial
No ratings yet
Pima Tutorial
8 pages
Lab 1 Introduction To Data
No ratings yet
Lab 1 Introduction To Data
11 pages
STA 100 Lab Assignment 1
No ratings yet
STA 100 Lab Assignment 1
8 pages
Explanationdocx
No ratings yet
Explanationdocx
9 pages
The Coefficient of Determinaton Exposed - Hahn
No ratings yet
The Coefficient of Determinaton Exposed - Hahn
4 pages
Hard Rock Cafe's Global Strategy
No ratings yet
Hard Rock Cafe's Global Strategy
7 pages
Fantec Gun
No ratings yet
Fantec Gun
7 pages
Analisis Jalur 3x
No ratings yet
Analisis Jalur 3x
6 pages
Practical2 3
No ratings yet
Practical2 3
6 pages
Ex 2
No ratings yet
Ex 2
5 pages
INT232 ETP Question Paper1 2023
No ratings yet
INT232 ETP Question Paper1 2023
4 pages
LAb Test 2
No ratings yet
LAb Test 2
4 pages
Compare
No ratings yet
Compare
4 pages
Document Sans Titre
No ratings yet
Document Sans Titre
7 pages
Bài tập Anh 8 Global Lưu Hoàng Trí 12. REVIEW 4 (UNIT 10-11-12)
No ratings yet
Bài tập Anh 8 Global Lưu Hoàng Trí 12. REVIEW 4 (UNIT 10-11-12)
5 pages
Excel Perhitungan Laporan Keuangan
No ratings yet
Excel Perhitungan Laporan Keuangan
5 pages
LRQA
No ratings yet
LRQA
3 pages
Untitled Notebook
No ratings yet
Untitled Notebook
3 pages
Pepsi
No ratings yet
Pepsi
3 pages
Intro Manufacturing Location 3
No ratings yet
Intro Manufacturing Location 3
3 pages
GLCA DA BOE Project 3 Health Insurance
No ratings yet
GLCA DA BOE Project 3 Health Insurance
2 pages