Exploratory Data Analysis and Modeling for
Insurance and User Data
Submitted by- Ayushkumar Parmar [21IM30005]
Introduction
This Jupyter Notebook conducts exploratory data analysis (EDA) and modeling for two datasets:
insurance and user_data. The analysis includes visualizations, linear regression modeling for
insurance charges prediction, and logistic regression modeling for binary classification in user
data. Insurance Dataset
The exploration starts with loading and understanding the insurance dataset:
# Importing the libraries
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
user_data = pd.read_csv('/content/User_Data.csv')
insurance = pd.read_csv('/content/insurance.csv')
df = insurance.copy()
print('Rows and columns in the dataset:', df.shape)
df.head() # Top 5 rows
df.info()
Rows and columns in the dataset: (1338, 7)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
Fitting a Line - Charges vs. BMI
A linear regression line is fitted to understand the relationship between BMI and insurance
charges:
sns.lmplot(x='bmi', y='charges', data=df, aspect=2, height=5)
plt.xlabel('Body Mass Index: as independent variable')
plt.ylabel('Insurance Charges: as dependent variable')
plt.title('Charges vs. BMI')
Text(0.5, 1.0, 'Charges vs. BMI')
Exploratory Data Analysis (EDA)
EDA includes descriptive statistics, checking for missing values, and exploring correlations:
df.describe()
# Check for missing values
missing_values = df.isnull()
# Summarize the missing values
missing_values_count = missing_values.sum()
# Display the result
print("Missing values in each column:")
print(missing_values_count)
# Correlation heatmap
corr = df.corr()
sns.heatmap(corr, cmap='viridis', annot=True)
Missing values in each column:
age 0
sex 0
bmi 0
children 0
smoker 0
region 0
charges 0
dtype: int64
<Axes: >
Distribution and Relationships Visualization
Visualizations include distribution plots, violin plots, box plots, and scatter plots:
# Distribution of insurance charges
f = plt.figure(figsize=(12,4))
sns.distplot(df['charges'], bins=50, color='r', ax=f.add_subplot(121))
sns.distplot(np.log10(df['charges']), bins=40, color='b',
ax=f.add_subplot(122))
f.suptitle('Distribution of insurance charges')
# Violin plots
f = plt.figure(figsize=(13,6))
sns.violinplot(x='sex', y='charges', data=df, palette='Wistia',
ax=f.add_subplot(121))
sns.violinplot(x='smoker', y='charges', data=df, palette='magma',
ax=f.add_subplot(122))
# Box plot
plt.figure(figsize=(13,6))
sns.boxplot(x='children', y='charges', hue='sex', data=df,
palette='rainbow')
# Scatter plots
f = plt.figure(figsize=(13,6))
sns.scatterplot(x='age', y='charges', data=df, palette='magma',
hue='smoker', ax=f.add_subplot(121))
sns.scatterplot(x='bmi', y='charges', data=df, palette='viridis',
hue='smoker', ax=f.add_subplot(122))
<Axes: xlabel='bmi', ylabel='charges'>
Dummy Variable and Log Transform
Creating dummy variables for categorical columns and log-transforming the charges:
# Dummy variable
categorical_columns = ['sex', 'smoker', 'region']
df_encode = pd.get_dummies(data=df, prefix='OHE', prefix_sep='_',
columns=categorical_columns,
drop_first=True,
dtype='int8')
# Log transform
df_encode['charges'] = np.log(df_encode['charges'])
Linear Regression Model
Building and evaluating a linear regression model:
# Train test split
from sklearn.model_selection import train_test_split
X = df_encode.drop('charges', axis=1) # Independent variable
y = df_encode['charges'] # Dependent variable
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, random_state=23)
# Model building
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
# Get the coefficients and the intercepts
coefficients = lin_reg.coef_
intercept = lin_reg.intercept_
# Print them
coefficients_df = pd.DataFrame({'Feature': X_train.columns,
'Coefficient': coefficients})
print("Intercept: ", intercept)
print("Coefficients:")
print(coefficients_df)
# sklearn regression module
y_pred_sk = lin_reg.predict(X_test)
# Evaluation: MSE and R-squared
from sklearn.metrics import mean_squared_error
J_mse_sk = mean_squared_error(y_pred_sk, y_test)
R_square_sk = lin_reg.score(X_test, y_test)
# Adjusted R-squared
n = len(y_test)
p = X_test.shape[1]
adjust_r_squared = 1 - (1 - R_square_sk) * ((n-1)/(n-p-1))
print("Adjusted Rsquared: ", adjust_r_squared)
Intercept: 7.07542806626153
Coefficients:
Feature Coefficient
0 age 0.033057
1 bmi 0.013706
2 children 0.101695
3 OHE_male -0.069942
4 OHE_yes 1.547184
5 OHE_northwest -0.054662
6 OHE_southeast -0.145206
7 OHE_southwest -0.135254
Adjusted Rsquared: 0.7722401938297871
Model Validation and Checks
Checking for linearity, residual normality, mean, multivariate normality, homoscedasticity, and
multicollinearity:
# Check for linearity and residual normality
f = plt.figure(figsize=(13,5))
ax = f.add_subplot(121)
ax.scatter(y_test, y_pred_sk, color='r')
ax.set_title('Check for linearity: Actual vs. Predicted')
ax = f.add_subplot(122)
sns.distplot((y_test - y_pred_sk), ax=ax, color='b')
ax.axvline((y_test - y_pred_sk).mean(), color='k', linestyle='--')
ax.set_title('Check for residual normality and mean: Residual error')
ax.set_xlabel('Residuals')
# Check for Multivariate Normality
f, ax = plt.subplots(1,2,figsize=(13,6))
_,(_,_,r) = sp.stats.probplot((y_test - y_pred_sk), fit=True,
plot=ax[0])
ax[0].set_title('Check for Multivariate Normality: Q-Q plot')
# Check for Homoscedasticity
sns.scatterplot(y=(y_test - y_pred_sk), x=y_pred_sk, ax=ax[1],
color='r')
ax[1].set_title('Check for Homoscedasticity: Residual vs. Predicted')
# Check for Multicollinearity - Variance Inflation Factor
VIF = 1 / (1-R_square_sk)
print('VIF: ', VIF)
VIF: 4.479966203229661
Logistic Regression for User Data
Moving on to the user data, logistic regression is performed for binary classification:
df = user_data.copy()
# Train test split
X = df.iloc[:,2:3].values
y = df.iloc[:,4].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =
0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# Fitting Logistic Regression on the training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)
# Predicting the test set results
y_pred = classifier.predict(X_test)
# Model Evaluation - Classification Report, Accuracy, and Confusion
Matrix
from sklearn.metrics import classification_report, accuracy_score,
confusion_matrix
report = classification_report(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
# Display results
print("Classification report:\n", report)
print("Accuracy: ", accuracy)
# Confusion Matrix Visualization
cm_df = pd.DataFrame(cm, index=['Actual 0', 'Actual 1'],
columns=['Predicted 0', 'Predicted 1'])
sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues')
Classification report:
precision recall f1-score support
0 0.89 0.97 0.93 68
1 0.92 0.75 0.83 32
accuracy 0.90 100
macro avg 0.91 0.86 0.88 100
weighted avg 0.90 0.90 0.90 100
Accuracy: 0.9
<Axes: >
Conclusion
This comprehensive analysis provides insights into the relationships within the insurance
dataset and performs a binary classification on the user data using logistic regression.
Visualizations, model building, and evaluations contribute to a thorough understanding of the
datasets.