0% found this document useful (0 votes)
21 views11 pages

Sla4a 21im30005

This document summarizes an analysis of two datasets: insurance and user data. Exploratory data analysis was conducted on the insurance dataset including visualizations and linear regression modeling to predict charges. Logistic regression modeling was used to perform binary classification on the user data. Validation checks were made on the linear regression model. The analysis provides insights into relationships within the datasets and classifies data using appropriate modeling techniques.

Uploaded by

Lohitava Ghosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views11 pages

Sla4a 21im30005

This document summarizes an analysis of two datasets: insurance and user data. Exploratory data analysis was conducted on the insurance dataset including visualizations and linear regression modeling to predict charges. Logistic regression modeling was used to perform binary classification on the user data. Validation checks were made on the linear regression model. The analysis provides insights into relationships within the datasets and classifies data using appropriate modeling techniques.

Uploaded by

Lohitava Ghosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Exploratory Data Analysis and Modeling for

Insurance and User Data


Submitted by- Ayushkumar Parmar [21IM30005]

Introduction
This Jupyter Notebook conducts exploratory data analysis (EDA) and modeling for two datasets:
insurance and user_data. The analysis includes visualizations, linear regression modeling for
insurance charges prediction, and logistic regression modeling for binary classification in user
data. Insurance Dataset

The exploration starts with loading and understanding the insurance dataset:

# Importing the libraries


import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")

user_data = pd.read_csv('/content/User_Data.csv')
insurance = pd.read_csv('/content/insurance.csv')

df = insurance.copy()

print('Rows and columns in the dataset:', df.shape)


df.head() # Top 5 rows
df.info()

Rows and columns in the dataset: (1338, 7)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

Fitting a Line - Charges vs. BMI


A linear regression line is fitted to understand the relationship between BMI and insurance
charges:

sns.lmplot(x='bmi', y='charges', data=df, aspect=2, height=5)


plt.xlabel('Body Mass Index: as independent variable')
plt.ylabel('Insurance Charges: as dependent variable')
plt.title('Charges vs. BMI')

Text(0.5, 1.0, 'Charges vs. BMI')

Exploratory Data Analysis (EDA)


EDA includes descriptive statistics, checking for missing values, and exploring correlations:

df.describe()

# Check for missing values


missing_values = df.isnull()

# Summarize the missing values


missing_values_count = missing_values.sum()

# Display the result


print("Missing values in each column:")
print(missing_values_count)

# Correlation heatmap
corr = df.corr()
sns.heatmap(corr, cmap='viridis', annot=True)

Missing values in each column:


age 0
sex 0
bmi 0
children 0
smoker 0
region 0
charges 0
dtype: int64

<Axes: >

Distribution and Relationships Visualization


Visualizations include distribution plots, violin plots, box plots, and scatter plots:

# Distribution of insurance charges


f = plt.figure(figsize=(12,4))
sns.distplot(df['charges'], bins=50, color='r', ax=f.add_subplot(121))
sns.distplot(np.log10(df['charges']), bins=40, color='b',
ax=f.add_subplot(122))
f.suptitle('Distribution of insurance charges')

# Violin plots
f = plt.figure(figsize=(13,6))
sns.violinplot(x='sex', y='charges', data=df, palette='Wistia',
ax=f.add_subplot(121))
sns.violinplot(x='smoker', y='charges', data=df, palette='magma',
ax=f.add_subplot(122))

# Box plot
plt.figure(figsize=(13,6))
sns.boxplot(x='children', y='charges', hue='sex', data=df,
palette='rainbow')

# Scatter plots
f = plt.figure(figsize=(13,6))
sns.scatterplot(x='age', y='charges', data=df, palette='magma',
hue='smoker', ax=f.add_subplot(121))
sns.scatterplot(x='bmi', y='charges', data=df, palette='viridis',
hue='smoker', ax=f.add_subplot(122))

<Axes: xlabel='bmi', ylabel='charges'>


Dummy Variable and Log Transform
Creating dummy variables for categorical columns and log-transforming the charges:

# Dummy variable
categorical_columns = ['sex', 'smoker', 'region']
df_encode = pd.get_dummies(data=df, prefix='OHE', prefix_sep='_',
columns=categorical_columns,
drop_first=True,
dtype='int8')

# Log transform
df_encode['charges'] = np.log(df_encode['charges'])

Linear Regression Model


Building and evaluating a linear regression model:

# Train test split


from sklearn.model_selection import train_test_split
X = df_encode.drop('charges', axis=1) # Independent variable
y = df_encode['charges'] # Dependent variable

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.3, random_state=23)

# Model building
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
# Get the coefficients and the intercepts
coefficients = lin_reg.coef_
intercept = lin_reg.intercept_

# Print them
coefficients_df = pd.DataFrame({'Feature': X_train.columns,
'Coefficient': coefficients})
print("Intercept: ", intercept)
print("Coefficients:")
print(coefficients_df)

# sklearn regression module


y_pred_sk = lin_reg.predict(X_test)

# Evaluation: MSE and R-squared


from sklearn.metrics import mean_squared_error
J_mse_sk = mean_squared_error(y_pred_sk, y_test)
R_square_sk = lin_reg.score(X_test, y_test)

# Adjusted R-squared
n = len(y_test)
p = X_test.shape[1]
adjust_r_squared = 1 - (1 - R_square_sk) * ((n-1)/(n-p-1))
print("Adjusted Rsquared: ", adjust_r_squared)

Intercept: 7.07542806626153
Coefficients:
Feature Coefficient
0 age 0.033057
1 bmi 0.013706
2 children 0.101695
3 OHE_male -0.069942
4 OHE_yes 1.547184
5 OHE_northwest -0.054662
6 OHE_southeast -0.145206
7 OHE_southwest -0.135254
Adjusted Rsquared: 0.7722401938297871

Model Validation and Checks


Checking for linearity, residual normality, mean, multivariate normality, homoscedasticity, and
multicollinearity:

# Check for linearity and residual normality


f = plt.figure(figsize=(13,5))
ax = f.add_subplot(121)
ax.scatter(y_test, y_pred_sk, color='r')
ax.set_title('Check for linearity: Actual vs. Predicted')
ax = f.add_subplot(122)
sns.distplot((y_test - y_pred_sk), ax=ax, color='b')
ax.axvline((y_test - y_pred_sk).mean(), color='k', linestyle='--')
ax.set_title('Check for residual normality and mean: Residual error')
ax.set_xlabel('Residuals')

# Check for Multivariate Normality


f, ax = plt.subplots(1,2,figsize=(13,6))
_,(_,_,r) = sp.stats.probplot((y_test - y_pred_sk), fit=True,
plot=ax[0])
ax[0].set_title('Check for Multivariate Normality: Q-Q plot')

# Check for Homoscedasticity


sns.scatterplot(y=(y_test - y_pred_sk), x=y_pred_sk, ax=ax[1],
color='r')
ax[1].set_title('Check for Homoscedasticity: Residual vs. Predicted')

# Check for Multicollinearity - Variance Inflation Factor


VIF = 1 / (1-R_square_sk)
print('VIF: ', VIF)

VIF: 4.479966203229661
Logistic Regression for User Data
Moving on to the user data, logistic regression is performed for binary classification:

df = user_data.copy()

# Train test split


X = df.iloc[:,2:3].values
y = df.iloc[:,4].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =


0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

# Fitting Logistic Regression on the training set


from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

# Predicting the test set results


y_pred = classifier.predict(X_test)

# Model Evaluation - Classification Report, Accuracy, and Confusion


Matrix
from sklearn.metrics import classification_report, accuracy_score,
confusion_matrix
report = classification_report(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

# Display results
print("Classification report:\n", report)
print("Accuracy: ", accuracy)

# Confusion Matrix Visualization


cm_df = pd.DataFrame(cm, index=['Actual 0', 'Actual 1'],
columns=['Predicted 0', 'Predicted 1'])
sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues')

Classification report:
precision recall f1-score support

0 0.89 0.97 0.93 68


1 0.92 0.75 0.83 32

accuracy 0.90 100


macro avg 0.91 0.86 0.88 100
weighted avg 0.90 0.90 0.90 100

Accuracy: 0.9

<Axes: >
Conclusion
This comprehensive analysis provides insights into the relationships within the insurance
dataset and performs a binary classification on the user data using logistic regression.
Visualizations, model building, and evaluations contribute to a thorough understanding of the
datasets.

You might also like