0% found this document useful (0 votes)

23 views11 pages

Sla4a 21im30005

This document summarizes an analysis of two datasets: insurance and user data. Exploratory data analysis was conducted on the insurance dataset including visualizations and linear regression modeling to predict charges. Logistic regression modeling was used to perform binary classification on the user data. Validation checks were made on the linear regression model. The analysis provides insights into relationships within the datasets and classifies data using appropriate modeling techniques.

Uploaded by

Lohitava Ghosh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views11 pages

Sla4a 21im30005

Uploaded by

Lohitava Ghosh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Exploratory Data Analysis and Modeling for

Insurance and User Data

Submitted by- Ayushkumar Parmar [21IM30005]

Introduction
This Jupyter Notebook conducts exploratory data analysis (EDA) and modeling for two datasets:
insurance and user_data. The analysis includes visualizations, linear regression modeling for
insurance charges prediction, and logistic regression modeling for binary classification in user
data. Insurance Dataset

The exploration starts with loading and understanding the insurance dataset:

# Importing the libraries

import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")

user_data = pd.read_csv('/content/User_Data.csv')
insurance = pd.read_csv('/content/insurance.csv')

df = insurance.copy()

print('Rows and columns in the dataset:', df.shape)

df.head() # Top 5 rows
df.info()

Rows and columns in the dataset: (1338, 7)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

Fitting a Line - Charges vs. BMI

A linear regression line is fitted to understand the relationship between BMI and insurance
charges:

sns.lmplot(x='bmi', y='charges', data=df, aspect=2, height=5)

plt.xlabel('Body Mass Index: as independent variable')
plt.ylabel('Insurance Charges: as dependent variable')
plt.title('Charges vs. BMI')

Text(0.5, 1.0, 'Charges vs. BMI')

Exploratory Data Analysis (EDA)

EDA includes descriptive statistics, checking for missing values, and exploring correlations:

df.describe()

# Check for missing values

missing_values = df.isnull()

# Summarize the missing values

missing_values_count = missing_values.sum()

# Display the result

print("Missing values in each column:")
print(missing_values_count)

# Correlation heatmap
corr = df.corr()
sns.heatmap(corr, cmap='viridis', annot=True)

Missing values in each column:

age 0
sex 0
bmi 0
children 0
smoker 0
region 0
charges 0
dtype: int64

<Axes: >

Distribution and Relationships Visualization

Visualizations include distribution plots, violin plots, box plots, and scatter plots:

# Distribution of insurance charges

f = plt.figure(figsize=(12,4))
sns.distplot(df['charges'], bins=50, color='r', ax=f.add_subplot(121))
sns.distplot(np.log10(df['charges']), bins=40, color='b',
ax=f.add_subplot(122))
f.suptitle('Distribution of insurance charges')

# Violin plots
f = plt.figure(figsize=(13,6))
sns.violinplot(x='sex', y='charges', data=df, palette='Wistia',
ax=f.add_subplot(121))
sns.violinplot(x='smoker', y='charges', data=df, palette='magma',
ax=f.add_subplot(122))

# Box plot
plt.figure(figsize=(13,6))
sns.boxplot(x='children', y='charges', hue='sex', data=df,
palette='rainbow')

# Scatter plots
f = plt.figure(figsize=(13,6))
sns.scatterplot(x='age', y='charges', data=df, palette='magma',
hue='smoker', ax=f.add_subplot(121))
sns.scatterplot(x='bmi', y='charges', data=df, palette='viridis',
hue='smoker', ax=f.add_subplot(122))

<Axes: xlabel='bmi', ylabel='charges'>

Dummy Variable and Log Transform
Creating dummy variables for categorical columns and log-transforming the charges:

# Dummy variable
categorical_columns = ['sex', 'smoker', 'region']
df_encode = pd.get_dummies(data=df, prefix='OHE', prefix_sep='_',
columns=categorical_columns,
drop_first=True,
dtype='int8')

# Log transform
df_encode['charges'] = np.log(df_encode['charges'])

Linear Regression Model

Building and evaluating a linear regression model:

# Train test split

from sklearn.model_selection import train_test_split
X = df_encode.drop('charges', axis=1) # Independent variable
y = df_encode['charges'] # Dependent variable

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.3, random_state=23)

# Model building
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
# Get the coefficients and the intercepts
coefficients = lin_reg.coef_
intercept = lin_reg.intercept_

# Print them
coefficients_df = pd.DataFrame({'Feature': X_train.columns,
'Coefficient': coefficients})
print("Intercept: ", intercept)
print("Coefficients:")
print(coefficients_df)

# sklearn regression module

y_pred_sk = lin_reg.predict(X_test)

# Evaluation: MSE and R-squared

from sklearn.metrics import mean_squared_error
J_mse_sk = mean_squared_error(y_pred_sk, y_test)
R_square_sk = lin_reg.score(X_test, y_test)

# Adjusted R-squared
n = len(y_test)
p = X_test.shape[1]
adjust_r_squared = 1 - (1 - R_square_sk) * ((n-1)/(n-p-1))
print("Adjusted Rsquared: ", adjust_r_squared)

Intercept: 7.07542806626153
Coefficients:
Feature Coefficient
0 age 0.033057
1 bmi 0.013706
2 children 0.101695
3 OHE_male -0.069942
4 OHE_yes 1.547184
5 OHE_northwest -0.054662
6 OHE_southeast -0.145206
7 OHE_southwest -0.135254
Adjusted Rsquared: 0.7722401938297871

Model Validation and Checks

Checking for linearity, residual normality, mean, multivariate normality, homoscedasticity, and
multicollinearity:

# Check for linearity and residual normality

f = plt.figure(figsize=(13,5))
ax = f.add_subplot(121)
ax.scatter(y_test, y_pred_sk, color='r')
ax.set_title('Check for linearity: Actual vs. Predicted')
ax = f.add_subplot(122)
sns.distplot((y_test - y_pred_sk), ax=ax, color='b')
ax.axvline((y_test - y_pred_sk).mean(), color='k', linestyle='--')
ax.set_title('Check for residual normality and mean: Residual error')
ax.set_xlabel('Residuals')

# Check for Multivariate Normality

f, ax = plt.subplots(1,2,figsize=(13,6))
_,(_,_,r) = sp.stats.probplot((y_test - y_pred_sk), fit=True,
plot=ax[0])
ax[0].set_title('Check for Multivariate Normality: Q-Q plot')

# Check for Homoscedasticity

sns.scatterplot(y=(y_test - y_pred_sk), x=y_pred_sk, ax=ax[1],
color='r')
ax[1].set_title('Check for Homoscedasticity: Residual vs. Predicted')

# Check for Multicollinearity - Variance Inflation Factor

VIF = 1 / (1-R_square_sk)
print('VIF: ', VIF)

VIF: 4.479966203229661
Logistic Regression for User Data
Moving on to the user data, logistic regression is performed for binary classification:

df = user_data.copy()

# Train test split

X = df.iloc[:,2:3].values
y = df.iloc[:,4].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =

0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

# Fitting Logistic Regression on the training set

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

# Predicting the test set results

y_pred = classifier.predict(X_test)

# Model Evaluation - Classification Report, Accuracy, and Confusion

Matrix
from sklearn.metrics import classification_report, accuracy_score,
confusion_matrix
report = classification_report(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

# Display results
print("Classification report:\n", report)
print("Accuracy: ", accuracy)

# Confusion Matrix Visualization

cm_df = pd.DataFrame(cm, index=['Actual 0', 'Actual 1'],
columns=['Predicted 0', 'Predicted 1'])
sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues')

Classification report:
precision recall f1-score support

0 0.89 0.97 0.93 68

1 0.92 0.75 0.83 32

accuracy 0.90 100

macro avg 0.91 0.86 0.88 100
weighted avg 0.90 0.90 0.90 100

Accuracy: 0.9

<Axes: >
Conclusion
This comprehensive analysis provides insights into the relationships within the insurance
dataset and performs a binary classification on the user data using logistic regression.
Visualizations, model building, and evaluations contribute to a thorough understanding of the
datasets.

Supervised Learning
100% (1)
Supervised Learning
15 pages
Regression Analysis - Cheatsheet
No ratings yet
Regression Analysis - Cheatsheet
9 pages
Formula Sheet - Quantitative Analysis
100% (1)
Formula Sheet - Quantitative Analysis
11 pages
(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
The Essentials of Machine Learning in Finance and Accounting
100% (1)
The Essentials of Machine Learning in Finance and Accounting
259 pages
Romi DM Apr2020
No ratings yet
Romi DM Apr2020
720 pages
Gaurav - Data Mining Lab Assignment
No ratings yet
Gaurav - Data Mining Lab Assignment
36 pages
Python Sklearn Linear Regression
No ratings yet
Python Sklearn Linear Regression
45 pages
Linear Regression in Scikit-Learn (Sklearn) - An Introduction - Datagy
No ratings yet
Linear Regression in Scikit-Learn (Sklearn) - An Introduction - Datagy
22 pages
CCES Guide 2016 PDF
No ratings yet
CCES Guide 2016 PDF
182 pages
Smoothing Methods
100% (1)
Smoothing Methods
52 pages
Sukanya Linear LogisticRegression Report
100% (1)
Sukanya Linear LogisticRegression Report
23 pages
Ist 407 Final Paper
No ratings yet
Ist 407 Final Paper
6 pages
Daily Dose of Data Science - Archive
No ratings yet
Daily Dose of Data Science - Archive
354 pages
20BCP021 Assignment 6
No ratings yet
20BCP021 Assignment 6
15 pages
Efficient Global Optimization of Analog Circuits Using Predictive Response Surface Models On Discretized Design Space
No ratings yet
Efficient Global Optimization of Analog Circuits Using Predictive Response Surface Models On Discretized Design Space
23 pages
ML Lab Codes
No ratings yet
ML Lab Codes
14 pages
R Assignment
No ratings yet
R Assignment
8 pages
Those Who Do Not Remember The Past Are Condemned To Repeat It George Santayana Spanish Philosopher, Poet and Novelist (1863-1952)
No ratings yet
Those Who Do Not Remember The Past Are Condemned To Repeat It George Santayana Spanish Philosopher, Poet and Novelist (1863-1952)
32 pages
Assignment AI-ML
No ratings yet
Assignment AI-ML
13 pages
ENME 392 - Homework 13 - Fa13 - Solutions
No ratings yet
ENME 392 - Homework 13 - Fa13 - Solutions
23 pages
CS3361 - Data Science Laboratory
No ratings yet
CS3361 - Data Science Laboratory
31 pages
Speciality Packaging Case Study
No ratings yet
Speciality Packaging Case Study
20 pages
Comparison of Random Forest and Gradient Boosting Fingerprints To Enhance An Outdoor Radio-Frequency Localization System
No ratings yet
Comparison of Random Forest and Gradient Boosting Fingerprints To Enhance An Outdoor Radio-Frequency Localization System
11 pages
Coloring Fruits
No ratings yet
Coloring Fruits
15 pages
Forecasting
No ratings yet
Forecasting
73 pages
Zerox Ready
No ratings yet
Zerox Ready
21 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
32 pages
Computing Jacobian and Hessian of Estimators and Their Application To Risk Approximation
No ratings yet
Computing Jacobian and Hessian of Estimators and Their Application To Risk Approximation
4 pages
2021-Approach For Real-Time Prediction of Pipe Stuck Risk Using A LongShort-Term Memory Autoencoder Architecture
No ratings yet
2021-Approach For Real-Time Prediction of Pipe Stuck Risk Using A LongShort-Term Memory Autoencoder Architecture
10 pages
Research Paper
No ratings yet
Research Paper
7 pages
ML Manual Final
No ratings yet
ML Manual Final
35 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
ASSi2 DSBDA
No ratings yet
ASSi2 DSBDA
4 pages
SML Lab 1
No ratings yet
SML Lab 1
19 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
Regression Practice - MLR
No ratings yet
Regression Practice - MLR
9 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
ml2020 Pythonlab02
No ratings yet
ml2020 Pythonlab02
3 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Group Work Assignment Supervised and Unsupervised Learning
No ratings yet
Group Work Assignment Supervised and Unsupervised Learning
10 pages
Bda Assign
No ratings yet
Bda Assign
15 pages
Medical
No ratings yet
Medical
4 pages
Project Idea
No ratings yet
Project Idea
8 pages
PROJECTS
No ratings yet
PROJECTS
6 pages
Handling Missing Values in A Real-Time Dataset During
No ratings yet
Handling Missing Values in A Real-Time Dataset During
5 pages
FYMCA IDSLab A6 Submission
No ratings yet
FYMCA IDSLab A6 Submission
9 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
A Zaenal Mufaqih - Tugas6
No ratings yet
A Zaenal Mufaqih - Tugas6
6 pages
Chaotic Grey Wolf
No ratings yet
Chaotic Grey Wolf
12 pages
Continuous Assessment
No ratings yet
Continuous Assessment
4 pages
Logistic Regression
No ratings yet
Logistic Regression
16 pages
Bank Marketing Targets 1724510938
No ratings yet
Bank Marketing Targets 1724510938
13 pages
Plagiarism Checker X - Report: Originality Assessment
No ratings yet
Plagiarism Checker X - Report: Originality Assessment
30 pages
UNITIV BtechIot
No ratings yet
UNITIV BtechIot
43 pages
Assignment 03
No ratings yet
Assignment 03
6 pages
Profitanalysis
No ratings yet
Profitanalysis
18 pages
10999-Manuscript (Word) - 48093-2-15-20231227
No ratings yet
10999-Manuscript (Word) - 48093-2-15-20231227
8 pages
Final Report
No ratings yet
Final Report
17 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
External
No ratings yet
External
11 pages
DM LabManual Teena
No ratings yet
DM LabManual Teena
6 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
Kartik MLP 4-9prg
No ratings yet
Kartik MLP 4-9prg
10 pages
Scream and Gunshot Detection and Localization For Audio-Surveillance Systems
No ratings yet
Scream and Gunshot Detection and Localization For Audio-Surveillance Systems
6 pages
ML Unit-2 Material
No ratings yet
ML Unit-2 Material
20 pages
Additive and Multiplicative
No ratings yet
Additive and Multiplicative
2 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
Assignment III
No ratings yet
Assignment III
3 pages
DA Programs
No ratings yet
DA Programs
44 pages
Cancer Detection
No ratings yet
Cancer Detection
8 pages
Unit 2 Svms Linear Logistic Regression
No ratings yet
Unit 2 Svms Linear Logistic Regression
9 pages
ML Complete Notes Hridoy
No ratings yet
ML Complete Notes Hridoy
5 pages
Exp 8 - LM
No ratings yet
Exp 8 - LM
10 pages
Unit I 2
No ratings yet
Unit I 2
78 pages
Project 2
No ratings yet
Project 2
5 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Da Lab Mannual
No ratings yet
Da Lab Mannual
25 pages
Renewable Energy Resources and Conservation Philip Pong PDF Download
No ratings yet
Renewable Energy Resources and Conservation Philip Pong PDF Download
90 pages
1 s2.0 S2772783124000311 Main
No ratings yet
1 s2.0 S2772783124000311 Main
15 pages
Machine Exercise 3
No ratings yet
Machine Exercise 3
22 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
MLT Unit 2 Notes
No ratings yet
MLT Unit 2 Notes
58 pages
ML Lab Experiment Shivansh
No ratings yet
ML Lab Experiment Shivansh
29 pages
Regression Analysis - Lasso and Ridge Regularization
No ratings yet
Regression Analysis - Lasso and Ridge Regularization
17 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet

Sla4a 21im30005

Uploaded by

Sla4a 21im30005

Uploaded by

Exploratory Data Analysis and Modeling for

Insurance and User Data

# Importing the libraries

print('Rows and columns in the dataset:', df.shape)

Rows and columns in the dataset: (1338, 7)

Fitting a Line - Charges vs. BMI

sns.lmplot(x='bmi', y='charges', data=df, aspect=2, height=5)

Text(0.5, 1.0, 'Charges vs. BMI')

Exploratory Data Analysis (EDA)

# Check for missing values

# Summarize the missing values

# Display the result

Missing values in each column:

Distribution and Relationships Visualization

# Distribution of insurance charges

<Axes: xlabel='bmi', ylabel='charges'>

Linear Regression Model

# Train test split

X_train, X_test, y_train, y_test = train_test_split(X, y,

# sklearn regression module

# Evaluation: MSE and R-squared

Model Validation and Checks

# Check for linearity and residual normality

# Check for Multivariate Normality

# Check for Homoscedasticity

# Check for Multicollinearity - Variance Inflation Factor

# Train test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =

# Fitting Logistic Regression on the training set

# Predicting the test set results

# Model Evaluation - Classification Report, Accuracy, and Confusion

# Confusion Matrix Visualization

0 0.89 0.97 0.93 68

accuracy 0.90 100

You might also like