Sla4a 21im30005
Sla4a 21im30005
Introduction
This Jupyter Notebook conducts exploratory data analysis (EDA) and modeling for two datasets:
insurance and user_data. The analysis includes visualizations, linear regression modeling for
insurance charges prediction, and logistic regression modeling for binary classification in user
data. Insurance Dataset
The exploration starts with loading and understanding the insurance dataset:
warnings.filterwarnings("ignore")
user_data = pd.read_csv('/content/User_Data.csv')
insurance = pd.read_csv('/content/insurance.csv')
df = insurance.copy()
df.describe()
# Correlation heatmap
corr = df.corr()
sns.heatmap(corr, cmap='viridis', annot=True)
<Axes: >
# Violin plots
f = plt.figure(figsize=(13,6))
sns.violinplot(x='sex', y='charges', data=df, palette='Wistia',
ax=f.add_subplot(121))
sns.violinplot(x='smoker', y='charges', data=df, palette='magma',
ax=f.add_subplot(122))
# Box plot
plt.figure(figsize=(13,6))
sns.boxplot(x='children', y='charges', hue='sex', data=df,
palette='rainbow')
# Scatter plots
f = plt.figure(figsize=(13,6))
sns.scatterplot(x='age', y='charges', data=df, palette='magma',
hue='smoker', ax=f.add_subplot(121))
sns.scatterplot(x='bmi', y='charges', data=df, palette='viridis',
hue='smoker', ax=f.add_subplot(122))
# Dummy variable
categorical_columns = ['sex', 'smoker', 'region']
df_encode = pd.get_dummies(data=df, prefix='OHE', prefix_sep='_',
columns=categorical_columns,
drop_first=True,
dtype='int8')
# Log transform
df_encode['charges'] = np.log(df_encode['charges'])
# Model building
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
# Get the coefficients and the intercepts
coefficients = lin_reg.coef_
intercept = lin_reg.intercept_
# Print them
coefficients_df = pd.DataFrame({'Feature': X_train.columns,
'Coefficient': coefficients})
print("Intercept: ", intercept)
print("Coefficients:")
print(coefficients_df)
# Adjusted R-squared
n = len(y_test)
p = X_test.shape[1]
adjust_r_squared = 1 - (1 - R_square_sk) * ((n-1)/(n-p-1))
print("Adjusted Rsquared: ", adjust_r_squared)
Intercept: 7.07542806626153
Coefficients:
Feature Coefficient
0 age 0.033057
1 bmi 0.013706
2 children 0.101695
3 OHE_male -0.069942
4 OHE_yes 1.547184
5 OHE_northwest -0.054662
6 OHE_southeast -0.145206
7 OHE_southwest -0.135254
Adjusted Rsquared: 0.7722401938297871
VIF: 4.479966203229661
Logistic Regression for User Data
Moving on to the user data, logistic regression is performed for binary classification:
df = user_data.copy()
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# Display results
print("Classification report:\n", report)
print("Accuracy: ", accuracy)
Classification report:
precision recall f1-score support
Accuracy: 0.9
<Axes: >
Conclusion
This comprehensive analysis provides insights into the relationships within the insurance
dataset and performs a binary classification on the user data using logistic regression.
Visualizations, model building, and evaluations contribute to a thorough understanding of the
datasets.