Assignment Questions
Assignment Questions
Q1. DAD hospital wants to understand what are the key factors
influencing the cost to hospital. The hospital wants to provide
treatment packages (fixed price contract) to the patients at the time of
the admission. Can the hospital build a model using the historical data
to estimate the cost of treatment?
Ans 1.
import pandas as pd
import numpy as np
data = pd.read_csv("hospital_data.csv")
X = data.drop('total_cost', axis=1)
y = data['total_cost']
Q2. Build a correlation matrix between all the numeric features in the
dataset. Report the features which are correlated at a cut-off of 0.70.
What actions will you take on the features which are highly
correlated?
Ans 2.
import seaborn as sns
import matplotlib.pyplot as plt
numeric_features = data.select_dtypes(include=[np.number])
corr_matrix = numeric_features.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f",
linewidths=0.5)
highly_correlated_features = corr_matrix[abs(corr_matrix) >
0.7].stack().index
print(f"Highly correlated features (>= 0.7):
{highly_correlated_features}")
Q3. Select the features that can be used to build a model to estimate
the cost to the hospital.
Ans 3.
selected_features = ['age', 'admission_days', 'diagnosis_code',
'hospital_type', 'previous_conditions']
X_selected = X[selected_features]
Q4. Identify which features are numerical and which are categorical.
Create a new Data Frame with the selected numeric features and
categorical features. Encode the categorical features and create
dummy features.
Ans 4.
numeric_features = X_selected.select_dtypes(include=[np.number])
categorical_features =
X_selected.select_dtypes(exclude=[np.number])
X_encoded = pd.get_dummies(X_selected, drop_first=True)
X_encoded.head()
Q5. Which features have the symptoms of multi-collinearity and need
to be removed from the model?
Ans 5.
from statsmodels.stats.outliers_influence import
variance_inflation_factor
X_vif = X_encoded.copy()
vif_data = pd.DataFrame()
vif_data["Feature"] = X_vif.columns
vif_data["VIF"] = [variance_inflation_factor(X_vif.values, i) for i in
range(len(X_vif.columns))]
print(vif_data)
X_encoded = X_encoded.drop(vif_data[vif_data['VIF'] > 10]
['Feature'], axis=1)
Q6. Find the outliers in the dataset using Z-score and Cook’s distance.
If required, remove the observations from the dataset.
Ans 6.
from scipy.stats import zscore
from statsmodels.stats.outliers_influence import OLSInfluence
z_scores = np.abs(zscore(X_encoded))
outliers = (z_scores > 3).all(axis=1)
print(f"Number of outliers detected: {outliers.sum()}")
X_clean = X_encoded[~outliers]
y_clean = y[~outliers]
model = sm.OLS(y_clean, sm.add_constant(X_clean)).fit()
influence = OLSInfluence(model)
cooks_d = influence.cooks_distance[0]
outliers_cooks = cooks_d > 4 / len(X_clean)
print(f"Number of outliers detected by Cook’s Distance:
{outliers_cooks.sum()}")
X_clean = X_clean[~outliers_cooks]
y_clean = y_clean[~outliers_cooks]
Q7. Split the data into training set and test set. Use 80% of data for
model training and 20% for model testing.
Ans 7.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_clean, y_clean,
test_size=0.2, random_state=42)
8. Build a regression model with statsmodel.api to estimate the total
cost to hospital. How do you interpret the model outcome?
9. Which features are statistically significant in predicting the total
cost to the hospital?
10. Build a linear regression model with significant features and
report model performance.
11. Conduct residual analysis using P-P plot to find out if the model is
valid.
12. Predict the total cost using the test set and report RMSE of the
model.