Heart Failure Prediction
Heart Failure Prediction
Heart Failure Prediction
Failure
This project aims to predict the occurrence of heart failure through multiple classificational
algorithms.
# os.getcwd()
# os.chdir('your directory goes here')
df = pd.read_csv('heart.csv')
Out[3]: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
In [5]: df.describe()
# this sector is mainly see the overall value distribution of each var
Feature Information
age: age in years
Out[6]: age 41
sex 2
cp 4
trestbps 49
chol 152
fbs 2
restecg 3
thalach 91
exang 2
oldpeak 40
slope 3
ca 5
thal 4
target 2
dtype: int64
Out[8]: 723
Visulizing data
1. Descriptive statistics
# it seems that all the categorical vars are associated with HF occurrance,
# but there's some outliers occurred in ca (ca = 4) and thal (thal = 0)
1. there's some outliers (illegal value) occurred in ca (ca = 4) and thal (thal = 0)
2. due to the excessive categorical imbalance in restecg category, I decide to merge 1 and
2 after checking the interpretation of the resting electrocardiographic results
These are the issues that need to be solved during data preprocessing
1. The distribution of age and thalach are slightly negatively skewed, that of trestbps and
chols are positively skewed to the different extend. However, that of oldpeak is almost
exponential.
2. There are no absolute "outlier" according to the medical use case
# it seems that 'oldpeak' and 'slope' are highly correlated with each other,
# they are the value derived from EEG (electrocardiogram).
# however, they are not too inter-correlated
Feature Preprocessing
In [16]: # dispose of outlier (non-delete method)
# there's some outliers (illegal value) occurred in ca (ca = 4) and thal (thal = 0)
# df['ca'] == 4 -> 3; df['thal'] == 0 -> 1
df['ca'] = df['ca'].replace(4,3)
df['thal'] = df['thal'].replace(0, 1)
cat_var = ['sex', 'fbs', 'exang', 'target', 'cp', 'restecg', 'slope', 'ca', 'thal']
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1025 non-null int64
1 sex 1025 non-null object
2 cp 1025 non-null object
3 trestbps 1025 non-null int64
4 chol 1025 non-null int64
5 fbs 1025 non-null object
6 restecg 1025 non-null object
7 thalach 1025 non-null int64
8 exang 1025 non-null object
9 oldpeak 1025 non-null float64
10 slope 1025 non-null object
11 ca 1025 non-null object
12 thal 1025 non-null object
13 target 1025 non-null object
dtypes: float64(1), int64(4), object(9)
memory usage: 112.2+ KB
df.head()
Out[18]: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
In [19]: # for nulti-categorical variables, they need one-hot encoding (transform them into
from sklearn.preprocessing import OneHotEncoder
enc_ohe = OneHotEncoder()
enc_ohe.fit(df[multi_cat])
df = OneHotEncoding(df, enc_ohe, multi_cat)
In [20]: df.info()
# 'cp', 'slope', 'ca', 'thal' are are assigned as dummy vars
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1025 non-null int64
1 sex 1025 non-null float64
2 trestbps 1025 non-null int64
3 chol 1025 non-null int64
4 fbs 1025 non-null float64
5 restecg 1025 non-null float64
6 thalach 1025 non-null int64
7 exang 1025 non-null float64
8 oldpeak 1025 non-null float64
9 target 1025 non-null float64
10 cp_0 1025 non-null float64
11 cp_1 1025 non-null float64
12 cp_2 1025 non-null float64
13 cp_3 1025 non-null float64
14 slope_0 1025 non-null float64
15 slope_1 1025 non-null float64
16 slope_2 1025 non-null float64
17 ca_0 1025 non-null float64
18 ca_1 1025 non-null float64
19 ca_2 1025 non-null float64
20 ca_3 1025 non-null float64
21 thal_1 1025 non-null float64
22 thal_2 1025 non-null float64
23 thal_3 1025 non-null float64
dtypes: float64(20), int64(4)
memory usage: 192.3 KB
df[numeric_var] = scaler.transform(df[numeric_var])
df.head()
Out[21]: age sex trestbps chol fbs restecg thalach exang oldpeak target
0 -0.268437 1.0 -0.377636 -0.659332 0.0 1.0 0.821321 0.0 -0.060888 0.0
1 -0.158157 1.0 0.479107 -0.833861 1.0 0.0 0.255968 1.0 1.727137 0.0
2 1.716595 1.0 0.764688 -1.396233 0.0 1.0 -1.048692 1.0 1.301417 0.0
3 0.724079 1.0 0.936037 -0.833861 0.0 1.0 0.516900 0.0 -0.912329 0.0
4 0.834359 0.0 0.364875 0.930822 1.0 1.0 -1.874977 0.0 0.705408 0.0
5 rows × 24 columns
print('training data has ' + str(x_train.shape[0]) + ' observation with ' + str(x_t
print('test data has ' + str(x_test.shape[0]) + ' observation with ' + str(x_test.s
# Logistic Regression
classifier_logistic = LogisticRegression()
# K Nearest Neighbors
classifier_KNN = KNeighborsClassifier()
# Random Forest
classifier_RF = RandomForestClassifier()
# Support Vector Classification
classifier_SVC = SVC(probability=True)
# GB classifier
classifier_GB = GradientBoostingClassifier()
# cross validation
scores = model_selection.cross_val_score(classifier_logistic, x_train, y_train, cv
print(f'For Logistic Regressional Classifier, the acc is {round(scores.mean() * 100
({round(scores.mean() * 100 - scores.std() * 100 * 1.96, 2)}\
~ {round(scores.mean() * 100, 2) + round(scores.std() * 100 * 1.96, 2)}) %')
# Confusion Matrix
cm = metrics.confusion_matrix(y_train, y_predict)
plt.matshow(cm)
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
print(metrics.classification_report(y_train, y_predict))
KNN Classifier
In [25]: #@title KNN Classifier
classifier_KNN.fit(x_train, y_train) # train model
y_predict = classifier_KNN.predict(x_train) # predict results
# cross validation
scores = model_selection.cross_val_score(classifier_KNN, x_train, y_train, cv = 10)
print(f'For KNN, the acc is {round(scores.mean() * 100, 2)} \
({round(scores.mean() * 100 - scores.std() * 100 * 1.96, 2)}\
~ {round(scores.mean() * 100, 2) + round(scores.std() * 100 * 1.96, 2)}) %')
# Confusion Matrix
cm = metrics.confusion_matrix(y_train, y_predict)
plt.matshow(cm)
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
print(metrics.classification_report(y_train, y_predict))
For KNN, the acc is 82.86 (74.92 ~ 90.78999999999999) %
Random Forest
# cross validation
scores = model_selection.cross_val_score(classifier_RF, x_train, y_train, cv = 10)
print(f'For RF, the acc is {round(scores.mean() * 100, 2)} \
({round(scores.mean() * 100 - scores.std() * 100 * 1.96, 2)}\
~ {round(scores.mean() * 100, 2) + round(scores.std() * 100 * 1.96, 2)}) %')
# Confusion Matrix
cm = metrics.confusion_matrix(y_train, y_predict)
plt.matshow(cm)
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
print(metrics.classification_report(y_train, y_predict))
# It is all correct in training dataset, is that overfitting?
SVC
# cross validation
scores = model_selection.cross_val_score(classifier_SVC, x_train, y_train, cv = 10)
print(f'For SVC, the acc is {round(scores.mean() * 100, 2)} \
({round(scores.mean() * 100 - scores.std() * 100 * 1.96, 2)}\
~ {round(scores.mean() * 100, 2) + round(scores.std() * 100 * 1.96, 2)}) %')
# Confusion Matrix
cm = metrics.confusion_matrix(y_train, y_predict)
plt.matshow(cm)
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
print(metrics.classification_report(y_train, y_predict))
GB Classifier
# cross validation
scores = model_selection.cross_val_score(classifier_GB, x_train, y_train, cv = 10)
print(f'For GB Classifier, the acc is {round(scores.mean() * 100, 2)} \
({round(scores.mean() * 100 - scores.std() * 100 * 1.96, 2)}\
~ {round(scores.mean() * 100, 2) + round(scores.std() * 100 * 1.96, 2)}) %')
# Confusion Matrix
cm = metrics.confusion_matrix(y_train, y_predict)
plt.matshow(cm)
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
print(metrics.classification_report(y_train, y_predict))
Naive Bayes
In [29]: classifier_NB.fit(x_train, y_train, sample_weight=None) # train model
y_predict = classifier_NB.predict(x_train) # predict results
# cross validation
scores = model_selection.cross_val_score(classifier_NB, x_train, y_train, cv = 10)
print(f'For Naive Bayes Classifier, the acc is {round(scores.mean() * 100, 2)} \
({round(scores.mean() * 100 - scores.std() * 100 * 1.96, 2)}\
~ {round(scores.mean() * 100, 2) + round(scores.std() * 100 * 1.96, 2)}) %')
# Confusion Matrix
cm = metrics.confusion_matrix(y_train, y_predict)
plt.matshow(cm)
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
print(metrics.classification_report(y_train, y_predict))
Optimize Hyperparameters
In [31]: parameters = {
'penalty':('l2','l1'),
'C':(0.036, 0.037, 0.038, 0.039, 0.040, 0.041, 0.042)
}
Grid_LR = GridSearchCV(LogisticRegression(solver='liblinear'),parameters, cv = 10)
Grid_LR.fit(x_train, y_train)
best_LR_model.predict(x_test)
print('The test acc of the "best" model for logistic regression is', best_LR_model.
# mapping the relationship between each parameter and the corresponding acc
LR_models = pd.DataFrame(Grid_LR.cv_results_)
res = (LR_models.pivot(index='param_penalty', columns='param_C', values='mean_test_
)
_ = sns.heatmap(res, cmap='viridis')
The test acc of the "best" model for logistic regression is 86.40776699029125 %
C:\Users\Raymo\AppData\Local\Temp\ipykernel_16328\314962870.py:10: FutureWarning: In
a future version, the Index constructor will not infer numeric dtypes when passed ob
ject-dtype sequences (matching Series behavior)
res = (LR_models.pivot(index='param_penalty', columns='param_C', values='mean_test
_score')
Model 2 - KNN Model
In [33]: # timing
start = time.time()
end = time.time()
print(f'For KNN, it took {(end - start)/(9 * 2 * 7)} seconds per parameter attempt'
best_KNN_model.predict(x_test)
print('The test acc of the "best" model for KNN is', best_KNN_model.score(x_test, y
# too many dimentions to map the relationship among hyperparameters and acc...
Model 3 - RF
In [35]: # timing
start = time.time()
end = time.time()
print(f'For Random Forest, it took {(end - start)/(6 * 4)} seconds per parameter at
best_RF_model.predict(x_test)
print('The test acc of the "best" model for RF is', best_RF_model.score(x_test, y_t
Model 4 - SVC
In [37]: # timing
start = time.time()
best_SVC_model.predict(x_test)
print('The test acc of the "best" model for SVC is', best_SVC_model.score(x_test, y
Model 5 - GB Classifier
warnings.warn(some_fits_failed_message, FitFailedWarning)
C:\Users\Raymo\anaconda3\lib\site-packages\sklearn\model_selection\_search.py:952: U
serWarning: One or more of the test scores are non-finite: [0.95549322 0.94788453
nan 0.95004675 0.94788453 nan
0.94244974 0.95005844 nan 0.95220898 0.94897148 nan
0.94788453 0.95331931 nan 0.9500935 0.95440626 nan
0.94462366 0.953331 nan 0.94352501 0.9479079 nan
0.94460028 0.95115708 nan 0.94789621 0.95661524 nan
0.95875409 0.95552828 nan 0.95658018 0.95552828 nan
0.94465872 0.95980598 nan 0.95010519 0.9619799 nan
0.95008181 0.95981767 nan 0.94458859 0.95765545 nan
0.94358345 0.95545816 nan 0.94680926 0.95330762 nan
0.95220898 0.94249649 nan 0.9544647 0.94250818 nan
0.94578074 0.94251987 nan 0.95765545 0.96097475 nan
0.95548153 0.9544647 nan 0.95549322 0.95556335 nan
0.9500935 0.93602151 nan 0.95330762 0.94359514 nan
0.95549322 0.94465872 nan]
warnings.warn(
best_GB_model.predict(x_test)
print('The test acc of the "best" model for GB classifier is', best_GB_model.score(
best_NB_model.predict(x_test)
print('The test acc of the "best" model for Gaussian Naive Bayes classifier is', be
The test acc of the "best" model for Gaussian Naive Bayes classifier is 80.582524271
84466 %
f1-Score: (2 * P * R) / (P + R)
print(metrics.classification_report(y_test, best_LR_model.predict(x_test)))
precision recall f1-score support
print(metrics.classification_report(y_test, best_KNN_model.predict(x_test)))
precision recall f1-score support
Model 3 - RF
print(metrics.classification_report(y_test, best_RF_model.predict(x_test)))
precision recall f1-score support
Model 4 - SVC
print(metrics.classification_report(y_test, best_SVC_model.predict(x_test)))
precision recall f1-score support
Model 5 - GB Classifier
print(metrics.classification_report(y_test, best_GB_model.predict(x_test)))
precision recall f1-score support
print(metrics.classification_report(y_test, best_NB_model.predict(x_test)))
precision recall f1-score support
# AUC
print('The AUC of LR model is', metrics.auc(fpr_lr,tpr_lr))
Model 2 - KNN
# AUC
print('The AUC of KNN model is', metrics.auc(fpr_knn,tpr_knn))
The AUC of KNN model is 1.0
# AUC
print('The AUC of RF model is', metrics.auc(fpr_rf,tpr_rf))
The AUC of RF model is 1.0
Model 4 - SVC
# AUC
print('The AUC of SVC model is', metrics.auc(fpr_svc,tpr_svc))
The AUC of SVC model is 1.0
Model 5 - GB Classifier
# AUC
print('The AUC of GB Classifier is', metrics.auc(fpr_gb,tpr_gb))
The AUC of GB Classifier is 0.9705660377358492
In [55]: # Use predict_proba to get the probability results of Gaussian Naive Bayes Classifi
y_pred_gb = best_NB_model.predict_proba(x_test)[:, 1]
fpr_gb, tpr_gb, _ = roc_curve(y_test, y_pred_gb)
# AUC
print('The AUC of NB Classifier is', metrics.auc(fpr_gb,tpr_gb))
The AUC of NB Classifier is 0.8905660377358491
It seems that KNN, RF, SVC are the relatively suitable in this case, correctly predicting
all the data within test dataset
However, due to the shortest average training time for KNN (0.48s per hyperparameter
attempt), it seems knn is the most efficient one.
indices = np.argsort(importances)[::-1]
From the result above, we can see that chest pain type 0 (cp_0), no major vessels colored by
flourosopy (ca_0) have strong impact on the occurrence of heart failure.
Apart from that, after-exercise ST depression on EEG (oldpeak), and maximum heart rate
achieved (thalach) also have a relative major impact on HF occurrence.
Insight
KNN, RF, SVC are excelled in predicting the occurrence of Heart failure through the given 13
features in this dataset, with proper feature preprocessing. However, we need more data to
verify the model prediction & train the model to avoid overfitting.