Final
Final
About Data
The dataset used in this study includes various
potential risk factors for coronary artery disease in
healthy individuals. The variables collected are
Age, Sex, Arterial Calcification Score (a measure
of total arterial wall calcification), various blood
pressure measures such as systolic and diastolic
readings, heart rate, height and weight
measurements such as BMI, glucose, bun,
creatinine, sodium, potassium, chloride, uric acid,
protein, albumin, globulin, a/g ratio, calcium,
phosphorus, alkaline phosphatase, SGOT, LDH,
bilirubin, GGTP, iron, white blood cell count, red
blood cell count, hemoglobin, hematocrit, MCV,
MCH, MCHC, platelets, RDW, neutrophils,
lymphocytes, monocytes, eosinophils, basophils,
total cholesterol, LDL cholesterol, HDL cholesterol,
and cholesterol ratio.
Process :
1. Data Cleaning & find Missing
Value
To preprocess the dataset using Python, I initially
identified that the "." sign represented the missing
values. I converted all missing values to the NAN
type to simplify the dataset processing.
Subsequently, I performed a NAN count for each
column to identify the missing values. It was noted
that some columns had just one missing value,
while others had up to 58 missing values,
accounting for approximately 9% of the entire
dataset. Then, I used the KNNimputer method to
impute the missing values in the dataset by
employing the k-nearest neighbors algorithm. This
technique works by identifying the k-nearest
neighbors to each observation with a missing
value based on other variables in the dataset that
are not missing. It then takes the average or
median of the values of those neighbors and
assigns it to the missing value. The KNNimputer
method is ideal for this dataset since it comprises
continuous and categorical variables, and this
technique does not require any assumptions about
the data's distribution.
2. make CAC score categorical
Based on the histogram plot of the CAC score, it
was observed that a significant number of patients
had a CAC score of 0, while others had scores
ranging up to 1600. To better understand the
relative risk associated with different CAC score
ranges, I referred to a study
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC54
87233/) that provided the following information:
3. Reducing varaibles
To reduce the number of variables in the dataset, I
first examined all 46 variables and identified any
overlap. For instance, variables such as 'BMI,'
'WT/kilo,' 'HT/in,' and 'HT/meters all provide
information about a patient's body weight and
height. I decided to use BMI because it is a
standardized measure of both. Next, I used the
truncated Singular Value Decomposition (SVD)
method to reduce the number of predictors in the
dataset. First, a loop function was implemented in
Python to run the SVD repeatedly until the reduced
variables could explain 92% of the variance space.
This involved decomposing the data matrix into
orthogonal components, retaining only the most
significant singular values and corresponding
singular vectors. The resulting reduced variables
represent a linear combination of the original
variables and capture the most essential
information in the data while minimizing noise and
redundancy. After reducing the predictors using the
truncated SVD method, I was able to reduce the
variables to 7, which are 'platelets', 'total
cholesterol', 'triglycerides', 'LDH', 'sodium', 'LDL
cholesterol', and 'systolic'. Based on my reading, I
also added age and sex variables into the dataset
as I believe they have some relation to the
predictions. To Check the reduced dataset, I
simply did a plots of the scatter plots for the
reduced feature varaibles and there's no varaibles
have a parttern relationship
Random Forest Classifier:
I chose to implement the random forest classifier
as my primary model. Initially, I split the dataset
into a training and testing set, with a ratio of 75%
for training and 25% for testing. Subsequently, I
standardized the predictors using the scale
function because the predictors had differing
scales. Applying to scale could improve the
stability and convergence of the random forest
algorithm, especially for tree-based models such
as the random forest. I also generated a tree-
based decision process graph using the random
forest model. Furthermore, I visualized the feature
importance graph, calculated using the Gini
importance method. Higher Gini values indicated
that a feature was more important during decision-
making. throughout the confusion matrix, we can
see that the model did a well prediction on the
level 1, and level2. but not level 3 and level 4
Tree deciosn
Gini importance
Confusion Matrix
Nerual Network:
I also employed the Neural Network method, but I
decided not to use the reduced dataset since
neural networks are capable of handling a large
amount of data and are straightforward in their
approach. From the table below it shows the
perforamnce metrics of the top 5 nerual network
model for 5 epochs during the trainning
process,From the table, we can see that the
training loss and validation loss are decreasing
over the epochs, indicating that the model is
getting better at predicting the output. The training
accuracy and validation accuracy are also
increasing, indicating that the model is becoming
more accurate in its predictions.
The best model after testing process is at epoch
10 with a training loss of 0.490232, training
accuracy of 0.808602, validation loss of 0.762039,
and validation accuracy of 0.71.
Conclusion
In conclusion, this study aimed to build a predictive
model to identify potential risk factors for coronary
artery disease (CAD) using a clinical and
laboratory measurements dataset. The data
preprocessing involved data cleaning, imputation
of missing values, categorization of the arterial
calcification score, and variable reduction using
the truncated Singular Value Decomposition (SVD)
method. The reduced dataset comprised age, sex,
platelets, total cholesterol, triglycerides, LDH,
sodium, LDL cholesterol, and systolic variables.
Random Forest Classifier was implemented as the
primary model to predict CAD risk factors. The
network model showed an accuracy of 76%, with a
sensitivity of 0.61 and a specificity of 0.80. The
findings suggest that the reduced set of variables
could effectively predict CAD risk factors and the
model could help identify individuals at high risk of
CAD. Further studies must validate the model on a
larger population and assess its clinical utility.