Final

This study aims to build a predictive model for coronary artery disease (CAD) risk factors using clinical and laboratory data. The dataset underwent preprocessing, including data cleaning, imputation of missing values, and variable reduction, resulting in a reduced set of variables. The Random Forest Classifier achieved an accuracy of 76%, indicating potential for identifying high-risk individuals for CAD, though further validation is needed on larger populations.

Uploaded by

freemanchen115

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views13 pages

Final

Uploaded by

freemanchen115

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Final

Author: Freeman Chen

Date: 2023-05-02
Stats 530
Intro
Cardiovascular disease is the leading cause of
mortality worldwide, and identifying risk factors for
the disease is essential for its prevention and early
detection. Previous studies have shown that
arterial calcification, total cholesterol, and
cholesterol ratio are potential indicators of
coronary artery disease (CAD) in healthy
individuals (Hartley et al., 2019; Miedema et al.,
2014). In this project, we aim to build a predictive
model to identify potential risk factors for CAD
using a dataset of clinical and laboratory
measurements.

About Data
The dataset used in this study includes various
potential risk factors for coronary artery disease in
healthy individuals. The variables collected are
Age, Sex, Arterial Calcification Score (a measure
of total arterial wall calcification), various blood
pressure measures such as systolic and diastolic
readings, heart rate, height and weight
measurements such as BMI, glucose, bun,
creatinine, sodium, potassium, chloride, uric acid,
protein, albumin, globulin, a/g ratio, calcium,
phosphorus, alkaline phosphatase, SGOT, LDH,
bilirubin, GGTP, iron, white blood cell count, red
blood cell count, hemoglobin, hematocrit, MCV,
MCH, MCHC, platelets, RDW, neutrophils,
lymphocytes, monocytes, eosinophils, basophils,
total cholesterol, LDL cholesterol, HDL cholesterol,
and cholesterol ratio.

Process :
1. Data Cleaning & find Missing
Value
To preprocess the dataset using Python, I initially
identified that the "." sign represented the missing
values. I converted all missing values to the NAN
type to simplify the dataset processing.
Subsequently, I performed a NAN count for each
column to identify the missing values. It was noted
that some columns had just one missing value,
while others had up to 58 missing values,
accounting for approximately 9% of the entire
dataset. Then, I used the KNNimputer method to
impute the missing values in the dataset by
employing the k-nearest neighbors algorithm. This
technique works by identifying the k-nearest
neighbors to each observation with a missing
value based on other variables in the dataset that
are not missing. It then takes the average or
median of the values of those neighbors and
assigns it to the missing value. The KNNimputer
method is ideal for this dataset since it comprises
continuous and categorical variables, and this
technique does not require any assumptions about
the data's distribution.
2. make CAC score categorical
Based on the histogram plot of the CAC score, it
was observed that a significant number of patients
had a CAC score of 0, while others had scores
ranging up to 1600. To better understand the
relative risk associated with different CAC score
ranges, I referred to a study
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC54
87233/) that provided the following information:

CAC score of 100-400: relative risk of 4.3

(95% CI:3.1-6.1);
CAC score of 401-999: relative risk of 7.2
(95% CI:5.2-9.9);
CAC score = 1000: relative risk of 10.8 (95%
CI:4.2-27.7).

To categorize the CAC score into meaningful

groups, I divided them into five levels: the first level
being 0, the second level being 1-99, the third level
being 100-400, the fourth level being 401-999, and
the fifth level being 1000 and above.

3. Reducing varaibles
To reduce the number of variables in the dataset, I
first examined all 46 variables and identified any
overlap. For instance, variables such as 'BMI,'
'WT/kilo,' 'HT/in,' and 'HT/meters all provide
information about a patient's body weight and
height. I decided to use BMI because it is a
standardized measure of both. Next, I used the
truncated Singular Value Decomposition (SVD)
method to reduce the number of predictors in the
dataset. First, a loop function was implemented in
Python to run the SVD repeatedly until the reduced
variables could explain 92% of the variance space.
This involved decomposing the data matrix into
orthogonal components, retaining only the most
significant singular values and corresponding
singular vectors. The resulting reduced variables
represent a linear combination of the original
variables and capture the most essential
information in the data while minimizing noise and
redundancy. After reducing the predictors using the
truncated SVD method, I was able to reduce the
variables to 7, which are 'platelets', 'total
cholesterol', 'triglycerides', 'LDH', 'sodium', 'LDL
cholesterol', and 'systolic'. Based on my reading, I
also added age and sex variables into the dataset
as I believe they have some relation to the
predictions. To Check the reduced dataset, I
simply did a plots of the scatter plots for the
reduced feature varaibles and there's no varaibles
have a parttern relationship
Random Forest Classifier:
I chose to implement the random forest classifier
as my primary model. Initially, I split the dataset
into a training and testing set, with a ratio of 75%
for training and 25% for testing. Subsequently, I
standardized the predictors using the scale
function because the predictors had differing
scales. Applying to scale could improve the
stability and convergence of the random forest
algorithm, especially for tree-based models such
as the random forest. I also generated a tree-
based decision process graph using the random
forest model. Furthermore, I visualized the feature
importance graph, calculated using the Gini
importance method. Higher Gini values indicated
that a feature was more important during decision-
making. throughout the confusion matrix, we can
see that the model did a well prediction on the
level 1, and level2. but not level 3 and level 4

Tree deciosn

Gini importance
Confusion Matrix
Nerual Network:
I also employed the Neural Network method, but I
decided not to use the reduced dataset since
neural networks are capable of handling a large
amount of data and are straightforward in their
approach. From the table below it shows the
perforamnce metrics of the top 5 nerual network
model for 5 epochs during the trainning
process,From the table, we can see that the
training loss and validation loss are decreasing
over the epochs, indicating that the model is
getting better at predicting the output. The training
accuracy and validation accuracy are also
increasing, indicating that the model is becoming
more accurate in its predictions.
The best model after testing process is at epoch
10 with a training loss of 0.490232, training
accuracy of 0.808602, validation loss of 0.762039,
and validation accuracy of 0.71.

Conclusion
In conclusion, this study aimed to build a predictive
model to identify potential risk factors for coronary
artery disease (CAD) using a clinical and
laboratory measurements dataset. The data
preprocessing involved data cleaning, imputation
of missing values, categorization of the arterial
calcification score, and variable reduction using
the truncated Singular Value Decomposition (SVD)
method. The reduced dataset comprised age, sex,
platelets, total cholesterol, triglycerides, LDH,
sodium, LDL cholesterol, and systolic variables.
Random Forest Classifier was implemented as the
primary model to predict CAD risk factors. The
network model showed an accuracy of 76%, with a
sensitivity of 0.61 and a specificity of 0.80. The
findings suggest that the reduced set of variables
could effectively predict CAD risk factors and the
model could help identify individuals at high risk of
CAD. Further studies must validate the model on a
larger population and assess its clinical utility.

Final PPT Heart Disease
67% (3)
Final PPT Heart Disease
23 pages
2010 Maria Petrou, Costas Petrou (Auth.) Image Processing - The Fundamentals, Second Edition
100% (2)
2010 Maria Petrou, Costas Petrou (Auth.) Image Processing - The Fundamentals, Second Edition
815 pages
Heart Disease Prediction Using Machine Learning-1
No ratings yet
Heart Disease Prediction Using Machine Learning-1
6 pages
Tips - The Ocean Circulation Inverse Problem PDF
100% (1)
Tips - The Ocean Circulation Inverse Problem PDF
464 pages
Heart Disease Project Report Full
No ratings yet
Heart Disease Project Report Full
5 pages
A Comprehensive Roadmap To Mastery in AI, ML, DS, DA, DSA & LLMs
No ratings yet
A Comprehensive Roadmap To Mastery in AI, ML, DS, DA, DSA & LLMs
24 pages
Alglib Man
No ratings yet
Alglib Man
430 pages
ALAFF
No ratings yet
ALAFF
637 pages
Project 7 Final Review 5
No ratings yet
Project 7 Final Review 5
49 pages
Prova-Real-time Anomaly Detection On Financial Data
No ratings yet
Prova-Real-time Anomaly Detection On Financial Data
115 pages
Matlab Basic Functions Reference
100% (1)
Matlab Basic Functions Reference
4 pages
Reference
No ratings yet
Reference
130 pages
RTV 4 Manual
No ratings yet
RTV 4 Manual
128 pages
Logistic Reg Application 2024-1
No ratings yet
Logistic Reg Application 2024-1
56 pages
2024 - A Survey On LoRA of Large Language Models - Mao Et Al - Arxiv
No ratings yet
2024 - A Survey On LoRA of Large Language Models - Mao Et Al - Arxiv
31 pages
Cse437 4
No ratings yet
Cse437 4
14 pages
AI-Based Predictive Support For Heart Disease Diagnosis
No ratings yet
AI-Based Predictive Support For Heart Disease Diagnosis
16 pages
Chapter 3 Old
No ratings yet
Chapter 3 Old
45 pages
Harmonization of Heart Disease Dataset For Accurat
No ratings yet
Harmonization of Heart Disease Dataset For Accurat
13 pages
Final - PPR (1) BTP
No ratings yet
Final - PPR (1) BTP
14 pages
Health Care Analytics: Science
No ratings yet
Health Care Analytics: Science
16 pages
Project Report
No ratings yet
Project Report
18 pages
Suryapdf2 Merged
No ratings yet
Suryapdf2 Merged
20 pages
Mini Report2
No ratings yet
Mini Report2
40 pages
Matlab Robust Control Toolbox
No ratings yet
Matlab Robust Control Toolbox
168 pages
Heart Disease Report
No ratings yet
Heart Disease Report
8 pages
Stroke Prediction Dataset
No ratings yet
Stroke Prediction Dataset
48 pages
Two-Dimensional DC Resistivity Inversion For Dipole-Dipole Data
No ratings yet
Two-Dimensional DC Resistivity Inversion For Dipole-Dipole Data
8 pages
Case Study
No ratings yet
Case Study
21 pages
1 s2.0 S2352914821000745 Main
No ratings yet
1 s2.0 S2352914821000745 Main
19 pages
A Simple Multilevel Space Vector Modulation Technique and MATLAB System Generator Built - 2020
No ratings yet
A Simple Multilevel Space Vector Modulation Technique and MATLAB System Generator Built - 2020
25 pages
A.I Lab Report
No ratings yet
A.I Lab Report
24 pages
Edited Version of Cardiovascular Diseases Risk Prediction Dataset Report
No ratings yet
Edited Version of Cardiovascular Diseases Risk Prediction Dataset Report
25 pages
AI and Linear Algebra
No ratings yet
AI and Linear Algebra
2 pages
(Ebook) Matrix Mathematics: Theory, Facts, and Formulas: Second Edition by Dennis S. Bernstein ISBN 9780691140391, 0691140391 Download
No ratings yet
(Ebook) Matrix Mathematics: Theory, Facts, and Formulas: Second Edition by Dennis S. Bernstein ISBN 9780691140391, 0691140391 Download
60 pages
Heart Disease Prediction Project Documentation
No ratings yet
Heart Disease Prediction Project Documentation
22 pages
Batch-2 (Review 2)
No ratings yet
Batch-2 (Review 2)
19 pages
25
No ratings yet
25
13 pages
Project Report
No ratings yet
Project Report
6 pages
Heart Disease Prediction & Accuracy Estimation Comparison
No ratings yet
Heart Disease Prediction & Accuracy Estimation Comparison
24 pages
My ML Project
No ratings yet
My ML Project
14 pages
Predicting Disease With Machine Learning
No ratings yet
Predicting Disease With Machine Learning
20 pages
Linear Algebra 0 PDF
No ratings yet
Linear Algebra 0 PDF
22 pages
Ensemble Feature Optimization For Heart Disease INASS 2023
No ratings yet
Ensemble Feature Optimization For Heart Disease INASS 2023
9 pages
ML Projects Part C
No ratings yet
ML Projects Part C
8 pages
HussainBadshah SafwanSheikh
No ratings yet
HussainBadshah SafwanSheikh
12 pages
Exercises CME
No ratings yet
Exercises CME
12 pages
PythonHeartDisease FirstReview
No ratings yet
PythonHeartDisease FirstReview
20 pages
A New Method of Feature Fusion and Its A PDF
No ratings yet
A New Method of Feature Fusion and Its A PDF
12 pages
Heart Disease
No ratings yet
Heart Disease
13 pages
Group 19
No ratings yet
Group 19
21 pages
Use of Machine Learning To Identify Risk
No ratings yet
Use of Machine Learning To Identify Risk
13 pages
The Prediction and Analysis of Heart Disease Using 240511 181237
No ratings yet
The Prediction and Analysis of Heart Disease Using 240511 181237
8 pages
INFX 499 Milestone 1
No ratings yet
INFX 499 Milestone 1
8 pages
Synopsis
No ratings yet
Synopsis
4 pages
Web Application
No ratings yet
Web Application
13 pages
Second Progres Report
No ratings yet
Second Progres Report
10 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
9 pages
Accuracy Based-Stacked Ensemble Learning Model For The Prediction of Coronary Heart Disease
No ratings yet
Accuracy Based-Stacked Ensemble Learning Model For The Prediction of Coronary Heart Disease
10 pages
Cardiovascular Disease Predictive Modeling
No ratings yet
Cardiovascular Disease Predictive Modeling
3 pages
582 Problems
No ratings yet
582 Problems
42 pages
Computer Vision and Robotics Notes
No ratings yet
Computer Vision and Robotics Notes
4 pages
Principal Component Analysis in Linear Systems Controllability Observability and Model Reduction
No ratings yet
Principal Component Analysis in Linear Systems Controllability Observability and Model Reduction
16 pages
Project Deliverable 3
No ratings yet
Project Deliverable 3
7 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
6 pages
1 s2.0 S235291482100143X Main
No ratings yet
1 s2.0 S235291482100143X Main
8 pages
Weighted Nuclear Norm Minimization With Application To Image Denoising
No ratings yet
Weighted Nuclear Norm Minimization With Application To Image Denoising
8 pages
ETE 399 Mini Project
No ratings yet
ETE 399 Mini Project
7 pages
Jacobian Methods
No ratings yet
Jacobian Methods
23 pages
Eda Report
No ratings yet
Eda Report
8 pages
(Winter 2021) : CS231A: Computer Vision, From 3D Reconstruction To Recognition Homework #0 Due: Sunday, January 17
100% (1)
(Winter 2021) : CS231A: Computer Vision, From 3D Reconstruction To Recognition Homework #0 Due: Sunday, January 17
2 pages
Python Cod1
No ratings yet
Python Cod1
3 pages
Prediction of Cardiovascular Disease Risk Based On Major Contributing Features
No ratings yet
Prediction of Cardiovascular Disease Risk Based On Major Contributing Features
11 pages
Prediction of Cardiovascular Disease Using Machine Learning: Journal of Physics: Conference Series
No ratings yet
Prediction of Cardiovascular Disease Using Machine Learning: Journal of Physics: Conference Series
9 pages
Shti235 0111
No ratings yet
Shti235 0111
5 pages
Predicting Heart Disease Using Neural Networks
No ratings yet
Predicting Heart Disease Using Neural Networks
7 pages
Mining Massive Datasets Preface
No ratings yet
Mining Massive Datasets Preface
17 pages
Zhang 2021 J. Phys. Conf. Ser. 1769 012024
No ratings yet
Zhang 2021 J. Phys. Conf. Ser. 1769 012024
6 pages
Lab 01 MATLAB
No ratings yet
Lab 01 MATLAB
7 pages
Multi Variable Loop Shaping
100% (1)
Multi Variable Loop Shaping
27 pages
A Hybrid Approach For Mortality Prediction For Heart Patients Using ACO-HKNN 2020
No ratings yet
A Hybrid Approach For Mortality Prediction For Heart Patients Using ACO-HKNN 2020
8 pages
Measurement: Esben Orlowitz, Anders Brandt
No ratings yet
Measurement: Esben Orlowitz, Anders Brandt
10 pages
Heart Disease Prediction With Machine Learning Approaches
No ratings yet
Heart Disease Prediction With Machine Learning Approaches
5 pages
Project - Predicting Heart Disease
No ratings yet
Project - Predicting Heart Disease
2 pages
Machine Learning: Course-End Project Problem Statement
No ratings yet
Machine Learning: Course-End Project Problem Statement
4 pages
Mimo For Dummies
No ratings yet
Mimo For Dummies
7 pages
Embc2016 Yts
No ratings yet
Embc2016 Yts
4 pages
Smart Business Problems and Analytical Hints in Cancer Research
From Everand
Smart Business Problems and Analytical Hints in Cancer Research
Zemelak Goraga
No ratings yet
Introduction To Non Parametric Methods Through R Software
From Everand
Introduction To Non Parametric Methods Through R Software
Editor IJSMI
No ratings yet

Final

Uploaded by

Final

Uploaded by

Final

Author: Freeman Chen

CAC score of 100-400: relative risk of 4.3

To categorize the CAC score into meaningful

You might also like