0% found this document useful (0 votes)
30 views16 pages

First Synopsis of The Project

Uploaded by

Kanika Kumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views16 pages

First Synopsis of The Project

Uploaded by

Kanika Kumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

~1~

Submitted to:

GURU NANAK DEV UNIVERSITY


AMRITSAR
SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR

THE DEGREE OF

MCA (FYIC) 6TH SEMESTER (2024)

Supervised By: Dr. Sarveshwar Bharti Submitted By: Kanika Kumari


Department of Computer Science Roll no: 27502100065
Guru Nanak Dev University, ASR. MCA (FYIC) 6th Semester
~2~

CONTENTS

SR.NO. PARTICULAR PAGE NO. REMARKS

1. Introduction to project and 3-6


problem statements.
2. Objectives 7-8
3. Features 9-13
4. Hardware and Software 14-16
Requirement

5. Technology and Languages 17


~3~

INTRODUCTION
The healthcare sector generates a lot of data regarding patients, diseases,
and diagnoses, but it is not being appropriately analyzed, so it is not
providing the value it should be. Heart illness is the prime reason of death.
Rendering to the World Health Organization, CVDs are the largest cause of
mortality globally, resulting in the deaths of an estimated 17.9 million
individuals each year. The healthcare industry generates a lot of data
regarding patient, diseases, and diagnoses, but it is not properly analyzed,
so it does not have the same impact as it should on patient health.

CVDs include coronary artery, rheumatic heart disease, vascular disease,


and various heart and blood vessel problems. Four out of every five CVD
fatalities are caused by strokes or heart attacks. Among the total deaths,
one-third occurs with persons below the age of 70. Sex, smoking, age, family
history, poor diet, cholesterol, physical inactivity, high blood pressure,
overweightness, and alcohol use are the key risk influences for heart
disease. Heart disease is also caused by hereditary risk factors such as
diabetes and high blood compression. Physical idleness, fatness and
unhealthy diet are some of the subordinate reasons that increase the risk.
Fatigue, palpitations, sweating, back pain, chest pain, shoulder and arm
pain, breath shortness and overall weakness are the most common
symptoms. The most recurrent sign of deficient blood stream to the heart is
still chest pain. In medical terminology, this type of chest pain is known as
Angina. There is examination available to help diagnose the disease, such as
X-rays, MRI scans, and angiography. Though, there are times when there is
a shortage of resources in an emergency due to non-availability of medical
apparatus. In cardiovascular disease, the time is as important as every
moment of diagnosing and treating the disease is counted.

Cardiac midpoints and outpatient departments produce huge outlay of data


regarding the diagnosis of heart diseases, and the potential demand for
improvement of big data analytics regarding cardiovascular overhaul and
patient consequences is vast. However, due to noise, incompleteness, and
irregularity, it is hard to make specific, accurate, and well-grounded
~4~

decisions using the data. Nowadays, AI is playing an important role in the


field of cardiology, appreciations to massive advancements in equipment,
big data, knowledge storage, acquisition, and recovery. Using various data
mining techniques, researchers used preprocessing methods on the data to
make verdicts using various ML models. In the cataloguing of genetic
cardiac illnesses and control subjects, a widespread set of ML algorithms
with their variations is used to predict the early stages of heart failure. KNN,
DT, SVC, LR, and RF machine algorithms are examples of heart attack
prediction algorithms. Machine learning approaches can be divided into
three categories: Supervised ML: task drive, labeled data
(classification/regression); Unsupervised ML: data-driven, unlabeled data
(clustering); Reinforcement Learning: learning from mistakes (playing
games).

In this study, supervised ML classifiers are used to show how different


models can predict the existence of heart disease and compare the accuracy
of these classifiers, such as Logistic Regression (LR), k-Nearest Neighbors
(kNN), XGBoost (XGB), Support Vector Machine (SVM), Stochastic Gradient
Boosted Tree (GBT), Naive Bayes (NB), Neural Network (NN), Decision Tree
(DT), Radial Basis Function (RBF), Random Forest (RF), and Multi-Layer
Perception (MLP).

Proposed Methodology
The proposed methodology is to build a machine learning system that can predict whether
a person has heart disease or not. Here are the steps involved:

1. Get the data: This involves downloading a dataset containing information about people's
health, including features like age, sex, chest pain type, and blood sugar level, and a target
variable indicating the presence or absence of heart disease.
2. Process the data: The data is preprocessed to make it suitable for the machine learning
model. This may involve handling missing values, converting categorical variables into
numerical ones, and scaling the data.
3. Split the data: The data is split into two sets: training data and testing data. The training
data is used to train the machine learning model, and the testing data is used to evaluate
the model's performance.
4. Train the model: A logistic regression model is chosen because it is suitable for binary
classification tasks like this one. The model is trained on the training data, learning the
relationships between the features and the target variable.
~5~

5. Evaluate the model: The model's performance is evaluated on the testing data using
accuracy score. This metric measures the proportion of predictions that the model makes
correctly.
6. Build a predictive system: Once the model is trained and evaluated, it can be used to
make predictions on new data. This involves feeding the features of a new person into the
model and getting a prediction of whether they have heart disease or not.

Flowchart of proposed methodology


~6~

PROBLEM STATEMENT

Heart disease can be managed effectively with a combination of lifestyle changes,


medicine and, in some cases, surgery. With the right treatment, the symptoms of heart
disease can be reduced and the functioning of the heart improved. The predicted
results can be used to prevent and thus reduce cost for surgical treatment and other
expensive. The overall objective of my work will be to predict accurately with few
tests and attributes the presence of heart disease. Attributes considered form the
primary basis for tests and give accurate results more or less. Many more input
attributes can be taken but our goal is to predict with few attributes and faster
efficiency the risk of having heart disease. Decisions are often made based on doctors’
intuition and experience rather than on the knowledge rich data hidden in the data set
and databases. This practice leads to unwanted biases, errors and excessive medical
costs which affects the quality of service provided to patients. Data mining holds great
potential for the healthcare industry to enable health systems to systematically use
data and analytics to identify inefficiencies and best practices that improve care and
reduce costs. According to (Wurz & Takala, 2006) ⁠the opportunities to improve care
and reduce costs concurrently could apply to as much as 30% of overall healthcare
spending. The successful application of data mining in highly visible fields like e-
business, marketing and retail has led to its application in other industries and sectors.
Among these sectors just discovering is healthcare. The healthcare environment is still
„information rich‟ but „knowledge poor‟. There is a wealth of data available within
the healthcare systems. However, there is a lack of effective analysis tools to discover
hidden relationships and trends in the data for African genres.
~7~

OBJECTIVES

1. Main Objectives
The main objective of this research is to develop a heart prediction system. The
system can discover and extract hidden knowledge associated with diseases from a
historical heart data set Heart disease prediction system aims to exploit data mining
techniques on medical data set to assist in the prediction of the heart diseases.

2. Specific Objectives
• Provides new approach to concealed patterns in the data.

• Helps avoid human biasness.

3. Justification
Clinical decisions are often made based on doctor’s insight and experience rather than
on the knowledge rich data hidden in the dataset. This practice leads to unwanted
biases, errors and excessive medical costs which affects the quality of service
provided to patients. The proposed system will integrate clinical decision support with
computer-based patient records (Data Sets). This will reduce medical errors, enhance
patient safety, decrease unwanted practice variation, and improve patient outcome.
This suggestion is promising as data modeling and analysis tools, e.g., data mining,
have the potential to generate a knowledge rich environment which can help to
significantly improve the quality of clinical decisions. There are voluminous records
in medical data domain and because of this, it has become necessary to use data
mining techniques to help in decision support and prediction in the field of healthcare.
Therefore, medical data mining contributes to business intelligence which is useful for
diagnosing of disease.
~8~

4. Scope and Limitation

 Scope
Here the scope of the project is that integration of clinical decision support with
computer-based patient records could reduce medical errors, enhance patient safety,
decrease unwanted practice variation, and improve patient outcome. This suggestion is
promising as data modeling and analysis tools, e.g., data mining, have the potential to
generate a knowledge-rich environment which can help to significantly improve the
quality of clinical decisions.

 Limitations
Medical diagnosis is considered as a significant yet intricate task that needs to be
carried out precisely and efficiently. The automation of the same would be highly
beneficial. Clinical decisions are often made based on doctor’s intuition and
experience rather than on the knowledge rich data hidden in the database. This
practice leads to unwanted biases, errors and excessive medical costs which affects the
quality of service provided to patients. Data mining have the potential to generate a
knowledge-rich environment which can help to significantly improve the quality of
clinical decisions.
~9~

FEATURES

In a project for predicting heart diseases, a feature includes:


1. Demographic Information: Age, gender, and possibly ethnicity.
 Age: Age of the individual, as age is a significant risk factor for heart
disease.
 Sex: Gender of the individual, as heart disease risk can vary between
males and females (1=Male, 0=Female).

2. Biometric Measurement: Blood pressure, cholesterol level +


3. ( both HDL and LDL), body mass index (BMI).
 Resting Blood Pressure (Trestbps): The resting blood pressure of the
individual, measure in millimeters of mercury (mmHg).
 Cholesterol (Chol) Levels: Total cholesterol level as well as level of
HDL (high-density lipoprotein) and LDL (low-density lipoprotein)
cholesterol.
 Fasting Blood Sugar (Fbs): The level of glucose in the blood after
fasting for a certain period.
Coronary Calcium (CA): It looks for calcium deposits in the heart
arteries. A buildup of calcium can narrow the arteries and reduce blood
flow of the heart.

4. Symptoms: Presence of chest pain, shortness of breath, fatigue, etc.


 Chest Pain (CP) Type: The type of chest pain experienced by the
individual (e.g., typical angina, atypical angina, non-angina pain, or
asymptomatic).
~ 10 ~

 Exercise- Induced Angina (Exang): Whether the individual angina


(chest pain) during exercise stress testing.
 Slope: Slope can be used reliably to predict the presence or absence
and the severity of coronary artery disease in individual patients.

5. Clinical Test: Result of ECG, stress test, and other diagnostic tests.
 Resting Electrocardiographic Result (Restecg): Result from resting
ECG tests, which can indicate abnormal heart rhythms or other cardiac
abnormalities.
 ST Depression (Oldpeak): The amount of ST- segment depression
observed during exercise stress testing, which can indicate ischemia
(lack of blood flow to the heart).
 Maximum Heart Rate Achieved (Thalach): The maximum heart rate
achieved during exercise stress testing.
 Thallium Stress Test Result (thal): Result of the thallium stress
testing, which can indicate areas of reduced blood flow to the heart
muscle.
Target: Have a heart disease or not.
~ 11 ~

 Description of features of the dataset

SL Features Types Description Range of features


NO. increasing the probability
of heart disease

1. Age Continuous Age in years NA

2. Sex Categorical Male, Female 1=Male,0=Female


3. Cp Categorical Chest pain type0: People with cp equal to
Typical angina1: 1, 2, 3 are more likely to
Atypical angina2: have heart disease than
Non-anginal people with cp equal to 0.
pain3.Asymptomat
ic
4. Trestbps Continuous Resting blood Concerning above 130-
pressure in mm Hg 140.

5. Chol Continuous Serum cholesterol Serum = LDL + HDL +


in mg/dl .2 * triglycerides Above
200 is matter of concern.
6. Fbs Categorical Fasting blood 1=true,0=false>126
sugar>120 mg/dl mg/dl signals diabetes
People with Fbs equal to
1 increased the
probability of suffering
from heart disease than
people with Fbs equal to
0.
7. Restecg Categorical Resting 0: Nothing to note1: Non
electrocardiographi normal heart beat can
c results. People range from mild
~ 12 ~

with value 1 symptoms to severe


(signals non- problems2: Possible or
normal heart beat, definite left ventricular
can range from hypertrophy Enlarged
mild symptoms to heart's main pumping
severe problems) chamber
are more likely to
have heart disease.

8. Thalach Continuous Maximum heart More than 140 are more


rate achieved likely to have heart
disease.

9. Exang Categorical Exercise induced People with value 0have


angina more probability of
suffering from heart
disease than people with
value 1 1=yes0=no

10. Oldpeak Continuous ST depression During exercise


induced by unhealthy heart will
exercise relative to stress more
rest looks at stress
of heart
11. Slope Categorical The slope of the 0: Up sloping: better
peak exercise ST heart rate with exercise
segment (uncommon)1: Flat
sloping: minimal change
(typical healthy heart)2:
Down sloping: signs of
unhealthy heart

12. Ca Categorical Number of major Ca equal to 0 are more


vessels (0-3) likely to have heart
colored by disease.
fluoroscopy
~ 13 ~

13. Thal Categorical Thalium stress 1, 3: normal6: fixed


result defect: used to be defect
7: reversible defect: no
proper blood movement
when exercising People
with thal value equal to 2
more likely to have heart
disease.
14. Target Categorical Have heart disease 1=yes, 0=no
or not
~ 14 ~

Software requirements

The software requirement used in this project is Colab notebook as python, as


Colab is a hosted Jupyter notebook that allows you to write and execute
arbitrary Python code through the browser, there are also some other software
that used in this project and these are :

 Operating system: A 64-bit operating system (Window, Linux or MacOS).

 Python: A programming language used for data processing and machine


learning.

 Jupyter Notebook: An open-source web application that allows you to create


share documents that contain live code, equations, visualizations and narrative
text.

 NumPy: A library for the Python programming language, adding support for
large, multi-dimensional arrays and, along with a large collection of high-level
mathematical functions to operate on these arrays.

 Pandas: A software library written for the Python programming language for
data manipulation and analysis.

 Scikit-learn: A free software machine learning library for the Python


programming language.

 Matplotlib: A plotting library for the Python programming language and its
numerical mathematics extension NumPy.

 Seaborn: A Python data visualization library based on matplotlib.

 Tensor flow: An open-source platform for machine learning and artificial


intelligence.

 Logistic Regression: A statistical model that is used in machine learning for


binary classification problems. It is a predictive analysis technique used to
~ 15 ~

explain the data and relationship between one dependent binary variable and one
or more nominal, ordinal, interval, or ratio-level independent variables.
These are the main software used in this project, but there might be other
libraries and tools used as well in Colab notebook.

HARDWARE REQUIREMENT

The hardware requirements for running the code in the provided Colab notebook
are:

 A machine with at least 4 vCPUs, either x86_64 or arm64 architecture, with


higher clock frequency preferred.

 At least 8 GB RAM.

 At least 500MB of free disk space.

 The server needs to have at least 50 Mbps downstream capacity from the
internet.

 The connection between client and server should have at least 20 Mbps
bandwidth, and no more than 200ms latency.

For the best performance, enabling Swap and using local SSD storage is
recommended. Additionally, the machine should have at least 8GB available.
Note that these requirements are for running the code on a local machine. If we
using a remote development environment, the requirements may be different.
~ 16 ~

TECHNOLOGY  LANGUAGES

Technology and Language that we are used in this project is Python, which is a
popular high-level programming language used for general-purpose
programming. Python is known for its simplicity, readability, and large standard
library, making it a great choice for a wide range of applications, from web
development to data analysis.
In addition to Python, the code also uses several libraries and tools, including
pandas, numpy, matplotlib, seaborn, and sklearn. These libraries provide
additional functionality for data manipulation, visualization, and machine
learning.
Pandas is a library for data manipulation and analysis, providing data structures
and functions for working with tabular data. Numpy is a library for numerical
computing, providing support for large, multi-dimensional arrays and matrices,
as well as a wide range of mathematical functions. Matplotlib and Seaborn are
libraries for data visualization, providing functions for creating charts, graphs,
and other visual representations of data.
Sklearn is a library for machine learning, providing a wide range of algorithms
and tools for building predictive models. In the code you provided, Sklearn is
used to train a logistic regression model for predicting heart disease. Overall, the
technology used in this code is a combination of Python and several popular
libraries for data manipulation, visualization, and machine learning.

You might also like