0% found this document useful (0 votes)
6 views7 pages

Project Deliverable 3

This document outlines an analysis aimed at identifying factors associated with cardiovascular disease, focusing on demographic and health indicators like age, blood pressure, and cholesterol levels. It describes the dataset, hypotheses for testing, and the statistical methods used, including t-tests and logistic regression. The findings suggest significant differences in resting blood pressure and maximum heart rate between patients with and without heart disease, with recommendations for improved health monitoring and lifestyle interventions.

Uploaded by

zille.huma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views7 pages

Project Deliverable 3

This document outlines an analysis aimed at identifying factors associated with cardiovascular disease, focusing on demographic and health indicators like age, blood pressure, and cholesterol levels. It describes the dataset, hypotheses for testing, and the statistical methods used, including t-tests and logistic regression. The findings suggest significant differences in resting blood pressure and maximum heart rate between patients with and without heart disease, with recommendations for improved health monitoring and lifestyle interventions.

Uploaded by

zille.huma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Project Deliverable 3.

1
1. Problem Statement
The primary objective of this analysis is to investigate potential factors associated
with cardiovascular disease. Specifically, we aim to explore if certain demographic
and health indicators (such as age, resting blood pressure, cholesterol levels, and
maximum heart rate) are significantly associated with the presence of cardiovascular
disease. This analysis will involve hypothesis testing to compare means and predictive
modeling to identify key risk factors.

2. Dataset Description
2.1. Dataset Overview
Here’s a list of variables from the dataset along with brief descriptions based on
typical cardiovascular datasets:
 patientid: Unique identifier for each patient.
 age: Patient's age in years.
 gender: Gender of the patient (1 = Male, 0 = Female).
 chestpain: Type of chest pain experienced (0–3, with different types indicating
various risks of heart disease).
 restingBP: Resting blood pressure in mm Hg.
 serumcholestrol: Serum cholesterol level in mg/dL.
 fastingbloodsugar: Whether fasting blood sugar > 120 mg/dL (1 = Yes, 0 =
No).
 restingrelectro: Resting electrocardiographic results (0–2, with higher values
possibly indicating abnormalities).
 maxheartrate: Maximum heart rate achieved.
 exerciseangia: Exercise-induced angina (1 = Yes, 0 = No).
 oldpeak: ST depression induced by exercise relative to rest.
 slope: The slope of the peak exercise ST segment (0–2).
 noofmajorvessels: Number of major vessels (0–3) colored by fluoroscopy.
 target: Outcome variable (1 = Heart disease, 0 = No heart disease).
3. Hypotheses
Based on the research objectives, we define hypotheses for the analysis. For example:

Hypothesis 1:
 There is a significant difference in the mean resting blood pressure between
patients with and without heart disease.
 Null Hypothesis (H0): There is no difference in resting blood pressure between
patients with and without heart disease.
 Alternative Hypothesis (H1): Patients with heart disease have a different mean
resting blood pressure than those without.

Hypothesis 2:
 There is a significant difference in the mean maximum heart rate between
patients with and without heart disease.
 Null Hypothesis (H0): There is no difference in maximum heart rate between
patients with and without heart disease.
 Alternative Hypothesis (H1): Patients with heart disease have a different mean
maximum heart rate than those without.

Hypothesis 3:
 Age and cholesterol levels are associated with the risk of heart disease.
 This can be tested with regression analysis where age and serum cholesterol are
predictors, and the outcome variable is the target (heart disease status).

4. Conducting Significance Testing and Regression


Analysis
The goal of this analysis is to understand relationships between various health
indicators and the likelihood of heart disease (the target variable). We will use
statistical hypothesis testing and regression modeling to:
1. Compare Means: Identify if there are significant differences in specific health
metrics (e.g., resting blood pressure, maximum heart rate) between patients
with and without heart disease.
2. Predictive Modeling: Investigate if variables such as age and serum
cholesterol levels are associated with a higher likelihood of heart disease.
For this purpose, we will conduct t-tests for comparing means and logistic regression
for predictive modeling.
# Load necessary libraries
library(ggplot2)
library(ggcorrplot)
library(pscl)

# Assume your dataset is named `data`


# If not, load the data
data <- read.csv("CardiovascularDisease.csv")

# --- Section 1: Descriptive Statistics and Visualizations ---

# 1.1 Descriptive statistics for numerical variables


summary(data)

# 1.2 Frequency counts for categorical variables


table(data$gender)
table(data$chestpain)
table(data$target)

# 1.3 Proportion of target outcomes (Heart Disease vs. No Heart Disease)


prop.table(table(data$target))

# 1.4 Graphs to illustrate descriptive statistics

# Histogram for Age Distribution


ggplot(data, aes(x = age)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
labs(title = "Age Distribution", x = "Age", y = "Frequency")

# Bar Plot for Chest Pain Types


ggplot(data, aes(x = factor(chestpain))) +
geom_bar(fill = "lightgreen") +
labs(title = "Chest Pain Type Distribution", x = "Chest Pain Type", y = "Count")

# Boxplot for Resting Blood Pressure by Target


ggplot(data, aes(x = factor(target), y = restingBP, fill = factor(target))) +
geom_boxplot() +
labs(title = "Resting BP by Heart Disease Status", x = "Heart Disease (1 = Yes, 0 = No)", y =
"Resting Blood Pressure")

# Scatter Plot for Age vs. Max Heart Rate by Target


ggplot(data, aes(x = age, y = maxheartrate, color = factor(target))) +
geom_point() +
labs(title = "Age vs Max Heart Rate by Heart Disease Status", x = "Age", y = "Max Heart Rate")

# Bar Plot for Exercise-Induced Angina vs Heart Disease


ggplot(data, aes(x = factor(exerciseangia), fill = factor(target))) +
geom_bar(position = "dodge") +
labs(title = "Exercise-Induced Angina vs Heart Disease", x = "Exercise-Induced Angina", y =
"Count")
# Correlation Heatmap for Numerical Variables
cor_matrix <- cor(data[sapply(data, is.numeric)])
ggcorrplot(cor_matrix, lab = TRUE)

# --- Section 2: Hypothesis Testing ---

# 2.1 T-test for Resting Blood Pressure by Heart Disease Status


t_test_restingBP <- t.test(restingBP ~ target, data = data)
print(t_test_restingBP)

# 2.2 T-test for Maximum Heart Rate by Heart Disease Status


t_test_maxheartrate <- t.test(maxheartrate ~ target, data = data)
print(t_test_maxheartrate)

# --- Section 3: Logistic Regression Model ---

# Logistic Regression with age and serum cholesterol as predictors


log_reg_model <- glm(target ~ age + serumcholestrol, data = data, family = "binomial")
summary(log_reg_model)

# --- Section 4: Model Diagnostics and Goodness of Fit ---

# 4.1 Calculate Pseudo R-squared


pR2(log_reg_model)

# 4.2 Confusion Matrix for Model Predictions


predicted_class <- ifelse(predict(log_reg_model, type = "response") > 0.5, 1, 0)
table(Predicted = predicted_class, Actual = data$target)

5. Result and Discussion


5.1. T-Tests Interpretation
5.1.1. Resting Blood Pressure:
o The t-test comparing resting blood pressure (BP) between patients with heart
disease (target = 1) and those without (target = 0) yields a t-value of -17.342 with
a p-value < 2.2e-16. Since the p-value is significantly less than 0.05, we reject
the null hypothesis, indicating a significant difference in resting BP between the
two groups.
o Patients with heart disease have a higher mean resting BP (164.04) than those
without (134.77). This suggests that higher resting BP may be associated with the
presence of heart disease in this dataset.
5.1.2. Maximum Heart Rate:
o The t-test results show a t-value of -7.0404 with a p-value of 4.488e-12, which
also supports rejecting the null hypothesis. This indicates a statistically
significant difference in max heart rate between the groups.
o The mean max heart rate is lower in patients without heart disease (136.31) than
in those with heart disease (152.12). Higher maximum heart rate could therefore
be indicative of heart disease in this context.

5.2. Logistic Regression Interpretation


The logistic regression model examined age and serum cholesterol as predictors of
heart disease status (target):
 Age: The p-value for age is 0.966, suggesting it is not a significant predictor of
heart disease in this model.
 Serum Cholesterol: The coefficient for serum cholesterol is positive (0.0031)
with a high significance (p < 0.001), indicating that higher serum cholesterol
is significantly associated with an increased likelihood of heart disease.
The pseudo R-squared values (McFadden's R² = 0.028) suggest that the model
explains a small portion of the variance in heart disease status.

5.3. Model Evaluation (Confusion Matrix)


The confusion matrix shows that:
 The model correctly predicted 82 out of 420 patients without heart disease
and 480 out of 580 patients with heart disease.
 However, it misclassified a significant number of patients (338 false
positives and 100 false negatives), which suggests that the model may not be
highly accurate for prediction purposes, possibly due to limited predictive
power of age and serum cholesterol alone.

5.4. Decision-Making Suggestions


 Resting Blood Pressure and Maximum Heart Rate: Since both of these
metrics show a significant difference between patients with and without heart
disease, these could be valuable metrics for clinical evaluation and early risk
assessment.
 Serum Cholesterol: The logistic regression analysis shows it is a significant
predictor of heart disease. Thus, reducing serum cholesterol levels through
lifestyle changes or medication could potentially lower heart disease risk.
 Model Improvement: The current logistic regression model could be
improved by including additional predictors, such as other health indicators or
demographic factors, to increase predictive accuracy.

Plots:
6. Improvement and Suggestions for Decision Making
Health Screening and Monitoring:
 Given the strong associations, implement regular monitoring of resting BP and
maximum heart rate for patients, particularly those in high-risk categories, as
early indicators of cardiovascular risk.
 Serum cholesterol management, including lifestyle adjustments and, if
necessary, medication, is recommended as a preventative measure against heart
disease.

Enhanced Risk Models:


 Develop more comprehensive models by including additional variables beyond
age and cholesterol (e.g., lifestyle factors, genetic predisposition) to improve
predictive accuracy for heart disease.

Preventative Care and Lifestyle Counseling:


 Educate patients, especially those with elevated BP, heart rate, and cholesterol,
on lifestyle changes—like diet, exercise, and stress management—that can
lower their risk. This proactive approach could reduce the incidence of heart
disease over time.

Tailored Intervention Programs:


 For high-risk patients, consider creating personalized health programs focusing
on BP, heart rate, and cholesterol control, which may reduce their likelihood of
developing heart disease.

You might also like