Programming With Python - Final Assignment - Valerie Riady Huette
Programming With Python - Final Assignment - Valerie Riady Huette
Business case: from the viewpoint of a hospital. I would like to propose a "Stroke
prevention program" where patients who are at risk for stroke: consistently have glucose
levels higher than threshold, are overweight, and are older than 65 years old - are to be
actively pursued to enroll in this program due to their high risk for stroke status. Program
would consist of a monthly blood work test, and those who opt for a premium plan could
get an additional glucose monitoring device to check their avg. glucose levels at home.
There are also classes for nutrition, exercise, and stroke 101 where patients can enroll
to improve their lifestyle.
Question to be answered: Does age, BMI, average glucose level, hypertension status
can predict a stroke?
Python, Jupiter, and various dictionaries will be used to aid the exploratory data analysis
(EDA), calculations, and visualization of the dataset to obtain insights and correlation
between stroke and other health indicators.
Initial information about data set is obtained through various functions as shown below:
List of columns (health indicators) are obtained by using the function .columns as shown
below:
For this analysis, several columns are skipped (excluded) as they are irrelevant to the
analysis such as: marriage status, type of work, or residence type to focus on more
quantitative data such as age, BMI, and average glucose levels to predict stroke
Average glucose levels is also analyzed the same way as BMI with the following results:
10.8% of stroke patients had average glucose level below normal (< 70mg/dL), 57% had
normal average glucose levels (between 70 to 179mg/dL), and 32.1% had average
glucose levels above average (>179mg/dL). Since more than half of stroke patients had
normal glucose level, no clear relationship between stroke and high glucose can be
made - high glucose levels do not necessarily cause stroke.
Using the.bins function. It was seen that the
first bin has an average glucose level range
of 31.9, the second has a larger range of
82.2, and the third has a largest range of
101.5. This shows that the average glucose
levels for patients are concentrated in the 1st
bin, while the 2nd bin has higher segregated
data points, and the 3rd bin has the most
segregated data points.
The same analysis was completed for age and it was found that only 4% of stroke
patients were under the age of 45, 33.3% between 45 and 65 years old, and 62.7% were
above the age of 65. It can be seen that the number of stroke patients increases rapidly
after the age of 45 and even more after the age of 65. We can say that stroke has a
positive correlation with age.
Using the.bins function. It was seen that the first bin has an age range 61.7, the second
has a smaller range of 14, and the third has a smallest range of only 5. This shows that
the average age for patients is concentrated in the 3rd bin, while the 2nd bin has higher
segregated data points, and the 1st bin has the least segregated data points.
Correlation coefficients for various health indicators for stroke patients were determined
using the .corr function.
● Correlation between age and average glucose level is positive with correlation coefficient
of 0.11 (almost negligible). This shows no
significance in the relationship between
age and average glucose levels.
● Surprisingly, Correlation between
BMI and age is negative with
correlation coefficient of -0.30,
meaning that as age increases,
BMI is seen to decrease for the
dataset. But a correlation coefficient of -0.3 is not significant enough to establish a
relationship. This data was unexpected as I predicted for BMI to be higher as people age
due to slower metabolism
● Correlation between BMI and average glucose level is positively correlated with
correlation coefficient of 0.34. As BMI increases, glucose levels are seen to increase but
with no clear relationship as 0.34 is a small coefficient relationship.
It was interesting and unexpected to see that correlation coefficients for non stroke patients
were higher than those of stroke patients.
III. Visualization:
Visualization tools/functions were used to generate various charts for easier
visualizations of the analyzed data.
A bar chart for the number of stroke patients based on their BMI was generated. More
normal and overweight patients were seen on this chart than underweight patients as
seen when using the .bins function.
Bar chart for the number of stroke patients based on their average glucose level was
also generated. Since average glucose levels for each patient varies, this plot does not
show a good visualization of data. A better visualization could be generated by grouping
average glucose levels (Eg. 55-60mg/dL, 60-65mg/dL, etc)
Bar chart for the number of stroke patients based on their age was also generated and
trends of increasing frequency of stroke occurrences can be seen as age increases. On
the other hand, below the age of 45 the number of strokes is minute.
To better visualize the trends and correlations between health indicators, scatterplots
were generated.
● For age and BMI, a scatterplot stroke and non stroke patients were generated. The
orange represents data for stroke patients and the blue
represents data for non stroke patients. For non stroke
patients, BMI and age is seen to be distributed evenly
across ages. For stroke patients, we see data heavily
concentrated at at 70-80 years old indicating a
prevalence for stroke at higher age. BMI of stroke
patients does not seem to have a relationship with Age.
● Scatterplot for average glucose levels and BMI for both strike and non stroke patients
showed no clear trend was seen. However, we do see
blank spots of data around the 150 to 200 mg/dL level.
Data is heavily concentrated below the 125 mg/dL level
and with some above the 200 mg/dL level - for both
stroke and non stroke patients. Again more points were
seen on the higher age range for stroke patients
indicating prevalence of stroke in older patients.
● For the scatterplot between BMI and average glucose levels, we see that data is
concentrated on the left side with BMI lower than 40. And we also see the same blank
spots above at around average glucose levels of 150 - 200 mg/dL. No clear relationship
between BMI and average glucose levels were
established from this scatterplot.
Correlation matrix were plotted for both stroke and non stroke patients to better visualize
results from part 1 of the EDA where correlation coefficient results were discussed in
details.
IV. Results
After doing various analyses on Stroke patients data mainly focusing on their Age, BMI,
and Avg. Glucose levels, it was seen that patients who suffer from stroke are mainly
patients above 40 years old, and the older the patient gets, the higher the rate of stroke.
For Age, 4% of stroke patients are below 45 years old, 33% are between 45 and 65, and
63% are above 65 years old.
For BMI, 83% of stroke patients have BMI over 24.9% indicating overweight. 16% has
normal BMI, and only 0.5% has underweight BMI.
For average glucose levels, 57% of Stroke patients were seen to have normal Glucose
Levels, whereas about 43% had abnormal glucose levels (32% with high glucose levels).
BMI has a negative correlation with age (-0.30) and a positive correlation with avg.
glucose levels (0.34) for stroke patients. The correlation coefficient is greater than 0.1
showing some trend but it is still less than 0.9, showing no definite correlation between
the two factors.
V. Conclusion
Trends of overweight can be seen for stroke patients. For average glucose levels, no
clear trend was seen as 57% of stroke patients were seen to have normal glucose
levels. Further statistical analysis such as the regression method, t-test, and p-value
should be completed to analyze the data further.
Although no clear relationship can be seen, patient health data such as BMI, glucose
levels, and age could be used to monitor a patient's susceptibility to stroke and for the
stroke prevention program. A more data driven approach to healthcare management
could also be constructed by utilizing available patient data to create a more
personalized healthcare plan or disease prevention plan.
VI. References
1. Benjamin, E. J., Blaha, M. J., Chiuve, S. E., Cushman, M., Das, S. R., Deo, R., ...
& Muntner, P. (2023). Heart disease and stroke statistics—2023 update: A report
from the American Heart Association. Journal of the American Heart Association,
12(2), e027044. https://doi.org/10.1161/JAHA.122.027044
2. World Health Organization. (n.d.). The top 10 causes of death. Retrieved July 23,
2024, from https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-
of-death
3. World Stroke Organization. (n.d.). Impact of stroke. Retrieved July 23, 2024, from
https://www.world-stroke.org/world-stroke-day-campaign/about-stroke/impact-of-
stroke