0% found this document useful (0 votes)
38 views11 pages

Programming With Python - Final Assignment - Valerie Riady Huette

Uploaded by

valerierhuette
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views11 pages

Programming With Python - Final Assignment - Valerie Riady Huette

Uploaded by

valerierhuette
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

[LSS2024] (B) Programming with Python

Data project - Final assignment

Valerie Riady Huette


24 July 2024
I. Introduction
According to the World Health Organization (WHO), stroke is the 2nd leading cause of
death and the 3rd leading cause of disability worldwide [Benjamin, 2023 & WHO, 2020].
1 of 4 adults above the age of 25 is predicted to have a stroke once in their lifetime
[World Stroke Organization, 2020]. With this, health data of 5110 non-stroke and stroke
patients were obtained to further study the relationship of stroke with other health
indicators such as BMI, average glucose levels, hypertension,and age.

Business case: from the viewpoint of a hospital. I would like to propose a "Stroke
prevention program" where patients who are at risk for stroke: consistently have glucose
levels higher than threshold, are overweight, and are older than 65 years old - are to be
actively pursued to enroll in this program due to their high risk for stroke status. Program
would consist of a monthly blood work test, and those who opt for a premium plan could
get an additional glucose monitoring device to check their avg. glucose levels at home.
There are also classes for nutrition, exercise, and stroke 101 where patients can enroll
to improve their lifestyle.

Question to be answered: Does age, BMI, average glucose level, hypertension status
can predict a stroke?

Python, Jupiter, and various dictionaries will be used to aid the exploratory data analysis
(EDA), calculations, and visualization of the dataset to obtain insights and correlation
between stroke and other health indicators.

II. Preliminary Exploratory Data


Analysis (EDA)
Healthcare data (.csv) from Kaggle is
imported into JupyterLite and read with
a python code. Dictionaries were also
imported onto Jupyter. The .head
function is then used to display the top
5 rows of data as shown below with
various health indicators:

Initial information about data set is obtained through various functions as shown below:

● The info function is used to determine the


types of data each column has and how many cells
are empty. Most cells in the data set are filled with
data except for BMI with 201 empty entries.
● Data type aligns with expectation, in which
the numerical data (age, avg_glucose_level, and
bmi) has data type of either integer or float. Other
data with object data type will not be used in this analysis, so conversion of data type
won't be necessary.
● Shape of data set was also determined
using the .shape function with 5110 data
entry + 1 row for headers and 12 columns
of health indicators.

List of columns (health indicators) are obtained by using the function .columns as shown
below:

A new column is created to map/define


the stroke status of each patient. Initially
the stroke status is defined with either 0/1.
In this case using the mapping function
will help identify a patient's status easier.

For this analysis, several columns are skipped (excluded) as they are irrelevant to the
analysis such as: marriage status, type of work, or residence type to focus on more
quantitative data such as age, BMI, and average glucose levels to predict stroke

Using the describe function,


preliminary information
such as mean percentiles
of the dataset based on
each health indicator
(columns) are generated.
The ID, hypertension, heart
disease, and stroke
columns values are not
valid as their entries contain only either 1 or 0, indicating a “Yes” or a “No” answer that
does not give useful quantitative analytical results. BMI has a mean of 28.9, which is
higher than expected (BMI > 24.9 is considered overweight) since most of the patients
(96%) in the study are non stroke patients. On the other hand, average glucose level has
a mean of 106.1mg/dL which is within normal glucose level. For hypertension, heart
disease, stroke, the mean values tell us how much of the total patients have that specific
condition as a “1” indicates a yes.

The .nunique function is used to check for the


uniqueness of each indicator. The ID shows that
we have 5110 unique entries, aligned with the
number of rows from the .shape function. Other
indicators such as heart disease, hypertension
and stroke status, have only 2 unique values
which align with expected (yes or no answer).
Other numerical indicators have unique values
that align with expected.

Using the filter function to filter out only


stroke patients for the first part of the
analysis where we analyze the
relationship of stroke and other health
indicators below. And also used the
shape function to show that there were
249 rows of data entries (250 patients).
Since we will not be using mean or
sum as a standard, patients with
missing data will still be utilized for the
analysis (rows with missing data will
not be eliminated or replaced) - info
showed that BMI only had 209 data,
with the remaining 40 rows missing BMI data.

Hypertension status from stroke


patients were obtained and it was
seen that only 26.5% of stroke
patients have hypertension. This
number is not significant enough to
establish a clear relationship
between hypertension and stroke.
Using the value counts function to count the number of stroke patients for each BMI
value to see trends of BMI amongst stroke patients. These BMI counts are then grouped
further into 3 categories: underweight (BMI < 18.5), normal weight (18.5 < BMI < 24.9),
and overweight (BMI > 24.9). It was seen that only 0.5% (1 patient) of stroke patients is
underweight, and 16.7% has normal weight, and 82.8% of stroke patients are
overweight. Indicating a strong relationship and trend between stroke and being
overweight.

The .bins function was utilized to


find out the ranges of BMIs when
the data set is split into 3 equal
parts as shown here. It was seen
that the first bin has a range 10.4,
the second has a smaller range of
only 4.5, and the third has a larger
range of 24.8. This shows that the
BMI data is concentrated in the 2nd
bin, while the 3rd bin has highly
segregated data points.

Average glucose levels is also analyzed the same way as BMI with the following results:
10.8% of stroke patients had average glucose level below normal (< 70mg/dL), 57% had
normal average glucose levels (between 70 to 179mg/dL), and 32.1% had average
glucose levels above average (>179mg/dL). Since more than half of stroke patients had
normal glucose level, no clear relationship between stroke and high glucose can be
made - high glucose levels do not necessarily cause stroke.
Using the.bins function. It was seen that the
first bin has an average glucose level range
of 31.9, the second has a larger range of
82.2, and the third has a largest range of
101.5. This shows that the average glucose
levels for patients are concentrated in the 1st
bin, while the 2nd bin has higher segregated
data points, and the 3rd bin has the most
segregated data points.

The same analysis was completed for age and it was found that only 4% of stroke
patients were under the age of 45, 33.3% between 45 and 65 years old, and 62.7% were
above the age of 65. It can be seen that the number of stroke patients increases rapidly
after the age of 45 and even more after the age of 65. We can say that stroke has a
positive correlation with age.

Using the.bins function. It was seen that the first bin has an age range 61.7, the second
has a smaller range of 14, and the third has a smallest range of only 5. This shows that
the average age for patients is concentrated in the 3rd bin, while the 2nd bin has higher
segregated data points, and the 1st bin has the least segregated data points.
Correlation coefficients for various health indicators for stroke patients were determined
using the .corr function.
● Correlation between age and average glucose level is positive with correlation coefficient
of 0.11 (almost negligible). This shows no
significance in the relationship between
age and average glucose levels.
● Surprisingly, Correlation between
BMI and age is negative with
correlation coefficient of -0.30,
meaning that as age increases,
BMI is seen to decrease for the
dataset. But a correlation coefficient of -0.3 is not significant enough to establish a
relationship. This data was unexpected as I predicted for BMI to be higher as people age
due to slower metabolism

● Correlation between BMI and average glucose level is positively correlated with
correlation coefficient of 0.34. As BMI increases, glucose levels are seen to increase but
with no clear relationship as 0.34 is a small coefficient relationship.

While for non-stroke patients, there were no


negative correlation seen with details below:

● Correlation between age and BMI is


positive with a correlation coefficient
of 0.35. However, the number is too
small to validate a clear correlation between the 2 indicators.
● Average glucose levels also had a positive correlation of 0.22 with BMI, but lower than of
BMI. This shows that again no clear relationship between age and average glucose
levels can be made.
● Correlation between BMI and average glucose levels is positive at 0.165. This does not
show a strong relationship between BMI and average glucose levels.

It was interesting and unexpected to see that correlation coefficients for non stroke patients
were higher than those of stroke patients.

III. Visualization:
Visualization tools/functions were used to generate various charts for easier
visualizations of the analyzed data.

A bar chart for the number of stroke patients based on their BMI was generated. More
normal and overweight patients were seen on this chart than underweight patients as
seen when using the .bins function.
Bar chart for the number of stroke patients based on their average glucose level was
also generated. Since average glucose levels for each patient varies, this plot does not
show a good visualization of data. A better visualization could be generated by grouping
average glucose levels (Eg. 55-60mg/dL, 60-65mg/dL, etc)
Bar chart for the number of stroke patients based on their age was also generated and
trends of increasing frequency of stroke occurrences can be seen as age increases. On
the other hand, below the age of 45 the number of strokes is minute.

To better visualize the trends and correlations between health indicators, scatterplots
were generated.

● For age and BMI, a scatterplot stroke and non stroke patients were generated. The
orange represents data for stroke patients and the blue
represents data for non stroke patients. For non stroke
patients, BMI and age is seen to be distributed evenly
across ages. For stroke patients, we see data heavily
concentrated at at 70-80 years old indicating a
prevalence for stroke at higher age. BMI of stroke
patients does not seem to have a relationship with Age.

● Scatterplot for average glucose levels and BMI for both strike and non stroke patients
showed no clear trend was seen. However, we do see
blank spots of data around the 150 to 200 mg/dL level.
Data is heavily concentrated below the 125 mg/dL level
and with some above the 200 mg/dL level - for both
stroke and non stroke patients. Again more points were
seen on the higher age range for stroke patients
indicating prevalence of stroke in older patients.

● For the scatterplot between BMI and average glucose levels, we see that data is
concentrated on the left side with BMI lower than 40. And we also see the same blank
spots above at around average glucose levels of 150 - 200 mg/dL. No clear relationship
between BMI and average glucose levels were
established from this scatterplot.

Correlation matrix were plotted for both stroke and non stroke patients to better visualize
results from part 1 of the EDA where correlation coefficient results were discussed in
details.

IV. Results

After doing various analyses on Stroke patients data mainly focusing on their Age, BMI,
and Avg. Glucose levels, it was seen that patients who suffer from stroke are mainly
patients above 40 years old, and the older the patient gets, the higher the rate of stroke.
For Age, 4% of stroke patients are below 45 years old, 33% are between 45 and 65, and
63% are above 65 years old.

For BMI, 83% of stroke patients have BMI over 24.9% indicating overweight. 16% has
normal BMI, and only 0.5% has underweight BMI.

For average glucose levels, 57% of Stroke patients were seen to have normal Glucose
Levels, whereas about 43% had abnormal glucose levels (32% with high glucose levels).

BMI has a negative correlation with age (-0.30) and a positive correlation with avg.
glucose levels (0.34) for stroke patients. The correlation coefficient is greater than 0.1
showing some trend but it is still less than 0.9, showing no definite correlation between
the two factors.
V. Conclusion

Trends of overweight can be seen for stroke patients. For average glucose levels, no
clear trend was seen as 57% of stroke patients were seen to have normal glucose
levels. Further statistical analysis such as the regression method, t-test, and p-value
should be completed to analyze the data further.

Although no clear relationship can be seen, patient health data such as BMI, glucose
levels, and age could be used to monitor a patient's susceptibility to stroke and for the
stroke prevention program. A more data driven approach to healthcare management
could also be constructed by utilizing available patient data to create a more
personalized healthcare plan or disease prevention plan.

VI. References

1. Benjamin, E. J., Blaha, M. J., Chiuve, S. E., Cushman, M., Das, S. R., Deo, R., ...
& Muntner, P. (2023). Heart disease and stroke statistics—2023 update: A report
from the American Heart Association. Journal of the American Heart Association,
12(2), e027044. https://doi.org/10.1161/JAHA.122.027044
2. World Health Organization. (n.d.). The top 10 causes of death. Retrieved July 23,
2024, from https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-
of-death
3. World Stroke Organization. (n.d.). Impact of stroke. Retrieved July 23, 2024, from
https://www.world-stroke.org/world-stroke-day-campaign/about-stroke/impact-of-
stroke

You might also like