0% found this document useful (0 votes)

31 views20 pages

Eda-Ml-Decision-Tree - Ipynb - Colab

The document describes a Python environment set up for data analysis, specifically focusing on a heart disease dataset containing various health indicators and risk factors. It outlines the dataset's structure, including columns such as age, gender, blood pressure, and cholesterol levels, which can be used for health research and analysis. Additionally, it includes code snippets for data loading, exploration, and visualization using libraries like pandas and plotly.

Uploaded by

Syeda Maria Shahzadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views20 pages

Eda-Ml-Decision-Tree - Ipynb - Colab

Uploaded by

Syeda Maria Shahzadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

# This Python 3 environment comes with many helpful analytics libraries installed

# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

# For example, here's several helpful packages to load

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory

# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Sav
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/heart-disease/heart_disease.csv

pip install plotly

Requirement already satisfied: plotly in /usr/local/lib/python3.10/dist-packages (5.24.1)

Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from plotly) (9.0.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from plotly) (24.2)
Note: you may need to restart the kernel to use updated packages.

keyboard_arrow_down DATASET:
This dataset contains various health indicators and risk factors related to heart disease. Parameters such as age, gender, blood pressure,
cholesterol levels, smoking habits, and exercise patterns have been collected to analyze heart disease risk and contribute to health research.
The dataset can be used by healthcare professionals, researchers, and data analysts to examine trends related to heart disease, identify risk
factors, and perform various health-related analyses.

Columns ;

Age: The individual's age.

Gender: The individual's gender (Male or Female).

Blood Pressure: The individual's blood pressure (systolic).

Cholesterol Level: The individual's total cholesterol level.

Exercise Habits: The individual's exercise habits (Low, Medium, High).

Smoking: Whether the individual smokes or not (Yes or No).

Family Heart Disease: Whether there is a family history of heart disease (Yes or No).

Diabetes: Whether the individual has diabetes (Yes or No).

BMI: The individual's body mass index.

High Blood Pressure: Whether the individual has high blood pressure (Yes or No).

Low HDL Cholesterol: Whether the individual has low HDL cholesterol (Yes or No).

High LDL Cholesterol: Whether the individual has high LDL cholesterol (Yes or No).

Alcohol Consumption: The individual's alcohol consumption level (None, Low, Medium, High).

Stress Level: The individual's stress level (Low, Medium, High).

Sleep Hours: The number of hours the individual sleeps.

Sugar Consumption: The individual's sugar consumption level (Low, Medium, High).

Triglyceride Level: The individual's triglyceride level.

Fasting Blood Sugar: The individual's fasting blood sugar level.

CRP Level: The C-reactive protein level (a marker of inflammation).

Homocysteine Level: The individual's homocysteine level (an amino acid that affects blood vessel health).

Heart Disease Status: The individual's heart disease status (Yes or No).

import pandas as pd
import plotly.express as px
import pprint
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew
from sklearn.preprocessing import OrdinalEncoder
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

df = pd.read_csv('/kaggle/input/heart-disease/heart_disease.csv')
df.head()

/usr/local/lib/python3.10/dist-packages/pandas/io/formats/format.py:1458: RuntimeWarning: invalid value encountered in greater

has_large_values = (abs_vals > 1e6).any()
/usr/local/lib/python3.10/dist-packages/pandas/io/formats/format.py:1459: RuntimeWarning: invalid value encountered in less
has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
/usr/local/lib/python3.10/dist-packages/pandas/io/formats/format.py:1459: RuntimeWarning: invalid value encountered in greater
has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
/usr/local/lib/python3.10/dist-packages/pandas/io/formats/format.py:1458: RuntimeWarning: invalid value encountered in greater
has_large_values = (abs_vals > 1e6).any()
/usr/local/lib/python3.10/dist-packages/pandas/io/formats/format.py:1459: RuntimeWarning: invalid value encountered in less
has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
/usr/local/lib/python3.10/dist-packages/pandas/io/formats/format.py:1459: RuntimeWarning: invalid value encountered in greater
has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
Family High
Blood Cholesterol Exercise High LDL Alcohol Stre
Age Gender Smoking Heart Diabetes BMI Blood ...
Pressure Level Habits Cholesterol Consumption Lev
Disease Pressure

0 56.0 Male 153.0 155.0 High Yes Yes No 24.991591 Yes ... No High Med

1 69.0 Female 146.0 286.0 High No Yes Yes 25.221799 No ... No Medium H

2 46.0 Male 126.0 216.0 Low No No No 29.855447 No ... Yes Low L

3 32.0 Female 122.0 293.0 High Yes Yes No 24.130477 Yes ... Yes Low H

4 60.0 Male 166.0 242.0 Low Yes Yes Yes 20.486289 Yes ... No Low H

5 rows × 21 columns

df.shape

(10000, 21)

keyboard_arrow_down EDA
df.isnull().sum()

Age 29
Gender 19
Blood Pressure 19
Cholesterol Level 30
Exercise Habits 25
Smoking 25
Family Heart Disease 21
Diabetes 30
BMI 22
High Blood Pressure 26
Low HDL Cholesterol 25
High LDL Cholesterol 26
Alcohol Consumption 2586
Stress Level 22
Sleep Hours 25
Sugar Consumption 30
Triglyceride Level 26
Fasting Blood Sugar 22
CRP Level 26
Homocysteine Level 20
Heart Disease Status 0
dtype: int64

df.dtypes

Age float64
Gender object
Blood Pressure float64
Cholesterol Level float64
Exercise Habits object
Smoking object
Family Heart Disease object
Diabetes object
BMI float64
High Blood Pressure object
Low HDL Cholesterol object
High LDL Cholesterol object
Alcohol Consumption object
Stress Level object
Sleep Hours float64
Sugar Consumption object
Triglyceride Level float64
Fasting Blood Sugar float64
CRP Level float64
Homocysteine Level float64
Heart Disease Status object
dtype: object

df['Age'].describe()

count 9971.000000
mean 49.296259
std 18.193970
min 18.000000
25% 34.000000
50% 49.000000
75% 65.000000
max 80.000000
Name: Age, dtype: float64

lists = df.columns
lists

Index(['Age', 'Gender', 'Blood Pressure', 'Cholesterol Level',

'Exercise Habits', 'Smoking', 'Family Heart Disease', 'Diabetes', 'BMI',
'High Blood Pressure', 'Low HDL Cholesterol', 'High LDL Cholesterol',
'Alcohol Consumption', 'Stress Level', 'Sleep Hours',
'Sugar Consumption', 'Triglyceride Level', 'Fasting Blood Sugar',
'CRP Level', 'Homocysteine Level', 'Heart Disease Status'],
dtype='object')

dataframe = df

invalid_column_for_displaying = []
for i in lists:
name = f'{i}'
data = dataframe[name]

if data.count() != data.value_counts().shape:

if data.isnull().sum() != 0 :
if data.dtype == 'float64' or data.dtype =='int64':
data = data.fillna(0)

else:
data= data.fillna('missing')
if data.unique().shape[0] <=10:
fig = px.pie(data,name,color=name,title=f'{name} Distribution')
fig.show()
else:
counts = data.value_counts().reset_index()
counts.columns =[name,'count']

fig = px.bar(counts,x=name,y='count',title=(f'{name} Distribution'),text_auto=True)

fig.show()
print(f'Describe: {data.describe()}')
else:
invalid_column_for_displaying.append(name)

print(f"List of columns in the data that couldn't be printed because there are an equal number of unique values as to the shape of the c
Age Distribution

182

182
174

173

173
172
169
168

168

167

166
162

162
161

161
160

160

159
157

157
150

156

156
155

154

154
152

151

151
149

149

149
148
147

145
142

142

139
132

128

121
count

100

50
29

0
0 10 20 30 40 50 60

Age

Describe: count 10000.000000

mean 49.153300
std 18.359959
min 0.000000
25% 34.000000
50% 49.000000
75% 65.000000
max 80.000000
Name: Age, dtype: float64

Gender Distribution

49.8% 50%

0.19%

Blood Pressure Distribution

214

200

178 178
177
174
172
171
169 169

162
161 161

150 148 148

140

136
count

100
50

0
0 20 40 60 80 100 120

Blood Pressure

Describe: count 10000.000000

mean 149.473200
std 18.728528
min 0.000000
25% 134.000000
50% 150.000000
75% 165.000000
max 180.000000
Name: Blood Pressure, dtype: float64

Cholesterol Level Distribution

90
84
83

80
81

78 78
77 77
76
75

73 73 73 73
72 72

70
71
70
69 69 69
68 68 68 68 68
67 67 67 67 67 67 67 67
66 66 66 66 66 66 66
65 65 65 65 65 65
64 64
63 63 63 63
62 62 62 62

60
61 61 61 61
60 60
59 59
58 58
57 57
56 56 56
count

50
51

47
46

30 30

0
0 50 100 150 200

Cholesterol Level

Describe: count 10000.000000

mean 224.749300
std 45.223467
min 0.000000
25% 187.000000
50% 225.000000
75% 263.000000
max 300.000000
Name: Cholesterol Level, dtype: float64

Exercise Habits Distribution

33.3% 33.7%

32.7% 0.25%

Smoking Dist ib tion

Smoking Distribution

48.5%
51.2%

0.25%

Family Heart Disease Distribution

49.8% 50%

0.21%

Diabetes Distribution

49.5% 50.2%

0.3%
High Blood Pressure Distribution

49.5% 50.2%

0.26%

Low HDL Cholesterol Distribution

49.8% 50%

0.25%

High LDL Cholesterol Distribution

49.4% 50.4%

0.26%
Alcohol Consumption Distribution

25% 25.9%

24.9% 24.3%

Stress Level Distribution

33.2% 33.9%

32.7% 0.22%

Sugar Consumption Distribution

33.3% 33.9%

32.5% 0.3%
Triglyceride Level Distribution

50
47 47

45 45

43 43 43 43

42 42 42

41 41 41 41 41 41

40 40

39 39
40 40 40 40

39 39
40

39
40 40 40 40

38 38 38 38 38 38 38 38 38 38 38 38 38 38 38

37 37 37 37 37 37 37 37 37 37 37 37

36 36 36 36 36 36 36

35 35 35 35 35 35 35 35 35 35 35 35 35

34 34 34 34 34 34 34 34 34 34 34 34 34 34

33 33 33 33 33 33 33 33 33 33 33

32 32 32 32 32 32 32 32 32 32 32 32

31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31

30 29
30

29
30

29 29
30

29
30

29 29
30 30 30

29
30 30

29
30

28 28 28 28 28 28 28
count

27 27 27 27 27 27 27

26 26 26 26 26 26 26 26

25 25 25 25

24 24 24 24 24

23 23 23 23 23 23

22 22 22 22 22

21 21

20 20

0
0 50 100 150 200 250 30

Triglyceride Level

Describe: count 10000.000000

mean 250.082500
std 87.886504
min 0.000000
25% 175.750000
50% 250.000000
75% 326.000000
max 400.000000
Name: Triglyceride Level, dtype: float64

Fasting Blood Sugar Distribution

151
147
140
141

139
138
134

134
133

132
131

131
130

129
128

128
127

127

120
124

123

123
122

122

121

120
119

118

118
117

117
116

116

116
115

115
114
111
110
100
104

100
98
count

20
22

0
0 20 40 60 80 100 12

Fasting Blood Sugar

Describe: count 10000.000000

mean 119.877900
std 24.221277
min 0.000000
25% 99.000000
50% 120.000000
75% 141.000000
max 160.000000
Name: Fasting Blood Sugar, dtype: float64

Heart Disease Status Distribution

20%
80%

List of columns in the data that couldn't be printed because there are an equal number of unique values as to the shape of the co
['BMI', 'Sleep Hours', 'CRP Level', 'Homocysteine Level']
Based on the analysis, the columns in the list 'invalid_column_for_displaying' exhibit as many unique values as there are rows, indicating that
they likely contain no meaningful variability. Therefore, it is reasonable to conclude that these columns would not provide useful information
fore predicting the outcome of heart disease and can be removed.

List of columns removed:

'BMI',
'Sleep Hours',
'CRP Level',
'Homocysteine Level'

keyboard_arrow_down Removing invalid columns based on the displaying for each columns
df = df.drop(columns=invalid_column_for_displaying,axis=1)
df.columns

Index(['Age', 'Gender', 'Blood Pressure', 'Cholesterol Level',

'Exercise Habits', 'Smoking', 'Family Heart Disease', 'Diabetes',
'High Blood Pressure', 'Low HDL Cholesterol', 'High LDL Cholesterol',
'Alcohol Consumption', 'Stress Level', 'Sugar Consumption',
'Triglyceride Level', 'Fasting Blood Sugar', 'Heart Disease Status'],
dtype='object')

keyboard_arrow_down Seperate categorical and numerical columns - handling missing values

columns_categorical = []
columns_numerical =[]

for i in df.columns:
if df[i].dtype == object:
columns_categorical.append(i)
else:
columns_numerical.append(i)
print(columns_categorical)
print(columns_numerical)

['Gender', 'Exercise Habits', 'Smoking', 'Family Heart Disease', 'Diabetes', 'High Blood Pressure', 'Low HDL Cholesterol', 'High LDL
['Age', 'Blood Pressure', 'Cholesterol Level', 'Triglyceride Level', 'Fasting Blood Sugar']

df['Gender'].unique()

array(['Male', 'Female', nan], dtype=object)

df.shape

(10000, 17)

df.isnull().sum()

Age 29
Gender 19
Blood Pressure 19
Cholesterol Level 30
Exercise Habits 25
Smoking 25
Family Heart Disease 21
Diabetes 30
High Blood Pressure 26
Low HDL Cholesterol 25
High LDL Cholesterol 26
Alcohol Consumption 2586
Stress Level 22
Sugar Consumption 30
Triglyceride Level 26
Fasting Blood Sugar 22
Heart Disease Status 0
dtype: int64

keyboard_arrow_down Handling the missing value - numerical

dataframe = df.copy()

for i in columns_numerical:
name = f'{i}'
data = dataframe[name]

# Calculate skewness while excluding missing values

skewness = skew(data.dropna())

plt.figure(figsize=(8, 5))
sns.kdeplot(data, color="skyblue")

# Overlay Mean & Median

plt.axvline(data.mean(), color='red', linestyle='dashed', label=f'Mean: {data.mean():.2f}')
plt.axvline(data.median(), color='blue', linestyle='dashed', label=f'Median: {data.median():.2f}')

# Titles and Legends

plt.title(f"{i} Distribution | Skewness: {skewness:.2f}", fontsize=14)
plt.xlabel(i)
plt.ylabel("Frequency")
plt.legend()

plt.show()

if skewness == 0:
print(f'{i} : {skewness:.2f} - Normally distributed')

elif skewness > 0:

print(f'{i} : {skewness:.2f} - Right skew (Positive skew)')
else:
print(f'{i} : {skewness:.2f} - Left skew (Negative skew)')
/usr/local/lib/python3.10/dist-packages/seaborn/_oldcore.py:1119: FutureWarning:

use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.

Age : -0.01 - Left skew (Negative skew)

/usr/local/lib/python3.10/dist-packages/seaborn/_oldcore.py:1119: FutureWarning:

use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.

Blood Pressure : 0.01 - Right skew (Positive skew)

/usr/local/lib/python3.10/dist-packages/seaborn/_oldcore.py:1119: FutureWarning:

use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
Cholesterol Level : -0.01 - Left skew (Negative skew)
/usr/local/lib/python3.10/dist-packages/seaborn/_oldcore.py:1119: FutureWarning:

use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.

Triglyceride Level : 0.01 - Right skew (Positive skew)

/usr/local/lib/python3.10/dist-packages/seaborn/_oldcore.py:1119: FutureWarning:

use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.

Fasting Blood Sugar : -0.01 - Left skew (Negative skew)

Based on the results from the skewness of the data and the missing values from each feature for the numeric columns, the following
decision has been made:

Age:

skewness = -0.01
missing = 0.29 %
imputer = median
Reason: The data is slightly left-skewed, and the median is more robust to outliers, ensuring a more accurate central tendency.

Blood Pressure:

skewness = 0.01
missing = 0.19 %
imputer = median
Reason: Since the skewness is close to 0, the data is nearly symmetrical, making the median a reliable choice that minimizes
distortion from outliers.

Cholesterol Level:

skewness = -0.01
missing = 0.30 %
imputer = median
Reason: A near-normal distribution but slightly skewed; using the median prevents the impact of extreme cholesterol values.

Triglyceride Level:

skewness = 0.01
missing = 0.26 %
imputer = median
Reason: The data is slightly right-skewed; the median is less affected by extreme values, providing a stable imputation.

Fasting Blood Sugar:

skewness = -0.01
missing = 0.22 %
imputer = median
Reason: Given the slight skewness, the median helps avoid potential biases introduced by extreme blood sugar levels.

keyboard_arrow_down Handling the missing values - categorical

columns_categorical

['Gender',
'Exercise Habits',
'Smoking',
'Family Heart Disease',
'Diabetes',
'High Blood Pressure',
'Low HDL Cholesterol',
'High LDL Cholesterol',
'Alcohol Consumption',
'Stress Level',
'Sugar Consumption',
'Heart Disease Status']

df[columns_categorical].isnull().sum()

Gender 19
Exercise Habits 25
Smoking 25
Family Heart Disease 21
Diabetes 30
High Blood Pressure 26
Low HDL Cholesterol 25
High LDL Cholesterol 26
Alcohol Consumption 2586
Stress Level 22
Sugar Consumption 30
Heart Disease Status 0
dtype: int64

Based on the results for the missing values for the categorical columns, the following decision has been made:

Most frequent:
['Gender','Exercise Habits','Smoking','Family Heart Disease','Diabetes','High Blood Pressure','Low HDL Cholesterol','High LDL
Cholesterol','Stress Level','Sugar Consumption']
No action:

Heart Disease Status

Remove column:

Alcohol Consumption, with 25,86 % missing data, Very high missing rate. missing values might indicate non-drinkers or data entry
issues. Dropping might be best.

keyboard_arrow_down Imputer = Filling in the missing values - Using Simple imputer

keyboard_arrow_down numerical
columns_numerical

['Age',
'Blood Pressure',
'Cholesterol Level',
'Triglyceride Level',
'Fasting Blood Sugar']

dataframe=df

dataframe

/usr/local/lib/python3.10/dist-packages/pandas/io/formats/format.py:1458: RuntimeWarning:

invalid value encountered in greater

/usr/local/lib/python3.10/dist-packages/pandas/io/formats/format.py:1459: RuntimeWarning:

invalid value encountered in less

/usr/local/lib/python3.10/dist-packages/pandas/io/formats/format.py:1459: RuntimeWarning:

invalid value encountered in greater

Family High
Blood Cholesterol Exercise Low HDL High LDL Alcohol Str
Age Gender Smoking Heart Diabetes Blood
Pressure Level Habits Cholesterol Cholesterol Consumption Le
Disease Pressure

0 56.0 Male 153.0 155.0 High Yes Yes No Yes Yes No High Med

1 69.0 Female 146.0 286.0 High No Yes Yes No Yes No Medium H

2 46.0 Male 126.0 216.0 Low No No No No Yes Yes Low

3 32.0 Female 122.0 293.0 High Yes Yes No Yes No Yes Low H

4 60.0 Male 166.0 242.0 Low Yes Yes Yes Yes No No Low H

... ... ... ... ... ... ... ... ... ... ... ... ...

9995 25.0 Female 136.0 243.0 Medium Yes No No Yes No Yes Medium H

9996 38.0 Male 172.0 154.0 Medium No No No Yes No Yes NaN H

9997 73.0 Male 152.0 201.0 High Yes No Yes No Yes Yes NaN

9998 23.0 Male 142.0 299.0 Low Yes No Yes Yes No Yes Medium H

9999 38.0 Female 128.0 193.0 Medium Yes Yes Yes No Yes Yes High Med

10000 rows × 17 columns

name_columns_num = dataframe.columns

imputer_median = SimpleImputer(strategy='median',fill_value=np.nan)

column_trans = ColumnTransformer(
transformers=[
('impute_age', imputer_median, ['Age']),
('impute_blood_pressure', imputer_median, ['Blood Pressure']),
('impute_cholesterol', imputer_median, ['Cholesterol Level']),
('impute_triglyceride', imputer_median, ['Triglyceride Level']),
('impute_fasting_blood_sugar', imputer_median, ['Fasting Blood Sugar']),
],
remainder='passthrough'
)

df_transformed = column_trans.fit_transform(dataframe)
print(df_transformed)

[[56.0 153.0 155.0 ... 'Medium' 'Medium' 'No']

[69.0 146.0 286.0 ... 'High' 'Medium' 'No']
[46.0 126.0 216.0 ... 'Low' 'Low' 'No']
...
[73.0 152.0 201.0 ... 'Low' 'Low' 'Yes']
[23.0 142.0 299.0 ... 'High' 'Medium' 'Yes']
[38.0 128.0 193.0 ... 'Medium' 'High' 'Yes']]

new_name_columns = columns_numerical+columns_categorical

new_name_columns

['Age',
'Blood Pressure',
'Cholesterol Level',
'Triglyceride Level',
'Fasting Blood Sugar',
'Gender',
'Exercise Habits',
'Smoking',
'Family Heart Disease',
'Diabetes',
'High Blood Pressure',
'Low HDL Cholesterol',
'High LDL Cholesterol',
'Alcohol Consumption',
'Stress Level',
'Sugar Consumption',
'Heart Disease Status']

df_num_imputed = pd.DataFrame(df_transformed,columns=new_name_columns)
df_num_imputed.head()

Fasting Family High

Blood Cholesterol Triglyceride Exercise Low HDL High L
Age Blood Gender Smoking Heart Diabetes Blood
Pressure Level Level Habits Cholesterol Cholester
Sugar Disease Pressure

0 56.0 153.0 155.0 342.0 120.0 Male High Yes Yes No Yes Yes N

1 69.0 146.0 286.0 133.0 157.0 Female High No Yes Yes No Yes N

2 46.0 126.0 216.0 393.0 92.0 Male Low No No No No Yes Y

3 32.0 122.0 293.0 293.0 94.0 Female High Yes Yes No Yes No Y

4 60.0 166.0 242.0 263.0 154.0 Male Low Yes Yes Yes Yes No N

dataframe = df_num_imputed[name_columns_num]
dataframe

Family High
Blood Cholesterol Exercise Low HDL High LDL Alcohol Str
Age Gender Smoking Heart Diabetes Blood
Pressure Level Habits Cholesterol Cholesterol Consumption Le
Disease Pressure

0 56.0 Male 153.0 155.0 High Yes Yes No Yes Yes No High Med

1 69.0 Female 146.0 286.0 High No Yes Yes No Yes No Medium H

2 46.0 Male 126.0 216.0 Low No No No No Yes Yes Low

3 32.0 Female 122.0 293.0 High Yes Yes No Yes No Yes Low H

4 60.0 Male 166.0 242.0 Low Yes Yes Yes Yes No No Low H

... ... ... ... ... ... ... ... ... ... ... ... ...

9995 25.0 Female 136.0 243.0 Medium Yes No No Yes No Yes Medium H

9996 38.0 Male 172.0 154.0 Medium No No No Yes No Yes NaN H

9997 73.0 Male 152.0 201.0 High Yes No Yes No Yes Yes NaN

9998 23.0 Male 142.0 299.0 Low Yes No Yes Yes No Yes Medium H

9999 38.0 Female 128.0 193.0 Medium Yes Yes Yes No Yes Yes High Med

10000 rows × 17 columns

dataframe[columns_numerical].isnull().sum()

Age 0
Blood Pressure 0
Cholesterol Level 0
Triglyceride Level 0
Fasting Blood Sugar 0
dtype: int64

keyboard_arrow_down categorical
# Drop Alcohol Consumption
dataframe = dataframe.drop(columns='Alcohol Consumption')

cat_columnn='Alcohol Consumption'
columns_categorical = list(filter(lambda x:x!=cat_columnn,columns_categorical))

feature_column ='Heart Disease Status'

columns_categorical = list(filter(lambda x:x!=feature_column,columns_categorical))

print(columns_categorical)

['Gender', 'Exercise Habits', 'Smoking', 'Family Heart Disease', 'Diabetes', 'High Blood Pressure', 'Low HDL Cholesterol', 'High LDL

dataframe[columns_categorical].isnull().sum()

Gender 19
Exercise Habits 25
Smoking 25
Family Heart Disease 21
Diabetes 30
High Blood Pressure 26
Low HDL Cholesterol 25
High LDL Cholesterol 26
Stress Level 22
Sugar Consumption 30
dtype: int64

imputer_most_freq = SimpleImputer(strategy='most_frequent',fill_value='missing')

column_trans_cat = ColumnTransformer(
transformers=[
('impute_gender',imputer_most_freq,['Gender']),
('impue_exercise',imputer_most_freq,['Exercise Habits']),
('impute_smoking',imputer_most_freq,['Smoking']),
('impute_f_h_d',imputer_most_freq,['Family Heart Disease']),
('impute_Diabetes',imputer_most_freq,['Diabetes']),
('impute_h_b_p',imputer_most_freq,['High Blood Pressure']),
('impute_l_hdl_c',imputer_most_freq,['Low HDL Cholesterol']),
('impute_h_ldl_c',imputer_most_freq,['High LDL Cholesterol']),
('impute_stress_level',imputer_most_freq,['Stress Level']),
('impute_sugar_consumption',imputer_most_freq,['Sugar Consumption'])
], remainder='passthrough'
)

new_name_columns = columns_categorical+columns_numerical
new_name_columns.append(feature_column)

new_name_columns

['Gender',
'Exercise Habits',
'Smoking',
'Family Heart Disease',
'Diabetes',
'High Blood Pressure',
'Low HDL Cholesterol',
'High LDL Cholesterol',
'Stress Level',
'Sugar Consumption',
'Age',
'Blood Pressure',
'Cholesterol Level',
'Triglyceride Level',
'Fasting Blood Sugar',
'Heart Disease Status']
df_transformed=column_trans_cat.fit_transform(dataframe)
df_cat_imputed = pd.DataFrame(df_transformed,columns=new_name_columns)
df_cat_imputed.head()

Family High
Exercise Low HDL High LDL Stress Sugar Blood Cholesterol
Gender Smoking Heart Diabetes Blood Age
Habits Cholesterol Cholesterol Level Consumption Pressure Level
Disease Pressure

0 Male High Yes Yes No Yes Yes No Medium Medium 56.0 153.0 155.0

1 Female High No Yes Yes No Yes No High Medium 69.0 146.0 286.0

2 Male Low No No No No Yes Yes Low Low 46.0 126.0 216.0

3 Female High Yes Yes No Yes No Yes High High 32.0 122.0 293.0

4 Male Low Yes Yes Yes Yes No No High High 60.0 166.0 242.0

df_imputed_cleaned = df_cat_imputed

keyboard_arrow_down Encode the data

df_imputed_cleaned[columns_categorical].head()

Exercise Family Heart High Blood Low HDL High LDL Stress Sugar
Gender Smoking Diabetes
Habits Disease Pressure Cholesterol Cholesterol Level Consumption

0 Male High Yes Yes No Yes Yes No Medium Medium

1 Female High No Yes Yes No Yes No High Medium

2 Male Low No No No No Yes Yes Low Low

3 Female High Yes Yes No Yes No Yes High High

df_imputed_cleaned[columns_categorical].columns

Index(['Gender', 'Exercise Habits', 'Smoking', 'Family Heart Disease',

'Diabetes', 'High Blood Pressure', 'Low HDL Cholesterol',
'High LDL Cholesterol', 'Stress Level', 'Sugar Consumption'],
dtype='object')

# Non ordinal categorical columns

df_imputed_cleaned['Gender'] = (df_imputed_cleaned['Gender']=='Male').astype(int)# Encode the label into 1/0

# # Male = 1
# # Female = 0
df_imputed_cleaned['Smoking'] = (df_imputed_cleaned['Smoking']=='Yes').astype(int)

df_imputed_cleaned['Family Heart Disease'] = (df_imputed_cleaned['Family Heart Disease']=='Yes').astype(int)

df_imputed_cleaned['Diabetes'] = (df_imputed_cleaned['Diabetes']=='Yes').astype(int)

df_imputed_cleaned['High Blood Pressure'] = (df_imputed_cleaned['High Blood Pressure']=='Yes').astype(int)

df_imputed_cleaned['Low HDL Cholesterol'] = (df_imputed_cleaned['Low HDL Cholesterol']=='Yes').astype(int)

df_imputed_cleaned['High LDL Cholesterol'] = (df_imputed_cleaned['High LDL Cholesterol']=='Yes').astype(int)

df_imputed_cleaned['Heart Disease Status'] = (df_imputed_cleaned['Heart Disease Status']=='Yes').astype(int)

df_imputed_cleaned['Exercise Habits'].unique()

array(['High', 'Low', 'Medium'], dtype=object)

df_imputed_cleaned['Stress Level'].unique()

array(['Medium', 'High', 'Low'], dtype=object)

df_imputed_cleaned['Sugar Consumption'].unique()

array(['Medium', 'Low', 'High'], dtype=object)

# Ordinal Categorical column

ordinal_encoder = OrdinalEncoder(categories=[['Low','Medium','High']])
df_imputed_cleaned['Exercise Habits'] = ordinal_encoder.fit_transform(df_imputed_cleaned[['Exercise Habits']])
df_imputed_cleaned['Stress Level'] = ordinal_encoder.fit_transform(df_imputed_cleaned[['Stress Level']])
df imputed cleaned['Sugar Consumption'] = ordinal encoder.fit transform(df imputed cleaned[['Sugar Consumption']]

df_imputed_cleaned.head()

Family High
Exercise Low HDL High LDL Stress Sugar Blood Cholesterol
Gender Smoking Heart Diabetes Blood Age
Habits Cholesterol Cholesterol Level Consumption Pressure Level
Disease Pressure

0 1 2.0 1 1 0 1 1 0 1.0 1.0 56.0 153.0 155.0

1 0 2.0 0 1 1 0 1 0 2.0 1.0 69.0 146.0 286.0

2 1 0.0 0 0 0 0 1 1 0.0 0.0 46.0 126.0 216.0

3 0 2.0 1 1 0 1 0 1 2.0 2.0 32.0 122.0 293.0

4 1 0.0 1 1 1 1 0 0 2.0 2.0 60.0 166.0 242.0

keyboard_arrow_down Correlation matrix

correlation_matrix = df_imputed_cleaned.corr(method='pearson')
plt.figure(figsize=(20,20))
sns.heatmap(correlation_matrix,cmap='viridis_r',annot=True)
plt.show()

LAB8 LogisticReg HeartDisease
No ratings yet
LAB8 LogisticReg HeartDisease
31 pages
Healthcare Analytics
No ratings yet
Healthcare Analytics
72 pages
Stroke Prediction
No ratings yet
Stroke Prediction
14 pages
Data Science Week 4
No ratings yet
Data Science Week 4
14 pages
Notebook - Ipynb - Colab - Megyhu
No ratings yet
Notebook - Ipynb - Colab - Megyhu
20 pages
Heart - Disease - Ipynb - Colab
No ratings yet
Heart - Disease - Ipynb - Colab
13 pages
Heart - Disease - 1.ipynb - Colaboratory
No ratings yet
Heart - Disease - 1.ipynb - Colaboratory
9 pages
Preprocessing1.ipynb - Colab
No ratings yet
Preprocessing1.ipynb - Colab
13 pages
Untitled2.Ipynb - Colab
No ratings yet
Untitled2.Ipynb - Colab
8 pages
Diabetes
No ratings yet
Diabetes
97 pages
Model2.ipynb - Colab
No ratings yet
Model2.ipynb - Colab
11 pages
Major Project - Colab
No ratings yet
Major Project - Colab
15 pages
m3125 Practical 3
No ratings yet
m3125 Practical 3
13 pages
Logistic Regression
No ratings yet
Logistic Regression
28 pages
Heart Failure Prediction
100% (1)
Heart Failure Prediction
41 pages
List of Employers in Australia
No ratings yet
List of Employers in Australia
45 pages
Sleep Disorder 1689050852
No ratings yet
Sleep Disorder 1689050852
41 pages
Baseline - Ipynb - Colab
No ratings yet
Baseline - Ipynb - Colab
5 pages
MayankBaryal
No ratings yet
MayankBaryal
9 pages
Stroke Prediction
No ratings yet
Stroke Prediction
10 pages
BDA Project Codes
No ratings yet
BDA Project Codes
20 pages
Binary Prediction of Smoker Status Using Bio-Signals
No ratings yet
Binary Prediction of Smoker Status Using Bio-Signals
20 pages
Import Numpy As NP
No ratings yet
Import Numpy As NP
3 pages
My Code
No ratings yet
My Code
7 pages
Apply Logistic Regression Model Techniques To Predict Data On Any Dataset
No ratings yet
Apply Logistic Regression Model Techniques To Predict Data On Any Dataset
5 pages
Methodolgy
No ratings yet
Methodolgy
8 pages
Medidas de Tendencia Central 2020 PDF
No ratings yet
Medidas de Tendencia Central 2020 PDF
26 pages
ML Proj Diabetes
No ratings yet
ML Proj Diabetes
51 pages
Turing Data Analysis
No ratings yet
Turing Data Analysis
30 pages
Healthcare-Project-Simplilearn - Week1
No ratings yet
Healthcare-Project-Simplilearn - Week1
6 pages
DSBDA2
No ratings yet
DSBDA2
6 pages
Dovdush KN-305 Lab3
No ratings yet
Dovdush KN-305 Lab3
2 pages
Dovdush KN-305 Lab2
No ratings yet
Dovdush KN-305 Lab2
2 pages
AML Sessional 1 Students
No ratings yet
AML Sessional 1 Students
16 pages
Onan RV Troubleshooing Guide
75% (4)
Onan RV Troubleshooing Guide
17 pages
Heart Disease Diagnosis Using Machine Learning
No ratings yet
Heart Disease Diagnosis Using Machine Learning
26 pages
Heart Disease Risk Factor Data Analysis Midterm Data 2 - Jupyter Notebook
No ratings yet
Heart Disease Risk Factor Data Analysis Midterm Data 2 - Jupyter Notebook
20 pages
Diabetes Prediction 1704256341
No ratings yet
Diabetes Prediction 1704256341
17 pages
Capstone Project 2
No ratings yet
Capstone Project 2
15 pages
Test Questions and Analysis
No ratings yet
Test Questions and Analysis
10 pages
Heart - Cleveland - Ipynb - Colab
No ratings yet
Heart - Cleveland - Ipynb - Colab
5 pages
Child, You Have To Do It Now
No ratings yet
Child, You Have To Do It Now
69 pages
Diabetes EDA and Kears Modeling
No ratings yet
Diabetes EDA and Kears Modeling
26 pages
Ide To 6 Classification Algorithms
No ratings yet
Ide To 6 Classification Algorithms
34 pages
Ampere's Law
No ratings yet
Ampere's Law
20 pages
Cardio Screen RF
100% (1)
Cardio Screen RF
27 pages
Hare Krishna
No ratings yet
Hare Krishna
1 page
Om - Panigale v4 - Usa - My23 Ed01
No ratings yet
Om - Panigale v4 - Usa - My23 Ed01
310 pages
Hcin620 m6 Lab6 Hanifahmutesi-Finalproject
No ratings yet
Hcin620 m6 Lab6 Hanifahmutesi-Finalproject
5 pages
Diabetes and Glucose Correlation - IBM Machine Learning Training Project
No ratings yet
Diabetes and Glucose Correlation - IBM Machine Learning Training Project
10 pages
34 Davass1
No ratings yet
34 Davass1
8 pages
# Load Packages: Pandas Pandas PD PD Numpy Numpy NP NP
No ratings yet
# Load Packages: Pandas Pandas PD PD Numpy Numpy NP NP
17 pages
KNN For Classification
No ratings yet
KNN For Classification
5 pages
ADS Exp-1
No ratings yet
ADS Exp-1
3 pages
Heart Disease Indicator Prediction Model
No ratings yet
Heart Disease Indicator Prediction Model
17 pages
Bridging (Animal Production) - L1
No ratings yet
Bridging (Animal Production) - L1
107 pages
Stroke Prediction Dataset
No ratings yet
Stroke Prediction Dataset
48 pages
Diabetes
No ratings yet
Diabetes
7 pages
Logistic Regression
No ratings yet
Logistic Regression
12 pages
SVM - RF - Diabetes - CSV - 26 - 6 - 2023.ipynb - Colaboratory
No ratings yet
SVM - RF - Diabetes - CSV - 26 - 6 - 2023.ipynb - Colaboratory
8 pages
Commissioning Report For Boiler Air and Flue Gas System Unit 1
No ratings yet
Commissioning Report For Boiler Air and Flue Gas System Unit 1
6 pages
Project
No ratings yet
Project
8 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
Gangguan Pendengaran Dan Kelainan Telinga
No ratings yet
Gangguan Pendengaran Dan Kelainan Telinga
157 pages
Logistic Regression 205
No ratings yet
Logistic Regression 205
8 pages
Unit I 20cec21 Geometric
No ratings yet
Unit I 20cec21 Geometric
44 pages
Dissertation Essex Uni
100% (2)
Dissertation Essex Uni
6 pages
Gearbox
100% (1)
Gearbox
5 pages
1brochure - Machine Learning PDF
No ratings yet
1brochure - Machine Learning PDF
5 pages
Bio-Signal Analysis For Smoking
No ratings yet
Bio-Signal Analysis For Smoking
1 page
Xe155ucr Spec
No ratings yet
Xe155ucr Spec
20 pages
Dear Sir/Madam,: IITH Campus Recruitment Program 2019-20
No ratings yet
Dear Sir/Madam,: IITH Campus Recruitment Program 2019-20
2 pages
04 Long Tables Cetking CAT DILR150 Frequently Repeated Questions
No ratings yet
04 Long Tables Cetking CAT DILR150 Frequently Repeated Questions
7 pages
A Dog Named Duke
No ratings yet
A Dog Named Duke
12 pages
How To Lower Your Cholesterol
From Everand
How To Lower Your Cholesterol
Jeannine Hill
No ratings yet
Journal of King Saud University - Computer and Information Sciences
No ratings yet
Journal of King Saud University - Computer and Information Sciences
23 pages
Avigilon Vertical Brochure - CriticalInfrastructure - ENG
No ratings yet
Avigilon Vertical Brochure - CriticalInfrastructure - ENG
6 pages
Nashik Car Deler List
No ratings yet
Nashik Car Deler List
8 pages
How to Have Naturally Healthy Cholesterol Levels: the best book on essentials on how to lower bad LDL & boost good HDL via foods/diet, medications, exercise & knowing cholesterol myths for clarity
From Everand
How to Have Naturally Healthy Cholesterol Levels: the best book on essentials on how to lower bad LDL & boost good HDL via foods/diet, medications, exercise & knowing cholesterol myths for clarity
Jessica Caplain
No ratings yet
Unit 3 Exam - Hands-On - Part 1
No ratings yet
Unit 3 Exam - Hands-On - Part 1
2 pages
Bid Evaluation Report - 23H00003
No ratings yet
Bid Evaluation Report - 23H00003
3 pages
Form Pelaporan Ukl Upl
No ratings yet
Form Pelaporan Ukl Upl
3 pages
Unit 3, Lesson 2
No ratings yet
Unit 3, Lesson 2
2 pages
Top 10 SSD Hard Disk in 2019
No ratings yet
Top 10 SSD Hard Disk in 2019
7 pages
American Choral Directors Association The Choral Journal
No ratings yet
American Choral Directors Association The Choral Journal
3 pages
CL - 4 - NSTSE-2024-Paper-1P204 Key-Updated
No ratings yet
CL - 4 - NSTSE-2024-Paper-1P204 Key-Updated
3 pages
CYCLOPENTANE
No ratings yet
CYCLOPENTANE
2 pages
Some, Any, Much, Many, A Lot Of, How Many, How Mu
No ratings yet
Some, Any, Much, Many, A Lot Of, How Many, How Mu
1 page
BlakeBlossomXXX OnlyFans Pictures & Videos Complete Siterip 3 Download
No ratings yet
BlakeBlossomXXX OnlyFans Pictures & Videos Complete Siterip 3 Download
1 page
Remote Viewing Dialogues Daz Smith PDF Download
No ratings yet
Remote Viewing Dialogues Daz Smith PDF Download
87 pages