Exploratory Data Analysis of Heart Disease Dataset 1737826105
Exploratory Data Analysis of Heart Disease Dataset 1737826105
Heart disease or Cardiovascular disease (CVD) is a class of diseases that involve the heart or blood vessels. Cardiovascular diseases are the
leading cause of death globally. This is true in all areas of the world except Africa. Together CVD resulted in 17.9 million deaths (32.1%) in
2015. Deaths, at a given age, from CVD are more common and have been increasing in much of the developing world, while rates have
declined in most of the developed world since the 1970s.
So, in this kernel, I have conducted Exploratory Data Analysis or EDA of the heart disease dataset. Exploratory Data Analysis or EDA is a
critical first step in analyzing a new dataset. The primary objective of EDA is to analyze the data for distribution, outliers and anomalies in the
dataset. It enable us to direct specific testing of the hypothesis. It includes analysing the data to find the distribution of data, its main
characteristics, identifying patterns and visualizations. It also provides tools for hypothesis generation by visualizing and understanding the
data through graphical representation.
Table of Contents
The table of contents for this project are as follows: -
1. Introduction to EDA
2. Objectives of EDA
3. Types of EDA
4. Import libraries
5. Import dataset
6. Exploratory data analysis
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 1/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
1. Introduction to EDA
Back to Table of Contents
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 2/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
Several questions come to mind when we come across a new dataset. The below list shed light on some of these questions:-
• Are there any missing numerical values, outliers or anomalies in the dataset?
• How to be sure that our dataset is ready for input in a machine learning algorithm?
The answer is Exploratory Data Analysis. It enable us to answer all of the above questions.
2. Objectives of EDA
Back to Table of Contents
ii. Check for missing numerical values, outliers or other anomalies in the dataset.
3. Types of EDA
Back to Table of Contents
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 3/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
EDA is generally cross-classified in two ways. First, each method is either non-graphical or graphical. Second, each method is either univariate
or multivariate (usually bivariate). The non-graphical methods provide insight into the characteristics and the distribution of the variable(s) of
interest. So, non-graphical methods involve calculation of summary statistics while graphical methods include summarizing the data
diagrammatically.
There are four types of exploratory data analysis (EDA) based on the above cross-classification methods. Each of these types of EDA are
described below:-
Side-by-Side Boxplots
Scatterplots
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 4/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
4. Import libraries
Back to Table of Contents
In [1]: # This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# Any results you write to the current directory are saved as output.
We can see that the input folder contains one input file named heart.csv .
sns.set(style="whitegrid")
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 5/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
import warnings
warnings.filterwarnings('ignore')
I have imported the libraries. The next step is to import the datasets.
5. Import dataset
Back to Table of Contents
I will import the dataset with the usual pandas read_csv() function which is used to import CSV (Comma Separated Value) files.
The scene has been set up. Now let the actual fun begin.
Now, we can see that the dataset contains 303 instances and 14 variables.
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 6/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
Out[6]: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
Summary of dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 303 non-null int64
1 sex 303 non-null int64
2 cp 303 non-null int64
3 trestbps 303 non-null int64
4 chol 303 non-null int64
5 fbs 303 non-null int64
6 restecg 303 non-null int64
7 thalach 303 non-null int64
8 exang 303 non-null int64
9 oldpeak 303 non-null float64
10 slope 303 non-null int64
11 ca 303 non-null int64
12 thal 303 non-null int64
13 target 303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB
Dataset description
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 7/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
If we simply want to check the data type of a particular column, we can use the following command.
In [8]: df.dtypes
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 8/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
Same is the case with several other variables - fbs , exang and target .
fbs (fasting blood sugar) should be a character variable as it contains only 0 and 1 as values (1 = true; 0 = false). As it contains
only 0 and 1 as values, so its data type is given as int64.
exang (exercise induced angina) should also be a character variable as it contains only 0 and 1 as values (1 = yes; 0 = no). It also
contains only 0 and 1 as values, so its data type is given as int64.
target should also be a character variable. But, it also contains 0 and 1 as values. So, its data type is given as int64.
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 9/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
Out[9]: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope
count 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000
mean 54.366337 0.683168 0.966997 131.623762 246.264026 0.148515 0.528053 149.646865 0.326733 1.039604 1.399340
std 9.082101 0.466011 1.032052 17.538143 51.830751 0.356198 0.525860 22.905161 0.469794 1.161075 0.616226
min 29.000000 0.000000 0.000000 94.000000 126.000000 0.000000 0.000000 71.000000 0.000000 0.000000 0.000000
25% 47.500000 0.000000 0.000000 120.000000 211.000000 0.000000 0.000000 133.500000 0.000000 0.000000 1.000000
50% 55.000000 1.000000 1.000000 130.000000 240.000000 0.000000 1.000000 153.000000 0.000000 0.800000 1.000000
75% 61.000000 1.000000 2.000000 140.000000 274.500000 0.000000 1.000000 166.000000 1.000000 1.600000 2.000000
max 77.000000 1.000000 3.000000 200.000000 564.000000 1.000000 2.000000 202.000000 1.000000 6.200000 2.000000
If we want to view the statistical properties of character variables, we should run the following command -
df.describe(include=['object'])
If we want to view the statistical properties of all the variables, we should run the following command -
df.describe(include='all')
In [10]: df.columns
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 10/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
7. Univariate analysis
Back to Table of Contents
It is integer valued as it contains two integers 0 and 1 - (0 stands for absence of heart disease and 1 for presence of heart disease).
In [11]: df['target'].nunique()
Out[11]: 2
We can see that there are 2 unique values in the target variable.
In [12]: df['target'].unique()
Comment
So, the unique values are 1 and 0. (1 stands for presence of heart disease and 0 for absence of hear disease).
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 11/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
In [13]: df['target'].value_counts()
Out[13]: target
1 165
0 138
Name: count, dtype: int64
Comment
1 stands for presence of heart disease. So, there are 165 patients suffering from heart disease.
Similarly, 0 stands for absence of heart disease. So, there are 138 patients who do not have any heart disease.
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 12/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 13/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
Interpretation
The above plot confirms the findings that -
There are 138 patients who do not have any heart disease.
In [15]: df.groupby('sex')['target'].value_counts()
Comment
sex variable contains two integer values 1 and 0 : (1 = male; 0 = female).
target variable also contains two integer values 1 and 0 : (1 = Presence of heart disease; 0 = Absence of heart disease)
So, out of 96 females - 72 have heart disease and 24 do not have heart disease.
Similarly, out of 207 males - 93 have heart disease and 114 do not have heart disease.
We can visualize the value counts of the sex variable wrt target as follows -
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 14/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
plt.show()
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 15/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[20], line 2
1 f, ax = plt.subplots(figsize=(8, 6))
----> 2 ax = sns.countplot(x="sex", hue="target", data=df)
3 plt.show()
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 16/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 17/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
Interpretation
We can see that the values of target variable are plotted wrt sex : (1 = male; 0 = female).
target variable also contains two integer values 1 and 0 : (1 = Presence of heart disease; 0 = Absence of heart disease)
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 18/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
Out of 96 females - 72 have heart disease and 24 do not have heart disease.
Similarly, out of 207 males - 93 have heart disease and 114 do not have heart disease.
Comment
The above plot segregate the values of target variable and plot on two different columns labelled as (sex = 0, sex = 1).
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 19/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 20/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[22], line 2
1 f, ax = plt.subplots(figsize=(8, 6))
----> 2 ax = sns.countplot(y="target", hue="sex", data=df)
3 plt.show()
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 21/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 22/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 23/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 24/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
Comment
I have visualize the target values distribution wrt sex .
We can follow the same principles and visualize the target values distribution wrt fbs (fasting blood sugar) and exang
(exercise induced angina) .
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 25/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 26/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[25], line 2
1 f, ax = plt.subplots(figsize=(8, 6))
----> 2 ax = sns.countplot(x="target", hue="fbs", data=df)
3 plt.show()
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 27/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 28/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 29/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[26], line 2
1 f, ax = plt.subplots(figsize=(8, 6))
----> 2 ax = sns.countplot(x="target", hue="exang", data=df)
3 plt.show()
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 30/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 31/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 32/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
It is integer valued as it contains two integers 0 and 1 - (0 stands for absence of heart disease and 1 for presence of heart disease).
1 stands for presence of heart disease. So, there are 165 patients suffering from heart disease.
Similarly, 0 stands for absence of heart disease. So, there are 138 patients who do not have any heart disease.
There are 138 patients who do not have any heart disease.
Out of 96 females - 72 have heart disease and 24 do not have heart disease.
Similarly, out of 207 males - 93 have heart disease and 114 do not have heart disease.
8. Bivariate Analysis
Back to Table of Contents
The target variable is target . So, we should check how each attribute correlates with the target variable. We can do it as follows:-
In [28]: correlation['target'].sort_values(ascending=False)
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 33/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
When it is close to +1, this signifies that there is a strong positive correlation. So, we can see that there is no variable which has strong
positive correlation with target variable.
When it is clsoe to -1, it means that there is a strong negative correlation. So, we can see that there is no variable which has strong
negative correlation with target variable.
When it is close to 0, it means that there is no correlation. So, there is no correlation between target and fbs .
We can see that the cp and thalach variables are mildly positively correlated with target variable. So, I will analyze the interaction
between these features and target variable.
Explore cp variable
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 34/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
In [ ]: df['cp'].nunique()
In [ ]: df['cp'].value_counts()
Comment
It can be seen that cp is a categorical variable and it contains 4 types of values - 0, 1, 2 and 3.
In [ ]: f, ax = plt.subplots(figsize=(8, 6))
ax = sns.countplot(x="cp", data=df)
plt.show()
In [ ]: df.groupby('cp')['target'].value_counts()
Comment
cp variable contains four integer values 0, 1, 2 and 3.
target variable contains two integer values 1 and 0 : (1 = Presence of heart disease; 0 = Absence of heart disease)
So, the above analysis gives target variable values categorized into presence and absence of heart disease and groupby cp variable
values.
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 35/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
We can visualize the value counts of the cp variable wrt target as follows -
In [ ]: f, ax = plt.subplots(figsize=(8, 6))
ax = sns.countplot(x="cp", hue="target", data=df)
plt.show()
Interpretation
We can see that the values of target variable are plotted wrt cp .
target variable contains two integer values 1 and 0 : (1 = Presence of heart disease; 0 = Absence of heart disease)
In [ ]: df['thalach'].nunique()
So, number of unique values in thalach variable is 91. Hence, it is numerical variable.
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 36/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
In [ ]: f, ax = plt.subplots(figsize=(10,6))
x = df['thalach']
ax = sns.distplot(x, bins=10)
plt.show()
Comment
We can see that the thalach variable is slightly negatively skewed.
We can use Pandas series object to get an informative axis label as follows :
In [ ]: f, ax = plt.subplots(figsize=(10,6))
x = df['thalach']
x = pd.Series(x, name="thalach variable")
ax = sns.distplot(x, bins=10)
plt.show()
In [ ]: f, ax = plt.subplots(figsize=(10,6))
x = df['thalach']
ax = sns.distplot(x, bins=10, vertical=True)
plt.show()
The KDE plot plots the density of observations on one axis with height along the other axis.
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 37/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
In [ ]: f, ax = plt.subplots(figsize=(10,6))
x = df['thalach']
x = pd.Series(x, name="thalach variable")
ax = sns.kdeplot(x)
plt.show()
We can shade under the density curve and use a different color as follows:
In [ ]: f, ax = plt.subplots(figsize=(10,6))
x = df['thalach']
x = pd.Series(x, name="thalach variable")
ax = sns.kdeplot(x, shade=True, color='r')
plt.show()
Histogram
A histogram represents the distribution of data by forming bins along the range of the data and then drawing bars to show the number of
observations that fall in each bin.
In [ ]: f, ax = plt.subplots(figsize=(10,6))
x = df['thalach']
ax = sns.distplot(x, kde=False, rug=True, bins=10)
plt.show()
In [ ]: f, ax = plt.subplots(figsize=(8, 6))
sns.stripplot(x="target", y="thalach", data=df)
plt.show()
Interpretation
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 38/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
We can see that those people suffering from heart disease (target = 1) have relatively higher heart rate (thalach) as compared to people
who are not suffering from heart disease (target = 0).
In [ ]: f, ax = plt.subplots(figsize=(8, 6))
sns.stripplot(x="target", y="thalach", data=df, jitter = 0.01)
plt.show()
In [ ]: f, ax = plt.subplots(figsize=(8, 6))
sns.boxplot(x="target", y="thalach", data=df)
plt.show()
Interpretation
The above boxplot confirms our finding that people suffering from heart disease (target = 1) have relatively higher heart rate (thalach) as
compared to people who are not suffering from heart disease (target = 0).
There is no variable which has strong positive correlation with target variable.
There is no variable which has strong negative correlation with target variable.
The cp and thalach variables are mildly positively correlated with target variable.
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 39/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
The people suffering from heart disease (target = 1) have relatively higher heart rate (thalach) as compared to people who are not
suffering from heart disease (target = 0).
The people suffering from heart disease (target = 1) have relatively higher heart rate (thalach) as compared to people who are not
suffering from heart disease (target = 0).
9. Multivariate analysis
Back to Table of Contents
The objective of the multivariate analysis is to discover patterns and relationships in the dataset.
I will use heat map and pair plot to discover the patterns and relationships in the dataset.
Heat Map
In [ ]: plt.figure(figsize=(16,12))
plt.title('Correlation Heatmap of Heart Disease Dataset')
a = sns.heatmap(correlation, square=True, annot=True, fmt='.2f', linecolor='white')
a.set_xticklabels(a.get_xticklabels(), rotation=90)
a.set_yticklabels(a.get_yticklabels(), rotation=30)
plt.show()
Interpretation
From the above correlation heat map, we can conclude that :-
target and cp variable are mildly positively correlated (correlation coefficient = 0.43).
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 40/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
target and thalach variable are also mildly positively correlated (correlation coefficient = 0.42).
target and slope variable are weakly positively correlated (correlation coefficient = 0.35).
target and exang variable are mildly negatively correlated (correlation coefficient = -0.44).
target and oldpeak variable are also mildly negatively correlated (correlation coefficient = -0.43).
target and ca variable are weakly negatively correlated (correlation coefficient = -0.39).
target and thal variable are also waekly negatively correlated (correlation coefficient = -0.34).
Pair Plot
In [ ]: num_var = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'target' ]
sns.pairplot(df[num_var], kind='scatter', diag_kind='hist')
plt.show()
Comment
I have defined a variable num_var . Here age , trestbps , chol`, `thalach` and `oldpeak are numerical variables and target
is the categorical variable.
In [ ]: df['age'].nunique()
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 41/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
In [ ]: df['age'].describe()
Interpretation
The mean value of the age variable is 54.37 years.
Now, I will plot the distribution of age variable to view the statistical properties.
In [ ]: f, ax = plt.subplots(figsize=(10,6))
x = df['age']
ax = sns.distplot(x, bins=10)
plt.show()
Interpretation
The age variable distribution is approximately normal.
In [ ]: f, ax = plt.subplots(figsize=(8, 6))
sns.stripplot(x="target", y="age", data=df)
plt.show()
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 42/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
Interpretation
We can see that the people suffering from heart disease (target = 1) and people who are not suffering from heart disease (target = 0)
have comparable ages.
In [ ]: f, ax = plt.subplots(figsize=(8, 6))
sns.boxplot(x="target", y="age", data=df)
plt.show()
Interpretation
The above boxplot tells two different things :
The mean age of the people who have heart disease is less than the mean age of the people who do not have heart disease.
The dispersion or spread of age of the people who have heart disease is greater than the dispersion or spread of age of the people
who do not have heart disease.
In [ ]: f, ax = plt.subplots(figsize=(8, 6))
ax = sns.scatterplot(x="age", y="trestbps", data=df)
plt.show()
Interpretation
The above scatter plot shows that there is no correlation between age and trestbps variable.
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 43/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
In [ ]: f, ax = plt.subplots(figsize=(8, 6))
ax = sns.regplot(x="age", y="trestbps", data=df)
plt.show()
Interpretation
The above line shows that linear regression model is not good fit to the data.
In [ ]: f, ax = plt.subplots(figsize=(8, 6))
ax = sns.scatterplot(x="age", y="chol", data=df)
plt.show()
In [ ]: f, ax = plt.subplots(figsize=(8, 6))
ax = sns.regplot(x="age", y="chol", data=df)
plt.show()
Interpretation
The above plot confirms that there is a slighly positive correlation between age and chol variables.
In [ ]: f, ax = plt.subplots(figsize=(8, 6))
ax = sns.scatterplot(x="chol", y = "thalach", data=df)
plt.show()
In [ ]: f, ax = plt.subplots(figsize=(8, 6))
ax = sns.regplot(x="chol", y="thalach", data=df)
plt.show()
Interpretation
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 44/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
The above plot shows that there is no correlation between chol and thalach variable.
None: None is a Python singleton object that is often used for missing data in Python code.
NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by all systems that use the standard IEEE
floating-point representation.
Below, I will list some useful commands to deal with missing values.
The above command checks whether each cell in a dataframe contains missing values or not. If the cell contains missing value, it returns True
otherwise it returns False.
df.isnull().sum()
The above command returns total number of missing values in each column in the dataframe.
df.isnull().sum().sum()
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 45/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
df.isnull().mean()
df.isnull().any()
It checks which column has null values and which has not. The columns which has null values returns TRUE and FALSE otherwise.
df.isnull().any().any()
It returns a boolean value indicating whether the dataframe has missing values or not. If dataframe contains missing values it returns TRUE and
FALSE otherwise.
df.isnull().values.any()
It checks whether a particular column has missing values or not. If the column contains missing values, then it returns TRUE otherwise FALSE.
df.isnull().values.sum()
df.isnull().sum()
Interpretation
We can see that there are no missing values in the dataset.
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 46/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
We can use an assert statement to programmatically check that no missing, unexpected 0 or negative values are present.
Assert statement will return nothing if the value being tested is true and will throw an AssertionError if the value is false.
Asserts
assert pd.notnull(df).all().all()
Interpretation
The above two commands do not throw any error. Hence, it is confirmed that there are no missing or negative values in the dataset.
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 47/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
age variable
In [ ]: df['age'].describe()
In [ ]: f, ax = plt.subplots(figsize=(8, 6))
sns.boxplot(x=df["age"])
plt.show()
trestbps variable
In [ ]: df['trestbps'].describe()
In [ ]: f, ax = plt.subplots(figsize=(8, 6))
sns.boxplot(x=df["trestbps"])
plt.show()
chol variable
In [ ]: df['chol'].describe()
In [ ]: f, ax = plt.subplots(figsize=(8, 6))
sns.boxplot(x=df["chol"])
plt.show()
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 48/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
thalach variable
In [ ]: df['thalach'].describe()
In [ ]: f, ax = plt.subplots(figsize=(8, 6))
sns.boxplot(x=df["thalach"])
plt.show()
oldpeak variable
In [ ]: df['oldpeak'].describe()
In [ ]: f, ax = plt.subplots(figsize=(8, 6))
sns.boxplot(x=df["oldpeak"])
plt.show()
Findings
The age variable does not contain any outlier.
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 49/50
1/24/25, 9:47 AM extensive-analysis-visualization-with-python
13. Conclusion
Back to Table of Contents
In this kernel, we have explored the heart disease dataset. In this kernel, we have implemented many of the strategies presented in the book
Think Stats - Exploratory Data Analysis in Python by Allen B Downey . The feature variable of interest is target variable. We have
analyzed it alone and check its interaction with other variables. We have also discussed how to detect missing data and outliers.
Thanks
14. References
Back to Table of Contents
My other kernel
Go to Top
localhost:8888/doc/tree/extensive-analysis-visualization-with-python.ipynb 50/50