0% found this document useful (0 votes)
26 views11 pages

05 AIHC Exp02

Uploaded by

laxitac115
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views11 pages

05 AIHC Exp02

Uploaded by

laxitac115
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Vidyavardhini’s College of Engineering & Technology

Name: Durvesh Kajrekar


Class: BE/CSE-DS
Experiment No. 2
Perform Exploratory data analysis of Healthcare Data.
Date of Performance: 2/8/24
Date of Submission: 9/8/24

HAIMLSBL701 AI&ML in Healthcare Lab


Vidyavardhini’s College of Engineering & Technology

Aim: Perform Exploratory data analysis of Healthcare Data.

Objective: The objective of this experiment is to perform Exploratory data analytics on


healthcare data using python numpy functions

Theory:
Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore
data, and possibly formulate hypotheses that might cause new data collection and
experiments. EDA focuses more narrowly on checking assumptions required for model fitting
and hypothesis testing. It also checks while handling missing values and making
transformations of variables as needed.

EDA builds a robust understanding of the data, and issues associated with either the info or
process. It’s a scientific approach to getting the story of the data.

TYPES OF EXPLORATORY DATA ANALYSIS:


1. Univariate Non-graphical
2. Multivariate Non-graphical
3. Univariate graphical
4. Multivariate graphical

1. Univariate Non-graphical: this is the simplest form of data analysis as during this we use
just one variable to research the info. The standard goal of univariate non-graphical EDA is to
know the underlying sample distribution/ data and make observations about the population.
Outlier detection is additionally part of the analysis.

The characteristics of population distribution include:


● Central tendency: The central tendency or location of distribution has got to do with
typical or middle values. The commonly useful measures of central tendency are
statistics called mean, median, and sometimes mode during which the foremost
common is mean. For skewed distribution or when there’s concern about outliers, the
median may be preferred.
● Spread: Spread is an indicator of what proportion distant from the middle we are to
seek out to find the info values. The quality deviation and variance are two useful
measures of spread. The variance is that the mean of the square of the individual
deviations and therefore the variance is the root of the variance
● Skewness and kurtosis: Two more useful univariate descriptors are the skewness and
kurtosis of the distribution. Skewness is that the measure of asymmetry and kurtosis
may be a more subtle measure of peakedness compared to a normal distribution.

HAIMLSBL701 AI&ML in Healthcare Lab


Vidyavardhini’s College of Engineering & Technology

2. Multivariate Non-graphical: Multivariate non-graphical EDA technique usually wants to


show the connection between two or more variables within the sort of either cross-tabulation
or statistics.
For categorical data, an extension of tabulation called cross-tabulation is extremely useful.
For 2 variables, cross-tabulation is preferred by making a two-way table with column
headings that match the amount of one-variable and row headings that match the amount of
the opposite two variables, then filling the counts with all subjects that share an equivalent
pair of levels.
For each categorical variable and one quantitative variable, we create statistics for
quantitative variables separately for every level of the specific variable then compare the
statistics across the amount of categorical variables.
Comparing the means is an off-the-cuff version of ANOVA and comparing medians may be a
robust version of one-way ANOVA.

3. Univariate graphical: Non-graphical methods are quantitative and objective, they are not
able to give the complete picture of the data; therefore, graphical methods are used more as
they involve a degree of subjective analysis, also are required. Common sorts of univariate
graphics are:
● Histogram: The foremost basic graph is a histogram, which may be a barplot during
which each bar represents the frequency (count) or proportion (count/total count) of
cases for a variety of values. Histograms are one of the simplest ways to quickly learn
a lot about your data, including central tendency, spread, modality, shape and outliers.
● Stem-and-leaf plots: An easy substitute for a histogram may be stem-and-leaf plots. It
shows all data values and therefore the shape of the distribution.
● Box Plots: Another very useful univariate graphical technique is the boxplot. Boxplots
are excellent at presenting information about central tendency and show robust
measures of location and spread also as providing information about symmetry and
outliers, although they will be misleading about aspects like multimodality. One
among the simplest uses of boxplots is within the sort of side-by-side boxplots.
● Quantile-normal plots: The ultimate univariate graphical EDA technique is the most
intricate. It's called the quantile-normal or QN plot or more generally the quantile-
quantile or QQ plot. it’s wont to see how well a specific sample follows a specific
theoretical distribution. It allows detection of non-normality and diagnosis of
skewness and kurtosis
4. Multivariate graphical: Multivariate graphical data uses graphics to display relationships
between two or more sets of knowledge. The sole one used commonly may be a grouped
barplot with each group representing one level of 1 of the variables and every bar within a
gaggle representing the amount of the opposite variable.
Other common sorts of multivariate graphics are:
● Scatterplot: For 2 quantitative variables, the essential graphical EDA technique is that

HAIMLSBL701 AI&ML in Healthcare Lab


Vidyavardhini’s College of Engineering & Technology

the scatter plot shows one variable on the x-axis and one on the y-axis and therefore
the point for every case in your dataset.
● Run chart: It’s a line graph of data plotted over time.
● Heat map: It’s a graphical representation of data where values are depicted by color.
● Multivariate chart: It’s a graphical representation of the relationships between factors
and response.
● Bubble chart: It’s a data visualization that displays multiple circles (bubbles) in a two-
dimensional plot.

In a nutshell: You ought to always perform appropriate EDA before further analysis of your
data. Perform whatever steps are necessary to become more conversant in your data, check
for obvious mistakes, learn about variable distributions, and study about relationships
between variables. EDA is not an exact science- It is very important!

TOOLS REQUIRED FOR EXPLORATORY DATA ANALYSIS:


Some of the most common tools used to create an EDA are:
1. R: An open-source programming language and free software environment for statistical
computing and graphics supported by the R foundation for statistical computing. The R
language is widely used among statisticians in developing statistical observations and data
analysis.
2. Python: An interpreted, object-oriented programming language with dynamic semantics.
Its high level, built-in data structures, combined with dynamic binding, make it very
attractive for rapid application development, also as to be used as a scripting or glue language
to attach existing components together. Python and EDA are often used together to spot
missing values in the data set, which is vital so you’ll decide the way to handle missing
values for machine learning.

Apart from these functions described above, EDA can also:


Perform k-means clustering: Perform k-means clustering: it’s an unsupervised learning
algorithm where the info points are assigned to clusters, also referred to as k-groups, k-means
clustering is usually utilized in market segmentation, image compression, and pattern
recognition
EDA is often utilized in predictive models like linear regression, where it’s wont to predict
outcomes.
It is also utilized in univariate, bivariate, and multivariate visualization for summary
statistics, establishing relationships between each variable, and understanding how different
fields within the data interact with one another.

HAIMLSBL701 AI&ML in Healthcare Lab


Vidyavardhini’s College of Engineering & Technology

Code: -
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('heart.csv')
print("Initial Data:")
print(df.head())
print("\nMissing Values:")
print(df.isnull().sum())
print("\nData Description:")
print(df.describe())
df_filled = df.fillna(df.median())
df_no_duplicates = df_filled.drop_duplicates()
z_scores = np.abs(stats.zscore(df_no_duplicates.select_dtypes(include=[np.number])))
df_no_outliers = df_no_duplicates[(z_scores < 3).all(axis=1)]
scaler = StandardScaler()
df_scaled =
pd.DataFrame(scaler.fit_transform(df_no_outliers.select_dtypes(include=[np.number])),
columns=df_no_outliers.select_dtypes(include=[np.number]).columns)
min_max_scaler = MinMaxScaler()
df_normalized =
pd.DataFrame(min_max_scaler.fit_transform(df_no_outliers.select_dtypes(include=[np.num
ber])), columns=df_no_outliers.select_dtypes(include=[np.number]).columns)
df_no_outliers['age_group'] = pd.cut(df_no_outliers['age'], bins=[20, 40, 60, 80], labels=['20-
39', '40-59', '60-79'])
df_no_outliers['sex'] = df_no_outliers['sex'].map({0: 'female', 1: 'male'})
df_encoded = pd.get_dummies(df_no_outliers, columns=['sex'])
plt.figure(figsize=(10, 6))
sns.histplot(df_no_outliers['age'], kde=True)

HAIMLSBL701 AI&ML in Healthcare Lab


Vidyavardhini’s College of Engineering & Technology

plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
plt.figure(figsize=(10, 6))
sns.boxplot(x='sex', y='age', data=df_no_outliers)
plt.title('Boxplot of Age by Sex')
plt.xlabel('Sex')
plt.ylabel('Age')
plt.show()
sns.pairplot(df_no_outliers, vars=['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'sex_male'],
hue='sex_male')
plt.show()
correlation_matrix = df_no_outliers.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()
print("\nProcessed Data (first few rows):")
print(df_no_outliers.head())
print("\nEncoded Data (first few rows):")
print(df_encoded.head())

Output:

HAIMLSBL701 AI&ML in Healthcare Lab


Vidyavardhini’s College of Engineering & Technology

HAIMLSBL701 AI&ML in Healthcare Lab


Vidyavardhini’s College of Engineering & Technology

HAIMLSBL701 AI&ML in Healthcare Lab


Vidyavardhini’s College of Engineering & Technology

HAIMLSBL701 AI&ML in Healthcare Lab


Vidyavardhini’s College of Engineering & Technology

Google Collaboratory Link: -

HAIMLSBL701 AI&ML in Healthcare Lab


Vidyavardhini’s College of Engineering & Technology

https://colab.research.google.com/drive/1PFQ_65TEJMjPtDC6UKmoZM8puWpw_Td5?
usp=sharing

Conclusion: -
Q. Comment on the importance of EDA. After using your Healthcare related dataset, what
observations did you make about the data?
Exploratory Data Analysis (EDA) is crucial for understanding a dataset's structure, ensuring
data quality, and uncovering patterns. It identifies missing values, outliers, and relationships
between variables, which guides cleaning, feature engineering, and modeling choices. In the
healthcare dataset, EDA revealed a typical age distribution, differences in age by sex
suggesting varying disease risk profiles, and important feature relationships such as between
`age`, `chol`, and `thalach`. Additionally, correlations between features like `chol` and `age`
provided insights into how variables interact, which is essential for understanding disease risk
and making informed decisions.

HAIMLSBL701 AI&ML in Healthcare Lab

You might also like