0% found this document useful (0 votes)

26 views11 pages

05 AIHC Exp02

Uploaded by

laxitac115

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views11 pages

05 AIHC Exp02

Uploaded by

laxitac115

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Vidyavardhini’s College of Engineering & Technology

Name: Durvesh Kajrekar

Class: BE/CSE-DS
Experiment No. 2
Perform Exploratory data analysis of Healthcare Data.
Date of Performance: 2/8/24
Date of Submission: 9/8/24

HAIMLSBL701 AI&ML in Healthcare Lab

Vidyavardhini’s College of Engineering & Technology

Aim: Perform Exploratory data analysis of Healthcare Data.

Objective: The objective of this experiment is to perform Exploratory data analytics on

healthcare data using python numpy functions

Theory:
Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore
data, and possibly formulate hypotheses that might cause new data collection and
experiments. EDA focuses more narrowly on checking assumptions required for model fitting
and hypothesis testing. It also checks while handling missing values and making
transformations of variables as needed.

EDA builds a robust understanding of the data, and issues associated with either the info or
process. It’s a scientific approach to getting the story of the data.

TYPES OF EXPLORATORY DATA ANALYSIS:

1. Univariate Non-graphical
2. Multivariate Non-graphical
3. Univariate graphical
4. Multivariate graphical

1. Univariate Non-graphical: this is the simplest form of data analysis as during this we use
just one variable to research the info. The standard goal of univariate non-graphical EDA is to
know the underlying sample distribution/ data and make observations about the population.
Outlier detection is additionally part of the analysis.

The characteristics of population distribution include:

● Central tendency: The central tendency or location of distribution has got to do with
typical or middle values. The commonly useful measures of central tendency are
statistics called mean, median, and sometimes mode during which the foremost
common is mean. For skewed distribution or when there’s concern about outliers, the
median may be preferred.
● Spread: Spread is an indicator of what proportion distant from the middle we are to
seek out to find the info values. The quality deviation and variance are two useful
measures of spread. The variance is that the mean of the square of the individual
deviations and therefore the variance is the root of the variance
● Skewness and kurtosis: Two more useful univariate descriptors are the skewness and
kurtosis of the distribution. Skewness is that the measure of asymmetry and kurtosis
may be a more subtle measure of peakedness compared to a normal distribution.

HAIMLSBL701 AI&ML in Healthcare Lab

Vidyavardhini’s College of Engineering & Technology

2. Multivariate Non-graphical: Multivariate non-graphical EDA technique usually wants to

show the connection between two or more variables within the sort of either cross-tabulation
or statistics.
For categorical data, an extension of tabulation called cross-tabulation is extremely useful.
For 2 variables, cross-tabulation is preferred by making a two-way table with column
headings that match the amount of one-variable and row headings that match the amount of
the opposite two variables, then filling the counts with all subjects that share an equivalent
pair of levels.
For each categorical variable and one quantitative variable, we create statistics for
quantitative variables separately for every level of the specific variable then compare the
statistics across the amount of categorical variables.
Comparing the means is an off-the-cuff version of ANOVA and comparing medians may be a
robust version of one-way ANOVA.

3. Univariate graphical: Non-graphical methods are quantitative and objective, they are not
able to give the complete picture of the data; therefore, graphical methods are used more as
they involve a degree of subjective analysis, also are required. Common sorts of univariate
graphics are:
● Histogram: The foremost basic graph is a histogram, which may be a barplot during
which each bar represents the frequency (count) or proportion (count/total count) of
cases for a variety of values. Histograms are one of the simplest ways to quickly learn
a lot about your data, including central tendency, spread, modality, shape and outliers.
● Stem-and-leaf plots: An easy substitute for a histogram may be stem-and-leaf plots. It
shows all data values and therefore the shape of the distribution.
● Box Plots: Another very useful univariate graphical technique is the boxplot. Boxplots
are excellent at presenting information about central tendency and show robust
measures of location and spread also as providing information about symmetry and
outliers, although they will be misleading about aspects like multimodality. One
among the simplest uses of boxplots is within the sort of side-by-side boxplots.
● Quantile-normal plots: The ultimate univariate graphical EDA technique is the most
intricate. It's called the quantile-normal or QN plot or more generally the quantile-
quantile or QQ plot. it’s wont to see how well a specific sample follows a specific
theoretical distribution. It allows detection of non-normality and diagnosis of
skewness and kurtosis
4. Multivariate graphical: Multivariate graphical data uses graphics to display relationships
between two or more sets of knowledge. The sole one used commonly may be a grouped
barplot with each group representing one level of 1 of the variables and every bar within a
gaggle representing the amount of the opposite variable.
Other common sorts of multivariate graphics are:
● Scatterplot: For 2 quantitative variables, the essential graphical EDA technique is that

HAIMLSBL701 AI&ML in Healthcare Lab

Vidyavardhini’s College of Engineering & Technology

the scatter plot shows one variable on the x-axis and one on the y-axis and therefore
the point for every case in your dataset.
● Run chart: It’s a line graph of data plotted over time.
● Heat map: It’s a graphical representation of data where values are depicted by color.
● Multivariate chart: It’s a graphical representation of the relationships between factors
and response.
● Bubble chart: It’s a data visualization that displays multiple circles (bubbles) in a two-
dimensional plot.

In a nutshell: You ought to always perform appropriate EDA before further analysis of your
data. Perform whatever steps are necessary to become more conversant in your data, check
for obvious mistakes, learn about variable distributions, and study about relationships
between variables. EDA is not an exact science- It is very important!

TOOLS REQUIRED FOR EXPLORATORY DATA ANALYSIS:

Some of the most common tools used to create an EDA are:
1. R: An open-source programming language and free software environment for statistical
computing and graphics supported by the R foundation for statistical computing. The R
language is widely used among statisticians in developing statistical observations and data
analysis.
2. Python: An interpreted, object-oriented programming language with dynamic semantics.
Its high level, built-in data structures, combined with dynamic binding, make it very
attractive for rapid application development, also as to be used as a scripting or glue language
to attach existing components together. Python and EDA are often used together to spot
missing values in the data set, which is vital so you’ll decide the way to handle missing
values for machine learning.

Apart from these functions described above, EDA can also:

Perform k-means clustering: Perform k-means clustering: it’s an unsupervised learning
algorithm where the info points are assigned to clusters, also referred to as k-groups, k-means
clustering is usually utilized in market segmentation, image compression, and pattern
recognition
EDA is often utilized in predictive models like linear regression, where it’s wont to predict
outcomes.
It is also utilized in univariate, bivariate, and multivariate visualization for summary
statistics, establishing relationships between each variable, and understanding how different
fields within the data interact with one another.

HAIMLSBL701 AI&ML in Healthcare Lab

Vidyavardhini’s College of Engineering & Technology

Code: -
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('heart.csv')
print("Initial Data:")
print(df.head())
print("\nMissing Values:")
print(df.isnull().sum())
print("\nData Description:")
print(df.describe())
df_filled = df.fillna(df.median())
df_no_duplicates = df_filled.drop_duplicates()
z_scores = np.abs(stats.zscore(df_no_duplicates.select_dtypes(include=[np.number])))
df_no_outliers = df_no_duplicates[(z_scores < 3).all(axis=1)]
scaler = StandardScaler()
df_scaled =
pd.DataFrame(scaler.fit_transform(df_no_outliers.select_dtypes(include=[np.number])),
columns=df_no_outliers.select_dtypes(include=[np.number]).columns)
min_max_scaler = MinMaxScaler()
df_normalized =
pd.DataFrame(min_max_scaler.fit_transform(df_no_outliers.select_dtypes(include=[np.num
ber])), columns=df_no_outliers.select_dtypes(include=[np.number]).columns)
df_no_outliers['age_group'] = pd.cut(df_no_outliers['age'], bins=[20, 40, 60, 80], labels=['20-
39', '40-59', '60-79'])
df_no_outliers['sex'] = df_no_outliers['sex'].map({0: 'female', 1: 'male'})
df_encoded = pd.get_dummies(df_no_outliers, columns=['sex'])
plt.figure(figsize=(10, 6))
sns.histplot(df_no_outliers['age'], kde=True)

HAIMLSBL701 AI&ML in Healthcare Lab

Vidyavardhini’s College of Engineering & Technology

plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
plt.figure(figsize=(10, 6))
sns.boxplot(x='sex', y='age', data=df_no_outliers)
plt.title('Boxplot of Age by Sex')
plt.xlabel('Sex')
plt.ylabel('Age')
plt.show()
sns.pairplot(df_no_outliers, vars=['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'sex_male'],
hue='sex_male')
plt.show()
correlation_matrix = df_no_outliers.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()
print("\nProcessed Data (first few rows):")
print(df_no_outliers.head())
print("\nEncoded Data (first few rows):")
print(df_encoded.head())

Output:

HAIMLSBL701 AI&ML in Healthcare Lab

Vidyavardhini’s College of Engineering & Technology

HAIMLSBL701 AI&ML in Healthcare Lab

Vidyavardhini’s College of Engineering & Technology

HAIMLSBL701 AI&ML in Healthcare Lab

Vidyavardhini’s College of Engineering & Technology

HAIMLSBL701 AI&ML in Healthcare Lab

Vidyavardhini’s College of Engineering & Technology

Google Collaboratory Link: -

HAIMLSBL701 AI&ML in Healthcare Lab

Vidyavardhini’s College of Engineering & Technology

https://colab.research.google.com/drive/1PFQ_65TEJMjPtDC6UKmoZM8puWpw_Td5?
usp=sharing

Conclusion: -
Q. Comment on the importance of EDA. After using your Healthcare related dataset, what
observations did you make about the data?
Exploratory Data Analysis (EDA) is crucial for understanding a dataset's structure, ensuring
data quality, and uncovering patterns. It identifies missing values, outliers, and relationships
between variables, which guides cleaning, feature engineering, and modeling choices. In the
healthcare dataset, EDA revealed a typical age distribution, differences in age by sex
suggesting varying disease risk profiles, and important feature relationships such as between
`age`, `chol`, and `thalach`. Additionally, correlations between features like `chol` and `age`
provided insights into how variables interact, which is essential for understanding disease risk
and making informed decisions.

HAIMLSBL701 AI&ML in Healthcare Lab

Compiled ESL Activities - Activity Directions - Updated Sept 13th, 2014
50% (2)
Compiled ESL Activities - Activity Directions - Updated Sept 13th, 2014
177 pages
Staad Questions PDF
No ratings yet
Staad Questions PDF
8 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
13 pages
What Is Exploratory Data Analysis (EDA) ?
No ratings yet
What Is Exploratory Data Analysis (EDA) ?
6 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
23 pages
Eda 1
No ratings yet
Eda 1
25 pages
Exploratory Data Analysis: A First Look at The Data
No ratings yet
Exploratory Data Analysis: A First Look at The Data
9 pages
ENGLISH-8-Quarter 2-Week 5
100% (1)
ENGLISH-8-Quarter 2-Week 5
6 pages
EDA Syllabus
No ratings yet
EDA Syllabus
2 pages
Unit 3 Ids Notes
No ratings yet
Unit 3 Ids Notes
31 pages
Best Journal
No ratings yet
Best Journal
11 pages
UNIT 1 Exploratory Data Analysis
100% (1)
UNIT 1 Exploratory Data Analysis
8 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
5 pages
DSE 3 Unit 4
No ratings yet
DSE 3 Unit 4
8 pages
Fundamentals of Data Source and Preparation For ML v31
No ratings yet
Fundamentals of Data Source and Preparation For ML v31
45 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
9 pages
Ai ML Exp2
No ratings yet
Ai ML Exp2
7 pages
Exploratory Data Analysis - Komorowski PDF
No ratings yet
Exploratory Data Analysis - Komorowski PDF
20 pages
Exp 12
No ratings yet
Exp 12
7 pages
Document
No ratings yet
Document
21 pages
Datascience Unit-4
No ratings yet
Datascience Unit-4
6 pages
UNIT II-DSDA - Docx Notes
No ratings yet
UNIT II-DSDA - Docx Notes
26 pages
Komorowski EDA2016
No ratings yet
Komorowski EDA2016
20 pages
Unit 3
No ratings yet
Unit 3
222 pages
EDA
No ratings yet
EDA
3 pages
Komorowski EDA2016
No ratings yet
Komorowski EDA2016
20 pages
Key Concepts in Exploratory Data Analysis (EDA)
No ratings yet
Key Concepts in Exploratory Data Analysis (EDA)
5 pages
Dev 1
No ratings yet
Dev 1
2 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
3 pages
AIDS C04-Session-22
No ratings yet
AIDS C04-Session-22
22 pages
Wa0000.
No ratings yet
Wa0000.
15 pages
Exploratory Data Analysis (EDA) in Data
No ratings yet
Exploratory Data Analysis (EDA) in Data
12 pages
Comparing Tools Provided by Python and R For Exploratory Data Analysis
No ratings yet
Comparing Tools Provided by Python and R For Exploratory Data Analysis
12 pages
22amh32 - Data Analytics and Data Science Unit I & Exploratory Data Analysis (Eda) 1. Exploratory Data Analysis (Eda)
No ratings yet
22amh32 - Data Analytics and Data Science Unit I & Exploratory Data Analysis (Eda) 1. Exploratory Data Analysis (Eda)
9 pages
Exploratory Data Analysis of Heart Disease Dataset 1737826105
No ratings yet
Exploratory Data Analysis of Heart Disease Dataset 1737826105
50 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
Unit 2 Lec4
No ratings yet
Unit 2 Lec4
24 pages
Komorowski EDA2016
No ratings yet
Komorowski EDA2016
20 pages
Exploratory Data Analysis in ML
No ratings yet
Exploratory Data Analysis in ML
7 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
17 pages
SCBA Pre-Use Inspection
No ratings yet
SCBA Pre-Use Inspection
2 pages
Unit 1
No ratings yet
Unit 1
23 pages
Lesson 5 Exploratory Data Analysis
No ratings yet
Lesson 5 Exploratory Data Analysis
10 pages
FDS Unit 2
No ratings yet
FDS Unit 2
15 pages
03a EDA
No ratings yet
03a EDA
47 pages
Orientering
No ratings yet
Orientering
15 pages
Data Science - Module 2 (Updated)
No ratings yet
Data Science - Module 2 (Updated)
94 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
Exploratory Data Analysis (Eda) : Niraj Poudyal, PHD Kathmandu University School of Arts
No ratings yet
Exploratory Data Analysis (Eda) : Niraj Poudyal, PHD Kathmandu University School of Arts
54 pages
Unit 1
No ratings yet
Unit 1
52 pages
Exploratory Data Analysis Types
No ratings yet
Exploratory Data Analysis Types
14 pages
Module 2
No ratings yet
Module 2
81 pages
Group 7
No ratings yet
Group 7
19 pages
Probability and Stat Unit 1
No ratings yet
Probability and Stat Unit 1
12 pages
Unit 1
No ratings yet
Unit 1
19 pages
EDA Feature Eng - Estimation Inference and Hypothesis
No ratings yet
EDA Feature Eng - Estimation Inference and Hypothesis
53 pages
BI-LEc 3
No ratings yet
BI-LEc 3
24 pages
Chapter 7 SQQS1033
No ratings yet
Chapter 7 SQQS1033
37 pages
Unit 3
No ratings yet
Unit 3
77 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Unit 3
No ratings yet
Unit 3
47 pages
Monsoon Theories
100% (1)
Monsoon Theories
14 pages
IOT Domain
No ratings yet
IOT Domain
70 pages
Post WW Ii Latin American Boom: 21 Century Literature From The Philippines and The World Week 4 Topic
No ratings yet
Post WW Ii Latin American Boom: 21 Century Literature From The Philippines and The World Week 4 Topic
2 pages
Oracle Final Exam Semester 1
100% (1)
Oracle Final Exam Semester 1
22 pages
Project Presentation On: Social Distance Indicator & Alarming System
No ratings yet
Project Presentation On: Social Distance Indicator & Alarming System
11 pages
The Act
No ratings yet
The Act
2 pages
Bio Metrics
No ratings yet
Bio Metrics
23 pages
05 AIHC Exp05
No ratings yet
05 AIHC Exp05
6 pages
Electrical Wire Labeling
No ratings yet
Electrical Wire Labeling
2 pages
Sachin Pawar Resume
No ratings yet
Sachin Pawar Resume
6 pages
Sony KDL - 52s5100 Chasis Exr2
No ratings yet
Sony KDL - 52s5100 Chasis Exr2
104 pages
CHAPTER 7 - MATHEMATICS of FINANCE, Seventh Edition by Robert L. Brown, Steve Kopp and Petr Zima (Z-Lib - Org) - 261-289
No ratings yet
CHAPTER 7 - MATHEMATICS of FINANCE, Seventh Edition by Robert L. Brown, Steve Kopp and Petr Zima (Z-Lib - Org) - 261-289
29 pages
Biography of Adolf Hitler
No ratings yet
Biography of Adolf Hitler
1 page
Fa22 Rba 003
No ratings yet
Fa22 Rba 003
7 pages
Equlibrium
No ratings yet
Equlibrium
20 pages
1.0 Introduction To Biochemistry and Cellular Organization
No ratings yet
1.0 Introduction To Biochemistry and Cellular Organization
6 pages
M P5 Rev1 Sem2 2024 2025
No ratings yet
M P5 Rev1 Sem2 2024 2025
6 pages
H 0010-20-43061 2 10 0 Pds Protocol Programmer S Guide
No ratings yet
H 0010-20-43061 2 10 0 Pds Protocol Programmer S Guide
172 pages
Three High-Altitude Peoples, Three Adaptations To Thin Air
No ratings yet
Three High-Altitude Peoples, Three Adaptations To Thin Air
11 pages
Human Resources, Job Design, and Work Measurement: Human Resource Strategy For Competitive Advantage
No ratings yet
Human Resources, Job Design, and Work Measurement: Human Resource Strategy For Competitive Advantage
3 pages
Ing Bank Ar 2018
No ratings yet
Ing Bank Ar 2018
369 pages
05 AIHC Exp04
No ratings yet
05 AIHC Exp04
8 pages
05 AIHC Exp03
No ratings yet
05 AIHC Exp03
7 pages
05 AIHC Exp01
No ratings yet
05 AIHC Exp01
6 pages
Intervention21120-5570393 152823
No ratings yet
Intervention21120-5570393 152823
10 pages
Photoluminescence FBG
No ratings yet
Photoluminescence FBG
13 pages
Task3.Ipynb - Colaboratory Dip
No ratings yet
Task3.Ipynb - Colaboratory Dip
3 pages
Oral Characteristics of Newborns: Journal of Dentistry For Children (Chicago, Ill.) December 2008
No ratings yet
Oral Characteristics of Newborns: Journal of Dentistry For Children (Chicago, Ill.) December 2008
4 pages
UNIT U03 02 Grammar Summary
No ratings yet
UNIT U03 02 Grammar Summary
5 pages
OOP Assignment 2
No ratings yet
OOP Assignment 2
2 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet