05 AIHC Exp02
05 AIHC Exp02
Theory:
Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore
data, and possibly formulate hypotheses that might cause new data collection and
experiments. EDA focuses more narrowly on checking assumptions required for model fitting
and hypothesis testing. It also checks while handling missing values and making
transformations of variables as needed.
EDA builds a robust understanding of the data, and issues associated with either the info or
process. It’s a scientific approach to getting the story of the data.
1. Univariate Non-graphical: this is the simplest form of data analysis as during this we use
just one variable to research the info. The standard goal of univariate non-graphical EDA is to
know the underlying sample distribution/ data and make observations about the population.
Outlier detection is additionally part of the analysis.
3. Univariate graphical: Non-graphical methods are quantitative and objective, they are not
able to give the complete picture of the data; therefore, graphical methods are used more as
they involve a degree of subjective analysis, also are required. Common sorts of univariate
graphics are:
● Histogram: The foremost basic graph is a histogram, which may be a barplot during
which each bar represents the frequency (count) or proportion (count/total count) of
cases for a variety of values. Histograms are one of the simplest ways to quickly learn
a lot about your data, including central tendency, spread, modality, shape and outliers.
● Stem-and-leaf plots: An easy substitute for a histogram may be stem-and-leaf plots. It
shows all data values and therefore the shape of the distribution.
● Box Plots: Another very useful univariate graphical technique is the boxplot. Boxplots
are excellent at presenting information about central tendency and show robust
measures of location and spread also as providing information about symmetry and
outliers, although they will be misleading about aspects like multimodality. One
among the simplest uses of boxplots is within the sort of side-by-side boxplots.
● Quantile-normal plots: The ultimate univariate graphical EDA technique is the most
intricate. It's called the quantile-normal or QN plot or more generally the quantile-
quantile or QQ plot. it’s wont to see how well a specific sample follows a specific
theoretical distribution. It allows detection of non-normality and diagnosis of
skewness and kurtosis
4. Multivariate graphical: Multivariate graphical data uses graphics to display relationships
between two or more sets of knowledge. The sole one used commonly may be a grouped
barplot with each group representing one level of 1 of the variables and every bar within a
gaggle representing the amount of the opposite variable.
Other common sorts of multivariate graphics are:
● Scatterplot: For 2 quantitative variables, the essential graphical EDA technique is that
the scatter plot shows one variable on the x-axis and one on the y-axis and therefore
the point for every case in your dataset.
● Run chart: It’s a line graph of data plotted over time.
● Heat map: It’s a graphical representation of data where values are depicted by color.
● Multivariate chart: It’s a graphical representation of the relationships between factors
and response.
● Bubble chart: It’s a data visualization that displays multiple circles (bubbles) in a two-
dimensional plot.
In a nutshell: You ought to always perform appropriate EDA before further analysis of your
data. Perform whatever steps are necessary to become more conversant in your data, check
for obvious mistakes, learn about variable distributions, and study about relationships
between variables. EDA is not an exact science- It is very important!
Code: -
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('heart.csv')
print("Initial Data:")
print(df.head())
print("\nMissing Values:")
print(df.isnull().sum())
print("\nData Description:")
print(df.describe())
df_filled = df.fillna(df.median())
df_no_duplicates = df_filled.drop_duplicates()
z_scores = np.abs(stats.zscore(df_no_duplicates.select_dtypes(include=[np.number])))
df_no_outliers = df_no_duplicates[(z_scores < 3).all(axis=1)]
scaler = StandardScaler()
df_scaled =
pd.DataFrame(scaler.fit_transform(df_no_outliers.select_dtypes(include=[np.number])),
columns=df_no_outliers.select_dtypes(include=[np.number]).columns)
min_max_scaler = MinMaxScaler()
df_normalized =
pd.DataFrame(min_max_scaler.fit_transform(df_no_outliers.select_dtypes(include=[np.num
ber])), columns=df_no_outliers.select_dtypes(include=[np.number]).columns)
df_no_outliers['age_group'] = pd.cut(df_no_outliers['age'], bins=[20, 40, 60, 80], labels=['20-
39', '40-59', '60-79'])
df_no_outliers['sex'] = df_no_outliers['sex'].map({0: 'female', 1: 'male'})
df_encoded = pd.get_dummies(df_no_outliers, columns=['sex'])
plt.figure(figsize=(10, 6))
sns.histplot(df_no_outliers['age'], kde=True)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
plt.figure(figsize=(10, 6))
sns.boxplot(x='sex', y='age', data=df_no_outliers)
plt.title('Boxplot of Age by Sex')
plt.xlabel('Sex')
plt.ylabel('Age')
plt.show()
sns.pairplot(df_no_outliers, vars=['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'sex_male'],
hue='sex_male')
plt.show()
correlation_matrix = df_no_outliers.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()
print("\nProcessed Data (first few rows):")
print(df_no_outliers.head())
print("\nEncoded Data (first few rows):")
print(df_encoded.head())
Output:
https://colab.research.google.com/drive/1PFQ_65TEJMjPtDC6UKmoZM8puWpw_Td5?
usp=sharing
Conclusion: -
Q. Comment on the importance of EDA. After using your Healthcare related dataset, what
observations did you make about the data?
Exploratory Data Analysis (EDA) is crucial for understanding a dataset's structure, ensuring
data quality, and uncovering patterns. It identifies missing values, outliers, and relationships
between variables, which guides cleaning, feature engineering, and modeling choices. In the
healthcare dataset, EDA revealed a typical age distribution, differences in age by sex
suggesting varying disease risk profiles, and important feature relationships such as between
`age`, `chol`, and `thalach`. Additionally, correlations between features like `chol` and `age`
provided insights into how variables interact, which is essential for understanding disease risk
and making informed decisions.