0% found this document useful (0 votes)
6 views

Module 1 - 2 - EDA

Uploaded by

24ad10ra51
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Module 1 - 2 - EDA

Uploaded by

24ad10ra51
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Dr.

Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal


Exploratory data analysis:
The graphical summaries
❖The set of observations is called a dataset.

❖By exploring the dataset we can gain insight into what probability model suits the
phenomenon.

❖To graphically represent univariate datasets, consisting of repeated measurements of


one particular quantity, we discuss the classical histogram, the more recently introduced
kernel density estimates and the empirical distribution function.

❖To represent a bivariate dataset, which consists of repeated measurements of two


quantities, we use the scatterplot.

1
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Exploratory vs Confirmatory Data Analysis
EDA CDA
• No hypothesis at first • Start with hypothesis

• Generate hypothesis • Test the null hypothesis

• Uses graphical methods • Uses statistical models


(mostly)

2
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Exploratory data analysis (EDA):
What is EDA……
• It involves analyzing and visualizing data to understand its
• key characteristics,
• uncover patterns, and
• identify relationships between variables
• It refers to the method of studying and exploring record sets
• to apprehend their predominant traits,
• discover patterns,
• locate outliers, and
• identify relationships between variables.
• EDA is normally carried out as a preliminary step before
undertaking extra formal statistical analyses or modeling.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Key aspects of EDA include:
• Distribution of Data: Understand their range, central tendencies
(mean, median), and dispersion (variance, standard deviation).
• Graphical Representations: Visualize relationships
• Outlier Detection: Identifying unusual values
• Correlation Analysis: Find the relationships between variables to
understand how they might affect each other.
• Handling Missing Values: Apply imputation or removal, depending
on their impact
• Summary Statistics: Insights into data trends
• Testing Assumptions: To meet certain conditions
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Why Exploratory Data Analysis is Important?
• Understanding Data Structures: dataset, features and key aspects
• Specially in the context of statistical modelling
• Identifying Patterns and Relationships:
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
• Detecting Anomalies and Outliers:identifying errors or unusual data
points that may adversely affect the results of your analysis.
• Testing Assumptions:If the assumptions do not hold, the conclusions
drawn from the model could be invalid.
• Informing Feature Selection and Engineering: Which features are
most relevant to include in a model and how to transform them
(scaling, encoding) to improve model performance.
• Optimizing Model Design: Decide on the complexity of the model,
and better tune model parameters
• Facilitating Data Cleaning: spotting missing values and errors in the
data.
• Enhancing Communication: Visual and statistical summaries make it
easy to understand for people without technical backgrounds.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Types of Exploratory Data Analysis
• Univariate
• Bivariate
• Multivariate
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Univariate
• Histograms: Used to visualize the distribution of a variable.
• Box plots: Useful for detecting outliers and understanding the
spread and skewness of the data.
• Bar charts: Employed for categorical data to show the
frequency of each category.
• Summary statistics: Calculations like mean, median, mode,
variance, and standard deviation that describe the central
tendency and dispersion of the data.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Bivariate
• Scatter Plots: These are one of the most common tools used in bivariate analysis. A
scatter plot helps visualize the relationship between two continuous variables.
• Correlation Coefficient: This statistical measure (often Pearson’s correlation coefficient
for linear relationships) quantifies the degree to which two variables are related.
• Cross-tabulation: Also known as contingency tables, cross-tabulation is used to analyze
the relationship between two categorical variables. It shows the frequency distribution
of categories of one variable in rows and the other in columns, which helps in
understanding the relationship between the two variables.
• Line Graphs: In the context of time series data, line graphs can be used to compare two
variables over time. This helps in identifying trends, cycles, or patterns that emerge in
the interaction of the variables over the specified period.
• Covariance: Covariance is a measure used to determine how much two random variables
change together. However, it is sensitive to the scale of the variables, so it’s often
supplemented by the correlation coefficient for a more standardized assessment of the
relationship.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Multivariate
• Pair plots: Visualize relationships across several variables
simultaneously to capture a comprehensive view of potential
interactions.
• Principal Component Analysis (PCA): A dimensionality
reduction technique used to reduce the dimensionality of large
datasets, while preserving as much variance as possible.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Steps for Performing Exploratory Data Analysis
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Steps
• Step 1: Understand the Problem and the Data : Knowing the Problem
• Step 2: Import and Inspect the Data: Structure, Variable types
• Step 3: Handle Missing Data: Noisy, NA, Missing
• Step 4: Explore Data Characteristics: Statistical Description (Mean,
Mode. Variance, Skewness, Kurtosis etc.)
• Step 5: Perform Data Transformation: (Scaling, Nomalizing,
Aggregation, Encoding)
• Step 6: Visualize Data Relationships (Create Frequency Tables, Charts,
Plots, Correlation matrix)
• Step 7: Handling Outliers : Z score, IQR etc.
• Step 8: Communicate Findings and Insights : Pattents and critical
analysis of Results

You might also like