0% found this document useful (0 votes)
18 views24 pages

BI-LEc 3

Uploaded by

Ayesha Asad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views24 pages

BI-LEc 3

Uploaded by

Ayesha Asad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Exploratory Data

Analysis (EDA)
Lecture 3- Miss Ushna Tasleem
 Exploratory Data Analysis (EDA) is a process of
describing the data using statistical and
visualization techniques to bring important
aspects of that data into focus for further analysis.
This involves inspecting the dataset from many
angles, describing & summarizing it without
Introduction making any assumptions about its contents.

 EDA is a significant step to take before diving into


statistical modeling or machine learning, to ensure
the data is really what it is claimed to be and that
there are no obvious errors. It should be part of
data science projects in every organization
 Exploratory Data Analysis (EDA) is like exploring a
new place. You look around, observe things, and
try to understand what’s going on. Similarly, in
EDA data science, you look at a dataset, check out
the different parts, and try to figure out what’s
Exploratory Data happening in the data.
Analysis (EDA)
 Exploratory data analysis (EDA) is used by data
scientists to analyze and investigate data sets and
summarize their main characteristics, often
employing data visualization methods.
 It involves using statistics and visual tools to
understand and summarize data, helping data
scientists and data analysts inspect the dataset
from various angles without making assumptions
Exploratory Data about its contents.
Analysis (EDA)
 The main purpose of EDA is to help look at
data before making any assumptions. It can help
identify obvious errors, as well as better
understand patterns within the data, detect
outliers or anomalous events, find interesting
relations among the variables.
 Look at the Data: Gather information about the data, such as the
number of rows and columns, and the type of information each column
contains. This includes understanding single variables and their
distributions.
 Clean the Data: Fix issues like missing or incorrect values.
Preprocessing is essential to ensure the data is ready for analysis and
predictive modeling.
 Make Summaries: Summarize the data to get a general idea of its
contents, such as average values, common values, or value

Process distributions. Calculating quantiles and checking for skewness can


provide insights into the data’s distribution.
 Visualize the Data: Use interactive charts and graphs to spot trends,
patterns, or anomalies. Bar plots, scatter plots, and other
visualizations help in understanding relationships between variables.
Python libraries like pandas, NumPy, Matplotlib, Seaborn, and Plotly
are commonly used for this purpose.
 Ask Questions: Formulate questions based on your observations,
such as why certain data points differ or if there are relationships
between different parts of the data.
 Find Answers: Dig deeper into the data to answer these questions,
which may involve further analysis or creating models, including
 There are four primary types of EDA:
 Univariate non-graphical. This is simplest form of data
analysis, where the data being analyzed consists of just
one variable. Since it’s a single variable, it doesn’t deal
with causes or relationships. The main purpose of

Types of univariate analysis is to describe the data and find


patterns that exist within it.
exploratory  Univariate graphical. Non-graphical methods don’t

data provide a full picture of the data. Graphical methods are


therefore required. Common types of univariate graphics
analysis include:
 Stem-and-leaf plots, which show all data values and the
shape of the distribution.
 Histograms, a bar plot in which each bar represents the
frequency (count) or proportion (count/total count) of cases
for a range of values.
 Box plots, which graphically depict the five-number summary
of minimum, first quartile, median, third quartile, and
maximum.
 Multivariate nongraphical: Multivariate data arises
from more than one variable. Multivariate non-
graphical EDA techniques generally show the
Types of relationship between two or more variables of the data

exploratory through cross-tabulation or statistics.


 Multivariate graphical: Multivariate data uses
data graphics to display relationships between two or more

analysis sets of data. The most used graphic is a grouped bar


plot or bar chart with each group representing one
level of one of the variables and each bar within a
group representing the levels of the other variable.
Other common types of multivariate graphics include:
 Scatter plot, which is used to plot data points on a
horizontal and a vertical axis to show how much one
variable is affected by another.

Types of  Multivariate chart, which is a graphical representation

exploratory of the relationships between factors and a response.


 Run chart, which is a line graph of data plotted over
data time.

analysis  Bubble chart, which is a data visualization that


displays multiple circles (bubbles) in a two-
dimensional plot.
 Heat map, which is a graphical representation of data
where values are depicted by color.
 Exploratory Data Analysis (EDA) is an essential
step in the data analysis process. It involves
analyzing and visualizing data to understand its
Why is Exploratory main characteristics, uncover patterns, and
Data Analysis identify relationships between variables.
Important?  EDA is crucial because raw data is usually skewed,
may have outliers, or too many missing values. A
model built on such data results in sub-optimal
performance. In the hurry to get to the machine
learning stage, some data professionals either
entirely skip the EDA process or do a very
mediocre job. This is a mistake with many
implications, including:
•Insight Generation: EDA helps uncover patterns, trends,
and relationships in data that drive business decisions.
•Anomaly Detection: Identifying outliers and unusual patterns
that may indicate data quality issues or business anomalies.
Why is •Hypothesis Formation: Helps in formulating hypotheses about the

Exploratory data that can be tested further using more sophisticated analytical

Data Analysis techniques.


•Improved Data Quality: By cleaning and transforming data,
Important? EDA ensures that subsequent analyses are based on accurate and
relevant information.
•Decision-Making Support: Provides decision-makers with a
deeper understanding of the data, enabling more informed and
effective business strategies.
1. Descriptive Statistics
 Summary Statistics: Measures like mean,
median, mode, standard deviation, and variance
provide a quick understanding of the data
distribution.
 Frequency Distribution: Analyzes how often
EDA values occur within a dataset, helping identify
common patterns or outliers.
Techniques  Example: A retail company analyzes the average,
median, and mode of daily sales figures to
understand general sales trends. The standard
deviation is also calculated to assess sales
variability.
 2. Data Visualization

 Histograms: Used to visualize the distribution of a


single variable, showing the frequency of data points
within specified ranges.
 Box Plots: Display the distribution of data based on a

EDA five-number summary (minimum, first quartile, median,


third quartile, and maximum) and help identify outliers.

Techniques  Scatter Plots: Reveal relationships or correlations


between two numerical variables, helping to identify
trends or patterns.
 Heatmaps: Show the concentration of data points in a
dataset, often used to represent correlation matrices.
 Bar Charts and Pie Charts: Visualize categorical data
to compare different groups or categories
3. Data Cleaning
• Handling Missing Data: Identifying and addressing
missing values through techniques such as imputation,
deletion, or filling with a default value.
• Outlier Detection: Identifying and managing outliers
that could skew the results. Techniques include the use of
z-scores or IQR (Interquartile Range).

EDA Example:

Techniques • Handling Missing Data: An organization notices that


some customer demographic data is missing. They decide
to impute missing age values using the median age of the
entire customer base to maintain analysis accuracy.
• Outlier Detection: In analyzing website traffic data, an
outlier detection method identifies a sudden spike in
traffic on a particular day, which is traced back to a
marketing campaign.
4. Data Transformation
• Normalization/Standardization: Adjusting data
scales to ensure variables are comparable,
especially when they are on different scales.
EDA • Log Transformation: Used to reduce skewness in
Techniques data distributions, making patterns more visible.
• Binning: Grouping continuous data into discrete
bins to simplify analysis and reveal trends.
Examples :
 Normalization/Standardization: A company
standardizes their sales and marketing data (e.g.,
scaling the data so that each variable has a mean
of 0 and a standard deviation of 1) to compare the
effects of different variables on revenue.
EDA  Log Transformation: A BI team uses a log
Techniques transformation on highly skewed revenue data to
make the distribution more normal, facilitating
better modeling and trend analysis.
 Binning: Customer ages are binned into ranges
(e.g., 18-25, 26-35, etc.) to simplify the analysis of
age-related purchasing behavior.
5. Correlation Analysis
• Correlation Coefficients: Quantify the
relationship between two variables (e.g., Pearson
or Spearman correlation), helping to identify
EDA variables that are strongly related.

Techniques • Correlation Matrices: Visual representations of


correlation coefficients for multiple variables,
aiding in identifying potential predictors.
• Examples:
• Correlation Coefficients: A business examines
the Pearson correlation between customer
satisfaction scores and customer retention rates,
finding a strong positive correlation that suggests

EDA higher satisfaction leads to better retention.

Techniques • Correlation Matrices: A BI analyst creates a


correlation matrix to understand how different
financial KPIs (e.g., revenue, profit margin,
customer acquisition cost) are interrelated,
guiding strategic decisions.
6. Feature Engineering
• Derived Variables: Creating new variables from
existing ones to capture more complex
relationships (e.g., interaction terms, polynomial

EDA features).

Techniques • Dimensionality Reduction: Techniques like


Principal Component Analysis (PCA) reduce the
number of variables while retaining most of the
variance, making it easier to interpret data.
 Feature Engineering Examples
• Derived Variables: A BI team creates a new
feature called "Customer Lifetime Value (CLV)" by
combining average purchase amount, purchase
frequency, and customer retention rate, providing
a powerful metric for marketing analysis.
EDA • Dimensionality Reduction: To simplify a
Techniques complex dataset with many variables, a company
uses Principal Component Analysis (PCA) to reduce
the number of features while retaining most of the
variability in the data, making it easier to visualize
and analyze.
7. Hypothesis Testing
• T-tests and ANOVA: Used to compare means between
groups and determine if observed differences are
statistically significant.
• Chi-Square Test: Used to examine the relationship
between categorical

EDA • Example:
• T-tests and ANOVA: A BI analyst conducts a t-test to
Techniques compare the average sales before and after a new
marketing campaign, determining if the observed
increase is statistically significant.
• Chi-Square Test: An e-commerce company uses a Chi-
Square test to examine the relationship between
customer gender and product category preference,
finding significant associations that inform targeted
marketing strategies.
 8. Time Series Analysis
• Trend Analysis: Identifying trends over time, such as
seasonal patterns or long-term shifts, using line plots or
time series decomposition.
• Autocorrelation: Measures the correlation of a time
series with its own past values, helping to identify
patterns over time.
EDA • Example:
Techniques • Trend Analysis: A BI team analyzes monthly sales
data over several years to identify seasonal trends,
such as increased sales during the holiday season, and
uses this insight for inventory planning.
• Autocorrelation: An autocorrelation function is
applied to daily website traffic data to detect any
repeating patterns or cycles, such as weekly peaks in
traffic.
9. Clustering and Segmentation
 K-Means Clustering: Grouping data points into
clusters based on similarity, often used for customer
segmentation in BI.
 Hierarchical Clustering: Creating a tree of clusters to
understand data groupings at different levels of
similarity.
EDA  Examples:

Techniques  K-Means Clustering: A company segments its


customers into clusters based on purchasing behavior
using K-means clustering, identifying distinct customer
groups such as "bargain hunters" and "premium buyers.
 "Hierarchical Clustering: A BI team uses hierarchical
clustering to organize products into categories based on
features such as price, brand, and customer ratings,
enabling more effective product recommendations.
10. Data Profiling
 Univariate Analysis: Examining each variable
individually to understand its distribution, central
tendency, and variability.
 Multivariate Analysis: Exploring relationships
between multiple variables simultaneously to uncover
complex interactions.
EDA  Examples:

Techniques  Univariate Analysis: A BI analyst examines each


variable individually, such as the distribution of
customer ages, to understand the basic characteristics
of the customer base.
 Multivariate Analysis: A retail company performs
multivariate analysis on sales data, examining how
variables like product category, customer location, and
purchase time interact to influence overall sales.
 Exploratory Data Analysis (EDA) is a critical
process in data analysis that involves summarizing
and visualizing data to uncover patterns,
relationships, and anomalies. It helps businesses
make informed decisions by providing insights into
data distributions and correlations.
 EDA can be performed through univariate,

Summary bivariate, and multivariate analyses, using both


graphical and non-graphical techniques. It also
includes methods like dimensionality reduction
and data transformation to enhance data quality
and interpretability.
 Overall, EDA is essential for understanding and
preparing data for more advanced analyses in
Business Intelligence.

You might also like