0% found this document useful (0 votes)
3 views

Exploratory Data Analysis unit 2

Exploratory Data Analysis (EDA) is a method for analyzing data sets to uncover patterns, anomalies, and insights through descriptive statistics and visualization techniques. It emphasizes understanding the data's structure and assumptions, which are crucial for ensuring valid analysis and guiding further statistical modeling. EDA differs from classical data analysis by focusing on data exploration rather than imposing predetermined models, utilizing graphical techniques, and being less formal in approach.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Exploratory Data Analysis unit 2

Exploratory Data Analysis (EDA) is a method for analyzing data sets to uncover patterns, anomalies, and insights through descriptive statistics and visualization techniques. It emphasizes understanding the data's structure and assumptions, which are crucial for ensuring valid analysis and guiding further statistical modeling. EDA differs from classical data analysis by focusing on data exploration rather than imposing predetermined models, utilizing graphical techniques, and being less formal in approach.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Exploratory Data

Analysis
Understanding your data through visualization and statistics

Presented By:
Prof. Tithirupa Tapaswini
Assistant Professor
Department of CSE
Medicaps University Indore
What is EDA?
Exploratory Data Analysis (EDA) is an approach to analyzing and understanding
data sets with the goal of discovering patterns, identifying anomalies, testing
hypotheses, and checking assumptions. It involves summarizing the main
characteristics of the data, often using visual methods.
Here are some key components of EDA:
1.Descriptive Statistics: Summarizing data through measures such as mean,
median, standard deviation, and variance.
2.Data Visualization: Using graphs and plots to visualize distributions,
relationships, and trends.
3.Data Cleaning: Identifying and addressing missing values, outliers, and
inconsistencies in the data.
4.Univariate Analysis: Examining individual variables to understand their
distribution and central tendency.
5.Bivariate and Multivariate Analysis: Exploring relationships between two or
more variables to identify correlations and patterns.
6.Feature Engineering: Creating new features or modifying existing ones to
improve the quality of the data or make it more suitable for modeling.

EDA is often a preliminary step before more formal statistical analysis or machine
learning, helping to guide the analysis and ensure that any modeling is based on a
sound understanding of the data.
Assumptions in EDA:
The gamut of scientific and engineering experimentation is virtually limitless. In this
sea of diversity is there any common basis that allows the analyst to systematically
and validly arrive at supportable, repeatable research conclusions.
Underlying Assumptions There are four assumptions that typically underlie all
measurement processes; namely, that the data from the process at hand "behave
like":
1. random drawings;
2. from a fixed distribution;
3. with the distribution having fixed location;
4. with the distribution having fixed variation.
Understanding and addressing the underlying assumptions in Exploratory Data
Analysis (EDA) is crucial for several reasons:
1.Validity of Analysis:
Importance: Assumptions help ensure that the methods and techniques used
during EDA are valid and appropriate for the data. If assumptions are violated, the
results of the analysis may be misleading or incorrect.
Example: If you assume data independence but your observations are correlated
(e.g., time series data), statistical tests that assume independence might produce
unreliable results.

2. Guidance for Further Analysis:


Importance: Addressing assumptions helps in cleaning and preparing data
effectively. This includes handling missing values, correcting errors, and
transforming variables to meet assumptions.
Example: Knowing that data should follow a normal distribution helps in choosing
appropriate statistical tests, such as parametric tests, or deciding if transformations
are needed.

Data Preparation:
•Importance: Addressing assumptions helps in cleaning and preparing data
effectively. This includes handling missing values, correcting errors, and transforming
variables to meet assumptions.
•Example: If the data is not normally distributed, you might need to apply
transformations or use non-parametric methods.
Accuracy of Insights:
•Importance: Assumptions impact the accuracy and reliability of the insights derived
from EDA. If the assumptions are not met, the insights may be flawed or incomplete.
•Example: In linear regression, if the relationship between variables is not linear, the
model's predictions and interpretations could be inaccurate.
Detecting Problems:
•Importance: Checking assumptions helps in identifying and diagnosing potential
problems or limitations in the data or analysis process.
•Example: Detecting multicollinearity (high correlation between independent
variables) can prevent issues in regression analysis, such as inflated standard
errors.
•Improving Model Performance:
•Importance: Properly addressing assumptions can improve the performance of
statistical models and machine learning algorithms. It ensures that models are
well-suited to the data, which enhances their predictive accuracy and
generalizability.
•Example: Addressing heteroscedasticity (non-constant variance of errors) in
regression analysis can lead to more reliable coefficient estimates and confidence
intervals.
Communicating Results:
Importance: Being aware of and communicating the assumptions made during EDA
helps stakeholders understand the context and limitations of the analysis. This
transparency builds trust in the findings and helps in making informed decisions.
Example: Clearly stating that the data was assumed to be normally distributed
helps stakeholders understand the basis of the statistical tests used and their
appropriateness.

In summary, understanding and verifying the underlying assumptions in EDA is


essential for conducting robust and credible data analysis. It ensures that the results
are reliable, the models are appropriate, and the insights derived are meaningful
and actionable.
Importance of Assumptions in EDA:
In EDA, assumptions are important because they guide the choice of analytical
techniques, help validate the results, and ensure accurate interpretations. They
inform whether data needs transformation and influence model selection,
impacting the overall reliability of insights drawn from the data.
1.Guiding Analysis
2. Model Building
3. Interpreting Results
4. Data Transformation
5. Validation and Robustness
6.Insight Generation
What are the EDA Goals?
The primary goal of EDA is to maximize the analyst's insight into a data set and
into the underlying structure of a data set, while providing all of the specific items
that an analyst would want to extract from a data set, such as:
1.a good-fitting, parsimonious model
2. a list of outliers
3. a sense of robustness of conclusions
4. estimates for parameters
5. uncertainties for those estimates
6. a ranked list of important factors
7. conclusions as to whether individual factors are statistically significant
8. optimal settings
Insight into the Data:
Insight implies detecting and uncovering underlying structure in the data.
To get a "feel" for the data, it is not enough for the analyst to know what is in the
data; the analyst also must know what is not in the data, and the only way to do
that is to draw on our own human pattern-recognition and comparative abilities in
the context of a series of judicious graphical techniques applied to the data.

The Role of Graphics


Statistics and data analysis procedures can broadly be split into two parts:
1.Quantitative
2.Graphical
1.Quantitative techniques are the set of statistical procedures that yield numeric
or tabular output.
Examples of quantitative techniques include:
hypothesis testing analysis of variance
Point Estimates and Confidence Intervals
Least square regression

2.Graphical :On the other hand, there is a large collection of statistical tools that
we generally refer to as graphical techniques.
These include:
scatterplots
histograms
probability plots
residual plots
box plots and block plots
An EDA/Graphics
Example Data:

X Y
10.00 8.04
8.00 6.95
13.00 7.58
9.00 8.81
11.00 8.33
14.00 9.96
6.00 7.24
4.00 4.26
The goal of the analysis is to compute summary statistics plus determine
the best linear fit for Y as a function of X, the results might be given as:
N=8
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Residual standard deviation = 1.237
Correlation = 0.816
The above quantitative analysis, although valuable, gives us only limited
insight into the data. Scatter Plot- In contrast, the following simple scatter
plot of the data
Suggests the following:
1. The data set "behaves like" a linear curve with some scatter;
2. there is no justification for a more complicated model (e.g., quadratic);
3. there are no outliers;
4. the vertical spread of the data appears to be of equal height irrespective of the
X-value; this indicates that the data are equally-precise throughout and so a
"regular" (that is, equi-weighted) fit is appropriate.

This kind of characterization for the data serves as the core for getting insight/feel
for the data.
Conclusions from the scatter plots are:
1. data set 1 is clearly linear with some scatter.
2. data set 2 is clearly quadratic.
3. data set 3 clearly has an outlier.
4. data set 4 is obviously the victim of a poor experimental design with a
single point far removed from the bulk of the data "wagging the dog".
Techniques for Testing Assumption
The following EDA techniques are simple, efficient, and powerful for the
routine testing of underlying assumptions:
1. run sequence plot (Yi versus i)
2. lag plot (Yi versus Yi-1)
3. histogram (counts versus subgroups of Y)
4. normal probability plot (ordered Y versus theoretical ordered Y)
Sample Plot: Assumptions Do Not Hold
If one or more of the four underlying assumptions do not hold, then it will
show up in the various plots as demonstrated in the following example
Interpretation of 4-Plot
The four EDA plots discussed on the previous page are used to
test the underlying assumptions:
1. Fixed Location: If the fixed location assumption holds, then
the run sequence plot will be flat and non-drifting.
2. Fixed Variation: If the fixed variation assumption holds,
then the vertical spread in the run sequence plot will be the
approximately the same over the entire horizontal axis.
3. Randomness: If the randomness assumption holds, then the
lag plot will be structureless and random.
4. Fixed Distribution: If the fixed distribution assumption
holds, in particular if the fixed normal distribution holds, then
1. the histogram will be bell-shaped,
and 2. the normal probability plot will be linear.
If all four of the assumptions hold, then the process is said definitionally to
be "in statistical control".

Consequences:
The following sections discuss in more detail the
consequences of invalid assumptions:
1. Consequences of non-randomness
2. Consequences of non-fixed location parameter
3. Consequences of non-fixed variation
4. Consequences related to distributional assumptions
Consequences of Non-Randomness
The randomness assumption is the most critical but the least tested.
If the randomness assumption does not hold, then
1. All of the usual statistical tests are invalid.
2. The calculated uncertainties for commonly used statistics
become meaningless.
3. The calculated minimal sample size required for a pre-
specified tolerance becomes meaningless.
4. The simple model: y = constant + error becomes invalid.
5. The parameter estimates become suspect and non
supportable
Autocorrelation:
One of the common problem in non randomness is autocorrelation.
Autocorrelation is the correlation between Yt and Yt-k, where k is an integer that
defines the lag for the autocorrelation. That is, autocorrelation is a time dependent
non-randomness.

Autocorrelation is typically detected via an autocorrelation plot or


a lag plot. If the data are not random due to autocorrelation, then
1. Adjacent data values may be related.
2. There may not be n independent snapshots of the phenomenon
under study.
3. There may be undetected "junk"-outliers.
4. There may be undetected "information-rich"- outliers.
Consequences of Non-Fixed Location Parameter
The usual estimate of location is the mean
from N measurements Y1, Y2, ... , YN.
If the run sequence plot does not support the assumption of fixed location, then

1. The location may be drifting.

2. The single location estimate may be meaningless (if the process is drifting).

3. The choice of location estimator (e.g., the sample mean) may be sub-optimal.

4. The usual formula for the uncertainty of the mean:

5. The location estimate may be poor.


6. The location estimate may be biased
Consequences of Non-Fixed Variation Parameter
The usual estimate of variation is the standard deviation

If the run sequence plot does not support the assumption of


fixed variation, then
1. The variation may be drifting.
2. The single variation estimate may be meaningless (if the
process variation is drifting).
3. The variation estimate may be poor.
4. The variation estimate may be biased
Consequences Related to Distributional Assumptions
Distribution 1. The distribution may be changing.
2. The single distribution estimate may be meaningless
(if the process distribution is changing).
3. The distribution may be markedly non-normal.
4. The distribution may be unknown.
5. The true probability distribution for the error may remain
unknown.
Model 1. The model may be changing.
2. The single model estimate may be meaningless.
3. The default model Y = constant + error may be invalid.
4. If the default model is insufficient, information about a better
model may remain undetected.
5. A poor deterministic model may be fit.
6. Information about an improved model may go undetected
How Does Exploratory Data Analysis differ from
Classical Data Analysis?
EDA is a data analysis approach. What other data analysis
approaches exist and how does EDA differ from these other
approaches? Three popular data analysis approaches are:
1.Classical
2.Exploratory
3.Bayesian

The difference is the sequence and focus of the intermediate


steps.
For classical analysis, the sequence is

Problem => Data => Model => Analysis =>


Conclusions

For EDA, the sequence is

Problem => Data => Analysis => Model =>


Conclusions

For Bayesian, the sequence is

Problem => Data => Model => Prior Distribution =>


Analysis => Conclusions
Focusing on EDA versus classical, these two approaches
differ as follows:
1.Model
2.Focus
3.Techniques
4.Rigor
5.Data Treatment
6.Assumptions
1.Model
Classical : The classical approach imposes models (both
deterministic and and probabilistic) on the data.
Deterministic models include, for example, regression
models and analysis of variance (ANOVA) models. The
most common probabilistic model assumes that the errors
about the deterministic model are normally distributed--this
assumption affects the validity of the ANOVA F tests.

Exploratory :The Exploratory Data Analysis approach does not


impose deterministic or probabilistic models on the data.
On the contrary, the EDA approach allows the data to
suggest admissible models that best fit the data.
2.Focus

Classical The two approaches differ substantially in focus.


For classical analysis, the focus is on the model--
estimating parameters of the model and generating
predicted values from the model.

Exploratory For exploratory data analysis, the focus is on the


data--its structure, outliers, and models suggested by
the data.
3.Techniques

Classical Classical techniques are generally quantitative in


nature. They include ANOVA, t tests, chi-squared tests,
and F tests.

Exploratory EDA techniques are generally graphical. They


include scatter plots, character plots, box plots,
histograms, bihistograms, probability plots, residual plots,
and mean plots.
4.Rigor

Classical Classical techniques serve as the probabilistic


foundation of science and engineering; the most
important characteristic of classical techniques is that
they are rigorous, formal, and "objective".

Exploratory EDA techniques do not share in that rigor or


formality. EDA techniques make up for that lack of rigor
by being very suggestive, indicative, and insightful about
what the appropriate model should be.EDA techniques
are subjective and depend on interpretation which may
differ from analyst to analyst, although experienced
analysts commonly arrive at identical conclusions.
5.Data Treatment

Classical Classical estimation techniques have the


characteristic of taking all of the data and mapping the
data into a few numbers ("estimates"). This is both a
virtue and a vice. The virtue is that these few numbers
focus on important characteristics (location, variation, etc.)
of the population. The vice is that concentrating on these
few characteristics can filter out other characteristics
(skewness, tail length, autocorrelation, etc.) of the same
population. In this sense there is a loss of information due
to this "filtering" process.

Exploratory The EDA approach, on the other hand, often


makes use of (and shows) all of the available data. In
this sense there is no corresponding loss of information
6.Assumptions

Classical The "good news" of the classical approach is that


tests based on classical techniques are usually very
sensitive--that is, if a true shift in location, say, has
occurred, such tests frequently have the power to detect
such a shift and to conclude that such a shift is
"statistically significant". The "bad news" is that classical
tests depend on underlying assumptions (e.g., normality),
and hence the validity of the test conclusions becomes
dependent on the validity of the underlying assumptions.
Exploratory Many EDA techniques make little or no
assumptions--they present and show the data--all of the
data--as is, with fewer encumbering assumptions.
How Does Exploratory Data Analysis Differ from Summary Analysis?
Summary A summary analysis is simply a numeric reduction
of a historical data set. It is quite passive. Its focus is in
the past. Quite commonly, its purpose is to simply arrive
at a few key statistics (for example, mean and standard
deviation) which may then either replace the data set or be
added to the data set in the form of a summary table.

Exploratory In contrast, EDA has as its broadest goal the desire


to gain insight into the engineering/scientific process
behind the data. Whereas summary statistics are passive
and historical, EDA is active and futuristic. In an attempt
to "understand" the process and improve it in the future,
EDA uses the data as a "window" to peer into the heart of
the process that generated the data.

You might also like