0% found this document useful (0 votes)
2 views

EDA Unit-2

Exploratory Data Analysis (EDA) is a data analysis philosophy that emphasizes graphical techniques to gain insights, uncover structures, and identify important variables in a dataset. It differs from classical data analysis by focusing on data exploration without imposing models upfront, while classical analysis emphasizes model parameter estimation. EDA aims to maximize insight into data, detect outliers, and suggest appropriate models, making it an active and insightful approach compared to passive summary analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

EDA Unit-2

Exploratory Data Analysis (EDA) is a data analysis philosophy that emphasizes graphical techniques to gain insights, uncover structures, and identify important variables in a dataset. It differs from classical data analysis by focusing on data exploration without imposing models upfront, while classical analysis emphasizes model parameter estimation. EDA aims to maximize insight into data, detect outliers, and suggest appropriate models, making it an active and insightful approach compared to passive summary analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

UNIT-2

What is EDA?
Approach
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to
1. maximize insight into a data set;
2. uncover underlying structure
3. extract important variables;
4. detect outliers and anomalies;
5. test underlying assumptions;
6. develop parsimonious models
7. determine optimal factor settings.
Focus
The EDA approach is precisely that--an approach--not a set of techniques, but an attitude/philosophy about how a data analysis should be
carried out.
Most EDA techniques are graphical in nature with a few quantitative techniques. The reason for the
heavy reliance graphics is that by its very nature the main role of EDA is to open-mindedly explore,
and graphics gives the analysts unparalleled power to do so, enticing the data to reveal its structural
secrets, and being always ready to gain some new, often unsuspected, insight into the data. In
combination with the natural pattern-recognition capabilities that we all possess, graphics provides,
of course, unparalleled power to carry this out The particular graphical techniques employed in EDA
areoften quite simple, consisting of various techniques of:

Plotting the raw data (such as data traces, histograms, bi histograms, probability plots, lag plots, block
plots, and Youden plots.
Plotting simple statistics such as mean plots, standard deviation plots, box plots, and main effects plots
of theraw data.

Positioning such plots so as to maximize our naturalpattern-recognition abilities, such as using


multiple plots per page.

How Does Exploratory Data Analysis differ from Classical


Data Analysis?
EDA is a data analysis approach. What other data analysis approaches exist and how does EDA
differ from these other approaches? Three popular data analysis approaches are:

1. Classical
2. Exploratory (EDA)
3. Bayesian
These three approaches are similar in that they all start witha general science/engineering problem
and all yield science/engineering conclusions. The difference is the sequence and focus of the
intermediate steps.
For classical analysis, the sequence is

Problem => Data => Model => Analysis =>Conclusions

For EDA, the sequence is

Problem => Data => Analysis => Model =>Conclusions

For Bayesian, the sequence is

Problem => Data => Model => Prior Distribution =>Analysis => Conclusions

Thus for classical analysis, the data collection is followed bythe imposition of a model (normality,
linearity, etc.) and the analysis, estimation, and testing that follows are focused on the parameters of that
model. For EDA, the data collection is not followed by a model imposition; rather it is followed
immediately by analysis with a goal of inferring what model would be appropriate. Finally, for a
Bayesian analysis, the analyst attempts to incorporate scientific/engineering knowledge/expertise into the
analysis by imposing a data- independent distribution on the parameters of the selected model; the
analysis thus consists of formally combining boththe prior distribution on the parameters and the
collected data to jointly make inferences and/or test assumptions aboutthe model parameters.

In the real world, data analysts freely mix elements of all ofthe above three approaches (and other
approaches). The above distinctions were made to emphasize the major differences among the three
approaches

Focusing on EDA versus classical, these two approaches differ as follows:


1.Model
2.Focus
3.Techniques
4.Rigor
5.Data Treatment
6.Assumptions

1.Model
Classical The classical approach imposes models (both deterministic and and
probabilistic) on the data.
Deterministic models include, for example, regression models and analysis of
variance (ANOVA) models. The most common probabilistic model assumes that the
errors about the deterministic model are normally distributed--this assumption
affects the validity of the ANOVA F tests.

Exploratory The Exploratory Data Analysis approach does not imposedeterministic or


probabilistic models on the data. On the contrary, the EDA approach allows the
data to suggest admissible models that best fit the data.
2.Focus

Classical The two approaches differ substantially in focus. For classicalanalysis, the focus is on
the model--estimating parameters of the model and generating predicted values from
the model.

Exploratory For exploratory data analysis, the focus is on the data--itsstructure, outliers, and
models suggested by the data.

3.Techniques

Classical Classical techniques are generally quantitative in nature. Theyinclude ANOVA, t tests,
chi-squared tests, and F tests.

Exploratory EDA techniques are generally graphical. They include scatterplots, character plots,
box plots, histograms, bihistograms, probability plots, residual plots, and mean plots.
4.Rigor

Classical Classical techniques serve as the probabilistic foundation of science and engineering;
the most important characteristic ofclassical techniques is that they are rigorous,
formal, and "objective".

Exploratory EDA techniques do not share in that rigor or formality. EDAtechniques make up for
that lack of rigor by being very suggestive, indicative, and insightful about what the
appropriate model should be.

EDA techniques are subjective and depend on interpretationwhich may differ from
analyst to analyst, although experienced analysts commonly arrive at identical
conclusions.

5.Data Treatment

Classical Classical estimation techniques have the characteristic of taking all of the data and
mapping the data into a few numbers ("estimates"). This is both a virtue and a vice.
The virtue is that these few numbers focus on important characteristics (location,
variation, etc.) of the population. The vice is that concentrating on these few
characteristics can filter out other characteristics (skewness, tail length,
autocorrelation, etc.) of the same population. In this sense there is a loss of
information due to this "filtering" process.

Exploratory The EDA approach, on the other hand, often makes use of (and shows) all of the
available data. In this sense there is no corresponding loss of information
6.Assumptions

Classical The "good news" of the classical approach is that tests based on classical techniques
are usually very sensitive--that is, if atrue shift in location, say, has occurred, such
tests frequently have the power to detect such a shift and to conclude that such a shift
is "statistically significant". The "bad news" is that classical tests depend on
underlying assumptions (e.g., normality), and hence the validity of the test
conclusions becomes dependent on the validity of the underlying assumptions. Worse
yet, the exact underlying assumptions may be unknown to the analyst, or if known,
untested. Thus the validity of the scientific conclusions becomes intrinsicallylinked to
the validity of the underlying assumptions. In practice, if such assumptions are
unknown or untested, the validity of the scientific conclusions becomes suspect.

Exploratory Many EDA techniques make little or no assumptions--they present and show the
data--all of the data--as is, with fewerencumbering assumptions.
How Does Exploratory Data Analysis Differ from
Summary Analysis?

Summary A summary analysis is simply a numeric reduction of a historical data set. It is quite
passive. Its focus is in the past. Quite commonly, its purpose is to simply arrive at a
few keystatistics (for example, mean and standard deviation) which may then either
replace the data set or be added to the data set in the form of a summary table.

Exploratory In contrast, EDA has as its broadest goal the desire to gain insight into the
engineering/scientific process behind the data. Whereas summary statistics are passive
and historical, EDA is active and futuristic. In an attempt to "understand" the process
and improve it in the future, EDA uses the data as a "window" to peer into the heart
of the process that generated the data. There is an archival role in the research and
manufacturing world for summary statistics, but there is an enormously larger role for
the EDA approach.

Insight implies detecting and uncovering underlying structure in the data. Such underlying structure
may not be encapsulated in the list of items above; such items serve as the specific targets of an
analysis, but the real insight and "feel" for a data set comes as the analyst judiciously probes and
explores the various subtleties of the data. The "feel" for the data comes almost exclusively from the
application of various graphical techniques, the collection of which serves as the window into the
essence of the data. Graphics are irreplaceable--there are no quantitative analogues that will give the
same insight as well-chosen graphics.

To get a "feel" for the data, it is not enough for the analyst toknow what is in the data; the analyst
also must know what isnot in the data, and the only way to do that is to draw on our own human
pattern-recognition and comparative abilities in the context of a series of judicious graphical
techniques applied to the data.

What are the EDA Goals?


Primary and Secondary Goals
The primary goal of EDA is to maximize the analyst's insight into a data set and into the underlying
structure of a data set, while providing all of the specific items that an analyst wouldwant to extract
from a data set, such as:

1. a good-fitting, parsimonious model


2. a list of outliers
3. a sense of robustness of conclusions
4. estimates for parameters
5. uncertainties for those estimates
6. a ranked list of important factors
7. conclusions as to whether individual factors arestatistically significant
8. optimal settings

Insight into theData


Insight implies detecting and uncovering underlying structure in the data. Such underlying
structure may not be encapsulated in the list of items above; such items serve as the specific targets of
an analysis, but the real insight and "feel" for a data set comes as the analyst judiciously probes and
explores the various subtleties of the data. The "feel" for the data comes almost exclusively from the
application of various graphical techniques, the collection of which serves as the window into the
essence of the data. Graphics are irreplaceable--there are no quantitative analogues that will give the
same insight as well-chosen graphics.

To get a "feel" for the data, it is not enough for the analyst toknow what is in the data; the analyst
also must know what isnot in the data, and the only way to do that is to draw on our own human
pattern-recognition and comparative abilities in the context of a series of judicious graphical
techniques applied to the data.

The Role of Graphics


Statistics and data analysis procedures can broadly be split into two parts:
1.Quantitative
2.Graphical

1.Quantitative techniques are the set of statistical proceduresthat yield numeric or tabular output.
Examples of quantitative techniques include:
hypothesis testing
analysis of variance
Point Estimates and Confidence Intervals
Least square regression

2.Graphical
On the other hand, there is a large collection of statistical tools that we generally refer to as graphical
techniques. These include:
scatterplots histograms probability plots
residual plots
box plots
block plots
The EDA approach relies heavily on these and similar graphical techniques. Graphical procedures are
not just toolsthat we could use in an EDA context, they are tools that we must use. Such graphical tools
are the shortest path to gaining insight into a data set in terms of:
testing assumptions
model selection
model validation
estimator selection
relationship identification
factor effect determination
outlier detection

If one is not using statistical graphics, then one is forfeiting insight into one or more aspects of the
underlying structure of the data.
An EDA/Graphics Example
Data:

X Y
10.00 8.04
8.00 6.95
13.00 7.58
9.00 8.81
11.00 8.33
14.00 9.96
6.00 7.24
4.00 4.26
12.00 10.84
7.00 4.82
5.00 5.68

If the goal of the analysis is to compute summary statistics plus determine the best linear fit for Y as a
function of X, the results might be given as:
N = 11
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Residual standard deviation = 1.237
Correlation = 0.816
The above quantitative analysis, although valuable, gives us only limited insight into the data.
Scatter Plot-
In contrast, the following simple scatter plot of the data

suggests the following: 1. The data set "behaves like" a linear curve with some scatter;
2. there is no justification for a more complicated model (e.g., quadratic);
3. there are no outliers;
4. the vertical spread of the data appears to be of equal height irrespective of the X-value; this indicates
that the data are equally-precise throughout and so a "regular" (that is, equi-weighted) fit is appropriate.

This kind of characterization for the data serves as the core for getting insight/feel for the data. Such
insight/feel does not come from the quantitative statistics; on the contrary, calculations of
quantitative statistics such as intercept and slope should be subsequent to the characterization and
will make sense only if the characterization is true. To illustrate the loss of information that results
when the graphics insight step is skipped, consider the following three data sets .

X2 Y2 X3 Y3 X4 Y4
10.00 9.14 10.00 7.46 8.00 6.58
8.00 8.14 8.00 6.77 8.00 5.76
13.00 8.74 13.00 12.74 8.00 7.71
9.00 8.77 9.00 7.11 8.00 8.84
11.00 9.26 11.00 7.81 8.00 8.47
14.00 8.10 14.00 8.84 8.00 7.04
6.00 6.13 6.00 6.08 8.00 5.25
4.00 3.10 4.00 5.39 19.00 12.50
12.00 9.13 12.00 8.15 8.00 5.56
7.00 7.26 7.00 6.42 8.00 7.91
5.00 4.74 5.00 5.73 8.00 6.89

A quantitative analysis on data set 2 yields


N = 11
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Residual standard deviation = 1.237
Correlation = 0.816 which is identical to the analysis for data set 1.
One might naively assume that the two data sets are "equivalent" since that is what the statistics tell
us; but what do the statistics not tell us?
Remarkably, a quantitative analysis on data sets 3 and 4 also yields N = 11
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Residual standard deviation = 1.236
Correlation = 0.816 (0.817 for data set 4) which implies that in some quantitative sense, all four of
the data sets are "equivalent". In fact, the four data sets are far from "equivalent" and a scatter plot of
each data set, which would be step 1 of any EDA approach, would tell us that immediately.

Conclusions from the scatter plots are: 1. data set 1 is clearly linear with some scatter.
2. data set 2 is clearly quadratic.
3. data set 3 clearly has an outlier.
4. data set 4 is obviously the victim of a poor experimental design with a single point far removed
from the bulk of the data "wagging the dog".
Importance:
These points are exactly the substance that provide and define "insight" and "feel" for a data set.
They are the goals and the fruits of an open exploratory data analysis (EDA) approach to the data.
Quantitative statistics are not wrong per se, but they are incomplete. They are incomplete because
they are numeric summaries which in the summarization operation do a good job of focusing on a
particular aspect of the data (e.g., location, intercept, slope, degree of relatedness, etc.) by judiciously
reducing the data to a few numbers. Doing so also filters the data, necessarily omitting and screening
out other sometimes crucial information in the focusing operation. Quantitative statistics focus but
also filter; and filtering is exactly what makes the quantitative approach incomplete at best and
misleading at worst. The estimated intercepts (= 3) and slopes (= 0.5) for data sets 2, 3, and 4 are
misleading because the estimation is done in the context of an assumed linear model and that
linearity assumption is the fatal flaw in this analysis. The EDA approach of deliberately postponing
the model selection until further along in the analysis has many rewards, not the least of which is the
ultimate convergence to a much-improved model and the formulation of valid and supportable
scientific and engineering conclusions.

EDA Assumptions:
The gamut of scientific and engineering experimentation is virtually limitless. In this sea of diversity
is there any common basis that allows the analyst to systematically and validly arrive at supportable,
repeatable research conclusions? Fortunately, there is such a basis and it is rooted in the fact that
every measurement process, however complicated, has certain underlying assumptions. This section
deals with what those assumptions are, why they are important, how to go about testing them, and
what the consequences are if the assumptions do not hold.
Underlying Assumptions
There are four assumptions that typically underlie all measurement processes; namely, that the data
from the process at hand "behave like":
1. random drawings;
2. from a fixed distribution;
3. with the distribution having fixed location; and
4. with the distribution having fixed variation.
The "fixed location" referred to in item 3 above differs for different problem types. The simplest
problem type is univariate; that is, a single variable.
For the univariate problem, the general model
response = deterministic component + random component becomes
response = constant + error
or this case, the "fixed location" is simply the unknown constant. We can thus imagine the process at
hand to be operating under constant conditions that produce a single column of data with the
properties that the data are uncorrelated with one another;
the random component has a fixed distribution;
the deterministic component consists of only a constant;
and the random component has fixed variation
Importance
Predictability is an all-important goal in science and engineering. If the four underlying assumptions
hold, then we have achieved probabilistic predictability--the ability to make probability statements
not only about the process in the past, but also about the process in the future. In short, such
processes are said to be "in statistical control".
Moreover, if the four assumptions are valid, then the process is amenable to the generation of valid
scientific and engineering conclusions. If the four assumptions are not valid, then the process is
drifting (with respect to location, variation, or distribution), unpredictable, and out of control. A
simple characterization of such processes by a location estimate, a variation estimate, or a
distribution "estimate" inevitably leads to engineering conclusions that are not valid, are not
supportable (scientifically or legally), and which are not repeatable in the laboratory.
Techniques for Testing Assumption
The following EDA techniques are simple, efficient, and powerful for the routine testing of
underlying assumptions:
1. run sequence plot (Yi versus i)
2. lag plot (Yi versus Yi-1)
3. histogram (counts versus subgroups of Y)
4. normal probability plot (ordered Y versus theoretical ordered Y)
The four EDA plots can be juxtaposed for a quick look at the characteristics of the data. The plots
below are ordered as follows:
1. Run sequence plot - upper left
2. Lag plot - upper right
3. Histogram - lower left
4. Normal probability plot - lower right.

This 4-plot reveals a process that has fixed location, fixed variation, is random, apparently has a fixed approximately normal distribution,
and has no outliers.
Sample Plot: Assumptions Do Not Hold
If one or more of the four underlying assumptions do not hold, then it will show up in the various plots as demonstrated in the following
example
This 4-plot reveals a process that has fixed location, fixed variation, is non-random (oscillatory), has a
non-normal, U-shaped distribution, and has several outliers.

Interpretation of 4-Plot
The four EDA plots discussed on the previous page are used to test the underlying assumptions:
1. Fixed Location: If the fixed location assumption holds, then the run sequence plot will be flat and non-
drifting.
2. Fixed Variation: If the fixed variation assumption holds, then the vertical spread in the run sequence plot
will be the approximately the same over the entire horizontal axis.
3. Randomness: If the randomness assumption holds, then the lag plot will be structureless and random.
4. Fixed Distribution: If the fixed distribution assumption holds, in particular if the fixed normal distribution
holds, then
1. the histogram will be bell-shaped,
and 2. the normal probability plot will be linear.
Conversely, the underlying assumptions are tested using the EDA plots:
Run Sequence Plot: If the run sequence plot is flat and non-drifting, the fixed-location assumption holds. If
the run sequence plot has a vertical spread that is about the same over the entire plot, then the fixed-variation
assumption holds.
Lag Plot: If the lag plot is structureless, then the randomness assumption holds.
Histogram: If the histogram is bell-shaped, the underlying distribution is symmetric and perhaps
approximately normal
Normal Probability Plot: If the normal probability plot is linear, the underlying distribution is approximately
normal.

If all four of the assumptions hold, then the process is said definitionally to be "in statistical control".

Consequences
The primary goal is to have correct, validated, and complete scientific/engineering conclusions flowing from
the analysis. This usually includes intermediate goals such as the derivation of a good-fitting model and the
computation of realistic parameter estimates. It should always include the ultimate goal of an understanding
and a "feel" for "what makes the process tick". There is no more powerful catalyst for discovery than the
bringing together of an experienced/expert scientist/engineer and a data set ripe with intriguing "anomalies"
and characteristics.
The following sections discuss in more detail the consequences of invalid assumptions:
1. Consequences of non-randomness
2. Consequences of non-fixed location parameter
3. Consequences of non-fixed variation
4. Consequences related to distributional assumptions

Consequences of Non-Randomness
There are four underlying assumptions:
1. randomness;
2. fixed location;
3. fixed variation; and
4. fixed distribution.
The randomness assumption is the most critical but the least tested.
Consequeces of NonRandomness
If the randomness assumption does not hold, then
1. All of the usual statistical tests are invalid.
2. The calculated uncertainties for commonly used statistics become meaningless.
3. The calculated minimal sample size required for a pre-specified tolerance becomes meaningless.
4. The simple model: y = constant + error becomes invalid.
5. The parameter estimates become suspect and nonsupportable
One specific and common type of non-randomness is autocorrelation. Autocorrelation is the correlation
between Yt and Yt-k, where k is an integer that defines the lag for the autocorrelation. That is,
autocorrelation is a time dependent non-randomness.
This means that the value of the current point is highly dependent on the previous point if k = 1 (or k points
ago if k is not 1). Autocorrelation is typically detected via an autocorrelation plot or a lag plot. If the data are
not random due to autocorrelation, then
1. Adjacent data values may be related.
2. There may not be n independent snapshots of the phenomenon under study.
3. There may be undetected "junk"-outliers.
4. There may be undetected "information-rich"- outliers.

Consequences of Non-Fixed Location Parameter


The usual estimate of location is the mean
from N measurements Y1, Y2, ... , YN.

If the run sequence plot does not support the assumption of fixed location, then

1. The location may be drifting.

2. The single location estimate may be meaningless (if the process is drifting).

3. The choice of location estimator (e.g., the sample mean) may be sub-optimal.

4. The usual formula for the uncertainty of the mean:

5. The location estimate may be poor.


6. The location estimate may be biased

Consequences of Non-Fixed Variation Parameter


The usual estimate of variation is the standard deviation
from N measurements Y1, Y2, ... , YN.
Consequences of Non-Fixed Variation If the run sequence plot does not support the assumption of fixed
variation, then
1. The variation may be drifting.
2. The single variation estimate may be meaningless (if the process variation is drifting).
3. The variation estimate may be poor.
4. The variation estimate may be biased
Consequences Related to Distributional Assumptions
For certain distributions, the mean is a poor choice. For any given distribution, there exists an optimal
choice-- that is, the estimator with minimum variability/noisiness. This optimal choice may be, for example,
the median, the midrange, the midmean, the mean, or something else. The implication of this is to "estimate"
the distribution first, and then--based on the distribution--choose the optimal estimator. The resulting
engineering parameter estimators will have less variability than if this approach is not followed.
Other consequences that flow from problems with distributional assumptions are:
Distribution 1. The distribution may be changing.
2. The single distribution estimate may be meaningless (if the process distribution is
changing).
3. The distribution may be markedly non-normal.
4. The distribution may be unknown.
5. The true probability distribution for the error may remain unknown.
Model 1. The model may be changing.
2. The single model estimate may be meaningless.
3. The default model Y = constant + error may be invalid.
4. If the default model is insufficient, information about a better model may remain undetected.
5. A poor deterministic model may be fit.
6. Information about an improved model may go undetected

You might also like