EDA Unit-2
EDA Unit-2
What is EDA?
Approach
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to
1. maximize insight into a data set;
2. uncover underlying structure
3. extract important variables;
4. detect outliers and anomalies;
5. test underlying assumptions;
6. develop parsimonious models
7. determine optimal factor settings.
Focus
The EDA approach is precisely that--an approach--not a set of techniques, but an attitude/philosophy about how a data analysis should be
carried out.
Most EDA techniques are graphical in nature with a few quantitative techniques. The reason for the
heavy reliance graphics is that by its very nature the main role of EDA is to open-mindedly explore,
and graphics gives the analysts unparalleled power to do so, enticing the data to reveal its structural
secrets, and being always ready to gain some new, often unsuspected, insight into the data. In
combination with the natural pattern-recognition capabilities that we all possess, graphics provides,
of course, unparalleled power to carry this out The particular graphical techniques employed in EDA
areoften quite simple, consisting of various techniques of:
Plotting the raw data (such as data traces, histograms, bi histograms, probability plots, lag plots, block
plots, and Youden plots.
Plotting simple statistics such as mean plots, standard deviation plots, box plots, and main effects plots
of theraw data.
1. Classical
2. Exploratory (EDA)
3. Bayesian
These three approaches are similar in that they all start witha general science/engineering problem
and all yield science/engineering conclusions. The difference is the sequence and focus of the
intermediate steps.
For classical analysis, the sequence is
Problem => Data => Model => Prior Distribution =>Analysis => Conclusions
Thus for classical analysis, the data collection is followed bythe imposition of a model (normality,
linearity, etc.) and the analysis, estimation, and testing that follows are focused on the parameters of that
model. For EDA, the data collection is not followed by a model imposition; rather it is followed
immediately by analysis with a goal of inferring what model would be appropriate. Finally, for a
Bayesian analysis, the analyst attempts to incorporate scientific/engineering knowledge/expertise into the
analysis by imposing a data- independent distribution on the parameters of the selected model; the
analysis thus consists of formally combining boththe prior distribution on the parameters and the
collected data to jointly make inferences and/or test assumptions aboutthe model parameters.
In the real world, data analysts freely mix elements of all ofthe above three approaches (and other
approaches). The above distinctions were made to emphasize the major differences among the three
approaches
1.Model
Classical The classical approach imposes models (both deterministic and and
probabilistic) on the data.
Deterministic models include, for example, regression models and analysis of
variance (ANOVA) models. The most common probabilistic model assumes that the
errors about the deterministic model are normally distributed--this assumption
affects the validity of the ANOVA F tests.
Classical The two approaches differ substantially in focus. For classicalanalysis, the focus is on
the model--estimating parameters of the model and generating predicted values from
the model.
Exploratory For exploratory data analysis, the focus is on the data--itsstructure, outliers, and
models suggested by the data.
3.Techniques
Classical Classical techniques are generally quantitative in nature. Theyinclude ANOVA, t tests,
chi-squared tests, and F tests.
Exploratory EDA techniques are generally graphical. They include scatterplots, character plots,
box plots, histograms, bihistograms, probability plots, residual plots, and mean plots.
4.Rigor
Classical Classical techniques serve as the probabilistic foundation of science and engineering;
the most important characteristic ofclassical techniques is that they are rigorous,
formal, and "objective".
Exploratory EDA techniques do not share in that rigor or formality. EDAtechniques make up for
that lack of rigor by being very suggestive, indicative, and insightful about what the
appropriate model should be.
EDA techniques are subjective and depend on interpretationwhich may differ from
analyst to analyst, although experienced analysts commonly arrive at identical
conclusions.
5.Data Treatment
Classical Classical estimation techniques have the characteristic of taking all of the data and
mapping the data into a few numbers ("estimates"). This is both a virtue and a vice.
The virtue is that these few numbers focus on important characteristics (location,
variation, etc.) of the population. The vice is that concentrating on these few
characteristics can filter out other characteristics (skewness, tail length,
autocorrelation, etc.) of the same population. In this sense there is a loss of
information due to this "filtering" process.
Exploratory The EDA approach, on the other hand, often makes use of (and shows) all of the
available data. In this sense there is no corresponding loss of information
6.Assumptions
Classical The "good news" of the classical approach is that tests based on classical techniques
are usually very sensitive--that is, if atrue shift in location, say, has occurred, such
tests frequently have the power to detect such a shift and to conclude that such a shift
is "statistically significant". The "bad news" is that classical tests depend on
underlying assumptions (e.g., normality), and hence the validity of the test
conclusions becomes dependent on the validity of the underlying assumptions. Worse
yet, the exact underlying assumptions may be unknown to the analyst, or if known,
untested. Thus the validity of the scientific conclusions becomes intrinsicallylinked to
the validity of the underlying assumptions. In practice, if such assumptions are
unknown or untested, the validity of the scientific conclusions becomes suspect.
Exploratory Many EDA techniques make little or no assumptions--they present and show the
data--all of the data--as is, with fewerencumbering assumptions.
How Does Exploratory Data Analysis Differ from
Summary Analysis?
Summary A summary analysis is simply a numeric reduction of a historical data set. It is quite
passive. Its focus is in the past. Quite commonly, its purpose is to simply arrive at a
few keystatistics (for example, mean and standard deviation) which may then either
replace the data set or be added to the data set in the form of a summary table.
Exploratory In contrast, EDA has as its broadest goal the desire to gain insight into the
engineering/scientific process behind the data. Whereas summary statistics are passive
and historical, EDA is active and futuristic. In an attempt to "understand" the process
and improve it in the future, EDA uses the data as a "window" to peer into the heart
of the process that generated the data. There is an archival role in the research and
manufacturing world for summary statistics, but there is an enormously larger role for
the EDA approach.
Insight implies detecting and uncovering underlying structure in the data. Such underlying structure
may not be encapsulated in the list of items above; such items serve as the specific targets of an
analysis, but the real insight and "feel" for a data set comes as the analyst judiciously probes and
explores the various subtleties of the data. The "feel" for the data comes almost exclusively from the
application of various graphical techniques, the collection of which serves as the window into the
essence of the data. Graphics are irreplaceable--there are no quantitative analogues that will give the
same insight as well-chosen graphics.
To get a "feel" for the data, it is not enough for the analyst toknow what is in the data; the analyst
also must know what isnot in the data, and the only way to do that is to draw on our own human
pattern-recognition and comparative abilities in the context of a series of judicious graphical
techniques applied to the data.
To get a "feel" for the data, it is not enough for the analyst toknow what is in the data; the analyst
also must know what isnot in the data, and the only way to do that is to draw on our own human
pattern-recognition and comparative abilities in the context of a series of judicious graphical
techniques applied to the data.
1.Quantitative techniques are the set of statistical proceduresthat yield numeric or tabular output.
Examples of quantitative techniques include:
hypothesis testing
analysis of variance
Point Estimates and Confidence Intervals
Least square regression
2.Graphical
On the other hand, there is a large collection of statistical tools that we generally refer to as graphical
techniques. These include:
scatterplots histograms probability plots
residual plots
box plots
block plots
The EDA approach relies heavily on these and similar graphical techniques. Graphical procedures are
not just toolsthat we could use in an EDA context, they are tools that we must use. Such graphical tools
are the shortest path to gaining insight into a data set in terms of:
testing assumptions
model selection
model validation
estimator selection
relationship identification
factor effect determination
outlier detection
If one is not using statistical graphics, then one is forfeiting insight into one or more aspects of the
underlying structure of the data.
An EDA/Graphics Example
Data:
X Y
10.00 8.04
8.00 6.95
13.00 7.58
9.00 8.81
11.00 8.33
14.00 9.96
6.00 7.24
4.00 4.26
12.00 10.84
7.00 4.82
5.00 5.68
If the goal of the analysis is to compute summary statistics plus determine the best linear fit for Y as a
function of X, the results might be given as:
N = 11
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Residual standard deviation = 1.237
Correlation = 0.816
The above quantitative analysis, although valuable, gives us only limited insight into the data.
Scatter Plot-
In contrast, the following simple scatter plot of the data
suggests the following: 1. The data set "behaves like" a linear curve with some scatter;
2. there is no justification for a more complicated model (e.g., quadratic);
3. there are no outliers;
4. the vertical spread of the data appears to be of equal height irrespective of the X-value; this indicates
that the data are equally-precise throughout and so a "regular" (that is, equi-weighted) fit is appropriate.
This kind of characterization for the data serves as the core for getting insight/feel for the data. Such
insight/feel does not come from the quantitative statistics; on the contrary, calculations of
quantitative statistics such as intercept and slope should be subsequent to the characterization and
will make sense only if the characterization is true. To illustrate the loss of information that results
when the graphics insight step is skipped, consider the following three data sets .
X2 Y2 X3 Y3 X4 Y4
10.00 9.14 10.00 7.46 8.00 6.58
8.00 8.14 8.00 6.77 8.00 5.76
13.00 8.74 13.00 12.74 8.00 7.71
9.00 8.77 9.00 7.11 8.00 8.84
11.00 9.26 11.00 7.81 8.00 8.47
14.00 8.10 14.00 8.84 8.00 7.04
6.00 6.13 6.00 6.08 8.00 5.25
4.00 3.10 4.00 5.39 19.00 12.50
12.00 9.13 12.00 8.15 8.00 5.56
7.00 7.26 7.00 6.42 8.00 7.91
5.00 4.74 5.00 5.73 8.00 6.89
Conclusions from the scatter plots are: 1. data set 1 is clearly linear with some scatter.
2. data set 2 is clearly quadratic.
3. data set 3 clearly has an outlier.
4. data set 4 is obviously the victim of a poor experimental design with a single point far removed
from the bulk of the data "wagging the dog".
Importance:
These points are exactly the substance that provide and define "insight" and "feel" for a data set.
They are the goals and the fruits of an open exploratory data analysis (EDA) approach to the data.
Quantitative statistics are not wrong per se, but they are incomplete. They are incomplete because
they are numeric summaries which in the summarization operation do a good job of focusing on a
particular aspect of the data (e.g., location, intercept, slope, degree of relatedness, etc.) by judiciously
reducing the data to a few numbers. Doing so also filters the data, necessarily omitting and screening
out other sometimes crucial information in the focusing operation. Quantitative statistics focus but
also filter; and filtering is exactly what makes the quantitative approach incomplete at best and
misleading at worst. The estimated intercepts (= 3) and slopes (= 0.5) for data sets 2, 3, and 4 are
misleading because the estimation is done in the context of an assumed linear model and that
linearity assumption is the fatal flaw in this analysis. The EDA approach of deliberately postponing
the model selection until further along in the analysis has many rewards, not the least of which is the
ultimate convergence to a much-improved model and the formulation of valid and supportable
scientific and engineering conclusions.
EDA Assumptions:
The gamut of scientific and engineering experimentation is virtually limitless. In this sea of diversity
is there any common basis that allows the analyst to systematically and validly arrive at supportable,
repeatable research conclusions? Fortunately, there is such a basis and it is rooted in the fact that
every measurement process, however complicated, has certain underlying assumptions. This section
deals with what those assumptions are, why they are important, how to go about testing them, and
what the consequences are if the assumptions do not hold.
Underlying Assumptions
There are four assumptions that typically underlie all measurement processes; namely, that the data
from the process at hand "behave like":
1. random drawings;
2. from a fixed distribution;
3. with the distribution having fixed location; and
4. with the distribution having fixed variation.
The "fixed location" referred to in item 3 above differs for different problem types. The simplest
problem type is univariate; that is, a single variable.
For the univariate problem, the general model
response = deterministic component + random component becomes
response = constant + error
or this case, the "fixed location" is simply the unknown constant. We can thus imagine the process at
hand to be operating under constant conditions that produce a single column of data with the
properties that the data are uncorrelated with one another;
the random component has a fixed distribution;
the deterministic component consists of only a constant;
and the random component has fixed variation
Importance
Predictability is an all-important goal in science and engineering. If the four underlying assumptions
hold, then we have achieved probabilistic predictability--the ability to make probability statements
not only about the process in the past, but also about the process in the future. In short, such
processes are said to be "in statistical control".
Moreover, if the four assumptions are valid, then the process is amenable to the generation of valid
scientific and engineering conclusions. If the four assumptions are not valid, then the process is
drifting (with respect to location, variation, or distribution), unpredictable, and out of control. A
simple characterization of such processes by a location estimate, a variation estimate, or a
distribution "estimate" inevitably leads to engineering conclusions that are not valid, are not
supportable (scientifically or legally), and which are not repeatable in the laboratory.
Techniques for Testing Assumption
The following EDA techniques are simple, efficient, and powerful for the routine testing of
underlying assumptions:
1. run sequence plot (Yi versus i)
2. lag plot (Yi versus Yi-1)
3. histogram (counts versus subgroups of Y)
4. normal probability plot (ordered Y versus theoretical ordered Y)
The four EDA plots can be juxtaposed for a quick look at the characteristics of the data. The plots
below are ordered as follows:
1. Run sequence plot - upper left
2. Lag plot - upper right
3. Histogram - lower left
4. Normal probability plot - lower right.
This 4-plot reveals a process that has fixed location, fixed variation, is random, apparently has a fixed approximately normal distribution,
and has no outliers.
Sample Plot: Assumptions Do Not Hold
If one or more of the four underlying assumptions do not hold, then it will show up in the various plots as demonstrated in the following
example
This 4-plot reveals a process that has fixed location, fixed variation, is non-random (oscillatory), has a
non-normal, U-shaped distribution, and has several outliers.
Interpretation of 4-Plot
The four EDA plots discussed on the previous page are used to test the underlying assumptions:
1. Fixed Location: If the fixed location assumption holds, then the run sequence plot will be flat and non-
drifting.
2. Fixed Variation: If the fixed variation assumption holds, then the vertical spread in the run sequence plot
will be the approximately the same over the entire horizontal axis.
3. Randomness: If the randomness assumption holds, then the lag plot will be structureless and random.
4. Fixed Distribution: If the fixed distribution assumption holds, in particular if the fixed normal distribution
holds, then
1. the histogram will be bell-shaped,
and 2. the normal probability plot will be linear.
Conversely, the underlying assumptions are tested using the EDA plots:
Run Sequence Plot: If the run sequence plot is flat and non-drifting, the fixed-location assumption holds. If
the run sequence plot has a vertical spread that is about the same over the entire plot, then the fixed-variation
assumption holds.
Lag Plot: If the lag plot is structureless, then the randomness assumption holds.
Histogram: If the histogram is bell-shaped, the underlying distribution is symmetric and perhaps
approximately normal
Normal Probability Plot: If the normal probability plot is linear, the underlying distribution is approximately
normal.
If all four of the assumptions hold, then the process is said definitionally to be "in statistical control".
Consequences
The primary goal is to have correct, validated, and complete scientific/engineering conclusions flowing from
the analysis. This usually includes intermediate goals such as the derivation of a good-fitting model and the
computation of realistic parameter estimates. It should always include the ultimate goal of an understanding
and a "feel" for "what makes the process tick". There is no more powerful catalyst for discovery than the
bringing together of an experienced/expert scientist/engineer and a data set ripe with intriguing "anomalies"
and characteristics.
The following sections discuss in more detail the consequences of invalid assumptions:
1. Consequences of non-randomness
2. Consequences of non-fixed location parameter
3. Consequences of non-fixed variation
4. Consequences related to distributional assumptions
Consequences of Non-Randomness
There are four underlying assumptions:
1. randomness;
2. fixed location;
3. fixed variation; and
4. fixed distribution.
The randomness assumption is the most critical but the least tested.
Consequeces of NonRandomness
If the randomness assumption does not hold, then
1. All of the usual statistical tests are invalid.
2. The calculated uncertainties for commonly used statistics become meaningless.
3. The calculated minimal sample size required for a pre-specified tolerance becomes meaningless.
4. The simple model: y = constant + error becomes invalid.
5. The parameter estimates become suspect and nonsupportable
One specific and common type of non-randomness is autocorrelation. Autocorrelation is the correlation
between Yt and Yt-k, where k is an integer that defines the lag for the autocorrelation. That is,
autocorrelation is a time dependent non-randomness.
This means that the value of the current point is highly dependent on the previous point if k = 1 (or k points
ago if k is not 1). Autocorrelation is typically detected via an autocorrelation plot or a lag plot. If the data are
not random due to autocorrelation, then
1. Adjacent data values may be related.
2. There may not be n independent snapshots of the phenomenon under study.
3. There may be undetected "junk"-outliers.
4. There may be undetected "information-rich"- outliers.
If the run sequence plot does not support the assumption of fixed location, then
2. The single location estimate may be meaningless (if the process is drifting).
3. The choice of location estimator (e.g., the sample mean) may be sub-optimal.