0% found this document useful (0 votes)
2 views

Exploratory data analysis in the context of data mining and resampling

This document discusses the evolving nature of exploratory data analysis (EDA) in relation to data mining and resampling, emphasizing that EDA is not opposed to statistical modeling but rather complements it. It highlights three main goals of EDA: cluster detection, variable selection, and pattern recognition, and introduces techniques such as Two Step clustering, classification trees, and neural networks. The article also addresses the limitations of conventional EDA frameworks and the necessity for new methodologies to adapt to modern data analysis challenges.

Uploaded by

anikeit
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Exploratory data analysis in the context of data mining and resampling

This document discusses the evolving nature of exploratory data analysis (EDA) in relation to data mining and resampling, emphasizing that EDA is not opposed to statistical modeling but rather complements it. It highlights three main goals of EDA: cluster detection, variable selection, and pattern recognition, and introduces techniques such as Two Step clustering, classification trees, and neural networks. The article also addresses the limitations of conventional EDA frameworks and the necessity for new methodologies to adapt to modern data analysis challenges.

Uploaded by

anikeit
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Exploratory data analysis in the context of data mining and

resampling.

ABSTRACT
Today there are quite a few widespread misconceptions of exploratory data analysis (EDA). One
of these misperceptions is that EDA is said to be opposed to statistical modeling. Actually, the
essence of EDA is not about putting aside all modeling and preconceptions; rather, researchers
are urged not to start the analysis with a strong preconception only, and thus modeling is still
legitimate in EDA. In addition, the nature of EDA has been changing due to the emergence of
new methods and convergence between EDA and other methodologies, such as data mining
and resampling. Therefore, conventional conceptual frameworks of EDA might no longer be
capable of coping with this trend. In this article, EDA is introduced in the context of data mining
and resampling with an emphasis on three goals: cluster detection, variable selection, and
pattern recognition. Two Step clustering, classification trees, and neural networks, which are
powerful techniques to accomplish the preceding goals, respectively, are illustrated with
concrete examples. Key words: exploratory data analysis, data mining, resampling, cross-
validation, data visualization, clustering, classification trees, neural network

COVENTIOAL VIEWS OF EDA


Exploratory data analysis was named by Tukey (1977) as an alternative to CDA. As mentioned
before, EDA is an attitude or philosophy about how data analysis should be carried out, instead
of being a fixed set of techniques. Tukey (1977) often related EDA to detective work. In EDA, the
role of the researcher is to explore the data in as many ways as possible until a plausible “story”
of the data emerges. Therefore, the “data detective” should be skeptical of the “face” value of
the data and keep an open mind to unanticipated results when the hidden patterns are
unearthed. Throughout many years, different researchers formulated different definitions,
classifications, and taxonomies of EDA. For example, Velleman and Hoaglin (1981) outlined four
basic elements of exploratory data analysis: residual, re-expression (data transformation),
resistant, and display (data visualization). Based upon Velleman and Hoaglin’s framework,
Behrens and Yu (2003) elaborated the above four elements with updated techniques, and
renamed “display” to “revelation.” Each of them is briefly introduced as follows:
1. Residual analysis: EDA follows the formula that data = fit + residual or data = model + error.
The fit or the model is the expected values of the data whereas the residual or the error is the
values that deviate from that expected value. By examining the residuals, the researcher can
assess the model’s adequacy (Yu, 2009b).
2. Re-expression or data transformation: When the distribution is skewed or the data structure
obscures thpattern, the data could be rescaled in order to improve interpretability. Typical
examples of data transformation include using natural log transformation or inverse probability
transformation to normalize a distribution, using square root transformation to stabilize
variances, and using logarithmic transformation to linearize a trend (Yu, 2009b). 3. Resistance
procedures: Parametric tests are based on the mean estimation, which is sensitive to outliers or
skewed distributions. In EDA, resistant estimators are usually used. The following are common
examples: median, trimean (a measure of central tendency based on the arithmetic average of
the values of the first quartile, the third quartile, and the median counted twice), Winsorized
mean (a robust version of the mean in which extreme scores are pulled back to the majority of
the data), and trimmed mean (a mean without outliers). It is important to point out that there is
a subtle difference between “resistance” and “robustness” though two terms are usually used
interchangeably. Resistance is about being immune to outliers while robustness is about being
immune to assumption violations. In the former, the goal is to obtain a data summary, while in
the latter the goal is to make a probabilistic inference. 4. Revelation or data visualization:
Graphing is a powerful tool for revealing hidden patterns and relationships among variables.
Typical examples of graphical tools for EDA are Trellis displays and 3D plots (Yu & Stockford,
2003). Although the use of scientific and statistical visualization is fundamental to EDA, they
should not be equated, because data visualization is concerned with just one data
characterization aspect (patterns) whereas EDA encompasses a wider focus, as introduced in
the previous three elements (NIST Semantech, 2006). According to NIST Semantech (2006),
EDA entails a variety of techniques for accomplishing the following tasks: 1) maximize insight; 2)
uncover underlying structure; 3) extract important variables; 4) detect outliers and anomalies;
5) test underlying assumptions; 6) develop parsimonious models; and 7) determine optimal
factor settings. Comparing the NIST’s EDA approach with Velleman and Hoaglin’s, and Behrens
and Yu’s, it is not difficult to see many common threads. For example, “maximize insight” and
“uncover underlying structure” is similar to revelation. LIMITATIO S OF COVE TIOAL VIEWS TO
EDA Although the preceding EDA framework provides researchers with helpful guidelines in
data analysis, some of the above elements are no longer as important as before due to the
emergence of new methods and convergence between EDA and other methodologies, such as
data mining and resampling. Data mining is a cluster of techniques that has been employed in
the Business Intelligence (BI) field for many years (Han & Kamber, 2006). According to Larose
(2005), data mining is the process of automatically extracting useful information and
relationships from immense quantities of data. Data mining does not start with a strong
preconception, a specific question, or a narrow hypothesis, rather it aims to detect patterns that
are already present in the data. Similarly, Luan (2002) views data mining as an extension of EDA.
Like EDA, resampling departs from theoretical distributions used by CDA. Rather, its inference is
based upon repeated sampling within the same sample, and that is why this school is called
resampling (Yu, 2003, 2007). How these two methodologies alter the features of EDA will be
discussed next.

CATEGORIES A D TECH IQUES OF EDA Clustering: TwoStep cluster analysis Clustering is


essentially grouping observations based upon their proximity to each other on multiple
dimensions. At first glance, clustering analysis is similar to discriminant analysis. But in the
latter the analyst musknow the group membership for the classification in advance. Because
discriminant analysis assigns cases to pre-existing groups, it is not as exploratory as cluster
analysis, which aims to identify the grouping categories in the first place. If there are just two
dimensions (variables), the analyst could simply use a scatterplot to look for the clumps. But
when there are many variables, the task becomes more challenging and thus it necessitates
algorithms. There are three major types of clustering algorithms: 1) Hierarchical clustering, 2)
non-hierarchical clustering (k-mean clustering), and 3) Two Step clustering. The last one is
considered the most versatile because it has several desirable features that are absent in other
clustering methods. For example, both hierarchical clustering and k-mean clustering could
handle continuous variables only, but Two Step clustering accepts both categorical and
continuous variables. This is the case because in Two Step clustering the distance measurement
is based on the log-likelihood method (Chiu et al., 2001). In computing log-likelihood, the
continuous variables are assumed to have a normal distribution and the categorical variables
are assumed to have a multinomial distribution. Nevertheless, the algorithm is reasonably
robust against the violation of these assumptions, and thus assumption checking is unnecessary.
Second, while k-mean clustering requires a pre-specified number of clusters and therefore
strong prior knowledge is required, Two Step clustering is truly data-driven due to its capability
of automatically returning the number of clusters. Last but not least, while hierarchical
clustering is suitable to a small data set only, Two Step clustering is so scalable that it could
analyze thousands of observations efficiently. As the name implies, Two Step clustering is
composed of two steps. The first step is called pre-clustering. In this step, the procedure
constructs a cluster features (CF) tree by scanning all cases one by one (Zhang et al., 1996).
When a case is scanned, the pre-cluster algorithm applies the log likelihood distance measure to
determine whether the case should be merged with other cases or form a new pre-cluster on its
own and wait for similar cases in further scanning. After all cases are exhausted, all pre-clusters
are treated as entities and become the raw data for the next step. In this way, the task is
manageable no matter how large the sample size is, because the size of the distance matrix is
dependent on just a few pre-clusters rather than all cases. Also, the researcher has the option to
turn on outlier handling. If this option is selected, entries that cannot fit into any pre-clusters
are treated as outliers at the end of CF-tree building. Further, in this pre-clustering step, all
continuous variables are automatically standardized. In other words, there is no need for the
analyst to perform outliers detection and data transformation in separate steps. In step two,
the hierarchical clustering algorithm is applied to the preclusters and then propose a set of
solutions. To determine the best number of clusters, each solution is compared against each
other based upon the Akaike Information Criterion (AIC) (Akaike, 1973) or the Bayesian
Information Criterion (BIC) (Schwarz, 1978). AIC is a fitness index for trading off the complexity
of a model against how well the model fits the data. To reach a balance between fitness and
parsimony, AIC not only rewards goodness of fit, but also gives a penalty to overfitting and
complexity. Hence, the best model is the one with the lowest AIC value. However, both Berk
(2008) and Shmueli (2009) agreed that although AIC is a good measure of predictive accuracy, it
can be over-optimistic in estimating fitness. In addition, because AIC aims to yield a predictive
model, using AIC for model selection is inappropriate for a model of causal explanation. BIC was
developed as a remedy to AIC. Like AIC, BIC also uses a penalty against complexity, but this
penalty is much stronger than that of the AIC. In this sense, BIC is in alignment to Ockham’s
razor: Given all things being equal, the simplest model tends to be the best one. To illustrate
Two Step clustering, a data set listing 400 of the world’s best colleges and universities compiled
by US news and World Report (2009) was utilized. The criteria used by US news and World
Report for selecting the best institutions include: Academic peer review score, employer review
score, student to faculty score, international faculty score, international students score, and
citations per faculty score. However, an educational researcher might not find the list helpful
because the report ranks these institutions by the overall scores. It is tempting for the
educational researcher to learn about how these best institutions relate to each other and what
their common threads are. In addition to the preceding measures, geographical location could
be taken into account. Because the data set contains both categorical and continuous variables,
the researcher employed the Two Step clustering analysis in Predictive Analytical Software
(PASW) Statistics (SPSS Inc., 2009). It is important to note that the clustering result may be
affected by the order of the cases in the file. In the original data set, the table has been sorted
by the rank in an ascending order. In an effort to minimize the order effect, the cases were re-
arranged in random order before the analysis was conducted. To run a Two Step cluster analysis,
the researcher must assign the categorical and continuous variables into the proper fields, as
shown in Figure 1, using BIC instead of AIC for simplicity

EDA AND RESAMPLING


At first glance, exploratory data mining is very similar to conventional EDA except that the
former employs certain advanced algorithms for automation. Actually, the differences between
conventional EDA and exploratory data mining could be found at the epistemological level. As
mentioned before, EDA suggests variables, constructs and hypotheses that are worth pursuing
and CDA takes the next step to confirm the findings. However, using resampling (Yu, 2003,
2007), data mining is capable of suggesting and validating a model at the same time. One may
argue that data mining should be classified as a form of CDA when validation has taken place. It
is important to point out that usually exploratory data mining aims to yield predication rather
than theoretical explanations of the relationships between variables (Shmueli & Koppius, 2008;
Yu, in press). Hence, the researcher still has to construct a theoretical model in the context of
CDA (e.g. structural equation modeling) if explanation is the research objective.
Resampling in the context of exploratory data mining addresses two important issues, namely,
generalization across samples and under-determination of theory by evidence (Kieseppa, 2001).
It is very common that in one sample a set of best predictors was yielded from regression
analysis, but in another sample a different set of best predictors was found (Thompson, 1995).
In other words, this kind of model can provide a post hoc model for an existing sample (in-
sample forecasting), but cannot be useful in out-of-sample forecasting. This occurs when a
specific model is overfitted to a specific data set and thus it weakens generalizability of the
conclusion. Further, even if a researcher found the so-called best fit model, there may be
numerous possible models to fit the same data. To counteract the preceding problems, most
data mining procedures employed cross-validation to enhance generalizability. For example, to
remediate the problem of under-determination of theory by data, neural networks exhaust
different models by the genetic algorithm, which begins by randomly generating pools of
equations. These initial randomly generated equations are estimated to the training data set
and prediction accuracy of the outcome measure is assessed using the test set to identify a
family of the fittest models. Next, these equations are hybridized or randomly recombined to
create the next generation of equations. Parameters from the surviving population of equations
may be combined or excluded to form new equations as if they were genetic traits inherited
from their “parents.” This process continues until no further improvement in predicting the
outcome measure of the test set can be achieved (Baker & Richards, 1999). In addition to cross-
validation, bootstrapping, another resampling technique, is also widely employed in data mining
(Salford Systems, 2009), but it is beyond the scope of this article to introduce bootstrapping.
Interested readers are encouraged to consult Yu (2003, 2007).

CONCLUDING REMARKS
This article introduces several new EDA tools, including TwoStep clustering, recursive
classification trees, and neural networks, in the context of data mining and resampling, but
these are just a fraction of the plethora ofexploratory data mining tools. In each category of EDA
there are different methods to accomplish the same goal, and each method has numerous
options (e.g. the number of k-fold cross-validation). In evaluating the efficacy of classification
trees and other classifers, Wolpert and Macready (1997) found that there is no single best
method and they termed this phenomenon “no free lunch” – every output comes with a price
(drawback). For instance, simplicity is obtained at the expense of fitness, and vice versa. As
illustrated before, sometimes simplicity could be an epistemologically sound criterion for
selecting the “best” solution. In the example of PISA data, the classification tree model is
preferable to the logistic regression model because of predictive accuracy. And also in the
example of world’s best universities, BIC, which tends to introduce heavy penalties to
complexity, is more favorable than AIC. But in the example of the retention study, when the
researcher suspected that there are entangled relationships among variables, a complex,
nonlinear neural net was constructed even though this black box lacks transparency. In one way
or the other the data explorer must pay a price. Ultimately, whether a simple and complex
approach should be adopted is tied to usefulness. Altman and Royston (2000) asserted that
“usefulness is determined by how well a model works in practice, not by how many zeros there
are in associated p values” (p.454). While this statement pinpoints the blind faith to p values in
using inferential statistics, it is also applicable to EDA. A data explorer should not hop around
solutions and refuse to commit himself/herself to a conclusion in the name of exploration;
rather, he/she should contemplate about which solution could yield more implications for the
research community. Last but not least, exploratory data mining techniques could be
simultaneously or sequentially employed. For example, because both neural networks and
classification trees are capable of selecting important predictors, they could be run side by side
and evaluated by classification agreement and ROC curves. On other occasions, a sequential
approach might be more appropriate. For instance, if the researcher suspects that the
observations are too heterogeneous to form a single population, clustering could be conducted
to divide the sample into sub-samples. Next, variable selection procedures could be run to
narrow down the predictor list for each sub-sample. Last, the researcher could focus on the
inter-relationships among just a few variables using pattern recognition methods. The
combinations and possibilities are virtually limitless. Data detectives are encouraged to explore
the data with skepticism and openness.
References
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle.
Altman, D. G., & Royston, P. (2000).What do we mean by validating a prognostic model?
Statistics in Medicine, 19, 453-473.

Baker, B. D., & Richards, C. E. (1999). A comparison of conventional linear regression methods
and neural networks for forecasting educational spending. Economics of Education Review, 18,
405-415.

NIST Sematech. (2006). What is EDA? Retrieved September 30, 2009, from
http://www.itl.nist.gov/div898/handbook/eda/secti on1/eda11.htm

Salford Systems. (2009). Random forest. [Computer software and manual]. San Diego, CA:
Author.

SPSS, Inc. (2009). PASW Statistics 17 [Computer software and manual]. Chicago, IL: Author.

Shmueli, G., Patel, N., & Bruce, P. (2007). Data mining for business intelligence: Concepts,
techniques, and applications in Microsoft Office Excel with XLMiner. Hoboken, N.J.: Wiley-
Interscience

Thompson, B. (1995). Stepwise regression and stepwise discriminant analysis need not apply
here: A guidelines editorial. Educational and Psychological Measurement, 55, 525-534.

Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley


Velleman, P. F., & Hoaglin, D. C. (1981). Applications, basics, and computing of exploratory data
analysis. Boston, MA: Duxbury Press
Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE
Transactions on Evolutionary Computation, 1(1), 67–82.

Yu, C. H. (2003). Resampling methods: Concepts, applications, and justification. Practical


Assessment Research and Evaluation, 8(19). Retrieved July 4, 2009, from
http://pareonline.net/getvn.asp?v=8&n=19
Yu, C. H. (2007). Resampling: A conceptual and procedural introduction. In Jason Osborne (Ed.),
Best practices in quantitative methods (pp. 283298). Thousand Oaks, CA: Sage Publications
Yu, C. H. (2009b). Exploratory data analysis and data visualization. Retrieved October 10, 2009,
from http://www.creativewisdom.com/teaching/WBI/EDA.shtml

You might also like