Exploratory data analysis in the context of data mining and resampling
Exploratory data analysis in the context of data mining and resampling
resampling.
ABSTRACT
Today there are quite a few widespread misconceptions of exploratory data analysis (EDA). One
of these misperceptions is that EDA is said to be opposed to statistical modeling. Actually, the
essence of EDA is not about putting aside all modeling and preconceptions; rather, researchers
are urged not to start the analysis with a strong preconception only, and thus modeling is still
legitimate in EDA. In addition, the nature of EDA has been changing due to the emergence of
new methods and convergence between EDA and other methodologies, such as data mining
and resampling. Therefore, conventional conceptual frameworks of EDA might no longer be
capable of coping with this trend. In this article, EDA is introduced in the context of data mining
and resampling with an emphasis on three goals: cluster detection, variable selection, and
pattern recognition. Two Step clustering, classification trees, and neural networks, which are
powerful techniques to accomplish the preceding goals, respectively, are illustrated with
concrete examples. Key words: exploratory data analysis, data mining, resampling, cross-
validation, data visualization, clustering, classification trees, neural network
CONCLUDING REMARKS
This article introduces several new EDA tools, including TwoStep clustering, recursive
classification trees, and neural networks, in the context of data mining and resampling, but
these are just a fraction of the plethora ofexploratory data mining tools. In each category of EDA
there are different methods to accomplish the same goal, and each method has numerous
options (e.g. the number of k-fold cross-validation). In evaluating the efficacy of classification
trees and other classifers, Wolpert and Macready (1997) found that there is no single best
method and they termed this phenomenon “no free lunch” – every output comes with a price
(drawback). For instance, simplicity is obtained at the expense of fitness, and vice versa. As
illustrated before, sometimes simplicity could be an epistemologically sound criterion for
selecting the “best” solution. In the example of PISA data, the classification tree model is
preferable to the logistic regression model because of predictive accuracy. And also in the
example of world’s best universities, BIC, which tends to introduce heavy penalties to
complexity, is more favorable than AIC. But in the example of the retention study, when the
researcher suspected that there are entangled relationships among variables, a complex,
nonlinear neural net was constructed even though this black box lacks transparency. In one way
or the other the data explorer must pay a price. Ultimately, whether a simple and complex
approach should be adopted is tied to usefulness. Altman and Royston (2000) asserted that
“usefulness is determined by how well a model works in practice, not by how many zeros there
are in associated p values” (p.454). While this statement pinpoints the blind faith to p values in
using inferential statistics, it is also applicable to EDA. A data explorer should not hop around
solutions and refuse to commit himself/herself to a conclusion in the name of exploration;
rather, he/she should contemplate about which solution could yield more implications for the
research community. Last but not least, exploratory data mining techniques could be
simultaneously or sequentially employed. For example, because both neural networks and
classification trees are capable of selecting important predictors, they could be run side by side
and evaluated by classification agreement and ROC curves. On other occasions, a sequential
approach might be more appropriate. For instance, if the researcher suspects that the
observations are too heterogeneous to form a single population, clustering could be conducted
to divide the sample into sub-samples. Next, variable selection procedures could be run to
narrow down the predictor list for each sub-sample. Last, the researcher could focus on the
inter-relationships among just a few variables using pattern recognition methods. The
combinations and possibilities are virtually limitless. Data detectives are encouraged to explore
the data with skepticism and openness.
References
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle.
Altman, D. G., & Royston, P. (2000).What do we mean by validating a prognostic model?
Statistics in Medicine, 19, 453-473.
Baker, B. D., & Richards, C. E. (1999). A comparison of conventional linear regression methods
and neural networks for forecasting educational spending. Economics of Education Review, 18,
405-415.
NIST Sematech. (2006). What is EDA? Retrieved September 30, 2009, from
http://www.itl.nist.gov/div898/handbook/eda/secti on1/eda11.htm
Salford Systems. (2009). Random forest. [Computer software and manual]. San Diego, CA:
Author.
SPSS, Inc. (2009). PASW Statistics 17 [Computer software and manual]. Chicago, IL: Author.
Shmueli, G., Patel, N., & Bruce, P. (2007). Data mining for business intelligence: Concepts,
techniques, and applications in Microsoft Office Excel with XLMiner. Hoboken, N.J.: Wiley-
Interscience
Thompson, B. (1995). Stepwise regression and stepwise discriminant analysis need not apply
here: A guidelines editorial. Educational and Psychological Measurement, 55, 525-534.