2010 Exploratory Data Analysis in The Context of Data Mining and Resampling
2010 Exploratory Data Analysis in The Context of Data Mining and Resampling
Research
ISSN: 2011-2084
[email protected]
Universidad de San Buenaventura
Colombia
Ho Yu, Chong
Exploratory data analysis in the context of data mining and resampling.
International Journal of Psychological Research, vol. 3, núm. 1, 2010, pp. 9-22
Universidad de San Buenaventura
Medellín, Colombia
How to cite
Complete issue
Scientific Information System
More information about this article Network of Scientific Journals from Latin America, the Caribbean, Spain and Portugal
Journal's homepage in redalyc.org Non-profit academic project, developed under the open access initiative
International Journal of Psychological Research, 2010. Vol. 3. No. 1. Chon Ho, Yu. (2010). Exploratory data analysis in the context of data mining and
ISSN impresa (printed) 2011-2084 resampling. International Journal of Psychological Research, 3(1), 9-22.
ISSN electronic (electronic) 2011-2079
Chong Ho Yu
Arizona State University
ABSTRACT
Today there are quite a few widespread misconceptions of exploratory data analysis (EDA). One of these
misperceptions is that EDA is said to be opposed to statistical modeling. Actually, the essence of EDA is not about putting
aside all modeling and preconceptions; rather, researchers are urged not to start the analysis with a strong preconception
only, and thus modeling is still legitimate in EDA. In addition, the nature of EDA has been changing due to the emergence
of new methods and convergence between EDA and other methodologies, such as data mining and resampling. Therefore,
conventional conceptual frameworks of EDA might no longer be capable of coping with this trend. In this article, EDA is
introduced in the context of data mining and resampling with an emphasis on three goals: cluster detection, variable
selection, and pattern recognition. TwoStep clustering, classification trees, and neural networks, which are powerful
techniques to accomplish the preceding goals, respectively, are illustrated with concrete examples.
Key words: exploratory data analysis, data mining, resampling, cross-validation, data visualization, clustering,
classification trees, neural networks
.
RESUME
Hoy por hoy existen diseminadas varias definiciones erróneas acerca del análisis de datos exploratorio (ADE). Una
de tales definiciones afirma que ADE es opuesto a la modelación estadística. De hecho, en ADE no se trata de obviar
modelaciones y pre-concepciones, al contrario se trata de hacer análisis usando no únicamente pre-concepciones fuertes, lo
que en si hace legitimo el uso de modelación en ADE. Además, la naturaleza de ADE ha estado cambiando debido a la
emergencia de nuevos métodos y la convergencia de ADE con otras metodologías, tales como la extracción de datos y el
remuestreo. Por tanto, las definiciones convencionales de ADE no dan cuenta de su estado actual. En este artículo, ADE se
presenta en el contexto de la extracción de datos y el remuestreo haciendo énfasis en tres objetivos: detección de
conglomerados, selección de variables, y reconocimiento de patrones. Las técnicas de clasificación en dos pasos, árboles de
clasificación, y redes neuronales sirven como ejemplos para lograr los objetivos delineados.
.
Palabras clave: Análisis de datos exploratorio, extracción de datos, remuestreo, validación cruzada, visualización de datos,
clasificación, arboles de clasificación, redes neuronales.
.
Artículo recibido/Article received: December 2009, Artículo aceptado/Article accepted: March 15/2010
Dirección correspondencia/Mail Address:
Chong Ho Yu, Ph.D. Director of Research and Assessment. Applied Learning Technologies Institute. Arizona State University 1475 N Scottsdale Rd Scottsdale, AZ 85257.
Email: [email protected]; [email protected]
INTERNATIONAL JOURNAL OF PSYCHOLOGICAL RESEARCH esta incluida en PSERINFO, CENTRO DE INFORMACION PSICOLOGICA DE COLOMBIA,
OPEN JOURNAL SYSTEM, BIBLIOTECA VIRTUAL DE PSICOLOGIA (ULAPSY-BIREME), DIALNET y GOOGLE SCHOLARS. Algunos de sus articulos aparecen en
SOCIAL SCIENCE RESEARCH NETWORK y está en proceso de inclusion en diversas fuentes y bases de datos internacionales.
INTERNATIONAL JOURNAL OF PSYCHOLOGICAL RESEARCH is included in PSERINFO, CENTRO DE INFORMACIÓN PSICOLÓGICA DE COLOMBIA, OPEN
JOURNAL SYSTEM, BIBLIOTECA VIRTUAL DE PSICOLOGIA (ULAPSY-BIREME ), DIALNET and GOOGLE SCHOLARS. Some of its articles are in SOCIAL
SCIENCE RESEARCH NETWORK, and it is in the process of inclusion in a variety of sources and international databases.
Exploratory data analysis (EDA) was introduced structure as possible into a model and then using graphs to
by Tukey and his colleagues about four decades ago find patterns that represent deviations from the current
(Tukey, 1969, 1977, 1986a, 1986b, 1986c, Tukey & Wilk, model (Gelman, 2004). Following this line of reasoning,
1986), and since then numerous publications regarding model-based clustering, which is based upon certain
EDA have become available to researchers (e.g. Behrens, probabilistic inferences, is considered legitimate in EDA
1997; Behrens & Yu, 2003; Fielding, 2007; Martinez, 2005; (Martinez, 2005).
Myatt, 2007; Schwaiger, & Opitz, 2001; Velleman & It is difficult for a data analyst to start with a “blank
Hoaglin, 1981). Although EDA is no longer considered a mind” and explore the data without any reference.
new methodology, the author of this article, based upon Traditionally, researchers classify the modes of reasoning
teaching and consulting experiences, observed that today in research as induction (data-driven) and deduction (theory
there are still quite a few widespread misconceptions of or hypothesis driven). Actually, there is a third avenue:
EDA. This phenomenon is partly due to the fact that EDA abduction. Abductive reasoning does not necessarily start
is a philosophy or mentality (skepticism and openness) with fully developed models or no models at all. For
(Hartwig & Dearing, 1979) rather than being a fixed set of example, when Kepler developed his astronomical model,
formal procedures, and it is also partly owing to the trend he had some basic preconceptions, which were very general
that emerging methods, such as data mining and “hunches” about the nature of motion and forces, and also
resampling, have been gradually changing the nature of the basic idea that the Sun is the source of the forces
EDA. As a remedy to those misconceptions, this paper will driving the planetary system. It is beyond the scope of this
start with clarifying what EDA is not, and then introducing article to thoroughly discuss abductive logic. Interested
conventional EDA and its limitations. Next, EDA in the readers are advised to consult Yu (1994, 2006, 2009a). In
new context of data mining and resampling will be alignment to abduction, the essence of EDA is not about
illustrated with concrete examples. Although these putting aside all modeling and preconceptions; rather,
examples are from education or educational psychology, researchers are urged not to start the analysis with a strong
the principles of analyzing these data sets could be preconception only.
extended to experimental psychology as well as other
branches of psychology. COVETIOAL VIEWS OF EDA
pattern, the data could be rescaled in order to improve (2005), data mining is the process of automatically
interpretability. Typical examples of data transformation extracting useful information and relationships from
include using natural log transformation or inverse immense quantities of data. Data mining does not start with
probability transformation to normalize a distribution, using a strong preconception, a specific question, or a narrow
square root transformation to stabilize variances, and using hypothesis, rather it aims to detect patterns that are already
logarithmic transformation to linearize a trend (Yu, 2009b). present in the data. Similarly, Luan (2002) views data
3. Resistance procedures: Parametric tests are mining as an extension of EDA. Like EDA, resampling
based on the mean estimation, which is sensitive to outliers departs from theoretical distributions used by CDA. Rather,
or skewed distributions. In EDA, resistant estimators are its inference is based upon repeated sampling within the
usually used. The following are common examples: same sample, and that is why this school is called
median, trimean (a measure of central tendency based on resampling (Yu, 2003, 2007). How these two
the arithmetic average of the values of the first quartile, the methodologies alter the features of EDA will be discussed
third quartile, and the median counted twice), Winsorized next.
mean (a robust version of the mean in which extreme scores
are pulled back to the majority of the data), and trimmed Checking assumptions
mean (a mean without outliers). It is important to point out
that there is a subtle difference between “resistance” and In multiple regression analysis the assumption of
“robustness” though two terms are usually used the absence of multicollinearity (high correlations among
interchangeably. Resistance is about being immune to predictors) must be met for the independent variables. If
outliers while robustness is about being immune to mutlicollinearity exists, probably the variance, standard
assumption violations. In the former, the goal is to obtain a error, and parameter estimates are all inflated. In addition to
data summary, while in the latter the goal is to make a computing the variance inflation factor, it is a common
probabilistic inference. practice to use a scatterplot matrix, a data visualization
4. Revelation or data visualization: Graphing is a technique for EDA, to examine the inter-relationships
powerful tool for revealing hidden patterns and among the predictors. While checking underlying
relationships among variables. Typical examples of assumptions plays an important role in conventional EDA,
graphical tools for EDA are Trellis displays and 3D plots many new EDA techniques based upon data mining are
(Yu & Stockford, 2003). Although the use of scientific and non-parametric in nature. For example, recursive partition
statistical visualization is fundamental to EDA, they should trees and neural networks are immune to multicollinearity
not be equated, because data visualization is concerned with (Carpio, & Hermosilla, 2002; Fielding, 2007).
just one data characterization aspect (patterns) whereas
EDA encompasses a wider focus, as introduced in the Spotting outliers
previous three elements (NIST Semantech, 2006).
According to NIST Semantech (2006), EDA In the past it was correct to say that outliers were
entails a variety of techniques for accomplishing the detrimental to data analysis because the slope of a
following tasks: 1) maximize insight; 2) uncover underlying regression line could be driven by just a single extreme
structure; 3) extract important variables; 4) detect outliers datum point. Thus, it is logical to assert that spotting
and anomalies; 5) test underlying assumptions; 6) develop outliers is an indispensable step in EDA. However,
parsimonious models; and 7) determine optimal factor TwoStep clustering, a sophisticated EDA algorithm, has
settings. Comparing the NIST’s EDA approach with built-in mechanisms to handle outliers during the clustering
Velleman and Hoaglin’s, and Behrens and Yu’s, it is not process. Actually, before the analysis the researcher could
difficult to see many common threads. For example, not tell which case is an outlier because the references
“maximize insight” and “uncover underlying structure” is (clusters) have not been made yet. Further, the recursive
similar to revelation. partition tree, which is a newer EDA technique arising from
data mining, is also immune against outliers (Fielding,
LIMITATIOS OF COVETIOAL VIEWS TO 2007).
EDA
Data transformation
Although the preceding EDA framework provides
researchers with helpful guidelines in data analysis, some of Data transformation is used as a powerful
the above elements are no longer as important as before due technique to improve the interpretability of the data. But in
to the emergence of new methods and convergence between the recursive partition tree, the independent variables do not
EDA and other methodologies, such as data mining and require any transformation at all (Fielding, 2007). In
resampling. Data mining is a cluster of techniques that has addition, Osborne (2002) asserted that many
been employed in the Business Intelligence (BI) field for transformations that reduce non-normality by changing the
many years (Han & Kamber, 2006). According to Larose spacing between data points raises issues in the
International Journal of Psychological Research 11
International Journal of Psychological Research, 2010. Vol. 3. No. 1. Chon Ho, Yu. (2010). Exploratory data analysis in the context of data mining and
ISSN impresa (printed) 2011-2084 resampling. International Journal of Psychological Research, 3(1), 9-22.
ISSN electronic (electronic) 2011-2079
interpretation of data, rather than improving it. If regression). How different solutions concur with each other
transformations are done correctly, all data points should could be viewed as a type of validation.
remain in the same relative order as prior to transformation
and this does not affect researchers’ interpretations of A EW EDA FRAMEWORK
scores. This might be problematic if the original variables
were meant to be interpreted in a straight-forward fashion, Goal-oriented, not means-oriented
such as annual income and age. After the transformations,
the new variables might become much more complex to Nevertheless, certain conventional EDA elements
interpret. Even if transformation is needed, some data are still indispensable. For example, in data mining many
mining procedures, such as neural networks, perform this iterative processes still rely on residual analysis, and no
task in a hidden layer without the intervention of the doubt data visualization is essential to examining hidden
analyst. patterns. But taking all of the above into account, it is
obvious that some of the conventional elements of EDA are
Transparency and interpretability not fully applicable to the new development. It doesn’t
necessarily imply that checking assumptions, spotting
Data visualization aims to improve transparency of outliers, transforming data, and so on are obsolete; rather,
the analytical process. While hypothesis testers submit the they could still be useful in some situations. However, there
data to complicated algorithms without understanding how are other EDA procedures for us to use to get around them.
the Wilk’s Lambda and the p-value are computed, data Hence, it is time to reconsider the appropriateness of the
visualizers could directly “see” the pattern on the graph. existing EDA framework. One of the problems of those
Not only do data analysts like the transparency and conventional approaches is that the characteristics of EDA
interpretability that results from visualization, but most are tied to both the attributes of the data (distribution,
teachers and speakers also like to employ graphing variability, linearity, outliers, measurement scales …etc)
techniques to present abstract results and complicated data and the final goals (detecting clusters, screening variables,
structures in a concrete and appealing manner (Yu & and unearthing hidden patterns and complex relationships).
Stockford, 2003). Interestingly enough, although variable In fact, dealing with the attributes of the data is just the
selection is considered an objective of EDA by NIST means instead of the ends, and as demonstrated above,
Sematech (2006) and many other exploratory data analysts, some data characteristics are no longer considered
traditional variable selection procedures, such as stepwise problematic to new EDA. However, if EDA is
regression, are usually excluded from the arena of EDA for characterized by a goal-oriented approach, then detecting
lacking visualization and transparency. However, it is clusters, screening variables, and unearthing hidden
important to note that the neural network, another new relationships would still be applicable to all techniques no
EDA technique based on data mining, is considered a matter what advanced procedures are introduced in the
“black box” because of a lack of transparency in the future.
process (Fielding, 2007). Nevertheless, it is still a powerful In the following section each of the three goals of
tool for pattern recognition. EDA stated above will be discussed. There are numerous
new EDA techniques belonging to the preceding three
Resampling and validation categories. Due to space limitations, only one technique
will be illustrated in each category. In addition, because
Confirmatory data analysis employs probabilistic variable selection and pattern recognition methods are
inferences and thus the results yielded from CDA are said guided by a response variable, they are considered
to posses a high degree of generalizability. In contrast, “supervised learning methods.” On the other hand,
EDA focuses on pattern recognition using the data at hand. clustering techniques have no dependent variable as a
For this reason, EDA is said to aim at hypothesis generation reference, and thus they are called “unsupervised learning
as a complementary approach to CDA (Behrens & Yu, methods.” “Learning” in this context means these
2003). Traditional EDA techniques might pass the initial approaches are data-driven i.e. the algorithms learn from
findings (suggested factors or hypotheses) to CDA for the data.
further inquiry. However, with the use of resampling, new
EDA can go beyond the initial sample to validate the CATEGORIES AD TECHIQUES OF EDA
finding. This feature will be further discussed in a later
section. Moreover, in the past, comparing EDA and CDA Clustering: TwoStep cluster analysis
results was just like comparing an apple and an orange. For
example, EDA does not return a p-value at all. Clustering is essentially grouping observations
Nevertheless, today some new data mining-based EDA based upon their proximity to each other on multiple
techniques allow the researcher to compare EDA results dimensions. At first glance, clustering analysis is similar to
against those produced from conventional procedures (e.g. discriminant analysis. But in the latter the analyst must
know the group membership for the classification in In step two, the hierarchical clustering algorithm is
advance. Because discriminant analysis assigns cases to applied to the preclusters and then propose a set of
pre-existing groups, it is not as exploratory as cluster solutions. To determine the best number of clusters, each
analysis, which aims to identify the grouping categories in solution is compared against each other based upon the
the first place. Akaike Information Criterion (AIC) (Akaike, 1973) or the
Bayesian Information Criterion (BIC) (Schwarz, 1978).
If there are just two dimensions (variables), the AIC is a fitness index for trading off the complexity of a
analyst could simply use a scatterplot to look for the model against how well the model fits the data. To reach a
clumps. But when there are many variables, the task balance between fitness and parsimony, AIC not only
becomes more challenging and thus it necessitates rewards goodness of fit, but also gives a penalty to over-
algorithms. There are three major types of clustering fitting and complexity. Hence, the best model is the one
algorithms: 1) Hierarchical clustering, 2) non-hierarchical with the lowest AIC value. However, both Berk (2008) and
clustering (k-mean clustering), and 3) TwoStep clustering. Shmueli (2009) agreed that although AIC is a good measure
The last one is considered the most versatile because it has of predictive accuracy, it can be over-optimistic in
several desirable features that are absent in other clustering estimating fitness. In addition, because AIC aims to yield a
methods. For example, both hierarchical clustering and k- predictive model, using AIC for model selection is
mean clustering could handle continuous variables only, but inappropriate for a model of causal explanation. BIC was
TwoStep clustering accepts both categorical and continuous developed as a remedy to AIC. Like AIC, BIC also uses a
variables. This is the case because in TwoStep clustering penalty against complexity, but this penalty is much
the distance measurement is based on the log-likelihood stronger than that of the AIC. In this sense, BIC is in
method (Chiu et al., 2001). In computing log-likelihood, the alignment to Ockham’s razor: Given all things being equal,
continuous variables are assumed to have a normal the simplest model tends to be the best one.
distribution and the categorical variables are assumed to To illustrate TwoStep clustering, a data set listing
have a multinomial distribution. Nevertheless, the 400 of the world’s best colleges and universities compiled
algorithm is reasonably robust against the violation of these by US ews and World Report (2009) was utilized. The
assumptions, and thus assumption checking is unnecessary. criteria used by US ews and World Report for selecting
Second, while k-mean clustering requires a pre-specified the best institutions include: Academic peer review score,
number of clusters and therefore strong prior knowledge is employer review score, student to faculty score,
required, TwoStep clustering is truly data-driven due to its international faculty score, international students score, and
capability of automatically returning the number of clusters. citations per faculty score. However, an educational
Last but not least, while hierarchical clustering is suitable to researcher might not find the list helpful because the report
a small data set only, TwoStep clustering is so scalable that ranks these institutions by the overall scores. It is tempting
it could analyze thousands of observations efficiently. for the educational researcher to learn about how these best
institutions relate to each other and what their common
As the name implies, TwoStep clustering is threads are. In addition to the preceding measures,
composed of two steps. The first step is called geographical location could be taken into account.
preclustering. In this step, the procedure constructs a cluster Because the data set contains both categorical and
features (CF) tree by scanning all cases one by one (Zhang continuous variables, the researcher employed the TwoStep
et al., 1996). When a case is scanned, the pre-cluster clustering analysis in Predictive Analytical Software
algorithm applies the log likelihood distance measure to (PASW) Statistics (SPSS Inc., 2009). It is important to note
determine whether the case should be merged with other that the clustering result may be affected by the order of the
cases or form a new precluster on its own and wait for cases in the file. In the original data set, the table has been
similar cases in further scanning. After all cases are sorted by the rank in an ascending order. In an effort to
exhausted, all preclusters are treated as entities and become minimize the order effect, the cases were re-arranged in
the raw data for the next step. In this way, the task is random order before the analysis was conducted. To run a
manageable no matter how large the sample size is, because TwoStep cluster analysis, the researcher must assign the
the size of the distance matrix is dependent on just a few categorical and continuous variables into the proper fields,
preclusters rather than all cases. Also, the researcher has the as shown in Figure 1, using BIC instead of AIC for
option to turn on outlier handling. If this option is selected, simplicity.
entries that cannot fit into any preclusters are treated as
outliers at the end of CF-tree building. Further, in this In this analysis a three-cluster solution is
preclustering step, all continuous variables are suggested (see Figure 2). Cluster 1 is composed of all
automatically standardized. In other words, there is no need European institutions, whereas Cluster 2 includes colleges
for the analyst to perform outliers detection and data and universities in Australia, New Zealand, Asia, and
transformation in separate steps. Africa. Cluster 3 consists of North American and South
American institutions. Other characteristics of these clusters citations per faculty score. In the last cluster, both
will be discussed next. international faculty score and international student score
are the lowest, but its citations per faculty score is the best.
Figure 1. Options in TwoStep cluster analysis
Figure 3. Importance of variables for setting clusters apart
Centroids
Cluster
PASW returns many tables and graphs for the 1 2 3 Combined
analyst to examine the results. Due to space constraints, Academic Peer Mean 54.13 58.86 64.40 58.53
only a few will be discussed here. For example, in Cluster 3 Review Score Std. Deviation 21.512 24.352 24.860 23.678
three variables are considered important to distinguishing Mean 52.97 62.40 59.72 57.48
Employer Review
Cluster 3 from the other two clusters. The three important
Score Std. Deviation 26.972 23.897 26.545 26.338
variables are citations per faculty score, international
students score, and international faculty score, because their Student to Faculty Mean 55.24 51.90 52.94 53.67
t-statistics exceed the critical value (see Figure 3). Score Std. Deviation 25.418 24.268 26.305 25.388
International Mean 59.90 50.10 43.68 52.35
The attributes of each cluster could be further Faculty Score Std. Deviation 26.841 31.734 21.432 27.538
examined using the centroids table (Table 1). Cluster 1 is
International Mean 62.66 46.22 43.85 52.61
characterized by high international students score, high
Students Score Std. Deviation 25.117 31.604 21.571 27.354
international faculty score, and moderate citations per
faculty score. On the other hand, Cluster 2 possesses the Citations per Mean 51.81 43.54 67.21 54.48
following characteristics: moderate international students Faculty Score Std. Deviation 19.975 21.671 24.554 23.711
score, moderate international faculty score, and low
different settings. Also, It is tempting to use some stopping educational software, frequent use of computers for writing
rule to prune the tree and “minimum complexity” might be documents, frequent use of computers for writing programs,
attractive to researchers that favor a simple model (see frequent use of computers for downloading music, the
Figure 5). But it is better to select “none” for pruning number of TV sets, frequent use of spreadsheets, frequent
because as mentioned before, premature stopping disallows use of computers for playing games, the number of
the researcher to see what is possible and better. computers at home, frequent use of graphics programs,
frequent use of computers for online communication,
Figure 5. K-fold cross-validation and pruning criterion. frequent use of computers for collaborating on the Internet.
However, when there are too many predictors, the
reliability of the parameter estimates decrease (Fielding,
2007). The predictive power of the two approaches was
evaluated by both classification agreement and ROC
curves. Table 2 indicates that the classification tree
outperforms the logistic regression model in predicting both
high (1) and weaker performers (0).
Figure 7. ROC comparing classification tree and logistic Figure 8. Three layers of a typical neural network
regression.
Figure 11. Interaction between residency and transferred a researcher found the so-called best fit model, there may
hours. be numerous possible models to fit the same data.
exploratory data mining tools. In each category of EDA Information Theory (pp.267–81). Budapest:
there are different methods to accomplish the same goal, Akademia Kiado.
and each method has numerous options (e.g. the number of Altman, D. G., & Royston, P. (2000).What do we mean by
k-fold cross-validation). In evaluating the efficacy of validating a prognostic model? Statistics in
classification trees and other classifers, Wolpert and Medicine, 19, 453-473.
Macready (1997) found that there is no single best method Baker, B. D., & Richards, C. E. (1999). A comparison of
and they termed this phenomenon “no free lunch” – every conventional linear regression methods and neural
output comes with a price (drawback). For instance, networks for forecasting educational spending.
simplicity is obtained at the expense of fitness, and vice Economics of Education Review, 18, 405-415.
versa. As illustrated before, sometimes simplicity could be Behrens, J. T. & Yu, C. H. (2003). Exploratory data
an epistemologically sound criterion for selecting the “best” analysis. In J. A. Schinka & W. F. Velicer, (Eds.),
solution. In the example of PISA data, the classification tree Handbook of psychology Volume 2: Research
model is preferable to the logistic regression model because methods in Psychology (pp. 33-64). New Jersey:
of predictive accuracy. And also in the example of world’s John Wiley & Sons, Inc.
best universities, BIC, which tends to introduce heavy Behrens, J. T. (1997). Principles and procedures of
penalties to complexity, is more favorable than AIC. But in exploratory data analysis. Psychological Methods,
the example of the retention study, when the researcher 2, 131-160.
suspected that there are entangled relationships among Berk, R. A. (2008). Statistical learning from a regression
variables, a complex, nonlinear neural net was constructed perspective. New York: Springer.
even though this black box lacks transparency. In one way Breiman, L., Friedman, J.H., Olshen, R.A., & Stone, C.J.
or the other the data explorer must pay a price. Ultimately, (1984). Classification and regression trees.
whether a simple and complex approach should be adopted Monterey, CA: Wadsworth International Group.
is tied to usefulness. Altman and Royston (2000) asserted Carpio, K.J.E. & Hermosilla, A.Y. (2002), On
that “usefulness is determined by how well a model works multicollinearity and artificial neural networks,
in practice, not by how many zeros there are in associated p Complexity International, 10, Retrieved October 8,
values” (p.454). While this statement pinpoints the blind 2009, from
faith to p values in using inferential statistics, it is also http://www.complexity.org.au/ci/vol10/hermos01/.
applicable to EDA. A data explorer should not hop around Chiu, T., Fang, D., Chen, J., Wang, Y., & Jeris, C. (2001).
solutions and refuse to commit himself/herself to a A robust and scalable clustering algorithm for
conclusion in the name of exploration; rather, he/she should mixed type attributes in large database
contemplate about which solution could yield more environment. Proceedings of the seventh ACM
implications for the research community. SIGKDD International Conference on Knowledge
Last but not least, exploratory data mining Discovery and Data Mining, San Francisco, CA
techniques could be simultaneously or sequentially 263-268.
employed. For example, because both neural networks and Fielding, A. H. (2007). Cluster and classification
classification trees are capable of selecting important techniques for the biosciences. New York, NY:
predictors, they could be run side by side and evaluated by Cambridge University Press.
classification agreement and ROC curves. On other Gelman, A. (2004). Exploratory data analysis for complex
occasions, a sequential approach might be more models. Journal of Computational and Graphical
appropriate. For instance, if the researcher suspects that the Statistics, 13, 755-779.
observations are too heterogeneous to form a single Gonzalez, J., & DesJardins, S. (2002). Artificial neural
population, clustering could be conducted to divide the networks: A new approach to predicting
sample into sub-samples. Next, variable selection application behavior. Research in Higher
procedures could be run to narrow down the predictor list Education, 43, 235-258.
for each sub-sample. Last, the researcher could focus on the Han, J., & Kamber, M. (2006). Data mining: Concepts and
inter-relationships among just a few variables using pattern techniques (2nd ed.). Boston, MA: Elsevier.
recognition methods. The combinations and possibilities are Hartwig, F., & Dearing, B. E. (1979). Exploratory data
virtually limitless. Data detectives are encouraged to analysis. Beverly Hills, CA: Sage Publications.
explore the data with skepticism and openness. Kieseppa, I. A. (2001). Statistical model selection criteria
and the philosophical problem of
REFERECES underdetermination. British Journal for the
Philosophy of Science, 52, 761–794.
Akaike, H. (1973). Information theory and an extension of Kohavi, R. (1995). A study of cross-validation and
the maximum likelihood principle. In B. N. Petrov bootstrap for accuracy estimation and model
and F. Csaki (Eds.), International Symposium on selection. In C. S. Melish (Ed.), Proceedings of the
14th International Joint Conference on Artificial
Intelligence (pp. 1137-1143). San Francisco, CA: Retrieved August 21, 2009, from
Morgan Kauffmann. http://ssrn.com/abstract=1112893
Krus, D. J. & Fuller, E. A. (1982). Computer-assisted Shmueli, G., Patel, N., & Bruce, P. (2007). Data mining for
multicross-validation in regression analysis. business intelligence: Concepts, techniques, and
Educational and Psychological Measurement, 42, applications in Microsoft Office Excel with
187-193. XLMiner. Hoboken, N.J.: Wiley-Interscience.
Kuan, C., & White, H. (1994). Artificial neural networks: Somers, M. J., & Casal, J. C. (2009). Using artificial neural
An econometric perspective. Econometric reviews, networks to model nonlinearity: The case of the
13, 1-91. job satisfaction–job performance relationship.
Larose, D. (2005). Discovering knowledge in data: An Organizational Research Methods, 12, 403-417.
introduction to data mining. NJ: Wiley- SPSS, Inc. (2009). PASW Statistics 17 [Computer software
Interscience. and manual]. Chicago, IL: Author.
Luan, J. (2002). Data mining and its applications in higher Thompson, B. (1995). Stepwise regression and stepwise
education. In A. Serban & J. Luan (Eds.), discriminant analysis need not apply here: A
Knowledge management: Building a competitive guidelines editorial. Educational and
advantage in higher education (pp. 17-36). PA: Psychological Measurement, 55, 525-534.
Josey-Bass. TIBCO (2009). Spotfire Miner [Computer software and
Martinez, W. L. (2005). Exploratory data analysis with manual]. Palo Alto, CA: Author.
MATLAB. London: Chapman & Hall/CRC. Tukey, J. W. (1969). Analyzing data: Sanctification or
McMenamin, J. S. (1997). A primer on neural networks for detective work? American Psychologist, 24, 83-91.
forecasting. Journal of Business Forecasting, 16, Tukey, J. W. (1977). Exploratory data analysis. Reading,
17–22. MA: Addison-Wesley.
Myatt, G. (2007). Making sense of data: A practical guide Tukey, J. W (1986a). Data analysis, computation and
to exploratory data analysis. Hoboken, NJ: John mathematics. In L. V. Jones (Ed.), The collected
Wiley & Sons. works of John W. Tukey: Vol. IV. Philosophy and
NIST Sematech. (2006). What is EDA? Retrieved principles of data analysis: 1965-1986 (pp. 753-
September 30, 2009, from 775). Pacific Grove, CA: Wadsworth. (Original
http://www.itl.nist.gov/div898/handbook/eda/secti work published 1972).
on1/eda11.htm Tukey, J. W (1986b). Exploratory Data Analysis as part of
Organization for Economic Cooperation and Development. a larger whole. In L. V. Jones (Ed.), The collected
(2006). Programme for international student works of John W. Tukey: Vol. IV. Philosophy and
assessment. Retrieved July 2, 2009, from principles of data analysis: 1965-1986 (pp. 793-
http://www.oecd.org/pages/0,3417,en_32252351_ 803). Pacific Grove, CA: Wadsworth. (Original
32235731_1_1_1_1_1,00.html work published 1973).
Osborne, J. (2002). Notes on the use of data Tukey, J. W (1986c). The future of data analysis. In L. V.
transformations. Practical Assessment, Research Jones (Ed.), The collected works of John W.
& Evaluation, 8(6). Retrieved September 30, 2009 Tukey: Vol. III. Philosophy and principles of data
from http://PAREonline.net/getvn.asp?v=8&n=6. analysis: 1949-1964 (pp. 391-484). Pacific Grove,
Quinlan, J. R. (1993). C4.5 programs for machine learning. CA: Wadsworth. (Original work published 1962).
San Francisco, CA: Morgan Kaufmann. Tukey, J. W., & Wilk, M. B. (1986). Data analysis and
Salford Systems. (2009). Random forest. [Computer statistics: An expository overview. In L. V. Jones
software and manual]. San Diego, CA: Author. (Ed.), The collected works of John W. Tukey: Vol.
SAS Institute. (2007). JMP 8 [Computer software and IV. Philosophy and principles of data analysis:
manual]. Cary, NC: Author. 1965-1986 (pp. 549-578). Pacific Grove, CA:
Schwaiger, M., & Opitz, O. (Eds.). (2001). Exploratory Wadsworth. (Original work published 1966).
data analysis in empirical analysis. New York: US News and World Report. (2009, June 18). World’s best
Springer. colleges and universities. Retrieved October 5,
Schwarz, G. E. (1978). Estimating the dimension of a 2009, from
model. Annals of Statistics, 6, 461-464. http://www.usnews.com/articles/education/worlds-
Shmueli, G. (2009). To explain or to predict? Retrieved best-colleges/2009/06/18/worlds-best-colleges-
March 1, 2009, from top-400.html
http://papers.ssrn.com/sol3/papers.cfm?abstract_id Velleman, P. F., & Hoaglin, D. C. (1981). Applications,
=1351252 basics, and computing of exploratory data
Shmueli, G., & Koppius, O. (2008) Contrasting predictive analysis. Boston, MA: Duxbury Press
and explanatory modeling in IS research. Robert
H. Smith School Research Paper No. RHS 06-058.
International Journal of Psychological Research 21
International Journal of Psychological Research, 2010. Vol. 3. No. 1. Chon Ho, Yu. (2010). Exploratory data analysis in the context of data mining and
ISSN impresa (printed) 2011-2084 resampling. International Journal of Psychological Research, 3(1), 9-22.
ISSN electronic (electronic) 2011-2079