0% found this document useful (0 votes)
2 views15 pages

2010 Exploratory Data Analysis in The Context of Data Mining and Resampling

Uploaded by

m7nomor9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views15 pages

2010 Exploratory Data Analysis in The Context of Data Mining and Resampling

Uploaded by

m7nomor9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

International Journal of Psychological

Research
ISSN: 2011-2084
[email protected]
Universidad de San Buenaventura
Colombia

Ho Yu, Chong
Exploratory data analysis in the context of data mining and resampling.
International Journal of Psychological Research, vol. 3, núm. 1, 2010, pp. 9-22
Universidad de San Buenaventura
Medellín, Colombia

Available in: http://www.redalyc.org/articulo.oa?id=299023509014

How to cite
Complete issue
Scientific Information System
More information about this article Network of Scientific Journals from Latin America, the Caribbean, Spain and Portugal
Journal's homepage in redalyc.org Non-profit academic project, developed under the open access initiative
International Journal of Psychological Research, 2010. Vol. 3. No. 1. Chon Ho, Yu. (2010). Exploratory data analysis in the context of data mining and
ISSN impresa (printed) 2011-2084 resampling. International Journal of Psychological Research, 3(1), 9-22.
ISSN electronic (electronic) 2011-2079

Exploratory data analysis in the context of data mining and resampling.

Análisis de Datos Exploratorio en el contexto de extracción de datos y remuestreo.

Chong Ho Yu
Arizona State University

ABSTRACT

Today there are quite a few widespread misconceptions of exploratory data analysis (EDA). One of these
misperceptions is that EDA is said to be opposed to statistical modeling. Actually, the essence of EDA is not about putting
aside all modeling and preconceptions; rather, researchers are urged not to start the analysis with a strong preconception
only, and thus modeling is still legitimate in EDA. In addition, the nature of EDA has been changing due to the emergence
of new methods and convergence between EDA and other methodologies, such as data mining and resampling. Therefore,
conventional conceptual frameworks of EDA might no longer be capable of coping with this trend. In this article, EDA is
introduced in the context of data mining and resampling with an emphasis on three goals: cluster detection, variable
selection, and pattern recognition. TwoStep clustering, classification trees, and neural networks, which are powerful
techniques to accomplish the preceding goals, respectively, are illustrated with concrete examples.

Key words: exploratory data analysis, data mining, resampling, cross-validation, data visualization, clustering,
classification trees, neural networks
.

RESUME

Hoy por hoy existen diseminadas varias definiciones erróneas acerca del análisis de datos exploratorio (ADE). Una
de tales definiciones afirma que ADE es opuesto a la modelación estadística. De hecho, en ADE no se trata de obviar
modelaciones y pre-concepciones, al contrario se trata de hacer análisis usando no únicamente pre-concepciones fuertes, lo
que en si hace legitimo el uso de modelación en ADE. Además, la naturaleza de ADE ha estado cambiando debido a la
emergencia de nuevos métodos y la convergencia de ADE con otras metodologías, tales como la extracción de datos y el
remuestreo. Por tanto, las definiciones convencionales de ADE no dan cuenta de su estado actual. En este artículo, ADE se
presenta en el contexto de la extracción de datos y el remuestreo haciendo énfasis en tres objetivos: detección de
conglomerados, selección de variables, y reconocimiento de patrones. Las técnicas de clasificación en dos pasos, árboles de
clasificación, y redes neuronales sirven como ejemplos para lograr los objetivos delineados.
.

Palabras clave: Análisis de datos exploratorio, extracción de datos, remuestreo, validación cruzada, visualización de datos,
clasificación, arboles de clasificación, redes neuronales.
.

Artículo recibido/Article received: December 2009, Artículo aceptado/Article accepted: March 15/2010
Dirección correspondencia/Mail Address:
Chong Ho Yu, Ph.D. Director of Research and Assessment. Applied Learning Technologies Institute. Arizona State University 1475 N Scottsdale Rd Scottsdale, AZ 85257.
Email: [email protected]; [email protected]

INTERNATIONAL JOURNAL OF PSYCHOLOGICAL RESEARCH esta incluida en PSERINFO, CENTRO DE INFORMACION PSICOLOGICA DE COLOMBIA,
OPEN JOURNAL SYSTEM, BIBLIOTECA VIRTUAL DE PSICOLOGIA (ULAPSY-BIREME), DIALNET y GOOGLE SCHOLARS. Algunos de sus articulos aparecen en
SOCIAL SCIENCE RESEARCH NETWORK y está en proceso de inclusion en diversas fuentes y bases de datos internacionales.
INTERNATIONAL JOURNAL OF PSYCHOLOGICAL RESEARCH is included in PSERINFO, CENTRO DE INFORMACIÓN PSICOLÓGICA DE COLOMBIA, OPEN
JOURNAL SYSTEM, BIBLIOTECA VIRTUAL DE PSICOLOGIA (ULAPSY-BIREME ), DIALNET and GOOGLE SCHOLARS. Some of its articles are in SOCIAL
SCIENCE RESEARCH NETWORK, and it is in the process of inclusion in a variety of sources and international databases.

International Journal of Psychological Research 9


International Journal of Psychological Research, 2010. Vol. 3. No. 1. Chon Ho, Yu. (2010). Exploratory data analysis in the context of data mining and
ISSN impresa (printed) 2011-2084 resampling. International Journal of Psychological Research, 3(1), 9-22.
ISSN electronic (electronic) 2011-2079

Exploratory data analysis (EDA) was introduced structure as possible into a model and then using graphs to
by Tukey and his colleagues about four decades ago find patterns that represent deviations from the current
(Tukey, 1969, 1977, 1986a, 1986b, 1986c, Tukey & Wilk, model (Gelman, 2004). Following this line of reasoning,
1986), and since then numerous publications regarding model-based clustering, which is based upon certain
EDA have become available to researchers (e.g. Behrens, probabilistic inferences, is considered legitimate in EDA
1997; Behrens & Yu, 2003; Fielding, 2007; Martinez, 2005; (Martinez, 2005).
Myatt, 2007; Schwaiger, & Opitz, 2001; Velleman & It is difficult for a data analyst to start with a “blank
Hoaglin, 1981). Although EDA is no longer considered a mind” and explore the data without any reference.
new methodology, the author of this article, based upon Traditionally, researchers classify the modes of reasoning
teaching and consulting experiences, observed that today in research as induction (data-driven) and deduction (theory
there are still quite a few widespread misconceptions of or hypothesis driven). Actually, there is a third avenue:
EDA. This phenomenon is partly due to the fact that EDA abduction. Abductive reasoning does not necessarily start
is a philosophy or mentality (skepticism and openness) with fully developed models or no models at all. For
(Hartwig & Dearing, 1979) rather than being a fixed set of example, when Kepler developed his astronomical model,
formal procedures, and it is also partly owing to the trend he had some basic preconceptions, which were very general
that emerging methods, such as data mining and “hunches” about the nature of motion and forces, and also
resampling, have been gradually changing the nature of the basic idea that the Sun is the source of the forces
EDA. As a remedy to those misconceptions, this paper will driving the planetary system. It is beyond the scope of this
start with clarifying what EDA is not, and then introducing article to thoroughly discuss abductive logic. Interested
conventional EDA and its limitations. Next, EDA in the readers are advised to consult Yu (1994, 2006, 2009a). In
new context of data mining and resampling will be alignment to abduction, the essence of EDA is not about
illustrated with concrete examples. Although these putting aside all modeling and preconceptions; rather,
examples are from education or educational psychology, researchers are urged not to start the analysis with a strong
the principles of analyzing these data sets could be preconception only.
extended to experimental psychology as well as other
branches of psychology. COVETIOAL VIEWS OF EDA

WHAT IS OT EDA? Exploratory data analysis was named by Tukey


(1977) as an alternative to CDA. As mentioned before,
When some people claim that their methodology is EDA is an attitude or philosophy about how data analysis
exploratory, what they actually mean is that they are not should be carried out, instead of being a fixed set of
sure what they are looking for. Unfortunately, poor research techniques. Tukey (1977) often related EDA to detective
is often implemented in the name of EDA. During data work. In EDA, the role of the researcher is to explore the
collection, some researchers flood their subjects with data in as many ways as possible until a plausible “story” of
hundred of survey items since their research questions are the data emerges. Therefore, the “data detective” should be
not clearly defined and their variables are not identified. skeptical of the “face” value of the data and keep an open
While it is true that EDA does not require a pre-determined mind to unanticipated results when the hidden patterns are
hypothesis to be tested, it does not justify the absence of unearthed.
research questions or ill-defined variables. Throughout many years, different researchers
Another common misperception is that EDA is formulated different definitions, classifications, and
said to be opposed to statistical modeling. Because EDA is taxonomies of EDA. For example, Velleman and Hoaglin
different from confirmatory data analysis (CDA), a set of (1981) outlined four basic elements of exploratory data
statistical procedures aiming to confirm a pre-formulated analysis: residual, re-expression (data transformation),
hypothesis using either p-values or confidence intervals, resistant, and display (data visualization). Based upon
some researchers believe that anything associated with Velleman and Hoaglin’s framework, Behrens and Yu
modeling or pre-conceived ideas about the data would (2003) elaborated the above four elements with updated
disqualify the analysis as a form of EDA. Gelman (2004) techniques, and renamed “display” to “revelation.” Each of
found that either EDA is often implemented in the absence them is briefly introduced as follows:
of modeling or that EDA is used only in the early stages of 1. Residual analysis: EDA follows the formula that
model formulation, but disappears from the radar screen data = fit + residual or data = model + error. The fit or the
after the model is generated. Actually, EDA employs data model is the expected values of the data whereas the
visualization as a primary tool, which is often used in residual or the error is the values that deviate from that
model diagnostics. For example, a quantile-quantile plot expected value. By examining the residuals, the researcher
can be drawn to examine the gap between the data and the can assess the model’s adequacy (Yu, 2009b).
empirical distribution of a model. Sometimes, data should 2. Re-expression or data transformation: When the
be explored in an iterative fashion by fitting as much distribution is skewed or the data structure obscures the

10 International Journal of Psychological Research


International Journal of Psychological Research, 2010. Vol. 3. No. 1. Chon Ho, Yu. (2010). Exploratory data analysis in the context of data mining and
ISSN impresa (printed) 2011-2084 resampling. International Journal of Psychological Research, 3(1), 9-22.
ISSN electronic (electronic) 2011-2079

pattern, the data could be rescaled in order to improve (2005), data mining is the process of automatically
interpretability. Typical examples of data transformation extracting useful information and relationships from
include using natural log transformation or inverse immense quantities of data. Data mining does not start with
probability transformation to normalize a distribution, using a strong preconception, a specific question, or a narrow
square root transformation to stabilize variances, and using hypothesis, rather it aims to detect patterns that are already
logarithmic transformation to linearize a trend (Yu, 2009b). present in the data. Similarly, Luan (2002) views data
3. Resistance procedures: Parametric tests are mining as an extension of EDA. Like EDA, resampling
based on the mean estimation, which is sensitive to outliers departs from theoretical distributions used by CDA. Rather,
or skewed distributions. In EDA, resistant estimators are its inference is based upon repeated sampling within the
usually used. The following are common examples: same sample, and that is why this school is called
median, trimean (a measure of central tendency based on resampling (Yu, 2003, 2007). How these two
the arithmetic average of the values of the first quartile, the methodologies alter the features of EDA will be discussed
third quartile, and the median counted twice), Winsorized next.
mean (a robust version of the mean in which extreme scores
are pulled back to the majority of the data), and trimmed Checking assumptions
mean (a mean without outliers). It is important to point out
that there is a subtle difference between “resistance” and In multiple regression analysis the assumption of
“robustness” though two terms are usually used the absence of multicollinearity (high correlations among
interchangeably. Resistance is about being immune to predictors) must be met for the independent variables. If
outliers while robustness is about being immune to mutlicollinearity exists, probably the variance, standard
assumption violations. In the former, the goal is to obtain a error, and parameter estimates are all inflated. In addition to
data summary, while in the latter the goal is to make a computing the variance inflation factor, it is a common
probabilistic inference. practice to use a scatterplot matrix, a data visualization
4. Revelation or data visualization: Graphing is a technique for EDA, to examine the inter-relationships
powerful tool for revealing hidden patterns and among the predictors. While checking underlying
relationships among variables. Typical examples of assumptions plays an important role in conventional EDA,
graphical tools for EDA are Trellis displays and 3D plots many new EDA techniques based upon data mining are
(Yu & Stockford, 2003). Although the use of scientific and non-parametric in nature. For example, recursive partition
statistical visualization is fundamental to EDA, they should trees and neural networks are immune to multicollinearity
not be equated, because data visualization is concerned with (Carpio, & Hermosilla, 2002; Fielding, 2007).
just one data characterization aspect (patterns) whereas
EDA encompasses a wider focus, as introduced in the Spotting outliers
previous three elements (NIST Semantech, 2006).
According to NIST Semantech (2006), EDA In the past it was correct to say that outliers were
entails a variety of techniques for accomplishing the detrimental to data analysis because the slope of a
following tasks: 1) maximize insight; 2) uncover underlying regression line could be driven by just a single extreme
structure; 3) extract important variables; 4) detect outliers datum point. Thus, it is logical to assert that spotting
and anomalies; 5) test underlying assumptions; 6) develop outliers is an indispensable step in EDA. However,
parsimonious models; and 7) determine optimal factor TwoStep clustering, a sophisticated EDA algorithm, has
settings. Comparing the NIST’s EDA approach with built-in mechanisms to handle outliers during the clustering
Velleman and Hoaglin’s, and Behrens and Yu’s, it is not process. Actually, before the analysis the researcher could
difficult to see many common threads. For example, not tell which case is an outlier because the references
“maximize insight” and “uncover underlying structure” is (clusters) have not been made yet. Further, the recursive
similar to revelation. partition tree, which is a newer EDA technique arising from
data mining, is also immune against outliers (Fielding,
LIMITATIOS OF COVETIOAL VIEWS TO 2007).
EDA
Data transformation
Although the preceding EDA framework provides
researchers with helpful guidelines in data analysis, some of Data transformation is used as a powerful
the above elements are no longer as important as before due technique to improve the interpretability of the data. But in
to the emergence of new methods and convergence between the recursive partition tree, the independent variables do not
EDA and other methodologies, such as data mining and require any transformation at all (Fielding, 2007). In
resampling. Data mining is a cluster of techniques that has addition, Osborne (2002) asserted that many
been employed in the Business Intelligence (BI) field for transformations that reduce non-normality by changing the
many years (Han & Kamber, 2006). According to Larose spacing between data points raises issues in the
International Journal of Psychological Research 11
International Journal of Psychological Research, 2010. Vol. 3. No. 1. Chon Ho, Yu. (2010). Exploratory data analysis in the context of data mining and
ISSN impresa (printed) 2011-2084 resampling. International Journal of Psychological Research, 3(1), 9-22.
ISSN electronic (electronic) 2011-2079

interpretation of data, rather than improving it. If regression). How different solutions concur with each other
transformations are done correctly, all data points should could be viewed as a type of validation.
remain in the same relative order as prior to transformation
and this does not affect researchers’ interpretations of A EW EDA FRAMEWORK
scores. This might be problematic if the original variables
were meant to be interpreted in a straight-forward fashion, Goal-oriented, not means-oriented
such as annual income and age. After the transformations,
the new variables might become much more complex to Nevertheless, certain conventional EDA elements
interpret. Even if transformation is needed, some data are still indispensable. For example, in data mining many
mining procedures, such as neural networks, perform this iterative processes still rely on residual analysis, and no
task in a hidden layer without the intervention of the doubt data visualization is essential to examining hidden
analyst. patterns. But taking all of the above into account, it is
obvious that some of the conventional elements of EDA are
Transparency and interpretability not fully applicable to the new development. It doesn’t
necessarily imply that checking assumptions, spotting
Data visualization aims to improve transparency of outliers, transforming data, and so on are obsolete; rather,
the analytical process. While hypothesis testers submit the they could still be useful in some situations. However, there
data to complicated algorithms without understanding how are other EDA procedures for us to use to get around them.
the Wilk’s Lambda and the p-value are computed, data Hence, it is time to reconsider the appropriateness of the
visualizers could directly “see” the pattern on the graph. existing EDA framework. One of the problems of those
Not only do data analysts like the transparency and conventional approaches is that the characteristics of EDA
interpretability that results from visualization, but most are tied to both the attributes of the data (distribution,
teachers and speakers also like to employ graphing variability, linearity, outliers, measurement scales …etc)
techniques to present abstract results and complicated data and the final goals (detecting clusters, screening variables,
structures in a concrete and appealing manner (Yu & and unearthing hidden patterns and complex relationships).
Stockford, 2003). Interestingly enough, although variable In fact, dealing with the attributes of the data is just the
selection is considered an objective of EDA by NIST means instead of the ends, and as demonstrated above,
Sematech (2006) and many other exploratory data analysts, some data characteristics are no longer considered
traditional variable selection procedures, such as stepwise problematic to new EDA. However, if EDA is
regression, are usually excluded from the arena of EDA for characterized by a goal-oriented approach, then detecting
lacking visualization and transparency. However, it is clusters, screening variables, and unearthing hidden
important to note that the neural network, another new relationships would still be applicable to all techniques no
EDA technique based on data mining, is considered a matter what advanced procedures are introduced in the
“black box” because of a lack of transparency in the future.
process (Fielding, 2007). Nevertheless, it is still a powerful In the following section each of the three goals of
tool for pattern recognition. EDA stated above will be discussed. There are numerous
new EDA techniques belonging to the preceding three
Resampling and validation categories. Due to space limitations, only one technique
will be illustrated in each category. In addition, because
Confirmatory data analysis employs probabilistic variable selection and pattern recognition methods are
inferences and thus the results yielded from CDA are said guided by a response variable, they are considered
to posses a high degree of generalizability. In contrast, “supervised learning methods.” On the other hand,
EDA focuses on pattern recognition using the data at hand. clustering techniques have no dependent variable as a
For this reason, EDA is said to aim at hypothesis generation reference, and thus they are called “unsupervised learning
as a complementary approach to CDA (Behrens & Yu, methods.” “Learning” in this context means these
2003). Traditional EDA techniques might pass the initial approaches are data-driven i.e. the algorithms learn from
findings (suggested factors or hypotheses) to CDA for the data.
further inquiry. However, with the use of resampling, new
EDA can go beyond the initial sample to validate the CATEGORIES AD TECHIQUES OF EDA
finding. This feature will be further discussed in a later
section. Moreover, in the past, comparing EDA and CDA Clustering: TwoStep cluster analysis
results was just like comparing an apple and an orange. For
example, EDA does not return a p-value at all. Clustering is essentially grouping observations
Nevertheless, today some new data mining-based EDA based upon their proximity to each other on multiple
techniques allow the researcher to compare EDA results dimensions. At first glance, clustering analysis is similar to
against those produced from conventional procedures (e.g. discriminant analysis. But in the latter the analyst must

12 International Journal of Psychological Research


International Journal of Psychological Research, 2010. Vol. 3. No. 1. Chon Ho, Yu. (2010). Exploratory data analysis in the context of data mining and
ISSN impresa (printed) 2011-2084 resampling. International Journal of Psychological Research, 3(1), 9-22.
ISSN electronic (electronic) 2011-2079

know the group membership for the classification in In step two, the hierarchical clustering algorithm is
advance. Because discriminant analysis assigns cases to applied to the preclusters and then propose a set of
pre-existing groups, it is not as exploratory as cluster solutions. To determine the best number of clusters, each
analysis, which aims to identify the grouping categories in solution is compared against each other based upon the
the first place. Akaike Information Criterion (AIC) (Akaike, 1973) or the
Bayesian Information Criterion (BIC) (Schwarz, 1978).
If there are just two dimensions (variables), the AIC is a fitness index for trading off the complexity of a
analyst could simply use a scatterplot to look for the model against how well the model fits the data. To reach a
clumps. But when there are many variables, the task balance between fitness and parsimony, AIC not only
becomes more challenging and thus it necessitates rewards goodness of fit, but also gives a penalty to over-
algorithms. There are three major types of clustering fitting and complexity. Hence, the best model is the one
algorithms: 1) Hierarchical clustering, 2) non-hierarchical with the lowest AIC value. However, both Berk (2008) and
clustering (k-mean clustering), and 3) TwoStep clustering. Shmueli (2009) agreed that although AIC is a good measure
The last one is considered the most versatile because it has of predictive accuracy, it can be over-optimistic in
several desirable features that are absent in other clustering estimating fitness. In addition, because AIC aims to yield a
methods. For example, both hierarchical clustering and k- predictive model, using AIC for model selection is
mean clustering could handle continuous variables only, but inappropriate for a model of causal explanation. BIC was
TwoStep clustering accepts both categorical and continuous developed as a remedy to AIC. Like AIC, BIC also uses a
variables. This is the case because in TwoStep clustering penalty against complexity, but this penalty is much
the distance measurement is based on the log-likelihood stronger than that of the AIC. In this sense, BIC is in
method (Chiu et al., 2001). In computing log-likelihood, the alignment to Ockham’s razor: Given all things being equal,
continuous variables are assumed to have a normal the simplest model tends to be the best one.
distribution and the categorical variables are assumed to To illustrate TwoStep clustering, a data set listing
have a multinomial distribution. Nevertheless, the 400 of the world’s best colleges and universities compiled
algorithm is reasonably robust against the violation of these by US ews and World Report (2009) was utilized. The
assumptions, and thus assumption checking is unnecessary. criteria used by US ews and World Report for selecting
Second, while k-mean clustering requires a pre-specified the best institutions include: Academic peer review score,
number of clusters and therefore strong prior knowledge is employer review score, student to faculty score,
required, TwoStep clustering is truly data-driven due to its international faculty score, international students score, and
capability of automatically returning the number of clusters. citations per faculty score. However, an educational
Last but not least, while hierarchical clustering is suitable to researcher might not find the list helpful because the report
a small data set only, TwoStep clustering is so scalable that ranks these institutions by the overall scores. It is tempting
it could analyze thousands of observations efficiently. for the educational researcher to learn about how these best
institutions relate to each other and what their common
As the name implies, TwoStep clustering is threads are. In addition to the preceding measures,
composed of two steps. The first step is called geographical location could be taken into account.
preclustering. In this step, the procedure constructs a cluster Because the data set contains both categorical and
features (CF) tree by scanning all cases one by one (Zhang continuous variables, the researcher employed the TwoStep
et al., 1996). When a case is scanned, the pre-cluster clustering analysis in Predictive Analytical Software
algorithm applies the log likelihood distance measure to (PASW) Statistics (SPSS Inc., 2009). It is important to note
determine whether the case should be merged with other that the clustering result may be affected by the order of the
cases or form a new precluster on its own and wait for cases in the file. In the original data set, the table has been
similar cases in further scanning. After all cases are sorted by the rank in an ascending order. In an effort to
exhausted, all preclusters are treated as entities and become minimize the order effect, the cases were re-arranged in
the raw data for the next step. In this way, the task is random order before the analysis was conducted. To run a
manageable no matter how large the sample size is, because TwoStep cluster analysis, the researcher must assign the
the size of the distance matrix is dependent on just a few categorical and continuous variables into the proper fields,
preclusters rather than all cases. Also, the researcher has the as shown in Figure 1, using BIC instead of AIC for
option to turn on outlier handling. If this option is selected, simplicity.
entries that cannot fit into any preclusters are treated as
outliers at the end of CF-tree building. Further, in this In this analysis a three-cluster solution is
preclustering step, all continuous variables are suggested (see Figure 2). Cluster 1 is composed of all
automatically standardized. In other words, there is no need European institutions, whereas Cluster 2 includes colleges
for the analyst to perform outliers detection and data and universities in Australia, New Zealand, Asia, and
transformation in separate steps. Africa. Cluster 3 consists of North American and South

International Journal of Psychological Research 13


International Journal of Psychological Research, 2010. Vol. 3. No. 1. Chon Ho, Yu. (2010). Exploratory data analysis in the context of data mining and
ISSN impresa (printed) 2011-2084 resampling. International Journal of Psychological Research, 3(1), 9-22.
ISSN electronic (electronic) 2011-2079

American institutions. Other characteristics of these clusters citations per faculty score. In the last cluster, both
will be discussed next. international faculty score and international student score
are the lowest, but its citations per faculty score is the best.
Figure 1. Options in TwoStep cluster analysis
Figure 3. Importance of variables for setting clusters apart

The 95% confidence intervals of citations per


faculty score clearly indicate that Cluster 3 substantively
Figure 2. Barchart of within cluster percentage
outperforms the two other clusters (see Figure 4). Actually,
in Cluster 3 most institutions are located in the US, and this
implies that although the best American universities are
successful in research in terms of citations and recognition,
they lack a strong international component comparing with
their overseas counterparts. At the end, the researcher labels
the three clusters as follows: 1) Cluster 1: International-
emphasis institutions; 2) Cluster 3: Research-emphasis
institutions; and 3) Cluster 2: Balanced (between
international-emphasis and research-emphasis) institutions.

Table 1. Centroids table.

Centroids
Cluster
PASW returns many tables and graphs for the 1 2 3 Combined
analyst to examine the results. Due to space constraints, Academic Peer Mean 54.13 58.86 64.40 58.53
only a few will be discussed here. For example, in Cluster 3 Review Score Std. Deviation 21.512 24.352 24.860 23.678
three variables are considered important to distinguishing Mean 52.97 62.40 59.72 57.48
Employer Review
Cluster 3 from the other two clusters. The three important
Score Std. Deviation 26.972 23.897 26.545 26.338
variables are citations per faculty score, international
students score, and international faculty score, because their Student to Faculty Mean 55.24 51.90 52.94 53.67
t-statistics exceed the critical value (see Figure 3). Score Std. Deviation 25.418 24.268 26.305 25.388
International Mean 59.90 50.10 43.68 52.35
The attributes of each cluster could be further Faculty Score Std. Deviation 26.841 31.734 21.432 27.538
examined using the centroids table (Table 1). Cluster 1 is
International Mean 62.66 46.22 43.85 52.61
characterized by high international students score, high
Students Score Std. Deviation 25.117 31.604 21.571 27.354
international faculty score, and moderate citations per
faculty score. On the other hand, Cluster 2 possesses the Citations per Mean 51.81 43.54 67.21 54.48
following characteristics: moderate international students Faculty Score Std. Deviation 19.975 21.671 24.554 23.711
score, moderate international faculty score, and low

14 International Journal of Psychological Research


International Journal of Psychological Research, 2010. Vol. 3. No. 1. Chon Ho, Yu. (2010). Exploratory data analysis in the context of data mining and
ISSN impresa (printed) 2011-2084 resampling. International Journal of Psychological Research, 3(1), 9-22.
ISSN electronic (electronic) 2011-2079

transformation of data will lead to the same result.


Figure 4. 95% confidence intervals Additionally, classification trees are robust against outliers,
because the data set is partitioned into many nodes during
the exploratory process, and as a result, the effect of
outliers is confined into their own nodes. In other words,
those outliers have no effects on other nodes and the
efficacy of the overall result (Fielding, 2007).

Like many other data mining procedures,


classification trees employed cross-validation, (Krus &
Fuller, 1982), which is a form of resampling, to enhance its
predictive power. Put simply, cross-validation divides the
data set into training sets and testing sets. Exploratory
modeling using the training data set inevitably tends to
overfit the data. But in the subsequent modeling using the
testing data set, the overfitted model will be revised in order
to enhance its generalizability. It is better to overfit the
model and then scale back to the optimal point. If a model
is built from a forward stepping approach and ended by a
stopping rule, the researcher will miss the opportunities of
Variable selection: Recursive partition trees seeing what might be possible and better ahead (Quinlan,
1993).
Classification trees, developed by Breiman et al.
(1984), aim to find which independent variable(s) can In the following discussion, the data set
successfully make a decisive split of the data by dividing “Programme for International Student Assessment” (PISA)
the original group of data into pairs of subgroups in the was utilized to illustrate classification trees. PISA is a series
dependent variable. Because classification trees can provide of assessments in science, mathematics, and reading. It is
guidelines for decision-making, they are also known as sponsored by the Organization for Economic Cooperation
decision trees. In addition, because at each decision point and Development (OECD, 2006), and administered
the data are partitioned and each partition is further internationally to 15-year-olds from different countries. In
partitioned independently of all other partitioned data until addition to test scores, PISA also administers many other
all possible splits are exhausted, they are also called instruments, such as the cognitive item test, the school
recursive partition trees (Fielding, 2007). questionnaire, the student demographic questionnaire, and
the information and communication technology familiarity
In programming terminology, a classification tree component for students questionnaire. In this example,
can be viewed as a set of “nested-if” logical statements. using the US and Canadian observations (n=22,601), the
Breiman et al. used the following example for illustrating researcher would like to find out which variables could best
nested-if logic. When heart attack patients are admitted to a predict performance in the science test. While the
hospital, three pieces of information are most relevant to researcher was burdened with hundreds of variables listed
the survival of patients: What is the patient's minimum in all preceding instruments, he turned to classification trees
systolic blood pressure over the initial 24 hour period? in Spotfire Miner (TIBCO, 2009).
What is his/her age? Does he/she display sinus tachycardia?
The answers to these three questions can help the doctor to Using the logit yielded form an Item Response
make a quick decision: “If the patient's minimum systolic Theory (IRT) analysis, students were divided into high and
blood pressure over the initial 24 hour period is greater than low performers, with this grouping variable became the
91, then if the patient's age is over 62.5 years, then if the outcome variable. To run a classification tree, the
patient displays sinus tachycardia, then and only then the researcher simply entered the dependent variable
patient is predicted not to survive for at least 30 days.” (performance in terms of logit) and all potential predictors.
These nested-if decisions can be translated into a graphical As mentioned before, outlier detection, data transformation,
form as a tree structure. and assumption check are not needed. Classification trees
have built-in cross-validation (resampling) mechanisms.
As mentioned before, classification trees can But it is important to note that by default Spotfire sets the
accept the original data without transformation, regardless K-fold cross-validation K to “0.” It is advisable to change it
of the distribution and scaling. Specifically, the algorithm is to 2 or more. Kohavi (1995) suggested that 10-fold
invariant to monotonic transformation that retains the rank partitioning produced the best result, however, “10” is not
order of the observations. Thus, making a logarithmic the magic number. Thus, the researcher could try out
International Journal of Psychological Research 15
International Journal of Psychological Research, 2010. Vol. 3. No. 1. Chon Ho, Yu. (2010). Exploratory data analysis in the context of data mining and
ISSN impresa (printed) 2011-2084 resampling. International Journal of Psychological Research, 3(1), 9-22.
ISSN electronic (electronic) 2011-2079

different settings. Also, It is tempting to use some stopping educational software, frequent use of computers for writing
rule to prune the tree and “minimum complexity” might be documents, frequent use of computers for writing programs,
attractive to researchers that favor a simple model (see frequent use of computers for downloading music, the
Figure 5). But it is better to select “none” for pruning number of TV sets, frequent use of spreadsheets, frequent
because as mentioned before, premature stopping disallows use of computers for playing games, the number of
the researcher to see what is possible and better. computers at home, frequent use of graphics programs,
frequent use of computers for online communication,
Figure 5. K-fold cross-validation and pruning criterion. frequent use of computers for collaborating on the Internet.
However, when there are too many predictors, the
reliability of the parameter estimates decrease (Fielding,
2007). The predictive power of the two approaches was
evaluated by both classification agreement and ROC
curves. Table 2 indicates that the classification tree
outperforms the logistic regression model in predicting both
high (1) and weaker performers (0).

Table 2. Classification agreement between the predicted


and observed for all students.

Predicted and Predicted and


observe matched observe matched Overall
Figure 6. Optimal partition tree after pruning. (1) (0)
Classification tree 84.1% 40.4% 67.00%
Logistic regression 83.9% 39.0% 65.9%

This assessment is bolstered by the overlaid ROC


curves, which illustrate sensitivity (true positive rate) and 1
– specificity (false positive rate). The ideal prediction
outcomes are 100% sensitivity (all true positives are found)
and 100% specificity (no false positives are found). In
Figure 7, the 45 degree diagonal gray line represents the
baseline. When there is no modeling, the probability is .5.
Thus, a good classifier should depict a ROC curve leaning
towards the upper left of the graph. Figure 6 shows that
After the job was submitted, Spotfire returned a overall the classification tree, shown by a blue line, is
suggested tree model, as shown in Figure 6, the orange superior to the logistic regression, presented by a red line.
portion of each rectangle depicts high performers while the Specifically, while attempting to achieve the highest true
blue portion signifies weaker performers. The classification positive rate, the logistic regression modeling is more
tree identified science enjoyment, the number of books at liberal than its decision tree counterpart. In other words, it
home, frequent use of educational software, frequent use of makes positive classification with weak evidence and tends
computers for writing documents, science interest, and to get positive cases correct at the expense of a high false
science value as the most important predictor to positive rate. For example, when the true positive rate of
performance in the PISA science test. This model is the logistic regression is .7, its false positive rate is as high
considered optimal because when the tree grows by further as .55. But when decision tree reaches the same true
partitioning, these variables keep recurring. In other words, positive rate, its false positive rate is just .425. It is true that
increasing complexity does not yield additionally useful in the lower left of the chart (lower true positive rate < .5)
information, and thus the redundant components were the logistic regression is more conservative than the
manually pruned. classification tree, but in that area the difference between
the two models in terms of the false positive rate is narrow.
A logistic regression model was run side by side In summary, no matter whether simplicity, classification
with the preceding classification tree. Unlike its agreement or ROC curves was used as the criterion for
classification tree counterpart, the logistic regression model determining the model choice, it is obvious that the
suggested a longer list of important predictors: Science classification tree approach is more advantageous than
enjoyment, the number of books at home, frequent use of logistic regression.

16 International Journal of Psychological Research


International Journal of Psychological Research, 2010. Vol. 3. No. 1. Chon Ho, Yu. (2010). Exploratory data analysis in the context of data mining and
ISSN impresa (printed) 2011-2084 resampling. International Journal of Psychological Research, 3(1), 9-22.
ISSN electronic (electronic) 2011-2079

Figure 7. ROC comparing classification tree and logistic Figure 8. Three layers of a typical neural network
regression.

As mentioned in the third section, preliminary data


transformation is unnecessary in many data mining
techniques, including neural networks. In traditional linear
regression the researcher might try different transformation
of the predictors, interactions between predictors, or both
(e.g. using centered scores for interaction terms). But in
neural networks these are automatically processed in the
hidden layer. In this sense, linear regression and logistic
regression can be viewed as special cases of neural
networks that omit the hidden layer (Shmueli, Patel, &
Bruce, 2007; Yu, 2009c). Because the input and the output
are mediated by the hidden layer that is not transparent to
the analyst, neural networks are commonly seen as a “black
box.”
Pattern recognition: eural networks
The network is completely connected in the sense
While classification trees aim to identify that each node in the layer is connected to each node in the
predictors, neural networks can be used for both selecting next layer. Each connection has a weight at the initial stage
variable and examining complex relationships (Gonzalez & and these weights are just randomly assigned. A common
DesJardins, 2002). Neural networks, as the name implies, technique in neural networks to fit a model is called back
try to mimic interconnected neurons in animal brains in propagation. During the process of back propagation, the
order to make the algorithm capable of complex learning residuals between the predicated and the actual errors in the
for extracting patterns and detecting trends (Kuan, 1994; initial model are fed back to the network. In this sense, back
McMenamin, 1997). Because this approach artificially propagation is in a similar vein to residual analysis in
mimics human neurons in computers, it is also named conventional EDA (Behrens & Yu, 2003). Since the
artificial neural networks. It is built upon the premise that network performs problem-solving through learning by
real world data structures are complex and nonlinear, and examples, its operation can be unpredictable. Thus, this
thus it necessitates complex learning systems. Unlike iterative loop continues one layer at a time until the errors
regression modeling that assumes linearity, neural networks are minimized. Neural networks use multiple paths for
could model linearity and thus they typically outperformed model construction. Each path-searching process is called a
regression (Somers & Casal, 2009). “tour” and the desired result is that only one best model
emerges out of many tours. Like other data mining
A trained neural network can be viewed as an techniques, neural networks also incorporate cross-
“expert” in the category of information it has been given to validation to avoid capitalization on chance alone in one
analyze. This expert system can provide projections given single sample.
new solutions to a problem and answer "what if" questions.
A typical neural network is composed of three types of In the following illustration, the example data set
layers, namely, the input layer, hidden layer, and output was compiled by tracking the continuous enrollment or
layer (see Figure 8). It is important to note that there are withdrawal of 6690 sophomore students enrolled at a US
three types of layers, not three layers, in the network. There university starting in 2003. The dependent variable is a
may be more than one hidden layer and it depends on how dichotomous variable, retention. In this study, retention is
complex the researcher wants the model to be. The input defined as persisting enrollment within the given time
layer contains the input data; the output layer is the result frame (2003-2004 academic years, excluding summer).
whereas the hidden layer performs data transformation and There are three sets of potential predictors: 1)
manipulation. Demographic: This set of predictors includes gender,
ethnic, residence (in state/out of state), and location (living
on campus/off campus). 2) Pre-college or external

International Journal of Psychological Research 17


International Journal of Psychological Research, 2010. Vol. 3. No. 1. Chon Ho, Yu. (2010). Exploratory data analysis in the context of data mining and
ISSN impresa (printed) 2011-2084 resampling. International Journal of Psychological Research, 3(1), 9-22.
ISSN electronic (electronic) 2011-2079

academic performance indicators: This set of variables


includes high school GPA, high school class rank, SAT The neural net indicates that the interaction effect
scores, ACT scores, transferred hours, and university between these students is complicated and non-linear. The
mathematics placement test scores. 3) Online class hours as Y-axis (vertical) of Figure 10 represents the predicted
a percentage of total hours during the sophomore year. Like probability of retention, the X-axis denotes the number of
the PISA data set, this data set contains so many variables transferred hours, and the Z-axis depicts ethnic groups
that using CDA might be difficult. Hence, the researcher coded as: White = 1, Asian = 2, Hispanic = 3, Black = 4,
turned to neural networks for exploring the inter- and Native American = 5. For White and Hispanic students,
relationships among these variables. as the number of transferred hours increases, the probability
of retention slightly increases, which is indicated by the
Figure 9. Dialog box of neural networks in JMP gradual slope on the outmost right. For Asian students, an
increase in the number of transferred hours does not affect
retention rate at all. However, for Black and Native
American students, when the amount of transferred hours is
low, the probability of continuing enrollment is still high.
But there is a sharp drop in probability of retention for
Native Americans when the number of transferred credits is
between 19 and 31. For Black students, the sudden
depression of probability happens between 18 and 38
transferred hours. Afterwards, the probability rises along
with the transferred hours.

The interaction between residency and transferred


hours is another noteworthy phenomenon. While the
probability of retention for non-residents slightly increases
as the number of transferred hours increases, the probability
for retention climbs up sharply after 42 transferred hours
(see Figure 11). It is important to note that 42 is by no
means the “magic” cutoff. This may vary from sample to
For this analysis, neural networks in JMP (SAS sample, and even from population to population. The main
Institute, 2009) were utilized. The very essence of EDA is point is that there exists an interaction effect between
the freedom of exploration; there is no single best approach. transferred hours and residency.
Thus, the researcher could freely enter the numbers of
hidden nodes, tours, maximum iterations, and folds of EDA AD RESAMPLIG
cross-validation, as shown in the following dialog box
(Figure 9). After several trials with different settings, the At first glance, exploratory data mining is very
researcher could select the most interpretable one out of a similar to conventional EDA except that the former
set of suggested results. employs certain advanced algorithms for automation.
Actually, the differences between conventional EDA and
Taking clarity of interpretation as the major exploratory data mining could be found at the
criterion, the results of the neural net using three hidden epistemological level. As mentioned before, EDA suggests
layers, three tours, and 5-fold cross-validation are retained variables, constructs and hypotheses that are worth
for the following discussion. A neural network allows the pursuing and CDA takes the next step to confirm the
analyst to examine all possible interactions (see Figure 10). findings. However, using resampling (Yu, 2003, 2007),
On the right panel of the graph, each rectangle contains the data mining is capable of suggesting and validating a model
value range of the variable from the lowest to the highest. at the same time. One may argue that data mining should be
Inside each rectangle there is a slider for manipulation. classified as a form of CDA when validation has taken
When the value of the variable changes, there is a place. It is important to point out that usually exploratory
corresponding change in the graph. The analyst can use the data mining aims to yield predication rather than theoretical
slider to superimpose a value grid on the graph and at the explanations of the relationships between variables
same time the rightmost cell shows the exact value of the (Shmueli & Koppius, 2008; Yu, in press). Hence, the
variable of interest. It is crucial to emphasize that these are researcher still has to construct a theoretical model in the
not regular 3-D plots that are commonly found in most context of CDA (e.g. structural equation modeling) if
EDA packages, in which frequencies or raw values are explanation is the research objective.
usually plotted. Rather, the probabilities on the Z-axis result
from adaptive learning through iterative loops.

18 International Journal of Psychological Research


International Journal of Psychological Research, 2010. Vol. 3. No. 1. Chon Ho, Yu. (2010). Exploratory data analysis in the context of data mining and
ISSN impresa (printed) 2011-2084 resampling. International Journal of Psychological Research, 3(1), 9-22.
ISSN electronic (electronic) 2011-2079

Figure 10. Interaction between ethnic groups and transferred hours.

Figure 11. Interaction between residency and transferred a researcher found the so-called best fit model, there may
hours. be numerous possible models to fit the same data.

To counteract the preceding problems, most data


mining procedures employed cross-validation to enhance
generalizability. For example, to remediate the problem of
under-determination of theory by data, neural networks
exhaust different models by the genetic algorithm, which
begins by randomly generating pools of equations. These
initial randomly generated equations are estimated to the
training data set and prediction accuracy of the outcome
measure is assessed using the test set to identify a family of
the fittest models. Next, these equations are hybridized or
randomly recombined to create the next generation of
equations. Parameters from the surviving population of
equations may be combined or excluded to form new
equations as if they were genetic traits inherited from their
“parents.” This process continues until no further
improvement in predicting the outcome measure of the test
set can be achieved (Baker & Richards, 1999). In addition
Resampling in the context of exploratory data to cross-validation, bootstrapping, another resampling
mining addresses two important issues, namely, technique, is also widely employed in data mining (Salford
generalization across samples and under-determination of Systems, 2009), but it is beyond the scope of this article to
theory by evidence (Kieseppa, 2001). It is very common introduce bootstrapping. Interested readers are encouraged
that in one sample a set of best predictors was yielded from to consult Yu (2003, 2007).
regression analysis, but in another sample a different set of
best predictors was found (Thompson, 1995). In other COCLUDIG REMARKS
words, this kind of model can provide a post hoc model for
an existing sample (in-sample forecasting), but cannot be This article introduces several new EDA tools,
useful in out-of-sample forecasting. This occurs when a including TwoStep clustering, recursive classification trees,
specific model is overfitted to a specific data set and thus it and neural networks, in the context of data mining and
weakens generalizability of the conclusion. Further, even if resampling, but these are just a fraction of the plethora of

International Journal of Psychological Research 19


International Journal of Psychological Research, 2010. Vol. 3. No. 1. Chon Ho, Yu. (2010). Exploratory data analysis in the context of data mining and
ISSN impresa (printed) 2011-2084 resampling. International Journal of Psychological Research, 3(1), 9-22.
ISSN electronic (electronic) 2011-2079

exploratory data mining tools. In each category of EDA Information Theory (pp.267–81). Budapest:
there are different methods to accomplish the same goal, Akademia Kiado.
and each method has numerous options (e.g. the number of Altman, D. G., & Royston, P. (2000).What do we mean by
k-fold cross-validation). In evaluating the efficacy of validating a prognostic model? Statistics in
classification trees and other classifers, Wolpert and Medicine, 19, 453-473.
Macready (1997) found that there is no single best method Baker, B. D., & Richards, C. E. (1999). A comparison of
and they termed this phenomenon “no free lunch” – every conventional linear regression methods and neural
output comes with a price (drawback). For instance, networks for forecasting educational spending.
simplicity is obtained at the expense of fitness, and vice Economics of Education Review, 18, 405-415.
versa. As illustrated before, sometimes simplicity could be Behrens, J. T. & Yu, C. H. (2003). Exploratory data
an epistemologically sound criterion for selecting the “best” analysis. In J. A. Schinka & W. F. Velicer, (Eds.),
solution. In the example of PISA data, the classification tree Handbook of psychology Volume 2: Research
model is preferable to the logistic regression model because methods in Psychology (pp. 33-64). New Jersey:
of predictive accuracy. And also in the example of world’s John Wiley & Sons, Inc.
best universities, BIC, which tends to introduce heavy Behrens, J. T. (1997). Principles and procedures of
penalties to complexity, is more favorable than AIC. But in exploratory data analysis. Psychological Methods,
the example of the retention study, when the researcher 2, 131-160.
suspected that there are entangled relationships among Berk, R. A. (2008). Statistical learning from a regression
variables, a complex, nonlinear neural net was constructed perspective. New York: Springer.
even though this black box lacks transparency. In one way Breiman, L., Friedman, J.H., Olshen, R.A., & Stone, C.J.
or the other the data explorer must pay a price. Ultimately, (1984). Classification and regression trees.
whether a simple and complex approach should be adopted Monterey, CA: Wadsworth International Group.
is tied to usefulness. Altman and Royston (2000) asserted Carpio, K.J.E. & Hermosilla, A.Y. (2002), On
that “usefulness is determined by how well a model works multicollinearity and artificial neural networks,
in practice, not by how many zeros there are in associated p Complexity International, 10, Retrieved October 8,
values” (p.454). While this statement pinpoints the blind 2009, from
faith to p values in using inferential statistics, it is also http://www.complexity.org.au/ci/vol10/hermos01/.
applicable to EDA. A data explorer should not hop around Chiu, T., Fang, D., Chen, J., Wang, Y., & Jeris, C. (2001).
solutions and refuse to commit himself/herself to a A robust and scalable clustering algorithm for
conclusion in the name of exploration; rather, he/she should mixed type attributes in large database
contemplate about which solution could yield more environment. Proceedings of the seventh ACM
implications for the research community. SIGKDD International Conference on Knowledge
Last but not least, exploratory data mining Discovery and Data Mining, San Francisco, CA
techniques could be simultaneously or sequentially 263-268.
employed. For example, because both neural networks and Fielding, A. H. (2007). Cluster and classification
classification trees are capable of selecting important techniques for the biosciences. New York, NY:
predictors, they could be run side by side and evaluated by Cambridge University Press.
classification agreement and ROC curves. On other Gelman, A. (2004). Exploratory data analysis for complex
occasions, a sequential approach might be more models. Journal of Computational and Graphical
appropriate. For instance, if the researcher suspects that the Statistics, 13, 755-779.
observations are too heterogeneous to form a single Gonzalez, J., & DesJardins, S. (2002). Artificial neural
population, clustering could be conducted to divide the networks: A new approach to predicting
sample into sub-samples. Next, variable selection application behavior. Research in Higher
procedures could be run to narrow down the predictor list Education, 43, 235-258.
for each sub-sample. Last, the researcher could focus on the Han, J., & Kamber, M. (2006). Data mining: Concepts and
inter-relationships among just a few variables using pattern techniques (2nd ed.). Boston, MA: Elsevier.
recognition methods. The combinations and possibilities are Hartwig, F., & Dearing, B. E. (1979). Exploratory data
virtually limitless. Data detectives are encouraged to analysis. Beverly Hills, CA: Sage Publications.
explore the data with skepticism and openness. Kieseppa, I. A. (2001). Statistical model selection criteria
and the philosophical problem of
REFERECES underdetermination. British Journal for the
Philosophy of Science, 52, 761–794.
Akaike, H. (1973). Information theory and an extension of Kohavi, R. (1995). A study of cross-validation and
the maximum likelihood principle. In B. N. Petrov bootstrap for accuracy estimation and model
and F. Csaki (Eds.), International Symposium on selection. In C. S. Melish (Ed.), Proceedings of the
14th International Joint Conference on Artificial

20 International Journal of Psychological Research


International Journal of Psychological Research, 2010. Vol. 3. No. 1. Chon Ho, Yu. (2010). Exploratory data analysis in the context of data mining and
ISSN impresa (printed) 2011-2084 resampling. International Journal of Psychological Research, 3(1), 9-22.
ISSN electronic (electronic) 2011-2079

Intelligence (pp. 1137-1143). San Francisco, CA: Retrieved August 21, 2009, from
Morgan Kauffmann. http://ssrn.com/abstract=1112893
Krus, D. J. & Fuller, E. A. (1982). Computer-assisted Shmueli, G., Patel, N., & Bruce, P. (2007). Data mining for
multicross-validation in regression analysis. business intelligence: Concepts, techniques, and
Educational and Psychological Measurement, 42, applications in Microsoft Office Excel with
187-193. XLMiner. Hoboken, N.J.: Wiley-Interscience.
Kuan, C., & White, H. (1994). Artificial neural networks: Somers, M. J., & Casal, J. C. (2009). Using artificial neural
An econometric perspective. Econometric reviews, networks to model nonlinearity: The case of the
13, 1-91. job satisfaction–job performance relationship.
Larose, D. (2005). Discovering knowledge in data: An Organizational Research Methods, 12, 403-417.
introduction to data mining. NJ: Wiley- SPSS, Inc. (2009). PASW Statistics 17 [Computer software
Interscience. and manual]. Chicago, IL: Author.
Luan, J. (2002). Data mining and its applications in higher Thompson, B. (1995). Stepwise regression and stepwise
education. In A. Serban & J. Luan (Eds.), discriminant analysis need not apply here: A
Knowledge management: Building a competitive guidelines editorial. Educational and
advantage in higher education (pp. 17-36). PA: Psychological Measurement, 55, 525-534.
Josey-Bass. TIBCO (2009). Spotfire Miner [Computer software and
Martinez, W. L. (2005). Exploratory data analysis with manual]. Palo Alto, CA: Author.
MATLAB. London: Chapman & Hall/CRC. Tukey, J. W. (1969). Analyzing data: Sanctification or
McMenamin, J. S. (1997). A primer on neural networks for detective work? American Psychologist, 24, 83-91.
forecasting. Journal of Business Forecasting, 16, Tukey, J. W. (1977). Exploratory data analysis. Reading,
17–22. MA: Addison-Wesley.
Myatt, G. (2007). Making sense of data: A practical guide Tukey, J. W (1986a). Data analysis, computation and
to exploratory data analysis. Hoboken, NJ: John mathematics. In L. V. Jones (Ed.), The collected
Wiley & Sons. works of John W. Tukey: Vol. IV. Philosophy and
NIST Sematech. (2006). What is EDA? Retrieved principles of data analysis: 1965-1986 (pp. 753-
September 30, 2009, from 775). Pacific Grove, CA: Wadsworth. (Original
http://www.itl.nist.gov/div898/handbook/eda/secti work published 1972).
on1/eda11.htm Tukey, J. W (1986b). Exploratory Data Analysis as part of
Organization for Economic Cooperation and Development. a larger whole. In L. V. Jones (Ed.), The collected
(2006). Programme for international student works of John W. Tukey: Vol. IV. Philosophy and
assessment. Retrieved July 2, 2009, from principles of data analysis: 1965-1986 (pp. 793-
http://www.oecd.org/pages/0,3417,en_32252351_ 803). Pacific Grove, CA: Wadsworth. (Original
32235731_1_1_1_1_1,00.html work published 1973).
Osborne, J. (2002). Notes on the use of data Tukey, J. W (1986c). The future of data analysis. In L. V.
transformations. Practical Assessment, Research Jones (Ed.), The collected works of John W.
& Evaluation, 8(6). Retrieved September 30, 2009 Tukey: Vol. III. Philosophy and principles of data
from http://PAREonline.net/getvn.asp?v=8&n=6. analysis: 1949-1964 (pp. 391-484). Pacific Grove,
Quinlan, J. R. (1993). C4.5 programs for machine learning. CA: Wadsworth. (Original work published 1962).
San Francisco, CA: Morgan Kaufmann. Tukey, J. W., & Wilk, M. B. (1986). Data analysis and
Salford Systems. (2009). Random forest. [Computer statistics: An expository overview. In L. V. Jones
software and manual]. San Diego, CA: Author. (Ed.), The collected works of John W. Tukey: Vol.
SAS Institute. (2007). JMP 8 [Computer software and IV. Philosophy and principles of data analysis:
manual]. Cary, NC: Author. 1965-1986 (pp. 549-578). Pacific Grove, CA:
Schwaiger, M., & Opitz, O. (Eds.). (2001). Exploratory Wadsworth. (Original work published 1966).
data analysis in empirical analysis. New York: US News and World Report. (2009, June 18). World’s best
Springer. colleges and universities. Retrieved October 5,
Schwarz, G. E. (1978). Estimating the dimension of a 2009, from
model. Annals of Statistics, 6, 461-464. http://www.usnews.com/articles/education/worlds-
Shmueli, G. (2009). To explain or to predict? Retrieved best-colleges/2009/06/18/worlds-best-colleges-
March 1, 2009, from top-400.html
http://papers.ssrn.com/sol3/papers.cfm?abstract_id Velleman, P. F., & Hoaglin, D. C. (1981). Applications,
=1351252 basics, and computing of exploratory data
Shmueli, G., & Koppius, O. (2008) Contrasting predictive analysis. Boston, MA: Duxbury Press
and explanatory modeling in IS research. Robert
H. Smith School Research Paper No. RHS 06-058.
International Journal of Psychological Research 21
International Journal of Psychological Research, 2010. Vol. 3. No. 1. Chon Ho, Yu. (2010). Exploratory data analysis in the context of data mining and
ISSN impresa (printed) 2011-2084 resampling. International Journal of Psychological Research, 3(1), 9-22.
ISSN electronic (electronic) 2011-2079

Wolpert, D. H., & Macready, W. G. (1997). No free lunch


theorems for optimization. IEEE Transactions on
Evolutionary Computation, 1(1), 67–82.
Yu, C. H. (1994, April). Induction? Deduction? Abduction?
Is there a logic of EDA? Paper presented at the
Annual Meeting of American Educational
Research Association, New Orleans, LA (ERIC
Document Reproduction Service No. ED 376
173).
Yu, C. H. (2003). Resampling methods: Concepts,
applications, and justification. Practical
Assessment Research and Evaluation, 8(19).
Retrieved July 4, 2009, from
http://pareonline.net/getvn.asp?v=8&n=19
Yu, C. H. (2006). Philosophical foundations of quantitative
research methodology. Lanham, MD: University
Press of America.
Yu, C. H. (2007). Resampling: A conceptual and
procedural introduction. In Jason Osborne (Ed.),
Best practices in quantitative methods (pp. 283-
298). Thousand Oaks, CA: Sage Publications.
Yu, C. H. (2009a). Causal inferences and abductive
reasoning: Between automated data mining and
latent constructs. Saarbrücken, Germany: VDM-
Verlag.
Yu, C. H. (2009b). Exploratory data analysis and data
visualization. Retrieved October 10, 2009, from
http://www.creative-
wisdom.com/teaching/WBI/EDA.shtml
Yu, C. H. (2009c). Multi-collinearity, variance Inflation
and orthogonalization in regression. Retrieved
October 5, 2009, from http://www.creative-
wisdom.com//computer/sas/collinear_deviation.ht
ml
Yu, C. H. (in press). A model must be wrong to be useful:
The role of linear modeling and false assumptions
in theoretical explanation. Open Statistics and
Probability Journal.
Yu, C. H. & Shawn, S. (2003). Evaluating spatial- and
temporal-oriented multi-dimensional visualization
techniques. Practical Assessment, Research &
Evaluation, 8(17). Retrieved September 30, 2009
from http://PAREonline.net/getvn.asp?v=8&n=17
Zhang, T., Ramakrishnon, R., & Livny, M. (1996). BIRCH:
An efficient data clustering method for very large
databases. Proceedings of the ACM SIGMOD
Conference on Management of Data, 103-114

22 International Journal of Psychological Research

You might also like