0% found this document useful (0 votes)
39 views11 pages

Data Mining Techniques in Analyzing Process Data: A Didactic

­

Uploaded by

Mohamed Hany
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views11 pages

Data Mining Techniques in Analyzing Process Data: A Didactic

­

Uploaded by

Mohamed Hany
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

ORIGINAL RESEARCH

published: 23 November 2018


doi: 10.3389/fpsyg.2018.02231

Data Mining Techniques in Analyzing


Process Data: A Didactic
Xin Qiao* and Hong Jiao
University of Maryland, College Park, College Park, MD, United States

Due to increasing use of technology-enhanced educational assessment, data mining


methods have been explored to analyse process data in log files from such assessment.
However, most studies were limited to one data mining technique under one specific
scenario. The current study demonstrates the usage of four frequently used supervised
techniques, including Classification and Regression Trees (CART), gradient boosting,
random forest, support vector machine (SVM), and two unsupervised methods,
Self-organizing Map (SOM) and k-means, fitted to one assessment data. The USA
sample (N = 426) from the 2012 Program for International Student Assessment (PISA)
responding to problem-solving items is extracted to demonstrate the methods. After
concrete feature generation and feature selection, classifier development procedures
are implemented using the illustrated techniques. Results show satisfactory classification
accuracy for all the techniques. Suggestions for the selection of classifiers are presented
based on the research questions, the interpretability and the simplicity of the classifiers.
Interpretations for the results from both supervised and unsupervised learning methods
Edited by: are provided.
Holmes Finch,
Ball State University, United States Keywords: data mining, log file, process data, educational assessment, psychometric

Reviewed by:
Daniel Bolt,
University of Wisconsin-Madison, INTRODUCTION
United States
Hongyun Liu, With the advance of technology incorporated in educational assessment, researchers have been
Beijing Normal University, China intrigued by a new type of data, process data, generated from computer-based assessment, or new
*Correspondence: sources of data, such as keystroke or eye tracking data. Most often, such data, often referred to as
Xin Qiao “data ocean,” is of very large volume and with few ready-to-use features. How to explore, discover
[email protected] and extract useful information from such an ocean has been challenging.
What analyses should be performed on such process data? Even though specific analytic
Specialty section: methods are to be used for different data sources with specific features, some common analysis
This article was submitted to methods can be performed based on the generic characteristics of log files. Hao et al. (2016) have
Quantitative Psychology and
summarized several common analytic actions when introducing the package in Python, glassPy.
Measurement,
These include summary information about the log file, such as the number of sessions, the time
a section of the journal
Frontiers in Psychology duration of each session, and the frequency of each event, can be obtained through a summary
function. In addition, event n-grams, or event sequences of different lengths, can be formed for
Received: 14 March 2018
further utilization of similarity measures to classify and compare persons’ performances. To take
Accepted: 29 October 2018
Published: 23 November 2018 the temporal information into account, hierarchical vectorization of the rank ordered time intervals
and the time interval distribution of event pairs were also introduced. In addition to these common
Citation:
Qiao X and Jiao H (2018) Data Mining
analytic techniques, other existing data analytic methods for process data are Social Network
Techniques in Analyzing Process Data: Analysis (SNA; Zhu et al., 2016), Bayesian Networks/Bayes nets (BNs; Levy, 2014), Hidden Markov
A Didactic. Front. Psychol. 9:2231. Model (Jeong et al., 2010), Markov Item Response Theory (Shu et al., 2017), diagraphs (DiCerbo
doi: 10.3389/fpsyg.2018.02231 et al., 2011) and process mining (Howard et al., 2010). Further, modern data mining techniques,

Frontiers in Psychology | www.frontiersin.org 1 November 2018 | Volume 9 | Article 2231


Qiao and Jiao Data Mining in Process Data

including cluster analysis, decision trees, and artificial neural Links (ROCK) on analyzing process data in log files from a
networks, have been used to reveal useful information about game-based assessment scenario.
students’ problem-solving strategies in various technology- To date, however, no study has demonstrated the utilization
enhanced assessments (e.g., Soller and Stevens, 2007; Kerr et al., of both supervised and unsupervised data mining techniques
2011; Gobert et al., 2012). for the analysis of the same process data. This study aims at
The focus of the current study is about data mining techniques filling this gap and provides a didactic of analyzing process
and this paragraph provides a brief review of related techniques data from the 2012 PISA log files retrieved from one of the
that have been frequently utilized and lessons that have been problem-solving items using both types of data mining methods.
learned related to analyzing process data in technology-enhanced This log file is well-structured and representative of what
educational assessment. Two major classes of data mining researchers may encounter in complex assessments, thus, suitable
techniques are supervised and unsupervised learning methods for demonstration purposes. The goal of the current study is
(Fu et al., 2014; Sinharay, 2016). Supervised methods are used 3-fold: (1) to demonstrate the use of data mining methods on
when subjects’ memberships are known and the purpose is process data in a systematic way; (2) to evaluate the consistency
to train a classifier that can precisely classify the subjects of the classification results from different data mining techniques,
into their own category (e.g., score) and then be efficiently either supervised or unsupervised, with one data file; (3) to
generalized to new datasets. Unsupervised methods are utilized illustrate how the results from supervised and unsupervised data
when subjects’ memberships are unknown and the goal is to mining techniques can be used to deal with psychometric issues
categorize the subjects into clearly separate groups based on and challenges.
features that can distinguish them apart. Decision trees, as a The subsequent sections are organized as follows. First,
supervised data classification method, has been used very often the PISA 2012 public dataset, including participants and the
in analysing process data in educational assessment. DiCerbo problem-solving item analyzed, is introduced. Second, the data
and Kidwai (2013) used Classification and Regression Tree analytic methods used in the current study are elaborated and
(CART) methodology to create the classifier to detect a player’s the concrete classifier development processes are illustrated.
goal in a gaming environment. The authors demonstrated the Third, the results from data analyses are reported. Lastly, the
building of the classifier, including feature generation, pruning interpretations of the results, limitations of the current study and
process, and evaluated the results using precision, recall, Cohen’s future research directions are discussed.
Kappa and A’ (Hanley and McNeil, 1982). This study proved
that the CART could be a reliable automated detector and METHODS
illustrated the process of how to build such a detector with
a relative small sample size (n = 527). On the other hand, Participants
cluster analysis and Self-Organizing Maps (SOMs; Kohonen, The USA sample (N = 429) was extracted from the 2012
1997) are two well-established unsupervised techniques that PISA public dataset. Students were from 15 years 3 months old
categorize students’ problem-solving strategies. Kerr et al. (2011) to 16 years 2 months old, representing 15-year-olds in USA
showed that cluster analysis can consistently identify key features (Organisation for Economic Co-operation Development, 2014).
in 155 students’ performances in log files extracted from an Three students with missing student IDs and school IDs were
educational gaming and simulation environment called Save deleted, yielding a sample of 426 students. There were no missing
Patch (Chung et al., 2010), which measures mathematical responses. The dataset was randomly partitioned into a training
competence. The authors described how they manipulated the dataset (n = 320, 75.12%) and a test dataset (n = 106, 24.88%).
data for the application of clustering algorithms and showed The size of the training dataset is usually about 2 to 3 times of the
evidence that fuzzy cluster analysis is more appropriate than size of the test dataset to increase the precision in prediction (e.g.,
hard cluster analysis in analyzing log file process data from Sinharay, 2016; Fossey, 2017).
game/simulation environment. Most importantly, the authors
demonstrated how cluster analysis can identify both effective Instrumentation
strategies and misconceptions students have with respect to the There are 42 problem-solving questions in 16 units in 2012
related construct. Soller and Stevens (2007) showed the power PISA. These items assess cognitive process in solving real-life
of SOM in terms of pattern recognition. They used SOM to problems in computer-based simulated scenarios (Organisation
categorize 5284 individual problem-solving performances into for Economic Co-operation Development, 2014). The problem-
36 different problem-solving strategies, each exhibiting different solving item, TICKETS task2 (CP038Q01), was analyzed in the
solution frequencies. The authors noted that the 36 strategy current study. It is a level-5 question (there were six levels in total)
classifications can be used as input to a test-level scoring that requires a higher level of exploring and understanding ability
process or externally validated by associating them with other in solving this complex problem (Organisation for Economic Co-
measures. Such detailed classifications can also serve as valuable operation Development, 2014). This interactive question requires
feedback to students and instructors. Chapters in Williamson students explore and collect necessary information to make a
et al. (2006) also discussed extensively the promising future decision. The main cognitive processes involved in this task are
of using data mining techniques, like SOM, as an automated planning and executing. Given the problem-solving scenario,
scoring method. Fossey (2017) has evaluated three unsupervised students need to come up with a plan and test it and modify it
methods, including k-means, SOM and Robust Clustering Using if needed. The item asks students to use their concession fare to

Frontiers in Psychology | www.frontiersin.org 2 November 2018 | Volume 9 | Article 2231


Qiao and Jiao Data Mining in Process Data

find and buy the cheapest ticket that allows them to take 4 trips up with either of the two solutions, but rather buy the wrong
around the city on the subway within 1 day. One possible solution ticket, get no credit on this item. For example, the last picture
is to choose 4 individual concession tickets for city subway, which in Figure 1 illustrates the tickets for four individual full fare for
costs 8 zeds while the other is to choose one daily concession country trains, which cost 72 zeds. “COUNTRY TRAINS” and
ticket for city subway, which costs 9 zeds. Figure 1 includes these “FULL FARE” are considered as unrelated actions because they
two options. Students can always use “CANCEL” button before are not the necessary actions to accomplish the task this item
“BUY” to make changes. Correctly completing this task requires requires. In terms of scoring, unrelated actions are allowed as
students to consider these two alternative solutions, then make long as the students buy the correct ticket in the end and make
comparisons in terms of the costs and end up choosing the comparisons during the action process.
cheaper one.
This item is scored polytomously with three score points, 0, 1, Data Description
or 2. Students who derive only one solution and fail to compare The PISA 2012 log file dataset for the problem-solving
with the other get partial credits. Students who do not come item was downloaded at http://www.oecd.org/pisa/pisaproducts/

FIGURE 1 | PISA 2012 problem-solving question TICKETS task2 (CP038Q01) screenshots. (For more clear view, please see http://www.oecd.org/pisa/test-2012/
testquestions/question5/ ).

Frontiers in Psychology | www.frontiersin.org 3 November 2018 | Volume 9 | Article 2231


Qiao and Jiao Data Mining in Process Data

database-cbapisa2012.htm. The dataset consists of 4722 actions response times, which explain the relationship between the
from 426 students as rows and 11 variables as columns. Eleven two (e.g., van der Linden, 2007; Bolsinova et al., 2017).
variables (see Figure 2) include: cnt indicates country, which Thus, the total response times are expected to differ as
is USA in the present study; schoolid and StIDStd indicate well.
the unique school and student IDs, respectively; event_number However, in this study, action features were created by
(ranging from 1 to 47) indicates the cumulative number of coding different lengths of adjacent action sequences together.
actions the student took; event_value (see raw event_values Thus, this study generated 12 action features consisting of
presented in Table 1) tells the specific action the student took only one action (unigrams), 18 action features containing
at one time stamp and time indicates the exact time stamp two ordered adjacent actions (bigrams), and 2 action features
(in seconds) corresponding to the event_value. Event notifies created from four sequential actions (four-grams). Further,
the nature of the action (start item, end item, or actions in all action sequences generated were assumed to have equal
process). Lastly, network, fare_type, ticket_type, and number_trips importance and no weights were assigned to each action
all describe the current choice the student had made. The sequence. In Table 1, “concession” is a unigram, consisting of
variables used were schoolid, StIDStd, event_value and time. only one action, that is, the student bought the concession
ID variables helped to identify students, while event_value and fare; on the other hand, “S_city” is a bigram, consisting of
time variables were used to generate features. The scores for all two actions, which are “Start” and “city subway,” representing
students were not provided in the log file, thus, hand coded and the student selected the city subway ticket after starting the
carefully double checked based on the scoring rule. Among the item.
426 students, 121 (28.4%) got full credit, 224 (52.6%) got partial Sao Pedro et al. (2012) showed that features generated
credit and 81 (19.0%) did not get any credit. Full, partial, and no should be theoretically important to the construct to achieve
credit were coded as 2, 1, and 0, respectively. better interpretability and efficiency. Following their suggestion,
features were generated as the indicators of the problem-solving
Feature Generation And Selection ability measured by this item, which is supported by the scoring
Feature Generation rubric. For example, one action sequence consisted of four
Features generated can be categorized into time features and actions, which was coded as “city_con_daily_cancel,” is crucial to
action features, as summarized in Table 1. Four Time features scoring. If the student first chose “city_subway” to tour the city,
were created: T_time, A_time, S_time, and E_time, indicating then used the student’s concession fare (“concession”), looked at
total response time, action time spent in process, starting the price of daily pass (“daily”) next and lastly, he/she clicked
time spent on first action, and ending time spent on last “Cancel” to see the other option, this action sequence is necessary
action, respectively. It was assumed that students with different but not sufficient for a full credit.
ability levels may differ in the time they read the question The final recoded dataset for analysis is made up of 426
(starting time spent on first action), the time they spent students as rows and 36 features (including 32 action sequence
during the response (action time spent in process), and the features and 4 time features) as columns. Scores for each student
time they used to make final decision (ending time spent served as known labels when applying supervised learning
on last action). Different researchers have proposed various methods. The frequency of each generated action feature was
joint modeling approaches for both response accuracy and calculated for each student.

FIGURE 2 | The screenshot of the log file for one student.

Frontiers in Psychology | www.frontiersin.org 4 November 2018 | Volume 9 | Article 2231


Qiao and Jiao Data Mining in Process Data

TABLE 1 | 15 raw event values and 36 generated features. In summary, a full set of features (36) were retained in the tree-
based methods and SVM while 31 features were selected for SOM
Event_value (15) Start, End, city_subway, concession, full_fare, daily,
Cancel, country_trains, individual, Buy, trip_1, trip_2,
and k-means after the deletion of features with little variance.
trip_3, trip_4, trip_5
Time features (4) T_time, A_time, S_time, E_time Data Mining Techniques
Single actions (12) All in raw event_values except for Start, End and Buy This study demonstrates how to utilize data mining techniques to
Two actions coded S_city (Start −→ city_subway) map the selected features (both action and time) to students’ item
together (18) S_country (Start −→ country_trains) performance on this problem-solving item in 2012 PISA. Given
city_full (city_subway −→ full_fare) students’ item scores are available in the data file, supervised
city_concession (city_subway −→ concession) learning algorithms can be trained to help classify students
country_full (country_trains −→ full_fare)
country_concession (country_trains −→ concession)
based on their known item performance (i.e., score category)
concession_daily (concession −→ daily) in the training dataset while unsupervised learning algorithms
concession_individual (concession −→ individual) categorize students into groups based on input variables without
full_daily (full_fare −→ daily) knowing their item performance. No assumptions about the data
full_individual (full_fare −→ individual)
distribution are made on these data mining techniques.
individual_trip4 (individual −→ trip_4)
other_cancel (other −→ Cancel)
Four supervised learning methods: Classification and
daily_cancel (daily −→ Cancel) Regression Tree (CART), gradient boosting, random forest, and
trip4_cancel (trip_4 −→ Cancel) SVM are explored to develop classifiers while, two unsupervised
daily_buy (daily −→ Buy) learning methods, Self-organizing Map (SOM) and k-means, are
trip4_buy (trip_4 −→ Buy)
utilized to further examine different strategies used by students in
individual_other (individual −→ other)
other_buy (other −→ Buy) both the same and different score categories. CART was chosen
Four actions city_con_ind_4 (city_subway −→ concession −→
because it worked effectively in a previous study (DiCerbo and
coded together (2) individual −→ trip_4) city_con_daily_cancel Kidwai, 2013) and is known for its quick computation and
(city_subway−→ concession −→ daily −→ Cancel) simple interpretation. However, it might not have the optimal
performance compared with other methods. Furthermore, small
changes in the data can change the tree structure dramatically
Feature Selection (Kuhn, 2013). Thus, gradient boosting and random forest, which
The selection of features should base on both theoretical can improve the performance of trees via ensemble methods,
framework and the algorithms used. As features were generated were also used for comparison. Though SVM has not been used
from a purely theoretical perspective in this study, no such much in the analysis of process data yet, it has been applied
consideration is needed in feature selection. as one of the most popular and flexible supervised learning
Two other issues that need consideration are redundant techniques for other psychometric analysis such as automatic
variables and variables with little variance. Tree-based methods scoring (Vapnik, 1995). The two clustering algorithms, SOM and
handle these two issues well and have built-in mechanisms k-means, have been applied in the analysis of process data in log
for feature selection. The feature importance indicated by files (Stevens and Casillas, 2006; Fossey, 2017). Researchers have
tree-based methods are shown in Figure 3. In both random suggested to use more than one clustering methods to validate
forest and gradient boosting, the most important one is the clustering solutions (Xu et al., 2013). All the analyses were
“city_con_daily_cancel.” The next important one is “other_buy,” conducted in the software program Rstudio (RStudio Team,
which means the student did not choose trip_4 before the action 2017).
“Buy.” The feature importance indicated by tree-based methods
is especially helpful when selection has to be made among Classifier Development
hundreds of features. It can help to narrow down the number The general classifier building process for the supervised learning
of features to track, analyze, and interpret. The classification methods consists of three steps: (1) train the classifier through
accuracy of the support vector machine (SVM) is reduced due to estimating model parameters; (2) determine the values of tuning
redundant variables. However, given the number of features (36) parameters to avoid issues such as “overfitting” (i.e., the statistical
is relatively small in the current study, deleting highly correlated model fits too closely to one dataset but fails to generalize to other
variables (ρ ≥ 0.8) did not improve classification accuracy datasets) and finalize the classifier; (3) calculate the accuracy of
for SVM. the classifier based on the test dataset. In general, training and
Clustering algorithms are affected by variables with near zero tuning are often conducted based on the same training dataset.
variance. Fossey (2017) and Kerr et al. (2011) discarded variables However, some studies may further split the training dataset into
with 5 or fewer attempts in their studies. However, their data were two parts, one for training while the other for tuning. Though
binary and no clear-cut criterion exists for feature elimination tree-based methods are not affected by the scaling issue, training
when using cluster algorithms in the analysis of process data. and test datasets are scaled for SVM, SOM, and k-means.
In the current study, 5 features with variance no >0.09 in Given the relatively small sample size of the current dataset,
both training and test dataset were removed to achieve optimal training, and tuning processes were both conducted on the
classification results. Descriptive statistics for all 36 features can training dataset. Classification accuracy was evaluated with
be found in Table A1 in Appendix A. the test dataset. For the CART technique, the cost-complexity

Frontiers in Psychology | www.frontiersin.org 5 November 2018 | Volume 9 | Article 2231


Qiao and Jiao Data Mining in Process Data

FIGURE 3 | Feature importance indicated by tree-based methods.

parameter (cp) was tuned to find the optimal tree depth using iterations. Euclidian distance was used as a distance measure for
R package rpart. Gradient boosting was carried out using R both methods. The number of clusters ranged from 3 to 10. The
package gbm. The tuning parameters for gradient boosting were lower bound was set to be 3 due to the three score categories
the number of trees, the complexity of trees, the learning rate in this dataset. The upper bound was set to be 10 given the
and the minimum number of observations in the tree’s terminal relative small number of features and small sample size in the
nodes. Random forest was tuned over its number of predictors current study. The R code for the usage of both supervised and
sampled for splitting at each node (mtry ) using R package unsupervised methods can be found in Appendix B.
randomForest. A radial basis function kernel SVM, carried out
in R package kernlab, was tuned through two parameters: scale Evaluation Criterion
function σ and the cost value C, which determine the complexity For the supervised methods, students in the test dataset are
of the decision boundary. After the parameters were tuned, the classified based on the classifier developed based on the training
classifiers were trained fitting to the training dataset. 10-fold- dataset. The performance of supervised learning techniques
validation was conducted for supervised learning methods in the was evaluated in terms of classification accuracy. Outcome
training processes. Cross-validation is not necessary for random measures include overall accuracy, balanced accuracy, sensitivity,
forest when estimating test error due to its statistical properties specificity, and Kappa. Since item scores are three categories,
(Sinharay, 2016). 0, 1, and 2, sensitivity, specificity and balanced accuracy were
For the unsupervised learning methods, SOM was carried out calculated as follows.
in the R package kohonen. Learning rate declined from 0.05 to
0.01 over the updates from 2000 iterations. k-means was carried True Positives
Sensitivity = (1)
out using the kmeans function in the stats R package with 2000 True Positives + False Negatives,

Frontiers in Psychology | www.frontiersin.org 6 November 2018 | Volume 9 | Article 2231


Qiao and Jiao Data Mining in Process Data

True Negatives error and the simplest tree structure (error < 0.2, number of
Specificity = (2)
True Negatives + False Positives, trees < 6), as shown in Figure 4. The final tuning parameters
Sensitivity + Specificity for gradient boosting: the number of trees = 250, the depth
Balanced Accuracy = (3) of trees = 10, the learning rate = 0.01 and the minimum
2
number of observations in the trees terminal nodes = 10.
where sensitivity measures the ability to predict positive cases, Figure 5 shows that when the maximum tree depth equaled
specificity measures the ability to predict negative cases and 10, the RMSE was minimum as iteration reached 250 with
balanced accuracy is the average of the two. Overall accuracy and the simplest tree structure. The number of predictors sampled
Kappa were calculated for each method based on the following for splitting at each node (mtry ) in the random forest
formula: was set to 4 to achieve the largest accuracy, as shown in
True Positives + True Negatives Figure 6. In the SVM, the scale function σ was set to 1
Overall Accuracy = (4) and the cost value C set to 4 to reach the smallest training
Total Cases
po − pe error 0.038.
Kappa = (5) The performance of the four supervised techniques
1 − pe
was summarized in Table 2. All four methods performed
where overall accuracy measures the proportion of all correct satisfactorily, with almost all values larger than 0.90. The
predictions. Kappa statistic is a measure of concordance for gradient boosting showed the best classification accuracy overall,
categorical data. In its formula, po is the observed proportion exhibiting the highest Kappa and overall accuracy (Kappa = 0.94,
of agreement, pe is the proportion of agreement expected by overall accuracy = 0.96). Most of their subclass specificity and
chance. The larger these five statistics are, the better classification balanced accuracy values also ranked top, with only sensitivity
decisions.
For the two unsupervised learning methods, the better fitting
method and the number of clusters were determined for the
training dataset by the following criteria:
1. Davies-Bouldin Index (DBI; Davies and Bouldin, 1979)
calculated as in Equation 6, can be applied to compare the
performance of multiple clustering algorithms (Fossey, 2017).
The algorithm with the lower DBI is considered the better
fitting one which has the higher between-cluster variance and
smaller within-cluster variance.
1 Xk Si + Sj
DBI = maxi6=j (6)
k i=1 Mij

where k is the number of clusters, Si and Sj are the average


distances from the cluster center to each case in cluster i and FIGURE 4 | The CART tuning results for cost-complexity parameter (cp).
cluster j. Mij is the distance between the centers of cluster i and
cluster j. Cluster j has the smallest between-cluster distance
with cluster i or has the highest within-cluster variance, or
both (Davies and Bouldin, 1979).
2. Kappa value (see Equation 5) is a measure of classification
consistency between these two unsupervised algorithms. It is
usually expected not smaller than 0.8 (Landis and Koch, 1977).
To check the classification stability and consistency in the
training dataset, the methods were repeated in the test dataset,
DBI and Kappa values were computed.

RESULTS
The tuning and training results for the four supervised learning
techniques are first reported and then the evaluation of their
performance on the test datasets. Lastly, the results for the
unsupervised learning methods are presented.

Supervised Learning Methods


The tuning processes for all the classifiers reached satisfactory FIGURE 5 | The Gradient Boosting tuning results.
results. For the CART, cp was set to 0.02 to achieve minimum

Frontiers in Psychology | www.frontiersin.org 7 November 2018 | Volume 9 | Article 2231


Qiao and Jiao Data Mining in Process Data

FIGURE 6 | The random forest tuning results (peak point corresponds to mtry = 4).

TABLE 2 | Average of accuracy measures of the scores.

Method R package Kappa Overall accuracy Sensitivity Specificity Balanced accuracy

0 1 2 0 1 2 0 1 2

CART rpart 0.92 0.95 0.89 0.97 0.97 0.98 0.96 0.99 0.93 0.96 0.98
Random Forest randomForest 0.92 0.95 0.89 0.95 10.0 0.99 0.96 0.97 0.94 0.95 0.99
Gradient Boosting gbm 0.94 0.96 0.89 0.97 10.0 0.99 0.96 0.99 0.94 0.96 0.99
Support Vector Machine kernlab 0.92 0.95 0.94 0.93 10.0 0.98 0.98 0.97 0.96 0.96 0.99

for score = 0, specificity for score = 1 and balanced accuracy who earned a partial credit were partitioned into two classes,
for score = 0 smaller than those from SVM. SVM, random one purely consisted of students in this group and the other
forest, and CART performed similarly well, all with a slightly consisted of 98% students who truly got partial credit. For the
smaller Kappa and overall accuracy values (Kappa = 0.92, overall no credit group, students were classified into three classes, one
accuracy = 0.95). purely consisted of students in this group and the other two
Among the four supervised methods, the single tree structure classes included 10 and 18% students from other categories. One
from CART built from the training dataset is the easiest to major benefit from this plot is that we can clearly tell the specific
interpret and plotted in Figure 7. Three colors represent three action sequences that led students into each class.
score categories: red (no credit), gray (partial credit), and green
(full credit). The darker the color is, the more confident the
predicted score is in that node, the more precise the classification Unsupervised Learning Methods
is. In each node, we can see three lines of numbers. The As shown in Table 3, the candidates for the best clustering
first line indicates the main score category in that node. The solution from the training dataset were k-means with 5 clusters
second line represents the proportions of each score category, (DBI = 0.19, kappa = 0.84) and SOM with 9 clusters (DBI = 0.25,
in the order of scores of 0, 1, and 2. The third line is the kappa = 0.96), which satisfied the criterion of a smaller DBI value
percentage of students falling into that node. CART has a built-in and kappa value ≥ 0.8. When validated with the test dataset, the
characteristic to automatically choose useful features. As shown DBI values for k-means and SOM all increased. It could be caused
in Figure 7, only five nodes (features), “city_con_daily_cancel,” by the smaller sample size of the test dataset. Due to the low kappa
“other_buy,” “trip4_buy,” “concession,” and “daily_buy,” were value for the 5-cluster solution in the validation sample, the final
used in branching before the final stage. In each branch, if the decision on the clustering solution was SOM with 9 clusters.
student performs the action (>0.5), he/she is classified to the The percentage of students in each score category in each cluster
right, otherwise, to the left. As a result, students with a full credit is presented in Figure 8. The cluster analysis results obtained
were branched into one class, in which 96% truly belonged to this based on both SOM and k-means can be found in Table A2 in
class and accounted for 29% of the total data points. Students Appendix A.

Frontiers in Psychology | www.frontiersin.org 8 November 2018 | Volume 9 | Article 2231


Qiao and Jiao Data Mining in Process Data

FIGURE 7 | The CART classification.

TABLE 3 | Clustering Algorithms’ Fit (DBI) and Agreement (Cohen’s Kappa).

Training dataset (n = 320) Test dataset (n = 106)

Number of DBI Kappa DBI Kappa


clusters

k-means SOM k-means SOM

3 1.427 1.54 0.037 1.741 1.696 0.900


4 1.792 1.447 0.061 1.444 1.178 0.078
5 0.188** 1.296 0.843 1.098 1.133 0.320**
6 1.448 1.087 0.934 1.057 1.171 0.390
7 1.413 1.023 0.835 1.177 0.920 0.891
8 0.198 1.057 0.753 1.063 1.034 0.894
9 1.099 0.249* 0.959 1.288 0.979 0.831
10 1.442 0.251 0.884 1.288 0.816 0.627
FIGURE 8 | Percentage in each score category in the final SOM clustering
** Best fitting solution with the training dataset but lower Kappa value with the test dataset, solution with 9 clusters from the training dataset.
indicating the disagreement between k-means and SOM.
* Final chosen solution. Bold values indicate potential final clustering solution and are

discussed in the text.


4. Unnecessary actions (cluster 2, 3, and 6): students tried
options not required by the question, e.g., country train ticket,
To interpret, label and group the resulting clusters, it is other number of individual ticket.
necessary to examine and generalize the students’ features and 5. Outlier (cluster 9): the student made too many attempts and is
the strategy pattern in each of the cluster. In alignment with identified as an outlier.
the scoring rubrics and ease of interpretation, the nine clusters
Such grouping and labeling can help researchers better
identified in the training dataset are grouped into five classes and
understand the common strategies used by students in each
interpreted as follows.
score category. It also helps to identify errors students made
1. Incorrect (cluster1): students bought neither individual tickets and can be a good source of feedback to students. For those
for 4 trips nor a daily ticket. students mislabeled above, they share the major characteristics
2. Partially correct (cluster 4–5): students bought either in the cluster. For example, 4% students who got no credit in
individual tickets for 4 trips or a daily ticket but did not cluster 4 in the training dataset bought daily ticket for the city
compare the prices. subway without comparing the prices, but they bought the full
3. Correct (cluster 7 and 8): students did compare the fare instead of using student’s concession fare. These students are
prices between individual tickets and a daily ticket and different from those in cluster 1 who bought neither daily tickets
chose to buy the cheaper one (individual tickets for nor individual tickets for 4 trips. Thus, students in the same
4 trips). score category were classified into different clusters, indicating

Frontiers in Psychology | www.frontiersin.org 9 November 2018 | Volume 9 | Article 2231


Qiao and Jiao Data Mining in Process Data

that they made different errors or took different actions during information about the detailed classifications between and within
the problem-solving process. In summary, though students in the each score category. Thus, based on the results in the current
same score category generally share the actions they took, they study, the CART method is sufficient for future studies on
can also follow distinct problem-solving processes. Students in similar datasets. Unsupervised learning algorithms, SOM and
different score categories can also share similar problem-solving k-means, also showed convergent clustering results based on
process. DBI and Kappa values. In the final clustering solution, students
were grouped into 9 clusters, revealing specific problem-solving
processes they went through.
SUMMARY AND DISCUSSIONS
Third, supervised and unsupervised learning methods serve to
This study analyzed the process data in the log file from one answer different research questions. Supervised learning methods
of the 2012 PISA problem-solving items using data mining can be used to train the algorithm to predict memberships in
techniques. The data mining methods used, including CART, the future data, like automatic scoring. Unsupervised methods
gradient boosting, random forest, SVM, SOM, and k-means, can reveal the problem-solving strategy patterns and further
yielded satisfactory results with this dataset. The three major differentiate students in the same score category. This is
purposes of the current study were summarized as follows. especially helpful for formative purposes. Students can be
First, to demonstrate the analysis of process data using provided with more detailed and individualized diagnostic
both supervised and unsupervised techniques, concrete steps in reports. Teachers can better understand students’ strengths and
feature generation, feature selection, classifier development and weaknesses, and adjust instructions in the classroom accordingly
outcome evaluation were presented in the current study. Among or provide more targeted tutoring to specific students. In
all steps, feature generation was the most crucial one because addition, it is necessary to check any indication for cheating
the quality of features determines the classification results to behavior in the misclassified or outlier cases from both types of
a large extent. Good features should be created based on a data mining methods. For example, students answered the item
thorough understanding of the item scoring procedure and the correctly within an extremely short amount of time can imply
construct. Key action sequences that can distinguish correct and item compromise.
incorrect answers served as features with good performance. This study has its own limitations. Other data mining
Unexpectedly, time features, including total response time and its methods, such as other decision trees algorithms and clustering
pieces, did not turn out to be important features for classification. algorithms, are worth of investigation. However, the procedure
This means that considerable variance of response time existed demonstrated in this study can be easily generalized to other
in each score group and the differences in response time algorithms. In addition, the six methods were compared based
distributions among the groups was not large enough to clearly on the same set of data rather than data under various conditions.
distinguish the groups (see Figure A1 in Appendix A). This Therefore, the generalization of the current study is limited due
study generated features based on theoretical beliefs about the to factors such as sample size and number of features. Future
construct measured and used students as the unit of analysis. studies can use a larger sample size and extract more features
The data could be structured in other ways according to different from more complicated assessment scenarios. Lastly, the current
research questions. For example, instead of using students as study focuses on only one item for the didactic purpose. In
the unit of analysis, the attempts students made can be used as the future study, process data for more items can be analyzed
rows and actions as columns, then the attempts can be classified simultaneously to get a comprehensive picture of the students.
instead of people. Fossey (2017) included a detailed tutorial on To sum up, the selection of data mining techniques for the
clustering algorithms with such data structure in a game-based analysis of process data in assessment depends on the purpose of
assessment. the analysis and the data structure. Supervised and unsupervised
Second, to evaluate classification consistency of these techniques essentially serve different purposes for data mining
frequently used data mining techniques, the current study with the former as a confirmatory approach while the latter as
compared four supervised techniques with different properties, an exploratory approach.
namely, CART, gradient boosting, random forest, and SVM.
All four methods achieved satisfactory classification accuracy AUTHOR CONTRIBUTIONS
based on various outcome measures, with gradient boosting
showing slightly better overall accuracy and Kappa value. In XQ as the first author, conducted the major part of study design,
general, easy interpretability and graphical visualization are data analysis and manuscript writing. HJ as the second author,
the major advantages of trees. Trees also deal with noisy and participated in the formulation and refinement of the study
incomplete data well (James et al., 2013). However, the trees design and provided crucial guidance in the statistical analysis
are easily influenced by even small changes in the data due to and manuscript composition.
its hierarchical splitting structure (Hastie et al., 2009). SVM, on
the contrary, generalizes well because once the hyperplane is SUPPLEMENTARY MATERIAL
found, small changes to data cannot greatly affect the hyperplane
(James et al., 2013). Given the specific dataset in the current The Supplementary Material for this article can be found
study, even the CART method worked very well. In addition, the online at: https://www.frontiersin.org/articles/10.3389/fpsyg.
CART method can be easily understood and provided enough 2018.02231/full#supplementary-material

Frontiers in Psychology | www.frontiersin.org 10 November 2018 | Volume 9 | Article 2231


Qiao and Jiao Data Mining in Process Data

REFERENCES Kohonen, T. (1997). Self-Organizing Maps. Heidelberg: Springer-Verlag.


doi: 10.1007/978-3-642-97966-8
Bolsinova, M., De Boeck, P., and Tijmstra, J. (2017). Modeling conditional Kuhn, M. (2013). Predictive Modeling With R and the Caret Package [PDF
dependence between response time and accuracy. Psychometrika 82, Document]. Available online at: https://www.r-project.org/conferences/useR-
1126–1148. doi: 10.1007/s11336-016-9537-6 2013/Tutorials/kuhn/user_caret_2up.pdf (Accessed November 9, 2018).
Chung, G. K. W. K., Baker, E. L., Vendlinski, T. P., Buschang, R. E., Delacruz, G. Landis, J. R., and Koch, G. G. (1977). The measurement of observer agreement for
C., Michiuye, J. K., et al. (2010). “Testing instructional design variations in a categorical data. Biometrics 33, 159–174. doi: 10.2307/2529310
prototype math game,” in Current Perspectives From Three National RandD Levy, R. (2014). Dynamic Bayesian Network Modeling of Game Based Diagnostic
Centers Focused on Game-based Learning: Issues in Learning, Instruction, Assessments (CRESST Report No.837). Los Angeles, CA: University of
Assessment, and Game Design, eds R. Atkinson (Chair) (Denver, CO: Structured California, National Center for Research on Evaluation, Standards, and Student
poster session at the annual meeting of the American Educational Research Testing (CRESST), Center for Studies in Education, UCLA. Available online at:
Association). https://files.eric.ed.gov/fulltext/ED555714.pdf (Accessed August 26, 2018).
Davies, D. L., and Bouldin, D. W. (1979). A cluster separation measure. IEEE Organisation for Economic Co-operation and Development (2014). PISA 2012
Trans. Pattern Anal. Mach. Intell. 1, 224–227. doi: 10.1109/TPAMI.1979.47 Results: Creative Problem Solving: Students’ Skills in Tackling Real-Life Problems,
66909 Vol. 5. Paris: PISA, OECD Publishing.
DiCerbo, K. E., and Kidwai, K. (2013). “Detecting player goals from game log files,” RStudio Team (2017). RStudio: Integrated development environment for R (Version
in Poster presented at the Sixth International Conference on Educational Data 3.4.1) [Computer software]. Available online at: http://www.rstudio.com/
Mining (Memphis, TN). Sao Pedro, M. A., Baker, R. S. J., and Gobert, J. D. (2012). “Improving
DiCerbo, K. E., Liu, J., Rutstein, D. W., Choi, Y., and Behrens, J. T. (2011). “Visual construct validity yields better models of systematic inquiry, even with
analysis of sequential log data from complex performance assessments,” in less information,” in User Modeling, Adaptation, and Personalization:
Paper presented at the annual meeting of the American Educational Research Proceedings of the 20th UMAP Conference, eds J. Masthoff, B. Mobasher,
Association (New Orleans, LA). M. C. Desmarais, and R. Nkambou (Heidelberg: Springer-Verlag), 249–260.
Fossey, W. A. (2017). An Evaluation of Clustering Algorithms for Modeling doi: 10.1007/978-3-642-31454-4_21
Game-Based Assessment Work Processes. Unpublished doctoral dissertation, Shu, Z., Bergner, Y., Zhu, M., Hao, J., and von Davier, A. A. (2017). An item
University of Maryland, College Park. Available online at: https://drum. response theory analysis of problem-solving processes in scenario-based tasks.
lib.umd.edu/bitstream/handle/1903/20363/Fossey_umd_0117E_18587.pdf? Psychol. Test Assess. Model. 59, 109–131. Available online at: https://www.
sequence=1 (Accessed August 26, 2018). psychologie-aktuell.com/fileadmin/download/ptam/1-2017_20170323/07_
Fu, J., Zapata-Rivera, D., and Mavronikolas, E. (2014). Statistical Methods for Shu.pdf (Accessed November 9, 2018)
Assessments in Simulations and Serious Games (ETS Research Report Series No. Sinharay, S. (2016). An NCME instructional module on data mining methods
RR-14-12). Princeton, NJ: Educational Testing Service. for classification and regression. Educ. Meas. Issues Pract. 35, 38–54.
Gobert, J. D., Sao Pedro, M. A., Baker, R. S., Toto, E., and Montalvo, O. (2012). doi: 10.1111/emip.12115
Leveraging educational data mining for real-time performance assessment of Soller, A., and Stevens, R. (2007). Applications of Stochastic Analyses for
scientific inquiry skills within microworlds. J. Educ. Data Min. 4, 111–143. Collaborative Learning and Cognitive Assessment (IDA Document D-3421).
Available online at: https://jedm.educationaldatamining.org/index.php/JEDM/ Arlington, VA: Institute for Defense Analysis.
article/view/24 (Accessed November 9, 2018) Stevens, R. H., and Casillas, A. (2006). “Artificial neural networks,” in Automated
Hanley, J. A., and McNeil, B. J. (1982). The meaning and use of the area Scoring of Complex Tasks in Computer-Based Testing, eds D. M. Williamson,
under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36. R. J. Mislevy, and I. I. Bejar (Mahwah, NJ: Lawrence Erlbaum Associates,
doi: 10.1148/radiology.143.1.7063747 Publishers), 259–312.
Hao, J., Smith, L., Mislevy, R. J., von Davier, A. A., and Bauer, M. (2016). Taming van der Linden, W. J. (2007). A hierarchical framework for modeling
Log Files From Game/Simulation-Based Assessments: Data Models and Data speed and accuracy on test items. Psychometrika 72, 287–308.
Analysis Tools (ETS Research Report Series No. RR-16-10). Princeton, NJ: doi: 10.1007/s11336-006-1478-z
Educational Testing Service. Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York, NY:
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Springer-Verlag. doi: 10.1007/978-1-4757-2440-0
Learning, 2nd edn. New York, NY: Springer. doi: 10.1007/978-0-387- Williamson, D. M., Mislevy, R. J., and Bejar, I. I. (2006). Automated Scoring
84858-7 of Complex Tasks in Computer-Based Testing, eds. Mahwah, NJ: Lawrence
Howard, L., Johnson, J., and Neitzel, C. (2010). “Examining learner control in Erlbaum Associates, Publishers. doi: 10.4324/9780415963572
a structured inquiry cycle using process mining.” in Proceedings of the 3rd Xu, B., Recker, M., Qi, X., Flann, N., and Ye, L. (2013). Clustering
International Conference on Educational Data Mining, 71–80. Available online educational digital library usage data: a comparison of latent class analysis
at: https://files.eric.ed.gov/fulltext/ED538834.pdf#page=83 (Accessed August and k-means algorithms. J. Educ. Data Mining 5, 38–68. Available
26, 2018). online at: https://jedm.educationaldatamining.org/index.php/JEDM/article/
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to view/21 (Accessed November 9, 2018)
Statistical Learning, Vol 112. New York, NY: Springer. Zhu, M., Shu, Z., and von Davier, A. A. (2016). Using networks to visualize and
Jeong, H., Biswas, G., Johnson, J., and Howard, L. (2010). “Analysis of productive analyze process data for educational assessment. J. Educ. Meas. 53, 190–211.
learning behaviors in a structured inquiry cycle using hidden Markov doi: 10.1111/jedm.12107
models,” in Proceedings of the 3rd International Conference on Educational
Data Mining, 81–90. Available online at: http://educationaldatamining.org/ Conflict of Interest Statement: The authors declare that the research was
EDM2010/uploads/proc/edm2010_submission_59.pdf (Accessed August 26, conducted in the absence of any commercial or financial relationships that could
2018). be construed as a potential conflict of interest.
Kerr, D., Chung, G., and Iseli, M. (2011). The Feasibility of Using
Cluster Analysis to Examine Log Data From Educational Video Games Copyright © 2018 Qiao and Jiao. This is an open-access article distributed under the
(CRESST Report No. 790). Los Angeles, CA: University of California, terms of the Creative Commons Attribution License (CC BY). The use, distribution
National Center for Research on Evaluation, Standards, and Student or reproduction in other forums is permitted, provided the original author(s) and
Testing (CRESST), Center for Studies in Education, UCLA. Available the copyright owner(s) are credited and that the original publication in this journal
online at: https://files.eric.ed.gov/fulltext/ED520531.pdf (Accessed is cited, in accordance with accepted academic practice. No use, distribution or
August 26, 2018). reproduction is permitted which does not comply with these terms.

Frontiers in Psychology | www.frontiersin.org 11 November 2018 | Volume 9 | Article 2231

You might also like