Capri
Capri
All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
DATA MINING
PRINCIPLES, APPLICATIONS
AND EMERGING CHALLENGES
HAROLD L. CAPRI
EDITOR
Copyright 2014. Nova Science Publishers, Inc.
New York
EBSCO Publishing : eBook Academic Collection (EBSCOhost) - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS
AN: 956104 ; Ma, Xiaolei, Capri, Harold L..; Data Mining: Principles, Applications and Emerging Challenges
Account: s8501869.main.ehost
Copyright © 2015 by Nova Science Publishers, Inc.
All rights reserved. No part of this book may be reproduced, stored in a retrieval system or
transmitted in any form or by any means: electronic, electrostatic, magnetic, tape, mechanical
photocopying, recording or otherwise without the written permission of the Publisher.
For permission to use material from this book please contact us:
[email protected]
Independent verification should be sought for any data, advice or recommendations contained in
this book. In addition, no responsibility is assumed by the publisher for any injury and/or damage
to persons or property arising from any methods, products, instructions, ideas or otherwise
contained in this publication.
This publication is designed to provide accurate and authoritative information with regard to the
subject matter covered herein. It is sold with the clear understanding that the Publisher is not
engaged in rendering legal or any other professional services. If legal or any other expert
assistance is required, the services of a competent person should be sought. FROM A
DECLARATION OF PARTICIPANTS JOINTLY ADOPTED BY A COMMITTEE OF THE
AMERICAN BAR ASSOCIATION AND A COMMITTEE OF PUBLISHERS.
Additional color graphics may be available in the e-book version of this book.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
CONTENTS
Preface vii
Chapter 1 Transit Passenger Origin Inference Using Smart
Card Data and GPS Data 1
Xiaolei Ma, Ph.D. and Yinhai Wang, Ph.D.
Chapter 2 Knowledge Extraction from an Automated
Formative Evaluation Based on Odala Approach
Using the Weka Tool? 33
Farida Bouarab-Dahmani and Razika Tahi
Chapter 3 Modeling Nations’ Failure via Data
Mining Techniques 53
Mohamed M. Mostafa, Ph.D.
Chapter 4 An Evolutionary Self-Adaptive Algorithm
for Mining Association Rules 89
Jośe María Luna, Alberto Cano
and Sebastián Ventura
Index 125
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
PREFACE
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
viii Harold L. Capri
on the check-in and check-out scan. The bus arrival time at each stop can be
inferred from GPS data, and individual passenger’s boarding stop is then
estimated by fusing the identified bus arrival time with smart card data. In
addition, a Markov chain based Bayesian decision tree algorithm is proposed
to mine the passengers’ origin information when GPS data are absent. Both
passenger origin mining algorithms are validated based on either on-board
transit survey data or personal GPS logger data. The results demonstrate the
effectiveness and efficiency of the proposed algorithms on extracting
passenger origin information. The estimated passenger origin data are highly
valuable for transit system planning and route optimization.
Chapter 2 - Differentiation between learners, adapted and personalized
learning are interesting research directions on technology for human learning
today. This issue leads to the design of educational systems integrating
strategies for learners' monitoring to assist each by evaluating his knowledge
and skills in one hand and detecting and analyzing his errors and obstacles in
the other hand. In this respect, formative evaluation is the process used to
capture data on the strengths and weaknesses of a learner. These data, to be
useful, must be objectively analyzed so that it can be used to manage the
following sessions. There are different data mining tools using different
algorithms for data analysis and knowledge extraction. Can we use these tools
in computer based systems? In such cases, is it possible to directly use a
variety of general-purpose algorithms for learning data analysis? The authors
discuss in this paper a learning cycle that can be a learning session with
feedback loop integrating formative evaluation followed by knowledge
extraction process using data mining algorithms. The author’s experiments,
presented in this work, shows a set of tests, about the exploration of learners'
errors, obtained from a self e-learning by doing tool for the algorithmic
domain. The authors used the data mining algorithms implemented in the
Weka tool: the C4.5 algorithm for classification, A Priori one for association
rules deduction and K-Means for clustering. The results given by these
experiments have proved the interest of classification and clustering as
implemented in Weka. However, the A priori algorithm gives in some cases
results difficult to interpret so that it needs a specific optimization to get
adequate frequents detection.
Chapter 3 - Since the concept of ‘failed states’ was coined in the early
1990s, it has come to occupy a top tier position in the international peace and
security’s agenda. This study uses data mining techniques to examine the
effect of various social, economic and political factors on states’ failure at the
global level. Data mining techniques use a broad family of computationally
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Preface ix
intensive methods that include decision trees, neural networks, rule induction,
machine learning and graphic visualization. Three artificial neural network
models: multi-layer perceptron neural network (MLP), radial basis function
neural network (RBFNN) and self-organizing maps neural network (SOM)
and one machine learning technique (support vector machines [SVM]) are
compared to a standard statistical method (linear discriminant analysis (LDA).
The variable sets considered are demographic pressures, movement of
refugees, group paranoia, human flight, regional economic development,
economic decline, de-legitimization of the state, public services’ performance,
human rights status, security apparatus, elites’ behavior and the role played by
other states or external political actors. The study shows how it is possible to
identify various dimensions of states’ failure by uncovering complex patterns
in the dataset, and also shows the classification abilities of data mining
techniques.
Chapter 4 - This paper presents a novel self-adaptive grammar-guided
genetic programming proposal for mining association rules. It generates
individuals through a context-free grammar, which allows of defining rules in
an expressive and flexible way over different domains. Each rule is
represented as a derivation tree that shows a solution (described using the
language) denoted by the grammar. Unlike existing evolutionary algorithms
for mining association rules, the proposed algorithm only requires a small
number of parameters, providing the possibility of discovering association
rules in an easy way for non-expert users. More specifically, this algorithm
does not require any threshold, and uses a novel parent selector based on a
niche-crowding model to group rules. This approach keeps the best individuals
in a pool and restricts the extraction of similar rules by analysing the instances
covered. The author’s compare our approach with the G3PARM algorithm, the
first grammar-guided genetic programming algorithm for the extraction of
association rules. G3PARM was described as a high-performance algorithm,
obtaining important results and overcoming the drawbacks of current
exhaustive search and evolutionary algorithms. Experimental results show that
the author’s new proposal obtains very interesting and reliable rules with
higher support values.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
In: Data Mining ISBN: 978-1-63463-738-1
Editor: Harold L. Capri © 2015 Nova Science Publishers, Inc.
Chapter 1
ABSTRACT
To improve customer satisfaction and reduce operation costs, transit
authorities have been striving to monitor their transit service quality and
identify the key factors to attract the transit riders. Traditional manual
data collection methods are unable to satisfy the transit system
optimization and performance measurement requirement due to their
expensive and labor-intensive nature. The recent advent of passive data
collection techniques (e.g., Automated Fare Collection and Automated
Vehicle Location) has shifted a data-poor environment to a data-rich
environment, and offered the opportunities for transit agencies to conduct
comprehensive transit system performance measures. Although it is
possible to collect highly valuable information from ubiquitous transit
data, data usability and accessibility are still difficult. Most Automatic
Fare Collection (AFC) systems are not designed for transit performance
monitoring, and additional passenger trip information cannot be directly
Email: [email protected]
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
2 Xiaolei Ma and Yinhai Wang
INTRODUCTION
According to the Census of 2000 in the United States, approximately 76%
people chose privately owned vehicles to commute to work in 2000 (ICF
consulting, 2003). Recent studies conducted by the 2009 American
Community Survey indicate 79.5% of home-based workers drive alone for
commuting (McKenzie and Rapino, 2009). Many developing countries, e.g.,
China, also rely on privately owned vehicles to commute. For example, more
than 34% of the Beijing residents chose cars as their primary travel mode
while only 28.2% chose transit in 2010 (Beijing Transportation Research
Center, 2012). Public transit has been considered as an effective
countermeasure to reduce congestion, air pollution, and energy consumption
(Federal Highway Administration, 2002). According to 2005 urban mobility
report conducted by Texas Transportation Institute (2005), travel delay in
2003 would increase by 27 percent without public transit, especially in those
most congested metropolitan cites of U.S., public transit services have saved
more than 1.1 billion hours of travel time. Moreover, public transit can help
enhance business, reduce city sprawl through the transit oriented development
(TDO). During certain emergency scenarios, public transit can even act as a
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 3
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
4 Xiaolei Ma and Yinhai Wang
RESEARCH BACKGROUND
Data from AFC system and AVL system are the two primary sources in
this study. Beijing Transit Incorporated began to issue smart cards in May 10,
2006. The smart card can be used in both the Beijing bus and subway systems.
Due to discounted fares (up to 60% off) provided by the smart card, more than
90% of the transit riders pay for their transit trips with their smart cards in
2010 (Beijing Transportation Research Center, 2010). Two types of AFC
systems exist in Beijing transit: flat fare and distance-based fare. Transit riders
pay at a fixed rate for those flat fare buses when entering by tapping their
smart cards on the card reader. Thus, only check-in scans are necessary. For
the distance-based AFC system, transit riders need to swipe their smart cards
during both check-in and check-out processes. Transit riders need to hold their
smart cards near the card reader device to complete transactions when entering
or exiting buses. Smart card can be used in Beijing subway system as well,
where passengers need to tap their smart card on top of fare gates during
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 5
Due to a design deficiency in the smart card scan system, the AFC system
on flat fare buses does not save any boarding location information, whereas
the AFC system stores boarding and alighting location, except for boarding
time information on distance-based fare buses. Key information stored in the
database includes smart card ID, route number, driver ID, transaction time,
remaining balance, transaction amount, boarding stop (only available for
distance-based fare buses), and alighting stop (only available for distance-
based fare buses).
More than 16 million smart card transactions data are generated per day.
Among these transactions, 52% are from flat-rate bus riders. These smart card
transactions are scattered in a large-scale transit network with 52386 links and
43432 nodes as presented in figure 1:
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
6 Xiaolei Ma and Yinhai Wang
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 7
inference algorithms are validated using external data (e.g., on-board survey
data and GPS data).
Figure 2. Flow Chart for Passenger Origin Inference with GPS Data.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
8 Xiaolei Ma and Yinhai Wang
the roadway network due to the satellite signal fluctuation. Data preprocessing
is required prior to bus arrival time estimation. A program is written to parse
and import raw GPS data into a database in an automatic manner. Key fields
of a GPS record are shown in Table 1.
The first step is to estimate the bus arrival time for each stop by joining
GPS data and the stop-level geo-location data. A buffer area can be created
around each particular stop for a certain transit route using the GIS software.
Within this area, several GPS records are likely to be captured. However,
identifying the geospatially closest GPS record to each particular stop is
challenging since there could be a certain number of unknown directional GPS
records within the specified buffer zone. Thanks to the powerful geospatial
analysis function in GIS, each link (i.e., polyline) where each transit stop is
located is composed of both start node and end node, and this implies that the
directional information for each GPS record is able to infer by comparing the
link direction and the direction changes from two consecutive GPS records.
With the identified direction, the distance from each GPS point to this
particular stop can be calculated, and the timestamp with the minimum
distance will be regarded as the bus arrival time at the particular stop. Figure 2
visually demonstrates the above algorithm procedure. Inbound stop represents
the physical location of a particular transit stop, and this stop is snapped to a
transit link, whose direction is regulated by both a start node and an end node.
By comparing the driving direction from GPS records with the link direction,
the nearest GPS records to this particular stop can be identified, and marked by
the red five-pointed star on the map. The timestamp associated with this five-
pointed star will be considered as the arrival time for this inbound stop. The
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 9
merit of the bus arrival time estimation algorithm lies in its efficiency. Rather
than searching all the GPS data to identify the traveling direction for each stop,
the proposed algorithm shrinks down the searching area, and filters out those
unlikely GPS data. The operation greatly alleviates the computational burden,
and is relatively easy to implement in the large-scale datasets, which is
particularly critical to process the tremendous amount of datasets within an
acceptable time period.
Figure 3. Boarding Time Estimation with GPS Data and Transit Stop Location Data.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
10 Xiaolei Ma and Yinhai Wang
In addition, because all the arrival time for all stops of a particular transit
route can be estimated, the average travel time between two adjacent stops can
be calculated as well. This speed statistics is not only critical for transit
performance measures, but also provides prior information for passenger
origin inference when GPS data are absent.
Validation
Compared with bus arrival time, door opening time can be more
accurately matched with smart card transaction time. This is because each bus
may not exactly stop at each transit stop for passenger boarding. The inferred
bus arrival time is subject to incur errors when it is used to match with smart
card data. To validate the accuracy of the proposed data fusion algorithm for
passenger origin inference, on-board transit survey was undertaken to collect
bus door opening time and arrival location for each stop of route 651 on
January, 13th, 2013. Hand holding GPS devices were used to track the
geospatial location of moving buses every 15 seconds. The survey duration
was from 8:00 AM to 1: 00 PM, and a total of 75 bus door opening time was
manually recorded. These bus door opening time records were then compared
with smart card transactions from 417 passengers, and these estimated stops
can be considered as the ground-truth data. By comparing the ground-truth
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 11
data with the results from the proposed GPS data fusion approach, 406
boarding stops were accurately inferred and 11 boarding stops differ from the
ground-truth data within one-stop-error range. The proposed algorithm
demonstrates its accuracy as high as 97.4%.
There are still a fair amount of buses without GPS devices, and thus the
bus arrival time at each transit stop is not directly measured. However, most
passengers scan their cards immediately when boarding and almost all
passengers should complete the check-in scan before arriving to the next stop.
This indicates that the first passenger’s transaction time can be safely assumed
as the group of passengers’ boarding time at the same stop. The challenge is
then to identify the bus location at the moment of the SC transaction so that we
can infer the onboard stop for that passenger. However, this is not easy
because the SC system for the flat-rate bus does not record bus location. We
know the time each transaction occurred on a bus of a particular route under
the operation of a particular driver, but nothing else is known from the SC
transaction database. Nonetheless, we are able to extract boarding volume
changes with time and passengers who made transfers. By mining these data
and combining transit route maps, we may be able to accomplish our goal.
Therefore, a two-step approach is designed for passenger origin data
extraction: smart card data clustering and transit stop recognition. To
implement the proposed algorithm in an efficient manner, a Markov Chain
based optimization approach is applied to reduce the computational
complexity.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
12 Xiaolei Ma and Yinhai Wang
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 13
(1) We assume the alighting stop in the previous route is spatially and
temporally the closest to the boarding stop for the next route. This is
reasonable because most passengers choose the closest stop for transit
transfer within a short period of time (Chu, 2008). Assume a
passenger k makes a transfer from route i to route j within n minutes.
If route i is a distance-based-rate bus line or a subway line, then we
can identify the transfer station that is also the boarding stop of route
j. Even if both routes are flat-rate bus routes, if the transferring
location is unique, we can still use the transfer information to identify
the transfer bus stop ID and name. In this study, the transfer time
duration n is 30 minutes, and the maximum distance between two
transfer stops is 300 meters.
(2) We assume that both the alighting time and the boarding time for each
particular stop is similar. In this case, we can substitute a passenger
boarding stop with another passenger alighting stop. Assume a
passenger k makes a transfer from route i to route j. If route j is a
subway line, where both its boarding location and time are available,
then we can estimate the passenger k’s alighting stop of route i, and
this alighting stop can be also considered as the boarding stop for
those passengers who get on the bus at the same time.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
14 Xiaolei Ma and Yinhai Wang
Walk distance between the two stops should be taken into account for
inferring the time when the flat-rate bus arrives at the transfer stop. However,
several possible boarding stops may exist due to the unknown direction in the
flat-rate smart card transaction, and thus additional data mining techniques are
needed to find the boarding stop with the maximum likelihood. These data
mining techniques will be detailed in the next section.
Based on the identified transfer stops, we can further segment the
transaction cluster sequence into shorter cluster series. Each series is bounded
by either the termini or the identified bus stops. The segmented series of
transaction clusters will be used as the input for the subsequent transit stop
inference algorithm.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 15
n
1 p1i p12 p1n
p11 p12 p1n
i 2
p21 p22 p2 n n
0 1 p2i p2 n
(3)
i2
p( n 1)1 p( n 1)2 p( n 1) n
p
n1 pn 2 pnn 0 0 p( n 1) n
1
0 0
where n=the total number of stops for the bus route. This transition probability
matrix plays a vital role in determining the potential stop ID for the next time
step.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
16 Xiaolei Ma and Yinhai Wang
Dij
Vij (4)
tk 1 tk
In the speed normal distribution, the mean travel speed ij and standard
deviation ij can be calculated from all buses with GPS devices in the same
route. Under this circumstance, the boarding time for each stop can be inferred
by matching GPS data and stop location information. Using the inferred
boarding time difference and distance between stop i and stop j, we can
calculate the mean travel speed ij and standard deviation ij as a priori
information. It is noteworthy that the speed mean and standard deviation are
not dependent on GPS data, but can be also obtained by other data sources
such as distance-based-rate SC transaction data. A sensitivity analysis further
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 17
pij Pr( Sk 1 j | Sk i )
zij
1 1 (5)
zij 2
exp( z 2 / 2)dz
2
exp( zij2 / 2) 2,
Vij ij
where Zij , which is the standardized travel speed between stop j
ij
and stop i , Δ is a small increase value for travel speed, and it will not impact
the algorithm result, since this is a common term for each transition
probability. In practice, to avoid the fast growth of Bayesian decision tree, the
transition probability can be bounded by a minimum probability to eliminate
those unlikely stops during calculation.
Each element in transition matrix can be quantified in the same way as
shown in Equation (5). With the complete transition matrix, the unknown
pattern of SC transaction series can be recognized as:
[ Sk 1 , Sk , Sk 1 ,..., S1 ]
arg max Pr( Sk 1 , Sk , Sk 1 ,..., S1 )
S1 ... S k 1
k
arg max ( k 1 Pr( Sn 1 j | Sn i))
S1 ... S k 1
n 1
k
Here, P(k 1) k 1 Pr(Sn 1 j | Sn i) denotes the geometric mean
n 1
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
18 Xiaolei Ma and Yinhai Wang
Implementation
As mentioned in the previous sections, due to the nature of transaction
data, several issues need to be addressed in the process of Markov chain based
Bayesian decision tree algorithm:
1. Direction identification
Beijing transit AFC system doesn’t log the travel direction information for
each route. We need to determine whether the bus is traveling inbound or
outbound before algorithm execution. The solution is that we construct two
Bayesian decision trees in each direction. Then the probability of the most
likely stop sequence from each of trees will be compared and the one with the
highest path probability wins.
2. Outlier removal
jN
pii 1 p
j i 1
ij (7)
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 19
The probability is able to better depict the situation where passengers may
delay a certain period to swipe their smart cards during boarding.
The journey begins from the initial bus stop to the terminus is defined as a
bus trip. The bus terminus is designed for bus turning, layover, and driver
change. It is also the starting stop on the bus timetable. However, in Beijing’s
transit network, some bus termini are located in the busy street or have limited
space. Hence, buses using these termini have to begin their next trip in a short
time period without causing an obstruction. This is a challenging issue in the
procedure of passenger origin inference, since the initial stop (root node) in
Bayesian decision tree may be misidentified if the bus trip is mistakenly
detected. The solution to this issue is to model the travel time probability of
each transaction cluster series. As indicated in the transaction cluster sequence
segmentation section, a transaction cluster sequence can be segmented by
several series using aforementioned spatiotemporal transfer relationships. Each
identified series is bounded by possible inferred stops, by calculating the travel
time for multiple combinations of inferred stops, and comparing with the
actual time difference, we are able to determine the existence of a bus trip
based on the highest probability. Figure 5 demonstrates the procedure of
identifying a bus trip.
Segment 1 Segment 2
Actual Stop ID 5 (inbound) 12 (inbound) 2 (outbound)
20 minutes
As presented in Figure 5, the starting point and ending point of the series
can be identified by several possible stops in different directions, and the
duration of this transaction cluster series is known as 20 minutes. A variety of
trips may exist for this transaction cluster sequence:
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
20 Xiaolei Ma and Yinhai Wang
Trip 1: The bus travels from the 5th inbound stop to the 11th inbound stop.
Trip 2: The bus travels from the 5th inbound stop to the 2nd outbound stop.
Trip 3: The bus travels from the 13th outbound stop to the 11th inbound
stop.
Trip 4: The bus travels from the 13th outbound stop to the 2nd outbound
stop.
The maximum and minimum travel time for any trip can be obtained
through GPS data or distance-based buses. In addition, the maximum bus
layover time can be assumed as 30 minutes. According to the central limit
theorem, bus travel time in a known road segment should follow normal
distribution, and therefore, we can compute the probability of each scenario,
and choose the trip with the maximum probability. If the travel time from stop
i to stop j is denoted as tij , and the probability density function of tij is defined
as:
1 (tij ij )2
p(tij ) exp( )dtij (8)
2 ij2 2 ij2
where ij is the average travel time from stop i to stop j, and ij is the
standard deviation of travel time from stop i to stop j. If the maximum and
minimum travel time (plus maximum and minimum bus layover time) between
stop i to stop j are max(tij ) and min(tij ) respectively, then the 95%
confidence interval of travel time can be further expressed as:
max(tij ) min(tij )
(tij )2 (10)
1 2
p(tij ) exp( )dtij
max(tij ) min(tij ) max(tij ) min(tij )
2( )2 2( )2
3.92 3.92
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 21
Each probability for the above four trips can be calculated as 0.54, 0.87,
0.0003 and 0. Therefore, the transaction cluster sequence starts at the 5th
inbound stop, and ends at the 2nd outbound stop, and thus a terminus should
exist during this trip. This result matched with the actual bus trip. Bayesian
decision tree algorithm can be further utilized to infer other uncertain stops
within this identified bus trip.
2 3 4
3 4 5 4 5 6 5 6 7
Path Probability: 0.36 0.32 0.27 0.31 0.21 0.19 0.12 0.07 0.04
Assume the initial boarding stop is 1. The potential stops in the next step
could be stop 2, stop 3, or stop 4 because they are all in the reachable range.
Assuming that the situations are similar for the remaining stops, a decision tree
is fully established. The traditional exhaustive search is to traverse each
potential path, and select the maximum probability. Based on this method, we
need to calculate the path probability nine times. This implies that the number
of paths to be calculated increases exponentially as the time step increases.
However, at the time step 3, there are two or more paths ending with stop 3, 4
and 5. Before carrying on the computation in the next time step, we can
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
22 Xiaolei Ma and Yinhai Wang
compare the probability of the paths with the same ending stop, and choose the
maximum one, which is also called the partial best path.
In the time step 3, only the following five paths are selected 1->2->3, 1-
>2->4,1->2->5,1->3->6, and 1->4->7. Recall that the Markov Chain model
states that the probability of current state given a previous state sequence
depends only on the previous state. Hence, five paths calculated in time step 3
guarantees the most probable paths in time step 4 without extra computations
of other paths. According to Equation (11), we can express the optimized
procedure in mathematics as:
We can now calculate the probability at each time step recursively until
the end of the route. Computing the probability in this way is far less
computational expensive than calculating the probabilities for all sequences. If
we denoted the total stops for a specific route as n, and the SC transactions are
classified in m clusters, which correspond to m time steps in Bayesian
decision trees, then the computational complexity for the exhaustive approach
can be written as O(mn ) . While using the optimized algorithm, the
computational complexity is only O(mn) . With the optimization, the algorithm
can be solved in a finite time, and can be efficiently applied in reality.
Validation
By installing GPS receivers on flat-rate buses, we can collect the
geospatial information and spot speed data in a real-time manner. There are
approximately 50% buses equipped with GPS devices in Beijing, and GPS
data are updated every 30 seconds. These data provide the opportunity to
validate the Markov-chain based Bayesian decision tree algorithm developed
in this study for passenger origin data extraction. GPS coordinates and
timestamp can be used to determine bus boarding and alighting location and
time. First, the geographical feature of bus stops and consecutive GPS records
for each bus are joined using latitude and longitude coordinates. Then, by
matching the passenger check-in time in the SC transaction database, the
boarding stop ID can be associated with each transaction. Since the inferred
stop ID using GPS data have been validated using the bus on-board survey
method, and can be considered as the ‘ground truth’ data for the comparison
purpose.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 23
In this section, the Markov chain based Bayesian decision tree algorithm
is first validated using GPS data for route 22, and then, several sensitivity
analyses are conducted to investigate impacts of different parameter settings in
Bayesian decision tree. Finally, a computational complexity experiment is also
included at the end of this section.
Algorithm Validation
Flat-rate based route 22 was selected to infer unknown boarding location
using Markov chain based Bayesian decision tree algorithm, and GPS data
associated with route 22 was also collected to verify the result. The SC
transaction data and GPS data are all recorded on April 7, 2010. The minimum
stop probability is defined as 0.05. If a stop whose transition probability is less
than 0.05, then this stop will be abandoned. Route 22 contains a total of 34
inbound and outbound stops as shown in Figure 7.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
24 Xiaolei Ma and Yinhai Wang
22. Error is defined as the stop ID difference (two stops that are adjacent to
each other should have consecutive IDs) between the ground truth stop based
on GPS data and the inferred stop using the proposed algorithm. For Route 22,
95% passenger boarding stops were deducted by the proposed algorithm.
55.8% of results perfectly match with the stops inferred by GPS accurately.
There are 11,645 recognized boarding stops within three-stop distance away
from the actual boarding stop, accounting for approximately 96.7% of the total
identified records or 91.6% of total records.
Figure 8. Bayesian Decision Tree Algorithm Accuracy for Route 22 based on GPS
Speed.
The results are very encouraging. In Beijing’s transit network, the error
within three stops is acceptable for transit planning level study, since these
stops are mostly affiliated with the same traffic analysis zone (TAZ) due to the
high transit network density.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 25
Sensitivity Analysis
1. Source of travel speed calculation
Recall that in computing the transition matrix, mean travel speed and
standard deviation were extracted from GPS data. However, there are still
many flat-rate routes without GPS devices. To understand how the algorithm
result changes when the travel speed mean and standard deviation are
inaccurate, a sensitivity analysis is carried out for this purpose. Table 4 and
Figure 9 show the results when the mean and standard deviation of travel
speed are retrieved from the distance-based fare routes, and these routes share
common stops with the “no-GPS” flat-fare route. Because both boarding stop
and alighting stop are known in the distance-based fare buses, we are still able
to extract the mean and standard deviation of travel speed between adjacent
stops for transition matrix construction.
Figure 9. Bayesian Decision Tree Algorithm Accuracy for Route 22 Based on Speed
from Distance-based Fare Routes.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
26 Xiaolei Ma and Yinhai Wang
Minimum stop probability plays a vital role to impact both the accuracy
and efficiency of the proposed algorithm. A too high threshold may eliminate
possible boarding stop candidates, and a too low threshold may consume
additional computation resources. In this sensitivity analysis, a different
minimum stop probability is set as 0.1, which means if the calculated
transition probability of a particular stop is lower than 0.1, and then this stop is
considered as an unlikely boarding stop. The comparison result is presented in
Table 5 and Figure 10.
When the minimum stop probability increases, less boarding stops can be
inferred using the proposed algorithm. In addition, the inferred boarding stops
are less accurate compared with the ones with minimum stop probability as
0.05. This is a reasonable result since a rigorous probability threshold may
limit the prorogation of errors. However, a trade-off exists between algorithm
accuracy and efficiency.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 27
Figure 10. Bayesian Decision Tree Algorithm Accuracy for Route 22 with Minimum
Stop Probability as 0.1.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
28 Xiaolei Ma and Yinhai Wang
Figure 11. Markov Chain based Bayesian Decision Tree Algorithm Run Time
Analysis.
The Markov chain based BDC algorithm can save a significant amount of
run time compared with the Basic BDC algorithm. The average performance
gains can achieve to 142 times faster than the basic algorithm. This is because
most of the redundant calculation steps have been already excluded using
Markov chain property.
CONCLUSION
Different from most entry-only AFC systems in other countries, Beijing’s
AFC system does not record boarding location information when passengers
embark the buses and swipe their smart cards. This creates challenges for
passenger OD estimation.
This study aims to tackle this issue. With further investigations on SC
transactions data, we proposed a Markov chain based Bayesian decision tree
algorithm to infer passengers boarding stops. This algorithm is based on
Bayesian inference theory, and the normal distribution of travel speed between
adjacent stops is used to depict the randomness of passenger boarding stops.
Both the mean and the standard deviation can be obtained from GPS data or
distance-based fare routes. Moreover, stationary Markov chain property is also
incorporated to further reduce the computational complexity of the algorithm
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 29
to a linear load. The optimized algorithm is proven its accuracy using the SC
transaction data.
This algorithm can be improved in various ways; for instance, the
algorithm does not perform well under the circumstance that the travel speed
between adjacent stops is not distinct, i.e., the travel speed probability
calculated for each stop is similar. The potential countermeasure for this issue
is to incorporate heterogeneity, e.g., the accessibility of a subway station or a
central business district (CBD) for each transit stop.
In summary, the Markov chain based Bayesian decision tree algorithm
provides both effective and efficient data mining approach for passenger origin
data extraction. It sets up a great foundation to mine transit passenger ODs
from the SC transaction data for transit system planning and operations.
ACKNOWLEDGMENTS
The authors would like to appreciate the funding support from the
National Natural Science Foundation of China (51408019) and the
Fundamental Research Funds for the Central Universities. All data used for
this study were provided by Beijing Transportation Research Center (BTRC).
We are grateful to BTRC for their data supports.
REFERENCES
Bayes, Thomas; Price, Mr. An essay towards solving a problem in the coctrine
of chances, Philosophical Transactions of the Royal Society of London 53
(0): 370–418, 1763.
Beijing Transportation Research Center, Beijing transportation smart card
usage survey, Research Report. 2010.
Beijing Transportation Research Center, the 4th Beijing Comprehensive
Transport Survey Summary Report, Jan. 2012.
Chen, J., 2009. Research on travel demand analysis of urban public
transportation based on smart card data information, Ph.D. dissertation,
Tongji University.
Chu, K. K. A. and Chapleau, R., Enriching archived smart card transaction
data for transit demand modeling, Transportation Research Record:
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
30 Xiaolei Ma and Yinhai Wang
Journal of the Transportation Research Board, Vol. 2063, 2008, pp. 63-
72.
Chu, K.K. and Chapleau, R. Augmenting transit trip characterization and
travel behavior comprehension: multiday location-stamped smart card
transactions. Transportation Research Record: Journal of the
Transportation Research Board, No. 2183, Transportation Research
Board of the National Academies, Washington, DC, 2010, pp.29–40.
Cooper, G. F., The computational complexity of probabilistic inference using
Bayesian belief networks, Artificial Intelligence, Vol. 42, 1990, pp. 393-
405.
Farzin, J. M., Constructing an automated bus Origin-Destination matrix using
farecard and Global Positioning System data in Sao Paulo, Brazil,
Transportation Research Record: Journal of the Transportation Research
Board, Vol. 2072, 2008, pp. 30-37.
Federal Highway Administration, Travel Time Reliabiliy: Making it there on
time, all the itme, 2006. Accessed on line at: http://ops.fhwa.dot.gov
/publications/tt_reliability/, on Apr. 18th, 2013.
Furth, P. G., Hemily, B., Muller, T. H. J., and Strathman, J. G., TCRP report
113: Using archived AVL-APC data to improve transit performance and
management, Transportation Research Board, 2006.
Gao, L.X. and Wu, J. P., An algorithm for mining passenger flow information
from smart card data, Journal of Beijing University of Posts and
Telecommunications, Jun. 2011, vol. 34, No.3, 2011, pp. 94-97.
ICF consulting, Center for urban transportation research, Nelson/Nygaard,
ESTC. Strategies for Increasing the Effectiveness of Commuter Benefits
Programs. TCRP report 87, Transportation Research Board, 2003.
Jang, W, Travel time and transfer analysis using transit smart card data,
Transportation Research Record: Journal of the Transportation Research
Board, Vol. 2144, 2010, pp.142–149.
Janssens, D., Wets, W., Brijs, T., Vanhoof, K., Arentze, T., Timmermans, H.,
Integrating Bayesian networks and decision trees in a sequential rule-
based transportation model, European Journal of Operational Research,
Vol. 175, Issue 1, 2006, pp. 16-34.
Kittelson & Associates, Inc., Urbitran, Inc. LKC Consulting Services, Inc.,
Morpace International, Inc., Queensland University of Technology, and
Nakanishi, Y., TCRP Report 88, A guidebook for developing a transit
performacne-measurement system, Transportation Research Board,
National Research Council, Washington, D.C., 2003.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 31
Lou, Y., Zhang, C., Zheng, Y., Xie Xing, Wang, W., and Huang, Y., Map-
matching for low-sampling-rate GPS trajectories, Proceedings of the 17th
ACM SIGSPATIAL International Conference on Advances in Geographic
Information Systems, pp. 352-361, 2009.
Ma, X., McCormack, E., Wang, Y., Processing Commercial GPS Data to
Develop a Web-Based Truck Performance Measures Program,
Transportation Research Record: Journal of the Transportation Research
Board. Vol.2246, 2011, pp.92-100.
Ma, X., Wang, Y., Feng, C., and Liu, J. Transit smart card data mining for
passenger origin information extraction. Journal of Zhejiang University
Science C,, Vol. 13, No. 10, 2013, pp. 750-760.
McKenzie, B. and Rapino, M. Commuting in United States: 2009, American
Community Survey Reports. Accessed on line at: http://www.census.gov
/prod/2011pubs/acs-15.pdf, on Oct. 7th, 2012.
Texas Transportation Institute, 2005 urban mobility report, Texas A&M
University, 2005.
Zhou, T., Zhai C., and Gao Z., Approaching bus OD matrices based on data
reduced from bus IC cards. Urban Transport of China, vol. 5, no.3, 2007,
pp. 48-52.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
In: Data Mining ISBN: 978-1-63463-738-1
Editor: Harold L. Capri © 2015 Nova Science Publishers, Inc.
Chapter 2
ABSTRACT
Differentiation between learners, adapted and personalized learning
are interesting research directions on technology for human learning
today. This issue leads to the design of educational systems integrating
strategies for learners' monitoring to assist each by evaluating his
knowledge and skills in one hand and detecting and analyzing his errors
and obstacles in the other hand. In this respect, formative evaluation is the
process used to capture data on the strengths and weaknesses of a learner.
These data, to be useful, must be objectively analyzed so that it can be
used to manage the following sessions. There are different data mining
tools using different algorithms for data analysis and knowledge
[email protected]
[email protected]
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
34 Farida Bouarab-Dahmani and Razika Tahi
1. INTRODUCTION
Educational data mining (Romero, C. and Ventura, S., 2007) is an area of
research where appropriate methodological research and technical means are
experienced to produce useful knowledge from different types of data (marks,
errors, data on the learner, log files ...). These data can be produced by any
kind of learning: face to face learning, e-learning or even blended leaning.
Learners’ differentiation and adapting learning is basic for competency-
based approach that is very discussed nowadays in educational science. The
value of a competency-based approach in our new world of global economy
and advanced technologies is obvious. However, mistrust and reluctance of
teachers and staff, often observed when it comes to the application of this
approach in the field, are justified by the difficulty in assessing skills. In
addition, this new approach is individual oriented monitoring, which requires
more work for everyone in the educational institutions.
With the introduction of the LMD (License, Master Doctorate) system in
Algerian universities, for example, the competency-based approach have to be
applied by individually tracking each student. However, this is not a possible
objective with the large number of students which makes this learners’
differentiation a tedious task for every faculty member.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Knowledge Extraction from an Automated Formative Evaluation … 35
The study proposed in this chapter discusses the possible add-ons of data
mining tools to computer based learning systems that enhance competency
based pedagogies, to get adaptive learning. We are particularly interested in
data mining algorithms for knowledge extraction from data given by a
formative and automated evaluation based on learners’ errors integrated in the
ODALA approach. This one is developed and evaluated in our previous works
(Bouarab-Dahmani F., 2010) (Bouarab-Dahmani F. et al., 2011). Among, the
data mining algorithms, considered in this study, we have: the C4.5 (with the
J48 implementation) for classification, A Priori algorithms for associative
rules deduction and K-means for clustering.
First, we will give an overview about the ODALA approach and data
mining technology. After that, in the third section, we present our tests on
mining learning data using some algorithms implemented in the Weka Tool.
After that, in the fourth point, we discuss the tests results and data mining
implementation in a CEHL. The conclusion is a synthesis about our
contribution and the possible research perspective of the presented work.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
36 Farida Bouarab-Dahmani and Razika Tahi
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Knowledge Extraction from an Automated Formative Evaluation … 37
Knowledge Extraction
(Data selection, preprocessing, datamining
algorithms application, interpreting)
For that, we are studying already existing algorithms and data mining
tools in order to see their possible use for the exploration of formative
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
38 Farida Bouarab-Dahmani and Razika Tahi
The relational tables considered in the tests are given in Table 1 with only
the attributes used in the experiments.
The approach followed for each test, where data is analyzed by a data
mining algorithm implemented in Weka tool, is composed of these steps:
Setting of a goal
Selection of the relational tables involved
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Knowledge Extraction from an Automated Formative Evaluation … 39
Relationship Commentary
- Learner (Id_learner, Name, E-mail, level) Belongs to the learner’s model
- Session (session_id, start_date, end_date, Session of a learner
id_learner)
) exercises table
-Error (id_error, label_error, type_error, id_ic) Table of potential errors. Id_ic is the
identifier of the knowledge item, a
granular component of the domain, to
which it corresponds
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
40 Farida Bouarab-Dahmani and Razika Tahi
Classifiers are one of the commonly used tools in data mining. They take
as input a collection of cases, each belonging to one of a small number of
classes and described by its values for a fixed set of attributes, and output a
classifier that can accurately predict the class to which a new case belongs
(Wu et al., 2008).
The C4.5 algorithm (Quinlan, J.R., 1993) is one of the most cited
algorithms for classification. It is an extension of the ID3 algorithm proposed
by Quinlan for decision tree construction. One of the most attractive of
decision trees aspects lies in the interpretation and construction of decision
rules. The confidence of the rule is the proportion of records in the leaf node to
which the decision rule is true. If confidence is 100% (= 1), the leaf node is
pure and the decision rule is perfect.
We use the J48 algorithm which is an implementation of the C4.5 with the
data collected during the execution of the WebSiela system, a prototype we
developed for algorithmic learning with an automated correction of solutions
freely built by learners to open questions. The input data table for the data
mining process is deduced from a combination of the relational tables given in
Table 1. We chose a very simple example concerning the classification of
students according to the level that can be good, average or bad. This
classification will clarify the relationship between a class (a level) and the
number of errors (deduced from the input data).
To test the reliability and usability of the algorithm, we conducted
disturbances in the input data to observe the reactions of the algorithm. In most
cases, the results generated were adequate to the inputs. For example,
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Knowledge Extraction from an Automated Formative Evaluation … 41
(see Figure 5), gave new adequate thresholds and detected the
misclassification.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
42 Farida Bouarab-Dahmani and Razika Tahi
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Knowledge Extraction from an Automated Formative Evaluation … 43
A Priori algorithm (Agrawal R., Srikant R., 1994) is based on the fact that
all common elements have subsets of common elements together. Indeed, if
the set {A, B} is frequent in the database, the sets {A} and {B} are themselves
frequent in the database. This algorithm generates rules involving correlated
data. We have just to point out the set of attributes concerned by the analysis.
The objective chosen in this case was to look for correlations between the
exercises, learners and errors detected in a learning session. The data to be
included in the input table of the A Priori algorithm, called input_base2 (see
Figure 6) is obtained by algebraic calculus on the two tables: Session, com-
Error. We ran the algorithm directly on the three attributes: id-learner,
id_exercise and id_error to extract all possible rules (the rule is given as if ...
then form)
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
44 Farida Bouarab-Dahmani and Razika Tahi
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Knowledge Extraction from an Automated Formative Evaluation … 45
Figure 8. A result of A Priori execution under the tool Weka (Test 2).
After different tests, we deduced that the use of this algorithm could be of
great help to enrich the formative evaluation or to get a global evaluation from
the CEHL. However an automated interpretation of the derived rules will not
be a simple task to develop and a “manual” interpretation will be very time
consuming since all the generated rules are not systematically significant.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
46 Farida Bouarab-Dahmani and Razika Tahi
decreases with increasing k till it hits zero when the number of clusters equals
the number of distinct data-points. K-means remains the most widely used
algorithm for clustering in practice (Wu et al., 2008).
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Knowledge Extraction from an Automated Formative Evaluation … 47
entries so that we can easily follow and analyze the results. The data mining
objective in the case of this test is to get clusters of learners going with the
number of their errors and/or number of exercises done.
Figure 10 shows the execution results of K-means via Weka with the
number of clusters K=2. We can easily interpret the results given where:
Full data: is the general average of all the instances of each attribute.
Cluster 0 (first cluster found) has 9,75 as average of error numbers and
2,375 average for exercises numbers done. This cluster has Sixteen (16)
learners.
Cluster 1 (the second cluster found) contains fourteen (14) learners with
the average of 2,0714 for the errors numbers and 1,0714 for the average of
exercises numbers done.
Figure 10. Screen result of clustering with the tool Weka (test1).
The interpretation can be: The algorithm divided the learners on two
clusters: those with one (1) exercise done with number of errors between 0 and
5 and those who did 2 or 3 exercises and the number of errors found is
between 6 and 15. Figure 11 shows a clustering with K= 3 and on two
attributes: the number of exercises done and the errors number. We can see
then that the characteristics of the clusters found have changed.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
48 Farida Bouarab-Dahmani and Razika Tahi
Figure 11. Screen result of clustering with the tool Weka (test2).
Figure 12. Screen result of clustering with the tool Weka (test3).
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Knowledge Extraction from an Automated Formative Evaluation … 49
Figure 13. Interface of the WebSiela data mining with Weka algorithms.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
50 Farida Bouarab-Dahmani and Razika Tahi
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Knowledge Extraction from an Automated Formative Evaluation … 51
or for a given use case. In addition, clustering and classification need just a
simple interpreting program.
CONCLUSION
We explored, in this chapter, the use and integration of the Weka
algorithms for educational data mining within a CEHL adopting learning by
doing in the context of competency-based approach. This exploration was
done by the execution of some algorithms implemented in the Weka tool first
on a limited set of data and after that with data collected in the WebSiela
system. The use of these algorithms was to explore the results of a formative
evaluation based on ODALA approach. The observed results, especially
through the execution of predictive classification with the C4.5 algorithm, the
associative rules detection with A Priori algorithm and K-means algorithm for
clustering, are interesting to track learner’s progress. However, the A priori
algorithm as implemented in Weka gives results difficult to interpret
automatically and even by human in the case of big data. Finally to realize the
loop of the learning process with a knowledge extraction module, we need
specific programs development that can use the Weka API in the
implementation step for clustering and classification. For frequents detection,
we have to study deeply the A priori algorithm and even other association
rules detection algorithms to get interpretable results. We have also the
research perspective concerning the development of an automated and
reusable filter and interpreter of the algorithms results and complete the loop
of the learning process where the data mining results are used to update the
learners’ models or e-portfolios.
REFERENCES
Agrawal R., Srikant R., 1994. “Fast Algorithms for Mining Association Rules
in Large Databases” Proc. of the 20th Int’l Conf. on Very Large Data
Bases (VLDB). June 1994, p. 478-499. IBM Research Report RJ 9839.
Baker, R.S.J.d., Corbett, A.T., Aleven, V. (2008) “More Accurate Student
Modelling Through Contextual Estimation of Slip and Guess Probabilities
in Bayesian Knowledge Tracing” Proceedings of the 9th International
Conference on Intelligent Tutoring Systems, 406-415, 2008.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
52 Farida Bouarab-Dahmani and Razika Tahi
Beck, J.E., Mostow, J. (2008) “How who should practice: Using learning
decomposition to evaluate the efficacy of different types of practice for
different types of students” Proceedings of the 9th International
Conference on Intelligent Tutoring Systems, 353-362.
Bouarab-Dahmani F. (2010). Modélisation basée ontologies pour
l’apprentissage interactif - Application à l’évaluation des connaissances
de l’apprenant. Doctorate thesis in computer science of Mouloud
Mammeri University, Tizi Ouzou, Algeia, 2010.
Bouarab-Dahmani F., Si-Mohammed M., Comparot C., Charrel P. J. (2011)
“Adaptive Exercises Generation using an Automated Evaluation and a
Domain Ontology: The ODALA+ Approach”, International journal of
emerging technologies in learning, IJET, Vol.6, Issue 2, June 2011, 4-10.
Ian, H. W., Eibe, F. (1999). Data Mining: Practical Machine Learning Tools
and Techniques with Java Implementations. Morgan Kaufmann, October
1999.
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan
Kaufmann 1993.
Romero, C. and Ventura, S. (2007) ”Educational data mining: A Survey from
1995 to 2005. Expert Systems with Applications“, (33), pp. 135-146.
Shih, B., Koedinger, K.R., Scheines, R. (2008) “A Response-Time Model for
Bottom-Out Hints as Worked Examples”. Proceedings of the First
International Conference on Educational Data Mining, 117-126, 2008.
Witten I.H., Eibe F., Hall M.A. (2011). Data Mining, Practical Machine
Learning Tools and Techniques. Third edition published in January 2011
by Morgan Kaufmann Publishers (ISBN: 978-0-12-374856-0).
(Wu et al., 2008). X.Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H.
Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu,·Z.H. Zhou, M.
Steinbach, D. J. Hand, D. Steinberg. Top 10 algorithms in data mining.
Knowl Inf Syst (2008) 14:1–37. DOI 10.1007/s10115-007-0114-2.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
In: Data Mining ISBN: 978-1-63463-738-1
Editor: Harold L. Capri © 2015 Nova Science Publishers, Inc.
Chapter 3
ABSTRACT
Since the concept of ‘failed states’ was coined in the early 1990s, it
has come to occupy a top tier position in the international peace and
security’s agenda. This study uses data mining techniques to examine the
effect of various social, economic and political factors on states’ failure at
the global level. Data mining techniques use a broad family of
computationally intensive methods that include decision trees, neural
networks, rule induction, machine learning and graphic visualization.
Three artificial neural network models: multi-layer perceptron neural
network (MLP), radial basis function neural network (RBFNN) and self-
organizing maps neural network (SOM) and one machine learning
technique (support vector machines [SVM]) are compared to a standard
statistical method (linear discriminant analysis (LDA). The variable sets
considered are demographic pressures, movement of refugees, group
paranoia, human flight, regional economic development, economic
decline, delegitimization of the state, public services’ performance,
human rights status, security apparatus, elites’ behavior and the role
played by other states or external political actors. The study shows how it
is possible to identify various dimensions of states’ failure by uncovering
complex patterns in the dataset, and also shows the classification abilities
of data mining techniques.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
54 Mohamed M. Mostafa
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 55
To determine the major factors that affect the state failure at the
global level; and
To benchmark the performance of computational intelligence models
against traditional statistical techniques.
Thus, this paper makes at least two important contributions to the broader
literature on state failure. First, most previous studies devoted to understand
the state failure phenomenon are comprised of case study evaluations of few
failed states (e.g., Lemarchand, 2003; Reno, 2003). Our study includes 177
nations, which makes it the most comprehensive study so far. By doing so the
study adds depth to the knowledge base on state failure. Second, by employing
computational intelligence methods such as neural networks and support
vector machines, the study adds breadth to the debate over the causes of state
failure at the global level. The paper is organized as follows. The next section
summarizes the methodology used to conduct the analysis. The subsequent
section presents empirical results of the analysis. Next, the paper sets out some
implications of the analysis. This section also deals with the research
limitations and explores avenues for future research.
2. METHOD
2.1. Multi-Layer Perceptron
MLP consists of sensory units that make up the input layer, one or more
hidden layers of processing units (perceptrons), and one output layer of
processing units (perceptrons). The MLP performs a functional mapping from
the input space to the output space. The output of an MLP is compared to a
target output and an error is calculated. This error is back-propagated to the
neural network and used to adjust the weights. This process aims at
minimizing the mean square error between the network’s prediction output and
the target output.
One of the first successful applications of MLP is reported by Lapedes and
Farber (1988). Using two deterministic chaotic time series generated by the
logistic map and the Glass-Mackey equation, they designed an MLP that can
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
56 Mohamed M. Mostafa
The basic architecture for a RBFNN is a 3-layer network. The input layer
is simply a fan-out layer and does no processing. The second or hidden layer
performs a non-linear mapping from the input space into a higher dimensional
space in which the patterns become linearly separable. The final layer
therefore performs a simple weighted sum with a linear output.
The unique feature of the RBFNN is the process performed in the hidden
layer. The idea is that the patterns in the input space form clusters. If the
centers of these clusters are known, then the distance from the cluster centre
can be measured. Furthermore, this distance measure is made non-linear, so
that if a pattern is in an area that is close to a cluster centre it gives a value
close to 1. Beyond this area, the value drops dramatically. The notion is that
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 57
this area is radially symmetrical around the cluster centre, so that the non-
linear function becomes known as the radial-basis function.
Since the RBFNN has only one hidden layer and has fast convergence
speed, it is widely used for non-linear mappings between inputs and outputs.
Examples include detecting spam email (Jiang, 2007), financial distress
prediction (Cheng et al., 2006), public transportation (Celikoglu & Cigizoglu,
2007), classification of active components in traditional medicine (Liu et al.,
2009), classification of audio signals (Dhanalakshmi et al., 2009), prediction
of athletes performance (Iyer & Sharda, 2009), and face recognition
(Balasubramanian et al., 2009).
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
58 Mohamed M. Mostafa
(Coussement and Van den Poel, 2008), text categorization (Hmeidi et al.,
2008), spam classification (Yu & Xu, 2008) and estimating production levels
(Chen & Wang, 2007).
The SOM, also called Kohonen map, is a heuristic model for exploring
and visualizing patterns in high dimensional datasets. It was first introduced to
the neural networks community by Kohonen (1982). SOM can be viewed as a
clustering technique that identifies clusters in a dataset without the rigid
assumptions of linearity or normality of more traditional statistical techniques.
Indeed, like k-means, it clusters data based on an unsupervised competitive
algorithm where each cluster has a fixed coordinate in a topological map
(Audrain-Pontevia, 2006). The SOM is trained based on an unsupervised
training algorithm where no target output is provided and the network evolves
until convergence. Based on the Gladyshev’s theorem, it has been shown that
SOM models have almost sure convergence (Lo & Bavarian, 1993).
The SOM consists of only two layers: the input layer which classifies data
according to their similarity, and the output layer of radial neurons arranged in
a two-dimensional map. Output neurons will self-organize to an ordered map
and neurons with similar weights are placed together. They are connected to
adjacent neurons by a neighborhood relation, dictating the topology of the map
(Moreno et al., 2006). The number of neurons can vary from a few dozen to
several thousand. Since the SOM compresses information while preserving the
most important topological and metrical relationships of the primary data
elements on the display, it can also be used for pattern classification (Silven et
al., 2003).
Due to the unsupervised character of their learning algorithm and the
excellent visualization ability, SOMs have been recently used in myriad
classification and clustering tasks. Examples include classifying cognitive
performance in schizophrenic patients and healthy individuals (Silver &
Shmoish, 2008), mutual funds classification (Moreno et al., 2006), speech
quality assessment (Mahdi, 2006), vehicle routing (Ghaziri & Osman, 2006),
network intrusion detection (Zhong et al., 2007), anomalous behavior in
communication networks (Frota et al., 2007), compounds pattern recognition
(Yan, 2006), market segmentation (Kuo et al., 2002), clustering green
consumer behavior (Mostafa, 2009) and classifying magnetic resonance brain
images (Chaplot et al., 2006).
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 59
3. RESULTS
3.1. Preliminary Data Analysis
Failed states data in this study were taken from the Fund for Peace
(FundForPeace.org) and Foreign Policy magazine (2009). The failed states
index (FSI) rates 12 social, economic and political indicators, derived from
open-source materials. These 12 indicators are: mounting demographic
pressures, massive movement of refugees and internally displaced persons,
legacy of vengeance-seeking group grievance, chronic and sustained human
flight, uneven economic development along group lines, sharp or severe
economic decline, criminalization or de-ligitimazation of the state, progressive
deterioration of public services, widespread violation of human rights, security
apparatus as “state within state’”, rise of factionalized elites, and intervention
of other states or external actors. The rank order of the states shown in Figure
1 is based on the total scores of the 12 indicators. For each indicator, the
ratings are placed on a scale of 0 to 10, with 0 being the lowest intensity (most
stable) and 10 being the highest intensity (least stable). The total score is the
sum of the 12 indicators and is on a scale of 0-120. In the 2009 index there are
177 states, compared to only 146 in 2006 and 75 in 2005. Only recognized
sovereign states based on the UN membership are included in the analysis.
Thus, several territories such as Taiwan, Palestine and Kosovo are excluded
from the index. In 2009 the FSI ranged from 18.3 in Norway to 114.7 in
Somalia. Furthermore, the Fund for Peace places nations into four categories
based on their scores. The most at-risk countries are placed in the “Alert”
category. This group includes nations having indices between 91 and 120; the
“Warning” category is reserved for countries scoring between 61 and 90; the
“Monitoring” category includes nations with a score ranging from 31 to 60;
and the “Sustainable” category includes nations scoring between 12 and 30.
Figure 2 shows boxplots for all variables used in the analysis.
Since the FSI is supposed to measure states “failure” dimension, factor
analysis was conducted to ascertain the unidimensionality of the index. Using
a standard eigenvalue of 1.0 (Child, 1990) and an inspection of a scree plot
(Figure 3), factor analysis yielded one factor. Total variance explained (79.06
percent) exceeds the 60 percent threshold commonly used in social sciences to
establish satisfaction with the solution (Hair et al., 1998). We used the Kaiser-
Mayer-Olkin (KMO) measure of sampling adequacy (Kaiser, 1970) to
measure the adequacy of the sample for extraction of the three factors.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
60 Mohamed M. Mostafa
Figure 1. Failed States Index. Sources: The Fund for Peace (FundForPeace.org) and
Foreign Policy (July/August, 2009, pp. 80-83).
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 61
There are many software packages available for analyzing MLP models.
We chose SPSS Neural Networks (SPSS, 2007) package. This software
package applies artificial intelligence techniques to automatically find the
efficient MLP architecture (MLP design used in this study is shown in Figure
4). Typically, the application of MLP requires a training data set and a testing
data set (Lek and Guegan, 1999). The training data set is used to train the MLP
and must have enough examples of data to be representative for the overall
problem. The testing data set should be independent of the training set and is
used to assess the classification accuracy of the MLP after training. Following
Lim and Kirikoshi (2005), an error back-propagation algorithm with weight
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
62 Mohamed M. Mostafa
updates occurring after each epoch was used for MLP training. The learning
rate was set at 0.1. Table 1 reports the properties of the MLP model. Table 2
shows the MLP classification accuracy. From table 2 we see that the MLP
classifier predicted training sample with 97.2% accuracy and validation
sample with 94.4% accuracy.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 63
Predicted
Sample Observed bdrline critical in_dang stable Percent Correct
Training bdrline 27 0 2 0 93.1%
critical 0 34 0 0 100.0%
in_dang 0 1 65 0 98.5%
stable 1 0 0 11 91.7%
Overall Percent 19.9% 24.8% 47.5% 7.8% 97.2%
Testing bdrline 3 0 0 1 75.0%
critical 0 4 0 0 100.0%
in_dang 1 0 26 0 96.3%
stable 0 0 0 1 100.0%
Overall Percent 11.1% 11.1% 72.2% 5.6% 94.4%
Dependent Variable: Failed_Status.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
64 Mohamed M. Mostafa
RBFNN was also implemented using the SPSS Neural Networks (SPSS,
2007) package (RBFNN design used in this study is shown in Figure 5). The
basic configuration of the RBFNN used is shown in Table 3. The learning
rates for the RBFNN parameters are varied between 0.001 and 0.1 and that for
the weights are varied between 0.1 and 0.7. The training is stopped if either the
error goal reaches 0.001 or if the maximum misclassification becomes lower
than one percent. Table 4 provides the basic RBFNN properties. From table 4
we see that the hit ratio for the training sample is 97.9% and the hit ratio for
the validation sample is 97%.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 65
Predicted
Sample Observed bdrline critical in_dang stable Percent Correct
Training bdrline 26 0 0 0 100.0%
critical 0 28 1 0 96.6%
in_dang 0 1 76 0 98.7%
stable 1 0 0 11 91.7%
Overall Percent 18.8% 20.1% 53.5% 7.6% 97.9%
Testing bdrline 7 0 0 0 100.0%
critical 0 9 0 0 100.0%
in_dang 1 0 15 0 93.8%
stable 0 0 0 1 100.0%
Overall Percent 24.2% 27.3% 45.5% 3.0% 97.0%
Dependent Variable: Failed_Status.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
66 Mohamed M. Mostafa
16
12
9
6
3
0
1 2 3 4 5 6
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 67
Performance of `svm'
100
0.35
0.30
80
0.25
60 0.20
C
0.15
40
0.10
0.05
20
0.00
-6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0
log10
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
68 Mohamed M. Mostafa
Predicted Group
Membership
Failed_Code 1.0 2.0 3.0 4.0 Total
Original Count 1.0 38 0 0 0 38
2.0 6 85 2 0 93
3.0 0 1 29 3 33
4.0 0 0 0 13 13
% 1.0 100.0 .0 .0 .0 100.0
2.0 6.5 91.4 2.2 .0 100.0
3.0 .0 3.0 87.9 9.1 100.0
4.0 .0 .0 .0 100.0 100.0
a
Cross-validated Count 1.0 38 0 0 0 38
2.0 9 81 3 0 93
3.0 0 3 22 8 33
4.0 0 0 1 12 13
% 1.0 100.0 .0 .0 .0 100.0
2.0 9.7 87.1 3.2 .0 100.0
3.0 .0 9.1 66.7 24.2 100.0
4.0 .0 .0 7.7 92.3 100.0
a. Cross validation is done only for those cases in the analysis. In cross validation, each
case is classified by the functions derived from all cases other than that case.
b. 93.2% of original grouped cases correctly classified.
c. 86.4% of cross-validated grouped cases correctly classified.
Figure 9 displays the MLP cumulative gain chart (similar figure was
obtained for RBFNN). This chart shows the percentage of the overall number
of cases in a given category gained by targeting a percentage of the total
number of cases. For example, the first point on the curve for the in danger
category is at (10%, 20%), meaning that if a dataset is scored with the network
and all the cases are sorted by predicted pseudo-probability of donor, we
would expect the top 10% to contain approximately 20% of all of the cases
that actually take the category in danger. Likewise, the top 20% would contain
approximately 40% of in danger states, and so on.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 69
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
70 Mohamed M. Mostafa
The diagonal line is the baseline curve; if 10% of the cases are selected
from the scored dataset at random, we would expect to gain approximately
10% of all of the cases that actually take the category donor. The farther above
the baseline a curve lies, the greater the gain.
Despite the satisfactory classification performance of the MLP, RBFNN
and SVM in this study, such models are often criticized as black boxes that do
not allow decision-makers to make inferences on how the input variables
affect the models’ results. One way to address this issue is to conduct a
variable impact analysis (VIA). The purpose of VIA is to measure the
sensitivity of net predictions to changes in independent variables. Figure 10
shows that the most important input variables for the MLP are refugees,
security apparatus and external intervention. Similar results were obtained
using the RBFNN. The lower the percent value for a given variable, the less
that variable affects the predictions. The results of the analysis can help in the
selection of a new set of independent variables, one that will allow more
accurate predictions. For example, a variable with a low impact value can be
eliminated in favor of some new variables.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 71
There are many software packages available for analyzing SOM models.
We chose SOMine package version 5.0 (Viscovery software, 2008). This
software applies artificial intelligence techniques to automatically find the
efficient SOM clusters. To visualize the cluster structure, some authors use the
unified distance matrix (U-matrix) (e.g., Vijayakumar et al., 2007; Stavrou et
al., 2010). However, this method does not give crisp boundaries to the clusters
(Worner & Gevrey, 2006). In this study a hierarchical cluster analysis with a
Ward linkage method was applied to the SOM to clearly delineate the edges of
each cluster. The number of neurons is chosen to be 2000. There are two
learning algorithms for SOM (Kohonen, 2001): the sequential or stochastic
learning algorithm and the batch learning algorithm. In the former, the
reference vectors are updated immediately after a single input vector is
presented. In the latter, the update is done using all input vectors. While the
batch algorithm does not suffer from convergence problems, the sequential
algorithm is stochastic in nature and is less likely trapped to a local minimum.
Following Ding & Patra (2007), we choose the sequential learning algorithm
to train the SOM.
The SOM cluster results are shown in Figure 11. This two-dimensional
hexagonal grid shows clear division of the input pattern into four clusters.
Since the order on the grid reflects the neighborhood within the data, features
of the data distribution can be read off from the emerging landscape on the
grid. Figure 11 shows four discernable clusters of failed states. This four-
cluster solution meets Siew et al., (2002) qualitative criteria that should be
used to select the representative SOM model. These criteria include
representability, explainability and level of sophistication. representability
refers to the fact that the variables in each cluster should be distinct and carry
some information of their own. This means that the resulting profile for each
cluster should be unique and meaningful. Explainability means that the
clusters themselves are distinct. Level of sophistication means that the size of
each cluster should be monitored so that there are no either too large clusters
that might hide more distinct groups in the cluster, or too small clusters that
might be an indication of artificial clusters.
When assessing the quality of clustering model for validation purposes,
quantitative criteria can also be used (Zhuang et al., 2009). We used the
Kohonen software package (Wehrens and Buydens, 2007) to validate the
cluster results.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
72 Mohamed M. Mostafa
in danger
stable
critical
border line
12
4
10
8 3
6
2
4
1
2
Figure 12 shows both the SOM counts and the mapping quality. In the left
plot, the background color of a unit corresponds to the number of samples
mapped to the unit. This figure shows that there is a reasonable spread out
over the map. One of the units is empty (depicted in grey), which suggests that
no samples have been mapped to it. The right plot shows the quality of the
mapping. It represents the mean distance between objects mapped to a
particular unit and the input vector of that unit. A good mapping should show
small distances everywhere in the map. An alternative method, called the bi-
directional Kohonen mapping (Melssen et al., 2006) has also been
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 73
implemented. Results obtained are very similar to the ones obtained above.
Another method to check the validity of the SOM during the training phase is
to see whether the input vectors are becoming more and more similar to the
closest objects in the dataset. Based on Figure 13 we see the effect of the
neighborhood shrinking to include only the winning unit. This implies that
there is no need for more iterations to optimize training parameters.
Training progress
0.05
X
Y
0.02
0.04
Mean distance to closest unit
0.015
0.03
0.01
0.02
0.005
0.01
0.00
0 20 40 60 80 100
Iteration
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
74 Mohamed M. Mostafa
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Table 6. SOM cluster summary
Freq
Seg.* Dem Ref Gr_Gr Hu_Fl Un_Dev Ec_Dec S_De P_Ser H_Ri Sec_Ap F_El Ex_In
%
1 21.47 8.18 7.99 8.50 7.17 8.45 7.13 8.69 7.72 8.31 8.18 8.69 7.74
2 37.85 7.51 5.59 6.37 6.30 7.48 6.69 7.53 7.05 6.88 6.65 7.09 6.74
2 20.34 5.94 3.60 4.92 6.14 6.79 5.51 5.83 5.19 4.88 4.65 4.82 5.46
3 20.34 3.45 2.48 3.69 2.53 4.10 3.41 2.94 2.44 2.98 2.21 2.83 3.03
Seg*: 1= critical; 2 = in danger; 3 =borderline; and 4 = stable.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
76 Mohamed M. Mostafa
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 77
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
78 Mohamed M. Mostafa
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 79
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
80 Mohamed M. Mostafa
error.ho<-c()
for (i in cost.v)
{m.cv<-svm(Class~.,
data=fsi,
type="C-classification",
kernel="linear",
cost=i,
cross=10
)
e<-100 - summary(m.cv)$tot.accuracy
error.cv<-c(error.cv,e)
m.ho<-svm(Class ~.,
data=train.df,
type="C-classification",
kernel="linear",
cost=i,
cross=0
)
p<-predict(m.ho, test.df)
correct<-sum(p == diagnosis.df[[1]])
e<-(nrow(test.df) - correct)/nrow(test.df)*100
error.ho<-c(error.ho,e)
}
y<-max(error.ho, error.cv)
plot.new()
plot.window(xlim=c(1, length(cost.v)), ylim=c(0,y))
box()
title(xlab="SVM model complexity",ylab="Error rate")
xticks<-seq(1, length(cost.v),1)
yticks<-seq(0,y,1)
xlabels<-seq(1, length(cost.v),1)
ylabels<-seq(0,y,1)
axis(1,at=xticks,labels=xlabels)
axis(2,at=yticks,labels=ylabels)
points(1:length(cost.v),error.cv,type="b",col="red")
points(1:length(cost.v),error.ho,type="b",col="blue")
x<-fsi[, 2:13]
y<-fsi[, 14]
model<-svm(x, y)
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 81
print(model)
summary(model)
pred<-predict(model, x)
table(pred, y)
tobj<-tune.svm(Class~.,data=fsi[1:100,], gamma=10^(-6:-3),
cost=10^(1:2))
summary(tobj)
plot(tobj, transform.x = log10, xlab=expression(log[10](gamma)),
ylab="C")
REFERENCES
Aiken, M. & Bsat, M. (1999). Forecasting market trends with neural networks,
Information Systems Management, 16: 42-49.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
82 Mohamed M. Mostafa
Aminian, F., Suarez, E., Aminian, M., & Walz, D. (2006). Forecasting
economic data with neural networks. Computational Economics, 28, 71-
88.
Anandarajan, M., Lee, P., & Anandarajan, A. (2001). Bankruptcy prediction of
financially stressed firms: an examination of the predictive accuracy of
artificial neural networks. International Journal of Intelligent Systems in
Accounting, Finance & Management, 10, 69-81.
Audrain-Pontevia, A. (2006). Kohonen self-organizing maps: A neural
approach for studying the links between attributes and overall satisfaction
in a services context. Journal of Consumer Satisfaction, Dissatisfaction
and Complaining Behavior, 19, 128-137.
Awad, M., & Motai, Y. (2008). Dynamic classification for video stream using
support vector machine. Applied Soft Computing, 8, 1314-1325.
Balasubramanian, M., Palanivel, S. & Rmalingam, V. (2009). Real time face
and mouth recognition using radial basis function neural networks. Expert
Systems with Applications, 36, 6879-6888.
Bensic, M., Sarlija, N., & Zekic-Susac, M. (2005). Modelling small-business
credit scoring by using logistic regression, neural networks and decision
trees. Intelligent Systems in Accounting, Finance and Management, 13,
133-150.
Berrueta, L., Alonso-Salces, R., & Heberger, K. (2007). Supervised pattern
recognition in food analysis. Journal of Chromatography A, 1158, 196-
214.
Bishop, C. (2006). Pattern recognition and machine learning, 2nd edition,
Springer, New York.
Calderon, T.,& Cheh, J. (2002). A roadmap for future neural networks
research in auditing and risk assessment. International Journal of
Accounting Information Systems, 3, 203-236.
Celikoglu, H., & Cigizoglu, H. (2007). Modeling public transport trips by
radial basis function neural networks. Mathematical and Computer
Modeling, 45, 480-489.
Chaplot, S., Patnaik, L., & Jagannathan, N. (2006). Classification of magnetic
resonance brain images using wavlets as input to support vector machines
and neural network. Biomedical Signal Processing and Control, 1, 86-92.
Chen, K., & Wang, C. (2007). A hybrid SARIMA and support vector
machines in forecasting the production values of the machinery industry in
Taiwan. Expert Systems with Applications, 32, 254-264.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 83
Cheng, C., Chen, C. & Fu, C. (2006). Financial distress prediction by radial
basis function network with logit analysis learning. Computers and
Mathematics with Applications, 51, 579-588.
Child, D. (1990). The Essentials of Factor Analysis. Casell. London.
Churilov, L., & Flitman, A. (2006). Towards fair ranking of Olympics
achievements: The case of Sydney 2000. Computers & Operations
Research, 33, 2057-2082.
Coussement, K., & Van den Poel, D. (2008). Churn prediction in subscription
services: An application of support vector machines while comparing two
parameter-selection techniques. Expert Systems with Applications, 34,
313-327.
Darbellay, G. & Slama, M. (2000). forecasting the short-term demand for
electricity: Do neural networks stand a better chance? International
Journal of Forecasting, 16: 71-83.
Dhanalakshmi, P. Palanivel, S. & Ramalingam, (2009). Classification of audio
signals using SVM and RBFNN. Expert Systems with Applications, 36,
6069-6075.
Dimitradou, E., Hornik, K., Leisch, F., Meyer, D., & Weingessel, A. (2005).
E1071: Misc. functions of the Department of Statistics (e1071), TU Wein,
Version 1.5-11. Available from http://cran.R-project.org.
Ding, C., & Patra, J. (2007). User modeling for personalized web search with
self-organizing map. Journal of the American Society for Information
Science and Technology, 58, 494-507.
Ferreira, C. (2001). Gene expression programming: a new adaptive algorithm
for solving problems. Complex Systems, 13, 87-129.
Ferreira, C. (2004). Gene expression programming and the evolution of
computer programs. In Leonardo de Castro and Fernando Von Zuben
(Eds.). Recent Developments in biologically inspired computing. Idea
Group Publishing, pp. 82-103.
Foreign Policy (2009). The failed states index, 80-83 (July/August).
Frota, R., Barreto, G., & Mota, J. (2007). Anomaly detection in mobile
communication network using the self-organizing map. Journal of
Intelligent and Fuzzy Systems, 18, 493-500.
Fund for Peace (FundForPeace.org).
Ghaziri, H. & Osman, I. (2006). Self-organizing feature maps for the vehicle
routing problem with backhauls. Journal of Scheduling, 9, 97-114.
Gorr, W., Nagin, D., & Szczypula, J. (1994). comparative study of artificial
neural network and statistical models for predicting student point
averages, International Journal of Forecasting, 10: 17-34.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
84 Mohamed M. Mostafa
Hair, J., Anderson, R., Tatham, R. & Black, W. (1998). Multivariate data
analysis with readings.
Hardy, Y., & Steeb, W. (2002). Gene expression programming and one-
dimensional chaotic maps. International Journal of Modern Physics C,
13-24.
Harvey, C., Travers, K., & Costa, M. (2000). Forecasting emerging market
returns using neural networks, Emerging Markets Quarterly, 4: 43-55.
Hecht-Nielson, R. (1989). Theory of the back-propagation neural network.
International Joint Conference on Neural Networks. Washington, DC,
593-605.
Hehir, A. (2007). The myth of the failed state and the war on terror: a
challenge to the conventional wisdom. Journal of Intervention and state
Building, 1, 307-332.
Helman, G., & Ratner, S. (1993). Saving failed states. Foreign Policy, 89, 3-
21.
Howard, T. (2008). Revisiting state failure: developing a causal model of state
failure based upon theoretical insight. Civil Wars, 10, 125-147.
Iyer, S. & Sharda, R. (2009). Prediction of athletes’ performance using neural
networks: an application in cricket team selection. Expert Systems with
Applications, 36, 5510-5522.
Jiang, E. (2007). Detecting spam email by radial basis function networks.
International Journal of Knowledge-based and Engineering Systems, 11,
409-418.
Kahyaoglu, T. (2008). Optimization of the pistachio nut roasting process using
response surface methodology and gene expression programming. LWT-
Food Science and Technology, 41, 26-33.
Kaiser, H. (1970). A second generation little jiffy. Psychometrika, 35, 401-
415.
Karatzoglou, A., Meyer, D., & Hornik, K. (2005). Support vector machines in
R. Available at http://statmath.wu-wien.ac.at.
Kecman, V. (2005). Support vector machines: An introduction. In L. Wang
(Ed.), Support vector machines: Theory and applications. Springer-Verlag,
Berlin, 1-48.
Kim, J., & Mueller, C. (1978). Introduction to Factor Analysis. Sage
Publications. Beverly Hills. CA.
Koetz, B., Morsdorf, F., Linden, S., Curt, T., & Allgower, B. (2008). Multi-
source land cover classification for forest management based on imaging
spectrometry and LIDAR data. Forest Ecology and Management, 256,
263-271.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 85
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
86 Mohamed M. Mostafa
Luan, F., Liu, T., Wen, Y., & Zhang, X. (2008). Classification of the fragrance
properties of chemical compounds based on support vector machine and
linear discriminant analysis. Flavor and Fragrance Journal, 23, 232-238.
Mahdi, A. (2006). Perceptual non-intrusive speech quality assessment using a
self-organizing map. Journal of Enterprise Information Management, 19,
148-164.
McMenamin, J. & Monforte, F. (1998). Short term energy forecasting with
neural networks, Energy Journal, 19: 43-52.
Melssen, W., Wehrens, R., & Buydens, L. (2006). Supervised Kohonen
networks for classification problems. Chemometrics and Intelligent
Laboratory Systems, 83, 99-113.
Mirmirani, S., & Li, H. (2004). Gold price, neural networks and genetic
algorithm. Computational Economics, 23, 193-200.
Moreno, D., Marco, P., and Olmeda, I. (2006). Self-organizing maps could
improve the classification of Spanish mutual funds. European Journal of
Operational Research, 147, 1039-1054.
Mostafa, M. (2009). Shades of green: a psychographic segmentation of the
green consumer in Kuwait using self-organizing maps”, Expert Systems
with Applications, 36, 11030-11038.
Nam, K. & Yi, J. (1997). Predicting airline passenger volume, Journal of
Business Forecasting Methods & Systems, 16: 14-17.
Piazza, J. (2008). Incubators of terror: do failed and failing states promote
transnational terrorism? International Studies Quarterly, 52, 469-488.
Poh, H. Yao, J. & Jasic, T. (1998). Neural networks for the analysis and
forecasting of advertising impact, International Journal of Intelligent
Systems in Accounting, Finance & management, 7: 253-268.
Prentice Hall (Englewood Cliffs, NJ).
Reno, W. (2003). Sierra Leone: warfare in a post-state soociety. In Robert. I.
Rotberg (ed). State failure and state weakness in a time of terror. World
Peace Foundation, Washington, DC, 71-100.
Ruiz-Suarez, J., Mayora -Ibarra, O., Torres -Jimenez, J., & Ruiz-Suarez, L.
(1995). Short term ozone forecasting by artificial neural network,
Advances in Engineering Software, 23: 143-149.
Sakai, S., Kobayashi, K., Toyabe, S., Mandai, N., Kanda, T., & Akazawa, K.
(2007). Comparison of the levels of accuracy of an artificial neural
network model and a logistic regression model for the diagnosis of acute
appendicitis. Journal of Medical Systems. 31, 357-364.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 87
Shan, Y., Zhao, R., Xu, G., Liebich., & Zhang, Y. (2002). Application of
probabilistic neural network in the clinical diagnosis of cancers based on
clinical chemistry data. Analytica Chimica Acta, 471, 77-86.
Siew, E., Smith, K., Churilov, L., & Ibrahim, M. (2002). A neural clustering
approach for Iso-Resource groupings for acute healthcare in Australia.
Proceedings of the 35th Annual Hawaii International Conference on
Systems Science (HICS35). IEEE Computer Society, Hawaii, USA.
Silven, O., Niskanen, M., & Kauppinen, H. (2003). Wood inspection with non
supervised clustering. Machine Vision and Applications, 13, 275-285.
Silver, H., & Shmoish, M. (2008). Analysis of cognitive performance in
schizophrenia patients and healthy individuals with unsupervised
clustering models. Psychiatry Research, 159, 167-179.
SPSS Neural Networks Version 17.0, SPSS Corporation (2007).
Stavrou, E., Spiliotis, S., & Charalambpuw, C. (2010). Flexible working
arrangements in context : an empirical investigation through self-
organising maps. European Journal of Operational Research, 202, 893-
902.
Swicegood, P., & Clark, J. (2001). Off-site monitoring systems for prediction
bank underperformance: a comparison of neural networks, discriminant
analysis, and professional human judgment. International Journal of
Intelligent Systems in Accounting, Finance & Management, 10, 169-186.
Teodorescu, L. & Sherwood, D. (2008). High energy physics event selection
with gene expression programming. Computer Physics Communications,
178, 409-419.
Thneberg, H., & Hotulainen, R. (2006). Contributions of data mining for
psycho-educational research: What self-organizing maps tell us about the
well-being of gifted learners. High Ability Studies, 17, 87-100.
Van de Walle, N. (2004). The economic correlates of state failure: taxes,
foreign aid and policies. In Robert I. Rotberg (ed). When states fail:
causes and consequences. Princeton University Press, 94-115.
Vapnik, V. (1995). The nature of statistical learning theory. Springer, Berlin.
Vesanto, J. & Alhoniemi, E. (2000). Clustering of the self-organizing map.
IEEE Transactions on Neural Networks, 11, 586-600.
Vesanto, J. (1999). SOM-based data visualisation methods. Intelligent Data
Analysis, 3, 111-126.
Videnova, I., Nedialkova, D., Dimitrova, M., & Popova, S. (2006). Neural
networks for air pollution forecasting. Applied Artificial Intelligence, 20,
493-506.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
88 Mohamed M. Mostafa
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
In: Data Mining ISBN: 978-1-63463-738-1
Editor: Harold L. Capri c 2015 Nova Science Publishers, Inc.
Chapter 4
Abstract
This paper presents a novel self-adaptive grammar-guided genetic
programming proposal for mining association rules. It generates indi-
viduals through a context-free grammar, which allows of defining rules
in an expressive and flexible way over different domains. Each rule is
represented as a derivation tree that shows a solution (described using the
language) denoted by the grammar. Unlike existing evolutionary algo-
rithms for mining association rules, the proposed algorithm only requires
a small number of parameters, providing the possibility of discovering as-
sociation rules in an easy way for non-expert users. More specifically, this
algorithm does not require any threshold, and uses a novel parent selector
based on a niche-crowding model to group rules. This approach keeps the
∗
E-mail address: [email protected]
†
E-mail address: [email protected]
‡
E-mail address: [email protected] (Corresponding author)
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
90 José Marı́a Luna, Alberto Cano and Sebastián Ventura
1. Introduction
Association rule mining (ARM), an important area of data mining, has received
enormous attention since its introduction by Agrawal et al. [1, 2] in the early
90s. ARM searches for strong relationships among items that are hidden in
datasets. An association rule (AR) is defined as an implication of the form
Antecedent → Consequent, both Antecedent and Consequent being sets
with no items in common. The meaning of an AR is that if the antecedent is
satisfied, then it is highly probable that the consequent will be also satisfied.
Most existing proposals for the extraction of ARs follow an exhaustive
search methodology based on a support–confidence framework [2, 6, 35], where
the support calculates the proportion of transactions covered by the rule, and the
confidence specifies how reliable the rule is. In these proposals, frequent pat-
terns are mined first, and are next used to extract reliable ARs. The mining
process is hindered by these two steps, which require a large amount of com-
putational time and large amounts of memory. Notice that with the growing
interest in the storage of information, real-world datasets containing numerical
values have turned out to be essential to the research community. Since numer-
ical domains typically contain many distinct values, the search space in these
domains is bigger than the search space in categorical domains. In such situa-
tions, exhaustive search algorithms in ARM cannot directly deal with numerical
domains as they become hardly maintainable.
Evolutionary algorithms (EA) have been widely used in data mining tasks,
where the process of searching for solutions requires an optimization. Many
researchers have focused on the ARM problems from an evolutionary perspec-
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
An Evolutionary Self-Adaptive Algorithm for Mining ... 91
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
92 José Marı́a Luna, Alberto Cano and Sebastián Ventura
which is measured by the coverage. It is noteworthy that for the sake of avoid-
ing mismatched rules, the instances covered by each rule are analysed. The
results clarify the good behaviour and efficiency of this proposal compared to
G3PARM, which was previously contrasted to other exhaustive and genetic al-
gorithms in [21]. More specifically, the empirical comparison demonstrates that
the new proposal obtains interesting rules with high support, high confidence,
and high coverage of dataset instances.
This paper is structured as follows: the main drawbacks of the support-
confidence framework are discussed and the most relevant related work is pre-
sented in Section 2; Section 3 describes the model proposed as well as its main
characteristics; Section 4 describes the experiments, including the datasets used,
the algorithm set-up, and disucsses the results obtained; finally, in Section 5,
some concluding remarks are outlined.
2. Preliminaries
The support–confidence framework is the most often used combination of mea-
sures in the ARM field. Actually, most existing proposals use the support
measure for mining frequent patterns, whereas the confidence measure is used
for discovering reliable rules. As discussed next, new measures are required
in the extraction of interesting ARs, since the use of the support–confidence
framework becomes insufficient. Finally, the most relevant ARM proposals are
described. The first proposals were designed following an exhaustive search
methodology. Currently, the use of exhaustive search methods is not a good
option and evolutionary ARM proposals are the most studied.
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use