0% found this document useful (0 votes)
60 views100 pages

Capri

Data mining paper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views100 pages

Capri

Data mining paper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

COMPUTER SCIENCE, TECHNOLOGY AND APPLICATIONS

All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

DATA MINING
PRINCIPLES, APPLICATIONS
AND EMERGING CHALLENGES

HAROLD L. CAPRI
EDITOR
Copyright 2014. Nova Science Publishers, Inc.

New York

EBSCO Publishing : eBook Academic Collection (EBSCOhost) - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS
AN: 956104 ; Ma, Xiaolei, Capri, Harold L..; Data Mining: Principles, Applications and Emerging Challenges
Account: s8501869.main.ehost
Copyright © 2015 by Nova Science Publishers, Inc.

All rights reserved. No part of this book may be reproduced, stored in a retrieval system or
transmitted in any form or by any means: electronic, electrostatic, magnetic, tape, mechanical
photocopying, recording or otherwise without the written permission of the Publisher.

For permission to use material from this book please contact us:
[email protected]

NOTICE TO THE READER


The Publisher has taken reasonable care in the preparation of this book, but makes no expressed
or implied warranty of any kind and assumes no responsibility for any errors or omissions. No
liability is assumed for incidental or consequential damages in connection with or arising out of
information contained in this book. The Publisher shall not be liable for any special,
consequential, or exemplary damages resulting, in whole or in part, from the readers’ use of, or
reliance upon, this material. Any parts of this book based on government reports are so indicated
and copyright is claimed for those parts to the extent applicable to compilations of such works.

Independent verification should be sought for any data, advice or recommendations contained in
this book. In addition, no responsibility is assumed by the publisher for any injury and/or damage
to persons or property arising from any methods, products, instructions, ideas or otherwise
contained in this publication.

This publication is designed to provide accurate and authoritative information with regard to the
subject matter covered herein. It is sold with the clear understanding that the Publisher is not
engaged in rendering legal or any other professional services. If legal or any other expert
assistance is required, the services of a competent person should be sought. FROM A
DECLARATION OF PARTICIPANTS JOINTLY ADOPTED BY A COMMITTEE OF THE
AMERICAN BAR ASSOCIATION AND A COMMITTEE OF PUBLISHERS.

Additional color graphics may be available in the e-book version of this book.

Library of Congress Cataloging-in-Publication Data

Data mining (Nova Science Publishers)


Data mining : principles, applications and emerging challenges / [edited by] Harold L. Capri.
pages cm. -- (Computer science, technology and applications)
Includes bibliographical references and index.
ISBN:  (eBook)
1. Data mining. I. Capri, Harold L., editor. II. Ma, Xiaolei, 1985- Transit passenger origin
inference using smart card data and GPS data. III. Title.
QA76.9.D343D365 2014
006.3'12--dc23
2014047181

Published by Nova Science Publishers, Inc. † New York

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
CONTENTS

Preface vii
Chapter 1 Transit Passenger Origin Inference Using Smart
Card Data and GPS Data 1
Xiaolei Ma, Ph.D. and Yinhai Wang, Ph.D.
Chapter 2 Knowledge Extraction from an Automated
Formative Evaluation Based on Odala Approach
Using the Weka Tool? 33
Farida Bouarab-Dahmani and Razika Tahi
Chapter 3 Modeling Nations’ Failure via Data
Mining Techniques 53
Mohamed M. Mostafa, Ph.D.
Chapter 4 An Evolutionary Self-Adaptive Algorithm
for Mining Association Rules 89
Jośe María Luna, Alberto Cano
and Sebastián Ventura
Index 125

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
PREFACE

Data mining is an area of research where appropriate methodological


research and technical means are experienced to produce useful knowledge
from different types of data. Data mining techniques use a broad family of
computationally intensive methods that include decision trees, neural
networks, rule induction, machine learning and graphic visualization. This
book discusses the principles, applications and emerging challenges of data
mining.
Chapter 1 - To improve customer satisfaction and reduce operation costs,
transit authorities have been striving to monitor their transit service quality and
identify the key factors to attract the transit riders. Traditional manual data
collection methods are unable to satisfy the transit system optimization and
performance measurement requirement due to their expensive and labor-
intensive nature. The recent advent of passive data collection techniques (e.g.,
Automated Fare Collection and Automated Vehicle Location) has shifted a
data-poor environment to a data-rich environment, and offered the
opportunities for transit agencies to conduct comprehensive transit system
performance measures. Although it is possible to collect highly valuable
information from ubiquitous transit data, data usability and accessibility are
still difficult. Most Automatic Fare Collection (AFC) systems are not designed
for transit performance monitoring, and additional passenger trip information
cannot be directly retrieved. Interoperating and mining heterogeneous datasets
would enhance both the depth and breadth of transit-related studies. This study
proposed a series of data mining algorithms to extract individual transit rider’s
origin using transit smart card and GPS data. The primary data source of this
study comes from the AFC system in Beijing, where a passenger’s boarding
stop (origin) and alighting stop (destination) on a flat-rate bus are not recorded

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
viii Harold L. Capri

on the check-in and check-out scan. The bus arrival time at each stop can be
inferred from GPS data, and individual passenger’s boarding stop is then
estimated by fusing the identified bus arrival time with smart card data. In
addition, a Markov chain based Bayesian decision tree algorithm is proposed
to mine the passengers’ origin information when GPS data are absent. Both
passenger origin mining algorithms are validated based on either on-board
transit survey data or personal GPS logger data. The results demonstrate the
effectiveness and efficiency of the proposed algorithms on extracting
passenger origin information. The estimated passenger origin data are highly
valuable for transit system planning and route optimization.
Chapter 2 - Differentiation between learners, adapted and personalized
learning are interesting research directions on technology for human learning
today. This issue leads to the design of educational systems integrating
strategies for learners' monitoring to assist each by evaluating his knowledge
and skills in one hand and detecting and analyzing his errors and obstacles in
the other hand. In this respect, formative evaluation is the process used to
capture data on the strengths and weaknesses of a learner. These data, to be
useful, must be objectively analyzed so that it can be used to manage the
following sessions. There are different data mining tools using different
algorithms for data analysis and knowledge extraction. Can we use these tools
in computer based systems? In such cases, is it possible to directly use a
variety of general-purpose algorithms for learning data analysis? The authors
discuss in this paper a learning cycle that can be a learning session with
feedback loop integrating formative evaluation followed by knowledge
extraction process using data mining algorithms. The author’s experiments,
presented in this work, shows a set of tests, about the exploration of learners'
errors, obtained from a self e-learning by doing tool for the algorithmic
domain. The authors used the data mining algorithms implemented in the
Weka tool: the C4.5 algorithm for classification, A Priori one for association
rules deduction and K-Means for clustering. The results given by these
experiments have proved the interest of classification and clustering as
implemented in Weka. However, the A priori algorithm gives in some cases
results difficult to interpret so that it needs a specific optimization to get
adequate frequents detection.
Chapter 3 - Since the concept of ‘failed states’ was coined in the early
1990s, it has come to occupy a top tier position in the international peace and
security’s agenda. This study uses data mining techniques to examine the
effect of various social, economic and political factors on states’ failure at the
global level. Data mining techniques use a broad family of computationally

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Preface ix

intensive methods that include decision trees, neural networks, rule induction,
machine learning and graphic visualization. Three artificial neural network
models: multi-layer perceptron neural network (MLP), radial basis function
neural network (RBFNN) and self-organizing maps neural network (SOM)
and one machine learning technique (support vector machines [SVM]) are
compared to a standard statistical method (linear discriminant analysis (LDA).
The variable sets considered are demographic pressures, movement of
refugees, group paranoia, human flight, regional economic development,
economic decline, de-legitimization of the state, public services’ performance,
human rights status, security apparatus, elites’ behavior and the role played by
other states or external political actors. The study shows how it is possible to
identify various dimensions of states’ failure by uncovering complex patterns
in the dataset, and also shows the classification abilities of data mining
techniques.
Chapter 4 - This paper presents a novel self-adaptive grammar-guided
genetic programming proposal for mining association rules. It generates
individuals through a context-free grammar, which allows of defining rules in
an expressive and flexible way over different domains. Each rule is
represented as a derivation tree that shows a solution (described using the
language) denoted by the grammar. Unlike existing evolutionary algorithms
for mining association rules, the proposed algorithm only requires a small
number of parameters, providing the possibility of discovering association
rules in an easy way for non-expert users. More specifically, this algorithm
does not require any threshold, and uses a novel parent selector based on a
niche-crowding model to group rules. This approach keeps the best individuals
in a pool and restricts the extraction of similar rules by analysing the instances
covered. The author’s compare our approach with the G3PARM algorithm, the
first grammar-guided genetic programming algorithm for the extraction of
association rules. G3PARM was described as a high-performance algorithm,
obtaining important results and overcoming the drawbacks of current
exhaustive search and evolutionary algorithms. Experimental results show that
the author’s new proposal obtains very interesting and reliable rules with
higher support values.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
In: Data Mining ISBN: 978-1-63463-738-1
Editor: Harold L. Capri © 2015 Nova Science Publishers, Inc.

Chapter 1

TRANSIT PASSENGER ORIGIN INFERENCE


USING SMART CARD DATA AND GPS DATA

Xiaolei Ma1, Ph.D. and Yinhai Wang2, Ph.D.


1
School of Transportation Science and Engineering,
Beihang University, Beijing, China
2
Department of Civil and Environmental Engineering,
University of Washington, Seattle, WA, US

ABSTRACT
To improve customer satisfaction and reduce operation costs, transit
authorities have been striving to monitor their transit service quality and
identify the key factors to attract the transit riders. Traditional manual
data collection methods are unable to satisfy the transit system
optimization and performance measurement requirement due to their
expensive and labor-intensive nature. The recent advent of passive data
collection techniques (e.g., Automated Fare Collection and Automated
Vehicle Location) has shifted a data-poor environment to a data-rich
environment, and offered the opportunities for transit agencies to conduct
comprehensive transit system performance measures. Although it is
possible to collect highly valuable information from ubiquitous transit
data, data usability and accessibility are still difficult. Most Automatic
Fare Collection (AFC) systems are not designed for transit performance
monitoring, and additional passenger trip information cannot be directly


Email: [email protected]

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
2 Xiaolei Ma and Yinhai Wang

retrieved. Interoperating and mining heterogeneous datasets would


enhance both the depth and breadth of transit-related studies. This study
proposed a series of data mining algorithms to extract individual transit
rider’s origin using transit smart card and GPS data. The primary data
source of this study comes from the AFC system in Beijing, where a
passenger’s boarding stop (origin) and alighting stop (destination) on a
flat-rate bus are not recorded on the check-in and check-out scan. The bus
arrival time at each stop can be inferred from GPS data, and individual
passenger’s boarding stop is then estimated by fusing the identified bus
arrival time with smart card data. In addition, a Markov chain based
Bayesian decision tree algorithm is proposed to mine the passengers’
origin information when GPS data are absent. Both passenger origin
mining algorithms are validated based on either on-board transit survey
data or personal GPS logger data. The results demonstrates the
effectiveness and efficiency of the proposed algorithms on extracting
passenger origin information. The estimated passenger origin data are
highly valuable for transit system planning and route optimization.

Keywords: Automated fare collection system, transit GPS, passenger origin


inference, Bayesian decision tree, Markov chain

INTRODUCTION
According to the Census of 2000 in the United States, approximately 76%
people chose privately owned vehicles to commute to work in 2000 (ICF
consulting, 2003). Recent studies conducted by the 2009 American
Community Survey indicate 79.5% of home-based workers drive alone for
commuting (McKenzie and Rapino, 2009). Many developing countries, e.g.,
China, also rely on privately owned vehicles to commute. For example, more
than 34% of the Beijing residents chose cars as their primary travel mode
while only 28.2% chose transit in 2010 (Beijing Transportation Research
Center, 2012). Public transit has been considered as an effective
countermeasure to reduce congestion, air pollution, and energy consumption
(Federal Highway Administration, 2002). According to 2005 urban mobility
report conducted by Texas Transportation Institute (2005), travel delay in
2003 would increase by 27 percent without public transit, especially in those
most congested metropolitan cites of U.S., public transit services have saved
more than 1.1 billion hours of travel time. Moreover, public transit can help
enhance business, reduce city sprawl through the transit oriented development
(TDO). During certain emergency scenarios, public transit can even act as a

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 3

safe and efficient transportation mode for evacuation (Federal Highway


Administration, 2002). Based on the aforementioned reasons, it is of critical
importance to improve the efficiency of public transit system, and promote
more roadway users to utilize public transit. To fulfill these objectives, transit
agencies need to understand the areas where improvements can be further
made, and whether community goals are being met, etc. A well-developed
performance measure system will facilitate decision making for transit
agencies. Transit agencies can evaluate the transit ridership trends with fare
policy changes and identify where and when better transit service should be
provided. In addition, transit agencies are also required to summarize transit
performance statistics for reporting to either the National Transit Database
(Kittelson & Associates et al., 2003), or the general public who are interested
knowing how well transit service is being provided. Nevertheless, developing
a set of structured performance measures often requires a large amount of data
and the corresponding domain knowledge to process and analyze these data.
These obstacles create challenges for transit agencies to spend time and effort
undertaking. Traditionally, transit agencies heavily rely on manual data
collection methods to gather transit operation and planning data (Ma et al.,
2012). However, traditional data collection methods (e.g., travel diary, survey,
etc.) are fairly costly and difficult to implement at a multiday level due to their
low response rate and accuracy. Transit agencies have spent tremendous
manpower and resource undertaking manual data collections, and consumed a
significant amount of energy and time to post-process the raw data. With
advances in information technologies in intelligent transportation systems
(ITS), the availability of public transit data has been increasing in the past
decades, which has gradually shifted public transit system into a data-rich
paradigm. Automatic Fare Collection (AFC) system and Automatic Vehicle
Track (AVL) system are two common passive data collection methods. AFC
system, also known as Smart Card system, records and processes the fare
related information using either contactless or contact card to complete the
financial transaction (Chu, 2010). There exist two typical types of AFC
systems: entry-only AFC system and distance-based AFC system. In the entry-
only AFC system, passengers are only required to swipe their smart cards over
the card reader during boarding, while passengers need to check in and check
out during both their boarding and alighting procedures for the distance-based
AFC system. AVL and AFC technologies hold substantial promise for transit
performance analysis and management at a relative low cost. However,
historically, both AVL and AFC data have not been used to their full
potentials. Many AVL and AFC systems do not archive data in a readily

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
4 Xiaolei Ma and Yinhai Wang

utilized manner (Furth, 2006). AFC system is initially designed to reduce


workloads of tedious manual fare collections, not for transit operation and
planning purposes, and thereby, certain critical information, such as specific
spatial location for each transaction, may not be directly captured. AVL
system tracks transit vehicles’ geospatial locations by Global Positioning
System (GPS) at either a constant or varying time interval. The accuracy of
GPS occasionally suffers from signal loss due to tall building obstructions in
the urban area (Ma et al., 2011). Both of the AFC system and AVL system
have their inherent drawbacks in monitoring transit system performance, and
require analytical approaches to eliminate the erroneous data, remedy the
missing values, and mine the unseen and indirect information.
The remainder of this paper is organized as follows: transit smart card data
and GPS data are described in the section 2. Based on these data sets, a data
fusion method is initially proposed to integrate with roadway geospatial data
to estimate transit vehicles arrival information. And then, a Bayesian decision
tree algorithm is presented to estimate each passenger’s boarding stop when
GPS data are unavailable. Considering the expensive computational burden of
decision tree algorithms, Markov-chain property is taken into account to
reduce the algorithm complexity. On-board survey and GPS data from the
Beijing transit system are used to test and verify the proposed algorithms.
Conclusion and future research efforts are summarized at the end of this paper.

RESEARCH BACKGROUND
Data from AFC system and AVL system are the two primary sources in
this study. Beijing Transit Incorporated began to issue smart cards in May 10,
2006. The smart card can be used in both the Beijing bus and subway systems.
Due to discounted fares (up to 60% off) provided by the smart card, more than
90% of the transit riders pay for their transit trips with their smart cards in
2010 (Beijing Transportation Research Center, 2010). Two types of AFC
systems exist in Beijing transit: flat fare and distance-based fare. Transit riders
pay at a fixed rate for those flat fare buses when entering by tapping their
smart cards on the card reader. Thus, only check-in scans are necessary. For
the distance-based AFC system, transit riders need to swipe their smart cards
during both check-in and check-out processes. Transit riders need to hold their
smart cards near the card reader device to complete transactions when entering
or exiting buses. Smart card can be used in Beijing subway system as well,
where passengers need to tap their smart card on top of fare gates during

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 5

entering and existing subway stations. Both boarding and alighting


information (time and location) are recorded by the fare gates. Although transit
smart card exhibits its superiority on its convenience and efficiency, there are
still the following issues to prevent transit agencies fully taking advantages of
smart card for operational purposes:

 Passenger boarding and alighting information missing

Due to a design deficiency in the smart card scan system, the AFC system
on flat fare buses does not save any boarding location information, whereas
the AFC system stores boarding and alighting location, except for boarding
time information on distance-based fare buses. Key information stored in the
database includes smart card ID, route number, driver ID, transaction time,
remaining balance, transaction amount, boarding stop (only available for
distance-based fare buses), and alighting stop (only available for distance-
based fare buses).

 Massive data sets

More than 16 million smart card transactions data are generated per day.
Among these transactions, 52% are from flat-rate bus riders. These smart card
transactions are scattered in a large-scale transit network with 52386 links and
43432 nodes as presented in figure 1:

Figure 1. Beijing Transit GIS Network.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
6 Xiaolei Ma and Yinhai Wang

 Limited external data with poor quality

Only approximate 50% of transit vehicles in Beijing are equipped with


GPS devices for tracking. GPS data are periodically sent to the central server
at a pre-determined interval of 30 seconds. However, the collected GPS data
suffer from two major data quality issues: (1) vehicle direction information is
missing; (2) GPS points fluctuation (Lou, et al., 2009). Map matching
algorithms are needed to align the inaccurate GPS spatial records onto the road
network. In addition, most of transit routes are not designed to have fixed
schedules because of high ridership demands, and only certain routes with a
long distance or headway follow schedules at each stop (Chen, 2009). The
above characteristics of the Beijing AFC and AVL systems create more
challenges to process and mine useful information.
It is noteworthy that the AFC system used in Beijing is not a unique case.
Most cities in China also employ the similar AFC system where passengers’
origin information is absent, such as Chongqing City (Gao and Wu, 2011),
Nanning City (Chen, 2009), Kunming City (Zhou et al., 2007). In other
developing countries, such as Brazil, AFC system does not record any
boarding location information as well (Farzin, 2008). Therefore, a solution for
passenger boarding and alighting information extraction is beneficial to those
transit agencies with imperfect SC data internationally.

TRANSIT PASSENGER ORIGIN INFERENCE


Because smart card readers in the flat-rate buses do not record passengers’
boarding stops, it is desired to infer individual boarding location using smart
card transaction data. In this section, two primary approaches are presented to
achieve this goal. Approximately 50% transit vehicles are equipped with GPS
devices in Beijing entry-only AFC system. Therefore, a data fusion method
with GPS data, smart card data and GIS data is firstly developed to estimate
each bus’s arrival time at each stop and infer individual passenger’s boarding
stop. And then, for those buses without GIS devices, a Bayesian decision tree
algorithm is proposed to utilize smart card transaction time and apply
Bayesian inference theory to depict the likelihood of each possible boarding
stop. In order to expand the usability of proposed Bayesian decision tree
algorithm in large-scale datasets, Markov chain optimization is used to reduce
the algorithm’s computational complexity. Both two transit passenger origin

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 7

inference algorithms are validated using external data (e.g., on-board survey
data and GPS data).

Passenger Origin Inference with GPS Data

In the first step, a GPS-based arrival information inference algorithm is


presented to estimate the arrival time for each transit stop, and then, the
inferred stop-level arrival time will be matched with the timestamp recorded in
AFC system. The temporally closest smart card transaction record will be
assigned with each known stop ID. The logic flow chart is demonstrated in
Figure 2. The major data processing procedure will be detailed below.

Figure 2. Flow Chart for Passenger Origin Inference with GPS Data.

Bus Arrival Time Extraction


Three primary data sources are involved in the passenger information
extraction: vehicle GPS data; transit stop spatial location data; and flat-fare-
based smart card transaction data. A transit GIS network contains the
geospatial location of each stop for any transit routes. The GPS device
mounted in the bus can record each bus’s location and timestamp every 30
seconds, but the data quality of collected GPS records is not satisfying: No
directional information is recorded in Beijing AVL system; GPS points are off

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
8 Xiaolei Ma and Yinhai Wang

the roadway network due to the satellite signal fluctuation. Data preprocessing
is required prior to bus arrival time estimation. A program is written to parse
and import raw GPS data into a database in an automatic manner. Key fields
of a GPS record are shown in Table 1.

Table 1. Examples of GPS raw data

Vehicle ID Date time Latitude Longitude Spot speed Route ID


2010-04-07
00034603 39.73875 116.1355 9.07 00022
09:28:57
2010-04-07
00034603 39.73710 116.1358 14.26 00022
09:29:27
2010-04-07
00034603 39.73592 116.1357 19.63 00022
09:29:58
2010-04-07
00034603 39.73479 116.1357 0 00022
09:30:28
2010-04-07
00034603 39.73420 116.1357 3.52 00022
09:30:58

The first step is to estimate the bus arrival time for each stop by joining
GPS data and the stop-level geo-location data. A buffer area can be created
around each particular stop for a certain transit route using the GIS software.
Within this area, several GPS records are likely to be captured. However,
identifying the geospatially closest GPS record to each particular stop is
challenging since there could be a certain number of unknown directional GPS
records within the specified buffer zone. Thanks to the powerful geospatial
analysis function in GIS, each link (i.e., polyline) where each transit stop is
located is composed of both start node and end node, and this implies that the
directional information for each GPS record is able to infer by comparing the
link direction and the direction changes from two consecutive GPS records.
With the identified direction, the distance from each GPS point to this
particular stop can be calculated, and the timestamp with the minimum
distance will be regarded as the bus arrival time at the particular stop. Figure 2
visually demonstrates the above algorithm procedure. Inbound stop represents
the physical location of a particular transit stop, and this stop is snapped to a
transit link, whose direction is regulated by both a start node and an end node.
By comparing the driving direction from GPS records with the link direction,
the nearest GPS records to this particular stop can be identified, and marked by
the red five-pointed star on the map. The timestamp associated with this five-
pointed star will be considered as the arrival time for this inbound stop. The

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 9

merit of the bus arrival time estimation algorithm lies in its efficiency. Rather
than searching all the GPS data to identify the traveling direction for each stop,
the proposed algorithm shrinks down the searching area, and filters out those
unlikely GPS data. The operation greatly alleviates the computational burden,
and is relatively easy to implement in the large-scale datasets, which is
particularly critical to process the tremendous amount of datasets within an
acceptable time period.

Figure 3. Boarding Time Estimation with GPS Data and Transit Stop Location Data.

Passenger Boarding Location Identification with Smart Card Data


For each smart card data transaction record, the boarding stop can be
estimated by matching the recorded timestamp and the identified bus arrival
time. As presented in Figure 4, for each smart card transaction record, the
transaction time is compared with the inferred bus arrival time at each stop.
This record will be assigned to a particular stop where the bus arrival time is
the most temporally closed with its transaction time. Since passengers begin to
embark the bus at a relative short time interval, this data fusion method is able
to capture almost all missing boarding stops.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
10 Xiaolei Ma and Yinhai Wang

Figure 4. Boarding Stop Identification with Bus Arrival Time.

In addition, because all the arrival time for all stops of a particular transit
route can be estimated, the average travel time between two adjacent stops can
be calculated as well. This speed statistics is not only critical for transit
performance measures, but also provides prior information for passenger
origin inference when GPS data are absent.

Validation
Compared with bus arrival time, door opening time can be more
accurately matched with smart card transaction time. This is because each bus
may not exactly stop at each transit stop for passenger boarding. The inferred
bus arrival time is subject to incur errors when it is used to match with smart
card data. To validate the accuracy of the proposed data fusion algorithm for
passenger origin inference, on-board transit survey was undertaken to collect
bus door opening time and arrival location for each stop of route 651 on
January, 13th, 2013. Hand holding GPS devices were used to track the
geospatial location of moving buses every 15 seconds. The survey duration
was from 8:00 AM to 1: 00 PM, and a total of 75 bus door opening time was
manually recorded. These bus door opening time records were then compared
with smart card transactions from 417 passengers, and these estimated stops
can be considered as the ground-truth data. By comparing the ground-truth

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 11

data with the results from the proposed GPS data fusion approach, 406
boarding stops were accurately inferred and 11 boarding stops differ from the
ground-truth data within one-stop-error range. The proposed algorithm
demonstrates its accuracy as high as 97.4%.

Passenger Origin Inference with Smart Card Data

There are still a fair amount of buses without GPS devices, and thus the
bus arrival time at each transit stop is not directly measured. However, most
passengers scan their cards immediately when boarding and almost all
passengers should complete the check-in scan before arriving to the next stop.
This indicates that the first passenger’s transaction time can be safely assumed
as the group of passengers’ boarding time at the same stop. The challenge is
then to identify the bus location at the moment of the SC transaction so that we
can infer the onboard stop for that passenger. However, this is not easy
because the SC system for the flat-rate bus does not record bus location. We
know the time each transaction occurred on a bus of a particular route under
the operation of a particular driver, but nothing else is known from the SC
transaction database. Nonetheless, we are able to extract boarding volume
changes with time and passengers who made transfers. By mining these data
and combining transit route maps, we may be able to accomplish our goal.
Therefore, a two-step approach is designed for passenger origin data
extraction: smart card data clustering and transit stop recognition. To
implement the proposed algorithm in an efficient manner, a Markov Chain
based optimization approach is applied to reduce the computational
complexity.

Smart Card Data Clustering

Transaction Data Classification


First of all, we need to sort SC transactions by the transit vehicle number.
This results in a list of SC transactions in the vehicle for the entire period of
operations for each day. During the operational period, the vehicle may have
two to ten round-trip runs depending on the round-trip length and roadway
condition. At a terminal station, a transit vehicle may take a break or continue
running. So there is no obvious signal for the end of a trip (a trip is defined as
the journey from one terminus to the other terminus). Meanwhile, there are a

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
12 Xiaolei Ma and Yinhai Wang

varying number of passengers at each stop, including some stops with no


passengers.
For stops with several passengers boarding, all transactions can be
classified into one group based on interval between their transactions. Thus,
the clustered SC transactions can be represented by a time series of check-in
passenger volumes at stops as shown in Table 2.

Table 2. Examples of Clustered SC transactions

Transaction Stop Total Transaction Time


Stop ID
Cluster No. Name Transactions Timestamp Difference
1 Unknown Unknown 18 5:26:36 0:14:26
2 Unknown Unknown 9 5:41:02 0:03:16
3 Unknown Unknown 11 5:44:18 0:04:35
4 Unknown Unknown 27 5:48:53 0:01:00

In Table 2, total transactions indicate the total boarding passengers in one


stop; transaction timestamp is recorded as the time when the first passenger
boards in this stop, and time difference means the elapsed time between the
boarding time at this stop and next stop with boarding passengers. Unlike most
entry-only AFC systems in the United States, stop name and ID from each
transaction are unknown in Beijing’s AFC system. Most buses in service
follow the predefined order of stops, however, it is still possible that there is
no passenger boarding in a specific stop, and thus two consecutive SC
transaction clusters do not necessarily correspond to two physically
consecutive stops. Obviously, this further complicates the situation and the
algorithm needed is indeed to map each cluster into the corresponding
boarding stop ID.
In summary, the smart card data clustering algorithm contains three steps
as follows:
Step 1: All transaction data for each bus are sorted by the transaction
timestamp in an ascending order.
Step 2: For two consecutive records, if their transaction time difference is
within 60 sec, then, these two transactions are included in one cluster;
otherwise, another cluster is initiated.
Step 3: If the transaction time difference for two consecutive records is
greater than 30 min or driver changing occurs, it is likely that the bus has
arrived in terminus, and for this bus, one bus trip has completed. Next record
will be the beginning for the next bus trip.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 13

The result of the clustering process is several sequences of clustered


transactions. Each sequence may contain one or more trips of the transit
vehicle. For particular routes, due to the limited space in terminus or busy
transit schedule, bus layover time may be too short to be used as a separation
symbol for trips. Such buses may have a very long clustered sequence that
makes the pattern discovery process very challenging. Furthermore, unfamiliar
passengers or passengers boarding from the check-out doors (this happens for
very crowded buses) may take longer than 60 seconds to scan their cards. The
delayed transaction may cause cluster assignment errors. Again, this adds extra
challenge to the follow-up passenger origin extraction process.

Transaction Cluster Sequence Segmentation


Beijing has a huge transit network with nearly 1,000 routes. It is quite
common to see passengers transfer between transit routes. Through transfer
activity analysis, we can further segment the clustered transaction sequence
into shorter series to reduce the uncertainty in passenger OD estimation (Jang,
2010). Two key principles used in the transfer stop identification are:

(1) We assume the alighting stop in the previous route is spatially and
temporally the closest to the boarding stop for the next route. This is
reasonable because most passengers choose the closest stop for transit
transfer within a short period of time (Chu, 2008). Assume a
passenger k makes a transfer from route i to route j within n minutes.
If route i is a distance-based-rate bus line or a subway line, then we
can identify the transfer station that is also the boarding stop of route
j. Even if both routes are flat-rate bus routes, if the transferring
location is unique, we can still use the transfer information to identify
the transfer bus stop ID and name. In this study, the transfer time
duration n is 30 minutes, and the maximum distance between two
transfer stops is 300 meters.
(2) We assume that both the alighting time and the boarding time for each
particular stop is similar. In this case, we can substitute a passenger
boarding stop with another passenger alighting stop. Assume a
passenger k makes a transfer from route i to route j. If route j is a
subway line, where both its boarding location and time are available,
then we can estimate the passenger k’s alighting stop of route i, and
this alighting stop can be also considered as the boarding stop for
those passengers who get on the bus at the same time.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
14 Xiaolei Ma and Yinhai Wang

Walk distance between the two stops should be taken into account for
inferring the time when the flat-rate bus arrives at the transfer stop. However,
several possible boarding stops may exist due to the unknown direction in the
flat-rate smart card transaction, and thus additional data mining techniques are
needed to find the boarding stop with the maximum likelihood. These data
mining techniques will be detailed in the next section.
Based on the identified transfer stops, we can further segment the
transaction cluster sequence into shorter cluster series. Each series is bounded
by either the termini or the identified bus stops. The segmented series of
transaction clusters will be used as the input for the subsequent transit stop
inference algorithm.

Data Mining for Transit Stop Recognition

Bayesian Decision Tree Inference


If we treat each segmented series of transaction cluster as an unknown
pattern, this unknown pattern can be considered as a sample of the sequential
stops on the bus route. If every stop has boarding passengers, this unknown
pattern is identical to the known bus stop sequence. Also, since distance and
speed limit between stops are known, travel time between stops is highly
predictable if there is no traffic jam. In reality, however, there may have
varying distribution of passengers boarding at any given stop and roadway
congestion may cost unpredictable delays. Therefore, the unknown pattern
recognition is a very challenging issue. Once the unknown pattern is
recognized, the boarding stop for any passenger becomes clear.
Bayesian decision tree algorithm is one of the widely used data mining
techniques for pattern recognition (Janssens et al., 2006). Each node in the
Bayesian decision tree is connected through Bayesian conditional probability,
and the entire tree is constructed directionally from the root node to the leaf
nodes. Applying this technique to the current problem, we can represent the
known starting stop as the root. if we denote the current boarding stop ID at
time step k as S k , and at time step k+1, the next boarding stop ID as Sk 1 ,
according to Bayesian inference theory (Bayes and Price, 1763), Sk 1 can be
calculated as:

Sk 1  arg max(Pr(Sk 1  j | S1 , S2 ...Sk )) (1)


j

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 15

where Pr(Sk 1 | S1 , S2 ...Sk ) =conditional probability of the next boarding stop


being Sk 1 , given the previous boarding stop sequence S1 , S2 ...Sk .
A Bayesian decision tree represents many possible known patterns. We
need to compute the probability for each known pattern to match the unknown
pattern. By further observation, we can find due to the nature of transit route,
the probability of passengers boarding at Sk 1 at time step k+1 is only related
to whether the last boarding stop was S k at time step k. That is because if the
transaction time and corresponding bus location for SC transaction cluster k is
known, the next SC transaction cluster k+1 only relies on how fast the bus
travels during the time period between SC transaction clusters k and k+1. In
this case, a SC transaction series can be recognized as a Markov chain process.
Markov chain is a stochastic process with the property that the next state only
relies on the current state. Therefore, Sk 1 can be rewritten as:

Sk 1  arg max(Pr( Sk 1  j | S1 , S2 ...Sk ))  arg max(Pr( Sk 1  j | Sk  i))


j j
(2)
subject to i  j

The single-step Markov transition probability is defined as


Pr(Sk 1  j | Sk  i) , also denoted as pij , with i, j being the stop IDs. Without
losing generality, we assume the bus is moving outbound with an increasing
trend of stop ID toward the destination. Then the transition probability matrix
Π can be simplified as:

 n

 1   p1i p12 p1n 
 p11 p12 p1n 
  
i 2

 p21 p22 p2 n   n

    
0 1   p2i p2 n 
(3)
 
i2 
 p( n 1)1 p( n 1)2 p( n 1) n   
 p  
 n1 pn 2 pnn   0 0 p( n 1) n 
 1 
 0 0

where n=the total number of stops for the bus route. This transition probability
matrix plays a vital role in determining the potential stop ID for the next time
step.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
16 Xiaolei Ma and Yinhai Wang

Bayesian Decision Tree Inference


To recognize the unknown pattern, it is critical to develop a measure to
quantify pij , the possibility of next boarding stop being stop j conditioned on
the previous boarding stop being i. The higher pij is, the more likely the next
SC transaction cluster corresponds to boarding passengers at stop j. In other
words, pij represents the probability for the next SC transaction cluster
timestamp being the bus boarding time at stop j. That is to say, the boarding
time in stop j for cluster k+1 can be predicted based on the travel distance
from stop i to stop j and average bus speed. Then, the calculated time can be
used as an indicator to compare with the real transaction timestamp for cluster
k+1. From this point, the average speed between stops i and j will be a key
variable. If the timestamp for cluster k is t k , and that for cluster k+1 is tk 1 ,
then, the bus travel time from time step k to time step k+1 is tk 1  tk , and the
stop distance between stop j and stop i is Dij , then, the average bus travel
speed Vij can be expressed as:

Dij
Vij  (4)
tk 1  tk

where Vij is a random variable depending on the traffic condition at the


moment. Vij is considered to be normally distributed, and its probability
density function can be adopted to quantifying pij .

In the speed normal distribution, the mean travel speed ij and standard

deviation  ij can be calculated from all buses with GPS devices in the same
route. Under this circumstance, the boarding time for each stop can be inferred
by matching GPS data and stop location information. Using the inferred
boarding time difference and distance between stop i and stop j, we can
calculate the mean travel speed ij and standard deviation  ij as a priori
information. It is noteworthy that the speed mean and standard deviation are
not dependent on GPS data, but can be also obtained by other data sources
such as distance-based-rate SC transaction data. A sensitivity analysis further

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 17

demonstrates the algorithm’s robustness even with different speed data


sources.
Then, the transition probability can be reformulated as:

pij  Pr( Sk 1  j | Sk  i )
zij  
1 1 (5)
 
zij  2
exp( z 2 / 2)dz 
2
exp( zij2 / 2)  2,

Vij  ij
where Zij  , which is the standardized travel speed between stop j
 ij
and stop i , Δ is a small increase value for travel speed, and it will not impact
the algorithm result, since this is a common term for each transition
probability. In practice, to avoid the fast growth of Bayesian decision tree, the
transition probability can be bounded by a minimum probability to eliminate
those unlikely stops during calculation.
Each element in transition matrix can be quantified in the same way as
shown in Equation (5). With the complete transition matrix, the unknown
pattern of SC transaction series can be recognized as:

[ Sk 1 , Sk , Sk 1 ,..., S1 ]
 arg max Pr( Sk 1 , Sk , Sk 1 ,..., S1 )
S1 ... S k 1

 arg max  Pr( Sk 1 | Sk , Sk 1 ,..., S1 ) Pr( Sk , Sk 1 ,..., S1 ) 


S1 ... S k 1

 arg max  Pr( Sk 1 | Sk ) Pr( Sk | Sk 1 ) Pr( S2 | S1 )  (6)


S1 ... S k 1
k
 arg max ( Pr( Sn 1  j | Sn  i ))
S1 ... S k 1
n 1

k
 arg max ( k 1  Pr( Sn 1  j | Sn  i))
S1 ... S k 1
n 1

 arg max ( P(k  1))


S1 ... S k 1

k
Here, P(k  1)  k 1  Pr(Sn 1  j | Sn  i) denotes the geometric mean
n 1

probability of passengers boarding stop sequence at time step k+1. It is also


the probability for the identified stop sequence to match the unknown pattern.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
18 Xiaolei Ma and Yinhai Wang

Algorithm Implementation and Optimization

Implementation
As mentioned in the previous sections, due to the nature of transaction
data, several issues need to be addressed in the process of Markov chain based
Bayesian decision tree algorithm:

1. Direction identification

Beijing transit AFC system doesn’t log the travel direction information for
each route. We need to determine whether the bus is traveling inbound or
outbound before algorithm execution. The solution is that we construct two
Bayesian decision trees in each direction. Then the probability of the most
likely stop sequence from each of trees will be compared and the one with the
highest path probability wins.

2. Outlier removal

As mentioned in the Smart Card Data Clustering section, in some cases,


the delayed transactions impact the accuracy of clustering algorithm, and these
abnormal transactions are also labeled as outliers. The principal difficulty is
that two inconsistent SC transactions by timestamp that should be classified in
one cluster may be read separately, and thus, the latter will be classified as
another cluster for the next stop. For instance, at a particular stop, if one
passenger boarded the bus and paid the fare at 8:00 AM, another passenger
swiped his smart card to alight at 8:10 AM. Due to the relative large
transaction timestamp gap, the second transaction will be assigned to another
cluster. In this case, the boarding stop ID will be misidentified.
The strategy used to remove these outliers is that there exists a probability
that a passenger may retain in the same stop. If the previous stop ID is defined
as i , the number of total stops in each possible direction is denoted as N , and
the probability that a passenger stay at stop i in the next time step can be
expressed as:

jN
pii  1  p
j i 1
ij (7)

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 19

The probability is able to better depict the situation where passengers may
delay a certain period to swipe their smart cards during boarding.

3. Bus trip detection

The journey begins from the initial bus stop to the terminus is defined as a
bus trip. The bus terminus is designed for bus turning, layover, and driver
change. It is also the starting stop on the bus timetable. However, in Beijing’s
transit network, some bus termini are located in the busy street or have limited
space. Hence, buses using these termini have to begin their next trip in a short
time period without causing an obstruction. This is a challenging issue in the
procedure of passenger origin inference, since the initial stop (root node) in
Bayesian decision tree may be misidentified if the bus trip is mistakenly
detected. The solution to this issue is to model the travel time probability of
each transaction cluster series. As indicated in the transaction cluster sequence
segmentation section, a transaction cluster sequence can be segmented by
several series using aforementioned spatiotemporal transfer relationships. Each
identified series is bounded by possible inferred stops, by calculating the travel
time for multiple combinations of inferred stops, and comparing with the
actual time difference, we are able to determine the existence of a bus trip
based on the highest probability. Figure 5 demonstrates the procedure of
identifying a bus trip.

Stop 5 Stop 13 Stop 11 Stop 2


(inbound) (outbound) (inbound) (outbound)
Bus Trip End

Segment 1 Segment 2
Actual Stop ID 5 (inbound) 12 (inbound) 2 (outbound)

20 minutes

Figure 5. Bus Trip Identification.

As presented in Figure 5, the starting point and ending point of the series
can be identified by several possible stops in different directions, and the
duration of this transaction cluster series is known as 20 minutes. A variety of
trips may exist for this transaction cluster sequence:

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
20 Xiaolei Ma and Yinhai Wang

Trip 1: The bus travels from the 5th inbound stop to the 11th inbound stop.
Trip 2: The bus travels from the 5th inbound stop to the 2nd outbound stop.
Trip 3: The bus travels from the 13th outbound stop to the 11th inbound
stop.
Trip 4: The bus travels from the 13th outbound stop to the 2nd outbound
stop.

The maximum and minimum travel time for any trip can be obtained
through GPS data or distance-based buses. In addition, the maximum bus
layover time can be assumed as 30 minutes. According to the central limit
theorem, bus travel time in a known road segment should follow normal
distribution, and therefore, we can compute the probability of each scenario,
and choose the trip with the maximum probability. If the travel time from stop
i to stop j is denoted as tij , and the probability density function of tij is defined
as:

1 (tij  ij )2
p(tij )  exp( )dtij (8)
2 ij2 2 ij2

where ij is the average travel time from stop i to stop j, and  ij is the
standard deviation of travel time from stop i to stop j. If the maximum and
minimum travel time (plus maximum and minimum bus layover time) between
stop i to stop j are max(tij ) and min(tij ) respectively, then the 95%
confidence interval of travel time can be further expressed as:

[ij  1.96 ij , ij  1.96 ij ]  [min(tij ), max(tij )] (9)

The probability density function of tij can be rewritten as:

max(tij )  min(tij )
(tij  )2 (10)
1 2
p(tij )  exp( )dtij
max(tij )  min(tij ) max(tij )  min(tij )
2( )2 2( )2
3.92 3.92

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 21

Each probability for the above four trips can be calculated as 0.54, 0.87,
0.0003 and 0. Therefore, the transaction cluster sequence starts at the 5th
inbound stop, and ends at the 2nd outbound stop, and thus a terminus should
exist during this trip. This result matched with the actual bus trip. Bayesian
decision tree algorithm can be further utilized to infer other uncertain stops
within this identified bus trip.

Computational Performance Optimization


Although we illustrated the mathematical form for Markov chain based
Bayesian decision tree in theory, this algorithm presented above has not been
applied in the real dataset. Cooper (1990) has proven Bayesian decision tree
algorithm a NP (Non-deterministic Polynomial)-hard problem, which means
that this algorithm cannot be solved in a polynomial time. Conventional
approach to calculate the path probability for all the potential boarding stop
sequences is computationally expensive, especially for the long sequences. To
better explain this challenge, an example is shown as follows:

2 3 4

3 4 5 4 5 6 5 6 7
Path Probability: 0.36 0.32 0.27 0.31 0.21 0.19 0.12 0.07 0.04

Figure 6. A Bayesian Decision Tree Algorithm Example.

Assume the initial boarding stop is 1. The potential stops in the next step
could be stop 2, stop 3, or stop 4 because they are all in the reachable range.
Assuming that the situations are similar for the remaining stops, a decision tree
is fully established. The traditional exhaustive search is to traverse each
potential path, and select the maximum probability. Based on this method, we
need to calculate the path probability nine times. This implies that the number
of paths to be calculated increases exponentially as the time step increases.
However, at the time step 3, there are two or more paths ending with stop 3, 4
and 5. Before carrying on the computation in the next time step, we can

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
22 Xiaolei Ma and Yinhai Wang

compare the probability of the paths with the same ending stop, and choose the
maximum one, which is also called the partial best path.
In the time step 3, only the following five paths are selected 1->2->3, 1-
>2->4,1->2->5,1->3->6, and 1->4->7. Recall that the Markov Chain model
states that the probability of current state given a previous state sequence
depends only on the previous state. Hence, five paths calculated in time step 3
guarantees the most probable paths in time step 4 without extra computations
of other paths. According to Equation (11), we can express the optimized
procedure in mathematics as:

P(k  1)  max( P(k )(k 1 Pr( Sk 1  j | Sk  i))) (11)


i, j

We can now calculate the probability at each time step recursively until
the end of the route. Computing the probability in this way is far less
computational expensive than calculating the probabilities for all sequences. If
we denoted the total stops for a specific route as n, and the SC transactions are
classified in m clusters, which correspond to m time steps in Bayesian
decision trees, then the computational complexity for the exhaustive approach
can be written as O(mn ) . While using the optimized algorithm, the
computational complexity is only O(mn) . With the optimization, the algorithm
can be solved in a finite time, and can be efficiently applied in reality.

Validation
By installing GPS receivers on flat-rate buses, we can collect the
geospatial information and spot speed data in a real-time manner. There are
approximately 50% buses equipped with GPS devices in Beijing, and GPS
data are updated every 30 seconds. These data provide the opportunity to
validate the Markov-chain based Bayesian decision tree algorithm developed
in this study for passenger origin data extraction. GPS coordinates and
timestamp can be used to determine bus boarding and alighting location and
time. First, the geographical feature of bus stops and consecutive GPS records
for each bus are joined using latitude and longitude coordinates. Then, by
matching the passenger check-in time in the SC transaction database, the
boarding stop ID can be associated with each transaction. Since the inferred
stop ID using GPS data have been validated using the bus on-board survey
method, and can be considered as the ‘ground truth’ data for the comparison
purpose.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 23

In this section, the Markov chain based Bayesian decision tree algorithm
is first validated using GPS data for route 22, and then, several sensitivity
analyses are conducted to investigate impacts of different parameter settings in
Bayesian decision tree. Finally, a computational complexity experiment is also
included at the end of this section.

Algorithm Validation
Flat-rate based route 22 was selected to infer unknown boarding location
using Markov chain based Bayesian decision tree algorithm, and GPS data
associated with route 22 was also collected to verify the result. The SC
transaction data and GPS data are all recorded on April 7, 2010. The minimum
stop probability is defined as 0.05. If a stop whose transition probability is less
than 0.05, then this stop will be abandoned. Route 22 contains a total of 34
inbound and outbound stops as shown in Figure 7.

Figure 7. Route 22 in Beijing Transit Network.

The algorithm results are listed as in Table 3 and Figure 8. In Table 3,


there are a total of 12,675 SC transactions mapped with GPS data for Route

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
24 Xiaolei Ma and Yinhai Wang

22. Error is defined as the stop ID difference (two stops that are adjacent to
each other should have consecutive IDs) between the ground truth stop based
on GPS data and the inferred stop using the proposed algorithm. For Route 22,
95% passenger boarding stops were deducted by the proposed algorithm.
55.8% of results perfectly match with the stops inferred by GPS accurately.
There are 11,645 recognized boarding stops within three-stop distance away
from the actual boarding stop, accounting for approximately 96.7% of the total
identified records or 91.6% of total records.

Table 3. Results of Bayesian Decision Tree Algorithm for Route 22


Based on GPS Speed

Number of Accumulated percentage Accumulated percentage


Route 22
records in inferred records in total records
Stop ID error<1 7062 58.6% 55.8%
Stop ID error<2 10371 86.1% 81.8%
Stop ID error<3 11341 94.2% 89.5%
Stop ID error<4 11645 96.7% 91.9%
Total 12043 N/A 97.9%

Figure 8. Bayesian Decision Tree Algorithm Accuracy for Route 22 based on GPS
Speed.

The results are very encouraging. In Beijing’s transit network, the error
within three stops is acceptable for transit planning level study, since these
stops are mostly affiliated with the same traffic analysis zone (TAZ) due to the
high transit network density.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 25

Sensitivity Analysis
1. Source of travel speed calculation
Recall that in computing the transition matrix, mean travel speed  and
standard deviation  were extracted from GPS data. However, there are still
many flat-rate routes without GPS devices. To understand how the algorithm
result changes when the travel speed mean and standard deviation are
inaccurate, a sensitivity analysis is carried out for this purpose. Table 4 and
Figure 9 show the results when the mean and standard deviation of travel
speed are retrieved from the distance-based fare routes, and these routes share
common stops with the “no-GPS” flat-fare route. Because both boarding stop
and alighting stop are known in the distance-based fare buses, we are still able
to extract the mean and standard deviation of travel speed between adjacent
stops for transition matrix construction.

Table 4. Results of Bayesian Decision Tree Algorithm for Route 22 Based


on Speed from Distance-based Fare Routes

Number of Accumulated percentage Accumulated percentage in


Route 22
records in inferred records total records
Stop ID error<1 6841 58.5% 54%
Stop ID error<2 10319 88.2% 81.4%
Stop ID error<3 11296 96.6% 89.1%
Stop ID error<4 11509 98.4% 90.8%
Total 11694 N/A 92.2%

Figure 9. Bayesian Decision Tree Algorithm Accuracy for Route 22 Based on Speed
from Distance-based Fare Routes.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
26 Xiaolei Ma and Yinhai Wang

Different data sources only slightly influence the percentage of inferred


stops. 92.2% boarding stops can be estimated using the speed generated from
distance-based fare routes, and the accuracy within three-stop error is 90.8%.
The result indicates the proposed algorithm is not sensitive to the travel speed,
even without GPS data, we are still able to correctly identify passenger
boarding stops using other data sources. This is not surprising, because in
normal distribution, mean and standard only influence the shape for
probability density function, as long as we make a reasonable assumption for
bus travel speed calculation, the algorithm results will not fluctuate
significantly.

2. Minimum stop probability

Minimum stop probability plays a vital role to impact both the accuracy
and efficiency of the proposed algorithm. A too high threshold may eliminate
possible boarding stop candidates, and a too low threshold may consume
additional computation resources. In this sensitivity analysis, a different
minimum stop probability is set as 0.1, which means if the calculated
transition probability of a particular stop is lower than 0.1, and then this stop is
considered as an unlikely boarding stop. The comparison result is presented in
Table 5 and Figure 10.
When the minimum stop probability increases, less boarding stops can be
inferred using the proposed algorithm. In addition, the inferred boarding stops
are less accurate compared with the ones with minimum stop probability as
0.05. This is a reasonable result since a rigorous probability threshold may
limit the prorogation of errors. However, a trade-off exists between algorithm
accuracy and efficiency.

Table 5. Results of Bayesian Decision Tree Algorithm for Route 22 with


Minimum Stop Probability as 0.1

Number of Accumulated Percentage in Accumulated Percentage in


Route 22
Records inferred records total records
Stop ID error<1 6011 55.2% 47.4%
Stop ID error<2 9157 84.0% 72.2%
Stop ID error<3 10139 93.1% 80.0%
Stop ID error<4 10589 97.2% 83.5%
Total 10894 N/A 85.9%

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 27

Figure 10. Bayesian Decision Tree Algorithm Accuracy for Route 22 with Minimum
Stop Probability as 0.1.

3. Computational complexity comparison

As mentioned in the algorithm optimization section, the computational


complexity should be also taken into account when the proposed algorithm is
implemented in a large-scale transit network. To compare the algorithm
efficiency between the basic Bayesian decision tree algorithm (Basic BDC)
and the Markov chain based Bayesian decision tree algorithm (Markov-chain
BDC), seven transit routes with an increasing number of total stops are tested.
10,000 smart card transactions for each route on April, 7, 2010 are used for
comparison purposes. The experimental result is listed in table 6 and figure 11.

Table 6. Computation Complexity Comparison between Basic


and Markov-chain Based Bayesian Decision Tree Algorithms

Running time for Running time for Markov-


Route ID Number of stops
Basic BDC (milliseconds) chain BDC(milliseconds)
00616 23 3798 493740
00647 36 4890 674820
00005 53 7747 937387
00839 66 17082 1947348
00355 74 21071 2486378
00646 80 23979 4556010
00603 86 29114 5560774

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
28 Xiaolei Ma and Yinhai Wang

Figure 11. Markov Chain based Bayesian Decision Tree Algorithm Run Time
Analysis.

The Markov chain based BDC algorithm can save a significant amount of
run time compared with the Basic BDC algorithm. The average performance
gains can achieve to 142 times faster than the basic algorithm. This is because
most of the redundant calculation steps have been already excluded using
Markov chain property.

CONCLUSION
Different from most entry-only AFC systems in other countries, Beijing’s
AFC system does not record boarding location information when passengers
embark the buses and swipe their smart cards. This creates challenges for
passenger OD estimation.
This study aims to tackle this issue. With further investigations on SC
transactions data, we proposed a Markov chain based Bayesian decision tree
algorithm to infer passengers boarding stops. This algorithm is based on
Bayesian inference theory, and the normal distribution of travel speed between
adjacent stops is used to depict the randomness of passenger boarding stops.
Both the mean and the standard deviation can be obtained from GPS data or
distance-based fare routes. Moreover, stationary Markov chain property is also
incorporated to further reduce the computational complexity of the algorithm

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 29

to a linear load. The optimized algorithm is proven its accuracy using the SC
transaction data.
This algorithm can be improved in various ways; for instance, the
algorithm does not perform well under the circumstance that the travel speed
between adjacent stops is not distinct, i.e., the travel speed probability
calculated for each stop is similar. The potential countermeasure for this issue
is to incorporate heterogeneity, e.g., the accessibility of a subway station or a
central business district (CBD) for each transit stop.
In summary, the Markov chain based Bayesian decision tree algorithm
provides both effective and efficient data mining approach for passenger origin
data extraction. It sets up a great foundation to mine transit passenger ODs
from the SC transaction data for transit system planning and operations.

ACKNOWLEDGMENTS
The authors would like to appreciate the funding support from the
National Natural Science Foundation of China (51408019) and the
Fundamental Research Funds for the Central Universities. All data used for
this study were provided by Beijing Transportation Research Center (BTRC).
We are grateful to BTRC for their data supports.

REFERENCES
Bayes, Thomas; Price, Mr. An essay towards solving a problem in the coctrine
of chances, Philosophical Transactions of the Royal Society of London 53
(0): 370–418, 1763.
Beijing Transportation Research Center, Beijing transportation smart card
usage survey, Research Report. 2010.
Beijing Transportation Research Center, the 4th Beijing Comprehensive
Transport Survey Summary Report, Jan. 2012.
Chen, J., 2009. Research on travel demand analysis of urban public
transportation based on smart card data information, Ph.D. dissertation,
Tongji University.
Chu, K. K. A. and Chapleau, R., Enriching archived smart card transaction
data for transit demand modeling, Transportation Research Record:

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
30 Xiaolei Ma and Yinhai Wang

Journal of the Transportation Research Board, Vol. 2063, 2008, pp. 63-
72.
Chu, K.K. and Chapleau, R. Augmenting transit trip characterization and
travel behavior comprehension: multiday location-stamped smart card
transactions. Transportation Research Record: Journal of the
Transportation Research Board, No. 2183, Transportation Research
Board of the National Academies, Washington, DC, 2010, pp.29–40.
Cooper, G. F., The computational complexity of probabilistic inference using
Bayesian belief networks, Artificial Intelligence, Vol. 42, 1990, pp. 393-
405.
Farzin, J. M., Constructing an automated bus Origin-Destination matrix using
farecard and Global Positioning System data in Sao Paulo, Brazil,
Transportation Research Record: Journal of the Transportation Research
Board, Vol. 2072, 2008, pp. 30-37.
Federal Highway Administration, Travel Time Reliabiliy: Making it there on
time, all the itme, 2006. Accessed on line at: http://ops.fhwa.dot.gov
/publications/tt_reliability/, on Apr. 18th, 2013.
Furth, P. G., Hemily, B., Muller, T. H. J., and Strathman, J. G., TCRP report
113: Using archived AVL-APC data to improve transit performance and
management, Transportation Research Board, 2006.
Gao, L.X. and Wu, J. P., An algorithm for mining passenger flow information
from smart card data, Journal of Beijing University of Posts and
Telecommunications, Jun. 2011, vol. 34, No.3, 2011, pp. 94-97.
ICF consulting, Center for urban transportation research, Nelson/Nygaard,
ESTC. Strategies for Increasing the Effectiveness of Commuter Benefits
Programs. TCRP report 87, Transportation Research Board, 2003.
Jang, W, Travel time and transfer analysis using transit smart card data,
Transportation Research Record: Journal of the Transportation Research
Board, Vol. 2144, 2010, pp.142–149.
Janssens, D., Wets, W., Brijs, T., Vanhoof, K., Arentze, T., Timmermans, H.,
Integrating Bayesian networks and decision trees in a sequential rule-
based transportation model, European Journal of Operational Research,
Vol. 175, Issue 1, 2006, pp. 16-34.
Kittelson & Associates, Inc., Urbitran, Inc. LKC Consulting Services, Inc.,
Morpace International, Inc., Queensland University of Technology, and
Nakanishi, Y., TCRP Report 88, A guidebook for developing a transit
performacne-measurement system, Transportation Research Board,
National Research Council, Washington, D.C., 2003.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Transit Passenger Origin Inference Using Smart Card Data … 31

Lou, Y., Zhang, C., Zheng, Y., Xie Xing, Wang, W., and Huang, Y., Map-
matching for low-sampling-rate GPS trajectories, Proceedings of the 17th
ACM SIGSPATIAL International Conference on Advances in Geographic
Information Systems, pp. 352-361, 2009.
Ma, X., McCormack, E., Wang, Y., Processing Commercial GPS Data to
Develop a Web-Based Truck Performance Measures Program,
Transportation Research Record: Journal of the Transportation Research
Board. Vol.2246, 2011, pp.92-100.
Ma, X., Wang, Y., Feng, C., and Liu, J. Transit smart card data mining for
passenger origin information extraction. Journal of Zhejiang University
Science C,, Vol. 13, No. 10, 2013, pp. 750-760.
McKenzie, B. and Rapino, M. Commuting in United States: 2009, American
Community Survey Reports. Accessed on line at: http://www.census.gov
/prod/2011pubs/acs-15.pdf, on Oct. 7th, 2012.
Texas Transportation Institute, 2005 urban mobility report, Texas A&M
University, 2005.
Zhou, T., Zhai C., and Gao Z., Approaching bus OD matrices based on data
reduced from bus IC cards. Urban Transport of China, vol. 5, no.3, 2007,
pp. 48-52.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
In: Data Mining ISBN: 978-1-63463-738-1
Editor: Harold L. Capri © 2015 Nova Science Publishers, Inc.

Chapter 2

KNOWLEDGE EXTRACTION FROM AN


AUTOMATED FORMATIVE EVALUATION
BASED ON ODALA APPROACH USING
THE WEKA TOOL?

Farida Bouarab-Dahmani1 and Razika Tahi2


1
The Computer Science Department, FGEI Faculty,
University of Tizi-Ouzou, Tizi-Ouzou, Algeria
2
The Laboratory for the Electrification of Industrial Enterprises,
Department of Economics, University of Boumerdes, Boumerdes, Algeria

ABSTRACT
Differentiation between learners, adapted and personalized learning
are interesting research directions on technology for human learning
today. This issue leads to the design of educational systems integrating
strategies for learners' monitoring to assist each by evaluating his
knowledge and skills in one hand and detecting and analyzing his errors
and obstacles in the other hand. In this respect, formative evaluation is the
process used to capture data on the strengths and weaknesses of a learner.
These data, to be useful, must be objectively analyzed so that it can be
used to manage the following sessions. There are different data mining
tools using different algorithms for data analysis and knowledge


[email protected]

[email protected]

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
34 Farida Bouarab-Dahmani and Razika Tahi

extraction. Can we use these tools in computer based systems? In such


cases, is it possible to directly use a variety of general-purpose algorithms
for learning data analysis? We discuss in this paper a learning cycle that
can be a learning session with feedback loop integrating formative
evaluation followed by knowledge extraction process using data mining
algorithms. Our experiments, presented in this work, shows a set of tests,
about the exploration of learners' errors, obtained from a self e-learning
by doing tool for the algorithmic domain. We used the data mining
algorithms implemented in the Weka tool: the C4.5 algorithm for
classification, A Priori one for association rules deduction and K-Means
for clustering. The results given by these experiments have proved the
interest of classification and clustering as implemented in Weka.
However, the A priori algorithm gives in some cases results difficult to
interpret so that it needs a specific optimization to get adequate frequents
detection.

Keywords: educational data mining, formative assessment, A Priori


algorithm, C4.5 algorithm, K-means clustering algorithm, the Weka tool

1. INTRODUCTION
Educational data mining (Romero, C. and Ventura, S., 2007) is an area of
research where appropriate methodological research and technical means are
experienced to produce useful knowledge from different types of data (marks,
errors, data on the learner, log files ...). These data can be produced by any
kind of learning: face to face learning, e-learning or even blended leaning.
Learners’ differentiation and adapting learning is basic for competency-
based approach that is very discussed nowadays in educational science. The
value of a competency-based approach in our new world of global economy
and advanced technologies is obvious. However, mistrust and reluctance of
teachers and staff, often observed when it comes to the application of this
approach in the field, are justified by the difficulty in assessing skills. In
addition, this new approach is individual oriented monitoring, which requires
more work for everyone in the educational institutions.
With the introduction of the LMD (License, Master Doctorate) system in
Algerian universities, for example, the competency-based approach have to be
applied by individually tracking each student. However, this is not a possible
objective with the large number of students which makes this learners’
differentiation a tedious task for every faculty member.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Knowledge Extraction from an Automated Formative Evaluation … 35

The study proposed in this chapter discusses the possible add-ons of data
mining tools to computer based learning systems that enhance competency
based pedagogies, to get adaptive learning. We are particularly interested in
data mining algorithms for knowledge extraction from data given by a
formative and automated evaluation based on learners’ errors integrated in the
ODALA approach. This one is developed and evaluated in our previous works
(Bouarab-Dahmani F., 2010) (Bouarab-Dahmani F. et al., 2011). Among, the
data mining algorithms, considered in this study, we have: the C4.5 (with the
J48 implementation) for classification, A Priori algorithms for associative
rules deduction and K-means for clustering.
First, we will give an overview about the ODALA approach and data
mining technology. After that, in the third section, we present our tests on
mining learning data using some algorithms implemented in the Weka Tool.
After that, in the fourth point, we discuss the tests results and data mining
implementation in a CEHL. The conclusion is a synthesis about our
contribution and the possible research perspective of the presented work.

2. DATA MINING AND ODALA APPROACH


Personalization in computer environments for human learning (CEHL) is
possible by the implementation of strategies defined to monitor and assist the
learner, based on a recurrent and continuous evaluation of learner’s
knowledge. Formative evaluation is then a recommended tool used to capture
data on the strengths and weaknesses of a learner during a given period. This
data can’t easily help the decision staff without an objective analysis to get
useful knowledge for learning management. However, these analysis processes
are always programming time consuming to be integrating to each CEHL
system. For that, the use already existing data mining tools seems interesting,
but which algorithm? Which tool or library? For what kind of input data and
learning management objectives?
Data mining is a field at the intersection of statistics and information
technology (databases, artificial intelligence, machine learning, etc) that gives
tools to discover structures, "interesting" knowledge and patterns in large data
sets. It is also defined as the set of algorithms and methods for the exploration
and analysis of (often) large computer databases to detect in the data: rules,
associations, unknown trends ... that can enhance decision systems (I.H.
Witten and al., 2011). The knowledge extraction process from data that

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
36 Farida Bouarab-Dahmani and Razika Tahi

commonly uses data mining algorithms is a succession of three main steps:


data pre-processing, pattern recognition and interpreting results.
Data mining tools can be used within CEHL to extract knowledge from
learning sessions’ databases. This knowledge will be used to give an adaptive
guidance for the learner so that he/she improve her/his progression. Three
application areas have received special attention in the field of educational
data mining:

 To improve the learner’s model and provide detailed information


about his/her characteristics such as knowledge, motivation or
attitudes.
 To use the data mining process in modelling individual differences
among learners (Baker et al., 2008).
 To discover or to improve, in their areas of application, the models
concerning the structure of contents or the taught domains such as in
(Shih, B., Koedinger, K.R., Scheines, R., 2008).
 To study educational support provided by the CEHL systems by
studying how each type of instructional support can improve learning.
Among the works on this problem, we have (Beck, J.E., Mostow, J.,
2008).

We found too few research works on knowledge extraction from an


automated and formative evaluation based on errors detection in learning by
doing mode. In this paper, we focus on improving learner’s model based on
knowledge extraction process, using data mining algorithms and data from
formative evaluation results’ according to ODALA approach.
The ODALA (Ontology Driven Auto-evaluation for Learning Approach)
evaluation approach (Bouarab-Dahmani F., 2010) proposes a methodology and
techniques for the development of an evaluation system based on the teaching
domain or discipline ontology Onto-TDM. It is based on the learners’ potential
errors classified into different types (semantic, lexico-syntactic ...) in the case
of learning by doing. The content to teach, that we call teaching domain, is
represented by domain ontology where the main concepts are: notions
composed of sub-notions and knowledge items, evaluation units, errors,
examples. The ODALA process (Bouarab-Dahmani F., 2010) (Bouarab-
Dahmani F. et al., 2011) contains: form analysis of learner’s solutions,
semantic analysis, marking process and learner’s model update. After different
ODALA cycles, we get in the learning system, a set of data about the learner’s

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Knowledge Extraction from an Automated Formative Evaluation … 37

learning and behavior such as: committed errors, comprehensive indicators,


exercises resolution matrix …
We discuss, in this chapter, a learning cycle with feedback loop
incorporating formative assessment followed by application of data mining
algorithms after each training session (see Figure 1). The resulted knowledge
from the knowledge extraction process is injected, in an appropriate format,
into the module that handles the management and monitoring of the learner to
facilitate appropriate decisions.

Learning Activity (Exercises


résolution, quiz, …)

Learner ‘s Management Formative Evaluation with ODALA


and Monitoring (Error diagnosis, Marking, …)

Knowledge Extraction
(Data selection, preprocessing, datamining
algorithms application, interpreting)

Figure 1. Learning Cycle integrating formative evaluation and knowledge extraction.

Currently, our interest concerns mainly the learner’s management and


monitoring and/or scaffolding going from formative evaluation results by:

 Recommendation of adequate learning content (courses, exercises ...)


 Identification of learning and performance differentiation between
different groups of learners
 Discovery of frequents and/or relationships between errors and
exercises, errors and learner profiles, ...
 Obstacles detection

For that, we are studying already existing algorithms and data mining
tools in order to see their possible use for the exploration of formative

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
38 Farida Bouarab-Dahmani and Razika Tahi

evaluation data results based on ODALA approach. We conducted a set of


tests, where the data got from the evaluation process of Websiela system was
analyzed using mainly data mining algorithms implemented by the Weka tool
(Ian, H. W., Eibe, F., 1999).
Weka (as java library or as a tool) is an open source device, developed at
the University of Waikato in New Zealand. Websiela (Bouarab-Dahmani F.,
2010) is a self-learning prototype for algorithmic based on ODALA approach.
The next section gives an illustration of these tests with a limited set of data to
make visible the results of the application of the C4.5 algorithms for
classification, A Priori one for association rules discovery and K-means for
clustering.

3. MINING LEARNING DATA USING THE WEKA TOOL


In addition to the ontological model of the domain to teach, data about the
learner (from the learner’s model) and the applied pedagogy (scenarios,
parameters, … from the pedagogy model), we made recourse to other
structures about the learning sessions where data about the learning activities
(such as the exercises done, the detected errors, the pages visited, ...) are
stored. This data is the most important to analyze. It expresses semantic rules,
represented in the database as tables and used in the tests described in this
paper to extract knowledge. For example we have these rules:

 A student can be enrolled in one or more sessions,


 each exercise can have more than one solution,
 an exercise can evaluate different knowledge items,
 an error refers to one knowledge item,
 a learner can perform different errors in a session when solving an
exercise.

The relational tables considered in the tests are given in Table 1 with only
the attributes used in the experiments.
The approach followed for each test, where data is analyzed by a data
mining algorithm implemented in Weka tool, is composed of these steps:

 Setting of a goal
 Selection of the relational tables involved

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Knowledge Extraction from an Automated Formative Evaluation … 39

 Pre-processing of the selected tables by algebraic calculation


(application of relational algebraic operators such as: selection,
natural join …) and translation to the adequate data format (the .arff
format in the case of input data for algorithms implemented in Weka).
 Choice of the appropriate algorithm
 Execution of the algorithm from the tool Weka and/or from a JAVA
application using the Weka API
 Analysis and interpretation of the results.

Table 1. Extract from the relational diagram of a CEHL database

Relationship Commentary
- Learner (Id_learner, Name, E-mail, level) Belongs to the learner’s model
- Session (session_id, start_date, end_date, Session of a learner
id_learner)
) exercises table
-Error (id_error, label_error, type_error, id_ic) Table of potential errors. Id_ic is the
identifier of the knowledge item, a
granular component of the domain, to
which it corresponds

Table of Knowledge items, the


granular components of the field to
teach
-(session_id, id_error, id_exercise, Nb-occ) Table of the errors committed by a
given learner at a given session when
solving a given exercise.
-Evaluates (id_exercise, id_ic, importance- Association between knowledge items
degree) and exercises with indication of the
degree of importance of the knowledge
item for the exercise

We tested different algorithms on different input data and for different


objectives. The C4.5 algorithms for classification, K-Means for clustering and
A Priori one for association rules discovery have attracted our attention during
the various trials, for readability and relevance of their results. We report, in
the following points of this paper some comments about the tests related to
each of these three algorithms.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
40 Farida Bouarab-Dahmani and Razika Tahi

3.1. Classification of Learners with J 48 Algorithm

Classifiers are one of the commonly used tools in data mining. They take
as input a collection of cases, each belonging to one of a small number of
classes and described by its values for a fixed set of attributes, and output a
classifier that can accurately predict the class to which a new case belongs
(Wu et al., 2008).
The C4.5 algorithm (Quinlan, J.R., 1993) is one of the most cited
algorithms for classification. It is an extension of the ID3 algorithm proposed
by Quinlan for decision tree construction. One of the most attractive of
decision trees aspects lies in the interpretation and construction of decision
rules. The confidence of the rule is the proportion of records in the leaf node to
which the decision rule is true. If confidence is 100% (= 1), the leaf node is
pure and the decision rule is perfect.
We use the J48 algorithm which is an implementation of the C4.5 with the
data collected during the execution of the WebSiela system, a prototype we
developed for algorithmic learning with an automated correction of solutions
freely built by learners to open questions. The input data table for the data
mining process is deduced from a combination of the relational tables given in
Table 1. We chose a very simple example concerning the classification of
students according to the level that can be good, average or bad. This
classification will clarify the relationship between a class (a level) and the
number of errors (deduced from the input data).
To test the reliability and usability of the algorithm, we conducted
disturbances in the input data to observe the reactions of the algorithm. In most
cases, the results generated were adequate to the inputs. For example,

 At the test1, a learner is misclassified; he is ranked within good level


when he has a number of errors equal to 12. This is shown in the
decision tree result sheet bad decisions (3.0/1.0) as seen in Figure 2.
So the algorithm can detect anomalies in our classification. For this
case, we find the choice of two levels (3 and 4) perfectly adequate for
the input data.
 For test 2, there are no misclassifications. We just added occurrences
that might change the classes’ thresholds (number of errors). As
result, shown by Figure 3 and figure 4, the threshold for the "bad"
class increased to 10.
 Test 3 showed a case of misclassification with the replacement of
errors number for learners. The algorithm had evolved with the data

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Knowledge Extraction from an Automated Formative Evaluation … 41

(see Figure 5), gave new adequate thresholds and detected the
misclassification.

Figure 2. The result displayed in test 1.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
42 Farida Bouarab-Dahmani and Razika Tahi

Figure 3. The textual result of test 2.

Figure 4. The graph of the classification (test 2).

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Knowledge Extraction from an Automated Formative Evaluation … 43

Figure 5. The textual result of test 3.

According to the different results of our tests on different input data, we


deduced that the C4.5 algorithm is an interesting one for predictive
classification of learners based on errors. The algorithm is flexible enough
(one can change the criteria and select different attributes) so that it could suit
different objectives.

3.2. Discovery of frequents with A Priori algorithm

A Priori algorithm (Agrawal R., Srikant R., 1994) is based on the fact that
all common elements have subsets of common elements together. Indeed, if
the set {A, B} is frequent in the database, the sets {A} and {B} are themselves
frequent in the database. This algorithm generates rules involving correlated
data. We have just to point out the set of attributes concerned by the analysis.
The objective chosen in this case was to look for correlations between the
exercises, learners and errors detected in a learning session. The data to be
included in the input table of the A Priori algorithm, called input_base2 (see
Figure 6) is obtained by algebraic calculus on the two tables: Session, com-
Error. We ran the algorithm directly on the three attributes: id-learner,
id_exercise and id_error to extract all possible rules (the rule is given as if ...
then form)

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
44 Farida Bouarab-Dahmani and Razika Tahi

Figure 6. The pre-processed input table 2.

Figure 7. A result of A Priori execution under the tool Weka (test1).

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Knowledge Extraction from an Automated Formative Evaluation … 45

Also, we put on different disturbances in the input data, for example: At


the test1, we got a rule (Rule 1 in the screenshot of Figure 7) that we
interpreted as: “regardless of the learner, the resolution of exercises 3 and 4
gave no error”. The change in some instances of the input table gave a
significant change in the generated rules. For example, we no longer had the
rule 1 of test1 about exercises 3 and 4 (see Figure 8).

Figure 8. A result of A Priori execution under the tool Weka (Test 2).

After different tests, we deduced that the use of this algorithm could be of
great help to enrich the formative evaluation or to get a global evaluation from
the CEHL. However an automated interpretation of the derived rules will not
be a simple task to develop and a “manual” interpretation will be very time
consuming since all the generated rules are not systematically significant.

3.3. Learners Clustering with K-Means Algorithm

The k-means algorithm is a simple iterative method to partition a given


dataset into a user-specified number of clusters, k (Wu et al., 2008). This
algorithm has been discovered by several researchers across different
disciplines, most notably Lloyd (1957, 1982), Forgey (1965), Friedman and
Rubin (1967), and McQueen (1967). The cost of the optimal solution

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
46 Farida Bouarab-Dahmani and Razika Tahi

decreases with increasing k till it hits zero when the number of clusters equals
the number of distinct data-points. K-means remains the most widely used
algorithm for clustering in practice (Wu et al., 2008).

Figure 9. The input table of clustering tests.

We used K-means implemented in Weka on different datasets deduced


from our educational database. In what follows, we propose an example (see
Figure 9) where we have a view on each learner with the number of exercises
done and the number of committed errors. The input table has just thirty (30)

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Knowledge Extraction from an Automated Formative Evaluation … 47

entries so that we can easily follow and analyze the results. The data mining
objective in the case of this test is to get clusters of learners going with the
number of their errors and/or number of exercises done.
Figure 10 shows the execution results of K-means via Weka with the
number of clusters K=2. We can easily interpret the results given where:
Full data: is the general average of all the instances of each attribute.
Cluster 0 (first cluster found) has 9,75 as average of error numbers and
2,375 average for exercises numbers done. This cluster has Sixteen (16)
learners.
Cluster 1 (the second cluster found) contains fourteen (14) learners with
the average of 2,0714 for the errors numbers and 1,0714 for the average of
exercises numbers done.

=== Run information ===


Scheme: weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10
Relation: table1
Instances: 30
Attributes: 3
nombre_err
nombre_exo
Ignored:
id_app
Test mode: evaluate on training data
=== Clustering model (full training set) ===
kMeans
======
Number of iterations: 2
Within cluster sum of squared errors: 2.2169753086419752
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1
(30) (12) (18)
=============================================
nombre_err 6.1667 11 2.9444
nombre_exo 1.7667 2.5 1.2778
Time taken to build model (full training data) : 0.02 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 12 ( 40%)
1 18 ( 60%)

Figure 10. Screen result of clustering with the tool Weka (test1).

The interpretation can be: The algorithm divided the learners on two
clusters: those with one (1) exercise done with number of errors between 0 and
5 and those who did 2 or 3 exercises and the number of errors found is
between 6 and 15. Figure 11 shows a clustering with K= 3 and on two
attributes: the number of exercises done and the errors number. We can see
then that the characteristics of the clusters found have changed.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
48 Farida Bouarab-Dahmani and Razika Tahi

=== Run information ===


Scheme: weka.clusterers.SimpleKMeans -N 3 -A "weka.core.EuclideanDistance
-R first-last" -I 500 -S 10
Relation: table1
Instances: 30
Attributes: 3
nombre_err
nombre_exo
Ignored: id_app
Test mode: evaluate on training data
=== Clustering model (full training set) ===
kMeans
======
Number of iterations: 3
Within cluster sum of squared errors: 0.49184667184667175
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2
(30) (6) (13) (11)
========================================================
nombre_err 6.1667 12.1667 1.9231 7.9091
nombre_exo 1.7667 3 1 2
Time taken to build model (full training data): 0 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 6 ( 20%)
1 13 ( 43%)
2 11 ( 37%)

Figure 11. Screen result of clustering with the tool Weka (test2).

=== Run information ===


Scheme: weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10
Relation: table1
Instances: 30
Attributes: 3
nombre_exo
Ignored:
id_app
nombre_err
Test mode: evaluate on training data
=== Clustering model (full training set) ===
kMeans
======
Number of iterations: 2
Within cluster sum of squared errors: 0.9705882352941179
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1
(30) (17) (13)
=============================================
nombre_exo 1.7667 2.3529 1
Time taken to build model (full training data) : 0 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 17 ( 57%)
1 13 ( 43%)

Figure 12. Screen result of clustering with the tool Weka (test3).

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Knowledge Extraction from an Automated Formative Evaluation … 49

Figure 12 shows clustering on only the number of exercises done (the


other attributes of the input table are “ignored”).
We note then that the characteristics of the clusters found have changed
compared to those of Figure 10.

4. DATA MINING IMPLEMENTATION IN A CEHL


We have deduced, after the tests described above, the interest of some data
mining algorithms for formative evaluation results’ analysis such as the C4.5,
K-Means and the A Priori algorithms. The question, after that, is “How to
integrate such algorithms in a CEHL?”
We have studied this problematic for the case of WebSiela system and the
Weka API. We have developed two alternatives: The first one is a simple
access interface to Weka tool from WebSiela with recovery of results. In this
case, the interpreting step will be done by a human expert. The second
alternative is related to the development of a module for an automated
knowledge extraction process calling data mining algorithms with possible
integration of an automated pre processing and interpreting (see Figure 13 to
see the developed prototype interface).

Figure 13. Interface of the WebSiela data mining with Weka algorithms.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
50 Farida Bouarab-Dahmani and Razika Tahi

Figure 14. The links to the data mining tasks.

At this level of our work, we have implemented the preprocessing and


knowledge extraction using the Weka API. The preprocessing module gives
the needed .arff file automatically going from a relational database using
adequate requests and attributes selection. The data mining of the obtained file
is got by the execution of the Weka API algorithms. We tested the three
algorithms cited above: A priori, J-48 and the K-Means (each for a specific
task as seen in the interface of Figure 14).
In both cases, when we use without improvements and changes the
algorithms implemented in Weka, the generated results can’t be integrated and
effectively used in the learning loop given in Figure 1. Consequently, this
leads to specific programs creation particularly for the algorithms’ results
interpreting and even for preprocessing since the input data can have initially
different format (XML, OWL, text …). For the case of frequents detection, we
think that we need an improvement of the A priori algorithm or another way
for associative rules detection.
This is in one side of our study. In an another side, the result of our
experiments, shows the interest of the information that could automatically be
generated from data mining for global formative evaluation, personalization,
recommendation ... and other activities for the learners’ management and
monitoring, particularly important in the case of competency based approach.
The Weka system can be used to test the interest and feasibility of an
algorithm before programming it in a specific use for educational data mining

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Knowledge Extraction from an Automated Formative Evaluation … 51

or for a given use case. In addition, clustering and classification need just a
simple interpreting program.

CONCLUSION
We explored, in this chapter, the use and integration of the Weka
algorithms for educational data mining within a CEHL adopting learning by
doing in the context of competency-based approach. This exploration was
done by the execution of some algorithms implemented in the Weka tool first
on a limited set of data and after that with data collected in the WebSiela
system. The use of these algorithms was to explore the results of a formative
evaluation based on ODALA approach. The observed results, especially
through the execution of predictive classification with the C4.5 algorithm, the
associative rules detection with A Priori algorithm and K-means algorithm for
clustering, are interesting to track learner’s progress. However, the A priori
algorithm as implemented in Weka gives results difficult to interpret
automatically and even by human in the case of big data. Finally to realize the
loop of the learning process with a knowledge extraction module, we need
specific programs development that can use the Weka API in the
implementation step for clustering and classification. For frequents detection,
we have to study deeply the A priori algorithm and even other association
rules detection algorithms to get interpretable results. We have also the
research perspective concerning the development of an automated and
reusable filter and interpreter of the algorithms results and complete the loop
of the learning process where the data mining results are used to update the
learners’ models or e-portfolios.

REFERENCES
Agrawal R., Srikant R., 1994. “Fast Algorithms for Mining Association Rules
in Large Databases” Proc. of the 20th Int’l Conf. on Very Large Data
Bases (VLDB). June 1994, p. 478-499. IBM Research Report RJ 9839.
Baker, R.S.J.d., Corbett, A.T., Aleven, V. (2008) “More Accurate Student
Modelling Through Contextual Estimation of Slip and Guess Probabilities
in Bayesian Knowledge Tracing” Proceedings of the 9th International
Conference on Intelligent Tutoring Systems, 406-415, 2008.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
52 Farida Bouarab-Dahmani and Razika Tahi

Beck, J.E., Mostow, J. (2008) “How who should practice: Using learning
decomposition to evaluate the efficacy of different types of practice for
different types of students” Proceedings of the 9th International
Conference on Intelligent Tutoring Systems, 353-362.
Bouarab-Dahmani F. (2010). Modélisation basée ontologies pour
l’apprentissage interactif - Application à l’évaluation des connaissances
de l’apprenant. Doctorate thesis in computer science of Mouloud
Mammeri University, Tizi Ouzou, Algeia, 2010.
Bouarab-Dahmani F., Si-Mohammed M., Comparot C., Charrel P. J. (2011)
“Adaptive Exercises Generation using an Automated Evaluation and a
Domain Ontology: The ODALA+ Approach”, International journal of
emerging technologies in learning, IJET, Vol.6, Issue 2, June 2011, 4-10.
Ian, H. W., Eibe, F. (1999). Data Mining: Practical Machine Learning Tools
and Techniques with Java Implementations. Morgan Kaufmann, October
1999.
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan
Kaufmann 1993.
Romero, C. and Ventura, S. (2007) ”Educational data mining: A Survey from
1995 to 2005. Expert Systems with Applications“, (33), pp. 135-146.
Shih, B., Koedinger, K.R., Scheines, R. (2008) “A Response-Time Model for
Bottom-Out Hints as Worked Examples”. Proceedings of the First
International Conference on Educational Data Mining, 117-126, 2008.
Witten I.H., Eibe F., Hall M.A. (2011). Data Mining, Practical Machine
Learning Tools and Techniques. Third edition published in January 2011
by Morgan Kaufmann Publishers (ISBN: 978-0-12-374856-0).
(Wu et al., 2008). X.Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H.
Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu,·Z.H. Zhou, M.
Steinbach, D. J. Hand, D. Steinberg. Top 10 algorithms in data mining.
Knowl Inf Syst (2008) 14:1–37. DOI 10.1007/s10115-007-0114-2.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
In: Data Mining ISBN: 978-1-63463-738-1
Editor: Harold L. Capri © 2015 Nova Science Publishers, Inc.

Chapter 3

MODELING NATIONS’ FAILURE VIA DATA


MINING TECHNIQUES

Mohamed M. Mostafa, Ph.D.


Gulf University for Science and Technology, Kuwait

ABSTRACT
Since the concept of ‘failed states’ was coined in the early 1990s, it
has come to occupy a top tier position in the international peace and
security’s agenda. This study uses data mining techniques to examine the
effect of various social, economic and political factors on states’ failure at
the global level. Data mining techniques use a broad family of
computationally intensive methods that include decision trees, neural
networks, rule induction, machine learning and graphic visualization.
Three artificial neural network models: multi-layer perceptron neural
network (MLP), radial basis function neural network (RBFNN) and self-
organizing maps neural network (SOM) and one machine learning
technique (support vector machines [SVM]) are compared to a standard
statistical method (linear discriminant analysis (LDA). The variable sets
considered are demographic pressures, movement of refugees, group
paranoia, human flight, regional economic development, economic
decline, delegitimization of the state, public services’ performance,
human rights status, security apparatus, elites’ behavior and the role
played by other states or external political actors. The study shows how it
is possible to identify various dimensions of states’ failure by uncovering
complex patterns in the dataset, and also shows the classification abilities
of data mining techniques.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
54 Mohamed M. Mostafa

Keywords: Failed states, neural networks, machine learning, computational


intelligence

1. INTRODUCTION AND LITERATURE SURVEY


Helman and Ratner (1993) defined failed states as those states that are
simply unable to function as independent entities. Helman and Ratner’s
classification included nations such as Haiti, Yugoslavia, the USSR, Sudan,
Liberia, and Cambodia. Zartman (1995) defined failed states as those in which
“the basic functions of the state are no longer performed” (p. 2). Zartman’s
classification included nations such as Congo of the 1960s; Chad, Ghana and
Uganda of the early 1980s; and Somalia, Liberia and Ethiopia of the early
1990s. Hehir (2007) argues that failed states suffer from both “coercive
incapacity” represented by monopolizing the use of force by the state and
“administrative incapacity” which involves a failure to provide the basic
services that most citizens expect from modern governments, such as a certain
level of personal security, economic stability, and functioning bureaucratic and
judicial systems. Using a multinomial logit model, Howard (2008) found that
the transition to a failing state is positively influenced by the presence of a
strong autocratic regime, state corruption and economic insecurity. However,
Lambach (2004) argues that there is no fine line between state failure and non-
failure. The author rather distinguishes between weak states that may still be
able to provide some level of political goods and collapsed states that cannot
guarantee even a modicum of order.
The 9/11 attacks brought failed states to the top tier of international peace
and security’s agenda. Afghanistan’s failure to combat Al-Qaeda has lent a
new attention to the concept because failed states were seen as safe harbors
and launching pads for terrorism and terrorist organizations. For example, in a
recent empirical study, Piazza (2008) found that the Failed States Index is a
significant predictor of transnational terrorism. Using a series of negative
binomial regressions, the author also found that states plagued by chronic state
failures are statistically more likely to host terrorist groups and are more likely
to be targeted by transnational terrorists themselves.
Although the failed states concept has emerged as a testable concept
almost 20 years ago, no previous studies have attempted to use computational
intelligence techniques to predict, classify and cluster failed nations and
nations with high political risk. In this research we aim to fill this research gap
by predicting, classifying and clustering state failure across 177 nations

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 55

through the use of intelligent modeling techniques. More specifically, the


purpose of this research is twofold:

 To determine the major factors that affect the state failure at the
global level; and
 To benchmark the performance of computational intelligence models
against traditional statistical techniques.

Thus, this paper makes at least two important contributions to the broader
literature on state failure. First, most previous studies devoted to understand
the state failure phenomenon are comprised of case study evaluations of few
failed states (e.g., Lemarchand, 2003; Reno, 2003). Our study includes 177
nations, which makes it the most comprehensive study so far. By doing so the
study adds depth to the knowledge base on state failure. Second, by employing
computational intelligence methods such as neural networks and support
vector machines, the study adds breadth to the debate over the causes of state
failure at the global level. The paper is organized as follows. The next section
summarizes the methodology used to conduct the analysis. The subsequent
section presents empirical results of the analysis. Next, the paper sets out some
implications of the analysis. This section also deals with the research
limitations and explores avenues for future research.

2. METHOD
2.1. Multi-Layer Perceptron

MLP consists of sensory units that make up the input layer, one or more
hidden layers of processing units (perceptrons), and one output layer of
processing units (perceptrons). The MLP performs a functional mapping from
the input space to the output space. The output of an MLP is compared to a
target output and an error is calculated. This error is back-propagated to the
neural network and used to adjust the weights. This process aims at
minimizing the mean square error between the network’s prediction output and
the target output.
One of the first successful applications of MLP is reported by Lapedes and
Farber (1988). Using two deterministic chaotic time series generated by the
logistic map and the Glass-Mackey equation, they designed an MLP that can

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
56 Mohamed M. Mostafa

accurately mimic and predict such dynamic nonlinear systems. There is an


extensive literature in financial applications of MLP (e. g. Kumar and
Bhattacharya, 2006; Harvey et al., 2000). Another major application of MLP is
in electric load consumption (e.g., Darbellay and Slama, 2000; McMenamin
and Monforte, 1998). Many other problems have been solved by MLP. A short
list includes air pollution forecasting (e.g., Videnova et al., 2006), maritime
traffic forecasting (Mostafa, 2009), airline passenger traffic forecasting (Nam
and Yi., 1997), railway traffic forecasting (Zhuo et al., 2007), commodity
prices (Kohzadi et al., 1996), ozone level (Ruiz-Suarez et at., 1995), student
grade point averages (Gorr et al., 1994), forecasting macroeconomic data
(Aminian et al., 2006), advertising (Poh et at., 1998), and market trends (Aiken
and Bsat, 1999).
The MLP is the most frequently used neural network technique in pattern
recognition (Bishop, 2006) and classification problems (Sharda, 1994).
However, numerous researchers document the disadvantages of the MLP
approach. For example, Calderon and Cheh (2002) argue that the standard
MLP network is subject to problems of local minima. Swicegood and Clark
(2001) claim that there is no formal method of deriving a MLP network
configuration for a given classification task. Thus, there is no direct method of
finding the ultimate structure for modeling process. Consequently, the refining
process can be lengthy, accomplished by iterative testing of various
architectural parameters and keeping only the most successful structures.
Wang (1995) argues that standard MLP provides unpredictable solutions in
terms of classifying statistical data.

2.2. Radial Basis Function Neural Network

The basic architecture for a RBFNN is a 3-layer network. The input layer
is simply a fan-out layer and does no processing. The second or hidden layer
performs a non-linear mapping from the input space into a higher dimensional
space in which the patterns become linearly separable. The final layer
therefore performs a simple weighted sum with a linear output.
The unique feature of the RBFNN is the process performed in the hidden
layer. The idea is that the patterns in the input space form clusters. If the
centers of these clusters are known, then the distance from the cluster centre
can be measured. Furthermore, this distance measure is made non-linear, so
that if a pattern is in an area that is close to a cluster centre it gives a value
close to 1. Beyond this area, the value drops dramatically. The notion is that

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 57

this area is radially symmetrical around the cluster centre, so that the non-
linear function becomes known as the radial-basis function.
Since the RBFNN has only one hidden layer and has fast convergence
speed, it is widely used for non-linear mappings between inputs and outputs.
Examples include detecting spam email (Jiang, 2007), financial distress
prediction (Cheng et al., 2006), public transportation (Celikoglu & Cigizoglu,
2007), classification of active components in traditional medicine (Liu et al.,
2009), classification of audio signals (Dhanalakshmi et al., 2009), prediction
of athletes performance (Iyer & Sharda, 2009), and face recognition
(Balasubramanian et al., 2009).

2.3. Support Vector Machines

SVMs have been developed by Vapnik (1995) as a novel type of machine


learning. SVMs are a set of related supervised learning methods used for
classification and regression. In the case of classification, SVMs obtain the
‘optimal’ boundary of two classes in a vector space independently on the
probabilistic distributions of training vectors in the data set. If the categories
are linearly separated, the aim of SVMs is to find the ‘optimal’ hyperplane
boundary which separates both classes, classifying not only the training set but
also unknown samples. When the classes are non-linearly separable, the input
data are implicitly mapped into a higher dimensional space by a kernel
function, e.g., Guassian radial basis function (Berrueta et al., 2007).
SVMs are based on the structure risk minimization (SRM) principle,
which has been shown to be superior to the traditional risk minimization
(ERM) principle employed by neural networks (Kecman, 2005). SRM
minimizes an upper bound of the generalization error on the Vapnik-
Chernoverkis (VC) dimension, as opposed to ERM, which minimizes the
training error. Recently, Li et al., (2008) found that SVMs have better
modeling performance than the MLP in short-term freeway traffic volume
forecasting. Compared to neural networks, other SVMs advantages include
their strong theoretical basis which provides them with high generalization
capability. SVMs always have a solution, which can be quickly obtained by a
standard quadratic programming algorithm.
SVMs have been used in a range of areas like financial applications (e.g.,
Ravi et al., 2008), classification of fragrance properties (Luan et al., 2008),
dynamic classification for video stream (Awad & Motai, 2008), classification
of forest fire types (Koetz et al., 2008), consumer churn prediction

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
58 Mohamed M. Mostafa

(Coussement and Van den Poel, 2008), text categorization (Hmeidi et al.,
2008), spam classification (Yu & Xu, 2008) and estimating production levels
(Chen & Wang, 2007).

2.4. Self-Organizing Maps

The SOM, also called Kohonen map, is a heuristic model for exploring
and visualizing patterns in high dimensional datasets. It was first introduced to
the neural networks community by Kohonen (1982). SOM can be viewed as a
clustering technique that identifies clusters in a dataset without the rigid
assumptions of linearity or normality of more traditional statistical techniques.
Indeed, like k-means, it clusters data based on an unsupervised competitive
algorithm where each cluster has a fixed coordinate in a topological map
(Audrain-Pontevia, 2006). The SOM is trained based on an unsupervised
training algorithm where no target output is provided and the network evolves
until convergence. Based on the Gladyshev’s theorem, it has been shown that
SOM models have almost sure convergence (Lo & Bavarian, 1993).
The SOM consists of only two layers: the input layer which classifies data
according to their similarity, and the output layer of radial neurons arranged in
a two-dimensional map. Output neurons will self-organize to an ordered map
and neurons with similar weights are placed together. They are connected to
adjacent neurons by a neighborhood relation, dictating the topology of the map
(Moreno et al., 2006). The number of neurons can vary from a few dozen to
several thousand. Since the SOM compresses information while preserving the
most important topological and metrical relationships of the primary data
elements on the display, it can also be used for pattern classification (Silven et
al., 2003).
Due to the unsupervised character of their learning algorithm and the
excellent visualization ability, SOMs have been recently used in myriad
classification and clustering tasks. Examples include classifying cognitive
performance in schizophrenic patients and healthy individuals (Silver &
Shmoish, 2008), mutual funds classification (Moreno et al., 2006), speech
quality assessment (Mahdi, 2006), vehicle routing (Ghaziri & Osman, 2006),
network intrusion detection (Zhong et al., 2007), anomalous behavior in
communication networks (Frota et al., 2007), compounds pattern recognition
(Yan, 2006), market segmentation (Kuo et al., 2002), clustering green
consumer behavior (Mostafa, 2009) and classifying magnetic resonance brain
images (Chaplot et al., 2006).

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 59

3. RESULTS
3.1. Preliminary Data Analysis

Failed states data in this study were taken from the Fund for Peace
(FundForPeace.org) and Foreign Policy magazine (2009). The failed states
index (FSI) rates 12 social, economic and political indicators, derived from
open-source materials. These 12 indicators are: mounting demographic
pressures, massive movement of refugees and internally displaced persons,
legacy of vengeance-seeking group grievance, chronic and sustained human
flight, uneven economic development along group lines, sharp or severe
economic decline, criminalization or de-ligitimazation of the state, progressive
deterioration of public services, widespread violation of human rights, security
apparatus as “state within state’”, rise of factionalized elites, and intervention
of other states or external actors. The rank order of the states shown in Figure
1 is based on the total scores of the 12 indicators. For each indicator, the
ratings are placed on a scale of 0 to 10, with 0 being the lowest intensity (most
stable) and 10 being the highest intensity (least stable). The total score is the
sum of the 12 indicators and is on a scale of 0-120. In the 2009 index there are
177 states, compared to only 146 in 2006 and 75 in 2005. Only recognized
sovereign states based on the UN membership are included in the analysis.
Thus, several territories such as Taiwan, Palestine and Kosovo are excluded
from the index. In 2009 the FSI ranged from 18.3 in Norway to 114.7 in
Somalia. Furthermore, the Fund for Peace places nations into four categories
based on their scores. The most at-risk countries are placed in the “Alert”
category. This group includes nations having indices between 91 and 120; the
“Warning” category is reserved for countries scoring between 61 and 90; the
“Monitoring” category includes nations with a score ranging from 31 to 60;
and the “Sustainable” category includes nations scoring between 12 and 30.
Figure 2 shows boxplots for all variables used in the analysis.
Since the FSI is supposed to measure states “failure” dimension, factor
analysis was conducted to ascertain the unidimensionality of the index. Using
a standard eigenvalue of 1.0 (Child, 1990) and an inspection of a scree plot
(Figure 3), factor analysis yielded one factor. Total variance explained (79.06
percent) exceeds the 60 percent threshold commonly used in social sciences to
establish satisfaction with the solution (Hair et al., 1998). We used the Kaiser-
Mayer-Olkin (KMO) measure of sampling adequacy (Kaiser, 1970) to
measure the adequacy of the sample for extraction of the three factors.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
60 Mohamed M. Mostafa

Alert Warning Monitoring/stable Sustainable/most stable No Information / Dependent


Territory.

Figure 1. Failed States Index. Sources: The Fund for Peace (FundForPeace.org) and
Foreign Policy (July/August, 2009, pp. 80-83).

Figure 2. Boxplots of variables used in the analysis.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 61

The KMO value found (0.942) is indicative of a data set considered to be


highly desirable for factor analysis (Kim and Mueller, 1978). The Bartlett’s
test of sphericity, which tests whether the correlation matrix is an identity
matrix, is significant (Approx. Chi-square=3055.735, df=66, p<0.001). This
indicates the factor model is appropriate.

Figure 3. Scree plot of eigenvalues vs. components.

3.2. MLP, RBFNN and SVM-Based Classification

There are many software packages available for analyzing MLP models.
We chose SPSS Neural Networks (SPSS, 2007) package. This software
package applies artificial intelligence techniques to automatically find the
efficient MLP architecture (MLP design used in this study is shown in Figure
4). Typically, the application of MLP requires a training data set and a testing
data set (Lek and Guegan, 1999). The training data set is used to train the MLP
and must have enough examples of data to be representative for the overall
problem. The testing data set should be independent of the training set and is
used to assess the classification accuracy of the MLP after training. Following
Lim and Kirikoshi (2005), an error back-propagation algorithm with weight

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
62 Mohamed M. Mostafa

updates occurring after each epoch was used for MLP training. The learning
rate was set at 0.1. Table 1 reports the properties of the MLP model. Table 2
shows the MLP classification accuracy. From table 2 we see that the MLP
classifier predicted training sample with 97.2% accuracy and validation
sample with 94.4% accuracy.

Figure 4. Multi-layer perceptron neural network architecture.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 63

Table 1. MLP neural network configuration

Input Covariates 1 Demog


Layer 2 Refugees
3 Grp_Griev
4 Hum_Flight
5 Uneven_Dev
6 Econ_Dec
7 State_Deleg
8 Pub_Serv
9 Hum_Rights
10 Sec_App
11 Fact_Elites
12 Exter_Interv
Number of Unitsa 12
Rescaling Method for Covariates Standardized
Hidden Number of Hidden Layers 1
Layer(s) Number of Units in Hidden Layer 1a 5
Activation Function Hyperbolic tangent
Output Dependent Variables 1 Failed_Status
Layer Number of Units 4
Activation Function Softmax
Error Function Cross-entropy
a. Excluding the bias unit.

Table 2. MLP neural network classification

Predicted
Sample Observed bdrline critical in_dang stable Percent Correct
Training bdrline 27 0 2 0 93.1%
critical 0 34 0 0 100.0%
in_dang 0 1 65 0 98.5%
stable 1 0 0 11 91.7%
Overall Percent 19.9% 24.8% 47.5% 7.8% 97.2%
Testing bdrline 3 0 0 1 75.0%
critical 0 4 0 0 100.0%
in_dang 1 0 26 0 96.3%
stable 0 0 0 1 100.0%
Overall Percent 11.1% 11.1% 72.2% 5.6% 94.4%
Dependent Variable: Failed_Status.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
64 Mohamed M. Mostafa

RBFNN was also implemented using the SPSS Neural Networks (SPSS,
2007) package (RBFNN design used in this study is shown in Figure 5). The
basic configuration of the RBFNN used is shown in Table 3. The learning
rates for the RBFNN parameters are varied between 0.001 and 0.1 and that for
the weights are varied between 0.1 and 0.7. The training is stopped if either the
error goal reaches 0.001 or if the maximum misclassification becomes lower
than one percent. Table 4 provides the basic RBFNN properties. From table 4
we see that the hit ratio for the training sample is 97.9% and the hit ratio for
the validation sample is 97%.

Figure 5. Radial basis function neural network architecture.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 65

Table 3. RBF neural network configuration

Input Covariates 1 Demog


Layer 2 Refugees
3 Grp_Griev
4 Hum_Flight
5 Uneven_Dev
6 Econ_Dec
7 State_Deleg
8 Pub_Serv
9 Hum_Rights
10 Sec_App
11 Fact_Elites
12 Exter_Interv
Number of Units 12
Rescaling Method for Covariates Standardized
Hidden Number of Units 4a
Layer Activation Function Softmax
Output Dependent Variables 1 Failed_Status
Layer Number of Units 4
Activation Function Identity
Error Function Sum of Squares
a. Determined by the testing data criterion: The "best" number of hidden units is the
one that yields the smallest error in the testing data.

Table 4. RBF neural network classification

Predicted
Sample Observed bdrline critical in_dang stable Percent Correct
Training bdrline 26 0 0 0 100.0%
critical 0 28 1 0 96.6%
in_dang 0 1 76 0 98.7%
stable 1 0 0 11 91.7%
Overall Percent 18.8% 20.1% 53.5% 7.6% 97.9%
Testing bdrline 7 0 0 0 100.0%
critical 0 9 0 0 100.0%
in_dang 1 0 15 0 93.8%
stable 0 0 0 1 100.0%
Overall Percent 24.2% 27.3% 45.5% 3.0% 97.0%
Dependent Variable: Failed_Status.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
66 Mohamed M. Mostafa

SVMs were implemented using the SVM function in R software


(Dimitradou et al., 2005). This function provides a user-friendly interface to
the LIBSVM software developed by Chang and Lin (2001) along with
visualization and parameter tuning methods. This function is currently one of
the most widely used implementations of SVM algorithms as it provides a
robust and fast SVM implementation and produces state of the art results on
most classification and regression problems (Karatzoglou et al., 2005).
Appendix A provides the R code used to conduct the SVM analysis. As seen
in Figure 6, the correct classification rate for both training and test samples
was 100%.
Figure 7 shows a contour plot of SVM performance at different levels of
complexity.
32
28
24
20
Error rate

16
12
9
6
3
0

1 2 3 4 5 6

SVM model complexity

Figure 6. Support vector machine model complexity vs. error rate.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 67

Performance of `svm'

100
0.35

0.30
80

0.25

60 0.20
C

0.15

40
0.10

0.05
20

0.00
-6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0

log10

Figure 7. Contour plot of SVM performance.

To study the effectiveness of MLP, RBFNN and SVM-based classification


of failed states, the results of MLP RBFNN and SVM were compared with the
traditional multiple discriminant analysis (MDA). MDA is frequently used
supervised pattern recognition technique. A linear function of the variables is
sought, which maximizes the ratio of between-class variance and minimizes
the ratio of within-class variance. MDA is an extremely simple and efficient
method of classification. Indeed, it cannot be outperformed if the two
distributions are normal and have the same dispersion matrix (i.e., Bayes
limit). Figure 8 shows the canonical discriminant functions’ failed states group
centroids. A common measure of predictive models is the percentage of
observation correctly classified or the hit ratio. The MDA model had an
accuracy rate of 93.2%, with a leave-one-out validation accuracy of 86.4% as
shown in Table 5.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
68 Mohamed M. Mostafa

Table 5. LDA classification resultsb,c

Predicted Group
Membership
Failed_Code 1.0 2.0 3.0 4.0 Total
Original Count 1.0 38 0 0 0 38
2.0 6 85 2 0 93
3.0 0 1 29 3 33
4.0 0 0 0 13 13
% 1.0 100.0 .0 .0 .0 100.0
2.0 6.5 91.4 2.2 .0 100.0
3.0 .0 3.0 87.9 9.1 100.0
4.0 .0 .0 .0 100.0 100.0
a
Cross-validated Count 1.0 38 0 0 0 38
2.0 9 81 3 0 93
3.0 0 3 22 8 33
4.0 0 0 1 12 13
% 1.0 100.0 .0 .0 .0 100.0
2.0 9.7 87.1 3.2 .0 100.0
3.0 .0 9.1 66.7 24.2 100.0
4.0 .0 .0 7.7 92.3 100.0
a. Cross validation is done only for those cases in the analysis. In cross validation, each
case is classified by the functions derived from all cases other than that case.
b. 93.2% of original grouped cases correctly classified.
c. 86.4% of cross-validated grouped cases correctly classified.

Figure 9 displays the MLP cumulative gain chart (similar figure was
obtained for RBFNN). This chart shows the percentage of the overall number
of cases in a given category gained by targeting a percentage of the total
number of cases. For example, the first point on the curve for the in danger
category is at (10%, 20%), meaning that if a dataset is scored with the network
and all the cases are sorted by predicted pseudo-probability of donor, we
would expect the top 10% to contain approximately 20% of all of the cases
that actually take the category in danger. Likewise, the top 20% would contain
approximately 40% of in danger states, and so on.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 69

Figure 8. Failed states group centroids.

Figure 9. Multi-layer perceptron neural network gain chart.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
70 Mohamed M. Mostafa

The diagonal line is the baseline curve; if 10% of the cases are selected
from the scored dataset at random, we would expect to gain approximately
10% of all of the cases that actually take the category donor. The farther above
the baseline a curve lies, the greater the gain.
Despite the satisfactory classification performance of the MLP, RBFNN
and SVM in this study, such models are often criticized as black boxes that do
not allow decision-makers to make inferences on how the input variables
affect the models’ results. One way to address this issue is to conduct a
variable impact analysis (VIA). The purpose of VIA is to measure the
sensitivity of net predictions to changes in independent variables. Figure 10
shows that the most important input variables for the MLP are refugees,
security apparatus and external intervention. Similar results were obtained
using the RBFNN. The lower the percent value for a given variable, the less
that variable affects the predictions. The results of the analysis can help in the
selection of a new set of independent variables, one that will allow more
accurate predictions. For example, a variable with a low impact value can be
eliminated in favor of some new variables.

Figure 10. Multi-layer perceptron neural network variable impact analysis.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 71

3.3. SOM-Based Clustering

There are many software packages available for analyzing SOM models.
We chose SOMine package version 5.0 (Viscovery software, 2008). This
software applies artificial intelligence techniques to automatically find the
efficient SOM clusters. To visualize the cluster structure, some authors use the
unified distance matrix (U-matrix) (e.g., Vijayakumar et al., 2007; Stavrou et
al., 2010). However, this method does not give crisp boundaries to the clusters
(Worner & Gevrey, 2006). In this study a hierarchical cluster analysis with a
Ward linkage method was applied to the SOM to clearly delineate the edges of
each cluster. The number of neurons is chosen to be 2000. There are two
learning algorithms for SOM (Kohonen, 2001): the sequential or stochastic
learning algorithm and the batch learning algorithm. In the former, the
reference vectors are updated immediately after a single input vector is
presented. In the latter, the update is done using all input vectors. While the
batch algorithm does not suffer from convergence problems, the sequential
algorithm is stochastic in nature and is less likely trapped to a local minimum.
Following Ding & Patra (2007), we choose the sequential learning algorithm
to train the SOM.
The SOM cluster results are shown in Figure 11. This two-dimensional
hexagonal grid shows clear division of the input pattern into four clusters.
Since the order on the grid reflects the neighborhood within the data, features
of the data distribution can be read off from the emerging landscape on the
grid. Figure 11 shows four discernable clusters of failed states. This four-
cluster solution meets Siew et al., (2002) qualitative criteria that should be
used to select the representative SOM model. These criteria include
representability, explainability and level of sophistication. representability
refers to the fact that the variables in each cluster should be distinct and carry
some information of their own. This means that the resulting profile for each
cluster should be unique and meaningful. Explainability means that the
clusters themselves are distinct. Level of sophistication means that the size of
each cluster should be monitored so that there are no either too large clusters
that might hide more distinct groups in the cluster, or too small clusters that
might be an indication of artificial clusters.
When assessing the quality of clustering model for validation purposes,
quantitative criteria can also be used (Zhuang et al., 2009). We used the
Kohonen software package (Wehrens and Buydens, 2007) to validate the
cluster results.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
72 Mohamed M. Mostafa

in danger
stable
critical

border line

Figure 11. SOM-Ward clusters.

Failed states data: counts


Failed states data: quality

12
4
10

8 3

6
2

4
1
2

Figure 12. SOM counts and quality plots.

Figure 12 shows both the SOM counts and the mapping quality. In the left
plot, the background color of a unit corresponds to the number of samples
mapped to the unit. This figure shows that there is a reasonable spread out
over the map. One of the units is empty (depicted in grey), which suggests that
no samples have been mapped to it. The right plot shows the quality of the
mapping. It represents the mean distance between objects mapped to a
particular unit and the input vector of that unit. A good mapping should show
small distances everywhere in the map. An alternative method, called the bi-
directional Kohonen mapping (Melssen et al., 2006) has also been

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 73

implemented. Results obtained are very similar to the ones obtained above.
Another method to check the validity of the SOM during the training phase is
to see whether the input vectors are becoming more and more similar to the
closest objects in the dataset. Based on Figure 13 we see the effect of the
neighborhood shrinking to include only the winning unit. This implies that
there is no need for more iterations to optimize training parameters.

Training progress
0.05

X
Y

0.02
0.04
Mean distance to closest unit

0.015
0.03

0.01
0.02

0.005
0.01
0.00

0 20 40 60 80 100

Iteration

Figure 13. SOM neighborhood shrinkage plot.

Cluster 1 is called “critical.” This is the green-colored cluster with a


frequency of 21.47%. This cluster corresponds to the “alert.” zone in the Fund
for Peace classification. Cluster 2 is called “in danger.” This blue-colored
cluster has a frequency of 37.58%. This is the largest cluster and corresponds
to the “warning” zone in the Fund for Peace classification. The third cluster is
called “border line.” This yellow-colored cluster has a frequency of 20.34%.
This cluster corresponds to the “monitoring” zone in the Fund for Peace

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
74 Mohamed M. Mostafa

classification. Finally, the fourth cluster is labeled “stable.” This cluster


corresponds to the “sustainable” zone in the Fund for Peace classification
Table 6 summarizes the basic information in each cluster. The SOM results
validate the Fund for Peace and Foreign Policy classifications.
Based on the SOM-Ward clusters, feature or component maps can be
constructed (Vesanto, 1999). These maps are also known in the literature as
‘temperature maps.’ (Churilov & Flitman, 2006). On these maps, the nodes
which share similar information are organized in close color proximity to each
other. Figure 14 shows the feature maps for every cluster and for all input
attributes. Feature maps show the distribution of values of the respective input
component over the map. Relationships between variables could be inspected
by visually comparing the pattern of shaded pixels for each map; similarity of
the patterns indicates strong monotonic relationships between the variables.
The name of the displayed input component appears on top of each map. The
color scale at the bottom of the component window shows that blue is used for
low values, green for mid-range values and red for high values. From the
feature maps we note, for example, that the “critical” cluster includes the
highest constellation of red pixels for nations characterized by mounting
demographic pressures, massive movement of refugees, group paranoia,
chronic or sustained human flight, uneven economic development, sharp
economic decline, delegitimization of the state, progressive deterioration of
public services, widespread violation of human rights, strong security
apparatus, factionalized elites and intervention of other states or external
political actors. This implies that these variables are positively related to state
failure- a result that was previously confirmed by other researchers (e.g.,
Howard, 2008). In essence, these colorful maps reveal the existence of
previously theorized assumptions and it can even create new ones. The maps
also make it possible to find subgroups that do not follow the main theoretical
assumptions. For example, when red dots are found in the middle of the
yellow or blue area, this signals the presence of deviant subgroups. When
either blue or red nodes are forming two clearly separated areas, this might be
considered as a sign of non-linear correlation (Thneberg & Hotulainen, 2006).
Figure 15 shows the predictive ability of the SOM model for a randomly
chosen set of the nations included in the analysis. For example, Zimbabwe is
correctly classified as a critically failed nation, while Belgium is correctly
classified as a stable nation (SOM prediction accuracy was 97.74%).

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Table 6. SOM cluster summary

Freq
Seg.* Dem Ref Gr_Gr Hu_Fl Un_Dev Ec_Dec S_De P_Ser H_Ri Sec_Ap F_El Ex_In
%
1 21.47 8.18 7.99 8.50 7.17 8.45 7.13 8.69 7.72 8.31 8.18 8.69 7.74
2 37.85 7.51 5.59 6.37 6.30 7.48 6.69 7.53 7.05 6.88 6.65 7.09 6.74
2 20.34 5.94 3.60 4.92 6.14 6.79 5.51 5.83 5.19 4.88 4.65 4.82 5.46
3 20.34 3.45 2.48 3.69 2.53 4.10 3.41 2.94 2.44 2.98 2.21 2.83 3.03
Seg*: 1= critical; 2 = in danger; 3 =borderline; and 4 = stable.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
76 Mohamed M. Mostafa

Figure 14. SOM temperature maps.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 77

Figure 15. SOM prediction maps.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
78 Mohamed M. Mostafa

4. IMPLICATIONS, LIMITATIONS AND FUTURE RESEARCH


Our results confirm the theoretical work by Hecht-Nielson (1989) who has
shown that computational intelligence models can learn input-output
relationships to the point of making perfect forecasts with the data on which
the network is trained. However, perfect forecasts with the training data do not
guarantee optimal forecasts with the testing data due to differences in the two
data sets. The good performance of these models in predicting and classifying
failed states can be traced to its inherent non-linearity. This makes such
techniques ideal for dealing with non-linear relations that may exist in the
data. Thus, computational intelligence models are needed to better understand
the inner dynamics of failed states at the global level. Our results are also in
line with the findings of other researchers who have investigated the
performance of neuro-computational models compared to other traditional
statistical techniques, such as regression analysis, discriminant analysis, and
logistic regression analysis. For example, in a study of clinical diagnosis of
cancers, Shan et al., (2002) found a hit ratio of 85% for the neural network
model compared to 80% for the LDA model. In a study of credit-scoring
models used in commercial and consumer lending decisions, Bensic et al.,
(2005) compared the performance of logistic regression, neural networks and
decision trees. The neural network model produced the highest hit rate and the
lowest type I error. Similar findings have been reported in a study examining
the performance of neural networks in predicting bankruptcy (Anandarajan et
al., 2001) and diagnosis of acute appendicitis (Sakai et al., 2007).
Based on variable impact analysis, our findings imply that refugees may
have a “billiard effect” on states failure. For example, the civil war in Liberia
hastened the collapse of Sierra Leone, and the flow of refugees from Sierra
Leone disrupted the unstable Guinean government. The Democratic Republic
of Congo has rapidly collapsed in the aftermath of the Rwandan genocide. Our
analysis also highlights the fact that declining democracy as manifested by a
strong security apparatus is correlated with state failure. From both our SOM
and variable impact analysis it is clear that states ruled by a strong security
apparatus are far more likely to fail than stable democracies. We found that
uneven development has an important impact on states’ failure. This finding is
in line with van de Walle’s (2004) findings. Van de Walle argues that poorly
executed macro-economic policies can lead to the failure of the state until the
state ceases to provide virtually any public goods, and state agents become
entirely predatory through rent seeking and corruption.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 79

Despite the significant contributions of this study, it suffers from a number


of limitations. First, this study has used a cross-sectional rather than a
longitudinal approach. This implies that much more emphasis has been placed
on observing failure across nations rather than in observing changes in states
failure rates. There would seem to be hence a need for much more longitudinal
research to focus on observing changes in states failure over time. Second,
despite the satisfactory performance of the computational intelligence models
in this study, future research might improve the performance of the models
used in this study by integrating fuzzy discriminant analysis and genetic
algorithms (GA) with computational intelligence models. Mirmirani and Li
(2004) pointed out that traditional algorithms search for optimal weight
vectors for a neural network with a given architecture, while GA can yield an
efficient exploration of the search space when the modeler has little apriori
knowledge of the structure of problem domains. Finally, future research might
use other computational intelligence and evolutionary computation models’
architectures such as gene expression programming (GEP) to classify and
predict nations’ failure. GEP was first introduced to the genetic programming
(GP) community by Ferreira (2001). Thus, it is the most recent development in
the field of artificial evolutionary systems (Ferreira, 2004). Due to the
unsupervised character of their learning algorithm and the excellent
visualization ability, GEP models have been recently used in myriad fields.
Examples include particle physics data analysis (Teodorescu & Sherwood,
2008), food processing (Kahyaoglu, 2008), real parameter optimization (Xu et
al., 2009), and chaotic maps analysis (Hardy & Steeb, 2002).

APPENDIX 1. R CODE USED TO IMPLEMENT SVM


fsi<-read.table("c:\\FSI.txt", header=T)
library(e1071)
random.df<-sample(fsi)
nobs<-nrow(fsi)
n.test<-nobs %/% 10
test.df<-fsi[1:n.test,]
diagnosis.df <- subset(test.df,select=Class)
test.df<-subset(test.df, select=-Class)
train.df<-fsi[(n.test+1):nobs,]
cost.v<-c(0.01, 0.1, 1, 10, 100, 1000)
error.cv<-c()

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
80 Mohamed M. Mostafa

error.ho<-c()
for (i in cost.v)
{m.cv<-svm(Class~.,
data=fsi,
type="C-classification",
kernel="linear",
cost=i,
cross=10
)
e<-100 - summary(m.cv)$tot.accuracy
error.cv<-c(error.cv,e)
m.ho<-svm(Class ~.,
data=train.df,
type="C-classification",
kernel="linear",
cost=i,
cross=0
)
p<-predict(m.ho, test.df)
correct<-sum(p == diagnosis.df[[1]])
e<-(nrow(test.df) - correct)/nrow(test.df)*100
error.ho<-c(error.ho,e)
}
y<-max(error.ho, error.cv)
plot.new()
plot.window(xlim=c(1, length(cost.v)), ylim=c(0,y))
box()
title(xlab="SVM model complexity",ylab="Error rate")
xticks<-seq(1, length(cost.v),1)
yticks<-seq(0,y,1)
xlabels<-seq(1, length(cost.v),1)
ylabels<-seq(0,y,1)
axis(1,at=xticks,labels=xlabels)
axis(2,at=yticks,labels=ylabels)
points(1:length(cost.v),error.cv,type="b",col="red")
points(1:length(cost.v),error.ho,type="b",col="blue")
x<-fsi[, 2:13]
y<-fsi[, 14]
model<-svm(x, y)

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 81

print(model)
summary(model)
pred<-predict(model, x)
table(pred, y)
tobj<-tune.svm(Class~.,data=fsi[1:100,], gamma=10^(-6:-3),
cost=10^(1:2))
summary(tobj)
plot(tobj, transform.x = log10, xlab=expression(log[10](gamma)),
ylab="C")

APPENDIX 2. R CODE USED TO IMPLEMENT SOM


library(kohonen)
fsi<-read.table("c:\\FSI09.txt", header=T)
kohmap<-xyf(scale(fsi), classvec2classmat(Class), grid = somgrid(6, 6,
"hexagonal"), rlen=100)
plot(kohmap, type="changes" )
plot(kohmap, type="counts", main="Failed states data: counts")
plot(kohmap, type="quality", main="Failed states data: quality")
xyfpredictions<-classmat2classvec(predict(kohmap)$unit.predictions)
bgcols<-c("gray", "pink", "lightgreen")
plot(kohmap, type="mapping", col=Class+1, pchs=Class,
bgcol=bgcols[as.integer(xyfpredictions)], main = "mapping plot")
training<-sample(nrow(fsi), 100)
Xtraining<-scale(fsi[training, ])
Xtest<-scale(fsi[-training, ], center=attr(Xtraining, "scaled:center"),
scale=attr(Xtraining, "scaled:scale"))
som.fsi<-som(Xtraining, grid = somgrid(6, 6, "hexagonal"))
som.prediction<-predict(som.fsi, newdata = Xtest, trainX = Xtraining,
trainY = factor(Class [training]))
table(Class[-training], som.prediction$prediction)

REFERENCES
Aiken, M. & Bsat, M. (1999). Forecasting market trends with neural networks,
Information Systems Management, 16: 42-49.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
82 Mohamed M. Mostafa

Aminian, F., Suarez, E., Aminian, M., & Walz, D. (2006). Forecasting
economic data with neural networks. Computational Economics, 28, 71-
88.
Anandarajan, M., Lee, P., & Anandarajan, A. (2001). Bankruptcy prediction of
financially stressed firms: an examination of the predictive accuracy of
artificial neural networks. International Journal of Intelligent Systems in
Accounting, Finance & Management, 10, 69-81.
Audrain-Pontevia, A. (2006). Kohonen self-organizing maps: A neural
approach for studying the links between attributes and overall satisfaction
in a services context. Journal of Consumer Satisfaction, Dissatisfaction
and Complaining Behavior, 19, 128-137.
Awad, M., & Motai, Y. (2008). Dynamic classification for video stream using
support vector machine. Applied Soft Computing, 8, 1314-1325.
Balasubramanian, M., Palanivel, S. & Rmalingam, V. (2009). Real time face
and mouth recognition using radial basis function neural networks. Expert
Systems with Applications, 36, 6879-6888.
Bensic, M., Sarlija, N., & Zekic-Susac, M. (2005). Modelling small-business
credit scoring by using logistic regression, neural networks and decision
trees. Intelligent Systems in Accounting, Finance and Management, 13,
133-150.
Berrueta, L., Alonso-Salces, R., & Heberger, K. (2007). Supervised pattern
recognition in food analysis. Journal of Chromatography A, 1158, 196-
214.
Bishop, C. (2006). Pattern recognition and machine learning, 2nd edition,
Springer, New York.
Calderon, T.,& Cheh, J. (2002). A roadmap for future neural networks
research in auditing and risk assessment. International Journal of
Accounting Information Systems, 3, 203-236.
Celikoglu, H., & Cigizoglu, H. (2007). Modeling public transport trips by
radial basis function neural networks. Mathematical and Computer
Modeling, 45, 480-489.
Chaplot, S., Patnaik, L., & Jagannathan, N. (2006). Classification of magnetic
resonance brain images using wavlets as input to support vector machines
and neural network. Biomedical Signal Processing and Control, 1, 86-92.
Chen, K., & Wang, C. (2007). A hybrid SARIMA and support vector
machines in forecasting the production values of the machinery industry in
Taiwan. Expert Systems with Applications, 32, 254-264.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 83

Cheng, C., Chen, C. & Fu, C. (2006). Financial distress prediction by radial
basis function network with logit analysis learning. Computers and
Mathematics with Applications, 51, 579-588.
Child, D. (1990). The Essentials of Factor Analysis. Casell. London.
Churilov, L., & Flitman, A. (2006). Towards fair ranking of Olympics
achievements: The case of Sydney 2000. Computers & Operations
Research, 33, 2057-2082.
Coussement, K., & Van den Poel, D. (2008). Churn prediction in subscription
services: An application of support vector machines while comparing two
parameter-selection techniques. Expert Systems with Applications, 34,
313-327.
Darbellay, G. & Slama, M. (2000). forecasting the short-term demand for
electricity: Do neural networks stand a better chance? International
Journal of Forecasting, 16: 71-83.
Dhanalakshmi, P. Palanivel, S. & Ramalingam, (2009). Classification of audio
signals using SVM and RBFNN. Expert Systems with Applications, 36,
6069-6075.
Dimitradou, E., Hornik, K., Leisch, F., Meyer, D., & Weingessel, A. (2005).
E1071: Misc. functions of the Department of Statistics (e1071), TU Wein,
Version 1.5-11. Available from http://cran.R-project.org.
Ding, C., & Patra, J. (2007). User modeling for personalized web search with
self-organizing map. Journal of the American Society for Information
Science and Technology, 58, 494-507.
Ferreira, C. (2001). Gene expression programming: a new adaptive algorithm
for solving problems. Complex Systems, 13, 87-129.
Ferreira, C. (2004). Gene expression programming and the evolution of
computer programs. In Leonardo de Castro and Fernando Von Zuben
(Eds.). Recent Developments in biologically inspired computing. Idea
Group Publishing, pp. 82-103.
Foreign Policy (2009). The failed states index, 80-83 (July/August).
Frota, R., Barreto, G., & Mota, J. (2007). Anomaly detection in mobile
communication network using the self-organizing map. Journal of
Intelligent and Fuzzy Systems, 18, 493-500.
Fund for Peace (FundForPeace.org).
Ghaziri, H. & Osman, I. (2006). Self-organizing feature maps for the vehicle
routing problem with backhauls. Journal of Scheduling, 9, 97-114.
Gorr, W., Nagin, D., & Szczypula, J. (1994). comparative study of artificial
neural network and statistical models for predicting student point
averages, International Journal of Forecasting, 10: 17-34.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
84 Mohamed M. Mostafa

Hair, J., Anderson, R., Tatham, R. & Black, W. (1998). Multivariate data
analysis with readings.
Hardy, Y., & Steeb, W. (2002). Gene expression programming and one-
dimensional chaotic maps. International Journal of Modern Physics C,
13-24.
Harvey, C., Travers, K., & Costa, M. (2000). Forecasting emerging market
returns using neural networks, Emerging Markets Quarterly, 4: 43-55.
Hecht-Nielson, R. (1989). Theory of the back-propagation neural network.
International Joint Conference on Neural Networks. Washington, DC,
593-605.
Hehir, A. (2007). The myth of the failed state and the war on terror: a
challenge to the conventional wisdom. Journal of Intervention and state
Building, 1, 307-332.
Helman, G., & Ratner, S. (1993). Saving failed states. Foreign Policy, 89, 3-
21.
Howard, T. (2008). Revisiting state failure: developing a causal model of state
failure based upon theoretical insight. Civil Wars, 10, 125-147.
Iyer, S. & Sharda, R. (2009). Prediction of athletes’ performance using neural
networks: an application in cricket team selection. Expert Systems with
Applications, 36, 5510-5522.
Jiang, E. (2007). Detecting spam email by radial basis function networks.
International Journal of Knowledge-based and Engineering Systems, 11,
409-418.
Kahyaoglu, T. (2008). Optimization of the pistachio nut roasting process using
response surface methodology and gene expression programming. LWT-
Food Science and Technology, 41, 26-33.
Kaiser, H. (1970). A second generation little jiffy. Psychometrika, 35, 401-
415.
Karatzoglou, A., Meyer, D., & Hornik, K. (2005). Support vector machines in
R. Available at http://statmath.wu-wien.ac.at.
Kecman, V. (2005). Support vector machines: An introduction. In L. Wang
(Ed.), Support vector machines: Theory and applications. Springer-Verlag,
Berlin, 1-48.
Kim, J., & Mueller, C. (1978). Introduction to Factor Analysis. Sage
Publications. Beverly Hills. CA.
Koetz, B., Morsdorf, F., Linden, S., Curt, T., & Allgower, B. (2008). Multi-
source land cover classification for forest management based on imaging
spectrometry and LIDAR data. Forest Ecology and Management, 256,
263-271.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 85

Kohonen, T. (1982). Self-organized formation of topologically correct feature


maps. Biological Cybernetics, 43, 59-69.
Kohonen, T. (2001). Self-organizing maps. 3rd Ed., Springer, Berlin.
Kohzadi, N., Boyd, M., Kemlanshahi, B., & Kaastra, I. (1996). A Comparison
of artificial neural network and time series models for forecasting
commodity prices, Neurocomputing, 10: 169-181.
Kumar, K., & Bhattacharya, S. (2006). Artificial neural network vs. linear
discriminant analysis in credit ratings forecast. Review of Accounting and
Finance, 5, 216-227.
Kuo, R., Ho, L., & Hu, C. (2002). Integration of self-organizing feature map
and K-means algorithm for market segmentation. Computers &
Operations Research, 29, 1475-1493.
Lambach, D. (2004). The perils of weakness: failed states and perceptions of
threat in Europe and Australia. Paper presented at the New Security
Agendas Conference. Kings College, London, July 1-3.
Lapedes, A. & Farber, R. (1988). How neural nets work? In Anderson, D.
(Ed). Neural information processing systems, American Institute of
Physics, New York: 442-456.
Lek, S., & Guegan, J. (1999).Artificial neural networks as a tool in ecological
modelling: an introduction. Ecological Modeling, 120, 65-73.
Lemarchand, R. (2003). The Democratic Republic of Congo: from failure to
potential reconstruction. In Robert. I. Rotberg (ed). State failure and state
weakness in a time of terror. World Peace Foundation, Washington, DC,
29-70.
Li, X., Zhou, J., Yuan, S., Zhou, X., & Fu, Q. (2008). Using support vector
machines to predict eco-environmental burden. Biomedical and
Environmental Sciences, 21, 45-52.
Lim, C., & Kirikoshi, T. (2005). Predicting the effects of physician-directed
promotion on prescription yield and sales uptake using neural networks.
Journal of Targeting, Measurement and Analysis for Marketing, 13, 158-
167.
Liu, H., Wen, Y., & Gao, Y. (2009). Application of experimental design and
radial basis function neural network to the separation and determination of
active components in traditional Chinese medicines by capillary
electrophoresis. Analytica Chimica Acta, 638, 88-93.
Lo, Z., & Bavarian, B. (1993). Analysis of convergence properties of topology
preserving neural networks. IEEE Transactions on Neural Networks, 11,
207-220.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
86 Mohamed M. Mostafa

Luan, F., Liu, T., Wen, Y., & Zhang, X. (2008). Classification of the fragrance
properties of chemical compounds based on support vector machine and
linear discriminant analysis. Flavor and Fragrance Journal, 23, 232-238.
Mahdi, A. (2006). Perceptual non-intrusive speech quality assessment using a
self-organizing map. Journal of Enterprise Information Management, 19,
148-164.
McMenamin, J. & Monforte, F. (1998). Short term energy forecasting with
neural networks, Energy Journal, 19: 43-52.
Melssen, W., Wehrens, R., & Buydens, L. (2006). Supervised Kohonen
networks for classification problems. Chemometrics and Intelligent
Laboratory Systems, 83, 99-113.
Mirmirani, S., & Li, H. (2004). Gold price, neural networks and genetic
algorithm. Computational Economics, 23, 193-200.
Moreno, D., Marco, P., and Olmeda, I. (2006). Self-organizing maps could
improve the classification of Spanish mutual funds. European Journal of
Operational Research, 147, 1039-1054.
Mostafa, M. (2009). Shades of green: a psychographic segmentation of the
green consumer in Kuwait using self-organizing maps”, Expert Systems
with Applications, 36, 11030-11038.
Nam, K. & Yi, J. (1997). Predicting airline passenger volume, Journal of
Business Forecasting Methods & Systems, 16: 14-17.
Piazza, J. (2008). Incubators of terror: do failed and failing states promote
transnational terrorism? International Studies Quarterly, 52, 469-488.
Poh, H. Yao, J. & Jasic, T. (1998). Neural networks for the analysis and
forecasting of advertising impact, International Journal of Intelligent
Systems in Accounting, Finance & management, 7: 253-268.
Prentice Hall (Englewood Cliffs, NJ).
Reno, W. (2003). Sierra Leone: warfare in a post-state soociety. In Robert. I.
Rotberg (ed). State failure and state weakness in a time of terror. World
Peace Foundation, Washington, DC, 71-100.
Ruiz-Suarez, J., Mayora -Ibarra, O., Torres -Jimenez, J., & Ruiz-Suarez, L.
(1995). Short term ozone forecasting by artificial neural network,
Advances in Engineering Software, 23: 143-149.
Sakai, S., Kobayashi, K., Toyabe, S., Mandai, N., Kanda, T., & Akazawa, K.
(2007). Comparison of the levels of accuracy of an artificial neural
network model and a logistic regression model for the diagnosis of acute
appendicitis. Journal of Medical Systems. 31, 357-364.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
Modeling Nations’ Failure via Data Mining Techniques 87

Shan, Y., Zhao, R., Xu, G., Liebich., & Zhang, Y. (2002). Application of
probabilistic neural network in the clinical diagnosis of cancers based on
clinical chemistry data. Analytica Chimica Acta, 471, 77-86.
Siew, E., Smith, K., Churilov, L., & Ibrahim, M. (2002). A neural clustering
approach for Iso-Resource groupings for acute healthcare in Australia.
Proceedings of the 35th Annual Hawaii International Conference on
Systems Science (HICS35). IEEE Computer Society, Hawaii, USA.
Silven, O., Niskanen, M., & Kauppinen, H. (2003). Wood inspection with non
supervised clustering. Machine Vision and Applications, 13, 275-285.
Silver, H., & Shmoish, M. (2008). Analysis of cognitive performance in
schizophrenia patients and healthy individuals with unsupervised
clustering models. Psychiatry Research, 159, 167-179.
SPSS Neural Networks Version 17.0, SPSS Corporation (2007).
Stavrou, E., Spiliotis, S., & Charalambpuw, C. (2010). Flexible working
arrangements in context : an empirical investigation through self-
organising maps. European Journal of Operational Research, 202, 893-
902.
Swicegood, P., & Clark, J. (2001). Off-site monitoring systems for prediction
bank underperformance: a comparison of neural networks, discriminant
analysis, and professional human judgment. International Journal of
Intelligent Systems in Accounting, Finance & Management, 10, 169-186.
Teodorescu, L. & Sherwood, D. (2008). High energy physics event selection
with gene expression programming. Computer Physics Communications,
178, 409-419.
Thneberg, H., & Hotulainen, R. (2006). Contributions of data mining for
psycho-educational research: What self-organizing maps tell us about the
well-being of gifted learners. High Ability Studies, 17, 87-100.
Van de Walle, N. (2004). The economic correlates of state failure: taxes,
foreign aid and policies. In Robert I. Rotberg (ed). When states fail:
causes and consequences. Princeton University Press, 94-115.
Vapnik, V. (1995). The nature of statistical learning theory. Springer, Berlin.
Vesanto, J. & Alhoniemi, E. (2000). Clustering of the self-organizing map.
IEEE Transactions on Neural Networks, 11, 586-600.
Vesanto, J. (1999). SOM-based data visualisation methods. Intelligent Data
Analysis, 3, 111-126.
Videnova, I., Nedialkova, D., Dimitrova, M., & Popova, S. (2006). Neural
networks for air pollution forecasting. Applied Artificial Intelligence, 20,
493-506.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
88 Mohamed M. Mostafa

Vijayakumar, C., Damayanti, G., Pant, R. & Sreedhar, C. (2007).


Segmentation and grading of brain tumors on apparent diffusion
coefficient images using self-organizing maps. Computerized Medical
Imaging and Graphics, 31, 473-484.
Viscovery Software GmbH (2008), SOMine user’s manual version 5.0,
Vienna, Austria.
Wang, S. (1995). The unpredictability of standard back propagation neural
networks in classification applications. Management Science, 41, 555-559.
Wehrens, R., & Buydens, L. (2007). Self- and super-organizing maps in R: the
Kohonen package. Journal of Statistical Software, 21 (5).
Worner, S., & Gevrey, M. (2006). Modeling global insect pest species
assemblages to determine risk of invasion. Journal of Applied Ecology,
43, 858-867.
Xu, K., Liu, Y., Tang, R., Zuo, J., & Tang, C. (2009). A novel method for real
parameter optimization based on gene expression programming. Applied
Soft Computing, 9, 725-737.
Yan, A. (2006). Application of self-organizing maps in compounds pattern
recognition and combinatorial library design. Combinatorial Chemistry &
High Throughput Screening, 9, 473-480.
Yu, B., & Xu, Z. (2008). A comparative study for content-based dynamic
spam classification using four machine learning algorithms. Knowledge-
Based Systems, 21, 355-362.
Zartman, W. (1995). Collapsed States: The disintegration and restoration of
legitimate authority. Lynne Rienner, Boulder, CO.
Zhong, S., Khoshgoftaar, M., & Seliya, N. (2007). Clustering-based network
intrusion detection. International Journal of Reliability, Quality and
Safety Engineering, 14, 169-187.
Zhuang, Z., Churilov, L., Burstein, F., & Sikaris, K. (2009). Combining data
mining and case-based reasoning for intelligent decision support for
pathology ordering by general practitioners. European Journal of
Operational Research, 195, 662-675.
Zhuo, W., Li-Min, J., Yong, Q., & Yan-hui, W. (2007). Railway passenger
traffic volume prediction based on neural network. Applied Artificial
Intelligence, 21, 1-10.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
In: Data Mining ISBN: 978-1-63463-738-1
Editor: Harold L. Capri c 2015 Nova Science Publishers, Inc.

Chapter 4

A N E VOLUTIONARY S ELF -A DAPTIVE


A LGORITHM FOR M INING
A SSOCIATION R ULES
José Marı́a Luna1,∗, Alberto Cano1,† and Sebastián Ventura1,2,‡
1
Dept. of Computer Science and Numerical Analysis,
University of Cordoba
2
Dept. of Computer Science, King Abdulaziz University,
Saudi Arabia Kingdom

Abstract
This paper presents a novel self-adaptive grammar-guided genetic
programming proposal for mining association rules. It generates indi-
viduals through a context-free grammar, which allows of defining rules
in an expressive and flexible way over different domains. Each rule is
represented as a derivation tree that shows a solution (described using the
language) denoted by the grammar. Unlike existing evolutionary algo-
rithms for mining association rules, the proposed algorithm only requires
a small number of parameters, providing the possibility of discovering as-
sociation rules in an easy way for non-expert users. More specifically, this
algorithm does not require any threshold, and uses a novel parent selector
based on a niche-crowding model to group rules. This approach keeps the

E-mail address: [email protected]

E-mail address: [email protected]

E-mail address: [email protected] (Corresponding author)

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
90 José Marı́a Luna, Alberto Cano and Sebastián Ventura

best individuals in a pool and restricts the extraction of similar rules by


analysing the instances covered.
We compare our approach with the G3PARM algorithm, the first
grammar-guided genetic programming algorithm for the extraction of as-
sociation rules. G3PARM was described as a high-performance algo-
rithm, obtaining important results and overcoming the drawbacks of cur-
rent exhaustive search and evolutionary algorithms. Experimental results
show that our new proposal obtains very interesting and reliable rules with
higher support values.

Keywords: Association rules; grammar guided genetic grogramming; evolu-


tionary computation; self-adaptive operators

1. Introduction
Association rule mining (ARM), an important area of data mining, has received
enormous attention since its introduction by Agrawal et al. [1, 2] in the early
90s. ARM searches for strong relationships among items that are hidden in
datasets. An association rule (AR) is defined as an implication of the form
Antecedent → Consequent, both Antecedent and Consequent being sets
with no items in common. The meaning of an AR is that if the antecedent is
satisfied, then it is highly probable that the consequent will be also satisfied.
Most existing proposals for the extraction of ARs follow an exhaustive
search methodology based on a support–confidence framework [2, 6, 35], where
the support calculates the proportion of transactions covered by the rule, and the
confidence specifies how reliable the rule is. In these proposals, frequent pat-
terns are mined first, and are next used to extract reliable ARs. The mining
process is hindered by these two steps, which require a large amount of com-
putational time and large amounts of memory. Notice that with the growing
interest in the storage of information, real-world datasets containing numerical
values have turned out to be essential to the research community. Since numer-
ical domains typically contain many distinct values, the search space in these
domains is bigger than the search space in categorical domains. In such situa-
tions, exhaustive search algorithms in ARM cannot directly deal with numerical
domains as they become hardly maintainable.
Evolutionary algorithms (EA) have been widely used in data mining tasks,
where the process of searching for solutions requires an optimization. Many
researchers have focused on the ARM problems from an evolutionary perspec-

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
An Evolutionary Self-Adaptive Algorithm for Mining ... 91

tive [22, 24], facing up to the exhaustive search approaches’ computational


and memory requirements. Most of these existing evolutionary approaches are
based on genetic algorithms (GA) [4, 33], having a fixed-length chromosome
that is not very flexible at all. A technique based on EA is genetic programming
(GP) [11, 18], the individual representation being its main peculiarity. In GP,
solutions are represented with variable-length hierarchical structures, usually in
a tree form, where the size, shape and structural complexity are not constrained
a priori [13].
Recently, an ARM algorithm was proposed [21] which was based on G3P
(Grammar Guided Genetic Programming) [25], an extension of GP. It was pre-
sented as the first G3P proposal for mining ARs, providing great results and
overcoming those drawbacks of current ARM algorithms in terms of its exe-
cution time, the mining of numerical domains, and solution complexity. This
algorithm, called G3PARM (Grammar Guided Genetic Programming Associa-
tion Rule Mining), makes use of a grammar to constrain the GP process, pro-
moting the creation of individuals with valid syntax forms. Each solution given
by G3PARM satisfies the language defined by the grammar, which allowed of
obtaining solutions over different domains. G3PARM was originally described
as a fully configurable algorithm, where a number of input parameters were
required, e.g., support and confidence thresholds, crossover and mutation prob-
abilities, the number of generations, the population size, etc. This kind of al-
gorithms is highly recommended in many situations, especially when they are
used by qualified data miners. Nevertheless, the possibility of mining ARs us-
ing a high-performance algorithm without the need to specify a large number of
parameters is a requirement for non-expert users.
In this paper, the G3PARM+ algorithm, an interesting self-adaptive G3P al-
gorithm, is proposed for mining ARs. This proposal also uses a context-free
grammar (CFG) to define syntax constraints and extract rules in both numerical
and categorical domains. During its evolutionary process, the proposal discov-
ers ARs by applying two genetic operators over those rules selected as parents
by means of a niche-crowding [9, 10] model. Both genetic operators adjust their
probabilities along the mining process. The proposed algorithm allows of ob-
taining ARs without any previous knowledge of the dataset, and does not require
support or confidence thresholds. Moreover, a support–confidence framework
is not purely accomplished, bringing the lift measure into play in order to ex-
tract frequent, reliable, and also interesting rules. The resulting set of rules
comprises dissimilar rules, covering a high percentage of the dataset instances,

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use
92 José Marı́a Luna, Alberto Cano and Sebastián Ventura

which is measured by the coverage. It is noteworthy that for the sake of avoid-
ing mismatched rules, the instances covered by each rule are analysed. The
results clarify the good behaviour and efficiency of this proposal compared to
G3PARM, which was previously contrasted to other exhaustive and genetic al-
gorithms in [21]. More specifically, the empirical comparison demonstrates that
the new proposal obtains interesting rules with high support, high confidence,
and high coverage of dataset instances.
This paper is structured as follows: the main drawbacks of the support-
confidence framework are discussed and the most relevant related work is pre-
sented in Section 2; Section 3 describes the model proposed as well as its main
characteristics; Section 4 describes the experiments, including the datasets used,
the algorithm set-up, and disucsses the results obtained; finally, in Section 5,
some concluding remarks are outlined.

2. Preliminaries
The support–confidence framework is the most often used combination of mea-
sures in the ARM field. Actually, most existing proposals use the support
measure for mining frequent patterns, whereas the confidence measure is used
for discovering reliable rules. As discussed next, new measures are required
in the extraction of interesting ARs, since the use of the support–confidence
framework becomes insufficient. Finally, the most relevant ARM proposals are
described. The first proposals were designed following an exhaustive search
methodology. Currently, the use of exhaustive search methods is not a good
option and evolutionary ARM proposals are the most studied.

2.1. The Support–confidence Framework


Different measures have been proposed to evaluate the quality of the rules ex-
tracted in the ARM process [3, 12, 23, 27]. Support and confidence [19] are two
measures which are well-known to the research community. The former (sup)
is defined in Equation 1 as the proportion of the number of transactions T in-
cluding the antecedent A and the consequent C in a dataset record collection D.
The latter conf is defined in Equation 2 as the proportion of the number of trans-
actions which include A and C among all the transactions that include A.

EBSCOhost - printed on 1/21/2023 10:59 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to https://www.ebsco.com/terms-of-use

You might also like