0% found this document useful (0 votes)
42 views77 pages

Full Text 01

This thesis proposes a novel method for representing event interval sequences using bipartite graphs and spectral embedding. The method involves constructing a hash table to convert interval sequences into a bipartite graph, creating an adjacency matrix for the graph, and applying a spectral embedding to obtain vector representations. Experiments on five real-world datasets show the method achieves up to two orders of magnitude speedup over other methods for clustering and classification tasks, while maintaining similar or better performance.

Uploaded by

Allah1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views77 pages

Full Text 01

This thesis proposes a novel method for representing event interval sequences using bipartite graphs and spectral embedding. The method involves constructing a hash table to convert interval sequences into a bipartite graph, creating an adjacency matrix for the graph, and applying a spectral embedding to obtain vector representations. Experiments on five real-world datasets show the method achieves up to two orders of magnitude speedup over other methods for clustering and classification tasks, while maintaining similar or better performance.

Uploaded by

Allah1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING,

SECOND CYCLE, 30 CREDITS


STOCKHOLM, SWEDEN 2020

A graph representation of event


intervals for efficient clustering
and classification

ZED HEEJE LEE

KTH ROYAL INSTITUTE OF TECHNOLOGY


SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
A graph representation of
event intervals for efficient
clustering and classification

ZED LEE

Master’s Programme, ICT Innovation, 120 credits


Date: August 25, 2020
Supervisor: Šarūnas Girdzijauskas
Examiner: Henrik Boström
School of Electrical Engineering and Computer Science
Host company: Institutionen för data- och systemvetenskap (DSV),
Stockholms universitet
Swedish title: En grafrepresentation av händelsesintervall för
effektiv klustering och klassificering
A graph representation of event intervals for efficient clustering and
classification / En grafrepresentation av händelsesintervall för
effektiv klustering och klassificering

c 2020
Zed Lee
Abstract | i

Abstract
Sequences of event intervals occur in several application domains, while their
inherent complexity hinders scalable solutions to tasks such as clustering
and classification. In this thesis, we propose a novel spectral embedding
representation of event interval sequences that relies on bipartite graphs. More
concretely, each event interval sequence is represented by a bipartite graph by
following three main steps: (1) creating a hash table that can quickly convert a
collection of event interval sequences into a bipartite graph representation, (2)
creating and regularizing a bi-adjacency matrix corresponding to the bipartite
graph, (3) defining a spectral embedding mapping on the bi-adjacency matrix.
In addition, we show that substantial improvements can be achieved with
regard to classification performance through pruning parameters that capture
the nature of the relations formed by the event intervals. We demonstrate
through extensive experimental evaluation on five real-world datasets that
our approach can obtain runtime speedups of up to two orders of magnitude
compared to other state-of-the-art methods and similar or better clustering and
classification performance.

Keywords
event intervals, bipartite graph, spectral embedding, clustering, classification.
ii | Sammanfattning

Sammanfattning
Sekvenser av händelsesintervall förekommer i flera applikationsdomäner,
medan deras inneboende komplexitet hindrar skalbara lösningar på uppgifter
som kluster och klassificering. I den här avhandlingen föreslår vi en ny spektral
inbäddningsrepresentation av händelsens intervallsekvenser som förlitar sig
på bipartitgrafer. Mer konkret representeras varje händelsesintervalsekvens av
en bipartitgraf genom att följa tre huvudsteg: (1) skapa en hashtabell som
snabbt kan konvertera en samling händelsintervalsekvenser till en bipartig
grafrepresentation, (2) skapa och reglera en bi-adjacency-matris som motsvarar
bipartitgrafen, (3) definiera en spektral inbäddning på bi-adjacensmatrisen.
Dessutom visar vi att väsentliga förbättringar kan uppnås med avseende
på klassificeringsprestanda genom beskärningsparametrar som fångar arten
av relationerna som bildas av händelsesintervallen. Vi demonstrerar genom
omfattande experimentell utvärdering på fem verkliga datasätt att vår strategi
kan erhålla runtime-hastigheter på upp till två storlekar jämfört med andra
modernaste metoder och liknande eller bättre kluster- och klassificerings-
prestanda.

Nyckelord
händelsesintervall, bipartitgraf, spektral inbäddning, klustering, klassificering.
CONTENTS | iii

Contents

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Benefits, ethics and sustainability . . . . . . . . . . . . . . . . 5
1.6 Research methodology . . . . . . . . . . . . . . . . . . . . . 6
1.7 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . 7

2 Extended background 9
2.1 Graph theory . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Graph matrices . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Spectral clustering . . . . . . . . . . . . . . . . . . . 12
2.1.4 PageRank . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Event sequences . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Elements . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Clustering methods . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 K-means algorithm . . . . . . . . . . . . . . . . . . . 17
2.3.2 K-means++ algorithm . . . . . . . . . . . . . . . . . 18
2.3.3 K-medoids algorithm . . . . . . . . . . . . . . . . . . 19
2.4 Classification methods . . . . . . . . . . . . . . . . . . . . . 19
2.4.1 K-nearest neighbors . . . . . . . . . . . . . . . . . . 19
2.4.2 Random forest . . . . . . . . . . . . . . . . . . . . . 20
2.4.3 Support vector machines . . . . . . . . . . . . . . . . 21
2.5 Performance metrics . . . . . . . . . . . . . . . . . . . . . . 23
2.5.1 Clustering purity . . . . . . . . . . . . . . . . . . . . 23
iv | Contents

2.5.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Research methods 26
3.1 Choice of research methods . . . . . . . . . . . . . . . . . . . 26
3.2 Research process . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Suggested algorithm 29
4.1 Research paradigm . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Model design . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 Construction of hash table . . . . . . . . . . . . . . . 32
4.2.2 Conversion to a bipartite graph . . . . . . . . . . . . . 35
4.2.3 Application of graph-based algorithms . . . . . . . . 37

5 Empirical experiments 41
5.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Data properties . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Experimental design . . . . . . . . . . . . . . . . . . . . . . 43
5.3.1 Experiment 1: clustering . . . . . . . . . . . . . . . . 44
5.3.2 Experiment 2: classification . . . . . . . . . . . . . . 45
5.3.3 Runtime efficiency . . . . . . . . . . . . . . . . . . . 47
5.3.4 Software tools . . . . . . . . . . . . . . . . . . . . . 47
5.4 Experiment 1. clustering . . . . . . . . . . . . . . . . . . . . 48
5.5 Experiment 2. classification . . . . . . . . . . . . . . . . . . 50

6 Conclusions and future work 53


6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3.1 Multipartite graph . . . . . . . . . . . . . . . . . . . 55
6.3.2 Alternative graph-based algorithm . . . . . . . . . . . 56

References 56

A Grid search results 63


LIST OF FIGURES | v

List of Figures

1.1 Example of an sequence of eight event intervals. The total


time duration of the sequence is 22 time points. . . . . . . . . 1

2.1 Example of a weighted graph (left) and a bipartite graph (right). 10


2.2 Example of an event sequence database. . . . . . . . . . . . . 15
2.3 Seven temporal pairwise relations created by the relative positions
of two event intervals. . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Example of k-Nearest Neighbors (NN) algorithm when k = 1
and k = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Example of the process of learning Random Forest (RF)
consisting of three decision trees using bagging. . . . . . . . . 21
2.6 Example of the clustering purity with three clusters and three
classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 Example of the suggested method. . . . . . . . . . . . . . . . 31


4.2 An instantiation of hash table in the suggested algorithm. . . . 32

5.1 An experimental process of the clustering and classification


task using our suggested method. . . . . . . . . . . . . . . . . 44
vi | LIST OF TABLES

List of Tables

5.1 A summary of the properties of the real-world datasets: sequences


and intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 A summary of the properties of the real-world datasets: temporal
relations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Clustering results for all competitors in terms of clustering
purity (%) and runtime (seconds). . . . . . . . . . . . . . . . 48
5.4 Clustering results for our algorithm in terms of clustering
purity (%) and runtime (seconds). . . . . . . . . . . . . . . . 49
5.5 Classification results for all competitors in terms of classification
accuracy (%) and runtime (seconds). . . . . . . . . . . . . . . 50
5.6 Classification results for our algorithm in terms of classification
accuracy (%) and runtime (seconds) with 1-NN and RF.
Parameters are selected by grid-search based on 1-NN classifier. 51
5.7 Classification results for our algorithm in terms of classification
accuracy (%) and runtime (seconds) with Support Vector
Machines (SVM) and personalized PageRank. . . . . . . . . . 51

A.1 The results of grid search for three constraints on the suggested
algorithm for classification. The constraint values were increased
by 0.1 within the range of 0.1 and 1.0 for each parameter. In
the case of tie, the values are sorted in ascending order by
minSup, maxSup, and gap (Ranks 1-5). . . . . . . . . . . . 64
A.2 The results of grid search for three constraints on the suggested
algorithm for classification. The constraint values were increased
by 0.1 within the range of 0.1 and 1.0 for each parameter. In
the case of tie, the values are sorted in ascending order by
minSup, maxSup, and gap (Ranks 6-10). . . . . . . . . . . 65
LIST OF ALGORITHMS | vii

List of Algorithms

1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 K-means++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 K-medoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Pseodocode of the suggested method . . . . . . . . . . . . . . 30


5 spectralEmbedding . . . . . . . . . . . . . . . . . . . . . . . . 37
6 personalizedPageRankClassification . . . . . . . . . . . . . . . 39
viii | List of acronyms and abbreviations

List of acronyms and abbreviations


CRISP-DM CRoss-Industry Standard Process for Data Mining

IBSM Interval-Based Sequential Matching

NN Nearest Neighbors

RBF Radial Basis Function

RF Random Forest

STIFE Sequences of Temporal Intervals Feature Extraction

SVD Singular Vector Decomposition

SVM Support Vector Machines

TN True Negative

TP True Positive
Introduction | 1

Chapter 1

Introduction

1.1 Background
Sequential pattern mining has been applied in various application areas.
The principal objective of sequential pattern mining is to extract frequent
patterns from sequences of events [1, 2] or to create clusters from data records
distinguished by similar patterns [3]. However, they have a strong assumption
that events occur immediately without duration. This assumption becomes
the main weakness of sequential pattern mining methods because they cannot
catch the temporal relations between events with time duration. Extensive
research has been conducted in which the concept of events extends to the
concept of event intervals. The advantage of this representation is that it can
model the duration of events. Therefore, we can analyze the relationship
between different event intervals according to their relative time position,
providing additional information about the nature of the underlying event.
Analysis of event intervals is widely accepted in various areas including sign

A A A
event

B B B

C C
1 2 4 6 7 8 10 11 13 15 17 19 21 22
time
Figure 1.1: Example of an sequence of eight event intervals. The total time
duration of the sequence is 22 time points.
2 | Introduction

language [4], music informatics [5], cognitive science [6] or linguistics [7].
Example. Fig. 1.1 shows a sequence composed of nine event intervals defined
over three labels. Each label is represented as an alphabet, such as A, B, and
C. Each event interval is characterized by a particular label, as well as its start
and end periods. The same event label can arise multiple times, and various
types of temporal relations can happen between each combination of event
labels.
A set of chronologically arranged event intervals can create an event interval
sequence. Relative relations between many events in the sequence can lead
to various forms of temporal configurations. A typical example would be
temporal relations between the events using Allen’s temporal logic [8]. One
challenging problem is finding common elements in such complex sequences
and clustering them, or classifying new sequences into specific kinds of sets
of similar elements. How fast and accurate clustering and classification is an
important point.

1.2 Problem
Early studies on classification and clustering of sequences of event intervals
have been concentrating on establishing pairwise distance measures between
the sequences. One of the distance functions is Artemis [9, 10], which
measures the pairwise distance by computing the ratio of temporal relations
of specific pairs that both sequences have in common. Artemis can ignore
the absolute timespan of each sequence. It makes Artemis unconscious to
the range of the sequences by comparing only relative positions. However,
this method requires substantial computation time for checking all pairwise
relations, making this algorithm take cubic time in the worst case. Nonetheless,
the excellent predictive performance is achieved using Artemis together with
the k-NN classification algorithm.
Interval-Based Sequential Matching (IBSM) [11], another distance metric,
computes the pairwise distance of sequences by converting each sequence into
a 0-1 matrix to observe the active event labels at each time point. This method
does not explicitly recognize temporal relations (e.g., Allen’s) from the event
pairs. In the matrix, its columns have each time point, and its rows have event
labels. The matrix has its cells set to 1 for active labels at a specific time
point, while all inactive event labels are set to 0. This measure differs from
Artemis in that each sequence has a data point as a matrix. The speed of IBSM
mainly depends on the maximum time duration of sequences in the database.
Introduction | 3

Furthermore, in the case of IBSM, since the absolute time points of sequences
are different, the additional processing time is needed to interpolate the shorter
sequences to match the longest size. As a result, the time computation can be
substantially slower when the database includes sequence pairs with a hugely
disproportional number of event labels and time durations. It can be worse
when the longest sequence is far longer than other ones. However, Artemis only
recognizes temporal relations, thus affected mostly by the number of different
pairwise relations in the database. Therefore, there is no clear winner between
these two algorithms, and depending on the nature of the data, one is slower
than the other.
Recently, Sequences of Temporal Intervals Feature Extraction (STIFE)
[12] has been introduced for the classification problem of event intervals
by utilizing a mixture of static and temporal features extracted from the
sequences. The temporal features include pairs of event intervals that can
obtain high class-separation by calculating information gain. Nevertheless,
STIFE’s feature extraction can be even slower than IBSM and Artemis. Also,
as it returns non-numeric feature vectors, we cannot directly apply a distance-
based algorithm, such as clustering, to the features.
Lastly, [13] presents the relationship between graphs and sequences of
event intervals. It shows a method to transform dynamic graphs to event
intervals. On the contrary, in this thesis, we explain that bipartite graph
representation can be used to define a feature space at a substantially low
computational cost by using spectral embeddings, a common embedding
technique for graph clustering for capturing community structure in graphs
[14, 15, 16]. For bipartite graphs, bi-spectral clustering is introduced to
speed up the process by using the bi-adjacency matrix. The bi-adjacency
matrix excludes the space for edges between vertices in the same set from the
adjacency matrix [17]. Variants of spectral clustering have been introduced
on the stochastic block model [18]. Recently, the technique of regularizing
the adjacency matrix has been researched and worked well [19, 20], being
explained in terms of eigenvector perturbation [21], conductance [22], and
sensitivity to outliers [23]. This embedding space of an affinity matrix can
also be used as a feature space for classification, showing better performance
than former pairwise distance measures [24].
A major problem with the earlier algorithms we discussed is that they do
not have practically applicable runtime speed in clustering or classification
problem of event interval sequences. For example, the latest clustering
algorithms have cubic runtime in the worst case depending on the length of
the event interval sequence or the number of event intervals. Furthermore,
4 | Introduction

clustering purity or classification accuracy can be limited because of the


property of previous algorithms only returning distances without data points,
or only non-numerical features, making the results impossible to apply various
existing machine learning-based clustering or classification algorithms. Thus,
this thesis aims to suggest a suitable form of graph representations of event
interval sequences that achieve faster runtime to finish and better or similar
clustering/classification performances than state-of-the-art algorithms, which
makes various graph-based clustering and classification algorithms applicable.

1.3 Purpose
To the best of our knowledge, this thesis is the first attempt to solve the problem
of clustering and classification of event sequence datasets by combining them
with the graph theory. From previous trials dealing with event sequence
datasets for classification and clustering, we could bring a problem of practical
applicability such as finishing the algorithm in a reasonable time and achieving
high-performance measures such as purity or clustering accuracy. Graph
theory can be expected to improve the applicability mentioned above since
the event sequence datasets can naturally form bipartite graphs without further
processing on the datasets, if we regard the event sequences and their temporal
relations as nodes, like terms and documents in a text corpus, customers and
purchasing items in market baskets [25]. Then we can exploit the bipartite
graph-based properties that can help our problems be more scalable and more
accurate than event sequence-specific ones. To solve the problems mentioned
above, the following research question would be applicable, and we consider
the bipartite graph as our suggested representation in this thesis:

To what extent the efficiency and effectiveness of classifying and clustering


event sequence datasets can be improved by introducing the bipartite graph
representation compared to state-of-the-art methods that directly use event
intervals?

The research question above includes both the efficiency and effectiveness
of the new proposal model. Efficiency can be interpreted as the total running
time to finish the model and get the results. Effectiveness can be explained by
calculating performance indicators such as clustering purity and classification
accuracy.
The main purpose of this thesis is eventually to attempt to combine
graph theory with event interval sequences, which is a special form of data.
Introduction | 5

Therefore, through this thesis, it is possible to enable the use of various


graph-based algorithms that have been actively researched, to event interval
sequences. It will be of great benefit to researchers in both fields because they
can expand and combine their research into larger areas.

1.4 Goals
The main goals to achieve to answer our research questions of this thesis are
summarized as follows:

1. We propose a robust framework to convert event interval-based datasets


into more suitable structures to increase efficiency and effectiveness.

2. We provide parameters specific to the event interval sequence datasets


that can contribute to speeding or accuracy improvement, and we also
propose how to deal with them.

3. We analyze and evaluate the performances of the suggested framework


for classification and clustering problems and show that it finishes
faster them state-of-the-art algorithms while preserving its clustering
or classification performances.

1.5 Benefits, ethics and sustainability


First of all, it brings benefits of improving efficiency and effectiveness to
researchers who want to classify or cluster existing event sequence datasets.
Our novel method results in a new feature space constructed with considerably
shorter computation time compared to state-of-the-art methods for event
interval sequence clustering and classification. We show that our method
can achieve higher clustering purity values than earlier methods for clustering
event interval sequences on five real-world datasets.
Second, for the ethical issues of general multivariate time series analysis,
there are benefits that this thesis can bring. Existing time series classification
techniques use each data point as it is for model training. However, using
existing data as it is for machine learning models faces data privacy issues.
Data privacy issues are seen as a potential drawback of machine learning
models for commercial or moral reasons [26]. However, if we do not use all
the data points from the time series, but extracting the temporal relation from
the data and forming a model using the method used in this thesis, we can solve
6 | Introduction

the privacy problem without sharing actual data information. Therefore, the
method proposed in this thesis can be an alternative to such conventional time
series analysis techniques. Temporal abstraction to discretize time series data
is already widely known [27], but after using this method, we can proceed
with time series analysis in the form of privacy by using the technology in
this thesis. Besides, in the case of converting to interval data, we can change
the name of the labels arbitrarily and use random numeric labels for the data,
for example, we can use the labels ’23’, ’445’, rather than using meaningful
labels such as ’high temperature’ or ’low temperature’ of the patient. It helps
to maintain even a higher level of data privacy.
Finally, in clustering and classification, the causes can be more easily
interpreted. In the case of state-of-the-art algorithms used for comparison in
this thesis, it is not feasible to interpret the results of classification or clustering
due to the characteristics of the algorithms. However, the bipartite graph
structure used in this thesis is easy to interpret on its own. For a specific
event sequence classified into a specific class, the cause can be inferred through
the edges connected to features or temporal relations. This can eliminate the
dangers of machine learning as a black-box that has emerged recently. Machine
learning models that are difficult to determine the cause are not sustainable
because they are challenging to identify when errors occur, and therefore
cannot be relied upon in critical cases [28]. The model proposed in this thesis
is easy to identify the cause, and in the end, it will help create a sustainable
model.

1.6 Research methodology


There has not been any trial to solve the clustering and classification problem
of event sequence datasets by combining it with graph theory. Our main
research methodology is to investigate through various performance metrics
that our graph-based structure is superior in terms of time and accuracy
compared to the existing algorithm. Thus, we include performance measures
commonly accepted in the data science community and some empirical
datasets to verify it. We use five real-world event sequence datasets and
perform our method to get the performance measures to validate our method.
Then we compare the results in terms of speed, which is the main goal to
improve, and two performance metrics: purity (for clustering) and accuracy
(for classification). A detailed description of our algorithm itself can be found
in Section 3.
Introduction | 7

1.7 Delimitations
Here we describe parts of our research question that we will not study in this
thesis. The items of delimitation defined here mainly try to limit the formal
part of the event sequence dataset.
• There are some cases that give flexibility by allowing uncertainty to
the time points that intervals have, for more accurate performance.
However, the uncertainty of the interval is not included in our scope
of research. That means we assume that the start time and end time of
intervals do not change.
• There is also a data format in which one interval has various statuses,
not only one, but this thesis only deals with cases where one interval can
have only one event label. If we have two intervals and the event labels
are different, they are treated as different intervals even if the start and
end times are the same.
• Event interval sequences can have multiple classes (e.g., one patient can
have multiple diseases in medical datasets), but in this thesis, we limit
that one sequence can have only one class.
This is the first study to perform graph clustering and classification by
converting the event interval sequence into a graph. Therefore, we do not
attempt any other graph forms except for the bipartite graph. Besides, although
there are various types of clustering and classification on graphs, we focus on
showing the possibility by using the primary spectral clustering and PageRank-
based techniques. Besides, this thesis demonstrates the performance of the
algorithm itself and does not perform statistical verification on empirical
datasets.

1.8 Structure of the thesis


• Chapter 2 presents relevant background information about graph theory,
event interval sequences, and clustering and classification methods.
• Chapter 3 explains the research methodology this thesis takes to proceed
the research to solve the research questions introduced in the thesis.
• Chapter 4 describes the suggested algorithm used to solve the clustering
and classification problem of event interval sequences using the bipartite
graph structure.
8 | Introduction

• Chapter 5 shows the benchmark results of our proposed method and


state-of-the-art methods and explains what aspects our algorithm gets
better than the other algorithms.

• Chapter 6 summaries the thesis and discusses possible future work


regarding the clustering and classification problem of event sequence
datasets.
Extended background | 9

Chapter 2

Extended background

This section introduces various technical backgrounds used in our thesis’s


algorithms. This section is largely divided into graph theory, event sequences,
clustering methods, and classification methods. Section 2.1 describes the basic
elements of graph theory in data mining and the various types of the adjacency
matrix. Section 2.2 describes events and intervals, and the event sequences that
these intervals make up, and briefly describes two interestingness factors that
characterize this data. Section 2.3 introduces three clustering algorithms of the
k-means family. Section 2.4 describes the three types of classification methods
used in this thesis. Finally, section 2.5 describes two evaluation indicators that
represent our experimental results.

2.1 Graph theory


2.1.1 Graphs
Graph
Let a graph G = {V, E} consist of a set of vertices V and a set of edges
E between the vertices. A graph Gw = {V, E, W } defines a weighted
graph, where W = {w1 , . . . , w|E| } is a vector of edge weights, with each
wi ∈ R being the weight of edge ei ∈ E (Figure 2.1-left). In this thesis, the
representation we propose is using a bipartite graph.

Bipartite graph
A bipartite graph GB = {U, V, E} is a special form of a graph, having vertices
divided into two disjoint sets U and V , meaning that U ∩ V = ∅, and a set of
10 | Extended background

Edges

1
2 1
a
5
4
2 2
1
3 b
3

1 1 Weights
3
2 c
4 4
Nodes
Graph Bipartite graph

Figure 2.1: Example of a weighted graph (left) and a bipartite graph (right).

edges E = {eu,v |u ∈ U, v ∈ V } (Figure 2.1-right).


In other words, in a bipartite graph, edges can only lead from one vertex
set U to the other vertex set V , while vertices belonging to the same set cannot
be connected. A bipartite weighted graph is trivially defined as GBW =
{U, V, E, W }.

2.1.2 Graph matrices


Adjacency matrix
An adjacency matrix A is a two dimensional square matrix of size |V | × |V |
representing a graph G = {V, E}. Each element in the matrix Aij indicates
whether an edge exists between the ith and jth vertices [29]. If there is no
edge, Aij = 0, and if an edge exists, it can have a real value depending on the
number (or weights) of edges between the vertices.

Biadjacency matrix
A bi-adjacency matrix B ∈ R|U |×|V | is a two-dimensional matrix representing
a bipartite graph GB = {U, V, E}, with each axis representing a set of
vertices, and edges between sets U and V are defined as elements of the matrix.
Moreover, Bij > 0 if and only if there is at least one edge between (i, j) in the
graph.
A bi-adjacency matrix can only be defined in a bipartite graph because the
matrix does not have elements between vertices in the same vertex set. On the
Extended background | 11

contrary, a regular adjacency matrix of a bipartite graph can be represented as


follows:  |U |×|U | 
0 B
A(GB ) = . (2.1)
BT 0|V |×|V |
Since vertices in the same set cannot be connected, the upper left and lower
right portions of the adjacency matrix are filled with zeros. A bi-adjacency
matrix is computationally very efficient as it reduces the matrix size from |U +
V | × |U + V | to |U | × |V |, while storing the same information.

Laplacian matrix
A Laplacian adjacency matrix of a graph G is defined as L = D − A, where
|U |
D ∈ R|U |×|U | is a diagonal degree matrix with Dii = Σj=1 Aij , and i, j ∈
[1, |U |].
Unlike the standard adjacency matrix, we cannot get a bi-adjacency matrix
carrying the same information as L for bipartite graph does not have a block
structure like an adjacency matrix as follows (D1 = diag(B1|U | ), D2 =
diag(B T 1|V | ) [30]:
" #
|U |×|U |
D1 B
L(GB ) = T |V |×|V | (2.2)
B D2

Normalized Laplacian matrix


1 1
A normalized Laplacian matrix N of a graph is defined as N = D− 2 LD− 2 =
1 1
I|U | − D− 2 AD− 2 , where D ∈ R|U |×|U | is a diagonal degree matrix where
Dii = deg(ui ), with i ∈ [1, |U |]. Note that we can only use the bi-adjacency
matrix part (top-right),since the form of N (GB ) is expressed as a bi-adjacency
matrix as follows:
− 21 − 12
" #
I|U | −D1 BD2
N (GB ) = − 12 T − 21
(2.3)
−D2 B D1 I|V |

The normalized Laplacian matrix provides an approximate solution for finding


the normalized ratio cut of the graph, which contributes to providing a good
graph partition [31].
12 | Extended background

2.1.3 Spectral clustering


The spectral clustering refers to a technique that performs dimensionality
reduction using eigenvalues of the similarity matrix of data and then performs
clustering through existing clustering algorithms on fewer dimensions [32,
14]. If one problem is defined as a node of the graph, and an edge is
defined by connecting high-related problems, relationship diagrams between
the problems can be formed. Subsequently, by using clustering techniques
that group similar objects into one group and other objects into another
group, useful information can be extracted from the relationship diagram of
the problem. We can create the similarity matrix of the spectral clustering
algorithm by using the series of graph matrices generated by graph nodes and
edges (Section 2.1.2).
Here we describe the procedure of spectral clustering. First, we compute
the similarity matrix based on the distance between the data. After calculating
the similarity matrix, each node will move each group to a lower dimension
that is easier to separate. Based on the distance from the projected low-
dimensional space, each cluster is created through an algorithm such as k-
means, and the spectral clustering algorithm ends.
The performance of spectral clustering depends on the similarity matrix
we choose. In the case of the minimum cut, the graph is split in a way that
the weight of each cluster is lost the least. However, in this case, the size of
one cluster may be too small as it only depends on the weights [15]. On the
other hand, ratio cut refers to a method of cutting such that the size of the
two clusters (the number of nodes) split is maintained as equally as possible
[33]. One special case of the spectral clustering is to use normalized Laplacian
matrix [31], which is used to get the semi-optimized normalized cut, which is
described as follow:

X
w(A, B) = wij
i∈A,j∈B
w(A, B) w(A, B)
normalized_cut(A, B) = +
w(A, V ) w(B, V )

With graph matrices, spectral clustering can be regarded as a method to


find the optimal partition of the given graph [34]. The optimal partition of the
graph is the same as optimal clusters, meaning the association of the edges
within the cluster are dense, while the weights between the clusters are not.
Since each graph matrix solves a different graph cut problem, it is essential to
Extended background | 13

choose a suitable representation for a specific problem.

2.1.4 PageRank
PageRank is a method of weighting documents having a hyperlink structure
such as the World Wide Web according to their relative importance [35].
Hyperlinks can be expressed as out-links from one document to other ones, and
simultaneously, as in-links coming to one document from other documents. A
citation graph can be an example with the papers as the nodes, and the citations
in the paper as the edges. The PageRank algorithm can be applied to all types
of graphs of the same shape.
The PageRank algorithm assumes that out-links from the same document
have the same importance. Moreover, the importance of the out-link is
determined proportionally to the importance of the document. Likewise, the
node’s importance is expressed recursively as the sum of the importance of the
in-links coming into the node.
Example. Let’s say the importance of nodes A, B, and C in order is iA , iB , iC .
In the PageRank algorithm, the importance of links from node B to A and the
importance of links from node B to C are both i2B . And the importance of
each node can be expressed as follows:

iA iB
iA = +
2 2
iA
iB = + iC
2
iB
iC =
2
The above expression can be expressed using a matrix and a vector as follows:
  1 1   
iA 2 2
0 iA
1
iB  =  0 1 iB 
2
iC 0 12 0 iC

If we set all the importance at the very beginning to N1 and repeat the matrix
multiplication, it will converge to the actual importance. This method is called
power iteration. However, The PageRank algorithm described above does not
work well for all graphs. It is known that it does not work well especially
for two cases: 1) when a node with no out-link exists (dead ends), 2) when
all out-links of a group smaller than the entire graph exist only in the group
(spider traps). Teleport has been proposed as a way to solve this problem.
14 | Extended background

With the probability of β, a standard PageRank algorithm will be performed.


However, with a probability of (1 − β), it moves to a completely random page
(teleport). With this teleport method, we can solve the spider traps problem
as we have probabilities of escaping from the endless cycle. The standard
PageRank algorithm could not get out of the trap, but now it has a certain
chance to get out. Even in the case of nodes without any out-link, we can
always teleport to random nodes, making the dead end problem can also be
solved.

Personalized PageRank
Personalized PageRank is a special form of PageRank with restriction of the
set to teleport [36]. Unlike a normal PageRank with teleport that can jump
to any node in the graph with the equal probability, personalized PageRank
teleports only to the given sets, or more generally, with different probability
values. This can be represented as a unique power iteration solution as follows:

P R(β, v) = (1 − β)v + βM T P R(β, v)

where v is probability vector for teleport, M is a similarity matrix, and β is


transport probability.
Since personalized PageRank can be used to get a similarity score between
the nodes, it has been used to get the similarity of words [37], to solve a
clustering problem [36], or to build a recommender system [38].

2.2 Event sequences


2.2.1 Elements
Let Σ = {e1 , . . . , em } be a set of m event labels. An event interval is a
combination of event and a specific time duration. Event intervals for the
same instance (e.g., a medical record of a single patient) form an event-interval
sequence or event sequence. Each concept is described in Figure 2.2. We
provide more formal definitions for these concepts.

Event interval
An event interval s = (e, ts , te ) is a triple of event label and time elements.
The three elements are event label s.e ∈ Σ, and s.ts , s.te , meaning the start
and end time of the interval. event interval can be formed when s.ts <= s.te ,
Extended background | 15

Interval s = {event label, tstart, tend}


Interval

E-sequence

Relation:
E-sequence Blue overlaps Yellow
database

Figure 2.2: Example of an event sequence database.

and for continuous event interval, s.ts < s.te is always established. In the
special case where s.ts = s.te , the event interval is instantaneous.

Event sequence
An event sequence S={s1 , . . . , sn } is a group of event intervals for the same
entity. Event intervals in an event sequence are sorted chronologically and can
contain the same event label multiple times. They first follow an ascending
order based on their start time. If the start times tie, the end times are
arranged in ascending order. If the end times are also the same, they follow
the lexicographic order of the event labels.

Temporal relation

follows A B matches
A
meets A B B
A left- A
overlaps
B matches B
contains
A right- A
B matches B
Figure 2.3: Seven temporal pairwise relations created by the relative positions
of two event intervals.

In this thesis, we consider Allen’s seven temporal relations [8] (Figure 2.3),
defined in the following set, I = {follows, meets, overlaps, matches, contains,
16 | Extended background

left-matches, right-matches}. Formally, a temporal relation R between two


event intervals sa and sb , with sa .e = ei , sb .e = ej , is defined as a triplet
< ei , ej , r >, with r ∈ I.
In several applications based on event intervals, we may not be interested
in the absolute time values of event intervals but preferably in the temporal
relations between them. Hence, a simplified representation may be used.

2.2.2 Properties
Vertical support
Given an event sequence S = {s1 , . . . , sn } and a temporal relation R =<
ei , ej , r > defined by event labels ei , ej ∈ Σ, with r ∈ I, the vertical support
of R is defined as the number of event sequences where event labels ei , ej
occur with relation r.
While there can be multiple occurrences of R in the same event sequence,
the relation is counted only once.
Let function occV (·) indicate a single occurrence of a temporal relation in
an event sequence, such that occV (R, S) = 1 if R occurs in S, and 0 otherwise.
We define a frequency function F : [0, |D|] → [0, 1] that computes the relative
vertical support of a temporal relation R in an event sequence database D as
follows:
|D|
1 X
F(R) = occV (R, Si ) .
|D| S ∈D
i

Horizontal support
Given a single event sequence S and a specific temporal relation R, we can
define the horizontal support as a function of these two parameters. The
horizontal support is the number of occurrences of R in S, represented as
occH (R, S). This function counts the occurrences of R in a single event
sequence, not in the database.

2.3 Clustering methods


Here we introduce three clustering algorithms: k-means, k-means++, and k-
medoids. k-means++ and k-medoids are variants of k-means to solve some
Extended background | 17

problems that can occur in k-means. However, the overall algorithm execution
process is similar for all three algorithms.

2.3.1 K-means algorithm

Algorithm 1: K-means
Data: D: set containing n data objects
k: number of clusters
Result: k clusters
1 k data objects are randomly extracted from the data object set D, and
these data objects are set as the centroid of each cluster.
2 For each data object in set D, the distances from k cluster center
objects are respectively obtained, and it is found which center point
(centroid) is most similar to each data object. Then, each data object
is assigned as the center point.
3 Recalculate the center point of the cluster. That is, the center point is
recalculated based on the clusters reassigned in 2.
4 Repeat steps 2 and 3 until the cluster to which each data object
belongs does not change.
5 return k clusters

The k-means clustering algorithm belongs to the partitioning method among


clustering methods [39, 40]. Partitioning is a method of dividing a given data
into several partitions (or groups). For example, suppose that we have n data
objects as our input. The partitioning method divides the input data into k
groups, where k is smaller than or equal to the number of data objects n.
Each divided k groups form k clusters. The process of dividing the group
is performed in a manner that minimizes a cost function such as dissimilarity
between groups based on distance, and in this process, the similarity between
data objects in the same group increases, and the similarity values with the data
objects in other groups are reduced [41]. The k-means algorithm determines
the sum of squares of the distance between the centroid of each group and
the data objects in the group as a cost function, and performs clustering by
updating the group belonging to each data object in the direction to minimize
this function value (Algorithm 1).
Given a set of n data objects (x1 , x2 , . . . , xn ), the k-means algorithm
classify n data points into k(≤ n) groups S = {S1 , S2 , . . . , Sk }, to maximize
the cohesion between objects inside the set. In other words, when µi is the
18 | Extended background

center point of the set Si , the algorithm tries to acieve the following formula:
k X
X
argmin kx − µi k2
S
i=1 x∈Si

This algorithm aims to find a set S minimizing the overall distances


between the center points and the other objects in the same set. Finding
the global minimum of this objective function is an NP-hard problem, so
the algorithm reduces the error of the objective function by the hill-climbing
method and closes the algorithm when we find the local minimum.

2.3.2 K-means++ algorithm

Algorithm 2: K-means++
Data: D: set containing n data objects
k: number of clusters
Result: k clusters
1 Select a random data point from the data set and set it as the first
center c0 .
2 For each data point di in the dataset D, the distance between the
corresponding data and the closest center among the selected center
points dist(di , ck ) is calculated.
3 Select a data point using a biased probability distribution whose
probability is proportional to Σk−1j dist(di , cj )2 , and set it as the k’th
center.
4 Step 2, 3 are repeated until k centers are selected.
5 k-means clustering is performed using the selected k centers as initial
values.
6 return k clusters

The k-means clustering has a property that significantly depends on how the
initial value is selected. k-means++ is proposed to reduce the damage caused
by these properties [42]. The k-means++ algorithm is a variant of the k-means
algorithm that selects the initial value of the k-means clustering algorithm
(Algorithm 2). The k-means++ algorithm requires additional time to set the
initial values, but the selected initial values then ensure that the k-means
algorithm finds the optimal k-means solution over the time of O(logk). In
this thesis, we used k-means++ instead of k-means.
Extended background | 19

2.3.3 K-medoids algorithm

Algorithm 3: K-medoids
Data: D: set containing n data objects
k: number of clusters
Result: k clusters
1 Select any k data from the data set and set it as the midpoint.
2 Bind unselected data to the nearest midpoint. At this time, not only
the Euclidean distance but also other distance functions can be used.
3 For each unselected data, the distance cost is calculated by assuming
that data is the new midpoint of the set containing it.
4 The distance cost for the existing midpoints is compared with the
distance cost for the newly proposed midpoints, and if the cost is
lowered, the midpoint is replaced.
5 Repeat steps 2, 3, 4 until the selected midpoints have not changed.
6 return k clusters

The k-medoids algorithm uses the midpoint instead of the average of the
data points to find the center of gravity of the cluster [43]. Unlike the k-
means algorithm, it works for the Euclidean distances and arbitrary distance
functions; thus, it does not need the actual data points but only needs the
pairwise distance function receiving two data points’ ids (Algorithm 3). In
this algorithm, most of the time is spent calculating the distance between the
data. Therefore, the distance value between data can be calculated and stored
in advance, or the heuristic technique can be used to speed up the algorithm.

2.4 Classification methods


2.4.1 K-nearest neighbors
The k-NN algorithm is a methodology that predicts or classifies new data
using information from the nearest k neighbors among existing data when we
have new data points [44]. As shown in Figure 2.4, we can infer the category
information of the black dots from the neighbors. If k = 1, it is orange, and if
k = 3, the algorithm will classify it as green. If it is a regression problem, the
mean of the dependent variable (y) is the predicted value.
K-NN does not have a procedure to call learning. When new data comes in,
the distance between the existing data is measured, and we can easily derive the
20 | Extended background

k=3

k=1

Figure 2.4: Example of k-NN algorithm when k = 1 and k = 3.

k nearest neighbors based on distances. That is why k-NN is a lazy model (or
instance-based learning), which means they do not build the model separately
as a training process. It is a concept that contrasts with model-based learning,
which creates a model from data first, and performs the task. The purpose is
to perform tasks such as classification/regression using only each observation
without a separate model generation process.
K-NN’s hyperparameters are two types of neighbors to search (k) and
a distance measurement method. If k is small, it will overfit the regional
characteristics of the data. Conversely, if it is large, the model tends to be
over-normalized (underfitting). Second, the results of the k-NN algorithm vary
greatly depending on the distance measurement method. Most commonly, the
Euclidean distance is used.

2.4.2 Random forest


In general, in the case of a method using a decision tree, there is a drawback
that the result or performance fluctuation is large. In particular, since
the decision tree generated according to the learning data is very different
depending on randomness, it is not easy to generalize and use it. Since the
decision tree is a hierarchical approach, if an error occurs in the middle, it
continues to propagate to the next step. Here, a randomization technique such
as bagging [45] can overcome these disadvantages of the decision tree and
have good generalization performance.
RF consists of many decision trees as depicted in Figure 2.5 [46, 47]. If
the opinions are not consolidated when a forest is made of many decision
Extended background | 21

Figure 2.5: Example of the process of learning RF consisting of three decision


trees using bagging.

trees, the majority rule is followed. This method of consolidating opinions


or combining multiple results is called an ensemble technique. ’Random’ in
RF means randomly selecting the elements used to build each decision tree.
In other words, RF refers to a technique that builds multiple decision trees for
the same data and synthesizes the results to improve prediction performance.
Bagging is an abbreviation of bootstrap aggregating and is a method of
aggregating trained base learners on slightly different training data through
bootstrap. Bootstrap refers to creating a dataset of the same size as the original
dataset by allowing redundancy in the given training data.

1. The bootstrap method creates T training datasets.

2. Train T basic classifiers (trees).

3. Combine basic classifiers (trees) into one classifier (RF) (using average
or majority vote method).

Because the tree has a small bias and large variance, a very profoundly
grown tree becomes overfitted. The bootstrap process improves the forest’s
performance because it preserves the trees’ deflection while reducing the
variance. That is, one decision tree is susceptible to noise in the training data,
but if the trees are not correlated with each other, the average of several trees
is robust against noise. If all the trees that make up the forest are trained with
the same dataset, the trees’ correlation will be huge. Therefore, bagging is a
process to uncorrelated trees by training on different datasets.

2.4.3 Support vector machines


In general, SVM [48] consist of hyperplanes or sets of hyperplanes that can
be used for classification or regression analysis. Intuitively, if the hyperplane
has a large difference from the nearest learning data point, the classification
22 | Extended background

error is small, so for a good classification, it is necessary to find the hyperplane


having the largest distance from the closest learning data for a certain classified
point. In general, the initial problem is dealt with in a finite-dimensional space,
but often the problem is that the data are not linearly separated. To solve this
problem, a method was proposed to facilitate separation by mapping from the
finite level of the initial problem to a higher level. In order to prevent the
increase in the amount of computation in the process, an SVM structure that
defines the kernel function k(x, y) appropriate for each problem is designed
so that the internal operation can be efficiently calculated using the variables
of the initial problem [49]. The hyperplane of a high dimensional space is
defined by a set of points and the dot product of constant vectors. The vectors
defined in the hyperplane are selected to be linear combinations with the image
vector parameters appearing in the database. In this selected hyperplane, the
point x corresponding to the hyperplane has the following relationship.
X
αi k(xi , x) = constant
i

If k(x, y) gets smaller as x and y get farther apart, each sum represents the
degree of proximity of the test point x and its corresponding data point xi . do.
In this way, the sum of the above kernel expressions can be used to measure
the relative proximity between data points and test points in the set we want to
distinguish. This may be more complicated and difficult when the point x in
the non-convex set in the initial space is mapped to a higher dimension.
Classifying data is a common task in machine learning. Assuming that a
given data point belongs to each of the two classes, the goal is to determine
where the new data point will belong. On a support vector machine, when a
data point is given as a p -dimensional vector (a list of p numbers), we check
to see if we can classify them into a hyperplane of (p − 1)-dimensions. This is
called a linear classification. Hyperplanes that classify data can come in many
cases. One logical way to select hyperplanes is to select the hyperplane with
the largest classification or margin between the two classes. So we choose
the hyperplane that maximizes the distance between the data points of each
class closest to the hyperplane. If such hyperplanes exist, the hyperplane is
called the maximum-margin hyperplane, and the linear classifier is called the
maximum margin classifier.
The optimization problem of SVM can be expressed as follows. Using
the Lagrange multiplier method, the above problem can be expressed as the
Extended background | 23

problem of finding the following saddle point:


( n
)
1 X
arg min max kwk2 − αi [yi (w · xi − b) − 1]
w,b α≥0 2 i=1

Nonlinear classification is proposed by applying kernel to the maximum


margin hyperplane [50, 51]. The nonlinear classification algorithm is similar
in shape to the existing linear classification algorithm, but the inner product
operation is replaced by a nonlinear kernel function. Through this, it was
possible to solve the maximum margin hyperplane problem of the transformed
feature space. The transformation described here can be a nonlinear or a
dimension-enhancing transformation. In other words, the classifier is a linear
hyperplane in a high dimensional feature space, but a nonlinear hyperplane in
an existing dimensional space. If a kernel function is applied to a Gaussian
radial basis function, the feature space becomes an infinite dimension Hilbert
space.

• Polynomial kernel: k(xi , xj ) = (xi · xj + 1)d

• Gaussian radial basis function: k(xi , xj ) = exp(−γkxi − xj k2 ) for γ >


0

2.5 Performance metrics


2.5.1 Clustering purity

Cluster 1 Cluster 2 Cluster 3

Figure 2.6: Example of the clustering purity with three clusters and three
classes.
24 | Extended background

Purity is one of the criteria for assessing the quality of a cluster and is
classified as an external evaluation [52] of clusters, meaning that we need to
know ground truth class distribution in our dataset [53]. This is calculated by
evaluating the percentage of how many single classes each cluster contains.
When calculating purity, the absolute value of each class is not important, and
it is always calculated by the percentage of classes that make up the largest
number in each cluster. If this is expressed as a formula, it can be expressed
as:
1 X
max |m ∩ c|
D m∈M c∈C

where we have set of clusters M , set of classes C, and our dataset D.


For example, in our example in Figure 2.6, first we have four circles in
cluster 1, which is the majority in this cluster. In cluster 2, orange squares are
the majority, and cluster 3 has the most triangles (three triangles out of five
elements). Thus, the purity can be calculated as 4+4+3 16
= 1116

2.5.2 Accuracy
Accuracy is calculated as the number of True Positive (TP) and True Negative
(TN) of the prediction over the size of the entire target dataset as follows:

tp + tn
Accuracy =
tp + tn + f p + f n

Accuracy is regarded as one of the most representative evaluation metrics.


However, if the class distribution of data is unbalanced, accuracy may not
reflect the characteristics of the data [54]. For example, suppose that 90 out of
100 data points are class 1, and only 10 have class 2. In this case, an algorithm
that classifies all values as class 1 can achieve an accuracy of 0.9. In this thesis,
accuracy is used in evaluating the classification of the algorithm.

2.6 Summary
This section briefly introduced the contents of the various algorithms used
in our thesis. The graph algorithm is used to transform our dataset, and
clustering and classification are applied on top of this graph structure. Event
sequences are the format that our data has by default and are used to transform
Extended background | 25

data through various representative values representing this format. We


also introduced various clustering (k-means, k-means++, and k-medoids)
and classification (k-NN, RF, SVM) algorithms to apply to the data, and
performance metrics (purity, accuracy) to evaluate the suggested method. All
of the techniques presented in this section were used evenly to construct and
evaluate our algorithm. More details of our algorithm are introduced in the
next section.
26 | Research methods

Chapter 3

Research methods

This chapter aims to provide an overview of the research method used in this
thesis to answer the following research questions:
• Representation and applicability: Will graph structure be a proper
representation for clustering and classification problems that can generate
numerical feature vectors?
• Clustering purity: Can clustering of event intervals with a graph
structure achieve purer clusters from datasets compared to previous
methods?
• Classification accuracy: Can the classification of event sequences with
a graph structure achieve higher accuracy from datasets than previous
methods?
• Runtime efficiency: Will the total runtime of creating the feature
vectors and performing classification and clustering on them be faster
than state-of-the-art algorithms?
Section 3.1 explains the choice of research methods in the thesis to answer
the research question. Section 3.2 describes the research process used in the
thesis based on the CRoss-Industry Standard Process for Data Mining (CRISP-
DM) process to derive the proper algorithm that can solve the proposed
research question.

3.1 Choice of research methods


In the previous sections, we introduced the research question as follows:
Research methods | 27

To what extent the efficiency and effectiveness of classifying and clustering


event sequence datasets can be improved by introducing the bipartite graph
representation compared to state-of-the-art methods that directly use event
intervals?

To answer the research question, we need to reconsider what part of the


research question we have to answer. Our research question can be divided as
follows:

• Efficiency: How can we make clustering of event intervals achieve purer


clusters from datasets compared to previous methods?

• Effectiveness: How can we make classification of event sequences


achieve higher accuracy from datasets compared to previous methods?

• Representation: Is there any alternative structure that can equivalently


represent the event sequences that support to achieve better efficiency
and effectiveness?

We have found that the algorithms previously specified for the event
sequence datasets did not show a practical level of efficiency and effectiveness.
Also, the number of algorithms limited to the event sequence datasets is quite
limited compared to those based on general data types. Based on these facts,
we concluded that changing the structure of the event sequence datasets to a
more general form could help enhance efficiency and effectiveness.
From the previous experiment converting dynamic graph structure into a
set of event intervals [13], we figured out that both structures are compatible
and can be converted into each other. Furthermore, from our comprehensive
background analysis, we also gathered information that the graph structure
can handle various machine learning algorithms including graph-specific
ones and traditional classification and clustering algorithms such as k-NN
and SVM, and dimensionality reduction techniques like spectral embedding,
which increases the possibility of improving both efficiency and effectiveness.
We introduce a novel method to convert event sequence datasets into
graph structure to answer our research question accurately based on our
extensive research into existing research areas. After introducing the method,
we answer each of the decomposed research questions by trying empirical
experiments with five real-world datasets. We validate that the method we
propose can reliably answer our research questions through graph clustering
and classifying based on graph structures and traditional machine learning
techniques through dimension reduction, which is also based on graphs. We
28 | Research methods

report how much improvement we have made in efficiency and effectiveness


in Chapter 5.

3.2 Research process


The procedure of this thesis follows the process of CRISP-DM [55], which
is commonly accepted as a standard process of research in the data science
area [56]. CRISP-DM has the following six steps: 1) Business understanding,
2) Data understanding, 3) Data preparation, 4) Modeling, 5) Evaluation, and
6) Deployment. We have included a business and data understanding of
event intervals in both introduction and extended background sections, such
as the importance of event sequence datasets in various industries and a brief
introduction of the dataset. From these understandings, our four research
questions came out as described at the beginning of Chapter 3. To solve our
research questions, the primary purpose of this study is to make a modeling
process to quickly solve the clustering and classification problems of the event
sequence datasets. Here in this chapter, we describe our modeling strategy
in detail to make it reproducible enough. Finally, to make the suggested
algorithm valid enough, we also show evaluation and deployment phases for
our algorithm’s testing purpose, which are included for selected real-world
datasets in Chapter 5.
Suggested algorithm | 29

Chapter 4

Suggested algorithm

This chapter provides a detailed explanation of our proposed algorithm derived


from the research process explained in Chapter 3. Section 4.1 details the
research paradigm by introducing a full image of the suggested method.
Section 4.2 describes a detailed explanation of the suggested framework.

4.1 Research paradigm


We suggest an efficient three-step framework for converting an event sequence
database into a bipartite graph representation, where structural information
regarding the temporal relations in the event sequences is preserved, facilitating
scalable clustering and classification. The first two steps of the framework
convert the original event sequence database into a bipartite graph, while at the
third step, the bipartite graph is converted into a spectral embedding space or
used directly to rooted PageRank. The final representation can then be readily
used by off-the-shelf clustering or classification algorithms. These steps, also
outlined in Figure 4.1 and Algorithm 4, are described below:

1. Construction of hash table: This is a data structure that efficiently


stores the information needed to create a bipartite graph after scanning
an event sequence database. We can apply various pruning processes
based on temporal relations to the table for better graph representation.

2. Conversion to a bipartite graph: The pruned table is converted to a


weighted bipartite graph with two vertex sets of event sequences and
temporal relations. The bipartite graph is represented as a form of a
bi-adjacency matrix. We represent the bipartite graph with the two
interestingness factors defined in Section 2, i.e., vertical support and
30 | Suggested algorithm

Algorithm 4: Pseodocode of the suggested method


Data: D: event sequence database, d: dimension factor
constraints: predefined constraints {minSup, maxSup,
gap}
Result: U : Row embedding of regularized bi-adjacency matrix
1 // Step 1: Construction of hash table
2 HT = {};
3 for Si ∈ D do
4 for sa , sb [sa < sb ] ∈ Si do
5 r ← getRelation(sa , sb , constraints.gap);
6 if r 6= N one then
7 R ← (sa .e, sb .e, r);
8 if R ∈ / HT then
9 HT .index(R);
10 if Si .id ∈
/ HT [R] then
11 HT [R].index(Si .id);
12 HT [R][Si .id].addHorizontalSupport();

13 for Rk ∈ HT do
14 if
F(R) < constraints.minSup∨F(R) > constraints.maxSup
then
15 remove HT [Rk ]
16 // Step 2: Conversion to a bipartite graph
17 B = 0|D|×|HT | ;
18 for Rj ∈ HT do
19 for Si .id ∈ HT [Rk ] do
20 B[Si .id][hash(Rk , |HT |)] = HT [Rk ][Si .id];
21 // Step 3: Application of graph-based algorithms
22 if mode = "spectral" then
23 // Spectral embedding of the bipartite graph
f eatures = spectralEmbedding(B, d);
24 else if mode = "pagerank" then
25 // Similarity acquisition by personalized PageRank
f eatures = pageRank(B, d);
26 return features
Suggested algorithm | 31

Step 1. Step 2.
Construction Conversion
getRelation()
2 1 1 R1
<A, A, follows>
2 4 1
1 1 2 R2
<A, B, matches>
id Event intervals 2 4 1
1 (A, 1, 3), (B, 1, 3), (A, 14, 16) <A, B, meets> 2 1 3 R3
1
2 (A, 1, 6), (B, 6, 8), (A, 10, 12), (C, 13, 17) <A, B, follows> 1 3 1 4 R4
3 (A, 4, 7), (B, 11, 12)
2 2
4 (B, 1, 5), (A, 6, 14), (B, 6, 14), (A, 17, 18) <A, C, follows>
2 4 1
E-sequence database 2 1
R1 R2 R3 R4
<B, A, follows> 1 1+0.01 0+0.01 0+0.01 0+0.01
2 4 2
2 1+0.01 0+0.01 2+0.01 1+0.01
<B, B, follows> 1 4 1 3 0+0.01 0+0.01 0+0.01 0+0.01
<B, C, follows> 1 2 1 4 0+0.01 1+0.01 1+0.01 2+0.01

HT HT[k]=HEk HEk[S i.id ]

Step 3.
Spectral Embedding Personalized PageRank

1 R1

Embedding e1 e2
2 R2
1
Normalization 2 3 R3
3
4 4 R4

Figure 4.1: Example of the suggested method.

horizontal support [57]. We use vertical support as a pruning factor


because it is a measure of how prevalent the temporal relation is across
the entire database, while horizontal support is used as a weight of the
edge of the graph since it represents the strength of a specific temporal
relation in different event sequences.

3. Application of graph-based algorithms: After generating the bi-


adjacency matrix, we can choose the way of getting feature vectors. We
support the following two ways:

(a) Spectral embedding of the bipartite graph: the feature vector


of each event sequence is generated through regularization and
singular value decomposition. These methods help with reducing
the complexity and dimensionality of the event sequences. Since
spectral embedding results in numerical feature vectors of event
sequences, we can use a wide range of classification and clustering
algorithms compared to previous distance-based (e.g., Artemis,
IBSM) and non-numerical-feature-based methods (e.g., STIFE).
(b) Similarity acquisition by personalized PageRank: the similarity
scores from one event sequence to all the others in the database
32 | Suggested algorithm

are generated in a random-walk manner. Then the clustering and


classification are performed based on similarity. Clustering will
be performed hierarchically, and classification will be performed
by picking k highest scores from similarity vector. This method
fully exploits the graph structure, which has not been tried before
on the event sequence database.

4.2 Model design


4.2.1 Construction of hash table

Add e-sequences S1.id 3


R1: <e1, e1, r1> Check F(Rk) for pruning
S2.id 2
R2: <e1, e1, r2>
S3.id 5
R3: <e1, e1, r3> R1 R2 R3 … R|HT|
S7.id 7 S1 3 2 1 … 0
getRelation() R4: <e1, e1, r4> S2 2 3 3 … 5
DB S11.id 5
R5: <e2, e1, r1> S3 5 4 4 … 0
Add e-sequences … … … … … …
R6: <e2, e1, r2> Check F(Rk) for pruning
S1.id 2
S|D| 4 5 0 … 0


S2.id 3

Ru: <e|Σ|, e|Σ|, r|I|>


… 5
S24.id 6

1) Gap Pruning Temporal relation 2) Frequency Pruning E-sequence Edge Weighted


{Gap} hash table {minSup, maxSup} hash table weights bi-adjacency matrix
s1.end – s2.start ≤ gap minSup ≤ F(R ) ≤ maxSup
HT HE HE[i]
k
|D|x|HT|

Figure 4.2: An instantiation of hash table in the suggested algorithm.

We construct a multi-layer hash table composed of three layers, where each


layer is from the event sequence database. This hash table structure is for
facilitating its conversion to a bipartite graph. We described the detailed form
of the hash table in Figure 4.2. The hash table efficiently keeps all information
for the conversion and pruning based on occurrences by scanning the database
only once to catch temporal relations. There are two main steps to create the
hash table: (1) construction step (blue arrows in Fig. 4.2), (2) pruning step
(orange arrows in Fig. 4.2).

Construction step
First, we traverse all event intervals in the event sequence database in pre-
defined chronological order. For every iteration, we have a target event interval
sa . Then we make a pair with all event intervals sb that occurs after (or at
Suggested algorithm | 33

the same time with, but having lexicographically later event label) our target.
Thereafter, we check the temporal relation between these two event intervals
sa and sb (lines 1-5, Algorithm 4). Then, the temporal relation between them,
Rk =< sa .e, sb .e, r >, with r ∈ I is formed and stored as a key in the
first hash table HT , which we call temporal relation hash table (lines 6-9).
Whenever the algorithm finds a temporal relation Rk , it identifies the event
sequence id containing it, and use it as a key in the second hash table HE ,
called event sequence hash table (lines 10-11). For clarity, since each record
HT [Rk ] ∈ HT is mapped to its respective event sequence hash table, we call
k
it HE . The keys of the event sequence hash table are the event sequence ids
where Rk occurs, while the values are the horizontal supports that quantify
the occurrence Rk in the event sequence, which eventually will be converted
into edge weights of the bipartite graph. When we firstly create a specific key,
k
we set the value in HE equal to one as it has only one horizontal support for
the first time. If the same temporal relation Rk happens more than once in the
k
same event sequence Si , we sum the count to HE [i] to update its horizontal
support (line 12).

Pruning step
The pruning step helps limit the unnecessary creation of trivial temporal
relations and maintain the graph only to hold necessary information for the
learning problem. The pruning step consists of two sub-steps. However, each
step does not occur sequentially, but occur at different times in the middle of
the construction step:
1. Gap pruning: A gap constraint limits the maximum distance of follows
relations between intervals. It is easy to catch follows relations as
it happens in almost every case between two intervals since it also
simply means there is no time overlaps at all. This pruning process
drops unnecessary temporal relations that happen just because they are
chronologically far apart rather than having a meaningful relation. The
algorithm checks the gap when examining the temporal relation while
scanning the database (line 5, Algorithm 4). For this, the algorithm
receives the gap constraint with a value in the range [0, 1]. This value
means the ratio of the average time duration of event sequences in the
database. The algorithm prunes the follows relations having a distance
above that ratio.
2. Frequency pruning: Frequency pruning is a step of eliminating the
temporal relations Rk , whose relative vertical supports F(Rk ) are
34 | Suggested algorithm

below or above the pre-defined criteria after the multi-layer hash table is
entirely created (lines 13-15). To do this, we impose the following two
constraints:

• Minimum support constraint: this constraint is corresponding to


the minimum occurrence frequency of each temporal relation. It
helps increase the cluster’s purity by restricting the small size
temporal relations that cannot be consistent within the same cluster.
• Maximum support constraint: this constraint is corresponding to
the number of the maximum occurrence of each temporal relation.
Maximum support limits the temporal relations spanning almost
all event sequences, allowing the embedding space to express the
event sequence space more holistically.

Example. Let us consider an event sequence database of size 4, as depicted


in Figure 4.1. We will use this example for the rest of the thesis. We used
the following parameter settings: minSup = 0.5, maxSup = 1, gap =
0.5. Consequently, we need to find temporal relations with absolute vertical
supports from 2 to 4. Moreover, the gap constraint gap = 0.5 indicates that
the longest distance that follows relation can have should be at most half the
average time length of all event sequences, which is b 16+17+12+18
4
× 0.5c = 7.
First, we scan the database to get all possible temporal relations by checking
the temporal relations between all the event intervals in the database. In
this example, we see that (A, 1, 3), (B, 1, 3) in the first event sequence and
(A, 6, 14), (B, 6, 14) in the fourth event sequence form a temporal relation:
< A, B, matches >. Then, we place the relation as the key in the first layer
of the hash table HT . After that, since the same temporal relation happens in
both the first and fourth event sequences, we can save their ids into the second
layer along with their corresponding vertical supports (left square boxes in the
event sequence hash table HE ). Finally, we compute the horizontal support
by counting how often the same temporal relation has happened in each saved
event sequence ({1, 4} in the example). We only have a single horizontal
occurrence in both event sequences. Hence, we add ones to the values of the
event sequence hash table (right square boxes). The gap constraint pruning is
applied together with the hash table construction. Actually, without the gap
constraint, (A, 1, 3) and (A, 14, 16) in the first event sequence must have
formed a follows relation. However, since the distance between the two event
intervals is 11 (> 7), we skip creating a record in the hash table. The same
pruning holds for the event intervals (B, 1, 5), (A, 17, 18) in the fourth event
sequence since a distance between them is equal to 12. After constructing the
Suggested algorithm | 35

hash table, the algorithm performs frequency pruning by applying two support
constraints {minSup, maxSup}. Since minSup = 0.5 (or support count
of 2), the temporal relations with vertical support equal to 1 are subsequently
excluded from the first layer of the table (gray triplets in HT in the example).

Time complexity
The benefit of this multi-layer hash table is that we can easily consider two
types of frequencies and apply pruning techniques by scanning the event
sequence database only once. Moreover, we can instantly transform the table
to its corresponding bi-adjacency matrix weighted by the support of the event
sequences’ temporal relations. All that is required is to scan the database once,
scan the first layer of the table for applying the pruning technique, and scan
the first and second layers of the hash table to convert it into the bi-adjacency
matrix.
With these benefits, we can explicitly calculate the time complexity of the
first step of the suggested algorithm. Given an event sequence database D =
{S1 , . . . , S|D| }, the set of possible relations I, and the alphabet of event labels
Σ, the time complexity for creating the bi-adjacency matrix is quadratic in the
worst case as follows:
|D|
X
( |Si |2 × |I|) + (|Σ|2 × |I|) + (|Σ|2 × |I| × |D|) .
Si ∈D

4.2.2 Conversion to a bipartite graph


In this step, we use the idea of a weighted bipartite graph. A bipartite graph
consists of edges that can only lead from the set of nodes U to the set V ,
while nodes in the same set cannot be connected. A weighted bipartite graph
is an extension of the bipartite graph, where each eu,v ∈ E equals to the
corresponding edge weight between u and v, or to 0 if no edge exists between
u and v. After the construction and pruning steps, we will have the completed
multi-layer hash table explained in the section 4.2.1. Then we can easily create
the corresponding weighted bipartite graph by directly using the values of
each layer of the hash table without any further calculation. The temporal
relations in the first layer of the hash table (or temporal relation hash table)
are used as the right-hand side nodes of the bipartite graph, while the event
sequence ids of the second layer (or event sequence hash table are used as the
left-side nodes. After that, the edges are created to link each event sequence id
36 | Suggested algorithm

(left-hand side nodes) to the corresponding temporal relations (right-hand side


nodes) it contains. Horizontal supports in the value of the second hash table
are used as edge weights (after pruned by {minSup, maxSup} thresholds).
The final graph is a weighted bipartite graph G = {U, V, E}, with

U : {i | Si ∈ D, i ∈ [1, |D|]},
V : {HT .keys | ∀R ∈ HT : minSup < F(R) < maxSup}} and
i i
E : {ei,j = HE [j] | i ∈ [1, |D|], j ∈ HE }.

Using the node and edge information of G, we construct its bi-adjacency


matrix B (lines 17-20, Algorithm 4). The bi-adjacency matrix B ∈ R|U |×|V | is
a two-dimensional matrix representing G, with the dimensions corresponding
to vertex sets U and V , and edges between sets U and V are defined as the
elements of the matrix, with
(
eu,v > 0, if and only if eu,v ∈ E
Bu,v =
0, otherwise

A bi-adjacency matrix is computationally highly efficient than a standard


adjacency matrix. It reduces the matrix size from |U + V | × |U + V | to
|U | × |V | while storing the same information as the normal one has the space
for the nodes in the same node-set, which will be filled with zeros. In Figure
4.2, there is an example of conversion from the set of temporal relations in the
hash table to a bipartite graph G. We can also check the graph’s corresponding
weighted bi-adjacency matrix B.
Example. After the construction and pruning step, we have four temporal
relations {< A, A, follows >, < A, B, matches >, < A, C, follows >, <
B, A, follows >} and four event sequence ids {1, 2, 3, 4} that meet all the
constraints. Then we can create a 4 × 4 bi-adjacency matrix and fill the values
of the third layers of the hash table as key of the second layer, the key of the
first layer in the matrix shown in step 2 in Figure 4.1. For example, since the
horizontal support of the pair < B, A, follows > in the fourth event sequence is
two, we can insert value two into the matrix with key {< B, A, follows >, 4},
which is the right bottom value in the matrix. If no edge occurs between an
event sequence and a relation pair, we can set that value to zero by following
the definition of the adjacency matrix. The third event sequence will have all
zeros in the matrix since all of its relations are pruned.
Suggested algorithm | 37

4.2.3 Application of graph-based algorithms


After creating the weighted bipartite graph, we can proceed by selecting one
of two actions as the third step to apply classification and clustering methods.
The first action, spectral embedding, returns a numerical feature vector for
each sequence by substituting a bi-adjacency matrix (or Laplacian matrix as
its variant) for a low-dimensional space that can represent the semi-optimal
partition of the graph. We can then use these feature vectors to apply existing
machine learning algorithms for classification or clustering. Second, we can
apply personalized PageRank-based classification and clustering techniques
using the graph structure as it is. This method finds similarity scores of each
node by traversing the nodes following the edges in a PageRank manner. We
can then proceed with clustering and classification by collecting nodes with
similarity scores from the top for each node.

Spectral embedding of the bipartite graph

Algorithm 5: spectralEmbedding
Data: B: a bi-adjacency matrix of intervals where B ∈ R|U |×|V |
d: dimension factor
Result: U : Embedding of the rows
|U |×|V |
1 BR = B + α ∗ 1
−1 −1
2 NBU R = D1 2 BR D2 2
3 calculate singluar vectors NBU R = M ΣW T
4 pick leading d singular values and corresponding d columns from M
5 return M [: U, : d]

After constructing the bipartite graph and its bi-adjacency matrix, we


proceed with defining a reduced-rank spectral embedding [18]. First, we apply
regularization with the regularization factor α to ensure noise and outlier
robustness of the spectral embedding. The factor α is determined by prior
information based on the characteristics of the datasets. We used the most up-
to-date method introduced in [23], which is adding a constant factor α equally
to all elements of the bi-adjacency matrix (line 1, Algorithm 5). We can
interpret this action in terms of graphs. Adding small values to the adjacency
matrix means adding small-weight edges to the graph, where the edge means
a pair of nodes between the sets (green edges in Fig. 4.1).
Next, using the weighted bi-adjacency matrix, we calculate the normalized
Laplacian matrix NB . However, we only calculate NBU R , which is the top-right
38 | Suggested algorithm

part of NB (line 2). We can only use this part since N can be expressed as a
bi-adjacency matrix as follows:
−1 −1
" #
I|U | −DU 2 BDV 2
N= −1 −1
(4.1)
−DV 2 B T DU 2 I|V |

The normalized Laplacian matrix provides an approximate solution for finding


the normalized cut of the graph, which contributes to providing a good graph
partition [31].
The next step is to define the spectral embedding space of NBU R . Spectral
embedding can reduce the horizontal dimension of the matrix, which can
have a maximum size of |Σ|2 ∗ |I| in a bi-adjacency matrix. Spectral
embedding creates reduced-size feature vectors so that the clustering and
classification algorithms can process them faster. The spectral embedding
space is achieved by applying eigendecomposition to create a new embedding
space and obtaining the leading eigenvectors of the adjacency matrix. Since
the bi-adjacency matrix is not square-shaped, we can apply Singular Vector
Decomposition (SVD) as an equivalent process of eigendecomposition [58].
Then we sort the singular values and choose the d leading values, where d ≤
min(|U |, |V |), and the corresponding columns from M . The target dimension
parameter k is set based on prior knowledge and the dataset properties. Finally,
we return the selected d columns of M (size U × d), defining the spectral
embedding space of each event sequence.
Example. In the example in Figure. 4.1, we first add a regularization factor
α = 0.01 to all the elements in the bi-adjacency matrix of size 3 × 4. Then
on the graph, we use tiny edges between every pair of vertices in each set,
represented as dotted green edges. After that, we calculate the normalized bi-
adjacency matrix and perform singular value decomposition. Since k = 2 in
the example, the result will be a reduced-size 3 × 2 matrix.

Similarity acquisition by personalized PageRank


The personalized PageRank algorithm used for classification and clustering is
the same. They cluster the nodes with high similarity or classify the unknown
node into the label that the most similar node in training data has. The
reason why random walk or PageRank based techniques have recently gained
popularity in many applications is that it is ideal for removing local decisions
and gets the global decision beyond the local neighborhood of the starting
vertex [36]. There have been many algorithms exploiting the similarity scores,
Suggested algorithm | 39

and we chose state-of-the-art ones. However, we mention that it is also possible


to apply other types of algorithms, which return the similarity score between
the nodes in the same way, explained in Section 2.1.4.
Clustering. Both top-down or bottom-up hierarchical algorithms can be
applied with the similarity scores [59, 36], and here we use the bottom-up
method suggested in [36]. This method firstly receives the similarity scores of
each node from a randomly selected node. After getting the similarity scores,
this algorithm uses top-down agglomerative clustering [60] to divide the pair
of nodes into two parts. It firstly chooses the node having the highest similarity
score from a randomly selected node. Graph modularity Q can be calculated
as follows:
1 X ki kj
Q= (Aij − )δ(ci , cj ).
4m ij 2m
P
where Aij is a member of an adjacency matrix A and ki = j Aij , ci means
the cluster number which a node i belongs to, and δ() is a function returning
0 if the clusters of the input nodes differ.
The algorithm calculates modularity Q to check the benefit of moving the
most similar node to the same cluster with the selected node. The algorithm
does this job for all the nodes, and finally, we get two separate areas. It
continues to make clusters until we have single-node clusters.

Algorithm 6: personalizedPageRankClassification
Data: B: a bi-adjacency matrix of intervals where B ∈ R|U |×|V |
l: vector of training true labels
d: dampling factor
Result: result: classification result (label)
1 U ←− unknown nodes from the graph B
2 for c ∈ unique(l) do
3 ui ←− 1
4 Normalize u such that ||u||1 = 1
5 Rc ←− P ersonalizedP ageRank(B, u, d)
6 for i ∈ U do
7 Xi ←− argmaxc (Rci )
8 return X[U ]

Classification. This method performs classification by calculating the


similarity score for each cluster, not for each node, by making the teleport set
every node of each known class [61]. After calculating the similarity score,
40 | Suggested algorithm

we can get the estimated label for our test instance (node), by looking at the
similarity scores of each class and take the one having the maximum score for
our target node. This algorithm is described in the algorithm 6. One different
thing from the classification using spectral embedding is that it directly uses
the graph structure itself and does not change more from a bi-adjacency matrix.
Instead, it just follows the edges to get scores of the most similar nodes from
the unknown ones.
Empirical experiments | 41

Chapter 5

Empirical experiments

This chapter shows the empirical experiments. Section 5.1 describes the
properties of the datasets we collected to perform our experiments. Section 5.2
describes the properties of datasets we selected. Section 5.3 describes the
framework selected to evaluate the suggested method, and software tools,
mainly libraries and languages, we used for our experiments. Section 5.3.1
and Section 5.3.2 give the results of our empirical experiments for clustering
and classification on five real-world event sequence datasets.

5.1 Data collection


Since our suggested model involves various heuristic methods inside, our
research questions regarding efficiency and effectiveness should be verified
by empirical results from the collected data. Therefore, the selection of
data is essential to the reliability and validity of our research. The datasets
should be enough to explore our experiment in various ways to generalize
its efficiency and effectiveness. Thus, we have secured open event sequence
datasets in as many fields as possible. Besides, our selected datasets have
various characteristics not only in the field of each dataset but also in the data
property aspect such as size and number of classes, which can help generalize
and verify our algorithm. The properties of the data perspective of each dataset
can be found in Section 5.2. The characteristics of each data are as follows:

• BLOCKS: The event labels in each sequence represent the essential


visual elements of people stacking blocks (e.g., contacts blue or red,
attached hand red). In this dataset, an event sequence represents a
single large action composed of the visual elements (e.g., pickup) or
42 | Empirical experiments

part of a scenario (e.g., assembly) that each basic element is intended to


represent.

• SKATING: The event labels show the muscle activity and leg position
of six professional inline speed skaters while performing a control test
at seven speeds on the treadmill. The event sequence represents one
element of the complete movement cycle.

• PIONEER: This dataset is based on the Pioneer-1 dataset available in


the UCI repository. It contains event intervals describing the discretized
input provided by the robot sensor. The event sequence represents one
of the robot’s three-movement scenarios.

• HEPATITIS: The dataset contains a sequence of patients suffering from


hepatitis B or C. The result of 63 regular tests corresponds to the event
labels. Each event sequence describes a series of tests performed on the
patient.

• CONTEXT: The event labels are from the categorical and numerical
data that describe the situation of mobile devices that people carry in
various situations. The event sequence represents a part of a possible
scenario, such as being in the street or having a meeting.

5.2 Data properties

Table 5.1: A summary of the properties of the real-world datasets: sequences


and intervals.
No. of No. of Average Average
No. of
Dataset event event interval e-sequence
e-sequences
labels intervals length length
BLOCKS 210 8 1,207 5.75 54.13
PIONEER 160 92 8,949 55.93 57.19
CONTEXT 240 54 19,355 80.65 191.85
SKATING 530 41 23,202 43.78 1,916.08
HEPATITIS 498 63 53,921 108.28 3,193.55

We used five public datasets collected from different application domains.


Table 5.1 and 5.2 represents the properties of datasets which mainly affect the
effectiveness and efficiency of interval analysis algorithms. Datasets are sorted
Empirical experiments | 43

Table 5.2: A summary of the properties of the real-world datasets: temporal


relations.
No. of unique No. of total
Dataset
temporal relations temporal relations
BLOCKS 174 3,245
PIONEER 26,429 252,986
CONTEXT 6,723 804,504
SKATING 4,844 516,272
HEPATITIS 20,865 3,785,167

from simple datasets to complex datasets. Firstly, the BLOCKS dataset is the
lightest dataset we have collected in terms of relation size. Since there are only
eight event labels, the maximum number of relations that the dataset can form
is just 8 × 8 = 64. On the other hand, PIONEER, CONTEXT, and SKATING
datasets have similar medium-sized characteristics, but each dataset highlights
different aspects. The PIONEER dataset has even fewer sequences than the
BLOCKS sequence, but it has 92 the largest number of event labels, so the
maximum possible temporal relations is also the highest (Table 3.2). On
the other hand, CONTEXT and SKATING have fewer possible relations than
PIONEER, but the number of event sequences and intervals increases the total
searchable space. The most significant difference between CONTEXT and
SKATING data is that the event sequence time point of SKATING is about
ten times larger than that of CONTEXT (Table 3.1), which makes the speed of
algorithms based on time points such as IBSM very slow. Finally, HEPATITIS
data is a dataset with both high event labels and sequences; all algorithms take
the most time on it. HEPATITIS data can thus explain how our algorithms
show an improved effect on complex data.

5.3 Experimental design


We demonstrated the applicability of our suggested algorithm representation
on five real-world datasets for clustering and classification and compared it
against two state-of-the-art competitors for the task of clustering and three for
classification. For repeatability purposes, our datasets and Python code can
be found on GitHub repository∗ .
The experiment process is mainly similar to the one shown in Figure 5.1

https://github.com/zedshape/zembedding
44 | Empirical experiments

Common process Task 1-1. Spectral embedding

Spectral Application of
Construction of embedding of the clustering/classification
hash table bipartite graph methods

Task 1-2. Graph-based clustering/classification

Conversion to a Similarity acquisition


bipartite graph by personalized
PageRank

Figure 5.1: An experimental process of the clustering and classification task


using our suggested method.

for both clustering and classification problems. Detailed experimental plans


for each item may be different, which will be explained in subsequent sections.

5.3.1 Experiment 1: clustering


Process
We applied the process mentioned in Section 4.2. We divided the process into
two main parts, as we offer both traditional clustering methods and the graph-
based clustering method based on bipartite structure. Step 1-2 are equally
applied to both tasks. Spectral embedding is only applied for traditional
clustering methods. Our experiment process for the clustering task is as
follows:

Common process
1. Construction of hash table: Form a hash table for traversing a given
event sequence datasets and converting them into graph form.

2. Conversion to a bipartite graph: When changing to a bipartite graph,


three parameters are available. However, since it is unsupervised
learning, correction for parameters cannot be performed in the case of
clustering. So we did this step without applying any other parameters.
The tasks described below are run separately after finishing the standard
process.
Empirical experiments | 45

Task 1-1. Spectral embedding


1. Spectral embedding of the bipartite graph: When applying spectral
embedding, we fixed the dimensions needed for embedding to 4 for
BLOCKS and 8 for the rest of the datasets.

2. Application of traditional clustering methods: For clustering, we


benchmarked k-means and k-medoids under the Euclidean distance in
the embedding space.

Task 1-2. Graph-based clustering


1. Similarity acquisition by personalized PageRank: The bottom-up
hierarchical method based on personalized PageRank in Section 4.2.3
is used.

Datasets
We used all five datasets we prepared (BLOCKS, PIONEER, CONTEXT,
SKATING, and HEPATITIS).

Competitor methods
We compared our model to two alternative state-of-the-art distance functions,
Artemis and IBSM. We developed those algorithms in the same programming
language for a fair comparison. We ran those on the same datasets in the same
working environment.

Evaluation metrics
We demonstrated the runtime efficiency of our suggested model and its
applicability to the clustering tasks by reporting clustering purity. For our
proposed algorithm as well as competitors, clustering purity values are derived
by performing the methods 100 times independently and average the values.
Runtime values are also averaged in the same manner.

5.3.2 Experiment 2: classification


Process
In the common process, the first step to make a hash table structure is the same
as the process for clustering problems. However, the next step is different as
46 | Empirical experiments

we can find and apply optimal parameters for the datasets. A detailed process
is described as follows:

Common process

1. Construction of hash table: Form a hash table for traversing a given


event sequence datasets and converting them into graph form.

2. Conversion to a bipartite graph: When changing to a bipartite graph,


three parameters are available. Unlike clustering, since classification
is the supervised learning, we applied grid search to get the optimal
parameters for a part of our dataset. Since we need to use the same
datasets for all competitors, we applied this grid search on the simplest 1-
NN classifier to achieve generality, which means it will not be overfitted
to better classifier such as SVMs.

The tasks described below are run separately after finishing the common
process.

Task 1-1. Spectral embedding

1. Spectral embedding of the bipartite graph: When applying spectral


embedding, we fixed the dimensions needed for embedding to 4 for
BLOCKS and 8 for the rest of the datasets.

2. Application of traditional classification methods: For classification,


we benchmarked four different classifiers using the feature space derived
from our suggested algorithm, i.e., 1-NN, RF, SVM with the Radial
Basis Function (RBF) kernel (SVM_RBF), and SVM with the 3-degree
Polynomial kernel (SVM_Poly).

Task 1-2. Graph-based clustering

1. Similarity acquisition by personalized PageRank: The bottom-up


hierarchical method based on personalized PageRank in Section 4.2.3
is used.

Datasets
For this experiment, we used all five datasets we prepared.
Empirical experiments | 47

Competitor methods
We here again compared our model to two alternative state-of-the-art distance
functions, Artemis and IBSM. For classification, for completeness, we also
compared the state-of-the-art algorithm called STIFE, a RF feature-based
classifier for event sequences, which only can handle classification problems.
We developed those three algorithms in the same programming language for
a fair comparison. We ran those on the same datasets in the same working
environment.

Evaluation metrics
We demonstrated the runtime efficiency of our suggested model as well as its
applicability to the tasks of classification by reporting classification accuracy.
For our proposed algorithm and competitors, classification accuracy values
are derived by performing 10-fold cross-validation and average the values.
Runtime values are also averaged in the same manner.

5.3.3 Runtime efficiency


The results of the experiments for the runtime are included together for all
the experimental cases of clustering purity (Section 5.3.1) and classification
accuracy (Section 5.3.2) mentioned above. We compare the total time from
the data input to the algorithm to the classification or clustering results. In the
repeated execution of the same experiment or k-fold validation, the total time
required was compared with the average time calculated by dividing the total
time by the number of experiments.

5.3.4 Software tools


All algorithms were implemented in Python 3.7 and run on an Ubuntu 18.10
system with Intel i7-8700 CPU at 3.20GHz and 32GB main memory. The
dependency information for the library used in our experiment can be found
in the Github repository. Note that traditional clustering and classification
algorithms, which run after our process (Step 1-3 in Section 4.2), were
implemented using the scikit-learn library.
In this chapter, we present the results and discuss them. If competitor
algorithms required hyperparameters, we followed the parameter setup defined
by the authors of each paper, or the ones they used in their paper for a fair
comparison. Throughout the spectral embedding process, the resulting feature
48 | Empirical experiments

Table 5.3: Clustering results for all competitors in terms of clustering purity
(%) and runtime (seconds).
Artemis IBSM
Dataset K-medoids K-means K-medoids
Purity Time Purity Time Purity Time
BLOCKS 85.62 1.20 95.30 0.71 99.09 10.57
PIONEER 66.13 15.64 63.94 4.41 64.09 74.13
CONTEXT 65.13 122.23 75.22 5.19 82.66 204.82
SKATING 36,52 180.48 70.21 286.10 - >1h
HEPATITIS - >1h 67.91 444.77 - >1h

vectors provided a compressed version of the original space by almost 99% for
all datasets, which has contributed to the high computation speedups obtained.
Graph-based ones were not as fast as the spectral embedding based process,
but they were also at least much faster than state-of-the-art algorithms. One
more benefit that graph-based clustering and classification can have is that we
can visually see how they are grouped in the graph structure since it does not
transform the bi-adjacency matrix generated by our hash table process. Using
our algorithm with various classifiers, we could achieve speedups of up to a
factor of 292. This is even an underestimate as for the cases where competitors
that did not finish within the one-hour execution time limit, our approach is at
least 300 times faster than the competitors.

5.4 Experiment 1. clustering


We set the expected number of clusters to the actual number of class labels in
the datasets and computed total runtime and purity values for our suggested
algorithm and all algorithms we compare to. Since Artemis and IBSM are
only calculating distances, k-means was inapplicable as we could not get any
data points representing each sequence. On the other hand, k-medoids was
generally faster than k-means because it could be run after pre-calculating
the pairwise distances between the data event sequences. To construct the
embedding space, we needed to provide the regularization factor to improve
the algorithm’s effectiveness. Here we used the same regularization factor,
α = 0.001, for every dataset.
Table 5.3 and 5.4 show the results in terms of clustering purity and runtime.
For each method, we set a one-hour time limit to its runtime. We firstly applied
each algorithm to create the feature vectors, and then k-means and k-medoids
Empirical experiments | 49

Table 5.4: Clustering results for our algorithm in terms of clustering purity
(%) and runtime (seconds).
K-medoids K-means PageRank
Dataset
Purity Time Purity Time Purity Time
BLOCKS 93.81 0.02 99.82 0.04 96.19 0.41
PIONEER 74.75 0.89 83.12 0.91 95.63 2.13
CONTEXT 77.54 1.99 82.36 2.02 61.03 3.59
SKATING 62.45 1.48 74.40 1.52 64.01 3.98
HEPATITIS 71.70 9.60 70.08 9.63 60.01 14.30

were applied respectively, except for the personalized PageRank algorithm.


The PageRank algorithm was directly applied to the bi-adjacency matrix.

Runtime comparison
In terms of runtime, our two algorithms (spectral embedding and personalized
PageRank) were faster compared to our two competitors (Artemis and IBSM).
In particular, Artemis even could not complete the calculation within an hour
on the HEPATITIS dataset. IBSM showed a deficient runtime performance for
the datasets with long event sequences, such as SKATING and HEPATITIS.
Specifically, when k-medoids was used on SKATING, the speed was even
slower than that of Artemis. Moreover, k-means could not complete within
an hour for SKATING and HEPATITIS, while our algorithm with k-means
completed in 1.52 seconds on SKATING and in 9.63 seconds for HEPATITIS.

Purity comparison
In terms of purity, our algorithm also showed remarkable results. In the k-
medoids trials, Artemis had the lowest purity values for all data sets except
for PIONEER, while IBSM showed the highest purity only on SKATING for
k-medoids, but it was about 193 times slower than our algorithm. For the
rest of the datasets, our algorithm showed the fastest runtime performance
and achieved the highest purity. In the k-means experiment, our algorithm
showed the highest purity values, except for CONTEXT. In the case of
CONTEXT, IBSM led by a slight difference by about 0.3 percent but was
also about 101 times slower than our algorithm. The personalized PageRank
algorithm shows similar results but not actively better than competitors, and
even shows worse performance in terms of purity on the most datasets except
for PIONEER. However, it shows the highest purity value on the PIONEER
50 | Empirical experiments

Table 5.5: Classification results for all competitors in terms of classification


accuracy (%) and runtime (seconds).
Artemis IBSM STIFE
Dataset 1-NN 1-NN RF
Acc Time Acc Time Acc Time
BLOCKS 99.00 1.43 100 0.77 100 2.96
PIONEER 97.50 19.27 93.75 4.43 98.75 8.51
CONTEXT 90.00 130.22 97.08 5.32 98.33 12.1
SKATING 84.00 208.79 97.74 286.24 96.42 21.4
HEPATITIS - >1h 77.91 445.83 82.13 83.7

dataset, suggesting that there might be cases where the graph structure is more
effective than spectral embedding. We have left a closer look at this problem
as future works.

5.5 Experiment 2. classification


For each competitor method, we used the classifiers suggested by the authors
in the corresponding papers. Since Artemis and IBSM are distance-based
algorithms, the number of applicable algorithms is highly limited. Therefore,
for these two competitors, only a 1-NN classifier was applied. On the other
hand, since STIFE generates non-numeric feature vectors, distance-based
algorithms cannot be applied, and in this case, the only RF was applied. For
STIFE, we applied the recommended optimal parameters [12]. In order to
adjust the parameters of our algorithm for each dataset, we performed a grid
search on 1-NN classification accuracy within the range of [0, 1] for each
of the three parameters {maxSup, minSup, gap}, in increments of 0.1.
The top 10 parameter settings and the experimental results are available in
Appendix A.
Unlike existing algorithms, our algorithm can be applied with a wide range
of algorithms as it forms numeric feature vectors. In this experiment, we
applied the ones that previous methods used, such as 1-NN and RF, and we
also ran two SVM with RBF kernel and polynomial kernel. Furthermore, we
also ran the personalized PageRank algorithm for classification. Table 5.5
shows the classification accuracy and runtime for each competitor method,
while Table 5.6 and 5.7 shows the results for our algorithm.
Empirical experiments | 51

Table 5.6: Classification results for our algorithm in terms of classification


accuracy (%) and runtime (seconds) with 1-NN and RF. Parameters are
selected by grid-search based on 1-NN classifier.
Constraints 1-NN RF
Dataset
minSup maxSup gap Acc Time Acc Time
BLOCKS 0.0 0.4 0.0 100 0.02 100 0.12
PIONEER 0.0 0.7 0.1 100 1.49 100 1.62
CONTEXT 0.4 0.5 0.2 95.00 1.35 96.25 1.46
SKATING 0.5 0.6 0.1 91.32 0.98 92.07 1.10
HEPATITIS 0.0 1.0 0.1 76.30 10.83 82.13 11.27

Table 5.7: Classification results for our algorithm in terms of classification


accuracy (%) and runtime (seconds) with SVM and personalized PageRank.
SVM_Poly SVM_RBF PageRank
Dataset
Acc Time Acc Time Acc Time
BLOCKS 100 0.02 100 0.02 99.52 0.66
PIONEER 100 1.50 100 1.49 94.12 1.78
CONTEXT 97.50 1.36 97.08 1.36 87.83 1.91
SKATING 93.58 0.99 92.45 0.99 87.41 1.62
HEPATITIS 83.73 10.82 83.34 11.04 79.51 9.51
52 | Empirical experiments

Runtime and accuracy comparison


1-NN under Artemis had the longest runtime and lowest accuracy for all
datasets, while for HEPATITIS, it failed to complete within the 1-hour runtime
limit. On the other hand, IBSM achieved the best performance on SKATING,
but it is 13 times slower than STIFE and up to 292 times slower than our
algorithm. Finally, STIFE was the algorithm with the highest speed and
accuracy performance (except for SKATING) among the other competitors.
It even achieved better performance than our algorithm on CONTEXT and
SKATING, but it was about up to 9 times slower than our algorithm on
CONTEXT, and 21 times on SKATING. Personalized PageRank algorithm
shows less accurate results compared to spectral embedding. It only shows
better accuracy than Artemis except for PIONEER. Compared to IBSM, it
only beat it on PIONEER. However, PageRank based classification also shows
better runtime efficiency than all competitors.
Conclusions and future work | 53

Chapter 6

Conclusions and future work

6.1 Conclusions
We proposed a novel representation of event sequences using a bipartite graph
for efficient and effective clustering and classification. We benchmarked our
representation on five real-world datasets against several competitors. Our
experimental benchmarks showed that both proposed spectral embedding
representation and graph-based analysis could achieve substantially lower
runtimes than competitors and even higher values of purity (for clustering)
and classification accuracy (for classification) than some of its competitors.
We want to revisit the research question in this thesis and close this thesis.
We worked on this thesis to address one research question:

To what extent the efficiency and effectiveness of classifying and clustering


event sequence datasets can be improved by introducing the bipartite graph
representation compared to state-of-the-art methods that directly use event
intervals?

In our research question, the key measures we need to derive was efficiency
and effectiveness, and we performed extensive experiments to derive both
measures thoroughly. First, we checked the efficiency through the total
runtime until the algorithm terminates, which runs the algorithm altogether,
as mentioned earlier. Moreover, we confirmed through experiments that the
algorithm proceeds faster for any data set we prepared than all the existing
algorithms. Second, effectiveness was measured through purity for clustering
and accuracy for classification. However, the effectiveness of our algorithm
showed a slightly different aspect of clustering and classification. Clustering
54 | Conclusions and future work

showed excellent performance for almost all datasets. However, classification


resulted in slightly lower performance on some datasets (e.g., SKATING)
compared to state-of-the-art algorithms such as STIFE or IBSM. Nevertheless,
our algorithm shows an improvement in the efficiency of up to 300 times,
which can be a great benefit in cases where speed is essential.
Besides, our algorithm has a high contribution in that it has linked data
analysis studies related to the event interval, which is a relatively small area,
to a graph analysis area, which is one of the most significant areas. The graph
area is also a field where the most number of studies are being conducted.
Therefore, it has an advantage that different research results in the graph
area can be directly applied to analyze event sequence datasets if we convert
the event sequence datasets into graphs using our proposed algorithm. For
clarification, we can claim that the thesis’s main algorithm is a framework that
transforms event intervals into graph structures. Clustering and classification
algorithms used in the thesis have been studied in the existing graph field
without any modification. Therefore, it is also possible to use different graph
algorithms that suit graph clustering and classification.
It is also remarkable that the result of our proposed algorithm can be
applied without limitation to various machine learning algorithms beyond the
graph-based algorithm. Previously, the applicable clustering and classification
machine learning algorithms were limited because the data format returned by
competitor algorithm was limited; for instance, no data points were returned,
or only non-numeric features were derived. However, the method proposed
in this thesis allows various algorithms to be applied by returning numerical
feature vectors. This will be of great help not only for the algorithms presented
in this thesis but also for examining other types of state-of-the-art algorithms
in the future.
As a result, this thesis’s effects bring various benefits, but all of these
effects are just parts of early research. This means that the research results
can give us more possibilities in the future than in the present since now we
provide a linkage from event intervals to the graphs, which allows the potential
development of the area.

6.2 Limitations
Although our advanced algorithm brings many benefits to the event sequence
field, there are obvious limitations to this attempt. The most significant
limitation of our algorithm is that it requires three parameters. To get the
most excellent effect that this algorithm can perform due to the addition of
Conclusions and future work | 55

parameters, we must first perform optimization on those parameters. This


fact means that we have to perform additional actions such as a grid search
to optimize parameters, which can have a speed penalty. Of course, if speed is
essential and we can tolerate some loss of accuracy, we can run the algorithm
without optimizing the parameters, then our algorithm will show substantially
faster speed than competitors.
Among our two options (spectral embedding, graphing algorithm), the
number of reduced dimensions selected in the spectral algorithm can also act
as a parameter. Choosing suitable dimensions is a well-known problem in the
spectral field, and we can choose the right dimension by finding numbers such
as Eigen gaps directly from the graph, but this purely represents the number
of dimensions optimized in the graph structure. If we have the number of
target classes in the clustering or the classification problem, then we may still
need to define it as hyperparameters that are not fully understood. In this
thesis, the same number of dimensional values were applied to all of these
points to avoid hyperparameter optimization and to show the superiority of
our algorithm without it.
Finally, in this thesis, the graph algorithms we used were limited. The
graph field is evolving at an alarming rate, but this thesis only introduced the
most basic PageRank-based clustering and classification algorithms to show
their applicability. However, based on the graph structure proposed in this
thesis, if we apply a more recent graph algorithm, there can be a possibility to
get better efficiency and effectiveness.

6.3 Future work


Future work of the thesis includes 1) extending the current bipartite graph
representation to a tripartite or higher multipartite structure by calculating
higher orders of temporal relations (or arrangements), and also 2) exploring
alternative graph-based algorithms such as motif-based clustering.

6.3.1 Multipartite graph


In this thesis, when constructing the graph, only the bipartite graph that
connects the set of e-sequences and the set of pairwise relations is constructed.
However, in practice, temporal arrangements can extend beyond pairwise
relations by checking the pairwise relations among more than two event
intervals. If we use those arrangements, we can create graphs that represent
each data better. However, it also has the side effect of increasing the
56 | Conclusions and future work

complexity of the graph. In the future, we will be able to construct graphs


that contain larger types of arrangements beyond bipartite graphs by applying
frequent arrangement mining algorithms to this framework. Research that
checks the trade-off between efficiency and effectiveness according to the
graph’s complexity may also be conducted.

6.3.2 Alternative graph-based algorithm


As mentioned earlier, the graph field is relatively more active than the event
sequence area. However, in our thesis, we applied only two PageRank-based
algorithms to show the possibility of the graph structure. However, since the
graph structure is generalized, unlike event sequence datasets, any graph-based
algorithm can be easily applied to the event sequence datasets transformed
into the graphs using our proposed algorithm. Therefore, testing more
sophisticated and advanced graphing algorithms could be another potential
research area to achieve better efficiency and effectiveness.
REFERENCES | 57

References

[1] H. T. Lam, F. Mörchen, D. Fradkin, and T. Calders, “Mining compressing


sequential patterns,” Statistical Analysis and Data Mining: The ASA
Data Science Journal, vol. 7, no. 1, pp. 34–52, 2014.

[2] J. Wang and J. Han, “Bide: Efficient mining of frequent closed


sequences,” in Proceedings. 20th international conference on data
engineering. IEEE, 2004, pp. 79–90.

[3] D. Perera, J. Kay, I. Koprinska, K. Yacef, and O. R. Zaïane, “Clustering


and sequential pattern mining of online collaborative learning data,”
IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 6,
pp. 759–772, 2008.

[4] P. Papapetrou, G. Kollios, S. Sclaroff, and D. Gunopulos, “Mining


frequent arrangements of temporal intervals,” Knowledge and
Information Systems, vol. 21, no. 2, p. 133, 2009.

[5] F. Pachet, G. Ramalho, and J. Carrive, “Representing temporal musical


objects and reasoning in the muses system,” Journal of new music
research, vol. 25, no. 3, pp. 252–275, 1996.

[6] B. Berendt, “Explaining preferred mental models in allen inferences with


a metrical model of imagery,” in Proceedings of the Eighteenth Annual
Conference of the Cognitive Science Society. Citeseer, 1996, pp. 489–
494.

[7] B. K. Bergen and N. Chang, “Embodied Construction Grammar in


Simulation-based Language Understanding,” Construction Grammars:
Cognitive Grounding and Theoretical Extensions, vol. 3, pp. 147–190,
2005.

[8] J. F. Allen, “Maintaining knowledge about temporal intervals,”


Communications of the ACM, vol. 26, no. 11, pp. 832–843, 1983.
58 | REFERENCES

[9] O. Kostakis, P. Papapetrou, and J. Hollmén, “Artemis: Assessing the


similarity of event-interval sequences,” in Joint European Conference on
Machine Learning and Knowledge Discovery in Databases. Springer,
2011, pp. 229–244.
[10] O. Kostakis and P. Papapetrou, “On searching and indexing sequences
of temporal intervals,” Data mining and knowledge discovery, vol. 31,
no. 3, pp. 809–850, 2017.
[11] A. Kotsifakos, P. Papapetrou, and V. Athitsos, “Ibsm: Interval-based
sequence matching,” in Proceedings of the 2013 SIAM International
Conference on Data Mining. SIAM, 2013, pp. 596–604.
[12] L. Bornemann, J. Lecerf, and P. Papapetrou, “Stife: A framework
for feature-based classification of sequences of temporal intervals,” in
International Conference on Discovery Science. Springer, 2016, pp.
85–100.
[13] O. Kostakis and A. Gionis, “On mining temporal patterns in dynamic
graphs, and other unrelated problems,” in International Conference on
Complex Networks and their Applications. Springer, 2017, pp. 516–
527.
[14] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis
and an algorithm,” in Advances in neural information processing
systems, 2002, pp. 849–856.
[15] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and
computing, vol. 17, no. 4, pp. 395–416, 2007.
[16] U. Von Luxburg, M. Belkin, and O. Bousquet, “Consistency of spectral
clustering,” The Annals of Statistics, pp. 555–586, 2008.
[17] J. Kunegis, “Exploiting the structure of bipartite graphs for algebraic and
spectral graph theory applications,” Internet Mathematics, vol. 11, no. 3,
pp. 201–321, 2015.
[18] Z. Zhou and A. A. Amini, “Analysis of spectral clustering algorithms for
community detection: the general bipartite setting.” Journal of Machine
Learning Research, vol. 20, no. 47, pp. 1–47, 2019.
[19] K. Chaudhuri, F. Chung, and A. Tsiatas, “Spectral clustering of graphs
with general degrees in the extended planted partition model,” in
Conference on Learning Theory, 2012, pp. 35–1.
REFERENCES | 59

[20] T. Qin and K. Rohe, “Regularized spectral clustering under the degree-
corrected stochastic blockmodel,” in Advances in Neural Information
Processing Systems, 2013, pp. 3120–3128.

[21] A. Joseph, B. Yu et al., “Impact of regularization on spectral clustering,”


The Annals of Statistics, vol. 44, no. 4, pp. 1765–1791, 2016.

[22] Y. Zhang and K. Rohe, “Understanding regularized spectral clustering


via graph conductance,” in Advances in Neural Information Processing
Systems, 2018, pp. 10 631–10 640.

[23] N. De Lara and T. Bonald, “Spectral embedding of regularized block


models,” in International Conference on Learning Representations,
2020.

[24] M. Schmidt, G. Palm, and F. Schwenker, “Spectral graph features for the
classification of graphs and graph sequences,” Computational Statistics,
vol. 29, no. 1-2, pp. 65–80, 2014.

[25] H. Zha, X. He, C. Ding, H. Simon, and M. Gu, “Bipartite


graph partitioning and data clustering,” in Proceedings of the tenth
international conference on Information and knowledge management,
2001, pp. 25–32.

[26] Q. Li, Z. Wen, and B. He, “Federated learning systems: Vision,


hype and reality for data privacy and protection,” arXiv preprint
arXiv:1907.09693, 2019.

[27] Y. Shahar, “A framework for knowledge-based temporal abstraction,”


Artificial intelligence, vol. 90, no. 1, pp. 79–133, 1997.

[28] A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik,


A. Barbado, S. García, S. Gil-López, D. Molina, R. Benjamins
et al., “Explainable artificial intelligence (xai): Concepts, taxonomies,
opportunities and challenges toward responsible ai,” Information Fusion,
vol. 58, pp. 82–115, 2020.

[29] C. Godsil and G. F. Royle, Algebraic graph theory. Springer Science


& Business Media, 2013, vol. 207.

[30] N.-D. Ho and P. Van Dooren, “On the pseudo-inverse of the laplacian
of a bipartite graph,” Applied Mathematics Letters, vol. 18, no. 8, pp.
917–922, 2005.
60 | REFERENCES

[31] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE
Transactions on pattern analysis and machine intelligence, vol. 22, no. 8,
pp. 888–905, 2000.

[32] D. A. Spielman and S.-H. Teng, “Spectral partitioning works: Planar


graphs and finite element meshes,” in Proceedings of 37th Conference
on Foundations of Computer Science. IEEE, 1996, pp. 96–105.

[33] L. Hagen and A. B. Kahng, “New spectral methods for ratio cut
partitioning and clustering,” IEEE transactions on computer-aided
design of integrated circuits and systems, vol. 11, no. 9, pp. 1074–1085,
1992.

[34] M. E. Newman, “Modularity and community structure in networks,”


Proceedings of the national academy of sciences, vol. 103, no. 23, pp.
8577–8582, 2006.

[35] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation


ranking: Bringing order to the web.” Stanford InfoLab, Tech. Rep., 1999.

[36] S. A. Tabrizi, A. Shakery, M. Asadpour, M. Abbasi, and M. A. Tavallaie,


“Personalized pagerank clustering: A graph clustering algorithm based
on random walks,” Physica A: Statistical Mechanics and its Applications,
vol. 392, no. 22, pp. 5772–5785, 2013.

[37] E. Agirre, M. Cuadros, G. Rigau, and A. Soroa, “Exploring knowledge


bases for similarity,” in Proceedings of the Seventh conference on
International Language Resources and Evaluation, 2010.

[38] P. Nguyen, P. Tomeo, T. Di Noia, and E. Di Sciascio, “An evaluation of


simrank and personalized pagerank to build a recommender system for
the web of data,” in Proceedings of the 24th International Conference on
World Wide Web, 2015, pp. 1477–1482.

[39] J. MacQueen et al., “Some methods for classification and analysis


of multivariate observations,” in Proceedings of the fifth Berkeley
symposium on mathematical statistics and probability, vol. 1, no. 14,
1967, pp. 281–297.

[40] J. A. Hartigan and M. A. Wong, “Algorithm as 136: A k-means clustering


algorithm,” Journal of the Royal Statistical Society. Series C (Applied
Statistics), vol. 28, no. 1, pp. 100–108, 1979.
REFERENCES | 61

[41] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques.
Elsevier, 2011.

[42] D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful


seeding,” Stanford, Tech. Rep., 2006.

[43] H.-S. Park and C.-H. Jun, “A simple and fast algorithm for k-medoids
clustering,” Expert systems with applications, vol. 36, no. 2, pp. 3336–
3341, 2009.

[44] N. S. Altman, “An introduction to kernel and nearest-neighbor


nonparametric regression,” The American Statistician, vol. 46, no. 3, pp.
175–185, 1992.

[45] L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp.
123–140, 1996.

[46] Y. Amit and D. Geman, “Shape quantization and recognition with


randomized trees,” Neural computation, vol. 9, no. 7, pp. 1545–1588,
1997.

[47] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–
32, 2001.

[48] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning,


vol. 20, no. 3, pp. 273–297, 1995.

[49] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery,


“Section 16.5. support vector machines,” Numerical recipes: the art of
scientific computing, 2007.

[50] M. A. Aizerman, “Theoretical foundations of the potential function


method in pattern recognition learning,” Automation and remote control,
vol. 25, pp. 821–837, 1964.

[51] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for


optimal margin classifiers,” in Proceedings of the fifth annual workshop
on Computational learning theory, 1992, pp. 144–152.

[52] D. Pfitzner, R. Leibbrandt, and D. Powers, “Characterization and


evaluation of similarity measures for pairs of clusterings,” Knowledge
and Information Systems, vol. 19, no. 3, p. 361, 2009.
62 | REFERENCES

[53] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to information


retrieval. Cambridge university press, 2008.

[54] J. P. Mower, “Prep-mt: predictive rna editor for plant mitochondrial


genes,” BMC bioinformatics, vol. 6, no. 1, p. 96, 2005.

[55] C. Shearer, “The crisp-dm model: the new blueprint for data mining,”
Journal of data warehousing, vol. 5, no. 4, pp. 13–22, 2000.

[56] G. Harper and S. D. Pickett, “Methods for mining hts data,” Drug
Discovery Today, vol. 11, no. 15-16, pp. 694–699, 2006.

[57] E. Sheetrit, N. Nissim, D. Klimov, and Y. Shahar, “Temporal


probabilistic profiles for sepsis prediction in the icu,” in Proceedings
of the 25th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining, 2019, pp. 2961–2969.

[58] D. Ramasamy and U. Madhow, “Compressive spectral embedding:


sidestepping the svd,” in Advances in Neural Information Processing
Systems, 2015, pp. 550–558.

[59] T. Bonald, B. Charpentier, A. Galland, and A. Hollocou, “Hierarchical


Graph Clustering using Node Pair Sampling,” in 14th International
Workshop on Mining and Learning with Graphs, Aug. 2018.

[60] R. Sibson, “Slink: an optimally efficient algorithm for the single-link


cluster method,” The computer journal, vol. 16, no. 1, pp. 30–34, 1973.

[61] F. Lin and W. W. Cohen, “Semi-supervised classification of network data


using very few labels,” in 2010 International Conference on Advances in
Social Networks Analysis and Mining. IEEE, 2010, pp. 192–199.
Appendix A: Grid search results | 63

Appendix A

Grid search results

These two tables (Table A.1, A.2) include the results of grid search for three
constraints on the suggested algorithm for classification. The constraint values
were increased by 0.1 within the range of 0.1 and 1.0. In the case of tie, the
values are sorted in ascending order by minSup, maxSup, and gap.
64 | Appendix A: Grid search results

Table A.1: The results of grid search for three constraints on the suggested
algorithm for classification. The constraint values were increased by 0.1 within
the range of 0.1 and 1.0 for each parameter. In the case of tie, the values are
sorted in ascending order by minSup, maxSup, and gap (Ranks 1-5).
Constraints 1-NN
Dataset Rank
minSup maxSup gap accuracy
1 0.0 0.4 0.0 100.00
2 0.0 0.4 0.6 100.00
BLOCKS 3 0.0 0.4 0.7 100.00
4 0.0 0.4 0.8 100.00
5 0.0 0.4 0.9 100.00
1 0.0 0.7 0.1 100.00
2 0.0 0.7 0.3 100.00
PIONEER 3 0.0 0.7 0.5 100.00
4 0.0 0.8 0.6 100.00
5 0.0 0.8 0.7 100.00
1 0.4 0.5 0.2 95.00
2 0.4 0.5 0.3 95.00
CONTEXT 3 0.4 0.7 0.2 93.75
4 0.4 0.7 0.3 93.33
5 0.1 1.0 0.6 92.92
1 0.5 0.6 0.1 91.32
2 0.5 0.9 0.3 90.75
SKATING 3 0.5 0.9 0.1 90.19
4 0.0 0.7 0.1 90.19
5 0.0 0.9 0.1 90.00
1 0.0 1.0 0.1 76.30
2 0.0 0.9 0.4 76.29
HEPATITIS 3 0.0 0.7 0.7 76.09
4 0.1 1.0 0.4 75.93
5 0.1 0.7 0.1 75.92
Appendix A: Grid search results | 65

Table A.2: The results of grid search for three constraints on the suggested
algorithm for classification. The constraint values were increased by 0.1 within
the range of 0.1 and 1.0 for each parameter. In the case of tie, the values are
sorted in ascending order by minSup, maxSup, and gap (Ranks 6-10).
Constraints 1-NN
Dataset Rank
minSup maxSup gap accuracy
6 0.0 0.4 1.0 100.00
7 0.0 0.5 0.0 100.00
BLOCKS 8 0.0 0.5 0.5 100.00
9 0.0 0.5 0.6 100.00
10 0.0 0.5 0.7 100.00
6 0.0 0.8 0.9 100.00
7 0.0 0.9 0.3 100.00
PIONEER 8 0.0 0.9 1.0 100.00
9 0.0 0.1 0.0 99.36
10 0.0 0.1 0.8 99.36
6 0.4 0.8 0.2 92.92
7 0.4 0.8 0.3 92.92
CONTEXT 8 0.4 0.6 0.3 92.50
9 0.1 0.9 0.7 91.25
10 0.1 0.9 0.9 91.25
6 0.5 0.7 0.2 89.81
7 0.0 0.9 0.3 89.62
SKATING 8 0.5 1.0 0.4 89.62
9 0.0 1.0 0.1 89.72
10 0.6 0.9 0.2 89.43
6 0.0 0.8 0.9 75.91
7 0.0 0.6 1.0 75.71
HEPATITIS 8 0.1 0.5 0.4 75.71
9 0.1 0.7 0.6 75.71
10 0.1 0.9 0.7 75.71
TRITA-EECS-EX-2020:679

www.kth.se

You might also like