0% found this document useful (0 votes)
115 views194 pages

Proceedings

Uploaded by

Nam Võ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views194 pages

Proceedings

Uploaded by

Nam Võ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 194

Benelearn 2017: Proceedings of the Twenty-Sixth Benelux

Conference on Machine Learning

Editors: Wouter Duivesteijn, Mykola Pechenizkiy, George Fletcher, Vlado Menkovski,


Eric Postma, Joaquin Vanschoren, and Peter van der Putten

Welcome!
Benelearn is the annual machine learning conference of the Benelux. It serves as a forum for researchers to
exchange ideas, present recent work, and foster collaboration in the broad field of Machine Learning and its
applications. These are the proceedings of the 26th edition, Benelearn 2017.
Benelearn 2017 takes place largely on the campus of the Technische Universiteit Eindhoven, De Zaale, Eind-
hoven. The Friday programme is located in De Zwarte Doos (see https://goo.gl/maps/XgKEo7JxyTC2),
and the Saturday programme in Auditorium (see https://goo.gl/maps/B3PnpuCjgMJ2). The conference
dinner on Friday evening is the only off-campus event; this takes place in the DAF Museum, Tongelresestraat
27, 5613 DA Eindhoven (see https://goo.gl/maps/zNLrhpSqimk).
As part of the main conference programme, we organize three special tracks: one on Complex Networks, one
on Deep Learning, and one Industry Track. Distributed over all tracks, contributing researchers not only
span all three Benelux countries, but also include affiliations from ten additional countries.
We thank all members of all programme committees for their service, and all authors of all papers for their
contributions!
Kind regards,
The Benelearn 2017 organizers

Organization
Conference Chairs: Wouter Duivesteijn, Mykola Pechenizkiy
Complex Networks Track Chair: George Fletcher
Deep Learning Track Chairs: Vlado Menkovski, Eric Postma
Industry Track Chairs: Joaquin Vanschoren, Peter van der Putten
Local Organization: Riet van Buul

1
Programme Committee Conference Track

Hendrik Blockeel, K.U. Leuven Jan Lemeire, Vrije Universiteit Brussel


Sander Bohte, CWI Tom Lenaerts, Université Libre de Bruxelles
Gianluca Bontempi, Université Libre de Bruxelles Marco Loog, Delft University of Technology
Walter Daelemans, University of Antwerp Martijn van Otterlo, Vrije Universiteit Amsterdam
Tijl De Bie, Ghent University, Data Science Lab Yvan Saeys, Ghent University
Kurt Driessens, Maastricht University Johan Suykens, KU Leuven, ESAT-STADIUS
Ad Feelders, Universiteit Utrecht Celine Vens, KU Leuven Kulak
Benoı̂t Frénay, Université de Namur Willem Waegeman, Ghent University
Pierre Geurts, University of Liège Marco Wiering, University of Groningen
Bernard Gosselin, University of Mons Jef Wijsen, University of Mons
Tom Heskes, Radboud University Nijmegen Menno van Zaanen, Tilburg University
John Lee, Université catholique de Louvain

Programme Committee Complex Networks Track

Dick Epema, Delft University of Technology Taro Takaguchi, National Institute of Information
Alexandru Iosup, Vrije Universiteit Amsterdam and Communications Technology
and TU Delft Yinghui Wu, University of California Santa Barbara
Nelly Litvak, University of Twente Nikolay Yakovets, Eindhoven University of Technology

Programme Committee Deep Learning Track

Bart Bakker, Philips Research Dimitrios Mavroeidis, Philips Research


Binyam Gebre, Philips Decebal Constantin Mocanu, TU Eindhoven
Ulf Grossekathofer, Holst Centre and IMEC Elena Mocanu, TU Eindhoven
Mike Holenderski, TU Eindhoven Stojan Trajanovski, Philips Research

Programme Committee Industry Track

Hendrik Blockeel, K.U. Leuven Johan Suykens, KU Leuven, ESAT-STADIUS


Kurt Driessens, Maastricht University Jan Van Haaren, KU Leuven
Murat Eken, Microsoft Cor Veenman, Netherlands Forensic Institute
M. Israel, Erasmus MC and Leiden University
Arno Knobbe, Leiden University Mathias Verbeke, Sirris
Arne Koopman, ASML Lukas Vermeer, Booking.com
Hugo Koopmans, DIKW Consulting Willem Waegeman, Ghent University
Wannes Meert, KU Leuven Jef Wijsen, University of Mons
Dejan Radosavljevik, T-Mobile Netherlands Jakub Zavrel, TextKernel
Ivar Siccama, Pega Michiel van Wezel, Dudok Wonen

Contents
Invited Talks
Toon Calders — Data mining, social networks and ethical implications . . . . . . . . . . . . . . . . 6
Max Welling — Generalizing Convolutions for Deep Learning . . . . . . . . . . . . . . . . . . . . . 7
Jean-Charles Delvenne — Dynamics and mining on large networks . . . . . . . . . . . . . . . . . . 8
Holger Hoos — The transformative impact of automated algorithm design: ML, AutoML and beyond 9

2
Conference Track
Research Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
L.F.J.M. Kanters — Extracting relevant discussion from Reddit Science AMAs . . . . . . . . 11
I.G. Veul — Locally versus Globally Trained Word Embeddings for Automatic Thesaurus Con-
struction in the Legal Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Rianne Conijn, Menno van Zaanen — Identifying writing tasks using sequences of keystrokes 28
Lars Lundberg, Håkan Lennerstad, Eva Garcia-Martin, Niklas Lavesson, Veselka Boeva —
Increasing the Margin in Support Vector Machines through Hyperplane Folding . . . . 36
Martijn van Otterlo, Martin Warnaar — Towards Optimizing the Public Library: Indoor
Localization in Semi-Open Spaces and Beyond . . . . . . . . . . . . . . . . . . . . . . 44
Antoine Adam, Hendrik Blockeel — Constraint-based measure for estimating overlap in clus-
tering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Extended Abstracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Thijs van de Laar, Bert de Vries — A Probabilistic Modeling Approach to Hearing Loss Com-
pensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Anouk van Diepen, Marco Cox, Bert de Vries — An In-situ Trainable Gesture Classifier . . . 66
Marcia Fissette, Bernard Veldkamp, Theo de Vries — Text mining to detect indications of
fraud in annual reports worldwide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Veronika Cheplygina, Lauge Sørensen, David M.J. Tax, Marleen de Bruijne, Marco Loog —
Do you trust your multiple instance learning classifier? . . . . . . . . . . . . . . . . . 72
Marco Cox, Bert de Vries — A Gaussian process mixture prior for hearing loss modeling . . . 74
Piotr Antonik, Marc Haelterman, Serge Massar — Predicting chaotic time series using a
photonic reservoir computer with output feedback . . . . . . . . . . . . . . . . . . . . . 77
Piotr Antonik, Marc Haelterman, Serge Massar — Towards high-performance analogue readout
layers for photonic reservoir computers . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Niek Tax, Natalia Sidorova, Wil M.P. van der Aalst — Local Process Models: Pattern Mining
with Process Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Christina Papagiannopoulou, Stijn Decubber, Willem Waegeman, Matthias Demuzere, Niko
E.C. Verhoest, Diego G. Miralles — A non-linear Granger causality approach for un-
derstanding climate-vegetation dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Dounia Mulders, Michel Verleysen, Giulia Liberati, André Mouraux — Characterizing Resting
Brain Activity to Predict the Amplitude of Pain-Evoked Potentials in the Human Insula 89
Quan Nguyen, Bert de Vries, Tjalling J. Tjalkens — Probabilistic Inference-based Reinforce-
ment Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Veselka Boeva, Milena Angelova, Elena Tsiporkova — Identifying Subject Experts through
Clustering Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Michael Stock, Bernard De Baets, Willem Waegeman — An Exact Iterative Algorithm for
Transductive Pairwise Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Sergio Consoli, Jacek Kustra, Pieter Vos, Monique Hendriks, Dimitrios Mavroeidis — To-
wards an automated method based on Iterated Local Search optimization for tuning the
parameters of Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Jacopo De Stefani, Gianluca Bontempi, Olivier Caelen, Dalila Hattab — Multi-step-ahead
prediction of volatility proxies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Tom Viering, Jesse Krijthe, Marco Loog — Generalization Bound Minimization for Active
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Jesse H. Krijthe, Marco Loog — Projected Estimators for Robust Semi-supervised Classification 110
Dimitris Paraschakis — Towards an Ethical Recommendation Framework . . . . . . . . . . . 112
Björn Brodén, Mikael Hammar, Bengt J. Nilsson, Dimitris Paraschakis — An Ensemble Rec-
ommender System for e-Commerce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Sara Magliacane, Tom Claassen, Joris M. Mooij — Ancestral Causal Inference . . . . . . . . 118
Martin Atzmueller — Exceptional Model Mining in Ubiquitous and Social Environments . . . 121

3
Sibylle Hess, Katharina Morik, Nico Piatkowski — PRIMPing Boolean Matrix Factorization
through Proximal Alternating Linearized Minimization . . . . . . . . . . . . . . . . . . 124
Sebastijan Dumančić, Hendrik Blockeel — An expressive similarity measure for relational
clustering using neighbourhood trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Complex Networks Track


Extended Abstracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Leonardo Gutiérrez Gómez, Jean-Charles Delvenne — Dynamics Based Features for Graph
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Dounia Mulders, Cyril de Bodt, Michel Verleysen, Johannes Bjelland, Alex Pentland, Yves-
Alexandre de Montjoye — Improving Individual Predictions using Social Networks
Assortativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
W.X. Wilcke, V. de Boer, F.A.H. van Harmelen — User-Driven Pattern Mining on knowledge
graphs: an Archaeological Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Marijn ten Thij, Sandjai Bhulai — Harvesting the right tweets: Social media analytics for the
Horticulture Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Leto Peel — Graph-based semi-supervised learning for complex networks . . . . . . . . . . . . 143
Martin Atzmueller, Lisa Thiele, Gerd Stumme, Simone Kauffeld — Contact Patterns, Group
Interaction and Dynamics on Socio-Behavioral Multiplex Networks . . . . . . . . . . . 145

Deep Learning Track


Research Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Julia Berezutskaya, Zachary V. Freudenburg, Nick F. Ramsey, Umut Güçlü, Marcel A.J. van
Gerven — Modeling brain responses to perceived speech with LSTM networks . . . . . 149
Stefan Thaler, Vlado Menkovski, Milan Petković — Towards unsupervised signature extraction
of forensic logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Extended Abstracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Jakub M. Tomczak, Max Welling — Improving Variational Auto-Encoders using convex com-
bination linear Inverse Autoregressive Flow . . . . . . . . . . . . . . . . . . . . . . . . 162
Jim Clauwaert, Michiel Stock, Marjan De Mey, Willem Waegeman — The use of shallow
convolutional neural networks in predicting promotor strength in Escherichia coli . . . 165
Nanne van Noord — Normalisation for painting colourisation . . . . . . . . . . . . . . . . . . 168
Niek Tax, Ilya Verenich, Marcello La Rosa, Marlon Dumas — Predictive Business Process
Monitoring with LSTMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Decebal Constantin Mocanu, Elena Mocanu, Phuong H. Nguyen, Madeleine Gibescu, Antonio
Liotta — Big IoT data mining for real-time energy disaggregation in buildings . . . . 173

Industry Track
Research Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Maria Biryukov — Comparison of Syntactic Parsers on Biomedical Texts . . . . . . . . . . . 176
Extended Abstracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Lodewijk Nauta, Max Baak — Eskapade: a lightweight, python based, analysis framework . . 183
Dirk Meijer, Arno Knobbe — Unsupervised region of interest detection in sewer pipe images:
Outlier detection and dimensionality reduction methods . . . . . . . . . . . . . . . . . 184
Dejan Radosavljevik, Peter van der Putten — Service Revenue Forecasting in Telecommuni-
cations: A Data Science Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Michiel van Wezel — Predicting Termination of Housing Rental Agreements with Machine
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Martin Atzmueller, David Arnu, Andreas Schmidt — Anomaly Analytics and Structural As-
sessment in Process Industries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

4
Invited Talks

5
Data mining, social networks and ethical implications

Toon Calders [email protected]


Universiteit Antwerpen

Abstract
Recently we have seen a remarkable increase
of awareness of the value of data. Whereas
companies and governments mainly used to
gather data about their clients just to sup-
port their operations, nowadays they are ac-
tively exploring new applications. For in-
stance, a telecom operator may use call data
not only to bill its customers, but also to
derive social relations between its customers
which may help to improve churn models,
and governments use mobility data to chart
mobility patterns that help to assess the im-
pact of planned infrastructure works. I will
give an overview of my research in this fas-
cinating area, including pattern mining, the
analysis of influence propagation in social
networks, and ethical challenges such as mod-
els that discriminate.

Appearing in Proceedings of Benelearn 2017. Copyright


2017 by the author(s)/owner(s).

6
Generalizing Convolutions for Deep Learning

Max Welling [email protected]


Universiteit van Amsterdam
University of California Irvine
Canadian Institute for Advanced Research

Abstract
Arguably, most excitement about deep learn-
ing revolves around the performance of con-
volutional neural networks and their ability
to automatically extract useful features from
signals. In this talk I will present work from
AMLAB where we generalize these convolu-
tions. First we study convolutions on graphs
and propose a simple new method to learn
embeddings of graphs which are subsequently
used for semi-supervised learning and link
prediction. We discuss applications to recom-
mender systems and knowledge graphs. Sec-
ond we propose a new type of convolution
on regular grids based on group transforma-
tions. This generalizes normal convolutions
based on translations to larger groups includ-
ing the rotation group. Both methods often
result in significant improvements relative to
the current state of the art.
Joint work with Thomas Kipf, Rianne van
den Berg and Taco Cohen.

Appearing in Proceedings of Benelearn 2017. Copyright


2017 by the author(s)/owner(s).

7
Dynamics and mining on large networks

Jean-Charles Delvenne [email protected]


Université catholique de Louvain

Abstract
A network, i.e. the data of nodes connected
by edges, often comes as the support of dy-
namical interactions. For example a social
network is often measured as the trace of
an information flow (phone calls, messages),
energy and phase information flow through
power networks, biochemical networks are
the skeleton of complex reaction systems,
etc. It is therefore natural to mine network-
shaped data jointly with a real or modelled
dynamics taking place on it. In this talk
we review how dynamics can provide efficient
and accurate methods for community detec-
tion, classification, centrality and assortativ-
ity measures.

Appearing in Proceedings of Benelearn 2017. Copyright


2017 by the author(s)/owner(s).

8
The transformative impact of automated algorithm design: ML,
AutoML and beyond

Holger Hoos [email protected]


Universiteit Leiden

Abstract
Techniques from artificial intelligence — and
especially, machine learning — are funda-
mentally changing the way we solve challeng-
ing computational problems, and recently,
automated machine learning (AutoML) has
begun to take this to a new level. In this
talk, I will share my perspective on the suc-
cess of ML and AutoML, and discuss how
the fundamental concepts and tools that en-
able both have a much broader impact than
commonly perceived. In particular, I will
highlight the role of a fruitful interplay be-
tween machine learning and optimisation in
this context, comment on general approaches
to automated algorithm design, and share my
thoughts on the next big challenge.

Appearing in Proceedings of Benelearn 2017. Copyright


2017 by the author(s)/owner(s).

9
Conference Track
Research Papers

10
Extracting relevant discussion from Reddit Science Science AMAs

L.F.J.M. Kanters [email protected]


Radboud University, Nijmegen, the Netherlands

Keywords: text mining, spam detection, reddit, naive Bayes

Abstract 1. Introduction
The social network and content aggregation On reddit there is the tradition of the Ask Me Any-
website Reddit occasionally hosts Q&A ses- thing, or AMA, threads. These are a kind of infor-
sions with scientists called science AMA (Ask mal interview or online Q&A session with whomever
Me Anything). These science AMAs are con- started the thread (the OP or original poster), any-
ducted through the comment system of Red- body can participate and ask questions. For about 3
dit which has a tree structure, mark-up and years the /r/science subreddit, a subforum dedicated
community driven feedback on both users to science, has been doing AMAs with scientists as a
and comments in the form of “karma” scores. kind of science outreach. As a result there are now well
over 600 different online AMAs with scientists cover-
Most of the actual discussion in these science
ing a wide variety of subjects with more being done
AMAs tends to be of high quality. However
each week. The strict moderation in /r/science has
a large number of the comments are super-
resulted in a subreddit culture that tends towards se-
fluous and not really part of the conversation
rious discussion which, combined with the enthusiasm
with the scientist. The goal of this project is
of the scientists involved, yields AMAs of an excep-
to determine if text mining methods can be
tionally high quality. The informal nature of reddit
used to filter out the unwanted comments. A
allows lay-people easy access, while also allowing for
secondary goal is to determine the relative
more in-depth questions. The hierarchical structure
importance of Reddit meta-data (tree struc-
of the comment section as well as the lack of time
ture, karma scores, etc) compared to the ac-
constraints nature of an AMA, a particular enthusias-
tual content of the comments.
tic OP might still be answering questions days later,
The python Reddit API was used to retrieve encourages follow-up discussion. And the /r/science
the AMAs. The CoreNLP tools were used community has a decent number of scientists among
to extract tokens, sentences, named entities its members, who are recognizable due to flair next to
and sentiment. These were combined with their username, so experts other than the OP are likely
other information, like Reddit meta-data and to join in the discussion. However despite this, large
WordNet, and used to extract features. The parts of these AMAs are still superfluous, consisting of
classification was done by a Gaussing naive unanswered questions, tangential discussions, there is
Bayes classifier using the scikit-learn toolbox. clearly a lot of knowledge to be found in these AMAs
but some manner of filtering might be required first.
Classification using all features or only text-
based features was effective both yielding In order to archive these AMAs, and to assign them
a precision/recall/f1-score of 0.84/0.99/0.91. their doi-number so they can actually be referenced in
Only using Reddit based features was slightly scientific literature, the Winnower has been copying
less effective, yielding 0.89/0.63/0.74. Only parts of these AMAs to their own website1 . Some of
using a single WordNet based similarity fea- the larger AMAs can end up having many thousands
ture still worked, yielding 0.81/0.99/0.89. of comments, with only a tiny fraction of them actu-
ally being worth archiving. So the Winnower’s applies
a filter to these AMAs but it is rather crude. They
take every comment that is at the 2nd level of the
Appearing in Proceedings of Benelearn 2017. Copyright
2017 by the author(s)/owner(s). 1
https://www.thewinnower.com/topics/science-ama

11
Extracting relevant discussion from Reddit Science Ask Me Anythings

comment tree and is made by the scientist being in- or be gilded, and will usually be displayed in order of
terviewed, which tend to be answers to questions, and popularity. The user who started the thread, by sub-
their parent comment, which would contain the ques- mitting the piece of content the thread pertains to, is
tion. Everything else, including follow-up discussion, referred to as OP, short for original poster.
is discarded.
All the upvotes and downvotes for every submission
The primary research question of this paper is to what and comment of a user are combined into a link and
extent it is possible, using text-mining, to distinguish comment karma, by subtracting the sum of downvotes
between informative & relevant comments and the rest. from the sum of upvotes. Each user’s karma is pub-
A secondary research question is to what extent reddit licly visible and tends to be used, in combination with
meta-data and comment hierarchy is necessary for this length of time that user has been a redditor, as an
classification. informal indication of a user’s reliability.
Moderation of reddit is generally handled by volunteer
2. Background moderators, or mods, with responsibilities and permis-
sions limited to the subreddits they moderate. Mod-
2.1. Reddit
eration policy varies from subreddit to subreddit.
Reddit (Reddit, 2016) is a social aggregation website
Among the tools for mods are deletion of content,
where users can submit content (either links to a web-
adding flair, banning and shadow-banning. Flair is
page or a piece of text), rate this content, comment
a short, 64 characters, piece of text that can be used
on it, and of course, consume it. It is quite a large
to customize submissions and users. User flair can
site, for example in 20152 alone it received 73.15 billion
be set by the user or a mod (depending on subreddit
submissions and 73.15 billion comments written by 8.7
policy) and will be shown next to the user’s username
million users.
on submissions and comments. Submission flair can
The frontpage of reddit, the entry point to the website, be set by the user who submitted the content or a
consists of a list of submission titles ordered by pop- mod (again depending on subreddit policy) and will
ularity. Popularity of a submission is determined by be shown next to the submission title.
user feedback in the form of upvotes and downvotes fed
In /r/science submission flair is used to categorize sub-
into an unknown stochastic function yielding a score.
missions by field, which is fairly typical use for submis-
Official reddit etiquette3 states that one should vote
sion flairs in reddit. User flair policy in /r/science is
based on whether or not “you think something con-
quite unique, the mods use user flair to indicate the
tributes to the conversation”, though in practice vot-
academic qualifications of a user (for example it might
ing is also based on agreement or amusement. Further
say “BS | Artificial Intelligence”), these qualifications
user feedback is possible by “gilding” which costs the
are verified by the mods. The /r/science mods call this
gilder $4, confers benefits to the poster of the gilded
their “science verified user program”4 the intention of
content and places a little golden star medallion next
which is to allow readers to distinguish between “edu-
to the submission. Each piece of content is submitted
cated opinions and random comments”, verified users
to a specific subreddit which functions both as a cate-
are also kept to a higher standard of conduct.
gory and community of sorts, a user can subscribe to
subreddits and their personal frontpage is a compos-
2.2. Related work
ite of the subreddits they are subscribed to. Usually
when mentioning a subreddit the name is preceded by Weimer et al (Weimer et al., 2007) did work on auto-
‘/r/’ because Reddit automatically turns such a men- matically assessing post quality in online forums, they
tion into a link to the subreddit. attempt to assess the usefulness of different types of
Each submission has an associated comment section features some of which are based on forum metadata.
where users can have a conversation. The conversation This is rather similar to the work in this paper. Their
in a reddit comment section is tree, each comment be- result was that classification based on sets lacking any
ing either a direct reply to the original submission or forum based features performs slightly worse than clas-
to another comment in the thread. Just as a submis- sification based on sets including those features. Their
sion the comments themselves can also be voted upon, work uses annotation based on user feedback through
built-in features of the forum software and the gen-
2
https://redditblog.com/2015/12/31/ eral goal underlying the classification differs a bit from
reddit-in-2015/ what is being done here.
3
https://www.reddit.com/wiki/reddiquette
4
https://www.reddit.com/r/science/wiki/flair

12
Extracting relevant discussion from Reddit Science Ask Me Anythings

Siersdorfer et al (Siersdorfer et al., 2010) did a sim- Normally in reddit sibling comments would be ordered
ilar study based on youtube comments. It may be by popularity but during annotation this ordering was
worth noting that this study is from before youtube randomized. Comments without any replies were not
switched to google+ comments, when it was still fea- shown during annotation and were assumed to be not
sible to moderate comment sections. The interesting worth keeping, since they cannot be part of any dis-
thing here is that they found a significant correlation cussion or question/answer pairs.
between the sentiment of a comment, as analyzed by
The data was annotated by two different annotators
SentiWordNet, and the scores users attributed to com-
who each annotated all comments presented to them.
ments. Though again, just as with Weimer et all, the
The Cohen’s Kappa is κ = 0.45, indicating a mere
point of the classification is a bit different from what
moderate agreement. In the interest of preserving as
is being done here.
many relevant comments as possible a comment will
On a slightly different note Androutsopoulos (An- be considered worth keeping if at least one of the an-
droutsopoulos et al., 2000) et al compares the perfor- notators thought it was worth keeping.
mance of a naive Bayes classifier on spam detection.
The AMA by Alice Jones was used as the training set
Their pipeline is fairly simple, the most complex con-
and the AMA by Paul Helquist as the test set.
figuration used in their works merely employs a lem-
matizer and stop-list, though despite this it manages
to get good recall and precision on their email cor- 3.2. Pipeline
pus. The use of a naive Bayes classifier is especially 3.2.1. Data gathering
interesting since its transparent decision making pro-
cess would allow one to easily assess the impact of each The data was gathered using the Python Reddit API
feature. (Praw-dev, 2016) which allows one to essentially do
and see from a Python script everything one would be
able to do or read using the reddit website. This was
3. Methods used to gather the following from the AMAs:
3.1. Data • The text of the original submission, which con-
The data was taken from the following two reddit tains an introduction of both the OP and the topic
AMAs: of the AMA.
• The text of each comment in the comment section.
• Hi reddit! Im Alice Jones, an expert on antisocial • For the original submission and comment:
behaviour and psychopathy at Goldsmiths, Uni- – The amount of times it was gilded.
versity of London. I research emotion processing – The karma (upvotes minus downvotes) it re-
and empathy, focusing on childhood development ceived.
and education. AMA!5 with 239 comments (172 – The flair of the user who wrote it.
used). • For each of the users:
– Whether the user currently has gold.
• Hi, my name is Paul Helquist, Professor and As- – The total amount of comment karma for all
sociate Chair of Chemistry & Biochemistry, at their comments all over reddit.
the University of Notre Dame. Ask me anything – The total amount of link karma they received.
about organic synthesis and my career.6 with 234 – Whether the user is a mod of any subreddit.
comments (121 used). – Whether the user was shadowbanned.
– A breakdown of link and comment karma by
The data was annotated manually based on whether subreddit it was gained on.
or not a given comment was informative & relevant
and would therefore be worth keeping, as if the anno- Note that the text retrieved was encoded in utf-8, con-
tator was editing down an interview. The information tains xml character entity references and has markup
available to the annotator was the text of the com- in the markdown format.
ments themselves as well as the comment hierarchy. All the data was stored in XML files while maintaining
5 the hierarchy of the comment section.
https://www.reddit.com/r/science/comments/
4twmrc/science_ama_series_hi_reddit_im_alice_
jones_an/ 3.2.2. Preprocessing and normalization
6
https://www.reddit.com/r/science/comments/
52k2gt/american_chemical_society_ama_hi_my_name_ Preprocessing and normalization was done using the
is_paul/ Stanford CoreNLP (Manning et al., 2014) language

13
Extracting relevant discussion from Reddit Science Ask Me Anythings

processing toolkit, the following built-in annotators • Features that depend on reddit meta-data and
were used: comment structure shown in table 2.
• Tokenization • Features purely based on the text of the com-
• Sentence splitting ment, independent of reddit meta-data. Besides
• Lemmatization the ones shown in table 1 these also include token
• Named entity recognition document frequencies.
• Syntactic analysis
• Part of speech tagging
• Sentiment analysis The feature vector consists of the features shown in
tables 2 and 1 followed by the document frequencies
Any token was dropped if it was an url, if it is a stop- of a number of tokens. Which document frequency
word, or if it is neither a verb, noun, adjective or ad- would be included was determined by taking the top
verb. Though some information, like the number of N tokens, of the entire training set, ordered by docu-
urls in the comment, were kept as a feature. ment frequency. How many tokens were included, or
hyperparameter N, is shown in section 4.1.
3.2.3. Feature extraction
3.3.1. Similarity features
The features were extracted using a Python script cre-
ating a series of feature vectors for each comment, see Two different types of similarity features were used,
the section 3.3 for details on the features. Most of features that indicate how similar two comments are
the features simply consist of a piece of reddit meta- to one another.
data or some quantity derived directly from the text.
However two of the features (tws and cf ws ) make use of One based solely on which exact tokens occurred in
WordNet (Fellbaum, 1998) implementation of the Nat- both comments, and how frequently they occurred.
ural Language Toolkit (Bird et al., 2009) based on the This similarity was defined as follow, where dft,x is
lemma and part-of-speech information determined by the document frequency of token t in comment x:
CoreNLP. Another three features (c+ , c −, co )make use

of the SentiWordNet (Baccianella et al., 2010) toolkit similarity(x, y) = dft,x dft,y (1)
which expands the WordNet with negative, objective t∈x,y
and positive sentiment values for words. The spellings
based features cm and cco are based on the enchant The other similarity feature was based on WordNet
python library7 . The curseword feature ccu is based path similarity. WordNet can determine a similarity
on the profanity python library8 . between two tokens by measuring the distance between
two tokens within the WordNet network, this is the
3.2.4. Classification and Evaluation path similarity (sim(a, b)). The WordNet similarity of
Both Classification and evaluation of the classification two comments was determined by taking the average of
based on the extracted feature vectors was done using the path similarity all possible combinations of tokens
the scikitlearn toolbox (Pedregosa et al., 2011). Dur- in both comments:
ing classification and evaluation one AMA was used
as a test set, while the other was used as a training
set, see section 3.1. The classifier used was a Gaus- 1 ∑∑
WordNetsimilarity(x, y) = sim(a, b)
sian Naive Bayes (Bishop, 2006) classifier because of |x||y| a∈x
b∈y
its transparent nature. This transparency enabled a (2)
closer examination of the features as described in sec-
tion 3.4. These similarities were used as elements in the feature
vector. The similarity between a comment and the
introductory comment made by the scientist was in-
3.3. Features
cluded. As well as the similarity between the user flair
Ultimately all features of a comment are combined into (the scientific credentials as shown on /r/science ) of
one large feature vector prior to being used for clas- the commenter and the scientist doing the AMA.
sification. The features used were split into two cate-
gories: 3.4. Gaussian naive Bayes
7
https://github.com/rfk/pyenchant/ Since the naive Bayes classifier is rather transparent,
8
https://github.com/ben174/profanity it is possible to look at the way a particular feature

14
Extracting relevant discussion from Reddit Science Ask Me Anythings

Table 1. Features independent of reddit meta-data and comment hierarchy. The last column, JS-divergence, shows an
indication of how much that feature influences classification, see section 4.4 for details.
Feature Description JS-divergence
cp
tn The number of paragraphs divided by the number of tokens. 3.12 · 101
curl
tn The number of hyperlinks divided by the number of tokens. 9.44
tne
tn The number of named entities divided by the number of tokens in the comment. 1.33 · 10−1
tc
tn The number of correctly words divided by the number of tokens in the comment. 9.31
c? The number of sentences ending in a question mark in the comment. 6.66 · 10−1
cm The number of misspelled words in the comment. 5.60 · 10−1
cp The number of paragraphs in the comment. 5.74 · 10−1
c+ The average positivity of the words in the comment according to SentiWordNet. 2.46
c− The average negativity of the words in the comment according to SentiWordNet. 1.56
cco Fraction of correctly spelled words in the comment. 9.31
ccp If the user uses capitals and periods at the start and end of their sentences 1, 4.79 · 102
otherwise 0.
clc The number of full-caps words of more than 3 character in the comment. 1.25 · 10−1
co The average objectivity of the words in the comment according to SentiWordNet. 1.95 · 101
csen The average sentiment according to Stanford NLP by sentence in the comment. 2.10
curl The number of hyperlinks in the comment. 2.02
tn The number of tokens in a comment. 5.88 · 10−1
ts The similarity between this comment and the initial comment made by OP. 1.48
tcu The number of cursewords in the comment. 1.30
tne The number of named entities found by Standford NLP. 1.02
tws The WordNet similarity between this comment and the initial comment made by 7.70 · 102
OP.

Table 2. Features dependent on reddit meta-data and comment hierarchy. The last column, JS-divergence, shows an
indication of how much that feature influences classification, see section 4.4 for details.
Feature Description JS-divergence
ca The number of ancestral comments of the comment in the tree. 9.80 · 10−2
cc The number of child comments of the comment in the tree. 1.90
cg If the comment has been gilded 1 otherwise 0. 0.00
ck The log amount of karma of the comment. 1.62
cs The number of sibling comments of the comment in the tree. 9.29 · 10−1
copa If a comment made by OP is among the ancestral comments 1 otherwise 0. 0.00
copc If OP replied to this comment 1 otherwise 0. 1.08
copd If a comment made by OP is among descendant comments 1 otherwise 0. 1.20
copp If the parent comment was made by OP 1 otherwise 0. 2.05 · 10−2
ub If the user was shadowbanned 1 otherwise 0. 0.00
uf If the user has /r/science flair, otherwise 0. 1.04
ug If the user has gold 1 otherwise 0. 1.52 · 10−1
um If the user is a mod of any subreddit 1 otherwise 0. 1.06
ud If user was deleted 1 otherwise 0. 8.81 · 101
uf s The similarity between comment’s user flair and the flair of OP. 2.19 · 101
uf ws Wordnet similarity between comment’s user flair and the flair of OP. 2.19 · 101
ukc The log amount of comment karma of the user of the comment. 3.66
ukl The log amount of link karma of the user of the comment. 9.12 · 10−2
uop If the comment’s user is also the OP 1 otherwise 0. 1.31 · 101

15
Extracting relevant discussion from Reddit Science Ask Me Anythings

impacts the resulting classification from a mathemat- preference for recall over precision corresponds to a
ical perspective. This would be a secondary method preference for keeping interesting comments over dis-
for finding the relative importance of features besides carding uninteresting ones.
comparing classification performance.
Consider the following probabilistic model, which is 4.2. All features
used by Naive Bayes for classification, where all distri- In order to determine if this classification works at
butions p are Gaussian: all a test run was performed where every feature was
used and the document frequency of the top N = 575
n
∏ tokens.
p(Ck |x1 , . . . , xn ) = p(Ck ) p(xi |Ck )
All features
i=1
Prediction
The predicted class will be the one with the highest Discard Keep

Truth
probability according to the model. So say only feature Discard 13 17
xj is being changed and one only has two classes, all Keep 1 90
the other features can simply be dropped: Precision 0.84
Recall 0.99
F1 0.91
n
∏ n

p(C1 ) p(xi |C1 ) < p(C2 ) p(xi |C2 ) 4.3. Reddit vs Text features
i=1 i=1
In order to determine if the use of reddit metadata
p(C1 )p(xj |C1 ) < p(C2 )p(xj |C2 )
based features has any effect the test has been re-
peated twice. Once with features solely derived from
And unless the difference in prior probability is quite the comment text and the document frequency of the
high the difference between probability distributions top N = 575 tokens:
p(xj |C1 ) and p(xj |C2 ) should determine which class is
predicted. If these distributions were similar the value Text features only
of xj would influence the posterior probability only Prediction
Discard Keep
Truth

little, since either distribution would assign it a similar


probability. Which means that the difference between Discard 13 17
p(xj |C1 ) and p(xj |C2 ) could be seen as a measure for Keep 1 90
the importance of feature xj . Precision 0.84
Recall 0.99
F1 0.91
4. Results
And once with on features solely derived form reddit
4.1. Hyper parameter N meta-data, no text based features were used:
The optimal number N token document frequencies to Reddit features only
include in the feature vector was determined as follows. Prediction
Figure 1 shows the recall, precision and f1 scores of Discard Keep
Truth

the classification by N, where the document frequency Discard 23 7


of the top N tokens by document frequency were in- Keep 34 57
clude in the feature vector, as well as the proportion Precision 0.89
of unique sets of document frequencies. This figure Recall 0.63
shows a slight trade-off between precision and recall. F1 0.74
The document frequency used to order the tokens by
document frequency is based on the comments in the 4.4. Features
training set only. For the purpose of these tests a ran-
As discussed in section 3.4 the difference between
dom 20% of the training set was used as a hold-out
the probability distributions underlying the Gaussian
test set.
naive Bayes classifier could be seen as an indication
An N of 575 was settled upon in order to maximize re- of feature importance. The specific difference measure
call without needlessly increasing the number of doc- being used here is Jensen-Shannon divergence (a sym-
ument frequencies included in the feature vector. The metric version of KullbackLeibler divergence). This

16
Extracting relevant discussion from Reddit Science Ask Me Anythings

1.0

0.9
type
precision
score

0.8 f1
recall
unique
0.7

0.6
0 500 1000 1500 2000 2500

Figure 1. Performance of classification and proportion of unique bags in the dataset by N , the number of tokens whose
document frequencies were included in the feature vector. The vertical line indicates the N used for further tests.

divergence basically shows how different the prototyp- no. As shown in section ?? the difference between all
ical “keep” comment is compared to the prototypical features and text only features classification is nonex-
“discard” comment based on a specific feature. Tables istent, while there does exist a difference between red-
2 and 1 show this divergence. dit only and text only feature classification. Perhaps
most interesting is that the WordNet similarity fea-
Because the WordNet similarity feature tws seems
ture on its own performs nearly as well as all features
rather influential it might be interesting to see how
combined.
it would perform on its own, even discarding the doc-
ument frequencies (N = 0). The other way of figuring out the relative importance
of the features, suggested in section 3.4, would be to
Wordnet Similarity tws only look at the inner working of the fitted Gaussian naive
Prediction Bayes classifier. Consider tables 2 and 1.
Discard Keep On the reddit-dependent side the flair similarity fea-
Truth

Discard 9 21 tures uf s and uf ws seems to be of import, probably


Keep 1 90 because it ends up identifying the scientist doing the
Precision 0.81 AMA himself. The hierarchy related features copc , copd
Recall 0.99 and cc are important as well because they indicate that
F1 0.89 a comment was an integral part of the discussion. Ad-
mittedly this the interestingness of these features is
dampened somewhat knowing that they do not really
5. Discussion contribute to the classification. In hindsight features
cg and ub were useless because no comment was gilded
Regarding the primary research question of whether or nor user shadowbanned in our data.
not it is possible, using text-mining, to distinguish be-
tween informative & relevant comments and the rest, On the text-only side the WordNet based similarity
it certainly seems possible. As shown in section ?? tws really stands out, even more so than the normal
each of the classification methods, all features, red- similarity tw , probably because comments that are se-
dit only features and text only features, have at least mantically similar to the introductory text are likely
decent precision, f1 and recall scores. Though the per- to be relevant to the discussion. In section 4.4 it’s
formance of reddit only features is not as good as the even shown to be good enough to function its own.
rest. It is rather nice to see that the extra effort of includ-
ing semantics pays off. One of the other WordNet,
Regarding the secondary research question whether or well SentiWordNet, based features co also seems in-
not reddit meta-data and comment hierarchy is im- fluential, it is supposed to indicate the objectivity of
portant to this classification, the answer seems to be

17
Extracting relevant discussion from Reddit Science Ask Me Anythings

a comment a quality one would expect from scientific Baccianella, S., Esuli, A., & Sebastiani, F. (2010). Sen-
explanations. Other indicators of well formatted and tiWordNet 3 . 0 : An Enhanced Lexical Resource for
c
well sourced comments, features ccp , tnp and cturl
n
seem Sentiment Analysis and Opinion Mining SentiWord-
also to be of import. Net. Analysis, 0, 1–12.
Bird, S., Klein, E., & Loper, E. (2009). Natural Lan-
6. Conclusion guage Processing with Python, vol. 43.
So it appears that it is quite possible to use text-mining Bishop, C. M. (2006). Pattern Recognition and Ma-
to distinguish between informative & relevant com- chine Learning. springer.
ments and the rest. And that while reddit meta-data
is quite useful it is not at all necessary for classifica- Fellbaum, C. (1998). WordNet: An Electronic Lexical
tion. To the point where even a single text-based fea- Database, vol. 71.
ture, the WordNet similarity measure, performs better
than all the reddit meta-data features combined. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J.,
Bethard, S. J., & McClosky, D. (2014). The Stanford
The one real issue with these conclusions is that the CoreNLP natural language processing toolkit. Pro-
amount of data used is quite small, mostly because an- ceedings of 52nd Annual Meeting of the Association
notating the data manually is quite time consuming. for Computational Linguistics: System Demonstra-
Initially using reddit comment karma as annotation tions (pp. 55–60).
was considered but the distribution of said karma is
horribly skewed which led to issues. The vast major- Pedregosa, F., Varoquaux, G., Gramfort, A., Michel,
ity of comments will never have had any feedback on V., Thirion, B., Grisel, O., Blondel, M., Pretten-
them, they would have had no upvotes or downvotes, hofer, P., Weiss, R., Dubourg, V., Vanderplas, J.,
resulting in a karma of 1. Passos, A., Cournapeau, D., Brucher, M., Perrot,
M., & Duchesnay, E. (2011). Scikit-learn: Machine
Also a lot more data was gathered than was actually Learning in Python. Journal of Machine Learning
used for this paper. Not just in raw quantity (228 dif- Research, 12, 2825–2830.
ferent AMAs were automatically scraped from reddit)
but also in terms of quality. Neither the breakdown of Praw-dev (2016). PRAW: The Python Reddit API
karma by subreddit or the markdown formatting was Wrapper.
used. And the first would probably reveal a lot about
the user, as a sort of fingerprint of their interests on Reddit (2016). Reddit FAQ.
reddit. Siersdorfer, S., Chelaru, S., & Augusta, V. (2010).
It might also be interesting to do this feature analysis How useful are your comments?: analyzing and pre-
using a classifier that does not make the independence dicting youtube comments and comment ratings.
assumption naive Bayes does. Proceedings of the 19th international conference on
World wide web, 15, 891–900.
Or it may be interesting to look into the usefulness
of different semantics based features, like word2vec or Weimer, M., Gurevych, I., & Mühlhäuser, M. (2007).
any of the other WordNet based distance measures. Automatically assessing the post quality in online
Seeing as the one real stand-out is the WordNet based discussions on software. Proceedings of the ACL,
similarity measure. Which used on its own yields a 125–128.
performance nearly as good as all features combined.
And could be interpreted as a kind of ”on topic”-ness
feature.

References
Androutsopoulos, I., Koutsias, J., Chandrinos, K. V.,
& Spyropoulos, C. D. (2000). An Experimental
Comparison of Naive Bayesian and Keyword-Based
Anti-Spam Filtering with Personal E-mail Messages.
Proceedings of the 23rd annual international ACM
SIGIR conference on Research and development in
information retrieval, 24–28.

18
Locally versus Globally Trained Word Embeddings
for Automatic Thesaurus Construction in the Legal Domain

I.G. Veul [email protected]


Institute for Computing and Information Sciences, Radboud University, the Netherlands

Keywords: word embeddings, word2vec, global, local, thesaurus construction

Abstract ing specialized digital collections containing many le-


In this paper two different word embedding gal documents. But with more and more information
methods are explored for the automatic con- becoming available and the size of these collections
struction of a thesaurus for legal texts. A growing rapidly, it has become increasingly difficult
word embedding maps every word to a rela- for legal experts to actually find relevant information
tively low dimensional vector, which is then in these collections. Documents might use synonyms
used to compare similarities between words. or describe similar concepts in different terms. A legal
We use Word2Vec for the word embedding, expert who is not aware of these variations in the ter-
which is an unsupervised learning method minology can have a difficult time coming up with the
that defines a word based on its context. right words to describe their information need and as
Words with similar contexts will then be a result might miss out on relevant documents.
considered similar. The unsupervised na- This problem is amplified by the specialized nature
ture of Word2Vec allows for the construc- of the documents (IJzereef et al., 2005): First of all,
tion of the thesaurus without requiring rel- experts of different legal fields might use different ter-
evance feedback. A downside with the stan- minologies. This means that an expert would require a
dard Word2Vec approach though is that the detailed understanding of the field and the words used
resulting word embeddings tend to be too in order to effectively search for documents. Special-
general, when trained on an entire corpus. ized documents also tend to contain abbreviations and
This paper studies whether training the word acronyms, increasing their ambiguity. Finally, the vo-
embeddings separately for different jurisdic- cabulary used in legal documents also varies over time,
tions results in a better thesaurus. The the- as new concepts and laws are introduced, making it
sauri are trained on the text of 300 000 Dutch difficult for experts to keep up with the terminology.
legal rulings. To assess the performance of To combat this ambiguity problem, queries can be ex-
the globally and locally trained thesauri, they panded with words that are closely related to the orig-
are compared to a manually constructed the- inal words. This concept of query expansion has been
saurus, which is already being used for query an active research area since the 70s (e.g. Minker et al.,
expansion in the legal domain. The results 1972) and over the years a wide array of techniques
show that there is a significant difference be- has been tested (see related works for examples). One
tween the global and local thesauri, but that such technique is using words from a thesaurus to ex-
the global thesaurus actually outperforms the pand the query. A thesaurus is a collection in which
local thesaurus. words are grouped with other words that have a similar
meaning or are otherwise related. Thesauri generally
contain three types of word relations: synonyms, hier-
1. Introduction archical and related (Hersh et al., 2000). Words that
are synonyms can be used interchangeably and share
Over the last 15 years the legal community has shifted the same meaning; words belonging to the hierarchical
from working with paper to working digitally, spawn- category share a broader/narrower relation; and the
related category contains all other types of relation-
Appearing in Proceedings of Benelearn 2017. Copyright ships between words that are considered important,
2017 by the author(s)/owner(s). for example two words being each others opposites.

19
Local versus Global Word Embeddings in Automatic Thesaurus Construction

A downside of thesauri for query expansion is that highlighted by Bhogal et al. (2007) in their review pa-
their creation and maintenance takes a lot of time, is per. Hersh et al. (2000) saw for example a general
labor intensive and is prone to human error (Lauser decline in performance, in their study assessing query
et al., 2008). An alternative is to automatically con- expansion using the UMLS Metathesaurus, while in
struct a thesaurus, by extracting word similarities di- some specific cases their query expansion method ac-
rectly from the documents in the collection. A com- tually showed improvement. IJzereef et al. (2005) ob-
mon approach for this is to use word embeddings. A served consistent significant improvement when apply-
word embedding is a mapping of a word to a low ing thesaurus query expansion to biomedical retrieval.
dimensional vector of real numbers. These embed- A unique aspect about their approach was that they
dings can be used to calculate the distance between tried to take into account how terms from a thesaurus
two words (for example using cosine similarity), which could benefit retrieval, by reverse engineering the role
serves as a quantitative representation of the similar- a thesaurus can play to improve retrieval. Tudhope
ity between the words (Roy et al., 2016). et al. (2006) took a less technical approach and looked
The interest in word embeddings has been refueled by at possible difficulties for users, when using thesauri
the introduction of a new word embedding technique for query expansion.
by Mikolov et al. (2013), called Word2Vec. Word2Vec Studies related to word embeddings for query expan-
uses a neural network to calculate the vector rep- sion can be divided into two categories. The first
resentations of words, by predicting the surrounding category are studies in which word embedding tech-
words of a given word. The advantages of Word2Vec niques were directly used for query expansion. Roy
are that it is easily accessible through the Word2Vec et al. (2016) for example used similar terms based on
software package1 and less computationally expensive Word2Vec embeddings directly for query expansion.
than other word embedding techniques (such as Latent Although their study showed increased performance
Dirichlet Allocation), which can get computationally on general purpose search tasks, the method failed to
very expensive with large data sets (Mikolov et al., achieve a comparable performance with state-of-the-
2013). art query expansion methods. Diaz et al. (2016) used
Word2Vec struggles though with generalization (Diaz Word2Vec in a similar way, but showed the impor-
et al., 2016), because the word embeddings are trained tance of locally training Word2Vec on relevant doc-
on the whole vocabulary. This effect could be aug- uments to overcome the generalization problem. In
mented in collections with legal documents, since dif- their study locally training the word embeddings sig-
ferent fields of law require different interpretations and nificantly improved the performance for query expan-
as a result might use words differently. The goal of sion tasks, compared to globally trained embeddings.
this paper is to study if this effect can be mitigated The second category uses word embeddings to auto-
by training Word2Vec separately for each legal field. matically construct thesauri. Navigli and Ponzetto
This is done in a two stage process: Firstly this study (2012) for example used a combination of word em-
aims to confirm that a thesaurus trained on the entire beddings from WordNet and Wikipedia to construct a
collection differs from a thesaurus trained on separate cross-lingual, general purpose thesaurus. Claveau and
legal fields. Then this paper tries to answer whether Kijak (2016b) also used WordNet (in combination with
the locally trained Word2Vec embeddings create a bet- Moby) to construct a thesaurus, but used a different
ter thesaurus than globally trained embeddings, in the approach to find related terms for the thesaurus. In-
context of legal documents. The contribution of this stead of using cosine similarity measures directly to
paper is limited to the detection of related words and link terms to relevant terms, they formed documents
does not address the assignment of thesaurus relation- from clusters of similar terms based on their word em-
ships to these words. beddings. Building the thesaurus was then done by
finding the most relevant document for every term.
2. Related Work
3. Method
Over the years a broad range of possible query ex-
pansion techniques have been studied. Ranging from This section describes the preprocessing of the data
adding clustered terms (Minker et al., 1972) to state- and the construction of the globally and locally trained
of-the-art methods that use relevance models, such as thesauri from the data. A visual summary of this pro-
RM3 (Abdul-Jaleel et al., 2004). The usage of thesauri cess is given in Figure 1.
for query expansion has shown mixed results, which is
1
https://code.google.com/archive/p/word2vec/

20
Local versus Global Word Embeddings in Automatic Thesaurus Construction

Table 2. The number of sentences for each of the jurisdic-


XML
tions after preprocessing.
XML

Jurisdiction Sentence Counts

XML
XML
XML XML
XML Total 40 523 303
XML
Administrative Law 19 651 290
1. 2. 5. Civil Law 12 643 019
Criminal Law 8 223 217
International Public Law 5 777
6.

3. 4.

Figure 1. A visual summary of the method used to con-


struct the thesauri. 1. The process starts with 300 000
XML files of Dutch legal rulings. 2. Sentences are ex-
tracted from the XML files. 3. The sentence data is sion of the justitiethesaurus 4 . The justitiethesaurus is
split into 5 groups: administrative law, civil law, crimi- a publicly available legal thesaurus aimed at query ex-
nal law, international public law and a group with all of pansion for expert search in the legal domain (van Net-
the sentences. 4. Word2Vec models are constructed for ev- burg & van der Weijde, 2015). It is created and main-
ery group of the data, except for international public law tained by the Wetenschappelijk Onderzoek- en Docu-
which did not have enough data. 5. A thesaurus is cre-
mentatiecentrum (WODC). The thesaurus was based
ated from the global model, by taking the top ten most
on ten sources, which consisted of legal keyword lists
similar terms for each term in vocabulary. The three lo-
cally trained models are joined together. 6. Two local and dictionaries, books on legal and immigration con-
thesauri are constructed from the joined local model. One cepts, a book about country names and a book about
that solves duplicate conflicts by taking the maximum sim- the construction of thesauri (van Netburg & van der
ilarity score and one that uses the average score. Weijde, 2015).
The reference thesaurus consisted of 5558 terms and
each term had one or more related terms. In total
there were 13 606 related terms (5877 unique). The
3.1. Data
justitiethesaurus used five types of relations to con-
The data used to train the Word2Vec embeddings con- nect these terms, which are explained in Table 1. The
sisted of three hundred thousand court rulings, which thesaurus contained both single word terms as well
were provided by Legal Intelligence2 . The rulings as terms in the form of multi-word phrases. For ex-
were crawled from the Dutch governmental website ample, ‘aandeelhouders’ (shareholders) had ‘algemene
that publishes court verdicts3 and were represented as vergadering van aandeelhouders’ (general meeting of
semi-structured XML files. Each file contained, among shareholders) as a related term.
other things, the ruling of the court, a short summary
of the case (inhoudsindicatie) and the jurisdiction to 3.3. Preprocessing
which the ruling belonged. The arrest and the sum-
Before the text was used to train the Word2Vec em-
mary were used as the text sources for the training
beddings, the text was split up into sentences using the
of the word embeddings, whereas the jurisdiction was
sentence tokenizer from the nltk module for Python.
used to group the rulings for local training. The rul-
For each sentence, all digits and punctuation symbols
ings belonged to one of four jurisdictions: administra-
(except the ‘-’) were removed, converted to lower case
tive law, civil law, international public law or criminal
and then split on whitespaces. The number of sen-
law.
tences for each jurisdiction, excluding empty ones, are
shown in Table 2. The sentences belonging to the in-
3.2. Reference Thesaurus
ternational public law jurisdiction were discarded for
To evaluate the learned thesauri, the thesauri were local training, because there were not enough sen-
compared to a ground truth in the form of a reference tences to effectively train the word embeddings, but
thesaurus. The reference thesaurus was the 2015 ver- were included in the global model.
2 4
https://www.legalintelligence.com https://data.overheid.nl/OpenDataSets/
3
https://www.rechtspraak.nl/ justitiethesaurus\_2015.xml

21
Local versus Global Word Embeddings in Automatic Thesaurus Construction

Table 1. Description of the five types of relations of the justitiethesaurus and their number of occurrences.

Relation Description Count

Narrow-term Term lower in the hierarchy 1822


Broader-term Term higher in the hierarchy 1822
Related-term A term that is related, but not hierarchically 6458
Use Reference to the preferred almost) synonymous term 1752
Used-for Reference to an (almost) synonymous term. Used if the original term is the 1752
preferred term

3.4. Training
Table 3. The vocabulary size (the term count before reduc-
Since a significant number of terms in the reference tion) and the term count in the reference thesaurus for all
thesaurus were phrases, the phrases model5 of the four models.
gensim module was used on all of the sentences in
the collection, to learn common bigram and trigram Vocabulary Count in
Model Name
Size Reference
phrases. The phrases were trained using the default
settings and only taking into account phrases that oc- Global 559 032 3585
curred at least five times. Administrative Law 302 444 3147
The Word2Vec embeddings were then trained on un- Civil Law 267 466 2989
igrams and the previously identified phrases, using Criminal Law 168 664 2627
the skip-gram implementation of gensim’s Word2Vec
model6 . Training was done locally for the three re-
maining jurisdictions and globally on the entire text 3.5. Combining Jurisdiction Thesauri
collection. The neural network consisted of 100 hid-
den nodes and used a maximum distance of five words. The final step before comparing global and local
In other words, the context of a word was defined as thesauri, was to combine the models for each juris-
the five words before and the five words after it. All diction into a single local thesaurus. This was done
terms that did not occur at least ten times were dis- by concatenating the most similar terms, for each
carded. term in the jurisdiction models, and then ordering the
After training the models, only the terms that oc- similar terms based on their cosine similarity scores.
curred in both the reference thesaurus as well as the If the same term showed up as a similar term in
trained models were selected. This was required in multiple jurisdiction models, the conflict was solved
order to compare the thesauri with the reference the- using two methods: The maximum method used the
saurus. The term counts of the four models, before highest similarity score as the score for the combined
and after reduction, are shown in Table 3. Not all thesaurus. The second method used the average
terms of the reference thesaurus were retrieved by the similarity score as the new score.
models. Many of the terms in the reference thesaurus After combining the jurisdiction models, the local the-
that were not identified by the Word2Vec models, were sauri covered 3526 terms, compared to the 3585 terms
phrases of two or more words (despite extracting com- covered by the global thesaurus. This discrepancy
monly used phrases from the texts). This means that was due to terms being discarded in the jurisdiction
these words and phrases did not actually occur (often models for not occurring more than ten times, while
enough) in the training data. they did occur more than ten times globally. The
Finally, the thesauri were constructed from the terms that were only present in the global thesaurus
Word2Vec models by taking the ten most similar terms were ignored for the comparison between the thesauri.
for each term, based on the cosine similarity of the em-
beddings.
5 4. Results
https://radimrehurek.com/gensim/models/
phrases.html 4.1. Similarity of Thesauri
6
https://radimrehurek.com/gensim/models/
word2vec.html Before analyzing the performance of the two types of
thesauri, it was important to determine whether the
globally and locally trained thesauri actually differed

22
Local versus Global Word Embeddings in Automatic Thesaurus Construction

significantly. For the comparison, the trained thesauri


Table 4. Comparison of the performances of the global and
were considered as a list of rankings of related terms:
local thesauri, as evaluated on the ground truth thesaurus.
one ranking for each term in the thesaurus. Comparing The r-precision was calculated on the entire top ten similar
the global and local thesauri then meant that all these terms for each term in the thesaurus.
individual rankings had to be compared. This was
done using the rank correlation coefficients Kendall’s MAP@k
τ and Spearman’s ρ. Figure 2 shows the histograms Thesaurus R-Precision
k=1 k=5 k = 10
of the correlation coefficients for both types of local
thesauri, when compared to the globally trained the- Global 0.072 0.064 0.065 0.055
Maximum 0.066 0.058 0.060 0.048
saurus. The blue bars denote the cases where the p- Average 0.060 0.051 0.054 0.044
value of the correlation were insignificant with a signif-
icance level of α = 0.05, whereas the green bars denote
the significant correlations.
The histograms show that for both ranked correlation The quality of the trained thesauri does not just de-
measures the coefficients are approximately normally pend on how many terms are retrieved correctly, but
distributed, where the majority of the coefficients are also on the types of correctly retrieved terms. These
close to zero. This means that for most of the terms are shown in Table 5. The results show no signifi-
in the thesauri, there was little dependence between cant differences between the global and local thesauri,
the rankings of the global and local thesauri. More- but there are differences within the term types. Most
over, the large majority of the rankings had insignif- notably, it follows from the table that approximately
icant correlation coefficients, meaning that the rank- half of the relevant terms, that were retrieved by the
ings were not significantly dependent. The correlation trained thesauri, were synonyms (terms with the ‘use’
coefficients and their p-values thus show that it is un- and ‘used-for’ relations). These two types of relations
likely that the globally and locally trained thesauri though only accounted for approximately one-fourth of
were similar. the number of related terms in the ground truth the-
saurus (see Table 1). The trained thesauri thus had a
4.2. Performance of Thesauri bias towards synonyms, compared to the ground truth.
On the other hand, terms with the ‘related-term’ re-
To test whether the differences between the local and lationship were underrepresented, only accounting for
global thesauri were actually an improvement, the 24% − 30% of the terms instead of the 47% in Table 1.
three thesauri were compared to a ground truth the- These two differences are most prevalent when only
saurus. The comparison was done by treating the re- looking at the single most similar term in the trained
lated terms of the ground truth thesaurus as relevant thesaurus (k = 1). As k grows, and more terms with
terms, that had to be retrieved by the rankings of most lower similarity scores are included, the relative num-
similar terms in the trained thesauri. This was done ber of synonyms retrieved decreases slightly as the rel-
for each term in the ground truth thesaurus, which ative number of terms with the ‘related-term’ increases
was then summarized for the entire thesaurus using slightly. This enforces the suspicion that the trained
r-precision and MAP at different values of k. The re- thesauri tend to assign higher similarity scores to syn-
sults are shown in Table 4. onyms than other types of related terms.
The results show that the globally trained thesaurus
performed consistently better than both the locally
trained thesauri. Of the local thesauri, the thesaurus
5. Discussion
which used the average concatenation method per- 5.1. Direct versus Indirect Evaluation
formed the worst across the board. For all three the-
sauri, taking into account only the single most similar The performances of the trained thesauri were evalu-
term (i.e. k = 1) resulted in the highest MAP scores. ated by comparing them to a ground truth thesaurus.
In other words, taking into account more terms did This evaluation method is based on the assumption
not improve the recall enough to compensate for the that the ground truth thesaurus is a reflection of what
decrease in precision. a good thesaurus should look like. Although the the-
In general though, the MAP@k and r-precision scores saurus used for this experiment was constructed by
are very poor. This means that a lot of irrelevant experts and has been improved upon for more than
terms were retrieved by the trained thesauri, and thus twenty years (van Netburg & van der Weijde, 2015),
none of the thesauri constructed from Word2Vec em- that does not necessarily mean that other good the-
beddings were very good. sauri have to look similar. As a result, it is difficult
to infer the quality of the thesauri solely based on the

23
Local versus Global Word Embeddings in Automatic Thesaurus Construction

Maximum Method
600
Insignificant Insignificant
500 Significant Significant

400
Frequency

300

200

100

0
1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0
Kendall's Spearman's
Average Method
600
Insignificant Insignificant
500 Significant Significant

400
Frequency

300

200

100

0
1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0
Kendall's Spearman's

Figure 2. Similarities between the global thesaurus and both types of local thesauri expressed in the Kendall’s τ and
Spearman’s ρ. The correlation coefficients are binned in twenty bins of size 0.1. The blue bars denote the rankings
for which there was no significant correlation, with a significance level α = 0.05. The green bars denote the significant
correlations.

reference thesaurus. The globally trained thesaurus trained thesauri did not contain all of the terms of the
performing better could simply mean that it was most ground truth thesaurus.
similar to the reference thesaurus, without actually be- Table 6 illustrates the problem of this discrepancy
ing the better thesaurus. This also applies to the gen- when using a reference thesaurus for evaluation. Al-
eral performance of the trained thesauri. Even though though the automatically constructed thesauri make
they showed a big discrepancy with the ground truth some clear mistakes, often in the form of linking to
thesaurus, that does not have to mean that the trained words with very similar usage (e.g. nationalities) or
thesauri perform poorly when used for query expan- linking to words from very specific cases (e.g. link-
sion. ing namaak to merk Colt), they also contain related
This is especially relevant for the comparisons in terms that are not found in the reference thesaurus
this paper, since the ground truth thesaurus and (e.g. groepsactie and collectieve actie). In the latter
the trained thesauri will naturally consist of differ- case, the constructed thesauri are thus unfairly pun-
ent terms. The trained thesauri are namely purely ished in the evaluation.
based on the terms in the document space, whereas Given these limitations, a better approach would be
the manually constructed thesaurus is based on terms to evaluate the thesauri directly on a query expansion
from concept lists and dictionaries, which might not task. Direct evaluation does not only remove the am-
actually appear as such in the collection. This was biguity of the performance evaluation, it also allows
also reflected by the fact that the vocabularies of the the thesauri to be compared to other query expansion

24
Local versus Global Word Embeddings in Automatic Thesaurus Construction

Table 5. An overview of the types of relations correctly retrieved by the three constructed thesauri, expressed in percent-
ages of the total number of retrieved relations. Here k = 1 means that only the first element of the most similar terms is
taken into account, k = 5 means only the first five elements, etcetera.

Global Maximum Average


Retrieved Term Types
k=1 k=5 k = 10 k=1 k=5 k = 10 k=1 k=5 k = 10
Narrow-term 7% 11% 12% 10% 11% 12% 9% 11% 12%
Broader-term 6% 10% 11% 9% 10% 10% 8% 9% 10%
Related-term 24% 26% 28% 25% 28% 30% 28% 29% 30%
Use 35% 28% 26% 28% 26% 24% 28% 25% 24%
Used-for 28% 25% 23% 29% 25% 24% 27% 25% 24%

0.80
Thesaurus
Administrative Law
Civil Law
0.75 Global
Criminal Law
Cosine Similarity

0.70

0.65

0.60
1 2 3 4 5 6 7 8 9 10
Similar Terms Index

Figure 3. The mean cosine similarity scores for each rank of the most similar term rankings of the trained models. The
error bars denote the standard deviation.

techniques. For a more complete overview of direct the highest similarity scores. Surprisingly, the model
and indirect evaluation of thesauri see the paper writ- trained on civil law actually has significantly higher
ten by Claveau and Kijak (2016a). Direct evaluation scores than the other two local models, even though it
was unfortunately not possible for this experiment, be- only had the second most training data. Further re-
cause no relevance data or query logs were available. search is required to confirm whether data imbalance
effects the similarity scores and to explore methods to
5.2. Data Imbalance then compensate for the imbalance.
After splitting the training data into multiple juris-
5.3. Further Research
dictions, the data was not equally balanced between
the different jurisdictions (see Tables 2 and 3). A ju- For this experiment, the performance of the trained
risdiction with less data might result in lower cosine thesauri was evaluated for the one, five and ten most
similarity scores for that jurisdiction, since there is less similar terms. Although the results showed the high-
text available to reinforce the context patterns of the est MAP@k scores for k = 1, more focused research
words. This way the imbalance in the data could cause has to be done to gain insight into the ideal number
jurisdictions with less data to be unfairly underrepre- of similar terms that have to be used for the trained
sented in the local thesauri. thesauri. This number will most likely differ between
The possible correlation between similarity scores and contexts in which the thesauri are used and as such be
the size of the training data is partially supported by evaluated separately for specific contexts.
Figure 3. The figure shows that the similarity scores Since this paper strictly focused on comparing the per-
for criminal law are on average the lowest for every formance of globally and locally trained thesauri using
position in the rankings of most similar term, whereas a ground truth thesaurus, it did not touch on the ac-
the model based on all of the data consistently has tual construction of thesauri from the models. For the

25
Local versus Global Word Embeddings in Automatic Thesaurus Construction

Table 6. Some examples of differences between the reference thesaurus and the automatically constructed thesauri. English
translations of the Dutch terms are given between the parentheses

Term Reference Thesaurus Local Thesaurus Global Thesaurus

Marokkanen
allochtonen (immigrants) Antillianen (Antilleans) Turken (Turks)
(Moroccans)
Benelux Comité van Ministers (Commit- BVIE (Benelux Convention In- woord- beeldmerk (word- figura-
tee of Ministers) tellectual Property) tive mark )
XTC ecstasy, MDMA heroı̈ne (heroin) GHB
nabootsing
namaak (counterfeit), imitatie merk Colt (authentic Colt) replica (replica)
(imitation)
(imitation)
groepsactie
class action collectieve actie (collective ac- collectieve actie (collective ac-
(group action)
tion) tion)
anonieme
melding
Meld Misdaad Anoniem anonieme tip (anonymous tip) anonieme tip (anonymous tip)
(anonymous
report)
opzegging
duurovereenkomst (fixed-term beëindiging (termination) ontbinding (termination)
(notice)
agreement)
natuurbeheer
(nature jacht (hunt), milieubeheer (en- subsidieregeling agrarisch (agri- landschapbeheer (landscape
management) vironmental management) cultural subsidy) management)
pesten (bullying) school en criminaliteit (school uitschelden (calling names) stalken (stalking)
and crime)
knowhow industriële knowhow (industrial know how know-how
knowhow )

comparison in this paper, it sufficed to select only the tomatic construction of thesauri. This difference can
terms that were shared by the trained thesauri and be attributed to the fact that relevant, but context spe-
the reference thesaurus. In practice though, selecting cific uses of terms, might not be captured by the neural
appropriate terms from the models is a crucial part network, because they get overshadowed in the grand
of forming a thesaurus. A possible selection technique scheme. This generalization effect was also mentioned
would be to use part-of-speech tagging to only select in previous papers by Diaz et al. (2016) and Roy et al.
noun phrases. The models could also be compared to (2016).
models trained on general text, as a way to only select As a follow up, this paper set out to answer whether lo-
terms that are specific to the legal domain. cal Word2Vec models created a better thesaurus than
Another challenging aspect of automatic thesaurus global models. This however proved not to be the
construction is automatically annotating the relations case, unlike the initial expectation that the general-
between terms. For this, more research is required to ization effect would result in a poorer performance of
take the trained thesauri and identify these relations. the global thesaurus. The globally trained thesaurus
Finally, the techniques described in this paper could actually outperformed both locally trained thesauri in
also be tested in different domains, to gain insight into the experiment.
whether the results carry over. Two methods to resolve conflicts in the concatena-
tion of the local models were explored. The experi-
6. Conclusion ment showed that taking the maximum cosine similar-
ity score consistently outperformed the average simi-
This study, first of all, set out to confirm that a the- larity score.
saurus trained on the entire collection differs from a It is hard to draw definite conclusions from the ex-
thesaurus trained on separate legal fields. The results periments though, since all three thesauri performed
showed a significant difference between the globally poorly. The thesauri showed very low MAP@k and r-
and locally trained Word2Vec embeddings in the au- precision scores when retrieving relevant terms from

26
Local versus Global Word Embeddings in Automatic Thesaurus Construction

the ground truth thesaurus, which means that the Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S.,
trained thesauri retrieved a large number of irrelevant & Dean, J. (2013). Distributed representations of
terms. In other words, there was a big discrepancy be- words and phrases and their compositionality. Ad-
tween the terms that were considered relevant by the vances in Neural Information Processing Systems
reference thesaurus and the terms considered relevant (pp. 3111–3119).
by the trained thesauri.
This discrepancy was also reflected in the bias of the Minker, J., Wilson, G. A., & Zimmerman, B. H.
trained thesauri in favor of synonyms, when compared (1972). An evaluation of query expansion by the
to the reference thesaurus. This bias stems from the addition of clustered terms for a document retrieval
assumption of Word2Vec, that related terms are used system. Information Storage and Retrieval, 8, 329–
in similar contexts. Synonyms namely have a natural 348.
tendency to occur more often in similar contexts than Navigli, R., & Ponzetto, S. P. (2012). Babelnet: The
broader, narrower or otherwise related terms. automatic construction, evaluation and application
of a wide-coverage multilingual semantic network.
References Artificial Intelligence, 193, 217–250.
Abdul-Jaleel, N., Allan, J., Croft, W. B., Diaz, F., Roy, D., Paul, D., Mitra, M., & Garain, U. (2016).
Larkey, L., Li, X., Smucker, M. D., & Wade, C. Using word embeddings for automatic query expan-
(2004). Umass at trec 2004: Novelty and hard. On- sion. Neu-IR ’16 SIGIR Workshop on Neural Infor-
line Proceedings of the 2004 Text Retrieval Confer- mation Retrieval.
ence.
Tudhope, D., Binding, C., Blocks, D., & Cunliffe, D.
Bhogal, J., Macfarlane, A., & Smith, P. (2007). A (2006). Query expansion via conceptual distance in
review of ontology based query expansion. Informa- thesaurus indexed collections. Journal of Documen-
tion processing & management, 43, 866–886. tation, 62, 509–533.
Claveau, V., & Kijak, E. (2016a). Direct vs. indirect van Netburg, C. J., & van der Weijde, S. Y. (2015).
evaluation of distributional thesauri. Proceedings of Justitiethesaurus 2015.
the International Conference on Computational Lin-
guistics, COLING.
Claveau, V., & Kijak, E. (2016b). Distributional the-
sauri for information retrieval and vice versa. Lan-
guage and Resource Conference, LREC.
Diaz, F., Mitra, B., & Craswell, N. (2016). Query ex-
pansion with locally-trained word embeddings. ACL
’16.
Hersh, W., Price, S., & Donohoe, L. (2000). Assess-
ing thesaurus-based query expansion using the umls
metathesaurus. Proceedings of the AMIA Sympo-
sium (pp. 344–348).
IJzereef, L., Kamps, J., & De Rijke, M. (2005).
Biomedical retrieval: How can a thesaurus help?
OTM Confederated International Conferences” On
the Move to Meaningful Internet Systems” (pp.
1432–1448).
Lauser, B., Johannsen, G., Caracciolo, C., van Hage,
W. R., Keizer, J., & Mayr, P. (2008). Comparing hu-
man and automatic thesaurus mapping approaches
in the agricultural domain. Metadata for Semantic
and Social Applications: Proceedings of the Inter-
national Conference on Dublin Core and Metadata
Applications (pp. 43–53).

27
Identifying writing tasks using sequences of keystrokes

Rianne Conijn [email protected]


Menno van Zaanen [email protected]
Department of Communication and Information Sciences, Tilburg University, The Netherlands

Keywords: keystroke analysis, classification, writing processes, writer identification

Abstract This results in data from multiple contexts, over differ-


ent time periods, ranging from low to high granularity.
The sequences of keystrokes that are gen-
In this study, we analyze fine-grained data: keystroke
erated when writing texts contain informa-
data from a writing task.
tion about the writer as well as the writing
task and cognitive aspects of the writing pro- In the literature, two different goals can be distin-
cess. Much research has been conducted in guished in the analyses of keystroke data: authentica-
the area of writer identification. However, tion or identification of writers, and determination of
research on the analysis of writing processes writing processes. Keystrokes have mainly been used
based on sequences of keystrokes has received for the former (Longi et al., 2015; Karnan et al., 2011).
only a limited amount of attention. There- In the field of educational data mining, the authentica-
fore, in this study we try to identify proper- tion and identification of writers is used, for example,
ties of keystrokes that indicate cognitive load for authentication in online exams (Gunetti & Picardi,
of the writing process. Moreover, we investi- 2005) or for the identification of programmers (Longi
gate the influence of these properties on the et al., 2015). These studies mainly focus on the typing
classification of texts written during two dif- or motor processes, since these are considered unique
ferent writing tasks: copying a text and free- per person. The majority of studies focus on statisti-
form generation of text. We show that we cal properties, such as mean, standard deviation, and
can identify properties that allow for the cor- Euclidean distance of three attributes: keystroke dura-
rect classification of writing tasks, which at tion, keystroke latency, and digraph duration (Karnan
the same time do not describe writer-specific et al., 2011). These features can be used to identify
characteristics. However, some properties are and authenticate writers to a large extent, with accu-
the result of an interaction between the typ- racies up to 99% (Tappert et al., 2009). These high
ing characteristics of the writer and the writ- accuracies show that keystroke logs contain much in-
ing task. formation that denotes writer-specific characteristics.
Yet, keystroke data also includes other information,
denoting the writing process itself. The determina-
1. Introduction tion of these writing processes has received less atten-
Students’ activities in online learning systems can pro- tion. This might be due to the fact that keystrokes
vide useful information about their learning behavior. are not clear measures of the underlying writing pro-
Educational data mining focuses on the use of data cesses (Baaijen et al., 2012). The data need to be pre-
from learners and their context to better understand processed and analyzed in a way such that it provides
how students learn, to improve educational outcomes, meaningful information to be used by students and
and to gain insight into and explain educational phe- teachers for improving learning and teaching. There-
nomena (Romero & Ventura, 2013). Data can be col- fore, this study explores the writing processes derived
lected from different sources, such as online learning from students’ keystrokes.
systems, student administration, and questionnaires. Some studies already investigated the possibilities of
determining writing processes using keystrokes. Baai-
Appearing in Proceedings of Benelearn 2017. Copyright jen et al. (2012) analyzed keystroke data from 80 par-
2017 by the author(s)/owner(s). ticipants during a 30-minute writing task. The rela-

28
Identifying writing tasks using sequences of keystrokes

tion between pauses, bursts, and revisions were an- To identify properties of keystrokes that indicate the
alyzed. Using these features, text production could cognitive load of the writing process, an open dataset is
be distinguished from revisions. Revision bursts were used, which has been used for writer identification. In
shorter than new text production bursts. In another a previous study, it was already shown that keystroke
writing task, keystroke data from 44 students during data differed between free-form and fixed text (Tap-
a 10-minute essay was collected to determine emo- pert et al., 2009). However, these differences were not
tional states (Bixler & D’Mello, 2013). Four feature made explicit nor evaluated. Therefore, in the cur-
sets were used: total time, keystroke verbosity (num- rent study, we analyze which features differ within the
ber of keys and backspaces), keystroke latency, and keystrokes of free-form versus fixed text using three
number of pauses (categorized by length). All feature different feature sets. As an evaluation, the differ-
sets combined could classify boredom versus engage- ences found between fixed and free-form text are used
ment with an accuracy of 87%. Keystroke data have to classify text as being either fixed or free-form text.
also been analyzed in programming tasks, to deter- This is done using all possible combinations of the
mine performance. Thomas et al. (2005) analyzed different feature groups, to determine which feature
keystroke data from 38 experienced programmers and group is most useful for the classification. At the same
141 novices in a programming task. Keystroke laten- time, since we are not interested in the writer-specific
cies and key types were found related to performance. information, the properties should not allow for an ac-
Key latencies (within and before or after a word) were curate identification of the actual writer.
found negatively correlated with performance. Addi-
tionally, it was found that experienced programmers 2. Method
used more browsing keys and were faster in pressing
those. 2.1. Data
These studies show that keystrokes do not only dif- Data used in the current experiments has been taken
fer due to writer-specific characteristics (which is used from the Villani keystroke dataset (Tappert et al.,
in authentication and identification), but also because 2009; Monaco et al., 2012). The Villani keystroke
of differences in revisions and text production, emo- dataset consists of keystroke data collected from 142
tional states, and level of experience. Whereas the participants in an experimental setting. Participants
differences in writer-specific properties may be due to were free to chose to copy a fable, a fixed text of 652
physical differences and differences in typing style, the characters, or to type a free-form text, an email of at
differences in writing properties are expected to come least 650 characters. Participants could copy the fable
from differences in cognitive processes required. In- multiple times and could also type multiple free-form
deed, keystroke duration and keystroke latencies are texts. Since typing the texts was not mandatory, not
often seen as an indicator of cognitive load (Leijten & all participants typed both a free-form text and a fixed
Van Waes, 2013). As different tasks lead to differences text. In total, 36 participants typed both at least one
in cognitive load, we may find these differences using fixed text and one free-form text, resulting in keystroke
different writing tasks. However, existing studies do data of 338 fixed texts and 416 free-form texts. The
not compare differences in keystrokes between tasks. other 106 participants only wrote either free-form or
Therefore, in the current study, the writing processes fixed texts, resulting in a further 21 fixed texts and
in two different tasks are compared: writing a free- 808 free-form texts. The keystroke data consisted of
form text versus a fixed text (copying a text). Here we timestamps for each key press and key release and the
assume that writing a free-from text requires a differ- corresponding key code. More information about the
ent cognitive load than writing a fixed text, resulting dataset and the collection of the dataset can be found
in differences in the keystroke data. in Tappert et al. (2009). In this research, we only use
the data of participants who created both text types.
Having knowledge of the cognitive load while produc-
ing a text may provide useful information, for exam-
ple, for teachers. Currently, teachers often only have 2.2. Data processing
access to the final writing product for evaluation pur- First, for all keystrokes, the type of key was derived:
poses. This does not provide insight in what students letter, number, browse key (e.g., LEFT, HOME),
did during the writing process. Insight in students’ punctuation key, correction key (BACKSPACE,
writing behavior or cognitive load during an assign- DELETE), and other (e.g., F12). The time between
ment may trigger the teacher to further investigate a key release and the subsequent key press (keystroke
this behavior and adapt the task or provide personal- latency or key pause time) was calculated. Thereafter,
ized feedback on the writing process. the type of pause between the key was derived using

29
Identifying writing tasks using sequences of keystrokes

e o o o
the key types. Six pause types were identified: pause Keys t h d bs bs b k

before word (after SPACE, before letter or number), Strokes


within word (between letters or numbers), after word < word
(after letter or number, before SPACE), before correc-
In word
tion, within correction, and after correction.
> word
Lastly, words were identified as the letters and num- < corr.
bers between two SPACE keys. For all words the word
In corr.
length (number of letters and numbers) and the word
time was calculated (time from key press of the first > corr.
character until time of the key release of the last char- Word
acter). For further analyses on the word length and
word time, only words without corrections were in- Figure 1. Measurement of timing of pauses of “the book”
cluded. The use of corrections within a word would with two corrections using the backspace key ( bs ). “<”
have a significant influence on the time of typing. Ad- means before, “>” means after, “corr.” stands for correc-
ditionally, since a BACKSPACE or DELETE key can tion. The last row indicates the time of the word “the” (of
length three characters). The word “book” is not used, as
be used to remove multiple characters, it is hard to
it contains corrections.
determine word length if corrections are made within
the word. information than what we extracted here. The selec-
Figure 1 shows the measurement of the timing of the tion of these features was made in order to reveal as
different types of pauses. Given that the writer types little as possible about the actual text being typed.
the words “the book” with two incorrect letters after Actual key code information, for instance, is not used
the SPACE key (“do”), which are corrected using two as that information should be quite consistent between
BACKSPACEs, the key presses and releases per key the fixed texts.
are illustrated in the second row. The following rows For the statistical analyses and training of the models,
each depict which periods between key releases and key only data from the 36 participants who typed both
presses are measured for that type of pause. For in- fixed and free-form texts were included. From these,
stance, the pause before the word “the” and the pause four texts were excluded, because they consisted of less
between the SPACE key and the letter “d” are counted than five words and inspection showed that these texts
as pause before word type (third row). The last row were random key strokes. Thereafter, 750 texts (338
indicates the timing used to compute the word length. fixed and 412 free-form) remained for analyses.
In this case, the word “the” is identified (which has a
length of three characters). The word “book” is not 2.3. Analyses
used, as it contains corrections.
To identify the relationship between keystroke infor-
After data enrichment, the three groups of features mation and cognitive load in writing tasks, two types
were computed: pause times, corrections, and word of analyses were used: statistical analyses and model
length. For all six different types of pause times (see evaluation. First, paired t-tests were conducted to de-
Figure 1), the normalized average pause times were termine whether differences were found between the
calculated by dividing the average pause time of each features in the fixed and free-form texts of partici-
type by the overall average pause time over all types. pants.
Additionally, the normalized standard deviations of
the pause times were calculated. In total, this resulted Thereafter, support vector machines were trained to
in 12 different features related to pause time. For classify texts as being fixed or free-form. Support
the corrections, two features were calculated: the to- vector machines were trained for all combinations of
tal number of corrections and the percentage of words the three feature groups (pause times, corrections, and
with corrections. Lastly, four features related to word word length), resulting in a total of seven models. The
length were computed: the average time and standard radial base function was used as kernel (‘svmRadial’
deviation for short words (having less than four char- from the ‘caret’ package in R (Kuhn, 2016)). The data
acters) and the average time and standard deviation was trained using 10-fold cross-validation. Grid search
for long words (consisting of between 9 and 13 char- was used during training (with a tuning part held aside
acters). All four features were normalized using the from the testing) to optimize the parameters σ and
average time and standard deviation of all words. cost. The average accuracy were calculated as perfor-
mance measures. Since the groups were not equally
Obviously, the keystroke sequences contain much more distributed, the average κ was also calculated. The

30
Identifying writing tasks using sequences of keystrokes

Table 1. Descriptive statistics and paired t-tests of features in fixed and free-form text (N = 36).
*=p < .05, **=p < .01, ***=p < .001.

Fixed text Free-form text


Feature M S.D. M S.D.

Total time *** 376 100 432 129


# Keys ** 703 54.6 749 57.5
# Corrections *** 21.7 13.6 35.3 16.4
% Words corrected *** 0.08 0.05 0.13 0.06
Average pause time before word ** 1.17 0.18 1.25 0.18
S.D. pause time time before word * 1.04 0.25 1.16 0.28
Average pause time within word *** 0.86 0.09 0.78 0.07
S.D. pause time within word *** 0.60 0.21 0.40 0.15
Average pause time after word ** 0.91 0.15 0.84 0.17
S.D. pause time after word ** 0.79 0.29 0.61 0.30
Average pause time before correction 2.49 0.76 2.40 1.52
S.D. pause time before correction * 1.50 0.55 1.22 0.51
Average pause time within correction 0.95 0.61 0.82 0.16
S.D. pause time within correction 0.48 0.30 0.44 0.24
Average pause time after correction 1.75 1.60 2.16 3.29
S.D. pause time after correction 1.14 0.29 1.03 0.30
Average short word time ** 0.52 0.05 0.48 0.09
S.D. short word time 0.33 0.09 0.32 0.11
Average long word time 2.87 0.26 2.98 0.71
S.D. long word time 1.14 0.23 1.16 0.28

κ corrects for random guessing, by comparing the ob- types of features have their own standard deviations
served accuracy with the expected accuracy (chance): per document). In the table, the descriptive statistics
of these features can also be found.
observed accuracy − expected accuracy
κ= It was found that fixed texts consisted of significantly
1 − expected accuracy
fewer keystrokes compared to free-form texts (703 ver-
Additionally, a one-way ANOVA with Tukey post-hoc sus 749). Although the fixed text consisted of 652 char-
test was used to determine whether the models differed acters, the mean number of keystrokes was 703. This
significantly in accuracy. can partly be explained by the fact that sometimes
multiple keys are needed to type one character (e.g.,
Lastly, since we focus on the writing process, the
SHIFT + character to type a capital letter). Addition-
learned models should preferably not be able to clas-
ally, this can indicate that the participants made typos
sify personal writer-specific characteristics. Thus, the
and fixed those, requiring BACKSPACE or DELETE
learned model should perform really badly when clas-
keystrokes. Indeed, it was shown that corrections were
sifying writers. Therefore, as an additional evaluation
made in 740 of the 750 sessions. The free-form texts
support vector machines were trained to classify the
contained more corrections and a higher percentage of
writers. The best σ and cost values from the mod-
words with at least one correction, compared to the
els classifying fixed versus free-form text were used.
fixed texts. Lastly, the participants were faster in typ-
Again, the average accuracy and κs were calculated
ing the fixed text compared to the free-form text. All
using the same folds in 10-fold cross-validation.
these findings provide some evidence that typing the
free-form text requires a different cognitive load.
3. Features measuring cognitive load
Several features were analyzed to determine where sig-
Paired t-tests were used to determine which features nificant differences in pause duration between the text
differed significantly between fixed and free-form text types were found. Specifically, we investigated the dif-
created by the same writer. This is assumed to pro- ferences between the pauses before, after, and within
vide insight in which features are indicative of cogni- words and corrections. Since the free-form and fixed
tive load. The results can be found in Table 1. Note texts differed in total length and time, timing of key
that we use both the mean as well as the standard pauses were normalized based on the average time per
deviation (S.D.) within a text as features (and both key pause in that session. It was found that writers

31
Identifying writing tasks using sequences of keystrokes

spend more of their writing time on pauses before a


Table 2. Accuracies and κs of support vector machine mod-
word and less time on pauses within a word or after a
els on the different feature groups that classify fixed versus
word in free-form text, compared to fixed texts. Addi- free-form text.
tionally, the standard deviation of pauses within and
after a word was significantly lower for free-form texts
Fixed vs. free
compared to fixed texts. For the pauses before words, feature group Accuracy κ
the opposite was found: when typing free-form text,
a larger proportion of pause time was spent before a Pauses 0.739 0.465
word, compared to fixed texts. Moreover, the stan- Correction 0.689 0.368
dard deviations were larger. This may be because the Words 0.687 0.370
writer will need to think (longer) about which word to Pauses, Correction 0.763 0.513
Correction, Words 0.767 0.522
type in free-form text, which is not needed for fixed
Pauses, Words 0.739 0.465
texts. Pauses, Correction, Words 0.781 0.551
For the key pause times before, after, and within cor-
rections, no significant differences were found between
free-form and fixed texts. The only exception is the Table 3. Accuracies and κs of support vector machine mod-
standard deviation of key pause time before correc- els on the different feature groups that classify writers (36
tions: free-form texts lead to a larger standard devi- classes).
ation for key pause time before corrections compared
to fixed texts. Writer
Feature group Accuracy κ
When comparing the average word time between the
two types of text, participants were faster in typing Pauses 0.248 0.223
short words (consisting of less than four letters) in Correction 0.091 0.065
free-form text compared to fixed text. This indicates Words 0.073 0.046
that in free-form text, of all words, less time is de- Pauses, Correction 0.311 0.287
voted to short words, compared to fixed text. No sig- Correction, Words 0.121 0.095
Pauses, Words 0.239 0.215
nificant differences were found between fixed and free- Pauses, Correction, Words 0.291 0.267
form texts for the average word time for long words
(8–13 letters). Additionally, no significant differences
were found in the standard deviation of time on short
and long words between the text types. all feature groups yielded the overall highest accuracy:
78.1% with a κ of 0.551. A one-way ANOVA showed
that the seven models differed significantly in accuracy
4. Model evaluation (F (6, 63) = 4.728, p < .001). Using two feature groups
4.1. Classifying fixed versus free-form text was always better than using only correction features
or word length features. However, the combination of
To measure the effect of the different groups of fea- all feature groups did not lead to a significantly higher
tures, we trained support vector machines and mea- accuracy than the pause time features alone. Thus, the
sured how well they could distinguish between a fixed word length and the corrections features did not have
and a free-form text. The models were trained using all much additional value next to the pause time features
combinations of three different feature groups: average for classifying fixed versus free-form text.
and standard deviations of the pause times (Pauses);
correction information (Correction); and average and 4.2. Classifying writers
standard deviation of short and long word typing time
(Words). The accuracies and κs of all seven models To determine whether the learned models did not in-
for classifying fixed and free-form text can be found in clude any writer-specific characteristics, models with
Table 2. the same settings as the models that classify text types
were trained and tested to classify writers. The results
The results show that all feature groups are useful for of these experiments can be found in Table 3. Simi-
the classification of fixed versus free-form text. The larly to the models classifying fixed versus free-format
feature group of the key pause times led to best ac- text, the key pause time features led to a higher accu-
curacy (73.9% with a κ of 0.465, or approximately racy (24.8% and κ = 0.223) than the correction and
47% above chance), compared to the other individual word length features. The correction and word length
feature groups. Not surprisingly, the combination of features resulted in the lowest accuracies: 9.1% and

32
Identifying writing tasks using sequences of keystrokes

7.3%, respectively. The model with both correction tion task. Adding the corrections and word length
and pause time features led to the highest accuracy: features did not lead to significantly higher accuracies,
31.1% with a κ of 0.267. Although this is a reason- showing that the word length and corrections features
ably low accuracy, the model clearly outperforms the do not add much information in addition to the pause
model based on chance for the 36-class classification times. When all feature groups were included, 78.1%
(1/36 = 2.8%). This means that some writer-specific accuracy was reached. Although this accuracy is rea-
characteristics are encoded within the feature groups sonably high, it also shows there is still some room for
that are used in these models, which in this case is improvement. Especially considering that classifying
unwanted. writers, being a more complex classification problem
with more classes, have shown accuracies up to 99%
A one-way ANOVA showed that the seven models dif-
(Tappert et al., 2009).
fered significantly in accuracy (F (6, 63) = 37.43, p <
.001). Interestingly, the accuracies for the corrections Since we aimed to identify features related to the writ-
and words feature groups combined are significantly ing process, we wanted to exclude writer-specific in-
lower than the accuracy of all models that include formation. In other words, the models should perform
pause time features. This indicates that the correc- badly when classifying the writer. To test this, we
tions and word length features include fewer writer- tried to classify the writers with the same settings as
specific characteristics than the pause time features. used in the writing task classification for the support
vector machine models. The lowest accuracy, while
5. Discussion using information from the keystroke sequences were
found when using word length features only (7.3%).
Keystroke data include both writer-specific informa- This corresponds to an accuracy of 68.7% on the text
tion and information about the writing processes. In type classification task. The highest accuracy (31.1%)
this study, we focused on the writing processes and was obtained with both pause time and correction fea-
aimed to identify properties of keystrokes that indi- tures (corresponding to 76.3% accuracy on the text
cate the cognitive load of the writing process. In order type classification task, which is close to the highest
to do this, keystrokes of two different writing tasks accuracy on that task: 78.1%). Even though the accu-
were analyzed, which are assumed to differ in cogni- racies on writer classification are higher than chance,
tive load: copying a text (fixed text) and writing a it is much lower than the 90%–99% accuracy reached
free-form text. in other studies (Longi et al., 2015; Tappert et al.,
2009). Thus, the feature groups that have been ex-
Our first analysis showed that several features ex-
tracted, actually contain mostly information related
tracted from the keystroke data differed significantly
to the writing task and not to the writer-specific char-
between the fixed and free-form texts of a writer.
acteristics.
These findings support previous work which showed
that keystrokes differ for different (types of) text en- Interestingly, especially the corrections and word
tered (Gunetti & Picardi, 2005; Tappert et al., 2009). length features showed low accuracies on classifying
As an extension, we also identified which features dif- writers. Thus, these feature groups contained little
fered and how these differed. When typing free-form information about individual typing characteristics.
text, the pauses before a word were longer, while the Adding additional information to improve the qual-
pauses within or after a word were shorter, compared ity of the text type classification task, also increases
to typing fixed text. This might indicate that the par- accuracy of the writer classification task. For example,
ticipants were thinking about the next word to type in if we add the key pause features, the accuracy of the
the free-form text before they typed the word, while text type task increases, but the writer identification
writers in the fixed text situation could immediately accuracies also increase. In other words, key pause
copy it as it was provided for them. Thus, differences properties contain useful information for the text type
in cognitive load may be identified in the pauses before classification task, but also contain information that
words. allows for the identification of the writer, which is un-
wanted in this case.
As an evaluation, we showed that the differences in
keystroke information can be used to classify fixed and There are at least three directions for future work.
free-form text. Using a support vector machine, the Firstly, future work could try to improve the accu-
key pause time features (which measure time spent racy on task classification, while not improving the
between key releases and key presses) were found to accuracy on writer identification. Additional features
lead to the highest accuracies for the text identifica- or feature groups could be identified, such as bursts

33
Identifying writing tasks using sequences of keystrokes

ending in a pause (P-bursts) or ending in a revision References


(R-bursts) (Baaijen et al., 2012). In addition, the key
Baaijen, V. M., Galbraith, D., & de Glopper, K.
pause feature group seems to contain useful informa-
(2012). Keystroke analysis: Reflections on proce-
tion, but these features should be modified in order to
dures and measures. Written Communication, 29,
remove any writer-specific properties. Alternatively,
246–277.
other machine learning algorithms, such as neural net-
works, may be tried to achieve higher accuracies. Bixler, R., & D’Mello, S. (2013). Detecting boredom
Secondly, a wider range of writing tasks could be con- and engagement during writing with keystroke anal-
sidered. For example, semi-fixed tasks with specific ysis, task appraisals, and stable traits. Proceedings
task descriptions (e.g., writing a sorting algorithm) of the 2013 international conference on Intelligent
can be investigated to determine whether differences user interfaces (pp. 225–234).
between tasks that are more similar can also be dis-
Gunetti, D., & Picardi, C. (2005). Keystroke analysis
tinguished. In this way, we may identify which tasks
of free text. ACM Transactions on Information and
require more cognitive load and in which properties of
System Security (TISSEC), 8, 312–347.
the process of typing this effort can be found. This
information can be used to improve the writing task Karnan, M., Akila, M., & Krishnaraj, N. (2011). Bio-
instruction or to provide feedback on the writing pro- metric personal authentication using keystroke dy-
cess to the learner during the writing task (see also namics: A review. Applied Soft Computing, 11,
Poncin et al., 2011; Kiesmueller et al., 2010). 1565–1573.
Lastly, this study assumed that the differences in
keystrokes provide an indication of cognitive load. Kiesmueller, U., Sossalla, S., Brinda, T., & Riedham-
However, we did not actually measure the cognitive mer, K. (2010). Online identification of learner
load. Future work could explicitly measure the cog- problem solving strategies using pattern recognition
nitive load during the task, for example by using a methods. Proceedings of the fifteenth annual con-
secondary task, or a questionnaire (Paas et al., 2003). ference on Innovation and technology in computer
In this way, the problem could be approached as a science education (pp. 274–278).
regression problem rather than a classification task.
Kuhn, M. (2016). caret: Classification and regression
training. R package version 6.0-73.
6. Conclusion
Leijten, M., & Van Waes, L. (2013). Keystroke log-
To conclude, this research has shown that keystroke ging in writing research: Using inputlog to analyze
data can be used to identify differences in writing and visualize writing processes. Written Communi-
tasks, which we believe require different cognitive load. cation, 30, 358–392.
Additionally, we showed which feature groups (key
pause, correction, and word length) have an influence Longi, K., Leinonen, J., Nygren, H., Salmi, J., Klami,
on the performance. In particular, the word length and A., & Vihavainen, A. (2015). Identification of pro-
correction feature groups led to a good performance on grammers from typing patterns. Proceedings of the
the writing task classification task and a low perfor- 15th Koli Calling Conference on Computing Educa-
mance on the writer identification task. The key pause tion Research (pp. 60–67).
feature group increases the performance on the text
classification task, but the performance on the writer Monaco, J. V., Bakelman, N., Cha, S.-H., & Tap-
classification task also increases (which is unwanted, pert, C. C. (2012). Developing a keystroke biomet-
as the key pause features also include writer-specific ric system for continual authentication of computer
properties). users. Intelligence and Security Informatics Confer-
ence (EISIC), 2012 European (pp. 210–216).
Having insight in these features which identify writing
processes and cognitive load can be useful for improv- Paas, F., Tuovinen, J. E., Tabbers, H., & Van Ger-
ing learning and teaching. For example, in this way, ven, P. W. (2003). Cognitive load measurement as a
teachers can also get insight in the writing process, means to advance cognitive load theory. Educational
instead of only the product of writing. This can be psychologist, 38, 63–71.
useful for adapting the course materials or providing
(personalized) feedback. Poncin, W., Serebrenik, A., & van den Brand, M.
(2011). Mining student capstone projects with frasr
and prom. Proceedings of the ACM international

34
Identifying writing tasks using sequences of keystrokes

conference companion on Object oriented program-


ming systems languages and applications companion
(pp. 87–96).
Romero, C., & Ventura, S. (2013). Data mining in
education. Wiley Interdisciplinary Reviews: Data
Mining and Knowledge Discovery, 3, 12–27.
Tappert, C. C., Villani, M., & Cha, S.-H. (2009).
Keystroke biometric identification and authentica-
tion on long-text input. Behavioral biometrics for
human identification: Intelligent applications, 342–
367.
Thomas, R. C., Karahasanovic, A., & Kennedy, G. E.
(2005). An investigation into keystroke latency met-
rics as an indicator of programming performance.
Proceedings of the 7th Australasian conference on
Computing education-Volume 42 (pp. 127–134).

35
Increasing the Margin in Support Vector Machines
through Hyperplane Folding

Lars Lundberg [email protected]


Håkan Lennerstad [email protected]
Eva Garcia-Martin [email protected]
Niklas Lavesson [email protected]
Veselka Boeva [email protected]
Blekinge Institute of Technology, SE-371 79 Karlskrona, Sweden

Keywords: support vector machines, geometric margin, hyperplane folding, hyperplane hinging, piecewise linear classifi-
cation.

Abstract we present a method that in many cases increases the SVM


margin in a linearly separable dataset.
We present a method, called hyperplane folding,
that increases the margin in a linearly separable Consider a set of points S ⊂ Rn that can be separated lin-
binary dataset by replacing the SVM hyperplane early in two subsets S+ and S− . By choosing a hyperplane P
with a set of hinging hyperplanes. Based on the that maximizes minv∈S dist(v, P), i.e., the minimal distance
location of the support vectors, the method splits to all points in S, we obtain a separating hyperplane that
the dataset into two parts, rotates one part of the has the same distance to all points in a certain set V− ∪ V+ ,
dataset and then merges the two parts again. This where V− ⊂ S − and V+ ⊂ S + . The set V− ∪ V+ is the set
procedure increases the margin in each iteration of support vectors, where both V− and V+ are non-empty.
as long as the margin is smaller than half of the The margin of this hyperplane P is m (P) = dist(v, P) for
shortest distance between any pair of data points any support vector v ∈ V− ∪ V+ . The largest possible mar-
from the two different classes. We provide an al- gin for any separating surface, linear or not, is trivially
gorithm for the general case with n-dimensional M(S ) = minv∈S + ,u∈S − |v − u| /2. For any surface we define
data points. A small experiment with three fold- the margin as m (P) = minv∈S dist(v, P)/2. A surface that
ing iterations on 2-dimensional data points shows fulfills m (P) = M(S ) is called a maximal margin surface.
that the margin does indeed increase and that If the set of support vectors V− ∪ V+ contains two elements
the accuracy improves with a larger margin, i.e., only, we have m (P) = M(S ), and the starting hyperplane
the number of misclassified data points decreases is a maximal margin surface. This case can happen also
when we use hyperplane folding. The method for more than two support vectors, but it is not the generic
can use any standard SVM implementation plus case.
some additional basic manipulation of the data
points, i.e., splitting, rotating and merging. In the generic case we have 3 ≤ |V− ∪ V+ | ≤ n+1, for which
we construct a separating folded surface, composited by
different hyperplanes, where the margin is normally larger
than m (P).
1. Introduction
The rest of the paper is organized as follows. Section 2
Support Vector Machines (SVMs) find the separating hy- discusses the related work. Section 3 introduces a hyper-
perplane that maximizes the margin to the data points clos- plane folding algorithm for the 2-dimensional case. Section
est to the hyperplane. The main idea with SVM is that, 4 discusses higher dimensions and proposes the algorithm
compared to a small margin, a large margin reduces the for the general case. Section 5 presents the initial evalua-
risk of misclassification of unknown points. In this paper, tion of the proposed algorithm for the 2-dimensional case
and further discussion. Section 6 is devoted to conclusions
Preliminary work. Under review for Benelearn 2017. Do not dis-
and future work.
tribute.

36
Increasing the Margin in Support Vector Machines through Hyperplane Folding

2. Related Work kernel tricks that extends the solution space to include cases
that are not linearly separable, and the notion of so called
Support vector machines (SVMs) originate from research soft margins to allow for errors in the training set.
conducted in statistical learning theory (Vapnik, 1995).
SVMs are used in supervised learning, where a training set If we assume that a data point in the test set can be at most
consisting of n-dimensional data points with known class a distance x from a data point belonging to the same class
labels is used for predicting classes in a test set consisting in the training set, then it is clear that we can only guar-
of n-dimensional data points with unknown classes. antee a correct classification as long as x is smaller than
the margin. As a consequence, the optimality of the sepa-
Rigorous statistical bounds are given for the generalisa- rating hyperplane guarantees that we will correctly classify
tion of hard margin SVMs (Bartlett, 1998; Shawe-Taylor any data point in the test set for a maximum x. This is very
et al., 1998). Moreover, statistical bounds are also given similar to error correcting codes that maximize the distance
for the generalisation of soft margin SVMs and for regres- between any pair of code words (Lin & Costello, 1983).
sion (Shawe-Taylor & Cristianini, 2000). The hyperplane folding approach presented here increases
There has been work on different types of piecewise lin- the margin thus guaranteeing correct classification of test
ear classifier based on the SVM concept. These methods set data points for larger x.
split the separating hyperplane into a set of hinging hyper- SVMs are also connected to data compression codes. In
planes (Wang & Sun, 2005). In (Yujian et al., 2011), the (von Luxburg et al., 2004), the authors suggest five data
authors define an algorithm that uses hinging hyperplanes compression codes that use an SVM hyperplane approach
to separate nonintersecting classes with a multiconlitron, when transferring information from a sender to a receiver.
which is a combination of a number of conlitrons, where The authors show that a larger margin in the SVM leads
each conlitron separates ”convexly separable” classes. The to higher data compression, and that the data compression
multiconlitron method cannot benefit directly from experi- can be improved by exploring the geometric properties of
ence and implementations of SVMs. Conlitrons and mul- the training set, e.g., if the data points in the training set
ticonlitrons need to be constructed with new and complex are shaped as an ellipsoid rather than a sphere. The hyper-
techniques (Li et al., 2014). The hyperplane folding ap- plane folding approach also uses geometric properties of
proach presented here is a direct extension of the standard the training set to improve the margin.
(hard margin) SVM (soft margin extensions are relatively
direct and discussed later). As a consequence, hyperplane The kernel trick maps a data point in n dimensions to a
folding can benefit from existing SVM experience and im- data point in a (much) higher dimension, thus increasing
plementations. the possibility to linearly separate the data points (Hofmann
et al., 2008). The hyperplane folding approach presented
A piecewise linear SVM classifier is presented in (Huang here does some remapping of data points but it does not
et al., 2013). That method splits the feature space into change the dimension of the data points.
a number of polyhedra and calculates one hinging hy-
perplane for each such polyhedron. Some divisions of
the feature space will increase the margin in hard margin 3. Hyperplane Folding
SVMs. However, unless one has detailed domain knowl- In this section, we introduce the hyperplane folding algo-
edge, there is no way to determine which polyhedra to se- rithm for the 2-dimensional case. Higher dimensions will
lect to improve the margin. The authors recommend ba- be discussed in Section 4.
sic approaches like equidistantly dividing the input space
into hypercubes or using random sizes and shapes for the Let us consider a standard SVM for fully separable binary
polyhedra. Based on the support vectors, the hyperplane classification set S with a separating hyperplane (the thick
folding method splits the feature space into two parts (i.e., blue line in Figure 1) and a margin d. If we assume that
into two polyhedra) in each iteration. Without any domain each data point is represented with (very) high resolution,
knowledge, the method guarantees that the split results in the probability of having more than three support vectors is
an increase of the margin (except for very special cases). arbitrarily close to zero in the 2-dimensional case. There-
As discussed above, hyperplane folding can directly ben- fore, in the current context we only need to consider the
efit from existing SVM experience and implementations, cases with two or three support vectors.
which is not the case for the method presented by Huang et As it was mentioned in the introduction, we consider only
al. the case |V− ∪ V+ | = 3 in the 2-dimensional scenario, be-
As stated in (Cortes & Vapnik, 1995), SVMs combine three cause we already have a maximal margin if |V− ∪ V+ | = 2.
ideas: the idea of optimal separating hyperplanes, i.e., hy- Without loss of generality we assume that |V+ | = 2 and
perplanes with the maximal margin, the idea of so called |V− | = 1, and we refer to the point in V− as the primary

37
Increasing the Margin in Support Vector Machines through Hyperplane Folding

support vector.
As a first step in our method, we split the dataset into two
parts by identifying a splitting hyperplane, which in two
+ +
dimensions is the line that is normal to the hyperplane and
that passes through the prime support vector (see Figure 2). +
+
When splitting into two datasets, the prime support vector + α
+
is included in both parts of the dataset. -
The two parts of the dataset define one SVM each (see Fig- - -
ure 3), producing one separating hyperplane for each part
of the dataset, where both margins are normally larger than - - -
the initial margin. We assume that the two new hyperplanes
intersect with an angle α.
Figure 3. The two SVMs after splitting the dataset.

separate sets, which is larger than the initial margin, and


+ has in general other support vectors than before (see Fig-
+ ure 5). If there are three support vectors in the new dataset,
+
+ the same procedure can be repeated.
+ + A detail explanation of the proposed hyperplane folding al-
- d gorithm for the 2-dimensional case is given below.
- -

- - - Algorithm: Hyperplane Folding for the 2-dimensional


Case

Figure 1. A separate dataset with three support vectors.


1: Run the standard SVM algorithm on the dataset S ⊂
R2 .
The output of SVM algorithm is:
– The equation of a separating hyperplane
+ – The support vectors V− ∪ V+
+
+ – The margin d
+
+ +
2: if |V− ∪ V+ | = 2 then terminate 1
-
3: if the determined number of hyperplane folding itera-
- - tions is reached then terminate
- - - 4: Select the primary support vector (one from the class
with only one support vector).
5: Split S along a line (a splitting hyperplane) that is or-
Figure 2. Splitting the dataset into two parts. thogonal to the separating hyperplane and that passes
through the primary support vector.
The folding takes place in the second step of our method,
where all points in one of the two data sets are rotated the 6: Duplicate the primary support vector to both subsets
angle α around the intersection point, aligning the two new of the data points.
separating hyperplanes. We rotate the part with the largest 7: Run the standard SVM algorithm separately on the
margin. In Figure 4 we show the case when we rotate the two subsets, thus obtaining two separating hyper-
data points in the right part. planes.
After rotating, the data points in the two parts are joined 1
In the current context we assume that there can only be two
into a new dataset of the same size as the original dataset. or three support vectors. The case with more support vectors is
The new margin is the smallest of the margins of the two discussed in Section 4.

38
Increasing the Margin in Support Vector Machines through Hyperplane Folding

8: Calculate the angle α between the two separating hy- there will in general be at most n+1 support vectors in an n
perplanes, and the intersection point between them, dimensional space, |V− | ≥ 1 and |V+ | ≥ 1.
i.e. the folding point.
We start by considering the case with three dimensions, i.e.,
9: Remove the primary support vector from the subset n = 3. Again, we cannot do anything if we only have two
with the largest margin. support vectors, because then the starting hyperplane has
maximal margin. In the case of three or four support vec-
10: Rotate the remaining data points in that subset an an- tors we can, however, increase the margin except in special
gle α around the folding point. cases. In order to simplify the discussion we assume that
the separating hyperplane is parallel to the xz-plane, which
11: Merge the two subsets. The new splitting hyperplane can be achieved by rotating the data set.
has a larger margin than d.
In the case |V− ∪ V+ | = 4 there are three different cases:
12: goto step 1 {|V+ | = 3, |V− | = 1}, or {|V+ | = 2, |V− | = 2}, or {|V+ | = 1,
|V− | = 3}. In either case we consider a line that passes
through one pair of support vectors from the same class.
This line is parallel to the separating hyperplane, since all
support vectors have the same distance to this hyperplane.
Then we rotate the data set around the y-axis so that this
+
+ line is parallel to the z-axis.
+ + +
α
Now we disregard from the z-components of the points, i.e.,
+ consider the points as projected onto the x,y-plane. Obvi-
+ +
α - - ously, the two support vectors from V+ will be projected on
+ +
- the same point, thus resulting in a situation with three sup-
port vectors in the x,y-plane; two from V− and one (merged)
- -
- from V+ . Having projected the support vectors on the x,y-
- - α - plane,we use the same method of rotation as in the previ-
ous section, which does not affect the z-component of the
points. Again we have produced a separating hyperspace
with a larger margin.
Figure 4. Rotating the data points in the right part of the dataset.
Red data points in the original dataset are rotated with an angle α, For n> 3 dimensions we again cannot do anything if we
counter clockwise, to new locations. only have two support vectors. In the case of 3, 4,. . . , n+1
support vectors we can, however, increase the margin, ex-
cept in special cases. In order to simplify the discussion we
can, by rotation, assume that the separating hyperplane is
+ orthogonal to the x,y-plane.
+
+ If we have only three support vectors, we can directly
+
project all data points on the x,y-plane by disregarding from
+ the coordinates x3 , . . . , xn for all points, perform the algo-
- -
+ rithm for n = 2, and then resubstitute the coordinates x3 ,
- . . . , xn for all points, similarly to the case n = 3.
- Now consider |V− ∪ V+ | = k for 4 ≤ k ≤ n + 1. We choose
-
- either V− or V+ which contains at least two points. We
construct a line between two of the points in the set and ro-
tate the data points so that this line becomes parallel with
a base vector of dimension n, keeping the hyperplane or-
Figure 5. The dataset after rotation. The new dataset has |V+ | = 1 thogonal to the x,y-plane. We then disregard from the nth
and |V− | = 2 and a larger margin than the original dataset.
coordinate, thus projecting orthogonally the points from Rn
to Rn−1 and the separating hyperplane to a hyperplane with
n-2 dimensions in Rn−1 . Since two support vectors in the
4. Higher Dimensions n-dimensional space are mapped to the same point in the
In this section, we discuss higher dimensions. If we assume (n-1)-dimensional space, there are now n support vectors
that each point is represented with (very) high resolution, in the (n-1)-dimensional space. This procedure can be re-

39
Increasing the Margin in Support Vector Machines through Hyperplane Folding

peated until we reach three support vectors, where we use 14: goto step 1
the method described in the previous section.
This produces a new data set with a separating hyperplane
that has a larger margin, in general. If this hyperplane is not Regarding the computational complexity of the general
a maximum margin surface, the procedure can be repeated case hyperplane folding algorithm we have the following.
on the new data set to increase the margin further until the Initially, we must run the standard SVM algorithm on the
margin is as close to M(S ) as desired. considered dataset, which implies O(max(m, n) min(m, n)2 )
In the end we may perform the inverses of all transforma- complexity according to (Chapelle, 2007), where m is the
tions in order to regain the initial data set with a separat- number of instances and n is the number of dimensions
ing surface that consists of folded hyperplanes, which has (attributes). Steps 6 to 11 in the algorithm is a loop with
a larger margin except in special cases. n − 2 iterations. In each iteration we rotate the dataset.
One data point rotation requires n2 multiplications. This
Based on the discussion above the algorithm for the general means that the computational complexity of this part is
case with n dimensions is given below. O(mn3 ). In step 12 we run the 2-dimensional algorithm,
which has complexity O(m). In step 13 we rotate back in
Algorithm: Hyperplane Folding for the General Case reverse order, which has complexity O(mn3 ). This means
that the computational complexity for one fold operation is
O(max(m, n) min(m, n)2 ) + O(mn3 ) + O(m) + O(mn3 ), i.e.
1: Run the standard SVM algorithm on the dataset S ⊂ it can be simplified to O(max(m, n) min(m, n)2 ) + O(mn3 ).
Rn (n > 2). If m > n, we have O(mn2 ) + O(mn3 ) = O(mn3 ). If n > m,
we have O(m2 n) + O(mn3 ) = O(mn3 ), i.e., the total compu-
2: if k = 2 then terminate 2 tational complexity for one fold operation is O(mn3 ).
3: if the determined number of hyperplane folding itera-
tions is reached then terminate 5. Initial Evaluation and Discussion
4: Rotate S so that all support vectors have value zero in In order to get a better understanding of how the proposed
dimensions higher than or equal to k. hyperplane folding method works, we have conducted an
experiment for the two-dimensional case. We implemented
5: Temporarily remove dimensions ≥ k from all points in our algorithm in Python 2.7 using the Scikit-learn library4 ,
S. the Numpy library5 , and the Pandas library6 .
If k = n + 1, no dimensions are removed.
If k = n, one dimension is removed, and so on. 5.1. Data

6: if k = 3 then goto step 12 We have generated n circles with synthetic data points (n =
4, 5, 6): dn/2e circles from S + and bn/2c circles from S − ,
7: Select two support vectors v1 and v2 from the same respectively. The circles are numbered 1 to n: odd num-
class (V+ or V− ). bered circles contain points from S + , and even numbered
circles contain points from S − .
8: Rotate S so that the values in dimensions 1 to k − 1 are
the same for v1 and v2 .
5.2. Experiment
9: Remove temporarily dimension k from all points in Initially, we have studied how the margin in the SVM is
S. 3 influenced by our hyperplane folding method. For this pur-
10: k = k − 1 pose we have generated the following data set. Each circle
i (i = 1, 2, . . . , 6) is centered at (100+100i, 100+101(i mod
11: goto step 6 2)), e.g., circle number 3 is centered at (400, 201). The ra-
dius of each circle is set to 50 and 100 data points at random
12: Run the 2-dimensional algorithm presented in Sec-
locations within each circle are generated. The generated
tion 3 for one iteration.
data set is linearly separable, since the distance between in
13: Expand the dimensions back one by one in reverse or- the y-dimension between S + -circles and S − -circles is 101
der and do the inverse rotations. and in addition, the radius of each circle is 50. This is
2 4
k denotes the number of support vectors, i.e., k = |V− ∪ V+ |. http://scikit-learn.org/
3 5
This means that v1 and v2 are now mapped to the same sup- http://www.numpy.org/
6
port vector. http://pandas.pydata.org/

40
Increasing the Margin in Support Vector Machines through Hyperplane Folding

Table 1. Margin and accuracy for the different SVMs. Each values is the average of 9 executions.

SVM number 4 circles 5 circles 6 circles


Margin Accuracy Margin Accuracy Margin Accuracy
0 13.2 95.2% 7.8 94.9% 9.3 95.1%
1 15.9 96.8% 11.0 95.7% 11.3 95.5%
2 27.3 98.0% 12.3 96.4% 13.4 96.0%
3 33.8 98.4% 16.9 97.0% 14.5 96.2%

dataset 0. SVM 0 is obtained based on dataset 0. Then the from test set i - 1. The corresponding accuracies when clas-
hyperplane folding algorithm defined in Section 3 is used sifying test set 2 (or test set 3) using SVM 2 (or SVM 3) are
to create dataset i (i = 1, 2, 3) and the corresponding SVM given in Table 1. For instance, they are 98.0% and 98.4%
i. Namely, dataset i (i = 1, 2, 3) is obtained by running our for the case with 4 circles, respectively.
2D algorithm on dataset i - 1. The corresponding SVM i
The obtained results (see Table 1) show that our method
(i = 1, 2, 3) is obtained based on dataset i.
increases the margin significantly. The effect is most vis-
We calculate the margin for each SVM i (i = 0, 1, 2, 3) (see ible for the case with 4 circles. When we only have four
Table 1). of circles we will quickly get close to the largest possible
margin, since we only have two circles of each class and we
The proposed method includes rotations of parts of the
will in 2-3 iterations ”fold away” one circle on each side of
dataset. This means that we also need to rotate an unknown
the separating hyperplane. This effect is illustrated graphi-
data point if we want to use the SVM for classification of
cally on Figures 6 and 7. Figures 6(a) and 7(a) show SVM
unknown points. In the 2D case it is simply necessary to re-
0 and SVM 2 for one of the nine tests for the case with four
member the line used for splitting the data points into two
circles. Figures 6(b) and 7(b) show the corresponding test
parts and the rotation angle α. In order to classify an un-
sets, i.e., test set 0 and test set 2. The angle and place of the
known data point we first check if the data point belongs to
two rotations can be clearly seen in test set 2. These fig-
the part that was rotated in the first iteration. In that case
ures demonstrate that already after two iterations we have
we rotate the unknown data point with the angle of the first
folded the hyperplane so that the blue circle on the left side
rotation. We then check if the unknown data (that now may
and the red circle on the right side are moved away from
have been rotated) is in the part that was rotated in the sec-
the separating hyperplane. When we have 5 or 6 circles
ond iteration. In that case we rotate the unknown data point
the separating hyperplane needs to be folded more times to
with the angle of the second rotation, and so on. When all
reach the same effect. This is the reason why the margin
rotations are done we use the final SVM, i.e., the SVM we
increases more quickly for the 4 circles case compared to
get after the 2D algorithm has terminated, to classify the
the cases with more circles.
data point in the usual way.
Table 1 also shows that the accuracy, i.e., the number of
At the second phase of our evaluation, we have studied how
correctly classified data points in the test set divided with
the hyperplane folding method affects the classification ac-
the total number of data points in the test set, also increases
curacy of the SVM. For this purpose we have generated
with our method. The reason for this is that a larger margin
another data set. Each circle i (i = 1, 2, . . . , 6) is centered
reduces the risk for misclassifications. The improvement in
at (100i, 100+101(i mod 2)). Then we set the radius of
accuracy is again highest for the case with 4 circles. As it
each circle to 75 and generate 1000 data points at random
was discussed above the reason for this is that the margin
locations within each circle. This is test set 0. It is clear
increases fastest for that case.
that some data points in test set 0 will be misclassified by
SVM 0, since the radius for each circle is now 75. Table 1
shows the accuracy when classifying test set 0 using SVM 5.3. Discussion
0, e.g., it is 95.2% for the case with 4 circles. In the previous subsection we have discussed an experi-
Then the data points in test set 0 are rotated in the same way ment for the two-dimensional case performed for initial
as it was done when dataset 1 was obtained from dataset 0. evaluation of the proposed hyperplane folding method. In
This generates test set 1. The accuracy when classifying the general case with n dimensions, we could do the same
test set 1 using SVM 1 can be seen in Table 1, e.g., it is thing as in the 2D case, i.e., we start by deciding if the un-
96.8% for the case with 4 circles. We then continue rotat- known data point is in the part of the n-dimensional space
ing in the same fashion in order to obtain test set i (i = 2, 3) that was rotated in the first iteration. After that we deter-

41
Increasing the Margin in Support Vector Machines through Hyperplane Folding

250
250
200
200

150 150

100
100
50
50
200 300 400 500 200 250 300 350 400 450 500

(a) SVM 0 (a) SVM 2

300
250
250
200
200
150 150

100 100

50
50
0
200 300 400 500 200 300 400 500

(b) Test set 0 (b) Test 2

Figure 6. Example of an SVM and test set for the case with four Figure 7. Example of an SVM and test set for the case with four
circles before any hyperplane folding iteration. circles after two hyperplane folding iterations.

of hyperplane folding). Clearly, hyperplane folding could


mine if the unknown data point (that now may have been
turn a dataset that is not linearly separable into a linearly
rotated) is in the part of the n-dimensional space that was
separable dataset. Test set 2 in Figure 7(b) is almost lin-
rotated in the second iteration, and so on. When all rota-
early separable, and it is clear that if we had a radius of 60
tions are done we use the final SVM, i.e., the SVM we get
instead of 75 the test set would be linearly separable after
after the algorithm in Section 3.3 has terminated, to classify
two iterations, whereas it is clearly not separable before we
the data point in the usual way.
have done any iteration.
The basic technique in hyperplane folding is to split the
In the n-dimensional case we do n-2 dimension reductions
dataset into two parts and rotate one of the parts, and then
before we reach the x,y-plane where the actual increase of
repeat the same procedure again. It is clear that this tech-
the margin is obtained. Depending on the order that we
nique could also be used for cases where the classes are not
reduce the dimensions we will end up with different x,y-
linearly separable, i.e., for soft margin SVMs. One way
planes. Some reduction orders will probably lead to more
to do this is to move data points from one class in the di-
significant increases of the margin than other reduction or-
rection of the normal of the separating hyperplane in the
ders. Finding the optimal reduction order is still an open
soft margin SVM until the dataset becomes linearly sepa-
research question. Moreover, the statistical bounds asso-
rable; follow our algorithm up to the point where we split
ciated with support vector machines need to be adjusted
the dataset; move the data points back to their original posi-
when doing multiple folds in higher dimensions. This is a
tion; and then run the (soft margin) SVM on each part; ro-
topic for further investigations.
tate one of the parts; and finally merge the two parts (there
could be other ways of implementing a soft margin version Folding the hyperplane many times will result in a larger

42
Increasing the Margin in Support Vector Machines through Hyperplane Folding

margin. However, excessive folding could probably lead to proposed hyperplane folding method on richer data.
overfitting. It is clear that the time for obtaining the SVM
grows with the number of folds. One could also expect that References
the time required to classify an unknown data point will in-
crease with many folds (even if there could be techniques Bartlett, P. L. (1998). The sample complexity of pat-
to limit this overhead). This means that there is a trade- tern classification with neural networks: The size of the
off between a large margin on the one hand and the risk weights is more important than the size of the network.
of overfitting and the execution time overhead on the other IEEE Transactions on Information Theory, 44, 525–536.
hand. The margin is increasing in the number of iterations
Chapelle, O. (2007). Training a support vector machine in
of hyperplane folding, and the algorithm can be stopped at
the primal. Neural Computation, 19, 1155–1178.
any point. This means that we can balance the advantages
and disadvantages of hyperfolding by selecting an appro- Cortes, C., & Vapnik, V. (1995). Support-vector networks.
priate number of iterations, e.g., we can simply stop when Machine learning, 20, 273–297.
we do not want to spend more time on hyperplane folding
or when the problem with overfitting becomes an issue. Hofmann, T., Schölkopf, B., & Smola, A. J. (2008). Kernel
methods in machine learning. The annals of statistics,
We have assumed that for an n-dimensional dataset there 1171–1220.
can be at most n+1 support vectors. If we have limited res-
olution, there could be more than n+1 support vectors, and Huang, X., Mehrkanoon, S., & Suykens, J. A. (2013). Sup-
in such cases we need to do a small variation of the algo- port vector machines with piecewise linear feature map-
rithm. The main idea in this case is to select the primary ping. Neurocomputing, 117, 118–127.
support vector, i.e., the support vector at which we split, so
that the primary support vector is the only support vector Li, Y., Leng, Q., Fu, Y., & Li, H. (2014). Growing construc-
from its class in one of the parts and so that that part also tion of conlitron and multiconlitron. Knowledge-Based
contains at least one support vector from the other class. It Systems, 65, 12–20.
is clear that one can always do such a split when we have Lin, S., & Costello, D. J. (1983). Error control coding:
more than n+1 support vectors. Fundamentals and applications. Prentice-Hall.

6. Conclusions Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & An-


thony, M. (1998). Structural risk minimization over data-
We have defined a method called hyperplane folding for dependent hierarchies. IEEE Transactions on Informa-
support vector machines. The defined method increases tion Theory, 44, 1926–1940.
the margin in the SVM. In addition, this method is easy
to implement since it is based on simple mathematical op- Shawe-Taylor, J., & Cristianini, N. (2000). Margin dis-
erations, i.e., splitting and rotating a set of data points in tribution and soft margin. In Advances in large margin
n-dimensions, and the method then uses existing SVM im- classifiers, 349–358. MIT Press.
plementations on the different parts of the dataset. Vapnik, V. (1995). The nature of statistical learning theory.
We have proposed hyperplane folding algorithms for the 2- Springer.
dimensional and general n-dimensional cases, respectively.
von Luxburg, U., Bousquet, O., & Schölkopf, B. (2004). A
An initial evaluation of the algorithm for 2-dimensional
compression approach to support vector model selection.
case has been conducted. The obtained results have shown
Journal of Machine Learning Research, 5, 293–323.
that the margin indeed increases and in addition, this im-
proves the classification accuracy of the SVM. Wang, S., & Sun, X. (2005). Generalization of hinging
A similar approach can be defined for the soft margin case, hyperplanes. IEEE Transactions on Information Theory,
i.e., the case when the dataset is not linearly separable. The 51, 4425–4431.
hyperplane folding method is incremental in the sense that Yujian, L., Bo, L., Xinwu, Y., Yaozong, F., & Houjun, L.
the margin is improved in each iteration. Therefore we can (2011). Multiconlitron: A general piecewise linear clas-
adapt the number of folds to balance the risk of overfitting sifier. IEEE Transactions on Neural Networks, 22, 276–
and the execution time on the one hand, and the size of the 289.
margin on the other hand.
Additional studies are needed before the potential of hyper-
plane folding can be fully understood. For future work, we
also aim to pursue further evaluation and validation of the

43
Towards Optimizing the Public Library:
Indoor Localization in Semi-Open Spaces and Beyond

Martijn van Otterlo [email protected]


Martin Warnaar [email protected]
Economics & Business Administration, Computer Science, Vrije Universiteit Amsterdam, The Netherlands

Abstract tant as mobility. We are assuming that the average


We report on the BLIIPS project which man of that year may make a capital investment in
aims at the digitalization and optimization an ”intermedium” or console his intellectual Ford or
of physical, public libraries through the use Cadillac comparable to the investment he now makes
of artificial intelligence combined with sen- in an automobile, or that he will rent one from a pub-
sor technology. As a first step we intro- lic utility that handles information processing as Con-
duce FLib, a localization application, and solidated Edison handles electric power.” We can see
additional developments for interaction with the modern smartphone or laptop being substituted
physical books. The contributions of this pa- in this quotation and indeed much of our contempo-
per are the introduction of the public library rary information consumption is done through these
as an interesting testbed for smart technolo- devices. But despite all electronic access to informa-
gies, a novel localization application with an tion, the public library is still the physical place to
experimental evaluation, and a compact re- go to (Palfrey, 2015) for physical books (Baron, 2015)
search agenda for smart libraries. and access to internet, but also things like 21st-century
skill building and group activities. Public libraries are
innovating in the direction of building communities of
1. Introduction interest more or less connected to information: from
collection to connection (see also (Palfrey, 2015)).
Under names such as the extensible library and library
Here we report on project BLIIPS in which smart-
3.0., the public library is changing (Allison, 2013).
phones (and other technologies) are used for an or-
In our digital society, a constant stream of innova-
thogonal purpose: to digitalize physical interactions
tions and artificially intelligent algorithms turns ev-
in the physical library to obtain insight in how physi-
ery possible (physical ) interaction in society into data
cal public libraries are used and how their services can
from which algorithms can create value in some way
be optimized. In particular we introduce FLib, a lo-
(Zheng et al., 2014). One could assume that pub-
calization application which uses machine learning to
lic libraries, with their physical books, would become
capture the signal landscape of both WiFi and Blue-
obsolete now information and knowledge rapidly be-
tooth beacon sensors for localization and navigation
come digital, and huge tech-companies take over. For
in a physical, multi-floor library building. The overall
example, Wikipedia gives us encyclopedial knowl-
goal of BLIIPS is to optimize the public library, which
edge, Google Books has many digitalized books,
can be about the layout of the physical space, the con-
and Mendeley archives bibliographic information.
tent and distribution of the book collection over this
The future of the library is a much-debated topic which space and the visiting patterns of users. The public li-
indicates that libraries have always been changing be- brary is an excellent, semantically-rich, and much un-
cause of new technology in writing, printing, archiving derexplored, environment for potential artificial intel-
and distributing. More than fifty years ago the vision- ligence research such as activity recognition, internet-
ary J.C.R. Licklider ((1965):33) wrote: ”By the year of-things, optimization (logistics, space, services), and
2000, information and knowledge may be as impor- recommender systems.
This paper and contributions are structured as follows.
Appearing in Proceedings of Benelearn 2017. Copyright In Section 2 we describe a much underexplored prob-
2017 by the author(s)/owner(s). lem domain for artificial intelligence: the physical pub-

44
Towards Optimizing the Public Library

Figure 1. a) Alkmaar library ground floor map with an example topology of typical patron walking paths between mean-
ingful locations. The purple area surrounding location (1) is ”culture and literature” and the area at (2) features ”young
adult fiction”. Both areas feature rectangular bookshelves and book-tables. Other locations are the self-service machines
at (C) and (B), a staircase to the second floor (H1), a seating area (7) and the main desk at (A). The space is an open
environment but furniture does restrict possible moving patterns, and in addition there are some small rooms at the top
(e.g. behind C and B). b) Fingerprint positioning. At the current location (of the mobile phone) signals of all three APs
are received. At the location on the right this holds too but AP2s signal is much stronger (−34) because it is closer.

lic library. In Section 3 we extensively introduce our tions. Nationwide 1 , in 2014 more than 22 percent of
Flib localization application based on WiFi and bea- all Dutch people was member of a library, more than
con technology. In Section 4 we additionally mention 63 million visits were paid to a library, and almost 80
methods for book interaction and conclude with a re- million books were borrowed. However, libraries know
search agenda for further research in the public library. very little about their patron’s behavior. In fact, the
only behavior visible in data are the books that were
2. The BLIIPS project checked out. How they use the physical space, how
they browse the book collection, which books are be-
Companies and governments are looking for ways to ing looked at; for this no (real-time) data is available,
utilize their existing data and capture new oppor- but could be highly relevant for managing the physi-
tunities to develop initiatives around data. In the cal library building, its services and its collection. Li-
Dutch municipality of Alkmaar, such activities are braries do have a long history of measuring, observing
aggregated through so-called triple-helix cooperations and evaluating but typically through labor-intensive
in which (local) governments, companies and knowl- surveys and observational studies (see (Edwards, 2009;
edge institutions collaborate (van Otterlo & Feldberg, Allison, 2013; van Otterlo, 2016b) for pointers).
2016). The Alkmaar public library, partially funded by
The BLIIPS project represents a first step towards the
the local government, works together with the Vrije
intelligent library in which this data is collected and
Universiteit Amsterdam on a data-oriented project
analyzed in real time, but also in which the physical
called BLIIPS (van Otterlo, 2016b). Its goal is to
environment can provide to patrons ”Google-like” ser-
utilize data to optimize the public library in various
vices we are accustomed 2 to in the digital world. For
ways. However, whereas most data-oriented projects
example, if all interactions are digitalized, a smart-
are about already digitalized aspects, BLIIPS targets
phone could provide location-based, personalized rec-
the physical aspects of the public library and seeks to
ommendations to physical books in the patron’s sur-
digitalize them with the use of new sensor technology.
rounding area, based on a user query and data about
The overall goal is to gain insight into the physical
the library, the patron, and additional sources.
behavior of patrons (i.e. library ”customers”) in the
1
physical library and how to optimize services, for ex- http://www.debibliotheken.nl/de-branche/
ample book borrowing. stelsel/kengetallen-bibliotheken/
2
Related, but used in an orthogal way, van Otterlo
((2016a)) uses the concept of libraryness as a metaphor
2.1. Public libraries: Physical vs. Digital to understand modern profiling and experimentation algo-
rithms in the context of privacy and surveillance.
The main library of Alkmaar is part of the Kennemer-
waard group (out of about 150 groups) with 14 loca-

45
Towards Optimizing the Public Library

Figure 2. a) Four key developments in BLIIPS. b) Testing ground VU: an open office space at the 5th floor of the main
building of the Vrije Universiteit Amsterdam. The topological graph shown depicts the transition structure of the hallway
connecting all surrounding spaces of the floor. Many of the room-like structures are part of the open space, others are
separate rooms with a door.

2.2. Towards Library Experimentation hospitals. We cannot kill people here. We can make
mistakes and nobody will die. We can try and test and
BLIIPS builds upon four interlocking developments,
try and test all the time.”
see Figure 2a (van Otterlo, 2016b). The first puzzle
piece is digitalization: making physical interaction
digital through sensors (and algorithms). The second 3. An Indoor Localization Application
piece connects to retail: the use of smart technology in
Mapping patron activity in the physical library re-
physical stores, such as the recent Amazon Go Store3 .
quires at least knowing where they are. For this, we
The Alkmaar library has adopted a retail strategy in
describe design and implementation of the FLib appli-
which the layout and collection, unlike traditional li-
cation. Localization is a well-studied problem (Shang
braries with many bookshelves, are more like a store:
et al., 2015; He & Chan, 2015), but the practical
lots of visible book covers, tables with intuitive themes
details of the environment, hardware and algorithms
and easy categorizations that deviate much from tradi-
used can deliver varying results and so far, localization
tional classification systems. A retail view on the prob-
is not solved (Lymberopoulos et al., 2015). First we
lem appeals to customer relations programmes, mar-
outline requirements and then we describe an interac-
keting concepts and so-called customer journeys. The
tive localization application (Warnaar, 2017).
third piece concerns advances in data science, espe-
cially developments in machine learning and the avail-
3.1. Indoor Localization
ability of tools. The fourth piece of the puzzle, exper-
imentation and optimization, is most important Whereas for outdoor localization GPS is successful,
for our long-term goals. The BLIIPS acronym stands indoor localization remains a challenge. GPS works
for Books and Libraries: Intelligence and Interaction by maintaining line of sight to satellites which is prob-
through Puzzle- and Skinnerboxes in which the latter lematic inside concrete buildings. Several indoor posi-
two elements denote physical devices used by psychol- tioning systems exist (Shang et al., 2015; He & Chan,
ogists in the last century for behavioral engineering. 2015) none of which is currently considered as the
The aim of BLIIPS is to influence, or even control, standard. Sensor information such as magnetic field
behaviors and processes in the library in order to op- strength, received radio waves, and inertial informa-
timize particular goals, such as increasing the number tion from gyroscopes and odometers can be used to
of book checkouts. Such digital Skinnerboxes are be- determine location. Smartphones are equipped with
coming a real possibility due to the combined effect an array of sensors; they are well suited as indoor
of data, algorithms and the ubiquity of digital inter- positioning devices. Lymberopoulos et al. (2015)
actions (van Otterlo, 2014). The library though, is a review the 2014 Microsoft indoor localization compe-
perfect environment for experimentation, unlike many tition (featuring 24 academic teams): ”all systems
other domains. As Palfrey (2015, p213) (quoting Kari exhibited large accuracy4 variations accross different
Lamsa) writes: ”Libraries are not so serious places. evaluation points which raises concerns about the sta-
We should not be too afraid of mistakes. We are not bility/reliability of current indoor location technolo-
3 4
https://www.amazon.com/b?node=16008589011 Typically average errors of a couple of meters.

46
Towards Optimizing the Public Library

elements of a localisation solution: (1) Sensors and


hardware. Localization depends on the interplay be-
tween sensor technology and devices such as smart-
phones. Sensors include wireless modules (e.g. WiFi,
Bluetooth) and motion sensors (e.g. accelerometers,
gyroscopes). Other well-known hardware are Zigbee
and RFID chips. (2) Measurements. Localization
depends on what is being measured from sensors. Most
techniques use received signal strength (RSS) of a sen-
sor. Derived measures such as distances and angles to-
wards sensors are typically employed for triangulation
approaches similar to GPS. An ideal signal should have
two properties: recognizability and stability. (3) Spa-
tial contexts. To aid localization, techniques such as
map matching, spatial models (of behavior, but also
of topological room connection structures) and land-
Figure 3. Android fingerprinting application marks can be used. These can be used in the local-
ization process or as a top-level localization decision
by employing the spatial context as a constraint. (4)
gies.” Indoor spaces are often more complicated in Bayesian filters. Probabilistic approaches are ef-
terms of topology and spatial constraints: wireless sig- fective for dealing with uncertainty in measurements.
nals suffer from multipath effect, scattering, and a non- The most general formulation of Bayesian localization
line of sight propagation time, thereby reducing the
comes from Bayes’ rule: P (x|o) = P (o|x)P
p(o)
(x)
in which
localization accuracy. Due to the small scale, most
applications require better accuracy than outdoors. x is the location and o a set of measurements (the ob-
servation). A sequence of observations o1 , o2 , . . . , on is
used to infer the sequence of locations x1 , x2 , . . . , xn ,
3.2. Library: Requirements and Solutions
assuming there are (meaningful) (in)dependencies in
Our target is the library of Alkmaar5 , a medium-sized the observations and locations. Such assumptions give
city in the Netherlands. Its properties and the overar- rise to various probabilistic models that can be used
ching BLIIPS project induce requirements for local- for localization from noisy observations such as (ex-
ization. First, the library consists of two floors (see tended) Kalman filters and hidden Markov models. All
Figures 1a and 5a) with mainly semi-open space. Un- models have a bias wrt. choice of distributions used
like several room-based approaches, the library hardly for e.g. sensor models P (o|x) and how to do inference
contains constraints such as rooms/corridors. In terms and learning. In prior experiments we employed a par-
of localization accuracy, we require (topical) area- ticular Monte Carlo based probabilistic model for local-
based accuracy for effective navigation to library sec- ization, the particle filter, in our VU environment (see
tions, and more accurate when technically feasible. Figure 2b). A particle filter keeps multiple hypotheses
Second, we want to leverage existing infrastructure (particles) of the location. Each time a patron moves,
as much as possible. Third, our solution needs to be the particles are (probabilistically) moved based on
amenable to incremental accuracy improvement with- a motion model to predict the next location. Af-
out having to repeat all deployment steps. Fourth, ob- ter sensing particles are resampled based on recur-
tained data in the deployment step should be reusable. sive Bayesian estimation using a sensor model that
Fifth, computational complexity should be low enough correlates the sensed data with the predicted state.
on smartphones. Sixth, smartphones should aid in ob- Eventually the particles will converge on the true po-
taining required measurements for deployment. And sition. We concluded that obtaining accurate motion
last, the application needs to engage the user by visu- and sensor models was not feasible in this stage. A sec-
ally showing the location and the patrons surroundings ond bottleneck was the computational complexity on
(or provide navigation in a later stage). the phone when even a moderate number of particles
and iterations were used. (5) Hybrid localization.
A common base of many solutions is fingerprinting:
A combination of techniques can improve the accu-
measuring the signal landscape such that localization
racy. These include multimodal fingerprinting, trian-
amounts to matching currently sensed signals with the
gulation fusing multiple measurements, methods com-
landscape. Shang et al. (2015) distinguish five main
bining wireless positioning with pedestrian dead reck-
5
http://alkmaar.bibliotheekkennemerwaard.nl/ oning, and cooperative localization.

47
Towards Optimizing the Public Library

Figure 4. a) VU testing floor: 28 × 10 grid overlay and beacon positions. Coverage shown for beacons 3, 9, 14, and 16.
b) FLib :Software components overview.

Each localization technique has drawbacks when con- a reference point and f a fingerprint over the set A.
sidering accuracy, cost, coverage, and complexity.
In the Offline training phase we first collect data.
None of them can be suitable to all scenarios. Based on
Here a reference point r is (physically) visited to mea-
our requirements formulated above we choose a (multi-
sure the signals available (f ) at that location and to
modal) fingerprinting solution in which we use smart-
store (r, f ) in the database. To increase the accuracy
phones both for measuring the signal space and for
of FDB , multiple fingerprints can be taken at the same
patron localization. Fingerprinting is accurate enough
location. Systematically all reference points r ∈ R
for our purpose, does not pose any assumptions on
should be visited. When building prediction mod-
knowledge about where signals come from nor on the
els the fingerprint database FDB is used to obtain a
modeling of the domain (e.g. sensor models), and
generalizable mapping M :: 2A×R → R, i.e. a map-
can be employed using the existing WiFi infrastruc-
ping from a set of signals (and their signal strengths) to
ture which we extend with Bluetooth beacons. Other
a reference point in R. All samples (r, f ) ∈ FDB rep-
requirements (like low computational complexity and
resent a supervised learning problem from fingerprints
local computation) are fullfilled by the choice of (sim-
(inputs) to reference points (outputs). In the Online
ple) algorithms with few biases and interactive visual-
localization phase. M is used for localization. Let
izations on the phone, and because fingerprinting sup-
the to-be-located-patron be in some unknown location
ports reuse of data. We use simple topological graphs
l in the space, and let the set of current signals be
and grid-based decompositions of space tailored to the
c = {(a1 , s2 ), (a2 , s2 ), . . . , (an , sn )}. The predicted lo-
required localization precision.
cation of l is then r = M (c).
3.3. Localization by Fingerprinting The choice for fingerprinting naturally induces a su-
pervised machine learning setting in which the signal
Localization by fingerprinting is a widely employed landscape over the space is the desired function to
technique (He & Chan, 2015). The general principle is learn, and where the fingerprints are samples of that
depicted in Figure 1b. Each black dot is a reference function. Intuitively, this determines the balance be-
point: a location in the space from which all received tween |R| and sample complexity (Wen et al., 2015).
signals together form a fingerprint. In the picture Fingerprinting is not prone to error drift such as of-
two received signal sets are depicted for two different ten seen when using inertial sensors to determine step
reference points. More formally, let R be the set of count and direction. Modelling signal decay over dis-
reference points and A be the set of APs. We denote a tance and through objects is also not required, as is
sensed signal with strength s from AP a ∈ A as the tu- the case for multilateration positioning. Another ad-
ple (a, s). Now, let f = {(a1 , s2 ), (a2 , s2 ), . . . , (an , sn )} vantage is that the positions of APs do not need to
be the set of all signals sensed at a particular location, be mapped. Disadvantages are that collecting finger-
called a fingerprint over the set A. A reference point prints of the site is a tedious task (Shang et al., 2015)
can denote a point (x, y) in space (rendering R infi- and that changes to the environment may require that
nite), but usually is taken from a finite set of regions, (some) fingerprints need to be collected again.
grid locations (see Figure 5a) or nodes of an abstract
topological graph such as in Figure 1a. A fingerprint
database FDB is a set of pairs (r, f ) where r ∈ R is

48
Towards Optimizing the Public Library

3.4. Multimodal Fingerprinting: Beacons which was replaced by a more uniform grid layout as
depicted in Figures 4a and 5a. When the user fin-
One of the constraints is that the library building has
gerprints a grid location, it gets highlighted to keep
only 8 different WiFI APs. Several other APs from
track of visited areas. The Estimote Android SDK
surrounding buildings can be used but they are out-
is used to collect Bluetooth signals. WiFi RSSIs are
side our control and less reliable. In contrast, our test
collected using Android’s BroadcastReceiver. The
VU environment (see Figure 2b) has many APs in-
Estimote software uses an adaptable scanning period
side. To enrich the signal landscape, we employ so-
and a pause between scanning periods. If the first is
called Bluetooth low energy (BLE) beacons. A beacon
too short, few or no beacons are detected, but if it is
is a self-powered, small device that sends out a signal
too long location-estimation lags behind (and: huge
with a adjustable signal strength and frequency. Bea-
performance differences between smartphones exist).
cons are a recent addition to the internet-of-things-
landscape (Ng & Wakenshaw, 2016) and most modern
3.5.2. Fingerprints Server
smartphones can detect them. For example, a museum
can place a beacon at an art piece and when the visitor Measured fingerprints are uploaded to a server appli-
gets near the beacon, his smartphone can detect this cation, (implemented in PHP using Symfony running
and provide information about the object. Most work on Apache, using a MySQL database). Fingerprints
employs beacons for areas such as rooms and corri- are first locally stored on the phone and then sent to
dors (i.e. region-based ). For example, LoCo (Cooper the server. The server’s only function is to store fin-
et al., 2016) is a fingerprinting system based on WiFi gerprints data: localisation runs locally on the phone.
APs and BLE beacons which are mostly aligned with
the room-like structure of an office. Such beacons act 3.5.3. Model Training
as noisy indicators for rooms. Such constraints are
The data on the server FDB is used for building a
somewhat present in our VU environment, but not at
model mapping fingerprints to grid locations (reference
the library. In a sub-project (Bulgaru, 2016) we tested
points). We utilize two different machine learning al-
region-based interpretations in the library with vary-
gorithms: k-nearest-neighbors (k-NN) and multi-layer
ing success due to the noisy nature of beacons.
perceptrons (MLP), see (Flach, 2012).
Here, in our semi-open library space we opt for a more
The first model is a lazy learner; generalization
general approach; to employ beacons as extra signals
and model building is not required, but instead
for fingerprinting and to treat them similar to WiFi
FDB is loaded on the smartphone and the algo-
signals. Beacons are configured such that signals will
rithm finds the k most similar fingerprints in FDB
be received throughout large portions of the library,
for the currently sensed signals. We use a modi-
just like the (much stronger) WiFi APs. Using this
fied Euclidean distance to compute a similarity met-
approach roughly 10 beacons per floor are effective.
ric between fingerprints. Given a fingerprint f =
Consequently, in our model, the set A of all APs is
{(af1 , sf1 ), . . . , (afn , sfn )} ∈ FDB and the currently
extended with all beacons {b1 , . . . , bn }.
sensed signals, c = {(ac1 , sc1 ), . . . , (acm , asm )}, we com-
pute the distance between c and f . Let Af c ⊆ A be
3.5. The FLib Localization Application
access points measured in both c and f . We compute
In this section we describe FLib, a smartphone appli- distance d(c, f ) as follows. For all sensed APs in Af c
cation for localization purposes in a real-world envi- we take the overall Euclidean distance between signal
ronment. Figure 4b shows an overview of main (soft- values. A penalty of 30 is added to the distance for
ware) components of FLib. In subsequent sections we each access point a ∈ A that is only in f and not in c,
will review all parts. FLib is targeted at our test- or only in c and not in f . This empirically estimated
ing ground at the university (see Figure 2b) and the value balances signals and missing values.
library in Alkmaar (see Figures 1a and 5a). Our second model is an MLP, a standard neural net-
work with one hidden layer of neurons, an input layer
3.5.1. Fingerprinting with |A| neurons and an output layer with |R| neu-
The fingerprint database FDB is filled by smartphone rons. Reference points are taken as classes and each
measurements, see Figure 3. In FLib the current po- class (r ∈ R) is represented by a separate neuron.
sition can be selected on an interactive map, after For the input layer we transform a sensed fingerprint
which the fingerprint is generated with a single but- {(a1 , s1 ), . . . , (am , sm )} with m ≤ |A| to a vector of
ton tap. Initial experiments in VU and Library em- length |A| where each a ∈ A has a fixed index in this
ployed a graph-like transition structure as in Figure 1a vector and each value at that index is the sensed signal

49
Towards Optimizing the Public Library

Figure 5. a) Alkmaar first floor (8 x 5 grid) with beacon positions (coverage shown for 6), b) FLib localization.

strength si (i ∈ 1 . . . m). All other components of the BeaconInside beacons were both used. Transmision
input vector are 0. To construct an output vector for a rate and power were (497ms, −16dBm) and (500ms,
fingerprint f (i.e. (r, f ) ∈ FDB ) we use a binary vector −3dBm) for Estimote and BeaconInside beacons re-
of length |R| with all zeros except at the index of the spectively. Fingerprinting was done with different
neuron representing r. Training an MLP amounts to smartphones: OnePlus A3003 (3), LG H320 (Leon),
computing the best set of weights in the network which and Huawei Y360-U61 (Y3). All access points in the
can be accomplished using gradient-descent learning. vicinity are used for fingerprints collection to increase
WiFi signal space. For vu, we have 396 unique AP ad-
3.5.4. Real-time Localisation dresses in the fingerprints collection, compared to 165
for library. A Bluetooth scanning period of 2500 ms
Both models can be used to generate a ranking
was used to balance delay and detection. RapidMiner
hr1 , . . . , rm i of all reference points. k-NN naturally
was used to train MLPs (learning rate 0.3, momentum
induces a ranking based on distances. MLPs however,
0.2, normalized inputs) and inference in models runs
yield a normalized distribution over the output nodes.
on the phone.
Instead of showing only the best prediction of loca-
tion, FLib shows a more gradual visualization which 3.6.1. Experimental Setup
highlights with shades of blue where the patron may
First, we determine whether unlabelled walked trajec-
be. To render the blue tiles as in Figure 5b, we calcu-
tories can successfully be classified at vu. We use the
late the transparency for the returned unique reference
graph model from Figure 2b and fill FDB with fin-
points. Let Rbest = hr1 , r2 , ..., rn i be the ranked loca-
gerprints taken at each node position. Next, we walk
tions where some r ∈ R can occur multiple times. The
several trajectories such as shown in Figure 6a, and
first element gets score |Rbest |, the second |Rbest | − 1
store unlabelled fingerprints of multiple locations. Us-
and so on, and scores for the same r are summed.
ing 1-NN with the modified Euclidean function, the
Scores are normalized and mapped onto 50 . . . 255, in-
predicted sequences of reference points are compared
ducing color value as a shade of blue.
to the truly walked paths.
3.6. Experiments and Outcomes Positioning performance over the grid at vu and li-
brary (Figures 4a and 5a) is calculated by taking the
Experiments were conducted at two separate locations: mean hamming distance (H) between n true (x, y) ∈ R
part of the 5th floor of the main building at the VU and predicted reference points (x0 , y 0 ) ∈ R:
university in the A-wing (vu) and the two publicly ac-
cessible floors at the Kennemerwaard Library at Alk- n
1X
maar (library). The library ground floor is 55 m H(M ) = |xi − x0i | + |yi − yi0 | (1)
n i=1
wide by 22 m in length, while the first floor is 54 m
wide by 40 m in length. The vu testing floor is 57,4 Fingerprints are collected with different phones, while
m wide and 20,5 m in length. Estimote Proximity and fingerprints of walks were collected with a OnePlus 3

50
Towards Optimizing the Public Library

Figure 6. a) Picture from (Bulgaru, 2016). Here we employ beacons in our testing environment with exactly known
beacon locations. An effective localization method is to score each grid location surrounding a detected beacon based on
the detected signal strength. For example, all 9 grid locations around a detected strong beacon signal (red area) get a
high value, whereas a much larger area around a detected low signal (blue) get a much lower value. This value scheme
reflects varying confidence in beacon detections based on signal strength. The final predicted location is computed from
a (weighted) combination of all grid position values, and forms a practical (and effective in the testing environment)
localization algorithm. b) A sample walk in the VU environment (Walk 2).

only. Differences in performance are compared using ure 7b. Figure 7c shows results for the first floor.
only fingerprints of OnePlus3, averaged fingerprints, Ground floor tiles cover 5.5 × 4 m. For the li-
and all fingerprints. Best performance is expected brary ground floor the best result (MLP, 50 hid-
when using fingerprints from the same phone as for den, 200 cycles) is a mean total hamming distance
the walks, since there are no sensor or configuration of 1.06: 0.65 for x andp0.41 for y and roughly (un-
differences. In all fingerprints for the ground floor at der)estimated error of (5.5 ∗ 0.65)2 + (4 ∗ 0.41)2 ≈
library, 745 records were collected, and 623 for the 3.92 m. For the library first floor, the same con-
first floor. Averaging fingerprints data per phone per figuration yields the best result: 0.80, with an er-
reference point was done to decrease computational ror x = 0.35 and y = 0.45. Each grid tile covers
complexity of k-NN, reducing |fDB | for the first floor p × 8 m, giving an (under)estimated mean error of
6.75
from 623 to just 72. Computational efficiency is impor- (6.75 ∗ 0.35)2 + (8 ∗ 0.45)2 ≈ 4.30 m. These levels of
tant because smartphones have limited battery time, indoor localisation performance suffice to detect a pa-
and positioning delay is reduced. tron’s region at library, and can be used for several
future applications. We have seen that using k > 3,
3.6.2. Results positioning performance starts degrading, so only re-
sults of {1, 2, 3}-NN are reported.
First, we look at 2 vu walk example results:

True W1 {5A-00b, 5A-55x, 5A-PA, 5A-71x, 5A-00d, 3.7. Related Work


5A-89, 5A-00e, 5A-88, 5A-00d, 5A-72, 5A-56, 5A-PA,
5A-00b} There is much related work in localization, e.g. (He &
Predicted W1 {5A-00b, 5A-55x, 5A-71x, 5A-00d, 5A-89,
5A-00e, 5A-00d, 5A-88, 5A-72, 5A-56, 5A-00b} Chan, 2015; Shang et al., 2015; Lymberopoulos et al.,
True W2 {5A-00b, 5A-55x, 5A-PA, 5A-71x, 5A-00d, 2015). Our major contribution is the library domain
5A-88, 5A-00e, 5A-89, 5A-00d, 5A-72, 5A-PA, 5A-56,
5A-00b} and its potential for library optimization; in terms of
Predicted W2 {5A-89, 5A-00b, 5A-55x, 5A-71x, 5A-00d, pure localization accuracy several systems may be bet-
5A-00e, 5A-00d, 5A-89, 5A-00d, 5A-72, 5A-56, 5A-00b}
ter. However, the BLIIPS project’s requirements on
We see that predicted and true sequences are very sim- accuracy are less strong for the tasks we aim at. A rel-
ilar, with the exception of some natural additional pre- atively novel aspect is that we aim at semi-open spaces
dicted neighboring locations. For vu, the positioning and do make different use of multimodal (i.e. with ad-
performance for our MLP and k-NN configurations are ditional beacons) fingerprinting than in other systems
displayed in Figure 7a. In the best case, 2-NN, we have such as (Cooper et al., 2016; Kriz et al., 2016). Direct
an average hamming distance error of 2.67 (1.65 in x comparison of empirical results is for this reason not
and 1.02 in y). Each grid tile is 2.05 m in width and feasible. In the past only very few such systems have
length, been considered for (public) library settings (see (van
p so we roughly (under)estimate the mean error Otterlo, 2016b) for pointers) and the results were very
with (1.65 ∗ 2.05)2 + (1.02 ∗ 2.05)2 ≈ 3.97 m.
limited; here our contribution lies in a successful mix
For library, the accuracy of different configurations of previously known techniques in a library setting.
over averaged fingerprints (ground floor) is in Fig-

51
Towards Optimizing the Public Library

Figure 7. Mean positioning hamming distance errors: a) at the vu (28 × 10 grid), b) using averaged fingerprints at the
LIBRARY ground floor c), using averaged fingerprints at the LIBRARY first floor.

4. Elements of a Research Agenda


The BLIIPS project has a main goal: to establish
a library experimentation facility to experiment with
ways to optimize the public library services by influ-
encing patrons, for example by interacting with pa-
trons through recommendations or by changing the
layout of the library or its collection. The FLib ap-
plication represents a large step towards turning pa-
tron behaviors into actionable data. We envision many Figure 8. Example alternative technology tested in BLI-
other challenges and opportunities for the intelligent IPS: processing the visual appearance of book spines to
library and briefly mention some directions. (1) Fin- recognize pictograms denoting book types.
gerprinting/localization. Current efforts go into
extending and upgrading FLib to increase accuracy
many interesting challenges await: can predictions be
and to incorporate more spatial context and other (se-
made who will do which activity, for how long, and
mantic) constraints coming from the library. Other
with what purpose? A general big data frame can con-
types of (deep) machine learning and especially (struc-
nect such data with demographics, geographical con-
tured) probabilistic models are appealing. In addition,
text, weather, trends on social media, and much more.
we want to investigate i) collaborative fingerprinting
(4) The personalized library In line with BLI-
(using many devices), and ii) the optimization of the
IPS’s philosophy of making the physical library more
choice for, and placement of, sensors in the environ-
Google-like, personalization will be a big issue in the
ment (e.g. (Shimosaka et al., 2016)). (2) Digital-
data-driven public library. Knowledge classification
ization. In addition to our use of WiFi and bea-
schemes, recommendations, advertisements and sug-
cons, a sub-project in BLIIPS targeted interaction
gestions for activities could all be personalized based
with physical books (Jica, 2016), with computer vi-
on data (e.g. books borrowed) combined with statisti-
sion and using RFID chips contained in each book
cal predictions derived from many patrons. The physi-
(see Figure 8 for an example). Books can be looked
cal setting enables location-based interventions such as
up based on i) cover, ii) bar code, iii) RFID chip,
personalized suggestions about interesting books in the
iv) textual information (ISBN). Combined with local-
local neighborhood. (5) Optimizing all services.
ization (movement) data, this further completes the
Prediction models can enable optimization of processes
digitalization of library activity. However, we envision
by means of experimentation. For example, one can
more ways to digitalize physical processes, using var-
systematically change aspects of the library services
ious new types of sensors, developments in networks
using expectations and actually see the results in the
(e.g. LoraWAN), existing technologies such as ”smart
data. Optimization requires goals. Potential library
shelves”, augmented/virtual reality, and more. (3)
optimization goals that are largely unexplored are i)
Activity recognition. More detailed data of digi-
the number of books people borrow, ii) distribution of
talized, physical activities can be distilled and used
(types of) books in the collection, iii) the most efficient
for predictive models. Traditional library research has
layout of the library, and iv) the conceptual arrange-
analyzed data before, but the scale of current and
ment of knowledge (classification schemes). One ad-
potential behavioral data is virtually unlimited and
vantage of data-oriented approaches is that monitoring

52
Towards Optimizing the Public Library

and intervening can be done in real time. The advan- Flach, P. (2012). Machine learning. Cambridge University
tage of sensor technology is that at some point one Press.
can relax the physical order of the library because, for He, S., & Chan, G. (2015). Wi-Fi fingerprint-based indoor
example, books can be located individually, escaping positioning: Recent advances and comparisons. IEEE
the standard order of the shelf. Coming up with the Comm. Surveys & Tutorials, 18.
right goals – together with the right hardware and al- Jica, R. (2016). Digital interactions with physical library
gorithmic technology – that are aligned with the many books. Bachelor thesis, Vrije Universiteit Amsterdam,
functions of the public library, is most challenging. (6) The Netherlands.
Privacy. More data means more risks for privacy in Kriz, P., Maly, F., & Kozel, T. (2016). Improving indoor
general. Libraries already collect data about their pa- localization using bluetooth low energy beacons. Mobile
trons, but this will increase quickly. Challenges are ba- Information Systems, 2016.
sic data privacy and security. However, a more hidden Licklider, J. (1965). Libraries of the future. Cambridge
form is intellectual privacy (see (van Otterlo, 2016a)). Massachusetts: MIT Press.
Personalized interventions in library services based on
Lymberopoulos, D., Liu, J., Yang, X., Choudhury, R. R.,
information about borrowing history can have trans- Handziski, V., & Sen, S. (2015). A realistic evaluation
formative effects on the autonomy of a patron in think- and comparison of indoor location technologies: Experi-
ing and deciding. Consequences of data-driven strate- ences and lessons learned. Proceedings of the 14th Inter-
gies in libraries are underexplored (but see (van Ot- national Conference on Information Processing in Sen-
terlo, 2016b)) and need more study. sor Networks (pp. 178–189). New York, NY, USA: ACM.
Ng, I. C., & Wakenshaw, S. Y. (2016). The internet-of-
things: Review and research directions. International
5. Conclusions Journal of Research in Marketing, 34, 3–21.
In this paper we have introduced the public library Palfrey, J. (2015). Bibliotech. Basic Books.
as an interesting domain for innovation with artifi-
Shang, J., Hu, X., Gu, F., Wang, D., & Yu, S. (2015).
cial intelligence. In the context of project BLIIPS Improvement schemes for indoor mobile location esti-
we have introduced the FLib localization application mation: A survey. Math. Probl. in Engineering.
as a first step towards patron activity monitoring, and
Shimosaka, M., Saisho, O., Sunakawa, T., Koyasu, H.,
have briefly touched upon additional results related to Maeda, K., & Kawajiri, R. (2016). ZigBee based wireless
book interaction. Many potential future work direc- indoor localization with sensor placement optimization
tions on BLIIPS and FLib exist and were outlined in towards practical home sensing. Advanced Robotics, 30,
the research agenda in the previous section. 315–325.
van Otterlo, M. (2014). Automated experimentation in
Acknowledgments walden 3.0. Surveillance & society, 12, 255–272.
van Otterlo, M. (2016a). The libraryness of calculative de-
The first author acknowledges support from the Amster-
vices. In L. Amoore and V. Piotukh (Eds.), Algorithmic
dam academic alliance (AAA) on data science, and we
life: Calculative devices in the age of big data, chapter 2,
thank Stichting Leenrecht for financial support. We thank
35–54. Routledge.
the people from the Alkmaar library for their kind support.
van Otterlo, M. (2016b). Project BLIIPS: Making the
physical public library more intelligent through artifi-
References cial intelligence. Qualitative and Quantitative Methods
Allison, D. A. (2013). The patron-driven library: A prac- in Libraries (QQML), 5, 287–300.
tical guide for managing collections and services in the van Otterlo, M., & Feldberg, F. (2016). Van kaas naar big
digital age. Chandos Inf. Prof. Series. data: Data Science Alkmaar, het living lab van Noord-
Baron, N. S. (2015). Words onscreen: The fate of reading Holland noord. Bestuurskunde, 29–34.
in a digital world. Oxford University Press. Warnaar, M. (2017). Indoor localisation on smartphones
using WiFi and bluetooth beacon signal strength. Mas-
Bulgaru, A. (2016). Indoor localisation using bluetooth
ter thesis, Vrije Universiteit Amsterdam.
low energy beacons. Bachelor thesis, Vrije Universiteit
Amsterdam, The Netherlands. Wen, Y., Tian, X., Wang, X., & Lu, S. (2015). Fundamen-
tal limits of RSS fingerprinting based indoor localiza-
Cooper, M., Biehl, J., Filby, G., & Kratz, S. (2016). tion. IEEE Conference on Computer Communications
LoCo: boosting for indoor location classification com- (INFOCOM) (pp. 2479–2487).
bining WiFi and BLE. Personal and Ubiquitous Com-
puting, 20, 83–96. Zheng, Y., Capra, L., Wolfson, O., & Yang, H. (2014). Ur-
ban computing: Concepts, methodologies, and applica-
Edwards, B. (2009). Libraries and learning resource cen- tions. ACM Trans. Intell. Syst. Technol., 5, 38:1–38:55.
tres. Architectural Press (Elsevier). 2nd edition.

53
Constraint-based measure for estimating overlap in clustering

Antoine ADAM [email protected]


Hendrik Blockeel [email protected]
KU Leuven, Department of Computer Science, Celestijnenlaan 200A, 3001 Leuven, Belgium

Keywords: meta-learning, constraints, clustering

Abstract assumptions about the input data and the target func-
tion to be approximated. For instance, some clustering
Different clustering algorithms have different algorithms implicitly assume that clusters are spheri-
strengths and weaknesses. Given a dataset cal; k-means is an example of that. Any clustering al-
and a clustering task, it is up to the user gorithm that tries to minimise the sum of squared Eu-
to choose the most suitable clustering algo- clidean distances inside the clusters, implicitly makes
rithm. In this paper, we study to what extent that assumption. The assumption can be relaxed by
this choice can be supported by a measure of rescaling the different dimensions or using a Maha-
overlap among clusters. We propose a con- lanobis distance; this can lead to elliptic clusters, but
crete, efficiently computable constraint-based such clusters are still convex.
measure. We show that the measure is indeed
informative: on the basis of this measure A different class of clustering algorithms does not as-
alone, one can make better decisions about sume convexity, but looks at local properties of the
which clustering algorithm to use. However, dataset, such as density of point or graph connectivity.
when combined with other features of the in- Such methods can identify, for instance, moon-shaped
put dataset, such as dimensionality, it seems clusters, which k-means cannot. Spectral clustering
that the proposed measure does not provide (von Luxburg, 2007) is an example of such approach.
useful additional information. Some clustering algorithms assume that the data have
been sampled from a population that consists of a mix
of different subpopulations, e.g., a mixture of Gaus-
1. Introduction sians. EM is an example of such an approach (Demp-
ster et al., 1977). A particular property of these ap-
For many types of machine learning tasks, such as su-
proaches is that clusters may overlap. That is, even
pervised learning, clustering, and so on, a variety of
though each individual instance still belongs to one
methods is available. It is often difficult to say in ad-
cluster, there are areas in the instance space where
vance which method will work best in a particular case;
two (or more) Gaussian density functions substantially
this depends on properties of the dataset, the target
differ from zero, so that instance of both clusters may
function, and the quality criteria one is interested in.
end up in this area.
The research field called meta-learning is concerned
with devising automatic ways of determining the most In this paper, we hypothesise that the amount to which
suitable algorithm and parameter settings, given a par- clusters may overlap is relevant for the choice of what
ticular dataset and possibly knowledge about the tar- clustering method to use. A measure, the Rvalue, has
get function. Traditionally, meta-learning has mostly been proposed before that, given the ground truth re-
been studied in a classification setting. In this paper, garding which instance belongs to which cluster, de-
however, we focus on clustering. scribes this overlap. Since clustering is unsupervised,
this measure cannot be used in practice for deciding
Clustering algorithms are no exception to the general
what clustering method to use. We therefore derive
rule that different learning algorithms make different
a new measure, CBO, which is based on must-link or
cannot-link constraints on instance pairs. We show
Preliminary work. Under review for Benelearn 2017. Do that the second measure correlates well with the first,
not distribute. making it a suitable proxy for selection the clustering

54
Constraint-based measure for estimating overlap in clustering

method. We show that this measure is indeed informa- vided to the clustering algorithm to guide the search
tive: on the basis of this measure alone, it is possible towards a more desirable solution. We then talk about
to select clustering algorithms such that, on average, constraint-based, constrained, or semi-supervised clus-
better clusterings are obtained. tering.
However, there are also negative results. Datasets can Constraints can be defined on different levels. On a
be described using other features than the measure cluster level, one can ask for clusters that are bal-
defined here. It turns out that, when a dataset is anced in size, or that have a maximum diameter in
described using a relatively small set of straightfor- space. On an instance level, one might know some
ward features (such as dimensionality), it is also pos- partial labelling of the data. A well-used type of con-
sible to make an informed choice about what clustering straints are must-link and cannot-link constraints, also
method to use. What’s more, if this set of straightfor- called equivalence constraints. These are pair-wise
ward features is extended with the overlap measure constraints which state that two instances must be or
described here, this does not significantly improve the cannot be in the same cluster.
informativeness of the dataset description, in terms of
Multiple methods have been developed to use these
which clustering method is optimal.
constraints, some of which are mentioned below. A
The conclusion from this is that, although the pro- metric can be learnt that complies with the constraints
posed measure is by itself an interesting feature, it (Bar-Hillel et al., 2005). The constraints can be used
seems to capture mostly information that is also con- in the algorithm for the cluster assignment in a hard
tained in other, simpler features. This is a somewhat (Wagstaff et al., 2001) or soft way (Pelleg & Baras,
surprising result for which we currently have no expla- 2007), (Ruiz et al., 2007), (Wang & Davidson, 2010).
nation; further research is warranted. Some hybrid algorithms use constraints for both met-
ric learning and clustering (Bilenko et al., 2004), (Hu
This paper is the continuation of a previously pub-
et al., 2013). Other approaches include constraints in
lished workshop paper (Adam & Blockeel, 2015).
general solver methods like constraint programming
While following the same ideas, the CBO has been
(Duong et al., 2015) or integer linear programming
completely redefined. In addition, the number of
(Babaki et al., 2014).
datasets considered was increased from 14 to 42. While
the correlation of the CBO with the overlapping has
improved considerably, the promising results for the 2.2. Algorithm selection for clustering
algorithm selection of that paper were somewhat re- Little research has been conducted on algorithm se-
duced by adding those datasets. lection for clustering. Existing methods usually pre-
The remainder of this paper is structured as follows. dict the ranking of clustering algorithms (De Souto
Section 2 discusses some related work on constraint- et al., 2008), (Soares et al., 2009), (Prudêncio et al.,
based clustering and meta-learning for clustering. Sec- 2011) (Ferrari & de Castro, 2015). The meta-features
tion 3 studies how the overlapping of clusters influ- used are unsupervised and/or domain-specific. None
ences the performance of algorithms. Section 4 in- of these approaches are using constraints which re-
troduces CBO, which is intended to approximate the moves the specificity that there is not only one single
amount of overlap from constraints. Section 5 presents clustering for one dataset. To the best of our knowl-
experimental results that compare algorithm selection edge, the only meta-learning method for clustering in-
based on CBO with algorithm selection using other volving constraints is (Van Craenendonck & Blockeel,
features of the dataset. Section 6 presents our conclu- 2016) which does not use features but simply selects
sions. the algorithm that satisfies the most constraints.

2. Related work 3. Rvalue


2.1. Constraint-based clustering As already mentioned, we assume some algorithms can
handle overlapping better than others. For example,
Clustering is the unsupervised learning task of iden- figure 1 shows a toy dataset (on the left) where two
tifying groups of similar instances in a dataset. Al- Gaussians overlap in there centre, forming a cross. In
though these groups are initially unknown, some infor- that case, EM (in the middle) is capable of retrieving
mation can be available as to what the desired solution the correct clustering while spectral clustering (SC, on
is. This information takes the form of constraints on the right) cannot. This shows the relevance of overlap-
the resulting clusters. These constraints can be pro- ping as a meta-feature to select a clustering algorithm.

55
Constraint-based measure for estimating overlap in clustering

when including bad performing datasets. This suggest


strongly that while overlapping does impact some al-
gorithms more than others, other factors also have a
significant influence of the performance of clustering
algorithms.

EM SC
all 0.31 0.32
Figure 1. Toy example of the cross dataset.
Rvalue < 0.2 0.48 0.50
Rvalue > 0.2 0.19 0.19
The Rvalue (Oh, 2011) has been used before as a mea-
Table 1. Average clustering performance measured with
sure of overlapping. Given a dataset of instances in dif-
ARI.
ferent classes, it quantifies the overlapping as a number
between 0 and 1. To compute the Rvalue of a dataset,
it considers each instance and its neighbourhood. An EM SC
instance is said in in overlapping if too many of it near- all 0.45 0.47
est neighbours are labelled differently than him. The Rvalue < 0.2 0.55 0.59
Rvalue of a dataset is then the proportion of instances Rvalue > 0.2 0.33 0.31
in overlapping. The Rvalue thus has 2 parameters:
Table 2. Same as table 1 for dataset where either EM or
the k-nearest neighbours to consider, and θ, the num-
SC scored an ARI of at least 0.2.
ber of nearest neighbours from a different class above
which an instance is in overlapping. Figure 2 shows the
Rvalue for some UCI datasets, which shows overlap- 4. Detecting overlap using constraints
ping occurs a lot in real-life datasets. For comparison,
the cross dataset just above has an Rvalue of 0.41 for While the Rvalue is a good indicator of the extent
the same parameters. to which clusters overlap, it is not useful in practice
because it requires knowledge of the clusters, which we
do not have. In this section, we present an alternative
measure: the Constraint-Based Overlap value (CBO).
The CBO is designed to correlate well with the Rvalue,
while not requiring full knowledge of the clusters.

4.1. Definition
The CBO makes use of must-link and cannot-link con-
straints. The idea is to identify specific configurations
of ML or CL constraints that indicate overlap. The
CBO uses two configurations, illustrated in figure 3:
Figure 2. Rvalue of some UCI datasets, k = 6, θ = 1.
• short CL constraints: when two points are close
To check our intuition that EM can handle overlap- together and yet belong to different clusters, this
ping better than SC, we look at the performance of is an indication that the two clusters overlap in
these algorithm w.r.t. the Rvalue. Table 1 shows the this area
average performance of these algorithm over some UCI
datasets presented in further sections. Table 2 shows • two parallel constraints, one of which is ML and
the same results but ignoring datasets were both al- the other CL, between points that are close. That
gorithm performed badly. We assume that if both is, if a and c are close to each other, and so are b
algorithm have an Adjusted Rand Index (Hubert & and d, and a and b must link while c and d cannot
Arabie, 1985) (ARI) of less than 0.2, the dataset is link, then this implies overlapping, either around
not very suitable for clustering to begin with and we a and c or around b and d (see figure).
can then ignore it. A complete list of used datasets
can be found in section 4.3. It can be seen that in The more frequent those patterns, the more the clus-
that second case, EM performs better than SC when ters overlap. A limit case of the second configura-
there is overlapping and vice versa when there is no tion is when the 2 constraints involves the same point
or little overlapping. This difference is much reduced (e.g. a = c on the figure) Then, by propagation of

56
Constraint-based measure for estimating overlap in clustering

computed as follows. Without loss of generality, as-


sume d(x1 , x2 ) + d(x01 , x02 ) ≤ d(x1 , x02 ) + d(x01 , x2 ) (this
can always be achieved by renaming x2 to x02 and vice
versa, see figure 4(b)). We then define:
score(c1 , c2 ) = s(x1 , x2 ) × s(x01 , x02 )
The multiplication ensures that if either x1 and x2 or
x01 and x02 are too far apart then the score is zero.

(a) Short cannot-link pattern

(a) Score of a single constraint

(b) Parallel and close must-link and cannot-link pattern

Figure 3. Overlapping patterns in constraints. The crosses


cluster and the squares cluster, both represented by a cir-
cle, overlap in the middle. A red line signifies a cannot-link,
while a blue line signifies a must-link constraint.

the constraints, there is a short cannot-link constraint


between the other 2 points. (b) Score of a pair of constraints

The question is how to define “short” or “close”. This Figure 4. Scoring of single constraint(a) and pair of con-
has to be relative to “typical” distances. To achieve straints(b) using the local similarity. The circles represent
this, we introduce a kind of relative similarity measure, the neighbourhoods of the points.
as follows. Let d(x, x0 ) be the distance between points
x and x0 , and  (0 ) be the distance between x (x0 ) and
In both cases, higher scores are more indicative of over-
its k’th nearest neighbour. Then
lap. To have a measure for the whole dataset, we
( aggregate these scores over the whole constraint set.
d(x,x0 )
0 1 − max(, 0) if d(x, x0 ) ≤ max(, 0 ) The idea is to compare the amount of short cannot-
s(x, x ) =
0 otherwise link constraints, direct (single pattern) or by propa-
gation(double pattern), to the total amount of short
That is: s(x, x0 ) is 1 when x and x0 coincide, and lin- constraints, both must-link and cannot-link. With CL
early goes to 0, reaching 0 when d(x, x0 ) = max(, 0 ), the set of cannot-link constraints and M L the set of
that is, x is no closer to x0 than its k’th nearest neigh- must-link constraints, we define
bour, and vice versa.
Using this relative similarity, we can assign scores to P P
score(c) + score(c1 , c2 )
both types of configurations mentioned above. c∈CL c1 ∈CL
c2 ∈M L
The score of a short constraint between two points CBO = P P
score(c) + score(c1 , c2 )
x and x0 is simply: c∈CL∪M L c1 ∈M L
c2 ∈CL∪M L
score(c) = s(x, x0 )
4.2. Stability
The score for a pair of parallel constraints, c1 be- As one can imagine, the CBO can be very noisy for
tween points x1 and x01 and c2 between x2 and x02 , is very small constraint sets. Several parameters influ-

57
Constraint-based measure for estimating overlap in clustering

Figure 5. Convergence of the CBO w.r.t. the size of the constraint set. Three datasets are considered with increasing
number of instances from left to right: iris(N=150), mammographic(N=830), yeast(N=1484). For each datasets, 80
constraint sets are sampled with various size (around 25,50,75,100,200,300,400,500). The CBO is computed for k=10 (top
row), k=20 (middle row), k=10+N/20 (bottom row). The blue points correspond to the total number of constraints. The
red points correspond to the number of constraints that actually participated in the measure. The Rvalue of the dataset
(k=10, θ = 1) is plotted as a black horizontal line.

58
Constraint-based measure for estimating overlap in clustering

ence that stability: the k-nearest neighbours to con-


sider, the size of the dataset and the size of the con-
straint set. If the k is too small and the dataset too
big, the measure would require too many constraints
not to be noisy. To solve this problem, we need k to in-
crease with the size of the dataset. For that reason, we
set k = 10 + N/20 where N is the number of instances
in the dataset. This has the desired effect while en-
suring a minimal number of neighbours are considered
for smaller datasets.
Figure 5 shows the variance of the CBO w.r.t. the size
of the constraint set for 3 datasets of different sizes.
For each dataset, several constraint sets of different
sizes were sampled from the true labels. This shows
that having a k increasing with the size of the dataset
makes the CBO more stable. Figure 6. CBO with k=10+N/20 vs Rvalue with k=6 and
θ=1.
4.3. Evaluation
The CBO is intended to serve as an alternative for the • CBO: The first system only uses the CBO as meta-
Rvalue, when the clusters are not known but some con- feature and choose EM if it is lower than 0.1, SC
straints are available. We therefore evaluate the CBO otherwise.
by comparing it to the Rvalue on a number of datasets
from the UCI repository and the OpenML repository, • Unsup: The second system uses unsupervised fea-
namely iris, glass, ionosphere, wine, vertebral, ecoli, tures that have been used by previous cluster-
seeds, students, robotnav4, yeast, zoo, breast cancer ing algorithm selection system, and that are pre-
wisconsin, mammographic, banknote, haberman, seg- sented in table 3. As in (Ferrari & de Castro,
mentation, landsat, sonar, libras, hillvalley, optdigits, 2015), we consider an attribute discrete if the
steel, leaf, spambase, parkinsons, occupancy, balance, number of distinct values observed for it is less
pageblocks, diabetes, vehicle, authorship, ailerons, jedit, than 30% of the number of instances. Using
kc1, megawatt, blood, climate, fertility, heart, robot- these meta-features, we learn a classifier to pre-
fail, volcanoes, engine. For each dataset, 20 constraint dict which of EM or SC will perform better.
sets of 200 random constraints were sampled. Figure • Full: The third system combines the unsupervised
6 visualises how the Rvalue and the CBO (averaged features and the CBO.
over the 20 constraint sets) correlate, over the different
datasets. This graph was produced for one particular
value of k and θ for Rvalue, but other values give very Unsupervised meta-feature description
Natural log of the number of instances
similar results. With a correlation of 0.93, it is clear Natural log of the number of attributes
that CBO is useful as a proxy for Rvalue. Percentage of outliers
Percentage of discrete attributes
Mean entropy of discrete attributes
5. Algorithm selection Mean absolute correlation between distrete attributes
Mean skewness of continuous attributes
Now that we have the CBO to estimate overlap using Mean kurtosis of continuous attributes
constraints, we can use it for meta-learning, and more Mean absolute correlation between numerical attributes
specifically algorithm selection. We picked 2 algo-
rithms to select from: Expectation Maximization(EM) Table 3. Unsupervised meta-features used for algorithm
and Spectral Clustering(SC). We chose these two be- selection.
cause among algorithms that build a global model like
EM and algorithms that use local properties of the These 3 methods were run on the datasets presented
data like SC, these are 2 algorithms that perform the in the previous section. For the constraint-based fea-
best on our datasets. To determine the performance tures, 20 constraint sets of about 200 constraints were
of EM and SC, we ran the algorithms with different sampled at random for each dataset. Table 5 shows the
parameters and kept the best run. Then, we build 3 ARI averaged over datasets and constraint sets, using
algorithm selection systems: a leave-one-out cross validation for the 2 methods that

59
Constraint-based measure for estimating overlap in clustering

involved a classifier (Unsup and Ful). For those two


methods, we used 3 classifiers: Support Vector Ma-
chine (SVM), Logistic Regression (LR) and Decision
Trees (DT). For all algorithms (clustering, classifier,
scores), we used the scikit-learn Python package (Pe-
dregosa et al., 2011).
Classif. EM SC CBO Unsup Full Oracle
0.31 0.32 0.33 0.37
SVM 0.33 0.33
LR 0.33 0.33
DT 0.33 0.31

Table 4. Average ARI of multiple approaches: consistently


EM or SC, selecting one of these using CBO, unsupervised
features, or both (“Full”); and using an oracle to predict
the best system.
Figure 7. Performance of the CBO algorithm selection
Classif. EM SC CBO Unsup Full Oracle (AS) when the threshold for choosing EM or SC varies
0.45 0.47 0.48 0.53 from 0 to 1 on the x axis.
SVM 0.46 0.46
LR 0.48 0.48
DT 0.47 0.47 a measure called CBO, which uses information from
must-link / cannot-link constraints to estimate the
Table 5. Same for datasets where either EM or SC scored amount of overlap. We have shown that the CBO cor-
an ARI of at least 0.2. relates well with the Rvalue, a previously proposed
measure for overlap in a completely known clustering.
On average, algorithm selection methods perform a bit
As such, the CBO can be a useful measure in itself, also
better than each algorithm separately. The improve-
outside the context of algorithm selection for cluster-
ment is quite modest, but relative to the maximum
ing.
improvement possible (by using an oracle), still sub-
stantial. Interestingly, the CBO on its own performs Third, we have empirically estimated the usefulness
as well as the whole set of features defined before. On of selecting the most appropriate clustering method,
the other hand, combining the CBO with those fea- among two methods with quite different properties:
tures does not further improve the results. EM, which is good at detecting overlapping clusters
but finds only elliptic clusters, and SC, which can
The choice of a threshold for the CBO method is rather
find clusters of any shape but cannot return overlap-
flexible. We set it to 0.1 as it is a good value without
ping clusters. The conclusion is that the CBO is in-
being over-fitting. Figure 7 shows the variation of the
deed informative for selecting the best among these
performance of that method when varying that thresh-
two; it yields a small but noticeable improvement, and
old for dataset with an ARI of at least 0.2 (which cor-
this improvement is comparable to the improvement
responds to the first line of table 5). It can be seen
obtained by using a set of 10 unsupervised features
that any value between 0.1 and 0.3 has about the same
previously proposed for clustering algorithm selection.
score.
When combined with those other features, however,
the CBO does not yield a further improvement. This
6. Conclusion suggests that the information contained in the CBO is
already contained in the other features.
Algorithm selection and meta-learning have been stud-
ied mostly in the classification setting. In this paper, Compared to choosing the best clustering method us-
we have studied them in the context of clustering. Our ing an oracle, CBO-based selection leaves room for
main contributions are as follows. further improvement. This is perhaps not surprising,
given that the amount of overlap among clusters is one
First, we have identified overlap between clusters as a
aspect that determines the effectiveness of clustering
relevant property of the true clustering, meaning the
methods, but certainly not the only one. An indica-
clustering according to the true labels.
tion of cluster shapes, for instance, is likely to give
Second, because such overlap is difficult to quantify additional information. The question remains open to
without knowing the cluster labels, we have proposed which extent this and other features can be derived

60
Constraint-based measure for estimating overlap in clustering

from constraints, and to what extent this can lead to Hubert, L., & Arabie, P. (1985). Comparing partitions.
better clustering algorithm selection. Journal of classification, 2, 193–218.
Oh, S. (2011). A new dataset evaluation method
Acknowledgments based on category overlap. Computers in Biology
Research financed by the KU Leuven Research Council and Medicine, 41, 115–122.
through project IDO/10/012. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel,
V., Thirion, B., Grisel, O., Blondel, M., Pretten-
References hofer, P., Weiss, R., Dubourg, V., Vanderplas, J.,
Passos, A., Cournapeau, D., Brucher, M., Perrot,
Adam, A., & Blockeel, H. (2015). Dealing with over-
M., & Duchesnay, E. (2011). Scikit-learn: Machine
lapping clustering: a constraint-based approach to
learning in Python. Journal of Machine Learning
algorithm selection. Meta-learning and Algorithm
Research, 12, 2825–2830.
Selection workshop-ECMLPKDD2015 (pp. 43–54).
Pelleg, D., & Baras, D. (2007). K-means with large and
Babaki, B., Guns, T., & Nijssen, S. (2014). Con-
noisy constraint sets. In Machine learning: Ecml
strained clustering using column generation. In In-
2007, 674–682. Springer.
tegration of ai and or techniques in constraint pro-
gramming, 438–454. Springer. Prudêncio, R. B., De Souto, M. C., & Ludermir,
T. B. (2011). Selecting machine learning algo-
Bar-Hillel, A., Hertz, T., Shental, N., & Weinshall, D. rithms using the ranking meta-learning approach.
(2005). Learning a mahalanobis metric from equiva- In Meta-learning in computational intelligence, 225–
lence constraints. Journal of Machine Learning Re- 243. Springer.
search, 6, 937–965.
Ruiz, C., Spiliopoulou, M., & Menasalvas, E. (2007).
Bilenko, M., Basu, S., & Mooney, R. J. (2004). In- C-dbscan: Density-based clustering with con-
tegrating constraints and metric learning in semi- straints. In Rough sets, fuzzy sets, data mining and
supervised clustering. Proceedings of the twenty- granular computing, 216–223. Springer.
first international conference on Machine learning
(p. 11). Soares, R. G., Ludermir, T. B., & De Carvalho, F. A.
(2009). An analysis of meta-learning techniques for
De Souto, M. C., Prudencio, R. B., Soares, R. G., ranking clustering algorithms applied to artificial
De Araujo, R. G., Costa, I. G., Ludermir, T. B., data. In Artificial neural networks–icann 2009, 131–
Schliep, A., et al. (2008). Ranking and selecting clus- 140. Springer.
tering algorithms using a meta-learning approach.
Neural Networks, 2008. IJCNN 2008.(IEEE World Van Craenendonck, T., & Blockeel, H. (2016).
Congress on Computational Intelligence). IEEE In- Constraint-based clustering selection. arXiv
ternational Joint Conference on (pp. 3729–3735). preprint arXiv:1609.07272.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). von Luxburg, U. (2007). A tutorial on spectral clus-
Maximum likelihood from incomplete data via the tering. Statistics and computing, 17, 395–416.
em algorithm. Journal of the royal statistical society.
Series B (methodological), 1–38. Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S., et al.
(2001). Constrained k-means clustering with back-
Duong, K.-C., Vrain, C., et al. (2015). Constrained ground knowledge. ICML (pp. 577–584).
clustering by constraint programming. Artificial In-
telligence. Wang, X., & Davidson, I. (2010). Flexible con-
strained spectral clustering. Proceedings of the 16th
Ferrari, D. G., & de Castro, L. N. (2015). Cluster- ACM SIGKDD international conference on Knowl-
ing algorithm selection by meta-learning systems: edge discovery and data mining (pp. 563–572).
A new distance-based problem characterization and
ranking combination methods. Information Sci-
ences, 301, 181–194.

Hu, P., Vens, C., Verstrynge, B., & Blockeel, H.


(2013). Generalizing from example clusters. Dis-
covery Science (pp. 64–78).

61
Conference Track
Extended Abstracts

62
A Probabilistic Modeling Approach to Hearing Loss Compensation

Thijs van de Laar1 [email protected]


Bert de Vries1,2 [email protected]
1
Department of Electrical Engineering, Eindhoven University of Technology
2
GN Hearing, Eindhoven

Keywords: Hearing Aids, Hearing Loss Compensation, Probabilistic Modeling, Bayesian Inference

1. Introduction praisals (the data). The architecture of our design loop


is shown in Fig. 1.
Hearing loss is a serious and prevalent condition that
is characterized by a frequency-dependent loss of sen-
sitivity for acoustic stimuli. As a result, a tone that
is audible for a normal-hearing person might not be
audible for a hearing-impaired patient. The goal of a
hearing aid device is to restore audibility by amplifi-
cation and compressing the dynamic range of acoustic
inputs to the remaining audible range of the patient.
In practice, current hearing aids apply frequency- and
intensity-dependent gains that aim to restore normal
audibility levels for the impaired listener.
The hearing aid algorithm design problem is a diffi-
cult engineering issue with many trade-offs. Each pa-
tient has her own auditory loss profile and individual Figure 1. The iterative algorithm design loop, featuring the
preferences for processed audio signals. Yet, we can- interplay between signal processing (Eq.5) and parameter
not afford to spend intensive tuning sessions with each estimation (Eq.6). Tuning parameters are designated by θ.
patient. As a result, there is a need for automating Figure adapted from (van de Laar & de Vries, 2016).
algorithm design iterations based on in-situ collected
patient feedback.
2. Model Specification
This short paper summarizes ongoing work on a prob-
abilistic modeling approach to the design of personal- We describe the hearing loss compensation model for
ized hearing aid algorithms (van de Laar & de Vries, one frequency band. In practice, a hearing aid would
2016). In this framework, we first specify a prob- apply the derived algorithms to each band indepen-
abilistic generative model that includes an explicit dently. For a given patient wearing hearing aids, we
description of the hearing loss problem. Given the define the received sound level as
model, hearing aid signal processing relates to on-line
Bayesian state estimation (similar to Kalman filter- rt = L(st + gt ; φ) (1)
ing). Estimation of the tuning parameters (known as where st is the sound pressure level (in dB SPL) of the
the ‘fitting’ task in hearing aid parlance) corresponds input signal that enters the hearing aid, gt is the hear-
to Bayesian parameter estimation. The innovative as- ing aid gain and L is a function with tuning param-
pect of the framework is that both the signal process- eters φ that models the patient’s hearing impairment
ing and fitting tasks can be automatically inferred from in accordance with (Zurek & Desloge, 2007).
the probabilistic model in conjunction with patient ap-
Hearing loss compensation balances two simultaneous
constraints. First, we want restored sound levels to be
Appearing in Proceedings of Benelearn 2017. Copyright approximately experienced at normal hearing levels:
2017 by the authors.
st |gt ∼ N (rt , ϑ) = N (L(st + gt ; φ), ϑ) . (2)

63
A Probabilistic Modeling Approach to Hearing Loss Compensation

Secondly, in order to minimize acoustic signal distor-


tion, the compensation gain should remain as constant 1 ↓ ς
as possible, which we model as
N
gt |gt−1 ∼ N (gt−1 , ς) . (3)
↓ 2
The trade-off between conditions Eqs. 2 and 3 is con- g gt
... t−1 + = ...
trolled by the noise variances ϑ and ς. The full gener- → → →
ative model is specified by combining Eqs. 2 and 3: 3 13
↑ 12
p(g0:T , s1:T , ς, ϑ, φ) = (4)
+
T
Y
p(g0 ) p(ς) p(ϑ) p(φ) p(st |gt , φ, ϑ) p(gt |gt−1 , ς) .
11 ↑
t=1
φ
In this model, st is an observed input sequence, gt is → L
the hidden gain signal, and θ = {ς, ϑ, φ} are tuning 6
parameters. 10 ↑ ↑ 5

ϑ
N +
→ →
3. Signal Processing and Fitting as 7 9
Probabilistic Inference ↑ 8

The signal processing and parameter estimation algo- =


rithms follow by applying Bayesian inference to the
generative model. The hearing aid signal processing 4 ↑ st
algorithm is defined by estimating the current gain gt
from given past observations s1:t and given parameter
settings θ = θ̂. In a Bayesian framework, this amounts Figure 2. A Forney-style factor graph for one time step in
to computing the generative model. The small numbered arrows indicate
a recursive message passing schedule for executing the sig-
R R
··· p(g0:t , s1:t , θ̂) dg0 . . . dgt−1 nal processing task of Eq. 5. Figure adapted from (van de
p(gt |s1:t , θ̂) = R R . (5) Laar & de Vries, 2016).
··· p(g0:t , s1:t , θ̂) dg0 . . . dgt
A suitable personalized parameter setting is vital to
satisfactory signal processing. Bayesian parameter es- in a Forney-style Factor Graph (FFG) (Forney, 2001).
timation amounts to computing In an FFG, nodes correspond to factors and edges rep-
resent variables. The FFG for the generative model of
p(gk−1:n , sk:n , θ) Eq. 4 is depicted in Fig. 2. The arrows indicate the
p(θ|D) = R . (6)
p(gk−1:n , sk:n , θ) dθ message passing schedule that recursively executes the
signal processing inference problem of Eq. 5. Partic-
In this formula, we assume availability of a training ular message passing update rules were derived in ac-
set of pairs D = {(gk−1:n , sk:n )}, where k and n > k cordance with (Loeliger, 2007) and (Dauwels, 2007).
are positive indices. This training set can be obtained
from in-situ collected patient appraisals on the quality Simulations show that the inferred signal processing
of the currently selected hearing aid algorithm (Fig.1). algorithm exhibits compressive amplification behavior
After the user casts a positive appraisal, we collect a that is similar to the manually designed dynamic range
few seconds of both the hearing aid input signal and compression circuits in hearing aids. Simulations also
corresponding gain signals and add these signal pairs verify that the parameter estimation algorithm is able
to the training database. to recover preferred tuning parameters from a user-
selected training example.

4. Inference Execution through Crucially, our algorithms for signal processing and fit-
Message Passing ting can be automatically inferred from a given model
plus in-situ collected patient appraisals. Therefore, in
Equations (5) and (6) are very difficult to compute contrast to existing design methods, this approach al-
directly. We have developed a software toolbox to au- lows for hearing aid personalization by a patient with-
tomate these inference problems by message passing out need for human design experts in the loop.

64
A Probabilistic Modeling Approach to Hearing Loss Compensation

References
Dauwels, J. (2007). On Variational Message Passing
on Factor Graphs. IEEE International Symposium
on Information Theory (pp. 2546–2550).
Forney, G.D., J. (2001). Codes on graphs: normal re-
alizations. IEEE Transactions on Information The-
ory, 47, 520–548.

Loeliger, H.-A. (2007). Factor Graphs and Message


Passing Algorithms – Part 1: Introduction.
van de Laar, T., & de Vries, B. (2016). A proba-
bilistic modeling approach to hearing loss compen-
sation. IEEE/ACM Transactions on Audio, Speech,
and Language Processing, 24, 2200–2213.
Zurek, P. M., & Desloge, J. G. (2007). Hearing loss
and prosthesis simulation in audiology. The Hearing
Journal, 60, 32–33.

65
An In-situ Trainable Gesture Classifier

Anouk van Diepen1 [email protected]


Marco Cox1 [email protected]
Bert de Vries1,2 [email protected]
1
Department of Electrical Engineering, Eindhoven University of Technology, 2 GN Hearing BV, Eindhoven

Keywords: gesture recognition, probabilistic modeling, Bayesian inference, empirical Bayes.

1. Introduction by the probability distribution


Gesture recognition, i.e., the recognition of pre-defined p(y, θ, k) = p(y|θ) · p(θ|k) · p(k) . (1)
gestures by arm or hand movements, enables a natural | {z } | {z } |{z}
dynamical gesture gesture
extension of the way we currently interact with devices model characteristics class index
(Horsley, 2016). Commercially available gesture recog-
nition systems are usually pre-trained: the developers Because the measurement sequence is temporally cor-
specify a set of gestures, and the user is provided with related, it is natural to choose p(y|θ) to be a hidden
an algorithm that can recognize just these gestures. Markov model (HMM). HMMs have been successfully
To improve the user experience, it is often desirable to applied to gesture classification in the past (Mäntylä
allow users to define their own gestures. In that case, et al., 2000). Under this model, θ represents the set of
the user needs to train the recognition system herself parameters of the HMM.
by a set of example gestures. Crucially, this scenario During learning, the parameter values θ of gestures of
requires learning gestures from just a few training ex- class k need to be learned from data. We choose to
amples in order to avoid overburdening the user. learn this distribution using a two-step approach.
We present a new in-situ trainable gesture classifier In the first step, a prior for θ is constructed. This
based on a hierarchical probabilistic modeling ap- prior distribution can be obtained in various ways.
proach. Casting both learning and recognition as prob- We have chosen to construct one that captures the
abilistic inference tasks yields a principled way to de- common characteristics that are shared among all ges-
sign and evaluate algorithm candidates. Moreover, the tures. This is done by learning the distribution using
Bayesian approach facilitates learning of prior knowl- dataset D, consisting of one measurement from each
edge about gestures, which leads to fewer needed ex- gesture class. This can be expressed as
amples for training new gestures.

p(D, θ, k)
2. Probabilistic modeling approach p(θ|D, k) = R . (2)
p(D, θ, k) dθ
Under the probabilistic modeling approach, both
learning and recognition are problems of probabilis- In the second step, the parameter distribution is
tic inference in the same generative model. This gen- learned for a specific gesture class, using the previ-
erative model is a joint probability distribution that ously learned p(θ|D, k) and a set of measurements Dk
specifies the relations among all (hidden and observed) with the same class k:
variables in the model.
Let y = (y1 , ..., yT ) be a time series of measurements p(Dk |θ)p(θ|D, k)p(k)
p(θ|D, Dk , k) = R . (3)
corresponding to a single gesture with underlying char- p(Dk , θ, k|D) dθ
acteristics θ. The characteristics are unique to gestures
of type (class) k. We can capture these dependencies In practice, exact evaluation of Eq. 2 and Eq. 3 is
intractable for our model due to the integral in the
denominator. We use variational Bayesian inference to
approximate this distribution (MacKay, 1997), which

66
An In-situ Trainable Gesture Classifier

results in a set of update equations that need to be containing the remaining (15x17=) 255 samples. The
iterated until convergence. recognition rate is evaluated on models trained on 1
through 5 examples. To minimize the influence of the
During recognition, the task of the algorithm is to
training order, the results are averaged over 5 different
identify the gesture class with the highest probabil-
permutations of the training set.
ity of having generated the measurement y. This is
expressed by To compare our algorithm, we have also evaluated the
R recognition rate of the same algorithm with uninfor-
p(y, θ, k) dθ mative prior distributions and of a 1-Nearest Neighbor
p(k|y) = P R . (4)
k p(y, θ, k) dθ (1-NN) algorithm using the same protocol.

If we assume that each gesture is performed with the


same a priori probability p(k), then p(y|k) ∝ p(k|y).
To calculate p(y|k), the method as proposed in Chap-
ter 3 of Beal (2003) is used: the obtained variational
posterior distribution of the parameters is replaced by
its mean, which allows exact evaluation of p(y|k).

3. Experimental validation
We built a gesture database using a Myo sensor
bracelet (ThalmicLabs, 2016), which is worn just be-
low the elbow (see Fig. 1). The Myo’s inertial mea-
surement unit measures the orientation of the bracelet.
This orientation signal is sampled at 6.7 Hz, converted
into the direction of the arm, and quantized using 6
quantization directions. The database contains 17 dif- Figure 2. Recognition rates of the 1-NN algorithm, the pro-
ferent gesture classes, each performed 20 times by the posed algorithm without prior information (HMM), and
same user. The duration of the measurements was the proposed algorithm with informed prior distributions
fixed to 3 seconds. (HMM prior).

Figure 2 shows the recognition rates of the algorithms.


Both hidden Markov based algorithms have a higher
recognition rate than the 1-NN algorithm. For per-
sonalization of gesture recognition, we are especially
interested in learning gesture classes using a low num-
ber of training examples. In particular for one-shot
training (from one example only), the hidden Markov
model using the learned prior distribution corresponds
to the highest recognition rate.
The algorithm was also tested for gestures that are not
used to learn the prior distribution. When the prior
Figure 1. The Myo sensor bracelet used to measure ges- is constructed with similar gestures, the new gestures
tures. are also learned faster than when uninformative priors
are used.

As a measure of performance, we use the recognition There are multiple ways to incorporate these results in
rate defined as: a practical gesture recognition system. For example,
the prior distribution can be constructed by the devel-
# correctly classified opers of the algorithm. Another possibility is to allow
Recognition rate = . (5)
total # of samples users to provide prior distributions themselves. This
means that the system will take longer to set up, but
The gesture database is split in a training set contain- when a user wants to learn a specific gesture under in-
ing 5 samples of every gesture class, and a test set situ conditions, it will require less training examples.

67
An In-situ Trainable Gesture Classifier

References
Beal, M. J. (2003). Variational algorithms for approx-
imate Bayesian inference. Doctoral dissertation,
University College London.
Horsley, D. (2016). Wave hello to the next interface.
IEEE Spectrum, 53, 46–51.

MacKay, D. J. C. (1997). Ensemble Learning for Hid-


den Markov Models (Technical Report).
Mäntylä, V.-M., Mäntyjärvi, J., Seppänen, T., & Tuu-
lari, E. (2000). Hand gesture recognition of a mobile
device user. 2000 IEEE International Conference on
Multimedia and Expo. (pp. 281–284).

ThalmicLabs (2016). Myo Gesture Control Armband.


Retrieved from https://www.myo.com/.

68
Text mining to detect indications of fraud in annual reports
worldwide

Marcia Fissette [email protected]


KPMG, Laan van Langerhuize 1, 1186 DS , Amstelveen, The Netherlands
Bernard Veldkamp [email protected]
University of Twente, Drienerlolaan 5, 7522 NB Enschede, The Netherlands
Theo de Vries [email protected]
University of Twente, Drienerlolaan 5, 7522 NB Enschede, The Netherlands

Keywords: Text mining, Fraud, Annual reports

1. Introduction as the no-fraud reports. We match the fraud and no-


fraud reports on year, sector and number of employees.
Fraud affects the financial results presented in the an- The matching process results in 1.325 annual reports
nual reports of companies worldwide. Analysis per- which do not contain known fraud. The resulting data
formed on annual reports focuses on the quantitative set contains annual reports in the period from 1999
data in these reports. However, the amount of textual to 2011. The formats of these annual reports differs
information in annual reports increased in the past depending on the on the stock exchange to which the
decade with companies using the reports to project organization is listed or the country of origin. Filings
themselves. The texts provide information that is to the SEC are on forms 10-K or 20-F while others
complementary to the financial results. Therefore, the have a freer format.
analysis of the textual information in annual reports
may provide indications of the presence of fraud within It is argued that the Management Discussion and
a company. This piece of research uses the extensive Analysis (MD&A) section is the most read part of the
and reality approaching data set containing annual re- annual report (Li, 2010). Previous research on 10-
ports of companies worldwide to answer the research K reports showed promising results (Cecchini et al.,
question: 2010; Glancy & Yadav, 2011; Purda & Skillicorn, 2010;
Purda & Skillicorn, 2012; Purda & Skillicorn, 2015)
Can a text mining model be developed that can detect Therefore, the MD&A section is a good starting point
indications of fraud in the management discussion and to determine whether text mining is a suitable means
analysis section of annual reports of companies world- for detecting indications of fraud in annual reports
wide? worldwide. The manual extraction of the MD&A sec-
tions is a labor-intensive task. Therefore, we devel-
oped an algorithm that is able to detect the start of
the MD&A section based on the section headers. The
2. The data MD&A section is not always explicitly present in the
We selected the fraud cases in the period from 1999 free format annual reports. From these reports we se-
to 2013 from news messages and the Accounting and lected the sections that most closely correspond to the
Auditing Enforcement Releases (AAER’s) published MD&A section.
by the Securities and Exchange Commission (SEC).
The selection process results in 402 annual reports in 3. The text mining model
the fraud set. For each annual report in the fraud set
we collect annual reports of companies similar to the Before extracting the features for the text mining
companies in the fraud set, but for which no fraudulent model, graphs, figures and tables are excluded because
activities are known. The latter category is referred to their primary way to convey information is not based
on text. We use the tokenizers from the Natural Lan-
The original paper is under review.

69
Text mining to detect fraud in annual reports worldwide

guage Toolkit (NLTK) for Python to identify sentences bigrams and grammatical relations between two words,
and words (Bird & Klein, 2009). HTML-tags in forms extracted with the Stanford parser, as features to the
10-K and 20-F are removed using the Python package model (De Marneffe et al., 2006).
‘BeautifulSoup’
We develop a baseline model comprising of word uni- 4. Results
grams. To obtain an informative set of word unigrams
Figure 1 shows the accuracy of the NB and SVM base-
we exclude stop words and stem the words using the
line models. For both types of models the optimal
Porter stemmer in NLTK (Bird & Klein, 2009). Words
number of features is around 10.000 unigrams. With
that appear only in one MD&A section in the entire
an accuracy of 89% the NB model outperforms the
data set are not informative. Therefore these words
SVM that achieves an accuracy of 85%.
will not be used as features. Furthermore, we ap-
ply ‘term frequency-inverse document frequency’ (TF-
IDF) as a normalization step of the word counts to
take into account the length of the text and the com-
monality of the word in the entire data set (Manning
& Schütze, 1999). Finally, the chi squared method is
applied to select the most informative features. We
start with the top 1.000 and increase the number of
features in steps of 1.000 until 24.000 to find the opti-
mal number of features.
The Naı̈ve Bayes classifier (NB) and Support Vector
Machine (SVM) have been proven successful in text
classification tasks in several domains (Cecchini et al.,
2010; Conway et al., 2009; Glancy & Yadav, 2011;
Goel et al., 2010; He & Veldkamp, 2012; Joachims,
1998; Manning & Schütze, 1999; Metsis et al., 2006; Figure 1. Performance of the Naı̈ve Bayes and Support
Purda & Skillicorn, 2015). Therefore, this research Vector Machine models.
uses these two types of machine learning approaches
to develop a baseline text mining model. Using 10-fold
stratified cross validation on 70% of the data the data The linguistic features of the descriptive, complex-
is split into train and test sets. The remaining 30% is ity, grammatical, readability and psychological process
saved for the best performing model in de development categories did not improve the result of the baseline
phase. models. The performance on the test set of the SVM
A word unigrams approach is a limited way of look- model increased to 90% by adding the most informa-
ing at texts because it omits a part of the textual in- tive bigrams. The addition of the relation features did
formation, such as the grammar. Therefore, we ex- not further increase the performance.
tend the baseline model with linguistic features cat-
egories to determine whether other types of textual 5. Discussion and conclusion
information may improve the results of the baseline
model. The first category consists of descriptive fea- The results show that it is possible to use text min-
tures, this includes the number of words and the num- ing techniques to detect indications of fraud in the
ber of sentences in the text. The second category of management discussion and analysis section of annual
features represents the complexity of a text. Exam- reports of companies worldwide. The word unigrams
ples of these features are the average sentence length capture the majority of the subtle information that
and the percentage of long words. The third group of differentiates fraudulent from non fraudulent annual
captures the grammatical information, such as the per- reports. The additional information that the linguis-
centage of verbs, nouns and several types of personal tic information provides is very limited, an only at-
pronouns. The fourth category assesses the readabil- tributable to the bigrams. Additional research may
ity of the text by using readability scores, including address the effects of the random 10-fold splitting pro-
the ‘Flesch Reading Ease Score’. The fifth categories cess, the effects of multiple authors on linguistic fea-
measures psychological processes such as positive and tures of a text and the possibilities of a ensemble of
negative sentiment words. Finally, we include words machine learning algorithms for detecting fraud in an-
nual reports worldwide.

70
Text mining to detect fraud in annual reports worldwide

References Purda, L., & Skillicorn, D. (2015). Accounting vari-


ables, deception, and a bag of words: Assessing the
Bird, Steven, E. L., & Klein, E. (2009). Natural lan-
tools of fraud detection. Contemporary Accounting
guage processing with python. OReilly Media Inc.
Research, 32, 1193–1223.
Cecchini, M., Aytug, H., Koehler, G. J., & Pathak, Purda, L. D., & Skillicorn, D. (2010). Reading be-
P. (2010). Making words work: Using financial text tween the lines: Detecting fraud from the lan-
as a predictor of financial events. Decision Support guage of financial reports. Available at SSRN:
Systems, 50, 164 – 175. http://ssrn.com/abstract=1670832.
Conway, M., Doan, S., Kawazoe, A., & Collier, N.
(2009). Classifying disease outbreak reports using n-
grams and semantic features. International journal
of medical informatics, 78, e47–e58.

De Marneffe, M.-C., MacCartney, B., Manning, C. D.,


et al. (2006). Generating typed dependency parses
from phrase structure parses. Proceedings of LREC
(pp. 449–454).

Glancy, F. H., & Yadav, S. B. (2011). A computa-


tional model for financial reporting fraud detection.
Decision Support Systems, 50, 595–601.

Goel, S., Gangolly, J., Faerman, S. R., & Uzuner, O.


(2010). Can linguistic predictors detect fraudulent
financial filings? Journal of Emerging Technologies
in Accounting, 7, 25–46.

He, Q., & Veldkamp, D. B. (2012). Classifying unstruc-


tured textual data using the product score model:
an alternative text mining algorithm. In T. Eggen
and B. Veldkamp (Eds.), Psychometrics in practice
at rcec, 47 – 62. Enschede: RCEC.

Joachims, T. (1998). Text categorization with support


vector machines: Learning with many relevant fea-
tures. Springer.

Li, F. (2010). The information content of forward-


looking statements in corporate filingsa naı̈ve
bayesian machine learning approach. Journal of Ac-
counting Research, 48, 1049–1102.

Manning, C. D., & Schütze, H. (1999). Foundations of


statistical natural language processing. Cambridge,
MA, USA: MIT Press.

Metsis, V., Androutsopoulos, I., & Paliouras, G.


(2006). Spam filtering with naive bayes-which naive
bayes? CEAS (pp. 27–28).

Purda, L., & Skillicorn, D. (2012). Accounting vari-


ables, deception, and a bag of words: Assessing
the tools of fraud detection. Available at SSRN:
http://ssrn.com/abstract=1670832.

71
Do you trust your multiple instance learning classifier?

Veronika Cheplygina1,2 [email protected]


Lauge Sørensen3
David M. J. Tax4
Marleen de Bruijne2,3
Marco Loog4,3
1
Medical Image Analysis group, Eindhoven University of Technology, The Netherlands
2
Biomedical Imaging Group Rotterdam, Erasmus Medical Center, Rotterdam, The Netherlands
3
The Image Section, University of Copenhagen, Copenhagen, Denmark
4
Pattern Recognition Laboratory, Delft University of Technology, The Netherlands

Keywords: multiple instance learning

Abstract have a few positive instances and negative bags are


assumed to have only negative instances. In our ex-
Multiple instance learning (MIL) is a weakly ample, positive bags are images with an abnormal di-
supervised learning scenario where labels are agnosis (and hence abnormal patches), while negative
given only for groups (bags) of examples (in- bags are images of healthy subjects. We can then train
stances). Because some MIL classifiers can a classifier to distinguish positive from negative bags
provide instance labels at test time, MIL is as well as possible. In some cases while the classifier
popular in applications where labels are dif- is learning to classify bags, it also learns to classify
ficult to acquire. However, MIL classifiers instances, providing the sought-after patch labels. It’s
are frequently only evaluated on their bag- like music to our ears - we only need (easy) image
level, not instance-level performance. In this labels as input, but we get (difficult) patch labels as
extended abstract, which covers previously output. It’s therefore not surprising that MIL is gain-
published work, we demonstrate why this ing popularity in medical image analysis - see (Quellec
could be problematic and discuss this open et al., 2017) for a recent survey.
problem.
Now that we trained a MIL classifier with image la-
bels, we would like to evaluate it. Since we trained our
1. Introduction classifier to classify bags, we evaluate it on its ability
to classify bags in the test set. We then choose the
Consider the task of training a classifier to segment best bag classifier as the classifier that will give us
a medical image into abnormal and healthy image those elusive patch labels. Where this goes wrong, is
patches. Traditionally, we would need examples of that the best bag classifier is typically not the best in-
both healthy and abnormal patches to train a super- stance classifier (Vanwinckelen et al., 2016). Consider
vised classifier. In medical imaging, obtaining such a test image with patches {A, B, C}, of which only
ground truth patch labels is often difficult, but image A is abnormal. Classifiers which classify any subset
labels, such as the diagnosis of the patient, are more of { {A}, {B}, {C}, {A,B}, {A,C}, {B,C}, {A,B,C}
easily available. The lack of ground truth patch } as abnormal are equally good classifiers for our test
labels calls for weakly-supervised approaches, image, but have very different performances on patch-
such as multiple instance learning (MIL). level! Therefore, evaluating a weakly-supervised
approach calls for ground-truth patch labels.
In MIL, we are only given bags which are labeled pos-
itive or negative. Positive bags are often assumed to
2. Methods and Results
Preliminary work. Under review for Benelearn 2017. Do If we are lucky, we can ask an expert to label a part
not distribute. of the patches in the test image, or perhaps, just to

72
Do you trust your multiple instance learning classifier?

visually assess the results. But what can we do if this Musk 1 Musk 2 Breast

isn’t possible? The approach we proposed in our pre- 0.8 0.8

AUC

AUC

AUC
0.8
vious work (Cheplygina et al., 2015), is to invent unsu- 0.6 0.6 0.6
pervised patch-level evaluation measures which do not 0.5 1 0.5 1 0.5 1
need any patch labels. We reasoned that, if a classifier S+ S+ S+
Messidor COPD validation COPD test
is finding the true patch labels, it should find similar 0.8
0.7 0.7
patch labels, even if we change the classifier slightly.

AUC
AUC

AUC
0.6 0.6
If the classifier is finding different patch labels every 0.6
0.5 0.5
time, we probably don’t want to trust it. By changing 0.5 1 0.5 1 0.5 1
S+ S+ S+
the classifier slightly and evaluating the stability of the
patch labels, we get a different sense of how well the simpleNM simpleSVM simple1NN milBoost
mi1NN MILES
classifier is doing. miNM miSVM

Let zi and zi0 be the patch-level outputs of the same


type of classifier, that has been slightly changed. Then
we can define a very simple stability measure as Figure 1. Bag performance (AUC) vs instance stability for
six datasets (each plot) and eight types of MIL classifiers
(each marker). The dotted lines show the Pareto frontiers,
S+ (z, z0 ) = n11 /(n01 + n10 + n11 ) (1) which indicate the “best” classifiers if both bag AUC and
instance stability are considered. Figure from (Cheplygina
where n00 = |{i|zi = 0 ∧ zi0 = 0}|, n01 = |{i|zi = et al., 2015), with permission.
0 ∧ zi0 = 1}| , n10 = |{i|zi = 1 ∧ zi0 = 0}| and n11 =
|{i|zi = 1 ∧ zi0 = 1}|. In other words, we measure the
agreement of the classifiers on patches, that either of be sufficient, and the instance labels given by the best
the two considered positive. bag classifier might not be reliable. A possible solu-
tion is to look at additional, unsupervised evaluation
In our experiments we used six MIL datasets: Musk1,
measures, such as instance label stability.
Musk2, Breast, Messidor, COPD validation and
COPD test. Musk1 and Musk2 are MIL bench- However, there is still room for improvement. While
mark datasets, while the others are medical imaging low stability makes us doubt the classifier, high stabil-
datasets. We split each dataset into a training set and ity doesn’t inspire confidence - for example, a classifier
a test set. that always outputs 0 is very stable using our measure.
In future work, we want to find a stability measure that
We trained eight types of MIL classifiers: simpleNM,
is informative about the instance performance. But to
miNM, simple1NN, mi1NN, simpleSVM, miSVM,
validate such a measure, we run into the same prob-
MILES and MILBoost. To change each classifier
lem: getting enough data with ground truth instance
slightly, we trained it on 10 random samples of 80% of
labels.
the training bags. We evaluated these 10 classifier ver-
sions on the fixed test set, and computed two measures:
the bag-level performance, and the instance-level sta- References
bility, averaged over all 12 10(10 − 1) = 45 pairs of the Cheplygina, V., Sørensen, L., Tax, D. M. J., de Brui-
10 slightly changed versions of the same classifier. jne, M., & Loog, M. (2015). Label stability in mul-
In Fig. 1 we plot the bag performance against the tiple instance learning. International Conference on
instance-level stability. We see that the classifier with Medical Image Computing and Computer-Assisted
the best bag performance is not always the most sta- Intervention (pp. 539–546).
ble classifier. For example, for Musk1 dataset, the best
Quellec, G., Cazuguel, G., Cochener, B., & Lamard,
bag classifier is MILES. But, taking into account the
M. (2017). Multiple-instance learning for medical
instance labels, we might want to choose mi1NN, sac-
image and video analysis. IEEE Reviews in Biomed-
rificing a little bit of bag performance, but gaining a
ical Engineering, in press.
lot of stability.
Vanwinckelen, G., Fierens, D., Blockeel, H., et al.
3. Conclusions (2016). Instance-level accuracy versus bag-level ac-
curacy in multi-instance learning. Data Mining and
The take-home message is that if we use MIL to clas- Knowledge Discovery, 30, 313–341.
sify instances, we should be careful about how we eval-
uate the classifier - only bag performance might not

73
A Gaussian process mixture prior for hearing loss modeling

Marco Cox1 [email protected]


Bert de Vries1,2 [email protected]
1
Dept. of Electrical Engineering, Eindhoven University of Technology, 2 GN Hearing, Eindhoven.

Keywords: hearing loss, probabilistic modeling, Bayesian machine learning.

1. Introduction taining the hearing thresholds, ages and genders of


around 85,000 people. Almost all existing work is
The most common way to quantify hearing loss is by based on very simple and/or uninformative GP priors;
means of the hearing threshold. This threshold corre- simply selecting a suitable type of kernel that assumes
sponds to the lowest sound intensity that the person the threshold curve to be smooth is already sufficient
in question can still perceive, and it is a function of to yield a well working system. However, by fitting
frequency. The typical process of measuring the hear- a slightly more complex model to a vast database of
ing threshold is known as pure-tone audiometry (Yost, measured thresholds, we obtain a prior that is more
1994), and it usually consists of incrementally estimat- informative and empirically justified.
ing the threshold value at a set of standard frequencies
ranging from 125 Hz to 8 kHz using a staircase “up 5
dB – down 10 dB” approach. 2. Probabilistic hearing loss model
A recent line of work in the field of machine learn- The hearing threshold is a (continuous) function of
ing has focused on improving the efficiency of hear- frequency, denoted by t : R → R. The goal is to
ing loss estimation by taking a probabilistic modeling specify an appropriate prior distribution p(t|a, g) con-
perspective (Gardner et al., 2015b; Song et al., 2015; ditioned on age a ∈ N and gender g ∈ {female, male}.
Gardner et al., 2015a). This approach assumes that We choose p(t|a, g) to be a Gaussian process mixture
the hearing threshold of a person is drawn from some model in which the mixing weights depend on age and
prior probability distribution. Under this assumption, gender:
the estimation problem reduces to a (Bayesian) infer- K
X
ence task. Since the resulting posterior distribution p(t|a, g) = πk (a, g)GP(t|θ k ). (1)
describes both the estimated threshold and its uncer- k=1
tainty, it is possible to select the ‘optimal’ next test All K GPs have independent mean functions and ker-
tone based on information-theoretic criteria. The so- nels, parametrized by hyperparameter vectors {θ k }.
called active learning loop (Cohn et al., 1996) of re- In our experiments we use third-order polynomial
peatedly selecting the best next experiment and up- mean functions and the squared exponential kernel,
dating the probabilistic estimate, significantly reduces which enforces a certain degree of smoothness on the
the total number of required test tones (Gardner et al., threshold function, depending on its length-scale pa-
2015b). rameter. We do not fix mixing function π(·, ·) to a
The success of the probabilistic approach hinges specific parametric form, but use a nearest neighbor
on the selection of a suitable hearing loss model. regression model.
Presently, the Gaussian process (GP) model is the The main idea behind the choice for a mixture model is
best-performing model of the hearing threshold as a that it seems reasonable to assume that hearing thresh-
function of frequency (Gardner et al., 2015b). A GP olds can roughly be classified into several types. These
can be viewed as a probability distribution over the types would correspond to different degrees of overall
space of real-valued functions (Rasmussen & Williams, hearing loss severity, as well as hearing loss resulting
2006). from different causes, i.e. natural ageing versus exten-
In this abstract we introduce a prior distribution for sive exposure to loud noises. The audiology literature
hearing thresholds learned from a large database con- indeed describes sets of “standard audiograms” to this
end (Bisgaard et al., 2010).

74
A Gaussian process mixture prior for hearing loss modeling

3. Model fitting and evaluation 20

We fit the model parameters – GP hyperparameters 0


θ 1 through θ K and mixing function π(a, g) – to a
database containing roughly 85k anonymized records 20

Hearing threshold [dB-HL]


from the Nordic countries. Each record contains the
age and gender of the person in question, together with 40
the hearing thresholds of both ears measured at (a sub-
set of) the standard audiometric frequencies. The total 60
set of 170k threshold measurement vectors is randomly
split into a training set (80%) and a test set (20%) for 80
performance evaluation.
The inference algorithm consists of two parts. Since all 100
threshold measurement vectors correspond to a fixed
set of frequencies, the GP mixture reduces to a mixture 120
0.125 0.25 0.5 0.75 1 1.5 2 3 4 6 8
of multivariate Gaussians. Therefore, in the first part Frequency [kHz]
we fit a Gaussian mixture model to the training set (a) p(t|a = 40)
using the expectation maximization algorithm (Moon,
1996). In the second part, we find the optimal GP 20
hyperparameter values by minimizing the Kullback-
Leibler divergence between the GP mixture and the 0
multivariate Gaussian mixture using gradient descent.
Figure 2 visualizes the fitted prior conditioned on dif- 20

Hearing threshold [dB-HL]


ferent ages. The means of the mixture components
indicate that different components indeed capture dif- 40
ferent types of threshold curves. Moreover, condition-
ing the prior on age has a clearly visible impact. This 60
impact is quantified in Figure 1, which shows the aver-
age log-likelihood of hearing thresholds in the test set. 80
It also shows that the GP mixture priors outperform
the empirically optimized single GP prior in terms of 100
predictive accuracy.
120
0.125 0.25 0.5 0.75 1 1.5 2 3 4 6 8
−62
Frequency [kHz]

(b) p(t|a = 80)


−63
Average log−likelihood in test set

Figure 2. Visualization of the learned prior for K = 6


−64
mixture components, conditioned on different ages. Blue
dashed lines indicate the means of the mixture components.
−65
The gray lines are samples from the conditional priors. A
−66
value of 0 dB-HL corresponds to no hearing loss.

−67

−68 p(t) 4. Conclusions


p(t|a)
p(t|a,g)
−69 We obtained a prior for hearing loss by fitting a GP
1 2 3 4 5 6 7 8 9 10
Number of mixture components mixture model to a vast database. Evaluation on a
test set shows that the mixture model outperforms the
Figure 1. Predictive performance of the fitted priors on the (empirically optimized) GP prior used in existing work
test set. The one mixture component case corresponds to (Gardner et al., 2015b), even without conditioning on
a standard GP prior with empirically optimized hyperpa- age and gender. If age and gender are observed, the
rameters. Conditioning on age and/or gender consistently prior consistently becomes more informative. The ben-
improves the predictive accuracy. efit of adding more components to the mixture tapers
off after about eight components.

75
A Gaussian process mixture prior for hearing loss modeling

References
Bisgaard, N., Vlaming, M. S. M. G., & Dahlquist, M.
(2010). Standard audiograms for the IEC 60118-15
measurement procedure. Trends in amplification,
14, 113–120.
Cohn, D. A., Ghahramani, Z., & Jordan, M. I. (1996).
Active learning with statistical models. Journal of
artificial intelligence research, 4, 129–145.
Gardner, J., Malkomes, G., Garnett, R., Weinberger,
K. Q., Barbour, D., & Cunningham, J. P. (2015a).
Bayesian active model selection with an application
to automated audiometry. Advances in Neural In-
formation Processing Systems (pp. 2386–2394).
Gardner, J. R., Song, X., Weinberger, K. Q., Barbour,
D. L., & Cunningham, J. P. (2015b). Psychophysical
Detection Testing with Bayesian Active Learning.
UAI (pp. 286–295).

Moon, T. K. (1996). The expectation-maximization


algorithm. IEEE Signal processing magazine, 13,
47–60.
Rasmussen, C. E., & Williams, C. K. I. (2006). Gaus-
sian Processes for Machine Learning. MIT Press.

Song, X. D., Wallace, B. M., Gardner, J. R., Ledbet-


ter, N. M., Weinberger, K. Q., & Barbour, D. L.
(2015). Fast, continuous audiogram estimation us-
ing machine learning. Ear and hearing, 36, e326.

Yost, W. A. (1994). Fundamentals of hearing: An


introduction (3rd ed.), vol. xiii. San Diego, CA, US:
Academic Press.

76
Predicting chaotic time series using a photonic reservoir computer
with output feedback

Piotr Antonik [email protected]


Laboratoire d’Information Quantique, Université libre de Bruxelles, Av. F. D. Roosevelt 50, CP 224, Brussels,
Belgium
Marc Haelterman [email protected]
Service OPERA-Photonique, Université libre de Bruxelles, Avenue F. D. Roosevelt 50, CP 194/5, Brussels,
Belgium
Serge Massar [email protected]
Laboratoire d’Information Quantique, Université libre de Bruxelles, Av. F. D. Roosevelt 50, CP 224, Brussels,
Belgium

Keywords: Reservoir computing, opto-electronic systems, FPGA, chaotic time series prediction

1. Introduction so that only the first neighbour nodes are connected


(Paquot et al., 2012). The system is trained offline,
Reservoir Computing is a bio-inspired computing using ridge regression algorithm.
paradigm for processing time dependent signals
(Jaeger & Haas, 2004; Maass et al., 2002). The perfor- Mackey-Glass chaotic series generation task.
mance of its hardware implementations matches digi- The Mackey-Glass delay differential equation is given
tal algorithms on a series of benchmark tasks (see e.g. by (Mackey & Glass, 1977)
(Soriano et al., 2015) for a review). Their capacities dx x(t − τ )
could be extended by feeding the output signal back =β − γx (1)
dt 1 + xn (t − τ )
into the reservoir, which would allow them to be ap-
plied to various signal generation tasks (Antonik et al., with τ , γ, β, n > 0. To obtain chaotic dynamics,
2016b). In practice, this requires a high-speed read- we set the parameters as in (Jaeger & Haas, 2004):
out layer for real-time output computation. Here we β = 0.2, γ = 0.1, τ = 17 and n = 10. The equation
achieve this by means of a field-programmable gate ar- was solved using the RK4 method with a stepsize of
ray (FPGA), and demonstrate the first photonic reser- 1.0.
voir computer with output feedback. We test our setup During the training phase, the reservoir computer re-
on the Mackey-Glass chaotic time series generation ceives the Mackey-Glass time series as input and is
task and obtain interesting prediction horizons, com- trained to predict the next value of the series from the
parable to numerical simulations, with ample room for current one. Then, the reservoir input is switched from
further improvement. Our work thus demonstrates the the teacher sequence to the reservoir output signal,
potential offered by the output feedback and opens a and the system is left running autonomously. To eval-
new area of novel applications for photonic reservoir uate the system performance, we compute the number
computing. A more detailed description of this work of correctly predicted steps.
can be found in (Antonik et al., 2017a; Antonik et al.,
2017b).
3. Experimental setup
2. Theory and methods Our experimental setup, schematised in figure 1, con-
sists of two main components: the opto-electronic
Reservoir computing. A general reservoir computer reservoir and the FPGA board. The former is based on
is described in (Lukoševičius & Jaeger, 2009). In our previously published works (Paquot et al., 2012). The
implementation we use a sine transfer function and a reservoir size N depends on the delay created by the
ring topology to simplify the interconnection matrix, fibre spool (Spool). We performed experiments with

77
Predicting chaotic time series using a photonic reservoir computer with output feedback

SLD Pr
xi (n)
Clock Matlab N
xi (n) Prediction length
100 600
MZ
90/10
Att
ADC Res Out wi experimental 125 ± 14 344 ± 64
numerical (noisy) 120 ± 32 361 ± 87

Spool
Mask Mi
y(n)
Amp Comb
Mi numerical (noiseless) 121 ± 38 637 ± 252
Pf DAC In u(n)

Mi × [u(n) OR y(n − 1)]


u(n) idealised model 217 ± 156 683 ± 264
FPGA
Opto-electronic reservoir
Table 1. Summary of experimental and numerical results.
Figure 1. Schematic representation of the experimental
setup. Optical and electronic components of the photonic
reservoir are shown in grey and black, respectively. It mance, but only makes the outcome more consistent.
contains an incoherent light source (SLD: Superlumines- For this reason, we measured the performances of our
cent Diode), a Mach-Zehnder intensity modulator (MZ), experimental setup by repeating the autonomous run
a 90/10 beam splitter, an optical attenuator (Att), a fi- 50 times for each training, and reporting results for
bre spool (Spool), two photodiodes (Pr and Pf ), a resistive the best prediction length.
combiner (Comb) and an amplifier (Amp). The FPGA
board implements the readout layer and computes the out- Table 1 sums up the results obtained experimentally
put signal y(n) in real time. It also generates the analogue with both reservoir sizes, as well as numerical re-
input signal u(n) and acquires the reservoir states xi (n) sults obtained with all three models. The prediction
(Antonik et al., 2016a). The computer, running Matlab, lengths were averaged over 10 sequences of the MG se-
controls the devices, performs the offline training and up- ries (generated from different starting points), and the
loads all the data – inputs u(n), readout weights wi and uncertainty corresponds to deviations which occurred
input mask Mi – on the FPGA. from one sequence to another. For a small reservoir
N = 100, experimental results agree with both ex-
perimental models, but all three are much lower than
two spools of approximately 1.6 km and 10 km and, the idealised model. We found that this is due to the
correspondingly, reservoirs of 100 and 600 neurons. 23 zeroed input mask elements, as well as the limited
resolution of the analog-to-digital converter (see com-
4. Results plementary material for details). Prediction lengths
obtained with the large reservoir N = 600 match the
Numerical simulations. While this work focuses on noisy experimental model, but here the noise has a
experimental results, we also developed three numeri- significant impact on the maximal performance achiev-
cal models of the setup in order to have several points able.
of comparison: (a) the idealised model incorporates
the core characteristics of our reservoir computer, dis- 5. Perspectives
regarding experimental considerations, and is used to
define maximal achievable performance, (b) the noise- Our numerical simulations have shown that reducing
less experimental model emulates the most influential the noise inside the opto-electronic reservoir would sig-
features of the experimental setup, but neglects the nificantly improve its performance. This can be done
noise, that is taken into account by (c) the noisy ex- by upgrading the components by low noise, low volt-
perimental model. age models, thus reducing the effects of electrical noise.
Despite these issues, our work experimentally demon-
Experimental results. The system was trained over
strates that photonic reservoir computers are capable
1000 input samples and was running autonomously for
of emulating chaotic attractors, which offers new po-
600 timesteps. We discovered that the noise inside
tential applications to this computational paradigm.
the opto-electronic reservoir makes the outcome of an
experiment inconsistent. That is, several repetitions
of the experiment with same parameters may result Acknowledgments
in significantly different prediction lengths. While the
We acknowledge financial support by Interuniversity
system produced several very good predictions, most
Attraction Poles program of the Belgian Science Pol-
of the outcomes were rather poor. We obtained similar
icy Office under grant IAP P7-35 photonics@be, by the
behaviour with the noisy experimental model, using
Fonds de la Recherche Scientifique F.R.S.-FNRS and
the same level of noise as measured experimentally.
by the Action de Recherche Concertée of the Académie
Numerical simulations have shown that reducing the Wallonie-Bruxelles under grant AUWB-2012-12/17-
noise does not always increase the maximum perfor- ULB9.

78
Predicting chaotic time series using a photonic reservoir computer with output feedback

References
Antonik, P., Duport, F., Hermans, M., Smerieri, A.,
Haelterman, M., & Massar, S. (2016a). Online train-
ing of an opto-electronic reservoir computer applied
to real-time channel equalization. IEEE Transac-
tions on Neural Networks and Learning Systems,
PP, 1–13.

Antonik, P., Hermans, M., Duport, F., Haelterman,


M., & Massar, S. (2016b). Towards pattern gen-
eration and chaotic series prediction with photonic
reservoir computers. SPIE’s 2016 Laser Technology
and Industrial Laser Conference (p. 97320B).

Antonik, P., Hermans, M., Haelterman, M., & Massar,


S. (2017a). Chaotic time series prediction using a
photonic reservoir computer with output feedback.
AAAI Conference on Artificial Intelligence.
Antonik, P., Hermans, M., Haelterman, M., & Massar,
S. (2017b). Photonic reservoir computer with out-
put feedback for chaotic time series prediction. 2017
International Joint Conference on Neural Networks.
to appear.
Jaeger, H., & Haas, H. (2004). Harnessing nonlinear-
ity: Predicting chaotic systems and saving energy in
wireless communication. Science, 304, 78–80.
Lukoševičius, M., & Jaeger, H. (2009). Reservoir
computing approaches to recurrent neural network
training. Comp. Sci. Rev., 3, 127–149.

Maass, W., Natschläger, T., & Markram, H. (2002).


Real-time computing without stable states: A new
framework for neural computation based on pertur-
bations. Neural comput., 14, 2531–2560.
Mackey, M. C., & Glass, L. (1977). Oscillation and
chaos in physiological control systems. Science, 197,
287–289.
Paquot, Y., Duport, F., Smerieri, A., Dambre, J.,
Schrauwen, B., Haelterman, M., & Massar, S.
(2012). Optoelectronic reservoir computing. Sci.
Rep., 2, 287.
Soriano, M. C., Brunner, D., Escalona-Morán, M., Mi-
rasso, C. R., & Fischer, I. (2015). Minimal approach
to neuro-inspired information processing. Frontiers
in computational neuroscience, 9.

79
Towards high-performance analogue readout layers for photonic
reservoir computers

Piotr Antonik [email protected]


Laboratoire d’Information Quantique, Université libre de Bruxelles, Av. F. D. Roosevelt 50, CP 224, Brussels,
Belgium
Marc Haelterman [email protected]
Service OPERA-Photonique, Université libre de Bruxelles, Avenue F. D. Roosevelt 50, CP 194/5, Brussels,
Belgium
Serge Massar [email protected]
Laboratoire d’Information Quantique, Université libre de Bruxelles, Av. F. D. Roosevelt 50, CP 224, Brussels,
Belgium

Keywords: Reservoir computing, opto-electronics, analogue readout, FPGA, online training

1. Introduction implementation we use a sine transfer function and a


ring topology to simplify the interconnection matrix,
Reservoir Computing is a bio-inspired comput- so that only the first neighbour nodes are connected
ing paradigm for processing time-dependent signals (Paquot et al., 2012). The system is trained online,
(Jaeger & Haas, 2004; Maass et al., 2002). The per- using the simple gradient descent algorithm, as in (An-
formance of its hardware implementations (see e.g. tonik et al., 2016a).
(Soriano et al., 2015) for a review) is comparable to
state-of-the-art digital algorithms on a series of bench- Benchmark tasks. We tested the performance of
mark tasks. The major bottleneck of these imple- our system on two benchmark tasks, commonly used
mentation is the readout layer, based on slow offline by the RC community: wireless channel equalisation
post-processing. Several analogue solutions have been and NARMA10. The first, introduced in (Jaeger &
proposed (Smerieri et al., 2012; Duport et al., 2016; Haas, 2004) aims at recovering the transmitted mes-
Vinckier et al., 2016), but all suffered from noticeable sage from the output of a noisy nonlinear wireless com-
decrease in performance due to added complexity of munication channel. The performance of the equaliser
the setup. Here we propose the online learning ap- is measured in terms of Symbol Error Rate (SER),
proach to solve these issues. We present an experimen- that is, the number of misclassified symbols. The
tal reservoir computer with a simple analogue read- NARMA10 task (Atiya & Parlos, 2000) constists in
out layer, based on previous works, and show numeri- emulating a nonlinear system of order 10. The per-
cally that online learning allows to disregard the added formance is measured in terms of Normalised Mean
complexity of an analogue layer and obtain the same Square Error (NMSE).
level of performance as with a digital layer. This work
thus demonstrates that online training allows building 3. Experimental setup
high-performance fully-analogue reservoir computers,
and represents an important step towards experimen- Our experimental setup, which we simulate numeri-
tal validation of the proposed solution. A more de- cally, is schematised in figure 1. It consists of the opto-
tailed description of this work can be found in (An- electronic reservoir (a replica of (Paquot et al., 2012)),
tonik et al., 2017a; Antonik et al., 2017b). the analogue readout layer, based on previous works
(Smerieri et al., 2012; Duport et al., 2016), and the
FPGA board, performing the online training (Antonik
2. Theory and methods et al., 2016a). The readout layer uses a dual-output
Reservoir computing. A general reservoir computer Mach-Zehnder modulator in order to apply both posi-
is described in (Lukoševičius & Jaeger, 2009). In our tive and negative readout weights, and the integration

80
Towards high-performance analogue readout layers for photonic reservoir computers

Analogue readout
et al., 2012; Duport et al., 2016). To that end, we
xi (n)
Pr considered several potential experimental imperfection
Pb and measured their impact on the performance.
50/50 y(n)
SLD MZ -
wi (n) C • The time constant τ = RC of the RC filter deter-
Att mines its integration period. We’ve shown that
MZ
50/50 ADC both tasks work well in a wide range of values of
τ , and knowledge of its precise value is not neces-

1.6 km

DAC
Amp Comb FPGA sary for good performance (contrary to (Duport
et al., 2016)).
Pf

Mi × u(n) • The sine transfer function of the readout Mach-


Opto-electronic reservoir Zehnder modulator can, in practice, be biased
due to temperature or electronic drifts of the de-
Figure 1. Scheme of the proposed experimental setup. The vice. This could have a detrimential impact on the
optical and electronic components are shown in black and readout weights. We’ve shown that precompensa-
grey, respectively. The reservoir layer consists of an inco- tion of the transfer function is not necessary, and
herent light source (SLD), a Mach-Zehnder intensity mod-
that realistic drifts of the bias wouldn’t decrease
ulator (MZ), a 50/50 beam splitter, an optical attenuator
(Att), an approximately 1.6 km fibre spool, a feedback pho-
the performance of the system.
todiode (Pf ), a resistive combiner (Comb) and an amplifier • The numerical precision of the readout weights,
(Amp). The analogue readout layer contains another 50/50
limited to 16 bits by the DAC, could be insuffi-
beam splitter, a readout photodiode (Pr ), a dual-output in-
cient for correct output generation. We’ve shown
tensity modulator (MZ), a balanced photodiode (Pb ) and a
capacitor (C). The FPGA board generates the inputs and that resolution as low as 8 bits is enough for this
the readout weights, samples the reservoir states and the application.
output signal, and trains the system.
5. Perspectives
(summation) of the weighted states is carried out by a The present work shows that online learning allows to
low-pass RC filter. efficiently train an analogue readout layer despite its
inherent complexity and practical imperfections. The
4. Results upcoming experimental validation of this idea would
lead to a fully-analogue, high-performance reservoir
All numerical experiments were performed in Matlab, computer. On top of considerable speed increase, due
using a custom model of a reservoir computer, based to the removal of the slow digital post-processing, such
on previous investigations (Paquot et al., 2012; An- device could be applied to periodic or chaotic signal
tonik et al., 2016a). generation by feeding the output signal back into the
The performance of our system on the channel equal- reservoir (Antonik et al., 2016b). This work is there-
isation task, with SERs between 10−4 and 10−3 de- fore an important step towards a new area of research
pending on the input mask, is comparable to the in reservoir computing field.
same opto-electronic setup with a digital output layer
(SER = 10−4 reported in (Paquot et al., 2012)), as 6. Acknowledgments
well as the fully-analogue setup (Duport et al., 2016),
also reporting SER of 10−4 . However, it outperforms We acknowledge financial support by Interuniversity
the first (and, conceptually, simpler) readout layer by Attraction Poles program of the Belgian Science Pol-
an order of magnitude (Smerieri et al., 2012). As for icy Office under grant IAP P7-35 photonics@be, by the
the NARMA10 task, we obtain a NMSE of 0.18. This Fonds de la Recherche Scientifique FRS-FNRS and by
is slightly worse than what was reported with a digital the Action de Recherche Concertée of the Académie
readout layer (0.168 ± 0.015 in (Paquot et al., 2012)), Wallonie-Bruxelles under grant AUWB-2012-12/17-
but better than the fully analogue setup (0.230 ± 0.023 ULB9.
in (Duport et al., 2016)).
References
Another goal of the simulations was to check how the
online learning approach would cope with experimen- Antonik, P., Duport, F., Hermans, M., Smerieri, A.,
tal difficulties encountered in previous works (Smerieri Haelterman, M., & Massar, S. (2016a). Online train-

81
Towards high-performance analogue readout layers for photonic reservoir computers

ing of an opto-electronic reservoir computer applied Vinckier, Q., Bouwens, A., Haelterman, M., & Mas-
to real-time channel equalization. IEEE Transac- sar, S. (2016). Autonomous all-photonic processor
tions on Neural Networks and Learning Systems, based on reservoir computing paradigm (p. SF1F.1.
PP, 1–13. ). Optical Society of America.

Antonik, P., Haelterman, M., & Massar, S. (2017a).


Improving performance of analogue readout layers
for photonic reservoir computers with online learn-
ing. AAAI Conference on Artificial Intelligence.

Antonik, P., Haelterman, M., & Massar, S. (2017b).


Online training for high-performance analogue read-
out layers in photonic reservoir computers. Cogni-
tive Computation, 1–10.

Antonik, P., Hermans, M., Duport, F., Haelterman,


M., & Massar, S. (2016b). Towards pattern gen-
eration and chaotic series prediction with photonic
reservoir computers. SPIE’s 2016 Laser Technology
and Industrial Laser Conference (p. 97320B).

Atiya, A., & Parlos, A. (2000). New results on re-


current network training: Unifying the algorithms
and accelerating convergence. IEEE Transactions
on Neural Networks, 11, 697–709.

Duport, F., Smerieri, A., Akrout, A., Haelterman, M.,


& Massar, S. (2016). Fully analogue photonic reser-
voir computer. Sci. Rep., 6, 22381.

Jaeger, H., & Haas, H. (2004). Harnessing nonlinear-


ity: Predicting chaotic systems and saving energy in
wireless communication. Science, 304, 78–80.

Lukoševičius, M., & Jaeger, H. (2009). Reservoir


computing approaches to recurrent neural network
training. Comp. Sci. Rev., 3, 127–149.

Maass, W., Natschläger, T., & Markram, H. (2002).


Real-time computing without stable states: A new
framework for neural computation based on pertur-
bations. Neural comput., 14, 2531–2560.

Paquot, Y., Duport, F., Smerieri, A., Dambre, J.,


Schrauwen, B., Haelterman, M., & Massar, S.
(2012). Optoelectronic reservoir computing. Sci.
Rep., 2, 287.

Smerieri, A., Duport, F., Paquot, Y., Schrauwen, B.,


Haelterman, M., & Massar, S. (2012). Analog read-
out for optical reservoir computers (pp. 944–952. ).

Soriano, M. C., Brunner, D., Escalona-Morán, M., Mi-


rasso, C. R., & Fischer, I. (2015). Minimal approach
to neuro-inspired information processing. Frontiers
in computational neuroscience, 9.

82
Local Process Models: Pattern Mining with Process Models

Niek Tax, Natalia Sidorova, Wil M.P. van der Aalst {n.tax,n.sidorova,w.m.p.v.d.aalst}@tue.nl
Eindhoven University of Technology, The Netherlands

Keywords: pattern mining, process mining, business process modeling, data mining

1. Introduction call for bids, (B) investigate a call for bids from the
Process mining aims to extract novel insights from business perspective, (C) investigate a call for bids
event data (van der Aalst, 2016). Process discovery from the legal perspective, and (D) decide on partici-
plays a prominent role in process mining. The goal pation in the call for bid. The event sequences (Figure
is to discover a process model that is representative 1(a)) contain the activities performed by one sales of-
for the set of event sequences in terms of start-to-end ficer throughout the day. The sales officer works on
behavior, i.e. from the start of a case till its termi- different calls for bids and not necessarily performs
nation. Many process discovery algorithms have been all activities for a particular call himself. Applying
proposed and applied to a variety of real life cases. A discovery algorithms, like the Inductive Miner (Lee-
more conventional perspective on discovering insights mans et al., 2013), yields models allowing for any
from event sequences can be found in the areas of se- sequence of events (Figure 1(c)). Such ”flower-like”
quential pattern mining (Agrawal & Srikant, 1995) and models do not give any insight in typical behavioral
episode mining (Mannila et al., 1997), which focus on patterns. When we apply any sequential pattern min-
finding frequent patterns, not aiming for descriptions ing algorithm using a threshold of six occurrences, we
of the full event sequences from start to end. obtain the seven length-three sequential patterns de-
picted in Figure 1(d) (results obtained using the SPMF
Sequential pattern mining is limited to the discovery (Fournier-Viger et al., 2014) implementation of the
of sequential orderings of events, while process dis- PrefixSpan algorithm (Pei et al., 2001)). However, the
covery methods aim to discover a larger set of event data contains a frequent non-sequential pattern where
relations, including sequential orderings, (exclusive) a sales officer first performs A, followed by B and C
choice relations, concurrency, and loops, represented in arbitrary order (Figure 1(b)). This pattern cannot
in process models such as Petri nets (Reisig, 2012), be found with existing process discovery or sequential
BPMN (Object Management Group, 2011), or pattern mining techniques. The two numbers shown in
UML activity diagrams. Process models distinguish the transitions (i.e., rectangles) represent (1) the num-
themselves from more traditional sequence mining ber of events of this type in the event log that fit this
approaches like Hidden Markov Models (Rabiner, local process model and (2) the total number of events
1989) and Recurrent Neural Networks with their of this type in the event log. For example, 13 out of 19
visual representation, which allows them to be used events of type C in the event log fit transition C, which
for communication between process stakeholders. are indicated in bold in the log in Figure 1(a). Under-
However, process discovery is normally limited to the lined sequences indicate non-continuous instances, i.e.
discovery of a complete model that captures the full instances with non-fitting events in-between the events
behavior of process instances, and not local patterns forming the instance of the local process model.
within instances. Local Process Models (LPMs) allow
the mining of patterns positioned in-between simple 3. LPM Discovery Approach
patterns (e.g. subsequences) and end-to-end models,
focusing on a subset of the process activities and A technique for the discovery of Local Process Mod-
describing frequent patterns of behavior. els (LPMs) is described in detail in (Tax et al.,
2016a). LPM discovery uses the process tree (Buijs
2. Motivating Example et al., 2012) process model notation, an example
of which is SEQ(A, B), which is a sequential pat-
Imagine a sales department where multiple sales of- tern that describes that activity B occurs after ac-
ficers perform four types of activities: (A) register a tivity A. Process tree models are iteratively ex-

83
Local Process Models: Pattern Mining with Process Models

Event sequences Sequential


patterns
hA,A,C,B,A,A,C,B,B,Ci A
hC,A,C,B,A,A,A,B,C,Bi B 13/20 B hA, B, Ai
hA, B, Ci
hA,A,B,D,C,D,A,B,C,Bi
A 13/21 hA, C, Ai
hC,A,C,B,B,B,A,D,B,Ci hA, C, Bi
hB,A,B,C,Ci C 13/19 D hB, A, Bi
hD,A,C,B,C,A,A,C,A,Bi C hB, A, Ci
(b)
hD,A,B,C,D,C,A,C,A,B,Ci (c) hC, A, Ci
(a) (d)

Figure 1. (a) A log L of event sequences executed by a sales officer with highlighted instances of the frequent pattern.
(b) The local process model showing frequent behavior in L. (c) The Petri net discovered on L with the Inductive Miner
algorithm (Leemans et al., 2013). (d) The sequential patterns discovered on L with PrefixSpan (Pei et al., 2001).
Recall NDCG@5 NDCG@10 NDCG@20
panded into larger patterns using a fixed set of ex- 1.00

pansion rules, e.g., SEQ(A, B) can be grown into 0.75

Markov
SEQ(A, AND(B, C)), which indicates that A is fol- 0.50

lowed by both B and C in arbitrary order. Process 0.25

trees can be converted in other process model nota- 0.00


tions, e.g., SEQ(A, AND(B, C)) can be converted in 1.00

the Petri net of Figure 1(b). LPMs are discovered us- 0.75
origin

Entropy
ing the following steps:

value
0.50 discovered
random
1) Generation Generate the initial set CM1 of can- 0.25

didate LPMs in the form of process trees. 0.00

2) Evaluation Evaluate LPMs in the current candi- 1.00

date set CM i based on support and confidence. 0.75

3) Selection A subset SCM i ⊆CM i of candidate

MRIG
0.50

LPMs is selected. SM =SM ∪SCM i . If SCM i =∅ 0.25

or i≥max iterations: stop. 0.00


4) Expansion Expand SCM i into a set of larger, ex-
16 uno
va don 0
as A
en

16 uno
va rdon 10
Ka ez A
ren

16 uno
va rdon 10
as A
en

16 uno
va rdon 10
as A
en
AD Br 9

AD Br 9

AD Br 9

AD Br 9
Or 0001
93

93

93
n K ez

n K ez

n K ez
ter

O 000

O 000

ter

O 000

ter
9
ste
10

10

10

10
panded, candidate process models, CM i+1 . Goto
rce

rce

rce

rce
n
ou

ou

ou

ou
step 2 using the newly created candidate set
CH

CH

CH

CH
es

es

es

es
2r

2r

2r

2r
I'1

I'1

I'1

I'1
CM i+1 .
BP

BP

BP

BP
4. Faster LPM Discovery by Clustering Figure 2. Performance of the three projection set discovery
Activities methods on the six data sets on the four metrics
The discovery of Local Process Models (LPMs) is com-
putationally expensive for event logs with many unique
cluster the activities in the directly-follows graph.
activities (i.e. event types), as the number of ways to
expand each candidate LPM is equal to the number We compare the quality of the obtained ranking of
of possible process model structures with which it can LPMs after clustering the activities with the ranking
be expanded times the number of activities in the log. of LPMs obtained on the original data set. To compare
(Tax et al., 2016b) explores techniques to cluster the the rankings we use NDCG, an evaluation measure for
set of activities, such that LPM discovery can be ap- rankings frequently used in the information retrieval
plied per activity cluster instead of on the complete set field. Figure 2 shows the results of the three clustering
of events, leading to considerable speedups. All clus- approaches on five data sets. All three produce bet-
tering techniques operate on a directly-follows graph, ter than random projections on a variety of data sets.
which shows how frequently the activity types of the Projection discovery based on Markov clustering leads
directly follows each other in the event log. Three clus- to the highest speedup, while higher quality LPMs can
tering techniques have been compared: entropy-based be discovered using a projection discovery based on log
clustering clusters the activities of the directly-follows statistics entropy. The Maximal Relative Information
graph using an information theoretic approach. Max- Gain based approach to projection discovery shows un-
imal relative information gain clustering is a variant stable performance with the highest gain in LPM qual-
on entropy-based clustering. The third clustering ity over random projections on some event logs, while
technique uses Markov clustering (van Dongen, 2008), not being able to discover any projection smaller than
an out-of-the-box graph clustering technique, to the complete set of activities on some other event logs.

84
Local Process Models: Pattern Mining with Process Models

References van der Aalst, W. M. P. (2016). Process mining: Data


science in action. Springer.
Agrawal, R., & Srikant, R. (1995). Mining sequen-
tial patterns. Proceedings of the 11th International van Dongen, S. (2008). Graph clustering via a dis-
Conference on Data Engineering (ICDE) (pp. 3–14). crete uncoupling process. SIAM Journal on Matrix
IEEE. Analysis and Applications, 30, 121–141.
Buijs, J. C. A. M., van Dongen, B. F., & van der Aalst,
W. M. P. (2012). A genetic algorithm for discov-
ering process trees. Proceedings of the 2012 IEEE
Congress on Evolutionary Computation (CEC) (pp.
1–8).

Fournier-Viger, P., Gomariz, A., Gueniche, T.,


Soltani, A., Wu, C.-W., & Tseng, V. S. (2014).
SPMF: a java open-source pattern mining library.
The Journal of Machine Learning Research, 15,
3389–3393.

Leemans, S. J. J., Fahland, D., & van der Aalst, W.


M. P. (2013). Discovering block-structured process
models from event logs - a constructive approach.
In Application and theory of petri nets and concur-
rency, 311–329. Springer.

Mannila, H., Toivonen, H., & Verkamo, A. I. (1997).


Discovery of frequent episodes in event sequences.
Data Mining and Knowledge Discovery, 1, 259–289.

Object Management Group (2011). Notation (BPMN)


version 2.0. OMG Specification.

Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H.,


Chen, Q., Dayal, U., & Hsu, M.-C. (2001). Pre-
fixSpan: mining sequential patterns efficiently by
prefix-projected pattern growth. Proceedings of the
17th International Conference on Data Engineering
(ICDE) (pp. 215–224).

Rabiner, L. R. (1989). A tutorial on hidden markov


models and selected applications in speech recogni-
tion. Proceedings of the IEEE, 77, 257–286.

Reisig, W. (2012). Petri nets: an introduction, vol. 4.


Springer Science & Business Media.

Tax, N., Sidorova, N., Haakma, R., & van der Aalst,
W. M. P. (2016a). Mining local process models.
Journal of Innovation in Digital Ecosystems, 3, 183–
196.

Tax, N., Sidorova, N., van der Aalst, W. M. P., &


Haakma, R. (2016b). Heuristic approaches for gen-
erating local process models through log projec-
tions. Proceedings of the IEEE Symposium on Com-
putational Intelligence and Data Mining (pp. 1–8).
IEEE.

85
A non-linear Granger causality approach for understanding
climate-vegetation dynamics

Christina Papagiannopoulou [email protected]


Stijn Decubber [email protected]
Willem Waegeman [email protected]
Depart. of Mathematical modelling, Statistics and Bioinformatics, Ghent University, Belgium
Matthias Demuzere [email protected]
Niko E. C. Verhoest [email protected]
Diego G. Miralles [email protected]
Laboratory of Hydrology and Water Management, Ghent University, Belgium

Keywords: time series forecasting, random forests, non-linear Granger causality, climate change

Abstract series with different spatial and temporal resolutions


and measure environmental and climatic variables.
Satellite Earth observation provides new
means to unravel the drivers of long-term Vegetation plays a crucial role in the global climate
changes in climate. Global historical records system. It affects the water, the energy and the carbon
of crucial environmental and climatic vari- cycles through the transfer of water vapor from land to
ables, which have the form of multivariate atmosphere, direct effects on the surface net radiation
time series, now span up to 30 years. In or through the exchange of carbon dioxide with the
this abstract we present a non-linear Granger atmosphere (McPherson et al., 2007; Bonan, 2008).
causality approach to detect causal relation- Given the impact of climate on vegetation dynamics,
ships between climatic time series and veg- a better understanding of the response of vegetation to
etation. Our framework consists of sev- projected disturbances in climatic conditions is crucial
eral components, including data fusion from to further improve our knowledge about the potential
various databases, time series decomposi- consequences of climate change. A first and necessary
tion techniques, feature construction meth- step in this direction, however, is to investigate the
ods and Granger causality analysis by means response of vegetation to past-time climate variability.
of machine learning algorithms. Experimen- Simple correlation statistics and linear regression
tal results on large-scale entire-globe datasets methods in general are insufficient when it comes
indicate that, with this framework, it is pos- to assessing causality, especially in high dimensional
sible to detect non-linear patterns that ex- datasets and given the non-linear nature of climate–
press the complex relationships between cli- vegetation dynamics. A commonly used approach con-
mate and vegetation. sists of Granger causality modelling (Granger, 1969),
which is typically expressed through linear vector au-
toregressive (VAR) models. In Granger causality, it
Satellites form the only practical means for a global is assumed that a time series x (in our case, climatic
and continuous observation of our planet. Independent time series) Granger-causes another time series y (i.e.,
sensors on different platforms monitor the dynamics vegetation) if the past of x is helpful in predicting the
of vegetation, soils, oceans and atmosphere, collecting future of y, given the past of y. In practice, the fore-
optical, thermal, or gravimetry information (Su et al., casting accuracy for future values of y of two compet-
2011). Their records take the form of multivariate time ing models is compared: a full model which includes
both past-time y and x as predictors and a purely au-
toregressive baseline model which just has access to
Appearing in Proceedings of Benelearn 2017. Copyright
2017 by the author(s)/owner(s). past-time y. If the full model produces significantly

86
A non-linear Granger causality approach for understanding climate-vegetation dynamics

(a) Explained variance (R 2 ) (b) Quantification of Granger causality

Regression
Ridge

≥ .4 ≥ .2
.3
.1
.1
.05 .05
0 0

(c) (d)
Non-linear Model

≥ .4 ≥ .2
.3
.1
.1
.05 .05
0 0

Figure 1. Linear versus non-linear Granger causality of climate on vegetation. (a) Explained variance (R2 ) of vegetation
anomalies based on a full ridge regression model in which all climatic variables are included as predictors. (b) Improvement
in terms of R2 by the full ridge regression model with respect to the baseline ridge regression model that uses only past
values of vegetation anomalies as predictors; positive values indicate (linear) Granger causality. (c) Explained variance
(R2 ) of vegetation anomalies based on a full random forest. (d) Improvement in terms of R2 by the full random forest
model with respect to the baseline random forest model; positive values indicate (non-linear) Granger causality.

better forecasts, the null hypothesis of Granger non- full model ). While the model explains more than
causality can be rejected (Granger, 1969). 40% of the variability in vegetation in some regions
(R2 > 0.4), this is by itself not necessarily indicative
This abstract, based on (Papagiannopoulou et al.,
of climate Granger-causing the vegetation anomalies.
2016), presents an extension of linear Granger causal-
In order to test the latter, we compare the results of
ity analysis, a novel non-linear framework for finding
the full model (Fig. 1a) to a baseline model, i.e., an
climatic drivers that affect vegetation. Our framework
autoregressive ridge regression model that only uses
consists of several steps. In a first step, data from dif-
previous values of vegetation to predict the vegeta-
ferent sources are collected and merged into a single,
tion at time t. Any increase in predictive performance
comprehensive dataset. Next, time series decomposi-
provided by the full ridge regression model (Fig. 1a)
tion techniques are applied to the target vegetation
over the corresponding baseline provides qualitative
time series and the various predictor climatic time
evidence of Granger causality (Fig. 1b). The results
series to isolate seasonal cycles, trends and anoma-
show that, when only linear relationships between veg-
lies. In a third step, we explore various techniques for
etation and climate are considered, the areas in which
constructing high-level features from climatic time se-
Granger causality of climate towards vegetation is sug-
ries using techniques that are similar to shapelets (Ye
gested are limited. The predictive power for vegetation
& Keogh, 2009). In a final step, we run a Granger
anomalies increases dramatically when using random
causality analysis on the vegetation anomalies, while
forests (Fig. 1c). In order to test whether the cli-
replacing traditional linear vector autoregressive mod-
matic and environmental controls Granger-cause the
els with random forests.
vegetation anomalies, we again compare the results
Applying the above framework, we end up with 4,571 of a full random forest model to a baseline random
features generated on thirty-year time series, allow- forest model. As seen in Fig. 1d, the improvement
ing to analyze 13,097 land pixels independently. Pre- over the baseline is unambiguous. One can conclude
dictive performance is assessed by means of five-fold that, while not bearing into consideration all poten-
cross-validation using the out-of-sample coefficient of tial control variables in our analysis, climate dynamics
determination (R2 ) as a performance measure. indeed Granger-cause vegetation anomalies in most of
the continental land surface. Moreover, the improved
Figure 1a shows the predictive performance of a ridge
capacity of random forests over ridge regression to pre-
regression model which includes the 4,571 climate pre-
dict vegetation anomalies suggests that these relation-
dictors on top of the history of vegetation (i.e., a
ships are non-linear.

87
A non-linear Granger causality approach for understanding climate-vegetation dynamics

References
Bonan, G. (2008). Forests and climate change: forc-
ings, feedbacks, and the climate benefits of forests.
science, 320, 1444–1449.
Granger, C. W. (1969). Investigating causal relations
by econometric models and cross-spectral methods.
Econometrica: Journal of the Econometric Society,
424–438.
McPherson, R. A., Fiebrich, C. A., Crawford, K. C.,
Kilby, J. R., Grimsley, D. L., Martinez, J. E.,
Basara, J. B., Illston, B. G., Morris, D. A., Kloe-
sel, K. A., et al. (2007). Statewide monitoring of
the mesoscale environment: A technical update on
the Oklahoma Mesonet. 24, 301–321.
Papagiannopoulou, C., Miralles, D. G., Decubber, S.,
Demuzere, M., Verhoest, N. E. C., Dorigo, W. A., &
Waegeman, W. (2016). A non-linear Granger causal-
ity framework to investigate climate–vegetation dy-
namics. Geoscientific Model Development Discus-
sions, 2016, 1–24.
Su, L., Jia, W., Hou, C., & Lei, Y. (2011). Microbial
biosensors: a review. Biosensors and Bioelectronics,
26, 1788–1799.

Ye, L., & Keogh, E. (2009). Time series shapelets:


a new primitive for data mining. Proc. of the 15th
ACM SIGKDD international conference on Knowl-
edge discovery and data mining - KDD ’09 (p. 947).
New York, New York, USA: ACM Press.

88
Characterizing Resting Brain Activity to Predict the Amplitude of
Pain-Evoked Potentials in the Human Insula

Dounia Mulders, Michel Verleysen {name.surname}@uclouvain.be


ICTEAM institute, Université catholique de Louvain, Place du Levant 3, 1348 Louvain-la-Neuve, Belgium
Giulia Liberati, André Mouraux {name.surname}@uclouvain.be
IONS institute, Université catholique de Louvain, Avenue Mounier 53, 1200 Woluwe-Saint-Lambert, Belgium

Keywords: Pain, nociception, intracerebral recordings, feature extraction, time series prediction.
Abstract of people with congenital insensitivity to pain. Fur-
How the perception of pain emerges from thermore, pain is a major healthcare issue and its
human brain activity remains largely un- treatment, especially in the context of pathological
known. Apart from inter-individual varia- chronic pain, constitutes a very challenging problem
tions, this perception depends not only on the for physicians. Characterizing the relationship be-
physical characteristics of the painful stim- tween pain perception and brain activity could pro-
uli, but also on other psycho-physiological as- vide insights on how nociceptive inputs are processed
pects. Indeed a painful stimulus applied to in the human brain and, ultimately, how this leads to
an individual can sometimes evoke very dis- the perception of pain (Apkarian et al., 2005; Tracey &
tinct sensations from one trial to the other. Mantyh, 2007). It is widely accepted that perception
Hence the state of a subject receiving such fluctuates along time, even in a resting state. These
a stimulus should (at least partly) explain fluctuations might result from variations in neuronal
the intensity of pain elicited by that stim- activity (Sadaghiani et al., 2010) which can be, at least
ulus. Using intracranial electroencephalogra- partly, recorded with neuroimaging or electrophysio-
phy (iEEG) from the insula to measure this logical monitoring techniques (VanRullen et al., 2011).
cortical “state”, our goal is to study to which Hence spontaneous brain activity, which is often con-
extent ongoing brain activity in the human sidered as noise, might be related to our perception
insula, an area thought to play a key role in capabilities. However the potential links between per-
pain perception, may predict the magnitude ception fluctuations and the recorded brain activity
of pain-evoked potentials and, more impor- are not yet fully elucidated. It has already been sug-
tantly, whether it may predict the perception gested that perception is a discrete process, namely
intensity. To this aim, we summarize the on- that the cortex quickly oscillates between different, rel-
going insular activity by defining frequency- atively short-lasting levels of excitability (VanRullen
dependent features, derived using continuous & Koch, 2003). This excitability can for instance be
wavelet and Fourier transforms. We then measured by electroencephalography (EEG).
take advantage of this description to predict Supporting the aforementioned hypothesis, several
the amplitude of the insular responses elicited studies have already established links between ongoing
by painful (heat) and non-painful (auditory, brain activity measured before the presentation of a
visual and vibrotactile) stimuli, as well as to sensory stimulus using functional magnetic resonance
predict the intensity of perception. imaging (fMRI) or EEG, and the subsequent stimulus-
evoked response, assessed either in terms of subjec-
tive perception or brain response magnitude (Mayhew
1. Introduction et al., 2013; Monto et al., 2008). For instance, Barry
et al. and Busch et al. study the effect of pre-stimulus
The ability to perceive pain is crucial for survival, as
low-frequency (between 5 and 15Hz) phase on auditory
exemplified by the injuries and reduced life expectancy
and visual perception respectively, showing that stimu-
lus processing can be affected by such phase (i.e. posi-
Appearing in Proceedings of Benelearn 2017. Copyright tion within a cycle) at the stimulus onset (2004; 2009).
2017 by the author(s)/owner(s). Tu et al. use linear regression to predict the sub-

89
Characterizing Resting Brain Activity to Predict the Amplitude of Pain-Evoked Potentials

jective pain intensity generated by nociceptive stim- EEG recordings Feature extraction
150
uli from the time-frequency coefficients obtained by a
100
short-time Fourier transform (2016). They show that

Amplitude ( V)
[30, 80] Hz
50 [12, 30] Hz
pain perception depends on some pre-stimulus time- [8, 12] Hz
0
frequency features. [4, 8] Hz
[0.1, 4] Hz
-50
Original signals
In this setting, our work aims to study whether and to -100
-0.5 0 0.5 1 1.5 -0.8 -0.6 -0.4 -0.2
which extent pain perception capabilities vary along Time after stimulus (sec) Time after stimulus (sec)

time. As a first step, our goal is to predict the ampli-


tude of the responses recorded in the insula following Figure 1. Extraction of the amplitude features in five fre-
a painful heat stimulus (generated by a CO2 laser) quency bands for six presentations of a painful stimulus
from the pre-stimulus insular activity. We focus on to a subject. The first box shows the recorded signals,
the insula because this region is thought to play an with the red and green dots allowing to define the response
important role in pain perception (Garcia-Larrea & amplitude. The second box gives the continuous wavelet
Peyron, 2013). Instead of analyzing the relationships transforms of these trials, the vertical dotted black lines
between the elicited responses and only a few features indicating the feature extraction times. These are defined
(e.g. the phase or power in one frequency band) of the by taking the wavelet support into account, according to
the frequency band considered.
resting EEG prior to a stimulus onset, we propose to
first characterize this ongoing activity in more details.
We summarize the ongoing EEG activity by defining at higher frequencies. The power features take a larger
frequency-dependent features, derived using continu- time window into account, starting 0.5 second before
ous wavelet and Fourier transforms. This description the stimulus onset, as these are obtained by Fourier
is then exploited to predict the amplitude of the insu- transform. Figure 1 shows an example of the ampli-
lar potentials elicited by painful (heat) and non-painful tudes extraction for six trials, for which the stimulus
(auditory, visual and vibrotactile) stimuli (for compar- onset is at t = 0.
ison) using multilayer perceptron neural networks, lin-
ear regression and k-nearest neighbor regression. Using the aforementioned features to describe the on-
going activity before the presentation of each stimulus
(40 trials of each kind of stimulus were conducted on
2. Method each patient), we then predict the response amplitude.
This study has been approved by the local ethic com- So far, the achieved performances are not significant,
mittee (CEBHF). Benefiting from the high spatiotem- but further investigations are carried out.
poral resolution of iEEG recordings from the insula
performed in patients implanted for a presurgical eval- 3. Conclusion
uation of focal epilepsy, our first goal is to summarize
the ongoing insular activity prior to a stimulus appli- In line with previous works attempting to study how
cation. For this purpose and as oscillations in different fluctuations of the ongoing oscillatory brain activity
frequency bands have been associated to different func- may modulate pain perception, our approach aims to
tions, frequency-dependent features are defined. Be- establish a link between the combination of several
cause neural activity is non-stationary and since brain features of the ongoing insular activity and the sub-
functional state has been hypothesized to vary over sequent neural response, taking advantage of the high
small time scales (less than one second) (Britz et al., spatiotemporal resolution of intracerebral EEG. The
2010), we propose to extract oscillation features as same characterization of the spontaneous brain activ-
close as possible to the stimulus onset, while avoid- ity could be used to predict subject evaluated pain
ing border effects specifically to each frequency band. perception directly, rather than the observed neural
The extracted features consist in (1) the amplitude, response.
(2) the phase and (3) the power in the five physiologi-
cal frequency bands (Birjandtalab et al., 2016). These Acknowledgments
bands are denoted by δ, θ, α, β and γ and correspond
to the frequency ranges [0.1, 4], [4, 8], [8, 12], [12, 30] DM is a Research Fellow of the Fonds de la Recherche
and [30, 80] Hz. The two first kinds of features are de- Scientifique - FNRS.
fined using the Morlet (continuous) wavelet, allowing
to describe the phase and amplitude of the oscillations
at a particular time with a better temporal resolution

90
Characterizing Resting Brain Activity to Predict the Amplitude of Pain-Evoked Potentials

References VanRullen, R., Busch, N., Drewes, J., & Dubois, J.


(2011). Ongoing eeg phase as a trial-by-trial predic-
Apkarian, A. V., Bushnell, M. C., Treede, R.-D., &
tor of perceptual and attentional variability. Fron-
Zubieta, J.-K. (2005). Human brain mechanisms of
tiers in psychology, 2, 60.
pain perception and regulation in health and dis-
ease. European journal of pain, 9, 463–463. VanRullen, R., & Koch, C. (2003). Is perception dis-
crete or continuous? Trends in cognitive sciences,
Barry, R. J., Rushby, J. A., Johnstone, S. J., Clarke, 7, 207–213.
A. R., Croft, R. J., & Lawrence, C. A. (2004). Event-
related potentials in the auditory oddball as a func-
tion of eeg alpha phase at stimulus onset. Clinical
Neurophysiology, 115, 2593–2601.

Birjandtalab, J., Pouyan, M. B., & Nourani, M.


(2016). Nonlinear dimension reduction for eeg-based
epileptic seizure detection. Biomedical and Health
Informatics (BHI), 2016 IEEE-EMBS International
Conference on (pp. 595–598).

Britz, J., Van De Ville, D., & Michel, C. M. (2010).


Bold correlates of eeg topography reveal rapid
resting-state network dynamics. Neuroimage, 52,
1162–1170.

Busch, N. A., Dubois, J., & VanRullen, R. (2009).


The phase of ongoing eeg oscillations predicts visual
perception. Journal of Neuroscience, 29, 7869–7876.

Garcia-Larrea, L., & Peyron, R. (2013). Pain matrices


and neuropathic pain matrices: a review. PAIN , R
154, S29–S43.

Mayhew, S. D., Hylands-White, N., Porcaro, C., Der-


byshire, S. W., & Bagshaw, A. P. (2013). Intrinsic
variability in the human response to pain is assem-
bled from multiple, dynamic brain processes. Neu-
roimage, 75, 68–78.

Monto, S., Palva, S., Voipio, J., & Palva, J. M. (2008).


Very slow eeg fluctuations predict the dynamics of
stimulus detection and oscillation amplitudes in hu-
mans. Journal of Neuroscience, 28, 8268–8272.

Sadaghiani, S., Hesselmann, G., Friston, K. J., &


Kleinschmidt, A. (2010). The relation of ongoing
brain activity, evoked neural responses, and cogni-
tion. Frontiers in systems neuroscience, 4, 20.

Tracey, I., & Mantyh, P. W. (2007). The cerebral sig-


nature for pain perception and its modulation. Neu-
ron, 55, 377–391.

Tu, Y., Zhang, Z., Tan, A., Peng, W., Hung, Y. S.,
Moayedi, M., Iannetti, G. D., & Hu, L. (2016). Al-
pha and gamma oscillation amplitudes synergisti-
cally predict the perception of forthcoming nocicep-
tive stimuli. Human brain mapping, 37, 501–514.

91
Probabilistic Inference-based Reinforcement Learning

Quan Nguyen1 [email protected]


Bert de Vries1,2 [email protected]
Tjalling J. Tjalkens1 [email protected]
1 2
Department of Electrical Engineering, Eindhoven University of Technology, GN Hearing BV, Eindhoven

Keywords: reinforcement learning, reward functions, probabilistic modeling, Bayesian inference.

Abstract discount factor in classical RL can be seen as certain


probabilistic assumptions in the model. This interpre-
We introduce probabilistic inference-based
tation provides us with a way to design appropriate
reinforcement learning (PIReL), an approach
reward function, by e.g., model selection.
to solve decision making problems by treat-
ing them as probabilistic inference tasks. Un-
like classical reinforcement learning, which 2. Problem Modeling
requires explicit reward functions, in PIReL
The model is based on the Markov Decision Process.
they are implied by probabilistic assumptions
The interaction between the agent and the environ-
of the model. This would enable a funda-
ment occurs in a time sequence, so subscription is used
mental way to design the reward function by
to indicate the time step. Under the Markov assump-
model selection as well as bring the potential
tion, when the environment is in state st (which is
to apply existing probabilistic modeling tech-
supposed to be fully observed by the agent), receives
niques to reinforcement learning problems.
an action at from the agent, will change to a new state
st+1 . The generative model is specified as:
1. Introduction −1
TY

Reinforcement learning (RL) is a domain in machine p(s1:T , a1:T ) = π ∗ p(s1 ) p(st+1 |st , at ), (1)
learning concerning with how an agent makes deci- t=1

sions in an uncertain environment. In the traditional QT −1


approach, the agent learns how to do a certain task where π , t=1 p(at ) is the action prior (prior pol-
by maximizing the expected total rewards. However, icy), and p(st+1 |st , at ) is the transition probability.
the reward functions are often handcrafted for specific Unlike the standard MDP, there is no explicit reward
problems than based on a general guideline. here. Next we will explain how to infer actions.
In contrast to classical RL, probabilistic inference-
2.1. Reinforcement Learning by Goal-based
based reinforcement learning (PIReL) treats the action
Probabilistic Inference
as a hidden variable in a probabilistic model. Hence
choosing actions that lead to the desired goal states For the simplest decision making problem (Attias,
can be treated in a straightforward manner as proba- 2003), at the initial state s1 , given a fixed horizon
bilistic inference. T > 1, and action prior π, the agent decides which
actions a1:T −1 should be done in order to archive the
This idea was in fact first proposed by (Attias, 2003).
specified goal at the horizon, sT = g. In the other
Our contribution is to extend the original framework
words, we are interested in the posterior:
so that it can take into account uncertainties about the
goals. The extended framework shows its connection
to classical RL. Particularly the reward function and p(at |s1 , sT = g), ∀t ∈ {1, . . . , T − 1}, (2)

These probabilities have the form of a smoothing dis-


Appearing in Proceedings of Benelearn 2017. Copyright tribution, and the inference problem can be solved ef-
2017 by the author(s)/owner(s).
ficiently by a forward-backward -based algorithm.

92
Probabilistic Inference-based Reinforcement Learning

3. Bayesian Policy and Relation to where γT and rT denote the discount factor and in-
Classical Reinforcement Learning stant reward at time T respectively, while R(sT ) is the
reward function that returns a corresponding reward
In practice, it could be tricky to specify a desired goal for state sT .
precisely on sT . Thus we introduce an abstract ran-
dom binary variable z that indicates whether sT is a It is clear that the horizon distribution πT behaves
good (rewarding) or bad state. The goal is instead set like the discount factor, while the goal distribution
as z = 1 (good state). πzT acts like the reward function in classical reinforce-
ment learning. In classical RL, both reward function
In the special case when the given goal on sT is certain, and discount factor are often given. In contrast in
we have p(z = 1|sT ) , δ(sT −g). And one could verify our probabilistic framework, the optimal policy, hori-
that zon and goal distribution π̂ag that maximize the (log)
p(z = 1|s1 , T ) = p(sT = g|s1 , T ) marginal likelihood in eq. (4) can be estimated by e.g.
while the updated policy (posterior) becomes EM algorithm (Dempster et al., 1977).

p(at |s1 , z = 1, T ) = p(at |s1 , sT = g), ∀t.


4. Related Work
For an uncertain goal on sT , we have a generic form The basic idea of PIReL originates from (Attias, 2003),
πzT , p(z = 1|sT ), which is a probability function where the agent infers actions in order to reach a cer-
with input sT (since z is always fixed at 1). tain goal at a fixed horizon. (Toussaint & Storkey,
The current policy however still assumes that the hori- 2006) define the goal as to obtain the highest valued
zon is known. Similarly, to accommodate the uncer- reward at the horizon; and propose an EM-based algo-
tainty about the horizon, we average over it. With- rithm to derive the MAP estimation of action posterior
out loss of generality, assume that horizon T is upper with the horizon is marginalized out. By averaging
bounded by 1 < T < ∞, thus we have the full Bayesian over the horizons, the inferred policy also maximizes
policy the expected return.
T
X In the neuroscience and cognitive sciences literature,
p(at |s1 , z = 1; πT ) = πT p(at |s1 , z = 1, T ), ∀t, (3) similar ideas to PIReL have been suggested, e.g., (Fris-
T =2 ton, 2010) and (Botvinick & Toussaint, 2012) discuss
agents that infer actions that lead to a predefined goal.
where πT , p(T ) is the probability that the hori-
zon is at time T . The marginal likelihood under An alternative approach to improve the reward func-
πag , {π, πT , πzT } (policy, horizon, and goal distri- tion is reward shaping, see e.g. (Ng et al., 1999), which
bution, respectively) is defined as: however offers a limited alteration to the predefined
rewards.
T
X
p(z = 1|s1 ; πag ) = πT p(z = 1|s1 , T ; πzT ; π)
T =2
5. Conclusions
T
X Z −1
TY We discussed a framework where classical RL is recast
= πT πzT p(st+1 |st ; π) ds2:T as goal-based probabilistic inference. In this approach,
T =2 t=1 there are no explicit reward functions as in classical
(4) RL, but instead the agent infers what actions to be
Let’s consider the value function, the expected (dis- do in order to reach a set of goals with different pri-
counted) total reward up to T , when the agent at ini- orities. The reward function and discount factor can
tial state s1 follows policy π (Sutton & Barto, 2017): be interpreted as the goal and horizon distribution in
this probabilistic framework. This potentially brings
" T ! #
X fundamental ways to improve or design an appropriate

Vπ (s1 ) = E γT rT s1 ; π reward function and discount factor.

T =1
T
X Acknowledgments
= γT E(rT |s1 ; π)
T =1 This work is part of the research programme HearScan
T
X Z −1
TY with project number 13925, which is (partly) financed
= γT R(sT ) p(st+1 |st ; π) ds2:T , by the Netherlands Organisation for Scientific Re-
T =1 t=1 search (NWO). We thank the anonymous reviewers

93
Probabilistic Inference-based Reinforcement Learning

for their thoughtful comments and suggestions.

References
Attias, H. (2003). Planning by probabilistic inference.
AISTATS.
Botvinick, M., & Toussaint, M. (2012). Planning as
inference. Trends in Cognitive Sciences, 16, 485–
488.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977).
Maximum likelihood from incomplete data via the
EM algorithm. Journal of the royal statistical soci-
ety. Series B (methodological), 1–38.
Friston, K. (2010). The free-energy principle: a unified
brain theory? Nature Reviews Neuroscience, 11,
127–138.
Ng, A. Y., Harada, D., & Russell, S. (1999). Policy in-
variance under reward transformations: Theory and
application to reward shaping. International Con-
ference on Machine Learning (pp. 278–287).

Sutton, R., & Barto, A. (2017). Reinforcement learn-


ing: An introduction. The MIT Press. second (in
progress) edition.
Toussaint, M., & Storkey, A. (2006). Probabilistic
inference for solving discrete and continuous state
Markov Decision Processes. Proceedings of the 23rd
International Conference on Machine Learning (pp.
945–952).

94
Identifying Subject Experts through Clustering Analysis

Veselka Boeva [email protected]


Blekinge Institute of Technology, SE-371 79 Karlskrona, Sweden
Milena Angelova [email protected]
Technical University of Sofia - branch Plovdiv, 4000 Plovdiv, Bulgaria
Elena Tsiporkova [email protected]
Sirris, The Collective Center for the Belgian technological industry, Brussels, Belgium

Keywords: data mining, expert finding, formal concept analysis, health science, knowledge management

Abstract 2007). In (Boeva et al., 2016), an enhanced technique


that identifies subject experts via clustering analysis of
In this work, we discuss an approach for iden-
the available online information has been developed.
tifying subject experts via clustering analy-
The authors propose a formal concept analysis ap-
sis of the available online information. Ini-
proach for grouping a given set of experts with respect
tially, the domain of interest is partitioned
to pre-defined subject areas. The proposed approach
to a number of subject areas. Next each ex-
is especially useful for modelling of two-phase expert
tracted expert is represented by a vector of
identification processes. Such processes are dealing
contributions of the expert to the different ar-
with two levels of complexity: initially it is necessary
eas. Finally, the set of all extracted experts
to identify a wide set of available experts, who have an
is grouped into a set of disjoint expert areas
expertise in a domain of interest given at the higher-
by applying Formal Concept Analysis (FCA).
level of abstraction (e.g., health science); then at the
The produced grouping is further shown to
second phase the expert domain can be broken up into
facilitate the location of experts with the re-
more specific expert areas and the available experts
quired competence.
can further be reduced to a smaller group of experts
who all have expertise in a number of selected areas
(e.g., information science and health care). This ex-
1. Introduction
pert identification scenario could be useful, e.g., when
Expertise retrieval has already gained significant in- the set of available experts is preliminary identified
terest in the area of information retrieval. A vari- and known, then the expert finding system can help
ety of practical scenarios of organizational situations in recruiting currently needed individuals by taking
that lead to expert seeking have been extensively pre- into account the specified topics.
sented in the literature, e.g., see (Cohen et al., 1998),
(Kanfer et al., 1997), (Kautz et al., 1996), (Mattox 2. Clustering of Experts
et al., 1999), (McDonald & Ackerman, 1998), (Vivac-
qua, 1999). The data needed for constructing the expert pro-
files could be extracted from various Web sources,
In the recent years, research on identifying experts e.g., LinkedIn, the DBLP library, Microsoft Academic
from online data sources has been gradually gain- Search, Google Scholar Citation, PubMed etc. There
ing interest (Abramowicz, 2011), (Bozzon et al., exist several open tools for extracting data from pub-
2013), (Balog, 2007), (Hristoskova et al., 2013), (Jung lic online sources. In addition, the Stanford part-of-
et al., 2007), (Stankovic et al., 2011), (Singh et al., speech tagger (Toutanova, 2000) can be used to anno-
2013), (Tsiporkova & Tourwè, 2011), (Zhang et al., tate the different words in the text collected for each
expert with their specific part of speech. The anno-
Preliminary work. Under review for Benelearn 2017. Do tated text can be reduced to a set of keywords by re-
not distribute. moving all the words tagged as articles, prepositions,

95
Identifying Subject Experts through Clustering Analysis

verbs, and adverbs. concept for this context is defined to be a pair (X, Y )
such that X is a subset of experts and Y is a subsets of
In view of the above, we define an expert profile as a
subject areas, and every expert in X belongs to every
list of keywords, extracted from the available informa-
area in Y ; for every expert that is not in X, there is
tion about the expert in question, describing her/his
a subject area in Y that does not contain that expert;
subjects of expertise. Assume that n different expert
for every subject area that is not in Y , there is a expert
profiles are created in total and each expert profile i
in X who is not associated with that area. The fam-
(i = 1, 2, . . . , n) is represented by a list of pi keywords.
ily of these concepts obeys the mathematical axioms
A conceptual model of the domain of interest, such as a defining a concept lattice. The built lattice consists of
thesaurus, a taxonomy etc., can be available and used concepts where each one represents a subset of experts
to attain accurate and topic relevant expert profiles. belonging to a number of subject areas. The set of all
In this case, usually a set of subject terms (topics) concepts partitions the experts into a set of disjoint
arranged in hierarchical manner is used to represent expert areas. Notice that the above introduced group-
concepts in the considered domain. Another possi- ing of experts can be performed with respect to any
bility to represent the domain of interest at a higher set of subject areas describing the domain of interest,
level of abstraction is to partition the set of all differ- e.g., the experts could be clustered on a lower level of
ent keywords used to define the expert profiles into k abstraction by using more specific topics.
main subject areas. The latter idea has been proposed
and applied in (Boeva et al., 2014). 3. Initial Evaluation and Discussion
As discussed above, the domain of interest can be pre-
The proposed approach has initially been evaluated
sented by k main subject categories C1 , C2 , . . . , Ck .
in (Boeva et al., 2016) by applying the algorithm to
Let us denote by bij the number of keywords from
partition Bulgarian health science experts extracted
the expert profile of expert i that belong to category
from PubMed repository of peer-reviewed biomedical
Cj . Now each expert i can be represented by a vector
articles. Medical Subject Headings (MeSH) is a con-
ei = (ei1 , ei2 , . . . , eik ), where eij = bij /pi and pi is the
trolled vocabulary developed by the US National Li-
total number of keywords in the expert profile repre-
brary of Medicine for indexing research publications,
sentation. In this way, each expert i is represented by
articles and books. Using the MeSH terms associated
a k-length vector of membership degrees of the expert
with peer-reviewed articles published by Bulgarian au-
to k different subject categories, i.e. the above proce-
thors and indexed in the PubMed, we extract all such
dure generates a fuzzy clustering. The resulting fuzzy
authors and construct their expert profiles. The MeSH
partition can easily be turned into a crisp one by as-
headings are grouped into 16 broad subject categories.
signing to each pair (expert, area) a binary value (0
We have produced a grouping of all the extracted au-
or 1), i.e. for each subject area we can associate those
thors with respect to these subject categories by ap-
experts who have membership degrees greater than a
plying the discussed formal concept analysis approach.
preliminary given threshold (e.g., 0.5). This partition
The produced grouping of experts is shown to cap-
is not guaranteed to be disjoint in terms of the differ-
ture well the expertise distribution in the considered
ent subject areas, since there will be experts who will
domain with respect to the main subjects. In addi-
belong to more than one subject category. This over-
tion, it facilitates the identification of individuals with
lapping partition is further analyzed and refined into
the required competence. For instance, if we need to
a disjoint one by applying FCA.
recruit researchers who have expertise simultaneously
Formal concept analysis (Ganter et al., 2005) is a in ’Phenomena and Processes’ and ’Health care’ cate-
mathematical formalism allowing to derive a concept gories, we can directly locate those who belong to the
lattice from a formal context constituted of a set of concept that unites the corresponding categories.
objects, a set of attributes, and a binary relation de-
fined on the Cartesian product of these two sets. In 4. Conclusion and Future Work
our case, a (formal) context consists of the set of the
n experts, the set of main categories {C1 , C2 , . . . , Ck } A formal concept analysis approach for clustering of
and an indication of which experts are associated with a group of experts with respect to given subject areas
which subject category. Thus the context is described has been discussed. The initial evaluation has demon-
as a matrix, with the experts corresponding to the rows strated that the proposed approach is a robust clus-
and the categories corresponding to the columns of the tering technique that is suitable to deal with sparse
matrix, and a value 1 in cell (i, j) whenever expert i data. Further evaluation and validation on richer data
is associated with subject area Cj . Subsequently, a extracted from different online sources are planned.

96
Identifying Subject Experts through Clustering Analysis

References Mattox, D., Maybury, M., & Morey, D. (1999). Enter-


prise expert and knowledge discovery. Proceedings of
Abramowicz, W. (2011). Semantically enabled experts
HCI International99, Special Session on: Computer
finding system - ontologies, reasoning approach and
Supported Communication and Cooperation Making
web interface design. Proceedings of ADBIS (pp.
Information Aware, Munich, Germany.
157–166).
McDonald, D., & Ackerman, M. (1998). Just talk to
Balog, K. (2007). Broad expertise retrieval in sparse
me: a field study of expertise location. Proceedings
data environments. Proceedings of 30th Annual In-
of CSCW98, ACM Press. Seattle, WA.
ternational ACM SIGIR Conference on Research
and Development in Information Retrieval ACM Singh, H., Singh, R., Malhotra, A., & Kaur, M. (2013).
Press, New York. Developing a biomedical expert finding system using
Boeva, V., Angelova, M., Boneva, L., & Tsiporkova, E. medical subject headings. Healthcare Informatics
(2016). Identifying a group of subject experts using Research, 19, 243–249.
formal concept analysis. Proceedings of the 8th IEEE Stankovic, M., Jovanovic, J., & Laublet, P. (2011).
International Conference on Intelligent Systems. Linked data metrics for flexible expert search on the
Boeva, V., Boneva, L., & Tsiporkova, E. (2014). open web. LNCS, Springer, Heidelberg, 108123.
Semantic-aware expert partitioning. Artificial In- Toutanova, K. (2000). Enriching the knowledge
telligence: Methodology, Systems, and Applications, sources used in a maximum entropy partofspeech
LNAI Springer International Publishing Switzer- tagger. Proceedings of the Joint SIGDAT Confer-
land, 8722, 13–24. ence on Empirical Methods in Natural Language
Bozzon, A., Brambilla, M., Ceri, S., Silvestri, M., Processing and Very Large Corpora EMNLP/VLC-
& Vesci, G. (2013). Choosing the right crowd: 2000.
Expert finding in social networks. Proceeding of
Tsiporkova, E., & Tourwè, T. (2011). Tool support
EDBT/ICDT13, Genoa, Italy.
for technology scouting using online sources. LNCS,
Cohen, A. L., Maglio, P. P., & Barrett, R. (1998). Springer, Heidelberg, 6999, 371376.
The expertise browser: how to leverage distributed
Vivacqua, A. (1999). Agents for expertise location.
organizational knowledge. Proceedings of CSCW98,
Proceedings of the AAAI Spring Symp. on Intelli-
Seatle, WA.
gent Agents in Cyberspace, Stanford, CA.
Ganter, B., Stumme, G., & Wille, R. (2005). For-
mal concept analysis: Foundations and applications. Zhang, J., Tang, J., & Li, J. (2007). Expert finding
LNAI no. 3626 Springer-Verlag. in a social network. LNCS, Springer, Heidelberg,
10661069.
Hristoskova, A., Tsiporkova, E., Tourwè, T., Buelens,
S., Putman, M., & Turck, F. D. (2013). A graph-
based disambiguation approach for construction of
an expert repository from public online sources. Pro-
ceeding of 5th IEEE International Conference on
Agents and Artificial Inteligence.
Jung, H., Lee, M., Kang, I.-S., Lee, S.-W., & Sung,
W.-K. (2007). Finding topic-centric identified ex-
perts based on full text analysis. Procing of 2nd
International ExpertFinder Workshop, ISWC.
Kanfer, A., Sweet, J., & Schlosser, A. (1997). Human-
izing the net: social navigation with a know-who
email agent. Proceedings of the Third Conference
on Human Factors and The web. Denver, Colorado.
Kautz, H., Selman, B., & Milewski, A. (1996). Agent
amplified communication. Proceedings of the Thir-
teenth National Conference on Artificial Intelli-
gence, Portland, OR.

97
An Exact Iterative Algorithm for Transductive Pairwise Prediction

Michiel Stock [email protected]


Bernard De Baets [email protected]
Willem Waegeman [email protected]
KERMIT, Coupure links 653, Ghent, Belgium

Keywords: pairwise learning, transductive learning, matrix imputation, optimization

Abstract predictions F ∈ Rm×q by solving


m q
Imputing missing values of a matrix when 1 XX
min (Fij − Yij )2
side-features are available can be seen as a F 2 i=1 j=1
special case of pairwise learning. In this ex- C
tended abstract we present an exact iterative + vec(F)> [G ⊗ K]−1 vec(F) , (1)
2
algorithm to impute these missing values ef-
ficiently. with K (resp. G) the Gram matrices for the objects
U (resp. V ) and C a regularization parameter which
can be selected by cross-validation. This optimiza-
tion problem has two parts. The first term is contains
1. Problem statement the loss function and ensures that the predictions are
Consider the problem of pairwise learning where for a close to the observed labels. The second term is a reg-
given dyad (u, v) ∈ U × V we want to learn a function ularization term that ensures that similar pairs have
f : U × V → R to make a prediction. For example, a similar label. See (Johnson & Zhang, 2008; Liu &
for a given protein and a given ligand, one wants to Yang, 2015) for a more in-depth discussion. This is
predict the binding affinity based on a set of examples. an example of transductive learning, as the pairs for
Pairwise prediction models are fitted using a set of which we want to make a prediction are known before
labeled examples: S = {(uh , vh , yh ) | h = 1, . . . , n} is fitting the model (Chapelle et al., 2006). The solution
a set of labeled dyads. Let U = {ui | i = 1, . . . , m} of the above minimization problem is given by
and V = {vj | j = 1, . . . , q} denote, respectively, the vec(F) = [I + C[G ⊗ K]−1 ]−1 vec(Y)
sets of distinct objects of both types in the training
set with m = |U | and q = |V |. If a dataset contains = Hvec(Y) .
exactly one labeled instance for every dyad (u, v) ∈ Here, H is typically denoted as the hat matrix, link-
U × V , this dataset is denoted as being complete. For ing the labels to the predictions. Computing the hat
such datasets, the labels can be structured as a matrix matrix explicitly is often prohibitory large, even for
Y ∈ Rm×q , so that its rows are indexed by the objects modest m and q. Starting from the eigenvalue decom-
in U and the columns by the objects in V . positions of K and G, one can compute F without
In the setting considered here, we assume to possess needing the hat matrix in an intermediate step. Using
two positive definite kernel matrices, describing the ob- some algebraic manipulations, F can be obtained with
jects of U and V respectively. Using Kronecker kernel a time complexity of O(m3 +q 3 ) and space requirement
ridge regression (Waegeman et al., 2012), a model can of O(m2 + q 2 ), provided that the dataset is complete.
be fitted to make predictions for new dyads. Suppose For example, for a protein-ligand interaction dataset
we want to make predictions only for in-sample objects of hundreds of proteins and thousands of ligands, the
(u, v) ∈ U × V , we can directly obtain the matrix of complexity of solving (1) is dominated by the num-
bers of objects (thousands of proteins) rather than the
number of labels (hundreds of thousands pairwise in-
Preliminary work. Under review for Benelearn 2017. Do teraction values).
not distribute.
Often, the training data is not complete. Hence, why

98
An Exact Iterative Algorithm for Transductive Pairwise Learning

one needs a model to impute these missing values in Figure 1 shows the example image of which parts were
the first place. For incomplete data, we need to solve removed. We either randomly removed 1%, 10%, 50%,
a modification of (1): 90% or 99% of the pixels or removed a 100 × 400 pix-
els block from the image. Subsequently, the iterative
1 X imputation algorithm was used to impute the missing
min (Fij − Yij )2
F 2 part of the image. The missing pixels were initial-
(i,j)∈T
ized with the average value of the remaining pixels in
C
+ vec(F)> [G ⊗ K]−1 vec(F) . (2) the image. The bottom of Figure 1 shows the mean
2
squared error of the values of the imputed pixels as a
Solving the above problem is determined by the num- function of the number of iterations of the algorithm.
ber of pairwise observations n and might be huge even For reference purposes, the variance is also indicated,
for modest m and q. Ironically, having less data makes corresponding to the expected mean squared error of
it much harder to compute F compared to the com- using the mean as imputation. In all cases, the al-
plete case. gorithm could restore the image substantially better
than the baseline. If the image is relatively complete,
Since computing F can be done efficiently when the the imputation is quite fast; all imputations could be
dataset is complete, we suggest the following simple done in under a minute on a standard laptop. Figure 2
recipe to update the missing values: shows some of the image restorations. With 10% of the
pixels missing, the restoration is visually indistinguish-
1. Initialize the missing values of the unlabelled able of the original. Using only 10% of the pixels, a
dyads, making the label matrix complete. This blurry image of the original can be produced. In the
can be done by using the average of the observed case where a block of the image is missing, a ‘shadow’
labels or initalizing them to zero. of the coffee cup can be seen, showing that the model
can at least detect some high-level features of the im-
2. Fit a model using this label matrix. This step age.
has a very low computational cost if the eigen-
value decompositions of K and G were already
computed. 3. Conclusions
3. Update the missing values using the model. We presented a simple algorithm to impute miss-
ing values in a transductive pairwise learning setting.
4. Repeat steps 2 and 3 until convergence. It can be shown that the algorithm always rapidly
converges to the correct solution. This algorithm
Formally, we can show that the above steps always was illustrated on an example of inpainting an im-
converge to the unique minimizer of (1) and the error age. Given the importance of pairwise learning in
w.r.t. F decays as a geometric series. domains such as molecular network inference (Vert,
2008; Schrynemackers et al., 2013), recommender sys-
tems (Lü et al., 2012) and species interactions predic-
2. Illustration: inpainting an image tion (Poisot et al., 2016; et al., 2017), we believe this
The pairwise methods can also be applied to images, algorithm to be a useful tool in a variety of settings.
naturally represented as a matrix. Using suitable
kernels, the Kronecker-based methods can be used References
as a linear image filter - see (Gonzalez & Woods,
Chapelle, O., Schölkopf, B., & Zien, A. (2006). Semi-
2007) for an introduction. A black-and-white im-
Supervised Learning. MIT Press.
age is merely a matrix with intensity values for each
pixel. Here, the only features for the rows and columns Gonzalez, R. C., & Woods, R. E. (2007). Digital Image
are the x- and y-coordinates of the pixels. For the Processing. Pearson.
rows (resp. columns) a kernel can be constructed that
quantifies the distance between pixels in the vertical Johnson, R., & Zhang, T. (2008). Graph-based
(resp. horizontal) direction. In the experiments, we semi-supervised learning and spectral kernel design.
use a standard radial basis kernel on the pixel coordi- IEEE Transactions on Information Theory, 54, 275–
nates for the rows and columns plugged in Kronecker 288.
kernel ridge regression with a regularization parameter
λ = 0.1. We will illustrate the imputation algorithm Liu, H., & Yang, Y. (2015). Bipartite edge predic-
on a benchmark image of a cup of coffee. tion via transductive learning over product graphs.

99
An Exact Iterative Algorithm for Transductive Pairwise Learning

Figure 1. (left) An image of a cup of coffee. (right) Mean squared error of the imputed pixels as a function of the number
of iterations of the imputation algorithm. Missing pixels are initiated with the average value of the observed pixels. The
dotted line indicates the variance of the pixels, i.e. the approximate mean squared error of imputing with the average
value of the imputed pixels.

Proceedings of the 32nd International Conference on


Machine Learning (pp. 1880–1888).
Lü, L., Medo, M., Yeung, C. H., Zhang, Y.-C., Zhang,
Z.-K., & Zhou, T. (2012). Recommender systems.
Physics Reports, 519, 1–49.

, Poisot, T., Waegeman, W., & De Baets, B. (2017).


Linear filtering reveals false negatives in species in-
teraction data. Scientific Reports, 7, 1–8.
Poisot, T., Stouffer, D. B., & Kéfi, S. (2016). Describe,
understand and predict: why do we need networks
in ecology? Functional Ecology, 30, 1878–1882.
Schrynemackers, M., Küffner, R., & Geurts, P. (2013).
On protocols and measures for the validation of su-
pervised methods for the inference of biological net-
works. Frontiers in Genetics, 4, 1–16.

Vert, J.-P. (2008). Reconstruction of biological net-


works by supervised machine learning approaches.
In Elements of computional systems biology, 165–
188.

Waegeman, W., Pahikkala, T., Airola, A., Salakoski,


T., , & De Baets, B. (2012). A kernel-based frame-
work for learning graded relations from data. IEEE
Transactions on Fuzzy Systems, 20, 1090–1101.

100
An Exact Iterative Algorithm for Transductive Pairwise Learning

Figure 2. Examples of missing pixel imputation on the coffee image. (left) Mask indicating which pixels were removed,
blue indicates available, red indicates missing. (middle) The coffee image with the corresponding missing pixels. (right)
The restored image. (from top to bottom) 10% of the pixels randomly removed, 90% of the pixels randomly removed, a
block of the image removed.

101
Towards an automated method based on Iterated Local Search
optimization for tuning the parameters of Support Vector Machines

Sergio Consoli, Jacek Kustra, Pieter Vos, Monique Hendriks, Dimitrios Mavroeidis
[name.surname]@philips.com
Philips Research, High Tech Campus 34, 5656 AE Eindhoven, The Netherlands

Keywords: Support Vector Machines, Iterated Local Search, online tuning, parameters setting

Abstract appropriate search interval is an ad-hoc approach; (iv)


We provide preliminary details and formula- it is a computationally expensive approach, especially
tion of an optimization strategy under cur- when search intervals require to capture wide ranges.
rent development that is able to automati- In this short contribution, we describe our preliminary
cally tune the parameters of a Support Vec- investigation into an optimization method which tack-
tor Machine over new datasets. The opti- les the parameter setting problem in SVMs using It-
mization strategy is a heuristic based on It- erated Local Search (ILS) [Lourenço et al., 2010], a
erated Local Search, a modification of classic popular explorative local search method used for solv-
hill climbing which iterates calls to a local ing discrete optimization problems. It belongs to the
search routine. class of trajectory optimization methods, i.e. at each
iteration of the algorithm the search process designs
a trajectory in the search space, starting from an ini-
1. Introduction tial state and dynamically adding a new, better so-
The performance of Support Vector Machine (SVM) lution to the curve in each discrete time step. Iter-
strongly relies on the initial setting of the model pa- ated Local Search mainly consists of two steps. In the
rameters [Vapnik, 2000]. The parameters are usually first step, a local optimum is reached by performing
set by training the SVM on a specific dataset and are a walk in the search space, referred to as the pertur-
then fixed when applied to a certain application. The bation phase. The second step is to efficiently escape
automatic configuration for algorithms is faced with from local optima by using an appropriate local search
the same problem when doing hyper-parameter tun- phase [Lourenço, 1995]. The application of an accep-
ing in machine learning: finding the optimal setting of tance criterion to decide which of two local candidate
those parameters is an art by itself and as such much solutions has to be chosen to continue the search pro-
research on the topic has been explored in the litera- cess is also an important aspect of the algorithm. In
ture [Ceylan & Taşkn, 2016,Lameski et al., 2015,Yang the next section we provide the preliminary details and
et al., 2012, Sherin & Supriya, 2015, Hutter et al., formulation of our optimization strategy based on ILS
2011]. Of the techniques used, grid search (or parame- for the parameter tuning task in SVMs.
ter sweep) is one of the most common methods to ap-
proximate optimal parameter values [Bergstra & Ben- 2. ILS for SVM parameters tuning
gio, 2012]. Grid search involves an exhaustive search
through a manually specified subset of the hyperpa- Given the input parameters x ∈ X and their corre-
rameter space of a learning algorithm, guided by some sponding output parameters y ∈ Y = {−1, 1}, the
performance metric (e.g. cross-validation). This tra- separation between classes in SVMs is achieved by fit-
ditional approach, however, has several limitations: (i) ting the hyperplane f (x) that has the optimal dis-
it is vulnerable to local optimum; (ii) it does not pro- tance to the nearest
Pndata point used for training of
vide any global optimality guarantee; (iii) setting an any class: f (x) = i=1 αi yi < xi , x > +b, where n is
the total number of parameters. The goal in SVMs
is to find the hyperplane which maximizes the min-
Preliminary work. Under review for Benelearn 2017. Do imum distances of the samples on each side of the
not distribute.

102
Tuning Support Vector Machines by Iterated Local Search

plane [Cortes & Vapnik, 1995]. A penalty is associ- evaluated accuracy is Acc0 . Then the acceptance cri-
ated with the instances which are misclassified and terion of this new solution is that it produces a better
added to the minimization function. This is done quality, that is an increased accuracy, than the best
via the parameter C P in the minimization formula: solution to date. If it does not happen, the new incum-
n
arg min 12 kωk2 + C i c(f, xi , yi ). bent solution is rejected and the ranges are updated
f (x)=ω T x+b automatically with the following values: γinf −down =
By varying C, a trade-off between the accuracy and rand(γinf −down ∗ 10−1 , γinf −down ) and Cinf −down =
and stability of the function is defined. Larger values rand(Cinf −down ∗ 10−1 , Cinf −down ),
γ−γinf −up
of C result in a smaller margin, leading to potentially γinf −up = rand( 2 , γ) and Cinf −up =
more accurate classifications, however overfitting can C−Cinf −up
rand( 2 , C), γsup−down =
occur. A mapping of the data with appropriate kernel γ
rand( sup−down
−γ
, γ) and C =
2 sup−down
functions k(x, x0 ) into a richer feature space, includ- C −C
rand( sup−down2 , C), and γ sup−up = rand(γ sup−up ∗
ing non-linear features is applied prior to the hyper-
10) and Csup−up = rand(Csup−up ∗ 10). That is,
plane fitting. Among several kernels in the literature,
indifferently for γ and C, the values of the inf -down
we consider the Gaussian radial-basis function (RBF):
and sup-up components are random values always
K(xi , x0 ) = exp(−γkxi − x0 k2 ), γ > 0, where γ de-
taken farther the current parameter (γ or C), in
fines the variance of the RBF, practically defining the
order to increase the diversification capability of the
shape of the kernel function peaks: lower γ values set
metaheuristic; while the values of the inf -up and
the bias to low and corresponding high γ to high bias.
sup-down components are random values always taken
The proposed ILS under current implementation for closer the current parameter, in order to increase the
SVM tuning uses grid search [Bergstra & Bengio, 2012] intensification strength around the current parameter.
as an inner local search routine, which is then iter- This perturbation setting should allow a good balance
ated in order to make it fine-grained and finally pro- among the intensification and diversification factors.
ducing the best parameters C and γ found to date.
Otherwise, if in the acceptance criterion the new in-
Given a training dataset D and an SVM model Θ, the
cumbent solution, γ 0 and C 0 , is better than the current
procedure first generates an initial solution. We use
one, γ and C, i.e. Acc0 > Acc, then this new solution
an initial solution produced by grid search. The grid
becomes the best solution to date (γ ← γ 0 , C ← C 0 ),
search exhaustively generates candidates from a grid of
and rangeγ and rangeC are updated as usual. This
the parameter values, C and γ, specified in the arrays
procedure continues iteratively until the termination
rangeγ ∈ <+ and rangeC ∈ <+ . We choose arrays
conditions imposed by the user are satisfied, produc-
containing five different values for each parameter, so
ing at the end the best combination of γ and C as
that the grid search method will look to 25 different
output.
parameters combinations. The range values are taken
as different powers of 10 from −2 to 2. Solution qual-
ity is evaluated as the accuracy of the SVM by means 3. Summary and outlook
of k-fold cross validation [McLachlan et al., 2004], and
We considered the parameter setting task in SVMs
stored in the variable Acc.
by an automated ILS heuristic, which looks to be a
Afterwards, the perturbation phase, which represents promising approach. We are aware that a more de-
the core idea of ILS, is applied to the incumbent tailed description of the algorithm is deemed neces-
solution. The goal is to provide a good starting point sary, along with a thorough computational investiga-
(i.e. parameter ranges) for the next local search phase tion. This is currently object of ongoing research, in-
of ILS (i.e. the grid search in our case), based on cluding a statistical analysis and comparison of the
the previous search experience of the algorithm, so proposed algorithm against the standard grid search,
as to obtain a better balance between exploration in order to quantify and qualify the improvements ob-
of the search space against wasting time in areas tained. Further research will explore the application
that are not giving good results. Ranges are set as: of this strategy to other SVM kernels, considering also
rangeγ = [γ ∗ 10−2 , γ ∗ 10−1 , γ, γ ∗ 10, γ ∗ 102 ] ≡ a variety of big, heterogenous datasets.
[γinf −down , γinf −up , γ, γsup−down , γsup−up ], and
rangeC = [C ∗ 10−2 , C ∗ 10−1 , C, C ∗ 10, C ∗ 102 ] ≡
References
[Cinf −down , Cinf −up , C, Csup−down , Csup−up ].
Bergstra, J., & Bengio, Y. (2012). Random search for
Imagine that the grid search gets the set of pa- hyper-parameter optimization. Journal of Machine
rameters γ 0 , C 0 as a new incumbent solution, whose Learning Research, 13, 281–305.

103
Tuning Support Vector Machines by Iterated Local Search

Ceylan, O., & Taşkn, G. (2016). SVM parameter selec-


tion based on harmony search with an application to
hyperspectral image classification. 24th Signal Pro-
cessing and Communication Application Conference
(SIU) (pp. 657–660).

Cortes, C., & Vapnik, V. N. (1995). Support-vector


networks. Machine Learning, 20, 273–297.
Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2011).
Sequential model-based optimization for general al-
gorithm configuration. Proceedings of the 5th Learn-
ing and Intelligent OptimizatioN Conference (LION
5) (p. 507523). Rome, Italy.
Lameski, P., Zdravevski, E., Mingov, R., & Kulakov,
A. (2015). Svm parameter tuning with grid search
and its impact on reduction of model over-fitting.
In Rough sets, fuzzy sets, data mining, and granular
computing, vol. 9437 of Lecture Notes in Computer
Science, 464–474. Heidelberg, Germany: Springer-
Verlag.
Lourenço, H. R. (1995). Job-shop scheduling: com-
putational study of local search and large-step opti-
mization methods. European Journal of Operational
Research, 83, 347–364.
Lourenço, H. R., Martin, O. C., & Stützle, T. (2010).
Iterated local search: Framework and applications,
363–397. Boston, MA: Springer US.
McLachlan, G. J., Do, K.-A., & Ambroise, C. (2004).
Analyzing microarray gene expression data. New
York: John Wiley & Sons.
Sherin, B. M., & Supriya, M. H. (2015). Selection and
parameter optimization of svm kernel function for
underwater target classification. Underwater Tech-
nology (UT), 2015 IEEE (pp. 1–5).
Vapnik, V. N. (2000). The nature of statistical learning
theory. New York: Springer-Verlag.

Yang, C., Ding, L., & Liao, S. (2012). Parameter tun-


ing via kernel matrix approximation for support vec-
tor machine. Journal of Computers, 7.

104
Multi-step-ahead prediction of volatility proxies

Jacopo De Stefani [email protected]


Gianluca Bontempi [email protected]
MLG, Departement dInformatique, Université Libre de Bruxelles, Boulevard du Triomphe CP212, 1050 Brussels,
Belgium
Olivier Caelen1 [email protected]
Dalila Hattab2 [email protected]
Worldline SA/NV, R&D, Bruxelles, Belgium1 /Equens Worldline, R&D, Lille (Seclin), France2

Keywords : financial time series, volatility forecasting, multiple-step-ahead forecast

Though machine learning techniques have been often σ G . The first proxy corresponds to the natural defini-
used for stock prices forecasting, few results are avai- tion of volatility (Poon & Granger, 2003), as a rolling
lable for market fluctuation prediction. Nevertheless, standard deviation over a past time window of size n
volatility forecasting is an essential tool for any trader v
u
wishing to assess the risk of a financial investment. u 1 n−1 X
The main challenge of volatility forecasting is that, σtSD,n
=t (rt−i − r¯n )2
n − 1 i=0
since this quantity is not directly observable, we can-
 
not predict its actual value but we have to rely on some (c)
Pt
observers, known as volatility proxies (Poon & Gran- where rt = ln (c) is the daily continuously com-
Pt−1
ger, 2003) based either on intraday (Martens, 2002) pounded return , r¯n is the average over {t, · · · , t−n+1}
or daily data. Once a proxy is chosen, the standard (c)
and Pt are the closing prices. The family of proxies
approach to volatility forecasting is the well-known i
σt is analytically derived in Garman and Klass (1980).
GARCH-like model(Andersen & Bollerslev, 1998). In q Pp Pq
recent years several hybrid approaches are emerging The proxy σtG = ω + j=1 βj (σt−j G )2 + 2
i=1 αi εt−i
(Kristjanpoller et al., 2014; Dash & Dash, 2016; Mon- is the volatility estimation returned by a GARCH
fared & Enke, 2014) which combine GARCH with a (1,1) (Hansen & Lunde, 2005) where εt−i ∼ N (0, 1)
non-linear computational approach. What is common and the coefficients ω, αi , βj are fitted according to the
to the state-of-the art is that volatility forecasting is procedure in (Bollerslev, 1986).
addressed as an univariate and one-step-ahead auto-
regressive (AR) time series problem.
2. The relationship between proxies
The purpose of our work is twofold. First, we aim to
perform a statistical assessment of the relationships The fact that several proxies have been defined for
among the most used proxies in the volatility litera- the same latent variable raises the issues of their sta-
ture. Second, we explore a NARX (Nonlinear Autore- tistical association. For this reason we computed the
gressive with eXogenous input) approach to estimate proxies, discussed above, on the 40 time series of the
multiple steps of the output given the past output and French stock market index CAC40 in the period ran-
input measurements, where the output and the input ging from 05-01-2009 to 22-10-2014 (approximately
are two different proxies. In particular, our preliminary 6 years). This corresponds to 1489 OHLC (Opening,
results show that the statistical dependencies between High, Low, Closing) samples for each time series. Mo-
proxies can be used to improve the forecasting accu- reover, we obtained the continuously compounded re-
racy. turn and the volume variable (representing the number
of trades in given trading day).
1. Background Figure 1 shows the aggregated correlation (over all
the 40 time series) between the proxies, obtained by
Three main types of proxies are available in the lite- meta-analysis (Field, 2001). The black rectangles in-
rature : the proxy σ SD,n , the family of proxies σ i and dicate the results of an hierarchical clustering using

105
Multi-step-ahead prediction of volatility proxies

(Ward Jr, 1963) with k=3. As expected, we can


observe a correlation clustering phenomenon between

Volume
proxies belonging to the same family, i.e. σti and σtSD,n .

σ250
σ100
SD

SD

SD
σ50
σG
The presence of σt0 in the σtSD,n cluster can be explai-

σ1
σ6
σ4
σ5
σ2
σ3

σ0
rt
1
Volume
ned by the fact that the former represents a degenerate
?

σ1 0.8
case of the latter when n = 1. Moreover, we find a cor- ?

σ6
relation between the volume and the σti family. ?

0.6
σ4 ?

0.4
σ5
3. NARX proxy forecasting
?

σ2 ?
0.2

We focus here on the multi-step-ahead forecasting of σ3 ?

0
the proxy σ G by addressing the question whether a rt ?

−0.2
NARX approach can be beneficial in terms of accu- σ0 ?

racy. In particular we compare a univariate multi- σ250


SD
?
−0.4
G
step-ahead NAR model σt+h = f (σtG , · · · , σt−m
G
)+ σ100
SD
?

−0.6
G
ω with a multi-step-ahead NARX model σt+h = σ50
SD −0.8
?

f (σtG , · · · , σt−m
G
, σtX , · · · , σt−m
X
) + ω, for a specific em- σG ?

bedding order m = 5 and for different estimators of −1


the dependency f .
Figure 1. Summary of the correlations between different
In particular we compare a naive model (average of volatility proxies for the 40 CAC40 time series. Note that
the past values), a GARCH(1,1), and two machine the continuously compounded return rt has a very low cor-
learning approaches : a feedforward Artificial Neural relation with all the other variables.
Networks (single hidden layer, implemented with R
nnet) and a k-Nearest Neighbors (automatic leave- 4. Conclusion and Future work
one-out selection of the number of neighbors). Multi- We studied the relationships between different proxies
step-ahead prediction is returned by a direct forecas- and we investigated the impact on the accuracy of vo-
ting strategy (Taieb, 2014). The MASE results (Hynd- latility forecasting of three parameters : the choice of
man and Koehler (2006)) from 10 out-of-sample eva- the exogenous proxy, the machine learning technique
luations (Tashman (2000)) in Table 1 show that both and the kind of autoregression. Results are preliminary
machine learning methods outperform the benchmark for the moment. For the final version we expect to pro-
methods (naive and GARCH) and that the ANN mo- vide additional comparisons in terms of the number of
del can take advantage of the additional information series, forecasting horizons h model orders m.
provided by the exogenous proxy. The results in Table
2 confirm that such conclusion remains consistent
when moving from a single stock time series in a given
market to an index time series (S&P500).

Table 1. MASE (normalized wrt the accuracy of a naive Table 2. MASE (normalized wrt the accuracy of a naive
method) for a 10-step volatility forecasting horizon on a method) for a 10-step volatility forecasting horizon on the
single stock composing the CAC40 index on the period S&P500 index on the period from 01-04-2012 to 30-07-2013
from 05-01-2009 to 22-10-2014, for different proxy combina- as in the work of Dash & Dash, 2016, for different proxy
tions (rows) and different forecasting techniques (columns). combinations (rows) and different forecasting techniques
The subscript X stands for the NARX model where σ X is (columns). The subscript X stands for the NARX model
exogenous. where σ X is exogenous.

σX AN N kN N AN NX kN NX GARCH(1,1) σX AN N kN N AN NX kN NX GARCH(1,1)
6
σ 0.07 0.08 0.06 0.11 1.34 σ6 0.58 0.49 0.53 0.56 1.15
V olume 0.07 0.08 0.07 0.14 1.34 V olume 0.58 0.49 0.57 0.66 1.15
σ SD,5 0.07 0.08 0.07 0.09 1.34 σ SD,5 0.58 0.49 0.58 0.58 1.15
σ SD,15 0.07 0.08 0.06 0.10 1.34 σ SD,15 0.58 0.49 0.65 0.65 1.15
σ SD,21 0.07 0.08 0.06 0.10 1.34 σ SD,21 0.58 0.49 0.56 0.65 1.15

106
Multi-step-ahead prediction of volatility proxies

References
Andersen, T. G., & Bollerslev, T. (1998). Arch and
garch models. Encyclopedia of Statistical Sciences.
Bollerslev, T. (1986). Generalized autoregressive
conditional heteroskedasticity. Journal of econome-
trics, 31, 307–327.
Dash, R., & Dash, P. (2016). An evolutionary hybrid
fuzzy computationally efficient egarch model for vo-
latility prediction. Applied Soft Computing, 45, 40–
60.
Field, A. P. (2001). Meta-analysis of correlation co-
efficients : a monte carlo comparison of fixed-and
random-effects methods. Psychological methods, 6,
161.
Garman, M. B., & Klass, M. J. (1980). On the esti-
mation of security price volatilities from historical
data. Journal of business, 67–78.
Hansen, P. R., & Lunde, A. (2005). A forecast com-
parison of volatility models : does anything beat a
garch (1, 1) ? Journal of applied econometrics, 20,
873–889.
Hyndman, R. J., & Koehler, A. B. (2006). Another
look at measures of forecast accuracy. International
journal of forecasting, 22, 679–688.
Kristjanpoller, W., Fadic, A., & Minutolo, M. C.
(2014). Volatility forecast using hybrid neural net-
work models. Expert Systems with Applications, 41,
2437–2442.
Martens, M. (2002). Measuring and forecasting s&p
500 index-futures volatility using high-frequency
data. Journal of Futures Markets, 22, 497–518.
Monfared, S. A., & Enke, D. (2014). Volatility forecas-
ting using a hybrid gjr-garch neural network model.
Procedia Computer Science, 36, 246–253.
Poon, S.-H., & Granger, C. W. (2003). Forecasting
volatility in financial markets : A review. Journal of
economic literature, 41, 478–539.
Taieb, S. B. (2014). Machine learning strategies for
multi-step-ahead time series forecasting. Doctoral
dissertation, Ph. D. Thesis.
Tashman, L. J. (2000). Out-of-sample tests of forecas-
ting accuracy : an analysis and review. International
journal of forecasting, 16, 437–450.
Ward Jr, J. H. (1963). Hierarchical grouping to opti-
mize an objective function. Journal of the American
statistical association, 58, 236–244.

107
Generalization Bound Minimization for Active Learning

Tom Viering [email protected]


TU Delft, Mekelweg 4, 2628 CD, Delft, The Netherlands
Jesse Krijthe [email protected]
TU Delft, Mekelweg 4, 2628 CD, Delft, The Netherlands
Marco Loog [email protected]
TU Delft, Mekelweg 4, 2628 CD, Delft, The Netherlands

Keywords: active learning, learning theory, maximum mean discrepancy, generalization

Supervised machine learning models require enough la- on the generalization error is tighter than the MMD
beled data to obtain good generalization performance. bound in the realizable setting — in this setting it is
However, for many practical applications such as med- assumed there is no model mismatch. Tighter bounds
ical diagnosis or video classification it can be expensive are generally considered favorable as they estimate the
or time consuming to label data (Settles, 2012). Often generalization error more accurately. One might there-
in practical settings unlabeled data is abundant, but fore also expect them to lead to better labeling choices
due to high costs only a small fraction can be labeled. in active learning when minimized and therefore we
In active learning an algorithm chooses unlabeled sam- evaluated an active learner that minimizes the Dis-
ples for labeling (Cohn et al., 1994). The idea is that crepancy.
models can perform better with less labeled data if the
However, we observed that active learning using the
labeled data is chosen carefully instead of randomly.
tighter Discrepancy bound performs worse than the
This way active learning methods make the most of a
MMD. The underlying reason is that these bounds
small labeling budget or can be used to reduce labeling
assume worst-case scenarios in order to derive their
costs.
guarantees, and therefore minimizing these bounds for
A generalization bound is an upper bound on the gen- active learning may result in suboptimal performance
eralization error of the model that holds given certain in non-worst-case scenarios. In particular, the worst-
assumptions. Several works have used generalization case scenario assumed by the Discrepancy is, proba-
bounds to guide the active learning process (Gu & bilistically speaking, very unlikely to occur compared
Han, 2012; Gu et al., 2012; Ganti & Gray, 2012; Gu to the scenario considered by the MMD and therefore
et al., 2014). We have performed a theoretical and the Discrepancy performs worse for active learning.
empirical study of active learners, that choose queries
This insight lead us to introduce the Nuclear Discrep-
that explicitly minimize generalization bounds, to in-
ancy whose bound is looser. The Nuclear Discrepancy
vestigate the relation between bounds and their active
considers average case scenarios which occur more of-
learning performance. We limited our study to the
ten in practice. Therefore, minimizing the Nuclear
kernel regularized least squares model (Rifkin et al.,
Discrepancy leads to an active learning strategy that
2003) and the squared loss.
is more suited to non-worst-case scenarios. Our experi-
We studied the state-of-the-art Maximum Mean Dis- ments show that active learning using the Nuclear Dis-
crepancy (MMD) active learner that minimizes a gen- crepancy improves significantly upon the MMD and
eralization bound (Chattopadhyay et al., 2012; Wang Discrepancy, especially in the realizable setting.
& Ye, 2013). The MMD is a divergence measure (Gret-
Our study illustrates that tighter bounds do not guar-
ton et al., 2012) which is closely related to the Discrep-
antee improved active learning performance and that
ancy measure (Mansour et al., 2009).
a probabilistic analysis is essential: active learners
One of our novel theoretical results is a comparison of should optimize their strategy for scenarios that are
these bounds. We show that the Discrepancy bound likely to occur in order to perform well in practice.

108
Generalization Bound Minimization for Active Learning

References
Chattopadhyay, R., Wang, Z., Fan, W., Davidson, I.,
Panchanathan, S., & Ye, J. (2012). Batch Mode Ac-
tive Sampling Based on Marginal Probability Dis-
tribution Matching. Proceedings of the 18th ACM
SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD) (pp. 741–749).
Cohn, D., Atlas, L., & Ladner, R. (1994). Improving
generalization with active learning. Machine Learn-
ing, 15, 201–221.
Ganti, R., & Gray, A. (2012). UPAL: Unbiased Pool
Based Active Learning. Proceedings of the 15th In-
ternational Conference on Artificial Intelligence and
Statistics (AISTATS) (pp. 422–431).
Gretton, A., Borgwardt, K. M., Rasch, M. J.,
Schölkopf, B., & Smola, A. (2012). A Kernel Two-
sample Test. Machine Learning Research, 13, 723–
773.
Gu, Q., & Han, J. (2012). Towards Active Learning on
Graphs: An Error Bound Minimization Approach.
Proceedings of the 12th IEEE International Confer-
ence on Data Mining (ICDM) (pp. 882–887).
Gu, Q., Zhang, T., & Han, J. (2014). Batch-Mode Ac-
tive Learning via Error Bound Minimization. Pro-
ceedings of the 30th Conference on Uncertainty in
Artificial Intelligence (UAI).
Gu, Q., Zhang, T., Han, J., & Ding, C. H. (2012).
Selective Labeling via Error Bound Minimization.
Proceedings of the 25th Conference on Advances in
Neural Information Processing Systems (NIPS) (pp.
323–331).
Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009).
Domain Adaptation: Learning Bounds and Algo-
rithms. Proceedings of the 22nd Annual Conference
on Learning Theory (COLT).
Rifkin, R., Yeo, G., & Poggio, T. (2003). Regular-
ized least-squares classification. Advances in Learn-
ing Theory: Methods, Model, and Applications, 190,
131–154.
Settles, B. (2012). Active Learning. Synthesis Lectures
on Artificial Intelligence and Machine Learning, 6,
1–114.
Wang, Z., & Ye, J. (2013). Querying Discriminative
and Representative Samples for Batch Mode Active
Learning. Proceedings of the 19th ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining (KDD) (pp. 158–166).

109
Projected Estimators for Robust Semi-supervised Classification

Jesse H. Krijthe [email protected]


Radboud University Nijmegen, Nijmegen, The Netherlands
Marco Loog [email protected]
Delft University of Technology, Delft, The Netherlands

Keywords: Semi-supervised learning, Least Squares Classification, Projection

Abstract Much work has been done on semi-supervised classifi-


cation, in particular on what additional assumptions
For semi-supervised techniques to be applied about the unlabeled data may help improve classifi-
safely in practice we at least want methods cation performance. These additional assumptions,
to outperform their supervised counterparts. while successful in some settings, are less successful
We study this question for classification us- in others where they do not hold. In effect they
ing the well-known quadratic surrogate loss can greatly deteriorate performance when compared
function. Unlike other approaches to semi- to a supervised alternative (Cozman & Cohen, 2006).
supervised learning, the procedure proposed Since, in semi-supervised applications, the number of
in this work does not rely on assumptions labeled objects may be small, the effect of these as-
that are not intrinsic to the classifier at hand. sumptions is often untestable. In this work, we in-
Using a projection of the supervised estimate troduce a conservative approach to training a semi-
onto a set of constraints imposed by the unla- supervised version of the least squares classifier that
beled data, we find that it is possible to safely is guaranteed to improve over the supervised least
improve over the supervised solution in terms squares classifier, in terms of the quadratic loss mea-
of this quadratic loss. sured on the labeled and unlabeled examples. To our
This abstract concerns the work presented in knowledge this is the first approach that offers such
(Krijthe & Loog, 2017) strong, albeit conservative, guarantees for improve-
ment over the supervised solution.

1. Problem & Setting 2. Sketch of the Approach


We consider the problem of semi-supervised classifica- In the supervised setting, using a quadratic surrogate
tion using the quadratic loss function, which is also loss (Hastie et al., 2009), the following objective is
known as least squares classification or Fisher’s linear minimized for w:
discriminant classification (Hastie et al., 2009; Poggio
L(w, X, y) = kXw − yk2 . (1)
& Smale, 2003). Suppose we are given an Nl × d ma-
trix with feature vectors X, labels y ∈ {0, 1}Nl and The supervised solution wsup is given by the minimiza-
an Nu × d matrix with unlabeled objects Xu from the tion of (1) for w. The well-known closed form solution
same distribution as the labeled objects. The goal of to this problem is given by
semi-supervised learning is to improve the classifica-
tion decision function f : Rd → R using the unlabeled wsup = (X> X)−1 X> y .
information in Xu as compared to the case where we
do not have these unlabeled objects. In this work, we Our proposed semi-supervised approach is to project
focus on linear classifiers where f (x) = w> x. the supervised solution wsup onto the set of all possible
classifiers we would be able to get from some labeling
of the unlabeled data.
Appearing in Proceedings of Benelearn 2017. Copyright    
>
−1 > y Nu
2017 by the author(s)/owner(s). Θ = Xe Xe Xe | yu ∈ [0, 1] .
yu

110
Projected Estimators for Robust Semi-supervised Classification

Self−Learning Projection TSVM


5

Loss Labeled+Unlabeled
Relative Increase
4

0
BCI
COIL2
Diabetes
Digit1
g241c
g241d
Haberman
Ionosphere
Mammography
Parkinsons
Sonar
SPECT
SPECTF
Transfusion
USPS
WDBC

BCI
COIL2
Diabetes
Digit1
g241c
g241d
Haberman
Ionosphere
Mammography
Parkinsons
Sonar
SPECT
SPECTF
Transfusion
USPS
WDBC

BCI
COIL2
Diabetes
Digit1
g241c
g241d
Haberman
Ionosphere
Mammography
Parkinsons
Sonar
SPECT
SPECTF
Transfusion
USPS
WDBC
Figure 1. Ratio of the loss in terms of surrogate loss of semi-supervised and supervised solutions measured on the labeled
and unlabeled instances. Values smaller than 1 indicate that the semi-supervised method gives a lower average surrogate
loss than its supervised counterpart. Unlike the other semi-supervised procedures, the projection method, evaluated on
labeled and unlabeled data, never has higher loss than the supervised procedure, as we prove in Theorem 1 of (Krijthe &
Loog, 2017)

Note that this set, by construction, will also contain 4. Empirical Evidence
the solution woracle , corresponding to the true but un-
known labeling ye∗ . Typically, woracle is a better solu- Aside from the theoretical guarantee that performance
tion than wsup and so we would like to find a solution never degrades when measured on the labeled and un-
more similar to woracle . This can be accomplished by labeled training set in terms of the surrogate loss, ex-
projecting wsup onto Θ. perimental results indicate that it not only never de-
grades, but often improves performance. Our experi-
ments also indicate the results hold when performance
wsemi = min d(w, wsup ) , is evaluated on objects in a test set that were not used
w∈Θ
as unlabeled objects during training.
where d(w, w0 ) is a particular distance measure that
measures the similarity between two classifiers. This References
is a quadratic programming problem with simple con- Cozman, F., & Cohen, I. (2006). Risks of Semi-
straints that can be solved using, for instance, a simple Supervised Learning. In O. Chapelle, B. Schölkopf
gradient descent procedure. and A. Zien (Eds.), Semi-supervised learning, chap-
ter 4, 56–72. MIT press.
3. Theoretical Guarantee
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009).
The main contribution of this work is a proof that the The Elements of Statistical Learning. Spinger. 2 edi-
semi-supervised learner that we just described is guar- tion.
anteed to never lead to worse performance than the
supervised classifier, when performance is measured Krijthe, J. H., & Loog, M. (2017). Projected Estima-
in terms of the quadratic loss on the labeled and un- tors for Robust Semi-supervised Classification. Ma-
labeled data. This property is shown empirically in chine Learning, http://arxiv.org/abs/1602.07865.
Figure 1. This non-degradation property is important Poggio, T., & Smale, S. (2003). The Mathematics of
in practical applications, since one would like to be Learning: Dealing with Data. Notices of the AMS,
sure that the effort of the collection of, and compu- 50, 537–544.
tation with unlabeled data does not have an adverse
effect. Our work is a conceptual step towards methods
with these types of guarantees.

111
Towards an Ethical Recommendation Framework

Dimitris Paraschakis [email protected]


Dept. of Computer Science, Malmö University, SE-205 06 Malmö, Sweden

Keywords: recommender systems, ethics of data mining / machine learning, ethical recommendation framework

Abstract Winoto, 2016), there exist only three publications that


specifically address the problem of ethical recommen-
This study provides an overview of various
dations. Still, they only focus on particular problems
ethical challenges that complicate the de-
in particular applications. A holistic view on the prob-
sign of recommender systems (RS). The ar-
lem of recommendation ethics is currently lacking in
ticulated ethical recommendation framework
the field, despite the massive research that RS attract
maps RS design stages to the corresponding
nowadays. This problem is multifaceted and relates to
ethical concerns, and further down to known
several interconnected topics that we broadly group
solutions and the proposed user-adjustable
into the ethics of data manipulation, algorithm design,
controls. This framework aims to aid RS
and experimentation. An in-depth discussion of these
practitioners in staying ethically alert while
topics with accompanying examples is provided in our
taking morally charged design decisions. At
original work (Paraschakis, 2017). Here, we briefly
the same time, it would give users the de-
outline the related issues in Section 2. In Section 3,
sired control over the sensitive moral aspects
we present an ethical recommendation framework that
of recommendations via the proposed “ethi-
serves two purposes: a) it provides a roadmap for an
cal toolbox”. The idea is embraced by the
ethics-inspired design of RS; b) it proposes a toolbox
participants of our feasibility study.
for manual tuning of morally-sensitive components of
RS. We evaluate our proposal in Section 4 by present-
ing the results of the conducted survey. Section 5 con-
1. Introduction cludes our work.
The notion of recommendations is built on real-life ex-
periences and therefore perceived by humans as some- 2. An Outline for Discussion
thing inherently positive. User studies indeed show
that the mere fact of labelling items as “recommenda- This section pinpoints RS-related ethical issues that
tions” increases their chances of being consumed (Cre- are discussed in our original work (Paraschakis, 2017).
monesi et al., 2012; Knijnenburg, 2015). Whenever
this fact is exploited for reasons beyond serving user 2.1. Ethics of data manipulation
needs, an ethical problem arises. Neglecting ethics in • Informed consent for data collection/user profiling
recommender systems may lead to privacy violation, • Data publishing: moral bonds between stakehold-
identity theft, behavior manipulation, discrimination, ers, the failure of anonymization
offensive / hazardous content, misleading information, • Sources of privacy breaches, possible attacks on
etc. Formally, we can define recommendation ethics RS users, GUI-based privacy solutions
as the study of the moral system of norms for serv- • Content filtering and censorship
ing recommendations of products and services to end
users in the cyberspace. This system must account
for moral implications stemming from both the act of 2.2. Ethics of algorithm design
recommending per se, and the enabling technologies
involved. According to the recent study by (Tang & • Algorithmic opacity, biases, and behavior manip-
ulation
• Explanation interfaces and their challenges
Preliminary work. Under review for Benelearn 2017. Do • Price discrimination
not distribute.
• Filter bubbles via news recommendations

112
Towards an Ethical Recommendation Framework

Table 1. General user-centric ethical recommendation framework

Design Data collection Data publishing Algorithm design User interface de- A/B testing
stage sign
Ethical Privacy breaches, lack Privacy / security / Biases, discrimi- Algorithmic opacity, Fairness, side effects,
con- of awareness/consent, anonymity breaches nation, behavior content censorship lack of trust / aware-
cerns fake profile injection manipulation ness / consent
Known Informed consent, Privacy-preserving Algorithm audits, Explanations, ethical Informed consent,
coun- privacy-preserving data publishing reverse engineering, rule set generation, possibility to opt out
termea- collaborative filtering, discrimination-aware content analysis and delete data
sures identity verification data mining
User- “Do not track activ- “Do not share data” “Marketing bias” fil- “Content censorship” “Opt out of experi-
adjustable ity” tool tool ter filter ments” tool
controls
This setting disal- This option allows This filter is used This tool can be used This option can be
lows the creation and local user profiling to remove any to set user-defined used to reset the rec-
maintenance of a but forbids sharing business-driven bias exclusion criteria for ommendation engine
user profile. Types of data with third par- introduced by RS filtering out inappro- to its default algo-
data can be manually ties (even in the pres- providers, and set priate items or cate- rithm, exclude the
defined and browsed ence of anonymiza- the recommendation gories. It also con- user from any future
items can be man- tion). Types of data engine to the “best tains the option to experiments, enable
ually deleted (e.g. or categories of al- match” mode (or turn the filter on and the opt-in option,
“manage history” lowed recipients can other user-selectable off (also with the delete data from pre-
tool on Amazon) be manually defined. modes, such as possibility of schedul- vious experiments.
“cheapest first”). ing).

2.3. Ethics of experimentation ous European universities, yielding 214 responses from
students and academic staff. The analysis of survey
• Famous cases of unethical A/B testing
results immediately revealed participants’ strong pref-
• Three ways of consent acquisition for A/B testing
erence for taking morally sensitive issues under their
• Fairness and possibilities of user control
control. In 4 out 5 studied issues, the majority voted
for having a user-adjustable setting within a recom-
3. Summary as a Framework mendation engine among other alternative solutions.
The survey questions, responses, and analysis can be
Table 1 summarizes our findings in the form of a found in (Paraschakis, 2017).
user-centric ethical recommendation framework, which
maps RS design stages to potential ethical concerns
and the recommended countermeasures. As a prac-
5. Conclusion
tical contribution, we propose an “ethical toolbox” We conclude that multiple moral dilemmas emerge on
comprised of user-adjustable controls corresponding to every stage of RS design, while their solutions are not
each design stage. These controls enable users to tune always evident or effective. In particular, there are
a RS to their individual moral standards. The us- many trade-offs to be resolved, such as user privacy
ability of the provided controls may depend on many vs. personalization, data anonymization vs. data util-
factors, such as their layout, frequency of using the ity, informed consent vs. experimentation bias, and
system, sensitivity of data, and so on. As a vital first algorithmic transparency vs. trade secrets. A careful
step, however, it is necessary to establish the general risk assessment is crucial for deciding on the strate-
stance of users towards the ethics of recommender sys- gies of data anonymization or informed consent acqui-
tems and whether the proposed toolbox would stand sition required for A/B testing or user profiling. We
as a viable solution. This is done in the next section. have found evidence that many big players on the RS
market (Facebook, Amazon, Netflix, etc.) have faced
4. Feasibility study loud ethics-related backlashes. Thus, it is important
to ensure that a RS design is not only legally and al-
We conduct an online survey1 to find out people’s opin- gorithmically justified, but also ethically sound. The
ions and their preferred course of action regarding five proposed framework suggests new paradigm of ethics-
ethical issues of RS that are addressed by the proposed awareness by design, which utilizes existing technolo-
toolbox: user profiling, data publishing, online exper- gies where possible, and complements them with user-
iments, marketing bias, and content censorship. The adjustable controls. This idea was embraced by the
survey was disseminated to Facebook groups of numer- vast majority of our survey participants, and future
1 work should further test its usability in a fully imple-
available at http://recommendations.typeform.com/
to/kgKNQ0 mented RS prototype.

113
Towards an Ethical Recommendation Framework

References
Cremonesi, P., Garzotto, F., & Turrin, R. (2012). In-
vestigating the persuasion potential of recommender
systems from a quality perspective: An empirical
study. ACM Trans. Interact. Intell. Syst., 2, 11:1–
11:41.
Knijnenburg, B. (2015). A user-tailored approach to
privacy decision support. Doctoral dissertation, Uni-
versity of California.
Paraschakis, D. (2017). Towards an ethical recom-
mendation framework. To appear in: Proceedings
of the 11th IEEE International Conference on Re-
search Challenges in Information Science.
Tang, T., & Winoto, P. (2016). I should not recom-
mend it to you even if you will like it: the ethics of
recommender systems. New Review of Hypermedia
and Multimedia, 22, 111–138.

114
An Ensemble Recommender System for e-Commerce

Björn Brodén [email protected]


Apptus Technologies, Trollebergsvägen 5, SE-222 29 Lund, Sweden
Mikael Hammar [email protected]
Apptus Technologies, Trollebergsvägen 5, SE-222 29 Lund, Sweden
Bengt J. Nilsson [email protected]
Dept. of Computer Science, Malmö University, SE-205 06 Malmö, Sweden
Dimitris Paraschakis [email protected]
Dept. of Computer Science, Malmö University, SE-205 06 Malmö, Sweden

Keywords: recommender systems, ensemble learning, thompson sampling, e-commerce, priming

Abstract tackling this issue is by following the “exploration-


In our ongoing work we extend the Thompson exploitation” paradigm of multi-arm bandits (MAB).
Sampling (TS) bandit policy for orchestrat- These algorithms use reinforcement learning to opti-
ing the collection of base recommendation al- mize decision making in the face of uncertainty. An-
gorithms for e-Commerce. We focus on the other established way of addressing the initial lack of
problem of item-to-item recommendations, data is to utilize content-based filtering, which recom-
for which multiple behavioral and content- mends items based on their attributes. The efficacy
based predictors are provided to an ensemble of these two approaches along with the fact that many
learner. The extended TS-based policy must real-world recommenders tend to favor simple algorith-
be able to handle situations when bandit mic approaches (Paraschakis et al., 2015), motivates
arms are non-stationary and non-answering. the creation of an ensemble learning scheme consist-
Furthermore, we investigate the effects of ing of a collection of base recommendation components
priming the sampler with pre-set parame- that are orchestrated by a MAB policy. The proposed
ters of reward distributions by analyzing the model has a number of advantages for a prospective
product catalog and/or event history, when vendor: a) it allows to easily plug-in/out recommenda-
such information is available. We report our tion components of any type, without making changes
preliminary results based on the analysis of to the main algorithm; b) it is scalable because ban-
two real-world e-Commerce datasets. dit arms represent algorithms and not single items; c)
handling context can be shifted to the level of compo-
nents, thus eliminating the need for contextual MAB
1. Introduction policies. Our approach is detailed in the next section.

A typical task in industrial e-Commerce applications


2. Approach
is generating top-N item-to-item recommendations in
a non-personalized fashion. Such recommendations The modelling part can be split into two sub-problems:
are useful in “cold-start” situations when user pro-
1. Constructing base recommendation components
files are very limited or non-existent, for example on
2. Choosing a bandit policy for the ensemble learner
landing pages. Even in this case, the cold-start prob-
lem manifests itself as a challenge of selecting items
that are relevant to a given item. A natural way of 2.1. Base recommendation components
We consider two types of components:
Preliminary work. Under review for Benelearn 2017. Do 1. A content-based component defines the set of
not distribute.

115
An Ensemble Recommender System for e-Commerce

items {y} that share the same attribute value with pler by pre-setting the prior parameters α and β of
the premise item x: reward distributions. We consider two realistic scenar-
x 7−→ {y : attributei (x) = attributei (y)} ios where the estimation of these priors can be done:
For example, return all items of the same color.
1. Newly launched website. In this case, the estima-
2. A collaborative filtering component defines the
tion of the parameters relies solely on the analysis
set of items {y} that are connected to the premise
of the product catalog.
item x via a certain event type (click, purchase,
2. Pre-existing website. In this case, the estimation
addition to cart, etc.):
of the parameters can be done by utilizing the
x 7−→ {y : eventi (x)t → eventi (y)t0 >t }
event history.
For example, return all items that were bought
after the premise item (across all sessions). In both scenarios, we must be able to reason about the
expected mean µ and variance σ 2 of reward distribu-
We note that special-purpose components can also be
tions based on the analysis of the available data. We
added by a vendor to handle all sorts of contexts.
can then compute α and β as follows:
2.2. Ensemble learner µλ (µ − 1)λ
α=− β= , λ = σ 2 + µ2 − µ (1)
σ2 σ2
The goal of our ensemble learner is to recommend top-
N items for the premise item by querying the empir- 3. Preliminary results and future work
ically best component(s). We employ the well-known
Thompson Sampling (TS) policy (Chapelle & Li, 2011) In our preliminary experiments we compare TS to
for several practical reasons: a) its strong theoretical other popular bandit policies for the top-5 recommen-
guarantees and excellent empirical performance; b) ab- dation task, after making the adjustments proposed
sence of parameters to tune; c) robustness to obser- in Section 2.2. Two stand-alone recommenders are
vation delays; d) flexibility in re-shaping arm reward used as strong baselines: best sellers and co-purchases
distributions (see Section 2.3). (“Those-Who-Bought-Also-Bought”). We run the ex-
For a K-armed Bernoulli bandit, Thompson Sampling periments on two proprietary e-Commerce datasets of
models the expected reward θa of each arm a as 1 million events each: a book store and a fashion
a Beta distribution with prior parameters α and β: store. The results below show the hit rate of each
θa ∼ Beta(Sa,t + α, Fa,t + β). In each round t, an arm method. We observe that Thompson Sampling sig-
with the highest sample is played. Success and failure
counts Sa,t and Fa,t are updated according to the ob-
served reward ra,t .
The blind application of this classical TS model would
fail in our case because of its two assumptions:
1. One arm pull per round. Because the selected
component may return only few (or even zero!)
items for a given query, pulling one arm at a time
may not be sufficient to fill in the top-N recom-
mendation list.
2. Arms are stationary. Because collaborative filter-
ing components improve their performance over
time, they have non-stationary rewards.
Therefore, our ongoing work extends Thompson Sam-
pling to adapt to the task at hand. To address the first
problem, we allow multiple arms to be pulled in each Figure 1. TS vs. baselines (measured in hit rate)
round and adjust the reward system accordingly. The
second problem can be solved by dividing each compo- nificantly outperforms the baselines and consistently
nent in sub-components of relatively stable behavior. outperforms state-of-the-art MAB policies by a small
margin, which justifies our choice of method. Future
2.3. Priming the sampler work will demonstrate the predictive superiority of the
extended TS in relation to the standard TS policy.
Apart from the proposed modifications of the TS
Furthermore, we plan to examine what can be gained
model, we examine the effects of priming the sam-
by priming the sampler and how exactly it can be done.

116
An Ensemble Recommender System for e-Commerce

Acknowledgments
This research is part of the research projects “Au-
tomated System for Objectives Driven Merchandis-
ing”, funded by the VINNOVA innovation agency;
http://www.vinnova.se/en/, and “Improved Search
and Recommendation for e-Commerce”, funded by the
Knowledge foundation; http://www.kks.se.
We express our gratitude to Apptus Technologies
(http://www.apptus.com) for the provided datasets
and computational resources.

References
Chapelle, O., & Li, L. (2011). An empirical evaluation
of thompson sampling. Proceedings of the 24th In-
ternational Conference on Neural Information Pro-
cessing Systems (pp. 2249–2257).
Paraschakis, D., Holländer, J., & Nilsson, B. J. (2015).
Comparative evaluation of top-n recommenders in e-
commerce : an industrial perspective. Proceedings of
the 14th IEEE International Conference on Machine
Learning and Applications (pp. 1024–1031).

117
Ancestral Causal Inference (Extended Abstract)

Sara Magliacane [email protected]


VU Amsterdam, De Boelelaan 1083a, 1081 HV Amsterdam, The Netherlands
Tom Claassen [email protected]
Radboud University Nijmegen, Postbus 9010, 6500GL Nijmegen, The Netherlands
Joris M. Mooij [email protected]
University of Amsterdam, Science Park 904, 1098 XH Amsterdam, The Netherlands

Keywords: causal inference, constraint-based causal discovery, structure learning

Abstract fidence in the causal predictions. Several approaches


address the first issue and improve the reliability of
This is an extended abstract of the NIPS 2016
constraint-based methods by exploiting redundancy in
paper (Magliacane et al., 2016).
the independence information. Unfortunately, existing
approaches have to choose to sacrifice either accuracy
by using a greedy method (Claassen & Heskes, 2012;
Discovering causal relations from data is at the founda-
Triantafillou & Tsamardinos, 2015), or scalability by
tion of the scientific method. Traditionally, cause-effect
formulating a discrete optimization problem on a super-
relations have been recovered from experimental data
exponentially large search space (Hyttinen et al., 2014).
in which the variable of interest is perturbed, but sem-
Additionally, the second issue is addressed only in lim-
inal work like the do-calculus (Pearl, 2009) and the
ited cases.
PC/FCI algorithms (Spirtes et al., 2000; Zhang, 2008)
demonstrate that, under certain assumptions, it is al- In (Magliacane et al., 2016) we propose Ancestral
ready possible to obtain significant causal information Causal Inference (ACI), a logic-based method that
by using only observational data. provides a comparable accuracy to the best state-of-
the-art constraint-based methods as HEJ (Hyttinen
Recently, there have been several proposals for com-
et al., 2014), but improves on the scalability by using
bining observational and experimental data to discover
a more coarse-grained representation. Instead of repre-
causal relations. These causal discovery methods are
senting direct causal relations, in ACI we represent and
usually divided into two categories: constraint-based
reason only with ancestral relations (“indirect” causal
and score-based methods. Score-based methods typi-
relations), which define an ancestral structure:
cally evaluate models using a penalized likelihood score,
while constraint-based methods use statistical inde- Definition 1. An ancestral structure is any rela-
pendences to express constraints over possible causal tion 99K on the observed variables that satisfies the
models. The advantages of constraint-based over score- non-strict partial order axioms:
based methods are the ability to handle latent con- ( reflexivity) : X 99K X,
founders naturally, no need for parametric modeling
( transitivity) : X 99K Y ∧ Y 99K Z =⇒ X 99K Z,
assumptions and an easy integration of complex back-
ground knowledge, especially for logic-based methods. ( antisymmetry) : X 99K Y ∧ Y 99K X =⇒ X = Y.

Two major disadvantages of constraint-based methods Though still super-exponentially large, this represen-
are: (i) vulnerability to errors in statistical indepen- tation drastically reduces computation time, as shown
dence test results, which are quite common in real-world in the evaluation. Moreover, this representation turns
applications, (ii) no ranking or estimation of the con- out to be very convenient, because in real-world appli-
cations the distinction between direct causal relations
Preliminary work. Under review for Benelearn 2017. Do and ancestral relations is not always clear or necessary.
not distribute.
To solve the vulnerability to statistical errors in inde-

118
Ancestral Causal Inference

pendence tests, ACI reformulates causal discovery as HEJ


1000
an optimization problem. Given a list I of weighted ACI
input statements (ij , wj ), where ij is the input state-
100

Execution time (s)


ment (e.g. an independence test result) and wj is the
associated weight (representing its confidence), we de-
10
fine the loss function as the sum of the weights of the
input statements that are not satisfied in A ∈ A, where
1
A is the set of all possible ancestral structures:
X 0.1
L(A, I) := wj .
(ij ,wj )∈I: ij is not satisfied in A
0.01
6 6.5 7 7.5 8 8.5 9
We explore two simple weighting schemes: Number of variables

• a frequentist approach, in which for any appropri- Figure 1. Synthetic data: execution times (log-scale).
ate frequentist statistical test with independence
as null hypothesis, we define the weight:

w = | log p − log α|, where p = p-value of the test

and α is the significance level (e.g., 5%);


• a Bayesian approach, in which the weight of each
input i using data set D is:

p(i|D) p(D|i) p(i)


w = log = log ,
p(¬i|D) p(D|¬i) p(¬i)

where the prior probability p(i) can be used as a


tuning parameter.

Under the standard assumptions for causal discovery


(i.e. the Causal Markov and Faithfulness assumption),
ACI implements five rules that relate ancestral relations
Figure 2. Synthetic data: example prediction-recall curve
and input independences, defining which inputs are not
for non-causal predictions for 6 variables, maximum condi-
satisfied in a given ancestral structure.
tioning set c = 1, frequentist test with α = 0.05.
For example, for X, Y , W disjoint (sets of) variables,
one of the ACI rules is:

(X ⊥
⊥ Y | W ) ∧ (X 699K W ) =⇒ X 699K Y,
It can be shown that this score is an approximation of
where X ⊥ ⊥ Y | W represents the conditional indepen- the marginal probability of the statement and that it
dence of X and Y conditioning on W , while X 699K Y satisfies certain theoretical guarantees, like soundness
represents the fact that X is not a cause of Y . and asymptotic consistency, given certain reasonable
Using this loss function, we propose a method to score assumptions on the weights of all input statements.
predictions according to their confidence. This is very We evaluate on synthetic data and show that ACI can
important for practical applications, as the low reliabil- outperform the state-of-the-art (HEJ, equipped with
ity of the predictions of constraint-based methods has our scoring method), achieving a speedup of several
been a major impediment to their widespread usage. orders of magnitude (as summarised in Figure 1), while
We define the confidence score for a statement s as: still providing a comparable accuracy, as we show in
an example precision and recall curve in Figure 2. In
C(s) = min L(A, I + (¬s, ∞))
A∈A the full paper, we also illustrate its practical feasibility
by applying it on a challenging protein data set that so
− min L(A, I + (s, ∞)) far had only been addressed with score-based methods.
A∈A

119
Ancestral Causal Inference

References
Claassen, T., & Heskes, T. (2012). A Bayesian approach
to constraint-based causal inference. UAI (pp. 207–
216).
Hyttinen, A., Eberhardt, F., & Järvisalo, M. (2014).
Constraint-based causal discovery: Conflict resolu-
tion with Answer Set Programming. UAI (pp. 340–
349).
Magliacane, S., Claassen, T., & Mooij, J. M. (2016).
Ancestral causal inference. NIPS.
Pearl, J. (2009). Causality: models, reasoning and
inference. Cambridge University Press.

Spirtes, P., Glymour, C., & Scheines, R. (2000). Cau-


sation, prediction, and search. MIT press.
Triantafillou, S., & Tsamardinos, I. (2015). Constraint-
based causal discovery from multiple interventions
over overlapping variable sets. Journal of Machine
Learning Research, 16, 2147–2205.
Zhang, J. (2008). On the completeness of orientation
rules for causal discovery in the presence of latent
confounders and selection bias. Artifical Intelligence,
172, 1873–1896.

120
Exceptional Model Mining in Ubiquitous and Social Environments

Martin Atzmueller [email protected]


Tilburg University (TiCC), Warandelaan 2, 5037 AB Tilburg, The Netherlands

Keywords: exceptional model mining, subgroup discovery, community detection, social interaction networks

Abstract As a revision of (Atzmueller, 2016b), this paper sum-


Exceptional model mining in ubiquitous and marizes formalizations and applications of subgroup
social environments includes the analysis of discovery and exceptional model mining in the context
resources created by humans (e. g., social me- of social interaction networks.
dia) as well as those generated by sensor de-
vices in the context of (complex) interactions. 2. Methods
This paper provides a structured overview
on a line of work comprising a set of papers Social interaction networks (Atzmueller, 2014; Mit-
that focus on local exceptionality detection in zlaff et al., 2011; Mitzlaff et al., 2013) focus on user-
ubiquitous and social environments and ac- related social networks in social media capturing social
cording complex social interaction networks. relations inherent in social interactions, social activi-
ties and other social phenomena which act as proxies
for social user-relatedness.
1. Introduction Exploratory data analysis is an important approach,
In ubiquitous and social environments, a variety of het- e. g., for getting first insights into the data. In partic-
erogenous multi-relational data is generated by sen- ular, descriptive data mining aims to uncover certain
sors and social media (Atzmueller, 2012a). Then, patterns for characterization and description of the
a set of complex social interaction networks (Atz- data and the captured relations. Typically, the goal
mueller, 2014), capturing distinct facets of the interac- of description-oriented methods is not only to find an
tion space (Mitzlaff et al., 2014). Here, local exception- actionable model, but also a human interpretable set
ality detection – based on subgroup discovery (Klösgen, of patterns (Mannila, 2000).
1996; Wrobel, 1997; Atzmueller, 2015) and exceptional Subgroup discovery and exceptional model mining are
model mining – provides flexible approaches for data prominent methods for local exceptionality detection
exploration, assessment, and the detection of unex- that can be configured and adapted to various an-
pected and interesting phenomena. alytical tasks. Local exceptionality detection espe-
Subgroup discovery is an approach for discovering in- cially supports the goal of explanation-aware data min-
teresting subgroups – as an instance of local pattern ing (Atzmueller & Roth-Berghofer, 2010), due to its
detection (Morik, 2002). The interestingness is usu- more interpretable results, e. g., for characterizing a set
ally defined by a certain property of interest formalized of data, for concept description, for providing regulari-
by a quality function. In the simplest case, a binary ties and associations between elements in general, and
target variable is considered, where the share in a sub- for detecting and characterizing unexpected situations,
group can be compared to the share in the dataset in e. g., events or episodes. In the following, we summa-
order to detect (exceptional) deviations. More com- rize approaches and methods for local exceptionality
plex target concepts consider sets of target variables. detection on attributed graphs, for behavioral char-
In particular, exceptional model mining (Leman et al., acterization, and spatio-temporal analysis. Further-
2008; Duivesteijn et al., 2012; Duivesteijn et al., 2016) more, we address issues of scalability and large-scale
focuses on more complex quality functions. data processing.

Appearing in Proceedings of Benelearn 2017. Copyright


2017 by the author(s)/owner(s).

121
Exceptional Model Mining in Ubiquitous and Social Environments

2.1. Descriptive Community Detection adaptation and extension of the subgroup discovery
methodology in that context. In addition, we can an-
Communities can intuitively be defined as subsets of
alyze multiplex networks by considering the match be-
nodes of a graph with a dense structure in the corre-
tween different networks, and deviations between the
sponding subgraph. However, for mining such com-
networks, respectively. Outlining these examples, we
munities usually only structural aspects are taken into
demonstrate that local exceptionality detection is a
account. Typically, no concise nor easily interpretable
flexible approach for compositional analysis in social
community description is provided.
interaction networks.
In (Atzmueller et al., 2016a), we focus on description-
oriented community detection using subgroup discov- 2.3. Exceptional Model Mining for
ery. For providing both structurally valid and in- Spatio-Temporal Analysis
terpretable communities we utilize the graph struc-
ture as well as additional descriptive features of the Exploratory analysis on ubiquitous data needs to han-
graph’s nodes. We aim at identifying communities dle different heterogenous and complex data types.
according to standard community quality measures, In (Atzmueller, 2014; Atzmueller et al., 2015), we
while providing characteristic descriptions at the same present an adaptation of subgroup discovery using
time. We propose several optimistic estimates of stan- exceptional model mining formalizations on ubiqui-
dard community quality functions to be used for ef- tous social interaction networks. Then, we can de-
ficient pruning of the search space in an exhaustive tect locally exceptional patterns, e. g., corresponding
branch-and-bound algorithm. We present examples of to bursts or special events in a dynamic network.
an evaluation using five real-world data sets, obtained Furthermore, we propose subgroup discovery and as-
from three different social media applications, show- sessment approaches for obtaining interesting descrip-
ing runtime improvements of several orders of mag- tive patterns and provide a novel graph-based analysis
nitude. The results also indicate significant semantic approach for assessing the relations between the ob-
structures compared to the baselines. A further ap- tained subgroup set. This exploratory visualization
plication of this method to the exploratory analysis approaches allows for the comparison of subgroups ac-
of social media using geo-references is demonstrated cording to their relations to other subgroups and to
in (Atzmueller, 2014; Atzmueller & Lemmerich, 2013). include further parameters, e. g., geo-spatial distribu-
Furthermore, a scalable implementation of the de- tion indicators. We present and discuss analysis results
scribed description-oriented community detection ap- utilizing a real-world ubiquitous social media dataset.
proach is given in (Atzmueller et al., 2016b), which is
also suited for large-scale data processing utilizing the 3. Conclusions and Outlook
Map/Reduce framework (Dean & Ghemawat, 2008).
Subgroup discovery and exceptional model mining pro-
vide powerful and comprehensive methods for knowl-
2.2. Characterization of Social Behavior
edge discovery and exploratory analyis in the context
Important structures that emerge in social interaction of local exceptionality detection. In this paper, we pre-
networks are given by subgroups. As outlined above, sented according approaches and methods, specifically
we can apply community detection in order to mine targeting social interaction networks, and showed how
both the graph structure and descriptive features in or- to implement local exceptionality detection on both a
der to obtain description-oriented communities. How- methodological and practical level.
ever, we can also analyze subgroups in a social inter-
Interesting future directions for local exceptional-
action network from a compositional perspective, i. e.,
ity detection in social contexts include extended
neglecting the graph structure. Then, we focus on the
postprocessing, presentation and assessment options,
attributes of subsets of nodes or on derived parame-
e. g., (Atzmueller et al., 2006; Atzmueller & Puppe,
ters of these, e. g., corresponding to roles, centrality
2008; Atzmueller, 2015). In addition, extensions to
scores, etc. In addition, we can also consider sequen-
predictive modeling, e. g., link prediction (Scholz et al.,
tial data, e. g., for characterization of exceptional link
2013; Atzmueller, 2014) are interesting options to ex-
trails, i. e., sequential transitions, as presented in (Atz-
plore. Furthermore, extending the analysis of sequen-
mueller, 2016a).
tial data, e. g., based on Markov chains as exceptional
In (Atzmueller, 2012b), we discuss a number of exem- models (Atzmueller et al., 2016c; Atzmueller, 2016a;
plary analysis results of social behavior in mobile so- Atzmueller et al., 2017), as well as group and net-
cial networks, focusing on the characterization of links work dynamics (Atzmueller et al., 2014; Kibanov et al.,
and roles. For that, we describe the configuration, 2014) are further interesting options for future work.

122
Exceptional Model Mining in Ubiquitous and Social Environments

References Atzmueller, M., Schmidt, A., & Kibanov, M. (2016c).


DASHTrails: An Approach for Modeling and Anal-
Atzmueller, M. (2012a). Mining Social Media. Infor-
ysis of Distribution-Adapted Sequential Hypotheses
matik Spektrum, 35, 132 – 135.
and Trails. WWW 2016 (Companion). ACM Press.
Atzmueller, M. (2012b). Mining Social Media: Key Atzmueller, M., Schmidt, A., Kloepper, B., & Arnu,
Players, Sentiments, and Communities. WIREs D. (2017). HypGraphs: An Approach for Analy-
Data Mining and Knowledge Discovery, 2, 411–419. sis and Assessment of Graph-Based and Sequential
Atzmueller, M. (2014). Data Mining on Social Inter- Hypotheses. New Frontiers in Mining Complex Pat-
action Networks. JDMDH, 1. terns. Heidelberg, Germany: Springer.
Dean, J., & Ghemawat, S. (2008). MapReduce: Sim-
Atzmueller, M. (2015). Subgroup Discovery. WIREs plified Data Processing on Large Clusters. Commu-
Data Mining and Knowledge Discovery, 5, 35–49. nications of the ACM, 51, 107–113.
Atzmueller, M. (2016a). Detecting Commu- Duivesteijn, W., Feelders, A., & Knobbe, A. J. (2012).
nity Patterns Capturing Exceptional Link Trails. Different Slopes for Different Folks: Mining for Ex-
IEEE/ACM ASONAM. Boston, MA, USA: IEEE. ceptional Regression Models with Cook’s Distance.
ICDM (pp. 868–876). ACM Press, New York.
Atzmueller, M. (2016b). Local Exceptionality Detec-
tion on Social Interaction Networks. ECML-PKDD Duivesteijn, W., Feelders, A. J., & Knobbe, A. (2016).
2016 (pp. 485–488). Springer. Exceptional Model Mining. DMKD, 30, 47–98.

Atzmueller, M., Baumeister, J., & Puppe, F. (2006). Kibanov, M., Atzmueller, M., Scholz, C., & Stumme,
Introspective Subgroup Analysis for Interactive G. (2014). Temporal Evolution of Contacts and
Knowledge Refinement. AAAI FLAIRS (pp. 402– Communities in Networks of Face-to-Face Human
407). AAAI Press. Interactions. Sci China Information Sciences, 57.

Atzmueller, M., Doerfel, S., & Mitzlaff, F. (2016a). Klösgen, W. (1996). Explora: A Multipattern and
Description-Oriented Community Detection using Multistrategy Discovery Assistant. In Advances in
Exhaustive Subgroup Discovery. Information Sci- Knowledge Discovery and Data Mining. AAAI.
ences, 329, 965–984. Leman, D., Feelders, A., & Knobbe, A. (2008). Excep-
tional Model Mining. PKDD (pp. 1–16). Springer.
Atzmueller, M., Ernst, A., Krebs, F., Scholz, C., &
Stumme, G. (2014). On the Evolution of Social Mannila, H. (2000). Theoretical Frameworks for Data
Groups During Coffee Breaks. WWW 2014 (Com- Mining. SIGKDD Explor., 1, 30–32.
panion). New York, NY, USA: ACM Press. Mitzlaff, F., Atzmueller, M., Benz, D., Hotho, A., &
Atzmueller, M., & Lemmerich, F. (2013). Ex- Stumme, G. (2011). Community Assessment using
ploratory Pattern Mining on Social Media using Evidence Networks, vol. 6904 of LNAI. Springer.
Geo-References and Social Tagging Information. Mitzlaff, F., Atzmueller, M., Benz, D., Hotho, A., &
IJWS, 2. Stumme, G. (2013). User-Relatedness and Com-
munity Structure in Social Interaction Networks.
Atzmueller, M., Mollenhauer, D., & Schmidt, A.
CoRR/abs, 1309.3888.
(2016b). Big Data Analytics Using Local Exception-
ality Detection. In Enterprise Big Data Engineering, Mitzlaff, F., Atzmueller, M., Hotho, A., & Stumme,
Analytics, and Management. IGI Global. G. (2014). The Social Distributional Hypothesis.
Journal of Social Network Analysis and Mining, 4.
Atzmueller, M., Mueller, J., & Becker, M. (2015). Ex-
ploratory Subgroup Analytics on Ubiquitous Data, Morik, K. (2002). Detecting Interesting Instances, vol.
vol. 8940 of LNAI. Heidelberg, Germany: Springer. 2447 of LNCS, 13–23. Springer Berlin Heidelberg.

Atzmueller, M., & Puppe, F. (2008). A Case-Based Scholz, C., Atzmueller, M., Barrat, A., Cattuto, C., &
Approach for Characterization and Analysis of Sub- Stumme, G. (2013). New Insights and Methods For
group Patterns. Applied Intelligence, 28, 210–221. Predicting Face-To-Face Contacts. ICWSM. AAAI.
Wrobel, S. (1997). An Algorithm for Multi-Relational
Atzmueller, M., & Roth-Berghofer, T. (2010). The Discovery of Subgroups. Proc. PKDD-97 (pp. 78–
Mining and Analysis Continuum of Explaining Un- 87). Heidelberg, Germany: Springer.
covered. AI-2010. London, UK: SGAI.

123
PRIMPing Boolean Matrix Factorization through Proximal
Alternating Linearized Minimization

Sibylle Hess [email protected]


Katharina Morik katharina.morik @cs.uni-dortmund.de
Nico Piatkowski [email protected]
TU Dortmund University, Computer Science 8, Otto-Hahn-Str. 12, Dortmund, Germany

Keywords: Tiling, Boolean Matrix Factorization, Minimum Description Length principle, Proximal Alternating
Linearized Minimization, Nonconvex-Nonsmooth Minimization

Abstract ard component, denoted as noise N = D − Y X T , the


objective is difficult to delineate: where to draw the
We propose a novel Boolean matrix factor-
line between structure and noise? Are there natural
ization algorithm to solve the tiling prob-
limitations on the rank of the factorization? To what
lem, based on recent results from optimiza-
extend may the groups of columns and rows, identified
tion theory. We demonstrate the superior ro- T
by the rank-one factorizations Y·s X·s overlap? Miet-
bustness of the new approach in the presence
tinen and Vreeken (2014) successfully apply the Min-
of several kinds of noise and types of underly-
imum Description Length (MDL) principle to reduce
ing structure. Experimental results on image
these considerations into one objective: exploit just
data show that the new method identifies in-
as many regularities as serves the compression of the
terpretable patterns which explain the data
data. Identifying regularities with column-row inter-
almost always better than the competing al-
relations, the description length counterbalances the
gorithms.
complexity of the model (derived interrelations) and
the fit to the data, measured by the size of the en-
coded data using the model. Thereby, an automatic
1. Introduction determination of the factorization rank r is enabled.
In a large range of data mining tasks such as Market Since the Boolean factorization of a given rank, yield-
Basket Analysis, Text Mining, Collaborative Filtering ing the smallest approximation error, cannot be ap-
or DNA Expression Analysis, we are interested in the proximated within any factor in polynomial time
exploration of data which is represented by a binary (unless NP=P), state-of-the art algorithms rely on
matrix. Here, we seek for sets of columns and rows heuristics. Various greedy methods are proposed in
whose intersecting positions frequently feature a one. order to filter the structure from the noise by min-
This identifies, e.g., groups of users together with their imization of a suitable description length (Miettinen
shared preferences, genes that are often co-expressed & Vreeken, 2014; Lucchese et al., 2014; Karaev et al.,
among several tissue samples, or words that occur to- 2015). However, the experiments indicate at large that
gether in documents describing the same topic. the quality considerably varies depending on the dis-
The identification of r such sets of columns and rows is tribution of noise and characteristics of the dataset
formally stated by a factorization of rank r. Thereby, (Miettinen & Vreeken, 2014; Karaev et al., 2015).
the m×n data matrix is approximated by the Boolean For real-world datasets, it is difficult (if not impossi-
product of two matrices Y ∈ {0, 1}m×r and X ∈ ble) to estimate these aspects, to choose the appro-
{0, 1}n×r such that D ≈ Y X T . Assuming that data priate algorithm or to assess its quality on the given
matrices compound a structural component, which can dataset. Believing that the unsteady performance is
be expressed by a suitable factorization, and a haphaz- due to a lack of theoretical foundation, we introduce
the method Primp (Hess et al., 2017) to numerically
Preliminary work. Under review for Benelearn 2017. Do optimize a real-valued approximation of the cost mea-
not distribute. sure as known from the algorithms Krimp (Siebes

124
PRIMPing BMF through PALM

Picture Factorization 1 the maximum number of iterations K are the input of


this algorithm. For every considered rank, Primp em-
ploys the Proximal Alternating Linearized Minimiza-
tion (PALM) (Bolte et al., 2014) to numerically mini-
mize the function

Factorization 2 Factorization 3 F (X, Y ) + φ(X) + φ(Y ),

where F is a smooth relaxation of the cost measure


and φ penalizesP non-binary values. Specifically, we
choose φ(X) = i,j Λ(Xij ), which employs the one-
dimensional function
(
Figure 1. Reconstructions of the Space Invaders image on −|1 − 2x| + 1 x ∈ [0, 1]
the top left by the three rank-one factorizations returned Λ(x) =
∞ otherwise.
by Primp. Best viewed in color.
We show that the gradients ∇X F and ∇Y F are Lip-
Algorithm 1 Primp(D, ∆r , K) schitz continuous with moduli M∇X F and M∇Y F ,
1: (XK , YK ) ← (∅, ∅) which guarantees that the PALM, performed in lines
2: for r ∈ {∆r , 2∆r , 3∆r , . . .} do 4-9, converges in a nonincreasing sequence of function
3: (X0 , Y0 ) ←IncreaseRank(XK , YK , ∆r ) values to a critical point.
4: for k ∈ {0, . . . , K − 1} do
5: αk−1 ← M∇X F (Yk )1 The proximal mapping of φ, stated in line 6 and 8, is a
6: Xk+1 ← proxαk φ (Xk − αk ∇X F (Xk , Yk )) function which returns a matrix satisfying the follow-
ing minimization criterion:
7: βk−1 ← M∇Y F (Xk+1 )
8: Yk+1 ← proxβk φ (Yk − βk ∇Y F (Xk+1 , Yk ))  
1 ? 2 ?
9: end for proxφ (X) ∈ arg min kX − X k + φ(X ) .
10: (X, Y ) ← RoundBinary(XK , YK ) X? 2
11: if r − r(X, Y ) > 1 then
Loosely speaking, the proximal mapping gives its ar-
12: return (X, Y )
gument a little push into a direction which minimizes
13: end if
φ. We see in Algorithm 1 that the evaluation of this
14: end for
operator is a base operation. Thus, we derive a closed
form of the proximal mapping of φ.
et al., 2006) and Slim (Smets & Vreeken, 2012). We After the numerical minimization, the matrices XK
assess the algorithms’ ability to filter the true under- and YK , having entries between zero and one (ensured
lying structure from the noise. Therefore, we compare by the definition of φ), are rounded to binary matrices
various performance measures in a controlled setting X and Y with respect to the minimization of the cost
of synthetically generated data as well as for real-world measure (line 10). If the rounding procedure returns
data. We show that Primp is capable of recovering the binary matrices which use at least one (non-singleton)
latent structure in spite of varying database character- pattern less than possible (r − r(X, Y ) > 1), the cur-
istics and noise distributions. In addition, we visualize rent factorization is returned. Otherwise, we increase
the derived categorization into tiles by means of two the rank and add ∆r random columns with entries
image-datasets, demonstrating interpretability of the between zero and one to the relaxed solution of the
groupings as shown for one of the datasets in Fig. 1. former iteration (XK , YK ) and numerically optimize
again.
2. Primp
Acknowledgments
We sketch our method Primp, in Algorithm 1. A
binary data matrix D, rank increment ∆r ∈ N and Part of the work on this paper has been supported
1 by Deutsche Forschungsgemeinschaft (DFG) within
Step sizes αk and βk have to be smaller than the in-
verse Lipschitz modulus in order to guarantee monotonic the Collaborative Research Center SFB 876 “Provid-
convergence. Thus, step sizes are multiplied with a con- ing Information by Resource-Constrained Analysis”,
stant smaller but close to one, in practice. projects A1 and C1 http://sfb876.tu-dortmund.de.

125
PRIMPing BMF through PALM

References
Bolte, J., Sabach, S., & Teboulle, M. (2014). Proxi-
mal alternating linearized minimization for noncon-
vex and nonsmooth problems. Mathematical Pro-
gramming, 146, 459–494.
Hess, S., Morik, K., & Piatkowski, N. (2017). The
primping routine – tiling through proximal alternat-
ing linearized minimization (under minor revision).
Data Mining and Knowledge Discovery.
Karaev, S., Miettinen, P., & Vreeken, J. (2015). Get-
ting to know the unknown unknowns: Destructive-
noise resistant boolean matrix factorization. SDM
(pp. 325–333).
Lucchese, C., Orlando, S., & Perego, R. (2014). A
unifying framework for mining approximate top-k
binary patterns. Transactions on Knowledge and
Data Engineering, 26, 2900–2913.

Miettinen, P., & Vreeken, J. (2014). Mdl4bmf: Min-


imum description length for boolean matrix factor-
ization. ACM Trans. Knowl. Discov. Data, 8, 18:1–
18:31.
Siebes, A., Vreeken, J., & van Leeuwen, M. (2006).
Item sets that compress. SDM (pp. 393–404).
Smets, K., & Vreeken, J. (2012). Slim: Directly mining
descriptive patterns. SDM (pp. 236–247). SIAM.

126
An expressive similarity measure for relational clustering using
neighbourhood trees

Sebastijan Dumančić [email protected]


Department of Computer Science, KU Leuven, Belgium
Hendrik Blockeel [email protected]
Department of Computer Science, KU Leuven, Belgium

Keywords: Relational learning, Clustering, Similarity of structured objects

Abstract Clustering is an underspecified learning task: there is


no universal criterion for what makes a good cluster-
In this paper, we introduce a novel similar-
ing, thus it is inherently subjective. This is known for
ity measure for relational data. It is the
i.i.d. data (Estivill-Castro, 2002), and even more true
first measure to incorporate a wide variety of
for relational data. Different methods for relational
types of similarity, including similarity of at-
clustering have very different biases, which are often
tributes, similarity of relational context, and
left implicit; for instance, some methods represent the
proximity in a hypergraph. We experimen-
relational information as a graph (which means they
tally evaluate how using this similarity af-
assume a single binary relation) and assume that sim-
fects the quality of clustering on very dif-
ilarity refers to proximity in the graph.
ferent types of datasets. The experiments
demonstrate that (a) using this similarity In this paper, we propose a very versatile framework
in standard clustering methods consistently for clustering relational data. It views a relational
gives good results, whereas other measures dataset as a hypergraph with typed vertices, typed
work well only on datasets that match their hyperedges, and attributes associated to the vertices.
bias; and (b) on most datasets, the novel sim- The task we consider, is: cluster the vertices of one
ilarity outperforms even the best among the particular type. What distinguishes our approach
existing ones. This is a summary of the paper from other approaches is that the concept of similarity
accepted to Machine Learning journal (Du- used here is very broad. It can take into account at-
mančić & Blockeel, 2017). tribute similarity, similarity of the relations an object
participates in (including roles and multiplicity), sim-
ilarity of the neighbourhood (in terms of attributes,
1. Introduction relationships, or vertex identity), and interconnectiv-
ity or graph proximity of the objects being compared.
In relational learning, the data set contains instances We experimentally show that this framework for clus-
with relationships between them. Standard learning tering is highly expressive and that this expressiveness
methods typically assume data are i.i.d. (drawn inde- is relevant, in the sense that on a number of relational
pendently from the same population) and ignore the datasets, the clusters identified by this approach coin-
information in these relationships. Relational learn- cide better with predefined classes than those of exist-
ing methods do exploit that information, and this ing approaches.
often results in better performance. Much research
in relational learning focuses on supervised learning
(De Raedt, 2008) or probabilistic graphical models 2. Clustering over neighbourhood trees
(Getoor & Taskar, 2007). Clustering, however, has 2.1. Hypergraph Representation
received less attention in the relational context.
Relational learning encompasses multiple paradigms.
Preliminary work. Under review for Benelearn 2017. Do Among the most common ones are the graph view,
not distribute. where the relationships among instances are repre-

127
An expressive similarity measure for relational clustering using neighbourhood trees

sented by a graph, and the predicate logic or equiv- vertices v and v 0 , v 0 is added each time it is encoun-
alently relational database view, which typically as- tered. Repeat this procedure for each v 0 on depth 1.
sumes the data to be stored in multiple relations, or The vertices thus added are at depth 2. Continue this
in a knowledge base with multiple predicates. Though procedure up to some predefined depth d. The root
these are in principle equally expressive, in practice element is never added to the subsequent levels.
the bias of learning systems differs strongly depending
on which view they take. For instance, shortest path 2.3. Similarity measure
distance as a similarity measure is much more com-
mon in the graph view than in the relational database The main idea behind the proposed dissimilarity mea-
view. In the purely logical representation, however, no sure is to express a wide range of similarity biases
distinction is made between the constants that iden- that can emerge in relational data, such as attribute or
tify a domain object, and constants that represent the structural similarity. The proposed dissimilarity mea-
value of one of its features. Identifiers have no inherent sure compares two vertices by comparing their neigh-
meaning, as opposed to feature values. bourhood trees. It does this by comparing, for each
level of the tree, the distribution of vertices, attribute
In this work, we introduce a new view that combines values, and outgoing edge labels observed on that level.
elements of both. This view essentially starts out from Earlier work in relational learning has shown that dis-
the predicate logic view, but changes the representa- tributions are a good way of summarizing neighbour-
tion to a hypergraph representation. Formally, the hoods (Perlich & Provost, 2006).
data structure that we assume in this paper is a typed,
labelled hypergraph H = (V, E, τ, λ) with V being a The final similarity measure consists of a linear com-
set of vertices, and E a set of hyperedges; each hyper- bination of different interpretations of similarity. Con-
edge is an ordered set of vertices. The type function cretely, the similarity measure is a composition of com-
τ assigns a type to each vertex and hyperedge. A set ponents reflecting:
of attributes A(t) is associated with each t ∈ TV . The
labelling function λ assigns to each vertex a vector of 1. attributes of the root vertices,
values, one for each attribute of A(τ (v)).
The clustering task we consider is the following: given 2. attributes of the neighbouring vertices,
a vertex type t ∈ TV , partition the vertices of this
3. proximity of the vertices,
type into clusters such that vertices in the same clus-
ter tend to be similar, and vertices in different clusters 4. identity of the neighbouring vertices,
dissimilar, for some subjective notion of similarity. In
practice, it is of course not possible to use a subjec- 5. distribution of hyperedge types in a neighbour-
tive notion; one uses a well-defined similarity function, hood.
which hopefully in practice approximates well the sub-
jective notion that the user has in mind. To be able
to capture several interpretations of relational simi- Each component is weighted by the corresponding
larity, such as attribute or neighbourhood similarity, weight wi . These weights allow one to formulate an
we represent each vertex with a neighbourhood tree - interpretation of the similarity between relational ob-
a structure that effectively describe a vertex and its jects.
neighbourhood.
2.4. Results
2.2. Neighbourhood tree We compared the proposed similarity measure against
Consider a vertex v. A neighbourhood tree aims to a wide range of existing relational clustering ap-
compactly represent the neighbourhood of the vertex proaches and graph kernels on five datasets. The pro-
v and all relationships it forms with other vertices, and posed similarity measure was used in conjunction with
it is defined as follows. For every hyperedge E in which spectral and hierarchical clustering algorithms. We
v participates, add a directed edge from v to each ver- found that, on each separate dataset, our approach
tex v 0 ∈ E. Label each vertex with its attribute vector. performs at least as well as the best competitor, and
Label the edge with the hyperedge type and the posi- it is the only approach that achieves good results on all
tion of v in the hyperedge (recall that hyperedges are datasets. Furthermore, the results suggest that decou-
ordered sets). The vertices thus added are said to be at pling different sources of similarity into a linear com-
depth 1. If there are multiple hyperedges connecting bination helps to identify relevant information and re-
duce the effect of noise.

128
An expressive similarity measure for relational clustering using neighbourhood trees

Acknowledgements
Research supported by Research Fund KU Leuven
(GOA/13/010)

References
De Raedt, L. (2008). Logical and relational learning.
Cognitive Technologies. Springer.
Dumančić, S., & Blockeel, H. (2017). An expressive
dissimilarity measure for relational clustering using
neighbourhood trees. Machine Learning, To Appear.

Estivill-Castro, V. (2002). Why so many clustering


algorithms: A position paper. SIGKDD Explor.
Newsl., 4, 65–75.
Getoor, L., & Taskar, B. (2007). Introduction to statis-
tical relational learning (adaptive computation and
machine learning). The MIT Press.
Perlich, C., & Provost, F. (2006). Distribution-based
aggregation for relational learning with identifier at-
tributes. Mach. Learn., 62, 65–105.

129
Complex Networks Track

Extended Abstracts

130
Dynamics Based Features for Graph Classification

Leonardo Gutiérrez Gómez [email protected]


Jean-Charles Delvenne [email protected]
Université catholique de Louvain, 4, Avenue Lemaı̂tre, B-1348 Louvain-la-Neuve, Belgium

Keywords: graph classification, dynamics on networks, machine learning on networks

Abstract order to discover the apparition of diseases. Social net-


works classification (Wang & Krim, 2012) is suitable
In this paper we propose a new feature based for many social, marketing and targeting proposes, as
approach to network classification. We show well as mobility, collaboration networks and soon.
how a dynamics on a network can be useful to
reveal patterns about the organization of the This problem has been treated before from the super-
components of the underlying graph where vised and unsupervised machine learning perspective.
the process takes place. Measuring the auto- In the first case, a set of discriminative hand-crafted
covariance along a random path on the net- features must be carefully selected in order to achieve
work of a suitable set of network attributes high generalization capabilities. Typically it is done by
including node labels, allows us to define gen- manually choosing a set of structural, global and con-
eralized features across different time scales. textual features (Fei & Huan, 2008) from the under-
These dynamic features turn out to be an lying graph. Kernel-based methods (Hofmann et al.,
appropriate discriminative signature of the 2008) are very popular in this context. It consists in
network suitable for classification and recog- a two step process in which a suitable kernel function
nition purposes. The method is tested em- is devised capturing a similarity property of interest,
pirically on established network benchmarks. followed by a classification step by using a kernelized
Results show that our dynamic-based fea- version of a classification algorithm, i.e. logistic regres-
tures are competitive and often outperform sion, support vector machines. Alternatively, unsu-
state of the art graph kernel based methods. pervised algorithms aims to learn those features from
data, but at the cost of high training time and of-
ten blowing up the number of parameters, something
clearly not suitable in the context of large social net-
1. Introduction
works.
A wide range of real world problems involve network
In that direction, understanding the structural decom-
analysis and prediction tasks. The complexity of so-
position of networks is crucial for our interest. In-
cial, engineering and biological networks make neces-
deed, community detection or clustering algorithms
sary developing methods to deal with the major dif-
on networks (Girvan & Newman, 2002) aim to disen-
ficulties mining graph-based data: the intrinsic high
tangle meaningful, simplified patterns that are shared
complexity of its structure and relations of its compo-
for groups of nodes along the network. In particular,
nents and high dimensionality of the data.
dynamics-based approaches (Delvenne et al., 2013) as
In a typical graph classification task, we are interested a general community detection framework, play a key
in assign the most likely label to a graph among a set role in our work. Certainly, when a dynamics takes
of classes. For example, in chemoinformatics, bioin- place on a network, it is constrained by the network
formatics, one is interested in predicting the toxicity structure and then could potentially reveal interesting
or anti-cancer activity of molecules. Characterization features of the organization of the network. Given this
of proteins and enzymes is crucial in drugs research in interdependence between dynamics and structure, we
are able to extract meaningful features of the network
across time scales, which will be useful for prediction
Preliminary work. Under review for Benelearn 2017. Do purposes.
not distribute.

131
Dynamics Based Features for Graph Classification - work in progress

Dataset GK Deep GK DF
COLLAB 72.84 ± 0.28 73.09 ± 0.25 73.77 ± 0.22
IMDB-BINARY 65.87 ± 0.98 66.96 ± 0.56 70.32 ± 0.88
IMDB-MULTI 43.89 ± 0.38 44.55 ± 0.52 45.85 ± 1.18
REDDIT-BINARY 77.34 ± 0.18 78.04 ± 0.39 86.09 ± 0.53
REDDIT-MULTI-5K 41.01 ± 0.17 41.27 ± 0.18 51.44 ± 0.55
REDDIT-MULTI-12K 31.82 ± 0.008 32.22 ± 0.10 39.67 ± 0.42

Table 1. Social networks: Mean and standard deviation of accuracy classification for Graphlet Kernel (GK) (Shervashidze
et al., 2011), Deep Graphlet Kernel (Deep GK) (Yanardag & Vishwanathan, 2015), and Dynamic Features (DF, our
method)

2. Method state of the art accuracies, in binary and multi-class


graph classification tasks.
In this work we explore the use of dynamics based
graph features in the supervised graph classification
setting. Having a well defined dynamic on a network,
i.e a stationary random walk, we manually specify a
candidate set of features based on our expertise and
domain knowledge. That is, we chose a node feature
such as degree, pagerank, local clustering coefficient
etc, and look at the autocovariance (Delvenne et al.,
2013) of this feature at times t ∈ {0, 1, 2, 3} for a ran-
dom walker jumping from node to node. This descrip-
tors will be used as a global fingerprint of the network,
describing generalized assortivities (the usual assorta-
tivity turns out to be the case t = 1), or clustering
coefficient related for t = 3. In addition, for categor-
ical node labels i.e. age, gender, atom type, we may
use an association binary matrix H encoding node by
class membership and then using the total autocovari-
ance H T Cov(Xτ , Xτ +t )H, yielding even more features
of interest.

Figure 1. Dynamics based features are able to discriminate


3. Experiments and Results between molecules.
This features are tested on many network benchmarks.
For bioinformatics datasets (Figure 1) we run an au-
tomatic feature selection process by training an l1 lin- Acknowledgments
ear SVM classifier. On the other hand for social net-
The authors acknowledges support from the grant
work datasets we opt for a Random Forest model (Ta-
“Actions de recherche concertées —Large Graphs and
ble 1). We compare experimentally its classification
Networks” of the Communauté Française de Bel-
accuracy with respect to a wide range of graph ker-
gique. We also thank Marco Saerens and Roberto
nel methods. They are certainly the Graphlet ker-
D’Ambrosio for helpful discussions and suggestions.
nel (Shervashidze et al., 2011), Shorthest path kernel
(Borgwardt & Kriegel, ) and the Weisfeiler-Lehman
subtree kernel (Shervashidze et al., 2011), as well as References
their Deep kernel versions respectively (Yanardag & Barnett, I., Malik, N., Kuijjer, M. L., Mucha, P. J., &
Vishwanathan, 2015). Random walks based kernels as Onnela, J.-P. (2016). Feature-based classification of
p-step random walk (Smola & Kondor, 2003) and the networks.
random walk kernel (Gärtner et al., 2003) as well as the
Ramon & Gartner kernel (Ramon & Gärtner, 2003)
Borgwardt, K., & Kriegel, H. Shortest-path kernels
are also considered. Our results show that our method
on graphs. Fifth IEEE International Conference on
is capable to achieve and in many cases, outperform
Data Mining (ICDM’05).

132
Dynamics Based Features for Graph Classification - work in progress

Delvenne, J.-C., Schaub, M. T., Yaliraki, S. N., &


Barahona, M. (2013). The stability of a graph par-
tition: A dynamics-based framework for community
detection. Modeling and Simulation in Science, En-
gineering and Technology, 221–242.

Fei, H., & Huan, J. (2008). Structure feature se-


lection for graph classification. Proceedings of the
17th ACM Conference on Information and Knowl-
edge Management (pp. 991–1000). New York, NY,
USA: ACM.

Girvan, M., & Newman, M. E. J. (2002). Community


structure in social and biological networks. Pro-
ceedings of the National Academy of Sciences, 99,
7821–7826.
Gärtner, T., Flach, P., & Wrobel, S. (2003). On
graph kernels: Hardness results and efficient alterna-
tives. IN: CONFERENCE ON LEARNING THE-
ORY (pp. 129–143).
Hofmann, T., Schölkopf, B., & Smola, A. J. (2008).
Kernel methods in machine learning. The Annals of
Statistics, 36, 1171–1220.
Ramon, J., & Gärtner, T. (2003). Expressivity versus
efficiency of graph kernels. Proceedings of the First
International Workshop on Mining Graphs, Trees
and Sequences (pp. 65–74).

Shervashidze, N., Schweitzer, P., van Leeuwen,


E. J., Mehlhorn, K., & Borgwardt, K. M. (2011).
Weisfeiler-lehman graph kernels. J. Mach. Learn.
Res., 12, 2539–2561.
Smola, A. J., & Kondor, R. (2003). Kernels and reg-
ularization on graphs, 144–158. Berlin, Heidelberg:
Springer Berlin Heidelberg.
Wang, T., & Krim, H. (2012). Statistical classification
of social networks. 2012 IEEE International Con-
ference on Acoustics, Speech and Signal Processing
(ICASSP).
Yanardag, P., & Vishwanathan, S. (2015). Deep graph
kernels. Proceedings of the 21th ACM SIGKDD In-
ternational Conference on Knowledge Discovery and
Data Mining (pp. 1365–1374). New York, NY, USA:
ACM.

133
Improving Individual Predictions using Social Networks
Assortativity

Dounia Mulders, Cyril de Bodt, Michel Verleysen {name.surname}@uclouvain.be


ICTEAM institute, Université catholique de Louvain, Place du Levant 3, 1348 Louvain-la-Neuve, Belgium
Johannes Bjelland [email protected]
Telenor Research, Snarøyveien 30, N-1360 Fornebu, Norway
Alex (Sandy) Pentland [email protected]
MIT Media Lab, Massachusetts Institute of Technology, 77 Mass. Ave., E14/E15, Cambridge, MA 02139, USA
Yves-Alexandre de Montjoye [email protected]
Data Science Institute, Imperial College London, 180 Queen’s Gate, London SW7 2AZ, U.K.

Keywords: Belief propagation, assortativity, homophily, social networks, mobile phone metadata.
Abstract & Weber, 2014; Frias-Martinez et al., 2010; Sarraute
et al., 2014). Especially in developing countries, such
Social networks are known to be assortative statistics are often scarce, as local censuses are costly,
with respect to many attributes such as age, rough, time-consuming and hence rarely up-to-date
weight, wealth, ethnicity and gender. Inde- (de Montjoye et al., 2014).
pendently of its origin, this assortativity gives Social networks contain individual information about
us information about each node given its their users (e.g. generated tweets for Twitter), in addi-
neighbors. It can thus be used to improve in- tion to a graph topology information. The assortativ-
dividual predictions in many situations, when ity of social networks is defined as the nodes tendency
data are missing or inaccurate. This work to be linked to others which are similar in some sense
presents a general framework based on prob- (Aral et al., 2009). This assortativity with respect to
abilistic graphical models to exploit social various demographics of their individuals such as gen-
network structures for improving individual der, age, weight, income level, etc. is well documented
predictions of node attributes. We quantify in the literature (McPherson et al., 2001; Madan et al.,
the assortativity range leading to an accuracy 2010; Wang et al., 2013; Smith et al., 2014; Newman,
gain. We also show how specific characteris- 2003). This property has been theorized to come ei-
tics of the network can improve performances ther from influences or homophilies or a combination
further. For instance, the gender assortativ- of both. Independently of its cause, this assortativity
ity in mobile phone data changes significantly can be used for individual prediction purposes when
according to some communication attributes. some labels are missing or uncertain, e.g. for demo-
graphics prediction in large networks. Some methods
are currently developed to exploit that assortativity
1. Introduction (Al Zamal et al., 2012; Herrera-Yagüe & Zufiria, 2012).
Social networks such as Facebook, Twitter, Google+ However, few studies take the global network structure
and mobile phone networks are nowadays largely stud- into account (Sarraute et al., 2014; Dong et al., 2014).
ied for predicting and analyzing individual demograph- Also, to our best knowledge, no research quantifies
ics (Traud et al., 2012; Al Zamal et al., 2012; Magno how the performances are related to the assortativity
& Weber, 2014). Demographics are indeed a key input strength. The goal of this work, already published, is
for the establishment of economic and social policies, to overcome these shortcomings (Mulders et al., 2017).
health campaigns, market segmentation, etc. (Magno
2. Method
We propose a framework based on probabilistic graph-
Appearing in Proceedings of Benelearn 2017. Copyright
ical models to exploit a social network structure, and
2017 by the author(s)/owner(s).

134
Improving Individual Predictions using Social Networks Assortativity

Min CALLS
especially its underlying assortativity, for individual 105
106

ne
prediction improvement in a general context. The net- r 10603
104

Min SMS
work assortativity is quantified by the assortativity co- .2

Threshold on SMS
50
efficient of the attribute to predict, denoted by r (New- 0
man, 2003). The method can be applied with the only 40
knowledge of the labels of a limited number of pairs -.2 30
of connected users in order to evaluate the assortativ- 20
-.4
ity, as well as class probability estimates for each user. 10
These probabilities may for example be obtained by -.6 0 0 10 20 30 40 50 104 106
105

applying a machine learning algorithm exploiting the Threshold on CALLS ne


node-level information, after it has been trained on
the individual data of the users with known labels. As Figure 1. Gender assortativity coefficient in a mobile phone
described in (Mulders et al., 2017), a loopy belief prop- network when only the edges with a number of texts (sms)
agation algorithm is applied on a Markov random field and a number of calls (calls) larger than some increasing
modeling the network to improve the accuracy of these thresholds are preserved. The top (resp. right) histogram
prior class probability estimates. The model is able to gives the number of edges (ne ) with calls (resp. sms)
larger than the corresponding value on the x-axis (resp.
benefit from the strength of the links, quantified for
y-axis), on a log. scale.
example by the number of contacts. The estimation of
r allows to optimally tune the model parameters, by 5 accuracy
recall M
4

Performances
defining synthetic graphs. These simulations permit recall F
(1) to prevent overfitting a given network structure, 3
2
(2) to perform the parameter tuning off-line and (3) to 1
avoid requiring the labeled users to form a connected 0
graph. These simulations also allow to quantify the -1
assortativity range leading to an accuracy gain over -2
-3 50 55 60 65 70 75 80 85 90 95
an approach ignoring the network topology.
Initial accuracy (%)
3. Mobile phone network
Figure 2. Accuracy and recall gains when varying the ini-
The methodology is validated on mobile phone data
tial accuracy in a mobile phone network, averaged over 50
to predict gender (M and F resp. for male and random simulations of the first predictions. The filled ar-
female). Since our model exploits the gender ho- eas delimit intervals of one standard deviation around the
mophilies, its performances depend on r. In the worst mean gains.
case of a randomly mixed network, r = 0. Perfect
(dis-)assortativity leads to r = (−)1. In our network, 4. Conclusion
r ≈ 0.3, but Fig. 1 shows that it can change accord- This work shows how assortativity can be exploited to
ing to some communication attributes. The strongest improve individual demographics prediction in social
edges (with many texts and/or calls) are more anti- networks, using a probabilistic graphical model. The
homophilic, allowing to partition the edges into strong achieved performances are studied on simulated net-
and weak parts, respectively disassortative and assor- works as a function of the assortativity and the qual-
tative (r ≈ 0.3 in the weak part, whereas r can reach ity of the initial predictions, both in terms of accuracy
−0.3 in the strong one while retaining ≈ 1% of the and distribution. Indeed, the relevance of the network
edges). This partition is exploited to improve the pre- information compared to individual features depends
dictions by adapting the model parameters in the dif- on (1) the assortativity amplitude and (2) the quality
ferent parts of the network. Fig. 2 shows the accuracy of the prior individual predictions. The graph simula-
and recall gains of our method, over simulated ini- tions allow to tune the model parameters. Our method
tial predictions with varying initial accuracies resulting is validated on a mobile phone network and the model
from sampled class probability estimates. The high- is refined to predict gender, exploiting both weak, ho-
est accuracy gains are obtained in the range [70, 85]% mophilic and strong, anti-homophilic links.
of initial accuracy, covering the accuracies reached by
state-of-the-art techniques aiming to predict gender Acknowledgments
using individual-level features (Felbo et al., 2015; Sar- DM and CdB are Research Fellows of the Fonds de la
raute et al., 2014; Frias-Martinez et al., 2010). These Recherche Scientifique - FNRS.
gains overcome the results obtained with Sarraute et
al’s reaction-diffusion algorithm (2014).

135
Improving Individual Predictions using Social Networks Assortativity

References Workshop on Self-Organizing Maps and Learning


Al Zamal, F., Liu, W., & Ruths, D. (2012). Homophily Vector Quantization, Clustering and Data Visual-
and latent attribute inference: Inferring latent at- ization (WSOM+).
tributes of twitter users from neighbors. ICWSM,
Newman, M. E. (2003). Mixing patterns in networks.
270.
Physical Review E, 67, 026126.
Aral, S., Muchnik, L., & Sundararajan, A.
Sarraute, C., Blanc, P., & Burroni, J. (2014). A
(2009). Distinguishing influence-based contagion
study of age and gender seen through mobile phone
from homophily-driven diffusion in dynamic net-
usage patterns in mexico. Advances in Social
works. Proceedings of the National Academy of Sci-
Networks Analysis and Mining (ASONAM), 2014
ences, 106, 21544–21549.
IEEE/ACM International Conference on (pp. 836–
de Montjoye, Y.-A., Kendall, J., & Kerry, C. F. (2014). 843).
Enabling humanitarian use of mobile phone data.
Smith, J. A., McPherson, M., & Smith-Lovin, L.
Brookings Center for Tech. Innovation.
(2014). Social distance in the united states: Sex,
Dong, Y., Yang, Y., Tang, J., Yang, Y., & Chawla, race, religion, age, and education homophily among
N. V. (2014). Inferring user demographics and social confidants, 1985 to 2004. American Sociological Re-
strategies in mobile social networks. Proceedings of view, 79, 432–456.
the 20th ACM SIGKDD international conference on Traud, A. L., Mucha, P. J., & Porter, M. A. (2012). So-
Knowledge discovery and data mining (pp. 15–24). cial structure of facebook networks. Physica A: Sta-
Felbo, B., Sundsøy, P., Pentland, A., Lehmann, S., & tistical Mechanics and its Applications, 391, 4165–
de Montjoye, Y.-A. (2015). Using deep learning to 4180.
predict demographics from mobile phone metadata. Wang, Y., Zang, H., & Faloutsos, M. (2013). Infer-
arXiv preprint arXiv:1511.06660. ring cellular user demographic information using ho-
Frias-Martinez, V., Frias-Martinez, E., & Oliver, N. mophily on call graphs. INFOCOM, 2013 Proceed-
(2010). A gender-centric analysis of calling behavior ings IEEE (pp. 3363–3368).
in a developing economy using call detail records.
AAAI spring symposium: artificial intelligence for
development.

Herrera-Yagüe, C., & Zufiria, P. J. (2012). Predic-


tion of telephone user attributes based on network
neighborhood information. International Workshop
on Machine Learning and Data Mining in Pattern
Recognition (pp. 645–659).

Madan, A., Moturu, S. T., Lazer, D., & Pentland,


A. S. (2010). Social sensing: obesity, unhealthy eat-
ing and exercise in face-to-face networks. Wireless
Health 2010 (pp. 104–110).

Magno, G., & Weber, I. (2014). International gen-


der differences and gaps in online social networks.
International Conference on Social Informatics (pp.
121–138).

McPherson, M., Smith-Lovin, L., & Cook, J. M.


(2001). Birds of a feather: Homophily in social net-
works. Annual review of sociology, 415–444.

Mulders, D., de Bodt, C., Bjelland, J., Pentland, A.,


Verleysen, M., & de Montjoye, Y.-A. (2017). Im-
proving individual predictions using social networks
assortativity. Proceedings of the 12th International

136
User-Driven Pattern Mining on knowledge graphs: an
Archaeological Case Study

Wilcke, WX [email protected]
Department of Computer Science,
Department of Spatial Economics,
VU University Amsterdam, The Netherlands
de Boer, V [email protected]
van Harmelen, FAH [email protected]
Department of Computer Science,
VU University Amsterdam, The Netherlands

Keywords: Knowledge Graph, Pattern Mining, Hybrid Evaluation, Digital Humanities, Archaeology

Abstract In this work1 , we have investigated to what extent


In this work, we investigate to what extent data mining can contribute to the understanding of ar-
data mining can contribute to the under- chaeological knowledge, published as knowledge graph,
standing of archaeological knowledge, pub- and which form would best meet the communities’
lished as knowledge graph, and which form needs. For this purpose, we have constructed a
would best meet the communities’ needs. pipeline which implements a state-of-the-art method
A case study was held which involved the to mine generalized association rules directly from the
user-driven mining of generalized association LOD cloud in an overall user-driven process (Freitas,
rules. Experiments have shown that the 1999). Produced rules take the form: ∀χ(T ype(χ, t) →
approach yielded mostly plausible patterns, (P (χ, φ) → Q(χ, ψ))). Their interestingness has been
some of which were rated as highly relevant evaluated by a group of raters.
by domain experts.
2. Approach
Our pipeline2 facilitates the rule mining algorithm,
1. Introduction various pre- and post-processing steps, and a simple
Digital Humanities communities have recently began rule browser. We will briefly touch on the most cru-
to show a growing interest in the knowledge graph as cial components next:
data modelling paradigm (Hallo et al., 2016). In this
paradigm, knowledge is encoded as edges between ver- Data Retrieval: On start, users are asked to pro-
tices and is supported by semantic background knowl- vide a target pattern which defines their specific
edge. Already, many humanity data sets have been interest, e.g., ceramic artefacts. Optionally, users
published as such, with large contributors being Eu- may specify numerous parameters which, if left
ropean archaeological projects such as CARARE and empty, are set to defaults. Together, these are
ARIADNE. These data have been made available in translated into a query which is used to construct
the Linked Open Data (LOD) cloud – an interna- an in-memory graph from the data retrieved from
tionally distributed knowledge graph – bringing large the LOD cloud.
amounts of structured data within arm’s reach of ar- Context Sampling: Entities that match the sup-
chaeological researchers. This presents new opportu- plied target pattern (i.e., target entities) are ex-
nities for data mining (Rapti et al., 2015). 1
This research has been partially funded by the ARI-
ADNE project through the European Commission under
the Community’s Seventh Framework Programme, con-
Appearing in Proceedings of Benelearn 2017. Copyright tract no. FP7-INFRASTRUCTURES-2012-1-313193.
2017 by the author(s)/owner(s). 2
Available at github.com/wxwilcke/MINOS.

137
User-Driven Pattern Mining on knowledge graphs

tended with other entities related to them: their


Table 1. Normalized separate and averaged plausibility
context. Unless specified by the user, contexts
values (nominal scale) for experiments A through D as pro-
are sampled breath-first up to a depth of 3. This vided by three raters (κ = −1.28e−3 ).
results in n subgraphs, with n equal to the total Rater
number of target entities in the in-memory graph. 1 2 3 Mean

Experiment
These subgraphs can be thought of as analogous A 1.00 1.00 0.00 0.67
to the instances in tabular data sets. B 0.80 0.80 0.00 0.53
C 0.80 0.80 0.20 0.60
Pattern Mining: Our pipeline implements
D 1.00 1.00 0.80 0.93
SWARM: a state-of-the-art generalized associa- Mean 0.90 0.90 0.25 0.68
tion rule mining algorithm (Barati et al., 2016).
We motivate its selection by the algorithm’s
ability to exploit semantic background knowledge Table 2. Normalized separate and averaged relevancy val-
to generalize rules. In addition, the algorithm ues (ordinal scale) for experiments A through D as provided
is transparent and yields interpretable results, by three raters (κ = 0.31).
thus fitting the domain requirements (Selhofer & Rater
1 2 3 Mean
Geser, 2014).

Experiment
A 0.13±0.18 0.13±0.18 0.00±0.00 0.09±0.12
Dimension Reduction: A data-driven evaluation B 0.53±0.30 0.53±0.30 0.33±0.47 0.47±0.36
process is used to rate rules on their commonness. C 0.53±0.30 0.33±0.24 0.67±0.41 0.51±0.32
Hereto, we have extended the basic support and D 0.60±0.28 0.47±0.18 0.80±0.45 0.62±0.30
confidence measures with those tailored to graphs. Mean 0.45±0.31 0.37±0.26 0.45±0.48 0.42±0.35
Rules which are too rare or too common rules are
omitted from the final result, as well as those with
omnipresent relations (e.g., type and label). Re- sands using the aforementioned data-drive evaluation
maining rules are shown in a simple faceted rule process. The remaining rules were then ordered on
browser, which allows users to interactively cus- confidence (first) and support (second).
tomize templates (Klemettinen et al., 1994). For For each experiment, we selected 10 example rules
instance, to set acceptable ranges for confidence from the top-50 candidates to create an evaluation
and support scores, as well as to specify the types set of 40 rules in total. Three domain experts were
of entities allowed in either or both antecedent then asked to evaluate these on both plausibility and
and consequent. relevancy to the archaeological domain. Each rule
was accompanied by a transcription in natural lan-
3. Experiments guage to further improve its interpretability. For in-
stance, a typical rule might state: “For every artefact
Experiments were run on an archaeological subset in the data set holds: if it consists of raw earthenware
(±425k facts) of the LOD cloud3 , which contains (Nimeguen), then it dates from early Roman to late
detailed summaries about archaeological excavation Roman times”.
projects in the Netherlands. Each summary holds in-
The awarded plausibility scores (Table 1) indicate that
formation on 1) the project’s organisational structure,
roughly two-thirds of the rules (0.68) were rated plau-
2) people and companies involved, 3) reports made and
sible, with experiment D yielding the most by far.
media created, 4) artefacts discovered together with
Rater 3 was far less positive than rater 1 and 2, and
their context and their (geospatial and stratigraphic)
has a strong negative influence on the overall plausibil-
relation, and 5) fine-grained information about various
ity scores. In contrast, the relevancy scores (Table 2)
locations and geometries.
are in fair agreement with an overall score of 0.42,
Four distinct experiments have been conducted, each implying a slight irrelevancy. This can largely be at-
one having focussed on a different granularity of the tributed to experiment A, which scored considerably
data: A) project level, B) artefact level, C) context lower than the other experiments.
level, and D) subcontextual level. These were chosen
together with domain experts, who were asked to de-
4. Conclusion
scribe the aspects of the data most interesting to them.
Our raters were positively surprised by the range of
Results and Evaluation patterns that we were able to discover. Most of these
Each experiment yielded more than 35,000 candidate were rated plausible, and some even as highly relevant.
rules. This has been brought down to several thou- Nevertheless, trivialities and tautologies were also fre-
quently encountered. Future research should focus on
3
Available at pakbon-ld.spider.d2s.labs.vu.nl. this by improving the data-driven evaluation step.

138
User-Driven Pattern Mining on knowledge graphs

References
Barati, M., Bai, Q., & Liu, Q. (2016). Swarm: An ap-
proach for mining semantic association rules from
semantic web data, 30–43. Cham: Springer Interna-
tional Publishing.
Freitas, A. A. (1999). On rule interestingness mea-
sures. Knowledge-Based Systems, 12, 309–315.
Hallo, M., Luján-Mora, S., Maté, A., & Trujillo, J.
(2016). Current state of linked data in digital li-
braries. Journal of Information Science, 42, 117–
127.

Klemettinen, M., Mannila, H., Ronkainen, P., Toivo-


nen, H., & Verkamo, A. I. (1994). Finding interest-
ing rules from large sets of discovered association
rules. Proceedings of the third international con-
ference on Information and knowledge management
(pp. 401–407).
Rapti, A., Tsolis, D., Sioutas, S., & Tsakalidis, A.
(2015). A survey: Mining linked cultural heritage
data. Proceedings of the 16th International Con-
ference on Engineering Applications of Neural Net-
works (INNS) (p. 24).
Selhofer, H., & Geser, G. (2014). D2.1: First report on
users needs (Technical Report). ARIADNE. http:
//ariadne-infrastructure.eu/Resources/D2.
1-First-report-on-users-needs.

139
Harvesting the right tweets:
Social media analytics for the Horticulture Industry

Marijn ten Thij [email protected]


Vrije Universiteit Amsterdam, Faculty of Sciences, Amsterdam, The Netherlands
Sandjai Bhulai [email protected]
Vrije Universiteit Amsterdam, Faculty of Sciences, Amsterdam, The Netherlands

Keywords: Twitter, horticulture, social media analytics

Abstract brands [Fan & Gordon, 2014]. Through tracking, one


can leverage these conversations to discover oppor-
In our current society, data has gone from
tunities or to create content for those audiences. It
scarce to superabundant: huge volumes of
requires advanced analytics that can detect patterns,
data are being generated every second. A
track sentiment, and draw conclusions based on where
big part of this flow is due to social me-
and when conversations happen. Doing this is impor-
dia platforms, which provide a very volatile
tant for many business areas since actively listening
flow of information. Leveraging this infor-
to customers avoids missing out on the opportunity to
mation, which is buried in this fast stream of
collect valuable feedback to understand, react, and to
messages, poses a serious challenge. A vast
provide value to customers.
amount of work is devoted to tackle this chal-
lenge in different business areas. In our work, The retail sector is probably the business area that
we address this challenge for the horticulture utilizes social media analytics the most. More than
sector, which has not received a lot of atten- 60% of marketeers use social media tools for campaign
tion in the literature. Our aim is to extract tracking [Järvinen et al., 2015], brand analysis [Hays
information from the social data flow that et al., 2013], and for competitive intelligence [He et al.,
can empower the horticulture sector. In this 2015]1 . Moreover, they also use tools for customer
abstract, we present our first steps towards care, product launches, and influencer ranking. Social
this goal. media analytics is also heavily used in news and jour-
nalism for building and engaging a news audience, and
measuring those efforts through data collection and
In recent years, there have been a lot of overwhelm- analysis [Bhattacharya & Ram, 2012, Castillo et al.,
ing changes in how people communicate and interact 2014, Lehmann et al., 2013]. A similar use is also
with each other, mostly due to social media. It has adopted in sports to actively engage with fans. In
revolutionized the Internet into a more personal and many business areas one also uses analytics for event
participatory medium. Consequently, social network- detection [Lanagan & Smeaton, 2011] or automatic re-
ing is now the top online activity on the Internet. porting of matches [Van Oorschot et al., 2012, Nichols
With this many subscriptions to social media, mas- et al., 2012] and user profiling [Xu et al., 2015].
sive amounts of information, accumulating as a result
of interactions, discussions, social signals, and other In our work, we address this challenge for the horti-
engagements, form a valuable source of, which can be culture sector, which has not received a lot of atten-
leveraged through social media analytics. tion in the literature. The horticulture industry is a
traditional sector in which producers are focused on
Social media analytics is the process of track- production, and in which many traders use their own
ing conversations around specific phrases, words or transactions as the main source of information. This
leads to reactive management with very little anticipa-
Appearing in Proceedings of Benelearn 2017. Copyright 1
http://www.netbase.com/blog/
2017 by the author(s)/owner(s). why-use-social-analytics

140
Harvesting the right tweets

‘pears’ or ‘pear’ to the number of Conference pears


that are sold in the same time frame. Similarly, we
compared the number of tweets mentioning ‘apple’ or
‘apples’ to the number of Elstar apples that are sold.
We found that in both cases the time series for the
tweets and the sales are comparable. Using an eight-
hour shift for the sales time series, we find a Pearson
correlation coefficient [Pearson, 1895] of 0.44 for the
apples series and a coefficient of 0.46 for the pears se-
ries. These results indicate that it could be possible
to predict the sales of a product type eight weeks in
advance, however, this will need to be confirmed using
other product types.
The second approach to extract value from Twitter is
Figure 1. Example of details page for tweets mentioning to see what a continuous feed of messages contains. As
bananas. a first step, we visualize the obtained tweets per prod-
uct in a top-10 application. The main page of this
tion to events in the future. Growers and traders lack application shows the top 10 most discussed products
data about consumer trends and how the products are on Twitter in the last day for both fruits/vegetables
used and appreciated. This setting provides opportu- and plants/flowers. By clicking on one of the products
nities to enhance the market orientation of the horti- in these top lists, we are redirected to a page, shown in
culture industry, e.g., through the use of social media. Figure 1 for bananas, which shows us both the current
Data on consumer’s appreciation and applications of messages mentioning the product and a detailed anal-
products are abundant on social media. Furthermore, ysis of these messages, e.g., in terms of the most oc-
grower’s communication on social media might indi- curring terms and a time series in which the terms are
cate future supply. This creates a need for analytic mentioned. Besides knowing what products are men-
methods to analyze social media data and to interpret tioned frequently, we also use the real-time data for the
them. Here, we present our first steps towards this detection of stories and discussions that suddenly pop-
goal, as presented in [ten Thij et al., 2016]. up. We do this by clustering incoming tweets by their
The tweets that we use in this study are scraped using tokens, using the Jaccard index [Jaccard, 1901]. If the
the filter stream of the Twitter Application Program- tokens of two tweets are more similar than a predefined
ming Interface (API)2 . Since we do not have access threshold, which we set at 0.6, then these two tweets
to the full Twitter feed, we do not receive all tweets will be represented by the same cluster. Therefore, if a
that we request due to rate limitations by Twitter3 . topic is actively discussed on Twitter, it will be repre-
Therefore, we use a list of 400 common tokens, that are sented as a cluster in our story detection. Since these
frequently used in Dutch tweets. Then, we filter the clusters are renewed every hour, we add the notion of
tweets using two lists of product names, provided by stories, which clusters the clusters over time. By do-
our partners from GroentenFruitHuis and Floricode. ing this, we can also track which clusters are prevalent
One list contains fruits and vegetables, e.g., apple, or- for a longer period of time and therefore will be very
ange, and mango, and the other contains flowers and likely to be of value for the industry.
plants, e.g., tulip, rose, and lily. After retrieving the In this paper, we described our first steps towards em-
tweets, the first step towards empowering the sector powering the horticulture industry by analyzing topic-
is knowing what kind of information can be retrieved relevant tweets in Twitter. During our first explo-
from the social feed. ration of the Twitter data, we found that there could
Using the data, we construct a weekly time series be predictive power in the number of times a specific
reflecting the number of mentions of the products. product is mentioned on Twitter for the future sales
We compared these numbers to sales numbers of the numbers of that particular product. Furthermore, we
most occurring product type of these products. Thus, developed methods to visualize the current industry-
we compared the number of Dutch tweets mentioning specific content in real-time and filter out interesting
information in the process. These ideas can be fruit-
2
https://dev.twitter.com/streaming/reference/ fully adopted in marketing analytics to directly mea-
post/statuses/filter sure the impact of marketing activities. These first
3
https://dev.twitter.com/rest/public/
rate-limits results provide a good basis for further study.

141
Harvesting the right tweets

References Marketing: Tools for Improving Marketing Commu-


nication Measurement”, 477–486. Springer Interna-
Bhattacharya, D., & Ram, S. (2012). Sharing news
tional Publishing.
articles using 140 characters: A diffusion analysis
on Twitter. Advances in Social Networks Analysis
and Mining (ASONAM), 2012 IEEE/ACM Inter- Lanagan, J., & Smeaton, A. F. (2011). Using Twitter
national Conference on (pp. 966–971). to Detect and Tag Important Events in Live Sports.
Artificial Intelligence, 29, 542–545.
Castillo, C., El-Haddad, M., Pfeffer, J., & Stempeck,
M. (2014). Characterizing the life cycle of online Lehmann, J., Castillo, C., Lalmas, M., & Zuckerman,
news stories using social media reactions. Proceed- E. (2013). Transient news crowds in social media.
ings of the 17th ACM Conference on Computer Sup- Proceedings of the Conference on Weblogs and Social
ported Cooperative Work &#38; Social Computing Media (pp. 351–360).
(pp. 211–223). New York, NY, USA: ACM.
Nichols, J., Mahmud, J., & Drews, C. (2012). Summa-
Fan, W., & Gordon, M. D. (2014). The power of social rizing sporting events using twitter. IUI ’12: Pro-
media analytics. Commun. ACM, 57, 74–81. ceedings of the 2012 ACM international conference
Hays, S., Page, S. J., & Buhalis, D. (2013). Social on Intelligent User Interfaces (pp. 189–198).
media as a destination marketing tool: its use by
national tourism organisations. Current Issues in Pearson, K. (1895). Note on regression and inheritance
Tourism, 16, 211–239. in the case of two parents. Proceedings of the Royal
Society of London, 58, 240–242.
He, W., Wu, H., Yan, G., Akula, V., & Shen, J. (2015). ten Thij, M., Bhulai, S., van den Berg, W., & Zwinkels,
A novel social media competitive analytics frame- H. (2016). Twitter Analytics for the Horticulture
work with sentiment benchmarks. Information & Industry. International Conference on DATA AN-
Management, 52, 801–812. Novel applications of so- ALYTICS 2016 (pp. 75–79). Venice, Italy: IARIA.
cial media analytics.
Jaccard, P. (1901). Distribution de la flore alpine Van Oorschot, G., Van Erp, M., & Dijkshoorn, C.
dans le bassin des dranses et dans quelques régions (2012). Automatic extraction of soccer game events
voisines. Bull. Soc. Vaud. Sci. Nat., 37, 241–272. from Twitter. CEUR Workshop Proceedings (pp.
21–30).
Järvinen, J., Töllmen, A., & Karjaluoto, H. (2015).
”marketing dynamism & sustainability: Things Xu, C., Yu, Y., & Hoi, C.-K. (2015). Hidden in-
change, things stay the same. . . ”, chapter ”Web An- game intelligence in NBA players’ tweets. Commun.
alytics and Social Media Monitoring in Industrial ACM, 58, 80–89.

142
Graph-based semi-supervised learning for complex networks

Leto Peel [email protected]


ICTEAM, Université Catholique de Louvain, Louvain-la-Neuve, Belgium
naXys, Université de Namur, Namur, Belgium

Keywords: semi-supervised learning, complex networks, classification

Abstract
We address the problem of semi-supervised
learning in relational networks, networks in
which nodes are entities and links are the
relationships or interactions between them. (a) (b) (c) (d)
Typically this problem is confounded with
the problem of graph-based semi-supervised Figure 1. Different patterns of links between class labels
learning (GSSL), because both problems rep- {red, black}: (a) nodes with the same label tend to be
resent the data as a graph and predict the linked (assortative), (b) links connect nodes with different
missing class labels of nodes. However, not labels (link-heterogeneity), (c) some nodes are assortative
all graphs are created equally. In GSSL a and some are not (class-heterogeneity), (d) missing labels
graph is constructed, often from independent (white) obscures the pattern of links.
data, based on similarity. As such, edges tend
to connect instances with the same class la-
bel. Relational networks, however, can be belled data, i.e. data for which the target attribute
more heterogeneous and edges do not al- values are known. Semi-supervised learning is a clas-
ways indicate similarity. In this work (Peel, sification problem that aims to make use of both the
2017) we present two scalable approaches unlabelled data and the labelled data typically used
for graph-based semi-supervised learning for to train supervised models. A common approach is
the more general case of relational networks. graph-based semi-supervised learning (GSSL) (Belkin
We demonstrate these approaches on syn- & Niyogi, 2004), (Joachims, 2003), (Talukdar & Cram-
thetic and real-world networks that display mer, 2009), (Zhou et al., 2003), (Zhu et al., 2003), in
different link patterns within and between which (often independent) data are represented as a
classes. Compared to state-of-the-art base- similarity graph, such that a vertex is a data instance
line approaches, ours give better classifica- and an edge indicates similarity between two instances.
tion performance and do so without prior By utilising the graph structure, of labelled and unla-
knowledge of how classes interact. belled data, it is possible to accurately classify the un-
labelled vertices using a relatively small set of labelled
instances.
In most complex networks, nodes have attributes, or
metadata, that describe a particular property of the Here we consider the semi-supervised learning problem
node. In some cases these attributes are only partially in the context of complex networks. These networks
observed for a variety of reasons e.g. the data is expen- consist of nodes representing entities (e.g. people, user
sive, time-consuming or difficult to accurately collect. accounts, documents) and links representing pairwise
In machine learning, classification algorithms are used dependencies or relationships (e.g. friendships, con-
to predict discrete node attributes (which we refer to tacts, references). Here class labels are discrete-valued
as class labels) by learning from a training set of la- attributes (e.g. gender, location, topic) that describe
the nodes and our task is to predict these labels based
only on the network structure and a small subset of
Preliminary work. Under review for Benelearn 2017. Do nodes already labelled. This problem of classifying
not distribute.
nodes in networks is often treated as a GSSL prob-

143
Graph-based semi-supervised learning for complex networks

lem because the objective, to predict missing node la- Acknowledgments


bels, and the input, a graph, are the same. Sometimes
this approach works well due to assortative mixing, or The author was supported by IAP DYSCO of the Bel-
homophily, a feature frequently observed in networks, gian Scientific Policy Office, and ARC Mining and
particularly in social networks. Homophily is the effect Optimization of Big Data Models of the Federation
that linked nodes share similar properties or attributes Wallonia-Brussels.
and occurs either through a process of selection or in-
fluence. However, not all node attributes in complex References
networks are assortative. For example, in a network
Belkin, M., & Niyogi, P. (2004). Semi-supervised
of sexual interactions between people it is likely that
learning on riemannian manifolds. Machine learn-
some attributes will be common across links, e.g. sim-
ing, 56, 209–239.
ilar demographic information or shared interests, but
other attributes will be different, e.g. links between Chau, D. H., Pandit, S., & Faloutsos, C. (2006). De-
people of different genders. Furthermore, the pattern tecting fraudulent personalities in networks of online
of similarity or dissimilarity of attributes across links auctioneers. Lect. Notes Comput. Sc., 4213, 103–
may not be consistent across the whole network, e.g. in 114.
some parts of the network links will occur between peo-
ple of the same gender. Cortes, C., Pregibon, D., & Volinsky, C. (2002). Com-
munities of interest. Intell. Data Anal., 6, 211–219.
In situations where we have a sparsely labelled network
and do not know the pattern of interaction between Joachims, T. (2003). Transductive learning via spec-
nodes of different classes, the problem of predicting tral graph partitioning. Int. Conf. Machine Learning
the class labels of the remaining nodes is hard. Fig- (pp. 290–297).
ure 1 shows a toy example in which nodes are assigned
red or black labels and Fig. 1(a)–(c) show possible ar- Peel, L. (2017). Graph-based semi-supervised learning
rangements of labels that become indistinguishable if for relational networks. SIAM International Con-
certain labels are missing (Fig. 1(d)). Tasks such as ference on Data Mining (SDM). arXiv preprint
fraud detection face this type of problem, where cer- arXiv:1612.05001.
tain patterns of interaction are indicative of nefarious Talukdar, P. P., & Crammer, K. (2009). New regu-
behaviour (e.g. in communication (Cortes et al., 2002) larized algorithms for transductive learning. Lect.
or online auction (Chau et al., 2006) networks) but Notes Comput. Sc., 5782, 442–457.
only a sparse set of confirmed fraudulent or legitimate
users are available and no knowledge of how fraudsters Zhou, D., et al. (2003). Learning with local and global
operate or if there are different types of fraudulent be- consistency. 321–328.
haviour.
Zhu, X., Ghahramani, Z., & Lafferty, J. (2003). Semi-
In this work (Peel, 2017), we present two novel meth- supervised learning using gaussian fields and har-
ods to deal with the problem of semi-supervised learn- monic functions. Int. Conf. on Machine Learning
ing in complex networks. Both methods approximate (pp. 912–919).
equivalence relations from social network theory to
define a notion of similarity that is robust to differ-
ent patterns of interaction. We use these measures
of similarity to implicitly construct similarity graphs
from complex networks upon which we can propagate
class label information. We demonstrate on synthetic
networks that our methods are capable of classifying
nodes under a range of different interaction patterns
in which standard GSSL methods fail. Finally, we
demonstrate on real data that our two-step label prop-
agation approach performs consistently well against
baseline approaches and easily scales to large networks
with O(106 ) nodes and O(107 ) edges.

144
Contact Patterns, Group Interaction and Dynamics on
Socio-Behavioral Multiplex Networks

Martin Atzmueller [email protected]


Tilburg University (TiCC), Warandelaan 2, 5037 AB Tilburg, The Netherlands
Lisa Thiele [email protected]
TU Braunschweig, Institute of Psychology Braunschweig, Germany
Gerd Stumme [email protected]
University of Kassel (ITeG), Wilhelmshöher Allee 73, 34121 Kassel, Germany
Simone Kauffeld [email protected]
TU Braunschweig, Institute of Psychology Braunschweig, Germany

Keywords: social network analysis, temporal dynamics, offline social networks, behavioral networks

Abstract at large scale, e. g., Bluetooth-enabled mobile phone


data (Atzmueller & Hilgenberg, 2013), or Radio Fre-
The analysis of social interaction networks quency Identification (RFID) devices (Barrat et al.,
is essential for understanding and modeling 2008). However, the combination of both sources is
network structures as well as the behavior used rather seldomly so far.
of the involved actors. This paper summa-
rizes an analysis at large scale using (sen- This paper summarizes an analysis of social interac-
sor) data collected by RFID tags comple- tions on networks of face-to-face proximity comple-
mented by self-report data obtained using mented by self-report data in the context of that stu-
surveys. We focus on the social network of a dents’ freshman week presented in (Atzmueller et al.,
students’ freshman week, and investigate re- 2016b). This freshman week, the first week of fresh-
search questions concerning group behavior man students at a psychology degree program, is or-
and structure, gender homophily, and inter- ganized as a special course (five days) before the reg-
relations of sensor-based (RFID) and self- ular courses start. We collected two types of network
report social networks. Such analyses are a data: Person-to-person interaction using self-report
first step for enhancing interactions and en- questionnaires and active RFID (radio frequency iden-
abling proactive guidance. tification) tags with proximity sensing, cf. (Barrat
et al., 2008). We focus on structural and dynamic
behavioral aspects as well as on properties of the par-
1. Introduction ticipants, i. e., gender homophily. Furthermore, we
investigate the relation of social interaction networks
The analysis of group interaction and dynamics is an of face-to-face proximity and networks based on self-
important task for providing insights into human be- reports, extending the analysis in (Thiele et al., 2014).
havior. Based on the social distributional hypothe-
sis (Mitzlaff et al., 2014) stating that users with simi- Summarizing our results, we show that there are dis-
lar interaction characteristics tend to be semantically tinctive structural and behavioral patterns in the face-
related, we investigate such interaction networks, and to-face proximity network corresponding to the activi-
analyze the respective relations. Social media and mo- ties of the freshman week. Specifically, we analyze the
bile devices allow the collection of interaction data evolution of contacts, as well as the individual connec-
tivity according to the phases of the event. Further-
more, we show the influence of gender homophily on
Appearing in Proceedings of Benelearn 2017. Copyright the face-to-face proximity activity.
2017 by the author(s)/owner(s).

145
Contact Patterns, Group Interaction and Dynamics on Socio-Behavioral Multiplex Networks

2. Related Work 4. Results and Future Work


The SocioPatterns collaboration developed an infras- We analyze data of a students’ freshman week and
tructure that detects close-range and face-to-face prox- show that there are distinctive structural patterns in
imity (1-1.5 meters) of individuals wearing proximity the F2F data corresponding to the activities of the
tags with a temporal resolution of 20 seconds (Cat- freshman week. This concerns both the static struc-
tuto et al., 2010). In contrast to, e. g., bluetooth- ture as well as its dynamic evolution of contacts and
based methods that allow the analysis based on co- the individual connectivity in the network according
location data (Atzmueller & Hilgenberg, 2013), here to the individual phases of the event. Furthermore,
face-to-face proximity can be observed with a prob- we show the effects of gender homophily on the con-
ability of over 99% using the interval of 20 seconds tact activity. Finally, our results also indicate existing
for a minimal contact duration. This infrastructure structural associations between the face-to-face prox-
has been deployed in various environments for study- imity network and various self-report networks. In the
ing the dynamics of human contacts, e. g., confer- context of introductory courses, this points out the im-
ences (Cattuto et al., 2010; Atzmueller et al., 2012; portance of stronger ties (long conversations) between
Macek et al., 2012), workplaces (Atzmueller et al., the students at the very beginning of their studies for
2014a), or schools (Mastrandrea et al., 2015). fostering an easier start, better cooperativeness and
The analysis of interaction and groups, and their evo- support between the students. Our results especially
lution, respectively, are prominent topics in social sci- show the positive effect of the freshman week for sup-
ences, e. g., (Turner, 1981; Atzmueller et al., 2014b). porting the connectivity between students; the anal-
The temporal evolution of contact networks and in- ysis also indicates the benefit of such a course of five
duced communities is analyzed, for example, in (Bar- days with respect to the interaction and contact pat-
rat & Cattuto, 2013; Kibanov et al., 2014). Also, terns in contrast to shorter introductory courses. Such
the evolution of social groups has been investigated insights into contact patterns and their dynamics en-
in a community-based analysis (Palla et al., 2007) us- able design and modeling decision support for orga-
ing bibliographic and call-detail records. Furthermore, nizing such events and for enhancing interaction of its
the analysis of link relations and their prediction is in- participants, e. g., considering group organization, rec-
vestigated in, e. g., (Liben-Nowell & Kleinberg, 2003; ommendations, notifications, and proactive guidance.
Christoph Scholz and Martin Atzmueller and Alain For future work, we aim to analyze structure and se-
Barrat and Ciro Cattuto and Gerd Stumme, 2013). mantics (Mitzlaff et al., 2011; Mitzlaff et al., 2014)
Overall, social interaction networks in online and of- further, e. g., in order to investigate, if different net-
fline contexts, important features, as well as methods work data can be predicted, e. g., (Scholz et al., 2012;
for analysis are summarized in (Atzmueller, 2014). Christoph Scholz and Martin Atzmueller and Alain
In contrast to the approaches above, this paper fo- Barrat and Ciro Cattuto and Gerd Stumme, 2013).
cuses on networks of face-to-face proximity (F2F) at a For that, also multiplex networks, e. g., based on co-
students’ freshman week, combining RFID-based net- location proximity information (Scholz et al., 2011)
works of a newly composed group with networks ob- can be applied. Here, subgroup discovery and excep-
tained by self-reports (SRN). To the best of the au- tional model mining, e. g., (Leman et al., 2008; Atz-
thors’ knowledge, this is the first time that such an mueller, 2015) provide interesting approaches, espe-
analysis has been performed using real-world networks cially when combining compositional and structural
of face-to-face proximity of a newly composed group analysis, i. e., on attributed graphs (Atzmueller et al.,
together with the corresponding questionnaire data. 2016a; Atzmueller, 2016). Furthermore, we aim to
integrate our results into smart approaches, e. g., as
enabled by augmenting the Ubicon platform (Atz-
3. Dataset mueller et al., 2014a) also including explanation-aware
The dataset contains data from 77 students (60 females methods (Atzmueller & Roth-Berghofer, 2010). Po-
and 17 males) attending the freshman week. We asked tential goals include enhancing interactions at such
each student to wear an active RFID tag while they events, as well as to support the organization of such
were staying at the facility. The RFID deployment events concerning group composition, and the setup of
at the freshman week utilized a variant of the My- activities both at the micro- and macro-level. Develop-
Group (Atzmueller et al., 2014a) system for data col- ing suitable recommendation, notification, and proac-
lection. Participants volunteered to wear active RFID tive guidance systems that are triggered according to
proximity tags, which can sense and log the close-range the events structure and dynamics are further direc-
face-to-face proximity of individuals wearing them. tions for future work.

146
Contact Patterns, Group Interaction and Dynamics on Socio-Behavioral Multiplex Networks

References Christoph Scholz and Martin Atzmueller and Alain


Barrat and Ciro Cattuto and Gerd Stumme (2013).
Atzmueller, M. (2014). Data Mining on Social Inter-
New Insights and Methods For Predicting Face-To-
action Networks. JDMDH, 1.
Face Contacts. Proc. ICWSM. AAAI Press.
Atzmueller, M. (2015). Subgroup Discovery – Ad-
Kibanov, M., Atzmueller, M., Scholz, C., & Stumme,
vanced Review. WIREs DMKD, 5, 35–49.
G. (2014). Temporal Evolution of Contacts and
Atzmueller, M. (2016). Detecting Community Pat- Communities in Networks of Face-to-Face Human
terns Capturing Exceptional Link Trails. Proc. Interactions. Sci Chi Information Sciences, 57.
IEEE/ACM ASONAM. IEEE Press.
Leman, D., Feelders, A., & Knobbe, A. (2008). Ex-
Atzmueller, M., Becker, M., Kibanov, M., Scholz, C., ceptional Model Mining. Proc. ECML-PKDD (pp.
Doerfel, S., Hotho, A., Macek, B.-E., Mitzlaff, F., 1–16). Berlin: Springer.
Mueller, J., & Stumme, G. (2014a). Ubicon and its
Liben-Nowell, D., & Kleinberg, J. M. (2003). The
Applications for Ubiquitous Social Computing. New
Link Prediction Problem for Social Networks. Proc.
Review of Hypermedia and Multimedia, 20, 53–77.
CIKM (pp. 556–559). ACM.
Atzmueller, M., Doerfel, S., Hotho, A., Mitzlaff, F.,
Macek, B.-E., Scholz, C., Atzmueller, M., & Stumme,
& Stumme, G. (2012). Face-to-Face Contacts at a
G. (2012). Anatomy of a Conference. Proc. ACM
Conference: Dynamics of Communities and Roles,
Hypertext (pp. 245–254). ACM.
vol. 7472 of LNAI. Springer.
Mastrandrea, R., Fournet, J., & Barrat, A. (2015).
Atzmueller, M., Doerfel, S., & Mitzlaff, F. (2016a).
Contact Patterns in a High School: A Compari-
Description-Oriented Community Detection using
son between Data Collected Using Wearable Sen-
Exhaustive Subgroup Discovery. Information Sci-
sors, Contact Diaries and Friendship Surveys. PLoS
ences, 329, 965–984.
ONE, 10.
Atzmueller, M., Ernst, A., Krebs, F., Scholz, C., &
Mitzlaff, F., Atzmueller, M., Benz, D., Hotho, A., &
Stumme, G. (2014b). On the Evolution of Social
Stumme, G. (2011). Community Assessment using
Groups During Coffee Breaks. Proc. WWW 2014
Evidence Networks. In Analysis of Social Media and
(Companion) (pp. 631–636). ACM.
Ubiquitous Data, vol. 6904 of LNAI, 79–98. Springer.
Atzmueller, M., & Hilgenberg, K. (2013). Towards
Mitzlaff, F., Atzmueller, M., Hotho, A., & Stumme,
Capturing Social Interactions with SDCF: An Ex-
G. (2014). The Social Distributional Hypothesis.
tensible Framework for Mobile Sensing and Ubiqui-
Journal of Social Network Analysis and Mining, 4.
tous Data Collection. Proc. MSM 2013. ACM Press.
Palla, G., Barabasi, A.-L., & Vicsek, T. (2007). Quan-
Atzmueller, M., & Roth-Berghofer, T. (2010). The
tifying Social Group Evolution. Nature, 446, 664–
Mining and Analysis Continuum of Explaining Un-
667.
covered. Proc. AI-2010. London, UK: SGAI.
Scholz, C., Atzmueller, M., & Stumme, G. (2012).
Atzmueller, M., Thiele, L., Stumme, G., & Kauffeld, S.
On the Predictability of Human Contacts: Influence
(2016b). Analyzing Group Interaction and Dynam-
Factors and the Strength of Stronger Ties. Proc. So-
ics on Socio-Behavioral Networks of Face-to-Face
cialCom (pp. 312–321). IEEE.
Proximity. Proc. ACM Ubicomp Adjunct. ACM.
Scholz, C., Doerfel, S., Atzmueller, M., Hotho, A., &
Barrat, A., & Cattuto, C. (2013). Temporal Net-
Stumme, G. (2011). Resource-Aware On-Line RFID
works, chapter Temporal Networks of Face-to-Face
Localization Using Proximity Data. Proc. ECML-
Human Interactions. Understanding Complex Sys-
PKDD (pp. 129–144). Springer.
tems. Springer.
Thiele, L., Atzmueller, M., Kauffeld, S., & Stumme, G.
Barrat, A., Cattuto, C., Colizza, V., Pinton, J.-F., den
(2014). Subjective versus Objective Captured Social
Broeck, W. V., & Vespignani, A. (2008). High Res-
Networks: Comparing Standard Self-Report Ques-
olution Dynamical Mapping of Social Interactions
tionnaire Data with Observational RFID Technol-
with Active RFID. PLoS ONE, 5.
ogy Data. Proc. Measuring Behavior. Wageningen,
Cattuto, C., Van den Broeck, W., Barrat, A., Colizza, The Netherlands.
V., Pinton, J.-F., & Vespignani, A. (2010). Dy- Turner, J. C. (1981). Towards a Cognitive Redefinition
namics of Person-to-Person Interactions from Dis- of the Social Group. Cah Psychol Cogn, 1.
tributed RFID Sensor Networks. PLoS ONE, 5.

147
Deep Learning Track
Research Papers

148
Modeling brain responses to perceived speech with LSTM networks

Julia Berezutskaya [email protected]


Zachary V. Freudenburg [email protected]
Nick F. Ramsey [email protected]
Brain Center Rudolf Magnus, Department of Neurology and Neurosurgery, University Medical Center Utrecht,
Heidelberglaan 100, 3584 CX, Utrecht, The Netherlands
Umut Güçlü [email protected]
Marcel A.J. van Gerven [email protected]
Radboud University, Donders Institute for Brain, Cognition and Behaviour, Montessorilaan 3, 6525 HR, Ni-
jmegen, The Netherlands

Keywords: LSTM, RNN, brain responses, speech

Abstract coding models (Santoro et al., 2014).


We used recurrent neural networks with long- Non-linear models of neural encoding have recently
short term memory units (LSTM) to model started to gain popularity in the neuroscience commu-
the brain responses to speech based on the nity, since they allow learning a more complex map-
speech audio features. We compared the per- ping between the stimulus features and the neural re-
formance of the LSTM models to the perfor- sponses. In a recent study, various models from the
mance of the linear ridge regression model recurrent neural network family were trained to pre-
and found the LSTM models to be more ro- dict the neural responses to video clips (Güçlü & van
bust for predicting brain responses across dif- Gerven, 2017).
ferent feature sets.
In the present study, we apply LSTM models to predict
the neural responses to continuous speech. We use
1. Introduction various sets of stimulus features for model training and
compare the performance of the LSTM models with
One of the approaches to understanding how the hu- the performance of a linear encoding model.
man brain processes information is through modeling
the observed neural activity evoked during an experi- 2. Methods
mental task. Typically, the neural activation data are
collected as a response to a set of stimuli, for example Brain data collection and preprocessing
pictures, audio or video clips. Then, salient features
Fifteen patients with medication-resistant epilepsy un-
are extracted from the stimulus set and used to model
derwent implantation of subdural electrodes (electro-
the neural responses. The learned mapping is called
corticography, ECoG). All patients gave written con-
a neural encoding model (Kay et al., 2008; Naselaris
sent to participate in research tasks alongside the clin-
et al., 2012).
ical procedures to determine the source of the epilep-
A common approach is to use hand-engineered fea- tic activity. During the research task, the patients
tures, which can be complex transformations of the watched a 6.5 min short movie with a coherent plot
stimulus set and learn a linear mapping between the (fragments of Pippi Longstocking, 1969) while their
stimulus features and the neural responses. In case of neural activity was recorded through the ECoG elec-
speech, the spectrogram and non-linear spectrotem- trodes. The ECoG recordings were acquired with a 128
poral modulation features have been used in linear en- channel recording system (Micromed, Treviso, Italy)
at a sampling rate (SR) of 512 Hz filtered at 0.15-
Preliminary work. Under review for Benelearn 2017. Do 134.4 Hz. All patients had electrodes in temporal and
not distribute. frontal cortices, implicated in auditory and language

149
Modeling brain responses to perceived speech with LSTM networks

processing (Howard et al., 2000; Hickok & Poeppel, was modeled as a linear combination of the input au-
2007; Friederici, 2012; Kubanek et al., 2013). dio features at this time point:
The collected ECoG data were preprocessed prior to ye (t) = βe > x(t) + e
model fitting. Per patient, based on the visual in-
spection, electrodes with noisy or flat signal were ex- where e ∼ N (0, σ 2 ).
cluded from the dataset. Notch filter at 50 and 100 L2 penalized least squares loss function was analyti-
Hz was used to remove line noise and common aver- cally minimized to estimate the regression coefficients
age re-referencing was applied. The Gabor wavelet βe . The kernel trick was used to avoid large matrix
decomposition was used to extract neural responses inversions in the input feature space:
in the high frequency band (HFB, 60-120 Hz) from
the time domain signal. The Wavelet decomposition βe = X> (XX> + λe In )−1 ye
was applied in the HFB range in 1 Hz bins with de- where n is the number of training time points.
creasing window length (4 wavelength full-width at
half max). The resulting signal was averaged over A nested cross-validation was used to estimate the
the whole range to produce a single HFB neural re- amount of regularization λe (Güçlü & van Gerven,
sponse per electrode. The resulting neural responses 2014). First, a grid of the effective degrees of free-
were downsampled to 125 Hz. The preprocessed data dom of the model fit was specified. Then, Newton’s
were concatenated across patients over the electrode method was used to solve the effective degrees of free-
dimension (total number of electrodes = 1283). dom for λe . Finally, λe that resulted in the lowest
nested cross-validation error was taken as the final es-
Audio features timate.

The soundtrack of the movie contained speech and mu- The model was tested on 5% of all data. A five-fold
sic fragments. From the soundtrack, we constructed cross-validation was used to validate the model perfor-
three input feature sets for training the models. First, mance. In each cross-validation fold different speech
we extracted the waveform of the movie soundtrack fragments were selected for testing, so that no data
and downsampled it to 16000 Hz. To create the points were shared in test sets across five folds.
first, time-domain, feature set (time), we reshaped Model performance was measured as the Spearman
the waveform to the matrix of size N × F1 , where N correlation between predicted and observed neural re-
is the number of time points at the SR of the neu- sponses in the test set. The correlation values were
ral responses (125 Hz), and F1 is 128 time features averaged across five cross-validation folds and were
(16000/125) To make the second feature set, we ex- transformed to t-values for determining significance
tracted a sound spectrogram at 128 logarithmically (Kendall & Stuart, 1961).
spaced bins in range 180-7000 Hz. This resulted in
a N × F2 matrix with F2 = 128 features (freq). Fi- LSTM encoding models
nally, the spectrogram was filtered with a bank of 2D
Gabor filters to extract spectrotemporal modulation For each input feature set, six LSTM models (Hochre-
energy features (Chi et al., 2005). The filtering was iter & Schmidhuber, 1997) with varying architectures
done at 16 logarithmically spaced bins in range 0.25- were trained to predict the neural responses to speech
40 Hz along the temporal dimension, and 8 logarith- fragments. The six LSTM models were specified using
mically spaced bins in range 0.25-4 cyc/oct along the a varying number of hidden layers (one or two) and
frequency dimension. The third feature matrix N × F3 a varying number of units per hidden layer (20, 50 or
was built by concatenating all spectrotemporal mod- 100). A fully-connected linear layer was specified as
ulation features: 16 × 8, F3 = 128 features (smtm). the output layer. The neural response of each electrode
The spectrogram and the spectrotemporal modulation ye at time point t was modeled as a linear combination
energy features were obtained using the NSL toolbox of the hidden states h(t). For models with one hidden
(Chi et al., 2005). LSTM layer (1-lstm20, 1-lstm50, 1-lstm100):
ye (t) = β>
e h1 (t) + be + e
Linear encoding model
where be is a bias and e ∼ N (0, σ 2 ).
For each input feature set, a separate kernel linear
ridge regression (Murphy, 2012) was trained to predict For models with two hidden LSTM layers (2-lstm20,
the neural responses to speech fragments. The HFB 2-lstm50, 2-lstm100):
neural response of each electrode ye at time point t
ye (t) = β>
e h2 (t) + be + e

150
Modeling brain responses to perceived speech with LSTM networks

The hidden states h1 (t) were computed in the follow- 3. Results


ing way:
When trained on time, the performance of the lin-
ear ridge regression model was not significant p <
f (t) =σ (Uf h1 (t − 1) + Wf x(t) + bf )
.001, Bonferroni corrected. All the LSTM models per-
i(t) =σ (Ui h1 (t − 1) + Wi x(t) + bi )
formed significantly above chance. When trained on
o(t) =σ (Uo h1 (t − 1) + Wo x(t) + bo )
freq, there was significant difference in performance be-
c(t) =i(t) tanh (Uc h1 (t − 1) + Wc x(t) + bc )
tween the linear ridge regression model and the LSTM
+f (t)c(t − 1)
models (F (1119) = 12.65, p = 5.69 × 10−13 , Tuckey’s
h1 (t) = o(t) tanh (c(t))
HSD test: each pair of ridge−LSTM means were sig-
nificantly different at p = 0.001). When trained on
where σ is the logistic sigmoid function. Vectors f (t), smtm, there was no significant difference between the
i(t), o(t) and c(t) correspond to four LSTM gates: for- linear ridge regression model and the LSTM models
get gate, input gate, output gate and cell state, respec- (F (874) = 1.7, p = .12). Overall, the LSTM models
tively. Matrices U and W contain the gate-specific showed good performance with all three feature sets
weights and vectors b are the gate-specific bias vec- (Fig. 1). The performance of the linear ridge regression
tors. model depended strongly on the input feature set and
For models with two hidden LSTM layers (2-lstm20, 2- improved as the input features became more complex.
lstm50, 2-lstm100), the hidden states h2 (t) were com- Despite varying the parameters of the LSTM architec-
puted in the similar way, except that the input to the
cells at the second layer was h1 (t).
Bar graphs of model performance
Mean squared error function was minimized during 0.30
ridge
the training using Adam optimizer (Kingma & Ba, Spearman correlation 0.25 1-lstm20
1-lstm50
2014). The models were trained using backpropaga- 0.20
1-lstm100
2-lstm20
tion through time, with truncation after the i-th time 2-lstm50
0.15 2-lstm100
point, corresponding to the 500 ms lag. Each model
0.10
was optimized using a validation set (5% of all data)
and early stopping: training was stopped if the loss 0.05

on the validation set did not decrease for 30 epochs. 0.00


time freq smtm
The model with least loss on validation set was used
as the final model. Each model was tested on 5% of all
data. Chainer package (Tokui et al., 2015) was used Figure 1. Model performance comparison between the lin-
for implementing the LSTM models. ear ridge regression model and the six LSTM models,
trained on separate feature sets: time, freq and smtm. The
The model performance was computed in the same bars show mean model performance scores over the elec-
way as for the linear model. Similarly, a five-fold cross- trodes (Spearman correlations). The scores were signifi-
validation was used to validate the model performance. cant at p < .001, Bonferroni corrected for the number of
The correlations were transformed to t-values for de- electrodes. Error bars indicate standard error of the mean.
termining significance.
ture, there was almost no difference in performance
among the six LSTM models. We observed a signif-
Model performance comparison
icant difference in LSTM model performance for the
For each model, the model performance scores (Spear- time feature set: F (275) = 9.37, p = 2.99 × 10−8 . For
man correlations) were thresholded at p < .001, Bon- both one- and two-layer LSTMs, the models with 20
ferroni corrected for the number of electrodes. Per hidden units performed worse compared to the models
each feature set, we selected the electrodes with sig- with a larger number of hidden units (based on the
nificant performance across all the models: 53 elec- HSD test).
trodes for freq, 125 electrodes for smtm and 0 for time.
All models trained on all feature sets performed signif-
The performance across the models was compared us-
icantly above chance only in a subset of all electrodes.
ing one-way ANOVA test. Tuckey’s honest significant
For all feature sets, the LSTM models achieved signif-
difference (HSD) test was used post-hoc to determine
icant performance in a larger amount of electrodes,
pairs of models with significantly different mean per-
compared to the linear ridge regression model (Ta-
formance values. Separately, per each model, we cal-
ble 1).
culated the number of electrodes for which the model
achieved significant performance. All models but the linear ridge regression model

151
Modeling brain responses to perceived speech with LSTM networks

mance of the linear ridge regression model depended


Table 1. Percentage of electrodes the models performed
strongly on the set of the input features. Notably, the
well for when trained on each feature set. Total number of
electrodes (100%) is 1283. Highest values are in bold. linear ridge regression model did non achieve signifi-
cant performance using the time domain features. In
contrast, the LSTM models showed comparable per-
Model Time Freq Smtm formance across different feature sets. Using more
complex audio features allowed the LSTM models to
ridge 0% 6% 16%
make accurate predictions for a larger set of ECoG
1-lstm20 8% 10% 19%
1-lstm50 9% 12% 25% electrodes.
1-lstm100 7% 12% 28%
There are multiple reasons why the linear ridge regres-
2-lstm20 7% 11% 17%
2-lstm50 5% 12% 23% sion model and the LSTM models might have shown
2-lstm100 8% 12% 26% different performance when trained on time and freq.
For example, the linear ridge regression model was reg-
ularized as opposed to the LSTM models presented
trained on time, showed significant performance in here. Additionally, we retrained the LSTM models
electrodes located in the temporal cortex (superior using a weight decay parameter for regularizing the
temporal gyrus), implicated in auditory processing network weights. The amount of the weight decay was
(Howard et al., 2000; Norman-Haignere et al., 2015). cross-validated, but in multiple cases its optimal value
The LSTM models trained on time performed signifi- turned out to be zero, and the overall performance of
cantly well for the electrodes in the superior temporal the LSTM models did not change considerably.
gyrus. The LSTM models trained on freq and smtm Other factors contributing to the superior performance
showed involvement of the electrodes located in the of the LSTM models include presence of non-linear
frontal cortex, as well as the posterior middle temporal transformations within the LSTM cells (σ and tanh),
and parietal cortices (Fig. 2). These cortical regions as well as the cell states c which accumulate the infor-
are implicated in language and other higher-level cog- mation relevant for the predictions over time. Finally,
nitive processing (Hagoort, 2013; Friederici, 2012). the linear ridge regression model and the LSTM mod-
els differed considerably with respect to the number of
the free parameters. We found it challenging to match
the linear regression and neural network models with
respect to all mentioned issues. Further work is nec-
essary to determine which concrete properties of the
LSTM models allowed it to outperform the linear ridge
regression model when trained on time and freq.
The present work has a number of limitations. Because
the placement of the ECoG grids varies across pa-
tients (depending on the tentative source of epilepsy),
it is usually challenging to generalize the model perfor-
mance to new patients’ data. Here we used data from
all patients to train the models. The model perfor-
Figure 2. Cortical locations of the electrodes whose re-
mance was then cross-validated using a five-fold cross-
sponses were modeled significantly above chance (at p <
.001, Bonferroni corrected for the number of electrodes) by validation. Increasing the amount of patients could
1-LSTM50 trained on smtm feature set. provide a larger overlap in the location of the elec-
trodes. Then, a generalization to the data of unseen
patients could be attempted.
4. Discussion
In the present study we trained several models to pre-
5. Conclusions and future work
dict the neural responses to perceived speech. The We trained several LSTM models to predict neural re-
neural responses were obtained using ECoG. We con- sponses to speech based on the speech audio features
sidered a linear ridge regression model and recurrent and compared it to the performance of the linear ridge
neural network models with LSTM units varying in regression model. In general, the performance of the
architecture. Each model was trained on three sepa- LSTM models was superior to the performance of the
rate sets of audio features. We found that the perfor-

152
Modeling brain responses to perceived speech with LSTM networks

linear ridge regression model in terms of the prediction Kendall, M. G., & Stuart, A. (1961). The advanced
accuracy and the amount of electrodes the models were theory of statistics (vol. 2), london: Charles w. Grif-
successfully fit for. Further work is planned to investi- fin and Co., Ltd, 1959–1963.
gate in detail what factors contribute to the superior
performance of the LSTM models, compared to the Kingma, D. P., & Ba, J. (2014). Adam: A method for
linear ridge regression model. Some work on explor- stochastic optimization. CoRR, abs/1412.6980.
ing the internal representations learned by the LSTM Kubanek, J., Brunner, P., Gunduz, A., Poeppel, D., &
models (cell states) is also planned. Finally, we intend Schalk, G. (2013). The tracking of speech envelope
to compare the performance of RNNs with the per- in the human cortex. PloS one, 8, e53398.
formance of a convolutional neural network, trained
on the wavelet-decomposed audio signal to predict the Murphy, K. P. (2012). Machine learning: a probabilis-
brain responses. tic perspective. MIT press.
Naselaris, T., Stansbury, D. E., & Gallant, J. L.
Acknowledgments (2012). Cortical representation of animate and inan-
The work was supported by the NWO Gravitation imate objects in complex natural scenes. Journal of
grant 024.001.006. Physiology-Paris, 106, 239–249.
Norman-Haignere, S., Kanwisher, N. G., & McDer-
References mott, J. H. (2015). Distinct cortical pathways for
music and speech revealed by hypothesis-free voxel
Chi, T., Ru, P., & Shamma, S. A. (2005). Multireso-
decomposition. Neuron, 88, 1281–1296.
lution spectrotemporal analysis of complex sounds.
The Journal of the Acoustical Society of America, Santoro, R., Moerel, M., De Martino, F., Goebel, R.,
118, 887–906. Ugurbil, K., Yacoub, E., & Formisano, E. (2014).
Encoding of natural sounds at multiple spectral and
Friederici, A. D. (2012). The cortical language cir-
temporal resolutions in the human auditory cortex.
cuit: from auditory perception to sentence compre-
PLoS Comput Biol, 10, e1003412.
hension. Trends in cognitive sciences, 16, 262–268.
Güçlü, U., & van Gerven, M. A. (2014). Unsuper- Tokui, S., Oono, K., Hido, S., & Clayton, J. (2015).
vised feature learning improves prediction of human Chainer: a next-generation open source framework
brain activity in response to natural images. PLoS for deep learning. Proceedings of Workshop on
Comput Biol, 10, e1003724. Machine Learning Systems (LearningSys) in The
Twenty-ninth Annual Conference on Neural Infor-
Güçlü, U., & van Gerven, M. A. (2017). Modeling mation Processing Systems (NIPS).
the dynamics of human brain activity with recur-
rent neural networks. Frontiers in Computational
Neuroscience, 11, 10–3389.
Hagoort, P. (2013). Muc (memory, unification, con-
trol) and beyond. Frontiers in Psychology, 4, 416.
Hickok, G., & Poeppel, D. (2007). The cortical orga-
nization of speech processing. Nature Reviews Neu-
roscience, 8, 393–402.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-
term memory. Neural computation, 9, 1735–1780.
Howard, M. A., Volkov, I., Mirsky, R., Garell, P., Noh,
M., Granner, M., Damasio, H., Steinschneider, M.,
Reale, R., Hind, J., et al. (2000). Auditory cortex on
the human posterior superior temporal gyrus. Jour-
nal of Comparative Neurology, 416, 79–92.
Kay, K. N., Naselaris, T., Prenger, R. J., & Gallant,
J. L. (2008). Identifying natural images from human
brain activity. Nature, 452, 352–355.

153
Towards unsupervised signature extraction of forensic logs.
http://wwwis.win.tue.nl/~benelearn2017

Stefan Thaler [email protected]


TU Eindhoven, Den Dolech 12, 5600 MB Eindhoven, Netherlands
Vlado Menkovski [email protected]
TU Eindhoven, Den Dolech 12, 5600 MB Eindhoven, Netherlands
Milan Petković [email protected]
Philips Research, High Tech Campus 34, Eindhoven, Netherlands
TU Eindhoven, Den Dolech 12, 5600 MB Eindhoven, Netherlands

Keywords: RNN auto-encoder, log signature extraction, representation learning, clustering

Abstract 1. Introduction
System- and application logs track activities of users
Log signature extraction is the process of and applications on computer systems. Log messages
finding a set of templates generated a set in such logs commonly consist of at least a time stamp
of log messages from the given log messages. and a free text message. The log message’s time stamp
This process is an important pre-processing indicates when an event has happened, and the text
step for log analysis in the context of infor- message describes what has happened. Log messages
mation forensics because it enables the anal- contain relevant information about the state of the
ysis of event sequences of the examined logs. software or actions that have been performed on a sys-
In earlier work, we have shown that it is tem, which makes them an invaluable source of infor-
possible to extract signatures using recurrent mation for a forensic investigator.
neural networks (RNN) in a supervised man-
ner (Thaler et al., 2017). Given enough la- Ideally, forensic investigators should be able to extract
beled data, this supervised approach works information from such logs in an automated fashion.
well, but obtaining such labeled data is labor However, extracting information in an automated way
intensive. is difficult for four reasons. First, the text contents
of log messages are not uniformly structured. Second,
In this paper, we present an approach to ad- there are different types of log messages in a system
dress the signature extraction problem in an log. Thirdly, the text content may consist of variable
unsupervised way. We use an RNN auto- and constant parts. The variable parts may have ar-
encoder to create an embedding for the log bitrary values and length. Finally, the types of log
lines and we apply clustering in the embed- messages change with updates of software and operat-
ded space to obtain the signatures. ing systems.
One way to enable automated information extraction
We experimentally demonstrate on a forensic
is by manually creating a parser for these logs. How-
log that we can assign log lines to their sig-
ever, writing a comprehensive parser is complex and
nature cluster with a V-Measure of 0.94 and
labor intensive for the same reasons that it is difficult
a Silhouette score of 0.75.
to analyze them automatically. A solution would be to
use a learning approach to extract the log signatures
automatically. A log signature is the ”template” that
has been used to create a log message, and extracting
Appearing in Proceedings of Benelearn 2017. Copyright log signatures is the task of finding the log signatures
2017 by the author(s)/owner(s). given a set of log messages. An example of a log signa-

154
Towards unsupervised signature extraction of forensic logs

ture ’Initializing cgroup subsys %s’, where ’%s’ acts as an attentive recurrent auto-encoder. We detail
a placeholder for a mutable part. This signature can this idea in Section 2.
be used to create log lines such as ’Initializing cgroup
• We provide first empirical evidence that this ap-
subsys pid’ or ’Initializing cgroup subsys io’.
proach yields competitive results to state-of-the-
Currently, log signatures are extracted in different art signature approaches. We detail our experi-
ways. First, there are rule-based approaches. In rule- ments in Section 3 and discuss the results in Sec-
based approaches, signatures are manually defined, for tion 4.
example by using regular expressions. Rule-based ap-
proaches tend to work well when applied on logs with a
2. Signature extraction using attentive
limited amount of signatures. Second, there are algo-
rithmic approaches, which use custom algorithms to RNN auto-encoders
extract signatures from logs. These algorithms are The main idea of our approach consists of two phases.
commonly tailored to specific types of logs. Finally, In the first phase, we train an RNN auto-encoder to
in previous work, we showed that supervised RNNs learn a representation of our log lines. To achieve this,
can also be used to derive log signatures from forensic we treat each log line as a sequence of words. This
logs (Thaler et al., 2017). sequence serves as input to an RNN encoder, which
Our work is inspired by recent advances in modeling encodes this sequence to a fixed, multi-dimensional
natural language using neural networks (Le & Mikolov, vector. Based on this vector, the RNN decoder tries
2014; Cho et al., 2014; Johnson et al., 2016). Since log to reconstruct the reverse sequence of words. We de-
lines are partially natural language, we assume that tail this model in section 2. In the second phase, we
neural language models will also capture the inherent cluster the encoded log lines based on their Euclidean
structure of log lines well. distance to each other. We use the centroids of the
clusters as signature description. We base this ap-
proach on the assumption that similar log lines are
encoded close together in the embedding space. In-
tuitively, this assumption can be explained as follows.
We let the model learn to reconstruct a log sequence
from a fix-sized vector, which it previously encoded.
Encoding a log line to a fixed-size vector is a sparsity
constraint, which encourages the model to encode the
log lines in distributed, dense manner. The more fea-
tures such encoded log lines share, the closer to each
other they will be in Euclidean space.

Figure 1. We first embed log lines using an RNN auto- 2.1. Model
encoder. We then cluster the embedded log lines to obtain
the signatures. Our model is based on the attentive RNN encoder-
decoder architecture that was introduced by Bahdanau
Here, we propose an approach for addressing the sig- et al. (Dzmitry Bahdana et al., 2014) to address neural
nature extraction problem using attentive RNN auto- machine translation. We depict the schematic archi-
encoders. Figure 1 sketches our idea. The ”encoder” tecture in Figure 2. This model consists of three parts:
transforms a log line into a dense vector representa- an RNN encoder, an alignment model, and an RNN
tion, and the ”decoder” attempts to reconstruct input decoder.
log line in the reverse order. Log lines that belong to We feed our model a sequence of n word ids w0 . . . wn .
the same signature are embedded close to each other To retrieve the input word vectors for the RNN en-
in the vector space. We then cluster the learned rep- coder we map each word to a unique vector xi . This
resentations and use cluster centroids as signature de- vector is represented by a row in a word embedding
scriptions. matrix W v×d and the row is indexed by the position
The main contributions of this paper are: of the word in the vocabulary. v is the number of
words in the vocabulary and d is a hyper parameter
and represents the dimension of the embedding.
• We present an approach for addressing the prob-
lem of extracting signatures from forensic logs. For a sequence of input word vectors x0 . . . xi the RNN
We are learning representations of log lines using encoder outputs a sequence of output vectors y0 . . . yn

155
Towards unsupervised signature extraction of forensic logs

for input and a vector h that represents the encoded assume potentially very large vocabularies in large log
log line. h is the last hidden state of the RNN network. files due to the variable parts of the logs.
The alignment model, also called attention mecha- The learning objective ”forces” the model to learn
nism, learns to weight the importance of the encoder’s which information is important for reconstructing a
outputs for each decoding step. The output of the at- log line. In other words, we learn a lossy compression
tention mechanism a context vector ci that represents function of our log lines.
the weighted sum of the encoding outputs. This con-
text vector is calculated for each decoding step. The 2.3. Optimization procedure
alignment model increases the re-construction quality
of the decoded sequences. To train our model, we use Adam, which is a form
of stochastic gradient descent (Kingma & Ba, 2014).
The decoders task is to predict the reversed input word During training, we use dropout to prevent overfit-
sequence. It predicts the words for each time step, ting (Srivastava et al., 2014), and gradient clipping to
using the information of the encoded vector h and the prevent exploding gradients (Pascanu et al., 2013).
context vector ci .
3. Experiment setup
For our experiment, we trained the model that we in-
troduced in Section 2. The RNN encoder, the RNN
decoder, and the alignment model have 128 units each.
The gradients are calculated in mini-batches of 10 log
lines and gradients are clipped at 0.5. We trained each
model with a learning rate of 0.001 and dropout rate
of 0.3. We drew 500 samples for our sampled softmax
learning objective. We determined the hyperparame-
ters empirically using a random search strategy.
We pad input- and output sequences that are of a dif-
ferent length with zeros at the end of the sequences.
Additionally, we add a special token that marks the
beginning and the end of a sequence of words.
We then hierarchically cluster the encoded log lines by
Figure 2. Architecture of our model. We use an attentive using the Farthest Point Algorithm. We use the Eu-
RNN auto-encoder to encode log lines. clidean distance as a distance metric and a clustering
threshold of 0.50, which we empirically determined.
Instead of using our models for translation, we use it We compare our approach to two state-of-the-art sig-
to predict the reverse sequence of our input sequence. nature extraction algorithms, LogCluster (Vaarandi &
We do so because we want the model to learn an em- Pihelgas, 2015) and IPLoM (Makanju et al., 2012). We
bedding space for our log lines and not to translate chose these two approaches for two reasons. First, they
sentences. Also, in contrast to (Dzmitry Bahdana scored high in a side-by-side comparison (He et al.,
et al., 2014), we use a single- instead of a bi-directional 2016). Second, both approaches are capable of finding
LSTM (Hochreiter & Schmidhuber, 1997) for encoding log clusters when the number of log clusters is not spec-
our input sequences. ified upfront. We used the author’s implementation of
LogCluster for our experiments1 , and the implementa-
2.2. Learning objective tion provided by He at al. (He et al., 2016)2 . IPLoM
supports four hyperparameters, a Cluster goodness
The learning objective of our problem is, given a threshold, a partition support threshold and an upper
sequence of words, correctly predict the reverse se- and lower bound for clustering. The best performing
quence, word by word. We calculate the loss of a mis- hyper parameters of IPLoM were a Cluster goodness
predicted word by using sampled softmax loss (Jean
1
et al., 2014). Sampled softmax loss approximates the http://ristov.github.io/logcluster/logcluster-
categorical cross-entropy loss between the embedded 0.08.tar.gz
2
https://github.com/cuhk-cse/logparser/tree/
target word and the predicted word. We motivate our 20882dabb01aa6e1e7241d4ff239121ec978a2fe
choice for using sampled softmax mainly because we

156
Towards unsupervised signature extraction of forensic logs

threshold of 0.125, a partition support threshold of 0.0,


Table 1. Experiment results.
a lower bound of 0.25 and an upper bound of 0.9.
LogCluster has two hyperparameters: ’support’, the Approach Silh. V-Meas.
minimum number of supported log lines to cluster two
loglines and ’wfreq’, the word frequency required to (Vaarandi & Pihelgas, 2015) N/A 0.881
substitute a word with a placeholder. We ran the (Makanju et al., 2012) N/A 0.824
LogCluster algorithm with a ’support’-value of 2 and RNN-AE + Cluster (ours) 0.749 0.944
’wfreq’-value of 0.9. For some log lines, LogCluster did
not extract a matching signature. We assigned each of
such log lines to their own cluster. the results. The columns, from left to right contain
the approach, the Silhouette score of the found clus-
3.1. Evaluation Metrics ters and the V-measure of found clusters.

To evaluate the quality of the extracted signatures, LogCluster creates log line clusters that have a homo-
we evaluate the clusters that we have found. We use geneity of 1.00 and completeness of 0.777, which yields
two metrics to evaluate the quality of our clusters, the a V-measure of 0.881. IPLoM creates log line clusters
Silhouette score and the V-Measure. that have a homogeneity of 0.761 and a completeness
of 0.898, which yields a V-measure of 0.824. Our ap-
The Silhouette score measures the relative quality of proach creates log line clusters that have a homogene-
clusters (Rousseeuw, 1987) by calculating and relating ity of 0.990 and a completeness of 0.905, which yields a
intra- and inter-cluster distances. The Silhouette score V-measure of 0.944. Additionally, the clusters formed
ranges from -1.0 to 1.0. A score of -1.0 indicates many by our approach have a Silhouette score of 0.749.
overlapping clusters and a score of 1.0 means perfect
clusters. None of the tested approaches manages to find all sig-
natures of our forensic log perfectly. In contrast to
The V-Measure is a harmonized mean between the that, in He et al.’s evaluation (He et al., 2016), both
homogeneity and completeness of clusters and cap- LogCluster and IPLoM to perfect F1 score. We ex-
tures a clustering solution’s success in including all plain this difference by the fact that our dataset is
and only data-points from a given class in a given clus- more difficult to analyze than the datasets presented
ter (Rosenberg & Hirschberg, 2007). The V-Measure in He et al.’s evaluation. The most difficult dataset
addresses the clustering ”matching” problem and is in had 376 signatures, but 4.7 million log lines, whereas
this context more appropriate than the F1-score. our dataset has only 11023 log lines and 856 signatures.
Both IPLoM and LogCluster have been designed the
3.2. Datasets first case with few signatures and many log lines per
signature.
To test our idea, we created a forensic log. We ex-
tracted this forensic log from a Linux system disk Our approach creates almost homogeneous clusters.
image using log2timeline3 . Log2timeline is a foren- We illustrate the clusters that are found in Figure 3.
sic tool that extracts event information from storage Figure 3 shows a sample of 15 log lines and how they
media and combines this information in a single file. are hierarchically clustered together. When two log
This log dataset consists of 11023 log messages and lines are identical, they have a distance close to zero,
856 signatures. We manually extracted the signatures which means they are embedded almost on the same
of these logs and verified their correctness using the spot. If they are related, they are closer to each other
Linux source code. The vocabulary size, i.e. the num- than other log lines. Our approach makes very few
ber of unique words, of our dataset is 4282 and the mistakes of grouping together log lines that do not be-
maximum log message length is 135. Due to the na- long together. Instead, the clusters are incomplete,
ture of forensic logs, we assume that the number of which means that some clusters should be grouped to-
unique words will grow when in larger logs. gether but are not.
One drawback of our approach is the increased com-
4. Results and discussion puting requirements. IPLoM processes our forensic
log in average under a second, LogCluster needs about
In this section, we present the results of our experi-
23 seconds process it whereas training our model for
ments and discuss our findings. Table 1 summarizes
the presented task needs 715 seconds. This increased
3
https://github.com/log2timeline/plaso/wiki/Using- performance requirements may become a problem on
log2timeline larger datasets. As with the state-of-the-art algo-

157
Towards unsupervised signature extraction of forensic logs

Figure 3. Cluster dendrogram of 15 log lines. The y-axis displays the log lines and x-axis the distance to each other.

rithms, the threshold to separate the clusters has to be or variational recurrent auto-encoders have been used
manually determined. Our goal is to extract human- to cluster music snippets (Fabius & van Amersfoort,
understandable log signatures. Currently, we only ob- 2014).
tain the centroids of embedded clusters, which allow us
to cluster log lines according to their signature. How- 6. Conclusions and future work
ever, these centroids can not effectively be interpreted
by humans. We have presented an approach to use an attentive
RNN auto-encoder models to address the problem of
5. Related Work extracting signatures from forensic logs. We use the
auto-encoder to learn a representation of our forensic
Log signature extraction has been studied to achieve logs, and cluster the embedded logs to retrieve our
a variety of goals such as anomaly and fault detection signatures. This approach finds signature clusters in
in logs (Vaarandi, 2003; Fu et al., 2009), pattern de- our forensic log dataset with a V-Measure of 0.94 and a
tection (Vaarandi, 2003; Makanju et al., 2009; Aharon Silhouette score of 0.75. These results are comparable
et al., 2009), profile building (Vaarandi & Pihelgas, to the state-of-the-art approaches.
2015), forensic analysis (Thaler et al., 2017) or com-
We plan to extend our work in several ways. So far
pression of logs (Tang et al., 2011; Makanju et al.,
we have only clustered log lines. To complete our ob-
2009). Most of these approaches motivated their sig-
jective, we also need a method for extracting human-
nature extraction approach by the large and rapidly
readable descriptions of these signature clusters. We
increasing volume of log data that needed to be ana-
plan to use the outputs of the attention network to aid
lyzed (Vaarandi, 2003; Makanju et al., 2009; Fu et al.,
the extraction of log signatures from the clustered log
2009; Aharon et al., 2009; Tang et al., 2011; Vaarandi
lines. Furthermore, we intend to explore regularization
& Pihelgas, 2015).
techniques that help improve the quality of the ex-
Many NLP-related problems have been addressed us- tracted signatures. Finally, we intend to demonstrate
ing neural networks. Collobert et al. were one of the the feasibility and competitiveness of our approach on
first to successfully apply neural models to a broad va- large datasets and datasets with fewer signatures.
riety of NLP-related tasks (Weston & Karlen, 2011).
Their approach has been followed by other neural mod- Acknowledgments
els for similar tasks, e.g. (Dyer et al., 2015; Lam-
ple et al., 2016). Also, a variety of language model- This work has been partially funded by the Dutch na-
ing tasks have been tackled using neural architectures, tional program COMMIT under the Big Data Veracity
e.g. (Dzmitry Bahdana et al., 2014; Cho et al., 2014; project.
Sutskever et al., 2014).
Auto-encoders have been successfully applied to clus-
tering tasks. For example, auto-encoders have been
used to cluster text and images (Xie et al., 2015)

158
Towards unsupervised signature extraction of forensic logs

References Lample, G., Ballesteros, M., Subramanian, S.,


Kawakami, K., & Dyer, C. (2016). Neural architec-
Aharon, M., Barash, G., Cohen, I., & Mordechai, E.
tures for named entity recognition. arXiv preprint
(2009). One graph is worth a thousand logs: Un-
arXiv:1603.01360.
covering hidden structures in massive system event
logs. Lecture Notes in Computer Science (including Le, Q., & Mikolov, T. (2014). Distributed represen-
subseries Lecture Notes in Artificial Intelligence and tations of sentences and documents. arXiv preprint
Lecture Notes in Bioinformatics) (pp. 227–243). arXiv:1405.4053.
Cho, K., van Merrienboer, B., Gulcehre, C., Bah- Makanju, A., Zincir-Heywood, A. N., & Milios, E. E.
danau, D., Bougares, F., Schwenk, H., & Bengio, (2012). A Lightweight Algorithm for Message Type
Y. (2014). Learning Phrase Representations us- Extraction in System Application Logs. IEEE
ing RNN Encoder-Decoder for Statistical Machine Transactions on Knowledge and Data Engineering,
Translation. Proceedings of the 2014 Conference on 24, 1921–1936.
Empirical Methods in Natural Language Processing
(EMNLP), 1724–1734. Makanju, A. A. O., Zincir-Heywood, A. N., & Mil-
ios, E. E. (2009). Clustering event logs using it-
Dyer, C., Ballesteros, M., Ling, W., Matthews, A., & erative partitioning. Proceedings of the 15th ACM
Smith, N. A. (2015). Transition-based dependency SIGKDD international conference on Knowledge
parsing with stack long short-term memory. arXiv discovery and data mining - KDD ’09 (p. 1255).
preprint arXiv:1505.08075.
Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On
Dzmitry Bahdana, Bahdanau, D., Cho, K., & Bengio, the difficulty of training recurrent neural networks.
Y. (2014). Neural Machine Translation By Jointly ICML (3), 28, 1310–1318.
Learning To Align and Translate. Iclr 2015, 1–15.
Rosenberg, A., & Hirschberg, J. (2007). V-Measure: A
Fabius, O., & van Amersfoort, J. R. (2014). Vari- Conditional Entropy-Based External Cluster Evalu-
ational recurrent auto-encoders. arXiv preprint ation Measure. EMNLP-CoNLL (pp. 410–420).
arXiv:1412.6581. Rousseeuw, P. J. (1987). Silhouettes: a graphical aid
Fu, Q., Lou, J.-g., Wang, Y., & Li, J. (2009). Exe- to the interpretation and validation of cluster anal-
cution Anomaly Detection in Distributed Systems ysis. Journal of computational and applied mathe-
through Unstructured Log Analysis. ICDM (pp. matics, 20, 53–65.
149–158). Srivastava, N., Hinton, G. E., Krizhevsky, A.,
Sutskever, I., & Salakhutdinov, R. (2014). Dropout:
He, P., Zhu, J., He, S., Li, J., & Lyu, M. R. (2016).
a simple way to prevent neural networks from over-
An evaluation study on log parsing and its use in log
fitting. Journal of Machine Learning Research, 15,
mining. Proceedings - 46th Annual IEEE/IFIP In-
1929–1958.
ternational Conference on Dependable Systems and
Networks, DSN 2016, 654–661. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Se-
quence to sequence learning with neural networks.
Hochreiter, S., & Schmidhuber, J. (1997). Long Short- Advances in neural information processing systems
Term Memory. Neural Computation, 9, 1735–1780. (pp. 3104–3112).
Jean, S., Cho, K., Memisevic, R., & Bengio, Y. (2014). Tang, L., Li, T., & Perng, C.-s. (2011). LogSig :
On Using Very Large Target Vocabulary for Neural Generating System Events from Raw Textual Logs.
Machine Translation. Cikm (pp. 785–794).
Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Thaler, S., Menkovski, V., & Petković, M. (2017). To-
Wu, Y., Chen, Z., Thorat, N., Viégas, F., Watten- wards a neural language model for signature extrac-
berg, M., Corrado, G., & others (2016). Google’s tion from forensic logs. 2017 5th International Sym-
Multilingual Neural Machine Translation System: posium on Digital Forensic and Security (ISDFS)
Enabling Zero-Shot Translation. arXiv preprint (pp. 1–6). IEEE.
arXiv:1611.04558.
Vaarandi, R. (2003). A Data Clustering Algorithm
Kingma, D. P., & Ba, J. (2014). Adam: A Method for for Mining Patterns From Event Logs. Computer
Stochastic Optimization. Engineering (pp. 119–126).

159
Towards unsupervised signature extraction of forensic logs

Vaarandi, R., & Pihelgas, M. (2015). LogCluster - A


Data Clustering and Pattern Mining Algorithm for
Event Logs. 12th International Conference on Net-
work and Service Management - CNSM 2015 (pp.
1–8).

Weston, J., & Karlen, M. (2011). Natural Language


Processing ( Almost ) from Scratch. Journal of Ma-
chine Learning Research, 12, 2493–2537.
Xie, J., Girshick, R., & Farhadi, A. (2015). Unsu-
pervised Deep Embedding for Clustering Analysis.
arXiv preprint arXiv:1511.06335.

160
Deep Learning Track
Extended Abstracts

161
Improving Variational Auto-Encoders using convex combination
linear Inverse Autoregressive Flow

Jakub M. Tomczak [email protected]


Max Welling [email protected]
University of Amsterdam, the Netherlands

Keywords: Variational Inference, Deep Learning, Normalizing Flow, Generative Modelling

Abstract but for continuous z this could be done efficiently


In this paper, we propose a new volume- through a re-parameterization of q(z|x) (Kingma &
preserving flow and show that it performs Welling, 2013), (Rezende et al., 2014), which yields a
similarly to the linear general normalizing variational auto-encoder architecture (VAE).
flow. The idea is to enrich a linear Inverse Typically, a diagonal covariance matrix of the encoder
 is
Autoregressive Flow by introducing multiple assumed, i.e., q(z|x) = N z|µ(x), diag(σ 2 (x)) , where
lower-triangular matrices with ones on the µ(x) and σ 2 (x) are parameterized by the NN. However,
diagonal and combining them using a con- this assumption can be insufficient and not flexible
vex combination. In the experimental studies enough to match the true posterior.
on MNIST and Histopathology data we show
that the proposed approach outperforms other A manner of enriching the variational posterior is to ap-
volume-preserving flows and is competitive ply a normalizing flow (Tabak & Turner, 2013), (Tabak
with current state-of-the-art linear normaliz- & Vanden-Eijnden, 2010). A (finite) normalizing flow
ing flow. is a powerful framework for building flexible posterior
distribution by starting with an initial random variable
with a simple distribution for generating z(0) and then
applying a series of invertible transformations f (t) , for
1. Variational Auto-Encoders and
t = 1, . . . , T . As a result, the last iteration gives a
Normalizing Flows random variable z(T ) that has a more flexible distribu-
Let x be a vector of D observable variables, z ∈ RM a tion. Once we choose transformations f (t) for which
vector of stochastic latent variables and let p(x, z) be a the Jacobian-determinant can be computed, we aim
parametric model of the joint distribution. Given data at optimizing the following lower bound (Rezende &
X = {x1 , . . . , xN } we typically aim at maximizing the Mohamed, 2015) :
PN
marginal log-likelihood, ln p(X) = i=1 ln p(xi ), with
h ∂f (t) i
X T
respect to parameters. However, when the model is
ln p(x) ≥Eq(z(0) |x) ln p(x|z(T ) ) + ln det (t−1)
parameterized by a neural network (NN), the optimiza- t=1
∂z
tion could be difficult due to the intractability of the 
− KL q(z(0) |x)||p(z(T ) ) . (2)
marginal likelihood. A possible manner of overcoming
this issue is to apply variational inference and optimize
the following lower bound: The fashion the Jacobian-determinant is handled de-
 termines whether we deal with general normalizing
ln p(x) ≥ Eq(z|x) [ln p(x|z)] − KL q(z|x)||p(z) , (1) flows or volume-preserving flows. The general normal-
izing flows aim at formulating the flow for which the
where q(z|x) is the inference model (an encoder ), p(x|z) Jacobian-determinant is relatively easy to compute. On
is called a decoder and p(z) = N (z|0, I) is the prior. the contrary, the volume-preserving flows design series
There are various ways of optimizing this lower bound of transformations such that the Jacobian-determinant
equals 1 while still it allows to obtain flexible posterior
Appearing in Proceedings of Benelearn 2017. Copyright distributions.
2017 by the author(s)/owner(s).
In this paper, we propose a new volume-preserving

162
Improving Variational Auto-Encoders using convex combination linear Inverse Autoregressive Flow

flow and show that it performs similarly to the linear


Table 1. Comparison of the lower bound of marginal log-
general normalizing flow.
likelihood measured in nats of the digits in the MNIST test
set. Lower value is better. Some results are presented after:
2. New Volume-Preserving Flow ♣ (Rezende & Mohamed, 2015), ♦ (Dinh et al., 2014), ♠
(Salimans et al., 2015), ♥ (Tomczak & Welling, 2016)
In general, we can obtain more flexible variational pos-
terior if we model a full-covariance matrix using a linear Method ≤ ln p(x)
transformation, namely, z(1) = Lz(0) . However, in or- VAE −93.9
der to take advantage of the volume-preserving flow, VAE+NF (T =10) ♣ −87.5
VAE+NF (T =80) ♣ −85.1
the Jacobian-determinant of L must be 1. This could VAE+NICE (T =10) ♦ −88.6
be accomplished in different ways, e.g., L is orthogonal VAE+NICE (T =80) ♦ −87.2
matrix or it is the lower-triangular matrix with ones VAE+HVI (T =1) ♠ −91.7
on the diagonal. The former idea was employed by the VAE+HVI (T =8) ♠ −88.3
Hauseholder flow (HF) (Tomczak & Welling, 2016) and VAE+HF(T =1) ♥ −87.8
VAE+HF(T =10) ♥ −87.7
the latter one by the linear Inverse Autoregressive Flow VAE+LinIAF −85.8
(LinIAF) (Kingma et al., 2016). In both cases, the VAE+ccLinIAF(K=5) −85.3
encoder outputs an additional set of variables that are
further used to calculate L. In the case of the LinIAF,
the lower triangular matrix with ones on the diagonal and 10,000 test images) and the second one contains
is given by the NN explicitly. 28 × 28 gray-scaled image patches of histopathology
scans (6,800 training images, 2,000 validation images
However, in the LinIAF a single matrix L could not and 2,000 test images). For both datasets we used a
fully represent variations in data. In order to alle- separate validation set for hyper-parameters tuning.
viate this issue we propose to consider K such ma-
trices, {L1 (x), . . . , LK (x)}. Further, to obtain the Set-up In both experiments we trained the VAE with
volume-preserving flow, we propose PK to use a convex 40 stochastic hidden units, and the encoder and the
combination of these matrices k=1 yk (x)Lk (x), where decoder were parameterized with two-layered neural
y(x) = [y1 (x), . . . , yK (x)]> is calculated using the soft- networks (300 hidden units per layer) and the gate acti-
max function, namely, y(x) = softmax(NN(x)), where vation function (van den Oord et al., 2016), (Tomczak
NN(x) is the neural network used in the encoder. & Welling, 2016). The number of combined matrices
Eventually, we have the following linear transformation was determined using the validation set and taking
with the convex combination of the lower-triangular more than 5 matrices resulted in no performance im-
matrices with ones on the diagonal: provement. For training we utilized ADAM (Kingma
& Ba, 2014) with the mini-batch size equal 100 and
X
K  one example for estimating the expected value. The
z(1) = yk (x)Lk (x) z(0) . (3) learning rate was set according to the validation set.
k=1
The maximum number of epochs was 5000 and early-
stopping with a look-ahead of 100 epochs was applied.
The convex combination of lower-triangular matrices We used the warm-up (Bowman et al., 2015), (Søn-
with ones on the diagonal results again in the lower- derby et al., 2016) for first 200 epochs. We initialized
triangular
 P matrix with ones on the diagonal, thus, weights according to (Glorot & Bengio, 2010).
K
| det k=1 yk (x)Lk (x) | = 1. This formulates the
We compared our approach to linear normalizing flow
volume-preserving flow we refer to as convex combina- (VAE+NF) (Rezende & Mohamed, 2015), and finite
tion linear IAF (ccLinIAF). volume-preserving flows: NICE (VAE+NICE) (Dinh
et al., 2014), HVI (VAE+HVI) (Salimans et al., 2015),
3. Experiments HF (VAE+HF) (Tomczak & Welling, 2016), linear IAF
(VAE+LinIAF) (Kingma et al., 2016) on the MNIST
Datasets In the experiments we use two datasets: data, and to VAE+HF on the Histopathology data.
the MNIST dataset1 (LeCun et al., 1998) and the The methods were compared according to the lower
Histopathology dataset (Tomczak & Welling, 2016). bound of marginal log-likelihood measured on the test
The first dataset contains 28×28 images of handwritten set.
digits (50,000 training images, 10,000 validation images
1
We used the dynamically binarized dataset as in Discussion The results presented in Table 1 and
(Salakhutdinov & Murray, 2008). 2 for MNIST and Histopathology data, respectively,

163
Improving Variational Auto-Encoders using convex combination linear Inverse Autoregressive Flow

Kingma, D., & Ba, J. (2014). ADAM: A


Table 2. Comparison of the lower bound of marginal log-
method for stochastic optimization. arXiv preprint
likelihood measured in nats of the image patches in the
Histopathology test set. Higher value is better. The experi- arXiv:1412.6980.
ment was repeated 3 times. The results for VAE+HF are Kingma, D. P., Salimans, T., Józefowicz, R., Chen,
taken from: ♥ (Tomczak & Welling, 2016).
X., Sutskever, I., & Welling, M. (2016). Improving
Method ≤ ln p(x) variational inference with inverse autoregressive flow.
VAE ♥ 1371.4 ± 32.1 NIPS.
VAE+HF (T =1) ♥ 1388.0 ± 22.1
VAE+HF (T =10) ♥ 1397.0 ± 15.2 Kingma, D. P., & Welling, M. (2013). Auto-encoding
VAE+HF (T =20) ♥ 1398.3 ± 8.1 variational bayes. arXiv preprint arXiv:1312.6114.
VAE+LinIAF 1388.6 ± 71
VAE+ccLinIAF(K=5) 1413.8 ± 22.9 LeCun, Y., Cortes, C., & Burges, C. J. (1998). The
MNIST database of handwritten digits.

reveal that the proposed flow outperforms all volume- Li, Y., & Turner, R. E. (2016). Rényi divergence varia-
preserving flows and performs similarly to the linear tional inference. arXiv preprint arXiv:1602.02311.
normalizing flow with large number of transformations. Oord, A. v. d., Kalchbrenner, N., & Kavukcuoglu, K.
The advantage of using several matrices instead of one (2016). Pixel recurrent neural networks. ICML, 1747–
is especially apparent on the Histopathology data where 1756.
the VAE+ccLinIAF performed better by about 15nats
than the VAE+LinIAF. Hence, the convex combina- Rezende, D., & Mohamed, S. (2015). Variational Infer-
tion of the lower-triangular matrices with ones on the ence with Normalizing Flows. ICML (pp. 1530–1538).
diagonal seems to allow to better reflect the data with
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014).
small additional computational burden.
Stochastic backpropagation and approximate infer-
ence in deep generative models. arXiv preprint
Acknowledgments arXiv:1401.4082.
The research conducted by Jakub M. Tomczak was Salakhutdinov, R., & Murray, I. (2008). On the quan-
funded by the European Commission within the Marie titative analysis of deep belief networks. ICML (pp.
Skłodowska-Curie Individual Fellowship (Grant No. 872–879).
702666, ”Deep learning and Bayesian inference for med-
ical imaging”). Salimans, T., Kingma, D. P., & Welling, M. (2015).
Markov chain Monte Carlo and Variational Inference:
Bridging the gap. ICML (pp. 1218–1226).
References
Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby,
Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M.,
S. K., & Winther, O. (2016). Ladder variational
Jozefowicz, R., & Bengio, S. (2015). Generating
autoencoders. arXiv preprint arXiv:1602.02282.
sentences from a continuous space. arXiv preprint
arXiv:1511.06349. Tabak, E., & Turner, C. V. (2013). A family of nonpara-
metric density estimation algorithms. Communica-
Burda, Y., Grosse, R., & Salakhutdinov, R. (2015). tions on Pure and Applied Mathematics, 66, 145–164.
Importance weighted autoencoders. arXiv preprint
arXiv:1509.00519. Tabak, E. G., & Vanden-Eijnden, E. (2010). Density
estimation by dual ascent of the log-likelihood. Com-
Dinh, L., Krueger, D., & Bengio, Y. (2014). Nice: Non- munications in Mathematical Sciences, 8, 217–233.
linear independent components estimation. arXiv
Tomczak, J. M., & Welling, M. (2016). Improving
preprint arXiv:1410.8516.
Variational Auto-Encoders using Householder Flow.
arXiv preprint arXiv:1611.09630.
Glorot, X., & Bengio, Y. (2010). Understanding the dif-
ficulty of training deep feedforward neural networks. van den Oord, A., Kalchbrenner, N., Espeholt, L.,
AISTATS (pp. 249–256). Vinyals, O., Graves, A., & Kavukcuoglu, K. (2016).
Conditional image generation with pixelcnn decoders.
Householder, A. S. (1958). Unitary triangularization Advances in Neural Information Processing Systems
of a nonsymmetric matrix. Journal of the ACM (pp. 4790–4798).
(JACM), 5, 339–342.

164
The use of shallow convolutional neural networks in predicting
promotor strength in Escherichia coli

Jim Clauwaert1 [email protected]


Michiel Stock1 [email protected]
Marjan De Mey2 [email protected]
Willem Waegeman1 [email protected]
1
KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Coupure
links 653, 9000, Ghent, Belgium
2
InBio, Centre for Industrial Biotechnology and Biocatalysis, Ghent University, Coupure links 653, 9000, Ghent,
Belgium

Keywords: artificial neural networks, promoter engineering, E. coli,

Abstract
Gene expression is an important factor in
many processes of synthetic biology. The
use of well-characterized promoter libraries
makes it possible to obtain reliable estimates
on the transcription rates in genetic circuits.
Yet, the relation between promoter sequence
and transcription rate is largely undiscov-
ered. Through the use of shallow convolu-
tional neural networks, we were able to create
models with good predictive power for pro- Figure 1. Basic layout of the first model. The sequence,
moter strength in E. coli. transformed into a 4 × 50 binary image, is evaluated by
the first convolutional layer outputting a vector of scores
for every 4 × l kernel (m). Rectified outputs are maximum
1. Introduction pooled and fed into the third layer. The sigmoid transform
of the softmax layer results in a probability score for each
The binding region of the transcription unit, called class.
the promoter region, is known to play a key role in the
transcription rate of downstream genes. In Eubacte- Following the recent success of artificial neural net-
ria, the sigma factor (σ) binds with the RNA poly- works (Alipanahi et al., 2015), inspiration was taken
merase subunit (αββ 0 ω) to create RNA polymerase. to create specialized models for promoter strength.
Being part of the RNA polymerase holoenzyme, the Due to the low amount of promoter libraries, models
sigma element acts as the connection between RNA were instead trained to predict the presence of a pro-
polymerase and DNA (Gruber & Gross, 2003). The moter region within a DNA sequence, using the more
use of prokaryotic organisms such as E. coli are indis- abundant ChIP-chip data. To evaluate whether the
pensable in research and biotechnological industry. As model gives a good indication of promoter strength,
such, multiple studies have investigated creating pre- the predicted score of the model was compared with
dictive models for promoter strength (De Mey, 2007; the given promoter strength scores of existing pro-
Meng et al., 2017). As of now, existing models are moter libraries. The use of several custom model ar-
trained using small in-house promoter libraries, with- chitectures are considered.
out evaluation on external public datasets.
2. Results
Preliminary work. Under review for Benelearn 2017. Do ChIP-chip data of the σ 70 transcription factor was
not distribute.
used from cells in the exponential phase (Cho et al.,

165
Formatting Instructions for Benelearn 2017 Abstracts and Papers

Table 1. Performance measures of the models on the test set (AUC) and external datasets (Spearman’s rank correlation).
given values are the averaged scores of the repeated experiments. The standard deviation is given within brackets

kernel test set Anderson Brewster(2012) Davis(2011) Mutalika (2013) Mutalikb (2013)
size 38984 seq. 19 prom. 18 prom. 10 prom. 118 prom. 137 prom.
M1 4 × 25 0.79 (0.02) 0.15 (0.19) 0.81 (0.09) 0.74 (0.12) 0.40 (0.07) 0.22 (0.07)
4 × 10 0.79 (0.02) 0.25 (0.16) 0.81 (0.11) 0.77 (0.08) 0.45 (0.04) 0.23 (0.04)
M2 4 × 25 0.78 (0.02) 0.20 (0.12) 0.74 (0.11) 0.81 (0.14) 0.50 (0.08) 0.16 (0.05)
4 × 10 0.77 (0.02) -0.16 (0.14) 0.78 (0.07) 0.68 (0.10) 0.41 (0.06) 0.12 (0.07)
M3 4 × 25 0.79 (0.02) 0.38 (0.14) 0.82 (0.10) 0.84 (0.08) 0.53 (0.07) 0.25 (0.10)
4 × 10 0.76 (0.01) 0.70 (0.14) 0.83 (0.06) 0.84 (0.08) 0.47 (0.08) 0.41 (0.15)
a
part of promoter library with variety within the -35 and -10 box
b
part of promoter library with variation over the whole sequence

2014). ChIP-chip experiments give an affinity measure measure of similarity of ranking between the proba-
between the RNA polymerase holoenzyme and DNA, bility scores and given scores within each promoter
pinpointing possible promoter sites. Due to the high library. Table 1 gives an overview of the performances
noise of ChIP-chip data, a classification approach was on the test set and external datasets.
taken to build the model. The convolutional neural
network is fed binary images (4 × 50) of the sequences 3. Discussion
following the work of Alipanahi (2015). Promoter se-
quences within the dataset are determined using both We found that the introduction of the proposed model
a transcription start site mapping (Cho et al., 2014), architectures shows significant improvement on rank-
and through selection of the highest peaks. ing known promoter libraries by promoter strength.
The results furthermore show that retaining positional
Multiple architectures have been considered, with
data can offer non-trivial boosts to smaller kernel sizes.
small changes applied to the basic model (M1) given
Yet, the M2 results show that these advantages are
in Figure 1. The general model architecture is largely
outweighed for smaller kernels, where an increased
based upon the work of Alipanahi (2015). To retain
model complexity reduces overall scores. For longer
positional data of high-scoring subsequences, a second
motifs, M2 still offers a boost in performance as the
model (M2) uses the pooled outputs of subregions of
increase in features compared to M1 is small. M3,
the sequence based upon the length (l) of the kernel
gaining positional information of the motifs without
(motif). Thus, a motif with length l = 10 creates
the cost of any additional complexity shows the best
50/l = 5 outputs for every kernel in the convolutional
results for each dataset using both short and long mo-
layer. A further adjustment first splits the sequence
tifs. The use of small kernels, with the exception of
into subsequences according to the motif length. This
M2, generally offers better scores. The performance of
reduces the number of outputs created in the first layer
the model to identify promoter regions on the test set
of the previous model. When training 4 × 10 ker-
shows little variations, with AUC scores reaching 0.80.
nels, five subsequences are created. Motifs are trained
Further optimizations to both the architecture of the
uniquely on one of the parts. The third model (M3)
model and selection of hyperparameters can prove to
retains positional information, while having the same
further increase model performance.
complexity as M1, albeit at a cost of flexibility.
To get an insight into the performances of the mod- 4. Conclusion
els, the use of long (4 × 25) and short (4 × 10) motifs
have been evaluated. model training was repeated 50 A comprehensive tool for promoter strength predic-
times to account for model variety. In this study we tion, in line with the creation of the ribosome binding
trained models for binary classification, predicting the site calculator (Salis, 2011), is highly anticipated in
existence of a promoter region within a given DNA the research community. This study shows the poten-
sequence. To verify whether given models can further- tial of using an indirect approach in creating predictive
more give reliable predictions on promoter strength, models for promoter strength.
following the idea that stronger promoters are more
likely to be recognized, promoter libraries have been
ranked on the probability scores of the model. The
Spearman’s rank correlation coefficient is used as a

166
Formatting Instructions for Benelearn 2017 Abstracts and Papers

References
Alipanahi, B., Delong, A., Weirauch, M. T., & Frey,
B. J. (2015). Predicting the sequence specificities of
DNA- and RNA-binding proteins by deep learning.
Nature Biotechnol, 33, 831–838.
Brewster, R. C., Jones, D. L., & Phillips, R. (2012).
Tuning Promoter Strength through RNA Poly-
merase Binding Site Design in Escherichia coli. PLoS
Computational Biology, 8.
Cho, B.-K., Kim, D., Knight, E. M., Zengler, K., &
Palsson, B. O. (2014). Genome-scale reconstruc-
tion of the sigma factor network in Escherichia coli:
topology and functional states. BMC biology, 12, 4.
Davis, J. H., Rubin, A. J., & Sauer, R. T. (2011).
Design, construction and characterization of a set
of insulated bacterial promoters. Nucleic Acids Re-
search, 39, 1131–1141.

De Mey, M. (2007). Construction and model-based


analysis of a promoter library for E.coli: An in-
dispensable tool for metabolic engineering. BMC
Biotechnology, 7, 34.
Gruber, T. M., & Gross, C. A. (2003). Multiple sigma
subunits and the partitioning of bacterial transcrip-
tion space. Annual Review of Microbiology, 57, 441–
66.
Meng, H., Ma, Y., Mai, G., Wang, Y., & Liu,
C. (2017). Construction of precise support vec-
tor machine based models for predicting promoter
strength. Quantitative Biology.
Mutalik, V. K., Guimaraes, J. C., Cambray, G., Lam,
C., Christoffersen, M. J., Mai, Q.-A., Tran, A. B.,
Paull, M., Keasling, J. D., Arkin, A. P., & Endy,
D. (2013). Precise and reliable gene expression via
standard transcription and translation initiation el-
ements. Nature Methods, 10, 354–60.
Salis, H. M. (2011). The ribosome binding site calcu-
lator. Methods in Enzymology, 498, 19–42.

167
Normalisation for painting colourisation

Nanne van Noord [email protected]


Cognitive Science and Artificial Intelligence group, Tilburg University

Keywords: image colourisation, convolutional neural network, normalisation

1. Introduction 2. Normalisation techniques


Recent work on style transfer (mapping the style from In this section we discuss the three normalisation tech-
one image onto the content of another) has shown that niques we compare in this work: (1) Batch Normalisa-
Instance Normalisation (InstanceNorm) when com- tion, (2) Instance Normalisation, and (3) Conditional
pared to Batch Normalisation (BatchNorm) speeds Instance Normalisation (CondInNorm).
up neural network training and produces better look-
(1) BatchNorm. Given a batch of size T , BatchNorm
ing results (Ulyanov et al., 2016; Dumoulin et al.,
normalises each channel c of its input x ∈ RT ×C×W ×H
2016; Huang & Belongie, 2017). While the bene-
such that it has zero-mean and unit-variance (Ioffe &
fits of normalisation for neural network training are
Szegedy, 2015). Formally, BatchNorm is defined as:
fairly well understood (LeCun et al., 2012; Ioffe &
Szegedy, 2015), it is unclear why using instance over
batch statistics gives such an improvement. Huang ( )
and Belongie (2017) propose the intuitive explanation, xtijk − µi
ytijk = γi + βi , (1)
that for style transfer BatchNorm centers all samples σi
around a single style whereas InstanceNorm performs
style normalisation. where µi and σi describe the mean and standard devi-
Motivated by these developments in style transfer I set ation for channel Ci across the spatial axes W and H,
out to explore whether similar benefits can be found and the batch of size T Additionally, for each channel
in another style dependant image-to-image translation BatchNorm keeps track of a pair of learned parame-
task, namely painting colourisation. Here we consider ters γi and βi , that scale and shift the normalised value
painting colourisation a variant of image colourisation, such that they may potentially recover the original ac-
focused on performing the colourisation in a manner tivations if needed (Ioffe & Szegedy, 2015).
that matches the painter’s style. (2) InstanceNorm. Ulyanov et al. (2016) modify
In image colourisation we aim to hallucinate colours BatchNorm such that µi and σi are calculated inde-
given a greyscale image. The main challenge with hal- pendently for all T instances in the batch. However,
lucinating colours is that the task is underconstrained; γ and β are still shared.
a pixel with a given greyscale value can be assigned a (3) CondInNorm. CondInNorm is an extension of In-
number of different colours. However, if we are able stanceNorm which conditions the γ and β parameters
to recognise what is depicted in the image, we may on the style (Dumoulin et al., 2016). In this case γ
be able to suggest a plausible colourisation. More- and β become N × C matrices, where N is equal to
over, for paintings if we know who painted it, we can the number of styles being modelled. In this work we
further narrow down what is plausible. Due to the use the painter as a proxy for the style and instead
importance of style for painting colourisation, we set condition on the painter.
out to compare whether using instance statistics over
batch statistics offers similar improvements for paint-
ing colourisation as were observed for style transfer 3. Method
(i.e., better looking and more stylised results). Recent work has shown that Convolutional Neural
Networks (CNN) can obtain sufficient visual under-
standing to perform automatic image colourisation
(Isola et al., 2016). In this work we extend this to

168
Normalisation for painting colourisation

Original Input BatchNorm InstanceNorm CondInNorm


Table 1. Painting colourisation results measured using
RMSE across all pixels, and PSNR in RGB space.
“Greyscale” is the result of predicting 0 for the ab chan-
nels.
Method RMSE PSNR
Greyscale 0.172 24.88
BN 0.146 23.26
IN 0.149 23.31
CIN 0.145 23.34

painting colourisation by implementing1 a CNN based


on the “U-Net” architecture used in (Isola et al., 2016).
A visualisation of the network architecture can be
found in Figure 1. The arrows between the layers in Figure 2. Example painting colourisation results.
the encoder and decoder represent skip connections,
which enable a direct mapping between layers at the
same spatial scale.
5. Conclusion
In this work we used a painting colourisation model ca-
Encoder Decoder pable of producing visually appealing colourisations to
compare three normalisation techniques. we conclude
Conv

Conv

Conv

Conv

Conv
Conv

Conv

Conv

Up

Up

Up

Up

Up

Greyscale Colour
Up
Up

Up

that using an instance-based normalisation techniques


Input Output
is beneficial for painting colourisation and that con-
ditioning the shifting and scaling parameters on the
Figure 1. Visualisation of the network architecture. Conv painter only leads to minimal improvements.
refers to a convolutional layer, and Up combines upsam-
pling with a convolutional layer. All convolutional layers
are followed by a normalisation layer. Acknowledgments
The research is supported by NWO (grant 323.54.004).
Using a CNN we learn a mapping Y = F (X) from the
luminance (L) channel of a CIE Lab image X ∈ RH×W
to the quantised ab colour channels Y ∈ RH×W ×2×Q . References
Where H, W are the image height and width, and Q Dumoulin, V., Shlens, J., Kudlur, M., Brain, G., &
the number of bins used to quantise the ab channels. View, M. (2016). A learned representation for artis-
The predicted histograms across colour bins are con- tic style. arXiv preprint.
verted to colour values by taking the weighted sum of
the bins. Huang, X., & Belongie, S. (2017). Arbitrary Style
Transfer in Real-time with Adaptive Instance Nor-
4. Experiment malization. ICLR 2017 Workshop track.

The painting colourisation performance of my CNN us- Ioffe, S., & Szegedy, C. (2015). Batch Normaliza-
ing different normalisation methods is evaluated on a tion: Accelerating Deep Network Training by Re-
subset of the “Painters by Numbers” dataset as pub- ducing Internal Covariate Shift. ICML (pp. 448–
lished on Kaggle2 . We select the painters who have 456). JMLR.
at least 5 artworks in the dataset, which results in a Isola, P., Zhu, J.-y., Zhou, T., & Efros, A. A. (2016).
dataset consisting of 101.580 photographic reproduc- Image-to-Image Translation with Conditional Ad-
tions of artworks produced by a total of 1.678 painters. versarial Networks. arXiv preprint.
All images were rescaled such that the shortest side
was 256 pixels, and subsequently a 224 × 224 crop was LeCun, Y., Bottou, L., Orr, G., & Müller, K. (2012).
extracted for analysis. Table 1 shows the quantitative Efficient backprop. Neural networks: Tricks of the
painting colourisation results. Example colourisations trade, 9–48.
are shown in Figure 2.
Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). In-
1
https://github.com/nanne/conditional-colour stance Normalization: The Missing Ingredient for
2
https://www.kaggle.com/c/painter-by-numbers Fast Stylization. arXiv preprint.

169
Predictive Business Process Monitoring with LSTMs

Niek Tax [email protected]


Eindhoven University of Technology, The Netherlands
Ilya Verenich, Marcello La Rosa {ilya.verenich,m.larosa}@qut.edu.au
Queensland University of Technology, Australia
Marlon Dumas [email protected]
University of Tartu, Estonia

Keywords: deep learning, recurrent neural networks, process mining, business process monitoring

1. Introduction consists of sequences (i.e., traces) of events, where for


Predictive business process monitoring techniques each event business task (i.e., activities) that was exe-
are concerned with predicting the evolution of run- cuted and the timestamp is known. Typically, the set
ning cases of a business process based on models of unique business tasks seen in a log is rather small,
extracted from historical event logs. A range of therefore learned representations (such as (Mikolov
such techniques have been proposed for a variety of et al., 2013)) are unlikely to work well. We trans-
business process prediction tasks, e.g. predicting the form each event into a feature vector using a one-hot
next activity (Becker et al., 2014), predicting the encoding on its activity.
future path (continuation) of a running case (Polato If the last seen event occurred just before closing time
et al., 2016), predicting the remaining cycle time of the company, it is likely that the next event of
(Rogge-Solti & Weske, 2013), and predicting deadline the trace will at earliest take place on the next busi-
violations (Metzger et al., 2015). Existing predictive ness day. If this event occurred on a Friday and the
process monitoring approaches are tailor-made for company is closed during weekends, it is likely that
specific prediction tasks and not readily generalizable. the next event will take place at earliest on Monday.
Moreover, their relative accuracy varies significantly Therefore, the timestamp of the last seen activity is
depending on the input dataset and the point in time likely to be useful in predicting the timestamp of the
when the prediction is made. next event. We extract two features representing the
Long Short-Term Memory networks (Hochreiter & time domain: the time since the start of the business
Schmidhuber, 1997) have been shown to deliver con- day, and the time since the start of the business week.
sistently high accuracy in several sequence modeling Figure 1 shows different setups that we explore. First,
application domains, e.g. natural language processing we explore predicting the next activity and its times-
and speech recognition. Recently, (Evermann et al., tamp with two separate LSTM models. Secondly, we
2016) applied LSTMs specifically to predict the explore predicting them with one joint LSTM model.
next activity in a case. This paper explores the Thirdly, we explore an architecture with n-shared
application of LSTMs for three predictive business LSTM layers, followed by m task-specific layers. We
process monitoring tasks: (i) the next activity in a use cross entropy loss for the predicted activity and
running case and its timestamp; (ii) the continuation mean absolute error (MAE) loss for the predicted
of a case up to completion; and (iii) the remaining time and train the neural network weights with Adam
cycle time. The outlined LSTM architectures are (Kingma & Ba, 2015).
empirically compared against tailor-made approaches
using four real-life event logs. For time prediction we take as baseline the set, bag,
and sequence approach described in (van der Aalst
2. Next Activity and Time Prediction et al., 2011). For activity prediction we take as base-
We start by predicting the next activity in a case and lines the LSTM based approach of (Evermann et al.,
its timestamp. A log of business process executions 2016) and the technique of (Breuker et al., 2016). Ta-

170
Predictive Business Process Monitoring with LSTMs

activity time Method Helpdesk BPI’12 W Env. permit


prediction prediction (Polato et al., 2016) 0.2516 0.0458 0.0260
LSTM 0.7669 0.3533 0.1522
LSTM LSTM
activity time activity time · · · m× · · · m×
prediction prediction prediction prediction Table 2. Suffix prediction results in terms of Damerau-
LSTM LSTM
Levenshtein Similarity.
LSTM LSTM LSTM LSTM
··· ··· ··· · · · n× shows the results of suffix prediction on three data
LSTM LSTM LSTM LSTM sets. The LSTM outperforms the baseline on all logs.
event event event
feature feature feature
vector vector vector 4. Remaining Cycle Time Prediction
(a) (b) (c)
By repeatedly predicting the next activity and its
Figure 1. Neural Network architectures with single-task timestamp with the method described in Section 2,
layers (a), with shared multi-tasks layer (b), and with n+m the timestamp of the last event of the trace can be
layers of which n are shared (c). predicted, which can be used to predict the remain-
Helpdesk BPI’12 W ing cycle time. Figure 2 shows the mean absolute er-
Layers Shared MAE (time) Accuracy (act.) MAE (time) Accuracy (act.)
3 3 3.77 0.7116 1.58 0.7507
ror for each prefix size, for the four logs. As baseline
3
3
2
1
3.80
3.76
0.7118
0.7123
1.57
1.59
0.7512
0.7525
we use the set, bag, and sequence approach described
3
2
0
2
3.82
3.81
0.6924
0.7117
1.66
1.58
0.7506
0.7556
in (van der Aalst et al., 2011), and the approach de-
2
2
1
0
3.77
3.86
0.7119
0.6985
1.56
1.60
0.7600
0.7537
scribed in (van Dongen et al., 2008). It can be seen
1
1
1
0
3.75
3.87
0.7072
0.7110
1.57
1.59
0.7486
0.7431
that LSTM consistently outperforms the baselines for
Time prediction baselines the Helpdesk log and the environmental permit log.
Set 5.83 - 1.97 -
Bag 5.74 - 1.92 -
Sequence 5.67 - 1.91 -

Evermann -
Activity prediction baselines
- - 0.623
5. Conclusions
Breuker - - - 0.719
The foremost contribution of this paper is a technique
Table 1. Experimental results for the Helpdesk and BPI’12 to predict the next activity of a running case and its
W logs. timestamp using LSTM neural networks. We showed
that this technique outperforms existing baselines on
ble 1 shows the MAE of the predicted timestamp of the
real-life data sets. Additionally, we found that predict-
next event and the accuracy of the predicted activity
ing the next activity and its timestamp via a single
on two data sets. It shows that LSTMs outperform the
model (multi-task learning) yields a higher accuracy
baseline techniques, and that architectures with shared
than predicting them using separate models. We then
layers outperform architectures without shared layers.
showed that this basic technique can be generalized to
address two other predictive process monitoring prob-
3. Suffix Prediction lems: predicting the entire continuation of a running
case and predicting the remaining cycle time.
By repeatedly predicting the next activity, using
the method described in Section 2, the trace can
(a) (b)
be predicted completely until its end. The most 8 14
recent method to predict an arbitrary number of 12
MAE, days

6 10
events ahead is (Polato et al., 2016), which extracts 4
8
6
a transition system from the log and then learns a 2 4
2
machine learning model for each transition system 0 0
2 3 4 5 6 7 2 4 6 8 10 12 14 16 18 20
state. Levenshtein similarity is a frequently used Prefix size Prefix size

string similarity measure, which is based on the min- (c) (d)


10 70
imal number of insertions, deletions and substitutions 8
60
MAE, days

50
needed to transform one string into another. In 6 40
business processes, activities are frequently performed 4 30
20
2
in parallel, leading to some event in the trace being 10
0 0
arbitrarily ordered., therefore we consider it only 2 4 6
Prefix size
8 10 2 4 6 8 10 12 14 16 18 20
Prefix size
a minor mistake when two events are predicted in van Dongen set sequence bag LSTM

the wrong order. We evaluate suffix predictions


with Damerau-Levenshtein similarity, which adds a Figure 2. MAE values using prefixes of different lengths for
swapping operation to Levenshtein similarity. Table 2 helpdesk (a), BPI’12 W (b), BPI’12 W (no duplicates) (c)
and environmental permit (d) datasets.

171
Predictive Business Process Monitoring with LSTMs

Acknowledgments Rogge-Solti, A., & Weske, M. (2013). Prediction of


remaining service execution time using stochastic
This work is accepted and to appear in the proceed- Petri nets with arbitrary firing delays. Proceedings
ings of the International Conference on Advanced In- of the International Conference on Service Oriented
formation Systems Engineering (Tax et al., 2017). Computing (pp. 389–403). Springer.
The source code and supplementary material re-
quired to reproduce the experiments reported in this Tax, N., Verenich, I., La Rosa, M., & Dumas, M.
paper can be found at http://verenich.github. (2017). Predictive business process monitoring with
io/ProcessSequencePrediction. This research is LSTM neural networks. Proceedings of the Interna-
funded by the Australian Research Council (grant tional Conference on Advanced Information Systems
DP150103356), the Estonian Research Council (grant Engineering (p. To appear.). Springer.
IUT20-55) and the RISE BPM project (H2020 Marie
Curie Program, grant 645751). van der Aalst, W. M. P., Schonenberg, M. H., & Song,
M. (2011). Time prediction based on process mining.
Information Systems, 36, 450–475.
References
van Dongen, B. F., Crooy, R. A., & van der Aalst,
Becker, J., Breuker, D., Delfmann, P., & Matzner,
W. M. P. (2008). Cycle time prediction: when will
M. (2014). Designing and implementing a frame-
this case finally be finished? Proceedings of the In-
work for event-based predictive modelling of busi-
ternational Conference on Cooperative Information
ness processes. Proceedigns of the 6th International
Systems (pp. 319–336). Springer.
Workshop on Enterprise Modelling and Information
Systems Architectures (pp. 71–84). Springer.
Breuker, D., Matzner, M., Delfmann, P., & Becker, J.
(2016). Comprehensible predictive models for busi-
ness processes. MIS Quarterly, 40, 1009–1034.
Evermann, J., Rehse, J.-R., & Fettke, P. (2016). A
deep learning approach for predicting process be-
haviour at runtime. Proceedings of the 1st Interna-
tional Workshop on Runtime Analysis of Process-
Aware Information Systems. Springer.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-
term memory. Neural Computation, 9, 1735–1780.
Kingma, D., & Ba, J. (2015). Adam: A method
for stochastic optimization. Proceedings of the 3rd
International Conference for Learning Representa-
tions.
Metzger, A., Leitner, P., Ivanovic, D., Schmieders, E.,
Franklin, R., Carro, M., Dustdar, S., & Pohl, K.
(2015). Comparing and combining predictive busi-
ness process monitoring techniques. IEEE Trans.
Systems, Man, and Cybernetics: Systems, 45, 276–
290.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S.,
& Dean, J. (2013). Distributed representations of
words and phrases and their compositionality. Ad-
vances in Neural Information Processing Systems
(pp. 3111–3119).
Polato, M., Sperduti, A., Burattin, A., & de Leoni,
M. (2016). Time and activity sequence predic-
tion of business process instances. arXiv preprint
arXiv:1602.07566.

172
Big IoT data mining for real-time energy disaggregation in buildings
(extended abstract)

Decebal Constantin Mocanu? [email protected]


Elena Mocanu? [email protected]
Phuong H. Nguyen? [email protected]
Madeleine Gibescu? [email protected]
Antonio Liotta? [email protected]
?
Dep. of Electrical Engineering, Eindhoven University of Technology, 5600 MB Eindhoven, The Netherlands

Keywords: deep learning, factored four-way conditional restricted Boltzmann machines, energy disaggregation,
energy prediction

Abstract ing models, namely Factored Four-Way Conditional


Restricted Boltzmann Machines (FFW-CRBM) (Mo-
In the smart grid context, the identification
canu et al., 2015) and Disjunctive Factored Four-Way
and prediction of building energy flexibility
Conditional Restricted Boltzmann Machines (DFFW-
is a challenging open question. In this paper,
CRBM) (Mocanu et al., 2017), to perform energy
we propose a hybrid approach to address this
disaggregation, flexibility identification and flexibility
problem. It combines sparse smart meters
prediction simultaneously.
with deep learning methods, e.g. Factored
Four-Way Conditional Restricted Boltzmann
Machines (FFW-CRBMs), to accurately pre- 2. The proposed method
dict and identify the energy flexibility of
Recently, it has been proven that it is possible in an
buildings unequipped with smart meters,
unified framework to perform both, classification and
starting from their aggregated energy values.
prediction, by using deep learning techniques, such as
The proposed approach was validated on a
in (Mocanu et al., 2014; Mocanu et al., 2015; Mocanu
real database, namely the Reference Energy
et al., 2017). Consequently, in the context of flexibil-
Disaggregation Dataset.
ity detection and prediction, we explore the general-
ization capabilities of Factored Four-Way Conditional
1. Introduction Restricted Boltzmann Machines (FFW-CRBM) (Mo-
canu et al., 2015) and Disjunctive Factored Four-Way
Unprecedented high volumes of data and information Conditional Restricted Boltzmann Machines (DFFW-
are available in the smart grid context, with the up- CRBM) (Mocanu et al., 2017). Both models, FFW-
ward growth of the smart metering infrastructure. CRBM and DFFW-CRBM, have shown to be suc-
This recently developed network enables two-way com- cessful on outperforming state-of-the-art techniques in
munication between smart grid and individual energy both, classification (e.g. Support Vector Machines)
consumers (i.e., the customers), with emerging needs and prediction (e.g. Conditional Restricted Boltz-
to monitor, predict, schedule, learn and make decisions mann Machines), on time series classification and pre-
regarding local energy consumption and production, diction in the context of human activity recognition,
all in real-time. One possible way to detect build- 3D trajectories estimation and so on. In Figure 1
ing energy flexibility in real-time is by performing en- a high level schematic overview of FFW-CRBM and
ergy disaggregation (Zeifman & Roth, 2011). In this DFFW-CRBM functionalities is depicted, while for a
paper (Mocanu et al., 2016), we propose an unified comprehensive discussion on their mathematical de-
framework which incorporates two novel deep learn- tails the interested reader is referred to (Mocanu et al.,
2015; Mocanu et al., 2017). The full methodology to
Appearing in Proceedings of Benelearn 2017. Copyright perform energy disaggregation can be found in (Mo-
2017 by the author(s)/owner(s). canu et al., 2016).
The full paper has been published in the proceedings of IEEE International Conference on Systems, Man, and
and Cybernetics (SMC 2016), Pages 003765-003769, DOI 10.1109/SMC.2016.7844820.

173
Big IoT data mining for real-time energy disaggregation in buildings

ht Table 1. Results showing accuracy [%] and balanced accu-


racy [%] for FFW-CRBM and DFFW-CRBM, when clas-
sifying an appliance versus all data.
Wh Appliance Method Accuracy [%] Balanced
accuracy [%]
refrigerator FFW-CRBM 86.23 90.05
1 Wl F(f) Wv DFFW-CRBM 83.10 91.27
8 dishwasher FFW-CRBM
DFFW-CRBM
97.42
97.26
80.21
87.06
0 washer dryer FFW-CRBM
DFFW-CRBM
98.83
99.06
99.03
92.16

lt Wv<t vt electric heater FFW-CRBM


DFFW-CRBM
99.10
99.03
90.58
92.05

fixed Table 2. Results showing the NRMSE [%] obtained to es-


fixed 3 5 v<t timate the electrical demand and the time-of-use for four
building electrical sub-systems using FFW-CRBM and
DFFW-CRBM.
Appliance Method Power Time-of-use
NRMSE [%] NRMSE [%]
Classification Prediction refrigerator FFW-CRBM
DFFW-CRBM
9.36
9.27
8.79
8.71
dishwasher FFW-CRBM 5.49 5.89
DFFW-CRBM 5.41 5.87
Figure 1. Classification and prediction schemes for FFW- washer dryer FFW-CRBM 2.70 2.43
DFFW-CRBM 2.59 2.44
CRBMs (DFFW-CRBM function in a similar manner). To electric heater FFW-CRBM 1.86 1.78
perform classification the value of each neuron from the DFFW-CRBM 1.85 1.77

dotted blue area has to be fixed (i.e. present and history the TKI SG-BEMS project of Dutch Top Sector.
layers) and to let the model to infer the values of the label
neurons. To perform prediction the value of each neuron
from the dotted red area has to be fixed (i.e. label and References
history layers) and to let the model to infer the values of
Kolter, J. Z., & Johnson, M. J. (2011). REDD: A Pub-
the present neurons.
We assessed our proposed framework on the The lic Data Set for Energy Disaggregation Research.
Reference Energy Disaggregation Dataset (REDD) SustKDD Workshop on Data Mining Applications
dataset (Kolter & Johnson, 2011). The results pre- in Sustainability. San Diego, California, USA.
sented in Table 1 and 2 show that both models per- Mocanu, D. C., Ammar, H. B., Lowet, D., Driessens,
formed very well obtaining a minimum prediction error K., Liotta, A., Weiss, G., & Tuyls, K. (2015). Fac-
on the power consumption of 1.85% and a maximum tored four way conditional restricted boltzmann ma-
error of 9.36%, while for the time-of-use prediction the chines for activity recognition. Pattern Recognition
minimum error reached was 1.77% in the case of the Letters, 66, 100 – 108.
electric heater and the maximum error obtained was
8.79% for the refrigerator. Mocanu, D. C., Ammar, H. B., Puig, L., Eaton, E., &
Liotta, A. (2017). Estimating 3d trajectories from
2d projections via disjunctive factored four-way con-
3. Conclusion ditional restricted boltzmann machines. Pattern
In this paper, we proposed a novel IoT framework Recognition.
to perform simultaneously and in real-time flexibil- Mocanu, D. C., Mocanu, E., Nguyen, P. H., Gibescu,
ity identification and prediction, by making use of M., & Liotta, A. (2016). Big iot data mining for
Factored Four Way Conditional Restricted Boltzmann real-time energy disaggregation in buildings. 2016
Machines and their Disjunctive version. The experi- IEEE International Conference on Systems, Man,
mental validation performed on a real-world database and Cybernetics (SMC) (pp. 003765–003769).
shows that both models perform very well, reaching
a similar performance with state-of-the-art models on Mocanu, E., Mocanu, D. C., Ammar, H. B., Zivkovic,
flexibility identification, while having the advantage of Z., Liotta, A., & Smirnov, E. (2014). Inexpensive
being capable to perform also flexibility prediction. user tracking using boltzmann machines. 2014 IEEE
International Conference on Systems, Man, and Cy-
bernetics (SMC) (pp. 1–6).
Acknowledgments
Zeifman, M., & Roth, K. (2011). Nonintrusive appli-
This research has been partly funded by the European ance load monitoring: Review and outlook. IEEE
Union’s Horizon 2020 project INTER-IoT (grant num- Transactions on Consumer Electronics, 57, 76–84.
ber 687283), and by the NL Enterprise Agency under

174
Industry Track
Research Papers

175
Comparison of Syntactic Parsers on Biomedical Texts
http://wwwen.uni.lu/lcsb

Maria Biryukov [email protected]


Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Belval, Luxembourg

Keywords: syntactic parser, biomedical, text mining

Abstract • Identify working strategies on how to efficiently


train parsers for biomedical domain.
Syntactic parsing is an important step in the
automated text analysis which aims at infor-
mation extraction. Quality of the syntactic 1.2. Parsers and models for the comparison
parsing determines to a large extent the re- For this study we used five different machine-learning
call and precision of the text mining results. based parsers. Three of these are Stanford parsers
In this paper we evaluate the performance of based on different linguistic paradigms: PCFG, Fac-
several popular syntactic parsers in applica- tored, and RNN (Manning, 2015). All the three
tion to the biomedical text mining. parsers have been trained on an English corpora, which
is a mixture of newswire, biomedical texts, English
translation of Chinese and Arabic Tree Bank, a set of
1. Introduction sentences with technical vocabulary. The fourth parser
Biomedical information extraction from text is an ac- is the BLLIP parser with two different training mod-
tive area of research with applications to generation els: a) World Street Journal (WSJ) and GigaWord
of disease maps, protein-protein interaction networks, (McClossky, 2015), which is a collection of news and
drug-disease effects and more. Proper understanding news-like style texts; and b) purely biomedical – Genia
of sentence syntactic structure is a precondition for (Kim et al., 2003) + Medline. The fifth parser is called
correct interpretation of its meaning. We used five Parsey McParseface (Presta & et.al, 2017) and makes
stay-of-the art syntactic parsers, Stanford, BLLIP and part of the Google Labs’ TensorFlow (Moshe et al.,
SyntaxNet, in two different experiments. One exper- 2016) deep learning framework. (This parser is refered
iment aimed at evaluation of sentence syntactic com- as ”Google” in Table 1). Similar to our experiments
plexity; another one tested parsers capability to cor- with BLLIP, we used Parsey McParseface with various
rectly parse biomedical articles. language models: the one provided with the distribu-
tion, trained on a multi-domain corpus combining data
from OntoNotes, Web Treebank, and updated and cor-
1.1. Goals
rected Question Treebank. We also trained it on the
In this paper we describe an ongoing work towards the biomedical gold standard corpora - Genia and CRAFT
following goals: (see section 3.1 for details about the corpora).
The parsers are briefly introduced bellow:
• Evaluate parser performance on the real-life cor-
pora. • PCFG is an accurate unlexicalized parser based
• Identify the most frequent parser errors on a probabilistic context-free grammar (Klein &
Manning, 2003).
• Find good predictors for sentence complexity and
parser errors • Factored parser combines syntactic structures
(PCFG) with semantic dependencies provided by
using lexical information (Klein & Manning,
Preliminary work. Under review for Benelearn 2017. Do 2003). Lexicalization is very helpful when deal-
not distribute.
ing with language ambiguities and helps cor-

176
Comparison of Syntactic Parsers on Biomedical Texts

rectly identify dependencies between sentence structure and thus it makes sense to talk about their
constituents. depth, i.e. the number of levels in such graph. Let us
denote by Depth(G) the depth of the sentence parse
• Recursive neural networks (RNN) parser (Socher graph, and by Bi the number of nodes (tokens) on the
et al., 2011) works in two steps: first it uses the i-th level of this graph, i.e. breadth. Then one possible
parses of PCFG parser to train; then recursive sentence complexity score can be calculated as follows:
neural networks are trained with semantic word
vectors and used to score parse trees. In this Depth(G)
X
way syntactic structure and lexical information Score(G) = Bi · i.
are jointly exploited. i=1

• BLLIP is a self-trained parser which is based on The more tokens and at the lower depth the higher
a probabilistic generative model with maximal would be this metric. In Fig. 1 we show how sentence
entropy discriminative reranking (McClosky & complexity scores are distributed in the corpus. Fig. 2
Charniak, 2008). In the self-training approach,
an existing parser parses an unseen data and
treats this newly labeled data in combination
with the actual labeled data to train a second
parser. BLLIP exploits this technique to self-
train a parser, initially trained on a standard Penn
Treebank corpus, using unlabeled biomedical ab-
stracts.

• Parsey McParseface is an incremental transition-


based parser with feature embeddings (Andor
et al., 2016) introduced by Google Labs. The
parser has appealing characteristics, such as ac-
curacy from 89% on web texts up to about
94% on news and questions, and speed (600
words/second).

2. Biomedical corpus analysis


Biomedical language is known to be difficult for parsers Figure 1. Sentence complexity scores distribution.
because it is very different from “standard” English.
We analyzed a development corpus from the BioNLP shows the relation between the sentence length, mea-
2011 competition (Tsujii et al., 2011) from the point sured in tokens, and the sentence complexity in our
of view of its syntactic and lexical variability. Initially score metric.
the corpus consisted of 2564 sentences. For this exper-
For the knowledge extraction pipeline which uses the
iment we considered only the sentences with at least
parser as one of its components it is of interest to as-
one mention of a protein and one predicate (noun, verb
sess the quality of the parse and to have an estimate
or adjective), which could trigger an event. In total we
of the probability of a parsing error for a given sen-
analyzed 875 sentences. Sentence length ranged from
tence. For example, correct argument attachment in a
2 to 146 tokens, with an average of 26 tokens.
sentence can be highly ambiguous and challenging for
The underlying assumption for the evaluation of a syn- the parser. BLLIP parser offers an option to output
tactic complexity was that more complex sentences multiple parses for a given sentence. We have picked
will result in a larger variability of their parses. We the top 50 parses per sentence and computed our met-
parsed the sentences with BLLIP parser trained on the ric for each parse. The goal was to get an idea on how
biomedical model (Genia+MEDLINE). stable the parse is; the more variation there is in the
sentence parses (and thus in their scores), the higher
Suppose that the parser builds a syntactic parse graph
is the chance that the parse may not be correct.
G for a given sentence. We are interested in assigning
such graph with a score which measures the complexity For each sentence we computed 50 parsing scores and
of the graph (sentence). The syntactic parse graphs then computed the coefficient of variation for each sen-
very often have a tree-like or a direct acyclic (DAG) tence by dividing the standard deviation of the score

177
Comparison of Syntactic Parsers on Biomedical Texts

These results show that for about 54% of the sentences


the BLLIP parse is very stable and only for about 8 −
15% is unstable. This is also reflected by its very good
performance in the tests in the following section. The
measure cv for a given sentence can be used as parse
error estimator in the event extraction pipeline.

2.1. Sentence content load


What makes biomedical texts so different from ”stan-
dard” English, are the in-domain words. These are,
first of all, names of genes, proteins, molecules, tis-
sues - the list is quite long. We were interested
to assess the distribution of complex multi-trigger
protein-protein interactions inside the typical biomed-
ical corpus. Given the gold standard annotation of the
BioNLP competition data, we calculated distribution
Figure 2. Sentence complexity versus sentence length in to- of protein names and predicates which could trigger
kens. an event in the corpus. It turned out that sentences
with only 2 − 3 mentions of some protein name cover
76% of all the corpus of sentences which contained at
by the mean of the score: least one trigger predicate. The were 11, 4, 2% with
σ four, five and six proteins respectively. This informa-
cv = .
µ tion shows that complex event extraction techniques
might not be necessary for protein-protein interaction
The distribution of the coefficients of variation cv extraction from the majority of biomedical sentences.
of the sentence complexity Score(G) for the BLLIP It should be noticed however, that text analysis is not
parser is shown in Fig. 3. As can be seen in this figure limited to only protein-related events.

3. Parser Evaluation for Event


Extraction
Semantic relations between individual words and
phrases are encoded in the syntactic dependencies.
This is the reason why syntactic parsing is an impor-
tant step towards information extraction. In our appli-
cations we aim at finding so-called ’events’: functional
connections between various biomedical concepts such
as proteins, chemicals, diseases, processes. Moreover
we are particularly interested in determining the event
context: logic and temporal order, coordination, mu-
tual dependency and/or exclusion. Such relations are
expressed via abundant use of coordination, preposi-
tions, adverbial and relative clauses. In this paper we
evaluate the parsers performance on biomedical texts.
Figure 3. Histogram of the coefficient of variation of the
3.1. Test corpus and Data sets
sentence complexity.
We used three tests corpora. As mentioned in Sec-
the coefficient was in the range: 0 ≤ cv ≤ 0.18. Out tion 2, for the sentence complexity evaluation we used
of total 850 sentences only 70 sentences had coefficient development corpus of the BioNLP competition (Tsu-
of variation > 0.10; 341 were in the middle range of jii et al., 2011). It consisted of 220 documents in which
0.05 ≤ cv ≤ 0.10, while 461 were in the lowest varia- 875 sentences have been parsed by us with BLLIP.
tion category (cv < 0.5). Finally for 30 sentences the To evaluate parsers performance on the biomedical
variation was zero.

178
Comparison of Syntactic Parsers on Biomedical Texts

texts we used two gold standard in-domain corpora. 3.2. Evaluation and Discussion
One was released in the framework of Genia project
Table 1 presents the most important results of each
(Kim et al., 2003). It consists of 2000 article abstracts
parser. For the overall performance assessment, we
collected with the key-word search ”Human, Blood
adopted evaluation criteria established by the Parser
Cells and Transcription Factors”. Another corpus is
Evaluation challenge (Black et al., 1992), PARSE-
known as ”Colorado Richly Annotated Full Text Cor-
VAL. These include accuracy of the part of speech
pus” (CRAFT) (Verspoor & et.al, 2012). It contains
tagging, unlabeled attachment score (UAS) which ac-
67 full text articles in a wide variety of biomedical
counts for the correspondence between each node and
sub-domains. For the evaluation we used about 1000
its parent in the test and gold standard parses, and
sentences from each of these corpora. Our choice of
labeled attachment score (LAS), which, in addition to
test sets size was based on the Genia corpus division
the parent node correspondence, checks if syntactic re-
into train, development and test sets distributed by D.
lation between two nodes (label on the edge) is the
McClossky (McClosky & Charniak, 2008). As of Ge-
same in the test and gold sets. Given the nature of
nia corpus, we used his division, and created our own
these two measurements, it is not surprising that LAS
for the Craft corpus.
are systematically lower than UAS for all the parsers
listed in Table 1. However, accuracy of the labeled at-
tachment predefines the extent to which semantic re-
lations between the concepts represented by the nodes
would be be correctly interpreted.
It can be seen from the table, that parsers performance
depends on how close the test domain is to the train-
ing domain. One of the important reasons for that are
out-of-domain words. Being unknown to the parser,
they present difficulty for the part of speech tagging.
The latter is responsible for the assignment of syntactic
dependencies between the words. Our analysis shows
that part of speech errors are responsible for 30% (for
the corresponding training and test domains) to 60%
(for different training and test domains) of the er-
rors in dependency assignment. Among all the parsers
trained on English corpora, Stanford RNN shows the
Figure 4. BLLIP parse.
best result (LAS 0.78) on the Genia corpus. BLLIP
trained on Genia + PubMed demonstrates the best
performance, followed by Parsey McParseface trained
on Genia and CRAFT.
With respect to the biomedical corpora it seems that
CRAFT is more difficult than Genia for all but one
parsers, either trained on the biomedical texts or not.
Detailed corpora investigation is required to answer
the question why it is so. However we suppose that cer-
tain portions of full texts, such as detailed description
of experimental setup, or explanations of the figures,
which are not necessarily complete or well formed sen-
tences, contribute to the lower parser scores. Besides,
full texts would have larger vocabulary than the ab-
stracts. This fact can be even stronger in our specific
training - test setup, due to the sub-domain coverage
of both biomedical corpora: Genia, being a narrow-
focused one as opposed to a much more diverse Craft.
Overall, based on the figures in the Table 1 we think
that abstracts are not sufficiently representative for
Figure 5. Stanford Factored parse. the entire article context to provide an efficient train-

179
Comparison of Syntactic Parsers on Biomedical Texts

Table 1. Parsers’ comparison on test corpus


Parser Model Corpus Test Pos UAS LAS
Stanford RNN English Craft 0.72 0.68 0.63
Genia 0.83 0.83 0.78
BLLIP Medline+Genia Craft 0.90 0.76 0.73
Genia 0.98 0.89 0.88
BLLIP WSJ+GigaWord Craft 0.77 0.66 0.61
Genia 0.83 0.74 0.68
Google English Genia 0.85 0.57 0.49
Google Genia Craft 0.90 0.73 0.64
Google Genia 0.98 0.90 0.84
Google Craft Craft 0.97 0.86 0.80
Google Genia 0.95 0.84 0.78

ing set for the parsers. ity of biomedical sentences in the context of event ex-
traction. We have defined sentence complexity met-
In addition to the evaluation presented above one can
rics and parse variability metrics which can help to
have a closer look at parsers performance on spe-
assess parser performance when it is used as part of
cific syntactic patterns, such as prepositional attach-
the knowledge extraction pipeline.
ment or coordination. These constructs carry infor-
mation about events participants, conditions under
which events take place, as well as location at which References
they happen. At the same time, both, coordination Andor, D., Alberti, C., Weiss, D., Severyn, A., Presta,
and prepositional attachments, are often difficult to A., Ganchev, K., Petrov, S., & Collins, M. (2016).
be parsed and attached correctly. Just as an illustra- Globally normalized transition-based neural net-
tion we compare the BLLIP and the Standard Fac- works. Proceedings of the 54th Annual Meeting of
tored parsers on an example of the following sentence: the Association for Computational Linguistics, ACL
”Thus, one major factor that regulates Cdk5 activity is 2016, August 7-12, 2016, Berlin, Germany, Volume
degradation of p35 and p39 via proteasomal degrada- 1: Long Papers.
tion.”. The graphs of the parses are given in Fig. 4 for
BLLIP and Fig. 5 for Stanford Factored respectively. Black, E., Lafferty, J. D., & Roukos, S. (1992). Devel-
The relevant information that we want to extract in opment and evaluation of a broad-coverage proba-
this case is two facts: a) degradation of both p35 and bilistic grammar of english-language computer man-
p39 regulates the Cdk5 activity; b) how this degrada- uals. 30th Annual Meeting of the Association for
tion happens - via proteasomal degradation. The first Computational Linguistics, 28 June - 2 July 1992,
fact was successfully captured by both parsers, but University of Delaware, Newark, Deleware, USA,
the mechanism was correctly captured only by BLLIP Proceedings. (pp. 185–192).
parser. The Stanford parser failed at prepositional at-
tachment. Kim, J., Ohta, T., Tateisi, Y., & Tsujii, J. (2003).
GENIA corpus - a semantically annotated corpus for
Our preliminary evaluation of the parsers performance bio-textmining. Proceedings of the Eleventh Interna-
on specific syntactic patterns shows that the success tional Conference on Intelligent Systems for Molec-
rate for the prepositional attachment is in the range ular Biology, June 29 - July 3, 2003, Brisbane, Aus-
between 82% to 95%, while coordination is worse, and tralia (pp. 180–182).
lies between 66% and 79%.
Klein, D., & Manning, C. D. (2003). Accurate un-
4. Conclusions lexicalized parsing. Proceedings of the 41st Annual
Meeting of the Association for Computational Lin-
In this paper we have studied five syntactic parsers guistics, 7-12 July 2003, Sapporo Convention Cen-
from three families: Stanford, BLLIP, and Parsey Mc- ter, Sapporo, Japan. (pp. 423–430).
Parseface on biomedical texts. We have seen that the
highest performance was reached by BLLIP parser on Manning, C. (2015). The Stanford parser: A statistical
the Genia test corpus. We have also studied complex- parser. https://nlp.stanford.edu/software/
lex-parser.shtml.

180
Comparison of Syntactic Parsers on Biomedical Texts

McClosky, D., & Charniak, E. (2008). Self-training for


Biomedical Parsing. ACL 2008, Proceedings of the
46th Annual Meeting of the Association for Compu-
tational Linguistics, June 15-20, 2008, Columbus,
Ohio, USA, Short Papers (pp. 101–104).

McClossky, D. (2015). BLLIP and Models.


https://github.com/BLLIP/bllip-parser/
blob/master/MODELS.rst.
Moshe, L., Marcello, H., & DeLesley, H. (2016).
An open-source software library for machine intelli-
gence. https://www.tensorflow.org.
Presta, A., & et.al (2017). Syntaxnet: Neural mod-
els of syntax. https://github.com/tensorflow/
models/tree/master/syntaxnet.

Socher, R., Lin, C. C., Ng, A. Y., & Manning, C. D.


(2011). Parsing natural scenes and natural language
with recursive neural networks. Proceedings of the
28th International Conference on Machine Learn-
ing, ICML 2011, Bellevue, Washington, USA, June
28 - July 2, 2011 (pp. 129–136).

Tsujii, J., Kim, J.-D., & Pyysalo, S. (2011). Bionlp:


2011. Proceedings of BioNLP 2011 Workshop.
Verspoor, K., C. K., & et.al (2012). A corpus of full-
text journal articles is a robust evaluation tool for
revealing differences in performance of biomedical
natural language processing tools. BMC Bioinfor-
matics, 13.

181
Industry Track
Extended Abstracts

182
Eskapade: a lightweight, python based, analysis framework
http://eskapade.kave.io

Lodewijk Nauta [email protected]


KPMG Advisory N.V., Laan van Langerhuize 1, 1186DS, Amstelveen, Netherlands
Max Baak [email protected]
KPMG Advisory N.V., Laan van Langerhuize 1, 1186DS, Amstelveen, Netherlands

Keywords: python, data analysis, framework

Abstract Building an analysis


Eskapade the threshold of making the step from ex-
Eskapade is a python framework that acceler- periments to production when building predictive an-
ates development of advanced analytics work- alytics solutions. The framework can be used to
flows in big data environments. The mod- build an analysis, using jupyter for interactive devel-
ular set-up allows scalable designs of analy- opment. Once the analysis is finished you can also
sis chains based on commonly available open- use the framework for production purposes by running
source libraries and custom built algorithms it in dockers or on your cluster, shortening the time-
using single configuration files with simple consuming step of reworking your analysis to produc-
syntax: from data ingestion and transforma- tion standards.
tions to ML models including feedback pro-
cessing. Since everyone in your team uses the same framework
team members can easily exchange code. Moreover,
this method of working allows for version control of
the analysis.
The framework
Eskapade is a is a python based framework for analyt- Running an analysis
ics. The framework employs a modular set up in the Analyses are run from a file called a macro. This macro
entire workflow of Data Science: from data ingestion runs chains and these chains contain links. The links
to transformation to trained model output. Every part are the fundamental building blocks that do simple
of the analysis can be run independently with different operations such as reading data, transforming data,
parameters from one configuration file with a universal training models and plotting output. Chains can be
syntax. The modular way of working reduces the com- controlled and rerun individually (for example for re-
plexity of your analysis and makes the reproducibility training) from the command line when you run your
of the steps undertaken a lot easier. macro. In this way it becomes easier to control what
The framework also includes self-learning software is happening at certain points in the analysis, while it
functionality for typical machine learning problems. is being developed or when it runs in production.
Combined with easy model evaluation and compari-
son of predictive algorithms, the life-cycle of an anal- Dependencies
ysis is made simpler. Trained algorithms can predict
in real-time or in batch mode and if necessary, trigger The framework is built on top of a variety of python
the retraining of an algorithm. analytics modules including pandas, scikit-learn, mat-
plotlib, pySpark and pyROOT, and can also be run
in jupyter to make the step from experimentation to
Preliminary work. Under review for Benelearn 2017. Do production easier.
not distribute.

183
Unsupervised region of interest detection in sewer pipe images:
Outlier detection and dimensionality reduction methods
Extended Abstract

Dirk Meijer [email protected]


Arno Knobbe [email protected]
LIACS, Leiden University, Niels Bohrweg 1, 2333 CA Leiden, The Netherlands

Keywords: Unsupervised learning, Outlier detection, Dimensionality reduction, Computer Vision

1. Introduction Broadly speaking, the pixels in an image can be seen


as a feature vector, with the notion that there is some
Sewer pipes require regular inspection to determine spatial ordering to these features and a high correla-
the deterioration state and performance, before decid- tion between adjacent pixels. As such, we treat this as
ing whether repair or replacement is necessary. In- an extremely high-dimensional, unsupervised outlier
spections are still mostly performed manually, which detection problem.
leads to subjective and inconsistent ratings for deteri-
oration and urgency, differing between inspectors and
even within inspectors (Dirksen et al., 2013).
The SewerSense project aims to investigate the pos-
sibilities of automating the analysis of sensory data
from sewer pipe inspections and the consequences this
would have for the sewer asset management industry.

2. Approach
Figure 1. Forward-facing pictures of the same concrete
The currently available data consists mostly of image sewer pipe at different locations along the street.
and video data, grouped by pipe stretch and munici-
pality. Rather than identifying defects in these images
directly, we have opted to detect regions of interest
(ROIs) in the images and classify these at a later stage. 3. Methods
We make the assumptions that (1) all images are from Unsupervised outlier detection methods are analogous
a forward-facing camera in a sewer pipe, (2) all im- to clustering: objects are thought to form clusters in
ages are similarly aligned, and (3) the surface of the feature space and outliers are those objects that are
pipe is similar in appearance for images in a single set. far away from clusters, or part of small clusters. The
See Figure 1 for an example of what these images may model used to fit these clusters must be somewhat re-
look like. It should be noted with the assumption of strictive, otherwise the models will overfit on the out-
“similar appearance” that the concrete and agglomer- liers that are present in the training data.
ate often contain a lot of texture, and adjacent pixel
values are not necessarily similar. Since the number of pixels in an image (≈ 106 ) is some
orders of magnitude greater than the number of images
We define an ROI to be the bounding box of a portion in a set (≈ 103 ), some dimensionality reduction is in
of an image that contains something “unexpected”. order before we try to find outliers in the dataset, to
Note that not all unexpected elements in the sewer ensure our methods don’t overfit on the training set.
pipe will be defects; we also want to detect pipe joints While outlier detection methods for higher dimension-
for example. See Figure 2 for an example of the ROIs alities exist (Aggarwal & Yu, 2001), these seem to be
we hope to detect. aimed mostly at sparse data, which our image data
is not. Other approaches seem to focus on dimen-
sionality reduction by feature selection and rejection
Appearing in Proceedings of Benelearn 2017. Copyright (Zimek et al., 2012), which is not most suited for im-
2017 by the author(s)/owner(s).
ages.

184
Unsupervised region of interest detection in sewer pipe images

3.2. Convolutional Autoencoders


An autoencoder is a type of artificial neural network,
where the target output is identical to the input, and
the intermediate layers have fewer neurons than the in-
put and output layers. The result is that the network
learns generalization of the input in the compacted lay-
ers. When the autoencoder uses strictly linear transfer
functions, it is very similar to PCA, estimating a linear
mapping onto a lower dimensionality. The interest-
ing application comes of course from using non-linear
transfer functions to learn a non-linear mapping.
Convolutional neural networks have been very success-
ful in image classification and bring two new elements
to neural networks: convolutional layers and pooling
layers. The convolutional layer acts like a filter bank,
turning the image into a series of filter responses which
contain information about the local structure in the
image. Unlike a static filter bank though, these filters
Figure 2. Regions of Interest in a sewer pipe image.
are learned from the input data. The pooling layer
performs a dimensionality reduction over a neighbor-
3.1. Principal Component Analysis hood in the image. This is often max-pooling, taking
the maximum value of the filter responses per filter in
Principal component analysis (PCA) projects the d- a specific region. This introduces some spatial invari-
dimensional data on d orthogonal vectors, in order of ance, which is well suited for images.
decreasing variance present in the data. If we now omit
all but the first d1 vectors, we have a lower dimen- Putting these two approaches together, a convolu-
sionality, while retaining most of the variance (as all tional autoencoder can learn a non-linear dimensio-
omitted vectors have less variance than the retained nality reduction in an unsupervised way, with some
ones). spatial invariance that can handle the image texture
in a better way than the PCA can. We believe this
The effect PCA has on images is an interesting one: may prove successful for detecting the ROIs.
the projection vectors returned by the PCA are shaped
like the input image data, and visual inspection shows
the collinearity of pixels. We call these eigenimages
4. Anticipated Results
(as the vectors are the eigenvectors of the covariance We are working on assembling a labeled dataset. With
matrix of the input data), and these have proven to be such a dataset we will investigate and report on the
well suited for image classification tasks such as facial effectiveness of dimensionality reduction methods and
recognition (Turk & Pentland, 1991). outlier detection methods as ROI detection techniques
In our application, we can find these eigenimages for in image sets. We expect the convolutional autoen-
an image set and smoothen the image by getting rid coder to outperform PCA, and we will look into other
of the contribution of all but the first d1 eigenimages. possibilities to overcome the issues faced by the PCA
Every regular occurrence in the image set should be that initially make it unsuitable for this task.
contained in the first few eigenimages, thus present
in the smoothed images. By thresholding the differ- Acknowledgments
ence image of the smoothed image and the original,
we should be able to find regions of interest. SewerSense is funded by the NWO/TTW TISCA pro-
gramme and implemented by researcher at Leiden Uni-
There are a few issues with this approach. Firstly, versity and Delft University of Technology. The im-
the PCA relies on inverting the covariance matrix of ages used have been provided by vandervalk+degroot.
the dataset to find the eigenvectors, which can very
quickly become challenging in the case of large sets of
large images. Secondly, because of the texture present
in the images, we need to have some level of spatial
invariance, which the PCA does not have.

185
Unsupervised region of interest detection in sewer pipe images

References
Aggarwal, C. C., & Yu, P. S. (2001). Outlier detection
for high dimensional data. ACM Sigmod Record (pp.
37–46).
Dirksen, J., Clemens, F., Korving, H., Cherqui, F.,
Le Gauffre, P., Ertl, T., Plihal, H., Müller, K., &
Snaterse, C. (2013). The consistency of visual sewer
inspection data. Structure and Infrastructure Engi-
neering, 9, 214–228.
Turk, M. A., & Pentland, A. P. (1991). Face recogni-
tion using eigenfaces. Computer Vision and Pattern
Recognition, 1991. Proceedings CVPR’91., IEEE
Computer Society Conference on (pp. 586–591).
Zimek, A., Schubert, E., & Kriegel, H.-P. (2012). A
survey on unsupervised outlier detection in high-
dimensional numerical data. Statistical Analysis and
Data Mining, 5, 363–387.

186
Service Revenue Forecasting in Telecommunications:
A Data Science Approach

Dejan Radosavljevik [email protected]


LIACS, Leiden University, P.O. Box 9512, 2300 RA Leiden, The Netherlands
Peter van der Putten [email protected]
LIACS, Leiden University, P.O. Box 9512, 2300 RA Leiden, The Netherlands

Keywords: telecommunications, data science, revenue forecasting

Abstract The operator where we deployed the research has


about two million postpaid customers with more than
This paper discusses a real-world case of
4000 combinations of differently priced rate plans and
revenue forecasting in telecommunications.
voice, SMS and Internet bundles. In order to forecast
Apart from the method we developed, we will
the revenue figure for one month, one has to account
describe the implementation, which is the key
not only for the different usage patterns throughout
factor of success for data science solutions
the year, but also the inflow of new customers, changes
in a business environment. We will also de-
in contract of the existing ones and the loss of cus-
scribe some of the challenges that occurred
tomers to competition. The systems that are used for
and our solutions to them. Furthermore, we
actual customer billing are not built for simulation of
will explain our unorthodox choice for the
revenues, as these are too embedded into operational
error measure. Last, but not least we will
processes. This makes the task of forecasting revenue
present the results of this process.
across different scenarios far from trivial.

1. Service revenue forecasting 2. Data collection and understanding


Given the importance of the process of revenue fore- Unlike in an academic research setting where seeking
casting, it is surprising there is not a lot of research on the best algorithm is the key to solving the problem
this topic. There is literature available regarding the and getting data is as easy as downloading a data set,
government sector (Fullerton, 1989; Feenberg et al., in a business setting this represents a large chunk of
1988), airline industry (Harris, 2006) and online firms the total work. Typically, the data is not available
(Trueman et al., 2001). However, none of them de- from a single data source, let alone structured in a
scribe this process in mobile telecommunications. format suitable for machine learning. The first thing
we needed to do is unify the data (invoice data, usage
We focus on the postpaid segment of mobile telecom- data, interconnect data, rate plans/bundles and inflow
munications, where service revenues can be split into and outflow of customers) into a single sandbox which
fixed (subscription fee which already contains certain we can use to construct our flat table. We chose to
services) and variable (based on additional usage of use Teradata (2017) as we had the most administrative
services not included in the subscription). Further- flexibility there. Any other database system would do.
more, operators charge differently for using services Also, we used the open source tools R (2008) for mov-
while abroad (roaming) and for making international ing data between data sources and KNIME (Berthold
calls. Operators also charge other operators intercon- et al., 2007) for automating the process.
nect fees for incoming calls (a customer from operator
B calls a customer from operator A). Next, we needed to restructure the data, as it was
recorded in a database the same way it is presented
to customers on an invoice. However, not all invoice
Appearing in Proceedings of Benelearn 2017. Copyright items are service revenues (e.g. the handset fee and
2017 by the author(s)/owner(s). application purchases are not service revenues). Non

187
Formatting Instructions for Benelearn 2017 Abstracts and Papers

icantly in reducing the total error and substantially


increased the computation time.
A lesson learned here was to keep all tables in memory,
instead of on disk. This reduced the run time of the
total modeling process from one hour to 10 minutes.
When using the more complex algorithms, the training
for some clusters lasted longer than a few hours to
Figure 1. Modeling Workflow in KNIME calculate, especially when the sample size was large
(more than 30,000 samples).
The goal of the model is to forecast service revenues
service revenues are out of scope. Isolating the service for a full year. This is achieved using the following
revenues required a lot of expert help as each revenue process. We start from the base of the current month
type has a different code. Importing and fully aligning and remove the customers marked for discontinuation.
the data with what is officially billed and reported took For each month, we first forecast the usage of the
almost two months of work. Next to this, distinction customer cluster. Then we add the projected inflow
between usage that is billed vs usage that is already of customers, the changes in contract and mark cus-
part of the subscription was necessary. Last, but not tomers who will discontinue their contract for each cus-
least, interconnect usage and usage abroad were added tomer cluster. Last, but not least, we run this ”new”
separately as they are billed in a different way. base and usage figures through the regression models
described in the previous section and predict the rev-
3. Clustering the data and modeling enues of the next month. We repeat this 12 times.
the service revenues From a run time perspective, our first run was too
It is not feasible to come up with a single service rev- slow: 52 hours for 4 months prediction. We were able
enue model for all customers due to the many differ- to reduce this to about 15 minutes per month by using
ent rate plans that the operator offers. Furthermore, R instead of KNIME for some data transformations,
usage habits are quite different for the various rate keeping everything in memory and only writing to disk
plans. However, a unique formula for each price plan once the full month is calculated, as well as optimiz-
is not feasible either, as some of these only contain a ing the process flow of generating ”new” random aver-
few customers. Therefore, we divided the entire cus- age customers for the inflow and contract changes and
tomer base into 120 clusters, combining rate plans and removing randomly selected customers from the base
bundles of similar characteristics (similar number of for the outflow. We display the forecasting results in
minutes, SMS and MB in the subscription), with a Qlikview (2014) providing drill down opportunities per
lower limit of 500 customers per cluster. Clustering rate plan, channel, bundle etc. The business now has
algorithms were considered, however we had enough better insights into the forecast than the actuals, so
of clear structure in the data to avoid this. there is request to update operational reporting on the
actuals in the same way. To validate our approach we
The approach we use is similar to the network capacity use typical prediction level measures such as RMSE or
model described in (Radosavljevik & van der Putten, MAE. However, the key measure for the end users to
2014), where we discuss load pockets to simulate net- accept the model was how close the sum of our predic-
work service quality. Here, we treat each cluster as a tions is to the actual revenues. On the total base our
revenue pocket. Therefore, we create a linear regres- model was only 0.3% off the target, while the error of
sion model for each cluster of customers taking various the standard budgeting process was 8 times higher.
types of usage as input and revenue as output, and save
it in a PMML format for scoring. This process is au- In conclusion, we used mostly open source tools and
tomated using KNIME. A screen shot of the process simple algorithms in a complex deployment flow to op-
is shown in figure 1. timize a key business problem. Our approach allows
for testing of multiple scenarios for the inflow of cus-
tomers, renewing contracts and outflow, as well as sim-
4. Deployment and results ulation of pricing the products differently. Its usability
We experimented with algorithms other than linear and rationale are currently being verified by the end
regression, such as support vector regression, random users. Our approach can generalize to revenue sim-
forests, regression trees, polynomial regression etc. ulation of any subscription based service with usage
However, these algorithms did not contribute signif- based pricing.

188
Formatting Instructions for Benelearn 2017 Abstracts and Papers

References
Berthold, M. R., Cebron, N., Dill, F., Gabriel, T. R.,
Kötter, T., Meinl, T., Ohl, P., Sieb, C., Thiel,
K., & Wiswedel, B. (2007). KNIME: The Kon-
stanz Information Miner. Studies in Classification,
Data Analysis, and Knowledge Organization (GfKL
2007). Springer.

Feenberg, D. R., Gentry, W. M., Gilroy, D., & Rosen,


H. S. (1988). Testing the rationality of state revenue
forecasts. National Bureau of Economic Research
Cambridge, Mass., USA.
Fullerton, T. M. (1989). A composite approach to fore-
casting state government revenues: Case study of
the idaho sales tax. International Journal of Fore-
casting, 5, 373–380.
Harris, F. H. d. (2006). Keynote paper: Large-scale
entry deterrence of a low-cost competitor: An early
success of airline revenue management. Interna-
tional Journal of Revenue Management, 1, 5–27.
QlikTech International AB (2014). Qlikview personal
edition (version 11.2). http://us-d.demo.qlik.
com/download.

R Development Core Team (2008). R: A language and


environment for statistical computing. R Foundation
for Statistical Computing, Vienna, Austria. ISBN 3-
900051-07-0.
Radosavljevik, D., & van der Putten, P. (2014). Large
scale predictive modeling for micro-simulation of 3g
air interface load. Proceedings of the 20th ACM
SIGKDD international conference on Knowledge
discovery and data mining (pp. 1620–1629).
Teradata Corporation (2017). Teradata online li-
brary. http://info.teradata.com/HTMLPubs/DB_
TTU_15_00/index.html.
Trueman, B., Wong, M. F., & Zhang, X.-J. (2001).
Back to basics: Forecasting the revenues of internet
firms. Review of Accounting Studies, 6(2–3), 305–
329.

189
Predicting Termination of Housing Rental Agreements with
Machine Learning

Michiel van Wezel [email protected]


Dudok Wonen, Larenseweg 32, 1221 BW Hilversum, The Netherlands

Keywords: housing, real estate, asset management, prediction.

Abstract role here: Higher rates of tenancy endings lead to


increased values. This is caused by the decreased
Terminations of rental agreements (tenancy
average discount period in case of a sell off.
endings) are important in the business pro-
cesses of housing associations. I describe the
results of a model that predicts tenancy end- In current practice, a moving average of the cluster-
ings using Machine Learning. wise historical RTE is used to make prognoses. The
parameter is of such importance that more accurate
modeling is desirable.
1. Introduction
In the world of housing associations tenancy endings 2. Data & Model
play an important role in all levels of business, e.g.: Data from our Enterprise Resource Planning system
and from the the yearly valuation process were col-
• Rental Process: Ended tenancies lead to new lected over a three year period in a data vault. A
rentals by new renters. This puts a burden on data vault is basically a time-stamped representation
the staff responsible for renting out the houses. of the underlying database of multiple information sys-
• Maintenance: Dwellings must often be renovated tems, enabling us to derive features from the evolu-
before a new tenancy. This is costly, both in terms tion of these data. We extracted two data-sets from
of effort and money. it. The first one contained data on (anonymised)
renters, dwellings and contracts on reference date Jan
• Sales: Often, part of a portfolio is labeled to be 1-st 2014, augmented with an indicator for tenancy
sold off to private owners upon tenancy ending. ending. The second one contained the same data on
This takes effort from the responsible staff and renters etc., but excludes the indicator. The data are
brings in funds for new investments. Both must highly noisy, with a lot of missings.
be planned in advance.
An ensemble classifier (see e.g., (Hastie et al., 2001))
• Portfolio management: Knowing which dwellings consisting of 10000 decision trees was implemented in
become available and which renters have a (latent) (R) (R Development Core Team, 2008).
desire to move helps in portfolio planning.
• Valuation: Dwellings are organized in clusters. 3. Results
The market value of each cluster is estimated Based on a cross validation estimate, the model sub-
yearly by a valuator. (This is a legal require- stantially outperforms the current practice of averaged
ment for the financial statement.) To this end, historical RTE estimation. Figure 1 shows a clear lift
the valuator uses a discounted cash flow (DCF) for all sensitivity levels.
method. The rate of tenancy endings (abbrevi-
ated RTE) within a cluster plays an important The trained model helps in understanding (or, at least
hypothesizing) why tenancies end. Figure 2 shows the
relative importances of the indicators used. Age of
Appearing in Proceedings of Benelearn 2017. Copyright tenant and building type are the most important ones.
2017 by the author(s)/owner(s).
The functional dependencies between the most impor-

190
Predicting Termination of Housing Rental Agreements.

lift curve

0.93
1.0

0.75
0.8

0.57
True positive rate

0.6

0.39
0.4

0.21
0.2

0.03
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Rate of positive predictions

Figure 1. Lift curve depicting model performance of the Figure 2. Relative importances of various indicators in the
ensemble model (top colored curve) versus the historical dataset.
RTE estimate (bottom colored curve). The bottom black
line depicts the baseline of a portfolio wide RTE.

tant indicators are visualized by partial dependence


plots. (These average over the other indicators and
must be interpreted with caution.) An example, show-
ing the dependency of tenant age, is shown in Figure
3.

4. Conclusion
Application of machine learning techniques for predict-
ing tenancy endings is viable. It allows for better plan-
ning than the currently used methods. It mitigates
risks, lowers costs and potentially improves revenue.
In the near future, we plan to improve our model by
including more data and modeling a different outcome
variable, i.e. period until tenancy end.

References
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The
elements of statistical learning. Springer Series in
Statistics. New York, NY, USA: Springer New York
Inc.
R Development Core Team (2008). R: A language and
Figure 3. Partial dependency plot showing probability of
environment for statistical computing. R Foundation
tenancy ending versus tenant age.
for Statistical Computing, Vienna, Austria. ISBN 3-
900051-07-0.

191
Anomaly Analytics and Structural Assessment in Process Industries

Martin Atzmueller [email protected]


Tilburg University (TiCC), Warandelaan 2, 5037 AB Tilburg, The Netherlands
David Arnu [email protected]
RapidMiner GmbH, Stockumer Straße 475, 44227 Dortmund, Germany
Andreas Schmidt [email protected]
University of Kassel (ITeG), Wilhelmshöher Allee 73, 34121 Kassel, Germany

Keywords: anomaly detection, exceptional model mining, sequential patterns, industry 4.0

Abstract This paper summarizes the implementation of a


comprehensive modeling and analytics approach for
Detecting anomalous behavior can be of crit- anomaly detection and analysis of heterogeneous data,
ical importance in an industrial application as presented in (Atzmueller et al., 2017a).
context: While modern production sites fea-
ture sophisticated alarm management sys-
tems, they mostly react to single events. In 2. Related Work
the context of process industries and het- The investigation of sequential patterns and sequential
erogeneous data sources, we model sequen- trails are interesting and challenging tasks in data min-
tial alarm data for anomaly detection and ing and network science, in particular in graph mining
analysis, based on first-order Markov chain and social network analysis, e. g., (Atzmueller, 2014;
models. We outline hypothesis-driven and Atzmueller, 2016b). In previous work (Atzmueller
description-oriented modeling and provide an et al., 2016b), we have presented the DASHTrails ap-
interactive dashboard for exploration and vi- proach that incorporates probability distributions for
sualization. deriving transitions utilizing HypTrails (Singer et al.,
2015). Based on that, the HypGraphs framework (Atz-
mueller et al., 2017b) provides a more general mod-
1. Introduction eling approach. Using general weight-attributed net-
work representations, we can infer transition matrices
In many industrial areas, production facilities have
as graph interpretations.
reached a high level of automation: sensor readings
are constantly analyzed and may trigger various forms Sequential pattern analysis has also been performed
of alarms. Then, the analysis of (exceptional) sequen- in the context of alarm management systems, where
tial patterns is an important task for obtaining insights sequences are represented by the order of alarm noti-
into the process and for modelling predictive applica- fications, e. g., (Folmer et al., 2014; Abele et al., 2013;
tions. The research project Early detection and deci- Vogel-Heuser et al., 2015). In contrast to those ap-
sion support for critical situations in production envi- proaches, we provide a systematic approach for the
ronments (FEE) aims at detecting critical situations analysis of sequential transition matrices and its com-
in production environments as early as possible and to parison relative to a set of hypotheses. Thus, similar
support the facility operator in handling these situa- to evidence networks in the context of social networks,
tions, e. g., (Atzmueller et al., 2016a). Here, appropri- e. g., (Mitzlaff et al., 2011) we model transitions as-
ate abstractions and analytics methods are necessary suming a certain interpretation of the data towards a
to adapt from a reactive to a proactive behavior. sequential representation. Then, we can identify im-
portant influence factors.
Appearing in Proceedings of Benelearn 2017. Copyright
2017 by the author(s)/owner(s).

192
Anomaly Analytics and Structural Assessment in Process Industries

3. Method
The detection and analysis of irregular or exceptional Weighted Network / Graph Data (Transition Matrix)
Estimation
patterns, i.e., anomalies (Hawkins, 1980; Akoglu et al., Output: -10

2015), in complex-structured heterogeneous data is a


Weighted Network / Graph Hypothesis 1
novel research area, e. g., for identifying new and/or … … … …
Output: -5
Analysis,
Presentation
emerging behavior, or for identifying detrimental or
malicious activities. The former can be used for deriv-
ing new information and knowledge from the data, for Weighted Network / Graph Hypothesis n

identifying events in time or space, or for identifying


interesting, important or exceptional groups. Figure 1. Overview on the modeling and analysis process.
In this paper, we focus on a combined detection and
analysis approach utilizing heterogeneous data. That that calculates the evidence values for different believe
is, we include semi-structured, as well as structured weights k and compares them directly with the given
data for enhancing the analysis. Furthermore, we also hypothesis and a random transition as a lower bound.
outline a description-oriented technique that does not
only allow the detection of the anomalous patterns, 4. Process Model & Implementation
but also its description using a given set of features.
In particular, the concept of exceptional model min- The first part of the analytical workflow is to build
ing, (Leman et al., 2008; Atzmueller, 2015; Duivesteijn the transition network for training and testing the hy-
et al., 2016) suitably enables such description-oriented potheses. We build these hypotheses on real plant
approaches, adapting methods for the detection of in- data and calculate the transition matrices for hourly
teresting subgroups (that is, subgroup discovery) with time slots. In the same way, after further preprocess-
more advanced target concepts for identifying excep- ing (smoothing and down-sampling) we aggregate the
tional (anomalous) groups. In our application context corresponding raw sensor data. The calculated outlier
of an industrial production plants in an Industry 4.0 score (Amer & Goldstein, 2012) is then presented, to-
context, cf. (Vogel-Heuser et al., 2015; Folmer et al., gether with the evidence scores. A high outlier score
2017), we based our anomaly detection system on the indicates possible anomalous sensor readings and a low
analysis of the plant topology and alarm logs as well evidence score indicates deviating transition patterns
as on the similarity based analysis of metric sensor in the alarm sequences. For further inspecting the out-
readings. The combined approach integrates both. lier scores, we provide an additional dashboard. This
shows the k highest outlier score for single sensor read-
For sequential data, we formulate the “reference be- ings for a selected time segment and the associated
havior” by collecting episodes of normal situations, sensor readings. Drilling-down from a high level of ab-
which is typically observed for long running processes. straction for a whole processing unit down to single
Episodes of alarm sequences (formulated as hypothe- sensor readings, a process engineer is then able to an-
ses) can be compared to the normal situations in order alyze possible critical situations in a convenient way.
to detect deviations, i.e., abnormal episodes. We map
these sequences to transitions between functional units For future work, we aim at extending the proposed
of an industrial plant. The results can also be used approach by integrating the knowledge gained from a
for diagnostics, by inspecting the transitions in detail. conceptual plant knowledge graph (Atzmueller et al.,
In summary, we utilize Bayesian inference on a first- 2016a). We also plan to integrate the system into
order Markov chain model, see Figure 1. As an input, the Big data architecture proposed in (Klöpper et al.,
we provide a (data) matrix, containing the transitional 2016), also considering further extensions on Big Data
information (frequencies) of transition between the re- frameworks, e. g., (Meng et al., 2016; Carbone et al.,
spective states, according to the (observed) data. In 2015) and advanced assessment, exploration and ex-
addition, we utilize a set of hypotheses given by (row- planation options, e. g., (Atzmueller et al., 2006; Atz-
normalized) stochastic matrices, modelling the given mueller & Roth-Berghofer, 2010; Seipel et al., 2013)
hypotheses. The estimation method outputs an evi- using advanced descriptive data analysis and model-
dence value, for each hypothesis, that can be used for ing techniques, e. g., (Atzmueller, 2016a).
ranking. Also, using the evidence values, we can com-
pare the hypotheses in terms of their significance. Acknowledgments
For modeling, we use the freely available Rapid- This work was partially funded by the BMBF project
Miner (Mierswa et al., 2006) extension of HypGraphs, FEE under grant number 01IS14006.

193
Anomaly Analytics and Structural Assessment in Process Industries

References Carbone, P., Ewen, S., Haridi, S., Katsifodimos, A.,


Markl, V., & Tzoumas, K. (2015). Apache Flink:
Abele, L., Anic, M., Gutmann, T., Folmer, J., Klein-
Stream and Batch Processing in a Single Engine.
steuber, M., & Vogel-Heuser, B. (2013). Combin-
Data Engineering, 28.
ing Knowledge Modeling and Machine Learning for
Alarm Root Cause Analysis. MIM (pp. 1843–1848). Duivesteijn, W., Feelders, A. J., & Knobbe, A. (2016).
Exceptional Model Mining. DMKD, 30, 47–98.
Akoglu, L., Tong, H., & Koutra, D. (2015). Graph
Based Anomaly Detection and Description. DMKD, Folmer, J., Kirchen, I., Trunzer, E., Vogel-Heuser, B.,
29, 626–688. Pötter, T., Graube, M., Heinze, S., Urbas, L., Atz-
mueller, M., & Arnu, D. (2017). Challenges for Big
Amer, M., & Goldstein, M. (2012). Nearest-Neighbor and Smart Data in Process Industries. atp edition,
and Clustering-based Anomaly Detection Algo- 01-02.
rithms for Rapidminer. Proc. RCOMM (pp. 1–12).
Folmer, J., Schuricht, F., & Vogel-Heuser, B. (2014).
Atzmueller, M. (2014). Analyzing and Grounding Detection of Temporal Dependencies in Alarm Time
Social Interaction in Online and Offline Networks. Series of Industrial Plants. Proc. 19th IFAC World
Proc. ECML-PKDD (pp. 485–488). Springer. Congress, 24–29.
Atzmueller, M. (2015). Subgroup Discovery. WIREs: Hawkins, D. (1980). Identification of Outliers. London,
Data Mining and Knowledge Discovery, 5, 35–49. UK: Chapman and Hall.
Atzmueller, M. (2016a). Detecting Community Pat- Klöpper, B., Dix, M., Schorer, L., Ampofo, A., Atz-
terns Capturing Exceptional Link Trails. Proc. mueller, M., Arnu, D., & Klinkenberg, R. (2016).
IEEE/ACM ASONAM. IEEE Press. Defining Software Architectures for Big Data En-
abled Operator Support Systems. Proc. INDIN.
Atzmueller, M. (2016b). Local Exceptionality Detec-
tion on Social Interaction Networks. Proc. ECML- Leman, D., Feelders, A., & Knobbe, A. (2008). Ex-
PKDD (pp. 485–488). Springer. ceptional Model Mining. Proc. ECML-PKDD (pp.
1–16). Heidelberg, Germany: Springer.
Atzmueller, M., Arnu, D., & Schmidt, A. (2017a).
Anomaly Detection and Structural Analysis in In- Meng, X., Bradley, J., Yavuz, B., Sparks, E.,
dustrial Production Environments. Proc. Interna- Venkataraman, S., Liu, D., Freeman, J., Tsai, D.,
tional Data Science Conference. Salzburg, Austria. Amde, M., Owen, S., et al. (2016). MLLib: Ma-
chine Learning in Apache Spark. JMLR, 17, 1–7.
Atzmueller, M., Baumeister, J., & Puppe, F. (2006).
Introspective Subgroup Analysis for Interactive Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M.,
Knowledge Refinement. Proc. AAAI FLAIRS (pp. & Euler, T. (2006). Yale: Rapid prototyping for
402–407). Palo Alto, CA, USA: AAAI Press. complex data mining tasks. Proc. KDD (pp. 935–
940). New York, NY, USA: ACM.
Atzmueller, M., Kloepper, B., Mawla, H. A., Jäschke,
B., Hollender, M., Graube, M., Arnu, D., Schmidt, Mitzlaff, F., Atzmueller, M., Benz, D., Hotho, A., &
A., Heinze, S., Schorer, L., Kroll, A., Stumme, G., & Stumme, G. (2011). Community Assessment using
Urbas, L. (2016a). Big Data Analytics for Proactive Evidence Networks. Analysis of Social Media and
Industrial Decision Support. atp edition, 58. Ubiquitous Data. Heidelberg, Germany: Springer.

Atzmueller, M., & Roth-Berghofer, T. (2010). The Seipel, D., Köhler, S., Neubeck, P., & Atzmueller, M.
Mining and Analysis Continuum of Explaining Un- (2013). Mining Complex Event Patterns in Com-
covered. Proc. AI-2010. London, UK: SGAI. puter Networks. In New Frontiers in Mining Com-
plex Patterns, LNAI. Springer.
Atzmueller, M., Schmidt, A., & Kibanov, M. (2016b).
Singer, P., Helic, D., Hotho, A., & Strohmaier, M.
DASHTrails: An Approach for Modeling and Anal-
(2015). Hyptrails: A Bayesian Approach for Com-
ysis of Distribution-Adapted Sequential Hypotheses
paring Hypotheses about Human Trails. Proc.
and Trails. Proc. WWW 2016 (Companion). ACM.
WWW. New York, NY, USA: ACM.
Atzmueller, M., Schmidt, A., Kloepper, B., & Arnu, Vogel-Heuser, B., Schütz, D., & Folmer, J. (2015).
D. (2017b). HypGraphs: An Approach for Analy- Criteria-based Alarm Flood Pattern Recognition
sis and Assessment of Graph-Based and Sequential Using Historical Data from Automated Production
Hypotheses. In New Frontiers in Mining Complex Systems (aPS). Mechatronics, 31.
Patterns, LNAI. Springer.

194

You might also like