Proceedings
Proceedings
Welcome!
Benelearn is the annual machine learning conference of the Benelux. It serves as a forum for researchers to
exchange ideas, present recent work, and foster collaboration in the broad field of Machine Learning and its
applications. These are the proceedings of the 26th edition, Benelearn 2017.
Benelearn 2017 takes place largely on the campus of the Technische Universiteit Eindhoven, De Zaale, Eind-
hoven. The Friday programme is located in De Zwarte Doos (see https://goo.gl/maps/XgKEo7JxyTC2),
and the Saturday programme in Auditorium (see https://goo.gl/maps/B3PnpuCjgMJ2). The conference
dinner on Friday evening is the only off-campus event; this takes place in the DAF Museum, Tongelresestraat
27, 5613 DA Eindhoven (see https://goo.gl/maps/zNLrhpSqimk).
As part of the main conference programme, we organize three special tracks: one on Complex Networks, one
on Deep Learning, and one Industry Track. Distributed over all tracks, contributing researchers not only
span all three Benelux countries, but also include affiliations from ten additional countries.
We thank all members of all programme committees for their service, and all authors of all papers for their
contributions!
Kind regards,
The Benelearn 2017 organizers
Organization
Conference Chairs: Wouter Duivesteijn, Mykola Pechenizkiy
Complex Networks Track Chair: George Fletcher
Deep Learning Track Chairs: Vlado Menkovski, Eric Postma
Industry Track Chairs: Joaquin Vanschoren, Peter van der Putten
Local Organization: Riet van Buul
1
Programme Committee Conference Track
Dick Epema, Delft University of Technology Taro Takaguchi, National Institute of Information
Alexandru Iosup, Vrije Universiteit Amsterdam and Communications Technology
and TU Delft Yinghui Wu, University of California Santa Barbara
Nelly Litvak, University of Twente Nikolay Yakovets, Eindhoven University of Technology
Contents
Invited Talks
Toon Calders — Data mining, social networks and ethical implications . . . . . . . . . . . . . . . . 6
Max Welling — Generalizing Convolutions for Deep Learning . . . . . . . . . . . . . . . . . . . . . 7
Jean-Charles Delvenne — Dynamics and mining on large networks . . . . . . . . . . . . . . . . . . 8
Holger Hoos — The transformative impact of automated algorithm design: ML, AutoML and beyond 9
2
Conference Track
Research Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
L.F.J.M. Kanters — Extracting relevant discussion from Reddit Science AMAs . . . . . . . . 11
I.G. Veul — Locally versus Globally Trained Word Embeddings for Automatic Thesaurus Con-
struction in the Legal Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Rianne Conijn, Menno van Zaanen — Identifying writing tasks using sequences of keystrokes 28
Lars Lundberg, Håkan Lennerstad, Eva Garcia-Martin, Niklas Lavesson, Veselka Boeva —
Increasing the Margin in Support Vector Machines through Hyperplane Folding . . . . 36
Martijn van Otterlo, Martin Warnaar — Towards Optimizing the Public Library: Indoor
Localization in Semi-Open Spaces and Beyond . . . . . . . . . . . . . . . . . . . . . . 44
Antoine Adam, Hendrik Blockeel — Constraint-based measure for estimating overlap in clus-
tering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Extended Abstracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Thijs van de Laar, Bert de Vries — A Probabilistic Modeling Approach to Hearing Loss Com-
pensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Anouk van Diepen, Marco Cox, Bert de Vries — An In-situ Trainable Gesture Classifier . . . 66
Marcia Fissette, Bernard Veldkamp, Theo de Vries — Text mining to detect indications of
fraud in annual reports worldwide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Veronika Cheplygina, Lauge Sørensen, David M.J. Tax, Marleen de Bruijne, Marco Loog —
Do you trust your multiple instance learning classifier? . . . . . . . . . . . . . . . . . 72
Marco Cox, Bert de Vries — A Gaussian process mixture prior for hearing loss modeling . . . 74
Piotr Antonik, Marc Haelterman, Serge Massar — Predicting chaotic time series using a
photonic reservoir computer with output feedback . . . . . . . . . . . . . . . . . . . . . 77
Piotr Antonik, Marc Haelterman, Serge Massar — Towards high-performance analogue readout
layers for photonic reservoir computers . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Niek Tax, Natalia Sidorova, Wil M.P. van der Aalst — Local Process Models: Pattern Mining
with Process Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Christina Papagiannopoulou, Stijn Decubber, Willem Waegeman, Matthias Demuzere, Niko
E.C. Verhoest, Diego G. Miralles — A non-linear Granger causality approach for un-
derstanding climate-vegetation dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Dounia Mulders, Michel Verleysen, Giulia Liberati, André Mouraux — Characterizing Resting
Brain Activity to Predict the Amplitude of Pain-Evoked Potentials in the Human Insula 89
Quan Nguyen, Bert de Vries, Tjalling J. Tjalkens — Probabilistic Inference-based Reinforce-
ment Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Veselka Boeva, Milena Angelova, Elena Tsiporkova — Identifying Subject Experts through
Clustering Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Michael Stock, Bernard De Baets, Willem Waegeman — An Exact Iterative Algorithm for
Transductive Pairwise Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Sergio Consoli, Jacek Kustra, Pieter Vos, Monique Hendriks, Dimitrios Mavroeidis — To-
wards an automated method based on Iterated Local Search optimization for tuning the
parameters of Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Jacopo De Stefani, Gianluca Bontempi, Olivier Caelen, Dalila Hattab — Multi-step-ahead
prediction of volatility proxies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Tom Viering, Jesse Krijthe, Marco Loog — Generalization Bound Minimization for Active
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Jesse H. Krijthe, Marco Loog — Projected Estimators for Robust Semi-supervised Classification 110
Dimitris Paraschakis — Towards an Ethical Recommendation Framework . . . . . . . . . . . 112
Björn Brodén, Mikael Hammar, Bengt J. Nilsson, Dimitris Paraschakis — An Ensemble Rec-
ommender System for e-Commerce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Sara Magliacane, Tom Claassen, Joris M. Mooij — Ancestral Causal Inference . . . . . . . . 118
Martin Atzmueller — Exceptional Model Mining in Ubiquitous and Social Environments . . . 121
3
Sibylle Hess, Katharina Morik, Nico Piatkowski — PRIMPing Boolean Matrix Factorization
through Proximal Alternating Linearized Minimization . . . . . . . . . . . . . . . . . . 124
Sebastijan Dumančić, Hendrik Blockeel — An expressive similarity measure for relational
clustering using neighbourhood trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Industry Track
Research Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Maria Biryukov — Comparison of Syntactic Parsers on Biomedical Texts . . . . . . . . . . . 176
Extended Abstracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Lodewijk Nauta, Max Baak — Eskapade: a lightweight, python based, analysis framework . . 183
Dirk Meijer, Arno Knobbe — Unsupervised region of interest detection in sewer pipe images:
Outlier detection and dimensionality reduction methods . . . . . . . . . . . . . . . . . 184
Dejan Radosavljevik, Peter van der Putten — Service Revenue Forecasting in Telecommuni-
cations: A Data Science Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Michiel van Wezel — Predicting Termination of Housing Rental Agreements with Machine
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Martin Atzmueller, David Arnu, Andreas Schmidt — Anomaly Analytics and Structural As-
sessment in Process Industries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
4
Invited Talks
5
Data mining, social networks and ethical implications
Abstract
Recently we have seen a remarkable increase
of awareness of the value of data. Whereas
companies and governments mainly used to
gather data about their clients just to sup-
port their operations, nowadays they are ac-
tively exploring new applications. For in-
stance, a telecom operator may use call data
not only to bill its customers, but also to
derive social relations between its customers
which may help to improve churn models,
and governments use mobility data to chart
mobility patterns that help to assess the im-
pact of planned infrastructure works. I will
give an overview of my research in this fas-
cinating area, including pattern mining, the
analysis of influence propagation in social
networks, and ethical challenges such as mod-
els that discriminate.
6
Generalizing Convolutions for Deep Learning
Abstract
Arguably, most excitement about deep learn-
ing revolves around the performance of con-
volutional neural networks and their ability
to automatically extract useful features from
signals. In this talk I will present work from
AMLAB where we generalize these convolu-
tions. First we study convolutions on graphs
and propose a simple new method to learn
embeddings of graphs which are subsequently
used for semi-supervised learning and link
prediction. We discuss applications to recom-
mender systems and knowledge graphs. Sec-
ond we propose a new type of convolution
on regular grids based on group transforma-
tions. This generalizes normal convolutions
based on translations to larger groups includ-
ing the rotation group. Both methods often
result in significant improvements relative to
the current state of the art.
Joint work with Thomas Kipf, Rianne van
den Berg and Taco Cohen.
7
Dynamics and mining on large networks
Abstract
A network, i.e. the data of nodes connected
by edges, often comes as the support of dy-
namical interactions. For example a social
network is often measured as the trace of
an information flow (phone calls, messages),
energy and phase information flow through
power networks, biochemical networks are
the skeleton of complex reaction systems,
etc. It is therefore natural to mine network-
shaped data jointly with a real or modelled
dynamics taking place on it. In this talk
we review how dynamics can provide efficient
and accurate methods for community detec-
tion, classification, centrality and assortativ-
ity measures.
8
The transformative impact of automated algorithm design: ML,
AutoML and beyond
Abstract
Techniques from artificial intelligence — and
especially, machine learning — are funda-
mentally changing the way we solve challeng-
ing computational problems, and recently,
automated machine learning (AutoML) has
begun to take this to a new level. In this
talk, I will share my perspective on the suc-
cess of ML and AutoML, and discuss how
the fundamental concepts and tools that en-
able both have a much broader impact than
commonly perceived. In particular, I will
highlight the role of a fruitful interplay be-
tween machine learning and optimisation in
this context, comment on general approaches
to automated algorithm design, and share my
thoughts on the next big challenge.
9
Conference Track
Research Papers
10
Extracting relevant discussion from Reddit Science Science AMAs
Abstract 1. Introduction
The social network and content aggregation On reddit there is the tradition of the Ask Me Any-
website Reddit occasionally hosts Q&A ses- thing, or AMA, threads. These are a kind of infor-
sions with scientists called science AMA (Ask mal interview or online Q&A session with whomever
Me Anything). These science AMAs are con- started the thread (the OP or original poster), any-
ducted through the comment system of Red- body can participate and ask questions. For about 3
dit which has a tree structure, mark-up and years the /r/science subreddit, a subforum dedicated
community driven feedback on both users to science, has been doing AMAs with scientists as a
and comments in the form of “karma” scores. kind of science outreach. As a result there are now well
over 600 different online AMAs with scientists cover-
Most of the actual discussion in these science
ing a wide variety of subjects with more being done
AMAs tends to be of high quality. However
each week. The strict moderation in /r/science has
a large number of the comments are super-
resulted in a subreddit culture that tends towards se-
fluous and not really part of the conversation
rious discussion which, combined with the enthusiasm
with the scientist. The goal of this project is
of the scientists involved, yields AMAs of an excep-
to determine if text mining methods can be
tionally high quality. The informal nature of reddit
used to filter out the unwanted comments. A
allows lay-people easy access, while also allowing for
secondary goal is to determine the relative
more in-depth questions. The hierarchical structure
importance of Reddit meta-data (tree struc-
of the comment section as well as the lack of time
ture, karma scores, etc) compared to the ac-
constraints nature of an AMA, a particular enthusias-
tual content of the comments.
tic OP might still be answering questions days later,
The python Reddit API was used to retrieve encourages follow-up discussion. And the /r/science
the AMAs. The CoreNLP tools were used community has a decent number of scientists among
to extract tokens, sentences, named entities its members, who are recognizable due to flair next to
and sentiment. These were combined with their username, so experts other than the OP are likely
other information, like Reddit meta-data and to join in the discussion. However despite this, large
WordNet, and used to extract features. The parts of these AMAs are still superfluous, consisting of
classification was done by a Gaussing naive unanswered questions, tangential discussions, there is
Bayes classifier using the scikit-learn toolbox. clearly a lot of knowledge to be found in these AMAs
but some manner of filtering might be required first.
Classification using all features or only text-
based features was effective both yielding In order to archive these AMAs, and to assign them
a precision/recall/f1-score of 0.84/0.99/0.91. their doi-number so they can actually be referenced in
Only using Reddit based features was slightly scientific literature, the Winnower has been copying
less effective, yielding 0.89/0.63/0.74. Only parts of these AMAs to their own website1 . Some of
using a single WordNet based similarity fea- the larger AMAs can end up having many thousands
ture still worked, yielding 0.81/0.99/0.89. of comments, with only a tiny fraction of them actu-
ally being worth archiving. So the Winnower’s applies
a filter to these AMAs but it is rather crude. They
take every comment that is at the 2nd level of the
Appearing in Proceedings of Benelearn 2017. Copyright
2017 by the author(s)/owner(s). 1
https://www.thewinnower.com/topics/science-ama
11
Extracting relevant discussion from Reddit Science Ask Me Anythings
comment tree and is made by the scientist being in- or be gilded, and will usually be displayed in order of
terviewed, which tend to be answers to questions, and popularity. The user who started the thread, by sub-
their parent comment, which would contain the ques- mitting the piece of content the thread pertains to, is
tion. Everything else, including follow-up discussion, referred to as OP, short for original poster.
is discarded.
All the upvotes and downvotes for every submission
The primary research question of this paper is to what and comment of a user are combined into a link and
extent it is possible, using text-mining, to distinguish comment karma, by subtracting the sum of downvotes
between informative & relevant comments and the rest. from the sum of upvotes. Each user’s karma is pub-
A secondary research question is to what extent reddit licly visible and tends to be used, in combination with
meta-data and comment hierarchy is necessary for this length of time that user has been a redditor, as an
classification. informal indication of a user’s reliability.
Moderation of reddit is generally handled by volunteer
2. Background moderators, or mods, with responsibilities and permis-
sions limited to the subreddits they moderate. Mod-
2.1. Reddit
eration policy varies from subreddit to subreddit.
Reddit (Reddit, 2016) is a social aggregation website
Among the tools for mods are deletion of content,
where users can submit content (either links to a web-
adding flair, banning and shadow-banning. Flair is
page or a piece of text), rate this content, comment
a short, 64 characters, piece of text that can be used
on it, and of course, consume it. It is quite a large
to customize submissions and users. User flair can
site, for example in 20152 alone it received 73.15 billion
be set by the user or a mod (depending on subreddit
submissions and 73.15 billion comments written by 8.7
policy) and will be shown next to the user’s username
million users.
on submissions and comments. Submission flair can
The frontpage of reddit, the entry point to the website, be set by the user who submitted the content or a
consists of a list of submission titles ordered by pop- mod (again depending on subreddit policy) and will
ularity. Popularity of a submission is determined by be shown next to the submission title.
user feedback in the form of upvotes and downvotes fed
In /r/science submission flair is used to categorize sub-
into an unknown stochastic function yielding a score.
missions by field, which is fairly typical use for submis-
Official reddit etiquette3 states that one should vote
sion flairs in reddit. User flair policy in /r/science is
based on whether or not “you think something con-
quite unique, the mods use user flair to indicate the
tributes to the conversation”, though in practice vot-
academic qualifications of a user (for example it might
ing is also based on agreement or amusement. Further
say “BS | Artificial Intelligence”), these qualifications
user feedback is possible by “gilding” which costs the
are verified by the mods. The /r/science mods call this
gilder $4, confers benefits to the poster of the gilded
their “science verified user program”4 the intention of
content and places a little golden star medallion next
which is to allow readers to distinguish between “edu-
to the submission. Each piece of content is submitted
cated opinions and random comments”, verified users
to a specific subreddit which functions both as a cate-
are also kept to a higher standard of conduct.
gory and community of sorts, a user can subscribe to
subreddits and their personal frontpage is a compos-
2.2. Related work
ite of the subreddits they are subscribed to. Usually
when mentioning a subreddit the name is preceded by Weimer et al (Weimer et al., 2007) did work on auto-
‘/r/’ because Reddit automatically turns such a men- matically assessing post quality in online forums, they
tion into a link to the subreddit. attempt to assess the usefulness of different types of
Each submission has an associated comment section features some of which are based on forum metadata.
where users can have a conversation. The conversation This is rather similar to the work in this paper. Their
in a reddit comment section is tree, each comment be- result was that classification based on sets lacking any
ing either a direct reply to the original submission or forum based features performs slightly worse than clas-
to another comment in the thread. Just as a submis- sification based on sets including those features. Their
sion the comments themselves can also be voted upon, work uses annotation based on user feedback through
built-in features of the forum software and the gen-
2
https://redditblog.com/2015/12/31/ eral goal underlying the classification differs a bit from
reddit-in-2015/ what is being done here.
3
https://www.reddit.com/wiki/reddiquette
4
https://www.reddit.com/r/science/wiki/flair
12
Extracting relevant discussion from Reddit Science Ask Me Anythings
Siersdorfer et al (Siersdorfer et al., 2010) did a sim- Normally in reddit sibling comments would be ordered
ilar study based on youtube comments. It may be by popularity but during annotation this ordering was
worth noting that this study is from before youtube randomized. Comments without any replies were not
switched to google+ comments, when it was still fea- shown during annotation and were assumed to be not
sible to moderate comment sections. The interesting worth keeping, since they cannot be part of any dis-
thing here is that they found a significant correlation cussion or question/answer pairs.
between the sentiment of a comment, as analyzed by
The data was annotated by two different annotators
SentiWordNet, and the scores users attributed to com-
who each annotated all comments presented to them.
ments. Though again, just as with Weimer et all, the
The Cohen’s Kappa is κ = 0.45, indicating a mere
point of the classification is a bit different from what
moderate agreement. In the interest of preserving as
is being done here.
many relevant comments as possible a comment will
On a slightly different note Androutsopoulos (An- be considered worth keeping if at least one of the an-
droutsopoulos et al., 2000) et al compares the perfor- notators thought it was worth keeping.
mance of a naive Bayes classifier on spam detection.
The AMA by Alice Jones was used as the training set
Their pipeline is fairly simple, the most complex con-
and the AMA by Paul Helquist as the test set.
figuration used in their works merely employs a lem-
matizer and stop-list, though despite this it manages
to get good recall and precision on their email cor- 3.2. Pipeline
pus. The use of a naive Bayes classifier is especially 3.2.1. Data gathering
interesting since its transparent decision making pro-
cess would allow one to easily assess the impact of each The data was gathered using the Python Reddit API
feature. (Praw-dev, 2016) which allows one to essentially do
and see from a Python script everything one would be
able to do or read using the reddit website. This was
3. Methods used to gather the following from the AMAs:
3.1. Data • The text of the original submission, which con-
The data was taken from the following two reddit tains an introduction of both the OP and the topic
AMAs: of the AMA.
• The text of each comment in the comment section.
• Hi reddit! Im Alice Jones, an expert on antisocial • For the original submission and comment:
behaviour and psychopathy at Goldsmiths, Uni- – The amount of times it was gilded.
versity of London. I research emotion processing – The karma (upvotes minus downvotes) it re-
and empathy, focusing on childhood development ceived.
and education. AMA!5 with 239 comments (172 – The flair of the user who wrote it.
used). • For each of the users:
– Whether the user currently has gold.
• Hi, my name is Paul Helquist, Professor and As- – The total amount of comment karma for all
sociate Chair of Chemistry & Biochemistry, at their comments all over reddit.
the University of Notre Dame. Ask me anything – The total amount of link karma they received.
about organic synthesis and my career.6 with 234 – Whether the user is a mod of any subreddit.
comments (121 used). – Whether the user was shadowbanned.
– A breakdown of link and comment karma by
The data was annotated manually based on whether subreddit it was gained on.
or not a given comment was informative & relevant
and would therefore be worth keeping, as if the anno- Note that the text retrieved was encoded in utf-8, con-
tator was editing down an interview. The information tains xml character entity references and has markup
available to the annotator was the text of the com- in the markdown format.
ments themselves as well as the comment hierarchy. All the data was stored in XML files while maintaining
5 the hierarchy of the comment section.
https://www.reddit.com/r/science/comments/
4twmrc/science_ama_series_hi_reddit_im_alice_
jones_an/ 3.2.2. Preprocessing and normalization
6
https://www.reddit.com/r/science/comments/
52k2gt/american_chemical_society_ama_hi_my_name_ Preprocessing and normalization was done using the
is_paul/ Stanford CoreNLP (Manning et al., 2014) language
13
Extracting relevant discussion from Reddit Science Ask Me Anythings
processing toolkit, the following built-in annotators • Features that depend on reddit meta-data and
were used: comment structure shown in table 2.
• Tokenization • Features purely based on the text of the com-
• Sentence splitting ment, independent of reddit meta-data. Besides
• Lemmatization the ones shown in table 1 these also include token
• Named entity recognition document frequencies.
• Syntactic analysis
• Part of speech tagging
• Sentiment analysis The feature vector consists of the features shown in
tables 2 and 1 followed by the document frequencies
Any token was dropped if it was an url, if it is a stop- of a number of tokens. Which document frequency
word, or if it is neither a verb, noun, adjective or ad- would be included was determined by taking the top
verb. Though some information, like the number of N tokens, of the entire training set, ordered by docu-
urls in the comment, were kept as a feature. ment frequency. How many tokens were included, or
hyperparameter N, is shown in section 4.1.
3.2.3. Feature extraction
3.3.1. Similarity features
The features were extracted using a Python script cre-
ating a series of feature vectors for each comment, see Two different types of similarity features were used,
the section 3.3 for details on the features. Most of features that indicate how similar two comments are
the features simply consist of a piece of reddit meta- to one another.
data or some quantity derived directly from the text.
However two of the features (tws and cf ws ) make use of One based solely on which exact tokens occurred in
WordNet (Fellbaum, 1998) implementation of the Nat- both comments, and how frequently they occurred.
ural Language Toolkit (Bird et al., 2009) based on the This similarity was defined as follow, where dft,x is
lemma and part-of-speech information determined by the document frequency of token t in comment x:
CoreNLP. Another three features (c+ , c −, co )make use
∑
of the SentiWordNet (Baccianella et al., 2010) toolkit similarity(x, y) = dft,x dft,y (1)
which expands the WordNet with negative, objective t∈x,y
and positive sentiment values for words. The spellings
based features cm and cco are based on the enchant The other similarity feature was based on WordNet
python library7 . The curseword feature ccu is based path similarity. WordNet can determine a similarity
on the profanity python library8 . between two tokens by measuring the distance between
two tokens within the WordNet network, this is the
3.2.4. Classification and Evaluation path similarity (sim(a, b)). The WordNet similarity of
Both Classification and evaluation of the classification two comments was determined by taking the average of
based on the extracted feature vectors was done using the path similarity all possible combinations of tokens
the scikitlearn toolbox (Pedregosa et al., 2011). Dur- in both comments:
ing classification and evaluation one AMA was used
as a test set, while the other was used as a training
set, see section 3.1. The classifier used was a Gaus- 1 ∑∑
WordNetsimilarity(x, y) = sim(a, b)
sian Naive Bayes (Bishop, 2006) classifier because of |x||y| a∈x
b∈y
its transparent nature. This transparency enabled a (2)
closer examination of the features as described in sec-
tion 3.4. These similarities were used as elements in the feature
vector. The similarity between a comment and the
introductory comment made by the scientist was in-
3.3. Features
cluded. As well as the similarity between the user flair
Ultimately all features of a comment are combined into (the scientific credentials as shown on /r/science ) of
one large feature vector prior to being used for clas- the commenter and the scientist doing the AMA.
sification. The features used were split into two cate-
gories: 3.4. Gaussian naive Bayes
7
https://github.com/rfk/pyenchant/ Since the naive Bayes classifier is rather transparent,
8
https://github.com/ben174/profanity it is possible to look at the way a particular feature
14
Extracting relevant discussion from Reddit Science Ask Me Anythings
Table 1. Features independent of reddit meta-data and comment hierarchy. The last column, JS-divergence, shows an
indication of how much that feature influences classification, see section 4.4 for details.
Feature Description JS-divergence
cp
tn The number of paragraphs divided by the number of tokens. 3.12 · 101
curl
tn The number of hyperlinks divided by the number of tokens. 9.44
tne
tn The number of named entities divided by the number of tokens in the comment. 1.33 · 10−1
tc
tn The number of correctly words divided by the number of tokens in the comment. 9.31
c? The number of sentences ending in a question mark in the comment. 6.66 · 10−1
cm The number of misspelled words in the comment. 5.60 · 10−1
cp The number of paragraphs in the comment. 5.74 · 10−1
c+ The average positivity of the words in the comment according to SentiWordNet. 2.46
c− The average negativity of the words in the comment according to SentiWordNet. 1.56
cco Fraction of correctly spelled words in the comment. 9.31
ccp If the user uses capitals and periods at the start and end of their sentences 1, 4.79 · 102
otherwise 0.
clc The number of full-caps words of more than 3 character in the comment. 1.25 · 10−1
co The average objectivity of the words in the comment according to SentiWordNet. 1.95 · 101
csen The average sentiment according to Stanford NLP by sentence in the comment. 2.10
curl The number of hyperlinks in the comment. 2.02
tn The number of tokens in a comment. 5.88 · 10−1
ts The similarity between this comment and the initial comment made by OP. 1.48
tcu The number of cursewords in the comment. 1.30
tne The number of named entities found by Standford NLP. 1.02
tws The WordNet similarity between this comment and the initial comment made by 7.70 · 102
OP.
Table 2. Features dependent on reddit meta-data and comment hierarchy. The last column, JS-divergence, shows an
indication of how much that feature influences classification, see section 4.4 for details.
Feature Description JS-divergence
ca The number of ancestral comments of the comment in the tree. 9.80 · 10−2
cc The number of child comments of the comment in the tree. 1.90
cg If the comment has been gilded 1 otherwise 0. 0.00
ck The log amount of karma of the comment. 1.62
cs The number of sibling comments of the comment in the tree. 9.29 · 10−1
copa If a comment made by OP is among the ancestral comments 1 otherwise 0. 0.00
copc If OP replied to this comment 1 otherwise 0. 1.08
copd If a comment made by OP is among descendant comments 1 otherwise 0. 1.20
copp If the parent comment was made by OP 1 otherwise 0. 2.05 · 10−2
ub If the user was shadowbanned 1 otherwise 0. 0.00
uf If the user has /r/science flair, otherwise 0. 1.04
ug If the user has gold 1 otherwise 0. 1.52 · 10−1
um If the user is a mod of any subreddit 1 otherwise 0. 1.06
ud If user was deleted 1 otherwise 0. 8.81 · 101
uf s The similarity between comment’s user flair and the flair of OP. 2.19 · 101
uf ws Wordnet similarity between comment’s user flair and the flair of OP. 2.19 · 101
ukc The log amount of comment karma of the user of the comment. 3.66
ukl The log amount of link karma of the user of the comment. 9.12 · 10−2
uop If the comment’s user is also the OP 1 otherwise 0. 1.31 · 101
15
Extracting relevant discussion from Reddit Science Ask Me Anythings
impacts the resulting classification from a mathemat- preference for recall over precision corresponds to a
ical perspective. This would be a secondary method preference for keeping interesting comments over dis-
for finding the relative importance of features besides carding uninteresting ones.
comparing classification performance.
Consider the following probabilistic model, which is 4.2. All features
used by Naive Bayes for classification, where all distri- In order to determine if this classification works at
butions p are Gaussian: all a test run was performed where every feature was
used and the document frequency of the top N = 575
n
∏ tokens.
p(Ck |x1 , . . . , xn ) = p(Ck ) p(xi |Ck )
All features
i=1
Prediction
The predicted class will be the one with the highest Discard Keep
Truth
probability according to the model. So say only feature Discard 13 17
xj is being changed and one only has two classes, all Keep 1 90
the other features can simply be dropped: Precision 0.84
Recall 0.99
F1 0.91
n
∏ n
∏
p(C1 ) p(xi |C1 ) < p(C2 ) p(xi |C2 ) 4.3. Reddit vs Text features
i=1 i=1
In order to determine if the use of reddit metadata
p(C1 )p(xj |C1 ) < p(C2 )p(xj |C2 )
based features has any effect the test has been re-
peated twice. Once with features solely derived from
And unless the difference in prior probability is quite the comment text and the document frequency of the
high the difference between probability distributions top N = 575 tokens:
p(xj |C1 ) and p(xj |C2 ) should determine which class is
predicted. If these distributions were similar the value Text features only
of xj would influence the posterior probability only Prediction
Discard Keep
Truth
16
Extracting relevant discussion from Reddit Science Ask Me Anythings
1.0
0.9
type
precision
score
0.8 f1
recall
unique
0.7
0.6
0 500 1000 1500 2000 2500
Figure 1. Performance of classification and proportion of unique bags in the dataset by N , the number of tokens whose
document frequencies were included in the feature vector. The vertical line indicates the N used for further tests.
divergence basically shows how different the prototyp- no. As shown in section ?? the difference between all
ical “keep” comment is compared to the prototypical features and text only features classification is nonex-
“discard” comment based on a specific feature. Tables istent, while there does exist a difference between red-
2 and 1 show this divergence. dit only and text only feature classification. Perhaps
most interesting is that the WordNet similarity fea-
Because the WordNet similarity feature tws seems
ture on its own performs nearly as well as all features
rather influential it might be interesting to see how
combined.
it would perform on its own, even discarding the doc-
ument frequencies (N = 0). The other way of figuring out the relative importance
of the features, suggested in section 3.4, would be to
Wordnet Similarity tws only look at the inner working of the fitted Gaussian naive
Prediction Bayes classifier. Consider tables 2 and 1.
Discard Keep On the reddit-dependent side the flair similarity fea-
Truth
17
Extracting relevant discussion from Reddit Science Ask Me Anythings
a comment a quality one would expect from scientific Baccianella, S., Esuli, A., & Sebastiani, F. (2010). Sen-
explanations. Other indicators of well formatted and tiWordNet 3 . 0 : An Enhanced Lexical Resource for
c
well sourced comments, features ccp , tnp and cturl
n
seem Sentiment Analysis and Opinion Mining SentiWord-
also to be of import. Net. Analysis, 0, 1–12.
Bird, S., Klein, E., & Loper, E. (2009). Natural Lan-
6. Conclusion guage Processing with Python, vol. 43.
So it appears that it is quite possible to use text-mining Bishop, C. M. (2006). Pattern Recognition and Ma-
to distinguish between informative & relevant com- chine Learning. springer.
ments and the rest. And that while reddit meta-data
is quite useful it is not at all necessary for classifica- Fellbaum, C. (1998). WordNet: An Electronic Lexical
tion. To the point where even a single text-based fea- Database, vol. 71.
ture, the WordNet similarity measure, performs better
than all the reddit meta-data features combined. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J.,
Bethard, S. J., & McClosky, D. (2014). The Stanford
The one real issue with these conclusions is that the CoreNLP natural language processing toolkit. Pro-
amount of data used is quite small, mostly because an- ceedings of 52nd Annual Meeting of the Association
notating the data manually is quite time consuming. for Computational Linguistics: System Demonstra-
Initially using reddit comment karma as annotation tions (pp. 55–60).
was considered but the distribution of said karma is
horribly skewed which led to issues. The vast major- Pedregosa, F., Varoquaux, G., Gramfort, A., Michel,
ity of comments will never have had any feedback on V., Thirion, B., Grisel, O., Blondel, M., Pretten-
them, they would have had no upvotes or downvotes, hofer, P., Weiss, R., Dubourg, V., Vanderplas, J.,
resulting in a karma of 1. Passos, A., Cournapeau, D., Brucher, M., Perrot,
M., & Duchesnay, E. (2011). Scikit-learn: Machine
Also a lot more data was gathered than was actually Learning in Python. Journal of Machine Learning
used for this paper. Not just in raw quantity (228 dif- Research, 12, 2825–2830.
ferent AMAs were automatically scraped from reddit)
but also in terms of quality. Neither the breakdown of Praw-dev (2016). PRAW: The Python Reddit API
karma by subreddit or the markdown formatting was Wrapper.
used. And the first would probably reveal a lot about
the user, as a sort of fingerprint of their interests on Reddit (2016). Reddit FAQ.
reddit. Siersdorfer, S., Chelaru, S., & Augusta, V. (2010).
It might also be interesting to do this feature analysis How useful are your comments?: analyzing and pre-
using a classifier that does not make the independence dicting youtube comments and comment ratings.
assumption naive Bayes does. Proceedings of the 19th international conference on
World wide web, 15, 891–900.
Or it may be interesting to look into the usefulness
of different semantics based features, like word2vec or Weimer, M., Gurevych, I., & Mühlhäuser, M. (2007).
any of the other WordNet based distance measures. Automatically assessing the post quality in online
Seeing as the one real stand-out is the WordNet based discussions on software. Proceedings of the ACL,
similarity measure. Which used on its own yields a 125–128.
performance nearly as good as all features combined.
And could be interpreted as a kind of ”on topic”-ness
feature.
References
Androutsopoulos, I., Koutsias, J., Chandrinos, K. V.,
& Spyropoulos, C. D. (2000). An Experimental
Comparison of Naive Bayesian and Keyword-Based
Anti-Spam Filtering with Personal E-mail Messages.
Proceedings of the 23rd annual international ACM
SIGIR conference on Research and development in
information retrieval, 24–28.
18
Locally versus Globally Trained Word Embeddings
for Automatic Thesaurus Construction in the Legal Domain
19
Local versus Global Word Embeddings in Automatic Thesaurus Construction
A downside of thesauri for query expansion is that highlighted by Bhogal et al. (2007) in their review pa-
their creation and maintenance takes a lot of time, is per. Hersh et al. (2000) saw for example a general
labor intensive and is prone to human error (Lauser decline in performance, in their study assessing query
et al., 2008). An alternative is to automatically con- expansion using the UMLS Metathesaurus, while in
struct a thesaurus, by extracting word similarities di- some specific cases their query expansion method ac-
rectly from the documents in the collection. A com- tually showed improvement. IJzereef et al. (2005) ob-
mon approach for this is to use word embeddings. A served consistent significant improvement when apply-
word embedding is a mapping of a word to a low ing thesaurus query expansion to biomedical retrieval.
dimensional vector of real numbers. These embed- A unique aspect about their approach was that they
dings can be used to calculate the distance between tried to take into account how terms from a thesaurus
two words (for example using cosine similarity), which could benefit retrieval, by reverse engineering the role
serves as a quantitative representation of the similar- a thesaurus can play to improve retrieval. Tudhope
ity between the words (Roy et al., 2016). et al. (2006) took a less technical approach and looked
The interest in word embeddings has been refueled by at possible difficulties for users, when using thesauri
the introduction of a new word embedding technique for query expansion.
by Mikolov et al. (2013), called Word2Vec. Word2Vec Studies related to word embeddings for query expan-
uses a neural network to calculate the vector rep- sion can be divided into two categories. The first
resentations of words, by predicting the surrounding category are studies in which word embedding tech-
words of a given word. The advantages of Word2Vec niques were directly used for query expansion. Roy
are that it is easily accessible through the Word2Vec et al. (2016) for example used similar terms based on
software package1 and less computationally expensive Word2Vec embeddings directly for query expansion.
than other word embedding techniques (such as Latent Although their study showed increased performance
Dirichlet Allocation), which can get computationally on general purpose search tasks, the method failed to
very expensive with large data sets (Mikolov et al., achieve a comparable performance with state-of-the-
2013). art query expansion methods. Diaz et al. (2016) used
Word2Vec struggles though with generalization (Diaz Word2Vec in a similar way, but showed the impor-
et al., 2016), because the word embeddings are trained tance of locally training Word2Vec on relevant doc-
on the whole vocabulary. This effect could be aug- uments to overcome the generalization problem. In
mented in collections with legal documents, since dif- their study locally training the word embeddings sig-
ferent fields of law require different interpretations and nificantly improved the performance for query expan-
as a result might use words differently. The goal of sion tasks, compared to globally trained embeddings.
this paper is to study if this effect can be mitigated The second category uses word embeddings to auto-
by training Word2Vec separately for each legal field. matically construct thesauri. Navigli and Ponzetto
This is done in a two stage process: Firstly this study (2012) for example used a combination of word em-
aims to confirm that a thesaurus trained on the entire beddings from WordNet and Wikipedia to construct a
collection differs from a thesaurus trained on separate cross-lingual, general purpose thesaurus. Claveau and
legal fields. Then this paper tries to answer whether Kijak (2016b) also used WordNet (in combination with
the locally trained Word2Vec embeddings create a bet- Moby) to construct a thesaurus, but used a different
ter thesaurus than globally trained embeddings, in the approach to find related terms for the thesaurus. In-
context of legal documents. The contribution of this stead of using cosine similarity measures directly to
paper is limited to the detection of related words and link terms to relevant terms, they formed documents
does not address the assignment of thesaurus relation- from clusters of similar terms based on their word em-
ships to these words. beddings. Building the thesaurus was then done by
finding the most relevant document for every term.
2. Related Work
3. Method
Over the years a broad range of possible query ex-
pansion techniques have been studied. Ranging from This section describes the preprocessing of the data
adding clustered terms (Minker et al., 1972) to state- and the construction of the globally and locally trained
of-the-art methods that use relevance models, such as thesauri from the data. A visual summary of this pro-
RM3 (Abdul-Jaleel et al., 2004). The usage of thesauri cess is given in Figure 1.
for query expansion has shown mixed results, which is
1
https://code.google.com/archive/p/word2vec/
20
Local versus Global Word Embeddings in Automatic Thesaurus Construction
XML
XML
XML XML
XML Total 40 523 303
XML
Administrative Law 19 651 290
1. 2. 5. Civil Law 12 643 019
Criminal Law 8 223 217
International Public Law 5 777
6.
3. 4.
21
Local versus Global Word Embeddings in Automatic Thesaurus Construction
Table 1. Description of the five types of relations of the justitiethesaurus and their number of occurrences.
3.4. Training
Table 3. The vocabulary size (the term count before reduc-
Since a significant number of terms in the reference tion) and the term count in the reference thesaurus for all
thesaurus were phrases, the phrases model5 of the four models.
gensim module was used on all of the sentences in
the collection, to learn common bigram and trigram Vocabulary Count in
Model Name
Size Reference
phrases. The phrases were trained using the default
settings and only taking into account phrases that oc- Global 559 032 3585
curred at least five times. Administrative Law 302 444 3147
The Word2Vec embeddings were then trained on un- Civil Law 267 466 2989
igrams and the previously identified phrases, using Criminal Law 168 664 2627
the skip-gram implementation of gensim’s Word2Vec
model6 . Training was done locally for the three re-
maining jurisdictions and globally on the entire text 3.5. Combining Jurisdiction Thesauri
collection. The neural network consisted of 100 hid-
den nodes and used a maximum distance of five words. The final step before comparing global and local
In other words, the context of a word was defined as thesauri, was to combine the models for each juris-
the five words before and the five words after it. All diction into a single local thesaurus. This was done
terms that did not occur at least ten times were dis- by concatenating the most similar terms, for each
carded. term in the jurisdiction models, and then ordering the
After training the models, only the terms that oc- similar terms based on their cosine similarity scores.
curred in both the reference thesaurus as well as the If the same term showed up as a similar term in
trained models were selected. This was required in multiple jurisdiction models, the conflict was solved
order to compare the thesauri with the reference the- using two methods: The maximum method used the
saurus. The term counts of the four models, before highest similarity score as the score for the combined
and after reduction, are shown in Table 3. Not all thesaurus. The second method used the average
terms of the reference thesaurus were retrieved by the similarity score as the new score.
models. Many of the terms in the reference thesaurus After combining the jurisdiction models, the local the-
that were not identified by the Word2Vec models, were sauri covered 3526 terms, compared to the 3585 terms
phrases of two or more words (despite extracting com- covered by the global thesaurus. This discrepancy
monly used phrases from the texts). This means that was due to terms being discarded in the jurisdiction
these words and phrases did not actually occur (often models for not occurring more than ten times, while
enough) in the training data. they did occur more than ten times globally. The
Finally, the thesauri were constructed from the terms that were only present in the global thesaurus
Word2Vec models by taking the ten most similar terms were ignored for the comparison between the thesauri.
for each term, based on the cosine similarity of the em-
beddings.
5 4. Results
https://radimrehurek.com/gensim/models/
phrases.html 4.1. Similarity of Thesauri
6
https://radimrehurek.com/gensim/models/
word2vec.html Before analyzing the performance of the two types of
thesauri, it was important to determine whether the
globally and locally trained thesauri actually differed
22
Local versus Global Word Embeddings in Automatic Thesaurus Construction
23
Local versus Global Word Embeddings in Automatic Thesaurus Construction
Maximum Method
600
Insignificant Insignificant
500 Significant Significant
400
Frequency
300
200
100
0
1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0
Kendall's Spearman's
Average Method
600
Insignificant Insignificant
500 Significant Significant
400
Frequency
300
200
100
0
1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0
Kendall's Spearman's
Figure 2. Similarities between the global thesaurus and both types of local thesauri expressed in the Kendall’s τ and
Spearman’s ρ. The correlation coefficients are binned in twenty bins of size 0.1. The blue bars denote the rankings
for which there was no significant correlation, with a significance level α = 0.05. The green bars denote the significant
correlations.
reference thesaurus. The globally trained thesaurus trained thesauri did not contain all of the terms of the
performing better could simply mean that it was most ground truth thesaurus.
similar to the reference thesaurus, without actually be- Table 6 illustrates the problem of this discrepancy
ing the better thesaurus. This also applies to the gen- when using a reference thesaurus for evaluation. Al-
eral performance of the trained thesauri. Even though though the automatically constructed thesauri make
they showed a big discrepancy with the ground truth some clear mistakes, often in the form of linking to
thesaurus, that does not have to mean that the trained words with very similar usage (e.g. nationalities) or
thesauri perform poorly when used for query expan- linking to words from very specific cases (e.g. link-
sion. ing namaak to merk Colt), they also contain related
This is especially relevant for the comparisons in terms that are not found in the reference thesaurus
this paper, since the ground truth thesaurus and (e.g. groepsactie and collectieve actie). In the latter
the trained thesauri will naturally consist of differ- case, the constructed thesauri are thus unfairly pun-
ent terms. The trained thesauri are namely purely ished in the evaluation.
based on the terms in the document space, whereas Given these limitations, a better approach would be
the manually constructed thesaurus is based on terms to evaluate the thesauri directly on a query expansion
from concept lists and dictionaries, which might not task. Direct evaluation does not only remove the am-
actually appear as such in the collection. This was biguity of the performance evaluation, it also allows
also reflected by the fact that the vocabularies of the the thesauri to be compared to other query expansion
24
Local versus Global Word Embeddings in Automatic Thesaurus Construction
Table 5. An overview of the types of relations correctly retrieved by the three constructed thesauri, expressed in percent-
ages of the total number of retrieved relations. Here k = 1 means that only the first element of the most similar terms is
taken into account, k = 5 means only the first five elements, etcetera.
0.80
Thesaurus
Administrative Law
Civil Law
0.75 Global
Criminal Law
Cosine Similarity
0.70
0.65
0.60
1 2 3 4 5 6 7 8 9 10
Similar Terms Index
Figure 3. The mean cosine similarity scores for each rank of the most similar term rankings of the trained models. The
error bars denote the standard deviation.
techniques. For a more complete overview of direct the highest similarity scores. Surprisingly, the model
and indirect evaluation of thesauri see the paper writ- trained on civil law actually has significantly higher
ten by Claveau and Kijak (2016a). Direct evaluation scores than the other two local models, even though it
was unfortunately not possible for this experiment, be- only had the second most training data. Further re-
cause no relevance data or query logs were available. search is required to confirm whether data imbalance
effects the similarity scores and to explore methods to
5.2. Data Imbalance then compensate for the imbalance.
After splitting the training data into multiple juris-
5.3. Further Research
dictions, the data was not equally balanced between
the different jurisdictions (see Tables 2 and 3). A ju- For this experiment, the performance of the trained
risdiction with less data might result in lower cosine thesauri was evaluated for the one, five and ten most
similarity scores for that jurisdiction, since there is less similar terms. Although the results showed the high-
text available to reinforce the context patterns of the est MAP@k scores for k = 1, more focused research
words. This way the imbalance in the data could cause has to be done to gain insight into the ideal number
jurisdictions with less data to be unfairly underrepre- of similar terms that have to be used for the trained
sented in the local thesauri. thesauri. This number will most likely differ between
The possible correlation between similarity scores and contexts in which the thesauri are used and as such be
the size of the training data is partially supported by evaluated separately for specific contexts.
Figure 3. The figure shows that the similarity scores Since this paper strictly focused on comparing the per-
for criminal law are on average the lowest for every formance of globally and locally trained thesauri using
position in the rankings of most similar term, whereas a ground truth thesaurus, it did not touch on the ac-
the model based on all of the data consistently has tual construction of thesauri from the models. For the
25
Local versus Global Word Embeddings in Automatic Thesaurus Construction
Table 6. Some examples of differences between the reference thesaurus and the automatically constructed thesauri. English
translations of the Dutch terms are given between the parentheses
Marokkanen
allochtonen (immigrants) Antillianen (Antilleans) Turken (Turks)
(Moroccans)
Benelux Comité van Ministers (Commit- BVIE (Benelux Convention In- woord- beeldmerk (word- figura-
tee of Ministers) tellectual Property) tive mark )
XTC ecstasy, MDMA heroı̈ne (heroin) GHB
nabootsing
namaak (counterfeit), imitatie merk Colt (authentic Colt) replica (replica)
(imitation)
(imitation)
groepsactie
class action collectieve actie (collective ac- collectieve actie (collective ac-
(group action)
tion) tion)
anonieme
melding
Meld Misdaad Anoniem anonieme tip (anonymous tip) anonieme tip (anonymous tip)
(anonymous
report)
opzegging
duurovereenkomst (fixed-term beëindiging (termination) ontbinding (termination)
(notice)
agreement)
natuurbeheer
(nature jacht (hunt), milieubeheer (en- subsidieregeling agrarisch (agri- landschapbeheer (landscape
management) vironmental management) cultural subsidy) management)
pesten (bullying) school en criminaliteit (school uitschelden (calling names) stalken (stalking)
and crime)
knowhow industriële knowhow (industrial know how know-how
knowhow )
comparison in this paper, it sufficed to select only the tomatic construction of thesauri. This difference can
terms that were shared by the trained thesauri and be attributed to the fact that relevant, but context spe-
the reference thesaurus. In practice though, selecting cific uses of terms, might not be captured by the neural
appropriate terms from the models is a crucial part network, because they get overshadowed in the grand
of forming a thesaurus. A possible selection technique scheme. This generalization effect was also mentioned
would be to use part-of-speech tagging to only select in previous papers by Diaz et al. (2016) and Roy et al.
noun phrases. The models could also be compared to (2016).
models trained on general text, as a way to only select As a follow up, this paper set out to answer whether lo-
terms that are specific to the legal domain. cal Word2Vec models created a better thesaurus than
Another challenging aspect of automatic thesaurus global models. This however proved not to be the
construction is automatically annotating the relations case, unlike the initial expectation that the general-
between terms. For this, more research is required to ization effect would result in a poorer performance of
take the trained thesauri and identify these relations. the global thesaurus. The globally trained thesaurus
Finally, the techniques described in this paper could actually outperformed both locally trained thesauri in
also be tested in different domains, to gain insight into the experiment.
whether the results carry over. Two methods to resolve conflicts in the concatena-
tion of the local models were explored. The experi-
6. Conclusion ment showed that taking the maximum cosine similar-
ity score consistently outperformed the average simi-
This study, first of all, set out to confirm that a the- larity score.
saurus trained on the entire collection differs from a It is hard to draw definite conclusions from the ex-
thesaurus trained on separate legal fields. The results periments though, since all three thesauri performed
showed a significant difference between the globally poorly. The thesauri showed very low MAP@k and r-
and locally trained Word2Vec embeddings in the au- precision scores when retrieving relevant terms from
26
Local versus Global Word Embeddings in Automatic Thesaurus Construction
the ground truth thesaurus, which means that the Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S.,
trained thesauri retrieved a large number of irrelevant & Dean, J. (2013). Distributed representations of
terms. In other words, there was a big discrepancy be- words and phrases and their compositionality. Ad-
tween the terms that were considered relevant by the vances in Neural Information Processing Systems
reference thesaurus and the terms considered relevant (pp. 3111–3119).
by the trained thesauri.
This discrepancy was also reflected in the bias of the Minker, J., Wilson, G. A., & Zimmerman, B. H.
trained thesauri in favor of synonyms, when compared (1972). An evaluation of query expansion by the
to the reference thesaurus. This bias stems from the addition of clustered terms for a document retrieval
assumption of Word2Vec, that related terms are used system. Information Storage and Retrieval, 8, 329–
in similar contexts. Synonyms namely have a natural 348.
tendency to occur more often in similar contexts than Navigli, R., & Ponzetto, S. P. (2012). Babelnet: The
broader, narrower or otherwise related terms. automatic construction, evaluation and application
of a wide-coverage multilingual semantic network.
References Artificial Intelligence, 193, 217–250.
Abdul-Jaleel, N., Allan, J., Croft, W. B., Diaz, F., Roy, D., Paul, D., Mitra, M., & Garain, U. (2016).
Larkey, L., Li, X., Smucker, M. D., & Wade, C. Using word embeddings for automatic query expan-
(2004). Umass at trec 2004: Novelty and hard. On- sion. Neu-IR ’16 SIGIR Workshop on Neural Infor-
line Proceedings of the 2004 Text Retrieval Confer- mation Retrieval.
ence.
Tudhope, D., Binding, C., Blocks, D., & Cunliffe, D.
Bhogal, J., Macfarlane, A., & Smith, P. (2007). A (2006). Query expansion via conceptual distance in
review of ontology based query expansion. Informa- thesaurus indexed collections. Journal of Documen-
tion processing & management, 43, 866–886. tation, 62, 509–533.
Claveau, V., & Kijak, E. (2016a). Direct vs. indirect van Netburg, C. J., & van der Weijde, S. Y. (2015).
evaluation of distributional thesauri. Proceedings of Justitiethesaurus 2015.
the International Conference on Computational Lin-
guistics, COLING.
Claveau, V., & Kijak, E. (2016b). Distributional the-
sauri for information retrieval and vice versa. Lan-
guage and Resource Conference, LREC.
Diaz, F., Mitra, B., & Craswell, N. (2016). Query ex-
pansion with locally-trained word embeddings. ACL
’16.
Hersh, W., Price, S., & Donohoe, L. (2000). Assess-
ing thesaurus-based query expansion using the umls
metathesaurus. Proceedings of the AMIA Sympo-
sium (pp. 344–348).
IJzereef, L., Kamps, J., & De Rijke, M. (2005).
Biomedical retrieval: How can a thesaurus help?
OTM Confederated International Conferences” On
the Move to Meaningful Internet Systems” (pp.
1432–1448).
Lauser, B., Johannsen, G., Caracciolo, C., van Hage,
W. R., Keizer, J., & Mayr, P. (2008). Comparing hu-
man and automatic thesaurus mapping approaches
in the agricultural domain. Metadata for Semantic
and Social Applications: Proceedings of the Inter-
national Conference on Dublin Core and Metadata
Applications (pp. 43–53).
27
Identifying writing tasks using sequences of keystrokes
28
Identifying writing tasks using sequences of keystrokes
tion between pauses, bursts, and revisions were an- To identify properties of keystrokes that indicate the
alyzed. Using these features, text production could cognitive load of the writing process, an open dataset is
be distinguished from revisions. Revision bursts were used, which has been used for writer identification. In
shorter than new text production bursts. In another a previous study, it was already shown that keystroke
writing task, keystroke data from 44 students during data differed between free-form and fixed text (Tap-
a 10-minute essay was collected to determine emo- pert et al., 2009). However, these differences were not
tional states (Bixler & D’Mello, 2013). Four feature made explicit nor evaluated. Therefore, in the cur-
sets were used: total time, keystroke verbosity (num- rent study, we analyze which features differ within the
ber of keys and backspaces), keystroke latency, and keystrokes of free-form versus fixed text using three
number of pauses (categorized by length). All feature different feature sets. As an evaluation, the differ-
sets combined could classify boredom versus engage- ences found between fixed and free-form text are used
ment with an accuracy of 87%. Keystroke data have to classify text as being either fixed or free-form text.
also been analyzed in programming tasks, to deter- This is done using all possible combinations of the
mine performance. Thomas et al. (2005) analyzed different feature groups, to determine which feature
keystroke data from 38 experienced programmers and group is most useful for the classification. At the same
141 novices in a programming task. Keystroke laten- time, since we are not interested in the writer-specific
cies and key types were found related to performance. information, the properties should not allow for an ac-
Key latencies (within and before or after a word) were curate identification of the actual writer.
found negatively correlated with performance. Addi-
tionally, it was found that experienced programmers 2. Method
used more browsing keys and were faster in pressing
those. 2.1. Data
These studies show that keystrokes do not only dif- Data used in the current experiments has been taken
fer due to writer-specific characteristics (which is used from the Villani keystroke dataset (Tappert et al.,
in authentication and identification), but also because 2009; Monaco et al., 2012). The Villani keystroke
of differences in revisions and text production, emo- dataset consists of keystroke data collected from 142
tional states, and level of experience. Whereas the participants in an experimental setting. Participants
differences in writer-specific properties may be due to were free to chose to copy a fable, a fixed text of 652
physical differences and differences in typing style, the characters, or to type a free-form text, an email of at
differences in writing properties are expected to come least 650 characters. Participants could copy the fable
from differences in cognitive processes required. In- multiple times and could also type multiple free-form
deed, keystroke duration and keystroke latencies are texts. Since typing the texts was not mandatory, not
often seen as an indicator of cognitive load (Leijten & all participants typed both a free-form text and a fixed
Van Waes, 2013). As different tasks lead to differences text. In total, 36 participants typed both at least one
in cognitive load, we may find these differences using fixed text and one free-form text, resulting in keystroke
different writing tasks. However, existing studies do data of 338 fixed texts and 416 free-form texts. The
not compare differences in keystrokes between tasks. other 106 participants only wrote either free-form or
Therefore, in the current study, the writing processes fixed texts, resulting in a further 21 fixed texts and
in two different tasks are compared: writing a free- 808 free-form texts. The keystroke data consisted of
form text versus a fixed text (copying a text). Here we timestamps for each key press and key release and the
assume that writing a free-from text requires a differ- corresponding key code. More information about the
ent cognitive load than writing a fixed text, resulting dataset and the collection of the dataset can be found
in differences in the keystroke data. in Tappert et al. (2009). In this research, we only use
the data of participants who created both text types.
Having knowledge of the cognitive load while produc-
ing a text may provide useful information, for exam-
ple, for teachers. Currently, teachers often only have 2.2. Data processing
access to the final writing product for evaluation pur- First, for all keystrokes, the type of key was derived:
poses. This does not provide insight in what students letter, number, browse key (e.g., LEFT, HOME),
did during the writing process. Insight in students’ punctuation key, correction key (BACKSPACE,
writing behavior or cognitive load during an assign- DELETE), and other (e.g., F12). The time between
ment may trigger the teacher to further investigate a key release and the subsequent key press (keystroke
this behavior and adapt the task or provide personal- latency or key pause time) was calculated. Thereafter,
ized feedback on the writing process. the type of pause between the key was derived using
29
Identifying writing tasks using sequences of keystrokes
e o o o
the key types. Six pause types were identified: pause Keys t h d bs bs b k
30
Identifying writing tasks using sequences of keystrokes
Table 1. Descriptive statistics and paired t-tests of features in fixed and free-form text (N = 36).
*=p < .05, **=p < .01, ***=p < .001.
κ corrects for random guessing, by comparing the ob- types of features have their own standard deviations
served accuracy with the expected accuracy (chance): per document). In the table, the descriptive statistics
of these features can also be found.
observed accuracy − expected accuracy
κ= It was found that fixed texts consisted of significantly
1 − expected accuracy
fewer keystrokes compared to free-form texts (703 ver-
Additionally, a one-way ANOVA with Tukey post-hoc sus 749). Although the fixed text consisted of 652 char-
test was used to determine whether the models differed acters, the mean number of keystrokes was 703. This
significantly in accuracy. can partly be explained by the fact that sometimes
multiple keys are needed to type one character (e.g.,
Lastly, since we focus on the writing process, the
SHIFT + character to type a capital letter). Addition-
learned models should preferably not be able to clas-
ally, this can indicate that the participants made typos
sify personal writer-specific characteristics. Thus, the
and fixed those, requiring BACKSPACE or DELETE
learned model should perform really badly when clas-
keystrokes. Indeed, it was shown that corrections were
sifying writers. Therefore, as an additional evaluation
made in 740 of the 750 sessions. The free-form texts
support vector machines were trained to classify the
contained more corrections and a higher percentage of
writers. The best σ and cost values from the mod-
words with at least one correction, compared to the
els classifying fixed versus free-form text were used.
fixed texts. Lastly, the participants were faster in typ-
Again, the average accuracy and κs were calculated
ing the fixed text compared to the free-form text. All
using the same folds in 10-fold cross-validation.
these findings provide some evidence that typing the
free-form text requires a different cognitive load.
3. Features measuring cognitive load
Several features were analyzed to determine where sig-
Paired t-tests were used to determine which features nificant differences in pause duration between the text
differed significantly between fixed and free-form text types were found. Specifically, we investigated the dif-
created by the same writer. This is assumed to pro- ferences between the pauses before, after, and within
vide insight in which features are indicative of cogni- words and corrections. Since the free-form and fixed
tive load. The results can be found in Table 1. Note texts differed in total length and time, timing of key
that we use both the mean as well as the standard pauses were normalized based on the average time per
deviation (S.D.) within a text as features (and both key pause in that session. It was found that writers
31
Identifying writing tasks using sequences of keystrokes
32
Identifying writing tasks using sequences of keystrokes
7.3%, respectively. The model with both correction tion task. Adding the corrections and word length
and pause time features led to the highest accuracy: features did not lead to significantly higher accuracies,
31.1% with a κ of 0.267. Although this is a reason- showing that the word length and corrections features
ably low accuracy, the model clearly outperforms the do not add much information in addition to the pause
model based on chance for the 36-class classification times. When all feature groups were included, 78.1%
(1/36 = 2.8%). This means that some writer-specific accuracy was reached. Although this accuracy is rea-
characteristics are encoded within the feature groups sonably high, it also shows there is still some room for
that are used in these models, which in this case is improvement. Especially considering that classifying
unwanted. writers, being a more complex classification problem
with more classes, have shown accuracies up to 99%
A one-way ANOVA showed that the seven models dif-
(Tappert et al., 2009).
fered significantly in accuracy (F (6, 63) = 37.43, p <
.001). Interestingly, the accuracies for the corrections Since we aimed to identify features related to the writ-
and words feature groups combined are significantly ing process, we wanted to exclude writer-specific in-
lower than the accuracy of all models that include formation. In other words, the models should perform
pause time features. This indicates that the correc- badly when classifying the writer. To test this, we
tions and word length features include fewer writer- tried to classify the writers with the same settings as
specific characteristics than the pause time features. used in the writing task classification for the support
vector machine models. The lowest accuracy, while
5. Discussion using information from the keystroke sequences were
found when using word length features only (7.3%).
Keystroke data include both writer-specific informa- This corresponds to an accuracy of 68.7% on the text
tion and information about the writing processes. In type classification task. The highest accuracy (31.1%)
this study, we focused on the writing processes and was obtained with both pause time and correction fea-
aimed to identify properties of keystrokes that indi- tures (corresponding to 76.3% accuracy on the text
cate the cognitive load of the writing process. In order type classification task, which is close to the highest
to do this, keystrokes of two different writing tasks accuracy on that task: 78.1%). Even though the accu-
were analyzed, which are assumed to differ in cogni- racies on writer classification are higher than chance,
tive load: copying a text (fixed text) and writing a it is much lower than the 90%–99% accuracy reached
free-form text. in other studies (Longi et al., 2015; Tappert et al.,
2009). Thus, the feature groups that have been ex-
Our first analysis showed that several features ex-
tracted, actually contain mostly information related
tracted from the keystroke data differed significantly
to the writing task and not to the writer-specific char-
between the fixed and free-form texts of a writer.
acteristics.
These findings support previous work which showed
that keystrokes differ for different (types of) text en- Interestingly, especially the corrections and word
tered (Gunetti & Picardi, 2005; Tappert et al., 2009). length features showed low accuracies on classifying
As an extension, we also identified which features dif- writers. Thus, these feature groups contained little
fered and how these differed. When typing free-form information about individual typing characteristics.
text, the pauses before a word were longer, while the Adding additional information to improve the qual-
pauses within or after a word were shorter, compared ity of the text type classification task, also increases
to typing fixed text. This might indicate that the par- accuracy of the writer classification task. For example,
ticipants were thinking about the next word to type in if we add the key pause features, the accuracy of the
the free-form text before they typed the word, while text type task increases, but the writer identification
writers in the fixed text situation could immediately accuracies also increase. In other words, key pause
copy it as it was provided for them. Thus, differences properties contain useful information for the text type
in cognitive load may be identified in the pauses before classification task, but also contain information that
words. allows for the identification of the writer, which is un-
wanted in this case.
As an evaluation, we showed that the differences in
keystroke information can be used to classify fixed and There are at least three directions for future work.
free-form text. Using a support vector machine, the Firstly, future work could try to improve the accu-
key pause time features (which measure time spent racy on task classification, while not improving the
between key releases and key presses) were found to accuracy on writer identification. Additional features
lead to the highest accuracies for the text identifica- or feature groups could be identified, such as bursts
33
Identifying writing tasks using sequences of keystrokes
34
Identifying writing tasks using sequences of keystrokes
35
Increasing the Margin in Support Vector Machines
through Hyperplane Folding
Keywords: support vector machines, geometric margin, hyperplane folding, hyperplane hinging, piecewise linear classifi-
cation.
36
Increasing the Margin in Support Vector Machines through Hyperplane Folding
2. Related Work kernel tricks that extends the solution space to include cases
that are not linearly separable, and the notion of so called
Support vector machines (SVMs) originate from research soft margins to allow for errors in the training set.
conducted in statistical learning theory (Vapnik, 1995).
SVMs are used in supervised learning, where a training set If we assume that a data point in the test set can be at most
consisting of n-dimensional data points with known class a distance x from a data point belonging to the same class
labels is used for predicting classes in a test set consisting in the training set, then it is clear that we can only guar-
of n-dimensional data points with unknown classes. antee a correct classification as long as x is smaller than
the margin. As a consequence, the optimality of the sepa-
Rigorous statistical bounds are given for the generalisa- rating hyperplane guarantees that we will correctly classify
tion of hard margin SVMs (Bartlett, 1998; Shawe-Taylor any data point in the test set for a maximum x. This is very
et al., 1998). Moreover, statistical bounds are also given similar to error correcting codes that maximize the distance
for the generalisation of soft margin SVMs and for regres- between any pair of code words (Lin & Costello, 1983).
sion (Shawe-Taylor & Cristianini, 2000). The hyperplane folding approach presented here increases
There has been work on different types of piecewise lin- the margin thus guaranteeing correct classification of test
ear classifier based on the SVM concept. These methods set data points for larger x.
split the separating hyperplane into a set of hinging hyper- SVMs are also connected to data compression codes. In
planes (Wang & Sun, 2005). In (Yujian et al., 2011), the (von Luxburg et al., 2004), the authors suggest five data
authors define an algorithm that uses hinging hyperplanes compression codes that use an SVM hyperplane approach
to separate nonintersecting classes with a multiconlitron, when transferring information from a sender to a receiver.
which is a combination of a number of conlitrons, where The authors show that a larger margin in the SVM leads
each conlitron separates ”convexly separable” classes. The to higher data compression, and that the data compression
multiconlitron method cannot benefit directly from experi- can be improved by exploring the geometric properties of
ence and implementations of SVMs. Conlitrons and mul- the training set, e.g., if the data points in the training set
ticonlitrons need to be constructed with new and complex are shaped as an ellipsoid rather than a sphere. The hyper-
techniques (Li et al., 2014). The hyperplane folding ap- plane folding approach also uses geometric properties of
proach presented here is a direct extension of the standard the training set to improve the margin.
(hard margin) SVM (soft margin extensions are relatively
direct and discussed later). As a consequence, hyperplane The kernel trick maps a data point in n dimensions to a
folding can benefit from existing SVM experience and im- data point in a (much) higher dimension, thus increasing
plementations. the possibility to linearly separate the data points (Hofmann
et al., 2008). The hyperplane folding approach presented
A piecewise linear SVM classifier is presented in (Huang here does some remapping of data points but it does not
et al., 2013). That method splits the feature space into change the dimension of the data points.
a number of polyhedra and calculates one hinging hy-
perplane for each such polyhedron. Some divisions of
the feature space will increase the margin in hard margin 3. Hyperplane Folding
SVMs. However, unless one has detailed domain knowl- In this section, we introduce the hyperplane folding algo-
edge, there is no way to determine which polyhedra to se- rithm for the 2-dimensional case. Higher dimensions will
lect to improve the margin. The authors recommend ba- be discussed in Section 4.
sic approaches like equidistantly dividing the input space
into hypercubes or using random sizes and shapes for the Let us consider a standard SVM for fully separable binary
polyhedra. Based on the support vectors, the hyperplane classification set S with a separating hyperplane (the thick
folding method splits the feature space into two parts (i.e., blue line in Figure 1) and a margin d. If we assume that
into two polyhedra) in each iteration. Without any domain each data point is represented with (very) high resolution,
knowledge, the method guarantees that the split results in the probability of having more than three support vectors is
an increase of the margin (except for very special cases). arbitrarily close to zero in the 2-dimensional case. There-
As discussed above, hyperplane folding can directly ben- fore, in the current context we only need to consider the
efit from existing SVM experience and implementations, cases with two or three support vectors.
which is not the case for the method presented by Huang et As it was mentioned in the introduction, we consider only
al. the case |V− ∪ V+ | = 3 in the 2-dimensional scenario, be-
As stated in (Cortes & Vapnik, 1995), SVMs combine three cause we already have a maximal margin if |V− ∪ V+ | = 2.
ideas: the idea of optimal separating hyperplanes, i.e., hy- Without loss of generality we assume that |V+ | = 2 and
perplanes with the maximal margin, the idea of so called |V− | = 1, and we refer to the point in V− as the primary
37
Increasing the Margin in Support Vector Machines through Hyperplane Folding
support vector.
As a first step in our method, we split the dataset into two
parts by identifying a splitting hyperplane, which in two
+ +
dimensions is the line that is normal to the hyperplane and
that passes through the prime support vector (see Figure 2). +
+
When splitting into two datasets, the prime support vector + α
+
is included in both parts of the dataset. -
The two parts of the dataset define one SVM each (see Fig- - -
ure 3), producing one separating hyperplane for each part
of the dataset, where both margins are normally larger than - - -
the initial margin. We assume that the two new hyperplanes
intersect with an angle α.
Figure 3. The two SVMs after splitting the dataset.
38
Increasing the Margin in Support Vector Machines through Hyperplane Folding
8: Calculate the angle α between the two separating hy- there will in general be at most n+1 support vectors in an n
perplanes, and the intersection point between them, dimensional space, |V− | ≥ 1 and |V+ | ≥ 1.
i.e. the folding point.
We start by considering the case with three dimensions, i.e.,
9: Remove the primary support vector from the subset n = 3. Again, we cannot do anything if we only have two
with the largest margin. support vectors, because then the starting hyperplane has
maximal margin. In the case of three or four support vec-
10: Rotate the remaining data points in that subset an an- tors we can, however, increase the margin except in special
gle α around the folding point. cases. In order to simplify the discussion we assume that
the separating hyperplane is parallel to the xz-plane, which
11: Merge the two subsets. The new splitting hyperplane can be achieved by rotating the data set.
has a larger margin than d.
In the case |V− ∪ V+ | = 4 there are three different cases:
12: goto step 1 {|V+ | = 3, |V− | = 1}, or {|V+ | = 2, |V− | = 2}, or {|V+ | = 1,
|V− | = 3}. In either case we consider a line that passes
through one pair of support vectors from the same class.
This line is parallel to the separating hyperplane, since all
support vectors have the same distance to this hyperplane.
Then we rotate the data set around the y-axis so that this
+
+ line is parallel to the z-axis.
+ + +
α
Now we disregard from the z-components of the points, i.e.,
+ consider the points as projected onto the x,y-plane. Obvi-
+ +
α - - ously, the two support vectors from V+ will be projected on
+ +
- the same point, thus resulting in a situation with three sup-
port vectors in the x,y-plane; two from V− and one (merged)
- -
- from V+ . Having projected the support vectors on the x,y-
- - α - plane,we use the same method of rotation as in the previ-
ous section, which does not affect the z-component of the
points. Again we have produced a separating hyperspace
with a larger margin.
Figure 4. Rotating the data points in the right part of the dataset.
Red data points in the original dataset are rotated with an angle α, For n> 3 dimensions we again cannot do anything if we
counter clockwise, to new locations. only have two support vectors. In the case of 3, 4,. . . , n+1
support vectors we can, however, increase the margin, ex-
cept in special cases. In order to simplify the discussion we
can, by rotation, assume that the separating hyperplane is
+ orthogonal to the x,y-plane.
+
+ If we have only three support vectors, we can directly
+
project all data points on the x,y-plane by disregarding from
+ the coordinates x3 , . . . , xn for all points, perform the algo-
- -
+ rithm for n = 2, and then resubstitute the coordinates x3 ,
- . . . , xn for all points, similarly to the case n = 3.
- Now consider |V− ∪ V+ | = k for 4 ≤ k ≤ n + 1. We choose
-
- either V− or V+ which contains at least two points. We
construct a line between two of the points in the set and ro-
tate the data points so that this line becomes parallel with
a base vector of dimension n, keeping the hyperplane or-
Figure 5. The dataset after rotation. The new dataset has |V+ | = 1 thogonal to the x,y-plane. We then disregard from the nth
and |V− | = 2 and a larger margin than the original dataset.
coordinate, thus projecting orthogonally the points from Rn
to Rn−1 and the separating hyperplane to a hyperplane with
n-2 dimensions in Rn−1 . Since two support vectors in the
4. Higher Dimensions n-dimensional space are mapped to the same point in the
In this section, we discuss higher dimensions. If we assume (n-1)-dimensional space, there are now n support vectors
that each point is represented with (very) high resolution, in the (n-1)-dimensional space. This procedure can be re-
39
Increasing the Margin in Support Vector Machines through Hyperplane Folding
peated until we reach three support vectors, where we use 14: goto step 1
the method described in the previous section.
This produces a new data set with a separating hyperplane
that has a larger margin, in general. If this hyperplane is not Regarding the computational complexity of the general
a maximum margin surface, the procedure can be repeated case hyperplane folding algorithm we have the following.
on the new data set to increase the margin further until the Initially, we must run the standard SVM algorithm on the
margin is as close to M(S ) as desired. considered dataset, which implies O(max(m, n) min(m, n)2 )
In the end we may perform the inverses of all transforma- complexity according to (Chapelle, 2007), where m is the
tions in order to regain the initial data set with a separat- number of instances and n is the number of dimensions
ing surface that consists of folded hyperplanes, which has (attributes). Steps 6 to 11 in the algorithm is a loop with
a larger margin except in special cases. n − 2 iterations. In each iteration we rotate the dataset.
One data point rotation requires n2 multiplications. This
Based on the discussion above the algorithm for the general means that the computational complexity of this part is
case with n dimensions is given below. O(mn3 ). In step 12 we run the 2-dimensional algorithm,
which has complexity O(m). In step 13 we rotate back in
Algorithm: Hyperplane Folding for the General Case reverse order, which has complexity O(mn3 ). This means
that the computational complexity for one fold operation is
O(max(m, n) min(m, n)2 ) + O(mn3 ) + O(m) + O(mn3 ), i.e.
1: Run the standard SVM algorithm on the dataset S ⊂ it can be simplified to O(max(m, n) min(m, n)2 ) + O(mn3 ).
Rn (n > 2). If m > n, we have O(mn2 ) + O(mn3 ) = O(mn3 ). If n > m,
we have O(m2 n) + O(mn3 ) = O(mn3 ), i.e., the total compu-
2: if k = 2 then terminate 2 tational complexity for one fold operation is O(mn3 ).
3: if the determined number of hyperplane folding itera-
tions is reached then terminate 5. Initial Evaluation and Discussion
4: Rotate S so that all support vectors have value zero in In order to get a better understanding of how the proposed
dimensions higher than or equal to k. hyperplane folding method works, we have conducted an
experiment for the two-dimensional case. We implemented
5: Temporarily remove dimensions ≥ k from all points in our algorithm in Python 2.7 using the Scikit-learn library4 ,
S. the Numpy library5 , and the Pandas library6 .
If k = n + 1, no dimensions are removed.
If k = n, one dimension is removed, and so on. 5.1. Data
6: if k = 3 then goto step 12 We have generated n circles with synthetic data points (n =
4, 5, 6): dn/2e circles from S + and bn/2c circles from S − ,
7: Select two support vectors v1 and v2 from the same respectively. The circles are numbered 1 to n: odd num-
class (V+ or V− ). bered circles contain points from S + , and even numbered
circles contain points from S − .
8: Rotate S so that the values in dimensions 1 to k − 1 are
the same for v1 and v2 .
5.2. Experiment
9: Remove temporarily dimension k from all points in Initially, we have studied how the margin in the SVM is
S. 3 influenced by our hyperplane folding method. For this pur-
10: k = k − 1 pose we have generated the following data set. Each circle
i (i = 1, 2, . . . , 6) is centered at (100+100i, 100+101(i mod
11: goto step 6 2)), e.g., circle number 3 is centered at (400, 201). The ra-
dius of each circle is set to 50 and 100 data points at random
12: Run the 2-dimensional algorithm presented in Sec-
locations within each circle are generated. The generated
tion 3 for one iteration.
data set is linearly separable, since the distance between in
13: Expand the dimensions back one by one in reverse or- the y-dimension between S + -circles and S − -circles is 101
der and do the inverse rotations. and in addition, the radius of each circle is 50. This is
2 4
k denotes the number of support vectors, i.e., k = |V− ∪ V+ |. http://scikit-learn.org/
3 5
This means that v1 and v2 are now mapped to the same sup- http://www.numpy.org/
6
port vector. http://pandas.pydata.org/
40
Increasing the Margin in Support Vector Machines through Hyperplane Folding
Table 1. Margin and accuracy for the different SVMs. Each values is the average of 9 executions.
dataset 0. SVM 0 is obtained based on dataset 0. Then the from test set i - 1. The corresponding accuracies when clas-
hyperplane folding algorithm defined in Section 3 is used sifying test set 2 (or test set 3) using SVM 2 (or SVM 3) are
to create dataset i (i = 1, 2, 3) and the corresponding SVM given in Table 1. For instance, they are 98.0% and 98.4%
i. Namely, dataset i (i = 1, 2, 3) is obtained by running our for the case with 4 circles, respectively.
2D algorithm on dataset i - 1. The corresponding SVM i
The obtained results (see Table 1) show that our method
(i = 1, 2, 3) is obtained based on dataset i.
increases the margin significantly. The effect is most vis-
We calculate the margin for each SVM i (i = 0, 1, 2, 3) (see ible for the case with 4 circles. When we only have four
Table 1). of circles we will quickly get close to the largest possible
margin, since we only have two circles of each class and we
The proposed method includes rotations of parts of the
will in 2-3 iterations ”fold away” one circle on each side of
dataset. This means that we also need to rotate an unknown
the separating hyperplane. This effect is illustrated graphi-
data point if we want to use the SVM for classification of
cally on Figures 6 and 7. Figures 6(a) and 7(a) show SVM
unknown points. In the 2D case it is simply necessary to re-
0 and SVM 2 for one of the nine tests for the case with four
member the line used for splitting the data points into two
circles. Figures 6(b) and 7(b) show the corresponding test
parts and the rotation angle α. In order to classify an un-
sets, i.e., test set 0 and test set 2. The angle and place of the
known data point we first check if the data point belongs to
two rotations can be clearly seen in test set 2. These fig-
the part that was rotated in the first iteration. In that case
ures demonstrate that already after two iterations we have
we rotate the unknown data point with the angle of the first
folded the hyperplane so that the blue circle on the left side
rotation. We then check if the unknown data (that now may
and the red circle on the right side are moved away from
have been rotated) is in the part that was rotated in the sec-
the separating hyperplane. When we have 5 or 6 circles
ond iteration. In that case we rotate the unknown data point
the separating hyperplane needs to be folded more times to
with the angle of the second rotation, and so on. When all
reach the same effect. This is the reason why the margin
rotations are done we use the final SVM, i.e., the SVM we
increases more quickly for the 4 circles case compared to
get after the 2D algorithm has terminated, to classify the
the cases with more circles.
data point in the usual way.
Table 1 also shows that the accuracy, i.e., the number of
At the second phase of our evaluation, we have studied how
correctly classified data points in the test set divided with
the hyperplane folding method affects the classification ac-
the total number of data points in the test set, also increases
curacy of the SVM. For this purpose we have generated
with our method. The reason for this is that a larger margin
another data set. Each circle i (i = 1, 2, . . . , 6) is centered
reduces the risk for misclassifications. The improvement in
at (100i, 100+101(i mod 2)). Then we set the radius of
accuracy is again highest for the case with 4 circles. As it
each circle to 75 and generate 1000 data points at random
was discussed above the reason for this is that the margin
locations within each circle. This is test set 0. It is clear
increases fastest for that case.
that some data points in test set 0 will be misclassified by
SVM 0, since the radius for each circle is now 75. Table 1
shows the accuracy when classifying test set 0 using SVM 5.3. Discussion
0, e.g., it is 95.2% for the case with 4 circles. In the previous subsection we have discussed an experi-
Then the data points in test set 0 are rotated in the same way ment for the two-dimensional case performed for initial
as it was done when dataset 1 was obtained from dataset 0. evaluation of the proposed hyperplane folding method. In
This generates test set 1. The accuracy when classifying the general case with n dimensions, we could do the same
test set 1 using SVM 1 can be seen in Table 1, e.g., it is thing as in the 2D case, i.e., we start by deciding if the un-
96.8% for the case with 4 circles. We then continue rotat- known data point is in the part of the n-dimensional space
ing in the same fashion in order to obtain test set i (i = 2, 3) that was rotated in the first iteration. After that we deter-
41
Increasing the Margin in Support Vector Machines through Hyperplane Folding
250
250
200
200
150 150
100
100
50
50
200 300 400 500 200 250 300 350 400 450 500
300
250
250
200
200
150 150
100 100
50
50
0
200 300 400 500 200 300 400 500
Figure 6. Example of an SVM and test set for the case with four Figure 7. Example of an SVM and test set for the case with four
circles before any hyperplane folding iteration. circles after two hyperplane folding iterations.
42
Increasing the Margin in Support Vector Machines through Hyperplane Folding
margin. However, excessive folding could probably lead to proposed hyperplane folding method on richer data.
overfitting. It is clear that the time for obtaining the SVM
grows with the number of folds. One could also expect that References
the time required to classify an unknown data point will in-
crease with many folds (even if there could be techniques Bartlett, P. L. (1998). The sample complexity of pat-
to limit this overhead). This means that there is a trade- tern classification with neural networks: The size of the
off between a large margin on the one hand and the risk weights is more important than the size of the network.
of overfitting and the execution time overhead on the other IEEE Transactions on Information Theory, 44, 525–536.
hand. The margin is increasing in the number of iterations
Chapelle, O. (2007). Training a support vector machine in
of hyperplane folding, and the algorithm can be stopped at
the primal. Neural Computation, 19, 1155–1178.
any point. This means that we can balance the advantages
and disadvantages of hyperfolding by selecting an appro- Cortes, C., & Vapnik, V. (1995). Support-vector networks.
priate number of iterations, e.g., we can simply stop when Machine learning, 20, 273–297.
we do not want to spend more time on hyperplane folding
or when the problem with overfitting becomes an issue. Hofmann, T., Schölkopf, B., & Smola, A. J. (2008). Kernel
methods in machine learning. The annals of statistics,
We have assumed that for an n-dimensional dataset there 1171–1220.
can be at most n+1 support vectors. If we have limited res-
olution, there could be more than n+1 support vectors, and Huang, X., Mehrkanoon, S., & Suykens, J. A. (2013). Sup-
in such cases we need to do a small variation of the algo- port vector machines with piecewise linear feature map-
rithm. The main idea in this case is to select the primary ping. Neurocomputing, 117, 118–127.
support vector, i.e., the support vector at which we split, so
that the primary support vector is the only support vector Li, Y., Leng, Q., Fu, Y., & Li, H. (2014). Growing construc-
from its class in one of the parts and so that that part also tion of conlitron and multiconlitron. Knowledge-Based
contains at least one support vector from the other class. It Systems, 65, 12–20.
is clear that one can always do such a split when we have Lin, S., & Costello, D. J. (1983). Error control coding:
more than n+1 support vectors. Fundamentals and applications. Prentice-Hall.
43
Towards Optimizing the Public Library:
Indoor Localization in Semi-Open Spaces and Beyond
44
Towards Optimizing the Public Library
Figure 1. a) Alkmaar library ground floor map with an example topology of typical patron walking paths between mean-
ingful locations. The purple area surrounding location (1) is ”culture and literature” and the area at (2) features ”young
adult fiction”. Both areas feature rectangular bookshelves and book-tables. Other locations are the self-service machines
at (C) and (B), a staircase to the second floor (H1), a seating area (7) and the main desk at (A). The space is an open
environment but furniture does restrict possible moving patterns, and in addition there are some small rooms at the top
(e.g. behind C and B). b) Fingerprint positioning. At the current location (of the mobile phone) signals of all three APs
are received. At the location on the right this holds too but AP2s signal is much stronger (−34) because it is closer.
lic library. In Section 3 we extensively introduce our tions. Nationwide 1 , in 2014 more than 22 percent of
Flib localization application based on WiFi and bea- all Dutch people was member of a library, more than
con technology. In Section 4 we additionally mention 63 million visits were paid to a library, and almost 80
methods for book interaction and conclude with a re- million books were borrowed. However, libraries know
search agenda for further research in the public library. very little about their patron’s behavior. In fact, the
only behavior visible in data are the books that were
2. The BLIIPS project checked out. How they use the physical space, how
they browse the book collection, which books are be-
Companies and governments are looking for ways to ing looked at; for this no (real-time) data is available,
utilize their existing data and capture new oppor- but could be highly relevant for managing the physi-
tunities to develop initiatives around data. In the cal library building, its services and its collection. Li-
Dutch municipality of Alkmaar, such activities are braries do have a long history of measuring, observing
aggregated through so-called triple-helix cooperations and evaluating but typically through labor-intensive
in which (local) governments, companies and knowl- surveys and observational studies (see (Edwards, 2009;
edge institutions collaborate (van Otterlo & Feldberg, Allison, 2013; van Otterlo, 2016b) for pointers).
2016). The Alkmaar public library, partially funded by
The BLIIPS project represents a first step towards the
the local government, works together with the Vrije
intelligent library in which this data is collected and
Universiteit Amsterdam on a data-oriented project
analyzed in real time, but also in which the physical
called BLIIPS (van Otterlo, 2016b). Its goal is to
environment can provide to patrons ”Google-like” ser-
utilize data to optimize the public library in various
vices we are accustomed 2 to in the digital world. For
ways. However, whereas most data-oriented projects
example, if all interactions are digitalized, a smart-
are about already digitalized aspects, BLIIPS targets
phone could provide location-based, personalized rec-
the physical aspects of the public library and seeks to
ommendations to physical books in the patron’s sur-
digitalize them with the use of new sensor technology.
rounding area, based on a user query and data about
The overall goal is to gain insight into the physical
the library, the patron, and additional sources.
behavior of patrons (i.e. library ”customers”) in the
1
physical library and how to optimize services, for ex- http://www.debibliotheken.nl/de-branche/
ample book borrowing. stelsel/kengetallen-bibliotheken/
2
Related, but used in an orthogal way, van Otterlo
((2016a)) uses the concept of libraryness as a metaphor
2.1. Public libraries: Physical vs. Digital to understand modern profiling and experimentation algo-
rithms in the context of privacy and surveillance.
The main library of Alkmaar is part of the Kennemer-
waard group (out of about 150 groups) with 14 loca-
45
Towards Optimizing the Public Library
Figure 2. a) Four key developments in BLIIPS. b) Testing ground VU: an open office space at the 5th floor of the main
building of the Vrije Universiteit Amsterdam. The topological graph shown depicts the transition structure of the hallway
connecting all surrounding spaces of the floor. Many of the room-like structures are part of the open space, others are
separate rooms with a door.
2.2. Towards Library Experimentation hospitals. We cannot kill people here. We can make
mistakes and nobody will die. We can try and test and
BLIIPS builds upon four interlocking developments,
try and test all the time.”
see Figure 2a (van Otterlo, 2016b). The first puzzle
piece is digitalization: making physical interaction
digital through sensors (and algorithms). The second 3. An Indoor Localization Application
piece connects to retail: the use of smart technology in
Mapping patron activity in the physical library re-
physical stores, such as the recent Amazon Go Store3 .
quires at least knowing where they are. For this, we
The Alkmaar library has adopted a retail strategy in
describe design and implementation of the FLib appli-
which the layout and collection, unlike traditional li-
cation. Localization is a well-studied problem (Shang
braries with many bookshelves, are more like a store:
et al., 2015; He & Chan, 2015), but the practical
lots of visible book covers, tables with intuitive themes
details of the environment, hardware and algorithms
and easy categorizations that deviate much from tradi-
used can deliver varying results and so far, localization
tional classification systems. A retail view on the prob-
is not solved (Lymberopoulos et al., 2015). First we
lem appeals to customer relations programmes, mar-
outline requirements and then we describe an interac-
keting concepts and so-called customer journeys. The
tive localization application (Warnaar, 2017).
third piece concerns advances in data science, espe-
cially developments in machine learning and the avail-
3.1. Indoor Localization
ability of tools. The fourth piece of the puzzle, exper-
imentation and optimization, is most important Whereas for outdoor localization GPS is successful,
for our long-term goals. The BLIIPS acronym stands indoor localization remains a challenge. GPS works
for Books and Libraries: Intelligence and Interaction by maintaining line of sight to satellites which is prob-
through Puzzle- and Skinnerboxes in which the latter lematic inside concrete buildings. Several indoor posi-
two elements denote physical devices used by psychol- tioning systems exist (Shang et al., 2015; He & Chan,
ogists in the last century for behavioral engineering. 2015) none of which is currently considered as the
The aim of BLIIPS is to influence, or even control, standard. Sensor information such as magnetic field
behaviors and processes in the library in order to op- strength, received radio waves, and inertial informa-
timize particular goals, such as increasing the number tion from gyroscopes and odometers can be used to
of book checkouts. Such digital Skinnerboxes are be- determine location. Smartphones are equipped with
coming a real possibility due to the combined effect an array of sensors; they are well suited as indoor
of data, algorithms and the ubiquity of digital inter- positioning devices. Lymberopoulos et al. (2015)
actions (van Otterlo, 2014). The library though, is a review the 2014 Microsoft indoor localization compe-
perfect environment for experimentation, unlike many tition (featuring 24 academic teams): ”all systems
other domains. As Palfrey (2015, p213) (quoting Kari exhibited large accuracy4 variations accross different
Lamsa) writes: ”Libraries are not so serious places. evaluation points which raises concerns about the sta-
We should not be too afraid of mistakes. We are not bility/reliability of current indoor location technolo-
3 4
https://www.amazon.com/b?node=16008589011 Typically average errors of a couple of meters.
46
Towards Optimizing the Public Library
47
Towards Optimizing the Public Library
Figure 4. a) VU testing floor: 28 × 10 grid overlay and beacon positions. Coverage shown for beacons 3, 9, 14, and 16.
b) FLib :Software components overview.
Each localization technique has drawbacks when con- a reference point and f a fingerprint over the set A.
sidering accuracy, cost, coverage, and complexity.
In the Offline training phase we first collect data.
None of them can be suitable to all scenarios. Based on
Here a reference point r is (physically) visited to mea-
our requirements formulated above we choose a (multi-
sure the signals available (f ) at that location and to
modal) fingerprinting solution in which we use smart-
store (r, f ) in the database. To increase the accuracy
phones both for measuring the signal space and for
of FDB , multiple fingerprints can be taken at the same
patron localization. Fingerprinting is accurate enough
location. Systematically all reference points r ∈ R
for our purpose, does not pose any assumptions on
should be visited. When building prediction mod-
knowledge about where signals come from nor on the
els the fingerprint database FDB is used to obtain a
modeling of the domain (e.g. sensor models), and
generalizable mapping M :: 2A×R → R, i.e. a map-
can be employed using the existing WiFi infrastruc-
ping from a set of signals (and their signal strengths) to
ture which we extend with Bluetooth beacons. Other
a reference point in R. All samples (r, f ) ∈ FDB rep-
requirements (like low computational complexity and
resent a supervised learning problem from fingerprints
local computation) are fullfilled by the choice of (sim-
(inputs) to reference points (outputs). In the Online
ple) algorithms with few biases and interactive visual-
localization phase. M is used for localization. Let
izations on the phone, and because fingerprinting sup-
the to-be-located-patron be in some unknown location
ports reuse of data. We use simple topological graphs
l in the space, and let the set of current signals be
and grid-based decompositions of space tailored to the
c = {(a1 , s2 ), (a2 , s2 ), . . . , (an , sn )}. The predicted lo-
required localization precision.
cation of l is then r = M (c).
3.3. Localization by Fingerprinting The choice for fingerprinting naturally induces a su-
pervised machine learning setting in which the signal
Localization by fingerprinting is a widely employed landscape over the space is the desired function to
technique (He & Chan, 2015). The general principle is learn, and where the fingerprints are samples of that
depicted in Figure 1b. Each black dot is a reference function. Intuitively, this determines the balance be-
point: a location in the space from which all received tween |R| and sample complexity (Wen et al., 2015).
signals together form a fingerprint. In the picture Fingerprinting is not prone to error drift such as of-
two received signal sets are depicted for two different ten seen when using inertial sensors to determine step
reference points. More formally, let R be the set of count and direction. Modelling signal decay over dis-
reference points and A be the set of APs. We denote a tance and through objects is also not required, as is
sensed signal with strength s from AP a ∈ A as the tu- the case for multilateration positioning. Another ad-
ple (a, s). Now, let f = {(a1 , s2 ), (a2 , s2 ), . . . , (an , sn )} vantage is that the positions of APs do not need to
be the set of all signals sensed at a particular location, be mapped. Disadvantages are that collecting finger-
called a fingerprint over the set A. A reference point prints of the site is a tedious task (Shang et al., 2015)
can denote a point (x, y) in space (rendering R infi- and that changes to the environment may require that
nite), but usually is taken from a finite set of regions, (some) fingerprints need to be collected again.
grid locations (see Figure 5a) or nodes of an abstract
topological graph such as in Figure 1a. A fingerprint
database FDB is a set of pairs (r, f ) where r ∈ R is
48
Towards Optimizing the Public Library
3.4. Multimodal Fingerprinting: Beacons which was replaced by a more uniform grid layout as
depicted in Figures 4a and 5a. When the user fin-
One of the constraints is that the library building has
gerprints a grid location, it gets highlighted to keep
only 8 different WiFI APs. Several other APs from
track of visited areas. The Estimote Android SDK
surrounding buildings can be used but they are out-
is used to collect Bluetooth signals. WiFi RSSIs are
side our control and less reliable. In contrast, our test
collected using Android’s BroadcastReceiver. The
VU environment (see Figure 2b) has many APs in-
Estimote software uses an adaptable scanning period
side. To enrich the signal landscape, we employ so-
and a pause between scanning periods. If the first is
called Bluetooth low energy (BLE) beacons. A beacon
too short, few or no beacons are detected, but if it is
is a self-powered, small device that sends out a signal
too long location-estimation lags behind (and: huge
with a adjustable signal strength and frequency. Bea-
performance differences between smartphones exist).
cons are a recent addition to the internet-of-things-
landscape (Ng & Wakenshaw, 2016) and most modern
3.5.2. Fingerprints Server
smartphones can detect them. For example, a museum
can place a beacon at an art piece and when the visitor Measured fingerprints are uploaded to a server appli-
gets near the beacon, his smartphone can detect this cation, (implemented in PHP using Symfony running
and provide information about the object. Most work on Apache, using a MySQL database). Fingerprints
employs beacons for areas such as rooms and corri- are first locally stored on the phone and then sent to
dors (i.e. region-based ). For example, LoCo (Cooper the server. The server’s only function is to store fin-
et al., 2016) is a fingerprinting system based on WiFi gerprints data: localisation runs locally on the phone.
APs and BLE beacons which are mostly aligned with
the room-like structure of an office. Such beacons act 3.5.3. Model Training
as noisy indicators for rooms. Such constraints are
The data on the server FDB is used for building a
somewhat present in our VU environment, but not at
model mapping fingerprints to grid locations (reference
the library. In a sub-project (Bulgaru, 2016) we tested
points). We utilize two different machine learning al-
region-based interpretations in the library with vary-
gorithms: k-nearest-neighbors (k-NN) and multi-layer
ing success due to the noisy nature of beacons.
perceptrons (MLP), see (Flach, 2012).
Here, in our semi-open library space we opt for a more
The first model is a lazy learner; generalization
general approach; to employ beacons as extra signals
and model building is not required, but instead
for fingerprinting and to treat them similar to WiFi
FDB is loaded on the smartphone and the algo-
signals. Beacons are configured such that signals will
rithm finds the k most similar fingerprints in FDB
be received throughout large portions of the library,
for the currently sensed signals. We use a modi-
just like the (much stronger) WiFi APs. Using this
fied Euclidean distance to compute a similarity met-
approach roughly 10 beacons per floor are effective.
ric between fingerprints. Given a fingerprint f =
Consequently, in our model, the set A of all APs is
{(af1 , sf1 ), . . . , (afn , sfn )} ∈ FDB and the currently
extended with all beacons {b1 , . . . , bn }.
sensed signals, c = {(ac1 , sc1 ), . . . , (acm , asm )}, we com-
pute the distance between c and f . Let Af c ⊆ A be
3.5. The FLib Localization Application
access points measured in both c and f . We compute
In this section we describe FLib, a smartphone appli- distance d(c, f ) as follows. For all sensed APs in Af c
cation for localization purposes in a real-world envi- we take the overall Euclidean distance between signal
ronment. Figure 4b shows an overview of main (soft- values. A penalty of 30 is added to the distance for
ware) components of FLib. In subsequent sections we each access point a ∈ A that is only in f and not in c,
will review all parts. FLib is targeted at our test- or only in c and not in f . This empirically estimated
ing ground at the university (see Figure 2b) and the value balances signals and missing values.
library in Alkmaar (see Figures 1a and 5a). Our second model is an MLP, a standard neural net-
work with one hidden layer of neurons, an input layer
3.5.1. Fingerprinting with |A| neurons and an output layer with |R| neu-
The fingerprint database FDB is filled by smartphone rons. Reference points are taken as classes and each
measurements, see Figure 3. In FLib the current po- class (r ∈ R) is represented by a separate neuron.
sition can be selected on an interactive map, after For the input layer we transform a sensed fingerprint
which the fingerprint is generated with a single but- {(a1 , s1 ), . . . , (am , sm )} with m ≤ |A| to a vector of
ton tap. Initial experiments in VU and Library em- length |A| where each a ∈ A has a fixed index in this
ployed a graph-like transition structure as in Figure 1a vector and each value at that index is the sensed signal
49
Towards Optimizing the Public Library
Figure 5. a) Alkmaar first floor (8 x 5 grid) with beacon positions (coverage shown for 6), b) FLib localization.
strength si (i ∈ 1 . . . m). All other components of the BeaconInside beacons were both used. Transmision
input vector are 0. To construct an output vector for a rate and power were (497ms, −16dBm) and (500ms,
fingerprint f (i.e. (r, f ) ∈ FDB ) we use a binary vector −3dBm) for Estimote and BeaconInside beacons re-
of length |R| with all zeros except at the index of the spectively. Fingerprinting was done with different
neuron representing r. Training an MLP amounts to smartphones: OnePlus A3003 (3), LG H320 (Leon),
computing the best set of weights in the network which and Huawei Y360-U61 (Y3). All access points in the
can be accomplished using gradient-descent learning. vicinity are used for fingerprints collection to increase
WiFi signal space. For vu, we have 396 unique AP ad-
3.5.4. Real-time Localisation dresses in the fingerprints collection, compared to 165
for library. A Bluetooth scanning period of 2500 ms
Both models can be used to generate a ranking
was used to balance delay and detection. RapidMiner
hr1 , . . . , rm i of all reference points. k-NN naturally
was used to train MLPs (learning rate 0.3, momentum
induces a ranking based on distances. MLPs however,
0.2, normalized inputs) and inference in models runs
yield a normalized distribution over the output nodes.
on the phone.
Instead of showing only the best prediction of loca-
tion, FLib shows a more gradual visualization which 3.6.1. Experimental Setup
highlights with shades of blue where the patron may
First, we determine whether unlabelled walked trajec-
be. To render the blue tiles as in Figure 5b, we calcu-
tories can successfully be classified at vu. We use the
late the transparency for the returned unique reference
graph model from Figure 2b and fill FDB with fin-
points. Let Rbest = hr1 , r2 , ..., rn i be the ranked loca-
gerprints taken at each node position. Next, we walk
tions where some r ∈ R can occur multiple times. The
several trajectories such as shown in Figure 6a, and
first element gets score |Rbest |, the second |Rbest | − 1
store unlabelled fingerprints of multiple locations. Us-
and so on, and scores for the same r are summed.
ing 1-NN with the modified Euclidean function, the
Scores are normalized and mapped onto 50 . . . 255, in-
predicted sequences of reference points are compared
ducing color value as a shade of blue.
to the truly walked paths.
3.6. Experiments and Outcomes Positioning performance over the grid at vu and li-
brary (Figures 4a and 5a) is calculated by taking the
Experiments were conducted at two separate locations: mean hamming distance (H) between n true (x, y) ∈ R
part of the 5th floor of the main building at the VU and predicted reference points (x0 , y 0 ) ∈ R:
university in the A-wing (vu) and the two publicly ac-
cessible floors at the Kennemerwaard Library at Alk- n
1X
maar (library). The library ground floor is 55 m H(M ) = |xi − x0i | + |yi − yi0 | (1)
n i=1
wide by 22 m in length, while the first floor is 54 m
wide by 40 m in length. The vu testing floor is 57,4 Fingerprints are collected with different phones, while
m wide and 20,5 m in length. Estimote Proximity and fingerprints of walks were collected with a OnePlus 3
50
Towards Optimizing the Public Library
Figure 6. a) Picture from (Bulgaru, 2016). Here we employ beacons in our testing environment with exactly known
beacon locations. An effective localization method is to score each grid location surrounding a detected beacon based on
the detected signal strength. For example, all 9 grid locations around a detected strong beacon signal (red area) get a
high value, whereas a much larger area around a detected low signal (blue) get a much lower value. This value scheme
reflects varying confidence in beacon detections based on signal strength. The final predicted location is computed from
a (weighted) combination of all grid position values, and forms a practical (and effective in the testing environment)
localization algorithm. b) A sample walk in the VU environment (Walk 2).
only. Differences in performance are compared using ure 7b. Figure 7c shows results for the first floor.
only fingerprints of OnePlus3, averaged fingerprints, Ground floor tiles cover 5.5 × 4 m. For the li-
and all fingerprints. Best performance is expected brary ground floor the best result (MLP, 50 hid-
when using fingerprints from the same phone as for den, 200 cycles) is a mean total hamming distance
the walks, since there are no sensor or configuration of 1.06: 0.65 for x andp0.41 for y and roughly (un-
differences. In all fingerprints for the ground floor at der)estimated error of (5.5 ∗ 0.65)2 + (4 ∗ 0.41)2 ≈
library, 745 records were collected, and 623 for the 3.92 m. For the library first floor, the same con-
first floor. Averaging fingerprints data per phone per figuration yields the best result: 0.80, with an er-
reference point was done to decrease computational ror x = 0.35 and y = 0.45. Each grid tile covers
complexity of k-NN, reducing |fDB | for the first floor p × 8 m, giving an (under)estimated mean error of
6.75
from 623 to just 72. Computational efficiency is impor- (6.75 ∗ 0.35)2 + (8 ∗ 0.45)2 ≈ 4.30 m. These levels of
tant because smartphones have limited battery time, indoor localisation performance suffice to detect a pa-
and positioning delay is reduced. tron’s region at library, and can be used for several
future applications. We have seen that using k > 3,
3.6.2. Results positioning performance starts degrading, so only re-
sults of {1, 2, 3}-NN are reported.
First, we look at 2 vu walk example results:
51
Towards Optimizing the Public Library
Figure 7. Mean positioning hamming distance errors: a) at the vu (28 × 10 grid), b) using averaged fingerprints at the
LIBRARY ground floor c), using averaged fingerprints at the LIBRARY first floor.
52
Towards Optimizing the Public Library
and intervening can be done in real time. The advan- Flach, P. (2012). Machine learning. Cambridge University
tage of sensor technology is that at some point one Press.
can relax the physical order of the library because, for He, S., & Chan, G. (2015). Wi-Fi fingerprint-based indoor
example, books can be located individually, escaping positioning: Recent advances and comparisons. IEEE
the standard order of the shelf. Coming up with the Comm. Surveys & Tutorials, 18.
right goals – together with the right hardware and al- Jica, R. (2016). Digital interactions with physical library
gorithmic technology – that are aligned with the many books. Bachelor thesis, Vrije Universiteit Amsterdam,
functions of the public library, is most challenging. (6) The Netherlands.
Privacy. More data means more risks for privacy in Kriz, P., Maly, F., & Kozel, T. (2016). Improving indoor
general. Libraries already collect data about their pa- localization using bluetooth low energy beacons. Mobile
trons, but this will increase quickly. Challenges are ba- Information Systems, 2016.
sic data privacy and security. However, a more hidden Licklider, J. (1965). Libraries of the future. Cambridge
form is intellectual privacy (see (van Otterlo, 2016a)). Massachusetts: MIT Press.
Personalized interventions in library services based on
Lymberopoulos, D., Liu, J., Yang, X., Choudhury, R. R.,
information about borrowing history can have trans- Handziski, V., & Sen, S. (2015). A realistic evaluation
formative effects on the autonomy of a patron in think- and comparison of indoor location technologies: Experi-
ing and deciding. Consequences of data-driven strate- ences and lessons learned. Proceedings of the 14th Inter-
gies in libraries are underexplored (but see (van Ot- national Conference on Information Processing in Sen-
terlo, 2016b)) and need more study. sor Networks (pp. 178–189). New York, NY, USA: ACM.
Ng, I. C., & Wakenshaw, S. Y. (2016). The internet-of-
things: Review and research directions. International
5. Conclusions Journal of Research in Marketing, 34, 3–21.
In this paper we have introduced the public library Palfrey, J. (2015). Bibliotech. Basic Books.
as an interesting domain for innovation with artifi-
Shang, J., Hu, X., Gu, F., Wang, D., & Yu, S. (2015).
cial intelligence. In the context of project BLIIPS Improvement schemes for indoor mobile location esti-
we have introduced the FLib localization application mation: A survey. Math. Probl. in Engineering.
as a first step towards patron activity monitoring, and
Shimosaka, M., Saisho, O., Sunakawa, T., Koyasu, H.,
have briefly touched upon additional results related to Maeda, K., & Kawajiri, R. (2016). ZigBee based wireless
book interaction. Many potential future work direc- indoor localization with sensor placement optimization
tions on BLIIPS and FLib exist and were outlined in towards practical home sensing. Advanced Robotics, 30,
the research agenda in the previous section. 315–325.
van Otterlo, M. (2014). Automated experimentation in
Acknowledgments walden 3.0. Surveillance & society, 12, 255–272.
van Otterlo, M. (2016a). The libraryness of calculative de-
The first author acknowledges support from the Amster-
vices. In L. Amoore and V. Piotukh (Eds.), Algorithmic
dam academic alliance (AAA) on data science, and we
life: Calculative devices in the age of big data, chapter 2,
thank Stichting Leenrecht for financial support. We thank
35–54. Routledge.
the people from the Alkmaar library for their kind support.
van Otterlo, M. (2016b). Project BLIIPS: Making the
physical public library more intelligent through artifi-
References cial intelligence. Qualitative and Quantitative Methods
Allison, D. A. (2013). The patron-driven library: A prac- in Libraries (QQML), 5, 287–300.
tical guide for managing collections and services in the van Otterlo, M., & Feldberg, F. (2016). Van kaas naar big
digital age. Chandos Inf. Prof. Series. data: Data Science Alkmaar, het living lab van Noord-
Baron, N. S. (2015). Words onscreen: The fate of reading Holland noord. Bestuurskunde, 29–34.
in a digital world. Oxford University Press. Warnaar, M. (2017). Indoor localisation on smartphones
using WiFi and bluetooth beacon signal strength. Mas-
Bulgaru, A. (2016). Indoor localisation using bluetooth
ter thesis, Vrije Universiteit Amsterdam.
low energy beacons. Bachelor thesis, Vrije Universiteit
Amsterdam, The Netherlands. Wen, Y., Tian, X., Wang, X., & Lu, S. (2015). Fundamen-
tal limits of RSS fingerprinting based indoor localiza-
Cooper, M., Biehl, J., Filby, G., & Kratz, S. (2016). tion. IEEE Conference on Computer Communications
LoCo: boosting for indoor location classification com- (INFOCOM) (pp. 2479–2487).
bining WiFi and BLE. Personal and Ubiquitous Com-
puting, 20, 83–96. Zheng, Y., Capra, L., Wolfson, O., & Yang, H. (2014). Ur-
ban computing: Concepts, methodologies, and applica-
Edwards, B. (2009). Libraries and learning resource cen- tions. ACM Trans. Intell. Syst. Technol., 5, 38:1–38:55.
tres. Architectural Press (Elsevier). 2nd edition.
53
Constraint-based measure for estimating overlap in clustering
Abstract assumptions about the input data and the target func-
tion to be approximated. For instance, some clustering
Different clustering algorithms have different algorithms implicitly assume that clusters are spheri-
strengths and weaknesses. Given a dataset cal; k-means is an example of that. Any clustering al-
and a clustering task, it is up to the user gorithm that tries to minimise the sum of squared Eu-
to choose the most suitable clustering algo- clidean distances inside the clusters, implicitly makes
rithm. In this paper, we study to what extent that assumption. The assumption can be relaxed by
this choice can be supported by a measure of rescaling the different dimensions or using a Maha-
overlap among clusters. We propose a con- lanobis distance; this can lead to elliptic clusters, but
crete, efficiently computable constraint-based such clusters are still convex.
measure. We show that the measure is indeed
informative: on the basis of this measure A different class of clustering algorithms does not as-
alone, one can make better decisions about sume convexity, but looks at local properties of the
which clustering algorithm to use. However, dataset, such as density of point or graph connectivity.
when combined with other features of the in- Such methods can identify, for instance, moon-shaped
put dataset, such as dimensionality, it seems clusters, which k-means cannot. Spectral clustering
that the proposed measure does not provide (von Luxburg, 2007) is an example of such approach.
useful additional information. Some clustering algorithms assume that the data have
been sampled from a population that consists of a mix
of different subpopulations, e.g., a mixture of Gaus-
1. Introduction sians. EM is an example of such an approach (Demp-
ster et al., 1977). A particular property of these ap-
For many types of machine learning tasks, such as su-
proaches is that clusters may overlap. That is, even
pervised learning, clustering, and so on, a variety of
though each individual instance still belongs to one
methods is available. It is often difficult to say in ad-
cluster, there are areas in the instance space where
vance which method will work best in a particular case;
two (or more) Gaussian density functions substantially
this depends on properties of the dataset, the target
differ from zero, so that instance of both clusters may
function, and the quality criteria one is interested in.
end up in this area.
The research field called meta-learning is concerned
with devising automatic ways of determining the most In this paper, we hypothesise that the amount to which
suitable algorithm and parameter settings, given a par- clusters may overlap is relevant for the choice of what
ticular dataset and possibly knowledge about the tar- clustering method to use. A measure, the Rvalue, has
get function. Traditionally, meta-learning has mostly been proposed before that, given the ground truth re-
been studied in a classification setting. In this paper, garding which instance belongs to which cluster, de-
however, we focus on clustering. scribes this overlap. Since clustering is unsupervised,
this measure cannot be used in practice for deciding
Clustering algorithms are no exception to the general
what clustering method to use. We therefore derive
rule that different learning algorithms make different
a new measure, CBO, which is based on must-link or
cannot-link constraints on instance pairs. We show
Preliminary work. Under review for Benelearn 2017. Do that the second measure correlates well with the first,
not distribute. making it a suitable proxy for selection the clustering
54
Constraint-based measure for estimating overlap in clustering
method. We show that this measure is indeed informa- vided to the clustering algorithm to guide the search
tive: on the basis of this measure alone, it is possible towards a more desirable solution. We then talk about
to select clustering algorithms such that, on average, constraint-based, constrained, or semi-supervised clus-
better clusterings are obtained. tering.
However, there are also negative results. Datasets can Constraints can be defined on different levels. On a
be described using other features than the measure cluster level, one can ask for clusters that are bal-
defined here. It turns out that, when a dataset is anced in size, or that have a maximum diameter in
described using a relatively small set of straightfor- space. On an instance level, one might know some
ward features (such as dimensionality), it is also pos- partial labelling of the data. A well-used type of con-
sible to make an informed choice about what clustering straints are must-link and cannot-link constraints, also
method to use. What’s more, if this set of straightfor- called equivalence constraints. These are pair-wise
ward features is extended with the overlap measure constraints which state that two instances must be or
described here, this does not significantly improve the cannot be in the same cluster.
informativeness of the dataset description, in terms of
Multiple methods have been developed to use these
which clustering method is optimal.
constraints, some of which are mentioned below. A
The conclusion from this is that, although the pro- metric can be learnt that complies with the constraints
posed measure is by itself an interesting feature, it (Bar-Hillel et al., 2005). The constraints can be used
seems to capture mostly information that is also con- in the algorithm for the cluster assignment in a hard
tained in other, simpler features. This is a somewhat (Wagstaff et al., 2001) or soft way (Pelleg & Baras,
surprising result for which we currently have no expla- 2007), (Ruiz et al., 2007), (Wang & Davidson, 2010).
nation; further research is warranted. Some hybrid algorithms use constraints for both met-
ric learning and clustering (Bilenko et al., 2004), (Hu
This paper is the continuation of a previously pub-
et al., 2013). Other approaches include constraints in
lished workshop paper (Adam & Blockeel, 2015).
general solver methods like constraint programming
While following the same ideas, the CBO has been
(Duong et al., 2015) or integer linear programming
completely redefined. In addition, the number of
(Babaki et al., 2014).
datasets considered was increased from 14 to 42. While
the correlation of the CBO with the overlapping has
improved considerably, the promising results for the 2.2. Algorithm selection for clustering
algorithm selection of that paper were somewhat re- Little research has been conducted on algorithm se-
duced by adding those datasets. lection for clustering. Existing methods usually pre-
The remainder of this paper is structured as follows. dict the ranking of clustering algorithms (De Souto
Section 2 discusses some related work on constraint- et al., 2008), (Soares et al., 2009), (Prudêncio et al.,
based clustering and meta-learning for clustering. Sec- 2011) (Ferrari & de Castro, 2015). The meta-features
tion 3 studies how the overlapping of clusters influ- used are unsupervised and/or domain-specific. None
ences the performance of algorithms. Section 4 in- of these approaches are using constraints which re-
troduces CBO, which is intended to approximate the moves the specificity that there is not only one single
amount of overlap from constraints. Section 5 presents clustering for one dataset. To the best of our knowl-
experimental results that compare algorithm selection edge, the only meta-learning method for clustering in-
based on CBO with algorithm selection using other volving constraints is (Van Craenendonck & Blockeel,
features of the dataset. Section 6 presents our conclu- 2016) which does not use features but simply selects
sions. the algorithm that satisfies the most constraints.
55
Constraint-based measure for estimating overlap in clustering
EM SC
all 0.31 0.32
Figure 1. Toy example of the cross dataset.
Rvalue < 0.2 0.48 0.50
Rvalue > 0.2 0.19 0.19
The Rvalue (Oh, 2011) has been used before as a mea-
Table 1. Average clustering performance measured with
sure of overlapping. Given a dataset of instances in dif-
ARI.
ferent classes, it quantifies the overlapping as a number
between 0 and 1. To compute the Rvalue of a dataset,
it considers each instance and its neighbourhood. An EM SC
instance is said in in overlapping if too many of it near- all 0.45 0.47
est neighbours are labelled differently than him. The Rvalue < 0.2 0.55 0.59
Rvalue of a dataset is then the proportion of instances Rvalue > 0.2 0.33 0.31
in overlapping. The Rvalue thus has 2 parameters:
Table 2. Same as table 1 for dataset where either EM or
the k-nearest neighbours to consider, and θ, the num-
SC scored an ARI of at least 0.2.
ber of nearest neighbours from a different class above
which an instance is in overlapping. Figure 2 shows the
Rvalue for some UCI datasets, which shows overlap- 4. Detecting overlap using constraints
ping occurs a lot in real-life datasets. For comparison,
the cross dataset just above has an Rvalue of 0.41 for While the Rvalue is a good indicator of the extent
the same parameters. to which clusters overlap, it is not useful in practice
because it requires knowledge of the clusters, which we
do not have. In this section, we present an alternative
measure: the Constraint-Based Overlap value (CBO).
The CBO is designed to correlate well with the Rvalue,
while not requiring full knowledge of the clusters.
4.1. Definition
The CBO makes use of must-link and cannot-link con-
straints. The idea is to identify specific configurations
of ML or CL constraints that indicate overlap. The
CBO uses two configurations, illustrated in figure 3:
Figure 2. Rvalue of some UCI datasets, k = 6, θ = 1.
• short CL constraints: when two points are close
To check our intuition that EM can handle overlap- together and yet belong to different clusters, this
ping better than SC, we look at the performance of is an indication that the two clusters overlap in
these algorithm w.r.t. the Rvalue. Table 1 shows the this area
average performance of these algorithm over some UCI
datasets presented in further sections. Table 2 shows • two parallel constraints, one of which is ML and
the same results but ignoring datasets were both al- the other CL, between points that are close. That
gorithm performed badly. We assume that if both is, if a and c are close to each other, and so are b
algorithm have an Adjusted Rand Index (Hubert & and d, and a and b must link while c and d cannot
Arabie, 1985) (ARI) of less than 0.2, the dataset is link, then this implies overlapping, either around
not very suitable for clustering to begin with and we a and c or around b and d (see figure).
can then ignore it. A complete list of used datasets
can be found in section 4.3. It can be seen that in The more frequent those patterns, the more the clus-
that second case, EM performs better than SC when ters overlap. A limit case of the second configura-
there is overlapping and vice versa when there is no tion is when the 2 constraints involves the same point
or little overlapping. This difference is much reduced (e.g. a = c on the figure) Then, by propagation of
56
Constraint-based measure for estimating overlap in clustering
The question is how to define “short” or “close”. This Figure 4. Scoring of single constraint(a) and pair of con-
has to be relative to “typical” distances. To achieve straints(b) using the local similarity. The circles represent
this, we introduce a kind of relative similarity measure, the neighbourhoods of the points.
as follows. Let d(x, x0 ) be the distance between points
x and x0 , and (0 ) be the distance between x (x0 ) and
In both cases, higher scores are more indicative of over-
its k’th nearest neighbour. Then
lap. To have a measure for the whole dataset, we
( aggregate these scores over the whole constraint set.
d(x,x0 )
0 1 − max(, 0) if d(x, x0 ) ≤ max(, 0 ) The idea is to compare the amount of short cannot-
s(x, x ) =
0 otherwise link constraints, direct (single pattern) or by propa-
gation(double pattern), to the total amount of short
That is: s(x, x0 ) is 1 when x and x0 coincide, and lin- constraints, both must-link and cannot-link. With CL
early goes to 0, reaching 0 when d(x, x0 ) = max(, 0 ), the set of cannot-link constraints and M L the set of
that is, x is no closer to x0 than its k’th nearest neigh- must-link constraints, we define
bour, and vice versa.
Using this relative similarity, we can assign scores to P P
score(c) + score(c1 , c2 )
both types of configurations mentioned above. c∈CL c1 ∈CL
c2 ∈M L
The score of a short constraint between two points CBO = P P
score(c) + score(c1 , c2 )
x and x0 is simply: c∈CL∪M L c1 ∈M L
c2 ∈CL∪M L
score(c) = s(x, x0 )
4.2. Stability
The score for a pair of parallel constraints, c1 be- As one can imagine, the CBO can be very noisy for
tween points x1 and x01 and c2 between x2 and x02 , is very small constraint sets. Several parameters influ-
57
Constraint-based measure for estimating overlap in clustering
Figure 5. Convergence of the CBO w.r.t. the size of the constraint set. Three datasets are considered with increasing
number of instances from left to right: iris(N=150), mammographic(N=830), yeast(N=1484). For each datasets, 80
constraint sets are sampled with various size (around 25,50,75,100,200,300,400,500). The CBO is computed for k=10 (top
row), k=20 (middle row), k=10+N/20 (bottom row). The blue points correspond to the total number of constraints. The
red points correspond to the number of constraints that actually participated in the measure. The Rvalue of the dataset
(k=10, θ = 1) is plotted as a black horizontal line.
58
Constraint-based measure for estimating overlap in clustering
59
Constraint-based measure for estimating overlap in clustering
60
Constraint-based measure for estimating overlap in clustering
from constraints, and to what extent this can lead to Hubert, L., & Arabie, P. (1985). Comparing partitions.
better clustering algorithm selection. Journal of classification, 2, 193–218.
Oh, S. (2011). A new dataset evaluation method
Acknowledgments based on category overlap. Computers in Biology
Research financed by the KU Leuven Research Council and Medicine, 41, 115–122.
through project IDO/10/012. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel,
V., Thirion, B., Grisel, O., Blondel, M., Pretten-
References hofer, P., Weiss, R., Dubourg, V., Vanderplas, J.,
Passos, A., Cournapeau, D., Brucher, M., Perrot,
Adam, A., & Blockeel, H. (2015). Dealing with over-
M., & Duchesnay, E. (2011). Scikit-learn: Machine
lapping clustering: a constraint-based approach to
learning in Python. Journal of Machine Learning
algorithm selection. Meta-learning and Algorithm
Research, 12, 2825–2830.
Selection workshop-ECMLPKDD2015 (pp. 43–54).
Pelleg, D., & Baras, D. (2007). K-means with large and
Babaki, B., Guns, T., & Nijssen, S. (2014). Con-
noisy constraint sets. In Machine learning: Ecml
strained clustering using column generation. In In-
2007, 674–682. Springer.
tegration of ai and or techniques in constraint pro-
gramming, 438–454. Springer. Prudêncio, R. B., De Souto, M. C., & Ludermir,
T. B. (2011). Selecting machine learning algo-
Bar-Hillel, A., Hertz, T., Shental, N., & Weinshall, D. rithms using the ranking meta-learning approach.
(2005). Learning a mahalanobis metric from equiva- In Meta-learning in computational intelligence, 225–
lence constraints. Journal of Machine Learning Re- 243. Springer.
search, 6, 937–965.
Ruiz, C., Spiliopoulou, M., & Menasalvas, E. (2007).
Bilenko, M., Basu, S., & Mooney, R. J. (2004). In- C-dbscan: Density-based clustering with con-
tegrating constraints and metric learning in semi- straints. In Rough sets, fuzzy sets, data mining and
supervised clustering. Proceedings of the twenty- granular computing, 216–223. Springer.
first international conference on Machine learning
(p. 11). Soares, R. G., Ludermir, T. B., & De Carvalho, F. A.
(2009). An analysis of meta-learning techniques for
De Souto, M. C., Prudencio, R. B., Soares, R. G., ranking clustering algorithms applied to artificial
De Araujo, R. G., Costa, I. G., Ludermir, T. B., data. In Artificial neural networks–icann 2009, 131–
Schliep, A., et al. (2008). Ranking and selecting clus- 140. Springer.
tering algorithms using a meta-learning approach.
Neural Networks, 2008. IJCNN 2008.(IEEE World Van Craenendonck, T., & Blockeel, H. (2016).
Congress on Computational Intelligence). IEEE In- Constraint-based clustering selection. arXiv
ternational Joint Conference on (pp. 3729–3735). preprint arXiv:1609.07272.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). von Luxburg, U. (2007). A tutorial on spectral clus-
Maximum likelihood from incomplete data via the tering. Statistics and computing, 17, 395–416.
em algorithm. Journal of the royal statistical society.
Series B (methodological), 1–38. Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S., et al.
(2001). Constrained k-means clustering with back-
Duong, K.-C., Vrain, C., et al. (2015). Constrained ground knowledge. ICML (pp. 577–584).
clustering by constraint programming. Artificial In-
telligence. Wang, X., & Davidson, I. (2010). Flexible con-
strained spectral clustering. Proceedings of the 16th
Ferrari, D. G., & de Castro, L. N. (2015). Cluster- ACM SIGKDD international conference on Knowl-
ing algorithm selection by meta-learning systems: edge discovery and data mining (pp. 563–572).
A new distance-based problem characterization and
ranking combination methods. Information Sci-
ences, 301, 181–194.
61
Conference Track
Extended Abstracts
62
A Probabilistic Modeling Approach to Hearing Loss Compensation
Keywords: Hearing Aids, Hearing Loss Compensation, Probabilistic Modeling, Bayesian Inference
63
A Probabilistic Modeling Approach to Hearing Loss Compensation
ϑ
N +
→ →
3. Signal Processing and Fitting as 7 9
Probabilistic Inference ↑ 8
4. Inference Execution through Crucially, our algorithms for signal processing and fit-
Message Passing ting can be automatically inferred from a given model
plus in-situ collected patient appraisals. Therefore, in
Equations (5) and (6) are very difficult to compute contrast to existing design methods, this approach al-
directly. We have developed a software toolbox to au- lows for hearing aid personalization by a patient with-
tomate these inference problems by message passing out need for human design experts in the loop.
64
A Probabilistic Modeling Approach to Hearing Loss Compensation
References
Dauwels, J. (2007). On Variational Message Passing
on Factor Graphs. IEEE International Symposium
on Information Theory (pp. 2546–2550).
Forney, G.D., J. (2001). Codes on graphs: normal re-
alizations. IEEE Transactions on Information The-
ory, 47, 520–548.
65
An In-situ Trainable Gesture Classifier
p(D, θ, k)
2. Probabilistic modeling approach p(θ|D, k) = R . (2)
p(D, θ, k) dθ
Under the probabilistic modeling approach, both
learning and recognition are problems of probabilis- In the second step, the parameter distribution is
tic inference in the same generative model. This gen- learned for a specific gesture class, using the previ-
erative model is a joint probability distribution that ously learned p(θ|D, k) and a set of measurements Dk
specifies the relations among all (hidden and observed) with the same class k:
variables in the model.
Let y = (y1 , ..., yT ) be a time series of measurements p(Dk |θ)p(θ|D, k)p(k)
p(θ|D, Dk , k) = R . (3)
corresponding to a single gesture with underlying char- p(Dk , θ, k|D) dθ
acteristics θ. The characteristics are unique to gestures
of type (class) k. We can capture these dependencies In practice, exact evaluation of Eq. 2 and Eq. 3 is
intractable for our model due to the integral in the
denominator. We use variational Bayesian inference to
approximate this distribution (MacKay, 1997), which
66
An In-situ Trainable Gesture Classifier
results in a set of update equations that need to be containing the remaining (15x17=) 255 samples. The
iterated until convergence. recognition rate is evaluated on models trained on 1
through 5 examples. To minimize the influence of the
During recognition, the task of the algorithm is to
training order, the results are averaged over 5 different
identify the gesture class with the highest probabil-
permutations of the training set.
ity of having generated the measurement y. This is
expressed by To compare our algorithm, we have also evaluated the
R recognition rate of the same algorithm with uninfor-
p(y, θ, k) dθ mative prior distributions and of a 1-Nearest Neighbor
p(k|y) = P R . (4)
k p(y, θ, k) dθ (1-NN) algorithm using the same protocol.
3. Experimental validation
We built a gesture database using a Myo sensor
bracelet (ThalmicLabs, 2016), which is worn just be-
low the elbow (see Fig. 1). The Myo’s inertial mea-
surement unit measures the orientation of the bracelet.
This orientation signal is sampled at 6.7 Hz, converted
into the direction of the arm, and quantized using 6
quantization directions. The database contains 17 dif- Figure 2. Recognition rates of the 1-NN algorithm, the pro-
ferent gesture classes, each performed 20 times by the posed algorithm without prior information (HMM), and
same user. The duration of the measurements was the proposed algorithm with informed prior distributions
fixed to 3 seconds. (HMM prior).
As a measure of performance, we use the recognition There are multiple ways to incorporate these results in
rate defined as: a practical gesture recognition system. For example,
the prior distribution can be constructed by the devel-
# correctly classified opers of the algorithm. Another possibility is to allow
Recognition rate = . (5)
total # of samples users to provide prior distributions themselves. This
means that the system will take longer to set up, but
The gesture database is split in a training set contain- when a user wants to learn a specific gesture under in-
ing 5 samples of every gesture class, and a test set situ conditions, it will require less training examples.
67
An In-situ Trainable Gesture Classifier
References
Beal, M. J. (2003). Variational algorithms for approx-
imate Bayesian inference. Doctoral dissertation,
University College London.
Horsley, D. (2016). Wave hello to the next interface.
IEEE Spectrum, 53, 46–51.
68
Text mining to detect indications of fraud in annual reports
worldwide
69
Text mining to detect fraud in annual reports worldwide
guage Toolkit (NLTK) for Python to identify sentences bigrams and grammatical relations between two words,
and words (Bird & Klein, 2009). HTML-tags in forms extracted with the Stanford parser, as features to the
10-K and 20-F are removed using the Python package model (De Marneffe et al., 2006).
‘BeautifulSoup’
We develop a baseline model comprising of word uni- 4. Results
grams. To obtain an informative set of word unigrams
Figure 1 shows the accuracy of the NB and SVM base-
we exclude stop words and stem the words using the
line models. For both types of models the optimal
Porter stemmer in NLTK (Bird & Klein, 2009). Words
number of features is around 10.000 unigrams. With
that appear only in one MD&A section in the entire
an accuracy of 89% the NB model outperforms the
data set are not informative. Therefore these words
SVM that achieves an accuracy of 85%.
will not be used as features. Furthermore, we ap-
ply ‘term frequency-inverse document frequency’ (TF-
IDF) as a normalization step of the word counts to
take into account the length of the text and the com-
monality of the word in the entire data set (Manning
& Schütze, 1999). Finally, the chi squared method is
applied to select the most informative features. We
start with the top 1.000 and increase the number of
features in steps of 1.000 until 24.000 to find the opti-
mal number of features.
The Naı̈ve Bayes classifier (NB) and Support Vector
Machine (SVM) have been proven successful in text
classification tasks in several domains (Cecchini et al.,
2010; Conway et al., 2009; Glancy & Yadav, 2011;
Goel et al., 2010; He & Veldkamp, 2012; Joachims,
1998; Manning & Schütze, 1999; Metsis et al., 2006; Figure 1. Performance of the Naı̈ve Bayes and Support
Purda & Skillicorn, 2015). Therefore, this research Vector Machine models.
uses these two types of machine learning approaches
to develop a baseline text mining model. Using 10-fold
stratified cross validation on 70% of the data the data The linguistic features of the descriptive, complex-
is split into train and test sets. The remaining 30% is ity, grammatical, readability and psychological process
saved for the best performing model in de development categories did not improve the result of the baseline
phase. models. The performance on the test set of the SVM
A word unigrams approach is a limited way of look- model increased to 90% by adding the most informa-
ing at texts because it omits a part of the textual in- tive bigrams. The addition of the relation features did
formation, such as the grammar. Therefore, we ex- not further increase the performance.
tend the baseline model with linguistic features cat-
egories to determine whether other types of textual 5. Discussion and conclusion
information may improve the results of the baseline
model. The first category consists of descriptive fea- The results show that it is possible to use text min-
tures, this includes the number of words and the num- ing techniques to detect indications of fraud in the
ber of sentences in the text. The second category of management discussion and analysis section of annual
features represents the complexity of a text. Exam- reports of companies worldwide. The word unigrams
ples of these features are the average sentence length capture the majority of the subtle information that
and the percentage of long words. The third group of differentiates fraudulent from non fraudulent annual
captures the grammatical information, such as the per- reports. The additional information that the linguis-
centage of verbs, nouns and several types of personal tic information provides is very limited, an only at-
pronouns. The fourth category assesses the readabil- tributable to the bigrams. Additional research may
ity of the text by using readability scores, including address the effects of the random 10-fold splitting pro-
the ‘Flesch Reading Ease Score’. The fifth categories cess, the effects of multiple authors on linguistic fea-
measures psychological processes such as positive and tures of a text and the possibilities of a ensemble of
negative sentiment words. Finally, we include words machine learning algorithms for detecting fraud in an-
nual reports worldwide.
70
Text mining to detect fraud in annual reports worldwide
71
Do you trust your multiple instance learning classifier?
72
Do you trust your multiple instance learning classifier?
visually assess the results. But what can we do if this Musk 1 Musk 2 Breast
AUC
AUC
AUC
0.8
vious work (Cheplygina et al., 2015), is to invent unsu- 0.6 0.6 0.6
pervised patch-level evaluation measures which do not 0.5 1 0.5 1 0.5 1
need any patch labels. We reasoned that, if a classifier S+ S+ S+
Messidor COPD validation COPD test
is finding the true patch labels, it should find similar 0.8
0.7 0.7
patch labels, even if we change the classifier slightly.
AUC
AUC
AUC
0.6 0.6
If the classifier is finding different patch labels every 0.6
0.5 0.5
time, we probably don’t want to trust it. By changing 0.5 1 0.5 1 0.5 1
S+ S+ S+
the classifier slightly and evaluating the stability of the
patch labels, we get a different sense of how well the simpleNM simpleSVM simple1NN milBoost
mi1NN MILES
classifier is doing. miNM miSVM
73
A Gaussian process mixture prior for hearing loss modeling
74
A Gaussian process mixture prior for hearing loss modeling
−67
75
A Gaussian process mixture prior for hearing loss modeling
References
Bisgaard, N., Vlaming, M. S. M. G., & Dahlquist, M.
(2010). Standard audiograms for the IEC 60118-15
measurement procedure. Trends in amplification,
14, 113–120.
Cohn, D. A., Ghahramani, Z., & Jordan, M. I. (1996).
Active learning with statistical models. Journal of
artificial intelligence research, 4, 129–145.
Gardner, J., Malkomes, G., Garnett, R., Weinberger,
K. Q., Barbour, D., & Cunningham, J. P. (2015a).
Bayesian active model selection with an application
to automated audiometry. Advances in Neural In-
formation Processing Systems (pp. 2386–2394).
Gardner, J. R., Song, X., Weinberger, K. Q., Barbour,
D. L., & Cunningham, J. P. (2015b). Psychophysical
Detection Testing with Bayesian Active Learning.
UAI (pp. 286–295).
76
Predicting chaotic time series using a photonic reservoir computer
with output feedback
Keywords: Reservoir computing, opto-electronic systems, FPGA, chaotic time series prediction
77
Predicting chaotic time series using a photonic reservoir computer with output feedback
SLD Pr
xi (n)
Clock Matlab N
xi (n) Prediction length
100 600
MZ
90/10
Att
ADC Res Out wi experimental 125 ± 14 344 ± 64
numerical (noisy) 120 ± 32 361 ± 87
Spool
Mask Mi
y(n)
Amp Comb
Mi numerical (noiseless) 121 ± 38 637 ± 252
Pf DAC In u(n)
78
Predicting chaotic time series using a photonic reservoir computer with output feedback
References
Antonik, P., Duport, F., Hermans, M., Smerieri, A.,
Haelterman, M., & Massar, S. (2016a). Online train-
ing of an opto-electronic reservoir computer applied
to real-time channel equalization. IEEE Transac-
tions on Neural Networks and Learning Systems,
PP, 1–13.
79
Towards high-performance analogue readout layers for photonic
reservoir computers
80
Towards high-performance analogue readout layers for photonic reservoir computers
Analogue readout
et al., 2012; Duport et al., 2016). To that end, we
xi (n)
Pr considered several potential experimental imperfection
Pb and measured their impact on the performance.
50/50 y(n)
SLD MZ -
wi (n) C • The time constant τ = RC of the RC filter deter-
Att mines its integration period. We’ve shown that
MZ
50/50 ADC both tasks work well in a wide range of values of
τ , and knowledge of its precise value is not neces-
1.6 km
DAC
Amp Comb FPGA sary for good performance (contrary to (Duport
et al., 2016)).
Pf
81
Towards high-performance analogue readout layers for photonic reservoir computers
ing of an opto-electronic reservoir computer applied Vinckier, Q., Bouwens, A., Haelterman, M., & Mas-
to real-time channel equalization. IEEE Transac- sar, S. (2016). Autonomous all-photonic processor
tions on Neural Networks and Learning Systems, based on reservoir computing paradigm (p. SF1F.1.
PP, 1–13. ). Optical Society of America.
82
Local Process Models: Pattern Mining with Process Models
Niek Tax, Natalia Sidorova, Wil M.P. van der Aalst {n.tax,n.sidorova,w.m.p.v.d.aalst}@tue.nl
Eindhoven University of Technology, The Netherlands
Keywords: pattern mining, process mining, business process modeling, data mining
1. Introduction call for bids, (B) investigate a call for bids from the
Process mining aims to extract novel insights from business perspective, (C) investigate a call for bids
event data (van der Aalst, 2016). Process discovery from the legal perspective, and (D) decide on partici-
plays a prominent role in process mining. The goal pation in the call for bid. The event sequences (Figure
is to discover a process model that is representative 1(a)) contain the activities performed by one sales of-
for the set of event sequences in terms of start-to-end ficer throughout the day. The sales officer works on
behavior, i.e. from the start of a case till its termi- different calls for bids and not necessarily performs
nation. Many process discovery algorithms have been all activities for a particular call himself. Applying
proposed and applied to a variety of real life cases. A discovery algorithms, like the Inductive Miner (Lee-
more conventional perspective on discovering insights mans et al., 2013), yields models allowing for any
from event sequences can be found in the areas of se- sequence of events (Figure 1(c)). Such ”flower-like”
quential pattern mining (Agrawal & Srikant, 1995) and models do not give any insight in typical behavioral
episode mining (Mannila et al., 1997), which focus on patterns. When we apply any sequential pattern min-
finding frequent patterns, not aiming for descriptions ing algorithm using a threshold of six occurrences, we
of the full event sequences from start to end. obtain the seven length-three sequential patterns de-
picted in Figure 1(d) (results obtained using the SPMF
Sequential pattern mining is limited to the discovery (Fournier-Viger et al., 2014) implementation of the
of sequential orderings of events, while process dis- PrefixSpan algorithm (Pei et al., 2001)). However, the
covery methods aim to discover a larger set of event data contains a frequent non-sequential pattern where
relations, including sequential orderings, (exclusive) a sales officer first performs A, followed by B and C
choice relations, concurrency, and loops, represented in arbitrary order (Figure 1(b)). This pattern cannot
in process models such as Petri nets (Reisig, 2012), be found with existing process discovery or sequential
BPMN (Object Management Group, 2011), or pattern mining techniques. The two numbers shown in
UML activity diagrams. Process models distinguish the transitions (i.e., rectangles) represent (1) the num-
themselves from more traditional sequence mining ber of events of this type in the event log that fit this
approaches like Hidden Markov Models (Rabiner, local process model and (2) the total number of events
1989) and Recurrent Neural Networks with their of this type in the event log. For example, 13 out of 19
visual representation, which allows them to be used events of type C in the event log fit transition C, which
for communication between process stakeholders. are indicated in bold in the log in Figure 1(a). Under-
However, process discovery is normally limited to the lined sequences indicate non-continuous instances, i.e.
discovery of a complete model that captures the full instances with non-fitting events in-between the events
behavior of process instances, and not local patterns forming the instance of the local process model.
within instances. Local Process Models (LPMs) allow
the mining of patterns positioned in-between simple 3. LPM Discovery Approach
patterns (e.g. subsequences) and end-to-end models,
focusing on a subset of the process activities and A technique for the discovery of Local Process Mod-
describing frequent patterns of behavior. els (LPMs) is described in detail in (Tax et al.,
2016a). LPM discovery uses the process tree (Buijs
2. Motivating Example et al., 2012) process model notation, an example
of which is SEQ(A, B), which is a sequential pat-
Imagine a sales department where multiple sales of- tern that describes that activity B occurs after ac-
ficers perform four types of activities: (A) register a tivity A. Process tree models are iteratively ex-
83
Local Process Models: Pattern Mining with Process Models
Figure 1. (a) A log L of event sequences executed by a sales officer with highlighted instances of the frequent pattern.
(b) The local process model showing frequent behavior in L. (c) The Petri net discovered on L with the Inductive Miner
algorithm (Leemans et al., 2013). (d) The sequential patterns discovered on L with PrefixSpan (Pei et al., 2001).
Recall NDCG@5 NDCG@10 NDCG@20
panded into larger patterns using a fixed set of ex- 1.00
Markov
SEQ(A, AND(B, C)), which indicates that A is fol- 0.50
the Petri net of Figure 1(b). LPMs are discovered us- 0.75
origin
Entropy
ing the following steps:
value
0.50 discovered
random
1) Generation Generate the initial set CM1 of can- 0.25
MRIG
0.50
16 uno
va rdon 10
Ka ez A
ren
16 uno
va rdon 10
as A
en
16 uno
va rdon 10
as A
en
AD Br 9
AD Br 9
AD Br 9
AD Br 9
Or 0001
93
93
93
n K ez
n K ez
n K ez
ter
O 000
O 000
ter
O 000
ter
9
ste
10
10
10
10
panded, candidate process models, CM i+1 . Goto
rce
rce
rce
rce
n
ou
ou
ou
ou
step 2 using the newly created candidate set
CH
CH
CH
CH
es
es
es
es
2r
2r
2r
2r
I'1
I'1
I'1
I'1
CM i+1 .
BP
BP
BP
BP
4. Faster LPM Discovery by Clustering Figure 2. Performance of the three projection set discovery
Activities methods on the six data sets on the four metrics
The discovery of Local Process Models (LPMs) is com-
putationally expensive for event logs with many unique
cluster the activities in the directly-follows graph.
activities (i.e. event types), as the number of ways to
expand each candidate LPM is equal to the number We compare the quality of the obtained ranking of
of possible process model structures with which it can LPMs after clustering the activities with the ranking
be expanded times the number of activities in the log. of LPMs obtained on the original data set. To compare
(Tax et al., 2016b) explores techniques to cluster the the rankings we use NDCG, an evaluation measure for
set of activities, such that LPM discovery can be ap- rankings frequently used in the information retrieval
plied per activity cluster instead of on the complete set field. Figure 2 shows the results of the three clustering
of events, leading to considerable speedups. All clus- approaches on five data sets. All three produce bet-
tering techniques operate on a directly-follows graph, ter than random projections on a variety of data sets.
which shows how frequently the activity types of the Projection discovery based on Markov clustering leads
directly follows each other in the event log. Three clus- to the highest speedup, while higher quality LPMs can
tering techniques have been compared: entropy-based be discovered using a projection discovery based on log
clustering clusters the activities of the directly-follows statistics entropy. The Maximal Relative Information
graph using an information theoretic approach. Max- Gain based approach to projection discovery shows un-
imal relative information gain clustering is a variant stable performance with the highest gain in LPM qual-
on entropy-based clustering. The third clustering ity over random projections on some event logs, while
technique uses Markov clustering (van Dongen, 2008), not being able to discover any projection smaller than
an out-of-the-box graph clustering technique, to the complete set of activities on some other event logs.
84
Local Process Models: Pattern Mining with Process Models
Tax, N., Sidorova, N., Haakma, R., & van der Aalst,
W. M. P. (2016a). Mining local process models.
Journal of Innovation in Digital Ecosystems, 3, 183–
196.
85
A non-linear Granger causality approach for understanding
climate-vegetation dynamics
Keywords: time series forecasting, random forests, non-linear Granger causality, climate change
86
A non-linear Granger causality approach for understanding climate-vegetation dynamics
Regression
Ridge
≥ .4 ≥ .2
.3
.1
.1
.05 .05
0 0
(c) (d)
Non-linear Model
≥ .4 ≥ .2
.3
.1
.1
.05 .05
0 0
Figure 1. Linear versus non-linear Granger causality of climate on vegetation. (a) Explained variance (R2 ) of vegetation
anomalies based on a full ridge regression model in which all climatic variables are included as predictors. (b) Improvement
in terms of R2 by the full ridge regression model with respect to the baseline ridge regression model that uses only past
values of vegetation anomalies as predictors; positive values indicate (linear) Granger causality. (c) Explained variance
(R2 ) of vegetation anomalies based on a full random forest. (d) Improvement in terms of R2 by the full random forest
model with respect to the baseline random forest model; positive values indicate (non-linear) Granger causality.
better forecasts, the null hypothesis of Granger non- full model ). While the model explains more than
causality can be rejected (Granger, 1969). 40% of the variability in vegetation in some regions
(R2 > 0.4), this is by itself not necessarily indicative
This abstract, based on (Papagiannopoulou et al.,
of climate Granger-causing the vegetation anomalies.
2016), presents an extension of linear Granger causal-
In order to test the latter, we compare the results of
ity analysis, a novel non-linear framework for finding
the full model (Fig. 1a) to a baseline model, i.e., an
climatic drivers that affect vegetation. Our framework
autoregressive ridge regression model that only uses
consists of several steps. In a first step, data from dif-
previous values of vegetation to predict the vegeta-
ferent sources are collected and merged into a single,
tion at time t. Any increase in predictive performance
comprehensive dataset. Next, time series decomposi-
provided by the full ridge regression model (Fig. 1a)
tion techniques are applied to the target vegetation
over the corresponding baseline provides qualitative
time series and the various predictor climatic time
evidence of Granger causality (Fig. 1b). The results
series to isolate seasonal cycles, trends and anoma-
show that, when only linear relationships between veg-
lies. In a third step, we explore various techniques for
etation and climate are considered, the areas in which
constructing high-level features from climatic time se-
Granger causality of climate towards vegetation is sug-
ries using techniques that are similar to shapelets (Ye
gested are limited. The predictive power for vegetation
& Keogh, 2009). In a final step, we run a Granger
anomalies increases dramatically when using random
causality analysis on the vegetation anomalies, while
forests (Fig. 1c). In order to test whether the cli-
replacing traditional linear vector autoregressive mod-
matic and environmental controls Granger-cause the
els with random forests.
vegetation anomalies, we again compare the results
Applying the above framework, we end up with 4,571 of a full random forest model to a baseline random
features generated on thirty-year time series, allow- forest model. As seen in Fig. 1d, the improvement
ing to analyze 13,097 land pixels independently. Pre- over the baseline is unambiguous. One can conclude
dictive performance is assessed by means of five-fold that, while not bearing into consideration all poten-
cross-validation using the out-of-sample coefficient of tial control variables in our analysis, climate dynamics
determination (R2 ) as a performance measure. indeed Granger-cause vegetation anomalies in most of
the continental land surface. Moreover, the improved
Figure 1a shows the predictive performance of a ridge
capacity of random forests over ridge regression to pre-
regression model which includes the 4,571 climate pre-
dict vegetation anomalies suggests that these relation-
dictors on top of the history of vegetation (i.e., a
ships are non-linear.
87
A non-linear Granger causality approach for understanding climate-vegetation dynamics
References
Bonan, G. (2008). Forests and climate change: forc-
ings, feedbacks, and the climate benefits of forests.
science, 320, 1444–1449.
Granger, C. W. (1969). Investigating causal relations
by econometric models and cross-spectral methods.
Econometrica: Journal of the Econometric Society,
424–438.
McPherson, R. A., Fiebrich, C. A., Crawford, K. C.,
Kilby, J. R., Grimsley, D. L., Martinez, J. E.,
Basara, J. B., Illston, B. G., Morris, D. A., Kloe-
sel, K. A., et al. (2007). Statewide monitoring of
the mesoscale environment: A technical update on
the Oklahoma Mesonet. 24, 301–321.
Papagiannopoulou, C., Miralles, D. G., Decubber, S.,
Demuzere, M., Verhoest, N. E. C., Dorigo, W. A., &
Waegeman, W. (2016). A non-linear Granger causal-
ity framework to investigate climate–vegetation dy-
namics. Geoscientific Model Development Discus-
sions, 2016, 1–24.
Su, L., Jia, W., Hou, C., & Lei, Y. (2011). Microbial
biosensors: a review. Biosensors and Bioelectronics,
26, 1788–1799.
88
Characterizing Resting Brain Activity to Predict the Amplitude of
Pain-Evoked Potentials in the Human Insula
Keywords: Pain, nociception, intracerebral recordings, feature extraction, time series prediction.
Abstract of people with congenital insensitivity to pain. Fur-
How the perception of pain emerges from thermore, pain is a major healthcare issue and its
human brain activity remains largely un- treatment, especially in the context of pathological
known. Apart from inter-individual varia- chronic pain, constitutes a very challenging problem
tions, this perception depends not only on the for physicians. Characterizing the relationship be-
physical characteristics of the painful stim- tween pain perception and brain activity could pro-
uli, but also on other psycho-physiological as- vide insights on how nociceptive inputs are processed
pects. Indeed a painful stimulus applied to in the human brain and, ultimately, how this leads to
an individual can sometimes evoke very dis- the perception of pain (Apkarian et al., 2005; Tracey &
tinct sensations from one trial to the other. Mantyh, 2007). It is widely accepted that perception
Hence the state of a subject receiving such fluctuates along time, even in a resting state. These
a stimulus should (at least partly) explain fluctuations might result from variations in neuronal
the intensity of pain elicited by that stim- activity (Sadaghiani et al., 2010) which can be, at least
ulus. Using intracranial electroencephalogra- partly, recorded with neuroimaging or electrophysio-
phy (iEEG) from the insula to measure this logical monitoring techniques (VanRullen et al., 2011).
cortical “state”, our goal is to study to which Hence spontaneous brain activity, which is often con-
extent ongoing brain activity in the human sidered as noise, might be related to our perception
insula, an area thought to play a key role in capabilities. However the potential links between per-
pain perception, may predict the magnitude ception fluctuations and the recorded brain activity
of pain-evoked potentials and, more impor- are not yet fully elucidated. It has already been sug-
tantly, whether it may predict the perception gested that perception is a discrete process, namely
intensity. To this aim, we summarize the on- that the cortex quickly oscillates between different, rel-
going insular activity by defining frequency- atively short-lasting levels of excitability (VanRullen
dependent features, derived using continuous & Koch, 2003). This excitability can for instance be
wavelet and Fourier transforms. We then measured by electroencephalography (EEG).
take advantage of this description to predict Supporting the aforementioned hypothesis, several
the amplitude of the insular responses elicited studies have already established links between ongoing
by painful (heat) and non-painful (auditory, brain activity measured before the presentation of a
visual and vibrotactile) stimuli, as well as to sensory stimulus using functional magnetic resonance
predict the intensity of perception. imaging (fMRI) or EEG, and the subsequent stimulus-
evoked response, assessed either in terms of subjec-
tive perception or brain response magnitude (Mayhew
1. Introduction et al., 2013; Monto et al., 2008). For instance, Barry
et al. and Busch et al. study the effect of pre-stimulus
The ability to perceive pain is crucial for survival, as
low-frequency (between 5 and 15Hz) phase on auditory
exemplified by the injuries and reduced life expectancy
and visual perception respectively, showing that stimu-
lus processing can be affected by such phase (i.e. posi-
Appearing in Proceedings of Benelearn 2017. Copyright tion within a cycle) at the stimulus onset (2004; 2009).
2017 by the author(s)/owner(s). Tu et al. use linear regression to predict the sub-
89
Characterizing Resting Brain Activity to Predict the Amplitude of Pain-Evoked Potentials
jective pain intensity generated by nociceptive stim- EEG recordings Feature extraction
150
uli from the time-frequency coefficients obtained by a
100
short-time Fourier transform (2016). They show that
Amplitude ( V)
[30, 80] Hz
50 [12, 30] Hz
pain perception depends on some pre-stimulus time- [8, 12] Hz
0
frequency features. [4, 8] Hz
[0.1, 4] Hz
-50
Original signals
In this setting, our work aims to study whether and to -100
-0.5 0 0.5 1 1.5 -0.8 -0.6 -0.4 -0.2
which extent pain perception capabilities vary along Time after stimulus (sec) Time after stimulus (sec)
90
Characterizing Resting Brain Activity to Predict the Amplitude of Pain-Evoked Potentials
Tu, Y., Zhang, Z., Tan, A., Peng, W., Hung, Y. S.,
Moayedi, M., Iannetti, G. D., & Hu, L. (2016). Al-
pha and gamma oscillation amplitudes synergisti-
cally predict the perception of forthcoming nocicep-
tive stimuli. Human brain mapping, 37, 501–514.
91
Probabilistic Inference-based Reinforcement Learning
Reinforcement learning (RL) is a domain in machine p(s1:T , a1:T ) = π ∗ p(s1 ) p(st+1 |st , at ), (1)
learning concerning with how an agent makes deci- t=1
92
Probabilistic Inference-based Reinforcement Learning
3. Bayesian Policy and Relation to where γT and rT denote the discount factor and in-
Classical Reinforcement Learning stant reward at time T respectively, while R(sT ) is the
reward function that returns a corresponding reward
In practice, it could be tricky to specify a desired goal for state sT .
precisely on sT . Thus we introduce an abstract ran-
dom binary variable z that indicates whether sT is a It is clear that the horizon distribution πT behaves
good (rewarding) or bad state. The goal is instead set like the discount factor, while the goal distribution
as z = 1 (good state). πzT acts like the reward function in classical reinforce-
ment learning. In classical RL, both reward function
In the special case when the given goal on sT is certain, and discount factor are often given. In contrast in
we have p(z = 1|sT ) , δ(sT −g). And one could verify our probabilistic framework, the optimal policy, hori-
that zon and goal distribution π̂ag that maximize the (log)
p(z = 1|s1 , T ) = p(sT = g|s1 , T ) marginal likelihood in eq. (4) can be estimated by e.g.
while the updated policy (posterior) becomes EM algorithm (Dempster et al., 1977).
93
Probabilistic Inference-based Reinforcement Learning
References
Attias, H. (2003). Planning by probabilistic inference.
AISTATS.
Botvinick, M., & Toussaint, M. (2012). Planning as
inference. Trends in Cognitive Sciences, 16, 485–
488.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977).
Maximum likelihood from incomplete data via the
EM algorithm. Journal of the royal statistical soci-
ety. Series B (methodological), 1–38.
Friston, K. (2010). The free-energy principle: a unified
brain theory? Nature Reviews Neuroscience, 11,
127–138.
Ng, A. Y., Harada, D., & Russell, S. (1999). Policy in-
variance under reward transformations: Theory and
application to reward shaping. International Con-
ference on Machine Learning (pp. 278–287).
94
Identifying Subject Experts through Clustering Analysis
Keywords: data mining, expert finding, formal concept analysis, health science, knowledge management
95
Identifying Subject Experts through Clustering Analysis
verbs, and adverbs. concept for this context is defined to be a pair (X, Y )
such that X is a subset of experts and Y is a subsets of
In view of the above, we define an expert profile as a
subject areas, and every expert in X belongs to every
list of keywords, extracted from the available informa-
area in Y ; for every expert that is not in X, there is
tion about the expert in question, describing her/his
a subject area in Y that does not contain that expert;
subjects of expertise. Assume that n different expert
for every subject area that is not in Y , there is a expert
profiles are created in total and each expert profile i
in X who is not associated with that area. The fam-
(i = 1, 2, . . . , n) is represented by a list of pi keywords.
ily of these concepts obeys the mathematical axioms
A conceptual model of the domain of interest, such as a defining a concept lattice. The built lattice consists of
thesaurus, a taxonomy etc., can be available and used concepts where each one represents a subset of experts
to attain accurate and topic relevant expert profiles. belonging to a number of subject areas. The set of all
In this case, usually a set of subject terms (topics) concepts partitions the experts into a set of disjoint
arranged in hierarchical manner is used to represent expert areas. Notice that the above introduced group-
concepts in the considered domain. Another possi- ing of experts can be performed with respect to any
bility to represent the domain of interest at a higher set of subject areas describing the domain of interest,
level of abstraction is to partition the set of all differ- e.g., the experts could be clustered on a lower level of
ent keywords used to define the expert profiles into k abstraction by using more specific topics.
main subject areas. The latter idea has been proposed
and applied in (Boeva et al., 2014). 3. Initial Evaluation and Discussion
As discussed above, the domain of interest can be pre-
The proposed approach has initially been evaluated
sented by k main subject categories C1 , C2 , . . . , Ck .
in (Boeva et al., 2016) by applying the algorithm to
Let us denote by bij the number of keywords from
partition Bulgarian health science experts extracted
the expert profile of expert i that belong to category
from PubMed repository of peer-reviewed biomedical
Cj . Now each expert i can be represented by a vector
articles. Medical Subject Headings (MeSH) is a con-
ei = (ei1 , ei2 , . . . , eik ), where eij = bij /pi and pi is the
trolled vocabulary developed by the US National Li-
total number of keywords in the expert profile repre-
brary of Medicine for indexing research publications,
sentation. In this way, each expert i is represented by
articles and books. Using the MeSH terms associated
a k-length vector of membership degrees of the expert
with peer-reviewed articles published by Bulgarian au-
to k different subject categories, i.e. the above proce-
thors and indexed in the PubMed, we extract all such
dure generates a fuzzy clustering. The resulting fuzzy
authors and construct their expert profiles. The MeSH
partition can easily be turned into a crisp one by as-
headings are grouped into 16 broad subject categories.
signing to each pair (expert, area) a binary value (0
We have produced a grouping of all the extracted au-
or 1), i.e. for each subject area we can associate those
thors with respect to these subject categories by ap-
experts who have membership degrees greater than a
plying the discussed formal concept analysis approach.
preliminary given threshold (e.g., 0.5). This partition
The produced grouping of experts is shown to cap-
is not guaranteed to be disjoint in terms of the differ-
ture well the expertise distribution in the considered
ent subject areas, since there will be experts who will
domain with respect to the main subjects. In addi-
belong to more than one subject category. This over-
tion, it facilitates the identification of individuals with
lapping partition is further analyzed and refined into
the required competence. For instance, if we need to
a disjoint one by applying FCA.
recruit researchers who have expertise simultaneously
Formal concept analysis (Ganter et al., 2005) is a in ’Phenomena and Processes’ and ’Health care’ cate-
mathematical formalism allowing to derive a concept gories, we can directly locate those who belong to the
lattice from a formal context constituted of a set of concept that unites the corresponding categories.
objects, a set of attributes, and a binary relation de-
fined on the Cartesian product of these two sets. In 4. Conclusion and Future Work
our case, a (formal) context consists of the set of the
n experts, the set of main categories {C1 , C2 , . . . , Ck } A formal concept analysis approach for clustering of
and an indication of which experts are associated with a group of experts with respect to given subject areas
which subject category. Thus the context is described has been discussed. The initial evaluation has demon-
as a matrix, with the experts corresponding to the rows strated that the proposed approach is a robust clus-
and the categories corresponding to the columns of the tering technique that is suitable to deal with sparse
matrix, and a value 1 in cell (i, j) whenever expert i data. Further evaluation and validation on richer data
is associated with subject area Cj . Subsequently, a extracted from different online sources are planned.
96
Identifying Subject Experts through Clustering Analysis
97
An Exact Iterative Algorithm for Transductive Pairwise Prediction
98
An Exact Iterative Algorithm for Transductive Pairwise Learning
one needs a model to impute these missing values in Figure 1 shows the example image of which parts were
the first place. For incomplete data, we need to solve removed. We either randomly removed 1%, 10%, 50%,
a modification of (1): 90% or 99% of the pixels or removed a 100 × 400 pix-
els block from the image. Subsequently, the iterative
1 X imputation algorithm was used to impute the missing
min (Fij − Yij )2
F 2 part of the image. The missing pixels were initial-
(i,j)∈T
ized with the average value of the remaining pixels in
C
+ vec(F)> [G ⊗ K]−1 vec(F) . (2) the image. The bottom of Figure 1 shows the mean
2
squared error of the values of the imputed pixels as a
Solving the above problem is determined by the num- function of the number of iterations of the algorithm.
ber of pairwise observations n and might be huge even For reference purposes, the variance is also indicated,
for modest m and q. Ironically, having less data makes corresponding to the expected mean squared error of
it much harder to compute F compared to the com- using the mean as imputation. In all cases, the al-
plete case. gorithm could restore the image substantially better
than the baseline. If the image is relatively complete,
Since computing F can be done efficiently when the the imputation is quite fast; all imputations could be
dataset is complete, we suggest the following simple done in under a minute on a standard laptop. Figure 2
recipe to update the missing values: shows some of the image restorations. With 10% of the
pixels missing, the restoration is visually indistinguish-
1. Initialize the missing values of the unlabelled able of the original. Using only 10% of the pixels, a
dyads, making the label matrix complete. This blurry image of the original can be produced. In the
can be done by using the average of the observed case where a block of the image is missing, a ‘shadow’
labels or initalizing them to zero. of the coffee cup can be seen, showing that the model
can at least detect some high-level features of the im-
2. Fit a model using this label matrix. This step age.
has a very low computational cost if the eigen-
value decompositions of K and G were already
computed. 3. Conclusions
3. Update the missing values using the model. We presented a simple algorithm to impute miss-
ing values in a transductive pairwise learning setting.
4. Repeat steps 2 and 3 until convergence. It can be shown that the algorithm always rapidly
converges to the correct solution. This algorithm
Formally, we can show that the above steps always was illustrated on an example of inpainting an im-
converge to the unique minimizer of (1) and the error age. Given the importance of pairwise learning in
w.r.t. F decays as a geometric series. domains such as molecular network inference (Vert,
2008; Schrynemackers et al., 2013), recommender sys-
tems (Lü et al., 2012) and species interactions predic-
2. Illustration: inpainting an image tion (Poisot et al., 2016; et al., 2017), we believe this
The pairwise methods can also be applied to images, algorithm to be a useful tool in a variety of settings.
naturally represented as a matrix. Using suitable
kernels, the Kronecker-based methods can be used References
as a linear image filter - see (Gonzalez & Woods,
Chapelle, O., Schölkopf, B., & Zien, A. (2006). Semi-
2007) for an introduction. A black-and-white im-
Supervised Learning. MIT Press.
age is merely a matrix with intensity values for each
pixel. Here, the only features for the rows and columns Gonzalez, R. C., & Woods, R. E. (2007). Digital Image
are the x- and y-coordinates of the pixels. For the Processing. Pearson.
rows (resp. columns) a kernel can be constructed that
quantifies the distance between pixels in the vertical Johnson, R., & Zhang, T. (2008). Graph-based
(resp. horizontal) direction. In the experiments, we semi-supervised learning and spectral kernel design.
use a standard radial basis kernel on the pixel coordi- IEEE Transactions on Information Theory, 54, 275–
nates for the rows and columns plugged in Kronecker 288.
kernel ridge regression with a regularization parameter
λ = 0.1. We will illustrate the imputation algorithm Liu, H., & Yang, Y. (2015). Bipartite edge predic-
on a benchmark image of a cup of coffee. tion via transductive learning over product graphs.
99
An Exact Iterative Algorithm for Transductive Pairwise Learning
Figure 1. (left) An image of a cup of coffee. (right) Mean squared error of the imputed pixels as a function of the number
of iterations of the imputation algorithm. Missing pixels are initiated with the average value of the observed pixels. The
dotted line indicates the variance of the pixels, i.e. the approximate mean squared error of imputing with the average
value of the imputed pixels.
100
An Exact Iterative Algorithm for Transductive Pairwise Learning
Figure 2. Examples of missing pixel imputation on the coffee image. (left) Mask indicating which pixels were removed,
blue indicates available, red indicates missing. (middle) The coffee image with the corresponding missing pixels. (right)
The restored image. (from top to bottom) 10% of the pixels randomly removed, 90% of the pixels randomly removed, a
block of the image removed.
101
Towards an automated method based on Iterated Local Search
optimization for tuning the parameters of Support Vector Machines
Sergio Consoli, Jacek Kustra, Pieter Vos, Monique Hendriks, Dimitrios Mavroeidis
[name.surname]@philips.com
Philips Research, High Tech Campus 34, 5656 AE Eindhoven, The Netherlands
Keywords: Support Vector Machines, Iterated Local Search, online tuning, parameters setting
102
Tuning Support Vector Machines by Iterated Local Search
plane [Cortes & Vapnik, 1995]. A penalty is associ- evaluated accuracy is Acc0 . Then the acceptance cri-
ated with the instances which are misclassified and terion of this new solution is that it produces a better
added to the minimization function. This is done quality, that is an increased accuracy, than the best
via the parameter C P in the minimization formula: solution to date. If it does not happen, the new incum-
n
arg min 12 kωk2 + C i c(f, xi , yi ). bent solution is rejected and the ranges are updated
f (x)=ω T x+b automatically with the following values: γinf −down =
By varying C, a trade-off between the accuracy and rand(γinf −down ∗ 10−1 , γinf −down ) and Cinf −down =
and stability of the function is defined. Larger values rand(Cinf −down ∗ 10−1 , Cinf −down ),
γ−γinf −up
of C result in a smaller margin, leading to potentially γinf −up = rand( 2 , γ) and Cinf −up =
more accurate classifications, however overfitting can C−Cinf −up
rand( 2 , C), γsup−down =
occur. A mapping of the data with appropriate kernel γ
rand( sup−down
−γ
, γ) and C =
2 sup−down
functions k(x, x0 ) into a richer feature space, includ- C −C
rand( sup−down2 , C), and γ sup−up = rand(γ sup−up ∗
ing non-linear features is applied prior to the hyper-
10) and Csup−up = rand(Csup−up ∗ 10). That is,
plane fitting. Among several kernels in the literature,
indifferently for γ and C, the values of the inf -down
we consider the Gaussian radial-basis function (RBF):
and sup-up components are random values always
K(xi , x0 ) = exp(−γkxi − x0 k2 ), γ > 0, where γ de-
taken farther the current parameter (γ or C), in
fines the variance of the RBF, practically defining the
order to increase the diversification capability of the
shape of the kernel function peaks: lower γ values set
metaheuristic; while the values of the inf -up and
the bias to low and corresponding high γ to high bias.
sup-down components are random values always taken
The proposed ILS under current implementation for closer the current parameter, in order to increase the
SVM tuning uses grid search [Bergstra & Bengio, 2012] intensification strength around the current parameter.
as an inner local search routine, which is then iter- This perturbation setting should allow a good balance
ated in order to make it fine-grained and finally pro- among the intensification and diversification factors.
ducing the best parameters C and γ found to date.
Otherwise, if in the acceptance criterion the new in-
Given a training dataset D and an SVM model Θ, the
cumbent solution, γ 0 and C 0 , is better than the current
procedure first generates an initial solution. We use
one, γ and C, i.e. Acc0 > Acc, then this new solution
an initial solution produced by grid search. The grid
becomes the best solution to date (γ ← γ 0 , C ← C 0 ),
search exhaustively generates candidates from a grid of
and rangeγ and rangeC are updated as usual. This
the parameter values, C and γ, specified in the arrays
procedure continues iteratively until the termination
rangeγ ∈ <+ and rangeC ∈ <+ . We choose arrays
conditions imposed by the user are satisfied, produc-
containing five different values for each parameter, so
ing at the end the best combination of γ and C as
that the grid search method will look to 25 different
output.
parameters combinations. The range values are taken
as different powers of 10 from −2 to 2. Solution qual-
ity is evaluated as the accuracy of the SVM by means 3. Summary and outlook
of k-fold cross validation [McLachlan et al., 2004], and
We considered the parameter setting task in SVMs
stored in the variable Acc.
by an automated ILS heuristic, which looks to be a
Afterwards, the perturbation phase, which represents promising approach. We are aware that a more de-
the core idea of ILS, is applied to the incumbent tailed description of the algorithm is deemed neces-
solution. The goal is to provide a good starting point sary, along with a thorough computational investiga-
(i.e. parameter ranges) for the next local search phase tion. This is currently object of ongoing research, in-
of ILS (i.e. the grid search in our case), based on cluding a statistical analysis and comparison of the
the previous search experience of the algorithm, so proposed algorithm against the standard grid search,
as to obtain a better balance between exploration in order to quantify and qualify the improvements ob-
of the search space against wasting time in areas tained. Further research will explore the application
that are not giving good results. Ranges are set as: of this strategy to other SVM kernels, considering also
rangeγ = [γ ∗ 10−2 , γ ∗ 10−1 , γ, γ ∗ 10, γ ∗ 102 ] ≡ a variety of big, heterogenous datasets.
[γinf −down , γinf −up , γ, γsup−down , γsup−up ], and
rangeC = [C ∗ 10−2 , C ∗ 10−1 , C, C ∗ 10, C ∗ 102 ] ≡
References
[Cinf −down , Cinf −up , C, Csup−down , Csup−up ].
Bergstra, J., & Bengio, Y. (2012). Random search for
Imagine that the grid search gets the set of pa- hyper-parameter optimization. Journal of Machine
rameters γ 0 , C 0 as a new incumbent solution, whose Learning Research, 13, 281–305.
103
Tuning Support Vector Machines by Iterated Local Search
104
Multi-step-ahead prediction of volatility proxies
Though machine learning techniques have been often σ G . The first proxy corresponds to the natural defini-
used for stock prices forecasting, few results are avai- tion of volatility (Poon & Granger, 2003), as a rolling
lable for market fluctuation prediction. Nevertheless, standard deviation over a past time window of size n
volatility forecasting is an essential tool for any trader v
u
wishing to assess the risk of a financial investment. u 1 n−1 X
The main challenge of volatility forecasting is that, σtSD,n
=t (rt−i − r¯n )2
n − 1 i=0
since this quantity is not directly observable, we can-
not predict its actual value but we have to rely on some (c)
Pt
observers, known as volatility proxies (Poon & Gran- where rt = ln (c) is the daily continuously com-
Pt−1
ger, 2003) based either on intraday (Martens, 2002) pounded return , r¯n is the average over {t, · · · , t−n+1}
or daily data. Once a proxy is chosen, the standard (c)
and Pt are the closing prices. The family of proxies
approach to volatility forecasting is the well-known i
σt is analytically derived in Garman and Klass (1980).
GARCH-like model(Andersen & Bollerslev, 1998). In q Pp Pq
recent years several hybrid approaches are emerging The proxy σtG = ω + j=1 βj (σt−j G )2 + 2
i=1 αi εt−i
(Kristjanpoller et al., 2014; Dash & Dash, 2016; Mon- is the volatility estimation returned by a GARCH
fared & Enke, 2014) which combine GARCH with a (1,1) (Hansen & Lunde, 2005) where εt−i ∼ N (0, 1)
non-linear computational approach. What is common and the coefficients ω, αi , βj are fitted according to the
to the state-of-the art is that volatility forecasting is procedure in (Bollerslev, 1986).
addressed as an univariate and one-step-ahead auto-
regressive (AR) time series problem.
2. The relationship between proxies
The purpose of our work is twofold. First, we aim to
perform a statistical assessment of the relationships The fact that several proxies have been defined for
among the most used proxies in the volatility litera- the same latent variable raises the issues of their sta-
ture. Second, we explore a NARX (Nonlinear Autore- tistical association. For this reason we computed the
gressive with eXogenous input) approach to estimate proxies, discussed above, on the 40 time series of the
multiple steps of the output given the past output and French stock market index CAC40 in the period ran-
input measurements, where the output and the input ging from 05-01-2009 to 22-10-2014 (approximately
are two different proxies. In particular, our preliminary 6 years). This corresponds to 1489 OHLC (Opening,
results show that the statistical dependencies between High, Low, Closing) samples for each time series. Mo-
proxies can be used to improve the forecasting accu- reover, we obtained the continuously compounded re-
racy. turn and the volume variable (representing the number
of trades in given trading day).
1. Background Figure 1 shows the aggregated correlation (over all
the 40 time series) between the proxies, obtained by
Three main types of proxies are available in the lite- meta-analysis (Field, 2001). The black rectangles in-
rature : the proxy σ SD,n , the family of proxies σ i and dicate the results of an hierarchical clustering using
105
Multi-step-ahead prediction of volatility proxies
Volume
proxies belonging to the same family, i.e. σti and σtSD,n .
σ250
σ100
SD
SD
SD
σ50
σG
The presence of σt0 in the σtSD,n cluster can be explai-
σ1
σ6
σ4
σ5
σ2
σ3
σ0
rt
1
Volume
ned by the fact that the former represents a degenerate
?
σ1 0.8
case of the latter when n = 1. Moreover, we find a cor- ?
σ6
relation between the volume and the σti family. ?
0.6
σ4 ?
0.4
σ5
3. NARX proxy forecasting
?
σ2 ?
0.2
0
the proxy σ G by addressing the question whether a rt ?
−0.2
NARX approach can be beneficial in terms of accu- σ0 ?
−0.6
G
ω with a multi-step-ahead NARX model σt+h = σ50
SD −0.8
?
f (σtG , · · · , σt−m
G
, σtX , · · · , σt−m
X
) + ω, for a specific em- σG ?
Table 1. MASE (normalized wrt the accuracy of a naive Table 2. MASE (normalized wrt the accuracy of a naive
method) for a 10-step volatility forecasting horizon on a method) for a 10-step volatility forecasting horizon on the
single stock composing the CAC40 index on the period S&P500 index on the period from 01-04-2012 to 30-07-2013
from 05-01-2009 to 22-10-2014, for different proxy combina- as in the work of Dash & Dash, 2016, for different proxy
tions (rows) and different forecasting techniques (columns). combinations (rows) and different forecasting techniques
The subscript X stands for the NARX model where σ X is (columns). The subscript X stands for the NARX model
exogenous. where σ X is exogenous.
σX AN N kN N AN NX kN NX GARCH(1,1) σX AN N kN N AN NX kN NX GARCH(1,1)
6
σ 0.07 0.08 0.06 0.11 1.34 σ6 0.58 0.49 0.53 0.56 1.15
V olume 0.07 0.08 0.07 0.14 1.34 V olume 0.58 0.49 0.57 0.66 1.15
σ SD,5 0.07 0.08 0.07 0.09 1.34 σ SD,5 0.58 0.49 0.58 0.58 1.15
σ SD,15 0.07 0.08 0.06 0.10 1.34 σ SD,15 0.58 0.49 0.65 0.65 1.15
σ SD,21 0.07 0.08 0.06 0.10 1.34 σ SD,21 0.58 0.49 0.56 0.65 1.15
106
Multi-step-ahead prediction of volatility proxies
References
Andersen, T. G., & Bollerslev, T. (1998). Arch and
garch models. Encyclopedia of Statistical Sciences.
Bollerslev, T. (1986). Generalized autoregressive
conditional heteroskedasticity. Journal of econome-
trics, 31, 307–327.
Dash, R., & Dash, P. (2016). An evolutionary hybrid
fuzzy computationally efficient egarch model for vo-
latility prediction. Applied Soft Computing, 45, 40–
60.
Field, A. P. (2001). Meta-analysis of correlation co-
efficients : a monte carlo comparison of fixed-and
random-effects methods. Psychological methods, 6,
161.
Garman, M. B., & Klass, M. J. (1980). On the esti-
mation of security price volatilities from historical
data. Journal of business, 67–78.
Hansen, P. R., & Lunde, A. (2005). A forecast com-
parison of volatility models : does anything beat a
garch (1, 1) ? Journal of applied econometrics, 20,
873–889.
Hyndman, R. J., & Koehler, A. B. (2006). Another
look at measures of forecast accuracy. International
journal of forecasting, 22, 679–688.
Kristjanpoller, W., Fadic, A., & Minutolo, M. C.
(2014). Volatility forecast using hybrid neural net-
work models. Expert Systems with Applications, 41,
2437–2442.
Martens, M. (2002). Measuring and forecasting s&p
500 index-futures volatility using high-frequency
data. Journal of Futures Markets, 22, 497–518.
Monfared, S. A., & Enke, D. (2014). Volatility forecas-
ting using a hybrid gjr-garch neural network model.
Procedia Computer Science, 36, 246–253.
Poon, S.-H., & Granger, C. W. (2003). Forecasting
volatility in financial markets : A review. Journal of
economic literature, 41, 478–539.
Taieb, S. B. (2014). Machine learning strategies for
multi-step-ahead time series forecasting. Doctoral
dissertation, Ph. D. Thesis.
Tashman, L. J. (2000). Out-of-sample tests of forecas-
ting accuracy : an analysis and review. International
journal of forecasting, 16, 437–450.
Ward Jr, J. H. (1963). Hierarchical grouping to opti-
mize an objective function. Journal of the American
statistical association, 58, 236–244.
107
Generalization Bound Minimization for Active Learning
Supervised machine learning models require enough la- on the generalization error is tighter than the MMD
beled data to obtain good generalization performance. bound in the realizable setting — in this setting it is
However, for many practical applications such as med- assumed there is no model mismatch. Tighter bounds
ical diagnosis or video classification it can be expensive are generally considered favorable as they estimate the
or time consuming to label data (Settles, 2012). Often generalization error more accurately. One might there-
in practical settings unlabeled data is abundant, but fore also expect them to lead to better labeling choices
due to high costs only a small fraction can be labeled. in active learning when minimized and therefore we
In active learning an algorithm chooses unlabeled sam- evaluated an active learner that minimizes the Dis-
ples for labeling (Cohn et al., 1994). The idea is that crepancy.
models can perform better with less labeled data if the
However, we observed that active learning using the
labeled data is chosen carefully instead of randomly.
tighter Discrepancy bound performs worse than the
This way active learning methods make the most of a
MMD. The underlying reason is that these bounds
small labeling budget or can be used to reduce labeling
assume worst-case scenarios in order to derive their
costs.
guarantees, and therefore minimizing these bounds for
A generalization bound is an upper bound on the gen- active learning may result in suboptimal performance
eralization error of the model that holds given certain in non-worst-case scenarios. In particular, the worst-
assumptions. Several works have used generalization case scenario assumed by the Discrepancy is, proba-
bounds to guide the active learning process (Gu & bilistically speaking, very unlikely to occur compared
Han, 2012; Gu et al., 2012; Ganti & Gray, 2012; Gu to the scenario considered by the MMD and therefore
et al., 2014). We have performed a theoretical and the Discrepancy performs worse for active learning.
empirical study of active learners, that choose queries
This insight lead us to introduce the Nuclear Discrep-
that explicitly minimize generalization bounds, to in-
ancy whose bound is looser. The Nuclear Discrepancy
vestigate the relation between bounds and their active
considers average case scenarios which occur more of-
learning performance. We limited our study to the
ten in practice. Therefore, minimizing the Nuclear
kernel regularized least squares model (Rifkin et al.,
Discrepancy leads to an active learning strategy that
2003) and the squared loss.
is more suited to non-worst-case scenarios. Our experi-
We studied the state-of-the-art Maximum Mean Dis- ments show that active learning using the Nuclear Dis-
crepancy (MMD) active learner that minimizes a gen- crepancy improves significantly upon the MMD and
eralization bound (Chattopadhyay et al., 2012; Wang Discrepancy, especially in the realizable setting.
& Ye, 2013). The MMD is a divergence measure (Gret-
Our study illustrates that tighter bounds do not guar-
ton et al., 2012) which is closely related to the Discrep-
antee improved active learning performance and that
ancy measure (Mansour et al., 2009).
a probabilistic analysis is essential: active learners
One of our novel theoretical results is a comparison of should optimize their strategy for scenarios that are
these bounds. We show that the Discrepancy bound likely to occur in order to perform well in practice.
108
Generalization Bound Minimization for Active Learning
References
Chattopadhyay, R., Wang, Z., Fan, W., Davidson, I.,
Panchanathan, S., & Ye, J. (2012). Batch Mode Ac-
tive Sampling Based on Marginal Probability Dis-
tribution Matching. Proceedings of the 18th ACM
SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD) (pp. 741–749).
Cohn, D., Atlas, L., & Ladner, R. (1994). Improving
generalization with active learning. Machine Learn-
ing, 15, 201–221.
Ganti, R., & Gray, A. (2012). UPAL: Unbiased Pool
Based Active Learning. Proceedings of the 15th In-
ternational Conference on Artificial Intelligence and
Statistics (AISTATS) (pp. 422–431).
Gretton, A., Borgwardt, K. M., Rasch, M. J.,
Schölkopf, B., & Smola, A. (2012). A Kernel Two-
sample Test. Machine Learning Research, 13, 723–
773.
Gu, Q., & Han, J. (2012). Towards Active Learning on
Graphs: An Error Bound Minimization Approach.
Proceedings of the 12th IEEE International Confer-
ence on Data Mining (ICDM) (pp. 882–887).
Gu, Q., Zhang, T., & Han, J. (2014). Batch-Mode Ac-
tive Learning via Error Bound Minimization. Pro-
ceedings of the 30th Conference on Uncertainty in
Artificial Intelligence (UAI).
Gu, Q., Zhang, T., Han, J., & Ding, C. H. (2012).
Selective Labeling via Error Bound Minimization.
Proceedings of the 25th Conference on Advances in
Neural Information Processing Systems (NIPS) (pp.
323–331).
Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009).
Domain Adaptation: Learning Bounds and Algo-
rithms. Proceedings of the 22nd Annual Conference
on Learning Theory (COLT).
Rifkin, R., Yeo, G., & Poggio, T. (2003). Regular-
ized least-squares classification. Advances in Learn-
ing Theory: Methods, Model, and Applications, 190,
131–154.
Settles, B. (2012). Active Learning. Synthesis Lectures
on Artificial Intelligence and Machine Learning, 6,
1–114.
Wang, Z., & Ye, J. (2013). Querying Discriminative
and Representative Samples for Batch Mode Active
Learning. Proceedings of the 19th ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining (KDD) (pp. 158–166).
109
Projected Estimators for Robust Semi-supervised Classification
110
Projected Estimators for Robust Semi-supervised Classification
Loss Labeled+Unlabeled
Relative Increase
4
0
BCI
COIL2
Diabetes
Digit1
g241c
g241d
Haberman
Ionosphere
Mammography
Parkinsons
Sonar
SPECT
SPECTF
Transfusion
USPS
WDBC
BCI
COIL2
Diabetes
Digit1
g241c
g241d
Haberman
Ionosphere
Mammography
Parkinsons
Sonar
SPECT
SPECTF
Transfusion
USPS
WDBC
BCI
COIL2
Diabetes
Digit1
g241c
g241d
Haberman
Ionosphere
Mammography
Parkinsons
Sonar
SPECT
SPECTF
Transfusion
USPS
WDBC
Figure 1. Ratio of the loss in terms of surrogate loss of semi-supervised and supervised solutions measured on the labeled
and unlabeled instances. Values smaller than 1 indicate that the semi-supervised method gives a lower average surrogate
loss than its supervised counterpart. Unlike the other semi-supervised procedures, the projection method, evaluated on
labeled and unlabeled data, never has higher loss than the supervised procedure, as we prove in Theorem 1 of (Krijthe &
Loog, 2017)
Note that this set, by construction, will also contain 4. Empirical Evidence
the solution woracle , corresponding to the true but un-
known labeling ye∗ . Typically, woracle is a better solu- Aside from the theoretical guarantee that performance
tion than wsup and so we would like to find a solution never degrades when measured on the labeled and un-
more similar to woracle . This can be accomplished by labeled training set in terms of the surrogate loss, ex-
projecting wsup onto Θ. perimental results indicate that it not only never de-
grades, but often improves performance. Our experi-
ments also indicate the results hold when performance
wsemi = min d(w, wsup ) , is evaluated on objects in a test set that were not used
w∈Θ
as unlabeled objects during training.
where d(w, w0 ) is a particular distance measure that
measures the similarity between two classifiers. This References
is a quadratic programming problem with simple con- Cozman, F., & Cohen, I. (2006). Risks of Semi-
straints that can be solved using, for instance, a simple Supervised Learning. In O. Chapelle, B. Schölkopf
gradient descent procedure. and A. Zien (Eds.), Semi-supervised learning, chap-
ter 4, 56–72. MIT press.
3. Theoretical Guarantee
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009).
The main contribution of this work is a proof that the The Elements of Statistical Learning. Spinger. 2 edi-
semi-supervised learner that we just described is guar- tion.
anteed to never lead to worse performance than the
supervised classifier, when performance is measured Krijthe, J. H., & Loog, M. (2017). Projected Estima-
in terms of the quadratic loss on the labeled and un- tors for Robust Semi-supervised Classification. Ma-
labeled data. This property is shown empirically in chine Learning, http://arxiv.org/abs/1602.07865.
Figure 1. This non-degradation property is important Poggio, T., & Smale, S. (2003). The Mathematics of
in practical applications, since one would like to be Learning: Dealing with Data. Notices of the AMS,
sure that the effort of the collection of, and compu- 50, 537–544.
tation with unlabeled data does not have an adverse
effect. Our work is a conceptual step towards methods
with these types of guarantees.
111
Towards an Ethical Recommendation Framework
Keywords: recommender systems, ethics of data mining / machine learning, ethical recommendation framework
112
Towards an Ethical Recommendation Framework
Design Data collection Data publishing Algorithm design User interface de- A/B testing
stage sign
Ethical Privacy breaches, lack Privacy / security / Biases, discrimi- Algorithmic opacity, Fairness, side effects,
con- of awareness/consent, anonymity breaches nation, behavior content censorship lack of trust / aware-
cerns fake profile injection manipulation ness / consent
Known Informed consent, Privacy-preserving Algorithm audits, Explanations, ethical Informed consent,
coun- privacy-preserving data publishing reverse engineering, rule set generation, possibility to opt out
termea- collaborative filtering, discrimination-aware content analysis and delete data
sures identity verification data mining
User- “Do not track activ- “Do not share data” “Marketing bias” fil- “Content censorship” “Opt out of experi-
adjustable ity” tool tool ter filter ments” tool
controls
This setting disal- This option allows This filter is used This tool can be used This option can be
lows the creation and local user profiling to remove any to set user-defined used to reset the rec-
maintenance of a but forbids sharing business-driven bias exclusion criteria for ommendation engine
user profile. Types of data with third par- introduced by RS filtering out inappro- to its default algo-
data can be manually ties (even in the pres- providers, and set priate items or cate- rithm, exclude the
defined and browsed ence of anonymiza- the recommendation gories. It also con- user from any future
items can be man- tion). Types of data engine to the “best tains the option to experiments, enable
ually deleted (e.g. or categories of al- match” mode (or turn the filter on and the opt-in option,
“manage history” lowed recipients can other user-selectable off (also with the delete data from pre-
tool on Amazon) be manually defined. modes, such as possibility of schedul- vious experiments.
“cheapest first”). ing).
2.3. Ethics of experimentation ous European universities, yielding 214 responses from
students and academic staff. The analysis of survey
• Famous cases of unethical A/B testing
results immediately revealed participants’ strong pref-
• Three ways of consent acquisition for A/B testing
erence for taking morally sensitive issues under their
• Fairness and possibilities of user control
control. In 4 out 5 studied issues, the majority voted
for having a user-adjustable setting within a recom-
3. Summary as a Framework mendation engine among other alternative solutions.
The survey questions, responses, and analysis can be
Table 1 summarizes our findings in the form of a found in (Paraschakis, 2017).
user-centric ethical recommendation framework, which
maps RS design stages to potential ethical concerns
and the recommended countermeasures. As a prac-
5. Conclusion
tical contribution, we propose an “ethical toolbox” We conclude that multiple moral dilemmas emerge on
comprised of user-adjustable controls corresponding to every stage of RS design, while their solutions are not
each design stage. These controls enable users to tune always evident or effective. In particular, there are
a RS to their individual moral standards. The us- many trade-offs to be resolved, such as user privacy
ability of the provided controls may depend on many vs. personalization, data anonymization vs. data util-
factors, such as their layout, frequency of using the ity, informed consent vs. experimentation bias, and
system, sensitivity of data, and so on. As a vital first algorithmic transparency vs. trade secrets. A careful
step, however, it is necessary to establish the general risk assessment is crucial for deciding on the strate-
stance of users towards the ethics of recommender sys- gies of data anonymization or informed consent acqui-
tems and whether the proposed toolbox would stand sition required for A/B testing or user profiling. We
as a viable solution. This is done in the next section. have found evidence that many big players on the RS
market (Facebook, Amazon, Netflix, etc.) have faced
4. Feasibility study loud ethics-related backlashes. Thus, it is important
to ensure that a RS design is not only legally and al-
We conduct an online survey1 to find out people’s opin- gorithmically justified, but also ethically sound. The
ions and their preferred course of action regarding five proposed framework suggests new paradigm of ethics-
ethical issues of RS that are addressed by the proposed awareness by design, which utilizes existing technolo-
toolbox: user profiling, data publishing, online exper- gies where possible, and complements them with user-
iments, marketing bias, and content censorship. The adjustable controls. This idea was embraced by the
survey was disseminated to Facebook groups of numer- vast majority of our survey participants, and future
1 work should further test its usability in a fully imple-
available at http://recommendations.typeform.com/
to/kgKNQ0 mented RS prototype.
113
Towards an Ethical Recommendation Framework
References
Cremonesi, P., Garzotto, F., & Turrin, R. (2012). In-
vestigating the persuasion potential of recommender
systems from a quality perspective: An empirical
study. ACM Trans. Interact. Intell. Syst., 2, 11:1–
11:41.
Knijnenburg, B. (2015). A user-tailored approach to
privacy decision support. Doctoral dissertation, Uni-
versity of California.
Paraschakis, D. (2017). Towards an ethical recom-
mendation framework. To appear in: Proceedings
of the 11th IEEE International Conference on Re-
search Challenges in Information Science.
Tang, T., & Winoto, P. (2016). I should not recom-
mend it to you even if you will like it: the ethics of
recommender systems. New Review of Hypermedia
and Multimedia, 22, 111–138.
114
An Ensemble Recommender System for e-Commerce
115
An Ensemble Recommender System for e-Commerce
items {y} that share the same attribute value with pler by pre-setting the prior parameters α and β of
the premise item x: reward distributions. We consider two realistic scenar-
x 7−→ {y : attributei (x) = attributei (y)} ios where the estimation of these priors can be done:
For example, return all items of the same color.
1. Newly launched website. In this case, the estima-
2. A collaborative filtering component defines the
tion of the parameters relies solely on the analysis
set of items {y} that are connected to the premise
of the product catalog.
item x via a certain event type (click, purchase,
2. Pre-existing website. In this case, the estimation
addition to cart, etc.):
of the parameters can be done by utilizing the
x 7−→ {y : eventi (x)t → eventi (y)t0 >t }
event history.
For example, return all items that were bought
after the premise item (across all sessions). In both scenarios, we must be able to reason about the
expected mean µ and variance σ 2 of reward distribu-
We note that special-purpose components can also be
tions based on the analysis of the available data. We
added by a vendor to handle all sorts of contexts.
can then compute α and β as follows:
2.2. Ensemble learner µλ (µ − 1)λ
α=− β= , λ = σ 2 + µ2 − µ (1)
σ2 σ2
The goal of our ensemble learner is to recommend top-
N items for the premise item by querying the empir- 3. Preliminary results and future work
ically best component(s). We employ the well-known
Thompson Sampling (TS) policy (Chapelle & Li, 2011) In our preliminary experiments we compare TS to
for several practical reasons: a) its strong theoretical other popular bandit policies for the top-5 recommen-
guarantees and excellent empirical performance; b) ab- dation task, after making the adjustments proposed
sence of parameters to tune; c) robustness to obser- in Section 2.2. Two stand-alone recommenders are
vation delays; d) flexibility in re-shaping arm reward used as strong baselines: best sellers and co-purchases
distributions (see Section 2.3). (“Those-Who-Bought-Also-Bought”). We run the ex-
For a K-armed Bernoulli bandit, Thompson Sampling periments on two proprietary e-Commerce datasets of
models the expected reward θa of each arm a as 1 million events each: a book store and a fashion
a Beta distribution with prior parameters α and β: store. The results below show the hit rate of each
θa ∼ Beta(Sa,t + α, Fa,t + β). In each round t, an arm method. We observe that Thompson Sampling sig-
with the highest sample is played. Success and failure
counts Sa,t and Fa,t are updated according to the ob-
served reward ra,t .
The blind application of this classical TS model would
fail in our case because of its two assumptions:
1. One arm pull per round. Because the selected
component may return only few (or even zero!)
items for a given query, pulling one arm at a time
may not be sufficient to fill in the top-N recom-
mendation list.
2. Arms are stationary. Because collaborative filter-
ing components improve their performance over
time, they have non-stationary rewards.
Therefore, our ongoing work extends Thompson Sam-
pling to adapt to the task at hand. To address the first
problem, we allow multiple arms to be pulled in each Figure 1. TS vs. baselines (measured in hit rate)
round and adjust the reward system accordingly. The
second problem can be solved by dividing each compo- nificantly outperforms the baselines and consistently
nent in sub-components of relatively stable behavior. outperforms state-of-the-art MAB policies by a small
margin, which justifies our choice of method. Future
2.3. Priming the sampler work will demonstrate the predictive superiority of the
extended TS in relation to the standard TS policy.
Apart from the proposed modifications of the TS
Furthermore, we plan to examine what can be gained
model, we examine the effects of priming the sam-
by priming the sampler and how exactly it can be done.
116
An Ensemble Recommender System for e-Commerce
Acknowledgments
This research is part of the research projects “Au-
tomated System for Objectives Driven Merchandis-
ing”, funded by the VINNOVA innovation agency;
http://www.vinnova.se/en/, and “Improved Search
and Recommendation for e-Commerce”, funded by the
Knowledge foundation; http://www.kks.se.
We express our gratitude to Apptus Technologies
(http://www.apptus.com) for the provided datasets
and computational resources.
References
Chapelle, O., & Li, L. (2011). An empirical evaluation
of thompson sampling. Proceedings of the 24th In-
ternational Conference on Neural Information Pro-
cessing Systems (pp. 2249–2257).
Paraschakis, D., Holländer, J., & Nilsson, B. J. (2015).
Comparative evaluation of top-n recommenders in e-
commerce : an industrial perspective. Proceedings of
the 14th IEEE International Conference on Machine
Learning and Applications (pp. 1024–1031).
117
Ancestral Causal Inference (Extended Abstract)
Two major disadvantages of constraint-based methods Though still super-exponentially large, this represen-
are: (i) vulnerability to errors in statistical indepen- tation drastically reduces computation time, as shown
dence test results, which are quite common in real-world in the evaluation. Moreover, this representation turns
applications, (ii) no ranking or estimation of the con- out to be very convenient, because in real-world appli-
cations the distinction between direct causal relations
Preliminary work. Under review for Benelearn 2017. Do and ancestral relations is not always clear or necessary.
not distribute.
To solve the vulnerability to statistical errors in inde-
118
Ancestral Causal Inference
• a frequentist approach, in which for any appropri- Figure 1. Synthetic data: execution times (log-scale).
ate frequentist statistical test with independence
as null hypothesis, we define the weight:
(X ⊥
⊥ Y | W ) ∧ (X 699K W ) =⇒ X 699K Y,
It can be shown that this score is an approximation of
where X ⊥ ⊥ Y | W represents the conditional indepen- the marginal probability of the statement and that it
dence of X and Y conditioning on W , while X 699K Y satisfies certain theoretical guarantees, like soundness
represents the fact that X is not a cause of Y . and asymptotic consistency, given certain reasonable
Using this loss function, we propose a method to score assumptions on the weights of all input statements.
predictions according to their confidence. This is very We evaluate on synthetic data and show that ACI can
important for practical applications, as the low reliabil- outperform the state-of-the-art (HEJ, equipped with
ity of the predictions of constraint-based methods has our scoring method), achieving a speedup of several
been a major impediment to their widespread usage. orders of magnitude (as summarised in Figure 1), while
We define the confidence score for a statement s as: still providing a comparable accuracy, as we show in
an example precision and recall curve in Figure 2. In
C(s) = min L(A, I + (¬s, ∞))
A∈A the full paper, we also illustrate its practical feasibility
by applying it on a challenging protein data set that so
− min L(A, I + (s, ∞)) far had only been addressed with score-based methods.
A∈A
119
Ancestral Causal Inference
References
Claassen, T., & Heskes, T. (2012). A Bayesian approach
to constraint-based causal inference. UAI (pp. 207–
216).
Hyttinen, A., Eberhardt, F., & Järvisalo, M. (2014).
Constraint-based causal discovery: Conflict resolu-
tion with Answer Set Programming. UAI (pp. 340–
349).
Magliacane, S., Claassen, T., & Mooij, J. M. (2016).
Ancestral causal inference. NIPS.
Pearl, J. (2009). Causality: models, reasoning and
inference. Cambridge University Press.
120
Exceptional Model Mining in Ubiquitous and Social Environments
Keywords: exceptional model mining, subgroup discovery, community detection, social interaction networks
121
Exceptional Model Mining in Ubiquitous and Social Environments
2.1. Descriptive Community Detection adaptation and extension of the subgroup discovery
methodology in that context. In addition, we can an-
Communities can intuitively be defined as subsets of
alyze multiplex networks by considering the match be-
nodes of a graph with a dense structure in the corre-
tween different networks, and deviations between the
sponding subgraph. However, for mining such com-
networks, respectively. Outlining these examples, we
munities usually only structural aspects are taken into
demonstrate that local exceptionality detection is a
account. Typically, no concise nor easily interpretable
flexible approach for compositional analysis in social
community description is provided.
interaction networks.
In (Atzmueller et al., 2016a), we focus on description-
oriented community detection using subgroup discov- 2.3. Exceptional Model Mining for
ery. For providing both structurally valid and in- Spatio-Temporal Analysis
terpretable communities we utilize the graph struc-
ture as well as additional descriptive features of the Exploratory analysis on ubiquitous data needs to han-
graph’s nodes. We aim at identifying communities dle different heterogenous and complex data types.
according to standard community quality measures, In (Atzmueller, 2014; Atzmueller et al., 2015), we
while providing characteristic descriptions at the same present an adaptation of subgroup discovery using
time. We propose several optimistic estimates of stan- exceptional model mining formalizations on ubiqui-
dard community quality functions to be used for ef- tous social interaction networks. Then, we can de-
ficient pruning of the search space in an exhaustive tect locally exceptional patterns, e. g., corresponding
branch-and-bound algorithm. We present examples of to bursts or special events in a dynamic network.
an evaluation using five real-world data sets, obtained Furthermore, we propose subgroup discovery and as-
from three different social media applications, show- sessment approaches for obtaining interesting descrip-
ing runtime improvements of several orders of mag- tive patterns and provide a novel graph-based analysis
nitude. The results also indicate significant semantic approach for assessing the relations between the ob-
structures compared to the baselines. A further ap- tained subgroup set. This exploratory visualization
plication of this method to the exploratory analysis approaches allows for the comparison of subgroups ac-
of social media using geo-references is demonstrated cording to their relations to other subgroups and to
in (Atzmueller, 2014; Atzmueller & Lemmerich, 2013). include further parameters, e. g., geo-spatial distribu-
Furthermore, a scalable implementation of the de- tion indicators. We present and discuss analysis results
scribed description-oriented community detection ap- utilizing a real-world ubiquitous social media dataset.
proach is given in (Atzmueller et al., 2016b), which is
also suited for large-scale data processing utilizing the 3. Conclusions and Outlook
Map/Reduce framework (Dean & Ghemawat, 2008).
Subgroup discovery and exceptional model mining pro-
vide powerful and comprehensive methods for knowl-
2.2. Characterization of Social Behavior
edge discovery and exploratory analyis in the context
Important structures that emerge in social interaction of local exceptionality detection. In this paper, we pre-
networks are given by subgroups. As outlined above, sented according approaches and methods, specifically
we can apply community detection in order to mine targeting social interaction networks, and showed how
both the graph structure and descriptive features in or- to implement local exceptionality detection on both a
der to obtain description-oriented communities. How- methodological and practical level.
ever, we can also analyze subgroups in a social inter-
Interesting future directions for local exceptional-
action network from a compositional perspective, i. e.,
ity detection in social contexts include extended
neglecting the graph structure. Then, we focus on the
postprocessing, presentation and assessment options,
attributes of subsets of nodes or on derived parame-
e. g., (Atzmueller et al., 2006; Atzmueller & Puppe,
ters of these, e. g., corresponding to roles, centrality
2008; Atzmueller, 2015). In addition, extensions to
scores, etc. In addition, we can also consider sequen-
predictive modeling, e. g., link prediction (Scholz et al.,
tial data, e. g., for characterization of exceptional link
2013; Atzmueller, 2014) are interesting options to ex-
trails, i. e., sequential transitions, as presented in (Atz-
plore. Furthermore, extending the analysis of sequen-
mueller, 2016a).
tial data, e. g., based on Markov chains as exceptional
In (Atzmueller, 2012b), we discuss a number of exem- models (Atzmueller et al., 2016c; Atzmueller, 2016a;
plary analysis results of social behavior in mobile so- Atzmueller et al., 2017), as well as group and net-
cial networks, focusing on the characterization of links work dynamics (Atzmueller et al., 2014; Kibanov et al.,
and roles. For that, we describe the configuration, 2014) are further interesting options for future work.
122
Exceptional Model Mining in Ubiquitous and Social Environments
Atzmueller, M., Baumeister, J., & Puppe, F. (2006). Kibanov, M., Atzmueller, M., Scholz, C., & Stumme,
Introspective Subgroup Analysis for Interactive G. (2014). Temporal Evolution of Contacts and
Knowledge Refinement. AAAI FLAIRS (pp. 402– Communities in Networks of Face-to-Face Human
407). AAAI Press. Interactions. Sci China Information Sciences, 57.
Atzmueller, M., Doerfel, S., & Mitzlaff, F. (2016a). Klösgen, W. (1996). Explora: A Multipattern and
Description-Oriented Community Detection using Multistrategy Discovery Assistant. In Advances in
Exhaustive Subgroup Discovery. Information Sci- Knowledge Discovery and Data Mining. AAAI.
ences, 329, 965–984. Leman, D., Feelders, A., & Knobbe, A. (2008). Excep-
tional Model Mining. PKDD (pp. 1–16). Springer.
Atzmueller, M., Ernst, A., Krebs, F., Scholz, C., &
Stumme, G. (2014). On the Evolution of Social Mannila, H. (2000). Theoretical Frameworks for Data
Groups During Coffee Breaks. WWW 2014 (Com- Mining. SIGKDD Explor., 1, 30–32.
panion). New York, NY, USA: ACM Press. Mitzlaff, F., Atzmueller, M., Benz, D., Hotho, A., &
Atzmueller, M., & Lemmerich, F. (2013). Ex- Stumme, G. (2011). Community Assessment using
ploratory Pattern Mining on Social Media using Evidence Networks, vol. 6904 of LNAI. Springer.
Geo-References and Social Tagging Information. Mitzlaff, F., Atzmueller, M., Benz, D., Hotho, A., &
IJWS, 2. Stumme, G. (2013). User-Relatedness and Com-
munity Structure in Social Interaction Networks.
Atzmueller, M., Mollenhauer, D., & Schmidt, A.
CoRR/abs, 1309.3888.
(2016b). Big Data Analytics Using Local Exception-
ality Detection. In Enterprise Big Data Engineering, Mitzlaff, F., Atzmueller, M., Hotho, A., & Stumme,
Analytics, and Management. IGI Global. G. (2014). The Social Distributional Hypothesis.
Journal of Social Network Analysis and Mining, 4.
Atzmueller, M., Mueller, J., & Becker, M. (2015). Ex-
ploratory Subgroup Analytics on Ubiquitous Data, Morik, K. (2002). Detecting Interesting Instances, vol.
vol. 8940 of LNAI. Heidelberg, Germany: Springer. 2447 of LNCS, 13–23. Springer Berlin Heidelberg.
Atzmueller, M., & Puppe, F. (2008). A Case-Based Scholz, C., Atzmueller, M., Barrat, A., Cattuto, C., &
Approach for Characterization and Analysis of Sub- Stumme, G. (2013). New Insights and Methods For
group Patterns. Applied Intelligence, 28, 210–221. Predicting Face-To-Face Contacts. ICWSM. AAAI.
Wrobel, S. (1997). An Algorithm for Multi-Relational
Atzmueller, M., & Roth-Berghofer, T. (2010). The Discovery of Subgroups. Proc. PKDD-97 (pp. 78–
Mining and Analysis Continuum of Explaining Un- 87). Heidelberg, Germany: Springer.
covered. AI-2010. London, UK: SGAI.
123
PRIMPing Boolean Matrix Factorization through Proximal
Alternating Linearized Minimization
Keywords: Tiling, Boolean Matrix Factorization, Minimum Description Length principle, Proximal Alternating
Linearized Minimization, Nonconvex-Nonsmooth Minimization
124
PRIMPing BMF through PALM
125
PRIMPing BMF through PALM
References
Bolte, J., Sabach, S., & Teboulle, M. (2014). Proxi-
mal alternating linearized minimization for noncon-
vex and nonsmooth problems. Mathematical Pro-
gramming, 146, 459–494.
Hess, S., Morik, K., & Piatkowski, N. (2017). The
primping routine – tiling through proximal alternat-
ing linearized minimization (under minor revision).
Data Mining and Knowledge Discovery.
Karaev, S., Miettinen, P., & Vreeken, J. (2015). Get-
ting to know the unknown unknowns: Destructive-
noise resistant boolean matrix factorization. SDM
(pp. 325–333).
Lucchese, C., Orlando, S., & Perego, R. (2014). A
unifying framework for mining approximate top-k
binary patterns. Transactions on Knowledge and
Data Engineering, 26, 2900–2913.
126
An expressive similarity measure for relational clustering using
neighbourhood trees
127
An expressive similarity measure for relational clustering using neighbourhood trees
sented by a graph, and the predicate logic or equiv- vertices v and v 0 , v 0 is added each time it is encoun-
alently relational database view, which typically as- tered. Repeat this procedure for each v 0 on depth 1.
sumes the data to be stored in multiple relations, or The vertices thus added are at depth 2. Continue this
in a knowledge base with multiple predicates. Though procedure up to some predefined depth d. The root
these are in principle equally expressive, in practice element is never added to the subsequent levels.
the bias of learning systems differs strongly depending
on which view they take. For instance, shortest path 2.3. Similarity measure
distance as a similarity measure is much more com-
mon in the graph view than in the relational database The main idea behind the proposed dissimilarity mea-
view. In the purely logical representation, however, no sure is to express a wide range of similarity biases
distinction is made between the constants that iden- that can emerge in relational data, such as attribute or
tify a domain object, and constants that represent the structural similarity. The proposed dissimilarity mea-
value of one of its features. Identifiers have no inherent sure compares two vertices by comparing their neigh-
meaning, as opposed to feature values. bourhood trees. It does this by comparing, for each
level of the tree, the distribution of vertices, attribute
In this work, we introduce a new view that combines values, and outgoing edge labels observed on that level.
elements of both. This view essentially starts out from Earlier work in relational learning has shown that dis-
the predicate logic view, but changes the representa- tributions are a good way of summarizing neighbour-
tion to a hypergraph representation. Formally, the hoods (Perlich & Provost, 2006).
data structure that we assume in this paper is a typed,
labelled hypergraph H = (V, E, τ, λ) with V being a The final similarity measure consists of a linear com-
set of vertices, and E a set of hyperedges; each hyper- bination of different interpretations of similarity. Con-
edge is an ordered set of vertices. The type function cretely, the similarity measure is a composition of com-
τ assigns a type to each vertex and hyperedge. A set ponents reflecting:
of attributes A(t) is associated with each t ∈ TV . The
labelling function λ assigns to each vertex a vector of 1. attributes of the root vertices,
values, one for each attribute of A(τ (v)).
The clustering task we consider is the following: given 2. attributes of the neighbouring vertices,
a vertex type t ∈ TV , partition the vertices of this
3. proximity of the vertices,
type into clusters such that vertices in the same clus-
ter tend to be similar, and vertices in different clusters 4. identity of the neighbouring vertices,
dissimilar, for some subjective notion of similarity. In
practice, it is of course not possible to use a subjec- 5. distribution of hyperedge types in a neighbour-
tive notion; one uses a well-defined similarity function, hood.
which hopefully in practice approximates well the sub-
jective notion that the user has in mind. To be able
to capture several interpretations of relational simi- Each component is weighted by the corresponding
larity, such as attribute or neighbourhood similarity, weight wi . These weights allow one to formulate an
we represent each vertex with a neighbourhood tree - interpretation of the similarity between relational ob-
a structure that effectively describe a vertex and its jects.
neighbourhood.
2.4. Results
2.2. Neighbourhood tree We compared the proposed similarity measure against
Consider a vertex v. A neighbourhood tree aims to a wide range of existing relational clustering ap-
compactly represent the neighbourhood of the vertex proaches and graph kernels on five datasets. The pro-
v and all relationships it forms with other vertices, and posed similarity measure was used in conjunction with
it is defined as follows. For every hyperedge E in which spectral and hierarchical clustering algorithms. We
v participates, add a directed edge from v to each ver- found that, on each separate dataset, our approach
tex v 0 ∈ E. Label each vertex with its attribute vector. performs at least as well as the best competitor, and
Label the edge with the hyperedge type and the posi- it is the only approach that achieves good results on all
tion of v in the hyperedge (recall that hyperedges are datasets. Furthermore, the results suggest that decou-
ordered sets). The vertices thus added are said to be at pling different sources of similarity into a linear com-
depth 1. If there are multiple hyperedges connecting bination helps to identify relevant information and re-
duce the effect of noise.
128
An expressive similarity measure for relational clustering using neighbourhood trees
Acknowledgements
Research supported by Research Fund KU Leuven
(GOA/13/010)
References
De Raedt, L. (2008). Logical and relational learning.
Cognitive Technologies. Springer.
Dumančić, S., & Blockeel, H. (2017). An expressive
dissimilarity measure for relational clustering using
neighbourhood trees. Machine Learning, To Appear.
129
Complex Networks Track
Extended Abstracts
130
Dynamics Based Features for Graph Classification
131
Dynamics Based Features for Graph Classification - work in progress
Dataset GK Deep GK DF
COLLAB 72.84 ± 0.28 73.09 ± 0.25 73.77 ± 0.22
IMDB-BINARY 65.87 ± 0.98 66.96 ± 0.56 70.32 ± 0.88
IMDB-MULTI 43.89 ± 0.38 44.55 ± 0.52 45.85 ± 1.18
REDDIT-BINARY 77.34 ± 0.18 78.04 ± 0.39 86.09 ± 0.53
REDDIT-MULTI-5K 41.01 ± 0.17 41.27 ± 0.18 51.44 ± 0.55
REDDIT-MULTI-12K 31.82 ± 0.008 32.22 ± 0.10 39.67 ± 0.42
Table 1. Social networks: Mean and standard deviation of accuracy classification for Graphlet Kernel (GK) (Shervashidze
et al., 2011), Deep Graphlet Kernel (Deep GK) (Yanardag & Vishwanathan, 2015), and Dynamic Features (DF, our
method)
132
Dynamics Based Features for Graph Classification - work in progress
133
Improving Individual Predictions using Social Networks
Assortativity
Keywords: Belief propagation, assortativity, homophily, social networks, mobile phone metadata.
Abstract & Weber, 2014; Frias-Martinez et al., 2010; Sarraute
et al., 2014). Especially in developing countries, such
Social networks are known to be assortative statistics are often scarce, as local censuses are costly,
with respect to many attributes such as age, rough, time-consuming and hence rarely up-to-date
weight, wealth, ethnicity and gender. Inde- (de Montjoye et al., 2014).
pendently of its origin, this assortativity gives Social networks contain individual information about
us information about each node given its their users (e.g. generated tweets for Twitter), in addi-
neighbors. It can thus be used to improve in- tion to a graph topology information. The assortativ-
dividual predictions in many situations, when ity of social networks is defined as the nodes tendency
data are missing or inaccurate. This work to be linked to others which are similar in some sense
presents a general framework based on prob- (Aral et al., 2009). This assortativity with respect to
abilistic graphical models to exploit social various demographics of their individuals such as gen-
network structures for improving individual der, age, weight, income level, etc. is well documented
predictions of node attributes. We quantify in the literature (McPherson et al., 2001; Madan et al.,
the assortativity range leading to an accuracy 2010; Wang et al., 2013; Smith et al., 2014; Newman,
gain. We also show how specific characteris- 2003). This property has been theorized to come ei-
tics of the network can improve performances ther from influences or homophilies or a combination
further. For instance, the gender assortativ- of both. Independently of its cause, this assortativity
ity in mobile phone data changes significantly can be used for individual prediction purposes when
according to some communication attributes. some labels are missing or uncertain, e.g. for demo-
graphics prediction in large networks. Some methods
are currently developed to exploit that assortativity
1. Introduction (Al Zamal et al., 2012; Herrera-Yagüe & Zufiria, 2012).
Social networks such as Facebook, Twitter, Google+ However, few studies take the global network structure
and mobile phone networks are nowadays largely stud- into account (Sarraute et al., 2014; Dong et al., 2014).
ied for predicting and analyzing individual demograph- Also, to our best knowledge, no research quantifies
ics (Traud et al., 2012; Al Zamal et al., 2012; Magno how the performances are related to the assortativity
& Weber, 2014). Demographics are indeed a key input strength. The goal of this work, already published, is
for the establishment of economic and social policies, to overcome these shortcomings (Mulders et al., 2017).
health campaigns, market segmentation, etc. (Magno
2. Method
We propose a framework based on probabilistic graph-
Appearing in Proceedings of Benelearn 2017. Copyright
ical models to exploit a social network structure, and
2017 by the author(s)/owner(s).
134
Improving Individual Predictions using Social Networks Assortativity
Min CALLS
especially its underlying assortativity, for individual 105
106
ne
prediction improvement in a general context. The net- r 10603
104
Min SMS
work assortativity is quantified by the assortativity co- .2
Threshold on SMS
50
efficient of the attribute to predict, denoted by r (New- 0
man, 2003). The method can be applied with the only 40
knowledge of the labels of a limited number of pairs -.2 30
of connected users in order to evaluate the assortativ- 20
-.4
ity, as well as class probability estimates for each user. 10
These probabilities may for example be obtained by -.6 0 0 10 20 30 40 50 104 106
105
Performances
defining synthetic graphs. These simulations permit recall F
(1) to prevent overfitting a given network structure, 3
2
(2) to perform the parameter tuning off-line and (3) to 1
avoid requiring the labeled users to form a connected 0
graph. These simulations also allow to quantify the -1
assortativity range leading to an accuracy gain over -2
-3 50 55 60 65 70 75 80 85 90 95
an approach ignoring the network topology.
Initial accuracy (%)
3. Mobile phone network
Figure 2. Accuracy and recall gains when varying the ini-
The methodology is validated on mobile phone data
tial accuracy in a mobile phone network, averaged over 50
to predict gender (M and F resp. for male and random simulations of the first predictions. The filled ar-
female). Since our model exploits the gender ho- eas delimit intervals of one standard deviation around the
mophilies, its performances depend on r. In the worst mean gains.
case of a randomly mixed network, r = 0. Perfect
(dis-)assortativity leads to r = (−)1. In our network, 4. Conclusion
r ≈ 0.3, but Fig. 1 shows that it can change accord- This work shows how assortativity can be exploited to
ing to some communication attributes. The strongest improve individual demographics prediction in social
edges (with many texts and/or calls) are more anti- networks, using a probabilistic graphical model. The
homophilic, allowing to partition the edges into strong achieved performances are studied on simulated net-
and weak parts, respectively disassortative and assor- works as a function of the assortativity and the qual-
tative (r ≈ 0.3 in the weak part, whereas r can reach ity of the initial predictions, both in terms of accuracy
−0.3 in the strong one while retaining ≈ 1% of the and distribution. Indeed, the relevance of the network
edges). This partition is exploited to improve the pre- information compared to individual features depends
dictions by adapting the model parameters in the dif- on (1) the assortativity amplitude and (2) the quality
ferent parts of the network. Fig. 2 shows the accuracy of the prior individual predictions. The graph simula-
and recall gains of our method, over simulated ini- tions allow to tune the model parameters. Our method
tial predictions with varying initial accuracies resulting is validated on a mobile phone network and the model
from sampled class probability estimates. The high- is refined to predict gender, exploiting both weak, ho-
est accuracy gains are obtained in the range [70, 85]% mophilic and strong, anti-homophilic links.
of initial accuracy, covering the accuracies reached by
state-of-the-art techniques aiming to predict gender Acknowledgments
using individual-level features (Felbo et al., 2015; Sar- DM and CdB are Research Fellows of the Fonds de la
raute et al., 2014; Frias-Martinez et al., 2010). These Recherche Scientifique - FNRS.
gains overcome the results obtained with Sarraute et
al’s reaction-diffusion algorithm (2014).
135
Improving Individual Predictions using Social Networks Assortativity
136
User-Driven Pattern Mining on knowledge graphs: an
Archaeological Case Study
Wilcke, WX [email protected]
Department of Computer Science,
Department of Spatial Economics,
VU University Amsterdam, The Netherlands
de Boer, V [email protected]
van Harmelen, FAH [email protected]
Department of Computer Science,
VU University Amsterdam, The Netherlands
Keywords: Knowledge Graph, Pattern Mining, Hybrid Evaluation, Digital Humanities, Archaeology
137
User-Driven Pattern Mining on knowledge graphs
Experiment
These subgraphs can be thought of as analogous A 1.00 1.00 0.00 0.67
to the instances in tabular data sets. B 0.80 0.80 0.00 0.53
C 0.80 0.80 0.20 0.60
Pattern Mining: Our pipeline implements
D 1.00 1.00 0.80 0.93
SWARM: a state-of-the-art generalized associa- Mean 0.90 0.90 0.25 0.68
tion rule mining algorithm (Barati et al., 2016).
We motivate its selection by the algorithm’s
ability to exploit semantic background knowledge Table 2. Normalized separate and averaged relevancy val-
to generalize rules. In addition, the algorithm ues (ordinal scale) for experiments A through D as provided
is transparent and yields interpretable results, by three raters (κ = 0.31).
thus fitting the domain requirements (Selhofer & Rater
1 2 3 Mean
Geser, 2014).
Experiment
A 0.13±0.18 0.13±0.18 0.00±0.00 0.09±0.12
Dimension Reduction: A data-driven evaluation B 0.53±0.30 0.53±0.30 0.33±0.47 0.47±0.36
process is used to rate rules on their commonness. C 0.53±0.30 0.33±0.24 0.67±0.41 0.51±0.32
Hereto, we have extended the basic support and D 0.60±0.28 0.47±0.18 0.80±0.45 0.62±0.30
confidence measures with those tailored to graphs. Mean 0.45±0.31 0.37±0.26 0.45±0.48 0.42±0.35
Rules which are too rare or too common rules are
omitted from the final result, as well as those with
omnipresent relations (e.g., type and label). Re- sands using the aforementioned data-drive evaluation
maining rules are shown in a simple faceted rule process. The remaining rules were then ordered on
browser, which allows users to interactively cus- confidence (first) and support (second).
tomize templates (Klemettinen et al., 1994). For For each experiment, we selected 10 example rules
instance, to set acceptable ranges for confidence from the top-50 candidates to create an evaluation
and support scores, as well as to specify the types set of 40 rules in total. Three domain experts were
of entities allowed in either or both antecedent then asked to evaluate these on both plausibility and
and consequent. relevancy to the archaeological domain. Each rule
was accompanied by a transcription in natural lan-
3. Experiments guage to further improve its interpretability. For in-
stance, a typical rule might state: “For every artefact
Experiments were run on an archaeological subset in the data set holds: if it consists of raw earthenware
(±425k facts) of the LOD cloud3 , which contains (Nimeguen), then it dates from early Roman to late
detailed summaries about archaeological excavation Roman times”.
projects in the Netherlands. Each summary holds in-
The awarded plausibility scores (Table 1) indicate that
formation on 1) the project’s organisational structure,
roughly two-thirds of the rules (0.68) were rated plau-
2) people and companies involved, 3) reports made and
sible, with experiment D yielding the most by far.
media created, 4) artefacts discovered together with
Rater 3 was far less positive than rater 1 and 2, and
their context and their (geospatial and stratigraphic)
has a strong negative influence on the overall plausibil-
relation, and 5) fine-grained information about various
ity scores. In contrast, the relevancy scores (Table 2)
locations and geometries.
are in fair agreement with an overall score of 0.42,
Four distinct experiments have been conducted, each implying a slight irrelevancy. This can largely be at-
one having focussed on a different granularity of the tributed to experiment A, which scored considerably
data: A) project level, B) artefact level, C) context lower than the other experiments.
level, and D) subcontextual level. These were chosen
together with domain experts, who were asked to de-
4. Conclusion
scribe the aspects of the data most interesting to them.
Our raters were positively surprised by the range of
Results and Evaluation patterns that we were able to discover. Most of these
Each experiment yielded more than 35,000 candidate were rated plausible, and some even as highly relevant.
rules. This has been brought down to several thou- Nevertheless, trivialities and tautologies were also fre-
quently encountered. Future research should focus on
3
Available at pakbon-ld.spider.d2s.labs.vu.nl. this by improving the data-driven evaluation step.
138
User-Driven Pattern Mining on knowledge graphs
References
Barati, M., Bai, Q., & Liu, Q. (2016). Swarm: An ap-
proach for mining semantic association rules from
semantic web data, 30–43. Cham: Springer Interna-
tional Publishing.
Freitas, A. A. (1999). On rule interestingness mea-
sures. Knowledge-Based Systems, 12, 309–315.
Hallo, M., Luján-Mora, S., Maté, A., & Trujillo, J.
(2016). Current state of linked data in digital li-
braries. Journal of Information Science, 42, 117–
127.
139
Harvesting the right tweets:
Social media analytics for the Horticulture Industry
140
Harvesting the right tweets
141
Harvesting the right tweets
142
Graph-based semi-supervised learning for complex networks
Abstract
We address the problem of semi-supervised
learning in relational networks, networks in
which nodes are entities and links are the
relationships or interactions between them. (a) (b) (c) (d)
Typically this problem is confounded with
the problem of graph-based semi-supervised Figure 1. Different patterns of links between class labels
learning (GSSL), because both problems rep- {red, black}: (a) nodes with the same label tend to be
resent the data as a graph and predict the linked (assortative), (b) links connect nodes with different
missing class labels of nodes. However, not labels (link-heterogeneity), (c) some nodes are assortative
all graphs are created equally. In GSSL a and some are not (class-heterogeneity), (d) missing labels
graph is constructed, often from independent (white) obscures the pattern of links.
data, based on similarity. As such, edges tend
to connect instances with the same class la-
bel. Relational networks, however, can be belled data, i.e. data for which the target attribute
more heterogeneous and edges do not al- values are known. Semi-supervised learning is a clas-
ways indicate similarity. In this work (Peel, sification problem that aims to make use of both the
2017) we present two scalable approaches unlabelled data and the labelled data typically used
for graph-based semi-supervised learning for to train supervised models. A common approach is
the more general case of relational networks. graph-based semi-supervised learning (GSSL) (Belkin
We demonstrate these approaches on syn- & Niyogi, 2004), (Joachims, 2003), (Talukdar & Cram-
thetic and real-world networks that display mer, 2009), (Zhou et al., 2003), (Zhu et al., 2003), in
different link patterns within and between which (often independent) data are represented as a
classes. Compared to state-of-the-art base- similarity graph, such that a vertex is a data instance
line approaches, ours give better classifica- and an edge indicates similarity between two instances.
tion performance and do so without prior By utilising the graph structure, of labelled and unla-
knowledge of how classes interact. belled data, it is possible to accurately classify the un-
labelled vertices using a relatively small set of labelled
instances.
In most complex networks, nodes have attributes, or
metadata, that describe a particular property of the Here we consider the semi-supervised learning problem
node. In some cases these attributes are only partially in the context of complex networks. These networks
observed for a variety of reasons e.g. the data is expen- consist of nodes representing entities (e.g. people, user
sive, time-consuming or difficult to accurately collect. accounts, documents) and links representing pairwise
In machine learning, classification algorithms are used dependencies or relationships (e.g. friendships, con-
to predict discrete node attributes (which we refer to tacts, references). Here class labels are discrete-valued
as class labels) by learning from a training set of la- attributes (e.g. gender, location, topic) that describe
the nodes and our task is to predict these labels based
only on the network structure and a small subset of
Preliminary work. Under review for Benelearn 2017. Do nodes already labelled. This problem of classifying
not distribute.
nodes in networks is often treated as a GSSL prob-
143
Graph-based semi-supervised learning for complex networks
144
Contact Patterns, Group Interaction and Dynamics on
Socio-Behavioral Multiplex Networks
Keywords: social network analysis, temporal dynamics, offline social networks, behavioral networks
145
Contact Patterns, Group Interaction and Dynamics on Socio-Behavioral Multiplex Networks
146
Contact Patterns, Group Interaction and Dynamics on Socio-Behavioral Multiplex Networks
147
Deep Learning Track
Research Papers
148
Modeling brain responses to perceived speech with LSTM networks
149
Modeling brain responses to perceived speech with LSTM networks
processing (Howard et al., 2000; Hickok & Poeppel, was modeled as a linear combination of the input au-
2007; Friederici, 2012; Kubanek et al., 2013). dio features at this time point:
The collected ECoG data were preprocessed prior to ye (t) = βe > x(t) + e
model fitting. Per patient, based on the visual in-
spection, electrodes with noisy or flat signal were ex- where e ∼ N (0, σ 2 ).
cluded from the dataset. Notch filter at 50 and 100 L2 penalized least squares loss function was analyti-
Hz was used to remove line noise and common aver- cally minimized to estimate the regression coefficients
age re-referencing was applied. The Gabor wavelet βe . The kernel trick was used to avoid large matrix
decomposition was used to extract neural responses inversions in the input feature space:
in the high frequency band (HFB, 60-120 Hz) from
the time domain signal. The Wavelet decomposition βe = X> (XX> + λe In )−1 ye
was applied in the HFB range in 1 Hz bins with de- where n is the number of training time points.
creasing window length (4 wavelength full-width at
half max). The resulting signal was averaged over A nested cross-validation was used to estimate the
the whole range to produce a single HFB neural re- amount of regularization λe (Güçlü & van Gerven,
sponse per electrode. The resulting neural responses 2014). First, a grid of the effective degrees of free-
were downsampled to 125 Hz. The preprocessed data dom of the model fit was specified. Then, Newton’s
were concatenated across patients over the electrode method was used to solve the effective degrees of free-
dimension (total number of electrodes = 1283). dom for λe . Finally, λe that resulted in the lowest
nested cross-validation error was taken as the final es-
Audio features timate.
The soundtrack of the movie contained speech and mu- The model was tested on 5% of all data. A five-fold
sic fragments. From the soundtrack, we constructed cross-validation was used to validate the model perfor-
three input feature sets for training the models. First, mance. In each cross-validation fold different speech
we extracted the waveform of the movie soundtrack fragments were selected for testing, so that no data
and downsampled it to 16000 Hz. To create the points were shared in test sets across five folds.
first, time-domain, feature set (time), we reshaped Model performance was measured as the Spearman
the waveform to the matrix of size N × F1 , where N correlation between predicted and observed neural re-
is the number of time points at the SR of the neu- sponses in the test set. The correlation values were
ral responses (125 Hz), and F1 is 128 time features averaged across five cross-validation folds and were
(16000/125) To make the second feature set, we ex- transformed to t-values for determining significance
tracted a sound spectrogram at 128 logarithmically (Kendall & Stuart, 1961).
spaced bins in range 180-7000 Hz. This resulted in
a N × F2 matrix with F2 = 128 features (freq). Fi- LSTM encoding models
nally, the spectrogram was filtered with a bank of 2D
Gabor filters to extract spectrotemporal modulation For each input feature set, six LSTM models (Hochre-
energy features (Chi et al., 2005). The filtering was iter & Schmidhuber, 1997) with varying architectures
done at 16 logarithmically spaced bins in range 0.25- were trained to predict the neural responses to speech
40 Hz along the temporal dimension, and 8 logarith- fragments. The six LSTM models were specified using
mically spaced bins in range 0.25-4 cyc/oct along the a varying number of hidden layers (one or two) and
frequency dimension. The third feature matrix N × F3 a varying number of units per hidden layer (20, 50 or
was built by concatenating all spectrotemporal mod- 100). A fully-connected linear layer was specified as
ulation features: 16 × 8, F3 = 128 features (smtm). the output layer. The neural response of each electrode
The spectrogram and the spectrotemporal modulation ye at time point t was modeled as a linear combination
energy features were obtained using the NSL toolbox of the hidden states h(t). For models with one hidden
(Chi et al., 2005). LSTM layer (1-lstm20, 1-lstm50, 1-lstm100):
ye (t) = β>
e h1 (t) + be + e
Linear encoding model
where be is a bias and e ∼ N (0, σ 2 ).
For each input feature set, a separate kernel linear
ridge regression (Murphy, 2012) was trained to predict For models with two hidden LSTM layers (2-lstm20,
the neural responses to speech fragments. The HFB 2-lstm50, 2-lstm100):
neural response of each electrode ye at time point t
ye (t) = β>
e h2 (t) + be + e
150
Modeling brain responses to perceived speech with LSTM networks
151
Modeling brain responses to perceived speech with LSTM networks
152
Modeling brain responses to perceived speech with LSTM networks
linear ridge regression model in terms of the prediction Kendall, M. G., & Stuart, A. (1961). The advanced
accuracy and the amount of electrodes the models were theory of statistics (vol. 2), london: Charles w. Grif-
successfully fit for. Further work is planned to investi- fin and Co., Ltd, 1959–1963.
gate in detail what factors contribute to the superior
performance of the LSTM models, compared to the Kingma, D. P., & Ba, J. (2014). Adam: A method for
linear ridge regression model. Some work on explor- stochastic optimization. CoRR, abs/1412.6980.
ing the internal representations learned by the LSTM Kubanek, J., Brunner, P., Gunduz, A., Poeppel, D., &
models (cell states) is also planned. Finally, we intend Schalk, G. (2013). The tracking of speech envelope
to compare the performance of RNNs with the per- in the human cortex. PloS one, 8, e53398.
formance of a convolutional neural network, trained
on the wavelet-decomposed audio signal to predict the Murphy, K. P. (2012). Machine learning: a probabilis-
brain responses. tic perspective. MIT press.
Naselaris, T., Stansbury, D. E., & Gallant, J. L.
Acknowledgments (2012). Cortical representation of animate and inan-
The work was supported by the NWO Gravitation imate objects in complex natural scenes. Journal of
grant 024.001.006. Physiology-Paris, 106, 239–249.
Norman-Haignere, S., Kanwisher, N. G., & McDer-
References mott, J. H. (2015). Distinct cortical pathways for
music and speech revealed by hypothesis-free voxel
Chi, T., Ru, P., & Shamma, S. A. (2005). Multireso-
decomposition. Neuron, 88, 1281–1296.
lution spectrotemporal analysis of complex sounds.
The Journal of the Acoustical Society of America, Santoro, R., Moerel, M., De Martino, F., Goebel, R.,
118, 887–906. Ugurbil, K., Yacoub, E., & Formisano, E. (2014).
Encoding of natural sounds at multiple spectral and
Friederici, A. D. (2012). The cortical language cir-
temporal resolutions in the human auditory cortex.
cuit: from auditory perception to sentence compre-
PLoS Comput Biol, 10, e1003412.
hension. Trends in cognitive sciences, 16, 262–268.
Güçlü, U., & van Gerven, M. A. (2014). Unsuper- Tokui, S., Oono, K., Hido, S., & Clayton, J. (2015).
vised feature learning improves prediction of human Chainer: a next-generation open source framework
brain activity in response to natural images. PLoS for deep learning. Proceedings of Workshop on
Comput Biol, 10, e1003724. Machine Learning Systems (LearningSys) in The
Twenty-ninth Annual Conference on Neural Infor-
Güçlü, U., & van Gerven, M. A. (2017). Modeling mation Processing Systems (NIPS).
the dynamics of human brain activity with recur-
rent neural networks. Frontiers in Computational
Neuroscience, 11, 10–3389.
Hagoort, P. (2013). Muc (memory, unification, con-
trol) and beyond. Frontiers in Psychology, 4, 416.
Hickok, G., & Poeppel, D. (2007). The cortical orga-
nization of speech processing. Nature Reviews Neu-
roscience, 8, 393–402.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-
term memory. Neural computation, 9, 1735–1780.
Howard, M. A., Volkov, I., Mirsky, R., Garell, P., Noh,
M., Granner, M., Damasio, H., Steinschneider, M.,
Reale, R., Hind, J., et al. (2000). Auditory cortex on
the human posterior superior temporal gyrus. Jour-
nal of Comparative Neurology, 416, 79–92.
Kay, K. N., Naselaris, T., Prenger, R. J., & Gallant,
J. L. (2008). Identifying natural images from human
brain activity. Nature, 452, 352–355.
153
Towards unsupervised signature extraction of forensic logs.
http://wwwis.win.tue.nl/~benelearn2017
Abstract 1. Introduction
System- and application logs track activities of users
Log signature extraction is the process of and applications on computer systems. Log messages
finding a set of templates generated a set in such logs commonly consist of at least a time stamp
of log messages from the given log messages. and a free text message. The log message’s time stamp
This process is an important pre-processing indicates when an event has happened, and the text
step for log analysis in the context of infor- message describes what has happened. Log messages
mation forensics because it enables the anal- contain relevant information about the state of the
ysis of event sequences of the examined logs. software or actions that have been performed on a sys-
In earlier work, we have shown that it is tem, which makes them an invaluable source of infor-
possible to extract signatures using recurrent mation for a forensic investigator.
neural networks (RNN) in a supervised man-
ner (Thaler et al., 2017). Given enough la- Ideally, forensic investigators should be able to extract
beled data, this supervised approach works information from such logs in an automated fashion.
well, but obtaining such labeled data is labor However, extracting information in an automated way
intensive. is difficult for four reasons. First, the text contents
of log messages are not uniformly structured. Second,
In this paper, we present an approach to ad- there are different types of log messages in a system
dress the signature extraction problem in an log. Thirdly, the text content may consist of variable
unsupervised way. We use an RNN auto- and constant parts. The variable parts may have ar-
encoder to create an embedding for the log bitrary values and length. Finally, the types of log
lines and we apply clustering in the embed- messages change with updates of software and operat-
ded space to obtain the signatures. ing systems.
One way to enable automated information extraction
We experimentally demonstrate on a forensic
is by manually creating a parser for these logs. How-
log that we can assign log lines to their sig-
ever, writing a comprehensive parser is complex and
nature cluster with a V-Measure of 0.94 and
labor intensive for the same reasons that it is difficult
a Silhouette score of 0.75.
to analyze them automatically. A solution would be to
use a learning approach to extract the log signatures
automatically. A log signature is the ”template” that
has been used to create a log message, and extracting
Appearing in Proceedings of Benelearn 2017. Copyright log signatures is the task of finding the log signatures
2017 by the author(s)/owner(s). given a set of log messages. An example of a log signa-
154
Towards unsupervised signature extraction of forensic logs
ture ’Initializing cgroup subsys %s’, where ’%s’ acts as an attentive recurrent auto-encoder. We detail
a placeholder for a mutable part. This signature can this idea in Section 2.
be used to create log lines such as ’Initializing cgroup
• We provide first empirical evidence that this ap-
subsys pid’ or ’Initializing cgroup subsys io’.
proach yields competitive results to state-of-the-
Currently, log signatures are extracted in different art signature approaches. We detail our experi-
ways. First, there are rule-based approaches. In rule- ments in Section 3 and discuss the results in Sec-
based approaches, signatures are manually defined, for tion 4.
example by using regular expressions. Rule-based ap-
proaches tend to work well when applied on logs with a
2. Signature extraction using attentive
limited amount of signatures. Second, there are algo-
rithmic approaches, which use custom algorithms to RNN auto-encoders
extract signatures from logs. These algorithms are The main idea of our approach consists of two phases.
commonly tailored to specific types of logs. Finally, In the first phase, we train an RNN auto-encoder to
in previous work, we showed that supervised RNNs learn a representation of our log lines. To achieve this,
can also be used to derive log signatures from forensic we treat each log line as a sequence of words. This
logs (Thaler et al., 2017). sequence serves as input to an RNN encoder, which
Our work is inspired by recent advances in modeling encodes this sequence to a fixed, multi-dimensional
natural language using neural networks (Le & Mikolov, vector. Based on this vector, the RNN decoder tries
2014; Cho et al., 2014; Johnson et al., 2016). Since log to reconstruct the reverse sequence of words. We de-
lines are partially natural language, we assume that tail this model in section 2. In the second phase, we
neural language models will also capture the inherent cluster the encoded log lines based on their Euclidean
structure of log lines well. distance to each other. We use the centroids of the
clusters as signature description. We base this ap-
proach on the assumption that similar log lines are
encoded close together in the embedding space. In-
tuitively, this assumption can be explained as follows.
We let the model learn to reconstruct a log sequence
from a fix-sized vector, which it previously encoded.
Encoding a log line to a fixed-size vector is a sparsity
constraint, which encourages the model to encode the
log lines in distributed, dense manner. The more fea-
tures such encoded log lines share, the closer to each
other they will be in Euclidean space.
Figure 1. We first embed log lines using an RNN auto- 2.1. Model
encoder. We then cluster the embedded log lines to obtain
the signatures. Our model is based on the attentive RNN encoder-
decoder architecture that was introduced by Bahdanau
Here, we propose an approach for addressing the sig- et al. (Dzmitry Bahdana et al., 2014) to address neural
nature extraction problem using attentive RNN auto- machine translation. We depict the schematic archi-
encoders. Figure 1 sketches our idea. The ”encoder” tecture in Figure 2. This model consists of three parts:
transforms a log line into a dense vector representa- an RNN encoder, an alignment model, and an RNN
tion, and the ”decoder” attempts to reconstruct input decoder.
log line in the reverse order. Log lines that belong to We feed our model a sequence of n word ids w0 . . . wn .
the same signature are embedded close to each other To retrieve the input word vectors for the RNN en-
in the vector space. We then cluster the learned rep- coder we map each word to a unique vector xi . This
resentations and use cluster centroids as signature de- vector is represented by a row in a word embedding
scriptions. matrix W v×d and the row is indexed by the position
The main contributions of this paper are: of the word in the vocabulary. v is the number of
words in the vocabulary and d is a hyper parameter
and represents the dimension of the embedding.
• We present an approach for addressing the prob-
lem of extracting signatures from forensic logs. For a sequence of input word vectors x0 . . . xi the RNN
We are learning representations of log lines using encoder outputs a sequence of output vectors y0 . . . yn
155
Towards unsupervised signature extraction of forensic logs
for input and a vector h that represents the encoded assume potentially very large vocabularies in large log
log line. h is the last hidden state of the RNN network. files due to the variable parts of the logs.
The alignment model, also called attention mecha- The learning objective ”forces” the model to learn
nism, learns to weight the importance of the encoder’s which information is important for reconstructing a
outputs for each decoding step. The output of the at- log line. In other words, we learn a lossy compression
tention mechanism a context vector ci that represents function of our log lines.
the weighted sum of the encoding outputs. This con-
text vector is calculated for each decoding step. The 2.3. Optimization procedure
alignment model increases the re-construction quality
of the decoded sequences. To train our model, we use Adam, which is a form
of stochastic gradient descent (Kingma & Ba, 2014).
The decoders task is to predict the reversed input word During training, we use dropout to prevent overfit-
sequence. It predicts the words for each time step, ting (Srivastava et al., 2014), and gradient clipping to
using the information of the encoded vector h and the prevent exploding gradients (Pascanu et al., 2013).
context vector ci .
3. Experiment setup
For our experiment, we trained the model that we in-
troduced in Section 2. The RNN encoder, the RNN
decoder, and the alignment model have 128 units each.
The gradients are calculated in mini-batches of 10 log
lines and gradients are clipped at 0.5. We trained each
model with a learning rate of 0.001 and dropout rate
of 0.3. We drew 500 samples for our sampled softmax
learning objective. We determined the hyperparame-
ters empirically using a random search strategy.
We pad input- and output sequences that are of a dif-
ferent length with zeros at the end of the sequences.
Additionally, we add a special token that marks the
beginning and the end of a sequence of words.
We then hierarchically cluster the encoded log lines by
Figure 2. Architecture of our model. We use an attentive using the Farthest Point Algorithm. We use the Eu-
RNN auto-encoder to encode log lines. clidean distance as a distance metric and a clustering
threshold of 0.50, which we empirically determined.
Instead of using our models for translation, we use it We compare our approach to two state-of-the-art sig-
to predict the reverse sequence of our input sequence. nature extraction algorithms, LogCluster (Vaarandi &
We do so because we want the model to learn an em- Pihelgas, 2015) and IPLoM (Makanju et al., 2012). We
bedding space for our log lines and not to translate chose these two approaches for two reasons. First, they
sentences. Also, in contrast to (Dzmitry Bahdana scored high in a side-by-side comparison (He et al.,
et al., 2014), we use a single- instead of a bi-directional 2016). Second, both approaches are capable of finding
LSTM (Hochreiter & Schmidhuber, 1997) for encoding log clusters when the number of log clusters is not spec-
our input sequences. ified upfront. We used the author’s implementation of
LogCluster for our experiments1 , and the implementa-
2.2. Learning objective tion provided by He at al. (He et al., 2016)2 . IPLoM
supports four hyperparameters, a Cluster goodness
The learning objective of our problem is, given a threshold, a partition support threshold and an upper
sequence of words, correctly predict the reverse se- and lower bound for clustering. The best performing
quence, word by word. We calculate the loss of a mis- hyper parameters of IPLoM were a Cluster goodness
predicted word by using sampled softmax loss (Jean
1
et al., 2014). Sampled softmax loss approximates the http://ristov.github.io/logcluster/logcluster-
categorical cross-entropy loss between the embedded 0.08.tar.gz
2
https://github.com/cuhk-cse/logparser/tree/
target word and the predicted word. We motivate our 20882dabb01aa6e1e7241d4ff239121ec978a2fe
choice for using sampled softmax mainly because we
156
Towards unsupervised signature extraction of forensic logs
To evaluate the quality of the extracted signatures, LogCluster creates log line clusters that have a homo-
we evaluate the clusters that we have found. We use geneity of 1.00 and completeness of 0.777, which yields
two metrics to evaluate the quality of our clusters, the a V-measure of 0.881. IPLoM creates log line clusters
Silhouette score and the V-Measure. that have a homogeneity of 0.761 and a completeness
of 0.898, which yields a V-measure of 0.824. Our ap-
The Silhouette score measures the relative quality of proach creates log line clusters that have a homogene-
clusters (Rousseeuw, 1987) by calculating and relating ity of 0.990 and a completeness of 0.905, which yields a
intra- and inter-cluster distances. The Silhouette score V-measure of 0.944. Additionally, the clusters formed
ranges from -1.0 to 1.0. A score of -1.0 indicates many by our approach have a Silhouette score of 0.749.
overlapping clusters and a score of 1.0 means perfect
clusters. None of the tested approaches manages to find all sig-
natures of our forensic log perfectly. In contrast to
The V-Measure is a harmonized mean between the that, in He et al.’s evaluation (He et al., 2016), both
homogeneity and completeness of clusters and cap- LogCluster and IPLoM to perfect F1 score. We ex-
tures a clustering solution’s success in including all plain this difference by the fact that our dataset is
and only data-points from a given class in a given clus- more difficult to analyze than the datasets presented
ter (Rosenberg & Hirschberg, 2007). The V-Measure in He et al.’s evaluation. The most difficult dataset
addresses the clustering ”matching” problem and is in had 376 signatures, but 4.7 million log lines, whereas
this context more appropriate than the F1-score. our dataset has only 11023 log lines and 856 signatures.
Both IPLoM and LogCluster have been designed the
3.2. Datasets first case with few signatures and many log lines per
signature.
To test our idea, we created a forensic log. We ex-
tracted this forensic log from a Linux system disk Our approach creates almost homogeneous clusters.
image using log2timeline3 . Log2timeline is a foren- We illustrate the clusters that are found in Figure 3.
sic tool that extracts event information from storage Figure 3 shows a sample of 15 log lines and how they
media and combines this information in a single file. are hierarchically clustered together. When two log
This log dataset consists of 11023 log messages and lines are identical, they have a distance close to zero,
856 signatures. We manually extracted the signatures which means they are embedded almost on the same
of these logs and verified their correctness using the spot. If they are related, they are closer to each other
Linux source code. The vocabulary size, i.e. the num- than other log lines. Our approach makes very few
ber of unique words, of our dataset is 4282 and the mistakes of grouping together log lines that do not be-
maximum log message length is 135. Due to the na- long together. Instead, the clusters are incomplete,
ture of forensic logs, we assume that the number of which means that some clusters should be grouped to-
unique words will grow when in larger logs. gether but are not.
One drawback of our approach is the increased com-
4. Results and discussion puting requirements. IPLoM processes our forensic
log in average under a second, LogCluster needs about
In this section, we present the results of our experi-
23 seconds process it whereas training our model for
ments and discuss our findings. Table 1 summarizes
the presented task needs 715 seconds. This increased
3
https://github.com/log2timeline/plaso/wiki/Using- performance requirements may become a problem on
log2timeline larger datasets. As with the state-of-the-art algo-
157
Towards unsupervised signature extraction of forensic logs
Figure 3. Cluster dendrogram of 15 log lines. The y-axis displays the log lines and x-axis the distance to each other.
rithms, the threshold to separate the clusters has to be or variational recurrent auto-encoders have been used
manually determined. Our goal is to extract human- to cluster music snippets (Fabius & van Amersfoort,
understandable log signatures. Currently, we only ob- 2014).
tain the centroids of embedded clusters, which allow us
to cluster log lines according to their signature. How- 6. Conclusions and future work
ever, these centroids can not effectively be interpreted
by humans. We have presented an approach to use an attentive
RNN auto-encoder models to address the problem of
5. Related Work extracting signatures from forensic logs. We use the
auto-encoder to learn a representation of our forensic
Log signature extraction has been studied to achieve logs, and cluster the embedded logs to retrieve our
a variety of goals such as anomaly and fault detection signatures. This approach finds signature clusters in
in logs (Vaarandi, 2003; Fu et al., 2009), pattern de- our forensic log dataset with a V-Measure of 0.94 and a
tection (Vaarandi, 2003; Makanju et al., 2009; Aharon Silhouette score of 0.75. These results are comparable
et al., 2009), profile building (Vaarandi & Pihelgas, to the state-of-the-art approaches.
2015), forensic analysis (Thaler et al., 2017) or com-
We plan to extend our work in several ways. So far
pression of logs (Tang et al., 2011; Makanju et al.,
we have only clustered log lines. To complete our ob-
2009). Most of these approaches motivated their sig-
jective, we also need a method for extracting human-
nature extraction approach by the large and rapidly
readable descriptions of these signature clusters. We
increasing volume of log data that needed to be ana-
plan to use the outputs of the attention network to aid
lyzed (Vaarandi, 2003; Makanju et al., 2009; Fu et al.,
the extraction of log signatures from the clustered log
2009; Aharon et al., 2009; Tang et al., 2011; Vaarandi
lines. Furthermore, we intend to explore regularization
& Pihelgas, 2015).
techniques that help improve the quality of the ex-
Many NLP-related problems have been addressed us- tracted signatures. Finally, we intend to demonstrate
ing neural networks. Collobert et al. were one of the the feasibility and competitiveness of our approach on
first to successfully apply neural models to a broad va- large datasets and datasets with fewer signatures.
riety of NLP-related tasks (Weston & Karlen, 2011).
Their approach has been followed by other neural mod- Acknowledgments
els for similar tasks, e.g. (Dyer et al., 2015; Lam-
ple et al., 2016). Also, a variety of language model- This work has been partially funded by the Dutch na-
ing tasks have been tackled using neural architectures, tional program COMMIT under the Big Data Veracity
e.g. (Dzmitry Bahdana et al., 2014; Cho et al., 2014; project.
Sutskever et al., 2014).
Auto-encoders have been successfully applied to clus-
tering tasks. For example, auto-encoders have been
used to cluster text and images (Xie et al., 2015)
158
Towards unsupervised signature extraction of forensic logs
159
Towards unsupervised signature extraction of forensic logs
160
Deep Learning Track
Extended Abstracts
161
Improving Variational Auto-Encoders using convex combination
linear Inverse Autoregressive Flow
162
Improving Variational Auto-Encoders using convex combination linear Inverse Autoregressive Flow
163
Improving Variational Auto-Encoders using convex combination linear Inverse Autoregressive Flow
reveal that the proposed flow outperforms all volume- Li, Y., & Turner, R. E. (2016). Rényi divergence varia-
preserving flows and performs similarly to the linear tional inference. arXiv preprint arXiv:1602.02311.
normalizing flow with large number of transformations. Oord, A. v. d., Kalchbrenner, N., & Kavukcuoglu, K.
The advantage of using several matrices instead of one (2016). Pixel recurrent neural networks. ICML, 1747–
is especially apparent on the Histopathology data where 1756.
the VAE+ccLinIAF performed better by about 15nats
than the VAE+LinIAF. Hence, the convex combina- Rezende, D., & Mohamed, S. (2015). Variational Infer-
tion of the lower-triangular matrices with ones on the ence with Normalizing Flows. ICML (pp. 1530–1538).
diagonal seems to allow to better reflect the data with
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014).
small additional computational burden.
Stochastic backpropagation and approximate infer-
ence in deep generative models. arXiv preprint
Acknowledgments arXiv:1401.4082.
The research conducted by Jakub M. Tomczak was Salakhutdinov, R., & Murray, I. (2008). On the quan-
funded by the European Commission within the Marie titative analysis of deep belief networks. ICML (pp.
Skłodowska-Curie Individual Fellowship (Grant No. 872–879).
702666, ”Deep learning and Bayesian inference for med-
ical imaging”). Salimans, T., Kingma, D. P., & Welling, M. (2015).
Markov chain Monte Carlo and Variational Inference:
Bridging the gap. ICML (pp. 1218–1226).
References
Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby,
Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M.,
S. K., & Winther, O. (2016). Ladder variational
Jozefowicz, R., & Bengio, S. (2015). Generating
autoencoders. arXiv preprint arXiv:1602.02282.
sentences from a continuous space. arXiv preprint
arXiv:1511.06349. Tabak, E., & Turner, C. V. (2013). A family of nonpara-
metric density estimation algorithms. Communica-
Burda, Y., Grosse, R., & Salakhutdinov, R. (2015). tions on Pure and Applied Mathematics, 66, 145–164.
Importance weighted autoencoders. arXiv preprint
arXiv:1509.00519. Tabak, E. G., & Vanden-Eijnden, E. (2010). Density
estimation by dual ascent of the log-likelihood. Com-
Dinh, L., Krueger, D., & Bengio, Y. (2014). Nice: Non- munications in Mathematical Sciences, 8, 217–233.
linear independent components estimation. arXiv
Tomczak, J. M., & Welling, M. (2016). Improving
preprint arXiv:1410.8516.
Variational Auto-Encoders using Householder Flow.
arXiv preprint arXiv:1611.09630.
Glorot, X., & Bengio, Y. (2010). Understanding the dif-
ficulty of training deep feedforward neural networks. van den Oord, A., Kalchbrenner, N., Espeholt, L.,
AISTATS (pp. 249–256). Vinyals, O., Graves, A., & Kavukcuoglu, K. (2016).
Conditional image generation with pixelcnn decoders.
Householder, A. S. (1958). Unitary triangularization Advances in Neural Information Processing Systems
of a nonsymmetric matrix. Journal of the ACM (pp. 4790–4798).
(JACM), 5, 339–342.
164
The use of shallow convolutional neural networks in predicting
promotor strength in Escherichia coli
Abstract
Gene expression is an important factor in
many processes of synthetic biology. The
use of well-characterized promoter libraries
makes it possible to obtain reliable estimates
on the transcription rates in genetic circuits.
Yet, the relation between promoter sequence
and transcription rate is largely undiscov-
ered. Through the use of shallow convolu-
tional neural networks, we were able to create
models with good predictive power for pro- Figure 1. Basic layout of the first model. The sequence,
moter strength in E. coli. transformed into a 4 × 50 binary image, is evaluated by
the first convolutional layer outputting a vector of scores
for every 4 × l kernel (m). Rectified outputs are maximum
1. Introduction pooled and fed into the third layer. The sigmoid transform
of the softmax layer results in a probability score for each
The binding region of the transcription unit, called class.
the promoter region, is known to play a key role in the
transcription rate of downstream genes. In Eubacte- Following the recent success of artificial neural net-
ria, the sigma factor (σ) binds with the RNA poly- works (Alipanahi et al., 2015), inspiration was taken
merase subunit (αββ 0 ω) to create RNA polymerase. to create specialized models for promoter strength.
Being part of the RNA polymerase holoenzyme, the Due to the low amount of promoter libraries, models
sigma element acts as the connection between RNA were instead trained to predict the presence of a pro-
polymerase and DNA (Gruber & Gross, 2003). The moter region within a DNA sequence, using the more
use of prokaryotic organisms such as E. coli are indis- abundant ChIP-chip data. To evaluate whether the
pensable in research and biotechnological industry. As model gives a good indication of promoter strength,
such, multiple studies have investigated creating pre- the predicted score of the model was compared with
dictive models for promoter strength (De Mey, 2007; the given promoter strength scores of existing pro-
Meng et al., 2017). As of now, existing models are moter libraries. The use of several custom model ar-
trained using small in-house promoter libraries, with- chitectures are considered.
out evaluation on external public datasets.
2. Results
Preliminary work. Under review for Benelearn 2017. Do ChIP-chip data of the σ 70 transcription factor was
not distribute.
used from cells in the exponential phase (Cho et al.,
165
Formatting Instructions for Benelearn 2017 Abstracts and Papers
Table 1. Performance measures of the models on the test set (AUC) and external datasets (Spearman’s rank correlation).
given values are the averaged scores of the repeated experiments. The standard deviation is given within brackets
kernel test set Anderson Brewster(2012) Davis(2011) Mutalika (2013) Mutalikb (2013)
size 38984 seq. 19 prom. 18 prom. 10 prom. 118 prom. 137 prom.
M1 4 × 25 0.79 (0.02) 0.15 (0.19) 0.81 (0.09) 0.74 (0.12) 0.40 (0.07) 0.22 (0.07)
4 × 10 0.79 (0.02) 0.25 (0.16) 0.81 (0.11) 0.77 (0.08) 0.45 (0.04) 0.23 (0.04)
M2 4 × 25 0.78 (0.02) 0.20 (0.12) 0.74 (0.11) 0.81 (0.14) 0.50 (0.08) 0.16 (0.05)
4 × 10 0.77 (0.02) -0.16 (0.14) 0.78 (0.07) 0.68 (0.10) 0.41 (0.06) 0.12 (0.07)
M3 4 × 25 0.79 (0.02) 0.38 (0.14) 0.82 (0.10) 0.84 (0.08) 0.53 (0.07) 0.25 (0.10)
4 × 10 0.76 (0.01) 0.70 (0.14) 0.83 (0.06) 0.84 (0.08) 0.47 (0.08) 0.41 (0.15)
a
part of promoter library with variety within the -35 and -10 box
b
part of promoter library with variation over the whole sequence
2014). ChIP-chip experiments give an affinity measure measure of similarity of ranking between the proba-
between the RNA polymerase holoenzyme and DNA, bility scores and given scores within each promoter
pinpointing possible promoter sites. Due to the high library. Table 1 gives an overview of the performances
noise of ChIP-chip data, a classification approach was on the test set and external datasets.
taken to build the model. The convolutional neural
network is fed binary images (4 × 50) of the sequences 3. Discussion
following the work of Alipanahi (2015). Promoter se-
quences within the dataset are determined using both We found that the introduction of the proposed model
a transcription start site mapping (Cho et al., 2014), architectures shows significant improvement on rank-
and through selection of the highest peaks. ing known promoter libraries by promoter strength.
The results furthermore show that retaining positional
Multiple architectures have been considered, with
data can offer non-trivial boosts to smaller kernel sizes.
small changes applied to the basic model (M1) given
Yet, the M2 results show that these advantages are
in Figure 1. The general model architecture is largely
outweighed for smaller kernels, where an increased
based upon the work of Alipanahi (2015). To retain
model complexity reduces overall scores. For longer
positional data of high-scoring subsequences, a second
motifs, M2 still offers a boost in performance as the
model (M2) uses the pooled outputs of subregions of
increase in features compared to M1 is small. M3,
the sequence based upon the length (l) of the kernel
gaining positional information of the motifs without
(motif). Thus, a motif with length l = 10 creates
the cost of any additional complexity shows the best
50/l = 5 outputs for every kernel in the convolutional
results for each dataset using both short and long mo-
layer. A further adjustment first splits the sequence
tifs. The use of small kernels, with the exception of
into subsequences according to the motif length. This
M2, generally offers better scores. The performance of
reduces the number of outputs created in the first layer
the model to identify promoter regions on the test set
of the previous model. When training 4 × 10 ker-
shows little variations, with AUC scores reaching 0.80.
nels, five subsequences are created. Motifs are trained
Further optimizations to both the architecture of the
uniquely on one of the parts. The third model (M3)
model and selection of hyperparameters can prove to
retains positional information, while having the same
further increase model performance.
complexity as M1, albeit at a cost of flexibility.
To get an insight into the performances of the mod- 4. Conclusion
els, the use of long (4 × 25) and short (4 × 10) motifs
have been evaluated. model training was repeated 50 A comprehensive tool for promoter strength predic-
times to account for model variety. In this study we tion, in line with the creation of the ribosome binding
trained models for binary classification, predicting the site calculator (Salis, 2011), is highly anticipated in
existence of a promoter region within a given DNA the research community. This study shows the poten-
sequence. To verify whether given models can further- tial of using an indirect approach in creating predictive
more give reliable predictions on promoter strength, models for promoter strength.
following the idea that stronger promoters are more
likely to be recognized, promoter libraries have been
ranked on the probability scores of the model. The
Spearman’s rank correlation coefficient is used as a
166
Formatting Instructions for Benelearn 2017 Abstracts and Papers
References
Alipanahi, B., Delong, A., Weirauch, M. T., & Frey,
B. J. (2015). Predicting the sequence specificities of
DNA- and RNA-binding proteins by deep learning.
Nature Biotechnol, 33, 831–838.
Brewster, R. C., Jones, D. L., & Phillips, R. (2012).
Tuning Promoter Strength through RNA Poly-
merase Binding Site Design in Escherichia coli. PLoS
Computational Biology, 8.
Cho, B.-K., Kim, D., Knight, E. M., Zengler, K., &
Palsson, B. O. (2014). Genome-scale reconstruc-
tion of the sigma factor network in Escherichia coli:
topology and functional states. BMC biology, 12, 4.
Davis, J. H., Rubin, A. J., & Sauer, R. T. (2011).
Design, construction and characterization of a set
of insulated bacterial promoters. Nucleic Acids Re-
search, 39, 1131–1141.
167
Normalisation for painting colourisation
168
Normalisation for painting colourisation
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Up
Up
Up
Up
Up
Greyscale Colour
Up
Up
Up
The painting colourisation performance of my CNN us- Ioffe, S., & Szegedy, C. (2015). Batch Normaliza-
ing different normalisation methods is evaluated on a tion: Accelerating Deep Network Training by Re-
subset of the “Painters by Numbers” dataset as pub- ducing Internal Covariate Shift. ICML (pp. 448–
lished on Kaggle2 . We select the painters who have 456). JMLR.
at least 5 artworks in the dataset, which results in a Isola, P., Zhu, J.-y., Zhou, T., & Efros, A. A. (2016).
dataset consisting of 101.580 photographic reproduc- Image-to-Image Translation with Conditional Ad-
tions of artworks produced by a total of 1.678 painters. versarial Networks. arXiv preprint.
All images were rescaled such that the shortest side
was 256 pixels, and subsequently a 224 × 224 crop was LeCun, Y., Bottou, L., Orr, G., & Müller, K. (2012).
extracted for analysis. Table 1 shows the quantitative Efficient backprop. Neural networks: Tricks of the
painting colourisation results. Example colourisations trade, 9–48.
are shown in Figure 2.
Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). In-
1
https://github.com/nanne/conditional-colour stance Normalization: The Missing Ingredient for
2
https://www.kaggle.com/c/painter-by-numbers Fast Stylization. arXiv preprint.
169
Predictive Business Process Monitoring with LSTMs
Keywords: deep learning, recurrent neural networks, process mining, business process monitoring
170
Predictive Business Process Monitoring with LSTMs
Evermann -
Activity prediction baselines
- - 0.623
5. Conclusions
Breuker - - - 0.719
The foremost contribution of this paper is a technique
Table 1. Experimental results for the Helpdesk and BPI’12 to predict the next activity of a running case and its
W logs. timestamp using LSTM neural networks. We showed
that this technique outperforms existing baselines on
ble 1 shows the MAE of the predicted timestamp of the
real-life data sets. Additionally, we found that predict-
next event and the accuracy of the predicted activity
ing the next activity and its timestamp via a single
on two data sets. It shows that LSTMs outperform the
model (multi-task learning) yields a higher accuracy
baseline techniques, and that architectures with shared
than predicting them using separate models. We then
layers outperform architectures without shared layers.
showed that this basic technique can be generalized to
address two other predictive process monitoring prob-
3. Suffix Prediction lems: predicting the entire continuation of a running
case and predicting the remaining cycle time.
By repeatedly predicting the next activity, using
the method described in Section 2, the trace can
(a) (b)
be predicted completely until its end. The most 8 14
recent method to predict an arbitrary number of 12
MAE, days
6 10
events ahead is (Polato et al., 2016), which extracts 4
8
6
a transition system from the log and then learns a 2 4
2
machine learning model for each transition system 0 0
2 3 4 5 6 7 2 4 6 8 10 12 14 16 18 20
state. Levenshtein similarity is a frequently used Prefix size Prefix size
50
needed to transform one string into another. In 6 40
business processes, activities are frequently performed 4 30
20
2
in parallel, leading to some event in the trace being 10
0 0
arbitrarily ordered., therefore we consider it only 2 4 6
Prefix size
8 10 2 4 6 8 10 12 14 16 18 20
Prefix size
a minor mistake when two events are predicted in van Dongen set sequence bag LSTM
171
Predictive Business Process Monitoring with LSTMs
172
Big IoT data mining for real-time energy disaggregation in buildings
(extended abstract)
Keywords: deep learning, factored four-way conditional restricted Boltzmann machines, energy disaggregation,
energy prediction
173
Big IoT data mining for real-time energy disaggregation in buildings
dotted blue area has to be fixed (i.e. present and history the TKI SG-BEMS project of Dutch Top Sector.
layers) and to let the model to infer the values of the label
neurons. To perform prediction the value of each neuron
from the dotted red area has to be fixed (i.e. label and References
history layers) and to let the model to infer the values of
Kolter, J. Z., & Johnson, M. J. (2011). REDD: A Pub-
the present neurons.
We assessed our proposed framework on the The lic Data Set for Energy Disaggregation Research.
Reference Energy Disaggregation Dataset (REDD) SustKDD Workshop on Data Mining Applications
dataset (Kolter & Johnson, 2011). The results pre- in Sustainability. San Diego, California, USA.
sented in Table 1 and 2 show that both models per- Mocanu, D. C., Ammar, H. B., Lowet, D., Driessens,
formed very well obtaining a minimum prediction error K., Liotta, A., Weiss, G., & Tuyls, K. (2015). Fac-
on the power consumption of 1.85% and a maximum tored four way conditional restricted boltzmann ma-
error of 9.36%, while for the time-of-use prediction the chines for activity recognition. Pattern Recognition
minimum error reached was 1.77% in the case of the Letters, 66, 100 – 108.
electric heater and the maximum error obtained was
8.79% for the refrigerator. Mocanu, D. C., Ammar, H. B., Puig, L., Eaton, E., &
Liotta, A. (2017). Estimating 3d trajectories from
2d projections via disjunctive factored four-way con-
3. Conclusion ditional restricted boltzmann machines. Pattern
In this paper, we proposed a novel IoT framework Recognition.
to perform simultaneously and in real-time flexibil- Mocanu, D. C., Mocanu, E., Nguyen, P. H., Gibescu,
ity identification and prediction, by making use of M., & Liotta, A. (2016). Big iot data mining for
Factored Four Way Conditional Restricted Boltzmann real-time energy disaggregation in buildings. 2016
Machines and their Disjunctive version. The experi- IEEE International Conference on Systems, Man,
mental validation performed on a real-world database and Cybernetics (SMC) (pp. 003765–003769).
shows that both models perform very well, reaching
a similar performance with state-of-the-art models on Mocanu, E., Mocanu, D. C., Ammar, H. B., Zivkovic,
flexibility identification, while having the advantage of Z., Liotta, A., & Smirnov, E. (2014). Inexpensive
being capable to perform also flexibility prediction. user tracking using boltzmann machines. 2014 IEEE
International Conference on Systems, Man, and Cy-
bernetics (SMC) (pp. 1–6).
Acknowledgments
Zeifman, M., & Roth, K. (2011). Nonintrusive appli-
This research has been partly funded by the European ance load monitoring: Review and outlook. IEEE
Union’s Horizon 2020 project INTER-IoT (grant num- Transactions on Consumer Electronics, 57, 76–84.
ber 687283), and by the NL Enterprise Agency under
174
Industry Track
Research Papers
175
Comparison of Syntactic Parsers on Biomedical Texts
http://wwwen.uni.lu/lcsb
176
Comparison of Syntactic Parsers on Biomedical Texts
rectly identify dependencies between sentence structure and thus it makes sense to talk about their
constituents. depth, i.e. the number of levels in such graph. Let us
denote by Depth(G) the depth of the sentence parse
• Recursive neural networks (RNN) parser (Socher graph, and by Bi the number of nodes (tokens) on the
et al., 2011) works in two steps: first it uses the i-th level of this graph, i.e. breadth. Then one possible
parses of PCFG parser to train; then recursive sentence complexity score can be calculated as follows:
neural networks are trained with semantic word
vectors and used to score parse trees. In this Depth(G)
X
way syntactic structure and lexical information Score(G) = Bi · i.
are jointly exploited. i=1
• BLLIP is a self-trained parser which is based on The more tokens and at the lower depth the higher
a probabilistic generative model with maximal would be this metric. In Fig. 1 we show how sentence
entropy discriminative reranking (McClosky & complexity scores are distributed in the corpus. Fig. 2
Charniak, 2008). In the self-training approach,
an existing parser parses an unseen data and
treats this newly labeled data in combination
with the actual labeled data to train a second
parser. BLLIP exploits this technique to self-
train a parser, initially trained on a standard Penn
Treebank corpus, using unlabeled biomedical ab-
stracts.
177
Comparison of Syntactic Parsers on Biomedical Texts
178
Comparison of Syntactic Parsers on Biomedical Texts
texts we used two gold standard in-domain corpora. 3.2. Evaluation and Discussion
One was released in the framework of Genia project
Table 1 presents the most important results of each
(Kim et al., 2003). It consists of 2000 article abstracts
parser. For the overall performance assessment, we
collected with the key-word search ”Human, Blood
adopted evaluation criteria established by the Parser
Cells and Transcription Factors”. Another corpus is
Evaluation challenge (Black et al., 1992), PARSE-
known as ”Colorado Richly Annotated Full Text Cor-
VAL. These include accuracy of the part of speech
pus” (CRAFT) (Verspoor & et.al, 2012). It contains
tagging, unlabeled attachment score (UAS) which ac-
67 full text articles in a wide variety of biomedical
counts for the correspondence between each node and
sub-domains. For the evaluation we used about 1000
its parent in the test and gold standard parses, and
sentences from each of these corpora. Our choice of
labeled attachment score (LAS), which, in addition to
test sets size was based on the Genia corpus division
the parent node correspondence, checks if syntactic re-
into train, development and test sets distributed by D.
lation between two nodes (label on the edge) is the
McClossky (McClosky & Charniak, 2008). As of Ge-
same in the test and gold sets. Given the nature of
nia corpus, we used his division, and created our own
these two measurements, it is not surprising that LAS
for the Craft corpus.
are systematically lower than UAS for all the parsers
listed in Table 1. However, accuracy of the labeled at-
tachment predefines the extent to which semantic re-
lations between the concepts represented by the nodes
would be be correctly interpreted.
It can be seen from the table, that parsers performance
depends on how close the test domain is to the train-
ing domain. One of the important reasons for that are
out-of-domain words. Being unknown to the parser,
they present difficulty for the part of speech tagging.
The latter is responsible for the assignment of syntactic
dependencies between the words. Our analysis shows
that part of speech errors are responsible for 30% (for
the corresponding training and test domains) to 60%
(for different training and test domains) of the er-
rors in dependency assignment. Among all the parsers
trained on English corpora, Stanford RNN shows the
Figure 4. BLLIP parse.
best result (LAS 0.78) on the Genia corpus. BLLIP
trained on Genia + PubMed demonstrates the best
performance, followed by Parsey McParseface trained
on Genia and CRAFT.
With respect to the biomedical corpora it seems that
CRAFT is more difficult than Genia for all but one
parsers, either trained on the biomedical texts or not.
Detailed corpora investigation is required to answer
the question why it is so. However we suppose that cer-
tain portions of full texts, such as detailed description
of experimental setup, or explanations of the figures,
which are not necessarily complete or well formed sen-
tences, contribute to the lower parser scores. Besides,
full texts would have larger vocabulary than the ab-
stracts. This fact can be even stronger in our specific
training - test setup, due to the sub-domain coverage
of both biomedical corpora: Genia, being a narrow-
focused one as opposed to a much more diverse Craft.
Overall, based on the figures in the Table 1 we think
that abstracts are not sufficiently representative for
Figure 5. Stanford Factored parse. the entire article context to provide an efficient train-
179
Comparison of Syntactic Parsers on Biomedical Texts
ing set for the parsers. ity of biomedical sentences in the context of event ex-
traction. We have defined sentence complexity met-
In addition to the evaluation presented above one can
rics and parse variability metrics which can help to
have a closer look at parsers performance on spe-
assess parser performance when it is used as part of
cific syntactic patterns, such as prepositional attach-
the knowledge extraction pipeline.
ment or coordination. These constructs carry infor-
mation about events participants, conditions under
which events take place, as well as location at which References
they happen. At the same time, both, coordination Andor, D., Alberti, C., Weiss, D., Severyn, A., Presta,
and prepositional attachments, are often difficult to A., Ganchev, K., Petrov, S., & Collins, M. (2016).
be parsed and attached correctly. Just as an illustra- Globally normalized transition-based neural net-
tion we compare the BLLIP and the Standard Fac- works. Proceedings of the 54th Annual Meeting of
tored parsers on an example of the following sentence: the Association for Computational Linguistics, ACL
”Thus, one major factor that regulates Cdk5 activity is 2016, August 7-12, 2016, Berlin, Germany, Volume
degradation of p35 and p39 via proteasomal degrada- 1: Long Papers.
tion.”. The graphs of the parses are given in Fig. 4 for
BLLIP and Fig. 5 for Stanford Factored respectively. Black, E., Lafferty, J. D., & Roukos, S. (1992). Devel-
The relevant information that we want to extract in opment and evaluation of a broad-coverage proba-
this case is two facts: a) degradation of both p35 and bilistic grammar of english-language computer man-
p39 regulates the Cdk5 activity; b) how this degrada- uals. 30th Annual Meeting of the Association for
tion happens - via proteasomal degradation. The first Computational Linguistics, 28 June - 2 July 1992,
fact was successfully captured by both parsers, but University of Delaware, Newark, Deleware, USA,
the mechanism was correctly captured only by BLLIP Proceedings. (pp. 185–192).
parser. The Stanford parser failed at prepositional at-
tachment. Kim, J., Ohta, T., Tateisi, Y., & Tsujii, J. (2003).
GENIA corpus - a semantically annotated corpus for
Our preliminary evaluation of the parsers performance bio-textmining. Proceedings of the Eleventh Interna-
on specific syntactic patterns shows that the success tional Conference on Intelligent Systems for Molec-
rate for the prepositional attachment is in the range ular Biology, June 29 - July 3, 2003, Brisbane, Aus-
between 82% to 95%, while coordination is worse, and tralia (pp. 180–182).
lies between 66% and 79%.
Klein, D., & Manning, C. D. (2003). Accurate un-
4. Conclusions lexicalized parsing. Proceedings of the 41st Annual
Meeting of the Association for Computational Lin-
In this paper we have studied five syntactic parsers guistics, 7-12 July 2003, Sapporo Convention Cen-
from three families: Stanford, BLLIP, and Parsey Mc- ter, Sapporo, Japan. (pp. 423–430).
Parseface on biomedical texts. We have seen that the
highest performance was reached by BLLIP parser on Manning, C. (2015). The Stanford parser: A statistical
the Genia test corpus. We have also studied complex- parser. https://nlp.stanford.edu/software/
lex-parser.shtml.
180
Comparison of Syntactic Parsers on Biomedical Texts
181
Industry Track
Extended Abstracts
182
Eskapade: a lightweight, python based, analysis framework
http://eskapade.kave.io
183
Unsupervised region of interest detection in sewer pipe images:
Outlier detection and dimensionality reduction methods
Extended Abstract
2. Approach
Figure 1. Forward-facing pictures of the same concrete
The currently available data consists mostly of image sewer pipe at different locations along the street.
and video data, grouped by pipe stretch and munici-
pality. Rather than identifying defects in these images
directly, we have opted to detect regions of interest
(ROIs) in the images and classify these at a later stage. 3. Methods
We make the assumptions that (1) all images are from Unsupervised outlier detection methods are analogous
a forward-facing camera in a sewer pipe, (2) all im- to clustering: objects are thought to form clusters in
ages are similarly aligned, and (3) the surface of the feature space and outliers are those objects that are
pipe is similar in appearance for images in a single set. far away from clusters, or part of small clusters. The
See Figure 1 for an example of what these images may model used to fit these clusters must be somewhat re-
look like. It should be noted with the assumption of strictive, otherwise the models will overfit on the out-
“similar appearance” that the concrete and agglomer- liers that are present in the training data.
ate often contain a lot of texture, and adjacent pixel
values are not necessarily similar. Since the number of pixels in an image (≈ 106 ) is some
orders of magnitude greater than the number of images
We define an ROI to be the bounding box of a portion in a set (≈ 103 ), some dimensionality reduction is in
of an image that contains something “unexpected”. order before we try to find outliers in the dataset, to
Note that not all unexpected elements in the sewer ensure our methods don’t overfit on the training set.
pipe will be defects; we also want to detect pipe joints While outlier detection methods for higher dimension-
for example. See Figure 2 for an example of the ROIs alities exist (Aggarwal & Yu, 2001), these seem to be
we hope to detect. aimed mostly at sparse data, which our image data
is not. Other approaches seem to focus on dimen-
sionality reduction by feature selection and rejection
Appearing in Proceedings of Benelearn 2017. Copyright (Zimek et al., 2012), which is not most suited for im-
2017 by the author(s)/owner(s).
ages.
184
Unsupervised region of interest detection in sewer pipe images
185
Unsupervised region of interest detection in sewer pipe images
References
Aggarwal, C. C., & Yu, P. S. (2001). Outlier detection
for high dimensional data. ACM Sigmod Record (pp.
37–46).
Dirksen, J., Clemens, F., Korving, H., Cherqui, F.,
Le Gauffre, P., Ertl, T., Plihal, H., Müller, K., &
Snaterse, C. (2013). The consistency of visual sewer
inspection data. Structure and Infrastructure Engi-
neering, 9, 214–228.
Turk, M. A., & Pentland, A. P. (1991). Face recogni-
tion using eigenfaces. Computer Vision and Pattern
Recognition, 1991. Proceedings CVPR’91., IEEE
Computer Society Conference on (pp. 586–591).
Zimek, A., Schubert, E., & Kriegel, H.-P. (2012). A
survey on unsupervised outlier detection in high-
dimensional numerical data. Statistical Analysis and
Data Mining, 5, 363–387.
186
Service Revenue Forecasting in Telecommunications:
A Data Science Approach
187
Formatting Instructions for Benelearn 2017 Abstracts and Papers
188
Formatting Instructions for Benelearn 2017 Abstracts and Papers
References
Berthold, M. R., Cebron, N., Dill, F., Gabriel, T. R.,
Kötter, T., Meinl, T., Ohl, P., Sieb, C., Thiel,
K., & Wiswedel, B. (2007). KNIME: The Kon-
stanz Information Miner. Studies in Classification,
Data Analysis, and Knowledge Organization (GfKL
2007). Springer.
189
Predicting Termination of Housing Rental Agreements with
Machine Learning
190
Predicting Termination of Housing Rental Agreements.
lift curve
0.93
1.0
0.75
0.8
0.57
True positive rate
0.6
0.39
0.4
0.21
0.2
0.03
0.0
Figure 1. Lift curve depicting model performance of the Figure 2. Relative importances of various indicators in the
ensemble model (top colored curve) versus the historical dataset.
RTE estimate (bottom colored curve). The bottom black
line depicts the baseline of a portfolio wide RTE.
4. Conclusion
Application of machine learning techniques for predict-
ing tenancy endings is viable. It allows for better plan-
ning than the currently used methods. It mitigates
risks, lowers costs and potentially improves revenue.
In the near future, we plan to improve our model by
including more data and modeling a different outcome
variable, i.e. period until tenancy end.
References
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The
elements of statistical learning. Springer Series in
Statistics. New York, NY, USA: Springer New York
Inc.
R Development Core Team (2008). R: A language and
Figure 3. Partial dependency plot showing probability of
environment for statistical computing. R Foundation
tenancy ending versus tenant age.
for Statistical Computing, Vienna, Austria. ISBN 3-
900051-07-0.
191
Anomaly Analytics and Structural Assessment in Process Industries
Keywords: anomaly detection, exceptional model mining, sequential patterns, industry 4.0
192
Anomaly Analytics and Structural Assessment in Process Industries
3. Method
The detection and analysis of irregular or exceptional Weighted Network / Graph Data (Transition Matrix)
Estimation
patterns, i.e., anomalies (Hawkins, 1980; Akoglu et al., Output: -10
193
Anomaly Analytics and Structural Assessment in Process Industries
Atzmueller, M., & Roth-Berghofer, T. (2010). The Seipel, D., Köhler, S., Neubeck, P., & Atzmueller, M.
Mining and Analysis Continuum of Explaining Un- (2013). Mining Complex Event Patterns in Com-
covered. Proc. AI-2010. London, UK: SGAI. puter Networks. In New Frontiers in Mining Com-
plex Patterns, LNAI. Springer.
Atzmueller, M., Schmidt, A., & Kibanov, M. (2016b).
Singer, P., Helic, D., Hotho, A., & Strohmaier, M.
DASHTrails: An Approach for Modeling and Anal-
(2015). Hyptrails: A Bayesian Approach for Com-
ysis of Distribution-Adapted Sequential Hypotheses
paring Hypotheses about Human Trails. Proc.
and Trails. Proc. WWW 2016 (Companion). ACM.
WWW. New York, NY, USA: ACM.
Atzmueller, M., Schmidt, A., Kloepper, B., & Arnu, Vogel-Heuser, B., Schütz, D., & Folmer, J. (2015).
D. (2017b). HypGraphs: An Approach for Analy- Criteria-based Alarm Flood Pattern Recognition
sis and Assessment of Graph-Based and Sequential Using Historical Data from Automated Production
Hypotheses. In New Frontiers in Mining Complex Systems (aPS). Mechatronics, 31.
Patterns, LNAI. Springer.
194