FULLTEXT01
FULLTEXT01
JOEY ÖHMAN
JOEY ÖHMAN
Abstract
The recent advancements of Natural Language Processing have cleared the
path for many new applications. This is primarily a consequence of the
transformer model and the transfer-learning capabilities provided by models
like BERT. However, task-specific labeled data is required to fine-tune these
models. To alleviate the expensive process of labeling data, Active Learning
(AL) aims to maximize the information gained from each label. By including a
model in the annotation process, the informativeness of each unlabeled sample
can be estimated and hence allow human annotators to focus on vital samples
and avoid redundancy.
This thesis investigates to what extent AL can accelerate model training
with respect to the number of labels required. In particular, the focus is on pre-
trained Swedish language models in the context of Named Entity Recognition.
The data annotation process is simulated using existing labeled datasets to
evaluate multiple AL strategies. Experiments are evaluated by analyzing the
F1 score achieved by models trained on the data selected by each strategy.
The results show that AL can significantly accelerate the model training
and hence reduce the manual annotation effort. The state-of-the-art strategy
for sentence classification, ALPS, shows no sign of accelerating the model
training. However, uncertainty-based strategies consistently outperform random
selection. Under certain conditions, these strategies can reduce the number of
labels required by more than a factor of two.
Keywords
Active learning, Named entity recognition, Language models, Natural language
processing, Bert, Swedish
ii | Abstract
Sammanfattning | iii
Sammanfattning
Framstegen som nyligen har gjorts inom naturlig språkbehandling har möjliggjort
många nya applikationer. Det är mestadels till följd av transformer-modellerna
och lärandeöverföringsmöjligheterna som kommer med modeller som BERT.
Däremot behövs det fortfarande uppgiftsspecifik annoterad data för att finjustera
dessa modeller. För att lindra den dyra processen att annotera data, strävar aktiv
inlärning efter att maximera informationen som utvinns i varje annotering.
Genom att inkludera modellen i annoteringsprocessen, kan man estimera
hur informationsrikt varje träningsexempel är, och på så sätt låta mänskilga
annoterare fokusera på viktiga datapunkter.
Detta examensarbete utforskar hur väl aktiv inlärning kan accelerera
modellträningen med avseende på hur många annoterade träningsexempel
som behövs. Fokus ligger på förtränade svenska språkmodeller och uppgiften
namnigenkänning. Dataannoteringsprocessen simuleras med färdigannoterade
dataset för att evaluera flera olika strategier för aktiv inlärning. Experimenten
evalueras genom att analysera den uppnådda F1-poängen av modeller som är
tränade på datapunkterna som varje strategi har valt.
Resultaten visar att aktiv inlärning har en signifikant förmåga att accelerera
modellträningen och reducera de manuella annoteringskostnaderna. Den toppmoderna
strategin för meningsklassificering, ALPS, visar inget tecken på att kunna
accelerera modellträningen. Däremot är osäkerhetsbaserade strategier är konsekvent
bättre än att slumpmässigt välja datapunkter. I vissa förhållanden kan dessa
strategier reducera antalet annoteringar med mer än en faktor 2.
Nyckelord
Aktiv inlärning, Namnigenkänning, Språkmodeller, Naturlig språkbehandling,
Bert, Svenska
iv | Sammanfattning
Acknowledgments | v
Acknowledgments
First, I want to thank Arbetsförmedlingen and my supervisor, Felix Stollenwerk,
for making this project possible and for his many hours of assistance. Secondly,
I want to express my gratitude towards Iolanda Leite, my academic supervisor
who has given me invaluable feedback on the thesis.
This thesis, and my five years at KTH, have been a long journey with many
ups and downs. I want to thank my friends and family for their everlasting
support, be it a place to live, helping me through tough periods, or aiding
me with motivating discussions. I want to thank my girlfriend, Jem Hippe-
Runsten, for her undying support in stressful times. Finally, many thanks to
Fredrik Carlsson for hosting amazing digital pool parties that helped so many
of us through the tough times of Covid-19 isolation.
I could not have come this far without all your support.
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Research Methodology . . . . . . . . . . . . . . . . . . . . . 4
1.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Ethics and Sustainability . . . . . . . . . . . . . . . . . . . . 4
1.7 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background 7
2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . 7
2.1.1 Data Pre-processing . . . . . . . . . . . . . . . . . . 7
2.1.2 Recurrent Neural Networks . . . . . . . . . . . . . . . 8
2.1.3 Word Embeddings . . . . . . . . . . . . . . . . . . . 8
2.1.4 Deep Contextualized Word Representations . . . . . . 9
2.1.5 Transfer Learning Beyond Word Embeddings . . . . . 10
2.1.6 Transformers . . . . . . . . . . . . . . . . . . . . . . 10
2.1.7 Bidirectional Encoder Representations from Transformers 15
2.1.8 Swedish Language Models . . . . . . . . . . . . . . . 18
2.1.9 Named Entity Recognition . . . . . . . . . . . . . . . 19
2.2 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Query by Uncertainty . . . . . . . . . . . . . . . . . . 21
2.2.3 Query by Committee . . . . . . . . . . . . . . . . . . 22
2.2.4 Batches & Diversity . . . . . . . . . . . . . . . . . . 23
2.2.5 Sequential Data . . . . . . . . . . . . . . . . . . . . . 24
2.2.6 Seed Data . . . . . . . . . . . . . . . . . . . . . . . . 26
viii | CONTENTS
3 Related Work 27
3.1 Active Learning & Named Entity Recognition . . . . . . . . . 27
3.2 Active Learning & BERT . . . . . . . . . . . . . . . . . . . . 28
4 Method 31
4.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 Named Entity Recognition Datasets . . . . . . . . . . 31
4.1.2 Pre-trained Models . . . . . . . . . . . . . . . . . . . 33
4.1.3 Acquisition Batch Sizes . . . . . . . . . . . . . . . . 33
4.1.4 Active Learning Strategies . . . . . . . . . . . . . . . 34
4.1.5 Seed Model Training . . . . . . . . . . . . . . . . . . 35
4.1.6 Stopping Criteria . . . . . . . . . . . . . . . . . . . . 36
4.1.7 Bootstrapping Framework . . . . . . . . . . . . . . . 36
4.1.8 Nerblackbox . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Evaluation Technique . . . . . . . . . . . . . . . . . . . . . . 37
4.2.1 One Dimensional Variation . . . . . . . . . . . . . . . 37
4.2.2 Experiment Score Matrix & Global Strategy Score . . 37
4.3 Experiment & Data Validity . . . . . . . . . . . . . . . . . . 39
4.4 Hardware Specification . . . . . . . . . . . . . . . . . . . . . 39
References 57
A Hyperparameter Comparison 65
A.1 ALPS Inspired vs Early Stopping . . . . . . . . . . . . . . . . 65
Contents | ix
List of Figures
B.1 Samples from the Swedish NER Corpus test set. Ground truth
samples compared with samples tagged by a fine-tuned KB-
BERT model. . . . . . . . . . . . . . . . . . . . . . . . . . . 66
LIST OF TABLES | xiii
List of Tables
BIO Beginning-Inside-Outside
SOTA State-Of-The-Art
Chapter 1
Introduction
important result of this is that the same model performance can be achieved
with fewer labeled samples.
A large portion of the AL strategies was developed before the era of deep
learning. Two important categories are Query by Uncertainty and Query by
Committee. The former refers to the process of estimating the informativeness
of unlabeled samples by using the model’s probability predictions and the
associated uncertainties. Often, these do not consider the diversity of the
chosen samples and are thus prone to choosing near-identical samples in a
batch setting. Query by Committee refers to ensemble methods and can
significantly improve the performance but does not solve the diversity problem
either. There are, however, diversity-based strategies that aim for samples that
contribute to a good data representation instead of only uncertainty.
When working with AL in conjunction with certain fields, the best strategy
may vary. For example, in deep learning, ensemble methods might not boost
performance enough to be worth the additional computational cost. Also, in
particular, for NLP tasks it can simply be the case that there are not enough
sufficiently diverse pre-trained models available. The increased amount of data
often implies larger batches and can render simple uncertainty-based strategies
useless. Another prominent example is NER, where a sequence of predictions
should be made for each sample. Therefore, many of the classic strategies need
to be extended. The conjunction of AL and NER is thoroughly investigated in
[5], although in the era before deep learning and transformer models.
While different AL strategies have been shown to vary in performance
over different tasks, some strategies seem to consistently achieve near State-
Of-The-Art (SOTA) performance. For example, BatchBALD[6], building
upon BALD[7] selects informative and diverse samples using concepts from
information theory. Some strategies are specific to deep learning, such as
Learning Loss[8] that adds an auxiliary learning output and objective to
the architecture. Another deep learning strategy, with similar motivation, is
that of BADGE (Batch Active learning by Diverse Gradient Embeddings)
[9] that compares samples in a hallucinated gradient space. Furthermore,
ALPS[10] is a method specific to pre-trained BERT models, that creates
surprisal embeddings through a pre-training objective to find diverse and
informative samples.
1.1 Motivation
The majority of results within the field of deep learning suggests that model
performance will improve with more data and larger models. Due to the vast
Introduction | 3
costs of creating and using the data, a prominent research direction is that
of transfer learning. Transfer learning allows practitioners to reuse a trained
model, for their specific task with reduced effort. The amount of task-specific
data needed is only a fraction of the data used to pre-train the model. However,
the task-specific data need to be labeled. This is often expensive or difficult but
can be remedied with AL. Instead of randomly selecting samples to label, AL
aims to select the most informative samples. Therefore, with a fixed labeling
budget, the model performance can be pushed further.
Countless remarkable applications of deep learning rely on having sufficient
labeled data. Relaxing this constraint can open up new use cases that were
previously infeasible. This is evident in many NLP tasks e.g. NER, where the
data annotation process can be particularly costly and time-consuming.
Minimizing the dependency on sizeable data is a central objective of
machine learning. It aims to solve a general problem that could aid both
commercial products and research projects. Furthermore, it could enable
the development of products that were previously infeasible and accelerate
research for projects that need manually annotated data.
1.2 Aim
The primary goal of this project is to estimate the potential of AL in conjunction
with Swedish language models and the NER task. More specifically, several
AL strategies are investigated and evaluated by varying multiple experiment
parameters e.g. dataset. This is described further in Chapter 4. The thesis
results and report can be used by anyone who wants to use pre-trained language
models for an NLP task and needs to create their own training data. In
particular, those interested in fine-tuning Swedish NER models might find this
work relevant.
1.5 Delimitations
Optimally, considering the research question, all interesting strategies would
be thoroughly investigated. This would indeed ensure that no optimal strategy
would be missed. However, this thesis is restricted to implementing a few
promising, yet feasible, strategies. Therefore, the results could be interpreted,
instead, as an estimated lower bound of the degree of acceleration that AL can
provide in this setting.
This project is limited to Swedish pre-trained language models, the equivalent
performance for other languages is not explored. Also, the only downstream
task investigated is that of NER. Furthermore, it could be relevant to see
whether the more informative samples requires greater annotation effort. This,
as the other limitations, is left as future work.
existing jobs in data annotation. Some could be threatened if the need for
annotated data is reduced. However, data annotation will still be an important
task and the goal is merely to make the process less tedious and redundant.
Moreover, as the process of annotating data and training performant models
becomes feasible for a larger number of practitioners, the rise of new projects
would also entail new jobs.
Chapter 2
Background
These word embeddings are then often used in conjunction with RNNs, as
word representation. This gives the model the advantage of having a smaller
input size (dense vectors) while incorporating the meaning of words in the
model input. Word embeddings take NLP a significant step towards transfer
learning, which is a major goal within the field.
2.1.6 Transformers
While the contextualized word embeddings provided by ELMo was a major
step from word independent embeddings, the long term relationships, e.g. over
several sentences, are still difficult to model. Also, the sequential nature of
RNNs does not allow model training to fully utilize the high-end computing
hardware. Graphics Processing Unit (GPU)s can perform an immense number
of floating-point operations per second, given that the operations can be done
in parallel, i.e. there are not sequential dependencies. While RNNs do come
with matrix operations that can be done efficiently on the GPU, one word has
to be processed before processing the next. These shortcomings of the LSTM
are primarily what motivates the Transformer [1] architecture.
Transformer models build upon the encoder-decoder architecture. This
means that an encoder network encodes the input to an intermediate representation,
which the decoder is trained to decode. A common example is that of machine
translation, a sequence-to-sequence task where the input text is encoded into a
vector representation which is then decoded to the target language. Using an
encoder-decoder architecture allows the source and target sentences to differ
in length. Furthermore, this architecture is practical for multi-modal data. For
example, image caption generation [19], where an input image can be encoded
to a vector and then decoded into a text sequence.
The Transformer model consists of an encoder and a decoder. The encoder
in turn is a stack of encoder blocks, and the decoder a stack of decoder blocks.
These blocks are slightly different and will be covered one by one, starting
with the encoder.
Background | 11
Encoder
The encoder block is made of two layers, a self-attention layer and a feed-
forward network. Each element of the input sequence will have its own path
through the encoder, starting with a word embedding layer as explained above.
These paths are not independent and share information via the self-attention
mechanism.
Self-attention
The self-attention layer allows words to be associated with each other, e.g.
when a word refers to another or resolving word-level ambiguities with the
help of context. It does this by first calculating query, key and value vectors for
each embedding xi , by multiplying it with the corresponding learned weight
matrices:
q i = x i W Q , ki = xi W K , vi = xi W V (2.1)
To gain some helpful intuition, one can think of the queries, keys, and
values as an analogy with retrieval systems. The query matrix is used to
transform the input into queries that can be matched against the keys that,
after a few additional steps, are multiplied with the values to provide the
embeddings with context.
These vectors are of a fixed dimensionality dk = 64, smaller than the
embedding vectors dx = 512, to keep the computational complexity of these
operations mostly constant. Scores are calculated by taking the dot product of
a word’s query vector with all key vectors. The score represents how much
focus each word should dedicate to the other words. The score is then divided
by the square root of the key vector dimensionality dk and passed through the
softmax function (σ), resulting in the softmax score sσ :
qi • k j
sσij = σ( √ ) (2.2)
dk
At this point, there is a scalar score sσij for each word embedding pair that
represents how relevant xj is for xi . This scalar, which now contains a value
between 0 and 1, is multiplied with the corresponding value vector vj . An
interpretation of this is that we keep the value vectors of relevant words, but
filter out irrelevant words i.e. with softmax scores of 0. Now, each source word
position i has scaled value vectors corresponding to each target word position
j. To incorporate information from other word embeddings into the current,
12 | Background
the target vectors are aggregated. The final output of the self-attention layer,
for word position i, is the sum of all these scaled value vectors:
X
zi = sσij vj (2.3)
j
S = QKT (2.5)
In this resulting matrix, each element corresponds to the score sij . To get
the normalized softmax score, we do an element-wise division of 8 followed
by a softmax:
S
Sσ = σ( √ ) (2.6)
dk
Here, the elements Sσij exactly correspond to the softmax score sσij achieved
in the vector equation 2.2 above. The next steps are to scale and sum the
value vectors, see equation 2.3, which are both handled with just one matrix
multiplication:
Z = Sσ V (2.7)
This entire chain of operations that constitutes the self-attention layer, can
be formulated with one compact formula:
XWQ (XWK )T
Z = σ( √ )XWV (2.8)
dk
Background | 13
Multi-headed Attention
The self-attention allows the word positions to focus and incorporate information
from other positions. However, one could imagine that the model would
benefit from being able to focus on word positions differently for different
aspects. This is partly the motivation for why the authors added multi-
headed attention. The entire self-attention function is duplicated into several
(8 originally) instances, each having their own weight matrices Q, K and V.
Since each attention head yields one matrix Z, there are several matrices now,
instead of just one. To aggregate the information into one matrix, the matrices
are concatenated and multiplied with another learned weight matrix WO . The
output matrix then has the same size as the original Z matrix.
Positional Encodings
With the self-attention mechanism described above, the model has no way of
telling where in the sentence different words lie, which is naturally important
to understand sequences. To encode this in the word embeddings, a positional
encoding is added to the embedding vector. These vectors are originally
created using a combination of two functions, one using sine and the other
cosine. However, the positional encoding vectors can be generated in different
ways. This allows the model to see a systematic difference between word
embeddings at different positions and learn how to interpret them.
Decoder
The decoder block is formed by three layers, the self-attention layer, the
encoder-decoder attention layer, and the feed-forward layer. All of which have
the residual connections with a sum and layer normalization step. The output
of the decoder is passed to a feed-forward network with a softmax output,
where each output element represents the probability of the corresponding
word.
The decoder works in time steps, generating one token each time. There
are two inputs of each time step. First, the encoder output vectors Kmem and
Vmem , and the decoded tokens up until the current time step. So, when the
encoder output has been generated, the decoder starts decoding at the first
time step, getting only a special start token as input along with the encoder
vectors. The self-attention works as in the encoder with one exception, it is
only allowed to attend to previous positions of the decoder input, i.e. what
has been decoded so far. The succeeding positions are masked, by being
set to −∞. The next layer, the encoder-decoder attention, works like the
multi-headed self-attention in the encoder, but it generates the query matrix Q
from the previous layer and uses the key and value matrices from the encoder
output vectors. This is where the decoder processes information from the
encoder and input sequence. Finally, the feed-forward layer works as in the
encoder and creates the output of the decoder, which is then passed to the
feed-forward softmax classifier network that outputs token predictions. These
token predictions are represented as a vector of the same dimensionality as
the vocabulary, i.e. number of known words. Each element is a softmax
probability and the argmax token is the Transformer prediction. The predicted
word is then used in the input of the next decoding time step. This process
repeats until the model outputs a special end of sequence token, which marks
the decoded sequence complete.
next time step. When all candidates have reached an end of sequence token,
the best sequence can be selected as the final output, e.g. the one with the
highest total probability.
Pre-training
The pre-training of BERT consists of two self-supervised tasks, one on word-
level and one on sentence-level. The addition of a sentence-level task is
motivated by the fact that there are various downstream tasks that require the
model to handle the relationship between two text sequences.
The word-level pre-training objective is that of Masked Language Modelling
(MLM). It consists of masking 15% of the input tokens, replacing 80% of these
with the [M ASK] token, 10% with a random token, and 10% with the original
token. The model is then trained predict the correct tokens for the masked
positions, using a classification layer added on top of the output of each token
position. Part of the intuition that led the authors to these rules is that the model
must learn good representations for both masked and non-masked tokens. The
Background | 17
Fine-tuning
When the model is pre-trained, it can be shared with anyone who desires to
fine-tune it for their particular downstream task. Fine-tuning BERT models
are in many cases simple and does not require much data as the model is
already pre-trained language understanding. This makes it possible to reuse
the results of the immense pre-training process (16 Tensor Processing Unit
(TPU)s running for 4 days for BERTLARGE ), saving time and resources,
allowing SOTA models to be trained on commodity hardware in a reasonable
time.
In many cases, fine-tuning is only a matter of adding a small layer to the
pre-trained model. Figure 2.1 briefly illustrates the composition of BERT and
a custom model on top. A common example is that of sentence classification,
e.g. spam detection or sentiment analysis, which only requires a classifier
added on top of the [CLS] token, similar to the sentence prediction pre-
training task. In the case of NER, each output token vector is simply passed
through a shallow classification network to predict the token entity. In
conclusion, BERT provides a way to reuse trained language models for a wide
variety of downstream tasks with little effort.
Evaluation
While NER models are usually trained as a token-level multi-class classification
task with sequence to sequence data, the evaluation is often done on entity-
level. This means that for the plain tag format, consecutive tokens with the
same tag will be considered as one entity. In the case of the BIO format,
entities are well-defined. Since a multi-token entity can be partly correct, there
are multiple ways of evaluating a model. A common way to handle this is to
treat any entity with any incorrect tokens as an incorrect entity prediction.
Furthermore, as with any machine learning problem, there are multiple
metrics for model performance. In NER, F1 scores, harmonic means of
precision and recall, are often used. However, there are multiple ways to
20 | Background
2 ∗ precision ∗ recall
F1 = (2.9)
precision + recall
This results in an F1 score for each entity class. To aggregate these
F1 scores into one final F1 score, a micro- or macro average can be used.
The macro average will simply calculate the mean F1 score, considering
the relevance of each class equal. Micro average is instead defined using a
weighted average, where each class is weighted by the fraction of the total
number of samples belonging to the class. So, macro average gives equal
importance to each class, whereas micro average gives equal importance to
each sample.
Figure 2.2 – Dataset with two classes, green and red. Unlabeled data pool
(left) and corresponding labeled dataset (right).
2.2.1 Bootstrapping
Bootstrapping aims to assist the human annotators by pre-tagging the unlabeled
samples for them. This is done by iteratively annotating data, training a model,
and pre-tagging the unlabeled samples with the model. In the later iterations,
the model predictions are often adequate and the annotators task transitions
from annotation to correction. This bootstrapping process is extended to use
AL by not only pre-tagging the samples that should be annotate, but also
selecting these samples. This process is briefly illustrated in Figure 2.4 and
formally described in Algorithm 1.
How to quantify the informativeness of samples is an open research
question. There are many strategies for it and may vary for different tasks.
This section will focus on classification tasks, but much of what is covered
here apply to regression tasks as well. This section briefly covers the relevant
areas of AL, for a more thorough overview of the field, the literature survey
by B. Settles [37] is a fine place to start.
Figure 2.3 – Dataset with two classes, green and red. Decision boundaries for
classifier trained with: randomly selected samples (left) and actively selected
samples (right).
a batch of samples, the result could be worse than random selection since the
batch could be filled with near equal samples with high uncertainty.
The problem of diversity is illustrated in the BatchBALD paper [6]. The
authors propose a good AL strategy to get diverse and informative samples in
a batch setting, inspired by information theory. Another promising strategy is
that of BADGE [9], which uses loss gradients to find diverse and informative
samples. The authors show that their strategy selects samples with diverse
gradients of high magnitude. They argue that samples with high magnitude
gradients mean that the sample is informative. Since there are no labels present
when selecting samples, the gradients are hallucinated, i.e. assume the label
the model favors.
Chapter 3
Related Work
There is little research done in the combination of AL, NER, and Swedish
language models. So, any work done within AL in conjunction with NLP
is considered related work, with particular focus on NER and pre-trained
language models.
The conjunction of AL and NLP has seen much research in the last few
decades. A. Cynthia et al., before deep learning, show that AL can accelerate
the model training of Information Extraction and Semantic Parsing [38]. G.
Tur et al., demonstrate that AL can reduce the annotation effort required in the
spoken language understanding by a factor 2 [39, 40]. B. Settles et al., present
an AL framework for multiple-instance machine learning and show that AL
can significantly improve the performance in this sub-field [41].
and Semantic Role Labeling [45]. Moreover, the authors observe that while
Bayesian approaches consistently perform best among their strategies, basic
uncertainty-based strategies significantly outperform random sample selection.
Y. Shen et. al. explores the combination of deep AL and NER [46] using
a composite model made of two convolutional neural network encoders and
an LSTM decoder. Also, AL have been explored in the clinical context by,
Y. Chen et al. [47] uses a conditional random field [48] classifier for entity
tagging in medical records. Furthermore, cost-aware AL has been investigated
by Q. Wei et al. [49], where the annotation costs are thoroughly considered to
ensure that the annotation effort is minimized. All of these authors observe that
AL possess the potential of accelerating the annotation process in the context
of NER.
Related Strategies
An optimal Active Learning strategy does not only consider the uncertainty or
informativeness of individual samples, but also the diversity within a batch.
Since practical settings use batches, uncertainty-based strategies are prone to
selecting redundant samples and hence perform poorly in a realistic scenario.
An intuitive way of modeling diversity is to measure the distance of samples
in input space. Then, uncertainty-based strategies could be used to weight the
samples to also consider informativeness. Clustering these weighted samples
with e.g. K-Means clustering and sampling data points close to these clusters
could generate diverse and informative batches.
However, in many fields, it does not make sense to measure the distance in
input space. Moreover, uncertainty-based strategies in deep learning rely on
the confidence estimates of the neural network softmax probabilities, which
are poorly calibrated and often over-confident [52]. Instead, some modern
strategies choose to rely on deep neural network embeddings. For example,
Related Work | 29
classification task. In fact, ALPS slightly overlaps with this thesis. However,
the contribution of this work is instead to explore AL for NER with Swedish
language models.
Method | 31
Chapter 4
Method
The potential of AL can be explored via simulations in any context that has
relevant datasets, as commonly done in the literature, e.g. [9, 10, 46, 47, 49]. In
this thesis, the process is simulated in the context of NER for Swedish language
models with Swedish pre-trained BERT models and NER datasets. The
iterative AL bootstrapping process of iteratively selecting the most informative
data for labeling and training, see Figure 2.4 and Algorithm 1, is simulated
by replacing the oracle with an existing labeled dataset. This is achieved by
simply hiding the labels from the model and returning them when the model
queries the oracle. Model performance is then measured after each iteration
and compared with passive learning for the same training data size.
The full names of the entities present in the datasets can be found in Table
4.1. The distributions of these classes with the O-tag omitted can be seen in
Figure 4.1. The Swe-NERC dataset contains a larger number of classes with
more complexity than Swedish NER Corpus. Examples from Swedish NER
Corpus can be found in Appendix B, where the ground truth is compared with
predictions from a fine-tuned model.
Figure 4.1 – Class distributions (O-tag omitted) for the Swedish Named Entity
Recognition datasets used in the experiments.
Choice of Measure
The acquisition batch size can be defined in multiple ways. The most natural
way is to use a sample-based measure, e.g. that B = 100 implies 100 samples
per batch. However, this leads to issues with many strategies. Examples are
strategies that measure the uncertainty of a sample with the total uncertainty
for all its tokens, as they will favor longer samples. Using this sample-based
batch size definition, the annotation effort will likely not be reduced with
34 | Method
the lower number of samples annotated since they are often longer instead.
Furthermore, using the sample-based definition, the strategy to simply choose
the longest samples often outperforms random (passive learning) significantly.
Consequently, this thesis resolves to use word-based batches, commonly
used in recent literature [46, 47]. In practice, this means that samples selected
by the AL strategy are added to the batch one by one until the batch size B
has been reached. This definition should enhance the correlation between the
observed strategy performance and the reduction in human annotation effort.
If B is too low, the experiment becomes unrealistic and infeasible. On the
other hand, if it is too high the potential benefit of AL fades without increasing
the annotation efficiency significantly. Therefore, the acquisition batch sizes
explored in the experiments are limited to B = 500 and B = 1000.
Strategy Definitions
To gain an understanding of the potential of AL, multiple strategies are
explored since their performance has been observed to be problem-dependent.
The random strategy, passive learning, is used as a baseline in the experiments.
A strategy must evidently outperform the random baseline to make AL useful.
The operations for simple uncertainty-based strategies can be done on
either token-level or sample-level. For example, Max-Ent refers to taking the
maximum probability on token-level for all tokens, then aggregating these by
calculating their Entropy (this strategy is named Logprob in [5]). Following
this naming convention, the strategies investigated and how they measure
informativeness are listed below:
• Ent-Max: Omit all tokens which have not been predicted as an entity,
i.e. O-tags. Let pmax
t be the maximum probability of token t, pmax
t =
max(pt ). The Ent-Max uncertainty is then the entropy of these maximum
Method | 35
probabilities.
T
X
IEntM ax (p) = − pmax
t log2 pmax
t (4.1)
t=1
• Ent-Marg: Omit all tokens which have not been predicted as an entity,
i.e. O-tags. Let pmarg
t be the absolute difference of the greatest two
probabilities of token t. The Ent-Marg uncertainty is then the entropy
of these margin probabilities.
T
X
IEntM arg (p) = − pmarg
t log2 pmarg
t (4.2)
t=1
4.1.8 Nerblackbox
The experiments fine-tune the pre-trained models using the PyTorch-based
Python library nerblackbox [58]. Nerblackbox takes as input a pre-trained
Transformer-based model and a NER dataset and outputs performance metrics
and the fine-tuned model. The library supports useful features such as early
stopping, error estimation through multiple runs, and easy configuration of
hyperparameters. Moreover, it supports the datasets used in the experiments
as built-in. These features are utilized heavily in the experiments, and in each
iteration when new data has been added, a model is fine-tuned from scratch on
the current subset of the dataset using nerblackbox.
Hyperparameters
The models are fine-tuned with a maximum sequence length of 128 and a batch
size of 32 (not to be confused with acquisition batch size) for a maximum of
50 epochs with early stopping. The AdamW optimizer [59, 60] is used with a
constant learning rate of 2e-5, β1 = 0.9 and β2 = 0.999. These hyperparameters
are inspired by the closely related work, ALPS [10]. However, their low
Method | 37
number of epochs used does not allow models to converge for small datasets,
i.e. early bootstrapping iterations. Increasing the maximum number of epochs
is justified by the conclusions drawn in [61, 62], and using early stopping
makes the training robust also for larger datasets, i.e. later bootstrapping
iterations. A brief comparison of these hyperparameter configurations for this
context is presented in Appendix A.
Each fine-tuning session is repeated 5 times, and the mean F1 score f¯1 and
standard error of the mean σf¯1 are reported [58], see equation 4.4.
σf
σf¯1 = √ 1 (4.4)
n
Where σf1 is the sample standard deviation and n is the number of samples.
E := (eDM BS ) (4.6)
Figure 4.2 – Experiment scores are defined by the AUC of the F1 score
learning curve (left). The experiment matrix contains experiment scores
in four dimensions. Fixing two allows for a two-dimensional visualization
(right).
Figure 4.2 depicts the definition of the experiment score and experiment
score matrix. This experiment score matrix holds compact, partly aggregated,
information of the experiment results. Taking the mean of a particular axis,
or several, enables the analysis of the average performance in the other
dimensions. For instance, taking the mean of the M and B axes results in
a two dimensional matrix that depicts the average performance of strategies
for different datasets. These elements are defined as in Equation 4.7.
1 X
ēDS = eDM BS (4.7)
NM NB M,B
More importantly, taking the mean over all axes but the S-axis results
in a vector with global strategy scores that can be seen as overall strategy
Method | 39
The random strategy is included in the experiments and thus in this vector,
to enable a simple comparison of the strategies with passive learning. This
vector may act as the final verdict of AL’s potential in the context of the NER
task with Swedish language models.
For a deeper analysis the strategy performances can be examined for
different fractions of the learning curves. For example, a fraction of 50%
means that only the first half of the learning curve would be considered when
calculating the experiment score. This results in measures of the acceleration
potential of AL strategies for different labeling budgets L.
• The analysis examines the F1 score per labeled word. This does not
perfectly model annotation effort and could mean that we are essentially
only optimizing annotation effort indirectly.
Machine 1 Machine 2
CPU Intel(R) Xeon(R) E5-2620 v4 @ 2.10GHz E5-2690 v3 @ 2.60GHz
RAM 60 GiB 500 GiB
GPU Nvidia Quadro P5000 Nvidia Titan RTX
VRAM 16 GiB 24 GiB
Results and Discussion | 41
Chapter 5
5.1.1 Dataset
The labeling budget L of the dataset Swedish NER Corpus was selected as
in Chapter 4. However, due to long experiment times and constraints in
resources and time, the labeling budget of Swe-NERC was chosen as 26,000
words to match the budget of Swedish NER Corpus. The Swe-NERC dataset
is inherently more difficult and, consequently, its original labeling budget is
much higher.
42 | Results and Discussion
Figure 5.1 illustrates the performance differences for the datasets for
different settings. The gap between the curves is larger for the Random
strategy. This arguably because the model approaches convergence with
Swedish NER Corpus and that AL accelerates the training better with Swe-
NERC.
5.1.2 Model
Due to the time and resource constraints of this thesis, the only pre-trained
model used consistently throughout the final AL experiments is KB-BERT.
Only a few experiments are carried out for AF-BERT to gain some intuition
on the difference in the behavior of different pre-trained models. Figure 5.2
illustrates the significant performance advantage of the Swedish SOTA model
KB-BERT. Since the curves do not seem to fundamentally differ, apart from
the additional advantage of AL for AF-BERT, the conclusions drawn from the
experiments with KB-BERT could hold for AF-BERT as well. The reason for
the extra acceleration observed with AF-BERT could be because the task is
more difficult for inferior models.
Figure 5.4 – Acquisition batch sizes B compared with Avg-Marg for both
datasets.
44 | Results and Discussion
Figure 5.5 – Acquisition batch sizes B compared with Ent-Marg for both
datasets.
Figure 5.6 – Acquisition batch sizes B compared with Ent-Max for both
datasets.
5.1.4 Strategy
Figure 5.7 – Active Learning strategies S compared with acquisition batch size
500 for both datasets.
Figure 5.8 – Active Learning strategies S compared with acquisition batch size
1000 for both datasets.
Figure 5.9 – Active Learning strategies S compared with acquisition batch size
4000 for both datasets.
Figure 5.7, Figure 5.8, and Figure 5.9 depict performance comparisons of the
AL strategies for different contexts. The results indicate that the uncertainty-
based AL strategies consistently outperform random throughout the majority
46 | Results and Discussion
of the iterations. The performance gain from AL is higher in the early stages
of the learning curve. This is expected since the model is still far from
convergence and hence has much to learn.
What is more surprising is the rather large improvement over random for
uncertainty-based strategies in a batch setting. For example, see the 4000-
word mark for Swedish NER Corpus in Figure 5.7 (left). At this point, all
these uncertainty-based strategies achieve significantly higher F1 test scores
of around 0.8 instead of Random’s 0.73. Random selection does not achieve
matching performance until it has queried 12 000 words. As a result, to achieve
an F1 test score of 0.8 in this context, uncertainty-based AL can reduce the
number of labeled words required by almost a factor of 3. For Swe-NERC, the
increase in absolute F1 score is greater but the corresponding reduction factor
seems to be around 2. In general, AL seem capable of reducing the number of
labeled words required by at least a factor of 2 in this context for sufficiently
small batch sizes, e.g. 500 words.
Figure 5.10 – Relative improvement matrix with rows D and columns B. The
bottom row shows the mean score for each batch size.
Figure 5.11 primarily presents three things. First, each strategy performs
better on Swe-NERC than Swedish NER Corpus. Secondly, the strategy scores
are similar but the Avg-Marg slightly outperforms the other two. Lastly, all
strategies significantly outperform Random.
Figure 5.11 – Relative improvement matrix with rows D and columns S. The
bottom row shows the mean score for each strategy.
Figure 5.12 shows that the strategies have similar performance while Avg-
Marg slightly outperforms the other strategies consistently, for all tested batch
sizes.
48 | Results and Discussion
Figure 5.12 – Relative improvement matrix with rows B and columns S. The
bottom row shows the mean score for each strategy.
Figure 5.13 – Average relative experiment score improvement over random for
increasing fractions of the learning curves.
Figure 5.15 – The number of words selected per sample by strategies through
the iterations.
Figure 5.16 – ALPS strategy compared with Ent-Max and Random, with KB-
BERT.
Results and Discussion | 51
Figure 5.17 – ALPS strategy compared with Ent-Max and Random, with AF-
BERT.
The number of words selected per sample is illustrated in Figure 5.15. The
bias towards longer samples is present in ALPS, much like Ent-Max. Since
ALPS aims to select samples with surprising language, it could be the case that
intricate and diverse language is often found in longer samples. Figure 5.16
illustrates the performance of ALPS in comparison to Random and Ent-Max
with KB-BERT. Moreover, Figure 5.17 presents the equivalent for AF-BERT
and shows that no fundamental differences are observed for this model. These
figures indicate that, for this downstream task and these datasets, uncertainty-
based strategies outperform ALPS significantly. Furthermore, ALPS does not
show an evident pattern of even outperforming Random.
52 | Results and Discussion
Conclusions and Future work | 53
Chapter 6
6.1 Conclusions
In this thesis, the potential of AL strategies has been examined for SOTA
Swedish language models in the downstream task NER. Experiments for
several strategies have been conducted on two Swedish NER datasets with
the Swedish SOTA language model KB-BERT. A brief comparison was
made with AF-BERT to verify that KB-BERT performs better and that no
fundamental differences in strategy performance were present. The strategy
experiments were conducted with three different acquisition batch sizes and
the results presented both fine-grained results and aggregated strategy performances.
The results indicate that AL can indeed accelerate the model training significantly
and hence reduce the human annotation effort. Furthermore, the bootstrapping
process allows for pre-tagging the unlabeled samples with the model, potentially
reducing the annotation effort further.
To explicitly address the research question: the extent to which AL can
accelerate model training for SOTA Swedish language models in the context of
NER is significant. The experiment results illustrate that for certain conditions,
the number of labels required can be reduced by more than a factor of two. The
uncertainty-based strategies explored in this thesis all outperformed random
selection, with little variation in performance between them. Nevertheless,
one important difference in behavior was observed. Strategies that aggregate
sequence uncertainty by using the mean, seem to be biased towards shorter
54 | Conclusions and Future work
cold-starts and optimizing for the final task and the outcome of this trade-
off could vary for different tasks. Furthermore, the fact that it currently is a
pure trade-off could indicate that future strategies will revolutionize the field
of AL. An interesting start could be to combine the strategies by using cold-
starts with ALPS and after some threshold switching to BADGE. Furthermore,
instead of selecting samples with surprising language like ALPS, the pre-
trained model could be further pre-trained on the unlabeled data pool, before
starting the bootstrapping and fine-tuning. This idea has just been explored by
K. Margatina et al. [64].
Furthermore, a potential improvement of the successful uncertainty-based
strategies could be to enhance the uncertainty estimations of the model. This
can be done with MC Dropout as done by Y. Gal et al. [7]. The authors
report significant improvement for strategies with bayesian models over the
deterministic equivalents. This could be a straightforward path to better
performance.
The main metrics of this thesis were F1 score per labeled word or F1 score
per labeled sample (sentence). This only acts as a proxy for the reduction of
annotation effort, as it is based on the assumption that the lower number of
labels required entails a simpler annotation process. The assumption is likely
true to some extent in many scenarios but could overestimate the potential
of AL if the more informative samples are also more difficult to annotate.
Consequently, an interesting research direction is that of cost-aware AL, which
aims to explicitly model the annotation cost.
56 | Conclusions and Future work
REFERENCES | 57
References
[33] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu,
“Tinybert: Distilling bert for natural language understanding,” 2020.
[42] B. Settles and M. Craven, “An analysis of active learning strategies for
sequence labeling tasks,” ser. EMNLP ’08. USA: Association for
Computational Linguistics, 2008, p. 1070–1079.
[45] A. Siddhant and Z. Lipton, “Deep bayesian active learning for natural
language processing: Results of a large-scale empirical study,” 01 2018.
doi: 10.18653/v1/D18-1318 pp. 2904–2909.
[54] W.-N. Hsu and H.-T. Lin, “Active learning by learning,” Proceedings
of the AAAI Conference on Artificial Intelligence, vol. 29, no. 1, Feb.
64 | REFERENCES
Appendix A
Hyperparameter Comparison
Appendix B
To demonstrate what the downstream task NER is and what samples can look
like, this appendix shows a couple of samples tagged by the model compared
with the ground truth. Figure B.1 shows predictions from a model trained on
Swedish NER Corpus. In the first example, the model mistakes a person for
an organization but gets the organization correct. In the second example, it
misses the location but correctly predicts the two persons.
Figure B.1 – Samples from the Swedish NER Corpus test set. Ground truth
samples compared with samples tagged by a fine-tuned KB-BERT model.
For DIVA
{
"Author1": {
"Last name": "Öhman",
"First name": "Joey",
"Local User Id": "joeyoh",
"E-mail": "[email protected]",
"ORCiD": "0000-0002-00001-1234",
"organisation": {"L1": "School of Electrical Engineering and Computer Science ",
}
},
"Degree": {"Educational program": "Master’s Programme, Machine Learning, 120 credits"},
"Title": {
"Main title": "Active Learning for Named Entity Recognition with Swedish Language Models",
"Language": "eng" },
"Alternative title": {
"Main title": "Aktiv Inlärning för Namnigenkänning med Svenska Språkmodeller",
"Language": "swe"
},
"Supervisor1": {
"Last name": "Leite",
"First name": "Iolanda",
"Local User Id": "iolanda",
"E-mail": "[email protected]",
"organisation": {"L1": "School of Electrical Engineering and Computer Science ",
"L2": "Intelligent Systems" }
},
"Supervisor2": {
"Last name": "Stollenwerk",
"First name": "Felix",
"E-mail": "[email protected]",
},
"Examiner1": {
"Last name": "Gustafsson",
"First name": "Joakim",
"Local User Id": "jkgu",
"E-mail": "[email protected]",
"organisation": {"L1": "School of Electrical Engineering and Computer Science ",
"L2": "Intelligent Systems" }
},
"Cooperation": { "Partner_name": "Arbetsförmedlingen"},
"Other information": {
"Year": "2021", "Number of pages": "xv,66"}
}
TRITA-EECS-EX-2021:583
www.kth.se