0% found this document useful (0 votes)
7 views89 pages

FULLTEXT01

This thesis explores the use of Active Learning (AL) to enhance Named Entity Recognition (NER) with Swedish language models, aiming to reduce the manual effort required for data annotation. The study finds that AL can significantly accelerate model training, with uncertainty-based strategies outperforming random selection and potentially halving the number of required labels. However, the ALPS strategy did not show any improvement in training speed compared to traditional methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views89 pages

FULLTEXT01

This thesis explores the use of Active Learning (AL) to enhance Named Entity Recognition (NER) with Swedish language models, aiming to reduce the manual effort required for data annotation. The study finds that AL can significantly accelerate model training, with uncertainty-based strategies outperforming random selection and potentially halving the number of required labels. However, the ALPS strategy did not show any improvement in training speed compared to traditional methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING,

SECOND CYCLE, 30 CREDITS


STOCKHOLM, SWEDEN 2021

Active Learning for Named Entity


Recognition with Swedish
Language Models

JOEY ÖHMAN

KTH ROYAL INSTITUTE OF TECHNOLOGY


SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Active Learning for Named
Entity Recognition with
Swedish Language Models

JOEY ÖHMAN

Master’s Programme, Machine Learning, 120 credits


Date: June 24, 2021

Supervisors: Iolanda Leite, Felix Stollenwerk


Examiner: Joakim Gustafsson
School of Electrical Engineering and Computer Science
Host company: Arbetsförmedlingen
Swedish title: Aktiv Inlärning för Namnigenkänning med Svenska
Språkmodeller
© 2021 Joey Öhman
Abstract | i

Abstract
The recent advancements of Natural Language Processing have cleared the
path for many new applications. This is primarily a consequence of the
transformer model and the transfer-learning capabilities provided by models
like BERT. However, task-specific labeled data is required to fine-tune these
models. To alleviate the expensive process of labeling data, Active Learning
(AL) aims to maximize the information gained from each label. By including a
model in the annotation process, the informativeness of each unlabeled sample
can be estimated and hence allow human annotators to focus on vital samples
and avoid redundancy.
This thesis investigates to what extent AL can accelerate model training
with respect to the number of labels required. In particular, the focus is on pre-
trained Swedish language models in the context of Named Entity Recognition.
The data annotation process is simulated using existing labeled datasets to
evaluate multiple AL strategies. Experiments are evaluated by analyzing the
F1 score achieved by models trained on the data selected by each strategy.
The results show that AL can significantly accelerate the model training
and hence reduce the manual annotation effort. The state-of-the-art strategy
for sentence classification, ALPS, shows no sign of accelerating the model
training. However, uncertainty-based strategies consistently outperform random
selection. Under certain conditions, these strategies can reduce the number of
labels required by more than a factor of two.

Keywords
Active learning, Named entity recognition, Language models, Natural language
processing, Bert, Swedish
ii | Abstract
Sammanfattning | iii

Sammanfattning
Framstegen som nyligen har gjorts inom naturlig språkbehandling har möjliggjort
många nya applikationer. Det är mestadels till följd av transformer-modellerna
och lärandeöverföringsmöjligheterna som kommer med modeller som BERT.
Däremot behövs det fortfarande uppgiftsspecifik annoterad data för att finjustera
dessa modeller. För att lindra den dyra processen att annotera data, strävar aktiv
inlärning efter att maximera informationen som utvinns i varje annotering.
Genom att inkludera modellen i annoteringsprocessen, kan man estimera
hur informationsrikt varje träningsexempel är, och på så sätt låta mänskilga
annoterare fokusera på viktiga datapunkter.
Detta examensarbete utforskar hur väl aktiv inlärning kan accelerera
modellträningen med avseende på hur många annoterade träningsexempel
som behövs. Fokus ligger på förtränade svenska språkmodeller och uppgiften
namnigenkänning. Dataannoteringsprocessen simuleras med färdigannoterade
dataset för att evaluera flera olika strategier för aktiv inlärning. Experimenten
evalueras genom att analysera den uppnådda F1-poängen av modeller som är
tränade på datapunkterna som varje strategi har valt.
Resultaten visar att aktiv inlärning har en signifikant förmåga att accelerera
modellträningen och reducera de manuella annoteringskostnaderna. Den toppmoderna
strategin för meningsklassificering, ALPS, visar inget tecken på att kunna
accelerera modellträningen. Däremot är osäkerhetsbaserade strategier är konsekvent
bättre än att slumpmässigt välja datapunkter. I vissa förhållanden kan dessa
strategier reducera antalet annoteringar med mer än en faktor 2.

Nyckelord
Aktiv inlärning, Namnigenkänning, Språkmodeller, Naturlig språkbehandling,
Bert, Svenska
iv | Sammanfattning
Acknowledgments | v

Acknowledgments
First, I want to thank Arbetsförmedlingen and my supervisor, Felix Stollenwerk,
for making this project possible and for his many hours of assistance. Secondly,
I want to express my gratitude towards Iolanda Leite, my academic supervisor
who has given me invaluable feedback on the thesis.
This thesis, and my five years at KTH, have been a long journey with many
ups and downs. I want to thank my friends and family for their everlasting
support, be it a place to live, helping me through tough periods, or aiding
me with motivating discussions. I want to thank my girlfriend, Jem Hippe-
Runsten, for her undying support in stressful times. Finally, many thanks to
Fredrik Carlsson for hosting amazing digital pool parties that helped so many
of us through the tough times of Covid-19 isolation.
I could not have come this far without all your support.

Stockholm, June 2021


Joey Öhman
vi | Acknowledgments
CONTENTS | vii

Contents

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Research Methodology . . . . . . . . . . . . . . . . . . . . . 4
1.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Ethics and Sustainability . . . . . . . . . . . . . . . . . . . . 4
1.7 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 7
2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . 7
2.1.1 Data Pre-processing . . . . . . . . . . . . . . . . . . 7
2.1.2 Recurrent Neural Networks . . . . . . . . . . . . . . . 8
2.1.3 Word Embeddings . . . . . . . . . . . . . . . . . . . 8
2.1.4 Deep Contextualized Word Representations . . . . . . 9
2.1.5 Transfer Learning Beyond Word Embeddings . . . . . 10
2.1.6 Transformers . . . . . . . . . . . . . . . . . . . . . . 10
2.1.7 Bidirectional Encoder Representations from Transformers 15
2.1.8 Swedish Language Models . . . . . . . . . . . . . . . 18
2.1.9 Named Entity Recognition . . . . . . . . . . . . . . . 19
2.2 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Query by Uncertainty . . . . . . . . . . . . . . . . . . 21
2.2.3 Query by Committee . . . . . . . . . . . . . . . . . . 22
2.2.4 Batches & Diversity . . . . . . . . . . . . . . . . . . 23
2.2.5 Sequential Data . . . . . . . . . . . . . . . . . . . . . 24
2.2.6 Seed Data . . . . . . . . . . . . . . . . . . . . . . . . 26
viii | CONTENTS

3 Related Work 27
3.1 Active Learning & Named Entity Recognition . . . . . . . . . 27
3.2 Active Learning & BERT . . . . . . . . . . . . . . . . . . . . 28

4 Method 31
4.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 Named Entity Recognition Datasets . . . . . . . . . . 31
4.1.2 Pre-trained Models . . . . . . . . . . . . . . . . . . . 33
4.1.3 Acquisition Batch Sizes . . . . . . . . . . . . . . . . 33
4.1.4 Active Learning Strategies . . . . . . . . . . . . . . . 34
4.1.5 Seed Model Training . . . . . . . . . . . . . . . . . . 35
4.1.6 Stopping Criteria . . . . . . . . . . . . . . . . . . . . 36
4.1.7 Bootstrapping Framework . . . . . . . . . . . . . . . 36
4.1.8 Nerblackbox . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Evaluation Technique . . . . . . . . . . . . . . . . . . . . . . 37
4.2.1 One Dimensional Variation . . . . . . . . . . . . . . . 37
4.2.2 Experiment Score Matrix & Global Strategy Score . . 37
4.3 Experiment & Data Validity . . . . . . . . . . . . . . . . . . 39
4.4 Hardware Specification . . . . . . . . . . . . . . . . . . . . . 39

5 Results and Discussion 41


5.1 One Dimensional Variations . . . . . . . . . . . . . . . . . . 41
5.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.3 Acquisition Batch Size . . . . . . . . . . . . . . . . . 43
5.1.4 Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Experiment Score Matrices . . . . . . . . . . . . . . . . . . . 46
5.3 Learning Curve Fractions . . . . . . . . . . . . . . . . . . . . 48
5.4 Global Strategy Scores . . . . . . . . . . . . . . . . . . . . . 49
5.5 Sample-based Comparison with ALPS . . . . . . . . . . . . . 50

6 Conclusions and Future work 53


6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Limitations & Future Work . . . . . . . . . . . . . . . . . . . 54

References 57

A Hyperparameter Comparison 65
A.1 ALPS Inspired vs Early Stopping . . . . . . . . . . . . . . . . 65
Contents | ix

B Named Entity Recognition Examples 66


x | Contents
LIST OF FIGURES | xi

List of Figures

2.1 The composition of BERT with a task-specific neural network


head on top. . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Dataset with two classes, green and red. Unlabeled data pool
(left) and corresponding labeled dataset (right). . . . . . . . . 21
2.3 Dataset with two classes, green and red. Decision boundaries
for classifier trained with: randomly selected samples (left)
and actively selected samples (right). . . . . . . . . . . . . . 22
2.4 The bootstrapping process with Active Learning. . . . . . . . 23
2.5 The process of estimating uncertainty/informativeness for a
single sample in binary classification. . . . . . . . . . . . . . 24
2.6 The process of estimating the uncertainty/informativeness
of the sample "A terrific example", in a sequential multi-
classification context. . . . . . . . . . . . . . . . . . . . . . 25

3.1 ALPS selects samples by creating and clustering surprisal


embeddings. The surprisal embeddings are created by using
the Cross-Entropy Loss of the MLM pre-training objective.
The embeddings are then normalized using L2 Normalization
before they are clustered. Finally, k samples are selected by
finding the closest sample to each cluster. . . . . . . . . . . . 29

4.1 Class distributions (O-tag omitted) for the Swedish Named


Entity Recognition datasets used in the experiments. . . . . . 33
4.2 Experiment scores are defined by the Area Under the Curve
(AUC) of the F1 score learning curve (left). The experiment
matrix contains experiment scores in four dimensions. Fixing
two allows for a two-dimensional visualization (right). . . . . 38

5.1 Dataset comparison for different contexts. . . . . . . . . . . . 42


5.2 Pre-trained model comparisons for different strategies. . . . . 42
xii | LIST OF FIGURES

5.3 The number of samples selected by strategies through the


iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 Acquisition batch sizes B compared with Avg-Marg for both
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 Acquisition batch sizes B compared with Ent-Marg for both
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.6 Acquisition batch sizes B compared with Ent-Max for both
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.7 Active Learning strategies S compared with acquisition batch
size 500 for both datasets. . . . . . . . . . . . . . . . . . . . 45
5.8 Active Learning strategies S compared with acquisition batch
size 1000 for both datasets. . . . . . . . . . . . . . . . . . . 45
5.9 Active Learning strategies S compared with acquisition batch
size 4000 for both datasets. . . . . . . . . . . . . . . . . . . 45
5.10 Relative improvement matrix with rows D and columns B.
The bottom row shows the mean score for each batch size. . . 47
5.11 Relative improvement matrix with rows D and columns S.
The bottom row shows the mean score for each strategy. . . . 47
5.12 Relative improvement matrix with rows B and columns S.
The bottom row shows the mean score for each strategy. . . . 48
5.13 Average relative experiment score improvement over random
for increasing fractions of the learning curves. . . . . . . . . 49
5.14 Average strategy scores ēS over all experiments. . . . . . . . 49
5.15 The number of words selected per sample by strategies through
the iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.16 ALPS strategy compared with Ent-Max and Random, with
KB-BERT. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.17 ALPS strategy compared with Ent-Max and Random, with
AF-BERT. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

A.1 Model hyperparameter performance for an increasing number


of data labels. The x-axis shows the number of labeled
samples beyond the initial seed dataset of 50 samples. . . . . 65

B.1 Samples from the Swedish NER Corpus test set. Ground truth
samples compared with samples tagged by a fine-tuned KB-
BERT model. . . . . . . . . . . . . . . . . . . . . . . . . . . 66
LIST OF TABLES | xiii

List of Tables

2.1 The size configurations of BERT. . . . . . . . . . . . . . . . 16

4.1 Entities present in each dataset. The O-entity is omitted. . . . 32


4.2 Hardware specifications of the experiment machines. . . . . . 40
xiv | LIST OF TABLES
List of acronyms and abbreviations | xv

List of acronyms and abbreviations


AL Active Learning

AUC Area Under the Curve

BERT Bidirectional Encoder Representations for Transformers

BIO Beginning-Inside-Outside

ELMo Embeddings from Language Models

GLUE General Language Understanding Evaluation

GPU Graphics Processing Unit

LSTM Long Short-Term Memory

MLM Masked Language Modelling

NER Named Entity Recognition

NLP Natural Language Processing

RNN Recurrent Neural Network

SOTA State-Of-The-Art

TPU Tensor Processing Unit


xvi | List of acronyms and abbreviations
Introduction | 1

Chapter 1

Introduction

The field of Natural Language Processing (NLP) is a sub-field that joins


linguistics, computer science, and artificial intelligence. A critical goal of
NLP is that of information extraction, the process of analyzing text in order to
extract relevant information. An example of this is Named Entity Recognition
(NER), where pre-defined classes should be assigned to words in a document.
This often acts as a foundation for further analysis and is, for example,
used by Arbetsförmedlingen to extract information from job ads. However,
this requires some understanding of the language, partly because words can
have orthogonal meanings in different contexts. The recent advancements of
NLP have revolutionized the language understanding of computers. These
advancements are primarily a result of the transformer language models [1]
and the transfer-learning capabilities that come with models like BERT [2].
Despite using pre-trained BERT models, domain-specific labeled data is
required to fine-tune the model on a particular downstream task, e.g. NER.
Labeling this data can be difficult and expensive, and an important research
direction is that of alleviating the dependency on dataset size. Also, training on
small datasets seems to correlate strongly with hyperparameter sensitivity [3].
Moreover, training a model with cloud computing can cost thousands of dollars
and hundreds of kg of carbon emission [4]. This immense data dependency has
motivated the rise of the field Active Learning (AL), which aims to iteratively
build a labeled dataset by allowing human annotators and machine learning
models to cooperate. The model can be trained on the available labels and
then estimate how informative each unlabeled sample is. Using the model to
estimate the information gain of samples can be done in many ways, referred
to as AL strategies. So, one can make informed decisions on which samples
to label and thus accelerate the learning of models trained on this dataset. The
2 | Introduction

important result of this is that the same model performance can be achieved
with fewer labeled samples.
A large portion of the AL strategies was developed before the era of deep
learning. Two important categories are Query by Uncertainty and Query by
Committee. The former refers to the process of estimating the informativeness
of unlabeled samples by using the model’s probability predictions and the
associated uncertainties. Often, these do not consider the diversity of the
chosen samples and are thus prone to choosing near-identical samples in a
batch setting. Query by Committee refers to ensemble methods and can
significantly improve the performance but does not solve the diversity problem
either. There are, however, diversity-based strategies that aim for samples that
contribute to a good data representation instead of only uncertainty.
When working with AL in conjunction with certain fields, the best strategy
may vary. For example, in deep learning, ensemble methods might not boost
performance enough to be worth the additional computational cost. Also, in
particular, for NLP tasks it can simply be the case that there are not enough
sufficiently diverse pre-trained models available. The increased amount of data
often implies larger batches and can render simple uncertainty-based strategies
useless. Another prominent example is NER, where a sequence of predictions
should be made for each sample. Therefore, many of the classic strategies need
to be extended. The conjunction of AL and NER is thoroughly investigated in
[5], although in the era before deep learning and transformer models.
While different AL strategies have been shown to vary in performance
over different tasks, some strategies seem to consistently achieve near State-
Of-The-Art (SOTA) performance. For example, BatchBALD[6], building
upon BALD[7] selects informative and diverse samples using concepts from
information theory. Some strategies are specific to deep learning, such as
Learning Loss[8] that adds an auxiliary learning output and objective to
the architecture. Another deep learning strategy, with similar motivation, is
that of BADGE (Batch Active learning by Diverse Gradient Embeddings)
[9] that compares samples in a hallucinated gradient space. Furthermore,
ALPS[10] is a method specific to pre-trained BERT models, that creates
surprisal embeddings through a pre-training objective to find diverse and
informative samples.

1.1 Motivation
The majority of results within the field of deep learning suggests that model
performance will improve with more data and larger models. Due to the vast
Introduction | 3

costs of creating and using the data, a prominent research direction is that
of transfer learning. Transfer learning allows practitioners to reuse a trained
model, for their specific task with reduced effort. The amount of task-specific
data needed is only a fraction of the data used to pre-train the model. However,
the task-specific data need to be labeled. This is often expensive or difficult but
can be remedied with AL. Instead of randomly selecting samples to label, AL
aims to select the most informative samples. Therefore, with a fixed labeling
budget, the model performance can be pushed further.
Countless remarkable applications of deep learning rely on having sufficient
labeled data. Relaxing this constraint can open up new use cases that were
previously infeasible. This is evident in many NLP tasks e.g. NER, where the
data annotation process can be particularly costly and time-consuming.
Minimizing the dependency on sizeable data is a central objective of
machine learning. It aims to solve a general problem that could aid both
commercial products and research projects. Furthermore, it could enable
the development of products that were previously infeasible and accelerate
research for projects that need manually annotated data.

1.2 Aim
The primary goal of this project is to estimate the potential of AL in conjunction
with Swedish language models and the NER task. More specifically, several
AL strategies are investigated and evaluated by varying multiple experiment
parameters e.g. dataset. This is described further in Chapter 4. The thesis
results and report can be used by anyone who wants to use pre-trained language
models for an NLP task and needs to create their own training data. In
particular, those interested in fine-tuning Swedish NER models might find this
work relevant.

1.3 Research Question


Summarizing the motivation and aim of this thesis, leads to the following
research question:
“To what extent can active learning techniques accelerate the model
training of state-of-the-art Swedish language models and hence minimize the
manual annotation effort in the context of Named Entity Recognition?”
4 | Introduction

1.4 Research Methodology


The research question is investigated by conducting experiments with a meticulous
structure. Due to the inherent variation of AL strategy performance for
different problem contexts, experiments need to be done with several configurations.
This is done by testing each strategy with varying datasets, pre-trained models,
and acquisition batch sizes. Furthermore, each model is trained multiple times
to measure statistical uncertainties.
Results are analyzed visually by a wide variety of plots that expose different
behavior aspects of the strategies. To conclude the comparisons, aggregations
of all experiments will be done to achieve a scalar score for each strategy.
This results in a simple way of comparing the strategies’ overall experiment
performances.

1.5 Delimitations
Optimally, considering the research question, all interesting strategies would
be thoroughly investigated. This would indeed ensure that no optimal strategy
would be missed. However, this thesis is restricted to implementing a few
promising, yet feasible, strategies. Therefore, the results could be interpreted,
instead, as an estimated lower bound of the degree of acceleration that AL can
provide in this setting.
This project is limited to Swedish pre-trained language models, the equivalent
performance for other languages is not explored. Also, the only downstream
task investigated is that of NER. Furthermore, it could be relevant to see
whether the more informative samples requires greater annotation effort. This,
as the other limitations, is left as future work.

1.6 Ethics and Sustainability


As has been mentioned briefly in the introduction of this thesis, lowering
the need for large datasets can have an important impact on environmental
sustainability. Training models on large datasets can be costly both in terms
of money and, more importantly, in terms of climate change. Enabling a model
to learn more efficiently on fewer samples, could alleviate these effects.
Furthermore, as AL could reduce the dependency on large datasets, this
could enable people and organizations with limited resources to utilize SOTA
models for their specific problems. Another aspect is that of the impact of
Introduction | 5

existing jobs in data annotation. Some could be threatened if the need for
annotated data is reduced. However, data annotation will still be an important
task and the goal is merely to make the process less tedious and redundant.
Moreover, as the process of annotating data and training performant models
becomes feasible for a larger number of practitioners, the rise of new projects
would also entail new jobs.

1.7 Thesis Outline


Each chapter of the thesis is briefly described as follows:

• Chapter 2 will provide background of the various sub-fields that this


thesis builds upon. The background aims to be sufficient for the reader
to understand the next chapters.

• Chapter 3 outlines related work.

• Chapter 4 covers the process of bootstrapping and how simulations can


be made to measure the performance of different AL strategies. It will
describe how experiments are structured and how different aspects of
the experiments are controlled.

• Chapter 5 presents and discusses the results of the experiments.

• Chapter 6 concludes this thesis and describes future work.


6 | Introduction
Background | 7

Chapter 2

Background

2.1 Natural Language Processing


The field of NLP aims to have computers analyze, manipulate or generate
natural language. The recent rapid progress has evolved NLP from limited
statistical methods that require careful pre-processing to deep neural networks,
yielding near human-level understanding and generation of text. The language
model architecture we use today is the result of an iterative process, where one
architecture often was built with inspiration from its predecessor. To grasp the
technology used today, it is often helpful to analyze the path which led us here.
The following subsection explores the recent evolution of language models,
reaching the current SOTA.

2.1.1 Data Pre-processing


The data used by language models come in the form of text. Before tackling the
particular NLP tasks, such as sentiment analysis, NER, question answering,
one often attempts to reduce the problem complexity by pre-processing the
data. This is not unique to NLP and is done in most sub-fields of machine
learning. However, pre-processing differs for textual and numerical data.
For statistical approaches, a common first step in pre-processing is that
of punctuation removal. Another important step is tokenization, a process
of turning a string of text into a sequence of tokens. This is often done by
splitting on spaces to get a sequence of words. Some methods also prune the
data of stopwords (e.g. "the", "is", "at") as they carry little information and
can dominate frequencies in statistical methods.
Furthermore, it is often desirable to reduce the size of the vocabulary that
8 | Background

a language model needs to understand. Two scenarios of this are Stemming


and Lemmatization. Stemming is the process of breaking down words to their
word stem. As an example, fishing, fished and fisher would all be reduced
to their stem fish. The stem is not necessarily a proper word, e.g. stemming
of argued to argu. On the contrary, lemmatizing aims to find the root word
using a dictionary-based approach and analyzing the context. The data pre-
processing, however, varies for different problems and architectures.
Before the deep learning era, the traditional NLP relied heavily on pre-
processing for frequency methods such as Bag of Words. Of course, there
were sophisticated probability-based methods as well, using e.g. Hidden
Markov Models. Nevertheless, the SOTA today is strongly coupled with
neural networks.

2.1.2 Recurrent Neural Networks


Recurrent Neural Network (RNN)s are neural networks made for processing
sequential data. They possess a hidden state, which can maintain a context
from previous parts of the sequence. The hidden state is very limited for
vanilla RNNs, especially for dependencies in long sequences. To remedy
the vanishing gradient problem [11] and long inter-sentence dependencies,
Long Short-Term Memory (LSTM)s [12] are usually used. These networks
drastically improve the ability to grasp the context by using a memory cell and
trainable gates. The memory cell is decoupled from the output, as opposed
to the hidden state, so the network can keep information in memory while
possibly outputting unrelated predictions. The gates control what information
is stored in the cell, and what to output, e.g. how to combine input and memory
for predictions.
The vanilla RNN processes a sequence from left to right. A common
variant is the bidirectional RNN [13], which processes the sequence in
both directions to use context from the entire sequence when making each
prediction. This is frequently used in LSTMs as well.

2.1.3 Word Embeddings


When processing text, traditionally, a vocabulary is often used. Since the
input of neural networks must be represented with numerical values, one must
transform the words into numerical representations. For example, using the
vocabulary word index as a one-hot encoding. However, this is inefficient
as the word indices have no numerical meaning. This means that the model
Background | 9

is forced to memorize these indices, instead of learning how to utilize the


meaning of words.
Instead of using vocabulary indices, word embeddings represent numbers
with dense vectors. Words can be transformed from their one-hot encoding to
the word embedding by a small feed-forward network. This network can either
be trained simultaneously as part of the model or be a pre-trained network. The
pre-trained networks are trained with a self-supervised task, e.g. Word2Vec
[14, 15], using large text corpora such as Wikipedia. Another pre-trained
variant of word embeddings, trained with a similar technique, is that of GloVe
[16]. Furthermore, the embeddings support vector arithmetics. A classic
example of this is:

Emb(king) − Emb(man) + Emb(woman) = Emb(queen)

These word embeddings are then often used in conjunction with RNNs, as
word representation. This gives the model the advantage of having a smaller
input size (dense vectors) while incorporating the meaning of words in the
model input. Word embeddings take NLP a significant step towards transfer
learning, which is a major goal within the field.

2.1.4 Deep Contextualized Word Representations


An obvious flaw of the word embeddings is that they are context-independent.
This means that a word has a fixed representation in any sentence. In some
cases, this is especially problematic. For example, when encountering words
that have different meanings in different contexts, e.g. "bat", "desert" and "lie".
To address the problem of word ambiguities, Embeddings from Language
Models (ELMo) [17] provides word representations that take the context of a
word into account. It accomplishes this by using a deep bidirectional LSTM
with sentences as input. Each word embedding is transformed into hidden
states while propagating through the layers. So, for each layer of the stacked
LSTM, each word gets a new representation. Finally, a linear combination
of these hidden states is used to create a contextualized word embedding for
each word. The ELMo network is trained with a self-supervised task on large
corpora of text and constitutes another important milestone in NLP transfer
learning.
10 | Background

2.1.5 Transfer Learning Beyond Word Embeddings


Contextualized word representations are useful for applications that need to
work with language. However, the sequence of words embeddings is still in
need of processing, e.g. by another LSTM, entailing that a new model needs
to be trained from scratch. Although, the problem complexity is reduced with
contextualized word embeddings.
Computer vision has been ahead of NLP in the aspect of transfer learning
for many years, with their immense convolutional neural networks pre-trained
on datasets with tens of thousands of images, that can be fine-tuned for specific
tasks. Motivated by this, Universal Language Model Fine-tuning (ULMFiT)
[18] was published, a language model and a process to fine-tune it for various
downstream tasks. This takes NLP close to computer vision in terms of
transfer learning.

2.1.6 Transformers
While the contextualized word embeddings provided by ELMo was a major
step from word independent embeddings, the long term relationships, e.g. over
several sentences, are still difficult to model. Also, the sequential nature of
RNNs does not allow model training to fully utilize the high-end computing
hardware. Graphics Processing Unit (GPU)s can perform an immense number
of floating-point operations per second, given that the operations can be done
in parallel, i.e. there are not sequential dependencies. While RNNs do come
with matrix operations that can be done efficiently on the GPU, one word has
to be processed before processing the next. These shortcomings of the LSTM
are primarily what motivates the Transformer [1] architecture.
Transformer models build upon the encoder-decoder architecture. This
means that an encoder network encodes the input to an intermediate representation,
which the decoder is trained to decode. A common example is that of machine
translation, a sequence-to-sequence task where the input text is encoded into a
vector representation which is then decoded to the target language. Using an
encoder-decoder architecture allows the source and target sentences to differ
in length. Furthermore, this architecture is practical for multi-modal data. For
example, image caption generation [19], where an input image can be encoded
to a vector and then decoded into a text sequence.
The Transformer model consists of an encoder and a decoder. The encoder
in turn is a stack of encoder blocks, and the decoder a stack of decoder blocks.
These blocks are slightly different and will be covered one by one, starting
with the encoder.
Background | 11

Encoder
The encoder block is made of two layers, a self-attention layer and a feed-
forward network. Each element of the input sequence will have its own path
through the encoder, starting with a word embedding layer as explained above.
These paths are not independent and share information via the self-attention
mechanism.

Self-attention
The self-attention layer allows words to be associated with each other, e.g.
when a word refers to another or resolving word-level ambiguities with the
help of context. It does this by first calculating query, key and value vectors for
each embedding xi , by multiplying it with the corresponding learned weight
matrices:

q i = x i W Q , ki = xi W K , vi = xi W V (2.1)
To gain some helpful intuition, one can think of the queries, keys, and
values as an analogy with retrieval systems. The query matrix is used to
transform the input into queries that can be matched against the keys that,
after a few additional steps, are multiplied with the values to provide the
embeddings with context.
These vectors are of a fixed dimensionality dk = 64, smaller than the
embedding vectors dx = 512, to keep the computational complexity of these
operations mostly constant. Scores are calculated by taking the dot product of
a word’s query vector with all key vectors. The score represents how much
focus each word should dedicate to the other words. The score is then divided
by the square root of the key vector dimensionality dk and passed through the
softmax function (σ), resulting in the softmax score sσ :

qi • k j
sσij = σ( √ ) (2.2)
dk
At this point, there is a scalar score sσij for each word embedding pair that
represents how relevant xj is for xi . This scalar, which now contains a value
between 0 and 1, is multiplied with the corresponding value vector vj . An
interpretation of this is that we keep the value vectors of relevant words, but
filter out irrelevant words i.e. with softmax scores of 0. Now, each source word
position i has scaled value vectors corresponding to each target word position
j. To incorporate information from other word embeddings into the current,
12 | Background

the target vectors are aggregated. The final output of the self-attention layer,
for word position i, is the sum of all these scaled value vectors:
X
zi = sσij vj (2.3)
j

This output zi is then propagated to the succeeding feed-forward network.


However, the self-attention process has been explained in terms of vectors.
In fact, the self-attention layer is processed with matrix operations, making it
highly performant on GPUs.
The embedding vectors are packed as rows into an embedding matrix X.
Now to get the queries, keys, and values, we multiply the embedding matrix
with the corresponding learned weight matrices:

Q = XWQ , K = XWK , V = XWV (2.4)


Note that this is matrix form of equation 2.1. Each row in the resulting
matrices corresponds to a word position, as with the embedding matrix. To
get the dot products qi • kj , one can simply multiply the query matrix Q with
the transposed key matrix KT :

S = QKT (2.5)
In this resulting matrix, each element corresponds to the score sij . To get
the normalized softmax score, we do an element-wise division of 8 followed
by a softmax:

S
Sσ = σ( √ ) (2.6)
dk
Here, the elements Sσij exactly correspond to the softmax score sσij achieved
in the vector equation 2.2 above. The next steps are to scale and sum the
value vectors, see equation 2.3, which are both handled with just one matrix
multiplication:

Z = Sσ V (2.7)
This entire chain of operations that constitutes the self-attention layer, can
be formulated with one compact formula:

XWQ (XWK )T
Z = σ( √ )XWV (2.8)
dk
Background | 13

Multi-headed Attention
The self-attention allows the word positions to focus and incorporate information
from other positions. However, one could imagine that the model would
benefit from being able to focus on word positions differently for different
aspects. This is partly the motivation for why the authors added multi-
headed attention. The entire self-attention function is duplicated into several
(8 originally) instances, each having their own weight matrices Q, K and V.
Since each attention head yields one matrix Z, there are several matrices now,
instead of just one. To aggregate the information into one matrix, the matrices
are concatenated and multiplied with another learned weight matrix WO . The
output matrix then has the same size as the original Z matrix.

Positional Encodings
With the self-attention mechanism described above, the model has no way of
telling where in the sentence different words lie, which is naturally important
to understand sequences. To encode this in the word embeddings, a positional
encoding is added to the embedding vector. These vectors are originally
created using a combination of two functions, one using sine and the other
cosine. However, the positional encoding vectors can be generated in different
ways. This allows the model to see a systematic difference between word
embeddings at different positions and learn how to interpret them.

Residual Connections and Layer Normalization


The input of the self-attention layer, e.g. the positionalized embeddings in
the first encoder block, is added to the output of the layer via a residual
connection. This sum is then passed through a layer normalization step [20],
which can be seen as an alternative to batch normalization [21], commonly
used to accelerate and stabilize the training. These residual connections are
used in the feed-forward layer, and the sub-layers of the decoder as well.
This entire process has described in detail what happens in the encoder
block. So, in the encoder where these blocks are stacked, it works the same
way throughout the entire stack. However, the word embedding and positional
encoding only happen in the input of the first block. Lastly, the encoder output
is transformed into memory attention vectors Kmem and Vmem , which will be
used by the decoder.
14 | Background

Decoder
The decoder block is formed by three layers, the self-attention layer, the
encoder-decoder attention layer, and the feed-forward layer. All of which have
the residual connections with a sum and layer normalization step. The output
of the decoder is passed to a feed-forward network with a softmax output,
where each output element represents the probability of the corresponding
word.
The decoder works in time steps, generating one token each time. There
are two inputs of each time step. First, the encoder output vectors Kmem and
Vmem , and the decoded tokens up until the current time step. So, when the
encoder output has been generated, the decoder starts decoding at the first
time step, getting only a special start token as input along with the encoder
vectors. The self-attention works as in the encoder with one exception, it is
only allowed to attend to previous positions of the decoder input, i.e. what
has been decoded so far. The succeeding positions are masked, by being
set to −∞. The next layer, the encoder-decoder attention, works like the
multi-headed self-attention in the encoder, but it generates the query matrix Q
from the previous layer and uses the key and value matrices from the encoder
output vectors. This is where the decoder processes information from the
encoder and input sequence. Finally, the feed-forward layer works as in the
encoder and creates the output of the decoder, which is then passed to the
feed-forward softmax classifier network that outputs token predictions. These
token predictions are represented as a vector of the same dimensionality as
the vocabulary, i.e. number of known words. Each element is a softmax
probability and the argmax token is the Transformer prediction. The predicted
word is then used in the input of the next decoding time step. This process
repeats until the model outputs a special end of sequence token, which marks
the decoded sequence complete.

Training and Inference


The model is trained using backpropagation and a labeled training dataset.
The supervised learning objective is to predict the correct one-hot encodings
for each word in the sentence. During inference the output could be decoded
greedily, meaning that in each time step, the word with the highest probability
is used. Another common method, that is often used in sequence generation,
e.g. image captioning [19], is that of beam search. Where several possible
decoded sentences are kept as candidates while decoding. A decoding step is
done for each candidate and then the most promising of which are kept for the
Background | 15

next time step. When all candidates have reached an end of sequence token,
the best sequence can be selected as the final output, e.g. the one with the
highest total probability.

Summary and Extensions


This section on the Transformer was deeply inspired by J. Alammar’s blog
post "The Illustrated Transformer" [22]. For visualizations and more details,
this is a fine resource.
The Transformer was the first sequence to sequence model based entirely
on attention, replacing the recurrent layers that have long been the default
for encoder-decoder architectures. The authors showed that the Transformer
can be trained for translation tasks significantly faster than recurrent or
convolutional architectures and still achieved new SOTA. However, while the
Transformer is perfect for machine translation, it is not evident how to handle
other NLP tasks with it, e.g. sentence classification or a way to pre-train a
Transformer for transfer learning that can be fine-tuned for other downstream
tasks.
One way to apply the Transformer architecture to other tasks, and make it
possible to transfer knowledge to different tasks, is to only use the decoder. The
OpenAI Transformer [23] stacks twelve decoder blocks, but with the encoder-
decoder attention layer omitted since there is no encoder in this model. The
decoder is pre-trained using language modelling, i.e. predicting the next word.
This is a fitting choice for the decoder, as it already masks future positions
when generating text. The pre-training is a self-supervised learning phase,
where large corpora of texts, e.g. Wikipedia and books, are used and the model
learns to predict the next word given the previous context. When the model
has been pre-trained it can be fine-tuned for different downstream tasks by
adding a classifier on top of it, e.g. a linear and softmax layer. The OpenAI
Transformer achieved SOTA performance in several tasks on 9 out of the 12
datasets they studied, using the transfer learning approach.

2.1.7 Bidirectional Encoder Representations from Transformers


The OpenAI Transformer was pre-trained using forward language modelling.
However, a desirable modification would be to train the model with bidirectional
language modelling. This would allow the model to condition predictions
on both left and right contexts. To achieve this, Bidirectional Encoder
Representations for Transformers (BERT) [2] instead uses the encoder side
of the Transformer in conjunction with a masked language modelling task.
16 | Background

Table 2.1 – The size configurations of BERT.

#blocks hidden_size #heads #parameters


BERTBASE 12 768 12 110
BERTLARGE 24 1024 16 340

Architecture and Interface


BERT is essentially the encoder side of the Transformer with some modifications.
The first input token position to the model is a special [CLS] (classification)
token. The succeeding token positions are reserved for the input text sequence.
All input tokens are propagated in their own path through the encoder stack
as usual. The last encoder block outputs a vector of size hidden_size for
each token position, including the [CLS] position, whose purpose is to serve
as input to a sentence classifier added on top. A common mistake is to use
this [CLS] output as a sentence representation, which is not recommended
by the authors. Creating good sentence representations [24, 25, 26] is a task
of its own. However, the authors propose ways to extract contextualized word
embeddings from BERT, by aggregating the encoder block outputs throughout
the encoder stack.
The model comes in two sizes, BERTBASE and BERTLARGE . They vary
in number of encoder blocks, hidden layer size in the feed-forward networks,
number of attention heads, and thus in total number of trainable parameters.
The configuration values of the two sizes can be seen in Table 2.1. There is a
clear pattern that indicates that larger models perform better, given sufficient
data.

Pre-training
The pre-training of BERT consists of two self-supervised tasks, one on word-
level and one on sentence-level. The addition of a sentence-level task is
motivated by the fact that there are various downstream tasks that require the
model to handle the relationship between two text sequences.
The word-level pre-training objective is that of Masked Language Modelling
(MLM). It consists of masking 15% of the input tokens, replacing 80% of these
with the [M ASK] token, 10% with a random token, and 10% with the original
token. The model is then trained predict the correct tokens for the masked
positions, using a classification layer added on top of the output of each token
position. Part of the intuition that led the authors to these rules is that the model
must learn good representations for both masked and non-masked tokens. The
Background | 17

bidirectional MLM task entails a slower model converges than unidirectional


equivalents, due to only training on 15% of the input tokens. However, the
authors showed that the bidirectional approach outperforms the unidirectional
model after only a few pre-training iterations.
The sentence-level pre-training objective is then, given two input sentences,
to predict the likelihood of the second one being the subsequent sentence in
the original data. This way, the model learns to handle a high-level view of
the input sequences. To distinguish between the sentences, since they are
concatenated with only a [SEP ] token between them, a positional segment
encoding is added to the input along with the Transformer’s positional token
encoding and token word embedding. Whether the second sentence follows
the first is a binary classification problem. To allow the model to output
predictions, a simple classification layer is added on top of the [CLS] token
output. The model is pre-trained with these two tasks together, minimizing
their combined loss function.

Fine-tuning
When the model is pre-trained, it can be shared with anyone who desires to
fine-tune it for their particular downstream task. Fine-tuning BERT models
are in many cases simple and does not require much data as the model is
already pre-trained language understanding. This makes it possible to reuse
the results of the immense pre-training process (16 Tensor Processing Unit
(TPU)s running for 4 days for BERTLARGE ), saving time and resources,
allowing SOTA models to be trained on commodity hardware in a reasonable
time.
In many cases, fine-tuning is only a matter of adding a small layer to the
pre-trained model. Figure 2.1 briefly illustrates the composition of BERT and
a custom model on top. A common example is that of sentence classification,
e.g. spam detection or sentiment analysis, which only requires a classifier
added on top of the [CLS] token, similar to the sentence prediction pre-
training task. In the case of NER, each output token vector is simply passed
through a shallow classification network to predict the token entity. In
conclusion, BERT provides a way to reuse trained language models for a wide
variety of downstream tasks with little effort.

Summary and Extensions


With BERT, the authors achieved new SOTA on eleven NLP tasks. They
pushed the General Language Understanding Evaluation (GLUE) score [27]
18 | Background

Figure 2.1 – The composition of BERT with a task-specific neural network


head on top.

with a 7.7% point absolute improvement. Furthermore, the authors explain


that their results can be replicated in an hour on a single Cloud TPU or in
a few hours on a GPU. The field of NLP has finally arrived at the computer
vision ImageNet equivalent in transfer learning.
The BERT model has been widely used by practitioners since it was
published in 2018. Furthermore, a new research field of improving BERT
has emerged. Some have modified the architecture or pre-training objective,
to push the SOTA even further [28, 29, 30], while others have scaled down
the model significantly without sacrificing much language understanding
[31, 32, 33].

2.1.8 Swedish Language Models


When working with a language other than English, there two common language
model approaches. One is to use a multilingual language model, such as M-
BERT. The other is that of language-specific models, pre-trained on corpora
in the particular language. For Swedish, there are two pre-trained language-
specific BERT models available: AF-BERT and KB-BERT [34]. The former
was trained by Arbetsförmedlingen (the Swedish Public Employment Service),
and the latter was trained by the National Library of Sweden. KB-BERT
outperforms both AF-BERT and M-BERT in Swedish NLP tasks and is,
therefore, the SOTA Swedish language model. The model is cased and has
been pre-trained on a total of 3500M words, whereas AF-BERT has been
trained on merely 300M words.
Background | 19

2.1.9 Named Entity Recognition


There are a wide variety of tasks within the field of NLP. One common task
is that of NER, a sub-task of information extraction about entity identification
and classification. For example, finding which tokens correspond to a person
or location in the following sentence: "Barack Obama lives in America". This
example also demonstrates that an entity can be composed of more than one
word or token.
There are many use cases of NER, ranging from general applications
of information extraction for analysis to specific tasks such as anonymizing
documents. It can act as a pre-processing step for other downstream tasks
such as question answering or search algorithms. Consequently, it could be
seen as one of the most important NLP tasks.
The NER task is often formulated as a token-level multi-class classification
problem. Meaning that a model processes a sequence of text and assigns a pre-
defined entity class to each token. Previously, the standard entity-tag format
has been to represent an entity simply as the entity class, with no consideration
of whether it is a continuation of an entity or the start of a new entity. This can
cause ambiguities, especially in processed data where stop words are omitted.
To address this problem, the Beginning-Inside-Outside (BIO) format [35]
represents entity tags as B-entity or I-entity (or O if no entity is assigned to the
token). B-entity represents the beginning of an entity, the left-most token of the
entity. I-entity represents an entity token inside the entity, a continuation of the
current entity. So, for our previous example "Barrack Obama lives in America"
we would get different entity sequences for different tag representations. The
plain representation would result in "PER PER O O LOC" while BIO would
yield "B-PER I-PER O O B-LOC".

Evaluation
While NER models are usually trained as a token-level multi-class classification
task with sequence to sequence data, the evaluation is often done on entity-
level. This means that for the plain tag format, consecutive tokens with the
same tag will be considered as one entity. In the case of the BIO format,
entities are well-defined. Since a multi-token entity can be partly correct, there
are multiple ways of evaluating a model. A common way to handle this is to
treat any entity with any incorrect tokens as an incorrect entity prediction.
Furthermore, as with any machine learning problem, there are multiple
metrics for model performance. In NER, F1 scores, harmonic means of
precision and recall, are often used. However, there are multiple ways to
20 | Background

calculate the F1 score, precision, and recall. The Language-Independent NER


task [36] introduced at CoNLL-2003, defines precision as the percentage of
named entities correctly tagged, and recall as the percentage of named entities
that are found. This definition also uses the strict definition of treating partly
correct entities as errors. The F1 score is then defined as:

2 ∗ precision ∗ recall
F1 = (2.9)
precision + recall
This results in an F1 score for each entity class. To aggregate these
F1 scores into one final F1 score, a micro- or macro average can be used.
The macro average will simply calculate the mean F1 score, considering
the relevance of each class equal. Micro average is instead defined using a
weighted average, where each class is weighted by the fraction of the total
number of samples belonging to the class. So, macro average gives equal
importance to each class, whereas micro average gives equal importance to
each sample.

Named Entity Recognition with BERT


To use BERT models for NER, each token position output vector is passed
into a classification network that outputs entity predictions. The prediction
loss is then be used for backpropagation through the classification and pre-
trained BERT models, to train the classification network and fine-tune the
BERT network. This can be seen as one composite model that predicts entities
for each input token and uses shared weights in the entity classifier network.
The final entities can then be created from the token predictions, i.e. combine
consecutive token tags of the same entity to one entity.

2.2 Active Learning


In the standard supervised learning setting, a labeled dataset is used. When
there is no labeled data available, one must manually label data before training
a model. This process can be both expensive and time-consuming. In the
case of pool-based sampling, the common scenario of having a large pool of
unlabeled data, a subset of the data are selected for labeling (see Figure 2.2).
This is often done by selecting random samples, referred to as passive learning.
Instead, AL aims to select the most informative data instances with the help
of a model, to accelerate the learning process and require fewer labels. Figure
2.3 illustrates how active learning can affect the labeled dataset and resulting
Background | 21

Figure 2.2 – Dataset with two classes, green and red. Unlabeled data pool
(left) and corresponding labeled dataset (right).

classifier. This is done in an iterative process referred to as Bootstrapping.

2.2.1 Bootstrapping
Bootstrapping aims to assist the human annotators by pre-tagging the unlabeled
samples for them. This is done by iteratively annotating data, training a model,
and pre-tagging the unlabeled samples with the model. In the later iterations,
the model predictions are often adequate and the annotators task transitions
from annotation to correction. This bootstrapping process is extended to use
AL by not only pre-tagging the samples that should be annotate, but also
selecting these samples. This process is briefly illustrated in Figure 2.4 and
formally described in Algorithm 1.
How to quantify the informativeness of samples is an open research
question. There are many strategies for it and may vary for different tasks.
This section will focus on classification tasks, but much of what is covered
here apply to regression tasks as well. This section briefly covers the relevant
areas of AL, for a more thorough overview of the field, the literature survey
by B. Settles [37] is a fine place to start.

2.2.2 Query by Uncertainty


Query by uncertainty is the method of letting a model quantify its uncertainty
about unlabeled samples. An illustration of how this is done is shown in
Figure 2.5. When uncertainties are assigned to all samples, the most uncertain
instances can be selected for labeling. This can accelerate model training and
data annotation by avoiding redundant samples. There exist many strategies
22 | Background

Figure 2.3 – Dataset with two classes, green and red. Decision boundaries for
classifier trained with: randomly selected samples (left) and actively selected
samples (right).

for quantifying uncertainty.


Least Confidence is a strategy where the uncertainty is defined as 1 −
max(p). Where p is the categorical probability distribution over labels. This
means that the samples with the least confidence in their most likely label are
selected.
Margin Sampling can be seen as an extension of Least Confidence. While
Least Confidence only considers the maximum probability of each sample,
Margin Sampling considers the probability of the two most likely labels p1
and p2 . Its uncertainty is then defined as 1 − (p1 − p2 ), which means that the
samples with the smallest probability differences in its two most likely labels
are selected.
Another, more general but not necessarily better strategy, is that of Entropy
Sampling. It utilizes the entire probability distribution by using Shannon’s
Entropy. The uncertainty is defined as the entropy and thus aims to select
samples with the greatest entropy: − p(xi )log(p(xi )). These strategies are
P
i
explored in more detail in chapter 4.

2.2.3 Query by Committee


Query by Committee uses ensemble learning to utilize multiple base learners
to quantify uncertainties. By using diverse base models, the uncertainty can
be defined using the disagreement in the committee since high disagreement
should imply high uncertainty. Most of the standard ensemble machine
Background | 23

Figure 2.4 – The bootstrapping process with Active Learning.

Algorithm 1: Bootstrapping with Active Learning


Input: Unlabeled dataset DU , Base Model MB , Acquisition Batch
Size B, Strategy S, Labeling Budget L
Output: Labeled dataset DL , Trained Model MT
Query oracle for initial seed dataset DS from DU
Let DL = DS
MT = Train(MB , DL )
while |DL | < L do
DS = SelectInformativeSamples(DU , MT , S, B)
DS0 = Pre-tag(DS , MT )
DS00 = Query oracle for labels to DS0
Move new labeled instances DS00 from DU to DL
MT = Train(MB , DL )
return DL , MT

learning concepts apply here. However, as with query by uncertainty, there


are different strategies on how to quantify disagreement. Examples of these
are entropy-based disagreement and the Kullback-Leibler divergence. Query
by committee is not investigated in this thesis, as it is not as applicable to pre-
trained language models and would entail infeasible training sessions.

2.2.4 Batches & Diversity


In real use cases, data will rarely be labeled one sample at a time. Instead,
a batch of B instances will be queried for annotation each iteration (see
Algorithm 1). When working with batches the diversity of the selected
samples can be crucial. If diversity is not taken into account when selecting
24 | Background

Figure 2.5 – The process of estimating uncertainty/informativeness for a single


sample in binary classification.

a batch of samples, the result could be worse than random selection since the
batch could be filled with near equal samples with high uncertainty.
The problem of diversity is illustrated in the BatchBALD paper [6]. The
authors propose a good AL strategy to get diverse and informative samples in
a batch setting, inspired by information theory. Another promising strategy is
that of BADGE [9], which uses loss gradients to find diverse and informative
samples. The authors show that their strategy selects samples with diverse
gradients of high magnitude. They argue that samples with high magnitude
gradients mean that the sample is informative. Since there are no labels present
when selecting samples, the gradients are hallucinated, i.e. assume the label
the model favors.

2.2.5 Sequential Data


When working with NLP, data samples are often composed of a sequence
of tokens. In NER, there is a multi-class classification for each token in a
sample, see Figure 2.6. Therefore, it is not always trivial how to apply AL
strategies to quantify the informativeness of samples. For example, the query
by uncertainty strategies can be applied to each token individually, e.g. Least
Confidence would yield 1 − max(p) for each token. However, this results in a
sequence of uncertainty scores while the goal is to assign a scalar uncertainty
score to the entire sample. One natural way to do this is to take the sum of
Background | 25

Figure 2.6 – The process of estimating the uncertainty/informativeness of the


sample "A terrific example", in a sequential multi-classification context.

uncertainty scores over each sample. An important side-effect is that this


strategy would strongly favor longer sentences. While this is easily solved
by taking the mean token uncertainty score instead, it has hinted that there is
an underlying problem.
The definition of an informative sample is that it provides much new
information to the model and dataset. However, the ultimate goals of AL are
to accelerate the model learning and to minimize the annotation effort. It is
not clear what this means for selecting textual data samples. Nevertheless,
it seems reasonable that it would be better to aim for a high informativeness
per word or entity, rather than per sample. Otherwise, a strategy that simply
selects the longest samples would perform significantly better than random,
but would not enlighten the burden of the human annotators.
So, sequential data opens up a wide variety of ways to quantify the
informativeness of a sample. Uncertainty strategies can be applied on either
token-level or sample-level, and different strategies can be combined. For
example, one can use the Least Confidence strategy to get the maximum
probabilities and then calculate the entropy of these maximum probabilities,
this strategy is referred to as logprob in [5]. Furthermore, the same strategy
could be used for token-level and sample-level, e.g. taking the entropy of the
entropies. In conclusion, sequential data introduces much to explore in terms
of AL.
26 | Background

2.2.6 Seed Data


Before starting the iterative part of the bootstrapping process, an initial seed
dataset DS must be labeled, see Algorithm 1. This seed dataset should be
representative of the data distribution of the unlabeled data pool and should
include all classes therein. If a class is omitted from the initial seed dataset, the
model could interpret instances of it as another class. This could potentially
lead to a model never considering such instances and hence this class will be
omitted from the final labeled dataset as well. However, F. Olsson investigates
the impact of seed dataset size and clustering methods for selecting the seed
set in the context of NER but observes no significant improvement with larger
seed sizes nor over random selection [5].
Related Work | 27

Chapter 3

Related Work

There is little research done in the combination of AL, NER, and Swedish
language models. So, any work done within AL in conjunction with NLP
is considered related work, with particular focus on NER and pre-trained
language models.
The conjunction of AL and NLP has seen much research in the last few
decades. A. Cynthia et al., before deep learning, show that AL can accelerate
the model training of Information Extraction and Semantic Parsing [38]. G.
Tur et al., demonstrate that AL can reduce the annotation effort required in the
spoken language understanding by a factor 2 [39, 40]. B. Settles et al., present
an AL framework for multiple-instance machine learning and show that AL
can significantly improve the performance in this sub-field [41].

3.1 Active Learning & Named Entity Recognition


There has been much focus on the NER task in the AL field. A thorough
investigation of the conjunction of AL and NER was done in the Ph.D. thesis
by F. Olsson [5]. However, the recent advancements of deep learning and
NLP have called for a revisit of many methods. B. Settles et al. proposed
strategies that advanced the AL SOTA for the Sequence Labeling task [42]. S.
Peshterliev et al. conducted human-in-the-loop case studies and simulations
on datasets and find that AL can provide a statistically significant improvement
for both Intent Classification and NER [43]. D. Shen et al. show that AL
with multi-criteria (informativeness, representativeness, diversity) for NER
can reduce labeling costs by at least 80% [44].
A. Siddhant et al. explore deep bayesian AL and demonstrate the potential
to consistently accelerate the performance of Sentence Classification, NER,
28 | Related Work

and Semantic Role Labeling [45]. Moreover, the authors observe that while
Bayesian approaches consistently perform best among their strategies, basic
uncertainty-based strategies significantly outperform random sample selection.
Y. Shen et. al. explores the combination of deep AL and NER [46] using
a composite model made of two convolutional neural network encoders and
an LSTM decoder. Also, AL have been explored in the clinical context by,
Y. Chen et al. [47] uses a conditional random field [48] classifier for entity
tagging in medical records. Furthermore, cost-aware AL has been investigated
by Q. Wei et al. [49], where the annotation costs are thoroughly considered to
ensure that the annotation effort is minimized. All of these authors observe that
AL possess the potential of accelerating the annotation process in the context
of NER.

3.2 Active Learning & BERT


Since the revolution of BERT, AL with these language models have not
yet received much consideration. E. Liat et al. and D. Grießhaber et al.
investigate the potential of AL in conjunction with pre-trained BERT models
(but not NER) [50, 51]. Their results suggest that AL is feasible for BERT
models. Furthermore, they observe that AL strategies vary in performance
with datasets and that freezing part of the model can improve performance in
a low-resource setting.

Related Strategies
An optimal Active Learning strategy does not only consider the uncertainty or
informativeness of individual samples, but also the diversity within a batch.
Since practical settings use batches, uncertainty-based strategies are prone to
selecting redundant samples and hence perform poorly in a realistic scenario.
An intuitive way of modeling diversity is to measure the distance of samples
in input space. Then, uncertainty-based strategies could be used to weight the
samples to also consider informativeness. Clustering these weighted samples
with e.g. K-Means clustering and sampling data points close to these clusters
could generate diverse and informative batches.
However, in many fields, it does not make sense to measure the distance in
input space. Moreover, uncertainty-based strategies in deep learning rely on
the confidence estimates of the neural network softmax probabilities, which
are poorly calibrated and often over-confident [52]. Instead, some modern
strategies choose to rely on deep neural network embeddings. For example,
Related Work | 29

a recent SOTA strategy, BADGE [9] uses hallucinated gradient embeddings


to encode samples, outperforming earlier top-performing strategies such as
diversity-based Coreset [53] and adaptive ALBL [54]. By representing all
unlabelled samples in an embedding space, the distance between samples in
this space could model diversity, whereas the magnitude of the embeddings
could act as a proxy for the uncertainty. Finding clusters in this space implies
finding samples with a large distance between them, which is a proxy for both
diverse and informative samples.
M. Yuan et. al. [10] introduce a new AL strategy specific for BERT
models. Their strategy addresses the cold-start problem of AL, i.e. estimating
informativeness before knowing the domain, by using part of the pre-training
objective to create model surprise embeddings. These embeddings are clustered,
similar to BADGE [9], to find diverse and informative samples. Figure 3.1
illustrates how ALPS selects samples by clustering surprisal embeddings.
The most prominent difference between ALPS and BADGE is that BADGE
uses hallucinated gradients instead of MLM-loss. The strategy ALPS seems
to slightly outperform BADGE in the Sentence Classification task, while
significantly outperforming them in terms of computational cost as well.

Figure 3.1 – ALPS selects samples by creating and clustering surprisal


embeddings. The surprisal embeddings are created by using the Cross-
Entropy Loss of the MLM pre-training objective. The embeddings are then
normalized using L2 Normalization before they are clustered. Finally, k
samples are selected by finding the closest sample to each cluster.

In conclusion, much research has been done in the conjunction of AL and


NLP. Recent studies have started to explore the potential of AL with pre-
trained BERT models, with ALPS advancing the AL SOTA in the sentence
30 | Related Work

classification task. In fact, ALPS slightly overlaps with this thesis. However,
the contribution of this work is instead to explore AL for NER with Swedish
language models.
Method | 31

Chapter 4

Method

The potential of AL can be explored via simulations in any context that has
relevant datasets, as commonly done in the literature, e.g. [9, 10, 46, 47, 49]. In
this thesis, the process is simulated in the context of NER for Swedish language
models with Swedish pre-trained BERT models and NER datasets. The
iterative AL bootstrapping process of iteratively selecting the most informative
data for labeling and training, see Figure 2.4 and Algorithm 1, is simulated
by replacing the oracle with an existing labeled dataset. This is achieved by
simply hiding the labels from the model and returning them when the model
queries the oracle. Model performance is then measured after each iteration
and compared with passive learning for the same training data size.

4.1 Experiment Design


To thoroughly investigate the research question, experiments will be variable
in four dimensions: NER dataset D, pre-trained BERT model M , acquisition
batch size B, and AL strategy S. This is to remedy the high variance inherent
in the strategies over problem contexts. Thus, an experiment is defined by the
tuple (D, M, B, S), similar to experiments in [9]. As a consequence, a valid
view of the potential of AL in the context of NER for Swedish language models
can be obtained.

4.1.1 Named Entity Recognition Datasets


Datasets for the NER task can vary in size and complexity and the experiments
are limited to the few existing Swedish NER datasets. Furthermore, these
must be of sufficient size, and not infeasibly large. The datasets used in the
32 | Method

experiments are the following:

• Swedish NER Corpus ∗ , a sentence-level corpus of news from Swedish


newspapers’ websites with 4 entities and over 9000 samples using the
plain tag format.

• Swe-NERC [55] † , a sentence-level corpus with a mixed data source


made from blog posts, different social forums, news texts, medical
journals, and Wikipedia war history. It contains about 8000 samples
with 8 entities. The BIO tag format is supported for this dataset, but the
plain system is used in the experiments for simplicity.

The full names of the entities present in the datasets can be found in Table
4.1. The distributions of these classes with the O-tag omitted can be seen in
Figure 4.1. The Swe-NERC dataset contains a larger number of classes with
more complexity than Swedish NER Corpus. Examples from Swedish NER
Corpus can be found in Appendix B, where the ground truth is compared with
predictions from a fine-tuned model.

Table 4.1 – Entities present in each dataset. The O-entity is omitted.

Swedish NER Corpus Swe-NERC


Location (LOC) Organisation (GRO)
Person (PER) Event (EVN)
Miscellaneous (MISC) Time Entity (TME)
Organization (ORG) Person (PRS)
Symptom (SMP)
Treatment (MNT)
WorkOfArt/Product (WRK)
Location (LOC)

https://github.com/klintan/swedish-ner-corpus

https://spraakbanken.gu.se/lb/resurser/swe-nerc/
Method | 33

Figure 4.1 – Class distributions (O-tag omitted) for the Swedish Named Entity
Recognition datasets used in the experiments.

4.1.2 Pre-trained Models


The pre-trained language models used are the following, both with 110M
parameters:

• AF-BERT [56], Swedish BERT-base uncased model trained on 300M


words

• KB-BERT [34], Swedish BERT-base cased model trained on 3500M


words

4.1.3 Acquisition Batch Sizes


Larger batch sizes can increase the efficiency of the bootstrapping process
but can cause problems for simple uncertainty-based strategies that do not
consider batch diversity. This scale of this problem also depends on the
dataset. Many near-identical samples, e.g. in a large dataset, might fill an
entire batch if the strategy deems them uncertain. To gain an overall overview
of the performance of AL and its strategies, multiple acquisition batch sizes
are investigated.

Choice of Measure
The acquisition batch size can be defined in multiple ways. The most natural
way is to use a sample-based measure, e.g. that B = 100 implies 100 samples
per batch. However, this leads to issues with many strategies. Examples are
strategies that measure the uncertainty of a sample with the total uncertainty
for all its tokens, as they will favor longer samples. Using this sample-based
batch size definition, the annotation effort will likely not be reduced with
34 | Method

the lower number of samples annotated since they are often longer instead.
Furthermore, using the sample-based definition, the strategy to simply choose
the longest samples often outperforms random (passive learning) significantly.
Consequently, this thesis resolves to use word-based batches, commonly
used in recent literature [46, 47]. In practice, this means that samples selected
by the AL strategy are added to the batch one by one until the batch size B
has been reached. This definition should enhance the correlation between the
observed strategy performance and the reduction in human annotation effort.
If B is too low, the experiment becomes unrealistic and infeasible. On the
other hand, if it is too high the potential benefit of AL fades without increasing
the annotation efficiency significantly. Therefore, the acquisition batch sizes
explored in the experiments are limited to B = 500 and B = 1000.

4.1.4 Active Learning Strategies


Sample Representation
Samples are defined as a tuple containing the tokens and tag probabilities
(x, p), each of the same length T . The tag probabilities correspond to the
predictions of the model. The probabilities compose a discrete probability
distribution of K classes. So, the element ptj is the predicted probability of
token t being of class j.

Strategy Definitions
To gain an understanding of the potential of AL, multiple strategies are
explored since their performance has been observed to be problem-dependent.
The random strategy, passive learning, is used as a baseline in the experiments.
A strategy must evidently outperform the random baseline to make AL useful.
The operations for simple uncertainty-based strategies can be done on
either token-level or sample-level. For example, Max-Ent refers to taking the
maximum probability on token-level for all tokens, then aggregating these by
calculating their Entropy (this strategy is named Logprob in [5]). Following
this naming convention, the strategies investigated and how they measure
informativeness are listed below:

• Ent-Max: Omit all tokens which have not been predicted as an entity,
i.e. O-tags. Let pmax
t be the maximum probability of token t, pmax
t =
max(pt ). The Ent-Max uncertainty is then the entropy of these maximum
Method | 35

probabilities.
T
X
IEntM ax (p) = − pmax
t log2 pmax
t (4.1)
t=1

• Ent-Marg: Omit all tokens which have not been predicted as an entity,
i.e. O-tags. Let pmarg
t be the absolute difference of the greatest two
probabilities of token t. The Ent-Marg uncertainty is then the entropy
of these margin probabilities.
T
X
IEntM arg (p) = − pmarg
t log2 pmarg
t (4.2)
t=1

• Avg-Marg: Let pmargt be the absolute difference of the greatest two


probabilities of token t. The uncertainty is then 1 minus the average
of these.
T T
1X 1 X marg
IAvgM arg (p) = (1 − pmarg
t ) = 1 − p (4.3)
T t=1 T t=1 t

• ALPS [10]: Selects samples by using K-Means clustering of surprisal


embeddings, created from the MLM pre-training objective of BERT. By
clustering surprisal embeddings, ALPS can select diverse and uncertain
batches. This strategy is explained in greater detail in Chapter 3.

4.1.5 Seed Model Training


Due to statistical uncertainties and inherent stochasticity, an identical pre-
trained model M can vary in performance after training on the same data.
Therefore, for each combination (D, M ) a seed model is trained on the seed
dataset of D. Then, all experiments (D, M, B, S) with this dataset D and
model M start post-seed training with the same trained seed model. As a
result, all experiments with the same dataset and pre-trained model start with
the same conditions after seed training.
The initial seed dataset size is selected as small as possible while still
preserving a reasonable representation of the class distribution. Instances of
all classes are present in the seed dataset. The seed dataset size is set to 50
samples for each dataset.
36 | Method

4.1.6 Stopping Criteria


To focus the experiments where the potential of AL is prominent, experiments
are stopped just before passive learning fully converges. A labeling budget
L is chosen as the minimum number of labels required such that passive
learning reaches 98% of its final F1 score. So, for any combination (D, M )
a passive model is trained on the entire training set with a sufficiently small
acquisition batch size to identify how much data was required to converge.
Then, this labeling budget L can be determined for this combination (D, M ),
and experiments can be conducted with any batch size B and strategy S. This
procedure is highly similar to how the labeling budget is handled by the authors
of BADGE [9].

4.1.7 Bootstrapping Framework


The framework used for the bootstrapping experiments was developed by
Arbetsförmedlingen AI-Center. This work extends this framework, e.g. with
AL strategies and the word-based acquisition batch size. The ultimate purpose
of this framework is to be used for the annotation of Arbetsförmedlingen’s job
ads [57].

4.1.8 Nerblackbox
The experiments fine-tune the pre-trained models using the PyTorch-based
Python library nerblackbox [58]. Nerblackbox takes as input a pre-trained
Transformer-based model and a NER dataset and outputs performance metrics
and the fine-tuned model. The library supports useful features such as early
stopping, error estimation through multiple runs, and easy configuration of
hyperparameters. Moreover, it supports the datasets used in the experiments
as built-in. These features are utilized heavily in the experiments, and in each
iteration when new data has been added, a model is fine-tuned from scratch on
the current subset of the dataset using nerblackbox.

Hyperparameters
The models are fine-tuned with a maximum sequence length of 128 and a batch
size of 32 (not to be confused with acquisition batch size) for a maximum of
50 epochs with early stopping. The AdamW optimizer [59, 60] is used with a
constant learning rate of 2e-5, β1 = 0.9 and β2 = 0.999. These hyperparameters
are inspired by the closely related work, ALPS [10]. However, their low
Method | 37

number of epochs used does not allow models to converge for small datasets,
i.e. early bootstrapping iterations. Increasing the maximum number of epochs
is justified by the conclusions drawn in [61, 62], and using early stopping
makes the training robust also for larger datasets, i.e. later bootstrapping
iterations. A brief comparison of these hyperparameter configurations for this
context is presented in Appendix A.
Each fine-tuning session is repeated 5 times, and the mean F1 score f¯1 and
standard error of the mean σf¯1 are reported [58], see equation 4.4.
σf
σf¯1 = √ 1 (4.4)
n
Where σf1 is the sample standard deviation and n is the number of samples.

4.2 Evaluation Technique


An experiment, (D, M, B, S), yields an F1 score (micro average over classes)
learning curve with mean F1 scores and uncertainties for each iteration in the
bootstrapping simulation. These F1 scores and other simulation data can be
evaluated in numerous ways. This section covers the evaluation methods used
in this report.

4.2.1 One Dimensional Variation


The experiments are defined through the tuple (D, M, B, S). By fixing three
of these, the last can be examined by plotting the learning curves. For instance,
fixing D, M and B and plotting different strategies to compare them in this
context. This enables the visual overview of performance comparisons in a
particular context.

4.2.2 Experiment Score Matrix & Global Strategy Score


The F1 score learning curve F1,DM BS that results from an experiment (D, M, B, S)
is aggregated to a scalar by calculating the AUC. The AUC is normalized such
that the x-axis ranges from zero to one. An experiment score is defined as
shown in equation 4.5.

eDM BS = AU Cnorm (F1,DM BS ) (4.5)


38 | Method

These experiment scores are then stored in a four-dimensional matrix, with


one dimension for each variable, equation 4.6:

E := (eDM BS ) (4.6)

Figure 4.2 – Experiment scores are defined by the AUC of the F1 score
learning curve (left). The experiment matrix contains experiment scores
in four dimensions. Fixing two allows for a two-dimensional visualization
(right).

Figure 4.2 depicts the definition of the experiment score and experiment
score matrix. This experiment score matrix holds compact, partly aggregated,
information of the experiment results. Taking the mean of a particular axis,
or several, enables the analysis of the average performance in the other
dimensions. For instance, taking the mean of the M and B axes results in
a two dimensional matrix that depicts the average performance of strategies
for different datasets. These elements are defined as in Equation 4.7.
1 X
ēDS = eDM BS (4.7)
NM NB M,B

Furthermore, the scores can be represented as relative improvement scores.


While this can reduce the interpretability of the numbers in each element, it
can present the advantage over random in a concise way. Relative improvement
scores are defined, for example, as in Equation 4.8, with R as the Random
strategy. The reason for subtracting 1 is simply to achieve a concise number.
 X eDM BS 
1
rel
ēDS = −1 (4.8)
NM NB M,B eDM BR

More importantly, taking the mean over all axes but the S-axis results
in a vector with global strategy scores that can be seen as overall strategy
Method | 39

performances throughout the experiments. The individual strategy scores are


defined in equation 4.9.
1 X
ēS = eDM BS (4.9)
ND NM NB D,M,B

The random strategy is included in the experiments and thus in this vector,
to enable a simple comparison of the strategies with passive learning. This
vector may act as the final verdict of AL’s potential in the context of the NER
task with Swedish language models.
For a deeper analysis the strategy performances can be examined for
different fractions of the learning curves. For example, a fraction of 50%
means that only the first half of the learning curve would be considered when
calculating the experiment score. This results in measures of the acceleration
potential of AL strategies for different labeling budgets L.

4.3 Experiment & Data Validity


The experiment design and the majority of the evaluation techniques used in
this thesis are common in the literature. Consequently, the experiments and
results should be valid. However, several aspects could potentially threaten
the confidence of conclusions drawn from these experiments:

• The datasets used are not necessarily representative of realistic use


cases.

• The analysis examines the F1 score per labeled word. This does not
perfectly model annotation effort and could mean that we are essentially
only optimizing annotation effort indirectly.

• The robustness of conclusions increases with the number of experiments.


Adding more datasets, models, acquisition batch sizes, and AL strategies
would provide better insight into the potential of AL in this context.
However, this work is limited in time and resources and, therefore,
conducts only the feasible and most promising experiments.

4.4 Hardware Specification


The experiments are executed on two GPU machines. Hardware specifications
for these are shown in Table 4.2.
40 | Method

Table 4.2 – Hardware specifications of the experiment machines.

Machine 1 Machine 2
CPU Intel(R) Xeon(R) E5-2620 v4 @ 2.10GHz E5-2690 v3 @ 2.60GHz
RAM 60 GiB 500 GiB
GPU Nvidia Quadro P5000 Nvidia Titan RTX
VRAM 16 GiB 24 GiB
Results and Discussion | 41

Chapter 5

Results and Discussion

This chapters covers the experiment results accompanied by discussions. The


structure will follow the experiment outline described in section 4.2. First, one
dimension of D, M , B, and S is varied at a time to inspect the sensitivity of the
AL performance in different contexts. Secondly, the experiment score matrix
is analyzed through different points of views. To summarize the uncertainty-
based strategy analysis, the global strategy scores are presented and compared.
Lastly, a brief sample-based analysis is done to examine the performance of
ALPS.

5.1 One Dimensional Variations


In this section, the individual properties D, M , B, and S will be varied, one at
a time to identify promising AL conditions. All plots in this section have the
number of labels on the x-axis, that represents the number of labels beyond
the 50 samples (about 900 words) in the initial seed dataset. The uncertainty
intervals depicted in the figures show two standard errors in each direction,
with the standard error calculated as shown in Equation 4.4.

5.1.1 Dataset
The labeling budget L of the dataset Swedish NER Corpus was selected as
in Chapter 4. However, due to long experiment times and constraints in
resources and time, the labeling budget of Swe-NERC was chosen as 26,000
words to match the budget of Swedish NER Corpus. The Swe-NERC dataset
is inherently more difficult and, consequently, its original labeling budget is
much higher.
42 | Results and Discussion

Figure 5.1 illustrates the performance differences for the datasets for
different settings. The gap between the curves is larger for the Random
strategy. This arguably because the model approaches convergence with
Swedish NER Corpus and that AL accelerates the training better with Swe-
NERC.

Figure 5.1 – Dataset comparison for different contexts.

5.1.2 Model
Due to the time and resource constraints of this thesis, the only pre-trained
model used consistently throughout the final AL experiments is KB-BERT.
Only a few experiments are carried out for AF-BERT to gain some intuition
on the difference in the behavior of different pre-trained models. Figure 5.2
illustrates the significant performance advantage of the Swedish SOTA model
KB-BERT. Since the curves do not seem to fundamentally differ, apart from
the additional advantage of AL for AF-BERT, the conclusions drawn from the
experiments with KB-BERT could hold for AF-BERT as well. The reason for
the extra acceleration observed with AF-BERT could be because the task is
more difficult for inferior models.

Figure 5.2 – Pre-trained model comparisons for different strategies.


Results and Discussion | 43

5.1.3 Acquisition Batch Size


Before examining the strategy performances for different batch sizes, the
definition of the acquisition batch size, see Chapter 4, is analyzed. Figure 5.3
depicts the sample lengths that the strategies favor. The x-axis represents the
number of labeled words, i.e. the iterations executed with word-based batches.
The y-axis shows the number of labeled samples that each strategy has queried.
The entropy-based strategies favor longer samples whereas the average-based
strategy prefers shorter samples.

Figure 5.3 – The number of samples selected by strategies through the


iterations.

Figure 5.4 – Acquisition batch sizes B compared with Avg-Marg for both
datasets.
44 | Results and Discussion

Figure 5.5 – Acquisition batch sizes B compared with Ent-Marg for both
datasets.

Figure 5.6 – Acquisition batch sizes B compared with Ent-Max for both
datasets.

The effect of changing the acquisition batch size is presented in Figure


5.4, Figure 5.5, and Figure 5.6. In these figures, the linear interpolation
between the points can be particularly misleading and the focus should lie on
the markers. A lower acquisition batch size is inherently better for AL but even
more crucial for uncertainty-based strategies. First, a lower batch size means
that AL has more room for iterative acceleration. Secondly, this observation is
aligned with the literature in that uncertainty-based strategies perform worse
for larger batch sizes. This is particularly evident at the 4000-word mark for
the entropy based-strategies with Swedish NER Corpus, and the Avg-Marg
strategy with Swe-NERC.
Results and Discussion | 45

5.1.4 Strategy

Figure 5.7 – Active Learning strategies S compared with acquisition batch size
500 for both datasets.

Figure 5.8 – Active Learning strategies S compared with acquisition batch size
1000 for both datasets.

Figure 5.9 – Active Learning strategies S compared with acquisition batch size
4000 for both datasets.

Figure 5.7, Figure 5.8, and Figure 5.9 depict performance comparisons of the
AL strategies for different contexts. The results indicate that the uncertainty-
based AL strategies consistently outperform random throughout the majority
46 | Results and Discussion

of the iterations. The performance gain from AL is higher in the early stages
of the learning curve. This is expected since the model is still far from
convergence and hence has much to learn.
What is more surprising is the rather large improvement over random for
uncertainty-based strategies in a batch setting. For example, see the 4000-
word mark for Swedish NER Corpus in Figure 5.7 (left). At this point, all
these uncertainty-based strategies achieve significantly higher F1 test scores
of around 0.8 instead of Random’s 0.73. Random selection does not achieve
matching performance until it has queried 12 000 words. As a result, to achieve
an F1 test score of 0.8 in this context, uncertainty-based AL can reduce the
number of labeled words required by almost a factor of 3. For Swe-NERC, the
increase in absolute F1 score is greater but the corresponding reduction factor
seems to be around 2. In general, AL seem capable of reducing the number of
labeled words required by at least a factor of 2 in this context for sufficiently
small batch sizes, e.g. 500 words.

5.2 Experiment Score Matrices


The experiment score matrices contain aggregated information about the
experiments with elements defined as relative improvement scores (Equation
4.8). These scores give an overview and easy comparison of the target
dimensions. However, the actual scores are difficult to interpret, as reducing
the labeling budget L will result in higher values. The reason for this is that
the area under the curve differs the most between active and passive learning
at the beginning of the experiments.
Nevertheless, Figure 5.10 illustrates that the acceleration provided by AL
is greater for Swe-NERC than Swedish NER Corpus. Furthermore, it indicates
that smaller acquisition batch sizes increase the performance significantly.
Results and Discussion | 47

Figure 5.10 – Relative improvement matrix with rows D and columns B. The
bottom row shows the mean score for each batch size.

Figure 5.11 primarily presents three things. First, each strategy performs
better on Swe-NERC than Swedish NER Corpus. Secondly, the strategy scores
are similar but the Avg-Marg slightly outperforms the other two. Lastly, all
strategies significantly outperform Random.

Figure 5.11 – Relative improvement matrix with rows D and columns S. The
bottom row shows the mean score for each strategy.

Figure 5.12 shows that the strategies have similar performance while Avg-
Marg slightly outperforms the other strategies consistently, for all tested batch
sizes.
48 | Results and Discussion

Figure 5.12 – Relative improvement matrix with rows B and columns S. The
bottom row shows the mean score for each strategy.

5.3 Learning Curve Fractions


The average acceleration potential observed in the experiments for increasing
fractions is presented in Figure 5.13. By only considering early sections of the
learning curve (smaller fractions), a smaller labeling budget L is simulated.
With larger labeling budgets L (larger fractions), the final advantage of
AL decreases as the model is approaching convergence and the number of
overlapping training samples increases. The key take-away from this is that
the potential of AL is most prominent for truly low-resource scenarios.
Results and Discussion | 49

Figure 5.13 – Average relative experiment score improvement over random for
increasing fractions of the learning curves.

5.4 Global Strategy Scores


Figure 5.14 depicts the overall experiment score performance of the strategies.
These bars represent another, absolute, view of the same information as
the mean row in Figure 5.12. It suggests that Avg-Marg is the overall
best-performing strategy in these experiments, with a slight advantage over
the other uncertainty-based AL strategies and a significant advantage over
Random.

Figure 5.14 – Average strategy scores ēS over all experiments.


50 | Results and Discussion

5.5 Sample-based Comparison with ALPS


Some more sophisticated strategies, e.g. those that use clustering of samples
to achieve diversity, are not compatible with the word-based batch setting used
in the experiment framework. Therefore, the analysis of ALPS is shown with
sample-based batches and required new experiments. The entire experiment
framework is not repeated for the sample-based approach here, and only a few
plots are included to present the performance of ALPS.

Figure 5.15 – The number of words selected per sample by strategies through
the iterations.

Figure 5.16 – ALPS strategy compared with Ent-Max and Random, with KB-
BERT.
Results and Discussion | 51

Figure 5.17 – ALPS strategy compared with Ent-Max and Random, with AF-
BERT.

The number of words selected per sample is illustrated in Figure 5.15. The
bias towards longer samples is present in ALPS, much like Ent-Max. Since
ALPS aims to select samples with surprising language, it could be the case that
intricate and diverse language is often found in longer samples. Figure 5.16
illustrates the performance of ALPS in comparison to Random and Ent-Max
with KB-BERT. Moreover, Figure 5.17 presents the equivalent for AF-BERT
and shows that no fundamental differences are observed for this model. These
figures indicate that, for this downstream task and these datasets, uncertainty-
based strategies outperform ALPS significantly. Furthermore, ALPS does not
show an evident pattern of even outperforming Random.
52 | Results and Discussion
Conclusions and Future work | 53

Chapter 6

Conclusions and Future work

This chapter starts by briefly summarizing the thesis. Secondly, it discusses


conclusions drawn from the results and addresses the research question.
Lastly, the limitations of the thesis and future work are described.

6.1 Conclusions
In this thesis, the potential of AL strategies has been examined for SOTA
Swedish language models in the downstream task NER. Experiments for
several strategies have been conducted on two Swedish NER datasets with
the Swedish SOTA language model KB-BERT. A brief comparison was
made with AF-BERT to verify that KB-BERT performs better and that no
fundamental differences in strategy performance were present. The strategy
experiments were conducted with three different acquisition batch sizes and
the results presented both fine-grained results and aggregated strategy performances.
The results indicate that AL can indeed accelerate the model training significantly
and hence reduce the human annotation effort. Furthermore, the bootstrapping
process allows for pre-tagging the unlabeled samples with the model, potentially
reducing the annotation effort further.
To explicitly address the research question: the extent to which AL can
accelerate model training for SOTA Swedish language models in the context of
NER is significant. The experiment results illustrate that for certain conditions,
the number of labels required can be reduced by more than a factor of two. The
uncertainty-based strategies explored in this thesis all outperformed random
selection, with little variation in performance between them. Nevertheless,
one important difference in behavior was observed. Strategies that aggregate
sequence uncertainty by using the mean, seem to be biased towards shorter
54 | Conclusions and Future work

samples whereas the entropy aggregation inherently favors longer samples.


This could be particularly important for real use cases. The reason for
the insignificant performance variation between uncertainty-based strategies
could be that each maximizes uncertainty in roughly the same way while
ignoring batch diversity.
The reason for the limited performance of ALPS in this context is not
evident. Since the original ALPS source code was used in the experiments,
with only minor modifications to fit the experiment framework, the implementation
should be correct. Furthermore, the ALPS algorithm, like Ent-Max seems
biased towards longer sentences, which could indicate that it is indeed working
as expected. The poor performance could be due to the nature of the
downstream task. Perhaps language understanding is not the major issue here,
but learning the task and meaning of tags. The authors of ALPS did not
consider other tasks than sentence classification.

6.2 Limitations & Future Work


While AL has been explored for decades, the recent research effort towards
the field does not seem to match its importance and potential. This thesis
contributes to remedy this shortage. However, there are several things left for
future works.
To achieve statistically significant results and eliminate statistical uncertainties,
experiments should be repeated. The experiments in this thesis include
multiple training sessions and evaluations in each iteration. This alleviates a
vital dimension of statistical uncertainty. However, another source of statistical
uncertainty that is important for AL is that of data selection. While the
uncertainty-based strategies are deterministic given the model predictions,
random selection and ALPS can yield different data batches for different runs.
Therefore, an optimal design would also repeat the data acquisition at each
iteration for such strategies. While the observed potential of AL in this context
remains valid, this would have further increased the accuracy of the results.
Also, new pre-trained models should be used to increase the robustness of
the experiment results. In particular, SOTA multilingual pre-trained language
models such as XLM-RoBERTa [63] would provide potential variations in
strategy behavior and interesting performance comparisons. More importantly,
other SOTA strategies could enhance the acceleration potential of AL further.
An interesting future candidate is BADGE, since it is similar to ALPS in many
ways, but optimizes the downstream task at hand instead of utilizing cold-starts
with the pre-training objective. This could be treated as a trade-off between
Conclusions and Future work | 55

cold-starts and optimizing for the final task and the outcome of this trade-
off could vary for different tasks. Furthermore, the fact that it currently is a
pure trade-off could indicate that future strategies will revolutionize the field
of AL. An interesting start could be to combine the strategies by using cold-
starts with ALPS and after some threshold switching to BADGE. Furthermore,
instead of selecting samples with surprising language like ALPS, the pre-
trained model could be further pre-trained on the unlabeled data pool, before
starting the bootstrapping and fine-tuning. This idea has just been explored by
K. Margatina et al. [64].
Furthermore, a potential improvement of the successful uncertainty-based
strategies could be to enhance the uncertainty estimations of the model. This
can be done with MC Dropout as done by Y. Gal et al. [7]. The authors
report significant improvement for strategies with bayesian models over the
deterministic equivalents. This could be a straightforward path to better
performance.
The main metrics of this thesis were F1 score per labeled word or F1 score
per labeled sample (sentence). This only acts as a proxy for the reduction of
annotation effort, as it is based on the assumption that the lower number of
labels required entails a simpler annotation process. The assumption is likely
true to some extent in many scenarios but could overestimate the potential
of AL if the more informative samples are also more difficult to annotate.
Consequently, an interesting research direction is that of cost-aware AL, which
aims to explicitly model the annotation cost.
56 | Conclusions and Future work
REFERENCES | 57

References

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,


L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017.

[2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training


of deep bidirectional transformers for language understanding,” in
Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers). Minneapolis,
Minnesota: Association for Computational Linguistics, Jun. 2019.
doi: 10.18653/v1/N19-1423 pp. 4171–4186. [Online]. Available:
https://www.aclweb.org/anthology/N19-1423

[3] J. Dodge, G. Ilharco, R. Schwartz, A. Farhadi, H. Hajishirzi,


and N. Smith, “Fine-tuning pretrained language models: Weight
initializations, data orders, and early stopping,” 2020.

[4] E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy


considerations for deep learning in NLP,” in Proceedings of the 57th
Annual Meeting of the Association for Computational Linguistics.
Florence, Italy: Association for Computational Linguistics, Jul.
2019. doi: 10.18653/v1/P19-1355 pp. 3645–3650. [Online]. Available:
https://www.aclweb.org/anthology/P19-1355

[5] F. Olsson, “Bootstrapping named entity annotation by means of active


machine learning: A method for creating corpora,” Ph.D. dissertation, ,
SICS, 2008. [Online]. Available: http://spraakdata.gu.se/publikationer/
datalinguistica/DL21.pdf

[6] A. Kirsch, J. van Amersfoort, and Y. Gal, “Batchbald: Efficient


and diverse batch acquisition for deep bayesian active learning,” in
Advances in Neural Information Processing Systems, H. Wallach,
H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and
58 | REFERENCES

R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019.


[Online]. Available: https://proceedings.neurips.cc/paper/2019/file/
95323660ed2124450caaac2c46b5ed90-Paper.pdf

[7] Y. Gal, R. Islam, and Z. Ghahramani, “Deep Bayesian active learning


with image data,” in Proceedings of the 34th International Conference
on Machine Learning, ser. Proceedings of Machine Learning Research,
D. Precup and Y. W. Teh, Eds., vol. 70. International Convention
Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp. 1183–1192.
[Online]. Available: http://proceedings.mlr.press/v70/gal17a.html

[8] D. Yoo and I. S. Kweon, “Learning loss for active learning,” in


Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), June 2019.

[9] J. T. Ash, C. Zhang, A. Krishnamurthy, J. Langford, and


A. Agarwal, “Deep batch active learning by diverse, uncertain
gradient lower bounds,” in Eighth International Conference
on Learning Representations (ICLR), April 2020. [Online].
Available: https://www.microsoft.com/en-us/research/publication/
deep-batch-active-learning-by-diverse-uncertain-gradient-lower-bounds/

[10] M. Yuan, H.-T. Lin, and J. Boyd-Graber, “Cold-start active learning


through self-supervised language modeling,” in Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing
(EMNLP). Online: Association for Computational Linguistics, Nov.
2020. doi: 10.18653/v1/2020.emnlp-main.637 pp. 7935–7948. [Online].
Available: https://www.aclweb.org/anthology/2020.emnlp-main.637

[11] S. Hochreiter, “The vanishing gradient problem during learning


recurrent neural nets and problem solutions,” International Journal of
Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 6, pp. 107–
116, 04 1998. doi: 10.1142/S0218488598000094

[12] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”


Neural computation, vol. 9, pp. 1735–80, 12 1997. doi:
10.1162/neco.1997.9.8.1735

[13] M. Schuster and K. Paliwal, “Bidirectional recurrent neural networks,”


IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681,
1997. doi: 10.1109/78.650093
REFERENCES | 59

[14] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed


representations of words and phrases and their compositionality,” in
Proceedings of the 26th International Conference on Neural Information
Processing Systems - Volume 2, ser. NIPS’13. Red Hook, NY, USA:
Curran Associates Inc., 2013, p. 3111–3119.
[15] T. Mikolov, G. Corrado, K. Chen, and J. Dean, “Efficient estimation of
word representations in vector space,” 01 2013, pp. 1–12.
[16] J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors
for word representation,” in Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing (EMNLP).
Doha, Qatar: Association for Computational Linguistics, Oct. 2014.
doi: 10.3115/v1/D14-1162 pp. 1532–1543. [Online]. Available: https:
//www.aclweb.org/anthology/D14-1162
[17] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and
L. Zettlemoyer, “Deep contextualized word representations,” in Proc. of
NAACL, 2018.
[18] J. Howard and S. Ruder, “Universal language model fine-tuning for
text classification,” in Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers).
Melbourne, Australia: Association for Computational Linguistics, Jul.
2018. doi: 10.18653/v1/P18-1031 pp. 328–339. [Online]. Available:
https://www.aclweb.org/anthology/P18-1031
[19] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and
tell: A neural image caption generator,” in 2015 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2015. doi:
10.1109/CVPR.2015.7298935 pp. 3156–3164.
[20] J. Ba, J. Kiros, and G. Hinton, “Layer normalization,” 07 2016.
[21] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in Proceedings
of the 32nd International Conference on Machine Learning, ser.
Proceedings of Machine Learning Research, F. Bach and D. Blei, Eds.,
vol. 37. Lille, France: PMLR, 07–09 Jul 2015, pp. 448–456. [Online].
Available: http://proceedings.mlr.press/v37/ioffe15.html
[22] J. Alammar, 2018. [Online]. Available: https://jalammar.github.io/
illustrated-transformer/
60 | REFERENCES

[23] A. Radford and K. Narasimhan, “Improving language understanding by


generative pre-training,” 2018.

[24] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes,


“Supervised learning of universal sentence representations from natural
language inference data,” 09 2017. doi: 10.18653/v1/D17-1070 pp. 670–
680.

[25] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings


using siamese bert-networks,” 01 2019. doi: 10.18653/v1/D19-1410 pp.
3973–3983.

[26] F. Carlsson, A. C. Gyllensten, E. Gogoulou, E. Y. Hellqvist,


and M. Sahlgren, “Semantic re-tuning with contrastive tension,” in
International Conference on Learning Representations, 2021. [Online].
Available: https://openreview.net/forum?id=Ov_sMNau-PF

[27] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman,


“GLUE: A multi-task benchmark and analysis platform for natural
language understanding,” in Proceedings of the 2018 EMNLP Workshop
BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP.
Brussels, Belgium: Association for Computational Linguistics, Nov.
2018. doi: 10.18653/v1/W18-5446 pp. 353–355. [Online]. Available:
https://www.aclweb.org/anthology/W18-5446

[28] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,


L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert
pretraining approach,” 2019.

[29] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov,


and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for
language understanding,” in Advances in Neural Information Processing
Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc,
E. Fox, and R. Garnett, Eds., vol. 32. Curran Associates, Inc.,
2019. [Online]. Available: https://proceedings.neurips.cc/paper/2019/
file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf

[30] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-


training text encoders as discriminators rather than generators,” ArXiv,
vol. abs/2003.10555, 2020.
REFERENCES | 61

[31] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and


R. Soricut, “Albert: A lite bert for self-supervised learning of
language representations,” in International Conference on Learning
Representations, 2020. [Online]. Available: https://openreview.net/
forum?id=H1eA7AEtvS

[32] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled


version of bert: smaller, faster, cheaper and lighter,” 2020.

[33] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu,
“Tinybert: Distilling bert for natural language understanding,” 2020.

[34] M. Malmsten, L. Börjeson, and C. Haffenden, “Playing with words at the


national library of sweden – making a swedish bert,” 2020.

[35] L. Ramshaw and M. Marcus, “Text chunking using transformation-based


learning,” in Third Workshop on Very Large Corpora, 1995. [Online].
Available: https://www.aclweb.org/anthology/W95-0107

[36] E. F. Tjong Kim Sang and F. De Meulder, “Introduction to the CoNLL-


2003 shared task: Language-independent named entity recognition,”
in Proceedings of the Seventh Conference on Natural Language
Learning at HLT-NAACL 2003, 2003, pp. 142–147. [Online]. Available:
https://www.aclweb.org/anthology/W03-0419

[37] B. Settles, “Active learning literature survey,” University of Wisconsin–


Madison, Computer Sciences Technical Report 1648, 2009.

[38] C. A. Thompson, M. E. Califf, and R. J. Mooney, “Active learning for


natural language parsing and information extraction,” in Proceedings of
the Sixteenth International Conference on Machine Learning (ICML-
99), Bled, Slovenia, June 1999, pp. 406–414. [Online]. Available:
http://www.cs.utexas.edu/users/ai-lab?thompson:ml99

[39] G. Tur, R. Schapire, and D. Hakkani-Tur, “Active learning


for spoken language understanding,” vol. 1, 05 2003. doi:
10.1109/ICASSP.2003.1198771. ISBN 0-7803-7663-3 pp. I–276.

[40] G. Tur, D. Hakkani-Tür, and R. E. Schapire, “Combining


active and semi-supervised learning for spoken language
understanding,” Speech Communication, vol. 45, no. 2, pp.
171–186, 2005. doi: https://doi.org/10.1016/j.specom.2004.08.002.
62 | REFERENCES

[Online]. Available: https://www.sciencedirect.com/science/article/pii/


S0167639304000962

[41] B. Settles, M. Craven, and S. Ray, “Multiple-instance active learning,” in


Advances in Neural Information Processing Systems, J. Platt, D. Koller,
Y. Singer, and S. Roweis, Eds., vol. 20. Curran Associates, Inc.,
2008. [Online]. Available: https://proceedings.neurips.cc/paper/2007/
file/a1519de5b5d44b31a01de013b9b51a80-Paper.pdf

[42] B. Settles and M. Craven, “An analysis of active learning strategies for
sequence labeling tasks,” ser. EMNLP ’08. USA: Association for
Computational Linguistics, 2008, p. 1070–1079.

[43] S. Peshterliev, J. Kearney, A. Jagannatha, I. Kiss, and S. Matsoukas,


“Active learning for new domains in natural language understanding,”
01 2019. doi: 10.18653/v1/N19-2012 pp. 90–96.

[44] D. Shen, J. Zhang, J. Su, G. Zhou, and C.-L. Tan, “Multi-criteria-based


active learning for named entity recognition,” in Proceedings of the
42nd Annual Meeting of the Association for Computational Linguistics
(ACL-04), Barcelona, Spain, Jul. 2004. doi: 10.3115/1218955.1219030
pp. 589–596. [Online]. Available: https://www.aclweb.org/anthology/
P04-1075

[45] A. Siddhant and Z. Lipton, “Deep bayesian active learning for natural
language processing: Results of a large-scale empirical study,” 01 2018.
doi: 10.18653/v1/D18-1318 pp. 2904–2909.

[46] Y. Shen, H. Yun, Z. Lipton, Y. Kronrod, and A. Anandkumar, “Deep


active learning for named entity recognition,” in Proceedings of the
2nd Workshop on Representation Learning for NLP. Vancouver,
Canada: Association for Computational Linguistics, Aug. 2017.
doi: 10.18653/v1/W17-2630 pp. 252–256. [Online]. Available: https:
//www.aclweb.org/anthology/W17-2630

[47] Y. Chen, T. A. Lasko, Q. Mei, J. C. Denny, and H. Xu, “A study of


active learning methods for named entity recognition in clinical text,”
Journal of Biomedical Informatics, vol. 58, pp. 11–18, 2015. doi:
https://doi.org/10.1016/j.jbi.2015.09.010. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S1532046415002038
REFERENCES | 63

[48] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random


fields: Probabilistic models for segmenting and labeling sequence data,”
in Proceedings of the Eighteenth International Conference on Machine
Learning, ser. ICML ’01. San Francisco, CA, USA: Morgan Kaufmann
Publishers Inc., 2001. ISBN 1558607781 p. 282–289.

[49] Q. Wei, Y. Chen, M. Salimi, J. Denny, Q. Mei, T. Lasko, Q. Chen,


S. Wu, A. Franklin, T. Cohen, and H. Xu, “Cost-aware active learning
for named entity recognition in clinical text,” Journal of the American
Medical Informatics Association : JAMIA, 2019.

[50] L. Ein-Dor, A. Halfon, A. Gera, E. Shnarch, L. Dankin, L. Choshen,


M. Danilevsky, R. Aharonov, Y. Katz, and N. Slonim, “Active Learning
for BERT: An Empirical Study,” in Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Processing (EMNLP).
Online: Association for Computational Linguistics, Nov. 2020. doi:
10.18653/v1/2020.emnlp-main.638 pp. 7949–7962. [Online]. Available:
https://www.aclweb.org/anthology/2020.emnlp-main.638

[51] D. Grießhaber, J. Maucher, and N. T. Vu, “Fine-tuning BERT for


low-resource natural language understanding via active learning,” in
Proceedings of the 28th International Conference on Computational
Linguistics. Barcelona, Spain (Online): International Committee on
Computational Linguistics, Dec. 2020. doi: 10.18653/v1/2020.coling-
main.100 pp. 1158–1171. [Online]. Available: https://www.aclweb.org/
anthology/2020.coling-main.100

[52] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration


of modern neural networks,” in Proceedings of the 34th International
Conference on Machine Learning, ser. Proceedings of Machine
Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70.
PMLR, 06–11 Aug 2017, pp. 1321–1330. [Online]. Available: http:
//proceedings.mlr.press/v70/guo17a.html

[53] O. Sener and S. Savarese, “Active learning for convolutional


neural networks: A core-set approach,” in International Conference
on Learning Representations, 2018. [Online]. Available: https:
//openreview.net/forum?id=H1aIuk-RW

[54] W.-N. Hsu and H.-T. Lin, “Active learning by learning,” Proceedings
of the AAAI Conference on Artificial Intelligence, vol. 29, no. 1, Feb.
64 | REFERENCES

2015. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/


view/9597

[55] L. Ahrenberg, J. Frid, and L.-J. Olsson, “A new resource for


swedish named-entity recognition,” 2020. [Online]. Available: https:
//gubox.app.box.com/v/SLTC-2020-paper-17

[56] Arbetsförmedlingen AI-Center. AF-BERT. [Online]. Available: https:


//github.com/af-ai-center/SweBERT

[57] F. Stollenwerk, N. Fastlund, A. Nyqvist, and J. Öhman, “Annotated job


ads using swedish language models and named entity recognition,” in
preparation.

[58] F. Stollenwerk. nerblackbox: a python package to fine-tune transformer-


based language models for named entity recognition. [Online]. Available:
https://af-ai-center.github.io/nerblackbox/

[59] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”


International Conference on Learning Representations, 12 2014.

[60] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in


International Conference on Learning Representations, 2019. [Online].
Available: https://openreview.net/forum?id=Bkg6RiCqY7

[61] M. Mosbach, M. Andriushchenko, and D. Klakow, “On the stability of


fine-tuning {bert}: Misconceptions, explanations, and strong baselines,”
in International Conference on Learning Representations, 2021.
[Online]. Available: https://openreview.net/forum?id=nzpLWnVAyah

[62] T. Zhang, F. Wu, A. Katiyar, K. Q. Weinberger, and Y. Artzi,


“Revisiting few-sample {bert} fine-tuning,” in International Conference
on Learning Representations, 2021. [Online]. Available: https:
//openreview.net/forum?id=cO1IH43yUF

[63] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek,


F. Guzman, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov,
“Unsupervised cross-lingual representation learning at scale,” 01 2020.
doi: 10.18653/v1/2020.acl-main.747 pp. 8440–8451.

[64] K. Margatina, L. Barrault, and N. Aletras, “Bayesian active learning with


pretrained language models,” 2021.
Appendix A: Hyperparameter Comparison | 65

Appendix A

Hyperparameter Comparison

A.1 ALPS Inspired vs Early Stopping


While the hyperparameters from ALPS may push the performance slightly
for full-sized datasets with learning rate decay, it does not allow model
training to convergence due to the low number of epochs. Instead, training
until convergence using early stopping and a maximum of 50 epochs allows
convergence also for early iterations and small datasets. This could be treated
as a trade-off. Nevertheless, stable and fair training throughout all iterations
is in this thesis deemed more important than perfect fine-tuning for AL
experiments. A comparison of the hyperparameters from ALPS and the
hyperparameters used in the experiments is shown in Figure A.1. Indeed,
models in early iterations perform poorly with a low number of epochs.
Furthermore, the variance with early stopping is reduced.

Figure A.1 – Model hyperparameter performance for an increasing number of


data labels. The x-axis shows the number of labeled samples beyond the initial
seed dataset of 50 samples.
66 | Appendix B: Named Entity Recognition Examples

Appendix B

Named Entity Recognition Examples

To demonstrate what the downstream task NER is and what samples can look
like, this appendix shows a couple of samples tagged by the model compared
with the ground truth. Figure B.1 shows predictions from a model trained on
Swedish NER Corpus. In the first example, the model mistakes a person for
an organization but gets the organization correct. In the second example, it
misses the location but correctly predicts the two persons.

Figure B.1 – Samples from the Swedish NER Corpus test set. Ground truth
samples compared with samples tagged by a fine-tuned KB-BERT model.
For DIVA
{
"Author1": {
"Last name": "Öhman",
"First name": "Joey",
"Local User Id": "joeyoh",
"E-mail": "[email protected]",
"ORCiD": "0000-0002-00001-1234",
"organisation": {"L1": "School of Electrical Engineering and Computer Science ",
}
},
"Degree": {"Educational program": "Master’s Programme, Machine Learning, 120 credits"},
"Title": {
"Main title": "Active Learning for Named Entity Recognition with Swedish Language Models",
"Language": "eng" },
"Alternative title": {
"Main title": "Aktiv Inlärning för Namnigenkänning med Svenska Språkmodeller",
"Language": "swe"
},
"Supervisor1": {
"Last name": "Leite",
"First name": "Iolanda",
"Local User Id": "iolanda",
"E-mail": "[email protected]",
"organisation": {"L1": "School of Electrical Engineering and Computer Science ",
"L2": "Intelligent Systems" }
},
"Supervisor2": {
"Last name": "Stollenwerk",
"First name": "Felix",
"E-mail": "[email protected]",
},
"Examiner1": {
"Last name": "Gustafsson",
"First name": "Joakim",
"Local User Id": "jkgu",
"E-mail": "[email protected]",
"organisation": {"L1": "School of Electrical Engineering and Computer Science ",
"L2": "Intelligent Systems" }
},
"Cooperation": { "Partner_name": "Arbetsförmedlingen"},
"Other information": {
"Year": "2021", "Number of pages": "xv,66"}
}
TRITA-EECS-EX-2021:583

www.kth.se

You might also like