Association For Computational Linguistics
Association For Computational Linguistics
August 5, 2021
Bangkok, Thailand (online)
©2021 The Association for Computational Linguistics
and The Asian Federation of Natural Language Processing
ISBN 978-1-954085-62-6
ii
Preface
The workshop is privileged to present four invited talks this year, all from very respected members of
the SIGMORPHON community. Reut Tsarfaty, Kenny Smith, Kristine Yu, and Ekaterina Vylomova all
presented talks at this year’s workshop.
This year also marks the sixth iteration of the SIGMORPHON Shared Task. Following upon the success
of last year’s multiple tasks, we again hosted 3 shared tasks:
Task 0:
SIGMORPHON’s sixth installment of its inflection generation shared task is divided into two parts:
Generalization, and cognitive plausibility.
In the first part, participants designed a model that learned to generate morphological inflections from a
lemma and a set of morphosyntactic features of the target form, similar to previous year’s tasks. This year,
participants learned morphological tendencies on a set of development languages, and then generalized
these findings to new languages - without much time to adapt their models to new phenomena.
The second part asks participants to inflect nonce words in the past tense, which are then judged for
plausibility by native speakers. This task aims to investigate whether state-of-the-art inflectors are
learning in a way that mimics human learners.
Task 1:
The second SIGMORPHON shared task on grapheme-to-phoneme conversion expands on the task from
last year, recategorizing data as belonging to one of three different classes: low-resource, medium-
resource, and high-resource.
Task 2:
Task 2 continues the effort from the 2020 shared task in unsupervised morphology. Unlike last year’s
task, which asked participants to implement a complete unsupervised morphology induction pipeline,
this year’s task concentrates on a single aspect of morphology discovery: paradigm induction. This task
asks participants to cluster words into inflectional paradigms, given no more than raw text.
We are grateful to the program committee for their careful and thoughtful reviews of the papers submitted
this year. Likewise, we are thankful to the shared task organizers for their hard work in preparing the
shared tasks. We are looking forward to a workshop covering a wide range of topics, and we hope for
lively discussions.
Garrett Nicolai
Kyle Gorman
iii
Ryan Cotterell
iv
Organizing Committee
Garrett Nicolai (University of British Columbia, Canada)
Kyle Gorman (City University of New York, USA)
Ryan Cotterell (ETH Zürich, Switzerland)
Program Committee
Damián Blasi (Harvard University)
Grzegorz Chrupała (Tilburg University)
Jane Chandlee (Haverford College)
Çağrı Çöltekin (University of Tübingen)
Daniel Dakota (Indiana University)
Colin de la Higuera (University of Nantes)
Micha Elsner (The Ohio State University)
Nizar Habash (NYU Abu Dhabi)
Jeffrey Heinz (University of Delaware)
Mans Hulden (University of Colorado)
Adam Jardine (Rutgers University)
Christo Kirov (Google AI)
Greg Kobele (Universität Leipzig)
Grzegorz Kondrak (University of Alberta)
Sandra Kübler (Indiana University)
Adam Lamont (University of Massachusetts Amherst)
Kevin McMullin (University of Ottawa)
Kemal Oflazer (CMU Qatar)
Jeff Parker (Brigham Young University)
Gerald Penn (University of Toronto)
Jelena Prokic (Universiteit Leiden)
Miikka Silfverberg (University of British Columbia)
Kairit Sirts (University of Tartu)
Kenneth Steimel (Indiana University)
Reut Tsarfaty (Bar-Ilan University)
Francis Tyers (Indiana University)
Ekaterina Vylomova (University of Melbourne)
Adina Williams (Facebook AI Research)
Anssi Yli-Jyrä (University of Helsinki)
Kristine Yu (University of Massachusetts)
vi
Table of Contents
Findings of the SIGMORPHON 2021 Shared Task on Unsupervised Morphological Paradigm Clustering
Adam Wiemerslage, Arya D. McCarthy, Alexander Erdmann, Garrett Nicolai, Manex Agirrezabal,
Miikka Silfverberg, Mans Hulden and Katharina Kann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
vii
CLUZH at SIGMORPHON 2021 Shared Task on Multilingual Grapheme-to-Phoneme Conversion: Vari-
ations on a Baseline
Simon Clematide and Peter Makarov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
BME Submission for SIGMORPHON 2021 Shared Task 0. A Three Step Training Approach with Data
Augmentation for Morphological Inflection
Gábor Szolnok, Botond Barta, Dorina Lakatos and Judit Ács . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Not quite there yet: Combining analogical patterns and encoder-decoder networks for cognitively plau-
sible inflection
Basilio Calderone, Nabil Hathout and Pierre Bonami . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Simple induction of (deterministic) probabilistic finite-state automata for phonotactics by stochastic gra-
dient descent
Huteng Dai and Richard Futrell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Comparative Error Analysis in Neural and Finite-state Models for Unsupervised Character-level Trans-
duction
Maria Ryskina, Eduard Hovy, Taylor Berg-Kirkpatrick and Matthew R. Gormley . . . . . . . . . . . . 258
viii
Leveraging Paradigmatic Information in Inflection Acceptability Prediction: the JHU-SFU Submission
to SIGMORPHON Shared Task 0.2
Jane S. Y. Li and Colin Wilson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
ix
Workshop Program
Due to the ongoing pandemic, and the virtual nature of the workshop, the papers will be presented
asynchronously, with designated question periods.
xi
No Day Set (continued)
xii
Towards Detection and Remediation of Phonemic Confusion
Abstract
Reducing communication breakdown is criti-
cal to success in interactive NLP applications,
such as dialogue systems. To this end, we pro-
pose a confusion-mitigation framework for the
detection and remediation of communication
breakdown. In this work, as a first step towards
implementing this framework, we focus on de-
tecting phonemic sources of confusion. As a
proof-of-concept, we evaluate two neural ar- Figure 1: A simplified variant of our proposed
chitectures in predicting the probability that a confusion-mitigation framework, which enables gener-
listener will misunderstand phonemes in an ut- ative NLP systems to detect and remediate confusion-
terance. We show that both neural models out- related communication breakdown. The confusion pre-
perform a weighted n-gram baseline, showing diction component predicts the confusion probability
early promise for the broader framework. of candidate utterances, which are rejected if this prob-
ability is above a decision threshold, φ.
1 Introduction
Ensuring that interactive NLP applications, such
as dialogue systems, communicate clearly and ef-
fectively is critical to their long-term success and
viability, especially in high-stakes domains, such as formulations, the NLG and confusion prediction
healthcare. Successful systems should thus seek to components can be closely integrated to better de-
reduce communication breakdown. One aspect of termine precisely how to avoid confusion. This
successful communication is the degree to which process can also be conditioned on models of the
each party understands the other. For example, current listener or task to achieve personalized or
properly diagnosing a patient may necessitate ask- context-dependent results. Figure 1 shows the sim-
ing logically complex questions, but these ques- plest variant of the framework.
tions should be phrased as clearly as possible to
promote understanding and mitigate confusion. As a first step towards implementing this frame-
To reduce confusion-related communication work, we work towards developing its central con-
breakdown, we propose that generative NLP sys- fusion prediction component, which predicts the
tems integrate a novel confusion-mitigation frame- confusion probability of an utterance. In this work,
work into their natural language generation (NLG) we specifically target phonemic confusion, that is,
processes. In brief, this framework ensures that the misidentification of heard phonemes by a lis-
such systems avoid transmitting utterances with tener. We consider two potential neural architec-
high predicted probabilities of confusion. In the tures for this purpose: a fixed-context, feed-forward
simplest and most decoupled formulation, an exist- network and a residual, bidirectional LSTM net-
ing NLG component simply produces alternatives work. We train these models using a novel proxy
to any rejected utterances without additional guid- data set derived from audiobook recordings, and
ing information. In more advanced and coupled compare their performance to that of a weighted
∗
Equal contribution. n-gram baseline.
1
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 1–10
August 5, 2021. ©2021 Association for Computational Linguistics
2 Background and Related Work noise. It may also arise from the natural similarities
between phonemes (discussed next). While many
Prior work focused on identifying confusion in nat- of these will not be represented in the text-based
ural language, rather than proactively altering it phonemic transcriptions data set used in this pre-
to help reduce communication breakdown, as our liminary work, our approach can be extended to
framework proposes. For example, Batliner et al. include them.
(2003) showed that certain features of recorded
Researchers in speech processing have studied
speech (e.g., repetition, hyperarticulation, strong
the prediction of phonemic confusion but, to our
emphasis) can be used to identify communica-
knowledge, this work has not been adapted to ut-
tion breakdown. The authors relied primarily on
terance generation. Instead, tasks such as prevent-
prosodic properties of recorded phrases, rather than
ing of sound-alike medication errors (i.e., naming
the underlying phonemes, words, or semantics, for
medications so that two medications do not sound
identifying communication breakdown. On the
identical) are common (Lambert, 1997). Zgank
other hand, conversational repair is a turn-level
and Kacic (2012) showed that the potential confus-
process in which conversational partners first iden-
ability of a word can be estimated by calculating
tify and then remediate communication breakdown
the Levenshtein distance (Levenshtein, 1966) of its
as part of a trouble-source repair (TSR) sequence
phonemic transcription to that of all others in the
(Sacks et al., 1974). Using this approach, Orange
vocabulary. We take inspiration from Zgank and
et al. (1996) identified differences in TSR patterns
Kacic (2012) and employ a phoneme-level Leven-
amongst people with no, early-stage, and middle-
shtein distance approach in this work.
stage Alzheimer’s, highlighting the usefulness of
communication breakdown detection. However, In the basic definition of the Levenshtein dis-
such work does not directly address the issue of tance, all errors are equally weighted. In practice,
proactive confusion mitigation and remediation, however, words that share many similar or identical
which the more advanced formulation of our frame- phonemes are more likely to be confused for one
work aims to address through listener and task con- another. Given this, Sabourin and Fabiani (2000)
ditioning. Our focus is on the simpler formulation developed a weighted phoneme-level Levenshtein
in this preliminary work. distance, where weights are determined by a human
Rothwell (2010) identified four types of noise expert or a learned model, such as a hidden Markov
that may cause confusion: physical noise (e.g., a model. Unfortunately, while these weights are
loud highway), physiological noise (e.g., hearing meant to represent phonemic similarity, selecting
impairment), psychological noise (e.g., attentive- an appropriate distance metric in phoneme space is
ness of listener), and semantic noise (e.g., word non-trivial. The classical results of Miller (1954)
choice). We postulate that mitigating confusion and Miller and Nicely (1955) group phonemes ex-
resulting from each type of noise may be possi- perimentally based on the noise level at which they
ble, at least to some extent, given sufficient context become indiscernible. The authors identify voic-
to make an informed compensatory decision. For ing, nasality, affrication, duration, and place of
example, given a particularly physically noisy envi- articulation as sub-phoneme features that predict a
ronment, speaking loudly would seem appropriate. phoneme’s sensitivity to distortion, and therefore
Unfortunately, such contextual information is often measure its proximity to others. Unfortunately,
lacking from existing data sets. In particular, the later work showed that these controlled conditions
physiological and psychological states of listeners do not map cleanly to the real world (Batliner et al.,
is rarely recorded. Even when such information is 2003). In addition, Wickelgren (1965) found al-
recorded (e.g., in Alzheimer’s speech studies Or- ternative phonemic distance features that could be
ange et al., 1996), the information is very coarse adapted into a distance metric.
(e.g, broad Alzheimer’s categories such as none, While this prior research sought to directly de-
early-stage, and middle-stage). fine a distance metric between phonemes based
We leave these non-trivial data gathering chal- on sub-phoneme features, since no method has
lenges as future work, instead focusing on phone- emerged as clearly superior, researchers now favour
mic confusion, which is significantly easier to op- direct, empirical measures of confusability (Bailey
erationalize. In practice, confusion at the phoneme- and Hahn, 2005). Likewise, our work assumes that
level may arise from any category of Rothwell these classical feature-engineering approaches to
2
predicting phoneme confusability can be improved
upon with neural approaches, just as automatic
speech recognition (ASR) systems have been im-
proved through the use of similar methods (e.g.,
Figure 2: We create a new data set with parallel ref-
Seide et al., 2011; Zeyer et al., 2019; Kumar et al., erence and hypothesis transcriptions from audiobook
2020). In addition, these classical approaches do data with parallel text and audio recordings. The text
not account for context (i.e., other phonemes sur- simply becomes the reference transcriptions. A tran-
rounding the phoneme of interest), whereas our scriber converts the audio recordings into hypothesis
approach conditions on such context to refine the transcriptions. In this preliminary work, we use an
confusion estimate. ASR system as a proxy for human transcribers.
3 Data
create aligned reference and hypothesis transcrip-
3.1 Data Gathering Process tions. For each text-audio pair, the text simply
To predict the phonemic confusability of utterances, becomes the reference transcriptions, while a tran-
we would ideally use a data set in which each utter- scriber converts the audio into hypothesis transcrip-
ance is annotated with speaker phonemic transcrip- tions. Given the preliminary nature of this work,
tion (the reference transcription), as well as listener we create a proxy data set in which we use Google
perceived phonemic transcription (the hypothesis Cloud’s publicly-available ASR system as a proxy
transcription). We could then compare these tran- for human transcribers (Cloud, 2019). We then
scriptions to identify phonemic confusion. process these transcriptions to identify phonemic
To the best of our knowledge, a data set of this confusion events (as described in Section 3.2). The
type does not exist. The English Consistent Con- final data set contains 84,253 parallel transcriptions.
fusion Corpus contains a collection of individual We split these into 63,189 training, 10,532 valida-
words spoken against a noisy background, with hu- tion, and 10,532 test transcriptions (a 75%-12.5%-
man listener transcriptions (Marxer et al., 2016). 12.5% split). The average reference and hypothesis
This is similar to our ideal data set, however the transcription lengths are 65.2 and 62.3 phonemes,
words are spoken in isolation, and thus without any respectively. The transcription error rate (i.e., the
utterance context. This same issue arises in the proportion of phonemes that are mis-transcribed)
Diagnostic Rhyme Test and its derivative data sets is only 8%, so there is significant imbalance in the
(Voiers et al., 1975; Greenspan et al., 1998). Other data set.
corpora, such as the BioScope Corpus (Vincze For the purposes of this preliminary work, the
et al., 2008) and the AMI Corpus (Carletta et al., Google Cloud ASR system (Cloud, 2019) is an
2005), contain annotations of dialogue acts, which acceptable proxy for human transcription ability
represent the intention of the speaker in producing under the reasonable assumption that, for any par-
each utterance (e.g., asking a question is labeled ticular transcriber, the distribution of error rates
with the dialogue act elicit information). across different phoneme sequences is nonuniform
However, dialogue acts relating to confusion only (i.e., within-transcriber variation is present). This
appear when a listener explicitly requests clarifica- assumption holds in all practical cases, and is rea-
tion from the speaker. This does not provide fine- sonable since the confusion-mitigation framework
grained information regarding which phonemes we propose can be conditioned on different tran-
caused the confusion, nor does it capture any in- scribers to control for inter-transcriber variation as
stances of confusion in which the listener does not future work.
explicitly vocalize their confusion.
3.2 Transcription Error Labeling
We thus create a new data set for this work (Fig-
ure 2). The Parallel Audiobook Corpus contains We post-process our aligned reference-hypothesis
121 hours of recorded speech data across 59 speak- transcription data set in two steps. First, each tran-
ers (Ribeiro, 2018). We use four of its audiobooks: scription must be converted from the word-level
Adventures of Huckleberry Finn, Emma, Treasure to the phoneme-level. For this, we use the CMU
Island, and The Adventures of Sherlock Holmes. Pronouncing Dictionary (Weide, 1998), which is
Crucially, the audio recordings in this corpus are based on the ARPAbet symbol set. For any words
aligned with the text being read, which allows us to with multiple phonemic conversions, we simply
3
of the others. That is, P (ỹi = xi | x, ỹ6=i ) =
P (ỹi = xi | x).1 We hypothesize that this assump-
tion, similar to the conditional independence as-
sumption of Naı̈ve Bayes (Zhang, 2004), will still
yield directionally-correct results, while drastically
increasing the tractability of the computation.
This assumption also allows us to simplify the
output space of the problem. Specifically, since
we only care to predict P (ỹ 6= x), with this as-
sumption, we now only need to consider, for each i,
whether ỹi = xi , rather than dealing with the much
Figure 3: Illustration of our transcription error labeling harder problem of predicting the exact value of ỹi .
process (using letters instead of phonemes for readabil- To achieve this, we use an element-wise Kronecker
ity). Given aligned reference (x) and hypothesis (y) ˜
delta function to replace ỹ with a binary vector, d,
vectors, we use the Levenshtein algorithm to ensure
they have the same length. Because y is not available such that d˜i ← ỹi 6= xi . Thus, the binary vector
at test time, we then “collapse” consecutive insertion d˜ records the position of each transcription error,
tokens to force the vectors to have the original length that is, the position of each phoneme in x that was
of x. Finally, we replace ỹ with the binary vector d, ˜ confused.
which has 1’s wherever x and ỹ don’t match. ˜ as ground truth
With the x’s as inputs and the d’s
labels, we can train models to predict P (d˜i | x) for
each i. As a post-processing step, we can then
default to the first conversion returned by the API.
combine these individual probabilities to estimate
Second, we label each resulting phoneme in each the utterance-level probability of phonemic confu-
reference transcription as either correctly or incor- sion, P (ỹ 6= x), which is the output of the central
rectly transcribed. This is nontrivial, because the confusion prediction component in Figure 1.
number of phonemes in the reference and hypothe- This formulation is general in the sense that any
sis transcriptions are rarely equal, and thus require xi can affect the predicted probability of any d˜i .
phoneme-level alignment. For this purpose, we In practice, however, and especially for long utter-
use a variant of the phoneme-level Levenshtein dis- ances, this is overly conservative, as only nearby
tance that returns the actual alignment, rather than phonemes are likely to have a significant effect. In
the final distance score (Figure 3). Section 4, we describe any additional conditional
Formally, let x ∈ Ka be a vector of reference independence assumptions that each architecture
phonemes and y ∈ Kb be a vector of hypothesis makes to further simplify its probability estimate.
phonemes from the data set. K refers to the set
{1, 2, 3, . . . , k, <INS>, <DEL>, <SOS>, <EOS>}, 4 Model Architectures and Baseline
where k is the number of unique phonemes in
the language being considered (e.g., in English, With recent advances, various neural architectures
k ≈ 40 depending on the dialect). In general, have been applied to NLP tasks. Early work in-
a 6= b, but we can manipulate the vectors by cludes n-gram-based, fully-connected architectures
incorporating insertion, deletion, and substitution for language modeling tasks (Bengio et al., 2003;
tokens (as done in the Levenshtein distance Mikolov et al., 2013). Recurrent neural network
algorithm). In general, this yields two vectors (RNN) architectures were then shown to be suc-
of the same length, x̃, ỹ ∈ Kc , c = max(a, b). cessful for applications such as language model-
While this manipulation can be performed at ing, speech recognition, and phoneme recognition
training time because y and b are known, such (Graves and Schmidhuber, 2005; Mikolov et al.,
information is unavailable at test time. Therefore, 2011). RNN architectures such as the LSTM
we modify the alignment at training time to ensure (Hochreiter and Schmidhuber, 1997) and GRU
x̃ ≡ x and c ≡ a. To achieve this, we “collapse” (Chung et al., 2015) variants had been successful
consecutive insertion tokens into a single instance in many NLP applications, such as machine lan-
of the insertion token, which ensures that |ỹ| = a. guage translation and phoneme classification (Sun-
dermeyer et al., 2012; Graves et al., 2013; Graves
Additionally, we assume that each hypothesis
phoneme, ỹi ∈ ỹ, is conditionally independent 1
ỹ6=i is every element in ỹ except the one at position i.
4
and Schmidhuber, 2005). Recently, the transformer the Adam optimizer with parameters α = 0.001,
architecture (Vaswani et al., 2017), which uses at- β1 = 0.9, and β2 = 0.999 (Kingma and Ba, 2014)
tention instead of recurrence to form dependencies to optimize a 1:10 weighted binary cross-entropy
between inputs, has shown state-of-the-art results loss function. We explored alternative parameter
in many areas of NLP, including syllable-based settings, and in particular a larger number of neu-
tasks (e.g., Zhou et al., 2018). rons, but found this architecture to be the most
In this work, we propose a fixed-context-window stable and highest performing of all variants tested,
architecture and a residual bi-LSTM architec- given the nature and relatively small size of the
ture for the central component of our confusion- data set.
mitigation framework. While similar architectures
4.2 LSTM Network
have already been applied to phoneme-based appli-
cations, such as phoneme recognition and classifi- The LSTM network receives the entire reference
cation (Graves and Schmidhuber, 2005; Weninger transcription, x, as input and predicts the entire bi-
et al., 2015; Graves et al., 2013; Li et al., 2017), narized hypothesis transcription, d, ˜ as output (Fig-
to our knowledge, our study is the first to apply ure 4b). Since the LSTM is bidirectional, we do
these architectures to identify phonemes related to not introduce any additional conditional indepen-
confusion for listeners. In our opinion, these archi- dence assumptions. Each input phoneme is passed
tectures strike an acceptable balance between com- through an embedding layer of dimension 42 (equal
pute and capability for this current work, unlike the to |K|) followed by a bidirectional LSTM layer and
more advanced transformer architectures, which two residual linear blocks with ReLU activations
require significantly more resources to train.2 (He et al., 2016). An output residual linear block
Since the data set is imbalanced (see Sec- with a sigmoid activation predicts the probability
tion 3.1), without sample weighting, early experi- of a transcription error. These skip connections
ments showed that both architectures never identi- are added since residual layers tend to outperform
fied any phonemes as likely to be mis-transcribed simpler alternatives (He et al., 2016). Passing the
(i.e., high specificity, low sensitivity). Accordingly, embedded input via skip connections ensures that
since the imbalance ratio is approximately 1:10, the original input is accessible at all depths of the
transcription errors are given 10-times more weight network, and also helps mitigate against any van-
than properly-transcribed phonemes in our binary ishing gradients that may arise in the LSTM.
cross-entropy loss function. We use the following output dimensions for each
layer: 50 for LSTM hidden and cell states, 40 for
4.1 Fixed-Context Network the first residual linear block, and 10 for the second.
We train with minibatches of size 256, using the
The fixed-context network takes as input the current
Adam optimizer with parameters α = 0.00005,
phoneme, xi , and the 4 phonemes before and after
β1 = 0.9, and β2 = 0.999 (Kingma and Ba, 2014)
it as a fixed window of context (Figure 4a). This
to optimize a 1:10 weighted binary cross-entropy
results in the additional conditional independence
loss function.
assumption that P (d˜i | x) = P (d˜i | xi−4:i+4 ). That
is, only phonemes within the fixed context window 4.3 Weighted n-Gram Baseline
of size 4 can affect the predicted probability of d˜i .
We compare our neural models to a weighted n-
These 9 phonemes are first embedded in a 15- gram baseline model. That is, d˜i depends only
dimensional embedding space. The embedding on the n previous phonemes in x (an order-n
layer is followed by a sequence of seven fully- Markov assumption). Formally, we make the con-
connected hidden layers with 512, 256, 256, 128, ditional independence assumption that P (d˜i | x) =
128, 64, and 64 neurons respectively. Each layer p(d˜i | xi−n+1:i ). Extending this baseline model to
is separated by Rectified Linear Unit (ReLU) non- include future phonemes would violate the order-n
linearities (Nair and Hinton, 2010; He et al., 2016). Markov assumption that is standard in n-gram ap-
Finally, an output with a sigmoid activation func- proaches. In this preliminary work, we opt to keep
tion predicts the probability of a transcription er- the baseline as standard as possible.
ror. We train with minibatches of size 32, using A weighted n-gram model is computed using an
2
Link to code: https://github.com/francois-rd/phonemic- algorithm similar to the standard maximum like-
confusion lihood estimation (MLE) n-gram counting algo-
5
(a) The fixed-context network uses a fixed window of context (b) Unrolled architecture of the LSTM network. The architec-
of size 4. These 9 phonemes are embedded using a shared ture consists of one bidirectional LSTM layer, two residual
embedding layer, concatenated, and then passed through 7 linear blocks with ReLU activations, and an output residual
linear layers with ReLU activations, followed by an output linear block with a sigmoid activation. Additional skip con-
layer with a sigmoid activation. nections are added throughout.
Figure 4: Architectural variants of the confusion prediction component of our confusion-mitigation framework.
6
Ground Truth Phrase Transcription of Audio Recording
... for they say every body is in love once ... ... for they say everybody is in love once ...
... his grave looks shewed that she was not ... ... his grave look showed that she was not ...
... shall use the carriage to night ... ... shall use the carriage tonight ...
... making him understand I warn’t dead ... ... making him understand I warrant dead ...
... shore at that place so we warn’t afraid ... ... sure at that place so we weren’t afraid ...
... read Elton’s letter as I was shewn in ... ... read Elton’s letters I was shown in ...
... sacrifice my poor hair to night and ... ... sacrifice my poor head tonight and ...
... we warn’t feeling just right ... ... we weren’t feeling just right ...
... that there was no want of taste ... ... that there was no on toothpaste ...
... knew that Arthur had discovered ... ... knew was it also have discovered ...
Table 1: Randomly selected phrases from amongst the top 100 phonemes predicted to be incorrectly transcribed
by the fixed-context model (transcription error probability > 0.999). Bold text denotes ASR transcription errors.
7
Ground Truth Phrase Transcription of Audio Recording
... the exquisite feelings of delight and ... ... the exquisite feelings of delight and ...
... gone Mister Knightley called ... ... gone Mister Knightley called ...
... has been exceptionally ... ... has been exceptionally ...
... not afraid of your seeing ... ... not afraid if you’re saying ...
... the sale of Randalls was long ... ... the sale of Randalls was long ...
... her very kind reception of himself ... ... her very kind reception to himself ...
... for the purpose of preparatory inspection ... ... for the purpose of preparatory inspection ...
... you would not be happy until you ... ... you would not be happy until you ...
... with the exception of this little blot ... ... with the exception of this little blot ...
... night we were in a great bustle getting ... ... night we were in a great bustle getting ...
Table 2: Randomly selected phrases from amongst the top 100 phonemes predicted to be correctly transcribed by
the fixed-context model (transcription error probability < 0.03). Bold text denotes ASR transcription errors.
ples have errors in Table 2. It therefore seems as architecture, such as the transformer, may produce
though, when the fixed-context model is very cer- stronger results. Future work can also investigate
tain about the presence or absence of errors, it is the differences in human phonemic confusability
usually correct. on ‘natural’ versus semantically-unpredictable sen-
Second, many of the transcription errors in Ta- tences.
ble 1 are seemingly caused by the archaic or id- A major aspect of our confusion-mitigation
iosyncratic writing present in the books used to framework, which we have not explored in this
create the data set. While this can be seen as a work, is the generation of alternative, clearer utter-
source of unwanted noise (we used an ASR sys- ances that retain the initial meaning. Constructively
tem trained on standard modern English), we argue enumerating these alternatives is non-trivial, as is
that, as per Rothwell’s model of communication identifying the neighbourhood beyond which their
(Section 2), familiarity with the vocabulary is, in meaning differs too significantly from the original.
fact, a very legitimate source of semantic noise. Conditioning on a specific listener’s priors as an
Indeed, phrases using more modern and standard additional mechanism to reduce communication
vernacular are seemingly less likely to be confus- breakdown is another major aspect we leave to fu-
ing, according to the fixed-context model. ture work.
Third, many of the errors not related to ar- Perhaps most significantly, we have limited the
chaism involve stop words, homonyms, or near- scope of our confusion assessment drastically in
homophones, which intuitively makes sense. Ad- this preliminary work, primarily to simplify the
ditionally, hard consonant sounds between words data gathering process. While our results are
(and stress at the beginning rather than at the end promising, communication breakdown is a nuanced
of words) appears more common in the set of and multi-faceted phenomenon of which phonemic
correctly-transcribed phrases as compared to the set confusion is but one small component. Modeling
of incorrectly-transcribed ones. These findings sug- these larger and more complex processes remains
gest the fixed-context model has picked up on some an important open challenge.
underlying patterns governing phonemic confusion,
which is promising for our confusion-mitigation 6 Conclusion
framework as a whole.
Reducing communication breakdown is critical
5.3 Future Work to successful interaction in dialogue systems and
This work uses a relatively small data set. Creat- other generative NLP systems. In this work, we
ing and using a significantly larger corpus using proposed a novel confusion-mitigation framework
human subjects rather than an ASR proxy would that such systems could employ to help minimize
likely yield more directly relevant results. We pos- the probability of human confusion during an in-
tulate that, with a larger and higher quality data teraction. As a first step towards implementing
set, a deeper and more advanced neural network this framework, we evaluated two potential neu-
8
ral architectures—a fixed-context network and an Google Cloud. 2019. Speech-to-Text Client Libraries.
LSTM network—for its central component, which
predicts the confusion probability of a candidate Alex Graves, Abdel-rahman Mohamed, and Geoffrey
Hinton. 2013. Speech recognition with deep recur-
utterance. These neural architectures outperformed rent neural networks. In 2013 IEEE International
a weighted n-gram baseline (with the fixed-context Conference on Acoustics, Speech and Signal Pro-
network performing best overall) when trained us- cessing, pages 6645–6649. IEEE.
ing a proxy data set derived from audiobook record-
Alex Graves and Jürgen Schmidhuber. 2005. Frame-
ings. In addition, qualitative analyses suggest that wise phoneme classification with bidirectional
the fixed-context model has uncovered some of the LSTM and other neural network architectures. Neu-
more intuitive causes of phonemic confusion, in- ral Networks, 18(5-6):602–610.
cluding stop words, homonyms, near-homophones,
and familiarity with the vocabulary. These prelim- Steven L Greenspan, Raymond W Bennett, and Ann K
Syrdal. 1998. An evaluation of the diagnostic rhyme
inary results show the promise of our confusion- test. International Journal of Speech Technology,
mitigation framework. Given this early success, 2(3):201–214.
further investigation and refinement is warranted.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Acknowledgments Sun. 2016. Deep Residual Learning for Image
Recognition. In 2016 IEEE Conference on Com-
Resources used in preparing this research were puter Vision and Pattern Recognition (CVPR), pages
770–778. ISSN: 1063-6919.
provided, in part, by the Province of Ontario,
the Government of Canada through CIFAR, Sepp Hochreiter and Jürgen Schmidhuber. 1997.
and companies sponsoring the Vector Institute Long Short-Term Memory. Neural Computation,
(www.vectorinstitute.ai/partners). 9(8):1735–1780.
Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Ricard Marxer, Jon Barker, Martin Cooke, and
Flynn, Mael Guillemot, Thomas Hain, Jaroslav Maria Luisa Garcia Lecumberri. 2016. A corpus
Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa of noise-induced word misperceptions for English.
Kronenthal, et al. 2005. The AMI Meeting Cor- The Journal of the Acoustical Society of America,
pus: A Pre-Announcement. In International Work- 140(5):EL458–EL463.
shop on Machine Learning for Multimodal Interac-
tion, pages 28–39. Springer. Tomáš Mikolov, Stefan Kombrink, Lukáš Burget, Jan
Černockỳ, and Sanjeev Khudanpur. 2011. Exten-
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, sions of recurrent neural network language model.
and Yoshua Bengio. 2015. Gated Feedback Recur- In 2011 IEEE International Conference on Acous-
rent Neural Networks. In International Conference tics, Speech and Signal Processing (ICASSP), pages
on Machine Learning, pages 2067–2075. 5528–5531. IEEE.
9
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor- William D Voiers, Alan D Sharpley, and Carl J Hehm-
rado, and Jeff Dean. 2013. Distributed Representa- soth. 1975. Research on Diagnostic Evaluation of
tions of Words and Phrases and their Composition- Speech Intelligibility. Research Report AFCRL-72-
ality. In Advances in neural information processing 0694, Air Force Cambridge Research Laboratories,
systems, pages 3111–3119. Bedford, Massachusetts.
George A. Miller. 1954. An Analysis of the Confusion Robert L. Weide. 1998. The CMU pronouncing dictio-
among English Consonants Heard in the Presence of nary. The Speech Group.
Random Noise. Journal of The Acoustical Society of
America, 26. Felix Weninger, Hakan Erdogan, Shinji Watanabe, Em-
manuel Vincent, Jonathan Le Roux, John R Her-
shey, and Björn Schuller. 2015. Speech Enhance-
George A. Miller and Patricia E. Nicely. 1955. An anal-
ment with LSTM Recurrent Neural Networks and its
ysis of perceptual confusions among some English
Application to Noise-Robust ASR. In International
consonants. The Journal of the Acoustical Society
Conference on Latent Variable Analysis and Signal
of America, 27(2):338–352.
Separation, pages 91–99. Springer.
Vinod Nair and Geoffrey E Hinton. 2010. Rectified Wayne A. Wickelgren. 1965. Acoustic similarity and
linear units improve restricted boltzmann machines. intrusion errors in short-term memory. Journal of
In Icml. Experimental Psychology, 70(1):102.
John B. Orange, Rosemary B. Lubinski, and D. Jef- Albert Zeyer, Parnia Bahar, Kazuki Irie, Ralf Schlüter,
fery Higginbotham. 1996. Conversational Repair and Hermann Ney. 2019. A Comparison of Trans-
by Individuals with Dementia of the Alzheimer’s former and LSTM Encoder Decoder Models for
Type. Journal of Speech, Language, and Hearing ASR. In 2019 IEEE Automatic Speech Recognition
Research, 39(4):881–895. and Understanding Workshop (ASRU), pages 8–15.
IEEE.
Manuel Sam Ribeiro. 2018. Parallel Audiobook Cor-
pus. University of Edinburgh School of Informatics. Andrej Zgank and Zdravko Kacic. 2012. Predicting
the Acoustic Confusability between Words for a
J. Dan Rothwell. 2010. In the Company of Others: An Speech Recognition System using Levenshtein Dis-
Introduction to Communication. New York: Oxford tance. Elektronika ir Elektrotechnika, 18(8):81–84.
University Press.
Harry Zhang. 2004. The optimality of naive Bayes.
American Association for Artificial Intelligence,
Michael Sabourin and Marc Fabiani. 2000. Predicting 1(2):3.
auditory confusions using a weighted Levinstein dis-
tance. US Patent 6,073,099. Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu.
2018. Syllable-Based Sequence-to-Sequence
Harvey Sacks, Emanuel Schegloff, and Gail Jefferson. Speech Recognition with the Transformer in Man-
1974. A Simple Systematic for the Organisation of darin Chinese. arXiv preprint arXiv:1804.10752.
Turn-Taking in Conversation. Language, 50:696–
735.
10
Recursive prosody is not finite-state
11
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 11–22
August 5, 2021. ©2021 Association for Computational Linguistics
least context-free (Chomsky, 1956; Chomsky and string language is regular. It is then possible to create
Schützenberger, 1959). Because sentential prosody a computational network that uses a supra-regular
interacts with the syntactic level in non-trivial ways, it grammar for the syntax which interacts with a
might seem sensible to assume that 1) the transforma- finite-state grammar for the prosody (Yu and Stabler,
tion from syntax to prosody is not finite-state definable 2017; Yu, 2019). To summarize, it seems that the
(= definable with finite-state transducers), and that implicit consensus in computational prosody is that
2) the string language of prosodic representations 1) syntax can be supra-regular, but the corresponding
is a supra-regular language, not a regular language. prosody is regular; 2) prosodic recursion is bounded.
Importantly though, this assumption is not trivially However, as we elaborate in the next section,
true. In fact, early work has shown that even if syntax coordination data from Wagner (2005) is a case where
is context-free, the corresponding prosodic structures syntactic recursion generates potentially unbounded-
can be a regular string language. For instance, Reich recursive prosodic structure. The rest of the paper is
(1969) argued that the prosodic structures in SPE can then dedicated to exploring the consequences of this
be generated via finite-state devices (see also Langen- construction for the expressivity of sentential prosody.
doen, 1975), while Pierrehumbert (1980) modeled
English intonation using a simple finite-state acceptor. 3 Prosodic recursion in coordination
When analyzed over string languages, this To our knowledge, Wagner (2005, 2010) is the
mismatch between supra-regular syntax and regular clearest case where syntactic recursion gets mapped
prosody was not explored much in the subsequent to recursive prosody, such that the recursion is
literature. In fact, it seems that current research on unboundedly deep for the prosody. In this section, we
computational prosody uses the premise that prosodic go over the data and generalizations (§3.1), we sketch
structures are at most regular (Gibbon, 2001). Cru- Wagner’s cyclic analysis (§3.2), and we discuss issues
cially, this premise is confounded by the general lack with finiteness (§3.3). Finally, we show that that this
of explicit mathematical formalizations of prosodic construction does not correspond to a regular string
systems. For example, there are algorithms for Dutch language (§3.4).
intonation that capture surface intonational contours
3.1 Unbounded recursive prosody
and other acoustic cues (t’Hart and Cohen, 1973;
t’Hart and Collier, 1975). These algorithms however Wagner documents unbounded prosodic recursion
do not themselves provide sufficient mathematical in the coordination of nouns, in contrast to earlier
detail to show that the prosodic phenomenon in results which reported flat non-recursive prosody
question is a regular string language. Instead, one (Langendoen, 1987, 1998). Based on experimental
has to deduce that Dutch intonation is regular because and acoustic studies, Wagner reports that recursive
the algorithm does not utilize counting or unbounded coordination creates recursively strong prosodic
look-ahead (t’Hart et al., 2006, pg. 114). boundaries. Syntactic edges have a prosodic strength
that incrementally depends on their distance from the
As a reflection of this mismatch, early work in bottom-most constituents.
prosodic phonology assumed something known as the When three items are coordinated with two non-
strict layer hypothesis (SLH; Nespor and Vogel, 1986; identical operators, then two syntactic parses are pos-
Selkirk, 1986). The SLH assumed that prosodic trees sible. Each syntactic parse has an analogous prosodic
cannot be recursive — i.e. a prosodic phrase cannot parse. The prosodic parse is based on the relative
dominate another prosodic phrase — thus ensuring strength of a prosodic boundary, with | being weaker
that a prosodic tree will have fixed depth. Subsequent than ||. The boundary is placed before the operator.
work in prosodic phonology weakened the SLH:
prosodic recursion at the phrase or sentence level is Table 1: Prosody of three items with non-identical
now accepted as empirically robust (Ladd 1986, 2008, operators
ch8; Selkirk 2011; Ito and Mester 2012, 2013). But
empirically, it is difficult to find cases of unbounded Syntactic grouping Prosodic grouping
prosodic recursion (Van der Hulst, 2010). Consider [A and [B or C]] A || and B | or C
a language that uses only bounded prosodic recursion [[A and B] or C] A | and B || or C
— e.g. there can be at most two recursive levels of
prosodic phrases. The prosodic tree will have fixed When the two operators are identical, then three
depth; and the computation of the corresponding syntactic and prosodic parses are possible. The
12
difference between the parses is determined by 3.2 Wagner’s cyclic analysis
semantic associativity. For example, a sentence like In order to generate the above forms, Wagner devised
I saw [[A and B] and C] means that I saw A and B a cyclic procedure which we summarize with the
together, and I saw C separately. algorithm below.
Table 2: Prosody of three items with identical operators
2. Wagner’s cyclic algorithm
Syntactic grouping Prosodic grouping (a) Base case: Let X be a constituent that
[A and [B and C]] A || and B | and C contains a set of unprosodified nouns
[[A and B] and C] A | and B || and C (terminal nodes) that are in an associative
[[A and B and C] A | and B | and C coordination. Place a boundary of strength
| between each noun.
When four items are coordinated, then at most (b) Recursive case: Consider a constituent Y.
11 parses are possible. The maximum is reached Let S be a set of constituents S (terminals
when the three operators are identical. We can have or non-terminals) that is properly contained
three levels of prosodic boundaries, ranging from the in Y, such that at least one constituent in
weakest | to the strongest |||. S be prosodified. Let |k be the strongest
prosodic boundary inside Y. Place the
Table 3: Prosody of four items with identical operators boundary |k+1 between each constituent
Syntactic grouping Prosodic grouping in Y.
[A and B and C and D] A | and B | and C | and D
[A and B and [C and D]] A || and B || and C | and D The algorithm is generalized to coordination of any
[A and [B and C] and D] A || and B | and C || and D depth. It takes as input a syntactic tree, and the output
[[A and B] and C and D] A | and B || and C || and D is prosodically marked strings. We illustrate this below,
[A and [B and C and D]] A || and B | and C | and D with the input tree represented as a bracketed string.
[[A and B and C] and D] A | and B | and C || and D
[[A and B] and [C and D]] A | and B || and C | and D 3. Illustrating Wagner’s algorithm
[A and [B and [C and D]] A ||| and B || and C | and D
Input [A and B and [C and D]]
[A and [[B and C] and D]] A ||| and B | and C || and D
Base case C | and D
[[A and [B and C]] and D] A || and B | and C ||| and D
[[[A and B] and C] and D] A | and B || and C ||| and D Recursive case A || and B || and C | and D
13
3.4 Computing recursive prosody over strings grows at a rate of at least O(n2) where n is the size
The choice of representation plays an important role of the input string.
in determining the generative capacity of the prosodic Such a function is neither rational nor regular.
mapping. We first start by treating the mapping as Rational functions are computed by 1-way FSTs,
a string-to-string function. We show that the mapping and regular functions by 2-way FSTs (Engelfriet
is not regular. and Hoogeboom, 2001).2 They share the following
Let the input language be a bracketed string property in terms of growth rates (Lhote, 2020).
language, such that the input alphabet is a set of Theorem 1. Given an input string of size n, the size
nouns{A, ..., Z}, coordinators, and brackets. The of the output string of a regular function grows at
output language replaces the brackets with substrings most linearly as c·n, where c is a constant.
of |∗. For illustration, assume that the input language Thus, this string-to-string function is not regular.
is guaranteed to be a well-bracketed string. At a It could be a more expressive polyregular function
syntactic boundary, we have to calculate the number (Engelfriet and Maneth, 2002; Engelfriet, 2015;
of intervening boundaries between it and deepest node. Bojańczyk, 2018; Bojańczyk et al., 2019), a question
But this requires unbounded memory. For instance, to that we leave for future work.
parse the example below, we incrementally increase The discussion in this section focused on generat-
the prosodic strength of each boundary as we read ing the output prosodic string when the input syntax
the input left-to-right. is a bracketed string. Importantly though, Lemma 1
entails that no matter how one chooses their string
5. Linearly parsing the prosody:
encoding of syntactic structure, prosody cannot be
[[[A and B] and C] and D] is mapped to
modeled as a rational transduction unless there is
A | and B || and C ||| and D, where
an upper bound on the minimum number of output
Input alphabet Σ ={ A, ... , Z, and, or, [, ]}
symbols that a single syntactic boundary must be
Output alphabet ∆ ={ A, ... , Z, and, or, |}
rewritten as. To the best of our knowledge, there is
Input language is Σ∗ and well-bracketed
no syntactic string encoding that guarantees such a
Given the above string with only left-branching bound. In the next section, we will discuss how to
syntax, the leftmost prosodic boundary will have a compute prosodic strength starting from a tree.
juncture of strength |. Every subsequent prosodic
4 Computing recursive prosody over trees
boundary will have incrementally larger strength.
Over a string, this means we have to memorize the Wagner (2010)’s treatment of recursive prosody as-
number x of prosodic junctures that were generated sumes an algorithm that maps a syntactic tree to a
at any point in order to then generate x+1 junctures prosodic string. It is thus valuable to understand the
at the next point. A 1-way FST cannot memorize an complexity of processes at the syntax-prosody inter-
unbounded amount of information. Thus, this function face starting from the tree representation of a sen-
is not rational function and cannot be defined by a tence. Assuming we start from trees, there is one
1-way FST. To prove this, we can look at this function more choice to be made, namely whether the prosodic
in terms of the size of the input and output strings. information (in the output) is present within a string or
a tree. Notably, every tree-to-string transduction can
6. Illustrating growth size of recursive prosody be regarded as a tree-to-tree transduction plus a string
[n A0 and A1 ] and A2] and ... and An] yield mapping. As the tree-to-tree case subsumes the
is mapped to tree-to-string one, it makes sense to consider only
A0 | and A1 || and A2 ||| and ... |n and An the former. For a tree-to-tree mapping, the goal is
to obtain a tree representation that already contains
Abstractly, for a left-branching input string with the correct prosodic information (Ladd, 1986; Selkirk,
n number of left-brackets [, the output string has 2011). This is the focus of the rest of this paper.
a monotonically increasing number of prosodic
junctures: | ··· || ··· ||| ··· |n. The total number of 4.1 Dependency trees
prosodic junctures is a triangular number n(n+1)/2. When working over syntactic structures explicitly, it is
We thus derive the following lemma. important to commit to a specific tree representation.
Lemma 1. For generating coordination prosody as a 2
This equivalence only holds for functions and deterministic
string-to-string function, the size of the output string FSTs. Non-deterministic FSTs can also compute relations.
14
In what follows, we adopt a type of dependency trees, number of prosodic boundaries needed at that level.
where the head of a phrase is treated as the mother of and4
the subtree that contains its arguments. For example,
the coordinated noun phrase Pearl and Garnet is Pearl1 $2 and7
represented as the following dependency tree.
$3 Garnet5 $6 Rose8
and
Finally, the prosodic tree is fed to a yield function
Pearl Garnet to generate an output prosodified string. In particular,
the correct tree-to-string mapping can be obtained
Dependency trees have a rich tradition in descrip- by a modified version of a recursive-descent yield,
tive, theoretical, and computational approaches to lan- which enumerates nodes left-to-right, depth first,
guage, and their properties have been defined across a and only enumerates the mother node of each
variety of grammar formalisms (Tesnière, 1965; Nivre, level after the boundary branch. This strategy is
2005; Boston et al., 2009; Kuhlmann, 2013; Debus- depicted by the numerical subscripts in the tree above,
mann and Kuhlmann, 2010; De Marneffe and Nivre, which reconstruct how the yield of the prosodically
2019; Graf and De Santo, 2019; Shafiei and Graf, annotated tree produces the string: Pearl || and
2020, a.o.). Dependency trees keep the relation be- Garnet | and Rose. The rest of this section will focus
tween heads and arguments local, and they maximally on how to obtain the correct tree encoding of prosodic
simplify the readability of our mapping rules. Hence, information, starting from a plain dependency tree.
they allow us to focus our discussion on issues that
are directly related to the connection of coordinated 4.3 Mathematical preliminaries
embeddings and prosodic strength, without having to For a natural number n, we let [n] = {1,...,n}. A
commit to a particular analysis of coordinate structure. ranked alphabet Σ is a finite set of symbols, each one
Importantly, this choice does not impact the gener- of which has a rank assigned by the function r :Σ→N.
alizability of the solution. It is fairly straightforward to We write Σ(n) to denote {σ ∈Σ|r(σ)=n}, and σ(n)
convert basic dependency trees into phrase structure indicates that σ has rank n.
trees. Similarly, although it is possible to adopt n-ary Given a ranked alphabet Σ and a set A, TΣ(A) is
branching structures, we chose to limit ourselves the set of all trees over Σ indexed by A. The symbols
to binary trees (in the input). This turns out to be in Σ are possible labels for nodes in the tree, indexed
the most conservative assumption, as it forces us to by elements in A. The set TΣ of Σ-trees contains
explicitly deal with associativity and flat prosody. all σ ∈Σ(0) and all terms σ(n)(t1,...,tn) (n≥0) such
that t1, ... , tn ∈ TΣ. Given a term m(n)(s1, ... , sn)
4.2 Encoding prosodic strength over trees where each si is a subtree with root di, we call m the
We are interested in the complexity of mapping a mother of the daughters d1,...,dn (1 ≤ i ≤ n). If two
“plain” syntactic tree to a tree representation which con- distinct nodes have the same mother, they are siblings.
tains the correct prosodic information. Because of this, Essentially, the rank of a symbol denotes the finite
we encode prosodic strength over trees in the form of number of daughters that it can take. Elements of A
strength boundaries at each level of embedding. Each are considered as additional symbols of rank 0.
embedding level in our final tree representation will Example 1. Given Σ := a(0),b(0),c(2),d(2) , TΣ is
thus have a prosodic strength branch. The tree below an infinite set. The symbol a(0) means that a is
shows how the syntactic tree for Pearl and Garnet a terminal node without daughters, while c(2) is a
is enriched with prosodic information, according to non-terminal node with two daughters. For example,
our encoding choices. For readability, we use $ to consider the tree below.
mark prosodic boundaries in trees instead of |, since
d
the latter could be confused with a unary tree branch.
and c d
Pearl $ Garnet b b b a
As the tree below shows, the depth of the prosody This tree corresponds to the term d(c(b,b),d(b,a)),
branch at each embedding level corresponds to the contained in TΣ. y
15
As is standard in defining meta-rules, we introduce • σ(q1(x1,1,...,x1,n1 ),...,qk (xk,1,...,xk,nk )) → r
X as a countably infinite set of variable symbols in R
(X ∩ Σ = X) to be used as place-holders in the
definitions of transduction rules over trees. and there are trees Ti,j ∈ TΣ for every
i ∈ [k] and j ∈ [ni], s.t. ϕ =
4.4 Multi bottom-up tree transducers β[σ(q1(t1,1, ... , t1,n1 ), ... , qk (tk,1, ... , tk,nk ))], and
We assume that the starting point of the prosodic pro- ψ =β[r[xi,j ←ti,j |i∈[k],j ∈[ni]]]; or there is a rule
cess is a plain syntactic tree. Thus, in order to derive
• root(q(x1,...,xn))→qf (t) in R
the correct prosodic encoding, we need to propagate
information about levels of coordination embedding and there are trees ti ∈ T∆ for every i ∈ [n] s.t. ϕ =
and about associativity. We adopt a bottom-up ap- β[root(q(t1,...,tn))], and ψ = β[qf (t[t1,...,tn])]. The
proach, and characterize this process in terms of multi tree transformation computed by M is the relation:
bottom-up tree transducers (MBOT; Engelfriet et al.,
1980; Lilin, 1981; Maletti, 2011). Essentially, MBOTs τM ={(s,t)∈TΣ ×T∆ | root(s)⇒∗M qf (t)}
generalize traditional bottom–up tree transducers in
that they allow states to pass more than one output sub- Intuitively, tree transductions are performed by
tree up to subsequent transducer operations (Gildea, rewriting a local tree fragment as specified by one
2012). In other words, each MBOT rule potentially of the rules in R. For instance, a rule can replace
specifies several parts of the output tree. This is high- a subtree, or copy it to a different position. Rules
lighted by the fact that the transducer states (q ∈Q) can apply bottom–up from the leaves of the input tree,
have rank greater than one — i.e. they can have more and terminate in an accepting state qf .
than one daughter, where the additional daughters are
4.5 MBOT for recursive prosody
used to hold subtrees in memory. We follow Fülöp
et al. (2004) in presenting the semantics of MBOTs. We want a transducer which captures Wagner
(2010)’s bottom-up cyclic procedure. Consider now
Definition 1 (MBOT). A multi bottom-up tree trans-
the MBOT Mpros = (Q, Σ, ∆, root, qf , R), with
ducer (MBOT) is a tuple M = (Q,Σ,∆,root,qf ,R),
Q = {q∗,qc}, σc ∈ {and,or} ( Σ, σ ∈ Σ−{and,or},
where Q, Σ ∪ ∆, {root}, {qf } are pairwise disjoint,
and Σ = ∆. We use qc to indicate that Mpros has
such that:
verified that a branch contains a coordination (so σc),
• Q is a ranked alphabet with Q(0) =∅, called the with q∗ assigned to any other branch. As mentioned,
set of states we use $ to mark prosodic boundaries in the trees
• Σ and ∆ are ranked input and output alphabets, instead of |. The set of rules R is as follows.
respectively Rule 1 rewrites a terminal symbol σ as itself. The
• root is a unary symbol, called the root symbol MBOT for that branch transitions to q∗(σ).
• qf is a unary symbol called the final state
σ →q∗(σ) (1)
R is a finite set of rules of two forms:
Rule 2 applies to a subtree headed by
• σ(q1(x1,1,...,x1,n1 ),...,qk (xk,1,...,xk,nk )) σc ∈{and,or}, with only terminal symbols as daugh-
→q0(t1,...,tn0 ) ters: σc(q∗(x),q∗(y)). It inserts a prosodic boundary
$ between the daughters x,y. The boundary $ is also
where k ≥ 0, σ ∈ Σ(k), for every copied as a daughter of the mother qc, as record of
i ∈ [k] ∪ {0}, qi ∈ Q(ni) for some ni ≥ 1, for the fact that we have seen one coordination level.
every j ∈[n0],tj ∈T∆({xi,j |i∈[k],j ∈[ni]}).
σc(q∗(x),q∗(y))→qc(σc(x,$,y),$) (2)
• root(q(x1,...,xn))→qf (t)
We illustrate this in Figure 1 with a coordination
where n≥1,q ∈Q(n), and t∈T∆(Xn). y of two items, representing the mapping: [B and A]
The derivational relation induced by M is a binary re- → B | and A. We also assume that sentence-initial
lation ⇒M over the set TΣ∪∆∪Q∪{root,qf } defined as boundaries are vacuously interpreted.
follows. For every ϕ,ψ ∈TΣ∪∆∪Q∪{root,qf }, ϕ⇒M ψ We now consider cases where a coordination is
iff there is a tree β ∈ TΣ∪∆∪Q∪{root,qf }(X1) s.t. x1 the mother not just of terminal nodes, but of other
occurs exactly once in β and either there is a rule coordinated phrases. Rule 3 handles the case in which
16
(1) where the embedding of the coordination is strictly
and and
right branching, with the bulk of the work done via
B A q∗ q∗ rule 3. However, while these rules work well for
instances in which a coordination is always the right
B A
daughter of a node, they cannot deal with cases in
which the coordination branches left, or alternates
and qc
between the two. This is easily fixed by introducing
q∗ q∗
(2)
and $
variants to rule 3, which consider the position of
the coordination as marked by qc. Importantly, the
B A B $ A position of the copy of the boundary branch is not
altered, and it is always kept as the rightmost sibling
Figure 1: Example of the application of rules (1) and (2). of qc. What changes is the relative position of the w
The numerical label on the arrow indicates which rule and x subbranches in the output (see Figure 3).
was applied in order to rewrite the tree on the left as the
tree on the right. σc(qc(w,y),q∗(x))→qc(σc(w,$(y),x),$(y)) (5)
C qc and $
qc qc and $
(3) (6)
and $ C $ and $ and $ and $ and $ and $
A $ B $ C $ D $ A $ B $ C $ D $
B $ A $ B $ A
Figure 4: Example of the application of rule (6).
Figure 2: Example of the application of rule (3). For ease
of readability, we omit q∗ states over terminal nodes. Finally, we need to take care of the flat prosody
or associativity issue. The MBOT Mpros as outlined
Rule 4 applies once all coordinate phrases up to the so far increases the depth of the boundary branch at
root have been rewritten. It simply rewrites the root each level of embedding. Because we are adopting
as the final accepting state. It gets rid of the daughter binary branching trees, the current set of rules is
of qc that contains the strength markers, since there trivially unable to encode cases like [A and B and
is no need to propagate them any further. C]. We follow Wagner’s assumption that semantic
information on the syntactic tree guides the prosody
root(qc(x,y))→qf (x) (4)
cycles. Representationally, we mark this by using
As the examples so far should have clarified, specific labels on the internal nodes of the tree. We
Mpros as currently defined readily handles cases assume that the flat constituent interpretation is
17
Input Apply rule (2) Apply rule (3) Apply rule (3) Apply rule (4)
and and qc and
B A B $ A $ B $ A $ $ B $ A
Figure 5: Walk-through of the transduction defined by Mpros . For ease of readability, and to highlight how qc propagates
embedding information about the coordination, q∗ and qf states are omitted.
obtained by marking internal nodes as non-cyclic, ture trees, by virtue of the bottom-up strategy being
introducing the alphabet symbol σn: intrinsically equipped with finite look-ahead. A switch
to phrase structure trees may prove useful for future
σn(q∗(x),qc(w,y)→qc(σc(x,y,w),y) (7) work on the interaction of prosody and movement.
Essentially, rule 7 tells us that when a coordination
node is marked as σn, Mpros just propagates the level 5 Generating recursive prosody
of prosodic strength that it currently has registered (in The previous section characterized recursive prosody
y), without increments (see Figure 6). This rule can be over trees with a non-linear, deterministic MBOT.
trivially adjusted to deal with branching differences, This is a nice result, as MBOTs are generally well-
as done for rules 3 and 5. understood in terms of their algorithmic properties.
andn qc Moreover, this result is in line with past work explor-
ing the connections of MBOTs, tree languages, and
C qc andn $ the complexity of movement and copying operations
in syntax (Kobele, 2006; Kobele et al., 2007, a.o.).
(7) We can now ask what the complexity of this
and $ C $ and
approach is. MBOTs generate output string languages
B $ A B $ A that are potentially parallel multiple context-free
languages (PMCFL; Seki et al., 1991, 1993; Gildea,
Figure 6: Application of rule (7) for flat prosody. 2012; Maletti, 2014; Fülöp et al., 2005). Since this
class of string languages is more powerful than
A full, step by step Mpros transduction is shown context-free, the corresponding tree language is not
in Figure 5. Taken together, the recursive prosodic a regular tree language (Gécseg and Steinby, 1997).
patterns are fully characterized by Mpros when it is This is not surprising, as MBOTs can be understood
adjusted with a set of rules to deal with alternating as an extension of synchronous tree substitution
branching and flat associativity. The tree transducer grammars (Maletti, 2014).
generates tree representations where each level of Notably, independently of our specific MBOT solu-
embedding is marked by a branch, which carries tion, prosody as defined in this paper generates at least
information about the prosodic strength for that level. some output string languages that lack the constant
As outlined in Section 4.2, this final representation growth property — hence, that are PMCFLs. Consider
may then be fed to a modified string yield function as input a regular tree language of left-branching
for dependency tree languages. coordinationate phrases, where each level is simply of
Dependency trees allowed us to present a transducer the form and(X, Mary). The n−th level of embedding
with rules that are relatively easy to read. But, as men- from the top extends the string yield by n+2 symbols.
tioned before, this choice does not affect our general This immediately implies no constant growth, and
result. Under the standard assumption that the distance thus no semi-linearity (Weir, 1988; Joshi et al., 1990).
between the head of a phrase and its maximal projec- Interestingly though, the prosody MBOT devel-
tion is bounded, Mpros can be extended to phrase struc- oped here is fairly limited in its expressivity as the
18
transducer states themselves do almost no work, because of the following fundamental properties:
and most of the transduction rules in Mpros rely
on the ability to store the prosody strength branch. • The syntax has unbounded recursion.
Hence, the specific MBOT in this paper might turn • The prosody has unbounded recursion.
out to belong to a relatively weak subclass of tree • All recursive prosodic constituents have the
transductions with copying, perhaps a variant of input same prosodic label (= a prosodic phrase).
strictly local tree transductions (cf. Ikawa et al., 2020; • The recursive prosodic constituents have
Ji and Heinz, 2020), or a transducer variant of sensing acoustic cues marking different strengths.
tree automata (cf. Fülöp et al., 2004; Kobele et al., • There is an algorithm which explicitly assigns
2007; Maletti, 2011, 2014; Graf and De Santo, 2019). the recursive prosodic constituents to these
Since all of those have recently been used in the different strengths.
formal study of syntax, they are natural candidates
for a computational model of prosody, and their sensi- In this paper, we focused on explicitly generating
tivity to minor representational difference might also the prosodic strengths at each recursive prosodic
illuminate what aspects of syntactic representation levels, putting aside the mathematically simpler task
affect the complexity of prosodic processes. of converting a recursive syntactic tree into a recursive
Finally, one might worry that the mathematical prosodic tree (Elfner, 2015; Bennett and Elfner,
complexity is a confound of the representation we use, 2019) — which is a process essentially analogous to
rather than a genuine property of the phenomenon. a relabeling of the nonterminal nodes of the syntactic
However, a representation of prosodic strength is tree, without care for the prosodic strength. The
necessary and cannot be reduced further for two mapping studied in this paper has been conjectured in
reasons. First, strength cannot be reduced to syntactic the past to be computationally more expressive than
boundaries because a single prosodic edge ( may regular languages or functions (Yu and Stabler, 2017).
correspond to |k for any k ≥1. As discussed in depth Here, we formally verified that hypothesis.
by Wagner (2005, 2010), one cannot simply convert An open question then is to find other empirical
a syntactic tree into a prosodic tree by replacing the phenomena which also have the above properties.
labels of nonterminal nodes. Second, strength also One potential area of investigation is the assignment
cannot be reduced to different categories of prosodic of relative prominence relations in English compound
constituents — e.g. assuming that | is a prosodic prosody (Chomsky and Halle, 1968). However, En-
phrase while || is an intonational phrase. As argued glish compound prosody is a highly controversial area.
in depth in (Wagner, 2005, 2010), these different It is unclear what is the current consensus on an exact
constituent types do not map neatly to prosodic algorithm for these compounds, especially one that
strength. Instead, these boundaries all encode relative utilizes recursion and is not based on impressionistic
strengths of prosodic phrase boundaries. judgments (Liberman and Prince, 1977; Gussenhoven,
2011). In this sense, the mathematical results in this
6 Conclusion paper highlight the importance of representational
commitments and of explicit assumptions in the study
This paper formalizes the computation of unbounded of prosodic expressivity. Our paper might then help
recursive prosodic structures in coordination. Their identify crucial issues in future theoretical and em-
computation cannot be done by string-based finite- pirical investigations of the syntax-prosody interface.
state transducers. They instead need more expressive
grammars. To our knowledge, this paper is one of Acknowledgements
the few (if only) formal results on how prosodic We are grateful to our anonymous reviewers, Jon
phonology at the sentence-level is computationally Rawski, and Kristine Yu. Thomas Graf is supported
more expressive than phonology at the word-level. by the National Science Foundation under Grant No.
As discussed above, recent work in prosodic BCS-1845344.
phonology relies on the assumption that prosodic
structure can be recursive. However, because such
work usually uses bounded-recursion, such phenom- References
ena are computationally regular. Departing from this Ryan Bennett and Emily Elfner. 2019. The syntax–
stance, this paper focused on the prosodic phenomena prosody interface. Annual Review of Linguistics,
reported in Wagner (2005) as a core case study, 5:151–171.
19
Mikołaj Bojańczyk. 2018. Polyregular functions. arXiv John S Coleman. 1991. Prosodic structure, parameter-
preprint arXiv:1810.08760. setting and ID/LP grammar. In Steven Bird, editor,
Declarative perspectives on phonology, pages 65–78.
Mikołaj Bojańczyk, Sandra Kiefer, and Nathan Lhote. Centre for Cognitive Science, University of Edinburgh.
2019. String-to-string interpretations with polynomial-
size output. In 46th International Colloquium on Marie-Catherine De Marneffe and Joakim Nivre. 2019.
Automata, Languages, and Programming, ICALP Dependency grammar. Annual Review of Linguistics,
2019, July 9-12, Patras, Greece. (LIPIcs), volume 5:197–218.
132, page 106:1–106:14, Schloss Dagstuhl - Leibniz-
Zentrum fuer Informatik. Ralph Debusmann and Marco Kuhlmann. 2010. Depen-
Marisa Ferrara Boston, John T. Hale, and Marco dency grammar: Classification and exploration. In
Kuhlmann. 2009. Dependency structures derived Resource-adaptive cognitive processes, pages 365–388.
from minimalist grammars. In The Mathematics of Springer.
Language, pages 1–12. Springer.
Arthur Dirksen. 1993. Phrase structure phonology. In
Peter Chew. 2003. A computational phonology of Russian. T. Mark Ellison and James Scobbie, editors, Compu-
Universal-Publishers, Parkland, FL. tational Phonology, page 81–96. Centre for Cognitive
Science, University of Edinburgh.
Noam Chomsky. 1956. Three models for the description
of language. IRE Transactions on information theory, Hossep Dolatian. 2020. Computational locality of cyclic
2(3):113–124. phonology in Armenian. Ph.D. thesis, Stony Brook
University.
Noam Chomsky and Morris Halle. 1968. The sound
pattern of English. MIT Press, Cambridge, MA. Hossep Dolatian, Nate Koser, Kristina Strother-Garcia,
Noam Chomsky and Marcel P Schützenberger. 1959. and Jonathan Rawski. 2021. Computational restric-
The algebraic theory of context-free languages. In tions on iterative prosodic processes. In Proceedings
Studies in Logic and the Foundations of Mathematics, of the 2019 Annual Meeting on Phonology. Linguistic
volume 26, pages 118–161. Elsevier. Society of America.
Kenneth Ward Church. 1983. Phrase-structure parsing: A Emily Elfner. 2015. Recursion in prosodic phrasing:
method for taking advantage of allophonic constraints. Evidence from Connemara Irish. Natural Language &
Ph.D. thesis, Massachusetts Institute of Technology. Linguistic Theory, 33(4):1169–1208.
John Coleman. 1992. The phonetic interpretation of Joost Engelfriet. 2015. Two-way pebble transducers
headed phonological structures containing overlapping for partial functions and their composition. Acta
constituents. Phonology, 9(1):1–44. Informatica, 52(7-8):559–571.
John Coleman. 1993. English word-stress in unification-
based grammar. In T. Mark Ellison and James Scobbie, Joost Engelfriet and Hendrik Jan Hoogeboom. 2001.
editors, Computational Phonology, page 97–106. MSO definable string transductions and two-way finite-
Centre for Cognitive Science, University of Edinburgh. state transducers. Transactions of the Association for
Computational Linguistics, 2(2):216–254.
John Coleman. 1995. Declarative lexical phonology.
In Jacques Durand and Francsis Katamba, editors, Joost Engelfriet and Sebastian Maneth. 2002. Two-way
Frontiers of phonology: Atoms, structures, derivations, finite state transducers with nested pebbles. In Inter-
pages 333–383. Longman, London. national Symposium on Mathematical Foundations of
Computer Science, pages 234–244. Springer.
John Coleman. 1996. Declarative syllabification in
Tashlhit Berber. In Jacques Durand and Bernard Laks, Joost Engelfriet, Grzegorz Rozenberg, and Giora Slutzki.
editors, Current trends in phonology: Models and 1980. Tree transducers, l systems, and two-way
methods, volume 1, pages 175–216. European Studies machines. Journal of Computer and System Sciences,
Research Institute, University of Salford, Salford. 20(2):150–202.
John Coleman. 1998. Phonological representations: Zoltán Fülöp, Armin Kühnemann, and Heiko Vogler.
Their names, forms and powers. Cambridge University 2004. A bottom-up characterization of deterministic
Press, Cambridge. top-down tree transducers with regular look-ahead.
John Coleman. 2000. Candidate selection. The Linguistic Information Processing Letters, 91(2):57–67.
Review, 17(2-4):167–180.
Zoltán Fülöp, Armin Kühnemann, and Heiko Vogler. 2005.
John Coleman and Janet Pierrehumbert. 1997. Stochastic Linear deterministic multi bottom-up tree transducers.
phonological grammars and acceptability. In Third Theoretical computer science, 347(1-2):276–287.
meeting of the ACL special interest group in com-
putational phonology: Proceedings of the workshop, Ferenc Gécseg and Magnus Steinby. 1997. Tree lan-
pages 49–56, East Stroudsburg, PA. Association for guages. In Handbook of formal languages, pages 1–68.
computational linguistics. Springer.
20
Dafydd Gibbon. 2001. Finite state prosodic analysis of C Douglas Johnson. 1972. Formal aspects of phonologi-
African corpus resources. In EUROSPEECH 2001 cal description. Mouton, The Hague.
Scandinavia, 7th European Conference on Speech
Communication and Technology, 2nd INTERSPEECH Aravind K Joshi, K Vijay Shanker, and David Weir. 1990.
Event, Aalborg, Denmark, September 3-7, 2001, pages The convergence of mildly context-sensitive grammar
83–86. ISCA. formalisms. Technical Reports (CIS), page 539.
Daniel Gildea. 2012. On the string translations produced Ronald M. Kaplan and Martin Kay. 1994. Regular
by multi bottom–up tree transducers. Computational models of phonological rule systems. Computational
Linguistics, 38(3):673–693. linguistics, 20(3):331–378.
Thomas Graf and Aniello De Santo. 2019. Sensing tree
George Anton Kiraz and Bernd Möbius. 1998. Mul-
automata as a model of syntactic dependencies. In
tilingual syllabification using weighted finite-state
Proceedings of the 16th Meeting on the Mathematics of
transducers. In The third ESCA/COCOSDA workshop
Language, pages 12–26, Toronto, Canada. Association
(ETRW) on speech synthesis.
for Computational Linguistics.
Carlos Gussenhoven. 2011. Sentential prominence in Ewan Klein. 1991. Phonological data types. In Steven
English. In Marc van Oostendorp, Colin Ewen, Eliz- Bird, editor, Declarative perspectives on phonol-
abeth Hume, and Keren Rice, editors, The Blackwell ogy, pages 127–138. Centre for Cognitive Science,
companion to phonology, volume 5, pages 1–29. University of Edinburgh.
Wiley-Blackwell, Malden, MA.
Gregory M. Kobele, Christian Retoré, and Sylvain Salvati.
Yiding Hao. 2020. Metrical grids and generalized tier pro- 2007. An automata-theoretic approach to minimalism.
jection. In Proceedings of the Society for Computation Model theoretic syntax at 10, pages 71–80.
in Linguistics, volume 3.
Gregory Michael Kobele. 2006. Generating Copies: An
Jeffrey Heinz. 2018. The computational nature of investigation into structural identity in language and
phonological generalizations. In Larry Hyman and grammar. Ph.D. thesis, University of California, Los
Frans Plank, editors, Phonological Typology, Phonetics Angeles.
and Phonology, chapter 5, pages 126–195. Mouton de
Gruyter, Berlin. Nate Koser. in prep. The computational nature of stress
assignment. Ph.D. thesis, Rutgers University.
Mans Hulden. 2006. Finite-state syllabification. In Anssi
Yli-Jyrä, Lauri Karttunen, and Juhani Karhumäki,
Marco Kuhlmann. 2013. Mildly non-projective de-
editors, Finite-State Methods and Natural Language
pendency grammar. Computational Linguistics,
Processing. FSMNLP 2005. Lecture Notes in Computer
39(2):355–387.
Science, volume 4002. Springer, Berlin/Heidelberg.
Harry Van der Hulst. 2010. A note on recursion in D. Robert Ladd. 1986. Intonational phrasing: The case for
phonology recursion. In Harry Van der Hulst, editor, recursive prosodic structure. Phonology, 3:311–340.
Recursion and human language, pages 301–342.
Mouton de Gruyter, Berlin & New York. D. Robert Ladd. 2008. Intonational phonology. Cam-
bridge University Press, Cambridge.
William J Idsardi. 2009. Calculating metrical structure.
In Eric Raimy and Charles E. Cairns, editors, Contem- D. Terence Langendoen. 1975. Finite-state parsing
porary views on architecture and representations in of phrase-structure languages and the status of
phonology, number 48 in Current Studies in Linguistics, readjustment rules in grammar. Linguistic Inquiry,
pages 191–211. MIT Press, Cambridge, MA. 6(4):533–554.
Shiori Ikawa, Akane Ohtaka, and Adam Jardine. 2020. D. Terence Langendoen. 1987. On the phrasing of
Quantifier-free tree transductions. Proceedings of the coordinate compound structures. In Brian Joseph and
Society for Computation in Linguistics, 3(1):455–458. Arnold Zwicky, editors, A festschrift for Ilse Lehiste,
page 186–196. Ohio State University, Ohio.
Junko Ito and Armin Mester. 2012. Recursive prosodic
phrasing in Japanese. In Toni Borowsky, Shigeto
D. Terence Langendoen. 1998. Limitations on embedding
Kawahara, Shinya Takahito, and Mariko Sugahara,
in coordinate structures. Journal of Psycholinguistic
editors, Prosody matters: Essays in honor of Elisabeth
Research, 27(2):235–259.
Selkirk, pages 280–303. Equinox Publishing, London.
Junko Ito and Armin Mester. 2013. Prosodic subcate- Nathan Lhote. 2020. Pebble minimization of polyreg-
gories in Japanese. Lingua, 124:20–40. ular functions. In Proceedings of the 35th Annual
ACM/IEEE Symposium on Logic in Computer Science,
Jing Ji and Jeffrey Heinz. 2020. Input strictly local pages 703–712.
tree transducers. In International Conference on
Language and Automata Theory and Applications, Mark Liberman and Alan Prince. 1977. On stress and
pages 369–381. Springer. linguistic rhythm. Linguistic inquiry, 8(2):249–336.
21
Eric Lilin. 1981. Propriétés de clôture d’une extension de Kristina Strother-Garcia. 2018. Imdlawn Tashlhiyt Berber
transducteurs d’arbres déterministes. In Colloquium syllabification is quantifier-free. In Proceedings of
on Trees in Algebra and Programming, pages 280–289. the Society for Computation in Linguistics, volume 1,
Springer. pages 145–153.
Andreas Maletti. 2011. How to train your multi bottom- Kristina Strother-Garcia. 2019. Using model theory
up tree transducer. In Proceedings of the 49th Annual in phonology: a novel characterization of syllable
Meeting of the Association for Computational Linguis- structure and syllabification. Ph.D. thesis, University
tics: Human Language Technologies, pages 825–834. of Delaware.
Andreas Maletti. 2014. The power of regularity- Lucien Tesnière. 1965. Eléments de syntaxe structurale,
preserving multi bottom-up tree transducers. In 1959. Paris, Klincksieck.
International Conference on Implementation and
Application of Automata, pages 278–289. Springer. Johan t’Hart and Antonie Cohen. 1973. Intonation by rule:
a perceptual quest. Journal of Phonetics, 1(4):309–327.
Marina Nespor and Irene Vogel. 1986. Prosodic
phonology. Foris, Dordrecht. Johan t’Hart and René Collier. 1975. Integrating different
levels of intonation analysis. Journal of Phonetics,
Joakim Nivre. 2005. Dependency grammar and depen- 3(4):235–255.
dency parsing. MSI report, 5133.1959:1–32.
Johan t’Hart, René Collier, and Antonie Cohen. 2006.
Marc van Oostendorp. 1993. Formal properties of A perceptual study of intonation: An experimental-
metrical structure. In Sixth Conference of the Euro- phonetic approach to speech melody. Cambridge
pean Chapter of the Association for Computational University Press.
Linguistics, pages 322–331, Utrecht. ACL.
Michael Wagner. 2005. Prosody and recursion. Ph.D.
Janet Breckenridge Pierrehumbert. 1980. The phonology thesis, Massachusetts Institute of Technology.
and phonetics of English intonation. Ph.D. thesis,
Massachusetts Institute of Technology. Michael Wagner. 2010. Prosody and recursion in coor-
dinate structures and beyond. Natural Language &
Peter. A. Reich. 1969. The finiteness of natural language. Linguistic Theory, 28(1):183–237.
Language, 45:831–843.
Markus Walther. 1993. Declarative syllabification with
Walter J Savitch. 1993. Why it might pay to assume that applications to German. In T. Mark Ellison and James
languages are infinite. Annals of Mathematics and Scobbie, editors, Computational Phonology, pages
Artificial Intelligence, 8(1-2):17–25. 55–79. Centre for Cognitive Science, University of
Edinburgh.
James M. Scobbie, John S. Coleman, and Steven Bird.
1996. Key aspects of declarative phonology. In Jacques Markus Walther. 1995. A strictly lexicalized approach
Durand and Bernard Laks, editors, Current Trends in to phonology. In Proceedings of DGfS/CL’95, page
Phonology: Models and Methods, volume 2. European 108–113, Düsseldorf. Deutsche Gesellschaft für
Studies Research Institute, Salford, Manchester. Sprachwissenschaft, Sektion Computerlinguistik.
Hiroyuki Seki, Takashi Matsumura, Mamoru Fujii, and David Jeremy Weir. 1988. Characterizing mildly context-
Tadao Kasami. 1991. On multiple context-free gram- sensitive grammar formalisms. Ph.D. thesis, University
mars. Theoretical Computer Science, 88(2):191–229. of Pennsylvania.
Hiroyuki Seki, Ryuichi Nakanishi, Yuichi Kaji, Sachiko Ngee Thai Yap. 2006. Modeling syllable theory with finite-
Ando, and Tadao Kasami. 1993. Parallel multiple state transducers. Ph.D. thesis, University of Delaware.
context-free grammars, finite-state translation systems,
and polynomial-time recognizable subclasses of Kristine M. Yu. 2017. Advantages of constituency:
lexical-functional grammars. In Proceedings of the Computational perspectives on Samoan word prosody.
31st annual meeting on Association for Computa- In International Conference on Formal Grammar 2017,
tional Linguistics, pages 130–139. Association for page 105–124, Berlin. Spring.
Computational Linguistics.
Kristine M. Yu. 2019. Parsing with minimalist grammars
Elisabeth Selkirk. 1986. On derived domains in sentence and prosodic trees. In Robert C. Berwick and Edward P.
phonology. Phonology Yearbook, 3(1):371–405. Stabler, editors, Minimalist Parsing, pages 69–109.
Oxford University Press, London.
Elisabeth Selkirk. 2011. The syntax-phonology interface.
In John Goldsmith, Jason Riggle, and Alan C. L. Yu, Kristine M. Yu and Edward P. Stabler. 2017. (In)
editors, The Handbook of Phonological Theory, 2 variability in the Samoan syntax/prosody interface
edition, pages 435–483. Blackwell, Oxford. and consequences for syntactic parsing. Laboratory
Nazila Shafiei and Thomas Graf. 2020. The subregular Phonology: Journal of the Association for Laboratory
complexity of syntactic islands. In Proceedings of the Phonology, 8(1):1–44.
Society for Computation in Linguistics, volume 3.
22
The Match-Extend Serialization Algorithm in Multiprecedence
Maxime Papillon
Classic, Modern Languages and Linguistics, Concordia University
[email protected]
23
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 23–31
August 5, 2021. ©2021 Association for Computational Linguistics
a.
structure when assuming strings. This assumption
z
is what Multiprecedence abandons. Multiprece-
dence proposes that asymmetry and irreflexivity # k æ t %
are not relevant to phonology. A segment can pre-
cede or follow multiple segment, two segments =⇒ # k æ t z %
can transitively precede each other, and a segment b.
can precede itself. A valid multiprecedence graph m
is not restricted by topology, a term a will use for
the pattern of the graph independent from the con- # h N u P %
tent of the nodes.
=⇒ # h m N u P %
Using this view of precedence, affixation is the
process of combining the graph representations of c.
different morphemes. A word is a graph consist-
ing of the edges and vertices (precedence relations # k @ r a %
and segments) of one or more morphemes. An ex- =⇒ # k @ r a k @ r a %
ample of the suffixation of the English plural is
shown in Fig. 2a, and the infixation of the Atayal
Figure 2: Affixation in Multiprecedence. Suffixa-
animate focus morpheme is given in Fig. 2b. Full
tion (a), infixation (b), reduplication (c).
root reduplication, which expresses the plural of
nouns in Indonesian is shown in Fig. 2c. There are
two things to notice in Fig. 2c. First, that a prece- tor focus and attaching between the first and the
dence arrow is added, without any segmental ma- second segment. For details on the mechanics
terial: the reduplicative morpheme consists of just of attachment see Raimy (2000a, §3.2), Samuels
that arrow. Second, although Fig. 2a and Fig. 2b (2009, p.177-87), and Papillon (2020, §2.2). It
each offer two paths from the START to the END suffices here to say that at vocabulary insertion an
of the graph, Fig. 2c contains a loop that offers an affix can target certain segments of the stem for
infinite number of paths from START to End. The attachment. Raimy (2000a) shows how this rep-
representation itself does not enforce how many resentation can generate the reduplicative patterns
times the arrow added by the plural morpheme from numerous languages as well as account for
should be traversed. All three of these structures such phenomena as over- and under-application of
have to be handled by a serialization algorithm in phonological processes in reduplication.
order to be actualized by the phonetic motor sys-
tem, which selects a path through the graph to be a. [last segment] → z → %
sent to the articulators. A correct serialization al- b. # → k → æ→ t → %
gorithm must be able to select the correct of the c.
two paths in Fig. 2a and Fig. 2b and the path go- z
ing through the back loop only once in Fig. 2c. # k æ t %
I will assume here that these forms are con-
structed by the attachment of an affix morpheme Figure 3: Affix (a) and root (b) combined in (c).
onto a stem as in Fig. 3. English speakers have a
graph as a lexical item for the plural as in Fig. 3a Given the assumption that a non-linear graph
and a lexical item for CAT as in Fig. 3b, which cannot be pronounced, phonology requires an al-
combine as in Fig. 3c. The moniker “last seg- gorithm capable of converting graph representa-
ment” is an informal way to refer to that part of tions into strings like in Fig. 2. Two main fam-
the affix that is responsible for attaching it to the ilies of algorithms have been proposed. Raimy
stem in the right location. This piece of the plural (1999) proposed a stack-based algorithm which
affix will attach onto the last segment, the one pre- was expanded upon by Idsardi and Shorey (2007)
ceding the end of the word %, of what it combined and McClory and Raimy (2007). This algo-
with, and onto %, yielding Fig 3c. Similarly the rithm traverses the graph from # to % by access-
Atayal form in Fig. 2 is built from a root #hNuP% ing the stack. This idea suffers the problem of
‘soak’ and an infix -m- marking the animate ac- requiring parameters on individual arcs. Every
24
morphologically-added precedence link must be 1 .The precedence links of the stem begin in a
parametrized as to its priority determining whether set StemSet.
it goes to the top or the bottom of the stack. This is 2. The morphologically added links begin in
necessary in this system because when a given arc a set WorkSpace.
is traversed is not predictable on the basis of when 3. Whenever two strings in the WorkSpace
it is encountered in a traversal. This parametriza- match such that the end of one string is iden-
tion radically explodes the range of patterns pre- tical to the end of the other, the operation
dicted to be possible much beyond what is at- Match collapses the two into one string such
tested. Fitzpatrick and Nevins (2002; 2004) pro- that the shared part appears once. E.g. abcd
posed a different constraint-base algorithm which and cdef to abcdef. A Match along multi-
globally compares paths through the graph for ple characters is done first.
completeness and economy but suffers the prob- 4. When there is no match within the
lem of requiring ad hoc constraints targeting indi- WorkSpace, the operation Extend simultane-
vidual types of graphs, lacking generality. In the ously lengthens all strings in the WorkSpace
rest of this article I will present a new algorithm to the right and left using matching prece-
which lacks any parameter and whose two opera- dence links of the stem. StemSet remains un-
tions are generic and not geared towards any spe- changed.
cific configuration. 5.Steps 3 and 4 are repeated until # and %
have been reached by Extend and there is a
3 The Match-Extend algorithm single string in the WorkSpace.
This section will present the Match-Extend al- Algorithm 1: The Match-Extend Algorithm
gorithm and follow up with a demonstration of (informal version).
its operation on various attested Multiprecedence
topologies. StemSet { #k , k@ , @r , ra , a%}
The input to the algorithm is the set of pairs WorkSpace ak
of segments corresponding to the pairs of seg- Extend rak@
ments in immediate precedence relation without Extend @rak@r
the affix, e.g. {#k,kæ,æt,t%} for the English stem Extend k@rak@ra
kæt, and the set of pair of segments correspond- Extend #k@rak@ra%
ing to the precedence links added by the affix, e.g.
{tz,z%} when the plural is added. Figure 4: Match-Extend derivation of k@ra-k@ra.
Intuitively the algorithm starts from the mor-
phologically added links and extends outwards by reduplication in Tohono O’odham involving redu-
following the precedence links in the StemSet, the plicated pattern such as babaD to ba-b-baD, and
set of all precedence links in the stem to which the čipkan to či-čpkan requiring graphs as in Raimy
morpheme is being added. If there is more than (2000a, p.114). Although there are multiple plau-
one morphologically added link, they all extend in sible paths through this graph, only one is attested
parallel and collapse together if one string ends in and this path requires traversing the graph by fol-
one or more segment and the other begins with the lowing the backlink before the front-link, even
same segment or segments. A working version of though the front-link would be encountered first
this algorithm coded in Python will be included as in a traversal.
supplementary material.
25
the node c and the other ends with the same node. StemSet {#s, su, ut, t%}
The two are collapsed as ičp and then keep ex- #i
tending. WorkSpace it
ts
StemSet {#č, či, ip, pk, ka, an, n%}
#it
čp Match
WorkSpace ts
ič Match #its
Match ičp Extend #itsu
Extend čičpk Extend #itsut
Extend #čičpka Extend #itsut%
Extend #čičpkan
Extend #čičpkan% Figure 8: Derivation of Nancowry Pitsut.
i
# gw a d %
# s u t %
Figure 9: Lushotseed gw ad-ad-gw ad.
Figure 7: Nancowry Pit-sut.
Fitzpatrick & Nevins (2002; 2004) proposed
Again Match-Extend will serialize Fig. 7 with- an ad hoc constraint to handle this type of sce-
out any further parameter as in Fig. 8. The three nario, the constraint S HORTEST, enforcing seri-
strings #i, it, and ts can match right away into alizations that follow the shorter arrow first. But
a single string #its which will keep extending. Match-Extend derives the attested pattern without
As these examples illustrate, Match-Extend any further assumptions. Consider the derivation
does not need to be specified with look-ahead, of the Lushotseed form in Fig. 9. After one Ex-
global considerations, or graph-by-graph specifi- tend step, the two strings adgw a and adad match
cations of serialization to derive the attested seri- along the nodes ad. You might notice that the two
alization of graphs like Fig. 5 or Fig. 7. The se- strings also match in the other order with the node
rialization starts in parallel from two added links a, so we must assume the reasonable principle that
that extend until they reach each other in the mid- in case of multiple matches, the best match, mean-
dle, and this will work regardless of the order in ing the match along more nodes, is chosen. From
which ‘backward’ and ‘forward’ arcs are located. that point on adadgw a extends into the desired
They will meet in one direction and serialize in form.
this order. It is somewhat intuitive to see why this works:
Another interesting topology is found in the because Match-Extend applies one step of Extend
analysis of Lushotseed. Fitzpatrick & Nevins at a time and must Match and collapse separate
26
StemSet { #gw , gw a, ad, d%} The graph in Fig. 12 is simply the transpose
dgw graph of a graph where S HORTEST would apply
WorkSpace like Fig. 9, but it does not actually fit the pat-
da
adgw a tern of S HORTEST as its two ‘backward’ arrows
Extend do not start from the same node. In fact if any-
adad
Match adadgw a thing S HORTEST would predict the wrong surface
Extend g adadgw ad
w form, as *si-sil-sil would be the form if the shorter
Extend #gw adadgw ad% path were taken first. In Match-Extend and Clos-
est Attachment Fig. 11 the prediction is clear: it is
Figure 10: Derivation of Lushotseed gw adadgw ad. predicted to serialize as sil-si-sil because the path
from l→s to i→s is shorter than the path from i→s
to l→s, thus deriving the correct string.
strings from the WorkSpace immediately when a
Fitzpatrick and Nevins (2002) report some
Match is found, two arcs added by the morphology
forms with graphs like Fig. 12 which must be
will necessarily match in the direction in which
linearized in ways that would contradict Match-
they are the closest. The end of the d→a arc is
Extend, such as saxw to sa-saxw -saxw in Lusot-
closer to the beginning of the d→gw one than vice-
sheed Diminutive+Distributive forms. But con-
versa, and hence the two will join in this direction
trary to the Distributive+OOC forms discussed
and therefore surface in this order. This can be
earlier there is no independent evidence here for
generalized as Fig. 11.
the two reduplications being serialized together.
• If the graph contains two morphologically I therefore assume that those instances consist
added links α → β and γ → δ, and of two separate cycles, serialized one at a time:
saxw to saxw -saxw to sa-saxw -saxw . Match-Extend
I There is a unique path X from β to γ not
therefore relies on cyclicity, with the graph built
going through α → β or γ → δ, and
up through affixation and serialized multiple times
I There is a unique path Y from δ to α not over the course of the derivation.
going through α → β or γ → δ,
3.2 Non-Edge Fixed Segmentism
• Then the Match-Extend algorithm will output
a string containing: Fixed segmentism refers to cases of reduplication
where a segment of one copy is overridden by one
I ...αβ...γδ... if X is shorter than Y or more fixed segments. A well known English
I ...γδ...αβ... if Y is shorter than X example is schm-reduplication like table to table-
schmable where schm-replaces the initial onset. I
Figure 11: Closest Attachment in Match-Extend. will call Non-Edge Fixed Segmentism (NEFS) the
special case of fixed segmentism where the fixed
Note that this is not a new assumption: this is segment is not at the edge of one of the copies.
a theorem of the model derivable from the way These are the examples where the graph needed is
Match and Extend interact with multiple morpho- like Fig. 13 or Fig. 14.
logically added arcs. This can allow us to work
out some serializations without having to do the
# a b c d e %
whole derivation.
Consider for instance the Nlaka’pamuctsin dis- x
tributive+diminutive double reduplication, e.g. sil,
‘calico’, to sil-si-sil, (Broselow, 1983). This pat- Figure 13: NEFS ‘early’ in the copy.
tern requires the Multiprecedence graph to look as
in Fig. 12.
# a b c d e %
# s i l %
x
Figure 12: Nlaka’pamuctsin sil-si-sil. Figure 14: NEFS ‘late’ in the copy.
27
Closest Attachment in Match-Extend predicts parametrized as to whether they are added on top
that if a fixed-segment is added towards the be- or at the bottom of the stack upon affixation, thus
ginning of the form, it should surface in the sec- deriving elaNeliN from the /a/ allomorph being on
ond copy, and if it is added toward the end of the top of the stack and traversed early and udanuden
form, it should surface in the first copy. Or in other from the /e/ allomorph being at the bottom of the
words the fixed segment will always occur in the stack and traversed late. This freedom of lexical
copy such that the fixed segment is closer to the specification grants their system the power to en-
juncture of the two copies. The graph in Fig. 13 force any order needed, including the capacity to
will serialize as abcde-axcde and the graph in handle the ‘look-ahead’ and ‘shortest’ cases above
Fig. 14 will serialize as abcxe-abcde. This in terms of full lexical specification. They could
follows from the properties of Match and Extend: also easily handle languages with the equivalent
as the precedence pairs of the overwriting segment of a L ONGEST constraint. This model is less pre-
and the precedence pair of the backward link ex- dictive while also being more complex.
tend outward, it will either reach the left or right
side first and this will determine the order in which
they appear in the final serialized form. # u d a n %
This prediction is borne out by many exam- e
ples of productive patterns of reduplication with
NEFS such as Marathi saman-suman (Alderete et Figure 16: Javanese udan-uden according to Id-
al., 1999, citing Apte 1968), Bengali sajôa-sujôa sardi and Shorey (2007) and McClory and Raimy
(Khan, 2006, p.14), Kinnauri migo-m@go (Chang, (2007).
2007).
Apparent counterexamples exist, but have other But this complexity is unneeded if we instead
plausible analyses. A major one worth discussing adopt dissimilation analysis closer in spirit to
briefly is the previous multiprecedence analysis of Yip’s original Optimality-Theory analysis. We
the Javanese Habitual-Repetitive as described by can say that the /a/ of the first copy is an over-
Yip (1995; 1998). Most forms surface with a fixed written /a/ in both elaN-eliN and in udan-uden
/a/ in the first copy as in elaN-eliN ‘remember’. and a phonological process causes dissimilation of
This requires a graph such as Fig. 15 which se- the root /a/ in the presence of the added /a/. In
rializes in comformity with Match-Extend. Optimality-Theory this requires an appeal to the
Obligatory Contour Principle operating between
the two copies, but in Multiprecedence the dissim-
# e l i N % ilation is even simpler to state because the two /a/’s
are very local in the graph. We simply need a rule
a to the effect of raising a stem /a/ in the context of a
morphologically-added /a/ that precedes the same
Figure 15: Javanese elaN-eliN. segment as in Fig.17.
28
Extend on the basis of Javanese. We have therefore seen that Match-Extend can
Consider another apparent counterexample to straightforwardly account for a number of attested
the prediction: the Palauan root /rEb@th / forms complex reduplicative patterns without any special
its distributive with CVCV reduplication and the stipulations. More interestingly Match-Extend
verbal prefix m@- forming m@-r@b@-rEb@th (Zuraw, makes strong novel predictions about the loca-
2003). At first blush, one may be tempted to see tion of fixed segments. I have not been able to
the first schwa of the first copy as overwriting the locate many examples of NEFS in the literature.
root’s /E/. But the presence of this schwa actually For example the typology of fixed segmentism in
follows from the independently-motivated phonol- Alderete et al. (1999) does not contain any exam-
ogy of Palauan in which all non-stressed vowels ple of NEFS. This will require further empirical
go to [@]. This thus is the result of a phonolog- research.
ical rule applying after serialization about which
Match-Extend has nothing to say. 4 One limitation of Match-Extend:
Relatedly, other apparent issues may be caused overly symmetrical graphs
by interactions with phonology. D’souza (1991,
There is a gap in the predictions of Fig. 11:
p.294) describes how echo-formation in some
Closest Attachment predicts that morphologically-
Munda languages is accomplished by replacing all
added edges will attach in the order they are the
the vowels in the second copy with a fixed vowel,
closest, which relies on an asymmetry in the form
e.g. Gorum bubuP ‘snake’ > bubuP-bibiP. Fixed
such that morphologically-added links are closer
segmentism of each vowel individually may not
in one order than the other. This leaves the prob-
be the best analysis of these forms, there may in-
lem of symmetrical forms like Fig. 19. The former
stead be a single fixed segment and a separate pro-
of there was posited in the analysis of Semai con-
cess of vowel harmony or something along those
tinuative reduplication by Raimy (2000a, p.146-
lines. This type of complex interaction of non-
47) for forms like dNOh ‘appearance of nodding’
local phonology with reduplication has been in-
to dh-dNOh; the latter would be needed in various
vestigated before in Multiprecedence, e.g. the
languages reduplicating CVC forms with vowel
analyses of Tuvan vowel harmony in reduplicated
changes such as the Takelma aorist described
forms in Harrison and Raimy (2004) and Papillon
in Sapir (1922, p.58) like t’eu ‘play shinny’ to
(2020, §7.1), but these analyses make extra as-
t’eut’au.
sumptions about possible Multiprecedence struc-
tures that go far beyond the basics explored here.
The subject requires further exploration, but ap- # a b c % # a b c %
pears to be more of an issue of phonology and rep-
x
resentation than of serialization per se.
Apparent counterexamples will have to be ap- Figure 19: Two structures overly symmetrical for
proached on a case-by case basis, but I have not Match-Extend.
identified many problematic examples so far that
did not turn out to be errors of analysis.1 These are the forms which, in the course of
1
Match-Extend, will come to a point where Match
One such apparent counter-example is worth briefly
commenting on here due to its being mentioned in well-
is indeterminate because two strings could match
known surveys of reduplication. This alleged reduplication equally well in either direction. For example the
is from in Macdonald and Darjowidjojo (1967, p.54) and WorkSpace of the first of these structures will start
repeated in Rubino (2005, p.16): Indonesian belat ‘screen’
to belat-belit ‘underhanded’. If correct this example would with ac and ca, which can match either as aca
be a counterexample to Match-Extend, as a fixed /i/ must or cac. The former would extend into #acabc%
surface in the second copy. However this pair seems to be and the latter into #abcac%. Match-Extend as
misidentified. The English-Indonesian bilingual dictionary
by (Stevens and Schmidgall-Tellings, 2004) lists a word be- stated so far is therefore indeterminate with regard
lit meaning ‘crooked, cunning, deceitful, dishonest, under- to these symmetrical forms.
handed’, which semantically seems like a more plausible
source for the reduplicated form belat-belit and fits the pre-
This is not an insurmountable problem for
dictions of Match-Extend. The same dictionary’s entry under Match-Extend. To the contrary this is a problem
belat lists some screen-related entries and then belat-belit as
meaning ‘crooked, devious, artful, cunning, insincere’ cross- was misidentified by previous authors and is unproblematic
referencing to belit as the base. I conclude that this example for Match-Extend.
29
of having too many solutions without a way to de- allomorphy (Papillon, 2020). A serialization
cide between them, none of which require adding algorithm capable of handling these structures is
parametrization to Match-Extend. Maybe sym- crucial for the completeness of the theory.
metrical forms crash the derivation and all appar- As pointed out by a reviewer, it is crucial to de-
ent instances in the literature must contain some velop a a typology of the possible attested graph-
hidden asymmetry. It is worth noting that the pat- ical input structures to the algorithm so as to
tern in Fig. 19 attested in Semai has a close cog- properly characterize and formalize the algorithm
nate in Temiar, but in this language the symmet- needed. In every form discussed here the roots is
rical structure is only obtained for simple onsets, implicitly assumed to be underlyingly linear and
kOw ‘call’ to kw-kOw, but slOg ‘sleep with’ to s- affixes alone add some topological variety to the
g-lOg (Raimy, 2000a, p.146). This asymmetry re- graphs, as is mostly the case in all the forms from
solves the Match-Extend derivation. It may simply (Raimy, 1999; Raimy, 2000a). Elsewhere I have
be the case that the forms that look symmetrical challenged this idea by positing parallel structures
have a hidden asymmetry in the form of silent seg- both underlyingly and in the output of phonol-
ments. For example if the root has an X at the start ogy (Papillon, 2020). If these structures are al-
as in Fig. 21. This is obviously very ad hoc and lowed in Multiprecedence Phonology then Match-
powerful so minimally we should seek language- Extend will need to be amended or enhanced to
internal evidence for such a segment before jump- handle more varied structures.
ing to conclusions. In this paper I proposed a model that departs
from the previous ones in being framed as patch-
ing a path from the morphology-added links to-
# s l O g %
wards # and % from the inside-out, as opposed to
the existing models seeking to give a set of instruc-
Figure 20: Temiar sglog. tions to correctly traverse the graph from # to %
from beginning to the end.
# X k O w % References
John Alderete, Jill Beckman, Laura Benua, Amalia
Gnanadesikan, John McCarthy, and Suzanne Ur-
Figure 21: Semai kw-kOw with hidden asymme- banczyk. 1999. Reduplication with fixed segmen-
try in the form of a segment X without a phonetic tism. Linguistic inquiry, 30(3):327–364.
correlate, which breaks the symmetry.
Ellen I Broselow. 1983. Salish double reduplications:
subjacency in morphology. Natural Language &
Alternatively it could be that symmetrical forms Linguistic Theory, 1(3):317–346.
lead to both options being constructed and this op-
Charles B Chang. 2007. Accounting for the phonology
tionality is resolved in extra-grammatical ways. I
of reduplication. Presentation at LASSO XXXVI.
will leave this hole in the theory open, as a prob- Available on https://cbchang.com/
lem to be resolved through further research. curriculum-vitae/presentations/.
30
K. David Harrison and Eric Raimy. 2004. Reduplica-
tion in Tuvan: Exponence, readjustment and phonol-
ogy. In Proceedings of Workshop in Altaic Formal
Linguistics, volume 1. Citeseer.
William Idsardi and Rachel Shorey. 2007. Unwinding
morphology. In CUNY Phonology Forum Workshop
on Precedence Relations.
SD Khan. 2006. Similarity Avoidance in Bengali
Fixed-Segment Reduplication. Ph.D. thesis, Univer-
sity of California.
Roderick Ross Macdonald and Soenjono Darjowidjojo.
1967. A student’s reference grammar of modern for-
mal Indonesian. Georgetown University Press.
Daniel McClory and Eric Raimy. 2007. Enhanced
edges: morphological influence on linearization. In
Poster presented at The 81st Annual Meeting of the
Linguistics Society of America. Anaheim, CA.
Maxime Papillon. 2020. Precedence and the Lack
Thereof: Precedence-Relation-Oriented Phonology.
Ph.D. thesis. https://drum.lib.umd.edu/
handle/1903/26391.
Eric Raimy. 1999. Representing reduplication. Ph.D.
thesis, University of Delaware.
Eric Raimy. 2000a. The phonology and morphology of
reduplication, volume 52. Walter de Gruyter.
Eric Raimy. 2000b. Remarks on backcopying. Lin-
guistic Inquiry, 31(3):541–552.
Eric Raimy. 2007. Precedence theory, root and tem-
plate morphology, priming effects and the struc-
ture of the lexicon. CUNY Phonology Symposium
Precedence Conference, January.
Carl Rubino. 2005. Reduplication: Form, function and
distribution. Studies on reduplication, pages 11–29.
Bridget Samuels. 2009. The structure of phonological
theory. Harvard University.
Edward Sapir. 1922. Takelma. In Franz Boas, editor,
Handbook of American Indian Languages,. Smith-
sonian Institution. Bureau of American Ethnology.
Alan M. Stevens and A. Ed. Schmidgall-Tellings.
2004. A comprehensive Indonesian-English dictio-
nary. PT Mizan Publika.
Moira Yip. 1995. Repetition and its avoidance: The
case in Javanese.
Moira Yip. 1998. Identity avoidance in phonology
and morphology. In Morphology and its Relation to
Phonology and Syntax, pages 216–246. CSLI Publi-
cations.
Kie Zuraw. 2003. Vowel reduction in Palauan redu-
plicants. In Proceedings of the 8th Annual Meeting
of the Austronesian Formal Linguistics Association
[AFLA 8], pages 385–398.
31
Incorporating tone in the calculation of phonotactic probability
James P. Kirby
University of Edinburgh
School of Philosophy, Psychology, and Language Sciences
[email protected]
32
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 32–38
August 5, 2021. ©2021 Association for Computational Linguistics
et al., 1953; Goldsmith, 2002; Pimentel et al., yin ping). Obstruents never occur as codas.
2020). By treating tone as another phone in
Thai (tha) A Thai lexicon of 4,133 unique
the segmental string, we can see whether and
syllables was created based on the dictionary
to what degree this choice has an effect on the
of Haas (1964) which contains around 19,000
overall entropy of the lexicon.
entries and 47,000 syllables. The phonemic
Intuitively, any model that can take into ac-
representation encodes 20 onsets, 3 medials /w
count phonotactic constraints will result in a
l r/, 21 nuclei (vowel length being contrastive
reduction in entropy. Thus, even an n-gram
in Thai), 8 codas and 5 tones. In Thai, high
model with a sufficiently large context window
tone is rare/unattested following unaspirated
should in principle be able model segment-tone
and voiced onsets, but there is also statistical
co-occurrences at the syllable level. However,
evidence for a restriction on rising tones with
tone languages differ with respect to tone-
these onsets (Perkins, 2013). In syllables with
segment co-occurrence restrictions (see Sec. 2).
an obstruent coda (/p t k/), only high, low, or
If a relevant constraint primarily targets syl-
falling tones occur, depending on length of the
lable onsets, for instance, placing the tonal
nuclear vowel (Morén and Zsiga, 2006).
“segment” in immediate proximity to the on-
set will increase the probability of the string, Vietnamese (vie) The Vietnamese lexicon
even relative to a model capable of capturing of 8,128 syllables was derived from a freely
the dependency at a longer distance. available dictionary of around 74,000 words
(Đức, 2004), phonetized using a spelling pro-
2 Languages nunciation (Kirby, 2008). The resulting repre-
Four syllable-tone languages were selected for sentation encodes 24 onsets, 1 medial (/w/),
this study: Mandarin Chinese, Cantonese, 14 nuclei, 8 codas and 6 tones. Vietnamese
Vietnamese and Thai. They are partially a syllables ending in obstruents /p t k/ are re-
convenience sample in that the necessary lex- stricted to just one of two tones.
ical resources were readily available, but also Cantonese (yue) The Cantonese syllabary
have some useful similarities: all share a sim- consists of the 1,884 unique syllables in the
ilar syllable structure template and have five Chinese Character Database (Kwan et al.,
or six tones. However, the four languages vary 2003), encoded using the jyutping system.
in terms of their segment-tone co-occurrence This representation distinguishes 22 onsets, 1
restrictions, as detailed below. medial (/w/), 11 nuclei, 5 codas and 6 tones.
In all cases, the lexicon was defined as In Cantonese, unaspirated initials do not oc-
the set of unique syllable shapes in each lan- cur in syllables with low-falling tones, and
guage. For consistency, the syllable tem- aspirated initials do not occur with the low
plate in all four languages is considered to be tone. Syllables ending with /p t k/ are re-
(C1 )(C2 )V(C)T, with variable positioning of stricted to one of the three “entering” tones
T. Offglides were treated as codas in all lan- (Yue-Hashimoto, 1972).
guages. The syllable lexicons for all four lan-
guages are provided in the supplementary ma- 3 Methods
terials (http://doi.org/10.17605/OSF.IO/NA5FB). Two classes of character-level language models
Mandarin (cmn) The Mandarin syllabary (LMs) were considered: simple n-gram models
consists of 1,226 syllables based on list of at- and recurrent neural networks (Mikolov et al.,
tested readings of the 13,060 BIG5 characters 2010). In an n-gram model, the probability
from Tsai (2000), phonetized using the phono- of a string is proportional to the conditional
logical system of Duanmu (2007). This rep- probabilities of the component n-grams:
resentation encodes 22 onsets, 3 medials (/j P (xi |xi−1 i−1
(1)
1 ) ≈ P (xi |xi−n+1 )
4 w/), 6 nuclei, 4 codas and 5 tones (includ-
ing the neutral tone). In Mandarin, unaspi- The degree of context taken into account is
rated obstruent onsets rarely appear with mid- thus determined by the value chosen for n.
rising tone (MC yang ping), and sonorant on- In a recurrent neural network (RNN), the
sets rarely occur with the high-level tone (MC next character in a sequence is predicting using
33
the current character and the previous hidden resulting strings were identical across permu-
state. At each step t, the network retrieves an tations. Both smoothed trigram and simple
embedding for the current input xt and com- RNN LMs were then fit to each permuted lex-
bines it with the hidden layer from the previ- icon 10 times, with random 80/20 train/dev
ous step to compute a new hidden layer ht : splits (other splits produced similar results).
For each run, the perplexity of the language
ht = g(U ht−1 + W xt ) (2) model on the dev set D = x1 x2 . . . xN (i.e., the
exponentiated cross-entropy1 ) was recorded:
where W is the weight matrix for the current
time step, U the weight matrix for the previ- P P L(D) = bH(D) (4)
1
ous time step, and g is an appropriate non- −N logb P (x1 x2 ...xN )
= b (5)
linear activation function. This hidden layer
ht is then used to generate an output layer 4 Results
yt , which is passed through a softmax layer to For brevity, only the main findings are sum-
generate a probability distribution over the en- marized here; the full results are available as
tire vocabulary. The probability of a sequence part of the online supplementary materials
x1 , x2 . . . xz is then just the product of the (http://doi.org/10.17605/OSF.IO/NA5FB).
probabilities of each character in the sequence: Table 1 show the orderings which minimized
z perplexity for each method and language, aver-
∏
P (x1 , x2 . . . xz ) = yi (3) aged over 10 runs. Table 2 shows the average
i=1 perplexity over all permutations for a given
language and method.
The incorporation of the recurrent connec-
tion as part of the hidden layer allows RNNs to method lexicon order PPL
avoid the problem of limited context inherent cmn T|C 4.91 (0.06)
in n-gram models, because the hidden state tha T|M 7.34 (0.12)
3-gram
embodies (some type of) information about all vie T|C 7.35 (0.03)
of the preceding characters in the string. Al- yue T|M 5.84 (0.09)
though RNNs cannot capture arbitrarily long- cmn T|M 4.01 (0.08)
distance dependencies, this is unlikely to make tha T|M 5.20 (0.04)
RNN
a difference for the relatively short distances vie T|M 5.16 (0.02)
involved in phonotactic modeling. yue T|# 4.37 (0.05)
Trigram models were built using the SRILM
Table 1: Orders which produced the lowest per-
toolkit (Stolcke, 2002), with maximum likeli-
plexities averaged over 10 runs (means and stan-
hood estimates smoothed using interpolated dard deviations).
Witten-Bell discounting (Witten and Bell,
1991). RNN LMs were built using PyTorch Differences between orderings were then as-
(Paszke et al., 2019), based on an implementa- sessed visually, aided by simple analyses of
tion by Mayer and Nelson (2020). The results variance. For the trigram LMs, perplexity was
reported here make use of simple recurrent net- lowest in Mandarin when tones followed co-
works (Elman, 1990), but similar results were das, while differences in perplexity between
obtained using an LSTM layer (Hochreiter and other orderings were negligible. For Thai,
Schmidhuber, 1997). Vietnamese, and Cantonese, all orderings were
roughly comparable except for when tone was
3.1 Procedure
ordered as the first segment in the syllable
The syllables in each lexicon were arranged (T|#), which increased perplexity by up to
in 5 distinct permutations: tone following the 1 over the mean of the other orderings. For
coda (T|C), nucleus (T|N), medial (T|M), on- Thai, the ordering T|M resulted in signifi-
set (T|O) and with tone as the initial seg- cantly lower perplexities compared to all other
ment in the syllable (T|#). As many syl- 1
Equivalently, we may think of P P L(D) as the in-
lables in these languages lack onsets, medi- verse probability of the set of syllables D, normalized
als, and/or codas, a sizable number of the for the number of phonemes.
34
cmn tha vie yue
3-gram 5.15 (0.17) 7.76 (0.4) 7.49 (0.27) 5.98 (0.18)
RNN 4.01 (0.07) 5.28 (0.05) 5.18 (0.03) 4.42 (0.07)
Table 2: Mean and standard deviation of perplexity across all permutations by lexicon and language
model.
permutations. For the RNN LMs, although the language model. Even a model with a large
T|M was the numerically optimal ordering for enough context window to capture such depen-
three out of the four languages, in practical dencies will assign the lexicon a higher perplex-
terms permutation had no effect on perplex- ity when structured in this way.
ity, with numerical differences of no greater The finding that the T|M ordering is always
than 0.1 (see Table 2). optimal in Thai (and by a larger margin than
in the other languages) is presumably due to
5 Discussion the fact that the distribution of the medials
/w l r/ is severely restricted in this language,
Consistent with other recent work in compu-
occurring only after /p ph t th k kh f/. The
tational phonotactics (e.g. Mayer and Nel-
distribution of tones after onset-medial clus-
son, 2020; Mirea and Bicknell, 2019; Pimentel
ters is inherently more constrained and there-
et al., 2020), the neural network models out-
fore more predictable. A similar restriction
performed the trigram baselines by a consider-
holds in Cantonese, albeit to a lesser degree
able margin (a reduction in average perplexity
(the medial /w/ only occurs with onsets /k/
of up to 2.5, depending on language). Neu-
and /kh /).
ral network models were also much less sen-
sitive to the linear position of tone relative
5.1 Shortcomings and extensions
to other elements in the segmental string (cf.
Do and Lai, 2021b), no doubt due to the fact This work did not explore representations
that the ability of the RNNs to model co- based on phonological features, given that
occurrence tendencies within the syllable is not their incorporation has failed to provide evalu-
constrained by context in the way that n-gram ative improvements in other studies of com-
models are. putational phonotactics (Mayer and Nelson,
Perhaps as a result, however, the RNN mod- 2020; Mirea and Bicknell, 2019; Pimentel et al.,
els reveal little about the nature of segment- 2020). However, feature-based approaches can
tone co-occurrence restrictions in any of the be both theoretically insightful and may even
languages investigated. In this regard, the tri- prove necessary for other quantifications, such
gram models, while clearly less optimal in a as the measure of phonological distance where
global sense, are still informative. The fact tone is involved (Do and Lai, 2021a).
that the ordering T|# was significantly worse The present study has focused on a small
under the trigram model for Cantonese, Viet- sample of structurally and typologically simi-
namese and Thai but not Mandarin can be ex- lar languages. All have relatively simple syl-
plained (or predicted) by the fact that of the lable structures in which one and only one
four languages, only Mandarin does not per- tone is associated with each syllable. Not all
mit obstruent codas, and consequently has no tone languages share these properties, how-
coda-tone co-occurrence restrictions (indeed, ever. In so-called “word-tone” languages, such
the four primary tones of Mandarin occur with as Japanese or Shanghainese, the surface tone
more or less equal type frequency). In the with which a given syllable is realized is fre-
other three languages, syllables with obstruent quently not lexically specified. In other lan-
codas can only bear a restricted set of tones, guages, such as Yoloxóchitl Mixtec (DiCanio
and in a trigram model, this dependency is not et al., 2014), tonal specification may be tied
modeled when tone is prepended to the sylla- to sub-syllabic units, such as the mora. Fi-
ble, since this means it will frequently, though nally, data from many other languages, such
not always, fall outside the window visible to as Kukuya (Hyman, 1987), make it clear that
35
in at least in some cases tones can only be tonal tier in the first instance, and ordering
treated in terms of abstract melodies, which with respect to segments may simply not be
do not have a consistent association to sylla- relevant (but see Goldsmith and Riggle, 2012).
bles, moras, or vowels (Goldsmith, 1976). In Finally, the present study has focused on the
these and many other cases, careful consider- lexical representation of tone, but in many lan-
ation of the theoretical motivations justifying guages tone primarily serves a morphological
a particular representation are required before function. The SIGMORPHON 2020 Task 0
it makes sense to consider ordering effects. shared challenge (Vylomova et al., 2020) in-
cluded inflection data from several tonal Oto-
However, to the extent that it is possible to
Manguean languages in which tone was or-
generate a segmental representation of a tone
thographically encoded in different ways via
language in which surface tones are indicated,
string diacritics. While the authors noted
what the present work suggests is that the pre-
the existence these differences, it is unclear
cise ordering of the tonal symbols with respect
whether and to what extent the different rep-
to other symbols in the string is unlikely to
resentations of tones affected system perfor-
have a significant impact on phonotactic prob-
mance. Similarly, the potential impact of
ability. This follows from two assumption (or
tone ordering relative to other elements in the
constraints): first, that the set of symbols used
string has yet to be systematically investigated
to indicate tones is distinct from those used to
in this setting.
indicate the vowels and consonants; and sec-
ond, that one and only one such tone symbol 6 Conclusion
appears per string domain (here, the syllable).
If these two constraints hold, the complexity This paper has assessed how different permu-
of the syllable template should in general have tations of tone and segments affects the per-
a greater impact on the entropy of the string plexity of the lexicon in four syllable-tone lan-
set than the position of the tone symbol, al- guages using two types of phonotactic lan-
though the number of unique tone symbols rel- guage models, an interpolated trigram model
ative to the number of segmental symbols may and a simple recurrent neural network. The
also have an effect. According to Maddieson perplexities assigned by the neural network
(2013) and Easterday (2019), languages with models were essentially unaffected by different
complex syllable structures (defined as those choices of ordering; while the trigram model
permitting fairly free combinations of two or was more sensitive to permutations of tone and
more consonants in the position before a vowel, segments, the effects on perplexity remained
and/or two or more consonants in the position minimal. In addition to providing a baseline
after the vowel) rarely have complex tone sys- for future evaluation, these results suggest that
tems, or indeed tone systems at all, so this is the phonotactic probability of a syllable is rel-
unlikely to be an issue for most tone languages. atively robust to choice of how tone is ordered
with respect to other elements in the string,
One possibility the present work did not ad- especially when using a model capable of en-
dress is whether it is even necessary, or desir- coding dependencies across the entire syllable.
able, to include tone in phonotactic probability
calculations in the first place. The probability Acknowledgments
of the lexicon of a tonal language would surely
This work was supported in part by the ERC
change if tone is ignored, but whether listeners’
EVOTONE project (grant no. 758605).
judgments of a sequence as well- or ill-formed
is better predicted by a model that takes tone
into account vs. one that does not is an empir- References
ical question (but see Kirby and Yu, 2007; Do Todd Bailey and Ulrike Hahn. 2001. Determinants
and Lai, 2021b for some evidence that it may of wordlikeness: phonotactics or lexical neigh-
not). Similarly, for research questions focused borhoods? Journal of Memory and Language,
on tone sandhis, or on the distributions of the 44:568–591.
tonal sequences themselves (tonotactics), the E. Colin Cherry, Morris Halle, and Roman Jakob-
relevant computations will be restricted to the son. 1953. Toward the logical description of
36
languages in their phonemic aspect. Language, James P. Kirby and Alan C. L. Yu. 2007. Lexi-
29(1):34–46. cal and phonotactic effects on wordlikeness judg-
ments in Cantonese. In Proceedings of the 16th
Robert Daland and Janet B. Pierrehumbert. 2011. International Conference of the Phonetic Sci-
Learning diphone-based segmentation. Cogni- ences, pages 1389–1392, Saarbrücken.
tive Science, 35(1):119–155.
Tze-Wan Kwan, Wai-Sang Tang, Tze-Ming
Christian DiCanio, Jonathan D Amith, and Chiu, Lei-Yin Wong, Denise Wong, and
Rey Castillo García. 2014. The phonetics of Li Zhong. 2003. Chinese character database
moraic alignment in yoloxóchitl mixtec. In Pro- with word-formations phonologically disam-
ceedings of the 4th International Symposium on biguated according to the Cantonese dialect.
Tonal Aspects of Languages (TAL-2014), pages http://humanum.arts.cuhk.edu.hk/Lexis/lexi-
203–210. can/. Accessed 9 February 2007.
Youngah Do and Ryan Ka Yau Lai. 2021a. Ac- Ian Maddieson. 2013. Tone. In Matthew S. Dryer
counting for lexical tones when modeling phono- and Martin Haspelmath, editors, The World At-
logical distance. Language, 97(1):e39–e67. las of Language Structures Online. Max Planck
Institute for Evolutionary Anthropology.
Youngah Do and Ryan Ka Yau Lai. 2021b. Incor-
porating tone in the modelling of wordlikeness Connor Mayer and Max Nelson. 2020. Phonotactic
judgements. Phonology, 37:577–615. learning with neural language models. Proceed-
ings of the Society for Computation in Linguis-
San Duanmu. 2007. The phonology of standard tics, 3:16.
Chinese, 2nd edition. Oxford University Press, Tomáš Mikolov, Martin Karafiát, Lukáš Burget,
Oxford. Jan Černocký, and Sanjeev Khudanpur. 2010.
Recurrent neural network based language model.
Shelece Easterday. 2019. Highly complex syllable
In Proc. INTERSPEECH 2010, page 1045–1048.
structure: A typological and diachronic study.
Studies in Laboratory Phonology. Language Sci- Nicole Mirea and Klinton Bicknell. 2019. Using
ence Press. LSTMs to assess the obligatoriness of phonolog-
ical distinctive features for phonotactic learning.
Jeffrey L. Elman. 1990. Finding structure in time. In Proceedings of the 57th Annual Meeting of
Cognitive Science, 14(2):179–211. the Association for Computational Linguistics,
pages 1595–1605. Association for Computational
John Goldsmith. 1976. Autosegmental Phonology. Linguistics.
Ph.D. thesis, MIT. [Published by Garland Press,
New York, 1979.]. Bruce Morén and Elizabeth Zsiga. 2006. The
lexical and post-lexical phonology of Thai
John Goldsmith. 2002. Phonology as information tones. Natural Language and Linguistic Theory,
minimization. Phonological Studies, 5:21–46. 24(1):113–178.
John Goldsmith and Jason Riggle. 2012. Informa- James Myers and Jane Tsay. 2005. The pro-
tion theoretic approaches to phonological struc- cessing of phonological acceptability judgements.
ture: the case of Finnish vowel harmony. Natu- In Proc. Symposium on 90-92 NSC Projects,
ral Language and Linguistic Theory, 30:859–896. Taipei.
Donald Shuxiao Gong. 2017. Grammaticality and Ellen Hamilton Newman, Twila Tardif, Jingyuan
lexical statistics in Chinese unnatural phonotac- Huang, and Hua Shu. 2011. Phonemes matter:
tics. UCL Working Papers in Linguistics, 17:1– The role of phoneme-level awareness in emergent
23. Chinese readers. Journal of Experimental Child
Psychology, 108(2):242–259.
Mary R. Haas. 1964. Thai-English student’s dic-
tionary. Stanford University Press, Stanford. Adam Paszke, Sam Gross, Francisco Massa,
Adam Lerer, James Bradbury, Gregory Chanan,
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Trevor Killeen, Zeming Lin, Natalia Gimelshein,
Long short-term memory. Neural Computation, Luca Antiga, Alban Desmaison, Andreas Kopf,
9(8):1735–1780. Edward Yang, Zachary DeVito, Martin Raison,
Alykhan Tejani, Sasank Chilamkurthy, Benoit
Larry M. Hyman. 1987. Prosodic domains in Steiner, Lu Fang, Junjie Bai, and Soumith Chin-
Kukuya. Natural Language & Linguistic The- tala. 2019. Pytorch: An imperative style, high-
ory, 5(3):311–333. performance deep learning library. In H. Wal-
lach, H. Larochelle, A. Beygelzimer, F. d'Alché-
James P. Kirby. 2008. vPhon: a Vietnamese Buc, E. Fox, and R. Garnett, editors, Advances
phonetizer (version 2.1.1) [computer program]. in Neural Information Processing Systems 32,
https://github.com/kirbyj/vPhon. pages 8024–8035. Curran Associates, Inc.
37
Jeremy Perkins. 2013. Consonant-tone interaction
in Thai. Ph.D. thesis, Rutgers, The State Uni-
versity of New Jersey.
Tiago Pimentel, Brian Roark, and Ryan Cotterell.
2020. Phonotactic complexity and its trade-offs.
Transactions of the Association for Computa-
tional Linguistics, 8:1–18.
Shabnam Shademan. 2006. Is phonotactic knowl-
edge grammatical knowledge? In Proceedings of
the 25th West Coast Conference on Formal Lin-
guistics, pages 371–379. Cascadilla Proceedings
Project.
Andreas Stolcke. 2002. SRILM – an extensible lan-
guage modeling toolkit. In Proc. Intl. Conf. on
Spoken Language Processing Vol. 2, pages 901–
904, Denver.
Holly L. Storkel and Su-Yeon Lee. 2011. The inde-
pendent effects of phonotactic probability and
neighbourhood density on lexical acquisition by
preschool children. Language and Cognitive Pro-
cesses, 26(2):191–211.
Chih-Hao Tsai. 2000. Mandarin syllable
frequency counts for Chinese characters.
http://technology.chtsai.org/syllable/. Ac-
cessed 10 March 2021.
Michael S. Vitevitch and Paul A. Luce. 1999. Prob-
abilistic phonotactics and neighborhood activa-
tion in spoken word recognition. Journal of
Memory and Language, 40:374–408.
Ekaterina Vylomova, Jennifer White, Eliza-
beth Salesky, Sabrina J. Mielke, Shijie Wu,
Edoardo Maria Ponti, Rowan Hall Maudslay,
Ran Zmigrod, Josef Valvoda, Svetlana Toldova,
and et al. 2020. Sigmorphon 2020 shared task
0: Typologically diverse morphological inflec-
tion. In Proceedings of the 17th SIGMORPHON
Workshop on Computational Research in Pho-
netics, Phonology, and Morphology, page 1–39.
Association for Computational Linguistics.
Ian H. Witten and Timothy C. Bell. 1991. The
zero-frequency problem: estimating the proba-
bilities of novel events in adaptive text compres-
sion. IEEE Transactions on Information The-
ory, 37(4):1085–1094.
Shiying Yang, Chelsea Sanker, and Uriel Co-
hen Priva. 2018. The organization of lexicons:
a cross-linguistic analysis of monosyllabic words.
In Proceedings of the Society for Computation
in Linguistics (SCiL) 2018, page 164–173.
Anne O. Yue-Hashimoto. 1972. Studies in Yue Di-
alects 1: Phonology of Cantonese. Cambridge
University Press.
Hồ Ngọc Đức. 2004. Vietnamese
word list. http://www.informatik.uni-
leipzig.de/∼duc/software/misc/wordlist.html.
Accessed 24 February 2021.
38
MorphyNet: a Large Multilingual Database
of Derivational and Inflectional Morphology
39
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 39–48
August 5, 2021. ©2021 Association for Computational Linguistics
Figure 1: The MorphyNet generation process and the input datasets used.
ing MorphyNet data. Section 4 presents the result- the scope of UniMorph by new extraction rules
ing resource, and Section 5 evaluates it. Section 6 and logic. The first version of MorphyNet covers
concludes the paper. 15 languages, and it is distinct from other resources
in three aspects: (1) it includes both inflectional
2 State of the Art and derivational data; (2) it extracts a significantly
higher number of inflections from Wiktionary; and
Ever since the early days of computational linguis- (3) it provides a wider range of morphological in-
tics, morphological analysis and its related tasks— formation. While for the languages it covers Mor-
such as stemming and lemmatization—have been phyNet can be considered a superset of UniMorph,
part of NLP systems. Earlier grammar-based sys- the latter supports more languages. With UDer, as
tems used finite-state transducers or affix stripping we show in section 4, the overlap is minor on all
techniques, and certain of them were already mul- languages. For these reasons, we consider Mor-
tilingual and were capable of tackling morpholog- phyNet as complementary to these databases, con-
ically complex languages (Beesley and Karttunen, siderably enriching their coverage on the 15 sup-
2003; Trón et al., 2005; Inxight, 2005). However, ported languages but not replacing them.
due to the costliness of producing the grammar
rules that drove them, many of these systems were 3 MorphyNet Generation
only commercially available.
MorphyNet is generated mainly from Wiktionary,
More recently, several projects have followed
through the following steps.
the approach of formalizing and/or integrating ex-
isting morphological data for multiple languages. 1. Filtering returns XML-based Wiktionary con-
UDer (Universal Derivations) (Kyjánek et al., tent from specific sections of relevant lexical
2020) integrates 27 derivational morphology re- entries: headword lines, etymology sections,
sources in 20 languages. UniMorph (Kirov et al., and inflectional tables are returned for nouns,
2016, 2018) and the Wikinflection Corpus (Methen- verbs, and adjectives.
iti and Neumann, 2020) rely mostly on Wiktionary
from which they extract inflectional information. 2. Extraction obtains raw morphological data by
Beyond the data source, however, the two last parsing the sections above.
projects have little in common: UniMorph is by
far more precise and complete, and being used 3. Enrichment algorithmically extends the cov-
as gold standard for NLP community (Cotterell erage of derivations and inflections obtained
et al., 2017, 2018) (recently covering 133 lan- from Wiktionary, through entirely distinct
guages (McCarthy et al., 2020)), while Wikinflec- methods for inflection and derivation.
tion follows a naïve, linguistically uninformed ap- 4. Resource generation, finally, outputs Mor-
proach of merely concatenating affixes, generat- phyNet data.
ing an abundance of ungrammatical word forms
(e.g. for Hungarian or Finnish). Below we explain the non-trivial Wiktionary ex-
MorphyNet is also based on extracting morpho- traction and enrichment steps, while Section 4 pro-
logical information from Wiktionary, extending vides details on the generated resource itself.
40
3.1 Wiktionary Extraction competição = {{suffix|pt|competir|ção}}
Wiktionary
We extract inflectional and derivational data accusation = {{suffix|en|accuse|ation}}
through hand-crafted extraction rules that target
recurrent patterns in Wiktionary content both in en pt
cognate
accuse.v acusar.v
source markdown and in HTML-rendered form. CogNet
cognate
With respect to UniMorph that takes a similar ap- accusation.n acusação.n
proach and scrapes tables that provide inflectional
paradigms, the scope of extraction is considerably pt acusar.v acusação.n -ção
extended, also including headword lines and ety-
MorphyNet pt competir.v competição.n -ção
mology sections. This allows us to obtain new en accuse.v accusation.n -ation
derivations, inflections, and features not covered
by UniMorph, such as gender information or noun Figure 2: Derivation enrichment example: inference of
and adjective declensions for Catalan, French, Ital- the derivation of the Portuguese word acusação.
ian, Spanish, Russian, English, or Serbo-Croatian.
Our rules target nouns, adjectives, and verbs in all
languages covered. etymology sections of Wiktionary entries to collect
the Morphology template usages, such as for the
Inflection extraction rules target two types of
English accusation:
Wiktionary content: inflectional tables and head-
word lines. Inflectional tables provide conjugation Equivalent to accuse + -ation.
and declension paradigms for a subset of verbs, where we have a morphology entry
nouns, and adjectives in Wiktionary. On tables, {{suffix|en|accuse|-ation}} from the Wiktionary
our extraction method was similar to that of Uni- XML dump. After collecting all morphology
Morph as described in (Kirov et al., 2016, 2018), entries, we applied the enrichment method to
with one major difference. UniMorph also ex- increase its coverage.
tracted a large number of separate entries with
modifier and auxiliary words, such as Spanish 3.2 Derivation Enrichment
negative imperatives (no comas, no coma, no co- Derivation enrichment is based on a linguistically
mamos etc.) or Finnish negative indicatives (en informed cross-lingual generalization of deriva-
puhu, et puhu, eivät puhu etc.). MorphyNet, on tional patterns observed in Wiktionary data, in or-
the other hand, has a single entry for each distinct der to extend the coverage of derivational data.
word form, regardless of the modifier word used.
In the example shown in Figure 2, Wik-
This policy had a particular impact on the size of
tionary contains the Portuguese derivation com-
the Finnish vocabulary.
petir (to compete) → competição (competition)
As inflectional tables are only provided by Wik- but not acusar (to accuse) → acusação (accusa-
tionary for 62.5%3 of nouns, verbs, and adjectives, tion). An indiscriminate application of the suf-
we extended the scope of extraction to headword fix -ção to all verbs would, of course, gener-
lines, such as ate lots of false positives, such as chegar (to ar-
banca f (plural banche) rive) ↛ *chegação. Even when the target word
does exist, the inferred derivation is often false, as
From this headword line, we extract two entries:
in the case of corar (to blush) ↛ coração (heart).
one for banca is feminine singular and second for
A counter-example from English could be jewel +
banche is feminine plural. We created specific
-ery → jewellery but gal +-ery ↛ gallery.
parsing rules for nouns, verbs, and adjectives be-
For this reason, we use stronger cross-lingual
cause each part of speech is described through a dif-
derivational evidence to induce the applicability
ferent set of morphological features. For example,
of the affix. In the example above, the existence
valency (transitive or reflexive) and aspect (perfec-
of the English derivation accuse → accusation,
tive or imperfective) are essential for verbs, while
where the meanings of the English and the corre-
gender (masculine or feminine) and number (singu-
sponding Portuguese words are the same, serves as
lar or plural) pertain to nouns and adjectives.
a strong hint for the applicability of the Portuguese
Derivation extraction rules were applied to the
pattern.
3
Computed over the 15 languages covered by MorphyNet. This intuition is formalized in MorphyNet as fol-
41
Table 1: Structure of MorphyNet inflectional data and its comparison to UniMorph. Data provided only by Mor-
phyNet is highlighted in bold. The rest is provided by both resources in a nearly identical format.
Table 2: Structure of MorphyNet derivational data and its comparison to UDer. Data only provided by MorphyNet
is highlighted in bold. The rest is provided by both resources in a nearly identical format.
lows: if in language A a derivation from source the inflection múltja → múltjával (his/her/its
word wAs to target word wAt through the affix aA is past + instrumental). For múltja, in turn, it pro-
not explicitly asserted (e.g. by Wiktionary) but it is vides múlt → múltja (past + possessive). It does
asserted for the corresponding cognates in at least not, however, directly provide the combination
one language B, then we infer its existence: of the two inflections: múlt → múltjával (past +
possessive + instrumental). Inflection enrichment
cog(wAs , wBs ) ∧ cog(wAt , wBt ) ∧ cog(aA , aB ) consists of inferring such missing rules from the
existing data.
∧ der(wBs , aB ) = wBt ⇒ der(wAs , aA ) = wAt The case above is formalized as follows: if, after
where cog(x, y) means that the words x and y are the Wiktionary extraction phase, the MorphyNet
cognates and der(b, a) = d that word d is derived data contains the inflections wr → w1 (with feature
from base word b and affix a. In our example, set F1 ) as well as w1 → w2 (with feature set F2 ),
A = Portuguese, B = English, wAs = acusar, then we create the new inflection wr → w2 with
wBs = accuse, wAt = acusação, wBt = accusation, feature set F1 ∪ F2 .
aA = -ção, and aB = -tion. The application of this logic increased the inflec-
As shown in Figure 1, we exploited a cognate tional coverage of MorphyNet by 10.8% and its re-
database, CogNet4 (Batsuren et al., 2019, 2021), call (with respect to ground truth data presented in
that has 8.1M cognate pairs, for evidence on cog- section 5) by 8.2% on average.
nacy: cog(wA , wB ) = True is asserted by the pres-
4 The MorphyNet Resource
ence of the word pair in CogNet.
The result of enrichment was a total increase of Morphynet is freely available for download, both
25.6% of the number of derivations in MorphyNet. as text files containing the data and as the source
Efficiency varies among languages, essentially de- code of the Wiktionary extractor.5 Two text files
pending on the completeness of the Wiktionary are provided per language: one for inflections and
coverage: it was the lowest for English with 3% one for derivations. The structure of the two types
and the highest for Spanish with 57%. of files is illustrated in Tables 1 and 2, respectively.
As shown, MorphyNet covers all data fields pro-
3.3 Inflection Enrichment vided by UniMorph for inflections and by UDer
The enrichment of inflectional data is based on for derivations. In addition, it extends UniMorph
the simple observation that Wiktionary does by indicating the affix and the immediate source
not provide the root word for all inflected word that produced the inflection. Such informa-
forms. For example, for the Hungarian múltjá- tion is useful, for example, to NLP applications
val (with his/her/its past), Wiktionary provides that rely on subword information for understand-
4 5
http://github.com/kbatsuren/CogNet http://github.com/kbatsuren/WiktConv
42
Table 3: MorphyNet dataset statistics
Table 4: UniMorph and MorphyNet data sizes com- Comparison to ground truth. The quality eval-
pared to Universal Dependencies content. uation of morphology database is a challenging
Language UniMorph MorphyNet Univ. Dep. task due to many weird morphology aspects of lan-
Catalan 81,576 168,462 25,443 guages evaluated (Gorman et al., 2019). As ground
Czech 134,528 298,888 151,838 truth on inflections we used the Universal Depen-
English 115,523 652,487 17,296
French 367,733 453,229 28,921
dencies6 dataset (Nivre et al., 2016, 2017), which
Finnish 2,490,377 1,617,751 47,813 (among others) provides morphological analysis
Hungarian 552,950 1,034,317 3,685 of inflected words over a multilingual corpus of
Italian 509,575 748,321 24,002
Serbo-Croatian 840,799 1,760,095 35,936 hand-annotated sentences. McCarthy et al. (2018)
Spanish 382,955 677,423 32,571 built a Python tool7 to convert these treebanks
Swedish 78,411 131,693 15,030 into UniMorph schema (Sylak-Glassman, 2016).
Russian 473,482 1,343,760 18,774
We evaluated both UniMorph 2.0 and MorphyNet
Total 5,893,381 8,886,426 401,309
against this data (performing the necessary map-
ping of feature tags beforehand) over the 11 lan-
ing out-of-vocabulary words. MorphyNet also ex- guages in the intersection of the two resources:
tends the UDer structure by indicating the affix and Hungarian (Vincze et al., 2010), Catalan, Span-
the semantic category for the target word when it ish (Taulé et al., 2008), Czech (Bejček et al.,
can be inferred from the morpheme. Such informa- 2013), Finnish (Pyysalo et al., 2015), Russian (Lya-
tion is again useful for subword regularization of shevskaya et al., 2016), Serbo-Croatian (De Melo,
derivationally rich languages, such as English. 2014), French (Guillaume et al., 2019), Italian
Table 4 provides per-language statistics on Mor- (Bosco et al., 2013), Swedish (Nivre and Megyesi,
phyNet data. The present version of the resource 2007), and English (Silveira et al., 2014). Ta-
contains 10.6 million entries, of which 95% are in- ble 5 contains evaluation results over nouns, verbs,
flections. Highly inflecting and agglutinative lan- and adjectives separately, as well as totals per lan-
guages are dominating the resource as 55% of all guage. Missing data points (e.g. for Catalan nouns)
entries belong to Finnish, Hungarian, Russian, and indicate that UniMorph did not have any corre-
Serbo-Croatian. Language coverage above all de- sponding inflections. For languages and parts of
pends on the completeness of Wiktionary, the main speech where both resources provide data, Mor-
source of our data. phyNet always provides higher recall. The excep-
tion is Finnish because of our policy of not extract-
5 Evaluation ing conjugations with auxiliary and modifier words
as separate entries (see Section 3.1). Overall, as
We evaluated MorphyNet through two different 6
https://universaldependencies.org/
methods: (1) through comparison to ground truth 7
https://github.com/unimorph/
and (2) through manual validation by experts. ud-compatibility
43
Table 5: Inflectional morphology evaluation of MorphyNet against UniMorph on Universal Dependencies
seen from Table 4, MorphyNet contains about 47% as Cohen’s Kappa, was 0.85 overall, varying be-
more entries over the 11 languages where it over- tween 0.74 (Finnish) and 0.97 (Portuguese). If we
laps with UniMorph. In terms of precision, the two consider UDer as gold standard, we obtain preci-
resources are comparable, except for Finnish (ad- sion figures between 87% and 99%.
jectives) and Swedish (adjectives and verbs) where
Manual evaluation was carried out by language
MorphyNet appears to be significantly more pre-
experts over sample data from five languages: En-
cise.
glish, Italian, French, Hungarian, and Mongolian.
The sample consisted of 1,000 randomly selected
UDer (Kyjánek et al., 2020) is a collection of
entries per language, half of them inflectional and
individual monolingual resources of derivational
the other half derivational. The experts were asked
morphology. Most of them have been carefully
to validate the correctness of source–target word
evaluated against their own datasets and offer high
pairs, of morphemes, as well as inflectional fea-
quality. We evaluated MorphyNet derivational
tures and parts of speech (the latter for deriva-
data against UDer over the nine languages covered
tions). Table 7 shows detailed results. The over-
by both resources: French (Hathout and Namer,
all precision is 98.9%, per-language values varying
2014), Portuguese (de Paiva et al., 2014), Czech
between 98.2% (Hungarian) and 99.5% (English).
(Vidra et al., 2019), German (Zeller et al., 2013),
The good results are proof both of the high qual-
Russian (Vodolazsky, 2020), Italian (Talamo et al.,
ity of Wiktionary data and of the general correct-
2016), Finnish (Lindén and Carlson, 2010; Lindén
ness of the data extraction and enrichment logic of
et al., 2012), Latin (Litta et al., 2016), and En-
MorphyNet. A manual checking of the incorrect
glish (Habash and Dorr, 2003). Statistics and re-
entries revealed that most of them were due to the
sults are shown in Table 6. First of all, the over-
failure of extraction rules due to occasional devia-
lap between MorphyNet and UDer is small, which
tions in Wiktionary from its own conventions.
is visible from our recall values relative to UDer
that vary between 0.6% (Czech) and 59.5% (Ital- 6 Conclusions and Future Work
ian). Among the languages evaluated, six were
better covered by MorphyNet and the remaining We consider the resource released and described
three (Czech, German, and Russian) by UDer. The here as an initial work-in-progress version that we
agreement between the two resources, computed plan to extend and improve. We are currently
44
Table 6: Derivational morphology evaluation of MorphyNet against Universal Derivations (UDer)
# Language MorphyNet Univeral Derivations (UDer) UDer ∩ MorphyNet Recall Precision Kappa
1 French 37,203 Démonette 13,272 2,558 18.5 95.5 0.91
2 Portuguese 15,974 NomLex-PT 3,420 1,235 35.8 98.9 0.97
3 Czech 9,660 Derinet 804,011 5,347 0.6 94.1 0.88
4 German 23,867 DerivBase 35,528 5,878 15.6 93.5 0.87
5 Russian 36,922 DerivBase.RU 118,374 6,370 12.3 88.1 0.76
6 Italian 42,149 DerIvaTario 1,548 958 59.5 90.7 0.81
7 Finnish 37,199 FinnWordnet 8,337 2,664 30.6 87.0 0.74
8 Latin 9,191 WFL 2,792 4,037 14.0 93.7 0.87
9 English 200,365 CatVar 16,185 7,397 45.7 91.9 0.83
Total 412,530 1,003,467 36,444 25.8 92.6 0.85
45
the Association for Computational Linguistics, word-level prediction. Transactions of the As-
5:135–146. sociation for Computational Linguistics, 6:451–
465.
Cristina Bosco, Montemagni Simonetta, and Simi
Maria. 2013. Converting italian treebanks: To- Fausto Giunchiglia, Khuyagbaatar Batsuren, and
wards an italian stanford dependency treebank. Gabor Bella. 2017. Understanding and exploit-
In 7th Linguistic Annotation Workshop and In- ing language diversity. In IJCAI, pages 4009–
teroperability with Discourse, pages 61–69. The 4017.
Association for Computational Linguistics.
Kyle Gorman, Arya D McCarthy, Ryan Cotterell,
Ryan Cotterell, Christo Kirov, John Sylak- Ekaterina Vylomova, Miikka Silfverberg, and
Glassman, Géraldine Walther, Ekaterina Magdalena Markowska. 2019. Weird inflects
Vylomova, Arya D McCarthy, Katharina Kann, but ok: Making sense of morphological gener-
Sebastian J Mielke, Garrett Nicolai, Miikka ation errors. In Proceedings of the 23rd Con-
Silfverberg, et al. 2018. The conll–sigmorphon ference on Computational Natural Language
2018 shared task: Universal morphological Learning (CoNLL), pages 140–151.
reinflection. In Proceedings of the CoNLL–
SIGMORPHON 2018 Shared Task: Universal Bruno Guillaume, Marie-Catherine de Marneffe,
Morphological Reinflection, pages 1–27. and Guy Perrier. 2019. Conversion et amélio-
rations de corpus du français annotés en univer-
Ryan Cotterell, Christo Kirov, John Sylak- sal dependencies. Traitement Automatique des
Glassman, Géraldine Walther, Ekaterina Langues, 60(2):71–95.
Vylomova, Patrick Xia, Manaal Faruqui, San-
dra Kübler, David Yarowsky, Jason Eisner, Nizar Habash and Bonnie Dorr. 2003. Catvar: A
et al. 2017. Conll-sigmorphon 2017 shared database of categorial variations for english. In
task: Universal morphological reinflection in Proceedings of the MT Summit, pages 471–474.
52 languages. In Proceedings of the CoNLL Citeseer.
SIGMORPHON 2017 Shared Task: Universal
Nabil Hathout and Fiammetta Namer. 2014. Dé-
Morphological Reinflection, pages 1–30.
monette, a french derivational morpho-semantic
Gerard De Melo. 2014. Etymological wordnet: network. In Linguistic Issues in Language Tech-
Tracing the history of words. In LREC, pages nology, Volume 11, 2014-Theoretical and Com-
1148–1154. Citeseer. putational Morphology: New Trends and Syner-
gies.
Valeria de Paiva, Livy Real, Alexandre Rade-
maker, and Gerard de Melo. 2014. Nomlex- Inxight. 2005. Linguistx natural language process-
pt: A lexicon of portuguese nominalizations. In ing platform.
Proceedings of the Ninth International Confer-
ence on Language Resources and Evaluation Christo Kirov, Ryan Cotterell, John Sylak-
(LREC’14), pages 2851–2858. Glassman, Géraldine Walther, Ekaterina
Vylomova, Patrick Xia, Manaal Faruqui,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Sabrina J Mielke, Arya D McCarthy, Sandra
Kristina Toutanova. 2019. Bert: Pre-training of Kübler, et al. 2018. Unimorph 2.0: Universal
deep bidirectional transformers for language un- morphology. In Proceedings of the Eleventh In-
derstanding. In Proceedings of the 2019 Con- ternational Conference on Language Resources
ference of the North American Chapter of the and Evaluation (LREC 2018).
Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long Christo Kirov, John Sylak-Glassman, Roger Que,
and Short Papers), pages 4171–4186. and David Yarowsky. 2016. Very-large scale
parsing and normalization of wiktionary mor-
Daniela Gerz, Ivan Vulić, Edoardo Ponti, Jason phological paradigms. In Proceedings of the
Naradowsky, Roi Reichart, and Anna Korhonen. Tenth International Conference on Language
2018. Language modeling for morphologically Resources and Evaluation (LREC’16), pages
rich languages: Character-aware modeling for 3121–3126.
46
Lukáš Kyjánek, Zdeněk Žabokrtskỳ, Magda George A Miller. 1998. WordNet: An electronic
Ševčíková, and Jonáš Vidra. 2020. Univer- lexical database. MIT press.
sal derivations 1.0, a growing collection of har-
monised word-formation resources. The Prague Joakim Nivre, Željko Agić, Lars Ahrenberg, Lene
Bulletin of Mathematical Linguistics, (115):5– Antonsen, Maria Jesus Aranzabe, Masayuki
30. Asahara, Luma Ateyah, Mohammed Attia, Aitz-
iber Atutxa, Liesbeth Augustinus, et al. 2017.
Krister Lindén and Lauri Carlson. 2010. Universal dependencies 2.1.
Finnwordnet–finnish wordnet by transla-
tion. LexicoNordica–Nordic Journal of Joakim Nivre, Marie-Catherine De Marneffe, Filip
Lexicography, 17:119–140. Ginter, Yoav Goldberg, Jan Hajic, Christo-
pher D Manning, Ryan McDonald, Slav Petrov,
Krister Lindén, Jyrki Niemi, and Mirka Hyvärinen. Sampo Pyysalo, Natalia Silveira, et al. 2016.
2012. Extending and updating the finnish word- Universal dependencies v1: A multilingual tree-
net. In Shall We Play the Festschrift Game?, bank collection. In Proceedings of the Tenth In-
pages 67–98. Springer. ternational Conference on Language Resources
and Evaluation (LREC’16), pages 1659–1666.
Eleonora Litta, Marco Passarotti, and Chris Culy.
2016. Formatio formosa est. Building a Word Joakim Nivre and Beata Megyesi. 2007. Bootstrap-
Formation Lexicon for Latin. In Proceedings of ping a swedish treebank using cross-corpus har-
the Third Italian Conference on Computational monization and annotation projection. In Pro-
Linguistics (CLIC–IT 2016), pages 185–189. ceedings of the 6th international workshop on
treebanks and linguistic theories, pages 97–102.
Olga Lyashevskaya, Kira Droganova, Daniel Citeseer.
Zeman, Maria Alexeeva, Tatiana Gavrilova,
Mārcis Pinnis, Rihards Krišlauks, Daiga Deksne,
Nina Mustafina, Elena Shakurova, et al. 2016.
and Toms Miks. 2017. Neural machine trans-
Universal dependencies for russian: A new
lation for morphologically rich languages with
syntactic dependencies tagset. Lyashevkaya,
improved sub-word units and synthetic data. In
K. Droganova, D. Zeman, M. Alexeeva, T.
International Conference on Text, Speech, and
Gavrilova, N. Mustafina, E. Shakurova//Higher
Dialogue, pages 237–245. Springer.
School of Economics Research Paper No. WP
BRP, 44. Ivan Provilkov, Dmitrii Emelianenko, and Elena
Voita. 2020. Bpe-dropout: Simple and effective
Arya D McCarthy, Christo Kirov, Matteo Grella,
subword regularization. In Proceedings of the
Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekate-
58th Annual Meeting of the Association for Com-
rina Vylomova, Sabrina J Mielke, Garrett Nico-
putational Linguistics, pages 1882–1892.
lai, Miikka Silfverberg, et al. 2020. Unimorph
3.0: Universal morphology. In Proceedings of Sampo Pyysalo, Jenna Kanerva, Anna Missilä,
The 12th Language Resources and Evaluation Veronika Laippala, and Filip Ginter. 2015. Uni-
Conference, pages 3922–3931. versal dependencies for finnish. In Proceedings
of the 20th Nordic Conference of Computational
Arya D McCarthy, Miikka Silfverberg, Ryan Cot- Linguistics (Nodalida 2015), pages 163–172.
terell, Mans Hulden, and David Yarowsky. 2018.
Marrying universal dependencies and universal Rico Sennrich, Barry Haddow, and Alexandra
morphology. In Proceedings of the Second Birch. 2016. Neural machine translation of rare
Workshop on Universal Dependencies (UDW words with subword units. In Proceedings of
2018), pages 91–101. the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Pa-
Eleni Metheniti and Günter Neumann. 2020. pers), pages 1715–1725.
Wikinflection corpus: A (better) multilingual,
morpheme-annotated inflectional corpus. In Natalia Silveira, Timothy Dozat, Marie-Catherine
Proceedings of The 12th Language Resources de Marneffe, Samuel Bowman, Miriam Connor,
and Evaluation Conference, pages 3905–3912. John Bauer, and Christopher D. Manning. 2014.
47
A gold standard dependency corpus for English.
In Proceedings of the Ninth International Con-
ference on Language Resources and Evaluation
(LREC-2014).
48
A Study of Morphological Robustness of Neural Machine Translation
49
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 49–59
August 5, 2021. ©2021 Association for Computational Linguistics
we aim to identify adversarial examples that lead verbs.5 Now, to construct a perturbed sentence, we
to maximum degradation in the translation quality. iterate through each token and uniformly sample
We build upon the recently proposed M ORPHEUS one inflectional form from the candidate inflections.
toolkit (Tan et al., 2020), that evaluated the robust- We repeat this process N (=50) times and compile
ness of NMT systems translating from English→X. our pool of perturbed sentences.6
For a given source English text, M ORPHEUS works To identify the adversarial sentence, we compute
by greedily looking for inflectional perturbations the chrF score (Popović, 2017) using the sacrebleu
by sequentially iterating through the tokens in input toolkit (Post, 2018) and select the sentence that re-
text. For each token, it identifies inflectional edits sults in the maximum drop in chrF score (if any). In
that lead to maximum drop in BLEU score. our preliminary experiments, we found chrF to be
We extend this approach to test X→English more reliable than BLEU (Papineni et al., 2002) for
translation systems. Since their toolkit2 is limited identifying adversarial candidates. While BLEU
to perturbations in English only, in this work we de- uses word n-grams to compare the translation out-
velop our own inflectional methodology that relies put with the reference, chrF uses character n-grams
on UniMorph (McCarthy et al., 2020). instead; which helps with matching morphological
variants of words.
2.1 Reinflection The original M ORPHEUS toolkit follows a
slightly different algorithm to identify adversaries.
UniMorph project3 provides morphological data Similar to our approach, they first extract all pos-
for numerous languages under a universal schema. sible inflectional forms for each of the constituent
The project supports over 100 languages and pro- tokens. Then, they sequentially iterate through the
vides morphological inflection dictionaries for upto tokens in the sentence, and for each token, they
three part-of-speech tags, nouns (N), adjectives select an inflectional form that results in the worst
(A DJ) and verbs (V). While some UniMorph dic- BLEU score. Once an adversarial form is identi-
tionaries include a large number of types (or fied, they directly replace the form in the original
paradigms) (German (≈15k), Russian (≈28k)), sentence and continue to the next token. While a
many dictionaries are relatively small (Turkish similar approach is possible in our setup, we found
(≈3.5k), Estonian (<1k)). This puts a limit on their algorithm to be computationally expensive as
the number of tokens we can perturb via UniMorph it prevents from performing efficient batching.
dictionary look-up. To overcome this limitation, we It is important to note that neither M ORPHEUS -
use the unimorph inflect toolkit4 that takes M ULTILINGUAL nor the original M ORPHEUS ex-
as input the lemma and the morphosyntactic de- haustively searches over all possible sentences, due
scription (MSD) and returns a reinflected word to memory and time constraints. However, our
form. This tool was trained using UniMorph dictio- approach in M ORPHEUS -M ULTILINGUAL can be
naries and generalizes to unseen types. An illustra- efficiently implemented and reduces the inference
tion of our inflectional perturbation methodology time by almost a factor of 20. We experiment on
is described in Table 1. 11 different language pairs, therefore, the run time
and computational costs are critical to our experi-
2.2 M ORPHEUS -M ULTILINGUAL ments.
Given an input sentence, our proposed method,
M ORPHEUS -M ULTILINGUAL, identifies adversar- 3 Experiments
ial inflectional perturbations to the input tokens In this section, we present a comprehensive eval-
that leads to maximum degradation in performance uation of the robustness of X→English machine
of the machine translation system. We first iter- translation systems. Since it is natural for NMT
ate through the sentence to extract all possible in- models to be more robust when trained on large
flectional forms for each of the constituent tokens. amounts of parallel data, we experiment with two
Since, we are relying on UniMorph dictionaries, we
5
are limited to perturbing only nouns, adjectives and Some dictionaries might contain fewer POS tags, for
example, in German we are restricted to just nouns and verbs.
6
2
N is a hyperparameter, and in our preliminary experi-
https://github.com/salesforce/morpheus ments, we find N = 50 to be sufficiently high to generate
3
https://unimorph.github.io/ many uniquely perturbed sentences and also keep the process
4
https://github.com/antonisa/unimorph inflect computationally tractable.
50
PRON VERB PART PUNCT ADV NOUN VERB AUX
Sie wissen nicht , wann Räuber kommen können
you-NOM.3PL knowledge-PRS.3PL not-NEG , when robber-NOM.PL come-NFIN can-PRS.3PL
51
lg Family Resource TTR
heb Semetic High (0.044, 0.191)
rus Slavic High (0.080, 0.107)
tur Turkic High (0.016, 0.048)
deu Germanic High (0.210, 0.321) Figure 2: Schematic for preliminary evaluation on
ukr Slavic High (0.103, 0.143) learners’ language text. This is similar to the methodol-
ces Slavic High (0.071, 0.082) ogy used in Anastasopoulos (2019).
swe Germanic Medium (0.156, 0.281)
lit Baltic Medium (0.051, 0.084)
slv Slavic Low (0.109, 0.087) versarial sets are however synthetic. In this section,
kat Kartvelian Low (0.057, ——) we evaluate the impact of morphological inflection
est Uralic Low (0.026, 0.056) related errors directly on learners’ text.
To this end, we utilize two grammatical er-
Table 3: List of language chosen from multilingual ror correction (GEC) datasets, German Falko-
TED corpus. For each language, the table presents the MERLIN-GEC (Boyd, 2018), Russian RULEC-
language family, resource level as the Type-Token ratio GEC (Rozovskaya and Roth, 2019). Both of these
(TTRlg ). We measure the ratio using the types and to-
datasets contain labeled error types relating to word
kens present in the reinflection dictionaries (UniMorph,
lexicon from TED dev) morphology. Evaluating the robustness on these
datasets will give us a better understanding of the
performance on actual text produced by second
language (L2) speakers.
Unfortunately, we don’t have gold English trans-
lations for the grammatically incorrect (or cor-
rected) text from GEC datasets. While there is a re-
lated prior work (Anastasopoulos et al., 2019) that
annotated Spanish translations for English GEC
data, we are not aware of any prior work that pro-
vide gold English translations for grammatically
incorrect data in non-English languages. There-
fore, we propose a pseudo-evaluation methodol-
ogy that allows for measuring robustness of MT
Figure 1: Perturbation statistics for selected TED lan- systems. A schematic overview of our methodol-
guages ogy is presented in Figure 2. We take the ungram-
matical text and use the gold GEC annotations to
In preparing our adversarial set, we retain the correct all errors except for the morphology re-
original source sentence if we fail to create any per- lated errors. We now have ungrammatical text that
turbation or if none of the identified perturbations only contains morphology related errors and it is
lead to a drop in chrF score. This is to make sure the similar to the perturbed outputs from M ORPHEUS -
adversarial set has the same number of sentences M ULTILINGUAL. Since, we don’t have gold trans-
as the original validation set. In Table 4, we present lations for the input Russian/German sentences,
the baseline and adversarial MT results. We notice we use the machine translation output of the fully
a considerable drop in performance for Hebrew, grammatical text as reference and the translation
Russian, Turkish and Georgian. As expected, the % output of partially-corrected text as hypothesis. In
drops are correlated to the perturbations statistics Table 5, we present the results on both Russian and
from Figure 1. German learners’ text.
Overall, we find that the pre-trained MT models
3.3 Translating Learner’s Text from fairseq are quite robust to noise in learn-
In the previous sections (§3.1, §3.2), we have seen ers’ text. We manually inspected some examples,
the impact of noisy inputs to MT systems. While, and found the MT systems to sufficiently robust
these results indicate a need for improving the ro- to morphological perturbations and changes in the
bustness of MT systems, the above-constructed ad- output translation (if any) are mostly warranted.
52
X→English Code # train Baseline Adversarial
BLEU chrF BLEU chrF NR
Hebrew heb 211K 40.06 0.5898 33.94 (-15%) 0.5354 (-9%) 1.56
Russian rus 208K 25.64 0.4784 11.70 (-54%) 0.3475 (-27%) 1.03
Turkish tur 182K 27.77 0.5006 18.90 (-32%) 0.4087 (-18%) 1.43
German deu 168K 34.15 0.5606 31.29 (-8%) 0.5373 (-4%) 1.82
Ukrainian ukr 108K 25.83 0.4726 25.66 (-1%) 0.4702 (-1%) 2.96
Czech ces 103K 29.35 0.5147 26.58 (-9%) 0.4889 (-5%) 2.11
Swedish swe 56K 36.93 0.5654 36.84 (-0%) 0.5646 (-0%) 3.48
Lithuanian lit 41K 18.88 0.3959 18.82 (-0%) 0.3948 (-0%) 3.42
Slovenian slv 19K 11.53 0.3259 10.48 (-9%) 0.3100 (-5%) 3.23
Georgian kat 13K 5.83 0.2462 4.92 (-16%) 0.2146 (-13%) 2.49
Estonian est 10K 6.68 0.2606 6.53 (-2%) 0.2546 (-2%) 4.72
Dataset f-BLEU f-chrF compare the impact of each perturbation type (POS,
dim) on the overall performance of MT model. Ad-
Russian GEC 85.77 91.56
ditionally, as seen in Figure 1, all inflectional per-
German GEC 89.60 93.95
turbations need not cause a drop in chrF (or BLEU)
scores. The adversarial sentences only capture the
Table 5: Translation results on Russian and German
GEC corpora. An oracle (aka. fully robust) MT system worst case drop in chrF. Therefore, to analyze the
would give a perfect score. We adopt the faux-BLEU overall impact of the each perturbation (POS, dim),
terminology from Anastasopoulos (2019). f-BLEU is we also compute the impact score on the entire set
identical to BLEU, except that it is computed against a of perturbed sentences explored by M ORPHEUS -
pseudo-reference instead of true reference. M ULTILINGUAL.
Table 8 (in Appendix) presents the results for
all the TED languages. First, the trends for adver-
Viewing these results in combination with results
sarial perturbations is quite similar to all explored
on TED corpus, we believe that X→English are
perturbations. This indicates that the adversarial
robust to morphological perturbations at source as
impact of a perturbation is not determined by just
long as they are trained on sufficiently large parallel
the perturbation type (POS, dim) but is lexically
corpus.
dependent.
4 Analysis Evaluation Metrics: In the results presented in
To better understand what makes a given MT sys- §3, we reported the performance using BLEU and
tem to be robust to morphology related grammati- chrF metrics (following prior work (Tan et al.,
cal perturbations in source, we present a thorough 2020)). We noticed significant drops on these met-
analysis of our results and also highlight a few lim- rics, even for high-resource languages like Rus-
itations of our adversarial methodology. sian, Turkish and Hebrew, including the state-of-
the-art fairseq models. To better understand
Adversarial Dimensions: To quantify the im- these drops, we inspected the output translations of
pact of each inflectional perturbation, we perform adversarial source sentences. We found a number
a fine-grained analysis on the adversarial sentences of cases where the new translation is semantically
obtained from multilingual TED corpus. For each valid but both the metrics incorrectly score them
perturbed token in the adversarial sentence, we low (see S2 in Table 6). This is a limitation of using
identify the part-of-speech (POS) and the feature surface level metrics like BLEU/chrF.
dimension(s) (dim) perturbed in the token. We uni- Additionally, we require the new translation to
formly distribute the % drop in sentence-level chrF be as close as possible to the original translation,
score to each (POS, dim) perturbation in the ad- but this can be a strict requirement on many occa-
versarial sentence. This allows us to quantitatively sions. For instance, if we changed a noun in the
53
Figure 3: Correlation between Noise Ratio (NR) and Figure 4: Correlation between Target-Source Noise Ra-
# train. The results indicate that, larger the training tio (NR) on TED machine translation and Type-Token
data, the models are more robust towards source pertur- Ratio (TTRlg ) of the source language (from UniMorph).
bations (NR≈1). The results indicate that the morphological richness
of the source language doesn’t necessarily correlate to
NMT robustness.
source from its singular to plural form, it is natural
to expect a robust translation system to reflect that
change in the output translation. To account for this perimented with four languages within the Slavic
behavior, we compute Target-Source Noise Ratio family, Czech, Ukranian, Russian and Slovenian.
(NR) metric from Anastasopoulos (2019). NR is All except Slovenian are high-resource. These
computed as follows, languages differ significantly in their morphologi-
cal richness (TTR) with, TTRces < TTRslv <<
100 − BLEU(t, t̃) TTRrus << TTRukr .9 As we have already seen in
NR(s, t, s̃, t̃) = (2)
100 − BLEU(s, s̃) above analysis (see Figure 4), morphological rich-
ness isn’t indicative of the noise ratio (NR), and
The ideal NR is ∼1, where a change in the source this behavior is also true for Slavic languages. We
(s → s̃) results in a proportional change in the tar- now check if morphological richness determines
get (t → t̃). For the adversarial experiments on the drop in BLEU/chrF scores? In fact, we find
TED corpus, we compute the NR metric for each that this is also not the case. We see larger % drop
language pair and the results are presented in Ta- for rus as compared to slv or ukr. We instead
ble 4. Interestingly, while Russian sees a major notice that the % drop in BLEU/chrF is dependent
drop in BLEU/chrF score, the noise ratio is close on the % edits we make to the validation set. The
to 1. This indicates that the Russian MT is actu- % edits we were able to make follows the order,
ally quite robust to morphological perturbations. δrus >> δces > δslv >> δukr (see Figure 1).
Furthermore, in Figure 3, we present a correlation
While NR is driven by size of training set, and %
analysis between the size of parallel corpus avail-
drop in BLEU is driven by % edits to the validation
able for training vs noise ratio metric. We see a
set. The % edits in turn depends on the size of
very strong negative correlation, indicating that
UniMorph dictionaries and not on morphological
high-resource MT systems (e.g., heb, rus, tur)
richness of the language. Therefore, we conclude
are quite robust to inflectional perturbations, inspite
that both the metrics, % drop in BLEU/chrF and
of the large drops in BLEU/chrF scores. Addition-
NR are dependent on the resource size (parallel
ally, we noticed that morphological richness of the
data and UniMorph dictionaries) and not on the
source language (measured via TTR in Table 3)
morphological richness of the language.
doesn’t play any significant role in the MT perfor-
mance under adversarial settings (e.g., rus, tur Semantic Change: In our adversarial attacks,
vs deu). The scatter plot between TTR and NR for we aim to create a ungrammatical source via inflec-
TED translation task is presented in Figure 4. tional edits and evaluate the robustness of systems
for these edits. While these adversarial attacks can
Morphological Richness: To analyze the im-
help us discover any significant biases in the transla-
pact of morphological richness of source, we look
deeper into the Slavic language family. We ex- 9
TTRlg measured on lexicons from TED dev splits.
54
Figure 5: Elasticity score for TED languages Figure 6: Boxplots for the distribution of # edits per
sentence in the adversarial TED validation set.
55
S1 Source (s) Тренер полностью поддержал игрока.
T1 Target (t) The coach fully supported the player.
rus
A-S1 Source (s̃) Тренера полностью поддержал игрок.
A-T1 Target (t̃) The coach was fully supported by the player.
S2 Source (s) Dinosaurier benutzte Tarnung, um seinen Feinden auszuweichen
T2 Target (t) Dinosaur used camouflage to evade its enemies (1.000)
deu
A-S2 Source (s̃) Dinosaurier benutze Tarnung, um seinen Feindes auszuweichen
A-T2 Target (t̃) Dinosaur Use camouflage to dodge his enemy (0.512)
S3 Source (s) У нас вообще телесные наказания не редкость.
T3 Target (t) In general, corporal punishment is not uncommon. (0.885)
rus
A-S3 Source (s̃) У нас вообще телесных наказании не редкостях.
A-T3 Target (t̃) We don’t have corporal punishment at all. (0.405)
S4 Source (s) Вот телесные наказания - спасибо, не надо.
T4 Target (t) That’s corporal punishment - thank you, you don’t have to. (0.458)
rus
A-S4 Source (s̃) Вот телесных наказаний - спасибах, не надо.
A-T4 Target (t̃) That’s why I’m here. (0.047)
S5 Source (s) Die Schießereien haben nicht aufgehört.
T5 Target (t) The shootings have not stopped. (0.852)
deu
A-S5 Source (s̃) Die Schießereien habe nicht aufgehört.
A-T5 Target (t̃) The shootings did not stop, he said. (0.513)
S6 Source (s) Всякое бывает.
T6 Target (t) Anything happens. (0.587)
rus
A-S6 Source (s̃) Всякое будете бывать.
A-T6 Target (t̃) You’ll be everywhere. (0.037)
S7 Source (s)
T7 Target (t) It ’s a real school. (0.821)
kat
A-S7 Source (s̃)
A-T7 Target (t̃) There ’s a man who ’s friend. (0.107)
S8 Source (s) Ning meie laste tuleviku varastamine saab ühel päeval kuriteoks.
T8 Target (t) And our children ’s going to be the future of our own day. (0.446)
est
A-S8 Source (s̃) Ning meie laptegs tuleviku varastamine saab ühel päeval kuriteoks.
A-T8 Target (t̃) And our future is about the future of the future. (0.227)
S9 Source (s) Nad pagevad üle piiride nagu see.
T9 Target (t) They like that overdights like this. (0.318)
est
A-S9 Source (s̃) Nad pagevad üle piirete nagu see.
A-T9 Target (t̃) They dress it as well as well. (0.141)
S10 Source (s) Мой дедушка был необычайным человеком того времени.
T10 Target (t) My grandfather was an extraordinary man at that time. (0.802)
rus
A-S10 Source (s̃) Мой дедушка будё необычайна человеков того времи.
A-T10 Target (t̃) My grandfather is incredibly harmful. (0.335)
Table 6: Qualitative analysis. (1) semantic change, (2) issues with evaluation metrics, (3,4,5,6,7,10) good examples
for attacks, (8) poor attacks, (9) poor translation quality (s → t)
56
We evaluate NMT models trained on TED corpus Findings of the 2019 conference on machine transla-
as well as pretrained models readily available as tion (WMT19). In Proceedings of the Fourth Con-
ference on Machine Translation (Volume 2: Shared
part of fairseq library. We observe a wide
Task Papers, Day 1), pages 1–61, Florence, Italy. As-
range of 0-50% drop in performances under sociation for Computational Linguistics.
adversarial setting. We further supplement our
experiments with an analysis on GEC-learners Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic
corpus for Russian and German. We qualitatively and natural noise both break neural machine transla-
tion. In International Conference on Learning Rep-
and quantitatively analyze the perturbations resentations.
created by our methodology and presented its
strengths as well as limitations, outlining some Adriane Boyd. 2018. Using Wikipedia edits in low
avenues for future research towards building more resource grammatical error correction. In Proceed-
ings of the 2018 EMNLP Workshop W-NUT: The
robust NMT systems. 4th Workshop on Noisy User-generated Text, pages
79–84, Brussels, Belgium. Association for Compu-
tational Linguistics.
References
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim
Md Mahfuz Ibn Alam and Antonios Anastasopoulos. Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,
2020. Fine-tuning MT systems for robustness to Fernanda Viégas, Martin Wattenberg, Greg Corrado,
second-language speaker variations. In Proceedings Macduff Hughes, and Jeffrey Dean. 2017. Google’s
of the Sixth Workshop on Noisy User-generated Text multilingual neural machine translation system: En-
(W-NUT 2020), pages 149–158, Online. Association abling zero-shot translation. Transactions of the As-
for Computational Linguistics. sociation for Computational Linguistics, 5:339–351.
Antonios Anastasopoulos. 2019. An analysis of source- Arya D. McCarthy, Christo Kirov, Matteo Grella,
side grammatical errors in NMT. In Proceedings of Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekate-
the 2019 ACL Workshop BlackboxNLP: Analyzing rina Vylomova, Sabrina J. Mielke, Garrett Nico-
and Interpreting Neural Networks for NLP, pages lai, Miikka Silfverberg, Timofey Arkhangelskiy, Na-
213–223, Florence, Italy. Association for Computa- taly Krizhanovsky, Andrew Krizhanovsky, Elena
tional Linguistics. Klyachko, Alexey Sorokin, John Mansfield, Valts
Ernštreits, Yuval Pinter, Cassandra L. Jacobs, Ryan
Antonios Anastasopoulos, Alison Lui, Toan Q. Cotterell, Mans Hulden, and David Yarowsky. 2020.
Nguyen, and David Chiang. 2019. Neural machine UniMorph 3.0: Universal Morphology. In Proceed-
translation of text from non-native speakers. In Pro- ings of the 12th Language Resources and Evaluation
ceedings of the 2019 Conference of the North Amer- Conference, pages 3922–3931, Marseille, France.
ican Chapter of the Association for Computational European Language Resources Association.
Linguistics: Human Language Technologies, Vol-
ume 1 (Long and Short Papers), pages 3070–3080, Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Minneapolis, Minnesota. Association for Computa- Fan, Sam Gross, Nathan Ng, David Grangier, and
tional Linguistics. Michael Auli. 2019. fairseq: A fast, extensible
toolkit for sequence modeling. In Proceedings of
Antonios Anastasopoulos and Graham Neubig. 2019. the 2019 Conference of the North American Chap-
Pushing the limits of low-resource morphological in- ter of the Association for Computational Linguistics
flection. In Proceedings of the 2019 Conference on (Demonstrations), pages 48–53, Minneapolis, Min-
Empirical Methods in Natural Language Processing nesota. Association for Computational Linguistics.
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
984–996, Hong Kong, China. Association for Com- Jing Zhu. 2002. Bleu: a method for automatic eval-
putational Linguistics. uation of machine translation. In Proceedings of
the 40th Annual Meeting of the Association for Com-
Naveen Arivazhagan, Ankur Bapna, Orhan Firat, putational Linguistics, pages 311–318, Philadelphia,
Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Pennsylvania, USA. Association for Computational
Mia Xu Chen, Yuan Cao, George Foster, Colin Linguistics.
Cherry, et al. 2019. Massively multilingual neural
machine translation in the wild: Findings and chal- Maja Popović. 2017. chrF++: words helping charac-
lenges. arXiv preprint arXiv:1907.05019. ter n-grams. In Proceedings of the Second Con-
ference on Machine Translation, pages 612–618,
Loı̈c Barrault, Ondřej Bojar, Marta R. Costa-jussà, Copenhagen, Denmark. Association for Computa-
Christian Federmann, Mark Fishel, Yvette Gra- tional Linguistics.
ham, Barry Haddow, Matthias Huck, Philipp Koehn,
Shervin Malmasi, Christof Monz, Mathias Müller, Matt Post. 2018. A call for clarity in reporting BLEU
Santanu Pal, Matt Post, and Marcos Zampieri. 2019. scores. In Proceedings of the Third Conference on
57
Machine Translation: Research Papers, pages 186– A Appendix
191, Brussels, Belgium. Association for Computa-
tional Linguistics. A.1 UniMorph Example
Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Pad- An example from German UniMorph dictionary is
manabhan, and Graham Neubig. 2018. When and presented in Table 7.
why are pre-trained word embeddings useful for neu-
ral machine translation? In Proceedings of the 2018 Paradigm Form MSD
Conference of the North American Chapter of the
Association for Computational Linguistics: Human abspielen (‘play’) abgespielt (‘played’) V. PTCP ; PST
Language Technologies, Volume 2 (Short Papers), abspielen (‘play’) abspielend (‘playing’) V. PTCP ; PRS
pages 529–535, New Orleans, Louisiana. Associa-
tion for Computational Linguistics. abspielen (‘play’) abspielen (‘play’) V; NFIN
Alla Rozovskaya and Dan Roth. 2019. Grammar error Table 7: Example inflections for German verb abspie-
correction in morphologically rich languages: The len (‘play’) from the UniMorph dictionary.
case of Russian. Transactions of the Association for
Computational Linguistics, 7:1–17.
Samson Tan, Shafiq Joty, Min-Yen Kan, and Richard A.2 MT training
Socher. 2020. It’s morphin’ time! Combating For all the languages in TED corpus, we train
linguistic discrimination with inflectional perturba-
Any→English using the fairseq toolkit. Specif-
tions. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics, ically, we use the ‘transformer iwslt de en’ archi-
pages 2920–2935, Online. Association for Computa- tecture, and train the model using Adam optimizer.
tional Linguistics. We use an inverse square root learning rate sched-
uler with warm-up update steps of 4000. In the
linear warm-up phase, we use an initial learning
rate of 1e-7 until a configured rate of 2e-4. We use
cross entropy criterion with label smoothing of 0.1.
58
Dimension ces deu est heb kat lit rus slv swe tur ukr
ADJ.Animacy - - - - - - 3.51(0.89) - - - -
ADJ.Case 4.31(0.81) - - - 10.67(2.59) - 4.78(0.91) - 5.05(5.05) - 6.04(1.10)
ADJ.Comparison - - - - - 7.99(0.46) - - - - -
ADJ.Gender 3.83(0.78) - - - - 6.81(-1.35) 5.30(1.00) - - - -
ADJ.Number 4.07(0.78) - - - 13.90(1.52) 6.31(-2.26) 4.67(0.94) - 5.05(5.05) 7.92(2.23) 6.25(1.29)
ADJ.Person - - - - - - - - - 8.89(2.43) -
N.Animacy - - - - - - 6.53(1.19) - - - -
N.Case 6.94(0.81) 6.39(1.26) 12.35(1.50) - 15.38(0.98) - 6.65(1.20) - 4.29(1.05) 14.39(2.37) 10.28(7.66)
N.Definiteness - - - - - - - - 8.36(1.61) - -
N.Number 5.44(0.77) 5.70(1.27) 8.10(1.33) 16.22(5.92) 14.46(0.66) - 6.12(1.22) - 4.30(1.52) 13.08(2.31) 21.20(15.96)
N.Possession - - - 12.63(4.31) - - - - - - -
V.Aspect - - - - 14.17(-0.38) - - - - - -
V.Gender - - - - - - 6.52(1.51) - - - -
V.Mood 13.17(2.78) 15.89(2.77) - - 11.11(0.58) - - 21.49(3.73) - - -
V.Number 8.23(2.72) 32.86(8.12) - 13.78(4.60) 9.02(1.33) - 6.23(1.44) 21.47(-9.47) - - -
V.Person 6.58(2.69) 6.22(1.50) - 10.86(4.99) 12.37(1.33) - 6.10(1.29) - - - -
V.Tense - - - 17.52(7.13) 13.09(1.05) - 6.59(1.61) - - - -
V.CVB.Tense - - - - - - 6.70(0.87) - - - 9.09(2.62)
V.MSDR.Aspect - - - - 14.39(4.68) - - - - - -
V.PTCP.Gender 10.28(2.75) - - - - - - - - - -
V.PTCP.Number 9.31(2.51) - - - - - - - - - -
Table 8: Fine-grained analysis of X→English translation performance w.r.t the perturbation type (POS, Morpho-
logical feature dimension). The number reported in this table indicate the average % drop in sentence level chrF for
an adversarial pertubation on a token with POS on the dimension (dim). The numbers in the parentheses indicate
average % drop for all the tested perturbations including the adversarial perturbations.
59
Sample-efficient Linguistic Generalizations through Program Synthesis:
Experiments with Phonology Problems
Saujas Vaduguru1 Aalok Sathe2 Monojit Choudhury3 Dipti Misra Sharma1
1
IIIT Hyderabad 2
MIT BCS∗
3
Microsoft Research India
1
{saujas.vaduguru@research.,dipti@}iiit.ac.in
2
aalok.sathe@{mit.edu, richmond.edu}
3
[email protected]
Abstract to V to be Ved
Neural models excel at extracting statistical mappasuN dipasuN
patterns from large amounts of data, but strug- mattunu ditunu
gle to learn patterns or reason about language ? ditimbe
from only a few examples. In this paper, we
? dipande
ask: Can we learn explicit rules that general-
ize well from only a few examples? We ex-
Table 1: Verb forms in Mandar (McCoy, 2018)
plore this question using program synthesis.
We develop a synthesis model to learn phonol-
ogy rules as programs in a domain-specific lan-
guage. We test the ability of our models to is not represented in large-scale text datasets that
generalize from few training examples using could allow the model to harness pretraining, and
our new dataset of problems from the Linguis- the number of samples presented here is likely not
tics Olympiad, a challenging set of tasks that sufficient for the neural model to learn the task.
require strong linguistic reasoning ability. In However, a human would fare much better at
addition to being highly sample-efficient, our this task even if they didn’t know Mandar. Identi-
approach generates human-readable programs,
fying rules and patterns in a different language
and allows control over the generalizability of
the learnt programs. is a principal concern of a descriptive linguist
(Brown and Ogilvie, 2010). Even people who
1 Introduction aren’t trained in linguistics would be able to solve
such a task, as evidenced by contestants in the Lin-
In the last few years, the application of deep neural
guistics Olympiads1 , and general-audience puzzle
models has allowed rapid progress in NLP. Tasks
books (Bellos, 2020). In addition to being able to
in phonology and morphology have been no excep-
solve the task, humans would be able to express
tion to this, with neural encoder-decoder models
their solution explicitly in terms of rules, that is to
achieving strong results in recent shared tasks in
say, a program that maps inputs to outputs.
phonology (Gorman et al., 2020) and morphology
Program synthesis (Gulwani et al., 2017) is a
(Vylomova et al., 2020). However, the neural mod-
method that can be used to learn programs that map
els that perform well on these tasks make use of
an input to an output in a domain-specific language
hundreds, if not thousands of training examples
(DSL). It has been shown to be a highly sample-
for each language. Additionally, the patterns that
efficient technique to learn interpretable rules by
neural models identify are not interpretable. In
specifying the assumptions of the task in the DSL
this paper, we explore the problem of learning in-
(Gulwani, 2011).
terpretable phonological and morphological rules
This raises the questions (i) Can program syn-
from only a small number of examples, a task that
thesis be used to learn linguistic rules from only
humans are able to perform.
a few examples? (ii) If so, what kind of rules can
Consider the example of verb forms in the lan-
be learnt? (iii) What kind of operations need to ex-
guage Mandar presented in Table 1. How would
plicitly be defined in the DSL to allow it to model
a neural model tasked with filling the two blank
linguistic rules? (iv) What knowledge must be im-
cells do? The data comes from a language that
∗ 1
Work done while at the University of Richmond https://www.ioling.org/
60
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 60–71
August 5, 2021. ©2021 Association for Computational Linguistics
plicitly provided with these operations to allow the We learn programs for token-level examples,
model to choose rules that generalize well? which transform an input token in its context to
In this work, we use program synthesis to output tokens. The program is a sequence of rules
learn phonological rules for solving Linguistics which are applied to each token in an input string
Olympiad problems, where only the minimal num- to produce the output string. The rules learnt are
ber of examples necessary to generalize are given similar to rewrite rules, of the form
(Şahin et al., 2020). We present a program syn-
thesis model and a DSL for learning phonological φ−l · · · φ−2 φ−1 Xφ1 φ2 · · · φr → T
rules, and curate a set of Linguistics Olympiad
problems for evaluation. where (i) X : I → B is a boolean predicate that
We perform experiments and comparisons to determines input tokens to which the rule is applied
baselines, and find that program synthesis does (ii) φi : I → B is a boolean predicate applied to
significantly better than our baseline approaches. the ith character relative to X, and the predicates φ
We also present some observations about the ability collectively determine the context in which the rule
of our system to find rules that generalize well, and is applied (iii) T : I → O∗ is a function that maps
discuss examples of where it fails. an input token to a sequence of output tokens.
X and φ belong to a set of predicates P, and T
2 Program synthesis is a function belonging to a set of transformation
functions T . P and T are specified by the DSL.
Program synthesis is “the task of automatically We allow the model to synthesize programs that
finding programs from the underlying program- apply multiple rules to a single token by synthesiz-
ming language that satisfy (user) intent expressed ing rules in passes and maintaining state from one
in some form of constraints” (Gulwani et al., 2017). pass to the next. This allows the system to learn
This method allows us to specify domain-specific stateful multi-pass rules (Sarthi et al., 2021).
assumptions as a language, and use generic synthe-
sis approches like FlashMeta (Polozov and Gul- 2.2 Domain-specific language
wani, 2015) to synthesize programs. The domain-specific language (DSL) is the declar-
The ability to explicitly encode domain-specific ative language which defines the allowable string
assumptions gives program synthesis broad appli- transformation operations. The DSL is defined by
cability to various tasks. In this paper, we explore a set of operators, a grammar which determines
applying it to the task of learning phonological how they can be combined, and a semantics which
rules. Whereas previous work on rule-learning has determines what each operator does. By defining
focused on learning rules of a specific type (Brill, operators to capture domain-specific phenomena,
1992; Johnson, 1984), the DSL in program synthe- we can reduce the space of programs to be searched
sis allows learning rules of different types, and in to include those programs that capture distinctions
different rule formalisms. relevant to the domain. This allows us to explicitly
In this work, we explore learning rules similar to encode knowledge of the domain into the system.
rewrite rules (Chomsky and Halle, 1968) that are Operators in the DSL also have a score asso-
used extensively to describe phonology. Sequences ciated with each operator that allows for setting
of rules are learnt using a noisy disjunctive synthe- domain-specific preferences for certain kinds of
sis algorithm NDSyn (Iyer et al., 2019) extended to programs. We can combine scores for each oper-
learn stateful multi-pass rules (Sarthi et al., 2021). ator in a program to compute a ranking score that
we can use to identify the most preferred program
2.1 Phonological rules as programs among candidates. The ranking score can capture
The synthesis task we solve is to learn a program in implicit preferences like shorter programs, more/-
a domain-specific language (DSL) for string trans- less general programs, certain classes of transfor-
duction, that is, to transform a given sequence of mations, etc.
input tokens i ∈ I ∗ to a sequence of output tokens The DSL defines the predicates P and set of
o ∈ O∗ , where I is the set of input tokens, and O transformations T that can be applied to a partic-
is the set of output tokens. Each token is a symbol ular token. The predicates and transformations in
accompanied by a feature set, a set of key-value the DSL we use, along with the description of their
pairs that maps feature names to boolean values. semantics, can be found in Tables 2 and 3.
61
Predicate
IsToken(w, s, i) Is x equal to the token s? This allows us to evaluate matches with specific
tokens.
Is(w, f, i) Is f true for x? This allows us to generalize beyond single tokens and use
features that apply to multiple tokens.
TransformationApplied(w, t, i) Has the transformation t has been applied to x in a previous pass? This
allows us to reference previous passes in learning rules for the current pass.
Not(p) Negates the predicate p.
Table 2: Predicates that are used for synthesis. The predicates are applied to a token x that is at an offset i from
the current token in the word w. The offset may be positive to refer to tokens after the current token, zero to refer
to the current token, or negative to refer to tokens before the current token.
Transformation
ReplaceBy(x, s1 , s2 ) If x is s1 , it is replaced with s2 . This allows the system to learn conditional
substitutions.
ReplaceAnyBy(x, s) x is replaced with s. This allows the system to learn unconditional substitutions.
Insert(x, S) This inserts a sequence of tokens S after x at the end of the pass. It allows for the
insertion of variable-length affixes.
Delete(x) This deletes x from the word at the end of the pass.
CopyReplace(x, i) These are analogues of the ReplaceBy and Insert transformations where the
CopyInsert(x, i) token which is added is the same as the token at an offset i from x. They allow
the system to learn phonological transformations such as assimilation and
gemination.
Identity(x) This returns x unchanged. It allows the system where a transformation applies
under certain conditions, but does not under others.
Table 3: Transformations that are used for synthesis. The transformations are applied to a token x in the word w.
The offset i for the Copy transformations may be positive to refer to tokens after the current token, zero to refer to
the current token, or negative to refer to tokens before the current token.
output := Map ( disjunction , input_tokens ) tures based only on the symbols in the input, more
disjunction := Else ( rule , disjunction ) complex features based on meaning and linguistic
rule := transformation
| IfThen ( predicate , rule ); categories can be provided to a system that works
on learning rules for more complex domains like
Figure 1: IfThen-Else statements in the DSL morphology or syntax. We leave this investigation
for future work.
62
multi-pass
Candidates
Figure 2: An illustration of the synthesis algorithm. FM is FlashMeta, which synthesizes rules which are com-
bined into a disjunction of rules by NDSyn. Here, rule #1 is chosen over #4 since it uses the more general concept
of the voice feature as opposed to a specific token, and thus has a higher ranking score.
transformations that are language-agnostic. Fig- ples, and those that are not solved are passed as the
ure 2 sketches the working of the algorithm. set of examples to NDSyn in the next pass. This
The NDSyn algorithm is an algorithm for learn- proceeds until all the examples are solved, or for a
ing disjunctions of rules, of the form shown in maximum number of passes.
Figure 1. Given a set of examples, it first gen-
erates a set of candidate rules using the Flash- 3 Dataset
Meta synthesis algorithm (Polozov and Gulwani,
To test the ability of our program synthesis system
2015). This algorithm searches for a program in the
to learn linguistic rules from only a few examples,
DSL that satisfies a set of examples by recursively
we require a task with a small number of training
breaking down the search problem into smaller sub-
examples, and a number of test examples which
problems. Given an operator, and the input-output
measure how well the model generalises to unseen
constraints it must satisfy, it infers constraints on
data. Additionally, to ensure a fair evaluation, the
each of the arguments to the operator, allowing it to
test examples should be chosen such that the sam-
recursively search for programs that satisfy these
ples in the training data provide sufficient evidence
constraints on each of the arguments. For exam-
to correctly solve the test examples.
ple, given the Is predicate and a set of examples
To this end, we use problems from the Linguis-
where the predicate is true or false, the algorithm
tics Olympiad. The Linguistics Olympiad is an
infers constraints on the arguments the token s and
umbrella term describing contests for high school
offset i such that the set of examples is satisfied.
students across the globe. Students are tasked with
The working of FlashMeta is illustrated with an
solving linguistics problems—a genre of composi-
example in Figure 3. We use the implementation of
tion that presents linguistic facts and phenomena
the FlashMeta algorithm available as part of the
in enigmatic form (Derzhanski and Payne, 2010).
PROSE 2 framework.
These problems typically have 2 parts: the data
From the set of candidate rules, NDSyn selects
and the assignments.
a subset of rules with a high ranking score that
The data consists of examples where the solver is
correctly answers the most examples as well incor-
presented with the application rules to some linguis-
rectly answers the least3 . Additional details about
tic forms (words, phrases, sentences) and the forms
the algorithm are provided in Appendix A.
derived by applying the rules to these forms. The
The synthesis of multi-pass rules proceeds in
data typically consists of 20-50 forms, the minimal
passes. In each pass, a set of token-aligned exam-
number of examples required to infer the correct
ples is provided as input to the NDSyn algorithm.
rules is presented (Şahin et al., 2020).
The resulting rules are then applied to all the exam-
The assignments provide other linguistic forms,
2
https://www.microsoft.com/en-us/research/ and the solver is tasked with applying the rules
group/prose/
3
A rule will not produce any answer to examples that don’t inferred from the data to these forms. The forms
satisfy the context constraints of the rule. in the assignments are carefully selected by the
63
IfThen ( IsToken (w ," a " , -1) ,
ReplaceBy (x ," b " ," d "))
abc → d IfThen ( IsToken (w ," b " ,0) ,
ReplaceAnyBy (x ," d "))
IfThen ( IsToken (w ," c " ,1) ,
ReplaceAnyBy (x ," d "))
Inverse SemanticsIfThen
predicate
rule
abc → True
Search for rule ReplaceBy (x ," b " ," d ")
ReplaceAnyBy (x ," d ")
Inverse SemanticsIsToken
token offset
abc → a abc → −1
IsToken (w ," a " , -1)
abc → b abc → 0 IsToken (w ," b " ,0)
IsToken (w ," c " ,1)
abc → c abc → 1
Figure 3: An illustration of the search performed by the FlashMeta algorithm. The blue boxes show the spec-
ification that an operator must satisfy in terms of input-output examples, with the input token underlined in the
context of the word. The Inverse Semantics of an operator is a function that is used to infer the specification for
each argument of the operator based on the semantics of the operator. This may be a single specification (as for
predicate) or a disjunction of specifications (as for token and offset). The algorithm then recursively searches for
programs to satisfy the specification for each argument, and combines the results of the search to obtain a program.
The search for the rule in an IfThen statement proceeds similarly to the search for a predicate. Examples of pro-
grams that are inferred from a specification are indicated with =⇒ . A dashed line between inferred specifications
indicates that the specifications are inferred jointly.
designer to test whether the solver has correctly form of a language and the corresponding phono-
inferred the rules, including making generalizations logical form (Table 4c) (4) marking the phonolog-
to unseen data. This allows us to see how much of ical stress on a given word (Table 4d). We refer
the intended solution has been learnt by the solver to each of these categories of problems as mor-
by examining responses to the assignments. phophonology, multilingual, transliteration, and
stress respectively. We further describe the dataset
The small number of training examples (data)
in Appendix B4 .
tests the generalization ability and sample effi-
ciency of the system, and presents a challenging 3.1 Structure of the problems
learning problem for the system. The careful se-
lection of test examples (assignment) lets us use Each problem is presented in the form of a matrix
them to measure how well the model learns these M . Each row of the matrix contains data pertaining
generalizations. to a single word/linguistic form, and each column
contains the same form of different words, i.e.,
We present a dataset of 34 linguistics problems, an inflectional or derivational paradigm, the word
collected from various publicly accessible sources. form in a particular language, the word in a partic-
These problems are based on phonology, and some ular script, or the stress values for each phoneme in
aspects of the morphology of languages, as well a word. A test sample in this case is presented as a
as the orthographic properties of languages. These particular cell Mij in the table that has to be filled.
problems are chosen such that the underlying rules The model has to use the data from other words in
depend only on the given word forms, and not the same row (Mi: ) and the words in the column
on inherent properties of the word like grammat- (M:j ) to predict the form of the word in Mij .
ical gender or animacy. The problems involve In addition to the data in the table, each prob-
(1) inferring phonological rules in morphological lem contains some additional information about the
inflection (Table 4a) (2) inferring phonological symbols used to represent the words. This addi-
changes between multiple related languages (Ta-
ble 4b) (3) converting between the orthographic 4
The dataset is available here.
64
base form negative form Turkish Tatar Listuguj Pronunciation Aleut Stress
joy kas joya:ya’ bandIr mandIr g’p’ta’q g@b@da:x tatul 01000
bi:law kas bika’law yelken cilkän epsaqtejg epsaxteck n@tG@lqin 000010000
tipoysu:da ? ? osta emtoqwatg ? sawat ?
? kas wurula:la’ bilezik ? ? @mtesk@m qalpuqal 00001000
(a) Movima negation (b) Turkish and Tatar (c) Micmac orthography (d) Aleut stress
Table 4: A few examples from different types of Linguistics Olympiad problems. ‘?’ represents a cell in the table
that is part of the test set.
tional information is meant to aid the solver under- train a model for each pair of columns in a problem.
stand the meaning of a symbol they may not have For each test example Mij , we find the column with
seen before. We manually encode this information the smallest index j 0 such that Mij 0 is non-empty
in the feature set associated with each token for and use Mij 0 as the source string to infer Mij .
synthesis. Where applicable, we also add conso- Additional details of baselines are provided in
nant/vowel distinctions in the given features, since Appendix C.
this is a basic distinction assumed in the solutions
to many Olympiad problems. 4.2 Program synthesis experiments
We use the assignments that accompany every As discussed in Section 3.1, the examples in a prob-
problem as the test set, ensuring that the correct lem are in a matrix, and we synthesize programs
answer can be inferred based on the given data. to transform entries in one column to entries in
another. Given a problem matrix M , we refer to
3.2 Dataset statistics a program to transform an entry in column i to
The dataset we present is highly multilingual. The an entry in column j as M:i → M:j . To obtain
34 problems contain samples from 38 languages, token-level examples, we use the Smith-Waterman
drawn from across 19 language families. There alignment algorithm (Smith et al., 1981), which
are 15 morphophonology problems, 7 multilingual favours contiguous sequences in aligned strings.
problems, 6 stress, and 6 transliteration problems. We train three variants of our synthesis system
The set contains 1452 training words with an aver- with different scores for the Is and IsToken op-
age of 43 words per problem, and 319 test words erators. The first one, N O F EATURE, does not use
with an average of 9 per problem. Each problem features, or the Is predicate. The second one, T O -
has a matrix that has between 7 and 43 rows, with KEN, assigns a higher score to IsToken and prefers
an average of 23. The number of columns ranges more specific rules that reference tokens. The third
from 2 to 6, with most problems having 2. one, F EATURE, assigns a higher score to Is and
prefers more general rules that reference features
4 Experiments instead of tokens. All other aspects of the model
remain the same across variants.
4.1 Baselines
Morphophonology and multilingual problems:
Given that we model our task as string transduc- For every pair of columns (s, t) in the problem
tion, we compare with the following transduction matrix M , we synthesize the program M:s → M:t .
models used as baselines in shared tasks on G2P To predict the form of a test sample Mij , we find
conversion (Gorman et al., 2020) and morphologi- a column k such that the program M:k → M:j has
cal reinflection (Vylomova et al., 2020). the best ranking score, and evaluate it on Mik .
Neural: We use LSTM-based sequence-to- Transliteration problems: Given a problem ma-
sequence models with attention as well as Trans- trix M , we construct a new matrix M 0 for each pair
former models as implemented by Wu (2020). For of columns (s, t) such that all entries in M 0 are in
each problem, we train a single neural model that the same script. We align word pairs (Mis , Mit )
takes the source and target column numbers, and using the Phonetisaurus many-to-many alignment
the source word, and predicts the target word. tool (Jiampojamarn et al., 2007), and build a sim-
WFST: We use models similar to the pair n-gram ple mapping f for each source token to the target
models (Novak et al., 2016), with the implementa- token with which it is most frequently aligned. We
tion similar to that used by Lee et al. (2020). We 0 by applying f to each token of M and
fill in Mis is
65
All Morphophonology Multilingual Transliteration Stress
Model
Table 5: Metrics for all problems, and for problems of each type. The CHF F score for stress problems is not
calculated, and not used to determine the overall CHR F score.
Mit0 = Mit . We then find a program M:s0 → M:t0 . explicit knowledge in the DSL and implicit knowl-
Stress problems: For these problems, we do not edge provided as the ranking score to generalize.
perform any alignment, since the training pairs are We then consider specific examples of problems,
already token aligned. The synthesis system learns and show examples of where our models succeed
to transform the source string to the sequence of and fail in learning different types of patterns.
stress values.
Model 100% ≥ 75% ≥ 50%
4.3 Metrics
N O F EATURE 3 5 7
We calculate two metrics: exact match accuracy, T OKEN 3 6 10
F EATURE 3 6 11
and CHR F score (Popović, 2015). The exact match WFST 1 2 7
accuracy measures the fraction of examples the
synthesis system gets fully correct. Table 6: Number of problems where the model
#{correctly predicted test samples} achieves different thresholds of the E XACT score.
E XACT =
#{test samples}
The CHR F score is calculated only at the token 5.1 Features aid generalization
level, and measures the n-gram overlaps between Since the test examples are chosen to test specific
the predicted answer and the true answer, and al- rules, solving more test examples correctly is in-
lows us to measure partially correct answers. We dicative of the number of rules inferred correctly.
do not calculate the CHR F score for stress problems In Table 6, we see that providing the model with
as n-gram overlap is not a meaningful measure of features allows it to infer more general rules, solv-
performance for these problems. ing a greater fraction of more problems. We see
4.4 Results that allowing the model to use features increases
its performance, and having it prefer more general
Table 5 summarizes the results of our experiments. rules involving features lets it do even better.
We report the average of each metric across prob-
lems for all problems and by category. 5.2 Correct programs are short
We find that neural models that don’t have spe- In Figure 4 we see that the number of rules in a
cific inductive biases for the kind of tasks we problem5 tends to be higher when the model gets
present here are not able to perform well with this the problem wrong, than when it gets it right. This
amount of data. The synthesis models do better indicates that when the model finds many specific
than the WFST baseline overall, and on all types rules, it overfits to the training data, and fails to
of problems except transliteration. This could be generalize well. This holds true for all the variants,
due to the simple map computed from alignments as seen in the downward slope of the lines.
before program synthesis causing errors that the We also find that allowing and encouraging a
rule learning process cannot correct. model to use features leads to shorter programs.
5 Analysis The average length of a program synthesized by
5
To account for some problems having more columns than
We examine two aspects of the program synthesis others (and hence more rules), we find the average number of
models we propose. The first is the way it uses the rules for each pair of columns.
66
rules based on features, and instead chooses rules
specific to each initial character in the root.
Since the DSL allows for substituting one token
with one other, or inserting multiple tokens, the
system has to use multiple rules to substitute one
token with multiple tokens. In the case of Mandar,
we see one way it does this, by performing multiple
substitutions (to transform di- to mas- it replaces d
and i with a and s respectively, and then inserts m).
67
the rule for long vowels applies, and one where the we hope to apply it to learning more complex types
rule for words without long vowels applies. of lingusitic rules in the future.
In addition to being a way to learn rules from
6 Related work data, the ability to explicity control the general-
ization behaviour of the model allows for the use
Gildea and Jurafsky (1996) also study the problem of program synthesis to understand the kinds of
of learning phonological rules from data, and ex- learning biases and operations that are required to
plicitly controlling generalization behaviour. We model various linguistic processes. We leave this
pursue a similar goal, but in a few-shot setting. exploration to future work.
Barke et al. (2019) and Ellis et al. (2015) study
program synthesis applied to linguistic rule learn- Acknowledgements
ing. They make much stronger assumptions about
the data (the existence of an underlying form, and We would like to thank Partho Sarthi for invaluable
the availability of additional information like IPA help with PROSE and NDSyn. We would also like
features). We take a different approach, and study to thank the authors of the ProLinguist paper for
program synthesis models that can work only on their assistance. Finally, we would like to thank the
the tokens in the word (like N O F EATURE), and also anonymous reviewers for their feedback.
explore the effect of providing features in these
cases. We also test our approach on a more varied References
set of problems that involves aspects of morphol-
ogy, transliteration, multilinguality, and stress. Shraddha Barke, Rose Kunkel, Nadia Polikarpova, Eric
Meinhardt, Eric Bakovic, and Leon Bergen. 2019.
Şahin et al. (2020) also present a set of Linguis- Constraint-based learning of phonological processes.
tics Olympiad problems as a test of the metalin- In Proceedings of the 2019 Conference on Empirical
guistic reasoning abilities of NLP models. While Methods in Natural Language Processing and the
problems in their set involve finding phonological 9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 6176–
rules, they also require the knowledge of syntax 6186, Hong Kong, China. Association for Computa-
and semantics that are out of the scope of our study. tional Linguistics.
We present a set of problems that only requires
reasoning about surface word forms, and without A. Bellos. 2020. The Language Lover’s Puzzle Book:
Lexical perplexities and cracking conundrums from
requiring the meanings. across the globe. Guardian Faber Publishing.
68
Kyle Gorman, Lucas F.E. Ashby, Aaron Goyzueta, Josef Robert Novak, Nobuaki Minematsu, and Keikichi
Arya McCarthy, Shijie Wu, and Daniel You. 2020. Hirose. 2016. Phonetisaurus: Exploring grapheme-
The SIGMORPHON 2020 shared task on multilin- to-phoneme conversion with joint n-gram models in
gual grapheme-to-phoneme conversion. In Proceed- the wfst framework. Natural Language Engineer-
ings of the 17th SIGMORPHON Workshop on Com- ing, 22(6):907–938.
putational Research in Phonetics, Phonology, and
Morphology, pages 40–50, Online. Association for Oleksandr Polozov and Sumit Gulwani. 2015. Flash-
Computational Linguistics. meta: A framework for inductive program synthesis.
SIGPLAN Not., 50(10):107–126.
Sumit Gulwani. 2011. Automating string processing
in spreadsheets using input-output examples. SIG- Maja Popović. 2015. chrF: character n-gram F-score
PLAN Not., 46(1):317–330. for automatic MT evaluation. In Proceedings of the
Tenth Workshop on Statistical Machine Translation,
Sumit Gulwani, Oleksandr Polozov, and Rishabh pages 392–395, Lisbon, Portugal. Association for
Singh. 2017. Program synthesis. Foundations and Computational Linguistics.
Trends® in Programming Languages, 4(1-2):1–119.
Arun Iyer, Manohar Jonnalagedda, Suresh Gözde Gül Şahin, Yova Kementchedjhieva, Phillip
Parthasarathy, Arjun Radhakrishna, and Sriram K Rust, and Iryna Gurevych. 2020. PuzzLing Ma-
Rajamani. 2019. Synthesis and machine learning chines: A Challenge on Learning From Small Data.
for heterogeneous extraction. In Proceedings of the In Proceedings of the 58th Annual Meeting of the
40th ACM SIGPLAN Conference on Programming Association for Computational Linguistics, pages
Language Design and Implementation, pages 1241–1254, Online. Association for Computational
301–315. Linguistics.
Sittichai Jiampojamarn, Grzegorz Kondrak, and Tarek Partho Sarthi, Monojit Choudhury, Arun Iyer, Suresh
Sherif. 2007. Applying many-to-many alignments Parthasarathy, Arjun Radhakrishna, and Sriram Ra-
and hidden Markov models to letter-to-phoneme jamani. 2021. ProLinguist: Program Synthesis for
conversion. In Human Language Technologies Linguistics and NLP. IJCAI Workshop on Neuro-
2007: The Conference of the North American Chap- Symbolic Natural Language Inference.
ter of the Association for Computational Linguistics;
Proceedings of the Main Conference, pages 372– Temple F Smith, Michael S Waterman, et al. 1981.
379, Rochester, New York. Association for Compu- Identification of common molecular subsequences.
tational Linguistics. Journal of molecular biology, 147(1):195–197.
Mark Johnson. 1984. A discovery procedure for cer- Harold Somers. 2016. Changing the subject.
tain phonological rules. In 10th International Con- In Andrew Lamont and Dragomir Radev, edi-
ference on Computational Linguistics and 22nd An- tors, North American Computational Linguistics
nual Meeting of the Association for Computational Olympiad 2016: Invitational Round. North Ameri-
Linguistics, pages 344–347, Stanford, California, can Computational Linguistics Olympiad.
USA. Association for Computational Linguistics.
Saujas Vaduguru. 2019. Chickasaw stress. In
Mary Laughren. 2011. Stopping and flapping in Shardul Chiplunkar and Saujas Vaduguru, editors,
warlpiri. In Dragomir Radev and Patrick Littell, Panini Linguistics Olympiad 2019. Panini Linguis-
editors, North American Computational Linguistics tics Olympiad.
Olympiad 2011: Invitational Round. North Ameri-
can Computational Linguistics Olympiad. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Kaiser, and Illia Polosukhin. 2017. Attention is all
Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. you need. In Advances in Neural Information Pro-
McCarthy, and Kyle Gorman. 2020. Massively cessing Systems, volume 30. Curran Associates, Inc.
multilingual pronunciation modeling with WikiPron.
In Proceedings of the 12th Language Resources Ekaterina Vylomova, Jennifer White, Eliza-
and Evaluation Conference, pages 4223–4228, Mar- beth Salesky, Sabrina J. Mielke, Shijie Wu,
seille, France. European Language Resources Asso- Edoardo Maria Ponti, Rowan Hall Maudslay, Ran
ciation. Zmigrod, Josef Valvoda, Svetlana Toldova, Francis
Minh-Thang Luong, Hieu Pham, and Christopher D Tyers, Elena Klyachko, Ilya Yegorov, Natalia
Manning. 2015. Effective approaches to attention- Krizhanovsky, Paula Czarnowska, Irene Nikkarinen,
based neural machine translation. arXiv preprint Andrew Krizhanovsky, Tiago Pimentel, Lucas
arXiv:1508.04025. Torroba Hennigen, Christo Kirov, Garrett Nicolai,
Adina Williams, Antonios Anastasopoulos, Hilaria
Tom McCoy. 2018. Better left unsaid. In Patrick Lit- Cruz, Eleanor Chodroff, Ryan Cotterell, Miikka
tell, Tom McCoy, Dragomir Radev, and Ali Shar- Silfverberg, and Mans Hulden. 2020. SIGMOR-
man, editors, North American Computational Lin- PHON 2020 shared task 0: Typologically diverse
guistics Olympiad 2018: Invitational Round. North morphological inflection. In Proceedings of the
American Computational Linguistics Olympiad. 17th SIGMORPHON Workshop on Computational
69
Research in Phonetics, Phonology, and Morphology, its arguments. The arguments may be other op-
pages 1–39, Online. Association for Computational erators, offsets, or other constants (like tokens or
Linguistics.
features). The score for an operator in the argu-
Elysia Warner. 2019. Tarangan. In Samuel Ahmed, ment is computed recursively. The score for an
Bozhidar Bozhanov, Ivan Derzhanski (technical offset favours smaller numbers and local rules by
editor), Hugh Dobbs, Dmitry Gerasimov, Shin-
jini Ghosh, Ksenia Gilyarova, Stanislav Gurevich,
decreasing the score for larger offsets. The score
Gabrijela Hladnik, Boris Iomdin, Bruno L’Astorina, for other constants is chosen to be a small negative
Tae Hun Lee (editor-in chief), Tom McCoy, André constant. The scores for the arguments are added
Nikulin, Miina Norvik, Tung-Le Pan, Aleksejs up, along with a small negative penalty to favour
Peguševs, Alexander Piperski, Maria Rubinstein,
shorter programs, to obtain the final score for the
Daniel Rucki, Artūrs Semeņuks, Nathan Somers,
Milena Veneva, and Elysia Warner, editors, Interna- operator.
tional Linguistics Olympiad 2019. International Lin- This ranking score selects for programs that are
guistics Olympiad. shorter, and favours either choosing more gen-
Shijie Wu. 2020. Neural transducer. https://github. eral by giving the Is predicate a higher score
com/shijie-wu/neural-transducer/. (F EATURE) or more specific rules by giving the
A NDSyn algorithm IsToken predicate a higher score (T OKEN). The
top k programs according to the ranking function
We use the NDSyn algorithm to learn disjunctions are chosen as candidates for the next step.
of rules. We apply NDSyn in multiple passes to To choose the final set of rules from the candi-
allow the model to learn multi-pass rules. dates generated using the FlashMeta algorithm,
At each pass, the algorithm learns rules to per- we use a set covering algorithm that chooses the
form token-level transformations that are applied rules that correctly answer the most number of ex-
to each element of the input sequence. The token- amples while also incorrectly answering the least.
level examples are passed to NDSyn, which learns These rules are applied to each example, and the
the if-then-else statements that constitute a set of output tokens are tagged with the transformation
rules. This is done by first generating a set of can- that is applied. These outputs are then the input to
didate rules by randomly sampling a token-level the next pass of the algorithm.
example and synthesizing a set of rules that satisfy
the example. Then, rules are selected to cover the B Dataset
token-level examples.
Rules that satisfy a randomly sampled example We select problems from various Linguistics
are learnt using the FlashMeta program synthesis Olympiads to create our dataset. We include pub-
algorithm (Polozov and Gulwani, 2015). The syn- licly available problems that have appeared in
thesis task is given by the DSL operator P and the Olympiads before. We choose problems that only
specification of constraints X that the synthesized involve rules based on the symbols in the data, and
program must satisfy. In our application, this speci- not based on knowledge of notions such as gender,
fication is in the form of token-level examples, and tense, case, or semantic role. These problems are
the DSL operators are the predicates and transfor- based on the phonology of a particular language,
mations defined in the paper. The algorithm recur- and include aspects of morphology and orthogra-
sively decomposes the synthesis problem (P, X ) phy, and maybe also the phonology of a different
into smaller tasks (Pi , Xi ) for each argument Pi language. In some cases where a single Olympiad
to the operator. Xi is inferred using the inverse problem involves multiple components that can be
semantics of the operator Pi , which is encoded as solved independent of each other, we include them
a witness function. The inverse semantics provides as separate problems in our dataset.
the possible values for the arguments of an opera- We put the data and assignments in a matrix, as
tor, given the output of the operator. We refer the described in Section 3.1 . We separate tokens in a
reader to the paper by Polozov and Gulwani (2015) word by a space while transcribing the problems
for a full description of the synthesis algorithm. from their source PDFs. We do not separate diacrit-
After the candidates are generated, they are ics as different tokens, and include them as part of
ranked according to a ranking score of each pro- the same token. For each token in the Roman script,
gram. The ranking score for an operator in a pro- we add the boolean features vowel and consonant,
gram is computed as a function of the scores of and manually tag the tokens according to whether
70
they are a vowel or consonant.
We store the problems in JSON files with details
about the languages, the families to which the lan-
guages belong, the data matrix, the notes used to
create the features, and the feature sets for each
token.
C Baselines
C.1 Neural
Following Şahin et al. (2020), we use small neural
models for sequence-to-sequence tasks. We train a
single neural model for each task, and provide the
column numbers as tags in addition to the source
sequence. We find that the single model approach
works better than training a model for each pair of
columns.
LSTM: We use LSTM models with soft attention
(Luong et al., 2015), with embeddings of size 64,
hidden layers of size 128, a 2-layer encoder and a
single layer decoder. We apply a dropout of 0.3 for
all layers. We train the model for 100 epochs using
the Adam optimizer with a learning rate of 10−3 ,
learning rate reduction on plateau, and a batch size
of 2. We clip the gradient norm to 5.
Transformer: We use Transformer models
(Vaswani et al., 2017) with embeddings of size
128, hidden layers of size 256, a 2-layer encoder
and a 2-layer decoder. We apply a dropout of 0.3
for all layers. We train the model for 2000 steps
using the Adam optimizer with a learning rate of
10−3 , warmup of 400 steps, learning rate reduction
on plateau, and a batch size of 2. We use a label
smoothing value of 0.1, and clip the gradient norm
to 1.
We use the implementations provided at https:
//github.com/shijie-wu/neural-transducer/ for all
neural models.
C.2 WFST
We use the implementation the WFST models avail-
able at https://github.com/sigmorphon/2020/tree/
master/task1/baselines/fst for the WFST models.
We train a model for each pair of columns. We
report the results for models of order 5, which were
found to perform the best on the test data (highest
E XACT score) among models of order 3 to 9.
71
Findings of the SIGMORPHON 2021 Shared Task on
Unsupervised Morphological Paradigm Clustering
Adam Wiemerslage† Arya McCarthy‡ Disfrutar Alexander Erdmann∇ Disfrutar
Garrett Nicolai ψ
Manex Agirrezabal φ
Miikka Silfverbergψ
Mans Hulden† Katharina Kann†
V;IND;PRS;1;SG
V;IND;PRS;1;PL
disfruto
disfrutáis
†
University of Colorado Boulder ‡
Johns Hopkins University ∇
V;IND;PRS;2;SG
V;IND;PRS;2;PL
φ
University of Copenhagen ψ
University of British Columbia
{adam.wiemerslage,katharina.kann}@colorado.edu
V;IND;PRS;3;SG V;IND;PRS;3;PL disfruta disfrutan
Abstract
We describe the second SIGMORPHON disfrutamos
72
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 72–81
August 5, 2021. ©2021 Association for Computational Linguistics
all teams also submit a system description paper. 2.1 Data
The shared task systems can be grouped into Languages The SIGMORPHON 2021 Shared
two broad categories: similarity-based systems Task on Unsupervised Morphological Paradigm
experiment with different combinations of ortho- Clustering features 5 development languages: Mal-
graphic and embedding-based similarity metrics for tese, Persian, Portuguese, Russian, and Swedish.
word forms combined with clustering methods like The final evaluation is done on 9 test languages:
k-means or agglomerative clustering. Grammar- Basque, Bulgarian, English, Finnish, German, Kan-
based methods instead learn grammars or rules nada, Navajo, Spanish, and Turkish.
from the data and either apply these to clustering Our languages span 4 writing systems, and repre-
directly, or first segment words into stems and af- sent fusional, agglutinative, templatic, and polysyn-
fixes and then cluster forms which share a stem into thetic morphologies. The languages in the develop-
paradigms. Our official baseline, described in Sec- ment set are mostly suffixing, except for Maltese,
tion 2.3, is based on grouping together word forms which is a templatic language. And while most of
sharing a common substring of length ≥ k, where the test languages are also predominantly suffix-
k is a hyperparameter. Grammar-based systems ob- ing, Navajo employs prefixes and Basque uses both
tain higher average F1 scores (see Section 2.2 for prefixes and suffixes.
details on evaluation) across the nine test languages
than the baseline. The Edinburgh system has the Text Corpora We provide corpora from the
best overall performance: it outperforms the base- Johns Hopkins University Bible Corpus (JHUBC)
line by 34.61% F1 and the second best system by (McCarthy et al., 2020b) for all development and
1.84% F1. test languages. This is the only resource that sys-
tems are allowed to use.
The rest of the paper is organized as follows:
Section 2 describes the task of unsupervised mor- Gold Partial Paradigms Along with the Bibles,
phological paradigm clustering in detail, including we also release a set of gold partial paradigms for
the official baseline and all provided datasets. Sec- the development languages to be used for system
tion 3 gives an overview of the participating sys- development. Gold data sets are also compiled for
tems. Section 4 describes the official results, and the test languages, but these test sets are withheld
5 presents an analysis. Finally, Section 6 contains until the completion of the shared task.
a discussion of where the task can move in future In order to produce gold partial paradigms, we
iterations and concludes the paper. first take the set of all paradigms Π for each lan-
guage from UniMorph (McCarthy et al., 2020a).
We then obtain gold partial paradigms ΠĜ =
2 Task Description T
Π Σ, where Σ is the set of types attested in the
Bible corpus. Finally, we sample up to 1000 of the
Unsupervised morphological paradigm clustering
resulting gold partial paradigms for each language,
consists of, given a raw text corpus, grouping words
resulting in the set ΠG according to the following
from that corpus into their paradigms without any
steps:
additional information. Recent work in unsuper-
vised morphology has attempted to induce full 1. Group gold paradigms in ΠĜ by size, result-
paradigms from corpora with only a subset of all ing in the set G, where gk ∈ G is the group of
types. Kann et al. (2020) and Erdmann et al. (2020) paradigms with k forms in it.
explore initial approaches to this task, which is
2. Continually loop over all gk ∈ G and ran-
called unsupervised morphological paradigm com-
domly sample one paradigm from gk until we
pletion, but find it to be challenging. Building
have 1000 paradigms.
upon the SIGMORPHON 2020 Shared Task on Un-
supervised Morphological Paradigm Completion Because not every token in the Bible corpora is in
(Kann et al., 2020), our shared task is focused on a UniMorph, we can only evaluate on the subset of
subset of the overall problem: sorting words into paradigms that exist in the UniMorph database. In
paradigms. This can be seen as an initial step to practice, this means that for several languages, we
paradigm completion, as unobserved types do not are not able to sample 1000 paradigms, cf. Tables
need to be induced, and the inflectional categories 1 and 2. Notably, for Basque, we can only provide
of paradigm slots do not need to be considered. 12 paradigms.
73
Maltese Persian Portuguese Russian Swedish
# Lines 7761 7931 31167 31102 31168
# Tokens 193257 227584 828861 727630 871707
# Types 16017 11877 31446 46202 25913
TTR .083 .052 .038 .063 .03
# Paradigms 76 64 1000 1000 1000
# Forms in paradigms 572 446 11430 6216 3596
Largest paradigm size 14 20 47 17 9
Table 1: Statistics for the development Bible corpora and the dev gold partial paradigms. TTR is the type-token
ratio in the corpus. The statistics for the paradigms reflect only those words in our partial paradigms, not the full
paradigms from Unimorph.
Table 2: Statistics for the test Bible corpora and the test gold partial paradigms.
dependesse
dependem
1.0 uation punishes systems for predicting a second
desfrutarem
dependerá
paradigm, P2, with words from G1, reducing the
dependesse
overall precision score of this submission.
depende
0.5
dependiam P2
Building upon BMAcc (Jin et al., 2020), we
use best-match F1 score for evaluation. We define
dependesse
depende
a paradigm as a set of word forms f ∈ π. Du-
desonrares plicate forms within π (syncretism) are discarded.
Given a set of gold partial paradigms π g ∈ ΠG , a
set of predicted paradigms π p ∈ ΠP , a gold vo-
Figure 2: An example matching of predicted paradigms S
cabulary Σg = π g , and a predicted vocabulary
in blue, and a gold paradigm in green. Words in red do S
Σp = π p , it works according to the following
not exist in the gold set, and thus cannot be evaluated.
steps:
1. Redefine each predicted paradigm, remov-
2.2 Evaluation 0
ing the words that we cannot evaluate π p =
T
As our task is entirely unsupervised, evaluation π p Σg , to form a set of pruned paradigms
is not straightforward: as in Kann et al. (2020), Π0P .
our evaluation requires a mapping from predicted
paradigms to gold paradigms. Because our set of 2. Build a complete Bipartite graph over Π0P and
gold partial paradigms does not cover all words in ΠG , where the edge weight between πig and
0 T 0
the corpus, in practice we only evaluate against a πjp is the number of true positives |πig πjp |.
subset of the clusters predicted by systems.
3. Compute the maximum-weight full matching
For these reasons, we want an evaluation that using Karp (1980), in order to find the optimal
assesses the best matching paradigms, ignoring pre- alignment between Π0P and ΠG
dicted forms that do not occur in the gold set, but
0
still punishing for spurious predictions that are in 4. Assign all predicted words Σp and all gold
the gold set. For example, Figure 2 shows two can- words Σg a label corresponding to the gold
didate matches for a gold partial paradigm. Each paradigm, according to the matching found in
74
0 0
3. Any unmatched wip ∈ Σp is assigned a disconnected components is > n/2 (Hartuv and
label corresponding to a spurious paradigm. Shamir, 2000). The number of HCSs is then taken
to be the cluster number. In practice, however,
5. Compute the F1 score between the sets of the graph-clustering step proves to be prohibitively
0
labeled words in Σp and Σg slow and results for test languages are submitted
using fixed numbers of clusters of size 500, 1000,
2.3 Baseline System 1500 and 1900. In experiments on the dev lan-
guages, they find that the orthographic representa-
We provide a straightforward baseline that con-
tions outperform the semantic representations for
structs paradigms based on substring overlap be-
all languages, and thus submit four systems utiliz-
tween words. We construct paradigms out of words
ing orthographic representations.
that share a substring of length ≥ k. Since words
can share multiple substrings, it is possible that The Boulder-Gerlach-Wiemerslage-Kann team
multiple identical, redundant paradigms are cre- (Boulder-GWK; Gerlach et al., 2021) submits
ated. We reduce these to a single paradigm. Words two systems based on an unsupervised lemmati-
that do not belong to a cluster are assigned a sin- zation system originally proposed by Rosa and
gleton paradigm, that is, a paradigm that consists Zabokrtský (2019). Their approach is based on ag-
of only that word. glomerative hierarchical clustering of word types,
We tune k on the development sets and find that where the distance between word types is computed
k = 5 works best on average. This means that a as a combination of a string distance metric and
word of less than 5 characters can only ever be in the cosine distance of fastText embeddings (Bo-
one, singleton, paradigm. janowski et al., 2017). Their choice of fastText
embeddings is due to the limited size of the shared
3 Submitted Systems task datasets. Two variants of edit distance are com-
pared to quantify string distance: (1) Jaro-Winkler
The Boulder-Perkoff-Daniels-Palmer team
edit distance (Winkler, 1990) resembles regular
(Boulder-PDP; Perkoff et al., 2021) participates
edit distance of strings but emphasizes similarity
with four submissions, resulting from experiments
at the start of strings which is likely to bias the
with two different systems. Both systems apply
system toward languages expressing inflection via
k-means clustering on vector representations of
suffixation. (2) A weighted variant of edit distance,
input words. They differ in the type of vector
where costs for insertions, deletions and substitu-
representations used: either orthographic or
tions are derived from a character-based language
semantic representations. Semantic skip-gram
model trained on the shared task data.
representations are generated using word2vec
(Mikolov et al., 2013). For the orthographic The CU–UBC (Yang et al., 2021) team provides
representations, each word is encoded into a vector systems that built upon the official shared task base-
of fixed dimensionality equaling the word length line – given the pseudo-paradigms found by the
|wmax | for the longest word wmax in the input baseline, they extract inflection rules of multiple
corpus. They associate each character c ∈ Σ in the types. Comparing pairs of words in each paradigm,
alphabet of the input corpus with a real number they learn both continuous and discontinuous char-
r ∈ [0, 1] and assign vi := r if the ith character acter sequences that transform the first word into
of the input word w is c. If |w| < |wmax |, the the second, following work on supervised inflec-
remaining entries are assigned to 0. tional morphology, such as Durrett and DeNero
The number of clusters is a hyperparameter of (2013); Hulden et al. (2014). Rules are sorted by
the k-means clustering algorithm. In order to set frequency to separate genuine inflectional patterns
this hyperparameter, Perkoff et al. (2021) experi- from noise. Starting from a random seed word,
ment with a graph-based method. The word types paradigms are constructed by iteratively applying
in the corpus form the nodes of a graph, where the the most frequent rules. Generated paradigms are
neighborhood of a word w consists of all words further tested for paradigm coherence using met-
sharing a maximal substring with w. The graph is rics such as graph degree calculation and fastText
split into highly connected subgraphs (HCS) con- embedding similarity.
taining n nodes, where the number of edges that The Edinburgh team (McCurdy et al., 2021)
need to be cut in order to split the graph into two submits a system based on adaptor grammars (John-
75
English Navajo Spanish Finnish Bulgarian Basque Kannada German Turkish Average
Rec 28.93 32.71 23.90 18.43 20.55 28.57 25.19 25.50 15.70 24.39
Boulder-PDP-1 Prec 29.27 34.15 24.68 18.81 20.75 29.51 35.18 25.64 15.90 25.99
F1 29.10 33.41 24.29 18.62 20.65 29.03 29.36 25.57 15.80 25.09
Rec 36.57 36.92 28.52 23.38 26.37 30.16 25.83 33.21 19.53 28.94
Boulder-PDP-2 Prec 37.00 38.54 29.45 23.86 26.63 31.15 36.08 33.40 19.79 30.65
F1 36.78 37.71 28.98 23.62 26.50 30.65 30.11 33.31 19.66 29.70
Rec 42.79 37.85 29.41 26.01 28.73 26.98 25.94 38.18 21.38 30.81
Boulder-PDP-3 Prec 43.30 39.51 30.37 26.55 29.01 27.87 36.23 38.39 21.66 32.54
F1 43.04 38.66 29.88 26.27 28.87 27.42 30.23 38.28 21.52 31.58
Rec 45.45 40.19 30.64 26.60 29.79 28.57 24.54 39.86 21.65 31.92
Boulder-PDP-4 Prec 45.99 41.95 31.63 27.15 30.08 29.51 34.28 40.08 21.93 33.62
F1 45.72 41.05 31.13 26.87 29.93 29.03 28.61 39.97 21.79 32.68
Rec 28.81 10.75 19.27 22.02 30.02 19.05 18.54 31.92 20.63 22.33
Boulder-GWK-2 Prec 66.33 65.71 69.93 67.36 71.69 35.29 62.45 78.56 64.09 64.60
F1 40.17 18.47 30.21 33.19 42.32 24.74 28.60 45.39 31.22 32.70
Rec 24.53 11.21 18.30 22.69 31.18 25.40 16.93 30.98 21.16 22.49
Boulder-GWK-1 Prec 56.47 68.57 66.41 69.41 74.46 47.06 57.04 76.26 65.74 64.60
F1 34.20 19.28 28.69 34.20 43.96 32.99 26.12 44.06 32.02 32.83
Rec 76.69 59.81 72.18 76.73 73.02 25.40 38.48 77.62 65.82 62.86
Baseline Prec 38.76 23.02 26.56 17.86 26.50 18.60 17.22 25.35 15.60 23.28
F1 51.49 33.25 38.83 28.97 38.89 21.48 23.79 38.22 25.23 33.35
Rec 66.95 50.93 60.52 45.96 65.08 17.46 30.33 66.57 43.25 49.67
CU–UBC-5 Prec 90.40 68.55 72.70 56.47 76.85 52.38 61.26 74.40 54.05 67.45
F1 76.93 58.45 66.05 50.68 70.48 26.19 40.57 70.26 48.05 56.41
Rec 63.76 51.867 63.62 48.75 63.84 17.46 33.12 65.05 45.81 50.36
CU–UBC-6 Prec 85.99 69.375 76.49 59.67 75.99 52.38 64.24 72.39 57.52 68.23
F1 73.23 59.36 69.46 53.66 69.39 26.19 43.71 68.52 51.00 57.17
Rec 60.36 53.74 64.05 51.51 58.18 22.22 35.37 59.32 47.74 50.28
CU–UBC-7 Prec 81.42 72.33 76.98 62.58 69.23 66.67 69.77 66.13 60.17 69.47
F1 69.33 61.66 69.92 56.51 63.23 33.33 46.94 62.54 53.24 57.41
Rec 83.39 47.66 76.48 52.06 73.14 25.40 36.33 74.28 46.50 57.25
CU–UBC-3 Prec 84.38 49.76 78.97 53.14 73.87 26.23 50.75 74.70 47.10 59.88
F1 83.89 48.69 77.71 52.60 73.50 25.81 42.35 74.49 46.80 58.42
Rec 80.69 47.66 78.35 57.29 73.77 28.57 40.73 74.06 50.93 59.12
CU–UBC-4 Rec 81.64 49.76 80.89 58.48 74.50 29.51 56.89 74.47 51.59 61.97
F1 81.16 48.69 79.60 57.88 74.14 29.03 47.47 74.27 51.26 60.39
Rec 75.96 47.66 75.73 65.35 69.07 28.57 49.52 65.08 60.58 59.73
CU–UBC-1 Prec 76.86 49.76 78.19 66.71 69.92 29.51 69.16 65.44 61.36 62.99
F1 76.41 48.69 76.94 66.03 69.50 29.03 57.71 65.26 60.97 61.17
Rec 88.16 41.59 81.90 72.68 76.58 28.57 50.91 73.98 67.37 64.64
CU–UBC-2 Prec 89.21 43.41 84.56 74.18 77.34 29.51 71.11 74.39 68.24 67.99
F1 88.68 42.48 83.21 73.42 76.96 29.03 59.34 74.18 67.80 66.12
Rec 89.54 41.59 82.38 59.58 80.22 31.75 58.95 78.97 72.82 66.20
Edinburgh Prec 90.75 43.41 85.06 60.84 83.30 32.79 82.34 79.41 73.75 70.18
F1 90.14 42.48 83.70 60.20 81.73 32.26 68.71 79.19 73.28 67.96
Rec 95.31 - 85.49 86.21 84.74 65.08 - 79.19 86.80 83.26
stanza Prec 93.87 - 85.84 85.91 82.79 50.62 - 71.57 86.87 79.64
F1 94.59 - 85.66 86.06 83.75 56.94 - 75.19 86.84 81.29
Table 3: Results on all test languages for all systems in %; the official shared task metric is best-match F1. To
provide a more complete picture, we also show precision and recall. stanza is a supervised system.
son et al., 2007) modeling word structure. Their guage then consists of the prefixes and stem identi-
work draws on parallels between the unsupervised fied by the adaptor grammar. For a predominantly
paradigm clustering task and unsupervised mor- prefixing language, the final stem instead contains
phological segmentation. Their grammars segment all suffixes of the word form. The team notes that
word forms in the shared task corpora into a se- this approach is unsuitable for languages which
quence of zero or more prefixes and a single stem extensively make use of both prefixes and suffixes,
followed by zero or more suffixes. such as Basque.
Based on the segmented words from the raw text Finally, they group all words which share the
data, they then determine whether the language same stem into paradigms. However, because
uses prefixes or suffixes for inflection. The final sampling from an adaptor grammar is a non-
stem for words in a predominantly suffixing lan- deterministic process – i.e., the system may return
76
multiple possible segmentations for a single word naaghá neiikai naahkai
naashá nijighá nideeshaał
form – they construct preliminary clusters by in- naayá ninádaah naniná
cluding all forms which might share a given stem. ninájı́daah nizhdoogaał
Then they select the cluster that maximizes a score
Table 4: A paradigm from our gold set for Navajo.
based on frequency of occurrence of the induced
segment in all segmentations.
Overgeneralization/Underspecification When
4 Results and Discussion acquiring language, children often overgeneralize
morphological analogies to new, ungrammatical
The official results obtained by all submitted sys-
forms. For example, the past tense of the English
tems on the test sets are shown in Table 3.
verb to know might be expressed as knowed, rather
The Edinburgh system performs best overall than the irregular knew. The same behavior can
with an average best-match F1 of 67.96%. In also be observed in learning algorithms at some
general, grammar-based systems attain the best re- point during the learning process (Kirov and
sults, with all of the CU–UBC systems and the Cotterell, 2018). This is reflected to some extent in
Edinburgh system outperforming the baseline by at Table 3 by trade-offs between precision and recall.
least 23.06% F1. The Boulder-GWK and Boulder- A low precision, but high recall indicates that a
PDP systems, both of which perform clustering system is overgeneralizing: some surface forms
over word representations, approach but do not ex- are erroneously assigned to too many paradigms.
ceed baseline performance. Perkoff et al. (2021) In effect, these systems are hypothesizing that
found that clustering over word2vec embeddings a substring is productive, and thus proposing a
performs poorly on the development languages, paradigmatic relationship between two words. For
and their scores on the test set reflect clusters found example, the English words approach and approve
with vectors based purely on orthography. The share the stem appro- with unproductive segments
Boulder-GWK systems contain incomplete results, as suffixes. The baseline tends to overgeneralize
and partial evidence suggests that their cluster- due to its creation of large paradigms via a naive
ing method, which combines both fastText embed- grouping of words by shared n-grams.
dings trained on the provided bible corpora, and On the other hand, several systems seem to un-
edit distance, can indeed outperform the baseline. derspecify, indicated by their low recall. A low
However, it likely cannot outperform the grammar- recall, but high precision indicates that a system
based submissions. does not attribute inflected forms to a paradigm
For comparison, we also evaluate a supervised that the form does in fact belong to. This can be
lemmatizer from the Stanza toolkit (Qi et al., 2020). caused by suppletion in systems based purely on
The Stanza lemmatizer is a neural network model orthography, for example, generating the paradigm
trained on Universal Dependencies (UD) treebanks with go and goes, but attributing went to a separate
(Nivre et al., 2020), which first tags for parts of paradigm. Underspecification is apparent in the
speech, and then uses these tags to generate lemmas CU–UBC submissions that relied on discontinuous
for a given word. Because there is no UD corpus in rules (CU–UBC 5, 6, and 7). This is likely because
the current version for Navajo nor Kannada, we do they filtered these systems down to far fewer rules
not have scores for those languages. Stanza’s accu- than their prefix/suffix systems, in order to avoid
racy on our task is far lower than that reported for severe overgeneralization that can result from spuri-
lemmatization on UD data. We note, however, that ous morphemes based on discontinuous substrings.
1) our data is from a different domain, 2) Biblical Similarly, the Boulder-GWK systems both have
language in particular can differ strongly from con- reasonable precision, but very low recalls. They
temporary text, and 3) we evaluate on only a partial report that this is due to the fact that they ignore
set of types in the corpus, which could represent a any words with less than a certain frequency in the
particularly challenging set of paradigms for some corpus due to time constraints, thus creating small
languages. The Stanza lemmatizer outperforms all paradigms and ignoring many words completely.
systems for all languages, except for German. This
is unsurprising as it is a supervised system, though Language and Typology In general, we find that
it is interesting that the German score falls short of Basque and Navajo are the two most difficult test
that of the Edinburgh system. languages. Both languages have relatively small
77
Figure 3: Singleton paradigm counts for the best performing system on all test languages. Languages for which
we have more than 100 paradigms on the left, and those for which we have less than 100 paradigms on the right.
Predicted singleton paradigms are in red and blue, gold singleton paradigms are in grey.
Figure 4: The F1 score across paradigm sizes for the best performing system on all test languages. From left to
right, the graphs represent the groups of languages in increasing order of how well systems typically performed on
them. F1 scores are interpolated for paradigm sizes that do not exist in a given language.
corpora, and are typlogically agglutinative – that pus may cause difficulties for their algorithm that
is, they express inflection via the concatenation of builds clusters based on affix frequency. Notably,
potentially many morpheme segments, which can the CU-UBC-7 system, which learns discontinu-
result in a large number of unique surface forms. ous rules rather than rules that model strictly con-
Both languages thus have relatively high type-token catenative morphology, performs best on Navajo
ratios (TTR) – especially Navajo, which has the by a large margin when compared to the best per-
highest TTR, cf. Table 2. It is also important to forming system, which relies on strictly concate-
note that both Basque and Navajo have compara- native grammars. It also performs best on Basque,
tively small sets of paradigms against which we though by a smaller margin. Another difficulty in
evaluate. This leaves the possibility that the subset Navajo morphology is that it exhibits verbal stem
of paradigms in the gold set are particularly chal- alternation for expressing mood, tense, and aspect,
lenging. However, the differences between system which creates challenges for systems that rely on
scores indicates that these two languages do offer rewrite rules or string similarity, based on continu-
challenges related to their morphology. ous substrings. For instance, our evaluation algo-
Navajo is a predominantly prefixing language rithm aligns a singleton predicted paradigm to the
– the only one in the development and test sets – gold paradigm in Table 4 for nearly all systems.
and Basque also inflects using prefixes, though to On Basque, most systems perform poorly. Mc-
a lesser extent. The top two performing systems Curdy et al. (2021), the best performing system
both obtain low scores for Navajo. The CU–UBC-2 overall, obtains a low score for Basque, which may
system considers only suffix rules, which results be due to their system assuming that a language
in it being the lowest performing CU–UBC system inflects either via prefixation or suffixation, but not
on Navajo. The Edinburgh submission should be both, as Basque does. Other systems, however,
able to identify prefixes and consider the suffix to attain similarly low scores for Basque.
be part of the stem in Navajo. However, the large The next tier of difficulty seems to comprise
number of types, for a relatively small Navajo cor- Finnish, Kannada, and Turkish, on which most sys-
78
tems obtain low scores. All of those languages form paradigms comprising several inflected forms.
are suffixing, but also have an agglutinative mor- Figure 3 demonstrates that the best system tends to
phology. The largest paradigm of each of these 3 overgenerate singleton paradigms. We see this to
languages are all in the top 4 largest paradigms in some extent for all agglutinative languages, which
Table 2. This implies that large paradigm sizes and may be due to the high number of typically long,
large numbers of distinct inflectional morphemes – unique forms. This is especially true for Navajo,
two properties often assumed to correlate with ag- which has a small corpus and extremely high type–
glutinative morphology –, coupled with sparse cor- token ratio. On the other hand, for the languages
pora to learn from, offer challenges for paradigm for which the highest scores are obtained, Span-
clustering. Though agglutinative morphology, hav- ish and English, the system does not overgenerate
ing relatively unchanged morphemes across words, singleton paradigms. Of the large number of sin-
might be simpler for automatic segmentation sys- gleton paradigms predicted for both languages, the
tems than morphology characterized as fusional, vast majority are correct. For other systems not
our sparse data sets are likely to complicate this. pictured in the figure, singleton paradigms are typi-
Finally, systems obtain the best results for En- cally undergenerated for Spanish and English. In
glish, followed by Spanish, and then Bulgarian. the case of English, this could be due to words
These three languages are also strongly suffixing, that share a derivational relationship. For example,
but typically express inflection with a single mor- the word accomplishment might be assigned to the
pheme. German appears to be a bit of an outlier, paradigm for the verb accomplish, when, in fact,
generally exhibiting scores that lie somewhere be- their relationship is not inflectional.
tween the highest scoring languages, and the more
difficult agglutinative languages. McCurdy et al. 6 Conclusion and Future Shared Tasks
(2021) hypothesize that this may be due to non-
concatenative morphology from German verbal cir- We presented the SIGMORPHON 2021 Shared
cumfixes. This hypothesis could explain why the Task on Unsupervised Morphological Paradigm
Boulder-GWK system performs better on German Clustering. Submissions roughly fell into two cat-
than other languages: it incorporates semantic in- egories: similarity-based methods and grammar-
formation. However, the CU–UBC systems that based methods, with the latter proving more
use discontinuous rules (systems 5, 6, and 7), and successful at the task of clustering inflectional
thus should better model circumfixation, do not paradigms. The best systems significantly im-
produce higher German scores than the continuous proved over the provided n-gram baseline, roughly
rules, including the suffix-only system. doubling the F1 score – mostly through much im-
proved precision. A comparison against a super-
5 Analysis: Partial Paradigm Sizes vised lemmatizer demonstrated that we have not yet
reached the ceiling for paradigm clustering: many
The effect of the size of the gold partial paradigms words are still either incorrectly left in singleton
on F1 score for the best system is illustrated in paradigms or incorrectly clustered with circum-
Figure 4. For Basque and Navajo, the F1 score stantially (and often derivationally) related words.
tends to drop as paradigm size increases. We see Regardless of the ground still to be covered, the
the same trend for Finnish, Kannada, and German, submitted results were a successful first step in au-
with a few exceptions, but this trend does not exist tomatically inducing the morphology of a language
for all languages. English resembles something without access to expert-annotated data.
like a bell shape, other than the low scoring outlier Unsupervised morphological paradigm cluster-
for the largest paradigms of size 7. Interestingly, ing is only the first step in a morphological learn-
Spanish and Turkish attain both very high and very ing process that more closely models human L1
low scores for larger paradigms. acquisition. We envision future tasks expanding
An artifact of a sparse corpus is that many sin- on this task to include other important aspects of
gleton paradigms arise. For theoretically larger morphological acquisition. Paradigm slot catego-
paradigms, only a single inflected form might oc- rization is a natural next step. To correctly cate-
cur in such a small corpus. Of course, this also hap- gorize paradigm slots, cross-paradigmatic similari-
pens naturally for certain word classes. However, ties must be considered, for example, the German
nouns, verbs, and occasionally adjectives typically words liest and schreibt are both 3rd person singular
79
present indicative inflections of two different verbs. Xia, Manaal Faruqui, Sandra Kübler, David
This can occasionally be identified via string simi- Yarowsky, Jason Eisner, et al. 2017. Conll-
sigmorphon 2017 shared task: Universal morpholog-
larity, but more often requires syntactic information.
ical reinflection in 52 languages. In Proceedings of
Syncretism (the collapsing of multiple paradigm the CoNLL SIGMORPHON 2017 Shared Task: Uni-
slots into a single representation) further compli- versal Morphological Reinflection, pages 1–30.
cates the task. A similar subtask involves lemma
identification, where a canonical form (Cotterell Ryan Cotterell, Christo Kirov, John Sylak-Glassman,
David Yarowsky, Jason Eisner, and Mans
et al., 2016b) is identified within the paradigm. Hulden. 2016a. The sigmorphon 2016 shared
Likewise, another important task involves fill- task—morphological reinflection. In Proceedings
ing unrealized slots in paradigms by generating the of the 14th SIGMORPHON Workshop on Compu-
correct surface form, which can be approached sim- tational Research in Phonetics, Phonology, and
Morphology, pages 10–22.
ilarly to previous SIGMORPHON shared tasks on
inflection (Cotterell et al., 2016a, 2017, 2018; Mc- Ryan Cotterell, Tim Vieira, and Hinrich Schütze.
Carthy et al., 2019; Vylomova et al., 2020), but will 2016b. A joint model of orthography and mor-
likely be based on noisy information from the slot phological segmentation. Association for Computa-
tional Linguistics.
categorization – all previous tasks have assumed
that the morphosyntactic information provided to Greg Durrett and John DeNero. 2013. Supervised
an inflector is correct. Currently, investigations into learning of complete morphological paradigms. In
the robustness of these systems to noise are sparse. Proceedings of the 2013 Conference of the North
American Chapter of the Association for Computa-
Another direction for this task is the expansion tional Linguistics: Human Language Technologies,
to more under-resourced languages. The submit- pages 1185–1195.
ted results demonstrate that the task becomes par-
ticularly difficult when the provided raw text is Alexander Erdmann, Micha Elsner, Shijie Wu, Ryan
Cotterell, and Nizar Habash. 2020. The paradigm
small, but under-documented languages are often discovery problem. In Proceedings of the 58th An-
the ones most in need of morphological corpora. nual Meeting of the Association for Computational
The JHUBC contains Bible data for more than 1500 Linguistics, pages 7778–7790, Online. Association
languages, which can potentially be augmented by for Computational Linguistics.
other raw text corpora because morphology is rel- Andrew Gerlach, Adam Wiemerslage, and Katharina
atively stable across domains. Future tasks may Kann. 2021. Paradigm clustering with weighted
enable the construction of inflectional paradigms edit distance. In Proceedings of the 18th Workshop
in languages that require them to construct further on Computational Research in Phonetics, Phonol-
computational tools. ogy, and Morphology. Association for Computa-
tional Linguistics.
Acknowledgments Erez Hartuv and Ron Shamir. 2000. A clustering al-
gorithm based on graph connectivity. Information
We would like to thank all of our shared task par- processing letters, 76(4-6):175–181.
ticipants for their hard work on this difficult task!
Mans Hulden, Markus Forsberg, and Malin Ahlberg.
2014. Semi-supervised learning of morphological
References paradigms and lexicons. In Proceedings of the 14th
Conference of the European Chapter of the Associa-
Piotr Bojanowski, Edouard Grave, Armand Joulin, and tion for Computational Linguistics, pages 569–578.
Tomas Mikolov. 2017. Enriching word vectors with
subword information. Transactions of the Associa- Huiming Jin, Liwei Cai, Yihui Peng, Chen Xia, Arya
tion for Computational Linguistics, 5:135–146. McCarthy, and Katharina Kann. 2020. Unsuper-
vised morphological paradigm completion. In Pro-
Ryan Cotterell, Christo Kirov, John Sylak-Glassman, ceedings of the 58th Annual Meeting of the Asso-
Géraldine Walther, Ekaterina Vylomova, Arya D ciation for Computational Linguistics, pages 6696–
McCarthy, Katharina Kann, Sabrina J Mielke, 6707, Online. Association for Computational Lin-
Garrett Nicolai, Miikka Silfverberg, et al. 2018. guistics.
The conll–sigmorphon 2018 shared task: Univer-
sal morphological reinflection. arXiv preprint Mark Johnson, Thomas L Griffiths, Sharon Goldwa-
arXiv:1810.07125. ter, et al. 2007. Adaptor grammars: A frame-
work for specifying compositional nonparametric
Ryan Cotterell, Christo Kirov, John Sylak-Glassman, bayesian models. Advances in neural information
Géraldine Walther, Ekaterina Vylomova, Patrick processing systems, 19:641.
80
Katharina Kann, Arya D. McCarthy, Garrett Nico- Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
lai, and Mans Hulden. 2020. The SIGMORPHON ter, Jan Hajič, Christopher D. Manning, Sampo
2020 shared task on unsupervised morphological Pyysalo, Sebastian Schuster, Francis Tyers, and
paradigm completion. In Proceedings of the 17th Daniel Zeman. 2020. Universal Dependencies v2:
SIGMORPHON Workshop on Computational Re- An evergrowing multilingual treebank collection.
search in Phonetics, Phonology, and Morphology, In Proceedings of the 12th Language Resources
pages 51–62, Online. Association for Computational and Evaluation Conference, pages 4034–4043, Mar-
Linguistics. seille, France. European Language Resources Asso-
ciation.
Richard M Karp. 1980. An algorithm to solve the m×
n assignment problem in expected time o (mn log n). E. Margaret Perkoff, Josh Daniels, and Alexis Palmer.
Networks, 10(2):143–152. 2021. Orthographic vs. semantic representations
for unsupervised morphological paradigm cluster-
Christo Kirov and Ryan Cotterell. 2018. Recurrent neu- ing. In Proceedings of the 18th Workshop on Compu-
ral networks in linguistic theory: Revisiting pinker tational Research in Phonetics, Phonology, and Mor-
and prince (1988) and the past tense debate. Trans- phology. Association for Computational Linguistics.
actions of the Association for Computational Lin-
guistics, 6:651–665. Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,
and Christopher D. Manning. 2020. Stanza: A
Arya D. McCarthy, Christo Kirov, Matteo Grella, python natural language processing toolkit for many
Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekate- human languages. In Proceedings of the 58th An-
rina Vylomova, Sabrina J. Mielke, Garrett Nico- nual Meeting of the Association for Computational
lai, Miikka Silfverberg, Timofey Arkhangelskiy, Na- Linguistics: System Demonstrations, pages 101–
taly Krizhanovsky, Andrew Krizhanovsky, Elena 108, Online. Association for Computational Linguis-
Klyachko, Alexey Sorokin, John Mansfield, Valts tics.
Ernštreits, Yuval Pinter, Cassandra L. Jacobs, Ryan
Rudolf Rosa and Zdenek Zabokrtský. 2019. Unsu-
Cotterell, Mans Hulden, and David Yarowsky.
pervised lemmatization as embeddings-based word
2020a. UniMorph 3.0: Universal Morphology.
clustering. CoRR, abs/1908.08528.
In Proceedings of the 12th Language Resources
and Evaluation Conference, pages 3922–3931, Mar- Ekaterina Vylomova, Jennifer White, Elizabeth
seille, France. European Language Resources Asso- Salesky, Sabrina J Mielke, Shijie Wu, Edoardo
ciation. Ponti, Rowan Hall Maudslay, Ran Zmigrod, Josef
Valvoda, Svetlana Toldova, et al. 2020. Sigmorphon
Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu, 2020 shared task 0: Typologically diverse morpho-
Chaitanya Malaviya, Lawrence Wolf-Sonkin, Gar- logical inflection. arXiv preprint arXiv:2006.11572.
rett Nicolai, Christo Kirov, Miikka Silfverberg, Sab-
rina J. Mielke, Jeffrey Heinz, Ryan Cotterell, and William E. Winkler. 1990. String comparator met-
Mans Hulden. 2019. The SIGMORPHON 2019 rics and enhanced decision rules in the fellegi-sunter
shared task: Morphological analysis in context and model of record linkage. In Proceedings of the Sec-
cross-lingual transfer for inflection. In Proceedings tion on Survey Research, pages 354–359.
of the 16th Workshop on Computational Research in
Phonetics, Phonology, and Morphology, pages 229– Changbing Yang, Garrett Nicolai, and Miikka Silfver-
244, Florence, Italy. Association for Computational berg. 2021. Unsupervised paradigm clustering us-
Linguistics. ing transformation rules. In Proceedings of the 18th
Workshop on Computational Research in Phonetics,
Arya D. McCarthy, Rachel Wicks, Dylan Lewis, Aaron Phonology, and Morphology. Association for Com-
Mueller, Winston Wu, Oliver Adams, Garrett Nico- putational Linguistics.
lai, Matt Post, and David Yarowsky. 2020b. The
Johns Hopkins University Bible corpus: 1600+
tongues for typological exploration. In Proceed-
ings of the 12th Language Resources and Evaluation
Conference, pages 2884–2892, Marseille, France.
European Language Resources Association.
81
Adaptor Grammars for Unsupervised Paradigm Clustering
Kate McCurdy
Sharon Goldwater Adam Lopez
School of Informatics
University of Edinburgh
[email protected], {sgwater, alopez}@inf.ed.ac.uk
82
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 82–89
August 5, 2021. ©2021 Association for Computational Linguistics
Word → Stem Suffix tering, we need a method to group words together
Word → Stem Word Segmentation based on their AG segmentations. The example
Stem → Chars walked walk-ed segmentations shown in Figure 1 suggest a very
Suffix → Chars jumping jump-ing simple approach to paradigm clustering: assign all
Chars → Char walking walk-ing forms with the same stem to the same cluster. For
Chars → Char Chars jump jump example, “walked” and “walking” would correctly
(a) Example grammar. (b) Toy corpus with target seg- cluster together with the shared stem “walk”. Our
Adapted nonterminals are mentations system builds upon this intuition.
underlined.
As a preliminary step, we select grammars to
Word Word
sample from, looking only at the development lan-
guages. We build simple clusters and heuristically
Stem Suffix Stem Suffix select grammars which show relatively high per-
formance, as described in Section 3.2. In this case
walk ed jump ing we select two grammars. Once the grammars have
(c) Example target morphological analyses, showing only the been selected, we discard the simple clusters in
top 2 levels of structure favor of a more sophisticated strategy.
We implement1 a procedure to generate clusters
Figure 1: A possible morphological analysis (1c)
for both development and test languages. First,
learned by the grammar in (1a) over the corpus shown
in (1b) (from Johnson et al., 2007b) we sample 3 separate AG parses for each corpus
and each grammar, resulting in 6 segmentations for
each word. We then use frequency-based metrics
frequently sampled subtrees gain higher probability over the segmentations to identify the language’s
within the conditional adapted distribution. Given adfix direction, i.e. whether it is predominantly
an AG specification, MCMC sampling can be used prefixing or suffixing, as described in Section 3.3.
to infer values for the PCFG rule probabilities θ Finally, we iterate over the entire vocabulary and
(Johnson et al., 2007a) and PYP hyperparameters apply frequency-based scores to generate paradigm
(Johnson and Goldwater, 2009). clusters, as described in Section 3.4 .
83
Word
To evaluate grammar performance, we follow
the intuition in Section 3.1 and group by AG- Prefix Stem Suffix
segmented stems. Grouping by stem can be more PrefixMorph SubMorph SuffixMorph SuffixMorph
difficult for complex words. For example, an AG SubMorph Char Char Char SubMorph SubMorph SubMorph SubMorph
with a more complex grammar might segment the Char Char Char o r t Char Char Char Char Char
“action” as the stem (see also the example in Fig- (a) PrStSu+SM
ure 2a); however, the target paradigm for cluster- Word
not “action” and “actions”. To address this issue SubMorph SubMorph SubMorph SubMorph SubMorph SubMorph
for our clustering task, we make the further sim- Char Char Char Char Char Char Char Char Char Char Char
plifying (but linguistically motivated; e.g. Stump,
a p p o r t i o n e d
2005, 56) assumption that inflectional morphology
(b) Simple+SM
is generally realized on a word’s periphery, so a
segmentation like “action-able-s” implies the stem Figure 2: Two example parses of the word “appor-
“actionable” (in a suffixing language like English, tioned” from our two distinct grammar specifications,
where the prefix is included in the stem). As all learned on the English test data.
of the development languages were predominantly
suffixing (with the partial exception of Maltese,
parameters applied to the same data can predict dif-
which includes root-and-pattern morphology), we
ferent segmentation outputs. Given this variability,
simply grouped together words with the same AG-
we run the AG sampler three times for each of our
segmented Prefix + Stem.
two selected grammars, yielding 6 parses of the
We selected two grammars with the following de-
lexicon for each language. The number of gram-
sirable attributes: 1) they reliably showed good per-
mar runs was heuristically selected and not tuned
formance on the development set, relative to other
in any way, so adding more runs for each grammar
grammars; and 2) they specified very similar struc-
might improve performance (for example, Sirts and
tures, making it easier to combine their outputs in
Goldwater, 2013, use 5 samples per grammar). We
later steps. Both grammars model words as a tripar-
then combine the resulting segmentations using the
tite Prefix-Stem-Suffix sequence. Both grammars
following procedure.
also use a SubMorph level of representation, which
has been shown to aid word segmentation (Sirts 3.3 Adfix direction
and Goldwater, 2013), although we only consider
segments from the level directly above SubMorphs The first step is to determine the adfix direction
in clustering. The full grammar specifications are for each language, i.e. whether the language uses
included in Appendix A. predominantly prefixing or suffixing inflection. We
heuristically select the adfix direction using the
• Simple+SM: Each word comprises one op- following automatic procedure.
tional prefix, one stem, and one optional suf- First, we count the frequency of each peripheral
fix. Each of these levels can comprise one or segment across all 6 parses of the lexicon. A pe-
more SubMorphs. ripheral segment is a substring at the start or end
of a word, which has been parsed as a segment
• PrStSu+SM Each word comprises zero or above the SubMorph level in some AG sample. For
more prefixes, one stem, and zero or more suf- instance, in the parse shown in Figure 2a, “app-”
fixes. Each of these levels can comprise one would be the initial peripheral segment, and “-ed”
or more SubMorphs. Eskander et al. (2020) would be the final peripheral segment. By contrast,
found that this grammar showed the highest for the parse shown in Figure 2b,“ap-” would be
performance in unsupervised segmentation the initial peripheral segment, and “-ioned” would
across the languages they evaluated. be the final peripheral segment.
Next, we rank the segmented adfixes by their
Sampling from an adaptor grammar is a non-
frequency, and select the top N for consideration,
deterministic process, so the same set of initial
where N is some heuristically chosen quantity. In
mance on this task, we did not explore this option. light of the generally Zipfian properties of linguistic
84
distributions, we chose to scale N logarithmically the example from Figure 2, if “apportions” were
with the vocabulary size, so N = log(|V |). also in the corpus, it would be added to the cluster
Finally, we select the majority label (i.e. “prefix” for the stem “apportion”, with “-s” as the adfix ai .
or “suffix”) of the N most frequent segments as the Similarly, it would also be considered in the cluster
adfix direction. This simple approach has obvious for the stem “apport”, with adfix “-ions”.
limitations — to name just one, it neglects the re-
Score cluster members For each word wi in cs ,
ality of nonconcatenative morphology, such as the
calculate a score xi :.
root-and-pattern inflection of many Maltese verbs.
Nonetheless, it appears to capture some key distinc- p
xi = Aw
i log(Ai ) (1)
tions: this method correctly identified Navajo as a
prefixing language, and all other development and where Ai is the normalized overall frequency of
test languages as predominantly suffixing. the ith adfix ai (suffix or prefix) per 10,000 types
in the corpus of 6 segmentations, and Aw i is the
3.4 Creating paradigm clusters proportion of segmentations of the ith word wi
Once we have inferred the adfix direction for a lan- which contain the adfix ai . For example, if “ap-
guage, we use a greedy iterative procedure over portioned” were in consideration for a hypothetical
words to identify and score potential clusters. Our cluster based on the stem “apportion”, Ai would be
scoring metric is frequency-based, motivated by the normalized corpus frequency of “-ed”, and Aw i
the observation that inflectional morphology (such would be .5 (assuming only the two segmentations
as the “-s” in “actionables”) tends to be more fre- shown in Figure 2). For a cluster with the stem
quent across word types relative to derivational “apport”, Ai would be the normalized frequency of
morphology (such as the “-able” in “actionables”). “-ioned”, and Aw i would still be .5.
Yarowsky and Wicentowski (2000) have demon- Intuitively, when evaluating a single word, Eq. 1
strated the value of frequency metrics in aligning assumes that adfixes which appear frequently in the
inflected forms from the same lemma. segmented corpus overall are more likely to be in-
We start with no assigned clusters and iterate flectional, so words with more frequent adfixes are
through the vocabulary in alphabetical order.3 For more likely paradigm members (the log(Ai ) term).
each word w which has not yet been assigned to For instance, the high frequency of the “-s” suffix
a cluster, we identify the most likely cluster using in English will increase the score of any word with
the following procedure. an “-s” suffix in its segmentation (e.g. “apportion-
s”). Eq. 1 also assumes that, for all segmentations
Find possible stems Identify each possible stem of this particular word wi , adfixes which appear
s from all of the segmentations for w, where the in a higher proportion
p of segmentations are more
“stem” comprises the entire substring up to a pe- reliable (the Aw i term), so the more times some
ripheral adfix. For example, based on the two AG samples the “apportion-s” segmentation, the
parses shown in Figure 2, “apportion” and “ap- higher the score for “apportions” membership in
port” would constitute possible stems for the word the “apportion”-stem paradigm. The square root
“apportioned”. The word w in its entirety is also transform was selected based on development set
considered as a possible stem. performance, and has not been tuned extensively.
Find possible cluster members For each stem Filter and score clusters For each possible stem
s, identify other words in the corpus which might cluster cs , filter out words whose score xi is below
share that stem, forming a potential cluster cs . A the score threshold hyperparameter t, to create a
word potentially shares a stem if it shares the same new cluster c0s . Calculate the cluster score xs by
substring from the non-adfixing-direction — so a taking the average of xi for only those words in c0s ,
stem is a shared prefix substring in a suffixing lan- i.e. only words with score xi ≥ t. The value for t
guage like English, and vice-versa for a prefixing is selected via grid search on the development set.
language like Navajo. For each word wi that is We found that setting t = 2 maximized F1 across
identified this way, the rest of the string outside the development languages as a whole.
of the possible stem s is a possible adfix ai . In Select cluster Select the potential cluster c0s with
3
The method is relatively insensitive to order, except re- the highest score, and assign w to that cluster, along
versed alphabetical order, which is worse for most languages. with each word wi in c0s .
85
Language Precision Recall F1 same cluster, but incorrectly assigns “geändert” to
a separate cluster. We estimate that roughly 30%
Maltese .30 .30 .30
of the model’s incorrect German predictions stem
Persian .54 .52 .53
from this issue. This limitation also contributed to
Portuguese .92 .91 .91
our model’s poor performance on Basque, which,
Russian .83 .82 .82
like Maltese, uses both prefixing and suffixing in-
Swedish .85 .81 .83
flection to express polypersonal agreement.4
Mean .69 .67 .68 One obvious way to improve this issue would
be to use an extension of the AG framework which
Table 1: Performance on development languages can represent nonconcatenative morphology. Botha
and Blunsom (2013) present such an extension,
Language Precision Recall F1 replacing the PCFG with a Probabilistic Simple
Basque .33 .32 .32 Range Concatenating Grammar. They report suc-
Bulgarian .83 .80 .82 cessful results for unsupervised segmentation on
English .91 .90 .90 Hebrew and Arabic. On the other hand, it’s unclear
Finnish .61 .60 .60 whether such a nonconcatenative-focused approach
German .79 .79 .79 could also adequately represent concatenative mor-
Kannada .82 .59 .69 phology. Fullwood and O’Donnell (2013) explore
Navajo .43 .42 .42 a similar framework, using Pitman-Yor processes
Spanish .85 .82 .84 to sample separate lexica of roots, templates, and
Turkish .74 .73 .73 “residue” segments; they find that their model works
well for Arabic, but much less well for English. In
Mean .70 .66 .68 addition, Eskander et al. (2020) report state-of-the-
art morphological segmentation for Arabic using
Table 2: Performance on test languages the PrStSu+SM grammar which we also use here.
Their findings suggest that, rather than changing
4 Results and Discussion the AG framework, we might attempt a more intel-
ligent clustering method based on noncontiguous
Performance was evaluated using the script pro- segmented subsequences rather than contiguous
vided by the shared task organizers. Table 1 shows substrings.
the results for the development languages, and Ta-
ble 2 shows the results for the test languages. While Irregular morphology The strong assumption
the average F1 score ends up being quite similar of contiguous substrings as stems also hinders ac-
for both development and test languages, it’s clear curate clustering of irregular forms of any kind,
within both groups that there are large differences from predictable stem alternations (such as um-
in performance across different languages. laut in German and Swedish, or theme vowels in
Portuguese and Spanish) to more challenging sup-
4.1 Error analysis and ways to improve pletive forms such as English “go”-“went”. The
Noncontiguous stems The clustering method de- latter likely requires additional input from seman-
scribed in Section 3.4 makes an unjustifiably strong tic representations, but semiregular alternations in
assumption that stems are contiguous substrings, forms could also be handled in principle by a more
which effectively eliminates its ability to represent intelligent clustering process. On this point, we
nonconcatenative morphology. This limitation con- note that some small but significant fraction of AG
tributes to the low score on Maltese, a Semitic lan- parses of Portuguese verbs grouped verbal theme
guage which includes root-and-pattern morphology vowels and inflections together (e.g. parsing “apre-
for certain verbs. The model further assumes that sentada” as “apresent-ada” rather than “apresenta-
the left or right edge of a word — the side opposite da”, “apresentarem” as “apresent-arem” rather than
from the adfix direction — is contiguous with the “apresenta-rem”, and so on), and these parses were
stem. This leads to errors on German, as most verbs crucial to our model’s relatively high performance
have a circumfixing past participle form “ge- + -t” on Portuguese.
or “ge- + -en”. For example, the model correctly 4
We thank an anonymous reviewer for bringing this to our
assigns “ändern”, “änderten”, and “ändert” to the attention.
86
Derivation vs. inflection Another issue is that subword length of 2 characters, and used it to clus-
the parses sampled by AGs do not distinguish be- ter words from the same cell rather than the same
tween inflectional and derivational morphology. paradigm (e.g. clustering together English verbs
This is apparent in Figure 2, where both grammars in the third person singular such as “walks” and
parse “apportioned” with “-ioned” as the suffix. We “jumps”). We attempted to follow this procedure,
seek to address this issue with frequency-based met- but it proved too difficult, as paradigm cell informa-
rics in our clustering method, but frequent deriva- tion was not explicitly included in the development
tional adfixes often score high enough to be as- data for this shared task. 3) We used the method
signed a wrong paradigm cluster. For example, described by Bojanowski et al. (2017) to identify
in English our model correctly clusters “allow”, important subwords within a word, in hopes of
“allows”, and “allowed” together, but it also incor- combining them with AG segmentations. However,
rectly assigns “allowance” to the same cluster. the identified subwords did not consistently align
A straightforward way to handle this within our with stem-adfix segementations as we had hoped,
existing approach would be to allow language- and did not seem to provide any additional benefit.
specific variation of the score threshold t. As we
had no method for unsupervised estimation of t Brown clustering Part of speech tags could pro-
for unfamiliar languages, we did not pursue this; vide latent structure as a higher-order grouping for
however, a researcher who had minimal familiarity paradigm clusters — for example, verbs would be
with the language in question might be able to se- expected to have paradigms more similar to other
lect a more sensible value for t based on inspecting verbs than to nouns. Brown clusters (Brown et al.,
the clusters. Beyond that, the distinction between 1992) have been used for unsupervised induction
inflectional and derivational morphology is an in- of word classes approximating part of speech tags.
triguing and contested issue within linguistics (e.g. We used a spectral clustering algorithm (Stratos
Stump, 2005), and the question of how to model it et al., 2014) to learn Brown clusters, but they did
computationally requires much more attention. not reliably correspond to part of speech categories
on our development language data.
4.2 Things that didn’t work
5 Conclusion
We attempted a number of unsupervised ap-
proaches beyond AG segmentations, with the goal The Adaptor Grammar framework has previously
of incorporating them during the clustering pro- been applied to unsupervised morphological seg-
cess; however, we could not consistently improve mentation. In this paper, we demonstrate that AG
performance with any of them. It seems likely to segmentations can be used for the related task of
us that these methods could still be used to improve unsupervised paradigm clustering with successful
AG-segmentation-based clusters, but we could not results, as shown by our system’s performance in
find immediately obvious ways to do this. the 2021 SIGMORPHON Shared Task.
We note that there is still considerable room for
FastText As the AG framework only models
improvement in our clustering procedure. Two key
word structure based on form, we hoped to use the
directions for future development are more sophis-
distributional representations learned by FastText
ticated treatment of nonconcatenative morphology,
(Bojanowski et al., 2017) to incorporate semantic
and incorporation of additional sources of informa-
and syntactic information into our model’s clus-
tion beyond the word form alone.
ters. We tried several different approaches without
success. 1) We trained a skipgram model with a Acknowledgments
context window of 5 words, a setting often used
for semantic applications, in hopes that words from This work was supported in part by the EPSRC Cen-
the same paradigm might have similar semantic tre for Doctoral Training in Data Science, funded
representations. Agglomerative clustering on these by the UK Engineering and Physical Sciences Re-
representations alone yielded much worse clusters search Council (grant EP/L016427/1) and the Uni-
than the AG method, and we could not find a way versity of Edinburgh, and a James S McDonnell
to combine them successfully with the AG clusters. Foundation Scholar Award (#220020374) to the
2) Erdmann et al. (2020) trained a skipgram model second author.
with a context window of 1 word and a minimum
87
References Unsupervised Morphological Segmentation of Un-
seen Languages. In Proceedings of COLING 2016,
Piotr Bojanowski, Edouard Grave, Armand Joulin, and the 26th International Conference on Computational
Tomas Mikolov. 2017. Enriching Word Vectors with Linguistics: Technical Papers, pages 900–910, Os-
Subword Information. Transactions of the Associa- aka, Japan. The COLING 2016 Organizing Commit-
tion for Computational Linguistics, 5:135–146. tee.
Jan A Botha and Phil Blunsom. 2013. Adaptor gram- Michelle Fullwood and Tim O’Donnell. 2013. Learn-
mars for learning non-concatenative morphology. In ing non-concatenative morphology. In Proceedings
Proceedings of the 2013 conference on empirical of the Fourth Annual Workshop on Cognitive Model-
methods in natural language processing, pages 345– ing and Computational Linguistics (CMCL), pages
356. 21–27, Sofia, Bulgaria. Association for Computa-
tional Linguistics.
Peter F. Brown, Peter V. Desouza, Robert L. Mer-
cer, Vincent J. Della Pietra, and Jenifer C. Lai.
Sharon Goldwater, Mark Johnson, and Thomas L Grif-
1992. Class-based n-gram models of natural lan-
fiths. 2006. Interpolating between types and tokens
guage. Computational linguistics, 18(4):467–479.
by estimating power-law generators. In Advances in
Ryan Cotterell, Christo Kirov, John Sylak-Glassman, neural information processing systems, pages 459–
Géraldine Walther, Ekaterina Vylomova, Arya D. 466.
McCarthy, Katharina Kann, Sabrina J. Mielke, Gar-
rett Nicolai, Miikka Silfverberg, David Yarowsky, Huiming Jin, Liwei Cai, Yihui Peng, Chen Xia,
Jason Eisner, and Mans Hulden. 2018. The Arya D. McCarthy, and Katharina Kann. 2020.
CoNLL–SIGMORPHON 2018 Shared Task: Uni- Unsupervised Morphological Paradigm Completion.
versal Morphological Reinflection. In Proceedings arXiv:2005.00970 [cs]. ArXiv: 2005.00970.
of the CoNLL–SIGMORPHON 2018 Shared Task:
Universal Morphological Reinflection, pages 1–27, Mark Johnson and Sharon Goldwater. 2009. Improving
Brussels. Association for Computational Linguis- nonparameteric Bayesian inference: experiments on
tics. unsupervised word segmentation with adaptor gram-
mars. In Proceedings of Human Language Tech-
Ryan Cotterell, Christo Kirov, John Sylak-Glassman, nologies: The 2009 Annual Conference of the North
Gėraldine Walther, Ekaterina Vylomova, Patrick American Chapter of the Association for Compu-
Xia, Manaal Faruqui, Sandra Kübler, David tational Linguistics, pages 317–325, Boulder, Col-
Yarowsky, Jason Eisner, and Mans Hulden. 2017. orado. Association for Computational Linguistics.
CoNLL-SIGMORPHON 2017 Shared Task: Uni-
versal Morphological Reinflection in 52 Languages. Mark Johnson, Thomas Griffiths, and Sharon Gold-
In Proceedings of the CoNLL SIGMORPHON 2017 water. 2007a. Bayesian Inference for PCFGs via
Shared Task: Universal Morphological Reinflection, Markov Chain Monte Carlo. In Human Language
pages 1–30, Vancouver. Association for Computa- Technologies 2007: The Conference of the North
tional Linguistics. American Chapter of the Association for Computa-
tional Linguistics; Proceedings of the Main Confer-
Ryan Cotterell, Christo Kirov, John Sylak-Glassman, ence, pages 139–146, Rochester, New York. Associ-
David Yarowsky, Jason Eisner, and Mans Hulden. ation for Computational Linguistics.
2016. The SIGMORPHON 2016 Shared
Task—Morphological Reinflection. In Pro- Mark Johnson, Thomas L. Griffiths, and Sharon Gold-
ceedings of the 14th SIGMORPHON Workshop on water. 2007b. Adaptor Grammars:A Framework for
Computational Research in Phonetics, Phonology, Specifying Compositional Nonparametric Bayesian
and Morphology, pages 10–22, Berlin, Germany. Models. In Bernhard Schölkopf, John Platt, and
Association for Computational Linguistics. Thomas Hofmann, editors, Advances in Neural In-
formation Processing Systems 19. The MIT Press.
Alexander Erdmann, Micha Elsner, Shijie Wu, Ryan
Cotterell, and Nizar Habash. 2020. The Paradigm Katharina Kann, Arya D. McCarthy, Garrett Nico-
Discovery Problem. arXiv:2005.01630 [cs]. ArXiv: lai, and Mans Hulden. 2020. The SIGMORPHON
2005.01630. 2020 Shared Task on Unsupervised Morphologi-
cal Paradigm Completion. In Proceedings of the
Ramy Eskander, Francesca Callejas, Elizabeth Nichols, 17th SIGMORPHON Workshop on Computational
Judith Klavans, and Smaranda Muresan. 2020. Mor- Research in Phonetics, Phonology, and Morphology,
phAGram, Evaluation and Framework for Unsuper- pages 51–62, Online. Association for Computational
vised Morphological Segmentation. In Proceedings Linguistics.
of the 12th Language Resources and Evaluation
Conference, pages 7112–7122, Marseille, France. Jim Pitman and Marc Yor. 1997. The Two-Parameter
European Language Resources Association. Poisson-Dirichlet Distribution Derived from a Sta-
ble Subordinator. The Annals of Probability,
Ramy Eskander, Owen Rambow, and Tianchun Yang. 25(2):855–900. Publisher: Institute of Mathemati-
2016. Extending the Use of Adaptor Grammars for cal Statistics.
88
Kairit Sirts and Sharon Goldwater. 2013. Minimally- A PCFGs
Supervised Morphological Segmentation using
Adaptor Grammars. Transactions of the Association Our system uses the following two grammar specifi-
for Computational Linguistics, 1:255–266. cations, developed by Eskander et al. (2016, 2020).
Karl Stratos, Do-kyum Kim, Michael Collins, and Nonterminals are adapted by default. Non-adapted
Daniel Hsu. 2014. A spectral algorithm for learn- nonterminals are preceded by 1, indicating an ex-
ing class-based n-gram models of natural language. pansion probability of 1, i.e. the PCFG always ex-
Proceedings of the Association for Uncertainty in Ar- pands this rule and never caches it.
tificial Intelligence.
A.1 Simple+SM
Gregory T Stump. 2005. Word-formation and inflec-
tional morphology. In Handbook of word-formation, 1 Word --> Prefix Stem Suffix
volume 64, pages 49–71. Springer, Dordrecht, The
Prefix --> ˆˆˆ SubMorphs
Netherlands.
Prefix --> ˆˆˆ
Ekaterina Vylomova, Jennifer White, Eliza-
Stem --> SubMorphs
beth Salesky, Sabrina J. Mielke, Shijie Wu,
Edoardo Maria Ponti, Rowan Hall Maudslay, Ran Suffix --> SubMorphs $$$
Zmigrod, Josef Valvoda, Svetlana Toldova, Francis Suffix --> $$$
Tyers, Elena Klyachko, Ilya Yegorov, Natalia
Krizhanovsky, Paula Czarnowska, Irene Nikkarinen, 1 SubMorphs --> SubMorph SubMorphs
Andrew Krizhanovsky, Tiago Pimentel, Lucas 1 SubMorphs --> SubMorph
Torroba Hennigen, Christo Kirov, Garrett Nicolai, SubMorph --> Chars
Adina Williams, Antonios Anastasopoulos, Hilaria 1 Chars --> Char
Cruz, Eleanor Chodroff, Ryan Cotterell, Miikka 1 Chars --> Char Chars
Silfverberg, and Mans Hulden. 2020. SIGMOR-
PHON 2020 Shared Task 0: Typologically Diverse A.2 PrStSu+SM
Morphological Inflection. In Proceedings of the
17th SIGMORPHON Workshop on Computational 1 Word --> Prefix Stem Suffix
Research in Phonetics, Phonology, and Morphology,
pages 1–39, Online. Association for Computational Prefix --> ˆˆˆ
Linguistics. Prefix --> ˆˆˆ PrefMorphs
1 PrefMorphs --> PrefMorph PrefMorphs
Adam Wiemerslage, Arya McCarthy, Alexander Erd- 1 PrefMorphs --> PrefMorph
PrefMorph --> SubMorphs
mann, Garrett Nicolai, Manex Agirrezabal, Miikka
Silfverberg, Mans Hulden, and Katharina Kann. Stem --> SubMorphs
2021. The SIGMORPHON 2021 Shared Task
on Unsupervised Morphological Paradigm Cluster- Suffix --> $$$
ing. In Proceedings of the 18th SIGMORPHON Suffix --> SuffMorphs $$$
Workshop on Computational Research in Phonetics, 1 SuffMorphs --> SuffMorph SuffMorphs
Phonology, and Morphology. Association for Com- 1 SuffMorphs --> SuffMorph
putational Linguistics. SuffMorph --> SubMorphs
David Yarowsky and Richard Wicentowski. 2000. Min- 1 SubMorphs --> SubMorph SubMorphs
imally Supervised Morphological Analysis by Mul- 1 SubMorphs --> SubMorph
timodal Alignment. pages 207–216. SubMorph --> Chars
1 Chars --> Char
1 Chars --> Char Chars
89
Orthographic vs. Semantic Representations for Unsupervised
Morphological Paradigm Clustering
90
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 90–97
August 5, 2021. ©2021 Association for Computational Linguistics
that the words share a common core meaning. They tional morphology and unsupervised learning that
also - usually - show a high degree of orthograph- could be combined to approach this problem.
ical similarity. Following these intuitions, we in- Previous work has identified the benefit of com-
vestigate KMeans clustering using two different bining rules based on linguistic characteristics with
types of word representations: one focusing on or- machine learning techniques. Erdmann et al. (2020)
thographical similarity and the other focusing on established a baseline for the Paradigm Discovery
semantic similarity. Intuitively, we would expect Problem that clusters the unannotated sentences
the cluster of forms for walk to be recognizable first by a combination of string similarity and lex-
largely based on orthographic similarity. The par- ical semantics and then uses this clustering as in-
tially irregular cluster for bring shows greater ortho- put for a neural transducer. Erdmann and Habash
graphical variability in the past-tense form brought (2018) investigated the benefits of different similar-
and so might be expected to require information ity models as they apply to Arabic dialects. Their
beyond orthographic similarity. findings demonstrated that Word2Vec embeddings
significantly underperformed in comparison to the
System Overview. The core of our approach is
Levenshtein distance baseline. The highest per-
to cluster unlabelled surface word forms using
forming representation was a combination of Fast-
KMeans clustering; a complete architecture dia-
Text and a de-lexicalized morphological analyzer.
gram can be seen in Figure 1. After reading the
The FastText embeddings (Bojanowski et al., 2016)
input file for a particular language to identify the
have the benefit of including sub-word information
lexicon and alphabet, we transform each word into
by representing words as character n-grams. The
two different types of vector representations. To
de-lexicalized analyzer relies on linguistic expert
capture semantic information, we train Word2Vec
knowledge of Arabic to identify the morphological
embeddings from the input data. The orthography-
closeness of two words. In the context of the paper,
based representations we learn are character embed-
it is used to prune out word relations that do not con-
dings, again trained from the input data. Details for
form to Arabic morphological rules. The approach
both representations appear in section 4.1. For the
mentioned greatly benefits from the use of a mor-
experiments in this paper, we test each type of rep-
phological analyzer, something that is not readily
resentation separately, using randomly initialized
available for low-resource languages. Soricut and
centers for the clustering. In later work, we plan to
Och (2015) focused on the use of morphological
explore the integration of both types of representa-
transformations as the basis for word representa-
tions. We would also like to explore the use of pre-
tions. Their representation can be quite accurate
defined centers for clustering. These pre-defined
for affix-based morphology.
centers could be provided using either a longest
Our representations are based entirely off of un-
common subsequence method or a graph-based al-
labelled data and do not require linguistic experts
gorithm such as that described in section 4.3. The
to provide morphological transformation rules for
final output of the system is a set of clusters, each
the language. Additionally, we hoped to create a
one representing a morphological paradigm.
system that would be robust for languages that in-
2 Previous Work clude non-affix based morphology. In this work we
compare Word2Vec representations to character-
The SIGMORPHON 2020 shared task set included based representations to represent orthography. We
an open problem calling for unsupervised systems have not yet evaluated additional representations
to complete morphological paradigms. For the or combinations of the two.
2020 task, participants were provided with the
set of lemmas available for each language (Kann, 3 Task overview
2020). In contrast, the 2021 SIGMORPHON task 2
outlines that submissions are unsupervised systems The 2021 SIGMORPHON Shared Task 2 created
that cluster input tokens into the appropriate mor- a call for unsupervised systems that would cre-
phological paradigm (Nicolai et al., 2020). Given ate morphological paradigm clusters. This was
the novelty of the task, there is a lack of previous intended to build upon the shared task from 2020
work done to cluster morphological paradigms in that focused on morphological paradigm comple-
an unsupervised manner. However, we have identi- tion. Participants were provided with tokenized
fied key methods from previous work in computa- Bible data from the JHU bible corpus (McCarthy
91
et al., 2020) and gold standard paradigm clusters 2. Identify the maximum word length of the lex-
for five development languages: Maltese, Persian, icon.
Portuguese, Russian and Swedish. Teams could
use this data to train their systems and evaluate 3. Create a dictionary of the alphabet where each
against the gold standard files as well as a baseline. character corresponds to a float value between
The baseline provided groups together words that 0 (non-inclusive) and 1 (inclusive).
share a substring of length n and then removes any 4. For each word:
duplicate clusters. The resulting systems were then
used to cluster tokenized data from a set of test (a) Initialize an array of zeros the same size
languages including: Basque, Bulgarian, English, as the maximum length word.
Finnish, German, Kannada, Navajo, Spanish, and (b) Map each character in the word, in order,
Turkish. to its respective float value based on our
alphabet dictionary. Leave the remaining
4 System Architecture values as zero.
The overall architecture of our system includes This representation focuses purely on the charac-
several distinct pieces as demonstrated in Figure ters of the language. For the time being, it does
1. For a given language, we read the corpus text not take into account the relationship between or-
provided and generate a lexicon of unique words. thographic characters in any of the languages but
The lexicon is then fed to an embedding layer and future work could attempt to create smarter numer-
an optional lemma identification layer. The em- ical representations based on these relationships.
bedding layer generates a vector representation of
Word Embeddings with Word2Vec. To incor-
each word based on either a character level embed-
porate semantic and syntactic information, we use
ding or a Word2Vec embedding. When used, the
the Word2Vec embeddings. Specifically, we train a
lemma identification layer generates a set of prede-
Word2Vec model for each language with the Gen-
fined lemmas from the lexicon based on either the
sim skip-gram representations (Řehůřek and Sojka,
standard longest common substring or a connected
2010).
graph formed from the longest common substring.
Result word embeddings along with the optional 4.2 (Optional) Lemma Identification
set of predefined lemmas are used as input to a
LCS Graph Formation One of the challenges
KMeans clustering algorithm. In the event prede-
of using clustering-based methods on this prob-
fined lemmas are not provided, the system defaults
lem is determining the number of morphological
to using a randomly initialized set of centroids. Oth-
paradigms expected to be present and then finding
erwise, the initial centroids for the clusters are the
suitable lemmas for each to serve as centers for
result of finding the appropriate word embedding
clustering. One potential approach to find lemmas
for the lemmas identified. Once a cluster has been
is to first arrange the words into a network graph
created, the output cluster predictions are formatted
based on the longest common substring relation-
into a paradigm dictionary which can be written to
ships between them. Specifically, for each attested
a file for evaluation.
word W in a language’s data, the longest common
4.1 Word Representations substring (LCS) is calculated between W and every
other attested word in the language. Graph edges
We create two different types of word represen- are then constructed between W and the word (or
tations, aiming to capture information that may words if there are multiple with the same length
reflect the relatedness of words within a paradigm. LCS) that have the longest LCS with W. This pro-
Character Based Embeddings. To capture or- cess is repeated for every word in the given lan-
thographic information, we generate a character- guage’s corpus. This results in a large graph that
based word embedding for the language. For each appears to capture many of the morphological de-
language we do the following: pendencies within the language.
Next, we split the graph into highly connected
1. Generate a lexicon of all the words in the de- subgraphs (HCSs). HCS are defined as graphs
velopment corpus and an alphabet of unique in which the number of edges that would need
characters in the language. to be cut to split the graph into two disconnected
92
Figure 1: Overall Statistical Clustering Architecture diagram. There are two possible word embedding algorithms
represented in the diagram (left side of split). The optional lemma identification layer also includes two possible
methods (right side).
subgraphs is greater than one half of the num- be as close (as defined by Euclidean distance) to
ber of nodes. This is helpful because in the LCS the cluster’s center, or the lemma word, as possible.
graphs generated, morphologically related forms
Clustering with Randomly Initiated Centers.
tend to be connected relatively densely to each
For comparison, we evaluate the effectiveness of
other and only weakly connect to forms from other
using randomly initialized centers for our clusters.
paradigms. Additionally, the use of a threshold
In the context of this task, this means that the first
based algorithm like HCS, unlike other clustering
set of centers fed to the algorithm do not necessarily
methods, would allow lemmas to be extracted with-
correspond to any valid word in the given language,
out having to prespecify the expected number of
or perhaps any language. Another obstacle for this
lemmas beforehand. Unfortunately, during testing
approach in an unsupervised setting is defining the
the HCS graph analysis proved computationally
number of clusters to use. Identifying this requires
taxing and was unable to be completed in time for
human interference with hyper-parameters that are
evaluation, though qualitative analysis of the gen-
not going to be cross-linguistically relevant. The
erated LCS graphs suggests the technique may still
size of the input bible corpus and the inflectional
be useful with better computational power. We will
morphology of the language both directly impact
explore this method further in future work.
the number of clusters, or the number of lemmas,
4.3 Clustering that are relevant. We used a range of cluster sizes
for the development languages from 100 to 6000 to
The word representations described in section 4.1
evaluate which ones provided the highest accuracy.
are used as input to a clustering algorithm. We use
For the test languages, we chose to submit results
the KMeans algorithm as defined by the sklearn im-
for clusters of size 500, 1000, 1500, and 1900 to
plementation (Pedregosa et al., 2011). The KMeans
assess performance variability based on number of
approach is one of the pioneering algorithms in un-
lemmas.
supervised learning (MacQueen et al., 1967). Input
values are grouped by continuously shifting clus- Extension: Initializing with Non-Random Cen-
ters and their centers while attempting to minimize ters. The use of non-random centers would have
the variance of each cluster. This indicates that the multiple benefits in the context of this task. This
cluster that a particular word is assigned to should approach would incorporate linguistic information
93
Language BL KMW2V KMCE 5 Results
Maltese 0.29 0.19 0.25
Persian 0.30 0.18 0.36 Table 2 shows results to date. We compare the
Portuguese 0.34 0.06 0.24 two representation methods on the development
Russian 0.36 0.11 0.34 languages. The KMeans clusterings for the devel-
Swedish 0.44 0.18 0.45 opment languages were generated based on optimal
cluster values starting with size 100 and increas-
Table 2: F1 Scores for each of the model types on all ing to a cluster size of 6000, or until the accuracy
development languages. The best F1 scores are in bold. no longer improved from an increase in cluster
BL is Baseline, KMW2V is KMeans with Word2Vec size. For the Word2Vec embeddings we used clus-
embeddings, and KMCE is KMeans with Character terings of size 110 for Maltese, 130 for Persian,
Embeddings.
1490 for Portuguese, 1490 for Russian, and 1490
for Swedish. With the character embeddings, we
Language BL 500 1000 1500 1900 had 540 clusters for Maltese, 110 clusters for Per-
Basque 0.21 0.29 0.31 0.27 0.29 sian, 2200 clusters for Portuguese, 4000 clusters
Bulgarian 0.39 0.21 0.27 0.29 0.30 for Russian, and 5400 clusters for Swedish. The F1
English 0.52 0.29 0.37 0.43 0.45 scores provided are based on comparing the appro-
Finnish 0.29 0.19 0.24 0.26 0.27 priate model’s predictions to the gold paradigms for
German 0.38 0.26 0.33 0.38 0.40 this task using the evaluation function defined in
Kannada 0.24 0.29 0.30 0.30 0.29 the SIGMORPHON 2021 Task 2 github repository.
Navajo 0.33 0.33 0.38 0.39 0.41 The KMCE models clearly and consistently out-
Spanish 0.39 0.24 0.29 0.30 0.31 perform the KMW2V models, for all development
Turkish 0.25 0.16 0.20 0.22 0.22 languages.
For test languages, we run clustering only with
Table 3: F1 Scores for the baseline (BL) and the the better-performing character-based representa-
KMCE models on the test languages. The best F1 tions. The performance on test languages was
scores are in bold. Test languages were evaluated on evaluated with clusters of size 500, 1000, 1500,
KMCE models with clusters of size 500, 1000, 1500, and 1900. These results are in Table 3. We found
and 1900.
that our algorithm outperformed the baseline for
Basque, German, Kannada, and Navajo. For both
Basque and Kannada, the largest clustering did
to inform the initial set of centers. This could lead not have the highest result suggesting that the cor-
to quicker convergence of a model due to more pora provided for these languages contain a smaller
intelligently picked centers. It could also prevent number of morphological paradigms. In the case of
the model from being skewed towards less than Bulgarian, English, Spanish, and Finnish, we note
ideal center values. Additionally, with pre-defined that the KMCE model performance increases with
centers we can remove the need to arbitrarily define each increase in cluster size. This suggests that the
the number of clusters. model accuracy would continue increasing if we
In the scope of this task, we were unable to ex- ran the model for these languages with a higher
periment with pre-defined center values but we number of clusters. Additional discussion of the er-
have proposed two potential methods for doing ror analysis appears in section 6, and fo the results
so: using longest common substrings and picking in section 7.
highly connected nodes from an LCS graph for-
6 Error Analysis
mation. The longest common substring approach
would mimic the lemma identification approach de- We have evaluated the results from the Word2Vec
scribed above (4.2). Both of these systems are rep- representations and our character-based embedding
resented as an optional lemma identification layer and compared them to the gold standard paradigms
on the right hand side of Figure 1. The output of provided by the task organizers. We have found
each one would be a set of words to use as centers. that, overall, the character-based version is more
Each word would be converted to the appropriate robust on regular verb forms than the Word2Vec
word representation and then fed as an input to the version, and that neither is effective on irregular
KMeans clustering. forms. Additionally, we explore some of the nu-
94
anced errors with the character based embeddings handle irregular verb forms.
and how they could be addressed for future work.
95
6.5 Word Length analysis or through the threshold-based graph clus-
Another type of cluster error has to do with word tering technique discussed above. Other potential
length. The word representation vectors were sized variations on that approach, once the problem of
based on the largest word present in a given lan- computational limits has been solved, include us-
guage’s corpus. If a word is under the maximum ing longest common sequences rather than longest
length, the remaining vector gets filled in with ze- common substrings, and weighting graph edges by
ros. This means that words that are similar in the length of the LCS between the two words. The
length are more likely to be paired together for former would potentially help accommodate forms
a cluster. The gold data created the morphological of non-concatenative morphology, while the latter
paradigm {crowd, crowds, crowding}, while ours would potentially include more information about
created two separate clusterings: {crowd, crowds} morphological relationships than an unweighted
and {brawling, proposal, crowding}. This is also graph does. Future research should also explore
present in the clustering of certain words in Navajo. how other sources of linguistic information could
Our algorithm grouped nizaad, and bizaad together, be leveraged for this task. This could include
but some of the longer forms in this paradigm were other forms of semantic information outside of the
excluded such as danihizaad and nihizaad. In fu- context-based semantics used by W2V, as well as
ture work, we would attempt to mitigate this by things like the orthographic-phonetic correspon-
using subword distances or cosine similarity as dences in a given language.
the basis for distance metrics in a clustering algo- Finally, we would like to explore filtering of the
rithm. This could prevent inaccurate groupings due output clusters according to language-specific prop-
to large affix lengths. erties in order to improve the overall results.This
would involve adding additional layers to our sys-
7 Discussion, Conclusions, Future Work tem architecture that take place after a distance-
based clustering. One such layer could prune un-
Overall, these results demonstrate an improvement likely clusters based off of a morphological trans-
over the baseline in several languages, namely Per- formations, such as the method used by Soricut and
sian, Swedish, Basque, Germany, Kannada, and Och (2015). Future unsupervised systems for clus-
Navajo, when using KMeans clustering over char- tering morphological paradigms should consider
acter embeddings. This suggests that embedding- the benefits of hierarchical models that leverage dif-
based clustering systems merit further exploration ferent algorithm types to gain the most information
as a potential approach to unsupervised problems in possible.
morphology. The fact that the character embedding
system outperformed the W2V one and the fact that
performance was strongest on words with regular References
inflectional paradigms suggests that this approach Piotr Bojanowski, Edouard Grave, Armand Joulin, and
might be best suited to synthetic and agglutinating Tomas Mikolov. 2016. Enriching word vectors with
languages in which morphology is encoded fairly subword information.
simply within the orthography of the word. Lan-
Alexander Erdmann, Micha Elsner, Shijie Wu, Ryan
guages that rely heavily on more complex morpho- Cotterell, and Nizar Habash. 2020. The paradigm
logical processes, particularly non-concatenative discovery problem. In Proceedings of the 58th An-
morphology, would likely require an extension of nual Meeting of the Association for Computational
this system that integrates more sources of non- Linguistics, pages 7778–7790, Online. Association
for Computational Linguistics.
orthographic information, or a different approach
all together. Alexander Erdmann and Nizar Habash. 2018. Comple-
One obvious avenue for building on this research mentary strategies for low resourced morphological
modeling. In Proceedings of the Fifteenth Workshop
is to find more efficient and more effective methods
on Computational Research in Phonetics, Phonol-
for the initial process of lemma identification. De- ogy, and Morphology, pages 54–65, Brussels, Bel-
veloping a set of lemmas would allow a pre-defined gium. Association for Computational Linguistics.
set of centers to be fed into the clustering algo-
Katharina Kann. 2020. Acquisition of inflectional
rithm rather than using randomly defined centers, morphology in artificial neural networks with prior
which would likely improve performance. This knowledge. In Proceedings of the Society for Com-
could be done by leveraging an initial rule based putation in Linguistics 2020, pages 144–154, New
96
York, New York. Association for Computational Lin-
guistics.
MacQueen, J, and author. 1967. Some methods for
classification and analysis of multivariate observa-
tions. In Proceedings of the Fifth Berkeley Sym-
posium on Mathematical Statistics and Probability,
Volume 1: Statistics, pages 281–297. University of
California Press.
Arya D. McCarthy, Rachel Wicks, Dylan Lewis, Aaron
Mueller, Winston Wu, Oliver Adams, Garrett Nico-
lai, Matt Post, and David Yarowsky. 2020. The
Johns Hopkins University Bible corpus: 1600+
tongues for typological exploration. In Proceed-
ings of the 12th Language Resources and Evaluation
Conference, pages 2884–2892, Marseille, France.
European Language Resources Association.
Garrett Nicolai, Kyle Gorman, and Ryan Cotterell, edi-
tors. 2020. Proceedings of the 17th SIGMORPHON
Workshop on Computational Research in Phonetics,
Phonology, and Morphology. Association for Com-
putational Linguistics, Online.
97
Unsupervised Paradigm Clustering Using Transformation Rules
[email protected] [email protected]
98
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 98–106
August 5, 2021. ©2021 Association for Computational Linguistics
Data: sponding forms in distinct paradigms (like all
the cat meowed plural forms of English nouns) are clustered
the cats are meowing into cells. Their benchmark system is based on
splitting every form into a (potentially discon-
tinuous) base and exponent, where the base is
the longest common subsequence of the forms
Paradigms:
in a paradigm and the exponent is the residual
cat meowed the are of the form. They then maximize the base in
cats meowing each paradigm while minimizing the exponents
of individual forms.
Figure 1: The unsupervised paradigm clustering
task. 3 Methods
This section describes how we extract rules
demonstrate that prefix and suffix rules deliver from the dataset and apply them to paradigm
stronger performance for most languages in the clustering. We also describe methods for fil-
shared task dataset but our more general trans- tering out extraneous forms from generated
formations rules are beneficial for templatic paradigms.
languages like Maltese and languages with a
high morpheme-to-word ratio like Basque. 3.1 Baseline
As a baseline, we use the character n-gram
2 Related Work
clustering method provided by the shared task
The unsupervised paradigm clustering task is organizers (Wiemerslage et al., 2021). Here
closely related to the 2020 SIGMORPHON all forms sharing a given substring of length
shared task on unsupervised morphological n are clustered into a paradigm. Duplicate
paradigm completion (Kann et al., 2020). How- paradigms are removed. The hyperparameter
ever, paradigm clustering systems do not infer n can be tuned on validation data if such data is
missing forms in paradigms. Our system re- available (we use n = 5 in all our experiments).
sembles the baseline system for the paradigm
completion task (Jin et al., 2020) which also 3.2 Transformation rules
extracts transformation rules, however, in the Our approach builds on the baseline paradigms
form of edit trees (Chrupala et al., 2008). discovered in the previous step. We start by ex-
Several approaches to unsupervised or mini- tracting transformation rules between all word
mally supervised morphology learning, which forms in a single baseline paradigm. For each
share characteristics with our system, have pair of strings like dog and dogs belonging to a
been proposed. Our rules are essentially iden- paradigm, we generate a rule like ?+ 0:s which
tical to the FST rules used by Beemer et al. translates the first form into the second one.
(2020) for the task of supervised morpholog- From a paradigm of size n, we can therefore ex-
ical inflection. Likewise, Durrett and DeN- tract n2 − n rules—one for each ordered pair of
ero (2013) and Ahlberg et al. (2015) both ex- distinct word forms. Preliminary experiments
tract inflectional rules after aligning forms from showed that large baseline paradigms tended
known paradigms. Yarowsky and Wicentowski to generate many incorrect rules which did not
(2000) also generate rules for morphological represent genuine morphological transforma-
transformations but their system for minimally tions. We, therefore, limited rule-discovery to
supervised morphological analysis requires ad- paradigms spanning maximally 20 forms.
ditional information in the form of a list of After generating transformation rules, we
morphemes as input. compute rule-frequency over all baseline
Erdmann et al. (2020) present a task called paradigms and discard rare rules which are
the paradigm discovery problem which is quite unlikely to represent genuine morphological
similar to the unsupervised paradigm clustering transformations (the minimum threshold for
task. In their formulation of the task, inflected rule frequency is a hyperparameter). The re-
forms are clustered into paradigms and corre- maining rules are then applied iteratively to
99
dog dog
Extract 2: ?+ 0:s Discard 2: ?+ 0:s Rebuild dogs
dogs
rules 2: ?+ s:0 rare rules 2: ?+ s:0 paradigms
hotdog
1: h:0 o:0 t:0 ?+ 1: h:0 o:0 t:0 ?+ hotdog
cat 1: 0:h 0:o 0:t ?+ 1: 0:h 0:o 0:t ?+
cats 1: h:0 o:0 t:0 ?+ 0:s 1: h:0 o:0 t:0 ?+ 0:s cat
1: 0:h 0:o 0:t ?+ s:0 1: 0:h 0:o 0:t ?+ s:0 cats
our datasets to construct paradigms. We exper- a rule which can apply transformations inside
iment with two rule types which are described the input string:
below. ?+ i:0 ?+ e:i ?+ 0:t
3.2.1 Prefix and Suffix Rules Like prefix and suffix rules, discontinuous
Our first approach to rule-discovery is based on rules are generated from baseline paradigms.
identifying a contiguous word stem shared by Unlike prefix and suffix rules, however, discon-
both forms. The stem is defined as the longest tinuous rules require a character-level align-
common substring of the forms. We split both ment between the input and output string.
forms into a prefix, stem and suffix. The mor- To this end, we start by generating a dataset
phological transformation is then defined as a consisting of all string pairs like (dog, dogs)
joint substitution of a prefix and suffix. For ex- and (hotdog, dog), where both strings belong
ample, given the German forms acker+n and to the the same paradigm. We then apply
ge+acker+t (German ‘to plow’), we would a character-level aligner based on the itera-
generate a rule: tive Markov chain Monte Carlo method to this
dataset.2 Using this method, we can jointly
0:g 0:e ?+ n:t
align all string pairs in the baseline paradigms.
As mentioned above, these rules are extracted This is beneficial because the MCMC aligner
from paradigms generated by the baseline sys- will prefer common substitutions, deletions
tem. and insertions over rare ones.3 which enforces
We also experiment with a more restricted consistency of the alignment over the entire
form of these rules in which only suffix trans- dataset. This in turn can help us find linguis-
formations are allowed. While this limits the tically motivated transformation rules.
possible transformations, it will also result in Character-level alignment results in pairs:
fewer incorrect rules and may, therefore, de- INPUT: d o g 0
liver better performance for languages which OUTPUT: d o g s
are predominantly suffixing. INPUT: h o t d o g
3.2.2 Discontinuous rules OUTPUT: 0 0 0 d o g
Even though prefix and suffix transformations Each symbol pair in the alignment represents
are adequate for representing morphological one of the following types: (1) an identity pair
transformations in many languages, they fail to x:x, (2) an insertion 0:x, (3) a deletion x:0,
derive the appropriate generalizations for lan- or (4) a substitution x:y. In order to convert
guages with templatic morphology like Maltese a pair of aligned strings into a transformation
(which was included among the development 2
This aligner was initially used for the baseline sys-
languages). For example, it is impossible to tem in the 2016 iteration of the SIGMORPHON shared
identify a contiguous stem-like unit spanning task (Cotterell et al., 2016).
3
This is a consequence of the fact that the algorithm
more than a single character for the Maltese iteratively maximizes the likelihood of the alignment for
forms gidem ‘to bite’ and gdimt. We need each example given all other examples in the dataset.
100
rule, we simply replace all contiguous sequences walk
of identity pairs with ?+. For the alignments
above, we get the rules: ?+ 0:s and h:0 o:0 X
t:0 ?+. wall walking
3.3 Iterative Application of Rules
After extracting a set of rules from baseline
paradigms, we discard the baseline paradigms. walked walks
We then construct new paradigms using our
rules. We start by picking a random word Figure 3: Given the candidate paradigm {walk,
form w from the dataset. We then form the wall, walking, walked, walks}, we can form a
graph where two word forms are connected if a
paradigm P for w as the set of all forms in
rule like ?+ 0:e 0:d derives one of the forms like
our dataset which can be derived from w by walked from the other one walk. We experiment
applying our rules iteratively. For example, with filtering out forms which have low degree in
given the form eats and the rules: this graph since those are more likely to be spurious
additions resulting from rules like ?+ l:k in the ex-
?+ s:0 and ?+ 0:i 0:n 0:g ample, which do not capture genuine morphologi-
the paradigm of eats would contain both eat cal regularities. In this example, wall might be fil-
(generated by the first rule) and eating (gen- tered out because it has low degree one compared
to all other forms which have degree three.
erated by the second rule from eats) provided
that both of these forms are present in our orig-
inal dataset. All forms in P are removed from If we first generate all paradigms and then fil-
the dataset and we then repeat the process for ter out extraneous forms, we will be left with a
another randomly sampled form in the remain- number of forms which have not been assigned
ing dataset. This continues until the dataset is to a paradigm. In order to circumvent this
exhausted. The procedure is sensitive to the or- problem, we apply filtering immediately after
der in which we sample forms from the dataset generating each individual paradigm. Forms
but exploring the optimal way to sample forms which are filtered out from the paradigm are
falls beyond the scope of the present work. placed back into the original dataset. They
For prefix and suffix rules, we limit rule ap- can then be included in paradigms which are
plication to a single iteration because this de- generated later in the process.
livered better results in practice. Applying
rules iteratively tended to result in very large Degree test Our morphological transforma-
paradigms. For discontinuous rules, we do ap- tion rules induce dependencies and therefore
ply rules iteratively. a graph structure between the forms in a
paradigm as demonstrated in Figure 3. Within
3.4 Filtering Paradigms each paradigm, we calculate the degree of a
According to our preliminary experiments, word in the following way: For each attested
many large paradigms generated by transfor- word w in the generated paradigm, its degree
mation rules contained word forms which were is the number of forms w0 in the paradigm for
morphologically unrelated to the other forms which we can find a transformation rule map-
in the paradigm. To counteract this, we ex- ping w → w0 . We increment the degree if there
perimented with three strategies for filtering is at least one edge between words w and w0
out individual extraneous forms from generated in the paradigm (the number of distinct rules
paradigms: the degree test, the rule-frequency mapping form w to w0 is irrelevant here as long
test and the embedding-similarity test. Forms as there is at least one). If the degree of a word
which fail all of our three tests are removed is less than a third of the paradigm size, the
from the paradigm.4 word fails the degree test.
4
These filtering strategies are applied to paradigms Rule-Frequency test Some rules like ?+
containing > 20 forms. This threshold was determined
based on examining the output clusters for the devel- e:i d:n 0:g for English represent genuine in-
opment languages. flectional transformations and will therefore
101
occur often in our datasets. Others, like the 4.2 Experiments on validation
rule ?* l:k in Figure 3, instead result from co- languages
incidence, and will usually have low frequency.
We can, therefore, use rule frequency as a cri-
Since our transformation rules are generated
terion when identifying extraneous forms in
from paradigms discovered by the baseline sys-
generated paradigms. We examine the cumula-
tem, which contain incorrect items, it is to be
tive frequency of all rules applying to the form
expected that some incorrect rules are gener-
in our paradigm. If this frequency is lower
ated. We filter out infrequent rules, as they are
than the median cumulative frequency in the
less likely to represent genuine morphological
paradigm, the form fails the rule-frequency test.
transformations. For prefix and suffix rules
Embedding-similarity test If a word fails (i.e., PS), we experimented with including the
to pass the degree and the rule frequency tests, top 2000 (PS-2000), 5000 (PS-5000), and all
we will measure the semantic similarity of the rules (PS-all), as measured by rule-frequency.
given form with other forms in the paradigm. Additionally, we present experiments using a
To this end, we trained FastText embeddings system which relies exclusively on suffix trans-
(Bojanowski et al., 2017) and calculated co- formations including all of them regardless of
sine similarity between embedding vectors as a frequency (S-all). For discontinuous rules (D),
measure of semantic relatedness.5 We start by we used lower thresholds because our prelimi-
selecting two reference words in the paradigm nary experiments indicated that incorrect gen-
which have high degree (at least 50% of the eralizations were a more severe problem for
maximal degree) and whose cumulative rule fre- this rule type. We selected the 200 (D-200),
quency is above the paradigm’s median value. 300 (D-300), and 500 (D-500) most frequent
We then compute their cosine similarity as a rules, respectively. Results with regard to best-
reference point r. For all other words in the match F1 score (see Wiemerslage et al. (2021)
paradigm, we then compare their cosine simi- for details) are shown in Table 1.
larity r0 to one of the reference forms. Forms According to the results, all of our systems
fail the embedding-similarity test if r0 < 0.5 outperform the baseline system by at least
and r − r0 > 0.3. 25.53% as measured using the mean best match
F1 score. Plain suffix rules (S-all) provide
4 Experiments and Results the best performance with a mean F1 score
of 65.41%, followed by other affixal systems
In this section, we describe experiments on the
(PS-2000, PS-5000 and PS-all). On average,
shared task development and test languages.
discontinuous rules (D-200, D-300 and D-500)
are slightly less-successful, but they deliver
4.1 Data and Resources
the best performance for Maltese. Table 1
The shared task uses two data resources. Cor- demonstrates that simply increasing the num-
pus data for the four development languages ber of rules does not always contribute to bet-
(Maltese, Persian, Russian and Swedish) and ter performance—the optimal threshold varies
nine test languages (Basque Bulgarian, English, between languages.
Finnish, German, Kannada, Navajo, Spanish
and Turkish) are sourced from the Johns Hop- As explained in Section 3.4, we aim to fil-
kins Bible Corpus (McCarthy et al., 2020b). ter out extraneous forms from overly-large
For most of the languages, complete Bibles paradigms. We applied this approach to discon-
were provided but for some of them, we only tinuous rules with a 500 threshold. Results are
had access to a subset (see Wiemerslage et al. shown in Table 2. As the table shows, a filtering
(2021) for details). Gold standard paradigms strategy can offer very limited improvements.
were automatically generated using the Uni- Most of the languages do not benefit from this
morph 3.0 database (McCarthy et al., 2020a). approach and even for languages which do, the
gain is miniscule. Due to their very limited
5
We train 300-dimensional embeddings with context effect, we did not apply filtering strategies to
window 3 and use character n-grams of size 3-6. test languages.
102
Maltese Persian Portuguese Russian Swedish Mean
Baseline 29.07 30.04 34.15 36.30 43.62 34.64
PS-2000 35.41 50.17 65.53 81.20 81.14 62.69
PS-5000 36.81 50.40 71.33 81.96 79.82 64.06
PS-all 40.67 53.15 76.63 75.39 72.46 63.66
S-all 30.32 52.69 82.67 80.65 80.74 65.41
D-200 42.99 54.65 66.86 70.38 68.76 60.73
D-300 42.99 53.64 69.38 72.33 67.14 61.10
D-500 45.05 51.82 66.37 75.26 62.30 60.16
Table 1: F1 Scores for each of the model types on all development languages. The best F1 scores are in
bold.
Table 2: F1 score for Discontinuous rules systems and Filtering systems across five validation languages.
4.3 Experiments on Test Languages suffix rules and discontinuous rules, discontin-
uous rules tend to generate more paradigms of
Results for the test languages are presented in size 1. In contrast to the paradigms generated
Table 3. We find that all of our systems sur- by our systems, the frequency of gold standard
passed the baseline results by at least 23.06% in paradigms drops far slower as the paradigms
F1 score. The prefix and suffix system using all grow. For example, for Finnish and Kannada,
of the suffix rules displays the best performance paradigms containing 10 forms are still very
with an F1 score of 66.12%. Among the discon- common. The only language where the distri-
tinuous systems, the system with a threshold of bution generated by our systems very closely
500 has the best results. On average, the affixal parallels the gold standard is Spanish. For
systems outperform the discontinuous ones. In all other languages, our systems very clearly
particular, these methods perform best on lan- over-generate small paradigms.
guages which are known to be predominantly
suffixing, such as English, Spanish, and Finnish. 5 Discussion and Conclusions
Contrarily, discontinuous rules deliver the best Paradigm construction can suffer from two
performance for Navajo—a strongly prefixing main difficulties: overgeneralization, and un-
language. Discontinuous rules also result in derspecification. In the former, paradigms are
the best performance for Basque, which has a too generous when adding new members. Con-
very high morpheme-to-word ratio. sider, for example, a paradigm headed by “sea”.
In order to better understand the behavior of We would want to include the plural “seas”, but
our systems, we analyzed the distribution of the not the unrelated words “seal”, “seals”, “un-
size of generated paradigms for prefix and suffix dersea”, etc. Contrarily, a paradigm selection
systems as well as discontinuous systems. Re- algorithm that is overly selective will result in
sults for selected systems are shown in Figure 4. a large number of small paradigms - less than
We conducted this experiment for the overall ideal in a morphologically-dense language.
best system (S-all), as well as the best discontin- Considering the results described in the pre-
uous system (D-500). Both systems follow the vious section, we note that our two best mod-
same overall pattern: large paradigms are rarer els skew towards conservatism - they prefer
than smaller ones and the frequency drops very smaller paradigms. This is likely an artifact of
rapidly with increasing paradigm size. The ma- our development cycle - we found that the base-
jority of generated paradigms have sizes in the line preferred large paradigms, often capturing
range 1-5. Although the tendency is similar for derivational features, or even circumstantial
103
English Navajo Spanish Finnish Bulgarian Basque Kannada German Turkish Mean
Baseline 51.49 33.25 38.83 28.97 38.89 21.48 23.79 38.22 25.23 33.35
PS-2000 83.89 48.69 77.71 52.60 73.50 25.81 42.35 74.49 46.80 58.42
PS-5000 81.16 48.69 79.60 57.88 74.14 29.03 47.47 74.27 51.26 60.39
PS-all 76.41 48.69 76.94 66.03 69.50 29.03 57.71 65.26 60.97 61.17
S-all 88.68 42.48 83.21 73.42 76.96 29.03 59.34 74.18 67.80 66.12
D-200 76.93 58.45 66.05 50.68 70.48 26.19 40.57 70.26 48.05 56.41
D-300 73.23 59.36 69.46 53.66 69.39 26.19 43.71 68.52 51.00 57.17
D-500 69.33 61.66 69.92 56.51 63.23 33.33 46.94 62.54 53.24 57.41
Table 3: F1 Scores for each of the model types on all test languages. The best F1 scores are in bold.
Basque Paradigm Size Distribution Bulgarian Paradigm Size Distribution English Paradigm Size Distribution
80 80 80
Basque-Suffix-all Bulgarian-Suffix-all English-Suffix-all
70 Basque-Discontinuous-500 70 Bulgarian-Discontinuous-500 70 English-Discontinuous-500
Basque-Gold Bulgarian-Gold English-Gold
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
1 5 10 15 20 1 5 10 15 20 1 5 10 15 20
Figure 4: Paradigm size distribution across nine test languages. The x axis stands for paradigm size
ranging from 1 to 20. The y axis shows the percentage of each paradigm size accounts for among all
paradigms the system generates.
104
string similarities, when clustering paradigms. cant rule support to combine them. Possible ar-
Much of our focus was thus on limiting rule ap- eas for exploration include iterative rule extrac-
plication only to those rules we could be certain tion on successively more correct paradigms, or
were genuine. Unfortunately, this means that the incorporation of a machine learning element
many words are excluded, residing in singleton that can predict missing forms.
paradigms. In this paper, we have presented a method for
Our methods were also affected by the choice automatically building inflectional paradigms
of development languages. Of these languages, from raw data. Starting with an n-gram base-
only one (Persian) is agglutinating, and none line, we extract intra-paradigmatic rewrite
of the authors can read the script, so it had a rules. These rules are then re-applied to the cor-
smaller impact on the evolution of our methods. pus in a discovery process that re-establishes
We believe that several languages —namely, known paradigms. Our methods prove very
Finnish, Turkish, and Basque— could have competitive, with our best model finishing
benefited from iterative rule application; how- within 2% of the best submitted system.
ever, the iterative process was not selected after
seeing a degradation (due to overgeneralization) References
on the development languages.
Malin Ahlberg, Markus Forsberg, and Mans
It is also worth discussing two outliers in Hulden. 2015. Paradigm classification in su-
our system selection. Our suffix-first model pervised learning of morphology. In Proceed-
performed very well on all of the development ings of the 2015 Conference of the North Amer-
ican Chapter of the Association for Computa-
languages except Maltese. This is not sur- tional Linguistics: Human Language Technolo-
prising, given its templatic morphology. Mal- gies, pages 1024–1029.
tese inspired the creation of our discontinuous
rule set, and indeed, these rules outperformed Sarah Beemer, Zak Boston, April Bukoski, Daniel
Chen, Princess Dickens, Andrew Gerlach, Torin
the suffixes for Maltese. Switching to the test Hopkins, Parth Anand Jawale, Chris Koski,
languages, we see that this model has higher Akanksha Malhotra, Piyush Mishra, Saliha
performance for Navajo and Basque –two lan- Muradoglu, Lan Sang, Tyler Short, Sagarika
guages that are rarely described as templatic. Shreevastava, Elizabeth Spaulding, Testumichi
Umada, Beilei Xiang, Changbing Yang, and
We observe, however, that both languages make Mans Hulden. 2020. Linguist vs. machine:
heavy use of prefixing. Note in Table 2 that in- Rapid development of finite-state morphological
cluding prefixes (PS-All) significantly improves grammars. In Proceedings of the 17th SIGMOR-
Navajo: the only language to see such a bene- PHON Workshop on Computational Research in
Phonetics, Phonology, and Morphology, pages
fit. Likewise, Navajo also has significant stem 162–170, Online. Association for Computational
alternation, which may be benefiting from dis- Linguistics.
continuous rule sets. Basque is trickier - it
Kenneth R Beesley and Lauri Karttunen. 2003.
does not improve simply from including pre-
Finite-state morphology: Xerox tools and tech-
fixal rules. Upon closer inspection, we observe niques. CSLI, Stanford.
that much Basque prefixation more closely re-
sembles circumfixation: the stem has a prefixal Piotr Bojanowski, Edouard Grave, Armand Joulin,
and Tomas Mikolov. 2017. Enriching word vec-
vowel to indicate tense, which is jointly applied tors with subword information. Transactions of
with inflectional suffixes. One round of rule the Association for Computational Linguistics,
application - even if it includes both suffixes 5:135–146.
and prefixes, appears to be insufficient. Grzegorz Chrupala, Georgiana Dinu, and Josef
There is still plenty of ground to be covered, van Genabith. 2008. Learning morphology with
with the mean F1 score below 70%. We be- Morfette. In Proceedings of the Sixth Inter-
national Conference on Language Resources
lieve that the next step lies in re-establishing and Evaluation (LREC’08), Marrakech, Mo-
a bottom-up construction for those paradigms rocco. European Language Resources Associa-
that our methods currently separate into small tion (ELRA).
sub-paradigms. Our methods predict roughly Ryan Cotterell, Christo Kirov, John Sylak-
twice to 3 times as many singleton paradigms Glassman, Géraldine Walther, Ekaterina Vylo-
as exist in the gold data, and there is not signifi- mova, Arya D McCarthy, Katharina Kann, Se-
105
bastian J Mielke, Garrett Nicolai, Miikka Silfver- and David Yarowsky. 2020a. UniMorph 3.0:
berg, et al. 2018. The CoNLL–SIGMORPHON Universal Morphology. In Proceedings of the
2018 shared task: Universal morphological re- 12th Language Resources and Evaluation Con-
inflection. In Proceedings of the CoNLL– ference, pages 3922–3931, Marseille, France. Eu-
SIGMORPHON 2018 Shared Task: Universal ropean Language Resources Association.
Morphological Reinflection, pages 1–27.
Arya D McCarthy, Ekaterina Vylomova, Shi-
Ryan Cotterell, Christo Kirov, John Sylak- jie Wu, Chaitanya Malaviya, Lawrence Wolf-
Glassman, Géraldine Walther, Ekaterina Vylo- Sonkin, Garrett Nicolai, Christo Kirov, Miikka
mova, Patrick Xia, Manaal Faruqui, Sandra Silfverberg, Sabrina J Mielke, Jeffrey Heinz,
Kübler, David Yarowsky, Jason Eisner, et al. et al. 2019. The SIGMORPHON 2019 shared
2017. Conll-sigmorphon 2017 shared task: Uni- task: Morphological analysis in context and
versal morphological reinflection in 52 languages. cross-lingual transfer for inflection. In Proceed-
In Proceedings of the CoNLL SIGMORPHON ings of the 16th Workshop on Computational Re-
2017 Shared Task: Universal Morphological Re- search in Phonetics, Phonology, and Morphology,
inflection, pages 1–30. pages 229–244.
Ryan Cotterell, Christo Kirov, John Sylak- Arya D. McCarthy, Rachel Wicks, Dylan Lewis,
Glassman, David Yarowsky, Jason Eisner, and Aaron Mueller, Winston Wu, Oliver Adams,
Mans Hulden. 2016. The SIGMORPHON 2016 Garrett Nicolai, Matt Post, and David Yarowsky.
shared task—morphological reinflection. In Pro- 2020b. The Johns Hopkins University Bible
ceedings of the 14th SIGMORPHON Work- corpus: 1600+ tongues for typological explo-
shop on Computational Research in Phonetics, ration. In Proceedings of the 12th Language
Phonology, and Morphology, pages 10–22. Resources and Evaluation Conference, pages
2884–2892, Marseille, France. European Lan-
Greg Durrett and John DeNero. 2013. Supervised guage Resources Association.
learning of complete morphological paradigms.
In Proceedings of the 2013 Conference of the Radu Soricut and Franz Och. 2015. Unsuper-
North American Chapter of the Association for vised morphology induction using word embed-
Computational Linguistics: Human Language dings. In Proceedings of the 2015 Conference
Technologies, pages 1185–1195. of the North American Chapter of the Associ-
ation for Computational Linguistics: Human
Alexander Erdmann, Micha Elsner, Shijie Wu, Language Technologies, pages 1627–1637, Den-
Ryan Cotterell, and Nizar Habash. 2020. The ver, Colorado. Association for Computational
paradigm discovery problem. In Proceedings Linguistics.
of the 58th Annual Meeting of the Association
for Computational Linguistics, pages 7778–7790, Ekaterina Vylomova, Jennifer White, Elizabeth
Online. Association for Computational Linguis- Salesky, Sabrina J Mielke, Shijie Wu, Edoardo
tics. Ponti, Rowan Hall Maudslay, Ran Zmigrod,
Josef Valvoda, Svetlana Toldova, et al. 2020.
Huiming Jin, Liwei Cai, Yihui Peng, Chen Xia, SIGMORPHON 2020 shared task 0: Typolog-
Arya McCarthy, and Katharina Kann. 2020. Un- ically diverse morphological inflection. SIG-
supervised morphological paradigm completion. MORPHON 2020.
In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, Adam Wiemerslage, Arya McCarthy, Alexander
pages 6696–6707, Online. Association for Com- Erdmann, Garrett Nicolai, Manex Agirreza-
putational Linguistics. bal, Miikka Silfverberg, Mans Hulden, and
Katharina Kann. 2021. The SIGMORPHON
Katharina Kann, Arya D. McCarthy, Garrett Nico- 2021 shared task on unsupervised morphologi-
lai, and Mans Hulden. 2020. The SIGMOR- cal paradigm clustering. In Proceedings of the
PHON 2020 shared task on unsupervised mor- 18th SIGMORPHON Workshop on Computa-
phological paradigm completion. In Proceed- tional Research in Phonetics, Phonology, and
ings of the 17th SIGMORPHON Workshop on Morphology. Association for Computational Lin-
Computational Research in Phonetics, Phonol- guistics.
ogy, and Morphology, pages 51–62, Online. As-
sociation for Computational Linguistics. David Yarowsky and Richard Wicentowski. 2000.
Minimally supervised morphological analysis by
Arya D. McCarthy, Christo Kirov, Matteo Grella, multimodal alignment. In Proceedings of the
Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekate- 38th Annual Meeting of the Association for
rina Vylomova, Sabrina J. Mielke, Garrett Nico- Computational Linguistics, pages 207–216, Hong
lai, Miikka Silfverberg, Timofey Arkhangelskiy, Kong. Association for Computational Linguis-
Nataly Krizhanovsky, Andrew Krizhanovsky, tics.
Elena Klyachko, Alexey Sorokin, John Mans-
field, Valts Ernštreits, Yuval Pinter, Cassan-
dra L. Jacobs, Ryan Cotterell, Mans Hulden,
106
Paradigm Clustering with Weighted Edit Distance
110
Lang Baseline LMC-B LMC-F JWC
prec. rec. F1 prec. rec. f1 prec. rec. F1 prec. rec. F1
Maltese 0.250 0.348 0.291 0.465 0.229 0.307 0.411 0.202 0.272 0.489 0.241 0.323
Persian 0.265 0.348 0.300 0.321 0.307 0.314 0.494 0.197 0.282 0.579 0.231 0.330
Portuguese 0.218 0.794 0.341 0.771 0.248 0.376 0.494 0.159 0.241 0.742 0.239 0.362
Russian 0.234 0.807 0.363 0.802 0.282 0.417 0.726 0.255 0.378 0.792 0.278 0.412
Swedish 0.303 0.776 0.436 0.818 0.378 0.517 0.695 0.321 0.439 0.838 0.388 0.530
Average 0.254 0.615 0.346 0.635 0.289 0.386 0.482 0.186 0.268 0.688 0.275 0.391
Table 2: Precision, recall, and F1 for all development languages. LMC-R is the LM-clustering system for language
models trained from left-to-right (reverse). LMC-F are trained from left-to-right, and JWC is the JW-clustering
system. The highest F1 for each language is in bold.
Table 3: Precision, recall, and F1 for all test languages. LMC is the LM-clustering system, JWC is the JW-
clustering system. The highest F1 for each language is in bold.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Long short-term memory. Neural computation, Dean. 2013. Efficient estimation of word represen-
9(8):1735–1780. tations in vector space.
112
Karthik Narasimhan, Regina Barzilay, and Tommi Conference on Empirical Methods in Natural Lan-
Jaakkola. 2015. An unsupervised method for un- guage Processing, pages 2465–2474, Brussels, Bel-
covering morphological chains. Transactions of the gium. Association for Computational Linguistics.
Association for Computational Linguistics, 3:157–
167.
113
7 Appendix
Here we present new results which include
the entire data set for selected languages. We
see an improvement in F1 for each language.
This due to the increased recall scores from
the paradigms being more complete. Precision
scores decrease across the board. This may
be due to the languages being sensitive to the
threshold value.
Lang Subset Full
prec. rec. F1 prec. rec. F1
Basque 0.471 0.254 0.330 0.443 0.429 0.435
Bulgarian 0.745 0.312 0.440 0.638 0.631 0.634
English 0.565 0.245 0.342 0.430 0.425 0.428
German 0.763 0.310 0.441 0.703 0.699 0.701
Maltese 0.465 0.229 0.307 0.402 0.400 0.401
Navajo 0.686 0.112 0.193 0.449 0.430 0.435
Spanish 0.664 0.183 0.287 0.579 0.560 0.569
Swedish 0.818 0.378 0.517 0.783 0.737 0.759
Average 0.659 0.252 0.357 0.553 0.539 0.545
114
Results of the Second SIGMORPHON Shared Task
on Multilingual Grapheme-to-Phoneme Conversion
Lucas F.E. Ashby∗ , Travis M. Bartley∗ , Simon Clematide† , Luca Del Signore∗ ,
Cameron Gibson∗ , Kyle Gorman∗ , Yeonju Lee-Sikka∗ , Peter Makarov† ,
Aidan Malanoski∗ , Sean Miller∗ , Omar Ortiz∗ , Reuben Raff∗ ,
Arundhati Sengupta∗ , Bora Seo∗ , Yulia Spektor∗ , Winnie Yan∗
∗
Graduate Program in Linguistics, Graduate Center, City University of New York
†
Department of Computational Linguistics, University of Zurich
Table 1: The ten languages in the medium-resource subtask with language codes and example training data pairs.
Table 2: The ten languages in the low-resource subtask with language codes and example training data pairs.
per-language grid search, the best baseline was dictions are converted back to the composed form
handily outperformed by nearly all submissions. (NFC). An implementation of the baseline was pro-
This led us to seek a simpler, stronger, and vided during the task and participating teams were
less computationally-demanding baseline for this encouraged to adapt it for their submissions.
year’s shared task.
The baseline for the 2021 shared task is a neu- 7 Submissions
ral transducer system using an imitation learn-
ing paradigm (Makarov and Clematide 2018). A Below we provide brief descriptions of sub-
variant of this system (Makarov and Clematide missions to the shared task; more detailed
2020) was the second-best system in the 2020 descriptions—as well as various exploratory anal-
shared task.5 Alignments are computed using yses and post-submission experiments—can be
ten iterations of expectation maximization, and found in the system papers later in this volume.
the imitation learning policy is trained for up to
sixty epochs (with a patience of twelve) using the AZ Hammond (2021) produced a single submis-
Adadelta optimizer. A beam of size of four is sion to the low-resource subtask. The model is in-
used for prediction. Final predictions are produced spired by the previous year’s bidirectional LSTM
by a majority-vote ten-component ensemble. In- baseline but also employs several data augmenta-
ternal processing is performed using the decom- tion strategies. First, much of the development
posed Unicode normalization form (NFD), but pre- data is used for training rather than for validation.
5
Secondly, new training examples are generated us-
The baseline was implemented using the DyNet neural
network toolkit (Neubig et al. 2017). In contrast to the previ- ing substrings of other training examples. Finally,
ous year’s baseline, the imitation learning system does not re- the AZ model is trained simultaneously on all lan-
quire a GPU for efficient training; it runs effectively on CPU guages, a method used in some of the previous
and can exploit multiple CPU cores if present. Training, en-
sembling, and evaluation for all three subtasks took roughly year’s shared task submissions (e.g., Peters and
72 hours of wall-clock time on a commodity desktop PC. Martins 2020, Vesik et al. 2020).
118
CLUZH Clematide and Makarov (2021) pro- baseline model. The results are shown in Table 5;
duced four submissions to the medium-resource note that the individual language results are ex-
subtask and three to the low-resource subtask. All pressed as three-digit percentages since there are
seven submissions are variations on the imitation 1,000 test examples each. While several of the
learning baseline model (section 6). They ex- CLUZH systems outperform the baseline on in-
periment with processing individual IPA Unicode dividual languages, including Armenian, French,
characters instead of entire IPA “segments” (e.g., Hungarian, Japanese, Korean, and Vietnamese,
CLUZH-1, CLUZH-5, and CLUZH-6), and larger the baseline achieves the best macro-accuracy.
ensembles (e.g., CLUZH-3). They also experi-
ment with input dropout, mogrifier LSTMs, and Low-resource subtask Three teams—AZ,
adaptive batch sizes, among other features. CLUZH, and UBC—submitted a total of six
systems to the low-resource subtask. Results for
Dialpad Gautam et al. (2021) produced three this subtask are shown in Table 6; note that the re-
systems to the high-resource subtask. The sults are expressed as two-digit percentages since
Dialpad-1 submission is a large ensemble of seven there are 100 test examples for each language.
different sequence models. Dialpad-2 is a smaller Three submissions outperformed the baseline.
ensemble of three models. Dialpad-3 is a single The best-performing submission was UBC-2, an
transformer model implemented as part of CMU adaptation of the baseline which assigns higher
Sphinx. Gautam et al. also experiment with sub- penalties for mis-predicted vowels and diacritic
word modeling techniques. characters. It achieved a 1.0% absolute (4%
relative) reduction in WER over the baseline.
UBC Lo and Nicolai (2021) submitted two sys-
tems for the low-resource subtask, both variations 8.2 Error analysis
on the baseline model. The UBC-1 submission hy-
pothesizes that, as previously reported by van Esch Error analysis can help identify strengths and
et al. (2016), inserting explicit syllable boundaries weaknesses of existing models, suggesting future
into the phone sequences enhances grapheme-to- improvements and guiding the construction of
phoneme performance. They generate syllable ensemble models. Prior experience using gold
boundaries using an automated onset maximiza- crowd-sourced data extracted from Wiktionary
tion heuristic. The UBC-2 submission takes a dif- suggests that a non-trivial portion of errors made
ferent approach: it assigns additional language- by top systems are due to errors in the gold data
specific penalties for mis-predicted vowels and di- itself. For example, Gorman et al. (2019) report
acritic characters such as the length mark /ː/. that a substantial portion of the prediction errors
made by the top two systems in the 2017 CoNLL–
8 Results SIGMORPHON Shared Task on Morphological
Reinflection (Cotterell et al. 2017) are due to tar-
Multiple submissions to the high- and low-
get errors, i.e., errors in the gold data. Therefore
resource subtasks outperformed the baseline; how-
we conducted an automatic error analysis for four
ever, no submission to the medium-resource sub-
target languages. It was hoped that this analysis
task exceeded the baseline. The best results for
would also help identify (and quantify) target er-
each language are shown in Table 3.
rors in the test data.
8.1 Subtasks Two forms of error analysis were employed
here. First, after Makarov and Clematide (2020),
High-resource subtask The Dialpad team sub-
the most frequent error types in each language are
mitted three systems for the high-resource subtask,
shown in Table 7. From this table it is clear that
all of which outperformed the baseline. Results for
many errors can be attributed either to the ambigu-
this subtask are shown in Table 4. The best sub-
ity of a language’s writing system. For example, in
mission overall, Dialpad-1, a seven-component
both Serbo-Croatian and Slovenian the most com-
ensemble, achieved an impressive 4.5% absolute
mon errors involve the confusion or omission of
(11% relative) reduction in WER over the baseline.
suprasegmental information such as pitch accent
Medium-resource subtask The CLUZH team and vowel length, neither of which are represented
submitted four systems for the medium-resource in the orthography. Likewise, in French and Ital-
subtask. All of of these systems are variants of the ian the most frequent errors confuse vowel sounds
119
Baseline WER Best submission(s) WER
eng_us 41.91 Dialpad-1 37.43
arm_e 7.0 CLUZH-7 6.4
bul 18.3 CLUZH-6 18.8
dut 14.7 CLUZH-7 14.7
fre 8.5 CLUZH-4, CLUZH-5, CLUHZ-6 7.5
geo 0.0 CLUZH-4, CLUHZ-5, CLUZH-6, CLUZH-7 0.0
hbs_latn 32.1 CLUZH-7 35.3
hun 1.8 CLUZH-6, CLUZH-7 1.0
jpn_hira 5.2 CLUZH-7 5.0
kor 16.3 CLUZH-4 16.2
vie_hanoi 2.5 CLUZH-5, CLUZH-7 2.0
ady 22 CLUZH-2, CLUZH-3, UBC-2 22
gre 21 CLUZH-1, CLUZH-3 20
ice 12 CLUZH-1, CLUZH-3 10
ita 19 UBC-1 20
khm 34 UBC-2 28
lav 55 CLUZH-2, CLUZH-3, UBC-2 49
mlt_latn 19 CLUZH-1 12
rum 10 UBC-2 10
slv 49 UBC-2 47
wel_sw 10 CLUZH-1 10
Table 3: Baseline WER, and the best submission(s) and their WER, for each language.
represented by the same graphemes. three medium-resource languages and four of the
Many errors may also be attributable to prob- low-resource languages. A fragment of the Bul-
lems with the target data. For example, the two garian covering grammar, showing readings of the
most frequent errors for English are predicting [ɪ] characters б, ф, and ю, is presented in Table 8.6
instead of [ә], and predicting [ɑ] instead of [ɔ]. Let G be the graphemic form of a word and let
Impressionistically, the former is due in part to P and P̂ be the corresponding gold and hypothe-
inconsistent transcription of the -ed and -es suf- sis pronunciations for that word. For error analysis
fixes, whereas the latter may reflect inconsistent we are naturally interested in cases where P ̸= P̂,
transcription of the low back merger. i.e., those cases where the gold and hypothesis
The second error analysis technique used here pronunciations do not match, since these are ex-
is an adaptation of a quality assurance technique actly the cases which contribute to word error rate.
proposed by Jansche (2014). For each language Then, P = πo (G ◦ γ) is a finite-state lattice repre-
targeted by the error analysis, a finite-state cov- senting the set of all “possible” pronunciations of
ering grammar is constructed by manually listing G admitted by the covering grammar.
all pairs of permissible grapheme-phone mappings When P ̸= P̂ but P ∈ P—that is, when
for that language. Let C be the set of all such g, p
6
pairs. Then, the covering grammar γ is the ra- Error analysis software was implemented using the
Pynini finite-state toolkit (Gorman 2016). See Gorman and
tional relation given by the closure over C, thus Sproat 2021, ch. 3, for definitions of the various finite-state
γ = C∗ . Covering grammars were constructed for operations used here.
120
Baseline CLUZH-4 CLUZH-5 CLUZH-6 CLUZH-7
arm_e 7.0 7.1 6.6 6.6 6.4
bul 18.3 20.1 19.2 18.8 19.7
dut 14.7 15.0 14.9 15.6 14.7
fre 8.5 7.5 7.5 7.5 7.6
geo 0.0 0.0 0.0 0.0 0.0
hbs_latn 32.1 38.4 35.6 37.0 35.3
hun 1.8 1.5 1.2 1.0 1.0
jpn_hira 5.2 5.9 5.3 5.5 5.0
kor 16.3 16.2 16.9 17.2 16.3
vie_hanoi 2.5 2.3 2.0 2.1 2.0
Macro-average 10.6 11.4 10.9 11.1 10.8
the gold pronunciation is one of the possible is silent whereas the non-suffixal word-final ent
pronunciations—we refer to such errors as model is normally read as [ɑ̃]. Morphological informa-
deficiencies, since this condition suggests that the tion was not provided to the covering grammar,
system in question has failed to guess one of sev- but it could easily be exploited by grapheme-to-
eral possible pronunciations of the current word. phoneme models.
In many cases this reflects genuine ambiguities in
Another condition of interest is when P ̸= P̂
the orthography itself. For example, in Italian, e
but P ∈ / P. We refer to such errors as coverage de-
is used to write both the phonemes /e, ɛ/ and o is
ficiencies, since they arise when the gold pronun-
similarly read as /o, ɔ/ (Rogers and d’Arcangeli
ciation is not one permitted by the covering gram-
2004). There are few if any orthographic clues
mar. While coverage deficiencies may result from
to which mid-vowel phoneme is intended, and
actual deficiencies in the covering grammar itself,
all submissions incorrectly predicted that the o in
they more often arise when a word does not fol-
nome ‘name’ is read as [ɔ] rather than [o]. Simi-
low the normal orthographic principles of its lan-
lar issues arise in Icelandic and French. The pre-
guage. For instance, Italian has borrowed the En-
ceding examples both represent global ambigui-
glish loanword weekend [wikɛnd] ‘id.’ but has not
ties, but model deficiencies may also occur when
yet adapted it to Italian orthographic principles. Fi-
the system has failed to disambiguate a local am-
nally, coverage deficiencies may indicate target er-
biguity. One example of this can be found in
rors, inconsistencies in the gold data itself. For ex-
French: the verbal third-person plural suffix -ent
ample, in the Italian data, the tie bars used to indi-
121
eng_us ɪ ә 113 ɑ ɔ 112 _ ʊ• 96 _ ɪ• 85 ɪ i 76
arm_e _ ә• 16 ә• _ 10 tʰ d 6 d tʰ 6 j• _ 3
bul ɛ• d͡ 32 a ә 31 ә ɤ 30 _ ◌̝ 27 ә a 25
dut ә eː 10 _ ː 10 ә ɛ 9 eː ә 8 z s 8
fre a ɑ 6 _ •s 5 ɔ o 5 e ɛ•ʁ 3 _ •t 3
geo
hbs_latn _ ː 85 ː _ 76 _ ◌̌ 55 ◌̌ ◌̂ 53 ◌̌ _ 52
hun _ ː 6 h ɦ 3 ʃ s 2 ː _ 2
jpn_hira _ ◌̥ 20 _ ◌̊ 11 _ d͡ 4 ː •ɯ̟ᵝ 3 h ɰᵝ 3
kor _ ː 73 ː _ 28 ʌ̹ ɘː 23 ʰ ◌͈ 9 ɘː ʌ̹ 6
vie_hanoi _ w• 3 _ ˧ 3 _ w•ŋ͡m• 2 ◌͡ɕ •ɹ 2 _ ʔ• 2
ady ʼ _ 3 ː _ 3 ʃ ʂ 3 ə• _ 2 a ә 2
gre ɾ r 8 r ɾ 3 i ʝ 3 m• _ 2 ɣ ɡ 2
ice ː _ 2 ◌̥ _ 2 _ ː 2
ita o ɔ 6 e ɛ 5 j i 3 ◌͡ • 2 ɔ o 2
khm aː i•ә 3 _ ʰ 3 _ •ɑː 2 ĕ ɔ 2 ɑ a 2
lav ◌̄ ◌̂ 11 _ ◌̂ 10 ◌̀ _ 9 ◌̄ _ 7 _ ◌̀ 4
mlt_latn _ ː 5 _ ɪ• 2 ɐ a 2 b p 2 a ɐ 2
rum ◌͡ • 2
slv ◌́ ◌̀ 7 ◌̀ː _ 6 ◌́ː _ 6 _ ◌́ː 5 ɛ éː 4
wel_sw ɪ iː 3 ɪ i̯ 2 _ ɛ• 2
Table 7: The five most frequent error types, represented by the hypothesis string, gold string, and count, for each
language; • indicates whitespace and _ the empty string.
Table 9: WER and model deficiency rate (MDR) for three languages from the medium-resource subtask.
Table 10: WER and model deficiency rate (MDR) for four languages from the low-resource subtask.
2020:47). This enormous improvement likely re- resources. Some prior work (e.g., Demberg et al.
flects quality assurance work on this language,8 2007) has found morphological tags highly useful,
but we did not anticipate reaching ceiling perfor- and error analysis (§8.2) suggests this information
mance. Insofar as the above quality assurance and would make an impact in French.
error analysis techniques prove effective and gen- There is a large performance gap between the
eralizable, we may soon be able to ask what makes medium-resource and low-resource subtasks. For
a language hard to pronounce (cf. Gorman et al. instance, the baseline achieves a WER of 10.6 in
2020:45f.). the medium-resource scenario and a WER of 25.1
As mentioned above, the data here are a mixture in the low-resource scenario. It seems that cur-
of broad and narrow transcriptions. At first glance, rent models are unable to reach peak performance
this might explain some of the variation in lan- with the 800 training examples provided in the low-
guage difficulty; for example, it is easy to imagine resource subtask. Further work is needed to de-
that the additional details in narrow transcriptions velop more efficient models and data augmenta-
make them more difficult to predict. However, for tion strategies for low-resource scenarios. In our
many languages, only one of the two levels of tran- opinion, this scenario is the most important one
scription is available at scale, and other languages, for speech technology, since speech resources—
divergence between broad and narrow transcrip- including pronunciation data—are scarce for the
tions is impressionistically quite minor. However, vast majority of the world’s written languages.
this impression ought to be quantified.
While we responded to community demand for 10 Conclusions
lower- and higher-resource subtasks, only one
team submitted to the high- and medium-resource The second iteration of the shared task on multi-
subtasks, respectively. It was surprising that none lingual grapheme-to-phoneme conversion features
of the medium-resource submissions were able to many improvements on the previous year’s task,
consistently outperform the baseline model across most of all data quality. Four teams submitted
the ten target languages. Clearly, this year’s base- thirteen systems, achieving substantial reductions
line is much stronger than the previous year’s. in both absolute and relative error over the base-
line in two of three subtasks. We hope the code
Participants in the high- and medium-resource
and data, released under permissive licenses,9 will
subtasks were permitted to make use of lemmas
be used to benchmark grapheme-to-phoneme con-
and morphological tags from UniMorph as addi-
version and sequence-to-sequence modeling tech-
tional features. However, no team made use of
niques more generally.
8
https://github.com/CUNY-CL/wikipron/
9
issues/138 https://github.com/sigmorphon/2021-task1/
123
Acknowledgements Kyle Gorman, Lucas F.E. Ashby, Aaron Goyzueta,
Shijie Wu, and Daniel You. 2020. The SIGMOR-
We thank the Wiktionary contributors, particularly PHON 2020 shared task on multilingual grapheme-
Aryaman Arora, without whom this shared task to-phoneme conversion. In Proceedings of the 17th
would be impossible. We also thank contribu- SIGMORPHON Workshop on Computational Re-
search in Phonetics, Phonology, and Morphology,
tors to WikiPron, especially Sasha Gutkin, Jack- pages 40–50.
son Lee, and the participants of Hacktoberfest
2020. Finally, thank you to Andrew Kirby for last- Kyle Gorman, Arya D. McCarthy, Ryan Cotterell,
minute copy editing assistance. Ekaterina Vylomova, Miikka Silfverberg, and Mag-
dalena Markowska. 2019. Weird inflects but OK:
making sense of morphological generation errors. In
Proceedings of the 23rd Conference on Computa-
References tional Natural Language Learning, pages 140–151.
Simon Clematide and Peter Makarov. 2021. CLUZH
Kyle Gorman and Richard Sproat. 2021. Finite-State
at SIGMORPHON 2021 Shared Task on Multilin-
Text Processing. Morgan & Claypool.
gual Grapheme-to-Phoneme Conversion: variations
on a baseline. In Proceedings of the 18th SIG-
Michael Hammond. 2021. Data augmentation for low-
MORPHON Workshop on Computational Research
resource grapheme-to-phoneme mapping. In Pro-
in Phonetics, Phonology, and Morphology.
ceedings of the 18th SIGMORPHON Workshop on
Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Computational Research in Phonetics, Phonology,
Géraldine Walther, Ekaterina Vylomova, Patrick and Morphology.
Xia, Manaal Faruqui, Sandra Kübler, David
Yarowsky, Jason Eisner, and Mans Hulden. 2017. S. J. Hannahs. 2013. The Phonology of Welsh. Oxford
CoNLL–SIGMORPHON 2017 shared task: univer- University Press.
sal morphological reinflection in 52 languages. In
Proceedings of the CoNLL SIGMORPHON 2017 Bradley Hauer, Amir Ahmad Habibi, Yixing Luan,
Shared Task: Universal Morphological Reinflection, Arnob Mallik, and Grzegorz Kondrak. 2020. Low-
pages 1–30. resource G2P and P2G conversion with synthetic
training data. In Proceedings of the 17th SIGMOR-
Vera Demberg, Helmut Schmid, and Gregor Möhler. PHON Workshop on Computational Research in
2007. Phonological constraints and morphologi- Phonetics, Phonology, and Morphology, pages 117–
cal preprocessing for grapheme-to-phoneme conver- 122.
sion. In Proceedings of the 45th Annual Meeting of
the Association of Computational Linguistics, pages Robert Hoberman. 2007. Maltese morphology. In
96–103. Alan S. Kaye, editor, Morphologies of Asia and
Africa, volume 1, pages 257–282. Eisenbrauns.
Daan van Esch, Mason Chua, and Kanishka Rao. 2016.
Predicting pronunciations with syllabification and Martin Jansche. 2014. Computer-aided quality as-
stress with recurrent neural networks. In INTER- surance of an Icelandic pronunciation dictionary.
SPEECH 2016: 17th Annual Conference of the In Proceedings of the Ninth International Confer-
International Speech Communication Association, ence on Language Resources and Evaluation, pages
pages 2841–2845. 2111–2114.
Vasundhara Gautam, Wang Yau Li, Zafarullah Paul Kingsbury, Stephanie Strassel, Cynthia
Mahmood, Frederic Mailhot, Shreekantha Nadig, McLemore, and Robert MacIntyre. 1997. CALL-
Riqiang Wang, and Nathan Zhang. 2021. Avengers, HOME American English Lexicon (PRONLEX).
ensemble! Benefits of ensembling in grapheme-to- LDC97L20.
phoneme prediction. In Proceedings of the 18th
SIGMORPHON Workshop on Computational Re- Christo Kirov, Ryan Cotterell, John Sylak-Glassman,
search in Phonetics, Phonology, and Morphology. Géraldine Walther, Ekaterina Vylomova, Patrick
Xia, Manaal Faruqui, Arya D. McCarthy, Sandra
Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. Kübler, David Yarowsky, Jason Eisner, and Mans
2012. Building large monolingual dictionaries at the Hulden. 2018. UniMorph 2.0: universal morphol-
Leipzig Corpora Collection: from 100 to 200 lan- ogy. In Proceedings of the 11th Language Resources
guages. In Proceedings of the Eighth International and Evaluation Conference, pages 1868–1873.
Conference on Language Resources and Evaluation,
pages 759–765. Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza,
Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D.
Kyle Gorman. 2016. Pynini: a Python library for McCarthy, and Kyle Gorman. 2020. Massively mul-
weighted finite-state grammar compilation. In Pro- tilingual pronunciation mining with WikiPron. In
ceedings of the SIGFSM Workshop on Statistical Proceedings of the 12th Language Resources and
NLP and Weighted Automata, pages 75–80. Evaluation Conference, pages 4216–4221.
124
Roger Yu-Hsiang Lo and Garrett Nicolai. 2021. Lin- Xiang Yu, Ngoc Thang Vu, and Jonas Kuhn. 2020.
guistic knowledge in multilingual grapheme-to- Ensemble self-training for low-resource languages:
phoneme conversion. In Proceedings of the 18th grapheme-to-phoneme conversion and morpholog-
SIGMORPHON Workshop on Computational Re- ical inflection. In Proceedings of the 17th SIG-
search in Phonetics, Phonology, and Morphology. MORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages
Peter Makarov and Simon Clematide. 2018. Imita- 70–78.
tion learning for neural morphological string trans-
duction. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
pages 2877–2882.
Michael Hammond
Dept. of Linguistics
U. of Arizona
Tucson, AZ, USA
[email protected]
126
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 126–130
August 5, 2021. ©2021 Association for Computational Linguistics
encoder and decoder. Each LSTM layer has 300 Character Word
nodes. The systems are connected by a 5-head 94.84 75.6
attention mechanism (Luong et al., 2015). Training 94.78 75.3
proceeds in 24,000 steps and the learning rate starts 94.46 74.0
at 1.0 and decays at a rate of 0.8 every 1,000 steps 94.84 75.5
starting at step 10,000. Optimization is stochastic 94.71 75.0
gradient descent, the batch size is 64, the dropout 94.59 74.5
rate is 0.5. 94.90 75.8
We spent a fair amount of time tuning the system 94.53 74.2
to these settings for optimal performance with this 94.53 74.2
general architecture on these data. We do not detail 94.71 75.0
these efforts as this is just a normal part of working Mean 94.69 74.91
with neural nets and not our focus here.
Table 2: Development accuracy for 10 runs of the full
Precise instructions for building the docker im-
system with all languages grouped together with esti-
age, full configuration files, and auxiliary code files mated word-level accuracy
are available at https://github.com/hammondm/
g2p2021. Character Word
94.60 74.6
3 General results 94.71 75.0
In this section, we give the general results of the full 94.35 73.8
system with all strategies in place and then in the 94.48 74.0
next sections we strip away each of our augmenta- 94.48 74.0
tion techniques to see what kind of effect each has. 94.50 74.2
In building our system, we did not have access to 94.59 74.5
the correct transcriptions for the test data provided, 94.71 75.0
so we report performance on the development data 94.80 75.4
here. 94.59 74.5
The system was subject to certain amount of Mean 94.58 74.5
randomness because of randomization of training Table 3: Development accuracy for 10 runs of the re-
data and random initial weights in the network. We duced system with all languages grouped together with
therefore report mean final accuracy scores over 100 development pairs with estimated word-level accu-
multiple runs. racy
Our system provides accuracy scores for devel-
opment data in terms of character-level accuracy.
that we are reporting accuracy rather than error rate,
The general task was scored in terms of word-level
so the goal is to maximize these values.
error rate, but we keep this measure for several
reasons. First, it was simply easier as this is what 4 Using development data
the system provided as a default. Second, this is
a more granular measure that enabled us to adjust The default partition for each language is 800 pairs
the system more carefully. Finally, we were able for training and 100 pairs for development. We
to simulate word-level accuracy in addition as de- shifted this to 880 pairs for training and 20 pairs
scribed below. for development. The logic of this choice was
We use a Monte Carlo simulation to calculate to retain what seemed like the minimum number
expected word-level accuracy based on character- of development items. Running the system ten
level accuracy and average transcription length for times without this repartitioning gives the results
the training data for the different languages. The in Table 3.
simulation works by generating 100, 000 words There is a small difference in the right direction,
with a random distribution of a specific character- but it is not significant for characters (t = −1.65,
level accuracy rate and then calculating word-level p = 0.11, unpaired) or words (t = −1.56, p =
accuracy from that. Running the full system ten 0.13, unpaired). It may be that with a larger sample
times, we get the results in Table 2. Keep in mind of runs, the difference becomes more stable.
127
Code Items added Character Word
ady 4 95.15 76.9
gre 223 94.40 73.7
ice 58 95.15 76.8
ita 194 94.59 74.5
khm 39 94.65 74.8
lav 100 95.27 77.4
mlt latn 62 94.53 74.2
rum 119 94.78 75.2
slv 127 95.09 76.6
wel sw 7 94.59 74.5
Mean 94.82 75.46
Table 4: Number of substrings added for each language
Table 5: 10 runs with all languages grouped together
without substrings added for each language
5 Using substrings
This method involves finding peripheral letters that result in a y in a non-final syllable ending up in a
map unambiguously to some symbol and then find- final syllable in a substring generated as above.
ing plausible splitting points within words to create Table 5 shows the results of 10 runs without
partial words that can be added to the training data. these additions and simulated word error rates for
Let’s exemplify this with Welsh. First, we iden- each run.
tify all word-final letters that always correspond to
Strikingly, adding the substrings lowered per-
the same symbol in the transcription. For exam-
formance, but the difference with the full model
ple, the letter c always corresponds to a word-final
is not significant for either characters (t = 1.18,
[k]. Similarly, we identify word-initial characters
p = 0.25, unpaired) or for words (t = 1.17,
with the same property. For example, in these data,
p = 0.25, unpaired). This model without sub-
the word-initial letter t always corresponds to [t].2
strings is the best-performing of all the models we
We then search for any word in training that has
tried, so this is what was submitted.
the medial sequence ct where the transcription has
[kt]. We take each half of the relevant item and add 6 Training together
them to the training data if that pair is not already
there. For example, the word actor [aktOr] fits the The basic idea here was to leverage the entire set
pattern, so we can add the pairs ac-ak and tor-tOr. of languages. Thus all languages were trained to-
to the training data. Table 4 gives the number of gether. To distinguish them, each pair was prefixed
items added for each language. This strategy is by its language code. Thus if we had orthogra-
a more limited version of the “slice-and-shuffle” phy O = ho1 , o2 , . . . , on i and transcription T =
approach used by Ryan and Hulden (2020) in last ht1 , t2 , . . . , tm i from language x, the net would be
year’s challenge. trained on the pair O0 = hx, o1 , o2 , . . . , on i and
Note that this procedure can make errors. If there T 0 = hx, t1 , t2 , . . . , tm i. The idea is that, while
are generalizations about the pronunciation of let- the mappings and orthographies are distinct, there
ters that are not local, that involve elements at a are similarities in what letters encode what sounds
distance, this procedure can obscure those. Another and in the possible sequences of sounds that can
example from Welsh makes the point. There are occur in the transcriptions. This approach is very
exceptions, but the letter y in Welsh is pronounced similar to that of Peters et al. (2017), except that
two ways. In a word-final syllable, it is pronounced we tag the orthography and the transcription with
[1], e.g. gwyn [gw1n] ‘white’. In a non-final sylla- the same langauge tag. Peters and Martins (2020)
ble, it is pronounced [@], e.g. gwynion [gw@njOn] took a similar approach in last years challenge, but
‘white ones’. Though it doesn’t happen in the train- use embeddings prefixed at each time step.
ing data here, the procedure above could easily In Table 6 we give the results for running each
2
language separately 5 times. Since there was a lot
This is actually incorrect for the language as a whole.
Word-initial t in the digraph th corresponds to a different less training data for each run, these models settled
sound [T]. faster, but we ran them the same number of steps
128
as the full models for comparison purposes. G. Klein, Y. Kim, Y. Y. Deng, J. Senellart, and
There’s a lot of variation across runs and the A.M. Rush. 2017. OpenNMT: Open-source toolkit
for neural machine translation. ArXiv e-prints.
means for each language are quite different, pre- 1701.02810.
sumably based on different levels of orthographic
transparency. The general pattern is clear that, over- Minh-Thang Luong, Hieu Pham, and Christopher D.
all, training together does better than training sep- Manning. 2015. Effective approaches to attention-
based neural machine translation. In Proceedings of
arately. Comparing run means with our baseline the 2015 Conference on Empirical Methods in Natu-
system is significant (t = −6.06, p < .001, un- ral Language Processing, pages 1412–1421.
paired).
Ben Peters, Jon Dehdari, and Josef van Genabith.
This is not true in all cases however. For some 2017. Massively multilingual neural grapheme-to-
languages, individual training seems to be better phoneme conversion. In Proceedings of the First
than training together. Our hypothesis is that lan- Workshop on Building Linguistically Generalizable
guages that share similar orthographic systems did NLP Systems, pages 19–26, Copenhagen. Associa-
tion for Computational Linguistics.
better with joint training and that languages with
diverging systems suffered. Ben Peters and André F. T. Martins. 2020. DeepSPIN
The final results show that our best system (no at SIGMORPHON 2020: One-size-fits-all multilin-
substrings included, all languages together, moving gual models. In Proceedings of the 17th SIGMOR-
PHON Workshop on Computational Research in
development data to training) performed reason- Phonetics, Phonology, and Morphology, pages 63–
ably for some languages, but did quite poorly for 69. Association for Computational Linguistics.
others. This suggests a hybrid strategy that would
Zach Ryan and Mans Hulden. 2020. Data augmen-
have been more successful. In addition to training tation for transformer-based G2P. In Proceedings
the full system here, train individual systems for of the 17th SIGMORPHON Workshop on Computa-
each language. For test, compare final develop- tional Research in Phonetics, Phonology, and Mor-
ment results for individual languages for the jointly phology, pages 184–188. Association for Computa-
tional Linguistics.
trained system and the individually trained systems
and use whichever does better for each language in
testing.
7 Conclusion
To conclude, we have augmented a basic sequence-
to-sequence LSTM model with several data aug-
mentation moves. Some of these were successful:
redistributing data from development to training
and training all the languages together. Some tech-
niques were not successful though: the substring
strategy resulted in diminished performance.
Acknowledgments
Thanks to Diane Ohala for useful discussion.
Thanks to several anonymous reviewers for very
helpful feedback. All errors are my own.
References
Kyle Gorman, Lucas F.E. Ashby, Aaron Goyzueta,
Arya McCarthy, Shijie Wu, and Daniel You. 2020.
The SIGMORPHON 2020 shared task on multilin-
gual grapheme-to-phoneme conversion. In Proceed-
ings of the 17th SIGMORPHON Workshop on Com-
putational Research in Phonetics, Phonology, and
Morphology, pages 40–50. Association for Compu-
tational Linguistics.
129
language 5 separate runs Mean
ady 95.27 91.12 93.49 94.67 93.49 93.61
gre 97.25 98.35 98.35 98.90 98.90 98.35
ice 91.16 94.56 93.88 90.48 94.56 92.93
ita 93.51 94.59 94.59 94.59 94.59 94.38
khm 94.19 90.32 90.97 90.97 90.97 91.48
lav 94.00 90.67 89.33 92.00 90.67 91.33
mlt latn 91.89 94.59 91.89 92.57 93.24 92.84
rum 95.29 96.47 94.71 95.88 95.29 95.51
slv 94.01 94.61 04.61 94.61 94.01 94.37
wel sw 96.30 97.04 96.30 97.04 96.30 96.59
Mean 94.29 94.23 93.81 94.17 94.2 94.14
130
Linguistic Knowledge in Multilingual Grapheme-to-Phoneme Conversion
131
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 131–140
August 5, 2021. ©2021 Association for Computational Linguistics
models has since been broken (Yao and Zweig, 3.3 Baselines
2015). Attention further improved the performance, The official baselines for individual languages are
as attentional encoder-decoders (Toshniwal and based on an ensembled neural transducer trained
Livescu, 2016) learned to focus on specific input se- with the imitation learning (IL) paradigm (Makarov
quences. As attention became “all that was needed” and Clematide, 2018a). The baseline WERs are tab-
(Vaswani et al., 2017), transformer-based architec- ulated in Table 3. In what follows, we overview this
tures have begun looming large (e.g., Yolchuyeva baseline neural-transducer system, as our models
et al., 2019). are built on top of this system. The detailed formal
Recent years have also seen works that capital- description of the baseline system can be found in
ize on multilingual data to train a single model Makarov and Clematide (2018a,b,c, 2020).
with grapheme-phoneme pairs from multiple lan- The neural transducer in question defines a con-
guages. For example, various systems from last ditional distribution over edit actions, such as copy,
year’s shared task submissions learned from a mul- deletion, insertion, and substitution:
tilingual signal (e.g., ElSaadany and Suter, 2020;
Peters and Martins, 2020; Vesik et al., 2020). |a|
Y
pθ (y, a|x) = pθ (aj |a<j , x),
3 The Low-resource Subtask j=1
132
voting majority. Early efforts to modify the ensem- To identify syllable boundaries in the input se-
ble to incorporate system confidence showed that a quence, we adopted a simple heuristic, the specific
majority ensemble was sufficient. steps of which are listed below:3
This model has proved to be competitive, judg-
ing from its performance on the previous year’s 1. Find vowels in the output: We first identify
G2P shared task. We therefore decided to use it as the vowels in the phoneme sequence by com-
the foundation to construct our systems. paring each segment with the vowel symbols
from the IPA chart. For instance, the symbols
4 Our Approaches [ø] and [y] in [th røyst] for Icelandic traust are
vowels because they match the vowel symbols
This section lays out our attempted approaches. [ø] and [y] on the IPA chart.
We investigate two alternatives, both linguistic in
nature. The first is inspired by a universal linguistic 2. Find vowels in the input: Next we align
structure—the syllable—and the other by the error the grapheme sequence with the phoneme se-
patterns discerned from the baseline predictions on quence using an unsupervised many-to-many
the development data. aligner (Jiampojamarn et al., 2007; Jiampo-
jamarn and Kondrak, 2010). By identifying
4.1 System 1: Augmenting Data with graphemes that are aligned to phonemic vow-
Unsupervised Syllable Boundaries els, we can identify vowels in the input. Using
Our first approach originates from the observation the Icelandic example again, the aligner pro-
that, in natural languages, a sequence of sounds duces a one-to-one mapping: t 7→ th , r 7→ r, a
does not just assume a flat structure. Neighboring 7→ ø, u 7→ y, s 7→ s, and t 7→ t. We therefore
sounds group to form units, such as the onset, nu- assume that the input characters a and u rep-
cleus, and coda. In turn, these units can further resent two vowels. Note that this step is often
project to a syllable (see Figure 1 for an example redundant for input sequences based on the
of such projection). Syllables are useful structural Latin script but is useful in identifying vowel
units in describing various linguistic phenomena symbols in other scripts.
and indeed in predicting the pronunciation of a
3. Find valid onsets and codas: A key step in
word in some languages (e.g., Treiman, 1994). For
syllabification is to identify which sequences
instance, in Dutch, the vowel quality of the nu-
of consonants can form an onset or a coda.
cleus can be reliably inferred from the spelling
Without resorting to linguistic knowledge, one
after proper syllabification: .dag. [dAx] ‘day’ but
way to identify valid onsets and codas is to
.da.gen. [da:G@n] ‘days’, where . marks syllable
look at the two ends of a word—consonant
boundaries. Note that a in a closed syllable is pro-
sequences appearing word-initially before the
nounced as the short vowel [A], but as the long
first vowel are valid onsets, and consonant
vowel [a:] in an open syllable. In applying syllabi-
sequences after the final vowel are valid codas.
fication to G2P conversion, van Esch et al. (2016)
Looping through each input sequence in the
find that training RNNs to jointly predict phoneme
training data gives us a list of valid onsets and
sequences, syllabification, and stress leads to fur-
codas. In the Icelandic example traust, the
ther performance gains in some languages, com-
initial tr sequence must be a valid onset, and
pared to models trained without syllabification and
the final st sequence a valid coda.
stress information.
4. Break word-medial consonant sequences
Syllable
into an onset and a coda: Unfortunately
identifying onsets and codas among word-
Onset Rhyme medial consonant sequences is not as straight-
forward. For example, how do we know the
Nucleus Coda 3
We are aware that different languages permit distinct
syllable constituents (e.g., some languages allow syllabic con-
t w E ł f T sonants while others do not), but given the restriction that we
are not allowed to use external resources in the low-resource
subtask, we simply assume that all syllables must contain a
Figure 1: The syllable structure of twelfth [twEłfT] vowel.
133
sequence in the input VngstrV (V for a vowel while other languages simply default to vowel
character) should be parsed as Vng.strV, as hiatuses/two side-by-side nuclei (e.g., Italian
Vn.gstrV, or even as V.ngstrV? To tackle this badia 7→ [badia])—indeed, both are common
problem, we use the valid onset and coda lists cross-linguistically. We again rely on the
gathered from the previous step: we split the alignment results in the second step to select
consonant sequence into two parts, and we the vowel segmentation strategy for individual
choose the split where the first part is a valid languages.
coda and the second part a valid onset. For
instance, suppose we have an onset list {str, After we have identified the syllables that com-
tr} and a coda list {ng, st}. This implies that pose each word, we augmented the input se-
we only have a single valid split—Vng.strV— quences with syllable boundaries. We identify
so ng is treated as the coda for the previous four labels to distinguish different types of sylla-
syllable and str as the onset for the follow- ble boundaries: <cc>, <cv>. <vc>, and <vv>,
ing syllable. In the case where more than one depending on the classes of sound the segments
split is acceptable, we favor the split that pro- straddling the syllable boundary belong to. For
duces a more complex onset, based on the instance, the input sequence b í l a v e r
linguistic heuristic that natural languages tend k s t æ ð i in Icelandic will be augmented
to tolerate more complex onsets than codas. to be b í <vc> l a <vc> v e r k <cc>
For example, Vng.strV > Vngs.trV. In the s t æ <vc> ð i. We applied the same syl-
situation where none of the splits produces a labification algorithm to all languages to generate
concatenation of a valid coda and onset, we new input sequences, with the exception of Khmer,
adopt the following heuristic: as the Khmer script does not permit a straightfor-
ward linear mapping between input and output se-
• If there is only one medial consonant quences, which is crucial for the vowel identifi-
(such as in the case where the consonant cation step. We then used these syllabified input
can only occur word-internally but not sequences, along with their target transcriptions, as
in the onset or coda position), this con- the training data for the baseline model.4
sonant is classified as the onset for the
following syllable. 4.2 System 2: Penalizing Vowel and Diacritic
• If there is more than one consonant, the Errors
first consonant is classified as the coda Our second approach focuses on the training ob-
and attached to the previous syllable jective of the baseline model, and is driven by
while the rest as the onset of the follow- the errors we observed in the baseline predictions.
ing syllable. Specifically, we noticed that the majority of er-
Of course, this procedure is not free of errors rors for the languages with a high WER—Khmer,
(e.g., some languages have onsets that are only Latvian, and Slovene—concerned vowels, some
allowed word-medially, so word-initial onsets examples of which are given in Table 1. Note the
will naturally not include them), but overall it nature of these mistakes: the mismatch can be in
gives reasonable results. the vowel quality (e.g., [O] for [o]), in the vowel
length (e.g., [á:] for [á]), in the pitch accent (e.g.,
5. Form syllables: The last step is to put to- [ı́:] for [ı̀:]), or a combination thereof.
gether consonant and vowel characters to form Based on the above observation, we modified
syllables. The simplest approach is to allow the baseline model to explicitly address this vowel-
each vowel character to be projected as a nu- mismatching issue. We modified the objective such
cleus and distribute onsets and codas around that erroneous vowel or diacritic (e.g., the length-
these nuclei to build syllables. If there are ening marker [:]) predictions during training incur
four vowels in the input, there are likewise 4
The hyperparameters used are the default values provided
four syllables. There is one important caveat, in the baseline model code: character and action embedding =
however. When there are two or more consec- 100, encoder LSTM state dimension = decoder LSTM state
utive vowel characters, some languages prefer dimension = 200, encoder layer = decoder layer = 1, beam
width = 4, roll-in hyperparameter = 1, epochs = 60, patience
to merge them into a single vowel/nucleus in = 12, batch size = 5, EM iterations = 10, ensemble size =
their pronunciation (e.g., Greek και 7→ [ce]) 10.
134
Language Target Baseline prediction 5 Results
khm nuh n ŭ @ h The performances of our systems, measured in
r O: j r ĕ @ j WER, are juxtaposed with the official baseline re-
s p ŏ @ n span sults in Table 3. We first note that the baseline was
particularly strong—gains were difficult to achieve
lat t s e: l s t s Ê: l s
for most languages. Our first system (Syl), which is
j u ō k s j ù o k s
based on syllabic information, unfortunately does
v æ̂ l s v ǣ: l s
not outperform the baseline. Our second system
slv j ó: g u r t j O g ú: r t (VP), which includes additional penalties for vow-
k r ı̀: S k r ı́: S els and diacritics, however, does outperform the
z d á j z d á: j baselines in several languages. Furthermore, the
macro WER average not only outperforms the base-
Table 1: Typical errors in the development set that in- line, but all other submitted systems.
volve vowels from Khmer (khm), Latvian (lat), and
Slovene (slv)
WER
Language Baseline Syl VP
additional penalties. Each incorrectly-predicted
vowel incurs this penalty. The penalty acts as a ady 22 25 22
regularizer that forces the model to expend more gre 21 22 22
effort on learning vowels. This modification is in ice 12 13 11
the same spirit as the softmax-margin objective of ita 19 20 22
Gimpel and Smith (2010), which penalizes high- khm 34 31 28
cost outputs more heavily, but our approach is even lav 55 58 49
simpler—we merely supplement the loss with ad- mlt_latn 19 19 18
ditional penalties for vowels and diacritics. We rum 10 14 10
fine-tuned the vowel and diacritic penalties using a slv 49 56 47
grid search on the development data, incrementing wel_sw 10 13 12
each by 0.1, from 0 to 0.5. In the cases of ties, we Average 25.1 27.1 24.1
skewed higher as the penalties generally worked
better at higher values. The final values used to Table 3: Comparison of test-set results based on the
generate predictions for the test data are listed in word error rates (WERs)
Table 2. We also note that the vowel penalty had
significantly more impact than the diacritic penalty. It seems that extra syllable information does not
help with predictions in this particular setting. It
Penalty might be the case that additional syllable bound-
aries increase input variability without providing
Language Vowel Diacritic much useful information with the current neural-
ady 0.5 0.3 transducer architecture. Alternatively, information
gre 0.3 0.2 about syllable boundary locations might be redun-
ice 0.3 0.3 dant for this set of languages. Finally, it is possible
ita 0.5 0.5 that the unsupervised nature of our syllable anno-
khm 0.2 0.4 tation was too noisy to aid the model. We leave
lav 0.5 0.5 these speculations as research questions for future
mlt_latn 0.2 0.2 endeavors and restrict the subsequent error analy-
rum 0.5 0.2 ses and discussion to the results from our vowel-
slv 0.4 0.4 penalty system.5
wel_sw 0.4 0.5 5
One reviewer raised a question of why only syllable
boundaries, as opposed to smaller constituents, such as onsets
Table 2: Vowel penalty and diacritic penalty values in or codas, are marked. Our hunch is that many phonological al-
the final models ternations happen at syllable boundaries, and that vowel length
in some languages depends on whether the nucleus vowel is
in a closed or open syllable. Also, given that adding syllable
135
ady gre ice ita khm lav mlt_latn rum slv wel_sw
80
60
Count
40
20
0
base Syl VP base Syl VP base Syl VP base Syl VP base Syl VP base Syl VP base Syl VP base Syl VP base Syl VP base Syl VP
Systems
Error types C-V, V-C C-C, C-ϵ, ϵ-C V-V, V-ϵ, ϵ-V
Figure 2: Distributions of error types in test-set predictions across languages. Error types are distinguished based
on whether an error involves only consonants, only vowels, or both. For example, C-V means that the error is
caused by a ground-truth consonant being replaced by a vowel in the prediction. C-ǫ means that it is a deletion
error where the ground-truth consonant is missing in the prediction while ǫ-C represents an insertion error where a
consonant is wrongly added.
136
Khmer vowels Latvian vowels Slovene vowels
ϵ
ϵ ū ɔ́ː
ù
i
u ɔ
ô
o óː
ɛː j
īː
îː o
ə
ī
Error types
Vowels
Vowels
Vowels
î ɛ̀ː
eː i base wrong
ɛ̄ː ə
ɛ̀ː ours wrong
e
ɛ̄ èː
ɛ
ɑː æ
āː éː
âː
ɑ
ā àː
â
aː à áː
a
baseline ground-truth ours (VP) baseline ground-truth ours (VP) baseline ground-truth ours (VP)
Systems Systems Systems
Figure 3: Comparison of vowels predicted by the baseline model and our best system (VP) with the ground-truth
vowels. Here we only visualize the cases where either the baseline model gives the right vowel but our system does
not, or vice versa. We do not include cases where both the baseline model and our system predict the correct vowel,
or both predict an incorrect vowel, to avoid cluttering the view. Each baseline—ground-truth—ours line represents
a set of aligned vowels in the same word; the horizontal line segment between a system and the ground-truth means
that the prediction from the system agrees with the ground-truth. Color hues are used to distinguish cases where
the prediction from the baseline is correct versus those where the prediction from our second system is correct.
Shaded areas on the plots enclose vowels of similar vowel quality.
137
Proportion
0.00 0.25 0.50 0.75 1.00
Predicted consonants
Predicted consonants
m
l ʎ
n
m m
n n p
ɲ ɲ r
ŋ ŋ
s
p p
ʃ
pʰ r
t
r ɾ
s s t͡s
t ʃ t͡ʃ
tʰ t
ʋ
ʋ v
x
w w
z
z z
ʔ ʒ ʒ
ɓ c cʰ ɗ f h j k kʰ l m n ɲ ŋ p pʰ r s t tʰ ʋ w z ʔ b c d f ɡ j ɟ k l ʎ m n ɲ ŋ p r ɾ s ʃ t v w z ʒ b d f ɡ j k l m n p r s ʃ t t͡s t͡ʃ ʋ x z ʒ
Ground-truth consonants Ground-truth consonants Ground-truth consonants
ɑː
aː
àː àː a
e
âː
āː a éː
æ èː
ĕ ǣ
ə
æ̀ː
eː ǣː ə́
e
ə ē ɛ
ɛ ɛ́
əː
Predicted vowels
Predicted vowels
Predicted vowels
ɛ̀
e, ə, ɛ
ɛ̂ ɛ́ː
ɛː
ɛ̄ ɛ̀ː
ɛ̀ː
æ, e, ɛ
i
ɛ̂ː i
iː ɛ̄ː íː
ɨ
i
ì ìː i
o ī óː
ŏ
îː
īː i òː
o
oː ɔ
ô
ō ɔ́
ɔ o
o, ɔ
oː
ɔ́ː
ɔː u
ù ɔ̀ː
u û
ū u
ŭ ùː úː
uː
ûː
ūː u ùː u
a aː ɑ ɑː e ĕ eː ə əː ɛː i iː ɨ o ŏ oː ɔ ɔː u ŭ uː a à â ā aː àː âː āː æ ǣ æ̀ːǣː e ē ɛ ɛ̀ ɛ̂ ɛ̄ ɛ̀ː ɛ̂ː ɛ̄ː i ì ī îː īː o ô ō oː u ù û ū ùː ûː ūː a á áː àː éː èː ə ə́ ɛ ɛ́ ɛ́ː ɛ̀ː i íː ìː óː òː ɔ ɔ́ ɔ́ː ɔ̀ː u úː ùː
Ground-truth vowels Ground-truth vowels Ground-truth vowels
e, ɛ
d
ɣ
ħ ð
j
kʲ
kʲʼ f e
kʷ
kʼ ɡ
l
ɬ ɣ
ɬʼ
ɮ ʝ
m ɛ
Predicted consonants
Predicted consonants
n k
p
Predicted vowels
pʷʼ l
pʼ
q
qʷ ʎ
r
ʁ m i
ʁʷ
s n
ʂ
o, ɔ
ʂʷ ɲ
ʃ
ʃʷ
r, ɾ
ʃʷʼ ŋ
ʃʼ o
t p
t͡s
t͡sʼ r
t͡ʂ
t͡ʃ ɾ
t͡ʃʼ
tʼ s ɔ
w
x t
z
ʐ
ʐʷ v
ʑ
ʒ x
ʒʷ u
ʔ z
ʔʷ
χ θ
χʷ
b ɕ dd͡zd͡ʒ f ɡʲɡʷɣ ħ j kʲkʲʼkʷkʼ l ɬ ɬʼ ɮmn ppʷʼpʼ qqʷr ʁʁʷs ʂʂʷ ʃ ʃʷʃʷʼʃʼ t t͡st͡sʼt͡ʂ t͡ʃ t͡ʃʼ tʼ w x z ʐʐʷʑ ʒʒʷʔʔʷχχʷ b c ç d ð f ɡ ɣ ʝ k l ʎ m n ɲ ŋ p r ɾ s t v x z θ a e ɛ i o ɔ u
Ground-truth consonants Ground-truth consonants Ground-truth vowels
Figure 4: Confusion matrices of vowel and consonant predictions by our second system (VP) for languages with the
test WER > 20%. Each row represents a predicted segment, with colors across columns indicating the proportion
of times the predicted segment matches individual ground-truth segments. A gray row means the segment in
question is absent in any predicted phoneme sequences but is present in at least one ground-truth sequence. The
diagonal elements represent the number of times for which the predicted segment matches the target segment,
while off-diagonal elements are those that are mis-predicted by the system. White squares are added to highlight
segment groups where mismatches are common.
138
ered from careful error analyses can inform the Peter Makarov and Simon Clematide. 2018c. UZH at
directions for potential improvements. CoNLL-SIGMORPHON 2018 shared task on uni-
versal morphological reinflection. In Proceedings of
the CoNLL-SIGMORPHON 2018 Shared Task: Uni-
versal Morphological Reinflection, pages 69–75.
References
Peter Makarov and Simon Clematide. 2020. CLUZH
Maximilian Bisani and Hermann Ney. 2008. Joint- at SIGMORPHON 2020 shared task on multilin-
sequence models for grapheme-to-phoneme conver- gual grapheme-to-phoneme conversion. In Proceed-
sion. Speech Communication, 50:434–451. ings of the Seventeenth SIGMORPHON Workshop
on Computational Research in Phonetics, Phonol-
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. ogy, and Morphology, pages 171–176.
Maximum likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical Soci- Josef R. Novak, Nobuaki Minematsu, and Keikichi
ety. Series B (Methodological), 39(1):1–38. Hirose. 2012. WFST-based grapheme-to-phoneme
conversion: Open source tools for alignment, model-
Omnia ElSaadany and Benjamin Suter. 2020. building and decoding. In Proceedings of the 10th
Grapheme-to-phoneme conversion with a mul- International Workshop on Finite State Methods and
tilingual transformer model. In Proceedings of Natural Language Processing, pages 45–49.
the Seventeenth SIGMORPHON Workshop on
Computational Research in Phonetics, Phonology, Josef Robert Novak, Nobuaki Minematsu, and Keikichi
and Morphology, pages 85–89. Hirose. 2015. Phonetisaurus: Exploring garpheme-
to-phoneme conversion with joint n-gram models in
Daan van Esch, Mason Chua, and Kanishka Rao. 2016. the WFST framework. Natural Language Engineer-
Predicting pronunciations with syllabification and ing, 22(6):907–938.
stress with recurrent neural networks. In Proceed-
Ben Peters and André F. T. Martins. 2020. DeepSPIN
ings of Interspeech 2016, pages 2841–2845.
at SIGMORPHON 2020: One-size-fits-all multilin-
gual models. In Proceedings of the Seventeenth SIG-
Kevin Gimpel and Noah A. Smith. 2010. Softmax-
MORPHON Workshop on Computational Research
margin CRFs: Training log-linear models with cost
in Phonetics, Phonology, and Morphology, pages
functions. In Human Language Technologies: The
63–69.
2010 Annual Conference of the North American
Chapter of the ACL, pages 733–736. Kanishka Rao, Fuchun Peng, Haşim Sak, and
Françoise Beaufays. 2015. Grapheme-to-phoneme
Sittichai Jiampojamarn and Grzegorz Kondrak. 2010. conversion using long short-term memory recurrent
Letter-phoneme alignment: An exploration. In Pro- neural networks. In IEEE International Confer-
ceedings of the 48th Annual Meeting of the Associa- ence on Acoustics, Speech and Signal Processing
tion for Computational Linguistics, pages 780–788. (ICASSP), pages 4225–4229.
Sittichai Jiampojamarn, Grzegorz Kondrak, and Tarek Eric Sven Ristad and Peter N. Yianilos. 1998. Learning
Sherif. 2007. Applying many-to-many alignments string-edit distance. IEEE Transactions on Pattern
and Hidden Markov Models to letter-to-phoneme Analysis and Machine Intelligence, 20(5):522–532.
conversion. In Proceedings of NAACL HLT 2007,
pages 372–379. Shubham Toshniwal and Karen Livescu. 2016. Jointly
learning to align and convert graphemes to
Jackson L. Lee, Lucas F. E. Ashby, M. Elizabeth Garza, phonemes with neural attention models. In
Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. 2016 IEEE Spoken Language Technology Workshop
McCarthy, and Kyle Gorman. 2020. Massively mul- (SLT), pages 76–82.
tilingual pronunciation mining with WikiPron. In
Rebecca Treiman. 1994. To what extent do ortho-
Proceedings of the 12th Conference on Language Re-
graphic units in print mirror phonological units in
sources and Evaluation (LREC 2020), pages 4223–
speech? Journal of Psycholinguistic Research,
4228.
23(1):91–110.
Peter Makarov and Simon Clematide. 2018a. Imita- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
tion learning for neural morphological string trans- Uszkoreit, Llion Jones, Aidan N. Gomez, Łukaaz
duction. In Proceedings of the 2018 Conference on Kaiser, and Illia Polosukhin. 2017. Attention is all
Empirical Methods in Natural Language Processing, you need. In Proceedings of the 31st Conference
pages 2877–2882. on Neural Information Processing Systems (NIPS
2017), pages 1–11.
Peter Makarov and Simon Clematide. 2018b. Neu-
ral transition-based string transduction for limited- Kaili Vesik, Muhammad Abdul-Mageed, and Miikka
resource setting in morphology. In Proceedings of Silfverberg. 2020. One model to pronounce them
the 27th International Conference on Computational all: Multilingual grapheme-to-phoneme conversion
Linguistics, pages 83–93. with a Transformer ensemble. In Proceedings of the
139
Seventeenth SIGMORPHON Workshop on Computa-
tional Research in Phonetics, Phonology, and Mor-
phology, pages 146–152.
Kaisheng Yao and Geoffrey Zweig. 2015. Sequence-
to-sequence neural net models for grapheme-to-
phoneme conversion. In Proceedings of Interspeech
2015, pages 3330–3334.
Sevinj Yolchuyeva, Géza Németh, and Bálint Gyires-
Tóth. 2019. Transformer based grapheme-to-
phoneme conversion. In Proceedings of Interspeech
2019, pages 2095–2099.
140
Avengers, Ensemble! Benefits of ensembling in
grapheme-to-phoneme prediction
Vasundhara Gautam, Wang Yau Li, Zafarullah Mahmood, Fred Mailhot∗,
Shreekantha Nadig† , Riqiang Wang, Nathan Zhang
141
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 141–147
August 5, 2021. ©2021 Association for Computational Linguistics
2.1 Data-related challenges of issues where a phone was transcribed with
a Unicode symbol not used in the IPA at all.
Wiktionary is an open, collaborative, public
effort to create a free dictionary in multiple Most of these were cases where the rare
languages. Anyone can create an account and variant was at least two orders of magnitude
add or amend words, pronunciations, etymo- less frequent than the common variant of the
logical information, etc. As with most user- symbol. There was, however, one class of
generated content, this is a noisy method of sounds where the variation was less dramat-
data creation and annotation. ically skewed; the consonants /m/, /n/, and
/l/ appeared in unstressed syllables follow-
Even setting aside the theory-laden ques-
ing schwa (/əm/, /ən/, /əl/) roughly one or-
tion of when or whether a given word should
der of magnitude more frequently than their
be counted as English,6 the open nature of
syllabic counterparts (/m̩ /, /n̩/, /l ̩/), and we
Wiktionary means that speakers of different
opted not to normalize these. If we had nor-
variants or dialects of English may submit
malized the syllabic variants, it would have
varying or conflicting pronunciations for sets
resulted in more consistent g2p output but it
of words. For example, some transcriptions
would likely also have penalized our perfor-
indicate that the users who input them had
mance on the uncleaned test set.7 In the end,
the cot/caught merger while others do not; in
our training data contained 47 phones (plus
the training data “cot” is transcribed /k ɑ t/
end-of-sequence and UNK symbols for some
and “caught” is transcribed /k ɔ t/, indicat-
models).
ing a split, but “aughts” is transcribed as /ɑ t
s/, indicating merger. There is also variation
3 Models
in the narrowness of transcription. For exam-
ple, some transcriptions include aspiration on We trained and evaluated several models for
stressed-syllable-initial stops while others do this task, both publicly available, in-house,
not c.f. “kill” /kʰ ɪ l/ and “killer” /k ɪ l ɚ/. and custom developed, along with various en-
Typically the set of English phonemes is sembling permutations. In the end, we sub-
taken to be somewhere between 38-45 de- mitted three sets of baseline beating results.
pending on variant/dialect (McMahon, 2002). The organizers assigned sequential identifiers
In exploring the training data, we found a to- to multiple submissions (e.g. Dialpad-N); we
tal of 124 symbols in the training set transcrip- include these in the discussion of our entries
tions, many of which only appeared in a small below, for ease of subsequent reference.
set (1–5) of transcriptions. To reduce the ef-
fect of this long tail of infrequent symbols, we 3.1 The Dialpad model (Dialpad-2)
normalized the training set.
Dialpad uses a g2p system internally for scal-
The main source of symbols in the long
able generation of novel lexicon additions.
tail was the variation in the broadness of
We were motivated to enter this shared task
transcription—vowels were sometimes but
as a means of assessing potential areas of im-
not always transcribed with nasalization be-
provement for our system; in order to do so
fore a nasal consonant, aspiration on word-
we needed to assess our own performance as
initial voiceless stops was inconsistently indi-
a baseline.
cated, phonetic length was occasionally indi-
This model is a simple majority-vote ensem-
cated, etc. There were also some cases of er-
ble of 3 existing publicly available g2p sys-
roneous transcription that we uncovered by
tems: Phonetisaurus (Novak et al., 2012), a
looking at the lowest frequency phones and
WFST-based model, Sequitur (Bisani and Ney,
the word-pronunciation pairs where they ap-
2008), a joint-sequence model trained via EM,
peared. For instance, the IPA /j/ was tran-
and a neural sequence-to-sequence model de-
scribed as /y/ twice, the voiced alveolar ap-
veloped at CMU as part of the CMUSphinx8
proximant /ɹ/ was mistranscribed as the trill
/r/ over 200 times, and we found a handful 7
Although the possibility also exists that one or more
of our models would have found and exploited contex-
6
E.g., the training data included the arguably French tual cues that weren’t obvious to us by inspection.
8
word-pronunciation pair: embonpoint /ɑ̃ b ɔ̃ p w ɛ̃/ https://cmusphinx.github.io
142
toolkit (see subsection 3.2). As Dialpad uses The method of ensembling for this model is
a proprietary lexicon and phoneset internally, word level majority-vote ensembling. We se-
we retrained all three models on the cleaned lect the most common prediction when there
version of the shared task training data, re- is a majority prediction (i.e. one prediction
taining default hyperparameters and architec- has more votes than all of the others). If there
tures. is a tie, we pick the prediction that was gen-
In the end, this ensemble achieved a test set erated by the best standalone model with re-
WER of 41.72, narrowly beating the baseline spect to each model’s performance on the de-
(results are discussed in more depth in Section velopment set.
4). This collection of models achieved a test set
WER of 37.43, a 10.75% relative reduction in
3.2 A strong standalone model: WER over the baseline model. As shown in
CMUSphinx g2p-seq2seq (Dialpad-3) Table 1, although a majority of the compo-
CMUSphinx is a set of open systems and nent models did not outperform the baseline,
tools for speech science developed at Carnegie there was sufficient agreement across differ-
Mellon University, including a g2p system.9 ent examples that a simple majority voting
It is a neural sequence-to-sequence model scheme was able to leverage the models’ vary-
(Sutskever et al., 2014) that is Transformer- ing strengths effectively. We discuss the com-
based (Vaswani et al., 2017), written in Ten- ponents and their individual performance be-
sorflow (Abadi et al., 2015). A pre-trained 3- low and in Section 4.
layer model is available for download, but it is
trained on a dictionary that uses ARPABET, a 3.3.1 Baseline variations
substantially different phoneset from the IPA
The “foundation” of our ensemble was the de-
used in this challenge. For this reason we re-
fault baseline model (Makarov and Clematide,
trained a model from scratch on the cleaned
2018), which we trained using the raw data
version of the training data.
and default settings in order to reflect the
This model achieved a test set WER of
baseline performance published by the orga-
41.58, again narrowly beating the baseline.
nization. We included this in order to individ-
Interestingly, this outperformed the Dialpad
ually assess the effect of additional models on
model which incorporates it, suggesting that
overall performance.
Phonetisaurus and Sequitur add more noise
than signal to predicted outputs, to say noth- In addition to this default base, we added
ing of increased computational resources and a larger version of the same model, for which
training time. More generally, this points to we increased the number of encoder and de-
the CMUSphinx seq2seq model as a simple coder layers from 1 to 3, and increased the
and strong baseline against which future g2p hidden dimensions 200 to 400.
research should be assessed.
3.3.2 biLSTM+attention seq2seq
3.3 A large ensemble (Dialpad-1) We conducted experiments with a RNN
In the interest of seeing what results could be seq2seq model, comprising a biLSTM encoder,
achieved via further naive ensembling, our fi- LSTM decoder, and dot-product attention.10
nal submission was a large ensemble, compris- We conducted several rounds of hyperparam-
ing two variations on the baseline model, the eter optimization over layer sizes, optimizer,
Dialpad-2 ensemble discussed above, and two and learning rate. Although none of these
additional seq2seq models, one using LSTMs models outperformed the baseline, a small
and the other Transformer-based. The latter network (16-d embeddings, 128-d LSTM lay-
additionally incorporated a sub-word extrac- ers) proved to be efficiently trainable (2 CPU-
tion method designed to bias a model’s input- hours) and improved the ensemble results, so
output mapping toward “good” grapheme- it was included.
phoneme correspondences.
10
We used the DyNet toolkit (Neubig et al., 2017) for
9
https://github.com/cmusphinx/g2p-seq2seq these experiments.
143
3.3.3 PAS2P: Pronunciation-assisted model has 6 layers of encoder and decoder
sub-words to phonemes with 2048 units, and 4 attention heads with
Sub-word segmentation is widely used in ASR 256 units. We use dropout with a probability
and neural machine translation tasks, as it of 0.1 and label smoothing with a weight
reduces the cardinality of the search space of 0.1 to regularize the model. This model
over word-based models, and mitigates the is- achieved WERs of 44.84 and 43.40 on the
sue of OOVs. Use of sub-words for g2p tasks development and test sets, respectively.
has been explore, e.g. Reddy and Goldsmith
4 Results
(2010) develop an MDL-based approach to ex-
tracting sub-word units for the task of g2p. Our main results are shown in Table 1, where
Recently, a pronunciation-assisted sub-word we show both dev and test set WER for each
model (PASM) (Xu et al., 2019) was shown individual model in addition to the submit-
to improve the performance of ASR models. ted ensembles. In particular, we can see that
We experimented with pronunciation-assisted many of the ensemble components do not beat
sub-words to phonemes (PAS2P), leveraging the baseline WER, but nonetheless serve to im-
the training data and a reparameterization of prove the ensembled models.
the IBM Model 2 aligner (Brown et al., 1993)
dubbed fast_align (Dyer et al., 2013).11 Model dev test
The alignment model is used to find an Dialpad-3 43.30 41.58
alignment of sequences of graphemeres to PAS2P 44.84 43.40
their corresponding phonemes. We follow a Baseline (large) 44.99 41.65
similar process as Xu et al. (2019) to find Baseline (organizer) 45.13 41.94
the consistent grapheme-phoneme pairs and Phonetisaurus 45.44 43.88
refinement of the pairs for the PASM model. Baseline (raw data) 45.92 41.70
We also collect grapheme sequence statistics Sequitur 46.69 43.86
and marginalize it by summing up the counts biLSTM seq2seq 47.89 44.05
of each type of grapheme sequence over all Dialpad-2 43.83 41.72
possible types of phoneme sequences. These Dialpad-1 40.12 37.43
counts are the weights of each sub-word se-
Table 1: Results for components of ensembles,
quence. and submitted models/ensembles (bolded).
Given a word and the weights for each
sub-word, the segmentation process is a
search problem over all possible sub-word 5 Additional experiments
segmentation of that word. We solve this
We experimented with different ensembles
search problem by building weighted FSTs12
and found that incorporating models with dif-
of a given word and the sub-word vocabu-
ferent architectures generally improves over-
lary, and finding the best path through this
all performance. In the standalone results,
lattice. For example, the word “thought-
only the top three models beat the base-
fulness” would be segmented by PASM as
line WER, but adding additional models with
“th_ough_t_f_u_l_n_e_ss”, and this would be
higher WER than the baseline continues to re-
used as the input in the PAS2P model
duce overall WER. Table 2 shows the effect
rather than the full sequence of individual
of this progressive ensembling, from our top-
graphemes.
3 models to our top-7 (i.e. the ensemble for
Finally, the PAS2P transducer is a
the Dialpad-1 model).
Transformer-based sequence-to-sequence
model trained using the ESPnet end-to-end 5.1 Edit distance-based voting
speech processing toolkit (Watanabe et al.,
In addition to varying our ensemble sizes and
2018), with pronunciation-assisted sub-
components, we investigated a different en-
words as inputs and phones as outputs. The
semble voting scheme, in which ties are bro-
11
https://github.com/clab/fast_align ken using edit distance when there is no 1-
12
We use Pynini (Gorman, 2016) for this. best majority option. That is, in the event of
144
Model dev test These represent massive performance im-
Ensemble-top3 41.10 39.71 provements (approx. 15% absolute, or 37%
Ensemble-top4 40.74 38.89 relative, WER reduction), and suggest refine-
Ensemble-top5 40.50 38.12 ment of our output selection/voting method
Ensemble-top6 40.31 37.69 (perhaps via some kind of confidence weight-
Ensemble-top7 (Dialpad-1) 40.12 37.43 ing) could lead to much-improved results.
145
“acres” (e ɪ k ɚ z/) rhymes with “degrees”, and Manjunath Kudlur, Josh Levenberg, Dan Mané,
that “beret” has a /t/ sound in it. In each of Rajat Monga, Sherry Moore, Derek Murray,
Chris Olah, Mike Schuster, Jonathon Shlens,
these cases, there was either not enough sam-
Benoit Steiner, Ilya Sutskever, Kunal Talwar,
ples in the training set to reliably learn the Paul Tucker, Vincent Vanhoucke, Vijay Vasude-
relevant grapheme-phoneme correspondence, van, Fernanda Viégas, Oriol Vinyals, Pete War-
or else a conflicting (but correct) correspon- den, Martin Wattenberg, Martin Wicke, Yuan
dence was over-represented in the training Yu, and Xiaoqiang Zheng. 2015. Tensor-
Flow: Large-scale machine learning on hetero-
data. geneous systems. Software available from ten-
sorflow.org.
7 Conclusion
Maximilian Bisani and Hermann Ney. 2008. Joint-
We presented and discussed three g2p sys- sequence models for grapheme-to-phoneme
tems submitted for the SIGMORPHON2021 conversion. Speech Communication, 50(5):434–
English-only shared sub-task. In addition 451.
to finding a strong off-the-shelf contender, Peter F. Brown, Stephen A. Della Pietra, Vincent J.
we show that naive ensembling remains a Della Pietra, and Robert L. Mercer. 1993. The
strong strategy in supervised learning, of mathematics of statistical machine translation:
Parameter estimation. Computational Linguistics,
which g2p is a sub-domain, and that sim- 19(2):263–311.
ple majority-voting schemes in classification
can often leverage the respective strengths Chris Dyer, Victor Chahuneau, and Noah A. Smith.
of sub-optimal component models, especially 2013. A simple, fast, and effective reparam-
eterization of IBM model 2. In Proceedings of
when diverse architectures are combined. Ad- the 2013 Conference of the North American Chap-
ditionally, we provided more evidence for ter of the Association for Computational Linguis-
the usefulness of linguistically-informed sub- tics: Human Language Technologies, pages 644–
word modeling as an input transformation on 648, Atlanta, Georgia. Association for Compu-
tational Linguistics.
speech-related tasks.
We also discussed additional experiments Kyle Gorman. 2016. Pynini: A python library
whose results were not submitted, indicating for weighted finite-state grammar compilation.
the benefit of exploring top-N model vs en- In Proceedings of the SIGFSM Workshop on Sta-
tistical NLP and Weighted Automata, pages 75–
semble trade-offs, and demonstrating the po- 80, Berlin, Germany. Association for Computa-
tential benefit of an edit-distance based tie- tional Linguistics.
breaking method for ensemble voting.
Kyle Gorman, Lucas F.E. Ashby, Aaron Goyzueta,
Future work includes further search for
Arya McCarthy, Shijie Wu, and Daniel You.
the optimal trade-off between ensemble size 2020. The SIGMORPHON 2020 shared task
and performance, as well as additional explo- on multilingual grapheme-to-phoneme conver-
ration of the edit-distance voting scheme, and sion. In Proceedings of the 17th SIGMORPHON
more sophisticated ensembling/voting meth- Workshop on Computational Research in Phonet-
ics, Phonology, and Morphology, pages 40–50,
ods, e.g. majority voting at the phone level Online. Association for Computational Linguis-
on aligned outputs. tics.
146
transduction. In Proceedings of the 2018 Confer- Shinji Watanabe, Takaaki Hori, Shigeki Karita,
ence on Empirical Methods in Natural Language Tomoki Hayashi, Jiro Nishitoba, Yuya Unno,
Processing, pages 2877–2882, Brussels, Belgium. Nelson Enrique Yalta Soplin, Jahn Heymann,
Association for Computational Linguistics. Matthew Wiesner, Nanxin Chen, Adithya Ren-
duchintala, and Tsubasa Ochiai. 2018. Espnet:
Peter Makarov and Simon Clematide. 2020. End-to-end speech processing toolkit. In Proc.
CLUZH at SIGMORPHON 2020 shared task on Interspeech 2018, pages 2207–2211.
multilingual grapheme-to-phoneme conversion.
In Proceedings of the 17th SIGMORPHON Work- Hainan Xu, Shuoyang Ding, and Shinji Watanabe.
shop on Computational Research in Phonetics, 2019. Improving end-to-end speech recogni-
Phonology, and Morphology, pages 171–176, On- tion with pronunciation-assisted sub-word mod-
line. Association for Computational Linguistics. eling. ICASSP 2019 - 2019 IEEE International
Conference on Acoustics, Speech and Signal Pro-
April McMahon. 2002. An Introduction to English cessing (ICASSP), pages 7110–7114.
Phonology. Edinburgh University Press, Edin-
burgh, U.K.
147
CLUZH at SIGMORPHON 2021 Shared Task on Multilingual
Grapheme-to-Phoneme Conversion: Variations on a Baseline
148
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 148–153
August 5, 2021. ©2021 Association for Computational Linguistics
Σ : / p(DEL(Σ)) : Ω / p(INS(Ω)) The imitation learning algorithm relies on an
expert policy for suggesting intuitive and appro-
p(#) priate character substitution, insertion and deletion
actions. For instance, for the data sample кит 7→
/kj it/ (Russian: “whale”), we would like the fol-
Σ : Ω / p(SUB(Σ, Ω)) lowing most natural edit sequence to attain the low-
est cost: SUBS[k], INS[j ], SUBS[i], SUBS[t]. The cost
Figure 2: Stochastic edit distance (Ristad and Yianilos, function for these actions is estimated by fitting
1998): A memoryless probabilistic FST. Σ and Ω stand a Stochastic Edit Distance (SED) model (Ristad
for any input and output symbol, respectively. Transi- and Yianilos, 1998) on the training data, which
tion weights are to the right of the slash and p(#) is
is a memoryless weighted finite-state transducer
the final weight.
shown in Figure 2. The resulting SED model is
integrated into the expert policy, the SED policy,
applying character-by-character substitutions. An that uses Viterbi decoding to compute optimal edit
extreme case is Georgian, which features an al- action sequences for any point in the action search
most deterministic one-to-one mapping between space: Given a transducer configuration of partially
graphemes and IPA segments that can be learned processed input, find the best edit actions to gen-
almost perfectly from little training data.2 erate the remaining target sequence suffix. Dur-
The main goal of our submission was to test ing training, an aggressive exploration schedule
1
whether our last year’s system, which is the base- psampling (i) = 1+exp(i) where i is the training
line for this year’s G2P challenge, already exhausts epoch number, exposes the model to configurations
the potential of its architecture, or whether changes sampled by executing edit actions from the model.
to the output representation (IPA segments vs. IPA For an extended description of the SED policy and
Unicode codepoints; input character dropout), to IL training, we refer the reader to the last year’s
the LSTM decoder (the mogrifier steps and the system description paper (Makarov and Clematide,
additional input of the attended character), to the 2020).
BiLSTM encoder (parallel encoders), or to other
hyper-parameter settings (adaptive batch size) can 2.1 Changes to the baseline model
improve the results without replacing the LSTM- This section describes the changes that we imple-
based encoder/decoder setup by a Transformer- mented in our submissions.
based architecture (see e.g. Wu et al. (2021) for IPA segments vs. IPA Unicode characters:
Transformer-based state-of-the-art results). Emitting IPA segments in one action (includ-
ing its whitespace delimiter), e.g., for the Rus-
2 Model description sian example from above SUBS[kj •],3 instead
of producing the same output by three actions
The model defines a conditional distribution
SUBS [k], INS[j ], INS[•] reduces the number of ac-
over substitution, insertion and deletion edits
Q|a| tion predictions (and potential errors) considerably,
pθ (a | x) = j=1 pθ (aj | a<j , x), where x =
which is beneficial. On the other hand, this might
x1 . . . x|x| is an input sequence of graphemes and
lead to larger action vocabularies and sparse train-
a = a1 . . . a|a| is an edit action sequence. The
ing distributions. Therefore, we experimented with
output sequence of IPA symbols y is determin-
character (CHAR) and IPA segment (SEG) edit ac-
istically computed from x and a. The model is
tions in our submission. Table 1 shows statistics
equipped with an LSTM decoder and a bidirec-
on the resulting vocabulary sizes if CHAR or SEG
tional LSTM encoder (Graves and Schmidhuber,
actions are used. Some caution is needed though
2005). At each decoding step j, the model attends
because some segments might only appear once in
to a single grapheme xi . The attention steps mono-
the training data, e.g. English has an IPA segment
tonically through the input sequence, steered by the
s:: that only appears in the word “psst”.
edits that consume input (e.g. a deletion shifts the
Input character dropout: To prevent the model
attention to the next grapheme xi+1 ).
from memorizing the training set and to force it to
2
Even a reduced training set of only 100 items allows a learn about syllable contexts, we randomly replace
single model to achieve over 90% accuracy on the Georgian
3
test set. • denotes whitespace symbol.
149
S Language NFD< SEG C NFC C NFD and IPA characters.
L ady 0.5% 67 37 37 Enriching the decoder input with the cur-
L gre 4.3% 33 33 33 rently attended input character: The auto-
L ice 30.3% 60 36 36 regressive decoder of the baseline system uses the
L ita 0.8% 32 29 29 LSTM decoder output of the previous time step
L khm 0.5% 47 36 34 and the BiLSTM encoded representation of the
L lav 12.4% 73 51 36 currently attended input character as input. Intu-
L mlt latn 9.0% 41 29 29 itively, by feeding the input character embedding
L rum 0.3% 45 31 31 directly into the decoder (as a kind of skip con-
L slv 4.3% 48 38 30 nection), we want to liberate the BiLSTM encoder
L wel sw 2.4% 43 37 37 from transporting the hard attention information
M arm e 0.0% 54 31 31 to the decoder, thereby motivating the sequence
M bul 3.5% 46 34 34 encoder to focus more on the contextualization of
M dut 0.8% 49 39 39 the input character.
M fre 0.1% 39 36 36 Multiple parallel BiLSTM encoders: Convo-
M geo 0.0% 33 27 27 lutional encoders typically use many convolutional
M hbs latn 3.7% 63 43 33 filters for representation learning and Transformer
M hun 42.5% 66 37 37 encoders similarly feature multi-head attention. Us-
M jpn hira 36.1% 64 42 39 ing several LSTM encoders in parallel has been
M kor 99.8% 60 46 46 proposed by Zhu et al. (2017) for language model-
M vie hanoi 88.2% 49 44 44 ing and translation and was e.g. also successfully
H eng us 0.0% 124 83 80 used for named entity recognition (Žukov-Gregorič
Average 16.2% 54.1 39.0 37.0 et al., 2018). Technically, the same input is fed
though several smaller LSTMs, each with its own
Table 1: Statistics on Unicode normalization for low
parameter set, and then their output is concatenated
(L), medium (M), and high (H) settings (column S).
Column NFD< specifies the percentage of training for each time step. The idea behind parallel LSTM
items where NFD normalized graphemes had smaller encoders is to provide a more robust ensemble-style
length difference to phonemes than in NFC normal- encoding with lower variance between models. For
ization. Column SEG gives the vocabulary size of IPA our submission, there was not enough time to sys-
segments (the counts are the same for NFC and NFD). tematically tune the input and hidden state sizes as
Column CNFC reports the phoneme vocabulary size in well as the number of parallel LSTMs.
NFC Unicode characters (CHAR) and CNFD in NFD.
Adaptive batch size scheduler: We combine
the ideas of “Don’t Decay the Learning Rate, In-
an input character with the UNK symbol according crease the Batch Size” (Smith et al., 2017) and
to a linearly decaying schedule.4 cyclical learning schedules by dynamically enlarg-
Mogrifier LSTM decoder: Mogrifier LSTMs ing or reducing the batch size according to develop-
(Melis et al., 2019) iteratively and mutually up- ment set accuracy: Starting with a defined minimal
date the hidden state of a previous time step with batch size m threshold, the batch size for the next
the current input before feeding the modified hid- epoch is set to bm − 0.5c if the development set
den state and input into a standard LSTM cell. On performance improved, or bm + 0.5c otherwise.5
language modeling tasks with smaller corpora, this If a predefined maximum batch size is reached,
technique closed the gap between LSTM and Trans- the batch size is reset in one step to the minimum
former models. We apply a standard mogrifier with threshold. The motivation for the reset comes from
5 rounds of updates in our experiments. We expect empirical observations that going back to a small
the mogrifier decoder to profit from IPA segmen- batch size can help overcome local optima. With
tation because in this setup the decoder mogrifies larger training sets, we subsample the training sets
neighboring IPA phoneme segments and not space per epoch randomly in order to have a more dy-
namic behavior.6
4
For all experiments, we start with a probability of 50%
5
for UNKing a character in a word and reduce this rate over 10 See also the recent discussion on learning rates and batch
epochs to a minimal probability of 1%. Light experimentation sizes by Wu et al. (2021).
6
on a few languages led to this cautious setting, which might The subsample size is set to 3,000 items per epoch in all
leave room for further improvement. our experiments.
150
2.2 Unicode normalization 50(L/M)
For some writing systems, e.g. for Korean or Viet- • LSTM encoder hidden state dimension: 200
namese, applying Unicode NFD normalization to (B), 300 (L/M) divided by 6 parallel encoders.
the input has a great impact on the input sequence We submit 3 ensemble runs for the low setting:
length and consequently on the G2P character cor- CLUZH-1: 15 models with CHAR input,
respondences. The decomposition of diacritics and CLUZH-2: 15 models with SEG input,
other composing characters for all languages, as CLUZH-3: 30 models with CHAR or SEG input.
performed in the baseline, has the disadvantage of We submit 4 ensemble runs for the medium setting:
longer input sequences. We apply a simple heuris- CLUZH-4: 5 models with CHAR input,
tic to decide on NFD normalization based on a CLUZH-5: 10 models with SEG input,
criterion for the minimum length distance between CLUZH-6: 5 models with SEG input,
graphemes and phonemes: If more than 50% of the CLUZH-7: 15 models with CHAR or SEG input.
training grapheme sequences in NFD normalization Due to a configuration error, medium results were
have a smaller length difference compared to the actually computed without two add-ons: mogrifier
phoneme sequence than their corresponding NFC LSTMs and the additional input character. In post-
variants, then NFD normalization is applied. See submission experiments, we computed runs that
Table 1 for statistics, which indicate a preference enabled these features and report their results as
for NFD for only 2 languages. well (CLUZH-4m/5m).
3 Submission details 4 Results and discussion
Modifications such as mogrifier LSTMs, additional Table 2 shows a comparison of results for the low
input character skip connections, or parallel en- setting. We report the development and test set
coders increase the number of model parameters average word error rate (WER) performance to
and make it difficult to compare the baseline system illustrate the sometimes dramatic differences be-
directly with its variants. Additionally, we did not tween these sets (e.g. Greek). Both runs containing
have enough time before the submission to system- CHAR action emitting models (CLUZH-1, CLUZH-
atically explore and fine-tune for the best combina- 3) have second best results (the best system reaches
tion of model modifications and hyper-parameters. 24.1). The SEG models with IPA segmentation ac-
In the end, after some light experimentation we had tions excel on some languages (Adyghe, Latvian),
to stick to settings that might not be optimal. but fail badly on Slovene and Maltese. Only for
We train separate models for each language on Romanian and Italian, we see an improvement for
the official training data and use the development the 30-strong mixed ensemble. The expectation
set exclusively for model selection. As beam de- that the size difference between the SEG and CHAR
coding for mogrifier models sometimes suffered vocabulary correlates with language-specific per-
compared to greedy decoding, we built all ensem- formance differences cannot be confirmed given
bles from greedy model prediction. Like the base- the numbers in Table 1. E.g. Latvian features 73
line system (B), we train the SED model for 10 different IPA segments but only 51 IPA characters,
epochs, use one-layer LSTMs, hidden state dimen- still, the SEG variant shows only 49% WER.
sion 200 for the decoder LSTMs and action embed- Table 3 shows a comparison of results for the
ding dimension 100. For the low (L) and medium medium setting. We report selected development
(M) setting, we have the following specific hyper- and test set average performance to illustrate that
parameters: also in this larger setting, the expectation of a
• patience: 12 (B), 24 (L), 18 (M) slightly higher development set performance does
• maximal epochs: 60 (B), 80 (L/M) not always hold (e.g. Korean or Japanese). On the
• minimal batch size:7 3 (L), 5 (M) other hand, Bulgarian and Dutch have a sharp in-
• maximal batch size: 10 (L/M) crease in errors on the test set compared to the de-
• character embedding dimension:8 100 (B), velopment set. The comparison between runs with
7
The baseline system’s batch size is 5. the mogrifier LSTM decoder and the attended char-
8
The motivation for lowering the character embedding acter input (CLUZH-Nm) or without (C-N) suggest
size comes from adding the input character to the mogrifier
decoder LSTM, which increases the parameter size for each that these changes are not beneficial. In the medium
of the 5 update weight matrices. setting, C-4 (CHAR) and C-6 (SEG) can be directly
151
CLUZH-1 (CHAR) CLUZH-2 (SEG) C-3 OUR BASELINE BSL Other
AVERAGE E AVERAGE E E AVERAGE E E
LNG dev test sd test dev test sd test test dev test sd test test test
ady 25.0 27.8 3.3 24 25.6 26.2 1.8 22 22 26 25.2 2.8 21 22 22
gre 6.5 22.2 2.3 20 5.1 22.8 2.8 22 20 5 26.0 3.3 25 21 21
ice 14.8 12.4 2.4 10 16.1 14.5 2.2 12 10 21 15.8 2.1 12 12 11
ita 24.5 27.0 2.2 23 24.4 26.3 3.2 24 21 25 22.7 3.5 19 19 20
khm 39.8 38.2 3.4 32 40.3 36.9 2.2 33 32 39 40.4 2.5 34 34 28
lav 47.2 53.7 2.8 53 46.9 55.3 3.7 49 49 44 56.5 2.2 54 55 49
mlt 17.0 18.0 2.4 12 19.7 21.2 2.9 16 14 23 21.8 5.1 17 19 18
rum 11.1 13.7 1.8 13 10.3 14.1 1.0 13 12 11 12.5 2.1 10 10 10
slv 46.4 56.4 2.7 50 48 60.2 3.4 59 55 44 54.2 2.1 51 49 47
wel 18.0 14.9 3.5 10 15.6 15.7 1.8 13 12 19 14.8 2.0 12 10 12
AVG 25.0 28.4 2.7 24.7 25.2 29.3 2.5 26.3 24.7 25.7 29.0 2.8 25.5 25.1 23.8
Table 2: Overview of the dev and test results in the low setting. C-3 is CLUZH-3 ensemble. OUR BASELINE
shows the results for our own run of the baseline configuration. They are different from the official baseline results
(BSL) due to different random seeds. Column sd always reports the test set standard deviation. E means ensemble
results.
C-4 CLUZH-4m (CHAR) C-5 CLUZH-5m (SEG) C-5l C-6 C-7 OUR BASELINE BSL
E5 AVERAGE E5 E10 AVERAGE E10 E10 E5 E15 AVERAGE E10 E10
LNG test dev test sd test test dev test sd test test test test dev test sd test test
arm 7.1 5.4 7.9 0.7 6.4 6.6 5.1 7.2 0.5 6.2 7.1 6.6 6.4 5.8 7.8 0.7 6.5 7.0
bul 20.1 12.2 20.4 2.0 19.9 19.2 11.9 23.3 2.1 22.4 16.2 18.8 19.7 12.5 19.7 1.7 19.3 18.3
dut 15.0 13.1 18.3 1.2 14.8 14.9 12.4 16.8 0.6 14.6 14.5 15.6 14.7 13.1 17.7 1.3 14.3 14.7
fre 7.5 8.4 9.7 0.6 8.2 7.5 8.5 9.5 0.7 8.1 8.1 7.5 7.6 8.9 9.1 0.5 7.8 8.5
geo .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0
hbs 38.4 43.2 44.5 1.1 39.1 35.6 42.4 44.3 1.5 36.8 35.7 37.0 35.3 39.1 38.9 1.2 33.6 32.1
hun 1.5 1.8 1.8 0.1 1.6 1.2 1.7 1.5 0.3 1.0 0.9 1.0 1.0 1.7 2.0 0.3 1.8 1.8
jpn 5.9 6.9 6.8 0.2 5.5 5.3 6.8 6.5 0.3 5.4 5.2 5.5 5.0 6.8 6.4 0.5 5.5 5.2
kor 16.2 21.3 18.6 0.7 17.4 16.9 19.6 18.3 0.8 16.2 16.1 17.2 16.3 20.4 18.9 0.8 16.5 16.3
vie 2.3 1.2 2.4 0.1 2.3 2.0 1.2 2.1 0.1 2.1 2.2 2.1 2.0 1.4 2.5 0.2 2.4 2.5
AVG 11.4 11.4 13.0 0.7 11.5 10.9 11.0 12.9 0.7 11.3 10.6 11.1 10.8 11.0 12.3 0.7 10.8 10.6
Table 3: Overview of the development and test results in the medium setting. C-N is CLUZH-N ensemble. CLUZH-
Nm runs use the mogrifier decoder and additional input character in decoder (these are post-submisson runs). C-5l
uses larger parameterization and reaches WER 10.60 (BSL: 10.64). OUR BASELINE shows the results for our
own run of the baseline configuration. Boldface indicates best performance in official shared task runs; underline
marks the best performance in post-submission configurations. Column sd always reports the test set standard
deviation. En means n-strong ensemble results.
compared because they feature the same ensem- ble. It achieves an impressive low word error rate
ble size: The results suggest that IPA segmentation of 38.7 compared to the official baseline (41.94)
(SEG) for higher resource settings (and the specific and the best other submission (37.43).
medium languages) seems to be slightly better than Future work: Performance variance between
CHAR . C-5l is a post-submission run with a larger different runs of our LSTM-based architecture
parametrization.9 This post-submission ensemble makes it difficult to reliably assess the actual useful-
outperforms the baseline system by a small mar- ness of the small architectural changes; extensive
gin, but still struggles with Serbo-Croatian (hbs) experimentation, e.g. in the spirit of Reimers and
compared to the official baseline results. Gurevych (2017), is needed for that. One should
In a post-submission experiment on the high set- also investigate the impact of the official data set
ting, we built a large10 5-strong SEG-based ensem- splits: The observed differences between the de-
9 velopment set and test set performance in the low
Three parallel encoders with 200 hidden units each; char-
acter embedding dimension of 200; no mogrifier; no input sion 100; decoder hidden state dimension: 500; minimal batch
character added to the decoder. size: 5; maximal batch size: 20; epochs: 200 (subsampled to
10
Character embedding dimension: 200; action embedding 3,000 items); patience: 24; no mogrifier; no input character
dimension: 100; 10 parallel encoders with hidden state dimen- added to the decoder.
152
setting for Slovene or Greek are extreme. Cross- Alex Graves and Jürgen Schmidhuber. 2005. Frame-
validation experiments might help assess the true wise phoneme classification with bidirectional
LSTM and other neural network architectures. Neu-
difficulty of the WikiPron datasets.
ral Networks, 18(5).
5 Conclusion Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza,
Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D.
This paper presents the approach taken by the McCarthy, and Kyle Gorman. 2020. Massively mul-
CLUZH team to solving the SIGMORPHON 2021 tilingual pronunciation modeling with WikiPron. In
Multilingual Grapheme-to-Phoneme Conversion LREC.
challenge. Our submission for the low and medium Peter Makarov and Simon Clematide. 2018. Imitation
settings is based on our successful SIGMORPHON learning for neural morphological string transduc-
2020 system, which is a majority-vote ensemble tion. In EMNLP.
of neural transducers trained with imitation learn- Peter Makarov and Simon Clematide. 2020. CLUZH
ing. We add several modifications to the existing at SIGMORPHON 2020 shared task on multilingual
LSTM architecture and experiment with IPA seg- grapheme-to-phoneme conversion. In Proceedings
ment vs. IPA character action predictions. For the of the 17th SIGMORPHON Workshop on Computa-
tional Research in Phonetics, Phonology, and Mor-
low setting languages, our IPA character-based run phology.
outperforms the baseline and ranks second overall.
The average performance of segment-based action Gábor Melis, Tomás Kociský, and Phil Blunsom. 2019.
edits suffers from performance outliers for certain Mogrifier LSTM. CoRR, abs/1909.01792.
languages. For the medium setting languages, we Nils Reimers and Iryna Gurevych. 2017. Reporting
note small improvements on some languages, but score distributions makes a difference: Performance
the overall performance is lower than the baseline. study of LSTM-networks for sequence tagging. In
EMNLP.
Using a mogrifier LSTM decoder and enriching
the encoder input with the currently attended in- Eric Sven Ristad and Peter N Yianilos. 1998. Learning
put character did not improve performance in the string-edit distance. IEEE Transactions on Pattern
medium setting. Post-submission experiments sug- Analysis and Machine Intelligence, 20(5).
gest that network capacity for the submitted sys- Stephane Ross, Geoffrey Gordon, and Drew Bagnell.
tems was too small. A post-submission run for the 2011. A reduction of imitation learning and struc-
high-setting shows considerable improvement over tured prediction to no-regret online learning. In
PMLR, volume 15 of Proceedings of Machine Learn-
the baseline. ing Research.
Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V.
References Le. 2017. Don’t decay the learning rate, increase the
batch size. CoRR, abs/1711.00489.
Roee Aharoni and Yoav Goldberg. 2017. Morphologi-
cal inflection generation with hard monotonic atten- Shijie Wu, Ryan Cotterell, and Mans Hulden. 2021.
tion. In ACL. Applying the transformer to character-level transduc-
tion. In EACL.
Lucas F.E Ashby, Travis M. Bartley, Simon Clematide,
Luca Del Signore, Cameron Gibson, Kyle Gorman, Danhao Zhu, Si Shen, Xin-Yu Dai, and Jiajun Chen.
Yeonju Lee-Sikka, Peter Makarov, Aidan Malanoski, 2017. Going wider: Recurrent neural network with
Sean Miller, Omar Ortiz, Reuben Raff, Arundhati parallel cells. CoRR, abs/1705.01346.
Sengupta, Bora Seo, Yulia Spektor, and Winnie Yan.
2021. Results of the Second SIGMORPHON 2021 Andrej Žukov-Gregorič, Yoram Bachrach, and Sam
Shared Task on Multilingual Grapheme-to-Phoneme Coope. 2018. Named entity recognition with par-
Conversion. In Proceedings of 18th SIGMORPHON allel recurrent neural networks. In ACL.
Workshop on Computational Research in Phonetics,
Phonology, and Morphology.
153
SIGMORPHON 2021 Shared Task on Morphological Reinflection:
Generalization Across Languages
Tiago PimentelQ∗ Maria Ryskinaì∗ Sabrina MielkeZ Shijie WuZ Eleanor Chodroff7
Brian LeonardB Garrett Nicolaiá Yustinus Ghanggo AteÆ Salam Khalifaè Nizar Habashè
Charbel El-KhaissiS Omer Goldmanń Michael GasserI William LaneR Matt Colerå
Arturo Oncevayď Jaime Rafael Montoya SamameH Gema Celeste Silva VillegasH
Adam Ekä Jean-Philipe Bernardyä Andrey Shcherbakov@ Aziyana Bayyr-oolü
Karina SheiferE,ż,Œ Sofya GanievaM,ż Matvey PlugaryovM,ż Elena KlyachkoE,ż Ali Salehiř
Andrew KrizhanovskyK Natalia KrizhanovskyK Clara Vania5 Sardana Ivanova1
Aelita Salchakù Christopher Straughnñ Zoey Liuť Jonathan North WashingtonF
Duygu Atamanæ Witold KieraśT Marcin WolińskiT Totok Suhardijantoþ Niklas StoehrD
Zahroh Nuriahþ Shyam RatanU Francis M. TyersI,E Edoardo M. Pontiø
Richard J. Hatcherř Emily Prud’hommeauxť Ritesh KumarU Mans HuldenX
Botond BartaA Dorina LakatosA Gábor SzolnokA Judit ÁcsA Mohit RajU
David YarowskyZ Ryan CotterellD Ben AmbridgeL,C Ekaterina Vylomova@
Q
University of Cambridge ì Carnegie Mellon University Z Johns Hopkins University
7
University of York B Brian Leonard Consulting á University of British Columbia
Æ
STKIP Weetebula è New York University Abu Dhabi S Australian National University
ń
Bar-Ilan University R Charles Darwin University å University of Groningen
I
Indiana University ď University of Edinburgh H Pontificia Universidad Católica del Perú
ä
University of Gothenburg @ University of Melbourne E Higher School of Economics
ü
Institute of Philology of the Siberian Branch of the Russian Academy of Sciences
ż
Institute of Linguistics, Russian Academy of Sciences M Moscow State University
Œ
Institute for System Programming, Russian Academy of Sciences
ř
University at Buffalo K Karelian Research Centre of the Russian Academy of Sciences
C
ESRC International Centre for Language and Communicative Development (LuCiD)
ñ
Northeastern Illinois University 1 University of Helsinki ù Tuvan State University
5
New York University ť Boston College F Swarthmore College æ University of Zürich
T
Institute of Computer Science, Polish Academy of Sciences þ Universitas Indonesia
U
Dr. Bhimrao Ambedkar University ø Mila/McGill University Montreal
D
ETH Zürich X University of Colorado Boulder L University of Liverpool
A
Budapest University of Technology and Economics
[email protected] [email protected] [email protected]
Abstract systems on the new data and conduct an ex-
tensive error analysis of the systems’ predic-
This year’s iteration of the SIGMORPHON tions. Transformer-based models generally
Shared Task on morphological reinflection demonstrate superior performance on the ma-
focuses on typological diversity and cross- jority of languages, achieving >90% accuracy
lingual variation of morphosyntactic features. on 65% of them. The languages on which sys-
In terms of the task, we enrich UniMorph tems yielded low accuracy are mainly under-
with new data for 32 languages from 13 resourced, with a limited amount of data. Most
language families, with most of them be- errors made by the systems are due to allo-
ing under-resourced: Kunwinjku, Classical morphy, honorificity, and form variation. In
Syriac, Arabic (Modern Standard, Egyptian, addition, we observe that systems especially
Gulf), Hebrew, Amharic, Aymara, Magahi, struggle to inflect multiword lemmas. The sys-
Braj, Kurdish (Central, Northern, Southern), tems also produce misspelled forms or end up
Polish, Karelian, Livvi, Ludic, Veps, Võro, in repetitive loops (e.g., RNN-based models).
Evenki, Xibe, Tuvan, Sakha, Turkish, In- Finally, we report a large drop in systems’ per-
donesian, Kodi, Seneca, Asháninka, Yanesha, formance on previously unseen lemmas.1
Chukchi, Itelmen, Eibela. We evaluate six
1
The data, systems, and their predictions are available:
∗
The authors contributed equally https://github.com/sigmorphon/2021Task0
154
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 154–184
August 5, 2021. ©2021 Association for Computational Linguistics
1 Introduction of idiosyncratic properties present in them, which
makes cross-linguistic comparison challenging.
Chomsky (1995) noted that if a Martian anthropol- Haspelmath (2010) suggested a distinction be-
ogist were to visit our planet, all of our world’s tween descriptive categories (specific to languages)
languages would appear as a dialect of a sin- and comparative concepts. The idea was then re-
gle language, more specifically instances of what fined and further developed with respect to mor-
he calls a “universal grammar”. This idea—that phology and realized in the UniMorph schema
all languages have a large inventory of shared (Sylak-Glassman et al., 2015b). Morphosyntac-
sounds, vocabulary, syntactic structures with minor tic features (such as “the dative case” or “the past
variations—was especially common among cog- tense”) in the UniMorph occupy an intermediate po-
nitive scientists. It was based on highly biased sition between the descriptive categories and com-
ethnocentric empirical observations, resulting from parative concepts. The set of features was initially
the fact that a vast majority of cognitive scientists, established on the basis of analysis of typologi-
including linguists, focused only on the familiar cal literature, and refined with the addition of new
European languages. Moreover, as Daniel (2011) languages to the UniMorph database (Kirov et al.,
notes, many linguistic descriptive traditions of in- 2018; McCarthy et al., 2020). Since 2016, SIG-
dividual languages, even isolated ones such as Rus- MORPHON organized shared tasks on morpholog-
sian or German, heavily rely on cross-linguistic ical reinflection (Cotterell et al., 2016, 2017, 2018;
assumptions about the structure of human language McCarthy et al., 2019; Vylomova et al., 2020) that
that are often projected from Latin grammars. Sim- aimed at evaluating contemporary systems. Parallel
ilarly, despite making universalistic claims, genera- to that, they also served as a platform for enriching
tive linguists, for a very long time, have focused on the UniMorph database with new languages. For
a small number of the world’s major languages, typ- instance, the 2020 shared task (Vylomova et al.,
ically using English as their departure point. This 2020) featured 90 typologically diverse languages
could be partly attributed to the fact that generative derived from various linguistic resources.
grammar follows a deductive approach where the This year, we are bringing many under-resourced
observed data is conditioned on a general model. languages (languages of Peru, Russia, India, Aus-
However, as linguists explored more languages, tralia, Papua New Guinea) and dialects (e.g., for
descriptions and comparisons of more diverse kinds Arabic and Kurdish). The sample is highly diverse:
of languages began to come up, both within the it contains languages with templatic, concatenative
framework of generative syntax as well as that of (fusional and agglutinative) morphology. In addi-
linguistic typology. Greenberg (1963) presents one tion, we bring more polysynthetic languages such
of the earliest typologically informed description as Kunwinjku, Chukchi, Asháninka. Unlike previ-
of “language universals” based on an analysis of ous years, we pay more attention to the conversion
a relatively larger set of 30 languages, which in- of the morphosynactic features of these languages
cluded a substantial proportion of data from non- into the UniMorph schema. In addition, for most
European languages. Subsequently, typologists languages we conduct an extensive error analysis.
have claimed that it is essential to describe the lim-
its of cross-linguistic variation (Croft, 2002; Com- 2 Task Description
rie, 1989) rather than focus only on cross-linguistic
similarities. This is especially evident from Evans In this shared task, the participants were told to de-
and Levinson (2009), where the authors question sign a model that learns to generate morphological
the notion of “language universals”, i.e. the exis- inflections from both a lemma and a set of mor-
tence of a common pattern, or basis, shared across phosyntactic features of the target form. Specifi-
human languages. By looking at cross-linguistic cally, each language in the task had its own training,
work done by typologists and descriptive linguists, development, and test splits. The training and de-
they demonstrate that “diversity can be found at velopment splits contained triples, with a lemma,
almost every level of linguistic organization”: lan- a set of morphological features, and the target in-
guages vary greatly on phonological, morpholog- flected form, while test splits only provided lemmas
ical, semantic, and syntactic levels. This leads and morphological tags: the participants’ models
us to p-linguistics (Haspelmath, 2020), a study of needed to predict the missing target form—making
particular languages, including the whole variety this a standard supervised learning task.
155
The target of the task, however, was to analyse suffixing and would struggle to learn prefixing
how well the current state-of-the-art reinflection or circumfixation, and the degree of the bias only
models could generalise across a typologically di- becomes apparent during experimentation on other
verse set of languages. These models should, in languages whose inflectional morphology patterns
theory, be general enough to work for natural lan- differ. Further, the model architecture itself could
guages of any typological patterning.2 As such, we also explicitly or implicitly favor certain word
designed the task in three phases: a Development formation types (suffixing, prefixing, etc.).
Phase, a Generalization Phase, and an Evaluation
Phase. As the phases advanced, more data and 3 Description of the Languages
more languages were released. 3.1 Gunwinyguan
In the Development Phase, we provided train-
The Gunwinyguan language family consists of Aus-
ing and development splits that should be used by
tralian Aboriginal languages spoken in the Arnhem
participants to develop their systems. Model de-
Land region of Australia’s Northern Territory.
velopment, evaluation, and hyper-parameter tuning
were, thus, mainly performed on these sets of lan- 3.1.1 Gunwinggic: Kunwinjku
guages. We will refer to these as the development This data set contains one member of this fam-
languages. ily: a dialect of Bininj Kunwok called Kunwin-
In the Generalization Phase, we provided train- jku. Kunwinjku is a polysynthetic language with
ing and development splits for new languages mostly agglutinating verbal morphology. A typical
where approximately half were genetically related verb there might look like Aban-yawoith-warrgah-
(belonged to the same family) and half were ge- marne-ganj-ginje-ng ‘1/3PL-again-wrong-BEN-
netically unrelated (either isolates or belonging to meat-cook-PP’ (“I cooked the wrong meat for them
different families) to the development languages. again”). As shown, the form has several prefixes
These languages (and their families) were kept as a and suffixes attached to the stem. As in other Aus-
surprise throughout the first (development) phase tralian languages, long vowels are typically repre-
and were only announced later on. As the partic- sented by double characters, and trills with “rr”.3
ipants were only given a few days with access to According to Evans’ (2003) analysis, the verb tem-
these languages before the submission deadline, we plate contains 12 affix slots which include two in-
expected that the systems couldn’t be radically im- corporated noun classes, and derivational affixes
proved to work on them—as such, these languages such as the benefactive and comitative. The data
allowed us to evaluate the generalization capacity included in this set are verbs extracted from the
of the re-inflection models, and how well they per- Kunwinjku translation of the Bible using the mor-
formed on new typologically unrelated languages. phological analyzer from Lane and Bird (2019) and
Finally, in the Evaluation Phase, the partic- manually verified by human annotators.
ipants’ models were evaluated on held-out test
forms from all of the languages of the previous 3.2 Afro-Asiatic
phases. The languages from the Development The Afro-Asiatic language family is represented by
Phase and the Generalization Phase were evaluated the Semitic subgroup.
simultaneously. The only difference between the
development and generalization languages was that 3.2.1 Semitic: Classical Syriac
participants had more time to construct their mod- Classical Syriac is a dialect of the Aramaic lan-
els for the languages released in the Development guage and is attested as early as the 1st century
Phase. It follows that a model could easily favor CE. As with most Semitic languages, it displays
or overfit to the phenomena that are more frequent non-concatenative morphology involving primarily
in the languages presented in the Development tri-consonantal roots. Syriac nouns and adjectives
Phase, especially if the parameters were shared are conventionally classified into three ‘states’—
across languages. For instance, a model based on Emphatic, Absolute, Construct—which loosely cor-
the morphological patterning of Indo-European relate with the syntactic features of definiteness,
languages may end up with a bias towards indeterminacy and the genitive. There are over 10
2 3
For example, Tagalog verbs exhibit circumfixation; thus, More details: https://en.wikipedia.
a model with a strong inductive bias towards suffixing would org/wiki/Transcription_of_Australian_
likely not work well for Tagalog. Aboriginal_languages.
156
Family Genus ISO 639-3 Language Source of Data Annotators
Development
Afro-Asiatic Semitic afb Gulf Arabic Khalifa et al. (2018) Salam Khalifa, Nizar Habash
Semitic amh Amharic Gasser (2011) Michael Gasser
Semitic ara Modern Standard Arabic Taji et al. (2018) Salam Khalifa, Nizar Habash
Semitic arz Egyptian Arabic Habash et al. (2012) Salam Khalifa, Nizar Habash
Semitic heb Hebrew (Vocalized) Wiktionary Omer Goldman
Semitic syc Classic Syriac SEDRA Charbel El-Khaissi
Arawakan Southern ame Yanesha Duff-Trip (1998) Arturo Oncevay, Gema Celeste Silva Vil-
Arawakan legas
Southern cni Asháninka Zumaeta Rojas and Zerdin Arturo Oncevay, Jaime Rafael Montoya
Arawakan (2018); Kindberg (1980) Samame
Austronesian Malayo- ind Indonesian KBBI, Wikipedia Clara Vania, Totok Suhardijanto, Zahroh
Polynesian Nuriah
Malayo- kod Kodi Ghanggo Ate (2021) Yustinus Ghanggo Ate, Garrett Nicolai
Polynesian
Aymaran Aymaran aym Aymara Coler (2014) Matt Coler, Eleanor Chodroff
Chukotko- Northern ckt Chukchi Chuklang; Tyers and Karina Sheifer, Maria Ryskina
Kamchatkan Chukotko- Mishchenkova (2020)
Kamchatkan
Southern itl Itelmen Karina Sheifer, Sofya Ganieva, Matvey
Chukotko- Plugaryov
Kamchatkan
Gunwinyguan Gunwinggic gup Kunwinjku Lane and Bird (2019) William Lane
Indo- Indic bra Braj Raw data from Kumar et al. Shyam Ratan, Ritesh Kumar
European (2018)
Slavic bul Bulgarian UniMorph (Kirov et al., 2018, Christo Kirov
Wiktionary)
Slavic ces Czech UniMorph (Kirov et al., 2018,
Wiktionary)
Iranian ckb Central Kurdish (Sorani) Alexina project Ali Salehi
Germanic deu German UniMorph (Kirov et al., 2018,
Wiktionary)
Iranian kmr Northern Kurdish (Kur- Alexina project
manji)
Indic mag Magahi Raw data from (Kumar et al., Mohit Raj, Ritesh Kumar
2014, 2018)
Germanic nld Dutch UniMorph (Kirov et al., 2018,
Wiktionary)
Slavic pol Polish Woliński et al. (2020); Witold Kieraś, Marcin Woliński
Woliński and Kieraś (2016)
Romance por Portuguese UniMorph (Kirov et al., 2018,
Wiktionary)
Slavic rus Russian UniMorph (Kirov et al., 2018, Ekaterina Vylomova
Wiktionary)
Romance spa Spanish UniMorph (Kirov et al., 2018,
Wiktionary)
Iranian sdh Southern Kurdish Fattah (2000, native speakers) Ali Salehi
Iroquoian Northern Iro- see Seneca Bardeau (2007) Richard J. Hatcher, Emily
quoian Prud’hommeaux, Zoey Liu
Trans–New Bosavi ail Eibela Aiton (2016b) Grant Aiton, Edoardo Maria Ponti, Eka-
Guinea terina Vylomova
Tungusic Tungusic evn Evenki Kazakevich and Klyachko Elena Klyachko
(2013)
Turkic Turkic sah Sakha Forcada et al. (2011, Apertium: Francis M. Tyers, Jonathan North Wash-
apertium-sah) ington, Sardana Ivanova, Christopher
Straughn, Maria Ryskina
Turkic tyv Tuvan Forcada et al. (2011, Apertium: Francis M. Tyers, Jonathan North
apertium-tyv) Washington, Aziyana Bayyr-ool, Aelita
Salchak, Maria Ryskina
Uralic Finnic krl Karelian Zaytseva et al. (2017, VepKar) Andrew and Natalia Krizhanovsky
Finnic lud Ludic Zaytseva et al. (2017, VepKar) Andrew and Natalia Krizhanovsky
Finnic olo Livvi Zaytseva et al. (2017, VepKar) Andrew and Natalia Krizhanovsky
Finnic vep Veps Zaytseva et al. (2017, VepKar) Andrew and Natalia Krizhanovsky
Generalization (Surprise)
Tungusic Tungusic sjo Xibe Zhou et al. (2020) Elena Klyachko
Turkic Turkic tur Turkish UniMorph (Kirov et al., 2018, Omer Goldman and Duygu Ataman
Wiktionary)
Uralic Finnic vro Võro Wiktionary Ekaterina Vylomova
157
verbal paradigms that combine affixation slots with For MSA, the paradigms inflect for all the above-
inflectional templates to reflect tense (past, present, mentioned features, while for EGY and GLF they
future), person (first, second, third), number (sin- inflect for the above-mentioned features except for
gular, plural), gender (masculine, feminine, com- voice, mood, case, and state. We use the func-
mon), mood (imperative, infinitive), voice (active, tional (grammatical) gender and number for MSA
passive), and derivational form (i.e., participles). and GLF, but the form-based gender and number
Paradigmatic rules are determined by a range of for EGY, since the resources we used did not have
linguistic factors, such as root type or phonolog- EGY functional gender and number (Alkuhlani and
ical properties. The data included in this set was Habash, 2011).
relatively small and consisted of 1,217 attested lex- We generated all the inflection tables from the
emes in the New Testament, which were extracted morphological analysis databases using the gener-
from Beth Mardutho: The Syriac Institute’s lexical ation component provided by CamelTools (Obeid
database, SEDRA. et al., 2020). We extracted all the verb, noun, and
adjective lemmas from a number of annotated cor-
3.2.2 Semitic: Arabic
pora and selected those that are already in the mor-
Modern Standard Arabic (MSA, ara) is the pri- phological analysis databases. For MSA, we used
marily written form of Arabic which is used in all the CALIMA-STAR database (Taji et al., 2018),
official communication means. In contrast, Ara- based on the SAMA database (Maamouri et al.,
bic dialects are the primarily spoken varieties of 2010), and the PATB (Maamouri et al., 2004) as
Arabic, and the increasingly written varieties on the sources of lemmas. For EGY, we used the
unofficial social media platforms. Dialects have CALIMA-EGY database (Habash et al., 2012) and
no official status despite being widely used. Both the ARZTB (Maamouri et al., 2012) as the sources
MSA and the dialects coexist in a state of diglos- of lemmas. For GLF, we used the Gulf verb ana-
sia (Ferguson, 1959) whether in spoken or written lyzer (Khalifa et al., 2017) for verbs, and for both
form. Arabic dialects vary amongst themselves nouns and adjectives we extracted all the annota-
and are different from MSA in most linguistic as- tions from the Annotated Gumar Corpus (Khalifa
pects (phonology, morphology, syntax, and lexical et al., 2018).
choice). In this work we provide inflection tables
for MSA (ara), Egyptian Arabic (EGY, arz), and 3.2.3 Semitic: Hebrew
Gulf Arabic (GLF, afb). Egyptian Arabic is the
variety of Arabic spoken in Egypt. Gulf Arabic As Syriac, Hebrew is a member of the Northwest
refers to the dialects spoken by the indigenous pop- Semitic branch, and, like Syriac and Arabic, it
ulations of the members of the Gulf Cooperation is written using an abjad where the vowels are
Council, especially regions on the Arabian Gulf. sparsely marked in unvocalized text. This fact en-
Similar to other Semitic languages, Arabic is a tails that in unvocalized data the complex ablaut-
templatic language. A word consists of a templatic extensive non-concatenative Semitic morphology
stem (root and pattern) and a number of affixes and is somewhat watered down as the consonants of the
clitics. Verb lemmas in Arabic inflect for person, root frequently appear consecutively with the alter-
gender, number, voice, mood, and aspect. Nomi- nating vowel unwritten. In this work we present
nal lemmas inflect for gender, number, case, and data in vocalized Hebrew, in order to examine the
state. Those features are realized through both the models’ ability to handle Hebrew’s full-fledged
templatic patterns and the concatenative affixations. Semitic morphological system.
Arabic words also take on a number of clitics: at- Hebrew verbs belong to 7 major classes
tachable prepositions, conjunctions, determiners, (Binyanim) with many subclasses depending on
and pronominal objects and possessives. In this the phonological features of the root’s consonants.
work, we do not include clitics as a part of the Verbs inflect for number, gender, and tense-mood,
paradigms, as they heavily increase the size of the while the nominal inflection tables include definite-
paradigms. We made the exception to add the Al de- ness and possessor.
terminer particle in order to be consistent with com- The provided inflection tables are largely identi-
monly used tokenizations for Arabic treebanks— cal to those of the past years’ shared tasks, scraped
Penn Arabic Treebank (Maamouri et al., 2004) and from Wiktionary, with the addition of the verbal
Arabic Universal Dependencies (Taji et al., 2017). nouns and all forms being automatically vocalized.
158
3.2.4 Semitic: Amharic 3.3.1 Aymaran: Aymara
Amharic is the most spoken and best-resourced Aymara is spoken mainly in Andean communities
among the roughly 15 languages in the Ethio- in the region encompassing Bolivia and Peru from
Semitic branch of South Semitic. Unlike most other the north of Lake Titicaca to the south of Lake
Semitic languages, but like other Ethio-Semitic lan- Poopó, extending westward to the valleys of the Pa-
guages, it is written in the Ge’ez (Ethiopic) script, cific coast and eastward to the Yunga valleys. It has
an abugida in which each character represents ei- roughly two million speakers, over half of whom
ther a consonant-vowel sequence or a consonant in are Bolivian. The rest reside mainly in Peru, with
the syllable coda position. small communities in Chile and Argentina. Ay-
mara is a highly agglutinative, suffix-only language.
Like other Semitic languages, Amharic displays
Nouns are inflected for grammatical number, case,
both affixation and non-concatenative template
and possessiveness. As Coler (2010) notes, Ay-
morphology. Verbs inflect for subject person, gen-
mara has 11–12 grammatical cases, depending on
der, and number and tense/aspect/mood. Voice and
the variety (as in some varieties the locative and
valence are also marked, both by templates and
genitive suffixes have merged and in others they
affixes, but these are treated as separate lemmas in
have not). The case suffix is attached to the last
the data. Other verb affixes (or clitics, depending
element of a noun phrase. Verbs present relatively
on the analysis) indicate object person, gender, and
complex paradigms, with dimensions such as gram-
number; negation; relativization; conjunctions; and,
matical person (marking both subject and direct
on relativized forms, prepositions and definiteness.
object), number, tense (simple, future, recent past,
None of these are included in the data.
distal past), mood (evidentials, two counterfactual
Nouns and adjectives share most of their mor- paradigms, and an imperative paradigm). More-
phology and are often not clearly distinguished. over, Aymara has a variety of suffixes which change
Nouns and adjectives inflect for definiteness, num- the grammatical category of the word. Words can
ber, and possession. Gender is only explicit when change grammatical category multiple times.5
the masculine or feminine singular definite suffixes
are present; most nouns have no inherent gender. 3.4 Indo-European
Nouns and adjectives also have prepositional pre- The Indo-European language family is the parent
fixes (or clitics) and accusative suffixes, which are family of most of the European and Asian lan-
not included in the data. guages. In this iteration of the shared task, we
The data for the shared task were generated by enrich the data with languages from Indo-Aryan,
the HornMorpho generator (Gasser, 2011), an FST Iranian, and Slavic groups. Iranian and Indo-
weighted with feature structures. Common ortho- Aryan are recognised as distinct subgroups of Indo-
graphic variants of the lemmas and common variant European. Characteristic retentions and innova-
plural forms of nouns are included. In these cases, tions make Iranian and Indo-Aryan language fami-
the variants are distinguished with the LGSPEC1 lies diverged and distinct from each other (Jain and
and LGSPEC2 features. Predictable orthographic Cardona, 2007).
variants are not included.
3.4.1 Indo-Aryan, or Indic: Magahi, Braj
3.3 Aymaran The Indian subcontinent is the heartland of where
the Indo-Aryan languages are spoken. This area
The Aymaran family has two branches: Southern is also referred to as South Asia and encompasses
Aymaran (which is the branch described in this con- India, Pakistan, Bangladesh, Nepal, Bhutan and
tribution, as represented by Mulyaq’ Aymara) and the islands of Sri Lanka and Maldives (Jain and
Central Aymaran (Jaqaru).4 Aymaran has no ex- Cardona, 2007). Magahi and Braj, which belong
ternal relatives. The neighboring and overlapping to the Indo-Aryan language family, are under our
Quechuan family is often erroneously believed to observation.
be related. Magahi comes under the Magadhi group of the
middle Indo-Aryan which includes Bhojpuri and
4
Sometimes Cauqui (also spelled “Kawki”), a language
5
spoken by less than ten elders in Cachuy, Canchán, Caipán, Tags’ conversion into UniMorph: https:
and Chavín, is considered to be a third Aymaran language but //github.com/unimorph/aym/blob/main/
it may be more accurate to consider it a Jaqaru dialect. Muylaq’AymaraUnimorphConversion.tsv
159
Maithili. While the exact classification within this lish and popularise the literary tradition of Braj, the
subgroup is still debatable, most accepted analyses local state government of Rajasthan has set up the
put it under one branch of the Eastern group of lan- Braj Bhasha Akademi (Braj Bhasha Academy) in
guages which includes Bangla, Asamiya, and Oriya Jaipur. Along with this, some individuals, local lit-
(Grierson and Konow, 1903). Magahi speech area erary and cultural groups, and language enthusiasts
is mainly concentrated in the Eastern Indian states at the local level also bring out publications in Braj
of Bihar and Jharkhand, but it also extends to the (Kumar et al., 2018). In all of the above sources,
adjoining regions of Bengal and Odisha (Grierson, bhakti poetry6 constitutes a large proportion of the
1903). traditional literature of Braj (Pankaj, 2020).
There is no grammatical gender and number As in the case of other Indo-Aryan languages,
agreement in Magahi, though sex-related gender Braj is also rich in morphological inflections. The
derivation commonly occurs for animate nouns dataset released for the present task contains two
like /laika/ (boy) and /laiki/ (girl). Number is sets of inflectional paradigms with morphologi-
also marked on nouns, and it affects the form of cal features for nouns and verbs. Nominal lem-
case markers and postpositions in certain instances mas in Braj are inflected for gender (masculine
(Lahiri, 2021). Moreover, it has a rich system of and feminine) and number (singular and plural);
verbal morphology to show the tense, aspect, per- verb lemmas take gender (masculine and feminine),
son, and honorific agreement with the subject as number (singular and plural), person (first, second
well as the addressee. and third), politeness/honorificity (formal and in-
formal), tense (present, past and future), and aspect
In the present dataset, the inflectional paradigms
(perfective, progressive, habitual and prospective)
for verbs show the honorificity level of both the
markings. Among these, the politeness feature
subjects and the addressees, and also the person
is marked for showing honorificity and formality.
of the subject, the tense and aspect markers. The
More generally, a formal/polite marker is used for
inflectional paradigms for nouns and adjectives are
strangers and the elite class, while informal/neutral
generated on the basis of the inflectional marker
markers are used for family and friends.
used for expressing case, familiarity, plurality, and
In order to generate the morphological
(sometimes) gender within animate nouns. Pro-
paradigms, we have used the data from the
nouns are marked for different cases and honori-
literary domain, annotated at the token level
ficity levels. These paradigms are generated on the
in the CoNLL-U editor (Heinecke, 2019). The
basis of a manually annotated corpus of Magahi
dataset was initially annotated using the Universal
folktales.
Dependencies morphological feature set and
We used a raw dataset from the literary domain. then automatically converted to the UniMorph
First, we annotated the dataset with the Universal schema using the script provided by McCarthy
Dependency morphological feature tags at token et al. (2018). Finally, the converted dataset was
level using the CoNLL-U editor (Heinecke, 2019). manually validated and edited to conform to the
We then converted the annotated dataset into the constraints and conventions of UniMorph to arrive
UniMorph schema using the script available for at the final labels.
converting UD data into the UniMorph tagset (Mc-
Carthy et al., 2018). To finalize the data, we man- 3.4.2 Iranian: Kurdish
ually validated the dataset against the UniMorph The Iranian branch is represented by Kurdish.
schema (Sylak-Glassman et al., 2015a). Among Western Iranian languages, Kurdish is the
Brajbhasha, or Braj is one of the Indo-Aryan term covering the largest group of related dialects.
languages spoken in the Western Indian states of Kurdish comprises three main subgroup dialects,
Uttar Pradesh, Madhya Pradesh, and Rajasthan. namely Northern Kurdish (including Kurmanji),
Grierson (1908) groups Brajbhasha under West- Central Kurdish (including Sorani), and Southern
ern Hindi of the Central Group in the Indo-Aryan Kurdish. Sorani Kurdish, spoken in Iran and Iraq,
family, along with other languages like Hindustani, is known for its morphological split ergative sys-
Bangaru, Kannauji, and Bundeli. Braj is not gener- tem. There are two sets of morphemes traditionally
ally used in education or for any official purposes in described as agreement markers: clitic markers and
any Braj spoken state, but it has a very rich literary 6
This is dedicated to Indian spiritual and mythological
tradition. Also in order to preserve, promote, pub- imagination as being associated with Lord Krishna.
160
verbal affixes, which are verbal agreement markers, three are represented in this dataset, with Polish
or the copula. The distribution of these formatives and Czech being the typical West Slavic languages,
can be described as ergative alignment, although Russian being the most prominent East Slavic lan-
mandatory agent indexing has led some scholars guage, and Bulgarian representing the Eastern part
to refer to the Sorani system as post- or remnant- of the South Slavic group. Slavic languages are
ergative (Jügel, 2009). Note that Sorani nominals characterized by a rich verbal and nominal inflec-
do not feature case marking. The single argument tion system. Typically, verbs mark tense, person,
of an intransitive verb is an affix while the transi- gender, aspect, and mood. Nouns mark gender,
tive verbs have a tense-sensitive alignment. With number, and case, although in Bulgarian and Mace-
transitive verbs, agents are indexed by affixes in the donian cases are reduced to only nominative and
present tense and with clitics in the past tense. On vocative. Masculine nouns additionally mark ani-
the other hand, the object is indexed with a clitic in macy.
the present tense and an affix in the past tense. In Polish data was obtained via a conversion
addition, Sorani also has the so-called experiencer- from the largest Polish morphological dictionary
subject verbs, with which both the agent and the (Woliński et al., 2020) which is also used as the
object are marked with clitic markers. Like other main data source in the morphological analysis.
Iranian languages, Sorani also features a series of Table 10 presents a simplified mapping from the
light-verb constructions which are composed us- original flexemic tagset of Polish (Przepiórkowski
ing the verbs kirdin ‘to do’ or bun ‘to be’. In the and Woliński, 2003) to the UniMorph schema. The
light verb constructions, the agent is marked with data for the remaining three Slavic languages were
an affix in the present tense, while a clitic marks obtained from Wiktionary.
the subject in the past tense. Southern Kurdish fea-
tures all the same verbs types, clitics and affixes, 3.5 Uralic: Karelian, Livvi, Ludic, Veps,
while the alignment pattern can be completely dif- Võro
ferent due to a nominative-accusative alignment
system. The usage of agreement markers with af- The Uralic languages are spoken from the north
fixes is widely predominant in Southern Kurdish of Siberia in Russia to Scandinavia and Hungary.
and clitics can be used to mark the possessives. They are agglutinating with some subgroups dis-
Both dialects of Kurdish allow for clitic and af- playing fusional characteristics (e.g., the Sámi lan-
fix stacking marking the agent and the object of guages). Many of the languages have vowel har-
a verb. In Sorani, for instance, dit=yan-im ‘They mony. Many of the larger case paradigms are made
saw me’ uses a clitic and an affix to mark the agent up of spatial cases, sometimes with distinctions for
and the object, and wist=yan=im ‘I wanted them’ direction and position. Further, most of the lan-
marks both the agent and the object with clitics. guages have possessive suffixes, which can express
Ditransitive verbs can be formed by a transitive possession or agreement in non-finite clauses.
verb and an applicative marker. For instance, a di- We use Karelian, Ludic, Livvi, Veps, and Võro
transitive three-participant verb da-m-în=î-yê ‘He in the shared task. All the data except Võro were
gave them to me’ marks the recipient and the object exported from the Open corpus of Veps and Kare-
with affixes, and the agent is marked with a clitic lian languages (VepKar). Veps and Karelian are
in the presence of an applicative (yê). A separate agglutinative languages with rich suffixal morphol-
set of morphological features is needed to account ogy. All inflectional categories in these languages
for such structures, in which the verb dictates the are formed by attaching one or more affixes corre-
person marker index as subject, agent, object or sponding to different grammatical categories to the
recipient. stem.
The presence of one or two stems in the nom-
3.4.3 Slavic: Polish inal parts of speech and verbs is essential when
The Slavic genus comprises a group of fusional constructing word forms in the Veps and Karelian
languages evolved from Proto-Slavic and spoken languages (Novak, 2019, 57). In these languages,
in Central and Eastern Europe, the Balkans and the to build the inflected forms of nouns and verbs, one
Asian parts of Russia from Siberia to the Far East. needs to identify one or two word stems. There
Slavic languages are most commonly divided into are formalized (algorithmic) ways to determine the
three major subgroups: East, West, and South. All stem, although not for all words (Novak et al., 2020,
161
684). 3.8 Turkic
Note that in the Ludic and Livvi dialects of the 3.8.1 Siberian Turkic: Sakha and Tuvan
Karelian language and in the Veps language, re- The Turkic languages of Siberia, spoken mostly
flexive forms of verbs have their own paradigm. within the Russian Federation, range from vulnera-
Thus, one set of morphological rules is needed for ble to severely endangered (Eberhard et al., 2021)
reflexive verbs and another set for non-reflexive and represent several branches of Turkic with vary-
verbs. ing degrees of relatedness (Баскаков, 1969; Tekin,
Võro represents the South Estonian dialect 1990; Schönig, 1999). They have rich agglutinat-
group. Similar to other Uralic languages, it has ag- ing morphology, like other Turkic languages, and
glutinative, primarily suffixal, morphology. Nouns share many grammatical properties (Washington
inflect for grammatical case and number. The cur- and Tyers, 2019).
rent shared task sample contains noun paradigm In this shared task, the Turkic languages of this
tables derived from Wiktionary.7 area are represented by Tuvan (Sayan Turkic) and
Sakha (Lena Turkic). For both languages, we
3.6 Tungusic make use of the lexicons of the morphological trans-
ducers built as part of the Apertium open-source
The Tungusic genus comprises a group of aggluti-
project (Khanna et al., to appear in 2021; Washing-
native languages spoken from Central and Eastern
ton et al., to appear in 2021). We use the transduc-
Siberia to the Far East over the territories of Russia
ers for Tuvan8 (Tyers et al., 2016; Washington et al.,
and China. The genus is considered to be a member
2016) and Sakha9 (Ivanova et al., 2019, to appear in
of the Altaic (or Transeurasian) language family by
2022) as morphological generators, extracting the
some researchers, although this is disputed. Tun-
paradigms for all the verbs and nouns in the lexicon.
gusic languages are commonly divided into two or
We manually design a mapping between the Aper-
three branches (see Oskolskaya et al. (2021) for
tium tagset and the UniMorph schema (Table 8),
discussion).
based on the system descriptions and additional
grammar resources (Убрятова et al. (1982) for
3.7 Tungusic: Evenki and Xibe
Sakha and Исхаков and Пальмбах (1961); An-
The dataset presents two Tungusic languages, derson and Harrison (1999); Harrison (2000) for
namely Evenki and Xibe, belonging to different Tuvan). Besides the tag mapping, we also include a
branches in any approach, with Xibe being quite few conditional rules, such as marking definiteness
aberrant from other Tungusic languages. Tungu- for nouns in the accusative and genitive cases.
sic languages are characterized by rich verbal and Since the UniMorph schema in its current ver-
nominal inflection and demonstrate vowel harmony. sion is not well-suited to capture the richness of
Typically verbs mark tense, person, aspect, voice Turkic morphology, we exclude many forms with
and mood. Nouns mark number, case and posses- morphological attributes that do not have a close
sion. equivalent in UniMorph. We also omit forms with
Inflection is achieved through suffixes. Evenki affixes that are considered quasi-derivational rather
is a typical agglutinative language with almost no than inflectional, such as the desiderative /-ksA/
fusion whereas Xibe is more fusional. in Tuvan (Washington et al., 2016), with the ex-
The Evenki data was obtained by conversion ception of the negative marker. These constraints
from a corpus of oral Evenki texts (Kazakevich and greatly reduce the sizes of the verbal paradigms:
Klyachko, 2013), which uses IPA. The Xibe data the median number of forms per lemma is 234 and
was obtained by conversion from a Universal De- 87 for Tuvan and Sakha respectively, compared to
pendency treebank compiled by Zhou et al. (2020), roughly 5,700 forms per lemma produced by ei-
which contains textbook and newspaper texts. Xibe ther generator. Our tag conversion and paradigm
texts use the traditional script. filtering code is publicly released.10
8
https://github.com/apertium/
7 apertium-tyv/
The tag conversion schema for Uralic lan-
9
guages is provided here: https://docs. https://github.com/apertium/
google.com/spreadsheets/d/1RjO_ apertium-sah/
10
J22yDB5FH5C24ej7sGGbeFAjcIadJA6ML55tsOI/ https://github.com/ryskina/
edit. apertium2unimorph
162
3.8.2 Turkic: Turkish 3.9.2 Malayo-Polynesian: Kodi/Kodhi
One of the further west Turkic languages, Turkish Kodi or Kodhi [koâi] is spoken in Sumba Island,
is part of the Oghuz branch, and, like the other eastern Indonesia (Ghanggo Ate, 2020). Regard-
languages of this family, it is highly agglutinative. ing its linguistic classification, Kodi belongs to the
In this work, we vastly expanded the existing Central-Eastern subgroup of Austronesian, related
UniMorph inflection tables. As with the Siberian to Sumba-Hawu languages. Based on the linguis-
Turkic languages, it was necessary to omit many tic fieldwork observations done by Ghanggo Ate
forms from the paradigm as the UniMorph schema (2020), it may be tentatively concluded that there
is not well-suited for Turkic languages. For this are only two Kodi dialects: Kodi Bhokolo and
reason, we only included the forms that may ap- Mbangedho-Mbalaghar. Even though some work
pear in main clauses. Other than this limitation, has been done on Kodi (Ghanggo Ate, to appear
we tried to include all possible tense-aspect-mood in 2021), it remains a largely under-documented
combinations, resulting in 30 series of forms, each language. Further, Kodi is vulnerable or threat-
including 3 persons and 2 numbers. The nominal ened because Indonesian, the prestigious national
coverage is less comprehensive and includes forms language, is used in most sociolinguistic domains
with case and possessive suffixes. outside the domestic sphere.
A prominent linguistic feature of Kodi is its
3.9 Austronesian clitic system, which is pervasive in various syntac-
tic categories—verbs, nouns, and adjectives—and
3.9.1 Malayo-Polynesian: Indonesian marks person (1, 2, 3) and number (SG vs. PL). In
Indonesian or Bahasa Indonesia is the official lan- addition, Kodi contains four sets of pronominal cli-
guage of Indonesia. It belongs to the Austronesian tics that agree with their antecedent: NOM(inative)
language family and it is written with the Latin proclitics, ACC(usative) enclitics, DAT(ive) en-
script. clitics and GEN(initive) enclitics. Interestingly,
these clitic sets are not markers of NOM, ACC,
Indonesian does not mark grammatical case, gen-
DAT, or GEN grammatical case—as in Malayalam
der, or tense. Words are composed from their roots
or Latin—but rather identify the head for TERM
through affixation, compounding, or reduplication.
relations (subject and object). Thus, by default,
The four types of Indonesian affixes are prefixes,
pronominal clitics are core grammatical arguments
suffixes, circumfixes (combination of prefixes and
reflecting subject and object.
suffixes), and infixes (inside the base form). In-
donesian uses both full and partial reduplication For the analyses of the features of Kodi clitics,
processes to form words. Full reduplication is often the data freshly collected in the fieldwork funded by
used to express the plural forms of nouns, while par- the Endangered Language Fund is annotated. Then,
tial reduplication is typically used to derive forms the collected data is converted to the UniMorph
that might have a different category than their base task format, which has the lemmas, the word forms,
forms. Unlike English, the distinction between in- and the morphosyntactic features of Kodi.
flectional and derivational morphological processes
3.10 Iroquoian
in Indonesian is not always clear (Pisceldo et al.,
2008). 3.10.1 Northern Iroquoian: Seneca
In this shared task, the Indonesian data is cre- The Seneca language is an indigenous Native Amer-
ated by bootstrapping the data from an Indone- ican language from the Iroquoian (Hodinöhšöni)
sian Wikipedia dump. Using a list of possible In- language family. Seneca is considered critically en-
donesian affixes, we collect unique word forms dangered and is currently estimated to have fewer
from Wikipedia and analyze them using MorphInd than 50 first-language speakers left, most of whom
(Larasati et al., 2011), a morphological analyzer are elders. The language is spoken mainly in three
tool for Indonesian based on an FST. We manu- reservations located in Western New York: Alle-
ally create a mapping between the MorphInd tagset gany, Cattaraugus, and Tonawanda.
and the UniMorph schema. We then use this map- Seneca possesses highly complex morphological
ping and apply some additional rule-based formu- features, with a combination of both agglutinative
las created by Indonesian linguists to build the final and fusional properties. The data presented here
dataset (Table 9). consists of inflectional paradigms for Seneca verbs,
163
the basic structure of which is composed of a verb 3.11.2 Southern Arawakan: Yanesha
base that describes an event or state of action. In Yanesha is an Amazonian language from the Pre-
virtually all cases, the verb base would be preceded Andine subgroup of Arawakan family (Adelaar
by a pronominal prefix which indicates the agent, and Muysken, 2004), spoken in Central Peru by
the patient, or both for the event or state, and fol- between 3 and 5 thousand people. It has two lin-
lowed by an aspect suffix which usually marks a guistic variants that correspond to the upriver and
habitual or a stative state. downriver areas, both mutually intelligible.
Yanesha is an agglutinating, polysynthetic lan-
(1) ha skatkwë s
guage with a VSO word order. Nouns and verbs
it he laugh HAB
are the two major word classes while the adjective
He laughs. word class is questionable due to the absence of
In some other scenarios, for instance, when the non-derived forms. The verb is the most morpho-
verb is describing a factual, future or hypothetical logically complex word class and the only obliga-
event, a modal prefix is attached before the pronom- tory constituent of a clause (Dixon and Aikhenvald,
inal prefix and the aspect suffix marks a punctual 1999).
state instead. The structures and orders of the pre- Among other typologically remarkable features,
fixes can be more complicated depending on, e.g., we find that the language lacks the distinction in
whether the action denoted by the verb is repetitive grammatical gender, the subject cross-referencing
or negative; these details are realized by adding a morphemes and one of the causatives are prefixes;
prepronominal prefix before the modal prefix. all other verbal affixes are suffixes, and nouns and
classifiers may be incorporated in the verb (Wise,
3.11 Arawakan 2002).
3.11.1 Southern Arawakan: Asháninka The corpus consists of inflected nouns and verbs
from both dialectal varieties. The annotated nouns
Asháninka is an Arawak language with more than take possessor prefixes, plural marking, and loca-
70,000 speakers in Central and Eastern Peru and in tive case, while the annotated verbs take subject
the state of Acre in Eastern Brazil, in a geographi- prefixes.
cal region located between the eastern foothills of
the Andes and the western fringe of the Amazon 3.12 Chukotko-Kamchatkan
basin (Mihas, 2017; Mayor Aparicio and Bodmer, The Chukotko-Kamchatkan languages, spoken in
2009). Although it is the most widely spoken Ama- the far east of the Russian Federation, are repre-
zonian language in Peru, certain varieties, such as sented in this dataset by two endangered languages,
Alto Perené, are highly endangered. Chukchi and Itelmen (Eberhard et al., 2021).
It is an agglutinating, polysynthetic, verb-initial
language. The verb is the most morphologically 3.12.1 Chukotko-Kamchatkan: Chukchi
complex word class, with a rich repertoire of aspec- Chukchi is a polysynthetic language that exhibits
tual and modal categories. The language lacks case polypersonal agreement, ergative–absolutive align-
marking, except for one locative suffix; grammat- ment, and a subject–object–verb basic word or-
ical relations of subject and object are indexed as der in transitive clauses (Tyers and Mishchenkova,
affixes on the verb itself. Other notable linguistic 2020). We use the data of the Amguema corpus,
features of the language include a distinction be- available through the Chuklang website,11 com-
tween alienably and inalienably possessed nouns, prised of transcriptions of spoken Chukchi in the
obligatory marking of reality status (realis/irrealis) Amguema variant. The Amguema data had been
on the verb, a rich system of applicative suffixes, annotated in the CoNLL-U format by Tyers and
serial verb constructions, and pragmatically condi- Mishchenkova (2020), and we convert it to the
tioned split intransitivity. UniMorph format using the conversion system of
The corpus consists of inflected nouns and verbs McCarthy et al. (2018).
from the variety spoken in the Tambo river of Cen-
3.12.2 Chukotko-Kamchatkan: Itelmen
tral Peru. The annotated nouns take possessor pre-
fixes, locative case and/or plural marking, while Itelmen is a language spoken on the western coast
the annotated verbs take subject prefixes, reality of the Kamchatka Peninsula. The language is con-
status (realis/irrealis), and/or perfective aspect. 11
https://chuklang.ru/
164
sidered to be highly endangered since it stopped Conversion into the UniMorph schema. After
been transferred from elders to youth ∼50 years the data collection was finalised for the above
ago (most are Russian-speaking monolinguals). languages, we converted them to the UniMorph
The language is agglutinative and primarily uses schema—canonicalising them in the process.14
suffixes. For instance, the plural form of a noun This process consisted mainly of typo corrections
is expressed by the suffix -Pn. We note that the (e.g. removing an incorrectly placed space in a tag,
plural form only exists in four grammatical cases “PRIV ” → “PRIV”), removing redundant tags
(NOM, DAT, LOC, VOC).12 The same plural suffix (e.g. duplicated verb annotation, “V;V.PTCP” →
transforms a noun into an adjective. Verbs mark “V.PTCP”), and fixing tags to conform to the Uni-
both subjects (with prefixes and suffixes) and ob- Morph schema (e.g. “2;INCL” → “2+INCL”).
jects (with suffixes). For instance, the first person These changes were implemented via language-
subject is marked by attaching the prefix t- and the specific Bash scripts. Given this freshly con-
suffix -čen (Volodin, 1976).13 The Itelmen data pre- verted data, we canonicalised its tag annotations,
sented in the task was collected through fieldwork making use of https://github.com/unimorph/
and manually annotated according to the UniMorph um-canonicalize. This process sorts the inflec-
schema. tion tags into their canonical order and verifies that
all the used tags are present in the ground truth
3.13 Trans-New Guinea UniMorph schema, flagging potential data issues
in the process.
3.13.1 Bosavi: Eibela
Data splitting. Given the canonicalised data as
Eibela, or Aimele, is an endangered language spo-
described above, we removed all instances with
ken by a small (∼300 speakers) community in Lake
duplicated <lemma; tags> pair—as these in-
Campbell, Western Province, Papua New Guinea.
stances were ambiguous with respect to their target
Eibela morphology is exclusively suffixing. Verbs
inflected form—and removed all forms other than
conjugate for tense, aspect, mood, evidentiality and
verbs, nouns, or adjectives. We then capped the
exhibit complex paradigms with a high degree of
dataset sizes to a maximum of 100,000 instances
irregularity. Generally, verbs cab be grouped into
per language, subsampling when necessary. Fi-
three classes based on their stems. Verbal inflec-
nally, we create a 70–10–20 train–dev–test split per
tional classes present various kinds of stem alter-
language, splitting the data across these sets at the
nations and suppletion. As Aiton (2016b) notes,
instance level (as opposed to, e.g., the lemma one).
the present and past forms are produced either
As such, the information about a lemma’s declen-
through stem changes or by a concatenative suffix.
sion or inflection class is spread out across these
In some cases, the forms can be quite similar (such
train, dev and test sets, making this task much sim-
as na:gla: ‘be sick.PST’ and na:glE ‘be sick.PRS’).
pler than if one had to predict the entire class from
The future tense forms are typically inflected using
the lemma’s form alone, as done by, e.g., Williams
suffixes. The current sample has been derived from
et al. (2020) and Liu and Hulden (2021).
interlinear texts from Aiton (2016a) and contains
mostly partial paradigms. 5 Baseline Systems
165
L V N ADJ V.CVB V.PTCP V.MSDR
afb 19,861 2,184 7,595 2,996 4,208 1,510 – – – – – –
amh 20,254 670 20,280 1,599 829 195 4,096 668 – – 668 668
ara 31,002 635 53,365 1,703 58,187 742 – – – – – –
arz 8,178 1,320 10,533 3,205 6,551 1,771 – – – – – –
heb 28,635 1,041 3,666 142 – – – – – – 847 847
syc 596 187 724 329 158 86 – – 261 77 – –
ame 1,246 184 2,359 143 – – – – – – – –
cni 5,478 150 14,448 258 – – – – – – – –
ind 8,805 2,570 5,699 2,759 1,313 731 – – – – – –
kod 315 44 91 14 56 8 – – – – – –
aym 50,050 910 91,840 656 – – – – – – 910 910
ckt 67 62 113 95 8 8 – – – – – –
itl 718 424 567 419 63 59 412 352 – – 20 19
Development
gup 305 73 – – – – – – – – – –
bra 564 286 808 757 174 157 – – – – – –
bul 13,978 699 8,725 1,334 13,050 435 423 423 17,862 699 1,692 423
ces 33,989 500 44,275 3,167 48,370 1,458 2,518 360 5,375 360 – –
ckb 14,368 112 1,882 142 – – – – 289 112 – –
deu 64,438 2,390 73,620 9,543 – – – – 4,777 2,390 – –
kmr 6,092 301 135,604 14,193 – – – – 397 150 783 301
mag 442 145 692 664 77 76 – – 6 6 3 3
nld 32,235 2,149 – – 21,084 2,844 – – 2,148 2,148 – –
pol 40,396 636 12,313 894 23,042 424 625 614 50,772 446 15,456 633
por 133,499 1,884 – – – – – – 9,420 1,884 – –
rus 33,961 2,115 54,153 4,747 46,268 1,650 3,188 2,107 5,486 2,138 – –
spa 132,702 2,042 – – – – 2,042 2,042 8,184 2,046 – –
see 5,430 140 – – – – – – – – – –
ail 940 365 339 249 32 24 – – – – –
evn 2,106 961 3,661 2,249 446 393 612 390 716 517 – –
sah 20,466 237 122,510 1,189 – – 2,832 236 – – – –
tyv 61,208 314 81,448 970 – – 9,336 314 – – – –
krl 108,016 1,042 1,118 107 213 24 – – 3,043 1,021 – –
lud 57 31 125 77 1 1 – – – – – –
olo 72,860 649 55,281 2,331 12,852 538 – – 1,762 575 – –
vep 55,066 712 69,041 2,804 16,317 560 – – 2,543 705 – –
sjo 135 99 49 41 16 16 86 69 78 65 51 44
Surprise
Table 2: Number of samples and unique lemmata (the second number in each column) in each word class in the
shared task data, aggregated over all splits. Here: “V” – verbs, “N” – nouns, “ADJ” – adjectives, “V.CVB” –
converbs, “V.PTCP” – participles, “V.MSDR” – masdars.
ing on a simple character-level alignment between los and Neubig (2019). More specifically, the team
the lemma and the form, this technique replaces implemented an encoder–decoder model with an
shared substrings of length > 3 with random char- attention mechanism. The encoder processes a
acters from the language’s alphabet, producing hal- character sequence using an LSTM-based RNN
lucinated lemma–tag–form triples. Data augmen- with attention. Tags are encoded with a self-
tation (+AUG) is applied to languages with fewer attention (Vaswani et al., 2017) position-invariant
than 10K training instances, and 10K examples are module. The decoder is an LSTM with separate
generated for each language. attention mechanisms for the lemma and the
tags. GUClasp focus their efforts on exploring
6 Submitted Systems strategies for training a multilingual model, in
particular, they implement the following strategies:
GUClasp The system submitted by Team
curriculum learning with competence (Platanios
GUClasp is based on the architecture and data
et al., 2019) based on character frequency and
augmentation technique presented by Anastasopou-
166
L BME GUClasp TRM TRM+AUG CHR-TRM CHR-TRM+AUG
Development
afb 92.39 81.71 94.88 94.88 94.89 94.89
amh 98.16 93.81 99.37 99.37 99.45 99.45
ara 99.76 94.86 99.74 99.74 99.79 99.79
arz 95.27 87.12 96.71 96.71 96.46 96.46
heb 97.46 89.93 99.10 99.10 99.23 99.23
syc 21.71 10.57 35.14 34.29 36.29 34.57
ame 82.46 55.94 87.43 87.85 87.15 86.19
cni 99.5 93.36 99.90 99.90 99.88 99.88
ind 81.31 55.68 83.61 83.61 83.30 83.30
kod 94.62 87.1 96.77 95.70 95.70 96.77
aym 99.98 99.97 99.98 99.98 99.98 99.98
ckt 44.74 52.63 26.32 55.26 28.95 57.89
itl 32.4 31.28 38.83 39.66 38.55 39.11
gup 14.75 21.31 59.02 63.93 55.74 60.66
bra 58.52 56.91 53.38 59.81 59.49 58.20
bul 98.9 96.46 99.63 99.63 99.56 99.56
ces 98.03 94.00 98.24 98.24 98.21 98.21
ckb 99.46 96.60 99.94 99.94 99.97 99.97
deu 97.98 91.94 97.43 97.43 97.46 97.46
kmr 98.21 98.09 98.02 98.02 98.01 98.01
mag 70.2 72.24 66.94 73.47 70.61 72.65
nld 98.28 94.91 98.89 98.89 98.92 98.92
pol 99.54 98.52 99.67 99.67 99.70 99.70
por 99.85 99.11 99.90 99.90 99.86 99.86
rus 98.07 94.32 97.55 97.55 97.58 97.58
spa 99.82 97.65 99.86 99.86 99.90 99.90
see 78.28 40.97 90.65 89.64 90.01 88.63
ail 6.84 6.46 12.17 11.79 10.65 12.93
evn 51.9 51.5 57.65 58.05 57.85 59.12
sah 99.95 99.69 99.93 99.93 99.97 99.97
tyv 99.97 99.78 99.95 99.95 99.97 99.97
krl 99.88 98.50 99.90 99.90 99.90 99.90
lud 59.46 59.46 16.22 45.95 27.03 45.95
olo 99.72 98.2 99.67 99.67 99.66 99.66
vep 99.72 97.05 99.65 99.65 99.70 99.70
Surprise
sjo 35.71 15.48 35.71 47.62 45.24 42.86
tur 99.90 99.49 99.36 99.36 99.35 99.35
vro 94.78 87.39 97.83 98.26 97.83 97.39
model loss, predicting Levenshtein operations often help but sometimes have a negative effect.
(copy, delete, replace and add) as a multi-
task objective going from lemma to inflected form, 7 Evaluation
label smoothing based on other characters in the
Following the evaluation procedure established in
same language (language-wise label smoothing),
the previous shared task iterations, we compare
and scheduled sampling (Bengio et al., 2015).
all systems in terms of their test set accuracy. In
addition, we perform an extensive error analysis
BME Team BME’s system is an LSTM encoder-
for most languages.
decoder model based on the work of Faruqui et al.
(2016), with three-step training where the model 8 Results
is first trained on all languages, then fine-tuned
on each language family, and finally fine-tuned on As Table 3 demonstrates, most systems achieve
individual languages. A different type of data aug- over 90% accuracy on most languages, with
mentation technique inspired by Neuvel and Fulop transformer-based baseline models demonstrating
(2002) is also used in the first two steps. Team BME superior performance on all language families ex-
also perform ablation studies and show that the aug- cept Uralic. Two Turkic languages, Sakha and Tu-
mentation techniques and the three training steps van, achieve particularly high accuracy of 99.97%.
167
This is likely due to the data being derived from
morphological transducers where certain parts of
verbal paradigms were excluded (see Section 3.8.1).
On the other hand, the accuracy on Classical Syriac, L BME GUClasp TRM
TRM+
AUG
CHR-TRM
CHR-TRM
+AUG
Chukchi, Itelmen, Kunwinjku, Braj, Ludic, Eibela, afb 94.24 82.35 96.31 96.31 96.47 96.47
75.04 75.69 81.40 81.40 80.09 80.09
Evenki, Xibe is low overall. Most of them are amh 98.15 93.77 99.38 99.38 99.44 99.44
100.00 100.00 98.07 98.07 100.00 100.00
under-resourced and have very limited amounts of ara 99.78 94.90 99.78 99.78 99.82 99.82 *
50.00 27.77 33.33 33.33 33.33 33.33
data—indeed, the Spearman correlation between arz 95.66 86.70 97.08 97.08 96.80 96.80
the transformer model’s performance and a lan- heb
89.91
97.46
92.79
89.92
91.64
99.09
91.64
99.09
91.64
99.23
91.64
99.23
guage’s training set size is of roughly 77%. syc
-
28.89
-
14.06
-
46.38
-
43.72
-
46.76
-
44.10
0 0 1.14 5.74 4.59 5.74
Analysis for each POS ame 82.45
-
55.93
-
87.43
-
87.84
-
87.15
-
86.18
-
cni 99.50 93.35 99.90 99.90 99.87 99.87
Tables 13 to 18 in the Appendix provide the accu- 100.00 100.00 100.00 100.00 100.00 100.00
ind 82.29 55.18 84.90 84.90 84.49 84.49
racy numbers for each word class. Verbs and nouns 68.83 61.90 67.09 67.09 67.96 67.96
kod 94.62 90.32 100.00 98.92 98.92 100.00
are the most represented classes in the dataset. For - - - - - -
aym 99.97 99.96 99.97 99.97 99.97 99.97
under-resourced languages such as Classical Syr- - - - - - -
iac, Itelmen, Chukchi, Braj, Magahi, Evenki, Ludic, ckt 50.00
44.11
75.00
50.00
50.00
23.52
75.00
52.94
50.00
26.47
75.00
55.88
*
lemmas, on the other hand, is mostly driven by data Table 4: Accuracy comparison for the lemmas known
augmentation. The models without data augmenta- from the training set (black numbers) vs. unknown lem-
tion have an accuracy around 60% on these lemmas, mas (red numbers). Groups having <20 unique lemmas
while all other models achieve around 65% on pre- are marked with asterisks.
viously unseen lemmas. This is in line with the
findings of Liu and Hulden (2021), who show that
the transducer’s performance on previously seen
168
words can be greatly improved by simply training L BME GUClasp
the models to perform the trivial task of copying afb 19.61 14.23
amh 6.81 15.90
random lemmas during training—a method some- ara 49.05 5.66
what related to data augmentation. arz 20.28 18.11
heb 15.90 13.63
syc 0 .50
Analysis for the most challenging inflections ame 6.45 4.83
cni 33.33 0
Table 5 shows the accuracy of the submitted sys- ind 19.94 15.60
tems on the “most challenging” test instances, aym 33.33 0
ckt 0 0
where all four baselines failed to predict the tar- itl 3.50 2.33
get form correctly. gup 0 0
bra 9.27 9.27
Frequently observed types of such cases include: bul 18.42 34.21
ces 35.59 16.57
• Unusual alternations of some letters in partic- ckb 100.00 0
deu 55.59 13.14
ular lexemes which are hard to generalize; kmr 39.36 10.10
mag 4.00 6.00
• Ambiguity of the target forms. Certain lem- nld 11.30 14.78
pol 47.61 30.15
mas allow some variation in forms, while the por 27.27 0
test set only lists a single exclusive golden rus 46.18 20.93
spa 64.00 12.00
form for each (lemma, tags) combination. In see 7.89 3.94
ail .99 .99
most cases, multiple acceptable forms may evn 5.55 4.78
be hardly distinguishable in spoken language. sah 25.00 0
tyv 80.00 20.00
For instance, they may only differ by an un- krl 78.94 47.36
stressed vowel or be orthographic variants of lud 17.64 23.52
olo 64.61 23.07
the same form. vep 58.46 9.23
sjo 9.37 6.25
• Multiword expressions are challenging when tur 93.37 88.39
vro 33.33 0
agreement is required. UniMorph does not
provide dependency information, however, Table 5: Test accuracy for each language for the sam-
the information can be inferred from simi- ples where none of baseline systems succeeds to pro-
lar samples or other parts of the same lemma duce correct prediction.
paradigm. The system’s ability to make gen-
eralizations from a sequence down to its sub-
sequences essentially depends on its architec- several cases, the result is practically correct but
ture. only in case of a different dialect, such as abull@n
instead of abuld@n. The performance is better for
• Errors in the test sets. Still, a small percentage nominal wordforms (74.27 accuracy for nouns only
of errors come from the data itself. vs. 30.55 for verbs only). This is perhaps due to
the higher regularity of nominal forms. BME is
9 Error Analysis performing slightly better for the Evenki language,
with errors in vowel harmony (such as ahatkanmo
As Elsner et al. (2019) note, accuracy-level evalua-
instead of ahatkanm@). In contrast with GUClasp,
tion might be sufficient to compare model variants
it tends to generate longer forms, adding unneces-
but does not provide much insight into the under-
sary suffixes. The problems with dialectal forms
standing of morphological systems and their learn-
can be found as well. The performance for Xibe
ability. Therefore, we now turn to a more detailed
is worse for both systems, though BME is better,
analysis of mispredictions made by the systems.
despite the simpler morphology—perhaps it is due
For the purpose of this study, we will rely on the
to the complexity of the Xibe script. At least in one
error type taxonomy proposed by Gorman et al.
instance, one of the systems generated a form with
(2019) and Muradoglu et al. (2020).
a Latin letter n instead of Xibe letters.
9.1 Evenki and Xibe
9.2 Syriac
For the Evenki language, GUClasp tends to
shorten the resulting words, sometimes generat- Both GUClasp and BME generated 350 nominal,
ing theoretically impossible forms. However, in verbal and adjectival forms with less than 50% ac-
169
curacy. This includes forms that are hypothetically The largest category of errors for both systems
correct despite being historically unattested (e.g., (unlike the baseline systems) were allomorphy
abydwtkwn ‘your (M.PL) loss’). Both systems per- errors, 51.76% for BME, 62.65% for GUClasp.
formed better on nominal and adjectival forms than Most of these resulted from the confusion be-
verbal forms. This may be explained by the higher tween vowels in verbal templates. Particularly
morphological regularity of nominal forms relative common were vowel errors in jussive-imperative
to verbal forms; nominal/adjectival inflections typi- (IMP) forms. Most Amharic verb lemmas belong
cally follow linear affixation rules (e.g., suffixation) to one of two inflection classes, each based on roots
while verbal forms follow the same rules in addition consisting of three consonants. The vowels in the
to non-concatenative processes. Further, both sys- templates for these categories are identical in the
tems handle lexemes with two or three letters (e.g., perfective (PRF), imperfective (IPFV), and con-
dn ‘to judge’) poorly compared to longer lexemes verb (V.CVB) forms, but differ in the infinitive
(e.g., bt.nwt’ ‘conception’). Where both systems (V.MSDR) and jussive-imperative, where class A
generate multiple verbal forms for the same lex- has the vowels .1.@ and class B has the vowels [email protected].
eme, the consonantal root is inconsistent. Finally, Both systems also produced a significant number
as expected, lexicalised phrases (e.g., klnš ‘every- of silly errors—incorrect forms that could not be
one’, derived from the high-frequency contraction explained otherwise. Most of these consisted of
kl ‘every’ and nš ‘person’) and homomorphs (e.g., consonant deletion, replacing one consonant with
ql’ ‘an expression (n.)’ or ‘to fry (v.)’) are handled another, or repeating a consonant–vowel sequence.
poorly. Comparatively, the BME system performed
worse than GUClasp, especially in terms of vowel 9.4 Polish
diacritic placement and consonant doubling, which
are consistently hypercorrected in both cases (e.g., Polish is among languages for which both systems
h.byb’ > h.abbbbay; h.yltn’ > h.aallto’). and all the baselines achieved the highest accu-
racy results. BME, with 99.54% accuracy, is doing
slightly better than GUClasp (98.52%). However,
9.3 Amharic
neither system exceeds the baseline results (99.67–
Both submitted systems performed well on the 99.70%).
Amharic data, BME (98.16% accuracy) somewhat Most of the errors made by both systems were al-
better than GUClasp (93.81% accuracy), though ready noted and classified by Gorman et al. (2019)
neither outperformed the baseline models. and follow from typical irregularities in Polish. For
For both systems, target errors represented a sig- example, masculine nouns have two GEN.SG suf-
nificant proportion of the errors, 32.35% for BME, fixes: -a and -u. The latter is typical for inan-
24.08% for GUClasp. Many of these involved al- imate nouns but the former could be used both
ternative plural forms of nouns. The data included with animate and inanimate nouns, which makes
only the most frequent plural forms when there it highly unpredictable and causes production of
were alternatives, sometimes excluding uncommon incorrect forms such as negatywa, rabunka instead
but still possible forms. In some cases only an of negatywu ‘negative’, rabunku ‘robbery’. Both
irregular form appeared in the data, and the sys- systems are vulnerable to such mistakes. Another
tem “erroneously” predicted the regular form with example would be the GEN.PL forms of plurale
the suffix -(w)oč, which also correct. For example, tantum nouns, which could have -ów or zero suffix,
BME produced hawaryawoč, the regular plural of leading to errors such as: tekstyli, wiktuał instead
hawarya ‘apostle’, instead of the expected irregular of tekstyliów ‘textiles’, wiktuałów ‘victuals’. Some
plural hawaryat. Another source of target errors loan words in Polish have fully (mango, marines,
was the confusion resulting from multiple represen- monsieur) or partially (millenium, in singular only)
tations for the phonemes /h,P,s,s’/ in the Amharic syncretic inflectional paradigms. This phenomenon
orthography. Again, the data included only the is hard to predict, as the vast majority of Polish
common spelling for a given lemma or inflected nouns inflect regularly. Both systems tend to pro-
form, but alternative spellings are usually also at- duce inflected forms of those nouns according to
tested. Many of the “errors” consisted of predicting their regular endings, which would be otherwise
correct forms with one of these phonemes spelled correct if not for their syncretic paradigms.
differently than in the expected form. One area in which BME returns significantly bet-
170
ter results than GUClasp are imperative forms. гостьей, ‘female guest’). Additionally, we ob-
Polish imperative forms follow a few different pat- serve more errors in the prediction of the instru-
terns involving some vowel alternations but in gen- mental case forms, mainly due to allomorphy. In
eral are fairly regular. For the 364 imperative forms many cases, the systems would have benefited
in the test data set, BME produced only 12 errors, from observing stress patterns or grammatical gen-
mostly excusable and concerning existing phonetic der. For instance, consider the feminine акварель
alternations which could cause some problems even ‘aquarelle’ and the masculine пароль ‘password’.
for native or L2 speakers. GUClasp, however, In order to make a correct prediction, a model
produced 61 erroneous imperative forms, some of should either be explicitly provided with the gram-
them being examples of overgeneralization of the matical gender, or a partial paradigm (e.g., the da-
zero suffix pattern for first person singular impera- tive and genitive singular slots) for the correspond-
tives (wyjaśn instead of wyjaśnij for the verb WY- ing lemma should be observed in the training set.
JAŚNIĆ ‘explain’). Indeed, the latter is often the case, but the systems
Interestingly, both systems sometimes produce still fail to make a correct inference.
forms that are considered incorrect in standard Pol- Finally, multiword expressions present them-
ish, but are quite often used colloquially by native selves as an extra challenge to the models. In
speakers. Both BME and GUClasp generated the most cases, the test lemmas also appeared in
form podniesa˛ si˛e (instead of podniosa˛ si˛e ‘they the training set, therefore the systems could
will rise’). Moreover, GUClasp generated the infer the dependency information from other
form podeszłeś (instead of podszedłeś ‘you came parts of the same lexeme. Still, Russian mul-
up’). tiword expessions appeared to be harder to in-
flect, probably as they show richer combina-
9.5 Russian tory diversity. For instance, электромагнитное
взаимодействие ‘electromagnetic interaction’ for
Similar to Polish and many other high-resource lan- the plural instrumental case slot is mispredicted
guages, the accuracy of all systems on Russian is as *электромагнитными взаимодействия, i.e
high, with BME being the best-performing model the adjective part is correct while the noun form
(98.07%). Majority of errors consistently made by is not. As Table 7 illustrates, the accuracy gap in
all systems (including the baseline ones) are related predicting multiword expressions with lemmas in-
to the different inflections for animate and inani- or out-of-vocabulary is quite large.
mate nouns in the accusative case. In particular,
UniMorph does not provide the corresponding ani- 9.6 Ludic
macy feature for nouns, an issue that has also been The Ludic language, in comparison with the Kare-
reported previously by Gorman et al. (2019). lian, Livvi and Veps languages, has the smallest
The formation of the short forms of adjectives number of lemmas (31 verbs, 77 nouns and 1 adjec-
and participles with -ен- and -енн- is another tive) and has the lowest accuracy (16–60%). There-
source of misprediction. The systems either gen- fore, the incomplete data is the main cause of errors
erate an incorrect number of н, as in *умерена in the morphological analyzers working with the
(should be умеренна ‘moderate’), or fail to at- Ludic dialect (‘target errors’ in the error taxonomy
tach the suffix in cases that require some repeti- proposed by Gorman et al.).
tion in the production, as in *жертвен (should be
жертвенен ‘sacrificial’), i.e. generation stops af- 9.7 Kurdish
ter the first ен is produced. In addition to that, the Testing on the Sorani data yields high accuracy val-
systems often mispredict alternations of е and ё, ues, although there are errors in some cases. More
as in *ошеломлённы instead of ошеломлены than 200 lemmas and 16K samples were generated.
‘overwhelmed’. The same error also occurs Both BME and GUClasp generate regular nominal
in the formation of past participle forms such and verbal forms with a high accuracy of 99.46%
as *покормлённый (should be покормленный and 96.6% respectively, although neither system
‘fed’). Further, we also observe it in noun declen- exceeds the baseline results of 99.94% and 99.97%.
sion, more specifically, in the prediction of the Kurdish has a complex morphological system with
singular instrumental form: *слесарём (should defective paradigms and second-position person
be слесарем ‘locksmith’), *гостьёй (should be markers. Double clitic and affix-clitic construc-
171
tions can mark subjects or objects in a verbal con- The case of Yanesha is different, as the base-
struction and ditranstives are made with applicative line only peaked at 87.43%, whereas the BME and
markers. Such morphological structures can be the GUClasp systems underperformed with 82.46%
reason for the few issues that still occur. and 55.94%, respectively. The task for Yanesha
is harder, as the writing tradition is not transpar-
9.8 Tuvan and Sakha ent enough to predict some rules. For instance,
Both BME and GUClasp predict the majority of large and short vowels are written in a similar way,
the inflected forms correctly, achieving test accura- always with a single vowel, and the aspirated vow-
cies of over 99.6% on both Tuvan and Sakha, with els are optionally marked with a diacritic. These
BME performing slightly better on both languages. distinctions are essential at the linguistic level, as
The remaining errors are generally caused by mis- they allow one to explain the morphophonologi-
applications of morphophonology, either by the cal processes, such as the syncope of the weak
prediction system or by the data generator itself. vowels in the flexionated forms (po’kochllet in-
Since the forms treated as ground truth were au- stead of po’kchellet). We also observe allomor-
tomatically generated by morphological transduc- phy errors, for instance, predicting phomchocheñ
ers (§3.8.1), the mismatch between the prediction instead of pomchocheñ (from mochocheñets and
and the reference might be due to ‘target errors’ V;NO2;FIN;REAL). The singular second person
where the reference itself is wrong (Gorman et al., prefix has ph-, pe- and p- as allomorphs, each with
2019). For the BME system, target errors account different rules to apply. Moreover, there are some
for 1/8 disagreement cases for Tuvan and 3/13 for spelling issues as well, as the diacritic and the
Sakha, although for all of them the system’s pre- apostrophe are usually optional. For instance, the
diction is indeed incorrect as well. For GUClasp, spellings wapa or wápa (to come where someone
the reference is wrong in 19/62 cases for Tuvan is located) are both correct. It is important to note
(four of them also have an incorrect lemma, which that the orthographic standards are going to be re-
makes it impossible to judge the correctness of vised by the Peruvian government to reduce the
any inflection) and 43/90 for Sakha. Interestingly, ambiguous elements.
GUClasp actually predicts the correct inflected
form for 27/43 and 3/1515 target error cases for 9.10 Magahi
Sakha and Tuvan, respectively. The transformer baseline with data augmenta-
The actual failure cases for both BME and tion (TRM+AUG) achieved the highest score, with
GUClasp are largely allomorphy errors, per Gor- GUClasp taking the second place (with 72.24%)
man et al.’s classification. Common problems and the base transformer yielding the lowest score
include consonant alternation (Sakha *охсусуҥ of 66.94%. For Magahi, the results do not vary
instead of охсуһуҥ), vowel harmony (Tu- too much between systems, and the clearest perfor-
van *ижиарлер instead of ижигерлер) and mance boost seems to arise from the use of data
vowel/null alternation (Tuvan *шымынар силер augmentation. The low score of the TRM baseline
instead of шымныр силер). Unadapted loan- is caused by the scarcity of data and the diversity
words that entered the languages through Russian in the morphophonological structure. Prediction er-
(e.g. Sakha педагог ‘pedagogue’, принц ‘prince’, rors on Magahi include incorrect honorificity, mis-
наследие ‘heritage’) are also frequent among the predicting plural markers, and spelling errors.
errors for both systems. Honorificity: the systems predict forms lacking
the honorific marker /-ak/. For example, /puch-
9.9 Ashaninka and Yanesha
hal/ (‘asked’) is predicted instead of /puchhlak/
For Ashaninka, the high baseline scores (over (‘asked’), or /bital/ (‘passed time’) instead of /bit-
99.8%) could be attributed to the relatively high lak/ (‘passed time’).
regularity of the (morpho)phonological rules in Plural marker: the systems’ predictions omit
the language. In this language, the BME system the plural markers /-van/ and /-yan/, similarly to
achieved comparable performance with 99.5%, the case of the honorific markers discussed above.
whereas GUClasp still achieved a robust accuracy For example, /thag/ (‘con’) is produced instead of
of 93.36%. /thagwan/ (‘con’).
15
19 target errors excluding the 4 unjudgeable cases. Spelling errors: the predicted words do not oc-
172
cur in the Magahi language. The predictions also the adjective achchhe instead of achchhau ‘good’);
do not show any specific error pattern. morphemes of honorificity (formal and informal:
We thus conclude that the performance of the suni instead of sunikain ‘listen’, rahee instead of
baseline systems is greatly affected by the morpho- raheen ‘be’ and karibau instead of kari ‘do’, etc.).
logical structure of Magahi. Also, some language- A portion of these errors is also caused by the
specific properties of Magahi are not covered by inflection errors in predicting and generating mul-
the scope of the UniMorph tagset. For example, tiword expressions (MWEs) (e.g. aannd instead
consider the following pair: of aannd-daayak ‘comfortable’). Apart from the
(/dekh/, /dekhlai/, ‘V;3;PRF;INFM;LSGSPEC1’, mentioned error types, the systems also made silly
‘see’) errors (e.g. uthi instead of uthaay ‘get up’, kathan
(/dekh/, /dekhlak/, ‘V;3;PRF;INFM;LGSPEC2’, instead of kathanopakathan ‘conversation’, karaah
‘see’) instead of karaahatau ‘groan’, keeee instead of kee-
Here both forms exhibit morphological fea- nee ‘do’, and grahanave instead of grahan ‘accept’,
tures that are not defined in the default annotation etc.) and spelling errors (e.g. dhamaky instead of
schema. Morphologically, the first form indicates dhamakii ‘threat’, laau or laauy instead of liyau
that the speaker knows the addressee but not inti- ‘take’, saii instead of saanchii ‘correct’, and sama-
mately (or there is a low level of intimacy), while jhat instead of samajhaayau ‘explain’, etc.) as
the second one signals a comparatively higher level classified by Gorman et al. (2019). Under all of the
of intimacy. Such aspects of the Magahi morphol- models, the majority of errors were silly or spelling
ogy are challenging for the systems. errors.
9.12 Aymara
9.11 Braj
All systems achieved high accuracy (99.98%) on
For the low-resource language Braj, both submit- this language. The few errors are mainly due to the
ted systems performed worse than the baseline sys- inconsistency in the initial data annotation. For in-
tems. BME achieved 58.52% prediction accuracy, stance, the form uraqiw is listed as a lemma while
slightly outperforming GUClasp with 56.91%. As it can only be understood as being comprised of
for the baseline systems, CHR-TRM scored high- two morphemes: uraqi-w(a) ‘it (is) land’. The root,
est with 59.49% accuracy and TRM scored lowest or the nominative unmarked form, is uraqi ‘land’.
with 53.38%. Among Indo-European languages, The -wa is presumably the declarative suffix. The
the performance of the BME, GUClasp, and the nucleus of this suffix can be lost owing to the com-
baseline systems is lowest for Braj. The low accu- plex morphophonemic rules which operate at the
racy and the larger number of errors are broadly edges of phrases. In addition, the accusative form
due to misprediction and misrepresentation of the uraqi is incorrect since the accusative is marked
morphological units and the smaller data size. by subtractive disfixation, therefore, uraq is the
BME, GUClasp, and the baseline systems gen- accusative inflected form.
erated 311 nominal, verbal, and adjectival inflected
forms from existing lemmas. In these outputs, the 9.13 Eibela
misprediction and misrepresentation errors are mor- Eibela seems to be one of the most challenging
phemic errors, already included/classified by Gor- languages, probably due to its small data size and
man et al. (2019). The findings of our analysis of sparsity. Since it has been extracted from inter-
both the gold data and the predictions of all systems linear texts, a vast majority of its paradigms are
highlight several common problems for nominal, partial, and this certainly makes the task more diffi-
verbal, and adjectival inflected forms. Common cult. A closer look at system outputs reveals that
errors, mispredicted by all models, include mor- many errors are related to misprediction of vowel
phemes of gender (masculine and feminine: for the length. For instance, to:mulu is inflected in N;ERG
noun akhabaaree instead of akhabaar ‘newspaper’, as tomulE instead of to:mu:lE:.
for the verb arraa instead of arraaii ‘shout’, and
for the adjective mithak instead of mithakeey ‘an- 9.14 Kunwinjku
cient’); morphemes of number (singular and plural: The data for Kunwinjku is relatively small and
for the noun kahaanee instead of kahaaneen ‘story’, contains verbal paradigms only. Test accuracies
for the verb utaran instead of utare ‘get off’, for range from 14.75% (BME) to 63.93% (TRM+AUG).
173
In this language, many errors were due to incor-
rect spelling and missing parts of transcription. For
instance, for the second person plural non-past of L BME GUClasp TRM
TRM+
CHR-TRM
CHR-TRM
AUG +AUG
the lemma borlbme, TRM predicts *ngurriborlbme afb 96.43 87.25 97.70 97.70 97.70 97.70
instead of ngurriborle. Interestingly, BME mis- amh
76.93
98.78
60.47
96.71
84.06
99.55
84.06
99.55
84.14
99.58
84.14
99.58
predicts most forms due to the looping effect de- 95.51 81.45 98.58 98.58 98.86 98.86
ara 99.89 97.06 99.90 99.90 99.90 99.90
scribed by Shcherbakov et al. (2020). In particu- 97.97 65.28 97.56 97.56 98.22 98.22
arz 96.59 91.86 97.96 97.96 97.83 97.83
lar, it starts producing sequences such as *ngar- 80.04 32.51 82.26 82.26 80.54 80.54
heb 98.70 94.84 99.61 99.61 99.74 99.74
rrrrrrrrrrrrrmbbbijj (should be karribelbmerrinj) 92.22 69.15 96.93 96.93 97.09 97.09
syc 85.71 82.14 85.71 85.71 85.71 82.14 *
or *ngadjarridarrkddrrdddrrmerri (should be kar- 16.14 4.34 30.74 29.81 31.98 30.43
riyawoyhdjarrkbidyikarrmerrimeninj). ame 87.93
74.40
75.63
26.96
92.11
80.54
92.80
80.54
92.80
78.83
90.95
79.18
cni 99.86 94.69 100.00 100.00 99.94 99.94
93.53 71.55 98.27 98.27 98.70 98.70
10 Discussion ind 81.67 56.08 83.98 83.98 83.70 83.70
36.00 4.00 36.00 36.00 32.00 32.00
kod 98.82 97.64 100.00 98.82 98.82 100.00
Reusing transformation patterns 50.00 12.50 100.00 100.00 100.00 100.00 *
aym 99.98 99.98 99.99 99.99 99.99 99.99
50.00 0 0 0 0 0 *
In most cases, morphological transformations may ckt 73.91 82.60 34.78 82.60 39.13 86.95
be properly carried out by matching a lemma itl
0
54.80
6.66
51.44
13.33
62.50
13.33
65.86
13.33
62.50
13.33
64.90
*
Otherwise, the prediction accuracy significantly Table 6: Accuracy comparison for fragment substitu-
degraded. Table 7 shows the multi-word lemma tions that could be observed in the training set (black
transformation accuracies. From these results we numbers) vs. more complex transformations (red num-
further notice that while all systems’ performance bers). Groups having <20 unique lemmas are marked
degrades on the previously unseen multi-word in- with asterisks.
flection patterns, this degradation is considerably
smaller for the transformer-based baselines (except
174
L BME GUClasp TRM
TRM+
CHR-TRM
CHR-TRM tuguese, for instance, most of the errors produced
AUG +AUG
aym 100.00 99.90 100.00 100.00 100.00 100.00
by the BME system arise due to missing acute ac-
- - - - - - cents, which mark stress; their use is determined by
bul 95.00 93.21 100.00 100.00 100.00 100.00
81.81 63.63 100.00 100.00 100.00 100.00 specific (and somewhat idiosyncratic) orthographic
ces 75.00 55.00 80.00 80.00 80.00 80.00
71.42 57.14 100.00 100.00 100.00 100.00 rules.
ckb 100.00 100.00 100.00 100.00 100.00 100.00
- - - - - -
deu 88.57 74.28 97.14 97.14 100.00 100.00 11 Conclusion
71.87 0 93.75 93.75 87.50 87.50
kmr 98.76
-
98.71
-
95.34
-
95.34
-
95.51
-
95.51
-
In the development of this shared task we added
nld 50.00 50.00 50.00 50.00 50.00 50.00 new data for 32 languages (13 language families)
- - - - - -
pol 99.85 99.60 99.97 99.97 99.91 99.91 to UniMorph—most of which are under-resourced.
98.28 92.12 98.28 98.28 98.28 98.28
rus 90.93 87.91 96.45 96.45 96.05 96.05 Further, we evaluated the performance of morpho-
56.55 27.04 72.13 72.13 72.13 72.13
logical reinflection systems on a typologically di-
tur 99.77 98.63 95.67 95.67 95.58 95.58
90.90 59.09 77.27 77.27 86.36 86.36 verse set of languages and performed fine-grained
analysis of their error patterns in most of these
Table 7: Accuracy for MWE lemmata in each language
on the test data. The numbers in black correspond to
languages. The main challenge for the morpho-
fragment substitutions that could be observed in the logical reinflection systems is still (as expected)
training set, while the red numbers correspond to more handling low-resource scenarios (where there is lit-
complex transformations. tle training data). We further identified a large gap
in these systems’ performance between the test lem-
mas present in the training set and the previously
for Turkish), implying that these models can better unseen lemmas—the latter are naturally hard test
generalise to previously unseen patterns. cases, but the work on reinflection models could
Allomorphy focus on improving these results going forward, fol-
lowing, for instance, the work of Liu and Hulden
The application of wrong (albeit valid) inflectional (2021). Further, allomorphy, honorificity and mul-
transformations by the models (allomorphy) is tiword lemmas also pose challenges for the current
present in most analysed languages. These allomor- models. We hope that the analysis presented here,
phy errors can be further divided into two groups: together with the new expansion of the UniMorph
(1) when an inflectional tag itself allows for multi- resources, will help drive further improvements in
ple inflection patterns which must be distinguished morphological reinflection. Following Malouf et al.
by the declension/inflection class to which the word (2020), we would like to emphasize that linguis-
belongs, and (2) when the model applies an inflec- tic analyses using UniMorph should be performed
tional rule that is simply invalid for that specific tag. with some degree of caution, since for many lan-
These errors are hard to analyse, however. The first guages it might not provide an exhaustive list of
is potentially unavoidable without extra informa- paradigms and variants.
tion, as declension/inflection classes are not always
fully predictable from word forms alone (Williams Acknowledgements
et al., 2020). The second type of allomorphic error,
We would like to thank Dr George Kiraz and Beth
on the other hand, is potentially fixable. In our anal-
Mardutho: the Syriac Institute for their help with
ysis, however, we did not find any concrete patterns
Classical Syriac data.
to when the models make this second (somewhat
arbitrary) type of mistake.
References
Spelling Errors
Willem F. H. Adelaar and Pieter C. Muysken. 2004.
Spelling errors are pervasive in most analysed lan- The Languages of the Andes. Cambridge Language
guages, even high-resource ones. These are chal- Surveys. Cambridge University Press.
lenging, as they require a deep understanding of the Grant Aiton. 2016a. The documentation of eibela: An
modelled language in order to be avoided. Spelling archive of eibela language materials from the bosavi
errors are especially common in languages with region (western province, papua new guinea).
vowel harmony (e.g. Tungusic), as the models have Grant William Aiton. 2016b. A grammar of Eibela:
some difficulty in correctly modelling it. Another a language of the Western Province, Papua New
source of spelling errors are the diacritics. In Por- Guinea. Ph.D. thesis, James Cook University.
175
Sarah Alkuhlani and Nizar Habash. 2011. A corpus pages 1–30, Vancouver. Association for Computa-
for modeling morpho-syntactic agreement in Arabic: tional Linguistics.
Gender, number and rationality. In Proceedings of
the 49th Annual Meeting of the Association for Com- Ryan Cotterell, Christo Kirov, John Sylak-Glassman,
putational Linguistics: Human Language Technolo- David Yarowsky, Jason Eisner, and Mans Hulden.
gies, pages 357–362, Portland, Oregon, USA. Asso- 2016. The SIGMORPHON 2016 shared Task—
ciation for Computational Linguistics. Morphological reinflection. In Proceedings of the
14th SIGMORPHON Workshop on Computational
Antonios Anastasopoulos and Graham Neubig. 2019. Research in Phonetics, Phonology, and Morphol-
Pushing the limits of low-resource morphological in- ogy, pages 10–22, Berlin, Germany. Association for
flection. In Proceedings of the 2019 Conference on Computational Linguistics.
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu- William Croft. 2002. Typology and universals. Cam-
ral Language Processing (EMNLP-IJCNLP), pages bridge University Press.
984–996, Hong Kong, China. Association for Com-
putational Linguistics. Michael Daniel. 2011. Linguistic typology and the
study of language. In The Oxford handbook of lin-
Gregory David Anderson and K David Harrison. 1999. guistic typology. Oxford University Press.
Tyvan (Languages of the World/Materials 257).
München: LINCOM Europa. R. M. W. Dixon and Alexandra Y. Aikhenvald, editors.
Phyllis E. Wms. Bardeau. 2007. The Seneca Verb: La- 1999. The Amazonian languages (Cambridge Lan-
beling the Ancient Voice. Seneca Nation Education guage Surveys). Cambridge University Press.
Department, Cattaraugus Territory.
M Duff-Trip. 1998. Diccionario Yanesha’ (Amuesha)-
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Castellano. Lima: Instituto Lingüístico de Verano.
Noam Shazeer. 2015. Scheduled sampling for se-
quence prediction with recurrent neural networks. David M. Eberhard, Gary F. Simons, and Charles
In Advances in Neural Information Processing Sys- D. Fennig (eds.). 2021. Ethnologue: Languages of
tems 28: Annual Conference on Neural Informa- the world. Twenty-fourth edition. Online version:
tion Processing Systems 2015, December 7-12, 2015, http://www.ethnologue.com.
Montreal, Quebec, Canada, pages 1171–1179.
Micha Elsner, Andrea D Sims, Alexander Erdmann,
Noam Chomsky. 1995. Language and nature. Mind, Antonio Hernandez, Evan Jaffe, Lifeng Jin, Martha
104(413):1–61. Booker Johnson, Shuan Karim, David L King, Lu-
ana Lamberti Nunes, et al. 2019. Modeling morpho-
Matt Coler. 2010. A grammatical description of Muy- logical learning, typology, and change: What can the
laq’Aymara. Ph.D. thesis, Vrije Universiteit Amster- neural sequence-to-sequence framework contribute?
dam. Journal of Language Modelling, 7.
Matt Coler. 2014. A grammar of Muylaq’Aymara: Ay- Nicholas Evans. 2003. Bininj Gun-wok: A Pan-
mara as spoken in Southern Peru. Brill. dialectal Grammar of Mayali, Kunwinjku and Kune.
Bernard Comrie. 1989. Language universals and lin- Pacific Linguistics. Australian National University.
guistic typology: Syntax and morphology. Univer-
sity of Chicago press. Nicholas Evans and Stephen C Levinson. 2009. The
myth of language universals: Language diversity
Ryan Cotterell, Christo Kirov, John Sylak-Glassman, and its importance for cognitive science. Behavioral
Géraldine Walther, Ekaterina Vylomova, Arya D. and brain sciences, 32(5):429–448.
McCarthy, Katharina Kann, Sabrina J. Mielke, Gar-
rett Nicolai, Miikka Silfverberg, David Yarowsky, Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, and
Jason Eisner, and Mans Hulden. 2018. The CoNLL– Chris Dyer. 2016. Morphological inflection genera-
SIGMORPHON 2018 shared task: Universal mor- tion using character sequence to sequence learning.
phological reinflection. In Proceedings of the In Proceedings of the 2016 Conference of the North
CoNLL–SIGMORPHON 2018 Shared Task: Univer- American Chapter of the Association for Computa-
sal Morphological Reinflection, pages 1–27, Brus- tional Linguistics: Human Language Technologies,
sels. Association for Computational Linguistics. pages 634–643, San Diego, California. Association
for Computational Linguistics.
Ryan Cotterell, Christo Kirov, John Sylak-Glassman,
Géraldine Walther, Ekaterina Vylomova, Patrick I.K. Fattah. 2000. Les dialectes kurdes méridionaux:
Xia, Manaal Faruqui, Sandra Kübler, David étude linguistique et dialectologique. Acta Iranica
Yarowsky, Jason Eisner, and Mans Hulden. 2017. : Encyclopédie permanente des études iraniennes.
CoNLL-SIGMORPHON 2017 shared task: Univer- Peeters.
sal morphological reinflection in 52 languages. In
Proceedings of the CoNLL SIGMORPHON 2017 Charles F Ferguson. 1959. Diglossia. Word,
Shared Task: Universal Morphological Reinflection, 15(2):325–340.
176
Mikel L Forcada, Mireia Ginestí-Rosell, Jacob Nord- Martin Haspelmath. 2020. Human linguisticality and
falk, Jim O’Regan, Sergio Ortiz-Rojas, Juan An- the building blocks of languages. Frontiers in psy-
tonio Pérez-Ortiz, Felipe Sánchez-Martínez, Gema chology, 10:3056.
Ramírez-Sánchez, and Francis M Tyers. 2011. Aper-
tium: a free/open-source platform for rule-based ma- Johannes Heinecke. 2019. ConlluEditor: a fully graph-
chine translation. Machine translation, 25(2):127– ical editor for universal dependencies treebank files.
144. In Proceedings of the Third Workshop on Univer-
sal Dependencies (UDW, SyntaxFest 2019), pages
Michael Gasser. 2011. HornMorpho: a system 87–93, Paris, France. Association for Computational
for morphological processing of Amharic, Oromo, Linguistics.
and Tigrinya. In Proceedings of the Conference
on Human Language Technology for Development, Sardana Ivanova, Anisia Katinskaia, and Roman Yan-
Alexandria, Egypt. garber. 2019. Tools for supporting language learn-
ing for Sakha. In Proceedings of the 22nd Nordic
Yustinus Ghanggo Ate. 2020. Kodi (Indonesia) - Lan- Conference on Computational Linguistics, pages
guage Snapshot. Language Documentation and De- 155–163, Turku, Finland. Linköping University
scription 19, pages 171–180. Electronic Press.
Yustinus Ghanggo Ate. 2021. Documentation of Kodi. Sardana Ivanova, Francis M. Tyers, and Jonathan N.
New Haven: Endangered Language Fund. Washington. to appear in 2022. A free/open-source
morphological analyser and generator for Sakha. In
Yustinus Ghanggo Ate. to appear in 2021. Reduplica- preparation.
tion in Kodi: A paradigm function account. Word
Structure 14(3). Danesh Jain and George Cardona. 2007. The Indo-
Aryan Languages. Routledge.
Kyle Gorman, Arya D. McCarthy, Ryan Cotterell,
Thomas Jügel. 2009. Ergative Remnants in Sorani Kur-
Ekaterina Vylomova, Miikka Silfverberg, and Mag-
dish? Orientalia Suecana, 58:142–158.
dalena Markowska. 2019. Weird inflects but OK:
Making sense of morphological generation errors. Olga Kazakevich and Elena Klyachko. 2013. Creat-
In Proceedings of the 23rd Conference on Computa- ing a multimedia annotated text corpus: a research
tional Natural Language Learning (CoNLL), pages task (Sozdaniye multimediynogo annotirovannogo
140–151, Hong Kong, China. Association for Com- korpusa tekstov kak issledovatelskaya protsedura).
putational Linguistics. In Proceedings of International Conference Compu-
tational linguistics 2013, pages 292–300.
Joseph Greenberg. 1963. Some universals of grammar
with particular reference to the order of meaningful Salam Khalifa, Nizar Habash, Fadhl Eryani, Ossama
elements. In Joseph Greenberg, editor, Universals Obeid, Dana Abdulrahim, and Meera Al Kaabi.
of Language, pages 73–113. MIT Press. 2018. A morphologically annotated corpus of Emi-
rati Arabic. In Proceedings of the Eleventh Interna-
George A. Grierson. 1908. Indo-Aryan Family: Cen- tional Conference on Language Resources and Eval-
tral Group: Specimens of the Rājasthānı̄ and Gu- uation (LREC 2018), Miyazaki, Japan. European
jarātı̄, volume IX(II) of Linguistic Survey of India. Language Resources Association (ELRA).
Office of the Superintendent of Government Print-
ing, Calcutta. Salam Khalifa, Sara Hassan, and Nizar Habash. 2017.
A morphological analyzer for Gulf Arabic verbs. In
George Abraham Grierson. 1903. Linguistic Survey of Proceedings of the Third Arabic Natural Language
India, Vol-III. Calcutta: Office of the Superinten- Processing Workshop, pages 35–45, Valencia, Spain.
dent, Government of PRI. Association for Computational Linguistics.
George Abraham Grierson and Sten Konow. 1903. Lin- Tanmai Khanna, Jonathan N. Washington, Fran-
guistic Survey of India. Calcutta Supt., Govt. Print- cis M. Tyers, Sevilay Bayatlı, Daniel G. Swanson,
ing. Tommi A. Pirinen, Irene Tang, and Hèctor Alòs
i Font. to appear in 2021. Recent advances in Aper-
Nizar Habash, Ramy Eskander, and Abdelati Hawwari. tium, a free/open-source rule-based machine transla-
2012. A morphological analyzer for Egyptian Ara- tion platform for low-resource languages. Machine
bic. In Proceedings of the Twelfth Meeting of the Translation.
Special Interest Group on Computational Morphol-
ogy and Phonology, pages 1–9, Montréal, Canada. Lee Kindberg. 1980. Diccionario asháninca. Lima:
Association for Computational Linguistics. Instituto Lingüístico de Verano.
K. David Harrison. 2000. Topics in the Phonology and Christo Kirov, Ryan Cotterell, John Sylak-Glassman,
Morphology of Tuvan. Ph.D. thesis, Yale University. Géraldine Walther, Ekaterina Vylomova, Patrick
Xia, Manaal Faruqui, Sabrina J. Mielke, Arya Mc-
Martin Haspelmath. 2010. Comparative concepts Carthy, Sandra Kübler, David Yarowsky, Jason Eis-
and descriptive categories in crosslinguistic studies. ner, and Mans Hulden. 2018. UniMorph 2.0: Uni-
Language, 86(3):663–687. versal Morphology. In Proceedings of the Eleventh
177
International Conference on Language Resources Robert Malouf, Farrell Ackerman, and Arturs Se-
and Evaluation (LREC 2018), Miyazaki, Japan. Eu- menuks. 2020. Lexical databases for computational
ropean Language Resources Association (ELRA). analyses: A linguistic perspective. In Proceedings
of the Society for Computation in Linguistics 2020,
Ritesh Kumar, Bornini Lahiri, and Deepak Alok. 2014. pages 446–456, New York, New York. Association
Developing LRs for Non-scheduled Indian Lan- for Computational Linguistics.
guages: A Case of Magahi. In Human Language
Technology Challenges for Computer Science and Pedro Mayor Aparicio and Richard E Bodmer. 2009.
Linguistics, Lecture Notes in Computer Science, Pueblos indígenas de la Amazonía peruana. Iquitos:
pages 491–501. Springer International Publishing, Centro de Estudios Teológicos de la Amazonía.
Switzerland. Original-date: 2014.
Arya D. McCarthy, Christo Kirov, Matteo Grella,
Ritesh Kumar, Bornini Lahiri, Deepak Alok Atul Kr. Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekate-
Ojha, Mayank Jain, Abdul Basit, and Yogesh Dawar. rina Vylomova, Sabrina J. Mielke, Garrett Nico-
2018. Automatic identification of closely-related In- lai, Miikka Silfverberg, Timofey Arkhangelskiy, Na-
dian languages: Resources and experiments. In Pro- taly Krizhanovsky, Andrew Krizhanovsky, Elena
ceedings of the 4th Workshop on Indian Language Klyachko, Alexey Sorokin, John Mansfield, Valts
Data Resource and Evaluation (WILDRE-4), Paris, Ernštreits, Yuval Pinter, Cassandra L. Jacobs, Ryan
France. European Language Resources Association Cotterell, Mans Hulden, and David Yarowsky. 2020.
(ELRA). UniMorph 3.0: Universal Morphology. In Proceed-
ings of the 12th Language Resources and Evaluation
Bornini Lahiri. 2021. The Case System of Eastern Indo- Conference, pages 3922–3931, Marseille, France.
Aryan Languages: A Typological Overview. Rout- European Language Resources Association.
ledge.
Arya D. McCarthy, Miikka Silfverberg, Ryan Cotterell,
William Lane and Steven Bird. 2019. Towards a robust Mans Hulden, and David Yarowsky. 2018. Marrying
morphological analyzer for Kunwinjku. In Proceed- Universal Dependencies and Universal Morphology.
ings of the The 17th Annual Workshop of the Aus- In Proceedings of the Second Workshop on Univer-
tralasian Language Technology Association, pages sal Dependencies (UDW 2018), pages 91–101, Brus-
1–9, Sydney, Australia. Australasian Language Tech- sels, Belgium. Association for Computational Lin-
nology Association. guistics.
Septina Dian Larasati, Vladislav Kubon, and Daniel
Zeman. 2011. Indonesian morphology tool (Mor- Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu,
phInd): Towards an Indonesian corpus. In Sys- Chaitanya Malaviya, Lawrence Wolf-Sonkin, Gar-
tems and Frameworks for Computational Morphol- rett Nicolai, Christo Kirov, Miikka Silfverberg, Sab-
ogy - Second International Workshop, SFCM 2011, rina J. Mielke, Jeffrey Heinz, Ryan Cotterell, and
Zurich, Switzerland, August 26, 2011. Proceedings, Mans Hulden. 2019. The SIGMORPHON 2019
volume 100 of Communications in Computer and In- shared task: Morphological analysis in context and
formation Science, pages 119–129. Springer. cross-lingual transfer for inflection. In Proceedings
of the 16th Workshop on Computational Research in
Ling Liu and Mans Hulden. 2021. Can a transformer Phonetics, Phonology, and Morphology, pages 229–
pass the wug test? Tuning copying bias in neu- 244, Florence, Italy. Association for Computational
ral morphological inflection models. arXiv preprint Linguistics.
arXiv:2104.06483.
Elena Mihas. 2017. The Kampa subgroup of the
Mohamed Maamouri, Ann Bies, Tim Buckwalter, and Arawak language family. In Alexandra Y. Aikhen-
Wigdan Mekki. 2004. The Penn Arabic Treebank: vald and R. M. W. Dixon, editors, The Cambridge
Building a Large-Scale Annotated Arabic Corpus. Handbook of Linguistic Typology, Cambridge Hand-
In Proceedings of the International Conference on books in Language and Linguistics, page 782–814.
Arabic Language Resources and Tools, pages 102– Cambridge University Press.
109, Cairo, Egypt.
Saliha Muradoglu, Nicholas Evans, and Ekaterina Vy-
Mohamed Maamouri, Ann Bies, Seth Kulick, Dalila lomova. 2020. Modelling verbal morphology in
Tabessi, and Sondos Krouna. 2012. Egyptian Arabic Nen. In Proceedings of the The 18th Annual Work-
Treebank DF Parts 1-8 V2.0 - LDC catalog num- shop of the Australasian Language Technology As-
bers LDC2012E93, LDC2012E98, LDC2012E89, sociation, pages 43–53, Virtual Workshop. Aus-
LDC2012E99, LDC2012E107, LDC2012E125, tralasian Language Technology Association.
LDC2013E12, LDC2013E21.
Sylvain Neuvel and Sean A. Fulop. 2002. Unsuper-
Mohamed Maamouri, Dave Graff, Basma Bouziri, Son- vised learning of morphology without morphemes.
dos Krouna, Ann Bies, and Seth Kulick. 2010. LDC In Proceedings of the ACL-02 Workshop on Mor-
standard Arabic morphological analyzer (SAMA) phological and Phonological Learning, pages 31–
version 3.1. 40. Association for Computational Linguistics.
178
I. P. Novak, N. B. Krizhanovskaya, T. P. Boiko, and Andrei Shcherbakov, Saliha Muradoglu, and Ekate-
N. A. Pellinen. 2020. Development of rules of gen- rina Vylomova. 2020. Exploring looping effects
eration of nominal word forms for new-written vari- in RNN-based architectures. In Proceedings of the
ants of the Karelian language. Vestnik ugrovedenia The 18th Annual Workshop of the Australasian Lan-
= Bulletin of Ugric Studies, 10(4):679–691. guage Technology Association, pages 115–120, Vir-
tual Workshop. Australasian Language Technology
Irina Novak. 2019. Karelian language and its dialects. Association.
In I. Vinokurova, editor, Peoples of Karelia: His-
torical and Ethnographic Essays, pages 56–65. Pe- John Sylak-Glassman, Christo Kirov, Matt Post, Roger
riodika. Que, and David Yarowsky. 2015a. A universal
feature schema for rich morphological annotation
Ossama Obeid, Nasser Zalmout, Salam Khalifa, Dima and fine-grained cross-lingual part-of-speech tag-
Taji, Mai Oudah, Bashar Alhafni, Go Inoue, Fadhl ging. In International Workshop on Systems and
Eryani, Alexander Erdmann, and Nizar Habash. Frameworks for Computational Morphology, pages
2020. CAMeL tools: An open source python toolkit 72–93. Springer.
for Arabic natural language processing. In Proceed-
ings of the 12th Language Resources and Evaluation John Sylak-Glassman, Christo Kirov, David Yarowsky,
Conference, pages 7022–7032, Marseille, France. and Roger Que. 2015b. A language-independent
European Language Resources Association. feature schema for inflectional morphology. In Pro-
ceedings of the 53rd Annual Meeting of the Associ-
Sofia Oskolskaya, Ezequiel Koile, and Martine ation for Computational Linguistics and the 7th In-
Robbeets. 2021. A Bayesian approach to the clas- ternational Joint Conference on Natural Language
sification of Tungusic languages. Diachronica. Processing (Volume 2: Short Papers), pages 674–
680, Beijing, China. Association for Computational
Prateek Pankaj. 2020. Reconciling Surdas and Keshav- Linguistics.
das: A study of commonalities and differences in
Brajbhasha literature. IOSR Journal of Humanities Dima Taji, Nizar Habash, and Daniel Zeman. 2017.
and Social Sciences, 25. Universal Dependencies for Arabic. In Proceedings
of the Third Arabic Natural Language Processing
Femphy Pisceldo, Rahmad Mahendra, Ruli Manurung, Workshop, pages 166–176, Valencia, Spain. Associ-
and I Wayan Arka. 2008. A two-level morpholog- ation for Computational Linguistics.
ical analyser for the Indonesian language. In Pro-
ceedings of the Australasian Language Technology Dima Taji, Salam Khalifa, Ossama Obeid, Fadhl
Association Workshop 2008, pages 142–150, Hobart, Eryani, and Nizar Habash. 2018. An Arabic mor-
Australia. phological analyzer and generator with copious fea-
tures. In Proceedings of the Fifteenth Workshop
Emmanouil Antonios Platanios, Otilia Stretcu, Graham on Computational Research in Phonetics, Phonol-
Neubig, Barnabás Póczos, and Tom M. Mitchell. ogy, and Morphology, pages 140–150, Brussels, Bel-
2019. Competence-based curriculum learning for gium. Association for Computational Linguistics.
neural machine translation. In Proceedings of the
2019 Conference of the North American Chapter Talat Tekin. 1990. A new classification of the Turkic
of the Association for Computational Linguistics: languages. Türk dilleri araştırmaları, 1:5–18.
Human Language Technologies, NAACL-HLT 2019, Francis Tyers, Aziyana Bayyr-ool, Aelita Salchak, and
Minneapolis, MN, USA, June 2-7, 2019, Volume 1 Jonathan Washington. 2016. A finite-state mor-
(Long and Short Papers), pages 1162–1172. Associ- phological analyser for Tuvan. In Proceedings of
ation for Computational Linguistics. the Tenth International Conference on Language Re-
sources and Evaluation (LREC’16), pages 2562–
Adam Przepiórkowski and Marcin Woliński. 2003. A
2567, Portorož, Slovenia. European Language Re-
flexemic tagset for Polish. In Proceedings of the
sources Association (ELRA).
2003 EACL Workshop on Morphological Processing
of Slavic Languages, pages 33–40, Budapest, Hun- Francis Tyers and Karina Mishchenkova. 2020. Depen-
gary. Association for Computational Linguistics. dency annotation of noun incorporation in polysyn-
thetic languages. In Proceedings of the Fourth
Andreas Scherbakov. 2020. The UniMelb submission Workshop on Universal Dependencies (UDW 2020),
to the SIGMORPHON 2020 shared task 0: Typo- pages 195–204, Barcelona, Spain (Online). Associa-
logically diverse morphological inflection. In Pro- tion for Computational Linguistics.
ceedings of the 17th SIGMORPHON Workshop on
Computational Research in Phonetics, Phonology, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
and Morphology, pages 177–183, Online. Associa- Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz
tion for Computational Linguistics. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
Claus Schönig. 1999. The internal division of modern cessing Systems, volume 30.
Turkic and its historical implications. Acta Orien-
talia Academiae Scientiarum Hungaricae, pages 63– A. P. Volodin. 1976. The Itelmen language.
95. Prosveshchenie, Leningrad.
179
Ekaterina Vylomova, Jennifer White, Eliza- European Chapter of the Association for Computa-
beth Salesky, Sabrina J. Mielke, Shijie Wu, tional Linguistics: Main Volume, pages 1901–1907,
Edoardo Maria Ponti, Rowan Hall Maudslay, Ran Online. Association for Computational Linguistics.
Zmigrod, Josef Valvoda, Svetlana Toldova, Francis
Tyers, Elena Klyachko, Ilya Yegorov, Natalia Nina Zaytseva, Andrew Krizhanovsky, Natalia
Krizhanovsky, Paula Czarnowska, Irene Nikkarinen, Krizhanovsky, Natalia Pellinen, and Aleksndra Ro-
Andrew Krizhanovsky, Tiago Pimentel, Lucas dionova. 2017. Open corpus of Veps and Karelian
Torroba Hennigen, Christo Kirov, Garrett Nicolai, languages (VepKar): preliminary data collection
Adina Williams, Antonios Anastasopoulos, Hilaria and dictionaries. In Corpus Linguistics-2017, pages
Cruz, Eleanor Chodroff, Ryan Cotterell, Miikka 172–177.
Silfverberg, and Mans Hulden. 2020. SIGMOR-
PHON 2020 shared task 0: Typologically diverse He Zhou, Juyeon Chung, Sandra Kübler, and Francis
morphological inflection. In Proceedings of the Tyers. 2020. Universal Dependency treebank for
17th SIGMORPHON Workshop on Computational Xibe. In Proceedings of the Fourth Workshop on
Research in Phonetics, Phonology, and Morphology, Universal Dependencies (UDW 2020), pages 205–
pages 1–39, Online. Association for Computational 215, Barcelona, Spain (Online). Association for
Linguistics. Computational Linguistics.
Jonathan North Washington, Aziyana Bayyr-ool, Esaú Zumaeta Rojas and Gerardo Anton Zerdin. 2018.
Aelita Salchak, and Francis M Tyers. 2016. De- Guía teórica del idioma asháninka. Lima: Universi-
velopment of a finite-state model for morphological dad Católica Sedes Sapientiae.
processing of Tuvan. Rodnoy Yazyk, 1:156–187.
Н. А. Баскаков. 1969. Введение в изучение
Jonathan North Washington, Ilnar Salimzianov, Fran- тюркских языков [N. A. Baskakov. An intro-
cis M. Tyers, Memduh Gökırmak, Sardana Ivanova, duction to Turkic language studies]. Москва:
and Oğuzhan Kuyrukçu. to appear in 2021. Высшая школа.
Free/open-source technologies for Turkic languages
developed in the Apertium project. In Proceedings Ф. Г. Исхаков and А. А. Пальмбах. 1961.
of the International Conference on Turkic Language Грамматика тувинского языка: Фонетика и
Processing (TURKLANG 2019). морфология [F. G. Iskhakov and A. A. Pal’mbakh.
A grammar of Tuvan: Phonetics and morphology].
Jonathan North Washington and Francis Morton Tyers. Москва: Наука.
2019. Delineating Turkic non-finite verb forms by
syntactic function. In Proceedings of the Workshop Е. И. Убрятова, Е. И. Коркина, Л. Н. Харитонов,
on Turkic and Languages in Contact with Turkic, and Н. Е. Петров, editors. 1982. Грамматика
volume 4, pages 115–129. современного якутского литературного
языка: Фонетика и морфология [E. I. Ubrya-
Adina Williams, Tiago Pimentel, Hagen Blix, Arya D. tova et al. Grammar of the modern Yakut literary
McCarthy, Eleanor Chodroff, and Ryan Cotterell. language: Phonetics and morphology]. Москва:
2020. Predicting declension class from form and Наука.
meaning. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
pages 6682–6695, Online. Association for Computa-
tional Linguistics.
Mary Ruth Wise. 2002. Applicative affixes in Pe-
ruvian Amazonian languages. Current Studies on
South American Languages [Indigenous Languages
of Latin America, 3], pages 329–344.
Marcin Woliński and Witold Kieraś. 2016. The on-
line version of grammatical dictionary of Polish. In
Proceedings of the Tenth International Conference
on Language Resources and Evaluation (LREC’16),
pages 2589–2594, Portorož, Slovenia. European
Language Resources Association (ELRA).
Marcin Woliński, Zygmunt Saloni, Robert Wołosz,
Włodzimierz Gruszczyński, Danuta Skowrońska,
and Zbigniew Bronk. 2020. Słownik gramatyczny
j˛ezyka polskiego, 4th edition. Warsaw. http://
sgjp.pl.
180
A Data conversion into UniMorph
Apertium tag UniMorph tag Apertium tag UniMorph tag Apertium tag UniMorph tag
<p1> 1 <imp> IMP <px1sg> PSS1S
<p2> 2 <ins> INS <px2pl> PSS2P
<p3> 3 <iter> ITER <px2sg> PSS2S
<abl> ABL <loc> LOC <px3pl> PSS3P
<acc> ACC <n> N <px3sg> PSS3S
<all> ALL <neg> NEG <px3sp> PSS3S/PSS3P
<com> COM <nom> NOM <pii> PST;IPFV
<comp> COMPV <aor> NPST <ifi> PST;LGSPEC1
<dat> DAT <nec> OBLIG <past> PST;LGSPEC2
<ded> DED <pl> PL <sg> SG
<du> DU <perf> PRF <v> V
<fut> FUT <resu> PRF;LGSPEC3 <gna_cond> V.CVB;COND
<gen> GEN <par> PRT <prc_cond> V.CVB;COND
<hab> HAB <px1pl> PSS1P
Table 8: Apertium tag mapping to the UniMorph schema for Sakha and Tuvan. For the definitions of the Apertium
tags, see Washington et al. (2016). This mapping alone is not sufficient to reconstruct the UniMorph annotation,
since some conditional rules are applied on top of this conversion (see §3.8.1)
Table 9: A simplified mapping from MorphInd tags to the UniMorph schema for Indonesian data. We follow
MorphInd’s three-level annotations for the mapping.
Table 10: Simplified mapping from the original flexemic tagset of Polish used in Polish morphological analysers
and corpora annotations (Przepiórkowski and Woliński, 2003) to the UniMorph schema. The mapping contains
most of the POS and feature labels and does not allow to reconstruct the full conversion of the original data, as
some mappings are conditional.
181
Xibe Universal Dependencies feature / UniMorph Additional rules
word transliteration
ADJ ADJ
ADP ADP
ADV ADV
AUX AUX
CCONJ CONJ
DET DET
INTJ INTJ
NOUN N
NUM NUM
PART PART
PRON PRO
PROPN PROPN
PUNCT _ excluding punctuation marks
SCONJ CONJ
SYM _ excluding symbols
VERB depends on other properties
X
_
Abbr=Yes _
Aspect=Imp IPFV
Aspect=Perf PFV seems to be closer to PFV than to PRF
Aspect=Prog PROG
Case=Abl ABL not for adpositions
Case=Acc ACC not for adpositions
Case=Cmp COMPV not for adpositions
Case=Com COM not for adpositions
Case=Dat DAT not for adpositions
Case=Gen GEN not for adpositions
Case=Ins INSTR not for adpositions
Case=Lat ALL not for adpositions
Case=Loc IN not for adpositions
Clusivity=Ex EXCL
Clusivity=In INCL
Degree=Cmp CMPR
Degree=Pos _
Foreign=Yes _
Mood=Cnd CMD=COND for finite forms only
Mood=Imp IMP
Mood=Ind IND
Mood=Sub SBJV
NumType=Card _
NumType=Mult POS=ADV
NumType=Ord POS=ADJ
NumType=Sets POS=ADJ
Number=Plur PL
Number=Sing SG
Person=1 1
Person=2 2
Person=3 3
Polarity=Neg NEG not for the negative auxiliary
Polite=Elev _
Poss=Yes CMD=PSS
PronType=Dem CMD=DEIXIS
PronType=Ind _
PronType=Int _
PronType=Prs _
PronType=Tot _
Reflex=Yes _
Tense=Fut FUT
Tense=Past PST
Tense=Pres PRS
Typo=Yes _ not including typos into the resulting table
Table 11: Simplified mapping for the Xibe Universal Dependencies corpus (Pt. 1)
182
Xibe Universal Dependencies feature / UniMorph Additional rules
word transliteration
VerbForm=Conv POS=V.CVB
VerbForm=Fin FIN
VerbForm=Inf NFIN
VerbForm=Part POS=V.PTCP
VerbForm=Vnoun POS=V.MSDR
Voice=Act ACT
Voice=Cau CAUS
Voice=Pass PASS
Voice=Rcp RECP
ateke _
dari _ means ‘each, every’
eiten _ means ‘each, every’
enteke _ means ‘like this’
ere PROX
erebe PROX
ereci PROX
eremu PROX
geren _ means ‘all’
harangga _
tenteke _ means ‘like that’
terali _ means ‘like that’
teralingge _ means ‘like that’
tere REMT
terebe REMT
terei REMT
tesu REMT
tuba _ means ‘there’
tuttu _ means ‘like that’
uba _ means ‘here’
ubaci _ means ‘here’
ubai _ means ‘here’
udu _ means ‘some’
uttu _ means ‘like this’
Table 12: Simplified mapping for the Xibe Universal Dependencies corpus (Pt. 2)
183
B Accuracy trends TRM+ CHR-TRM
L BME GUClasp TRM CHR-TRM
AUG +AUG
afb 92.42 79.83 95.13 95.13 95.43 95.43
TRM+ CHR-TRM amh 98.36 91.56 99.72 99.72 99.72 99.72
L BME GUClasp TRM CHR-TRM
AUG +AUG ara 99.79 88.63 99.88 99.88 99.87 99.87
afb 94.77 90.26 95.24 95.24 95.84 95.84 arz 93.31 78.08 95.16 95.16 94.98 94.98
amh 89.67 87.09 94.83 94.83 94.83 94.83 heb 98.41 92.95 99.65 99.65 99.75 99.75
ara 99.87 98.34 99.93 99.93 99.93 99.93 syc 11.02 4.41 25.73 21.32 27.94 25.00
arz 95.65 91.39 97.31 97.31 97.07 97.07 ame 81.30 57.82 84.78 86.52 86.08 83.47
syc 10.71 7.14 10.71 17.85 14.28 14.28 cni 98.73 79.51 99.72 99.72 99.63 99.63
ind 80.15 69.26 85.60 85.60 84.43 84.43 ind 83.01 52.41 84.47 84.47 84.36 84.36
kod 100.00 90.90 100.00 100.00 90.90 100.00 kod 93.65 92.06 100.00 98.41 100.00 100.00
ckt 50.00 50.00 50.00 50.00 50.00 50.00 aym 99.98 99.99 100.00 100.00 100.00 100.00
itl 50.00 58.33 66.66 58.33 66.66 58.33 ckt 25.00 31.25 18.75 37.50 18.75 43.75
bra 68.75 65.62 50.00 71.87 65.62 56.25 itl 14.96 12.92 20.40 25.17 21.08 23.12
bul 99.73 96.85 100.00 100.00 99.96 99.96 gup 14.75 21.31 59.01 63.93 55.73 60.65
ces 99.49 97.74 99.50 99.50 99.52 99.52 bra 31.30 29.56 24.34 27.82 27.82 30.43
mag 69.23 84.61 76.92 92.30 92.30 84.61 bul 99.51 98.36 99.86 99.86 99.89 99.89
nld 97.29 96.38 97.85 97.85 97.80 97.80 ces 98.88 94.97 99.54 99.54 99.40 99.40
pol 99.91 99.67 99.95 99.95 100.00 100.00 ckb 99.72 96.44 99.96 99.96 99.96 99.96
rus 99.81 98.88 99.44 99.44 99.44 99.44 deu 99.39 94.55 99.75 99.75 99.73 99.73
ail 12.50 12.50 0 12.50 12.50 12.50 kmr 98.20 96.97 100.00 100.00 99.67 99.67
evn 73.52 74.50 78.43 76.47 72.54 77.45 mag 36.36 38.63 42.04 42.04 44.31 39.77
krl 100.00 90.69 93.02 93.02 93.02 93.02 nld 99.53 94.99 99.86 99.86 99.86 99.86
olo 99.80 98.05 99.92 99.92 99.76 99.76 pol 99.57 98.22 99.76 99.76 99.74 99.74
vep 99.85 97.86 99.82 99.82 99.88 99.88 por 99.84 99.12 99.91 99.91 99.86 99.86
sjo 66.66 66.66 66.66 100.00 100.00 100.00 rus 99.25 90.31 97.09 97.09 97.38 97.38
tur 97.78 97.41 100.00 100.00 100.00 100.00 spa 99.82 97.55 99.90 99.90 99.92 99.92
see 78.27 40.97 90.65 89.64 90.00 88.63
Table 13: Accuracy for “Adjective” on the test data. ail 5.69 6.73 10.88 8.80 9.32 10.36
evn 34.70 32.03 44.90 44.66 44.90 46.35
sah 99.83 98.98 99.61 99.61 99.83 99.83
tyv 99.94 99.50 99.91 99.91 99.95 99.95
TRM+ CHR-TRM
L BME GUClasp TRM CHR-TRM krl 99.94 98.82 99.94 99.94 99.94 99.94
AUG +AUG
lud 56.25 56.25 0 50.00 6.25 50.00
syc 65.21 6.52 84.78 82.60 86.95 80.43 olo 99.84 99.14 99.71 99.71 99.70 99.70
bul 99.60 97.90 100.00 100.00 100.00 100.00 vep 99.71 97.50 99.60 99.60 99.65 99.65
ces 100.00 98.40 99.90 99.90 100.00 100.00 sjo 18.51 3.70 29.62 33.33 29.62 25.92
ckb 97.91 95.83 100.00 100.00 100.00 100.00 tur 99.96 99.98 99.17 99.17 99.17 99.17
deu 94.89 91.72 95.91 95.91 96.52 96.52
kmr 100.00 100.00 100.00 100.00 100.00 100.00
nld 89.17 78.58 94.58 94.58 96.00 96.00 Table 17: Accuracy for “Verb” in each language on the
pol 100.00 99.92 100.00 100.00 100.00 100.00
por 99.83 98.96 99.67 99.67 99.78 99.78
test data.
rus 97.04 92.06 96.58 96.58 96.67 96.67
spa 99.93 99.03 99.35 99.35 99.29 99.29
evn 12.76 7.09 17.73 18.43 14.89 19.14
krl 100.00 98.18 100.00 100.00 100.00 100.00 TRM+ CHR-TRM
L BME GUClasp TRM CHR-TRM
olo 99.69 97.83 99.38 99.38 99.69 99.69 AUG +AUG
vep 99.02 96.67 99.21 99.21 99.21 99.21
afb 91.00 81.87 94.00 94.00 92.95 92.95
sjo 22.22 22.22 27.77 55.55 55.55 50.00 amh 98.46 96.30 99.24 99.24 99.31 99.31
ara 99.60 94.76 99.44 99.44 99.56 99.56
Table 14: Accuracy for “Participle” on the test data. arz 96.58 91.66 97.56 97.56 97.23 97.23
heb 94.23 70.37 98.39 98.39 98.92 98.92
syc 20.00 18.57 32.85 34.28 32.14 32.85
TRM+ CHR-TRM ame 82.99 55.06 88.66 88.46 87.65 87.44
L BME GUClasp TRM CHR-TRM
AUG +AUG cni 99.79 98.62 99.96 99.96 99.96 99.96
ind 78.93 57.69 81.81 81.81 81.38 81.38
amh 98.67 95.78 99.75 99.75 99.87 99.87
kod 94.73 84.21 100.00 100.00 100.00 100.00
itl 19.04 13.09 25.00 20.23 21.42 21.42
aym 99.97 99.95 99.96 99.96 99.96 99.96
bul 100.00 98.57 100.00 100.00 100.00 100.00
ces 98.97 95.47 100.00 100.00 99.38 99.38 ckt 60.00 70.00 30.00 70.00 35.00 70.00
pol 99.22 99.22 100.00 100.00 100.00 100.00 itl 64.22 66.97 71.55 71.55 72.47 72.47
rus 99.21 97.49 97.96 97.96 99.68 99.68 bra 75.60 74.39 74.39 79.87 80.48 78.04
spa 98.74 98.23 99.24 99.24 100.00 100.00 bul 95.27 89.54 97.92 97.92 97.47 97.47
evn 23.38 16.12 25.00 32.25 27.41 33.06 ces 95.49 88.60 95.56 95.56 95.60 95.60
sah 100.00 100.00 100.00 100.00 100.00 100.00 ckb 97.44 98.01 99.71 99.71 100.00 100.00
tyv 100.00 100.00 100.00 100.00 100.00 100.00 deu 96.93 89.64 95.48 95.48 95.51 95.51
kmr 98.20 98.12 97.91 97.91 97.91 97.91
sjo 54.54 9.09 54.54 54.54 72.72 45.45 mag 91.60 92.30 81.81 91.60 85.31 92.30
pol 96.30 89.71 97.07 97.07 97.35 97.35
Table 15: Accuracy for “Converb” on the test data. rus 95.80 92.91 96.24 96.24 96.03 96.03
ail 9.67 4.83 17.74 20.96 14.51 20.96
evn 71.30 74.23 75.48 75.34 76.88 76.18
L BME GUClasp TRM
TRM+
CHR-TRM
CHR-TRM sah 99.97 99.78 99.97 99.97 99.99 99.99
AUG +AUG tyv 99.98 99.93 99.96 99.96 99.97 99.97
amh 88.61 79.67 95.12 95.12 97.56 97.56 krl 93.48 68.83 95.81 95.81 95.81 95.81
heb 79.53 73.68 83.62 83.62 83.04 83.04 lud 61.90 61.90 28.57 42.85 42.85 42.85
aym 100.00 100.00 100.00 100.00 100.00 100.00 olo 99.54 96.99 99.57 99.57 99.58 99.58
itl 33.33 33.33 33.33 50.00 33.33 33.33 vep 99.72 96.50 99.66 99.66 99.70 99.70
bul 98.85 98.00 100.00 100.00 100.00 100.00 sjo 58.33 41.66 25.00 58.33 25.00 66.66
kmr 99.37 100.00 98.74 98.74 100.00 100.00 tur 99.83 98.49 99.73 99.73 99.71 99.71
pol 99.96 99.96 100.00 100.00 100.00 100.00 vro 94.78 87.39 97.82 98.26 97.82 97.39
sjo 46.15 0 46.15 38.46 46.15 30.76
Table 16: Accuracy for “Masdar” on the test data. Table 18: Accuracy for “Noun” in each language on the
test data.
184
Training Strategies for Neural Multilingual Morphological Inflection
185
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 185–192
August 5, 2021. ©2021 Association for Computational Linguistics
previous output or the gold as input when de- 3 Method
coding.
In this section, the multilingual model and training
2 Data strategies used are presented. 3 We employ a single
model with shared parameters across all languages.
The data released cover 38 languages of varying
typology, genealogy, grammatical features, scripts, 3.1 Model
and morphological processes. The data for the To account for different languages in our model
different languages vary greatly in size, from 138 we prepend a language embedding to the input
examples (Ludic) to 100310 (Turkish). For the (similarly to Johnson et al. (2017); Raffel et al.
low-resourced languages1 we extend the original (2019)). To model inflection, we employ an
dataset with hallucinated data (Anastasopoulos and encoder-decoder architecture with attention. The
Neubig, 2019) to train on. first layer in the model is comprised of an LSTM,
With respect to the work of Anastasopoulos and which produces a contextual representation for
Neubig (2019), we make the following changes. each character in the lemma. We encode the tags us-
We identify all subsequences of length 3 or more ing a self-attention module (equivalent to a 1-head
that overlap in the lemma and inflection. We then transformer layer) (Vaswani et al., 2017). This
randomly sample one of them, denoted R, as the layer does not use any positional data: indeed the
sequence to be replaced. For each language, we order of the tags does not matter (Anastasopoulos
compile a set CL containing all (1,2,3,4)-grams in and Neubig, 2019).
the language. We construct a string G to replace R To generate inflections, we use an LSTM de-
with by uniformly sampling n-grams from CL and coder with two attention modules. One attending
concatenating them G = cat(g0 , ..., gm ) until we to the lemma and one to the tags. For the lemma
have a sequence whose length satisfy: |R| − 2 ≤ attention, we use a content-based attention mod-
|G| ≤ |R| + 2. ule (Graves et al., 2014; Karunaratne et al., 2021)
Additionally, we do not consider subsequences which uses cosine similarity as its scoring method.
which include a phonological symbol.2 A However, we found that only using content-based
schematic of the hallucination process is shown attention causes attention to be too focused on a sin-
in Figure 1. gle character, and mostly ignores contextual cues
relevant for the generation.
New
inflection:
r a ː t k ː ŋ i t i n
To remedy this, we combine the content-based
New
lemma: r a ː t k ː attention with additive attention as follows, where
superscript cb indicate content-based attention, add
Sample random
(1,2,3)-grams additive attention and k the key:
t=1
Inflection: h e ː k i ː ŋ i t i n
acb = softmax(cos(k, h))
Figure 1: A example of the data hallucination process. T
X
cb
The sequence R = ki is replace by G = tk. att = acb cb
t ht
t=1
Sampling n-grams instead of individual charac- att = W[attcb ; attadd ]
ters allow us to retain some of the orthographical
In addition to combining content-based attention
information present in the language. We generate a
and additive attention we also employ regulariza-
set of 10 000 hallucinated examples for each of the
tion on the attention modules such that for each
low-resource languages.
decoding step, the attention is encouraged to dis-
1
We consider languages with less than 10 000 training tribute the attention weights a uniformly across
examples as low-resource in this paper.
2 3
Thus in Figure 1 a subsequence of length 2 is selected Our code is available here:
as the sequence to be replaced, since the larger subsequences https://github.com/adamlek/
would include the phonological symbol : multilingual-morphological-inflection/
186
the lemma and tag hidden states (Anastasopoulos 3.3 Curriculum Learning
and Neubig, 2019; Cohn et al., 2016). We employ We employ a competence-based curriculum learn-
additive attention for the tags. ing strategy (Liu et al., 2020; Platanios et al., 2019).
In each decoding step, we pass the gold or pre- A competence curriculum learning strategy con-
dicted character embedding to the decoding LSTM. structs a learning curriculum based on the compe-
We then take the output as the key and calculate tence of a model, and present examples which the
attention over the lemma and tags. This representa- model is deemed to be able to handle. The goal of
tion is then passed to a two-layer perceptron with this strategy is for the model to transfer or apply
ReLU activations. the information it acquires from the easy examples
to the hard examples.
3.2 Multi-task learning
To estimate an initial difficulty for an example
Instead of predicting the characters in the inflected we consider the character unigram log probability
form, one can also predict the Levenshtein opera- of the lemma and inflection. For a word (either the
tions needed to transform the lemma into the in- lemma or inflection) w = c0 , ..., cK , the unigram
flected form; as shown by Makarov et al. (2017). log probability is given by:
A benefit of considering operations instead of
characters needed to transform a lemma to its in- K
X
flected form is that the script used is less of a fac- log(PU (w)) = log(p(ck )) (1)
tor, since by considering the operations only we k=0
abstract away from the script used. We find that To get a score for a lemma and inflection pair
making both predictions, as a multi-task setup, im- (henceforth (x, y)), we calculate it as the sum of
proves the performance of the system. the log probabilities of x and y:
The multi-task setup operates on the character
level, thus for each contextual representation of a score(x, y) = PU (x) + PU (y) (2)
character we want to predict an operation among
deletion (del), addition/insertion (add), substi- Note that here we do not normalize by the length
tution (sub) and copy (cp). Because add and of the inflection and lemma. This is because an
del change the length, we predict two sets of op- additional factor in how difficult an example should
erations, the lemma-reductions and the lemma- be considered is its length, i.e. longer words are
additions. To illustrate, the Levenshtein operations harder to model. We then sort the examples and use
for the word pair (valatas, ei valate) in Veps (uralic a cumulative density function (CDF) to map the
language related to Finnish) is shown in Figure 2. unigram probabilities to a score in the range (0, 1],
we denote the training set of pairs and their scores
Inflection: e i v a l a t e
((x, y), s)0 , . . . , ((x, y), s)m , where m indicate the
Operations: add add add cp cp cp cp cp del sub number of examples in the dataset, as D.
To select appropriate training examples from D
Lemma: v a l a t a s
we must estimate the competence c of our model.
Figure 2: Levenshtein operations mapped to characters The competence of the model is estimated by a
in the lemma and inflection. function of the number of training steps t taken:
s !
In our setup, the task of lemma-reductions is 1 − c(1)2
performed by predicting the cp, del, and sub c(t) = min 1, t + c(1)2 (3)
c(1)2
operations based on the encoded hidden states in
the lemma. The task of lemma-additions then is During training, we employ a probabilistic ap-
performed by predicting the cp, add, and sub proach to constructing batches from our corpus,
operations on the characters generated by the de- we uniformly draw samples ((x, y), s) from the
coder. We use a single two-layer perceptron with training set D such that the score s is lower than
ReLU activation to predict both lemma-reduction the model competence c(t). This ensures that for
and lemma-additions. 4 each training step, we only consider examples that
4
In the future, we’d like to experiment with including the the model can handle according to our curriculum
representations of tags in the input to the operation classifier. schedule.
187
However, just because an example has low un- incorrect input in the training phase. We address
igram probability doesn’t ensure that the exam- this issue using scheduled sampling (Bengio et al.,
ple is easy, as the example may contain frequent 2015).
characters but also include rare morphological pro- We implement a simple schedule for calculat-
cesses (or rare combinations of Levenshtein op- ing the probability of using the gold characters or
erations), to account for this we recompute the the model’s prediction by using a global sample
example scores at each training step. We sort the probability variable which is updated at each train-
examples in each training step according to the ing step. We start with a probability ρ of 100% to
decoding loss, then assign a new score to the ex- take the gold. At each training step, we decrease ρ
amples in the range (0, 1] using a CDF function. 1
by totalsteps . For each character, we take a sample
We also have to take into account that as the from the Bernoulli distribution of parameter ρ to
model competence grows, “easy” (low loss or high determine the decision to make.
unigram probability) examples will be included
more often in the batches. To ensure that the 3.5 Training
model learns more from examples whose difficulty We use cross-entropy loss for the character gener-
is close to its competence we compute a weight w ation loss and for the operation predictions tasks.
for each example in the batch. We then scale the Our final loss function consists of the character gen-
loss by dividing the score s by the model compe- eration loss, the lemma-reduction, and the lemma-
tence at the current time-step: addition losses summed. We use a cosine anneal-
ing learning rate scheduler (Loshchilov and Hutter,
score(x, y) 2017), gradually decreasing the learning rate. The
weighted loss(x, y) = loss(x, y) ×
c(t) hyperparameters used for training are presented in
(4) Table 1.
Because the value of our model competence
is tied to a specific number of training steps, H YPERPARAMETER VALUE
we develop a probabilistic strategy for sampling Batch Size 256
batches when the model has reached full compe- Embedding dim 128
tence. When the model reaches full competence we Hidden dim 256
construct language weights by dividing the number Training steps 240000
of examples in a language by the total number of Steps for full competence 60000
examples in the dataset and taking the inverse dis- Initial LR 0.001
tribution as the language weights. Thus for each Min LR 0.0000001
language, we get a value in the range (0, 1] where Smoothing-α 2.5%
low-resource languages receive a higher weight. To
Table 1: Hyperparameters used. As we use a proba-
construct a batch we continue by sampling exam- bilistic approach to training we report number of train-
ples, but now we only add an example if r ∼ ρ, ing steps rather than epochs. In total, the number of
where ρ is a uniform Bernoulli distribution, is less training steps we take correspond to about 35 epochs.
than the language weight of the example. This strat-
egy allows us to continue training our model after
reaching full competence without neglecting the Language-wise Label smoothing We use
low-resource languages. language-wise label smoothing to calculate the
In total we train the model for 240 000 training loss. This means that we remove a constant
steps, and consider the model to be fully competent α from the probability of the correct character
after 60 000 training steps. and distribute the same α uniformly across the
probabilities of the characters belonging to the
3.4 Scheduled Sampling language of the word. The motivation for doing
Commonly, when training an encoder-decoder label smoothing this way is that we know that
RNN model, the input at time-step t is not the all incorrect character predictions are not equally
output from the decoder at t − 1, but rather the incorrect. For example, when predicting the
gold data. It has been shown that models trained inflected form of a Modern Standard Arabic (ara)
with this strategy may suffer at inference time. In- word, it is more correct to select any character
deed, they have never been exposed to a partially from the Arabic alphabet than a character from
188
Test Dev
1 · 105
L ANG EM L EV EM L EV
100
afb 90.29 0.17 91.29 0.15
ail 6.84 3.6 7.69 3.62
ame 70.72 0.75 73.67 0.64
50,000
50
amh 97.44 0.04 96.87 0.04
ara 98.69 0.03 98.59 0.04
arz 91.65 0.14 92.48 0.14
aym 99.8 0.0 99.75 0.01
0
0
bra 62.38 0.71 64.05 0.59
aym
amh
mag
ame
kmr
gup
kod
deu
heb
ckb
evn
vep
por
sah
spa
syc
vro
tyv
olo
pol
ces
nld
afb
ind
see
bra
lud
bul
ara
rus
arz
cni
ckt
sjo
tur
krl
ail
itl
bul 98.16 0.03 98.02 0.03
ces 93.41 0.12 94.01 0.12
Figure 3: Number of examples (green indicate natu- ckb 68.27 0.77 68.91 0.73
ral and blue hallucinated examples, left x-axis) plotted ckt 60.53 1.37 55.56 1.72
against the exact match accuracy (right x-axis) of our cni 91.99 0.11 91.38 0.12
system on the development data (blue) and the test data
deu 93.95 0.09 93.28 0.1
(red).
evn 51.7 1.47 50.41 1.5
gup 22.95 3.92 26.67 3.1
the Latin or Cyrillic alphabet. A difficulty is that heb 95.86 0.09 94.97 0.1
each language potentially uses a different set of ind 62.32 1.3 60.28 1.33
characters. We calculate this set using the training itl 22.63 2.89 22.16 3.11
set only— so it is important to make α not too kmr 98.19 0.02 98.32 0.02
large, so that there is not a too big difference kod 79.57 0.58 80.43 0.37
between characters seen in the training set and krl 97.62 0.05 97.83 0.04
those not seen. Indeed, if there were, the model lud 62.16 0.73 66.67 0.44
might completely exclude unseen characters from mag 70.2 0.53 63.64 0.71
its test-time predictions. (We found that α = 2.5% nld 92.51 0.12 92.31 0.12
is a good value.) olo 99.39 0.01 99.36 0.01
pol 97.34 0.04 97.51 0.04
4 Results por 98.41 0.03 98.3 0.03
rus 97.02 0.05 96.8 0.05
The results from our system using the four straining sah 99.86 0.0 99.86 0.0
strategies presented earlier are presented in Table 2. see 49.77 1.7 50.37 1.51
Each language is evaluated by two metrics, exact sjo 29.76 1.73 32.43 1.83
match, and average Levenshtein distance. The aver- spa 98.66 0.02 98.65 0.02
age Levenshtein distance is on average, how many syc 9.43 5.25 15.7 5.41
operations are required to transform the system’s tur 98.69 0.03 98.86 0.03
guess to the gold inflected form. One challenging tyv 98.61 0.03 98.51 0.03
aspect of this dataset for our model is balancing vep 99.26 0.01 99.3 0.01
the information the model learns about low- and vro 82.17 0.31 80.7 0.42
high-resource languages. We plot the accuracy the
model achieved against the data available for that Table 2: Results on the development data.
language in Figure 3.
We note that for all languages with roughly more
than 30 000 examples our model performs well, language examples do. That is, when construct-
achieving around 98% accuracy. When we con- ing hallucinated examples, orthography is taken
sider languages that have around 10 000 natural into account only indirectly because we consider
examples and no hallucinated data the accuracy n-grams instead of characters when finding the re-
drops closer to round 50%. For the languages with placement sequence. However, we find that for
hallucinated data, we would expect this trend to many of the languages with hallucinated data the
continue as the data is synthetic and does not take exact match accuracy is above 50%, but varies a
into account orthographic information as natural lot depending on the language.
189
Two of the worst languages in our model is arz) script and one cluster that use Amharic and
Classical Syriac (syc) and Xibe (sjo). An issue Hebrew (amh, heb) script. As mentioned earlier
with Classical Syriac is that the language uses Classical Syriac uses its another script and seems
a unique script, the Syriac abjad, which makes to consequently appear in another part of the map.
it difficult for the model to transfer information In general, our model’s language embeddings ap-
about operations and common character combina- pear to learn some relationships between languages,
tions/transformations into Classical Syriac from but certainly not all of them. However, that we find
related languages such as Modern Standard Arabic some patterns in encouraging for future work.
(spoken in the region). For Xibe there is a similar
story: it uses the Sibe alphabet which is a variant 6 Scheduled Sampling
of Manchu script, which does not occur elsewhere
We note that during the development all of our
in our dataset.
training strategies showed a stronger performance
for the task, except one: scheduled sampling. We
5 Language similarities
hypothesize this is because the low-resource lan-
Our model process many languages simultaneously, guages benefit from using the gold as input when
thus it would be encouraging if the model also was predicting the next character, while high-resource
able to find similarities between languages. To languages do not need this as much. The model has
explore this we investigate whether the language seen more examples from high-resource languages
embeddings learned by the model produce clusters and thus can model them better, which makes us-
of language families. A t-SNE plot of the language ing the previous hidden state more reliable as input
embeddings is shown in Figure 4. when predicting the next token. Indeed, the sched-
uled sampling degrade the overall performance by
3.04 percentage points, increasing our total aver-
aym
age accuracy to 83.3 percentage points, primarily
bulrus pol affecting low-resource languages.
vro ckb
arz
7 Conclusions and future Work
sjo mag lud ara
por sycbra kod afb We have presented a single multilingual model for
kmr tyv see spa
sah ces morphological inflection in 38 languages enhanced
heb
amh amedeunld
with different training strategies: curriculum learn-
ailevn tur vep ckt
olo ing, multi-task learning, scheduled sampling and
krl
gupind
itl
language-wise label smoothing. The results indi-
cni cate that our model to some extent capture simi-
larities between the input languages, however, lan-
guages that use different scripts appears problem-
Figure 4: t-SNE plot of the language embeddings. Dif- atic. A solution to this would be to employ translit-
ferent colors indicate different language families. eration (Murikinati et al., 2020).
In future work, we plan on exploring curriculum
The plot shows that the model can find some learning in more detail and move away from esti-
family resemblances between languages. For ex- mating the competence of our model linearly, and
ample, we have a Uralic cluster consisting of the instead, estimate the competence using the accu-
languages Veps (vep), Olonets (olo), and Karelian racy on the batches. Another interesting line of
(krl) which are all spoken in a region around Russia work here is instead of scoring the examples by
and Finland. However, Ludic (lud) and Võro (vro) model loss alone, but combine it with insights from
are not captured in this cluster, yet they are spoken language acquisition and teaching, such as sorting
in the same region. lemmas based on their frequency in a corpus (Ionin
We can see that the model seem to separate lan- and Wexler, 2002; Slabakova, 2010).
guage families somewhat depending on the script We also plan to investigate language-wise label
used. The Afro-Asiatic languages are split into smoothing more closely, specifically how the value
two smaller clusters, one cluster containing the of α should be fine-tuned with respect to the num-
languages that use Standard Arabic (ara, afv and ber of characters and languages.
190
Acknowledgments Geethan Karunaratne, Manuel Schmuck, Manuel
Le Gallo, Giovanni Cherubini, Luca Benini, Abu
The research reported in this paper was supported Sebastian, and Abbas Rahimi. 2021. Robust high-
by grant 2014-39 from the Swedish Research Coun- dimensional memory-augmented neural networks.
cil, which funds the Centre for Linguistic Theory Nature communications, 12(1):1–12.
and Studies in Probability (CLASP) in the Depart- Xuebo Liu, Houtim Lai, Derek F. Wong, and Lidia S.
ment of Philosophy, Linguistics, and Theory of Chao. 2020. Norm-based curriculum learning for
Science at the University of Gothenburg. neural machine translation. In Proceedings of the
58th Annual Meeting of the Association for Com-
putational Linguistics, ACL 2020, Online, July 5-
10, 2020, pages 427–436. Association for Compu-
References tational Linguistics.
Antonios Anastasopoulos and Graham Neubig. 2019.
Pushing the limits of low-resource morphological in- Ilya Loshchilov and Frank Hutter. 2017. SGDR:
flection. In Proceedings of the 2019 Conference on stochastic gradient descent with warm restarts. In
Empirical Methods in Natural Language Processing 5th International Conference on Learning Repre-
and the 9th International Joint Conference on Nat- sentations, ICLR 2017, Toulon, France, April 24-
ural Language Processing, EMNLP-IJCNLP 2019, 26, 2017, Conference Track Proceedings. OpenRe-
Hong Kong, China, November 3-7, 2019, pages 984– view.net.
996. Association for Computational Linguistics. Peter Makarov, Tatiana Ruzsics, and Simon Clematide.
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and 2017. Align and copy: UZH at SIGMORPHON
Noam Shazeer. 2015. Scheduled sampling for se- 2017 shared task for morphological reinflection. In
quence prediction with recurrent neural networks. Proceedings of the CoNLL SIGMORPHON 2017
In Advances in Neural Information Processing Sys- Shared Task: Universal Morphological Reinflection,
tems 28: Annual Conference on Neural Informa- Vancouver, BC, Canada, August 3-4, 2017, pages
tion Processing Systems 2015, December 7-12, 2015, 49–57. Association for Computational Linguistics.
Montreal, Quebec, Canada, pages 1171–1179.
Nikitha Murikinati, Antonios Anastasopoulos, and Gra-
Trevor Cohn, Cong Duy Vu Hoang, Ekaterina Vy- ham Neubig. 2020. Transliteration for cross-lingual
molova, Kaisheng Yao, Chris Dyer, and Gholamreza morphological inflection. In Proceedings of the
Haffari. 2016. Incorporating structural alignment 17th SIGMORPHON Workshop on Computational
biases into an attentional neural translation model. Research in Phonetics, Phonology, and Morphology,
In NAACL HLT 2016, The 2016 Conference of the pages 189–197, Online. Association for Computa-
North American Chapter of the Association for Com- tional Linguistics.
putational Linguistics: Human Language Technolo-
Emmanouil Antonios Platanios, Otilia Stretcu, Gra-
gies, San Diego California, USA, June 12-17, 2016,
ham Neubig, Barnabas Poczos, and Tom M Mitchell.
pages 876–885. The Association for Computational
2019. Competence-based curriculum learning
Linguistics.
for neural machine translation. arXiv preprint
Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. arXiv:1903.09848.
Neural turing machines. CoRR, abs/1410.5401.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Martin Haspelmath and Andrea Sims. 2013. Under- Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
standing morphology. Routledge. Wei Li, and Peter J Liu. 2019. Exploring the limits
of transfer learning with a unified text-to-text trans-
Tania Ionin and Kenneth Wexler. 2002. Why is ‘is’ eas- former. arXiv preprint arXiv:1910.10683.
ier than ‘-s’?: acquisition of tense/agreement mor-
phology by child second language learners of en- Abhishek Sharma, Ganesh Katrapati, and Dipti Misra
glish. Second language research, 18(2):95–136. Sharma. 2018. IIT(BHU)–IIITH at CoNLL–
SIGMORPHON 2018 shared task on universal mor-
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim phological reinflection. In Proceedings of the
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, CoNLL–SIGMORPHON 2018 Shared Task: Uni-
Fernanda Viégas, Martin Wattenberg, Greg Corrado, versal Morphological Reinflection, pages 105–111,
et al. 2017. Google’s multilingual neural machine Brussels. Association for Computational Linguis-
translation system: Enabling zero-shot translation. tics.
Transactions of the Association for Computational
Linguistics, 5:339–351. Roumyana Slabakova. 2010. What is easy and what is
hard to acquire in a second language?
Katharina Kann and Hinrich Schütze. 2016. Med: The
lmu system for the sigmorphon 2016 shared task on Peter Smit, Sami Virpioja, Stig-Arne Grönroos, and
morphological reinflection. In Proceedings of the Mikko Kurimo. 2014. Morfessor 2.0: Toolkit for
14th SIGMORPHON Workshop on Computational statistical morphological segmentation. In Proceed-
Research in Phonetics, Phonology, and Morphology, ings of the 14th Conference of the European Chap-
pages 62–70. ter of the Association for Computational Linguistics,
191
EACL 2014, April 26-30, 2014, Gothenburg, Swe-
den, pages 21–24. The Association for Computer
Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
cessing Systems 30: Annual Conference on Neural
Information Processing Systems 2017, December 4-
9, 2017, Long Beach, CA, USA, pages 5998–6008.
192
BME Submission for SIGMORPHON 2021 Shared Task 0. A Three Step
Training Approach with Data Augmentation for Morphological Inflection
Gábor Szolnok∗ Botond Barta∗
Budapest University of Budapest University of
Technology and Economics Technology and Economics
[email protected] [email protected]
193
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 193–195
August 5, 2021. ©2021 Association for Computational Linguistics
the dominant approach in the field. The first ma- Family Langs Low Samples
jor success of seq2seq models in morphological in- Afro-Asiatic 12 1 196550
flection was the submission by Kann and Schütze Arnhem 1 1 214
(2016) to the 2016 edition of the SIGMORPHON Aymaran 1 0 100000
shared task. This was followed by an extensive Arawakan 2 0 16472
study by Faruqui et al. (2015) on LSTM-based Iroquoian 1 0 3801
encoder-decoder models for morphological inflec- Turkic 3 0 300371
tion. Chukotko-Kamchatkan 2 2 1378
We used the augmentation techniques intro- Tungusic 2 1 5500
duced by Neuvel and Fulop (2002). Inspired Austronesian 2 1 11395
Trans-New-Guinean 1 1 918
by Bergmanis et al. (2017) we attempted to ex-
Indo-European 12 2 685567
tract different morphological properties of the
Uralic 5 2 279720
languages and used them to generate data.
Anastasopoulos and Neubig (2019) used a two Table 1: List of language families and the number of
step training method that first trains on the lan- languages from each family. The third column is the
guage family and then on the individual languages. number of low resource (<1300 samples) languages in
We use a similar procedure but we augment the a particular family. The forth column is the overall sam-
data with a different technique in each step. ple count in each family.
3 Data
as our encoder and a unidirectional LSTM with at-
The shared task covered 38 languages from 12 tention as our decoder.
language families. 35 of these languages were Recall that the input for the inflection task is a
available from the beginning while 3 surprise lan- pair consisting of a lemma and a list of morphosyn-
guages, Turkish, Vibe and Võro, were released one tactic tags. We represent these pairs as a single se-
week before the submission deadline. Each lan- quence as the LSTM’s input. For the input lemma-
guage had a train and a development split of vary- tags pair izar, (V, COND, PL, 2), we se-
ing size. Each sample consists of a lemma, an in- rialize it as
flected form and a list of morphosyntactic tags in <SOS> i z a r <SEP> V COND PL 2 <EOS>
the following format: Similarly, we convert the target form into a se-
vaguear vaguearás V;IND;SG;2;FUT quence of characters:
emunger emunjamos V;IMP;PL;1;POS <SOS> i z a r <EOS>
desenchufar desenchufo V;IND;SG;1;PRS
The output of our model looks like this when
delirar deliraren V;SBJV;PL;3;FUT
the inflected word is izarı́ais:
The amount of data varies widely among the lan- <SOS> i z a r ı́ a i s <EOS>
guages: while the language Veps has more than
The input sequence is first projected to an em-
100000 examples, Ludic, the most underresourced
bedding space, which then provides the input for
language, has only 128 train samples. Table 1 lists
the encoder LSTM. The decoder is a standard uni-
the 12 language families and the number of lan-
directional LSTM with attention. We decode the
guages from that family. We consider some lan-
output in a greedy fashion and do not use beam
guages low resource languages if they have fewer
search. We project the final output to the output
than 1300 samples. 8 language families had at
vocabulary’s dimension and use the softmax func-
least one low resource language and 3 families
tion to generate a probability distribution. The in-
were represented only by low resource languages.
put and the output embeddings use shared weights
One goal of our data augmentation techniques is
and they are trained from scratch along with the
to offset this imbalance (see Section 5).
rest of the model.
4 Model architecture
4.2 Hyperparameter selection
4.1 LSTM based seq2seq model We selected 16 languages from diverse families
Our model is largely based on the encoder-decoder for hyperparameter tuning. Most of them were fu-
model of Faruqui et al. (2015). We use a bidirec- sional or low resource because early experiments
tional LSTM (Hochreiter and Schmidhuber, 1997) showed that these are the harder ones to learn for
194
Lang excluded feature basemodels
Family Result
code Copy Stem-mod Step 1 Step 2 Step 3 IIT+DA OL
Turkic tur 99.90 99.90 99.94 99.92 97.38 99.90 99.35 97.10
vep 99.72 54.10 99.55 99.80 99.05 99.67 99.70 91.13
Uralic lud 59.46 56.76 70.27 56.76 67.57 62.16 45.95 0.00
olo 99.72 91.15 99.84 99.78 98.26 99.72 99.66 99.48
rus 98.07 94.84 98.00 97.86 95.56 97.34 97.58 70.72
Indo -
kmr 98.21 86.02 98.74 98.41 97.50 98.21 98.01 5.14
European
deu 97.98 91.19 98.23 97.91 89.91 97.98 97.46 91.86
Table 2: The different results we achieved on the test dataset with different models, with different aug-
mentation techniques excluded and with different training steps excluded. For comparison the table show
the result of our submission (result), the given basemodel IIT+DA (Input Invariant Transformer + Data
Augment) and the models that were just trained on only one language (OL).
195
Not quite there yet: Combining analogical patterns and
encoder-decoder networks for cognitively plausible inflection
196
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 196–204
August 5, 2021. ©2021 Association for Computational Linguistics
(DEU) and Dutch (NLD). For each language, four
datasets are released: (a) a training dataset, (b) a
development dataset of attested verb forms, (c) a (1) WP1 WP2
development dataset of wug forms which includes anSpi:l@n anSpi:l@nt ← TAP
human judgements, and (d) a test dataset of wug XnSpi:l@n XnSpi:l@nt
forms without the human judgements. The goal of aXSpi:l@n aXSpi:l@nt
the Shared Task is to assign a score to each wug anXpi:l@nt anXpi:l@n
form in (d) that correlates as closely as possible aXSY i:Z@T aXSY i:Z@T t
with the human judgements (see Section 4.2 for XnY pZlT n XnY pZlT nt
more details on the evaluation process). X@n X@nt ← FAP
The entries of the datasets include lemma/form aX aXt
pair and a UniMorph (Kirov et al., 2018) tag (UT) XnY XnY t
specifying the part of speech and the paradigm cell Xn Xnt
of the form. The pairs are provided as written forms X Xt ← BAP
(orthographically) in datasets (a) and (b) and in IPA
(phonologically) in all four datasets. Henceforth we will note patterns using the
Table 1 summarizes the size of the training data ‘+’ symbol as a general notation for vari-
(a), its phonological make-up, the number of mor- ables, and rely implicitly on order to match
phosyntactic tags and the proportion (%) of syn- variables in alternations. Hence e.g. the AP
cretism, i.e. of forms that fill two or more paradigm (XnY pZlT n, XnY pZlT nt) will be noted
cells of the same lexeme. Although the number +n+p+l+n/+n+p+l+nt.
of phonemes is substantially similar in the three Most of the patterns in (1) are of little morpho-
languages, the datasets differ in the number of en- logical interest. This is in particular the case of the
tries (twice as many entries in DEU as in ENG) trivial alternation pattern (TAP) which just records
and number of cells. The three datasets have a the two strings without making any generalization
comparable amount of syncretism. over common elements. A broad alternation pat-
tern (BAP) is an optimal pattern that can be inferred
ENG DEU NLD by pairwise alignment of the two forms, without
entries 41 658 100 011 74 176 taking into account the situation in the rest of the
phonemes 43 44 39 paradigm. In principle there can be more than one
UTs 11 29 7 BAP for a pair of form, although that rarely hap-
syncretism % 53 50 42 pens in practice. This type of pattern is of crucial
interest to the study of the implicative structure
Table 1: Training data
of paradigms (Ackerman et al., 2009; Ackerman
and Malouf, 2013) and the induction of inflection
3 Analogical Patterns classes (Beniamine et al., 2017), but does not lead
to the identification of affixes familiar from typical
Three of our model architectures rely on alternation grammatical descriptions. For that purpose, multi-
patterns (APs) describing the formal relationship ple alignments across the paradigms are necessary
between two word-forms. An AP is a pairing of two (Beniamine and Guzmán Naranjo, 2021), and lead
word-forms patterns (WPs) with shared variables to what we call a fine alternation pattern (FAP):
over substrings which represent word parts that here @n is the infinitive suffix and @nt is the present
are common between the two forms. For example, participle suffix.
the two German word-forms anSpi:l@n ‘to allude The crucial intuition behind the determination of
to’ and anSpi:l@nt ‘alluded to’ are related by the FAPs is that they identify recurrent partials (Hock-
pattern (anX@n, anX@nt) where the variable X ett, 1947) across both paradigms and lexemes. For
represents Spi:l. instance, the FAP in (1) is motivated by the fact
The number of different APs satisfied by a pair that the substrings an and Spi:l are shared across
of forms is typically large. For instance, there are all pairs of paradigm cells of anspielen (2), while
256 (28 ) distinct patterns relating anSpi:l@n and an- the substrings @n and @nt recur in many (infinitive,
Spi:l@nt, some of which are shown in (1), where present participle) pairs across lexemes (3).
italic capital letters represent variables over strings.
197
(2) anSpi:l@n Spi:l@t+an V;IND;PST;2;PL ‘to switch over’ is reordered as y:b@rvEks@l@. Sec-
anSpi:l@n Spi:lt@+an V;SBJV;PST;1;SG ond, all phonemes represented by digraphs and
anSpi:l@n Spi:lt+an V;IMP;2;PL trigraphs in IPA were replaced with arbitrary uni-
anSpi:l@n Spi:l@+an V;IMP;2;SG graphs (capital letters; eg. S is substituted for y:).
anSpi:l@n anSpi:l@st V;SBJV;PST;2;SG
anSpi:l@n anSpi:lst V;IND;PRS;2;SG Broad alternation patterns. Each entry in the
datasets consists of the infinitive and another form
(3) tari:fi:r@n tari:fi:r@nt V.PTCP;PRS of some lexeme, accompanied by the UT of the
‘to tariff’ second form. The BAP of a pair of forms is com-
tari:fi:r@n tari:fi:rt@n V;SBJV;PST;3;PL puted through an alignment of the two word-forms
ast@n ast@nt V.PTCP;PRS and the identification of their common parts and
‘to lug’ their differences. The alignment is computed by
ast@n ast@t@t V;SBJV;PST;2;PL means of the SequenceMatcher method of the
vain@n vain@nt V.PTCP;PRS python difflib library; we then go through the
“
‘to cry’ “
sequences provided by the method and create the
vain@n vaint V;IND;PRS;3;SG word-form patterns by replacing the common parts
“ “
tsErStrait@n tsErStrait@nt V.PTCP;PRS by + and copying the differences in their respec-
“
‘to disagree’ “
tive patterns. For example, SequenceMatcher
tsErStrait@n tsErStrait@st V;SBJV;PST;2;SG aligns the forms aptail@n and apg@tailt as in (5)
“ “ “ “ BAPs are
In this paper, we rely on an algorithm for in- which yields the ++en/+ge+t BAP.
ferring BAPs and FAPs initially developed to cre- therefore calculated separately for each entry con-
ate Glawinette (Hathout et al., 2020). Glawinette sidering only the two forms.
is a French derivational lexicon created from the
(5) word-form1 ap tail @n
definitions of the GLAWI machine readable dic-
word-form2 ap g@ tai“l t
tionary (Sajous and Hathout, 2015; Hathout and “
BAP1 + + @n
Sajous, 2016). Glawinette provides a description
BAP2 + g@ + t
of derivational morphology by means of morpho-
logical families and derivational series; it is part Note that a BAP can also be seen as a charac-
of an effort aiming at the design of derivational terization of an analogical series. For instance,
paradigms. BAPs and FAPs have been adapted to the pairs of forms in (4) can all be aligned in ex-
the datasets of the current task, analogizing inflec- actly the same way as in (5), they all have the same
tional paradigms to derivational families and pairs BAP ++@n/+g@+t and they form formal analogies
of inflectional paradigm cells to derivational series. (Lepage, 1998, 2004b,a; Stroppa and Yvon, 2005;
For instance, (4) presents an excerpt of an inflec- Langlais and Yvon, 2008). More specifically, if two
tional series in the inflectional paradigm of the verb pairs of forms (F1 , F2 ) and (F3 , F4 ) have the same
anspielen that realizes the features V;NFIN and BAP, then F1 : F2 :: F3 : F4 . BAPs could also
V.PTCP;PST. In turn, this series yields two series be computed for entire inflectional paradigms as
of word-forms, the ones in the left column and the proposed by Hulden (2014). Also note that BAPs
ones in the right column. are not specific to an inflection class, as two classes
(4) apzu:x@n apg@zu:xt ‘to search’ may exhibit common behavior in one part of their
aplOx@n apg@lOxt ‘to punch off’ paradigm but not in another. For instance, the BAP
aprYk@n apg@rYkt ‘to disengage’ +/+s describes the formal relation that connects
apgUk@n apg@gUkt ‘to peek’ the infinitive and the V;PRS;3;SG form of both
regular (work) and irregular English verbs (eat).
Basic preprocessing. The forms in the test set
(d) being in IPA, we only computed phonological Fine alternation patterns. Unlike BAPs which
BAPs and FAPs. BAPs and FAPs have been com- are derived solely from the examination of pair-
puted for all the entries of all four datasets. In wise alternations, FAPs rely on the place of the two
addition, two basic modifications were performed. word-forms in the overall morphological system to
First, particles were reorder so as to appear in the identify more stable recurrent partials correspond-
same position in the infinitival and inflected word- ing to traditional exponents. For instance, the BAP
forms. For instance, wechsele über, vEks@l@ y:b@r relating the German weak verbs like anspielen to
198
its present participle anspielend, relying on the op- terns characterize a large enough subset of the pairs
timal alignment between the two forms, does not in Φ. These APs are obtained as follows.
identify the infinitive and past participle exponents We first collect the WPs that possibly charac-
-en and -end. These cannot be deduced from an terize the word-forms in Φ1 by computing a pat-
isolate pair of word-forms, and require considering, tern of word-forms for each pair of word-forms
across the paradigm, all the pairs of word-forms made up of two word-forms from Φ1 . These pat-
that include infinitives or present participles and terns are dual of the ones illustrated in (5) as we
finding out the pair of endings that best character- need to represent what the word-forms have in
izes, across lexemes, the infinitives “similar” to common and not their differences. For instance,
anspielen and the present participles “similar” to in the second column of (1), the pattern that de-
anspielend. The main challenges in the identifi- scribes the common part of apg@zu:xt and apg@lOxt
cation of the FAPs are then (i) that they involve is apg@+xt and the one for the common part of
the entire dataset and cannot be computed locally apg@lOxt and apg@rYkt is apg@+t. If the num-
for a single pair of word-forms; (ii) that we need ber of word-forms in the column is large and var-
to formally define what “similar” means; (iii) that ied enough, all the relevant WPs that characterize
we potentially need to consider all the APs of all a part of the word-forms will be collected. We
the pairs of words included in the dataset; (iv) that then align the patterns obtained for the two col-
we need a reasonable operational approximation of umn. This is done by considering the WPs as if
what could considered as linguistically relevant. they were word-forms and computing their ana-
logical signature, i.e. their BAP. For instance, we
(i) The regularities that determine the FAPs are
have aplOx@n : aprYk@n :: apg@lOxt : apg@rYkt. The
holistic properties of the dataset, i.e. of the union of
BAP for the first pair is ++@n/+g@+t and the
the datasets (a), (b), (c) and (d). The consequence
BAP for the second is identical; the pattern that
is that each FAP depends on the entire dataset, and
characterizes aplOx@n:aprYk@n is ap+@n and the
FAPs have to be recomputed each time any of the
one that characterizes the second is apge+t.
datasets (a), (b), (c) or (d) is modified.
These two WPs are aligned because their BAP is
(ii) The pairs in (4) are good examples of what ++@n/+g@+t, i.e. the same as the BAP of the two
similar may mean, from an inflectional point of pairs of word-forms.
view. This type of similarity can be defined in terms By doing the same computation for all pairs of
of analogy. We first assume that pairs of forms word-forms and matching them with respect to their
that satisfy the same BAP constitute an analogical BAP, we end up with a number of FAP candidates
series (as they satisfy the formal analogy encoded that we first screen in order to exclude those that
by the BAP). Word-forms belonging to the same describe only a small part of Φ, or that are made
column in an analogical series are then considered up of WPs that describe a small part of Φ1 or Φ2 .
as similar. In our example, the word-forms in each Another feature that helps select valuable FAPs is
column in (4) count as similar. the number of variable parts (+) it contains. For our
models, we only used FAPs that contain exactly one
(iii) We limit the number of patterns to be con- variable part, but this number could be increased
sidered by looking only at the ones that are involved for languages with templatic morphology or that
in the characterization of similar word-forms. In make use of infixes.
other words, once the sets of similar word-forms
are created, we only consider the similarities that (iv) We assume that optimal FAPs are pairs of
exist between the word-forms that belong to each WPs that recur both within the paradigm and across
set, since only these ones may be part of an FAP. the lexicon, as we illustrated in (2) and (3). For
Let Φ be the set of pairs of word-forms satisfy- instance, the FAP of a pair of word-forms anSpi:l@n
ing some BAP, and Φ1 (resp. Φ2 ) the set of word- and anSpi:l@nt consists of the aligned patterns de-
forms that are the first (resp. second) element of scribing the largest number of word-forms similar
a pair in Φ. What we are looking for are the pat- to anSpi:l@n on the one hand, and the largest num-
terns that characterize a large enough subset of the ber of word-forms similar to anSpi:l@nt on the other
word-forms in Φ1 that are in correspondence with hand. It turns out that this is the pattern +@n/+@nt.
patterns that characterize a large enough subset of More precisely, let (F1 , F2 ) be a pair of word-
the word-forms in Φ2 , i.e. such that the pair of pat- forms and let {(P1 , Q1 ), (P2 , Q2 ), ..., (Pn , Qn )}
199
be the FAP candidates connecting F1 and F2 (i.e. in Section 3 provide a description complementary
the set of the aligned WPs of F1 and F2 ). Let to the lemma-form mapping in which analogical
|X| be the number of form pairs that satisfy an regularities may be locally to a single pair of forms
alternation that contain the WP X. The FAP of (BAPs) or globally from the entire lexicon (FAPs).
(F1 , F2 ) is then the WP pair (Pi , Qi ) such that These paradigmatic analogies emerge when the
|Pi | + |Qi | = maxnj=1 ([Pj | + |Qj |). FAPs are forms are contrasted with all other forms of their
therefore selected separately for each pair of word- lexeme and the other forms that occupy the same
forms. cell in the paradigm (Bonami and Beniamine, 2016;
The models M1 and M2 presented in Section 4 Ahlberg et al., 2015; Albright, 2002).
use FAPs computed from the union of the datasets The models we designed for the task combine
(a), (b), (c) and (d) for each of the three languages the capacity of the sequence-to-sequence models to
of the task. learn the regularities present in strings of phonemes
with the alternation patterns acquired from the
Discussion. BAPs and FAPs give different types
paradigms, in order to predict native speaker re-
of information: BAPs capture relations between
sponses in a wug test.
pairs of forms independently of the rest of the sys-
tem, and are hence crucial to addressing the im- 4.1 Models
plicative structure of paradigms (Wurzel, 1989).
We designed four models for the shared task.
FAPs on the other hand characterize a pair of forms
taking into account their place in the rest of the sys- 4.1.1 Model 1
tem; this typically leads to more specific patterns In the first model, M1, we consider morphologi-
that are satisfied by a smaller set of pairs. cal inflection as a mapping over sequences. The
mapping is implemented by bidirectional LSTMs
4 Combining analogical patterns and
with dropouts (Hochreiter and Schmidhuber, 1997;
encoder-decoder networks Gal and Ghahramani, 2016). The hidden states
Early work on connectionist models of the ac- of the encoder are used to initialize the decoder
quisition of morphology involved pattern associ- states. The model adopts a teacher forcing strategy
ators that learn relationships between a base lex- to compute the decoder’s state in the next time-
ical form (i.e. the lemma) and a target form (i.e. step. M1 takes as input four sequences: the lemma
the inflected form). For example, Rumelhart and (IPA-encoded), the UT, the BAP and the FAP pat-
McClelland (1986) propose a simulation of how terns. The output sequence is the inflected form
English past tense is learned. They focus on pairs (IPA-encoded). The output layer uses a sigmoid to
of verb forms like go-went and walk-walked and produce a probability distribution over the output
consider that morphological learning is a gradual phonemes.
process which includes an intermediate phases of
M1 Input: {lemma + UT + BAP + FAP}
“over-regularization” (where the past form of go
Output: {inflected form}
is goed instead of went). This yields the well-
known“U-shape” curves observed in the develop- The probability score assigned to the wug forms
mental phases of morphological competence in is the joint probability of the its phonemes. The
children. model M1 addresses the task directly. We expected
More recently, models based on deep learning the prediction of a model that uses all the available
architectures have been used (Malouf, 2017) and information including BAPs and FAPs would be
in particular sequence-to-sequence models able to accurate and highly correlated with the judgments
predict one form of a lexeme from another (Faruqui of the speakers.
et al., 2016; Kirov and Cotterell, 2018).
These approaches are based on the assumption 4.1.2 Model 2
that the morphological learning can reduce to a The second model, M2, relies on FAPs to identify
simple mapping between a base form and an in- the crucial thing to be predicted in a wug task,
flected one. Generalizations over similar mappings namely the inflectional pattern of the output form.
(e.g. love-loved,walk-walked vs. sing-sang, ring- Hence the model is trained to predict, instead of the
rang) are learned from the dependences between raw output form, the word pattern that constitutes
the phonemes in sequences. The APs presented the second part of the FAP (FAP2 ) and identifies
200
its place in the inflection system while abstracting This is meant as a very simple baseline, cap-
away from what is common between the input and turing in a very crude fashion the intuition that
output forms. speakers judge as more natural wugs that fit into a
more frequent pattern.
M2 Input: {lemma + UT}
Output: {FAP2 } 4.2 Results
Computationally M2 is similar to M1 except for The submissions to the task are evaluated using
the input/output sequences involved. In particular, the AIC scores from a mixed-effects beta regres-
the probability score of a wug form is the joint sion model (Magnusson et al., 2017) where the
probability of the output symbols. scaled human ratings (DV) were predicted from
the submitted model’s ratings (IV). The regression
4.1.3 Model 3 implements a random intercept for lemma type. Ta-
Our third model, M3, estimates a possible word- ble 2 reports the AIC scores of the test data for the
likeness effects due to phonological similarity of three languages.
the inflected forms that have the same UT. Word-
likeness is the extent to which a sound sequence Models ENG NLD DEU
composing a form is phonologically acceptable in M1 −33.4 −60.0 −16.1
a given language. It mostly depends on the phono- M2 −43.0 −66.0 −98.8
tactic knowledge of the speakers (Vitevitch and M3 −37.5 −64.9 −12.9
Luce, 2005; Hahn and Bailey, 2005) and on the ex- M4 −40.7 −36.8 −72.9
istence of phonologically similar words in the men-
Table 2: AIC scores calculated on the basis of the final
tal lexicon (Albright, 2007). For example, a wug test data. Lower scores are better.
past form like saIndId included in the English test
dataset could trigger wordlikeness effect because
it is similar to an attested past form saIdId (sided). 5 Discussion
For each of the three languages, we designed a
classifier which predicts whether an inflected form The performance of our four models suggest the
is assigned to a specific UT in the train set. The following observations. First, M2 outperforms our
target UTs are the ones of the inflected forms in the three other models for all three languages, and
three test sets (d), namelyV;PST;1;SG for ENG, also ranked second of all systems submitted to the
V;PST;PL for NLD and V.PTCP;PST for DEU. shared task. Second, there is a striking difference
Technically, for each language, the M3 model is in performance between M2 and M1, which had an
an LSTM-based binary classifier which takes the similar architecture, but performed very poorly—
inflected form as input and outputs whether it is worse than our baseline M4 model, and second to
assigned to the target UT (value 1) or not (value 0). last of all systems submitted to the shared task. Al-
At training time, the forms which are assigned to though more experiments are needed to conclude
the target UT and to another UT, are only kept with on this point, we conjecture that the better per-
their target UT. formance of M2 might be due to the fact that it
abstracts away from the question of predicting the
M3 For inflected UT in the test set (d), shape of the stem in the output, but focuses instead
Input: {inflected form} on that part of the inflected form that is not to be
Output: {0,1} found in the input. This seems to match intuitions
about human behavior: when dealing with inflec-
The score assigned to the wug form is simply the
tions, speakers may have a hard time applying the
probability outputted by the system.
right pattern, but they never have a hard time re-
4.1.4 Model 4 membering what the stem looks like, even if it is
Our fourth model, M4, simply uses the type fre- phonotactically unusual (see Virpioja et al. 2018
quency of the BAP relating the wug lemma and the for a psycho-computational study).
wug form as a score for the test dataset. The other surprising result is that M4, which
was intended as a crude baseline, did surprisingly
M4 Raw type frequency of the BAP relating the well on the English and German data, although it
wug lemma and wug form performed very poorly on Dutch. This is interest-
201
ingly complementary to the performance of M3, objective of our participation was to test different
which did surprisingly well on Dutch but poorly on hypotheses. The main one is the relatively low
German. As M3 is entirely focused on phonotactic importance of stems when predicting the accept-
similarity while M4 is focused on the frequency ability of wug forms, as evidenced by the good
of alternations, this suggests that the three inflec- performance of the M2 model, which which only
tion systems (to the extent that they are faithfully predicts the FAP of the inflected form. Therefore,
represented by the datasets) raise different kinds of M2 is output-oriented in the sense that the proper-
challenges to speakers. ties that characterize the input, i.e. the lemma, are
To better assess the quality of M2, we exam- not used during training.
ined how well it statistically correlates with human M1 and M2 models are able to predict inflected
performance in Albright and Hayes’s (2003) exper- forms and FAP2 patterns for any UT in the training
iments on islands of reliability (IOR) in regular and set while M3 models are specialized on a single
irregular inflection in English. Albright and Hayes UT. In future work, we plan to develop specialized
are trying to establish that speakers rely on struc- versions of M1 and M2 in order to estimate the im-
tured linguistic knowledge as encoded in their Min- portance of the tested inflectional series (i.e. of the
imal Generalization Learner (MGL, Albright and set of form pairs with the same UTs as the entries in
Hayes, 2006) rather than pure analogy when inflect- test set) with respect to the entire training set. We
ing novel words. To establish this, they collected further plan to test our models on more complete
both productions of human participants asked to datasets in which the inflected forms could be pre-
inflect a novel word, and judgments on pairs of dicted from other forms than the lemma, but also
word-forms. They show that the MGL leads to a jointly from several forms of the lexeme (Bonami
better correlation with human results than a purely and Beniamine, 2016).
analogical system based on Nosofsky (1990) (NOS
in the table below). As Table 3 shows, our M2 per- Acknowledgement
forms at a level comparable to the MGL. More pre- Experiments presented in this paper were carried
cisley, it clearly outperforms it on irregular verbs out using the OSIRIM platform, that is adminis-
while trailing on regulars. Importantly, M2 does tered by IRIT and supported by CNRS, the Region
that without relying on any structured knowledge Occitanie, the French Government and ERDF.
of the kind found in the MGL, although it does rely
on a more complete view of the morphological sys-
tem. This suggests that the conclusions of Albright References
and Hayes should be reconsidered. Farrell Ackerman, James P. Blevins, and Robert Mal-
ouf. 2009. Parts and wholes: implicative patterns in
Ratings Production inflectional paradigms. In James P. Blevins and Juli-
Models probabilities ette Blevins, editors, Analogy in Grammar, pages
54–82. Oxford University Press, Oxford.
reg. irr. reg. irr.
MGL 0.745 0.570 0.678 0.333 Farrell Ackerman and Robert Malouf. 2013. Morpho-
NOS 0.448 0.488 0.446 0.517 logical organization: the low conditional entropy
M2 0.583 0.595 0.611 0.560 conjecture. Language, 89:429–464.
Malin Ahlberg, Markus Forsberg, and Mans Hulden.
Table 3: Pearson correlations (r) of participant re- 2015. Paradigm classification in supervised learn-
sponses to models. Core IOR verbs (n = 41). See ing of morphology. In Proceedings of the 2015 Con-
Albright and Hayes (2003) for the list of nonce verbs ference of the North American Chapter of the Asso-
exploited in the experiment ciation for Computational Linguistics: Human Lan-
guage Technologies, pages 1024–1029, Denver, Col-
orado. Association for Computational Linguistics.
6 Conclusion Adam Albright. 2002. Islands of reliability for regu-
lar morphology: Evidence from italian. Language,
At the time of writing, we do not have the descrip- 78:684–709.
tions of the other systems that were submitted to
Adam Albright. 2007. Gradient phonologi-
the shared task. As a result, we are not able to cal acceptability as a grammatical effect.
identify the reasons for the good and not so good https://www.mit.edu/˜albright/papers/
performance of the four systems we submitted. The Albright-GrammaticalGradience.pdf.
202
Adam Albright. 2009. Feature-based generalisation as Twelfth International Conference on Language Re-
a source of gradient acceptability. Phonology, 8:9– sources and Evaluation (LREC 2020), pages 3870–
41. 3878, Marseille.
Adam Albright and Bruce Hayes. 2003. Rules Jennifer Hay, Janet Pierrehumbert, and Mary E. Beck-
vs. analogy in English past tenses: A computa- man. 2004. Speech perception, well-formedness
tional/experimental study. Cognition, 90(2):119– and the statistics of the lexicon. In John Local,
161. Richard Ogden, and Rosalind Temple, editors, Pho-
netic Interpretation: Papers in Laboratory Phonol-
Adam Albright and Bruce Hayes. 2006. Modeling ogy VI, Papers in Laboratory Phonology, pages 58–
productivity with the gradual learning algorithm: 74. Cambridge University Press.
The problem of accidentally exceptionless general-
izations. In Gisbert Fanselow, Féry Caroline, Vo- Sepp Hochreiter and Jürgen Schmidhuber. 1997.
gel Ralf, and Schlesewsky Matthias, editors, Gra- Long short-term memory. Neural Computation,
dience in Grammar: Generative Perspectives, page 9(8):1735–1780.
185—204. Oxford University Press, Oxford.
Charles F. Hockett. 1947. Problems of morphemic
Sacha Beniamine, Olivier Bonami, and Benoı̂t Sagot. analysis. Language, 23:321–343.
2017. Inferring inflection classes with description
Mans Hulden. 2014. Generalizing inflection tables into
length. Journal of Language Modelling, 5(3):465–
paradigms with finite state operations. In Proceed-
525.
ings of the 2014 Joint Meeting of SIGMORPHON
Sacha Beniamine and Matı́as Guzmán Naranjo. 2021. and SIGFSM, pages 29–36, Baltimore, Maryland.
Multiple alignments of inflectional paradigms. In Christo Kirov and Ryan Cotterell. 2018. Recurrent neu-
Proceedings of the Society for Computation in Lin- ral networks in linguistic theory: Revisiting pinker
guistics, volume 4. and prince (1988) and the past tense debate. Trans-
Olivier Bonami and Sacha Beniamine. 2016. Joint pre- actions of the Association for Computational Lin-
dictiveness in inflectional paradigms. Word Struc- guistics, 6:651–665.
ture, 9:156–182. Christo Kirov, Ryan Cotterell, John Sylak-Glassman,
Géraldine Walther, Ekaterina Vylomova, Patrick
Maria Corkery, Yevgen Matusevych, and Sharon Gold-
Xia, Manaal Faruqui, Sabrina J. Mielke, Arya Mc-
water. 2019. Are we there yet? encoder-decoder Carthy, Sandra Kübler, David Yarowsky, Jason Eis-
neural networks as cognitive models of english past ner, and Mans Hulden. 2018. UniMorph 2.0: Uni-
tense inflection. versal Morphology. In Proceedings of the Eleventh
Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, and International Conference on Language Resources
Chris Dyer. 2016. Morphological inflection genera- and Evaluation (LREC 2018), Miyazaki, Japan. Eu-
tion using character sequence to sequence learning. ropean Language Resources Association (ELRA).
In Proceedings of the 2016 Conference of the North Philippe Langlais and François Yvon. 2008. Scaling
American Chapter of the Association for Computa- up analogical learning. In Proceedings of the 22nd
tional Linguistics: Human Language Technologies, International Conference on Computational Linguis-
pages 634–643, San Diego, California. Association tics (COLING 2008), page 51–54, Manchester.
for Computational Linguistics.
Yves Lepage. 1998. Solving analogies on words: An
Yarin Gal and Zoubin Ghahramani. 2016. A theoret- algorithm. In Proceedings of the 36th Annual Meet-
ically grounded application of dropout in recurrent ing of the Association for Computational Linguistics
neural networks. In Advances in Neural Information and of the 17th International Conference on Com-
Processing Systems, volume 29. Curran Associates, putational Linguistics, volume 2, pages 728–735,
Inc. Montréal.
Ulrike Hahn and Todd M. Bailey. 2005. What makes Yves Lepage. 2004a. Analogy and formal languages.
words sound similar? Cognition, 97:227–267. Electronic notes in theoretical computer science,
53:180–191.
Nabil Hathout and Franck Sajous. 2016. Wiktion-
naire’s Wikicode GLAWIfied: a workable French Yves Lepage. 2004b. Lower and higher estimates of
machine-readable dictionary. In Proceedings of the number of true analogies between sentences con-
the Tenth International Conference on Language tained in a large multilingual corpus. In Proceed-
Resources and Evaluation (LREC 2016), Portorož, ings of the 20th international conference on Com-
Slovenia. putational Linguistics (COLING-2004), pages 736–
742, Genève.
Nabil Hathout, Franck Sajous, Basilio Calderone, and
Fiammetta Namer. 2020. Glawinette: a linguisti- Arni Magnusson, Hans J. Skaug, Anders Nielsen,
cally motivated derivational description of French Casper W. Berg, Kasper Kristensen, Martin Maech-
acquired from GLAWI. In Proceedings of the ler, Koen J. van Bentham, Benjamin M. Bolker,
203
and Mollie E. Brooks. 2017. glmmTMB: Gener-
alized Linear Mixed Models using Template Model
Builder.
Robert Malouf. 2017. Abstractive morphological learn-
ing with a recurrent neural network. Morphology,
27:431–458.
Robert M. Nosofsky. 1990. Relations between
exemplar-similarity and likelihood models of clas-
sification. Journal of Mathematical Psychology,
34:393–418.
David. E. Rumelhart and James. L. McClelland. 1986.
On learning the past tense of English verbs. In D. E.
Rumelhart and J. L. McClelland, editors, Parallel
Distributed Processing, volume 2, pages 216–271.
MIT Press.
Franck Sajous and Nabil Hathout. 2015. GLAWI,
a free XML-encoded Machine-Readable Dictionary
built from the French Wiktionary. In Proceedings
of the of the eLex 2015 conference, pages 405–426,
Herstmonceux, England.
Nicolas Stroppa and François Yvon. 2005. An analog-
ical learner for morphological analysis. In Proceed-
ings of the 9th Conference on Computational Natu-
ral Language Learning (CoNLL-2005), pages 120–
127, Ann Arbor, MI. ACL.
Sami Petteri Virpioja, Minna Lehtonen, Annika Hultén,
Henna Kivikari, Riitta Salmelin, and Krista Lagus.
2018. Using statistical models of morphology in the
search for optimal units of representation in the hu-
man mental lexicon. Cognitive Science, 42(3):939–
973.
Michael S. Vitevitch and Paul A. Luce. 2005. In-
creases in phonotactic probability facilitate spoken-
nonword repetition. Journal of Memory and Lan-
guage, 52:93–204.
Wolfgang Ulrich Wurzel. 1989. Inflectional Morphol-
ogy and Naturalness. Kluwer, Dordrecht.
204
Were We There Already? Applying Minimal Generalization to the
SIGMORPHON-UniMorph Shared Task on Cognitively Plausible
Morphological Inflection
Table 1: Number of lexemes (wordform pairs) used for training, number of rules learned by minimal generalization
(before and after pruning), and evaluation on average human wug-test ratings for each language. Lower AIC values
indicate a better match between model predictions and behavioral results.
Micha Elsner
Department of Linguistics
The Ohio State University
[email protected]
214
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 214–226
August 5, 2021. ©2021 Association for Computational Linguistics
Section 5 shows that this analogical framework or disrupt the mapping between them. Low-level
for inflection can predict inflections across a va- correspondence between character sets is the most
riety of languages, demonstrating reasonable per- important factor for successful transfer in very low-
formance on the Sigmorphon 2020 multilingual resource settings, but models with disjoint charac-
benchmark (Vylomova et al., 2020). Section 6 de- ter representations still succeed at transfer once at
scribes one-shot learning experiments, performing least 200 target examples are available, indicating
language transfer without fine-tuning, and shows that higher-level information is also transferred and
that for languages with concatenative affixes, one- contributes to performance.
shot transfer can be more effective than previously
Kann et al. (2017b) also represents a prior one-
thought. Section 7 studies the system’s ability to ap-
shot morphological learning experiment. Their set-
ply different types of morphological processes us-
ting is not quite the same as the one here; they
ing constructed stimuli, showing that some config-
assume access to a single inflected form in half the
urations are capable of learning generic and trans-
paradigm cells in their target language (Spanish)
ferable representations of processes including pre-
which are used to fine-tune a pretrained system.
fixing, suffixing and reduplication.
Because their system uses the conventional tag-
based framework, they are capable of filling cells
2 Related work
for which no example is available (zero-shot learn-
The overall positive effect of transfer learning is ing), while the memory-based system presented
well established (McCarthy et al., 2019). Previ- here is not. On the other hand, the current work
ous research has also evaluated how the choice does not use fine-tuning or require target-language
of source language affects the performance in the data at training time. They evaluate inflection on
target. While there is a robust trend for related both seen and unseen cells as a function of five
languages to perform better, there are also many re- source languages, four of which are in the Romance
ports of exceptions. Kann (2020) finds that Hungar- family. The best one-shot transfer within Romance
ian is a better source for English than German and scores 44% exact match, the worst 13%. Transfer
a better source for Spanish than Italian. She con- from unrelated Arabic scores 0%. One-shot learn-
cludes that matching the target language’s default ing experiments in this work use a much larger set
affix placement (prefixing/suffixing) is important, of languages, and although performance in the typ-
and that agglutinative languages might be benefi- ical case is similar, the best results are substantially
cial to transfer learning in general, but that genetic better.
relatedness is not always a necessary or sufficient The memory-based design of the current work is
for effective transfer. Lin et al. (2019) also find that rooted in cognitive theories of morphological pro-
Hungarian and Turkish are good source languages cessing. The widely accepted dual route model of
for a surprising variety of unrelated targets. Rather morphological processing postulates that the mind
than attribute this to agglutination, they propose retrieves familiar inflected forms from memory as
that these languages lead to good transfer because well as synthesizing forms from scratch (Milin
of their large datasets and difficulty as tasks. Fur- et al., 2017; Alegre and Gordon, 1998; Butterworth,
ther puzzling results come from Anastasopoulos 1983). It has often been claimed that memorized
and Neubig (2019), who find that Italian data does forms of specific words are central to the structure
not improve performance in closely related Ladin of inflection classes (Bybee and Moder, 1983; By-
or Neapolitan3 once monolingual hallucinated data bee, 2006; Jackendoff and Audring, 2020). In such
is available, and that Latvian is as good a source a theory, production of a form of a rare lemma is
for Scots Gaelic as its relative Irish. guided by the memory of the appropriate forms of
Previous analyses of transfer learning have at- common ones. Additional evidence for this view
tempted to differentiate the contributions of various comes from historical changes in which one word’s
parts of the model through factored vocabularies or paradigm is analogically remodeled on another’s
ciphering (Kann et al., 2017b; Jin and Kann, 2017). (Krott et al., 2001; Hock and Joseph, 1996, ch.5).
These methods give disjoint representations to char- Liu and Hulden (2020) evaluate a model very simi-
acters and tags in the source and target languages, lar to this one (a transformer in which target forms
3
Regional Romance languages spoken in Northern and of other words, which they term “cross-table” ex-
Southern Italy respectively. amples, are provided as part of the input). They
215
Lemma Target specification → Target
Standard inflection generation waiata V; PASS waiatatia
waiata karanga : karangatia waiatatia
Memory-based
waiata kaukau : kaukauria waiatatia
Figure 1: Differing inputs for inflection models, eliciting the passive of the Maori verb waiata “sing”. The memory-
based system relies on an exemplar verb as the target specifier; shown here are karanga “call”, which takes a
matching suffix, and kaukau “swim”, which mismatches.
find that such examples are complementary to data evaluated using instances generated using random
hallucination and yield improved results in data- selection.
sparse settings. Some earlier non-neural models To perform similarity-based selection, each
also rely on stored word forms (Skousen, 1989; lemma is aligned with its target form in the training
Albright and Hayes, 2002). data in order to extract an edit rule (Durrett and
DeNero, 2013; Nicolai et al., 2016). (For the first
3 Exemplar selection memory-based example in Figure 1, both words
have the same edit rule -+tia.) The selected exem-
The system uses instances generated as described in
plar/form pair uses the same edit rule, if possible.
Figure 1, separating the lemma, exemplar lemma
During training, a lemma is allowed to act as its
and exemplar form with punctuation characters.
own exemplar, so that there is always at least one
Each instance also contains two features indicating
candidate. However, words in the test set must be
the language and language family of the example
given exemplars from the training set. If a cell in
(e.g. LANG MAO, FAM AUSTRONESIAN).
the test set does not appear in the training set, no
The selection of the exemplar is critical to the
prediction can be made; in this case, the system
model’s performance. Ideally, the lemma and the
outputs the lemma. Extending the model to cover
exemplar inflect in the same way, reducing the in-
this case is discussed below as future work.6
flection task to copying. But this is not always the
case. For example, Maori verbs fall into inflection
4 Model design
classes, as shown in Figure 1; when the exemplar
comes from a different class than the lemma, copy- The system uses the character-based transformer
ing will yield an invalid output, so the model has (Wu et al., 2020) as its learning model; this is a
to guess which class the input belongs to.4 sequence-to-sequence transformer (Vaswani et al.,
This paper presents experiments using two set- 2017) tuned for morphological tasks, and serves as
tings: In random selection, the exemplar lemma a strong official baseline for the Sigmorphon 2020
is chosen arbitrarily from the set of training task. Moreover, transformers are known to perform
lemma/form pairs for the appropriate language and well in the few-shot setting (Brown et al., 2020).
cell. This makes the task difficult, but allows the All default hyperparameters7 match those of Wu
model to learn to cope with the distribution of in- et al. (2020).
puts it will face at test time. In similarity-based As discussed in prior work (Anastasopoulos
selection, each source lemma is paired with an and Neubig, 2019; Kann and Schütze, 2017), it
exemplar for which the transductions are highly is important to pretrain the model to predispose
similar. This makes the task easy, but since it relies it to copy strings. To ensure this, the system is
on access to the true target form, it can be used only trained on a synthetic dataset. Each synthetic in-
for model training, not for testing.5 All models are stance is generated within a random character set.
4
In cases of class-dependent syncretism, the model must
The instance consists of a random pseudo-lemma
also guess which cell is being specified. For instance, German and pseudo-exemplar created by sampling word
feminine nouns do not inflect for case, but some masculine
nouns do, so the combination of a masculine lemma and a using random selection. To avoid this issue, no training scores
feminine exemplar can yield an unsolvable prediction prob- are reported in this paper.
6
lem. In the SigMorphon 2020 datasets, this rarely occurs in
5 practice. ≥ 99% of target cells are covered in all languages ex-
Within the training set, the same lemma/inflected form
pair can appear as both an exemplar and a target instance; a re- cept Ingrian (98%), Evenki (96%), and notably Ludic (61%).
7
viewer speculates that this might allow the model to memorize Including 4 layers, batches of 64, and the learning rate
lexically-specific outputs within the training set even when schedule.
216
lengths from the training word length distribution Family Random Similarity Base
Austronesian (4) 83 (13) 67 (21) 81 (18)
and then filling each one with random characters. Germanic (10) 87 (10) 51 (16) 90 (9)
With probability 50% the example is given a pre- Niger-Congo (9) 98 (4) 94 (9) 97 (3)
fix; independently with probability 50% a suffix; Oto-Manguean (10) 82 (16) 39 (23) 86 (12)3
Uralic (11) 92 (6) 46 (14) 93 (0.05)
independently with probability 10% an infix at Overall 89 (12) 57 (26) 90 (11)
a random character position. Prefixes and suf-
fixes are random strings between 2-5 characters Table 1: Fine-tuned accuracy scores for models trained
long and infixes are 1-2 characters long. (This with random and similarity-based selection, compared
means that, in some cases, no affix is added and to the baseline. Num languages in family and score
standard deviation across languages in parentheses.
the transformation is the identity, as occurs in cases
of morphological syncretism.) An example such
instance is mpieňjmel:rbeaikkea::zlürbeaikkeaüe can vary based on the choice of exemplar, the sys-
with output zlümpieňjmelüe. The language tags tem applies a simple post-process to compensate
for these examples indicate the kinds of affixa- for unlucky choices: it runs each lemma with five
tion operations which were performed, for exam- randomly-selected exemplars and chooses the ma-
ple LANG PREFIX SUFFIX; the family tag identifies jority output.
them as SYNTHETIC. While this synthetic dataset is Neither model achieves the same performance as
inspired by hallucination techniques (Anastasopou- the baseline (90%), although the random-exemplar
los and Neubig, 2019; Silfverberg et al., 2017), note model (89%) comes quite close. The similar-
that these synthetic instances are not presented to exemplar model (57%) is clearly inferior due to
the model as part of any natural language. its severe mismatch between training and test set-
The Sigmorphon 2020 data is divided into “de- tings. Performance varies across language families.
velopment languages” (45 languages in 5 fami- All models perform well in Niger-Congo, although
lies: Austronesian, Germanic, Niger-Congo, Oto- the conference organizers state that data from these
Manguean and Uralic) and “surprise languages” languages may have been biased toward regular
(45 more languages, including some members of forms in an unrepresentative way.8 The random-
development families as well as unseen families). exemplar model is at or near baseline performance
Data from all the “development languages”, plus in Austronesian and Uralic, but falls further below
the synthetic examples from the previous stage, is baseline in Germanic and Oto-Manguean. Both
used to train a multilingual model, which is fine- of these families are characterized by complex in-
tuned family. Finally the family models are fine- flection class structure in which randomly chosen
tuned by language. During multilingual training exemplars are less likely to resemble the target for
and per-family tuning, the dataset is balanced to a given word.
contain 20,000 instances per language; languages The similar-exemplar model also performs
with more training instances than this are subsam- poorly in Uralic. While some Uralic languages
pled, while languages with fewer are upsampled by have inflection classes (Baerman, 2014), many
sampling multiple exemplars (with replacement) (like Finnish) do not, but have complex systems
for each lemma/target pair. For the final language- of phonological alternations (Koskenniemi and
specific fine-tuning stage, all instances from the Church, 1988). While the random-exemplar model
specific language are used. can learn to compensate for these, the similar-
exemplar model does not.
5 Fine-tuned results
6 One-shot results
This section shows the test results for fully fine-
tuned models on the development languages. Table This section shows the results of one-shot learning.
1 shows the average exact match and standard de- These experiments apply the multilingual and fam-
viation by language family. Full results are given ily models from the development languages to the
in Appendix A. Tables also show the results of the surprise languages, without fine-tuning. For lan-
official competition baseline which is closest to the guages within development families, they use the
current work, the character transformer (Wu et al., appropriate family model; otherwise they use the
2020) fine-tuned by language, TRM - SINGLE. 8
A Swahili speaker confirms that some forms in the data
Because the results of exemplar-based models appear artificially over-regularized (Martha Johnson p.c.).
217
multilingual model. Thus, the model’s only access Family Random Similarity Base
Germanic (3) 29 (13) 38 (22) 80 (13)
to information about the target language is via the Niger-Congo (1) 75 (0) 88 (0) 100 (0)
provided exemplar. Uralic (5) 21 (9) 28 (12) 76 (26)
Each experiment evaluates the results across five Afro-Asiatic (3) 7 (3) 26 (18) 96 (3)
Algic (1) 2 (0) 14 (0) 68 (0)
random exemplars per test instance (with replace- Dravidian (2) 7 (7) 13 (3) 85 (9)
ment), but averages the results rather than applying Indic (4) 4 (5) 4 (2) 98 (3)
majority selection. This computes the expected Iranian (3) 35 (39) 34 (32) 82 (19)
Romance (8) 6 (4) 53 (19) 99 (1)
performance in the one-shot setting where only a Sino-Tibetan (1) 21 (0) 9 (0) 84 (0)
single exemplar is available. Siouan (1) 13 (0) 13 (0) 96 (0)
Results are shown in Table 2. One-shot learning Songhay (1) 21 (0) 82 (0) 88 (0)
Southern Daly 4 (0) 6 (0) 90 (0)
is not competitive with the baseline fine-tuned sys- Tungusic (1) 28 (0) 27 (0) 57 (0)
tem in any language family, but has some capacity Turkic (9) 7 (8) 19 (11) 96 (7)
to predict inflections in all families. Performance Uto-Aztecan (1) 33 (0) 30 (0) 81 (0)
Overall 14 (18) 30 (25) 90 (15)
is generally better in families for which related
languages were present in development. Table 2: One-shot accuracy scores for models trained
The system trained with random exemplars with random and similarity-based selection, compared
achieves its best results on Tajik (Iranian: tgk, score to the baseline. Num. languages in family and
89%), Shona (Niger-Congo: sna, score 75%)9 , and score standard deviation across languages in parenthe-
ses. Families represented in development above the
Norwegian Nynorsk (Germanic: nno, score 42%).
line, surprise families below.
The system trained with similar exemplars achieves
its best results on Shona (88%), Zarma (Songhay:
dje, score 82%) and Tajik (79%). Notably, some “repeat” and engolir “ingest”, are mismatched with
of these high scores are achieved on languages that exemplars from a different inflection class; both
were difficult for the baseline systems; the score for systems make incorrect predictions, but the similar-
Tajik beats the transformer baseline (56%), perhaps exemplar system preserves the suffixes while the
due to data sparsity, since baselines regularized us- random-exemplar system does not. Finally, in
ing data hallucination perform better (93%). the last example llevar-se “get up”, the similar-
Training with similar exemplars leads to clearly exemplar model misinterprets the reflexive suffix
better results than random exemplars, a reversal of -se as part of the verb stem, while the random-
the trend observed with fine-tuning. This difference exemplar model fails to make any edit.
is particularly marked in Romance (53% average A more systematic analysis computes an
vs 5%). While the random-exemplar system is alignment-based edit rule for each system predic-
better at guessing what to do when the exemplar tion (King et al., 2020) and counts the unique rules
and target forms are divergent, this causes errors used to form one-shot predictions in the Catalan de-
with unfamiliar languages. The system attempts velopment set. Over 37105 instances, the random-
to guess the correct inflection, rather than simply exemplar model applies 626 unique edit rules, 20
copying. of which appear in correct predictions. The similar-
As an example, Table 3 shows an analysis of exemplar model applies 3137 unique rules, 154 of
performance in Catalan (cat), selected because its them correctly. The greater variety of both correct
results are fairly typical of the Romance family; and incorrect outputs from the similar-exemplar
the similar-exemplar system scores 53% while the model demonstrates its preference for faithfulness
random-exemplar system scores 12%. The table to the exemplar rather than remodeling the output
shows selected instances with different levels of to fit language-specific constraints.
exemplar match and mismatch. The first two, ar-
rissar “curl” and disputar “discuss”, match their 7 Synthetic transfer experiments
exemplars well and are good cases for copying.
The random-exemplar model gets these both wrong, When transfer learning fails, it can be difficult to
segmenting incorrectly in the first and adding a spu- tell whether the system has failed to represent a
rious character in the second. The next two, repetir general morphological process, or whether it mis-
9
applies what it has learned due to mismatched lexi-
As stated above, the Niger-Congo datasets are artificial-
ized and probably does not represent the real difficulty of the cal/phonological triggers. Experiments on artificial
inflection task. data can probe what abstract processes the model
218
Lemma Exemplar Rand. Sel. Sim. Sel Target
arrissar posar : posarien arrissaren arrissarien arrissarien
disputar descriure : descriuria disputarta disputaria disputaria
repetir cremar : cremo repetirer repetio repeteixo
engolir forjar : forjava engolire engoliva engolia
llevar-se terminar : termino llevar-se llevor-se llevo
Table 3: Development data from Catalan (Romance: cat) showing the outputs of two one-shot systems.
has learned to apply, the links between these pro- at least half the languages of every development
cesses and language families, and the environments family.11 The second is a subset of Cyrillic char-
in which they can operate. acters intended to test transfer to a less-familiar or-
A probing dataset is synthesized to model several thography; a few Uralic development languages are
morphological operations (Figure 2), including pre- written in Cyrillic. Each language has 90 random
fix/suffix affixation, reduplication and gemination. lemmas, sampled with the frames CVCV, CVCVC,
Affixation is typologically widespread (Bickel and CVCVCVC; affixal languages have 30 affixes of
Nichols, 2013) and appears in every development types VCV, CV, CVCV, plus 7 single-letter affixes.
language on which the model was trained. Suf- No probe lemma coincides with any real lemma,
fixation is more common in Germanic and Uralic; and no probe affix has frequency > 5% as a string
Oto-Manguean tonal morphology is also often rep- prefix or suffix in any real language. Affixal lan-
resented via word-final diacritics.10 Prefixing is guages contain an instance for every lemma/affix
more common in the Niger-Congo family. pair. Reduplication and gemination languages have
Reduplication appears in three of the four Aus- one instance per lemma.
tronesian development languages, Tagalog, Hili- The model is prompted to inflect the probes as
gaynon and Cebuano (WAL, 2013), but not in the if they are members of each language family, and
Maori dataset provided. The probe language has as members of a comparatively well-resourced lan-
partial reduplication of the first syllable, as found in guage selected from those families, specifically
Tagalog and Hiligaynon. Previous work with artifi- Tagalog (tgl), German (deu), Mezquital Otomi
cial data demonstrates that sequence-to-sequence (ote), Swahili (swa) and Finnish (fin), as well as
learners can learn fully abstract representations of the synthetic suffixing language used in pretraining
reduplication (Prickett et al., 2018; Nelson et al., (suff). In addition to checking whether the output
2020; Haley and Wilson, 2021), but it has not been matches, the table shows whether reduplicated in-
previously shown that networks trained on real stances have been correctly reduplicated (using a
data do this in a transferable way. In one-shot regular expression).
language transfer, reduplication instances are actu- Table 4 shows the results. A comparison be-
ally ambiguous. Given an instance modi : :: gobu tween the random-exemplar and similar-exemplar
: gogobu, there are two plausible interpretations, models confirms the hypothesis from above that
reduplicative momodi and affixal gomodi. Thus, random-exemplar models have less generalizable
analysis of reduplicative instances can be infor- representations of morphological processes, es-
mative about the model’s learned linkage between pecially prefixation and suffixation. While both
language family and typology. models are capable of attaching affixes in the syn-
Gemination is a inflectional process whereby a thetic language, the random-exemplar model learns
segment is lengthened to mark some morpholog- very language- and suffix-specific rules for apply-
ical feature (Samek-Lodovici, 1992). The probe ing these operations, leading to very low accuracy
language geminates the last non-final consonant. for copying generic affixes. Both models show
None of the development languages have morpho- less language-specific remodeling of affixes in the
logical gemination. family-only setting than when the probes are la-
The probe languages use two alphabets: the first beled as part of a particular language; this effect is
is a common subset of characters which appear in again more pronounced for the random-exemplar
10
No Unicode normalization was performed; Oto-
model.
Manguean tone diacritics are treated as characters (as are parts Both models learn to reduplicate arbitrary CV
of the complex characters of the Indic scripts). The placement syllables, but this process is mostly restricted to
of these diacritics within the word varies from language to
11
language. Consonants mpbntdrlskgh, vowels aeiou.
219
Lemma semet is Hungarian so successful as a source language
Probe type Exemplar Target
Prefixing kigu : igokigu igosemet for unrelated targets? Kann (2020) suggests that
Suffixing kigu : kiguigo semetigo it is its agglutinative nature. The results shown
Reduplication modi : momodi sesemet here offer some speculative support for this view—
Gemination bogu : boggu semmet
perhaps the relative segmentability of prototypi-
Figure 2: Probe tasks illustrated for a single lemma. cally agglutinative languages (Plank, 1999) acts
like the similar-exemplar setting in the memory-
based model, giving the source model a general
Tagalog,12 , with some generalization to Austrone-
bias for concatenative affixation, unpolluted by too
sian. Most other languages interpret reduplication
many lexical and phonological alternations. As re-
instances as affixes.
ported here, such a model is a promising starting
Only the similar-exemplar model gets any gem-
point for inflection in many non-agglutinative sys-
ination instances correct, and these primarily in
tems, such as Romance verbs, which nevertheless
Uralic.13 This is unsurprising, since the model was
are strongly concatenative.
never trained with morphological gemination. It
demonstrates that the model’s representations of Where transfer between related languages fails,
morphological processes represent the input typol- it is conjecturally possible that the source model
ogy and are not simply artifacts of the transformer representations of edit operations are too closely
architecture. While Uralic does not have gemi- linked to particular phonological and lexical prop-
nation as an independent morphological process, erties of the source. This is clearly shown in the
alternations involving geminates do occur in some synthetic transfer experiments, where generic suf-
paradigms; the NOM . PL of tikka “dart” is tikat.14 fixation fails in Germanic and Uralic despite these
The model seems to have learned a little about gem- families being strongly suffixing, because the sys-
ination from this morphophonological process, but tem has learned to remodel its outputs to conform
not a fully generalized representation. too closely to source-language templates.
Affixation remains relatively successful when us- More broadly, the synthetic experiments show
ing Cyrillic characters (suffixes more than prefixes), a link between language typology and learning
but for the most part, less so than with Latin char- of morphological processes, suggesting that lan-
acters, although in the random-exemplar model, guage structure, not only language relatedness, is
Cyrillic suffixes are somewhat more accurate, prob- key to successful transfer— transfer of structural
ably due to less interference from language-specific principles can lead to improvements even without
knowledge. This substantiates the general find- cognate words or affixes. For instance, success-
ing (Murikinati et al., 2020) that transfer across ful reduplication appears only in Austronesian and
scripts is more difficult than within-script. Cyrillic successful gemination only in Uralic. A promising
reduplication sees a much larger drop in accuracy. direction for future work would be to replace the
The difference is probably that simple affixation is language family feature with a set of typological
phonologically uncomplicated, while reduplication feature indicators such as WALs properties (WAL,
requires phonological information about vowels 2013), which might help the model to learn faster
and consonants. in low-resource target languages.
220
Model Fam/Lg. Pref Pref (Cyrl) Suff Suff (Cyrl) Redup. Redup. (Cyrl) Gem.
austro 62 36 26 38 0 (10) 0 0
austro/tgl 0 1 0 0 28 (90) 3 (7) 0
ger 1 0 25 36 0 (3) 0 0
ger/deu 0 0 8 10 0 (3) 0 0
n-congo 92 55 40 41 0 (3) 0 0
n-congo/swa 100 76 36 25 0 (3) 0 0
Rand.
oto 20 15 21 33 0 (3) 0 0
oto/ote 35 30 1 9 0 (3) 0 0
uralic 3 0 23 34 0 (3) 0 0
uralic/fin 0 0 7 22 0 (3) 0 0
synth 84 62 97 91 0 (3) 0 0
synth/suff 28 1 100 97 0 (3) 0 0
austro 86 75 94 85 30 (30) 0 0
austro/tgl 30 35 75 63 88 (88) 8 (8) 0
ger 85 55 99 96 3 (3) 0 8
ger/deu 86 55 99 98 0 0 5
n-congo 99 96 98 93 0 (3) 0 3
n-congo/swa 99 98 88 57 0 0 0
Sim.
oto 88 76 95 87 18 (18) 0 0
oto/ote 96 84 59 17 5 (5) 0 0
uralic 59 10 97 95 0 0 17
uralic/fin 52 4 98 98 0 0 12
synth 94 84 99 95 8 (10) 0 2
synth/suff 86 42 100 99 0 0 2
Table 4: Accuracy of synthetic probe tasks presented as different language and language family. (Cyrl) indicates
Cyrillic alphabet. Parentheses in reduplication columns show frequency of correct CV reduplication.
pervised setting by train-test mismatch. Selecting strongest in languages without large numbers of
training exemplars using a classifier which could inflection classes, and requires training exemplars
also be used at inference time would reduce this to be selected in the same way as test exemplars.
mismatch. These experiments are left for future Memory-based analogy also provides a foundation
work. for one-shot transfer; in this case, training exem-
Finally, since the memory-based architecture is plars should closely match the elicited inflections,
cognitively inspired, it might be adapted as a cog- so that the model learns to copy rather than recon-
nitive model of language learning in contact sit- struct the output form. One-shot transfer using this
uations. Work on this learning process suggests mechanism can achieve higher accuracy than pre-
that speakers find it much easier to learn new ex- viously thought, even when no genetically related
ponents than to learn new morphological processes languages are available in training. Scores vary
(Dorian, 1978; Mithun, 2020). Thus, the impact widely, but can be over 80% for some languages.
of source-language transfer may indeed be most
Finally, this paper provides new evidence about
significant in cases where the L1 and L2 (source
what kinds of abstract information (beyond char-
and target) languages differ in the abstract mecha-
acter correspondences) is transferred between lan-
nisms of inflection rather than the specifics. Histor-
guages when learning to inflect. The model learns
ical contact-induced change provides evidence for
general processes for prefixation and suffixation
this viewpoint in the form of systems which have
which apply (to some extent) across character sets,
changed to employ the same processes as a contact
but its application of these can be disrupted by
language. For example, Cappadocian Greek has
language-specific morpho-phonological rules. It
become agglutinative through its extensive contact
also learns to reduplicate arbitrary CV sequences,
with Turkish (Janse, 2004). For other examples,
but applies this process only when targeting a lan-
see Green (1995); Thomason (2001).
guage with reduplication. Learning of morphologi-
9 Conclusion cal processes in general appears to be driven by the
input typology. The discussion argues that the use-
The results of this paper demonstrate that the pro- fulness of general representations for prefixation
posed cognitive mechanism of memory-based anal- and suffixation accounts for the puzzling effective-
ogy provides a relatively strong basis for inflection ness of agglutinative languages as transfer sources
prediction. Performance in a supervised setting is reported in previous research.
221
Acknowledgments Brian Butterworth. 1983. Lexical representation. In
Brian Butterworth, editor, Language production, vol.
This research is deeply indebted to ideas con- 2: Development, writing and other language pro-
tributed by Andrea Sims. I am also grateful to cesses, pages 257–294. Academic Press.
members of LING 5802 in autumn 2020 at Ohio Joan Bybee. 2006. From usage to grammar: The
State, and to the three anonymous reviewers for mind’s response to repetition. Language, 82(4):711–
their comments and suggestions. Parts of this work 733.
were run on the Ohio Supercomputer (OSC, 1987).
Joan Bybee and Carol Moder. 1983. Morphological
classes as natural categories. Language, 59(2):251–
270.
References
Nancy C. Dorian. 1978. The fate of morphological
2013. World atlas of language structures online. Avail- complexity in language death: Evidence from East
able online at https://wals.info/, accessed 3 Sutherland Gaelic. Language, 54(3):590–609.
June 2020.
Greg Durrett and John DeNero. 2013. Supervised
Adam Albright and Bruce Hayes. 2002. Modeling learning of complete morphological paradigms. In
English past tense intuitions with minimal gener- Proceedings of the 2013 Conference of the North
alization. In Proceedings of the Sixth Meeting of American Chapter of the Association for Computa-
the Association for Computational Linguistics Spe- tional Linguistics: Human Language Technologies,
cial Interest Group in Computational Phonology in pages 1185–1195, Atlanta, Georgia. Association for
Philadelphia, July 2002, pages 58–69. Computational Linguistics.
Maria Alegre and Peter Gordon. 1998. Frequency ef- Alexander Erdmann, Tom Kenter, Markus Becker, and
fects and the representational status of regular inflec- Christian Schallhart. 2020. Frugal paradigm com-
tions. Journal of Memory and Language, 40:41–61. pletion. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
Antonios Anastasopoulos and Graham Neubig. 2019.
pages 8248–8273, Online. Association for Computa-
Pushing the limits of low-resource morphological in-
tional Linguistics.
flection. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing Alex Graves, Greg Wayne, Malcolm Reynolds,
and the 9th International Joint Conference on Natu- Tim Harley, Ivo Danihelka, Agnieszka Grabska-
ral Language Processing (EMNLP-IJCNLP), pages Barwińska, Sergio Gómez Colmenarejo, Edward
984–996, Hong Kong, China. Association for Com- Grefenstette, Tiago Ramalho, John Agapiou, et al.
putational Linguistics. 2016. Hybrid computing using a neural net-
work with dynamic external memory. Nature,
Matthew Baerman. 2014. Covert systematicity in a dis- 538(7626):471–476.
tributionally complex system. Journal of Linguis-
tics, pages 1–47. Ian Green. 1995. The death of ‘prefixing’: contact in-
duced typological change in northern australia. In
Balthasar Bickel and Johanna Nichols. 2013. Fusion Annual Meeting of the Berkeley Linguistics Society,
of selected inflectional formatives. In Matthew S. volume 21, pages 414–425.
Dryer and Martin Haspelmath, editors, The World
Atlas of Language Structures Online. Max Planck In- Coleman Haley and Colin Wilson. 2021. Deep neural
stitute for Evolutionary Anthropology, Leipzig. networks easily learn unnatural infixation and redu-
plication patterns. Proceedings of the Society for
James P. Blevins, Petar Milin, and Michael Ramscar. Computation in Linguistics, 4(1):427–433.
2017. The Zipfian paradigm cell filling problem.
In Ferenc Kiefer, James P. Blevins, and Huba Bar- Hans Henrich Hock and Brian D. Joseph. 1996. Lan-
tos, editors, Perspectives on morphological organi- guage history, language change and language rela-
zation: Data and analyses, pages 141–158. Brill. tionship: An introduction to historical and compar-
ative linguistics. Mouton de Gruyter.
Antal van den Bosch and Walter Daelemans. 1999.
Memory-based morphological analysis. In Proceed- Ray Jackendoff and Jenny Audring. 2020. The texture
ings of the 37th Annual Meeting of the Association of the lexicon: Relational Morphology and the Par-
for Computational Linguistics, pages 285–292, Col- allel Architecture. Oxford University Press.
lege Park, Maryland, USA. Association for Compu-
tational Linguistics. Mark Janse. 2004. Animacy, definiteness, and case in
Cappadocian and other Asia Minor Greek dialects.
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Journal of Greek linguistics, 5(1):3–26.
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Huiming Jin and Katharina Kann. 2017. Exploring
Askell, et al. 2020. Language models are few-shot cross-lingual transfer of morphological knowledge
learners. arXiv preprint arXiv:2005.14165. in sequence-to-sequence models. In Proceedings of
222
the First Workshop on Subword and Character Level Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li,
Models in NLP, pages 70–75, Copenhagen, Den- Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani,
mark. Association for Computational Linguistics. Junxian He, Zhisong Zhang, Xuezhe Ma, Antonios
Anastasopoulos, Patrick Littell, and Graham Neubig.
Katharina Kann. 2020. Acquisition of inflectional 2019. Choosing transfer languages for cross-lingual
morphology in artificial neural networks with prior learning. In Proceedings of the 57th Annual Meet-
knowledge. In Proceedings of the Society for Com- ing of the Association for Computational Linguis-
putation in Linguistics 2020, pages 144–154, New tics, pages 3125–3135, Florence, Italy. Association
York, New York. Association for Computational Lin- for Computational Linguistics.
guistics.
Ling Liu and Mans Hulden. 2020. Analogy models for
Katharina Kann, Ryan Cotterell, and Hinrich Schütze. neural word inflection. In Proceedings of the 28th
2017a. Neural multi-source morphological reinflec- International Conference on Computational Linguis-
tion. In Proceedings of the 15th Conference of the tics, pages 2861–2878, Barcelona, Spain (Online).
European Chapter of the Association for Computa- International Committee on Computational Linguis-
tional Linguistics: Volume 1, Long Papers, pages tics.
514–524, Valencia, Spain. Association for Compu-
tational Linguistics. Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu,
Chaitanya Malaviya, Lawrence Wolf-Sonkin, Gar-
Katharina Kann, Ryan Cotterell, and Hinrich Schütze. rett Nicolai, Christo Kirov, Miikka Silfverberg, Sab-
2017b. One-shot neural cross-lingual transfer for rina J. Mielke, Jeffrey Heinz, Ryan Cotterell, and
paradigm completion. In Proceedings of the 55th Mans Hulden. 2019. The SIGMORPHON 2019
Annual Meeting of the Association for Computa- shared task: Morphological analysis in context and
tional Linguistics (Volume 1: Long Papers), pages cross-lingual transfer for inflection. In Proceedings
1993–2003, Vancouver, Canada. Association for of the 16th Workshop on Computational Research in
Computational Linguistics. Phonetics, Phonology, and Morphology, pages 229–
244, Florence, Italy. Association for Computational
Katharina Kann and Hinrich Schütze. 2016. MED: The Linguistics.
LMU system for the SIGMORPHON 2016 shared
task on morphological reinflection. In Proceedings Petar Milin, Laurie Beth Feldman, Michael Ramscar,
of the 14th SIGMORPHON Workshop on Computa- Roberta A. Hendrick, and R. Harald Baayen. 2017.
tional Research in Phonetics, Phonology, and Mor- Discrimination in lexical decision. PLoS ONE,
phology, pages 62–70, Berlin, Germany. Associa- 12(2):e0171935.
tion for Computational Linguistics.
Marianne Mithun. 2020. Where is morphological com-
Katharina Kann and Hinrich Schütze. 2017. Unlabeled plexity? In Peter Arkadiev and Francesco Gardani,
data for morphological generation with character- editors, The complexities of morphology, pages 306–
based sequence-to-sequence models. In Proceed- 327. Oxford University Press.
ings of the First Workshop on Subword and Charac-
ter Level Models in NLP, pages 76–81, Copenhagen, Nikitha Murikinati, Antonios Anastasopoulos, and Gra-
Denmark. Association for Computational Linguis- ham Neubig. 2020. Transliteration for cross-lingual
tics. morphological inflection. In Proceedings of the
17th SIGMORPHON Workshop on Computational
David King, Andrea Sims, and Micha Elsner. 2020. In- Research in Phonetics, Phonology, and Morphology,
terpreting sequence-to-sequence models for Russian pages 189–197, Online. Association for Computa-
inflectional morphology. In Proceedings of the Soci- tional Linguistics.
ety for Computation in Linguistics 2020, pages 481–
490, New York, New York. Association for Compu- Max Nelson, Hossep Dolatian, Jonathan Rawski, and
tational Linguistics. Brandon Prickett. 2020. Probing RNN encoder-
decoder generalization of subregular functions us-
Kimmo Koskenniemi and Kenneth Ward Church. 1988. ing reduplication. In Proceedings of the Society
Complexity, two-level morphology and Finnish. In for Computation in Linguistics 2020, pages 167–
Coling Budapest 1988 Volume 1: International Con- 178, New York, New York. Association for Compu-
ference on Computational Linguistics. tational Linguistics.
Andrea Krott, R Harald Baayen, and Robert Schreuder. Garrett Nicolai, Bradley Hauer, Adam St Arnaud, and
2001. Analogy in morphology: modeling the choice Grzegorz Kondrak. 2016. Morphological reinflec-
of linking morphemes in dutch. tion via discriminative string transduction. In Pro-
ceedings of the 14th SIGMORPHON Workshop on
Constantine Lignos and Charles Yang. 2018. Morphol- Computational Research in Phonetics, Phonology,
ogy and language acquisition. In Andrew Hippis- and Morphology, pages 31–35, Berlin, Germany. As-
ley and Gregory T. Stump, editors, Cambridge hand- sociation for Computational Linguistics.
book of morphology, pages 765–791. Cambridge
University Press. OSC. 1987. Ohio supercomputer center.
223
Frans Plank. 1999. Split morphology: How agglutina-
tion and flexion mix. Linguistic Typology, 3:279–
340.
Brandon Prickett, Aaron Traylor, and Joe Pater. 2018.
Seq2Seq models with dropout can learn generaliz-
able reduplication. In Proceedings of the Fifteenth
Workshop on Computational Research in Phonetics,
Phonology, and Morphology, pages 93–100, Brus-
sels, Belgium. Association for Computational Lin-
guistics.
Vieri Samek-Lodovici. 1992. A unified analysis of
crosslinguistic morphological gemination. In Pro-
ceedings of CONSOLE, volume 1, pages 265–283.
Citeseer.
Miikka Silfverberg, Francis Tyers, Garrett Nicolai, and
Mans Hulden. 2021. Do RNN states encode abstract
phonological alternations? In Proceedings of the
2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu-
man Language Technologies, pages 5501–5513. As-
sociation for Computational Linguistics.
Miikka Silfverberg, Adam Wiemerslage, Ling Liu, and
Lingshuang Jack Mao. 2017. Data augmentation for
morphological reinflection. In Proceedings of the
CoNLL SIGMORPHON 2017 Shared Task: Univer-
sal Morphological Reinflection, pages 90–99, Van-
couver. Association for Computational Linguistics.
Royal Skousen. 1989. Analogical modeling of lan-
guage. Springer Science & Business Media.
Sarah Grey Thomason. 2001. Contact-induced typo-
logical change. In Language typology and language
universals: An international handbook, volume 2,
pages 1640–1648.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In NIPS.
Ekaterina Vylomova, Jennifer White, Eliza-
beth Salesky, Sabrina J. Mielke, Shijie Wu,
Edoardo Maria Ponti, Rowan Hall Maudslay, Ran
Zmigrod, Josef Valvoda, Svetlana Toldova, Francis
Tyers, Elena Klyachko, Ilya Yegorov, Natalia
Krizhanovsky, Paula Czarnowska, Irene Nikkarinen,
Andrew Krizhanovsky, Tiago Pimentel, Lucas
Torroba Hennigen, Christo Kirov, Garrett Nicolai,
Adina Williams, Antonios Anastasopoulos, Hilaria
Cruz, Eleanor Chodroff, Ryan Cotterell, Miikka
Silfverberg, and Mans Hulden. 2020. SIGMOR-
PHON 2020 shared task 0: Typologically diverse
morphological inflection. In Proceedings of the
17th SIGMORPHON Workshop on Computational
Research in Phonetics, Phonology, and Morphology,
pages 1–39, Online. Association for Computational
Linguistics.
Shijie Wu, Ryan Cotterell, and Mans Hulden. 2020.
Applying the transformer to character-level transduc-
tion. arXiv preprint arXiv:2005.10213.
224
A Full results
For replicability, this appendix provides
full results for all languages, as 0-1 accu-
racy on the official test datasets. The re-
ported baseline is TRM - SINGLE, copied from
https://docs.google.com/spreadsheets/d/
1ODFRnHuwN-mvGtzXA1sNdCi-jNqZjiE-i9jRxZCK0kg. Lang Fam Rand Sim Base
Scores for supervised systems on the development ang Indo-Eur: Germanic 72 19 78
azg Oto-Manguean 94 22 95
languages are shown in Table 5 and scores for ceb Austronesian 79 69 84
one-shot systems on surprise languages are shown cly Oto-Manguean 82 19 91
in Table 6. See Vylomova et al. (2020) for cpa Oto-Manguean 74 33 91
ctp Oto-Manguean 43 15 60
language abbreviation definitions. czn Oto-Manguean 83 32 80
dan Indo-Eur: Germanic 75 42 75
deu Indo-Eur: Germanic 93 62 98
eng Indo-Eur: Germanic 97 67 97
est Uralic 94 47 95
fin Uralic 100 39 100
frr Indo-Eur: Germanic 81 39 87
gaa Niger-Congo 100 100 98
gmh Indo-Eur: Germanic 94 75 91
hil Austronesian 97 74 98
isl Indo-Eur: Germanic 88 37 97
izh Uralic 85 33 87
kon Niger-Congo 99 99 98
krl Uralic 99 36 99
lin Niger-Congo 100 100 100
liv Uralic 93 54 96
lug Niger-Congo 90 74 91
mao Austronesian 71 57 52
mdf Uralic 92 67 94
mhr Uralic 91 67 93
mlg Austronesian 100 100 100
myv Uralic 93 61 94
nld Indo-Eur: Germanic 99 61 99
nob Indo-Eur: Germanic 75 47 76
nya Niger-Congo 100 100 100
ote Oto-Manguean 99 80 99
otm Oto-Manguean 98 46 98
pei Oto-Manguean 65 17 72
sme Uralic 99 31 100
sot Niger-Congo 100 100 98
swa Niger-Congo 100 100 100
swe Indo-Eur: Germanic 97 59 99
tgl Austronesian 69 35 72
vep Uralic 83 28 84
vot Uralic 81 41 86
xty Oto-Manguean 90 79 91
zpv Oto-Manguean 87 46 85
zul Niger-Congo 92 83 92
Overall 89 57 90
Stdev 12 26 11
225
Lang Fam Rand Sim Base
ast Indo-Eur: Romance 2 64 100
aze Turkic 9 17 81
bak Turkic 15 14 100
ben Indo-Aryan 1 4 99
bod Sino-Tibetan 21 9 84
cat Indo-Eur: Romance 12 53 100
cre Algic 2 14 68
crh Turkic 24 45 99
dak Siouan 13 13 96
dje Nilo-Saharan 21 82 88
evn Tungusic 28 27 57
fas Indo-Eur: Iranian 2 13 100
frm Indo-Eur: Romance 7 73 100
fur Indo-Eur: Romance 11 19 100
glg Indo-Eur: Romance 9 59 100
gml Indo-Eur: Germanic 11 11 62
gsw Indo-Eur: Germanic 33 64 93
hin Indo-Aryan 0 1 100
kan Dravidian 13 16 76
kaz Turkic 0 7 98
kir Turkic 2 6 98
kjh Turkic 11 11 100
kpv Uralic 17 47 97
lld Indo-Eur: Romance 3 68 99
lud Uralic 22 14 32
mlt Afro-Asiatic 10 13 97
mwf Australian 4 6 90
nno Indo-Eur: Germanic 42 40 86
olo Uralic 37 33 94
ood Uto-Aztecan 33 30 81
orm Afro-Asiatic 2 52 99
pus Indo-Eur: Iranian 13 9 90
san Indo-Aryan 13 5 93
sna Niger-Congo 75 88 100
syc Afro-Asiatic 8 13 91
tel Dravidian 0 10 95
tgk Indo-Eur: Iranian 89 79 56
tuk Turkic 0 21 86
udm Uralic 11 30 98
uig Turkic 0 26 99
urd Indo-Aryan 2 7 99
uzb Turkic 0 21 100
vec Indo-Eur: Romance 2 62 100
vro Uralic 17 17 61
xno Indo-Eur: Romance 2 22 96
Overall 14 30 90
Stdev 18 25 15
226
Simple induction of (deterministic) probabilistic finite-state automata for
phonotactics by stochastic gradient descent
227
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 227–236
August 5, 2021. ©2021 Association for Computational Linguistics
shape |Q| × |Σ|, gives the probability of emitting Given a data distribution X with support over
a symbol x given a state. Each row in the matrix Σ∗ , we wish to learn a PFA by finding parameter
represents a state, and each column represents an matrices E and T to minimize an objective func-
output symbol. Given a distribution on states rep- tion of the form
resented as a stochastic vector q, the probability
mass function over symbols is: J(E, T) = h− log p(x|E, T)ix∼X + C(E, T),
(3)
p(·|q) = q> E. (1) where h·ix∼X indicates an average over val-
ues x drawn from the data distribution X, and
Each symbol x ∈ Σ is associated with a right- − log p(x|E, T) is the negative log likelihood
stochastic transition matrix Tx of shape |Q|×|Q|, (NLL) of a sample x under the model; the average
so that the probability distribution on following negative log likelihood is equivalent to the cross en-
states given that the symbol x was emitted from the tropy of the data distribution X and the model. By
distribution on states q is minimizing cross-entropy, we maximize likelihood
and thus fit to the data. The term C(E, T) repre-
p(·|q, x) = q> Tx . (2) sents additional complexity constraints on the E
and T matrices, discussed in Section 2.4. When C
Generation of a particular sequence x ∈ Σ∗
is interpreted as a log prior probability on automata,
works by starting in a distinguished initial state
then minimizing Eq. 3 is equivalent to fitting the
q0 , generating a symbol x, transitioning into the
model by maximum a posteriori.
next state q 0 , and so on recursively until reaching a
Given the formulation in Eq. 3, because the ob-
distinguished final state qf . Given a PFA parame-
jective function is differentiable, we can search
terized by matrices E and T, the probability of a
for the optimal matrices E and T by performing
sequence xN t=1 marginalizing over all trajectories
(stochastic) descent on the gradients of the objec-
through states can be calculated according to the
tive. That is, for a parameter matrix X, we can
Forward algorithm (Baum et al., 1970; Vidal et al.,
search for a minimum by performing updates of
2005a, §3) as follows:
the form
p(xN N X0 = X − η∇J(X), (4)
t=1 |E, T) = f (xt=1 |δq0 ),
where the scalar η is the learning rate. In stochas-
where δq is a one-hot coordinate vector on state q tic gradient descent, each update is performed using
and a random finite sample from the data distribution,
called a minibatch, to approximate the average
f (∅|q) = δq>f q
over the data distribution in Eq. 3.
f (xnt=1 |q) = p(x1 |q) · f (xnt=2 |q> Tx1 ). However, we cannot apply these updates directly
to the matrices E and T because they must be
The important aspect of this formulation is that right-stochastic, meaning that the entries in each
the probability of a sequence is a differentiable row must be positive and sum to 1. There is no
function of the matrices E and T that define the guarantee that the output of Eq. 4 would satisfy
PFA. Because the probability function is differen- these constraints. This issue was addressed by Dai
tiable, we can induce a PFA from a set of training (2021) by clipping the values of the matrix E to
sequences by using gradient descent to search for be between 0 and 1. A more general solution is
matrices that maximize the probability of the train- that, instead of doing optimization on the E and T
ing sequences. matrices directly, we instead do optimization over
2.2 Learning by gradient descent underlying real-valued matrices Ẽ and T̃ such that
We describe a simple and highly general method exp Ẽij exp T̃ij
for inducing a PFA from data by stochastic gradi- Eij = P , Tij = P ,
k exp Ẽik k exp T̃ik
ent descent. Although more specialized learning
algorithms and heuristics exist for special cases in other words we derive the matrices E and T
(see for example Vidal et al., 2005b, §3), ours has by applying the softmax function to underlying
the advantage of generality. Our goal is to see how matrices Ẽ and T̃, whose entries are called logits.
effective this simple approach can be in practice. Gradient descent is then done on the objective as
228
a function of the logit matrices Ẽ and T̃. This ap- istic PFAs because of their nice theoretical proper-
proach to parameterizing probability distributions ties (Heinz, 2010). A deterministic PFA is dis-
is standard in machine learning. Applied to induce tinguished by having fully deterministic transi-
a PFA with states Q and symbol inventory Σ, our tion matrices T. This condition can be expressed
formulation yields a total of |Q| × (|Q| × |Σ| − 1) information-theoretically. Assuming 0 log 0 = 0,
meaningful trainable parameters. letting the entropy of a stochastic vector p be:
We note that this procedure is not guaranteed to X
find an automaton that globally minimizes the ob- H[p] = − pi log pi ,
i
jective when optimizing T (see Vidal et al., 2005b,
§3). But in practice, stochastic gradient descent in a PFA is deterministic when it satisfies the con-
high-dimensional spaces can avoid local minima, dition H[q> Tx ] = 0 for all symbols x and state
functioning as a kind of annealing (Bottou, 1991, distributions q.
§4); using these simple optimization techniques on We can use this expression to monitor the degree
non-convex objectives is now standard practice in of nondeterminism of a PFA during optimization,
machine learning. or to add a determinism constraint to the objective
in Section 2.2. The average nondeterminism N
2.3 Sequence representation and word of a PFA is given by
boundaries X
N (E, T) = q̂i Eij H[δq>i Tj ],
In order to model phonotactics, a PFA must be sen-
ij
sitive to the boundaries of words, because there are
often constraints that apply only at word beginnings where q̂ is the stationary distribution over states,
or endings (Hayes and Wilson, 2008; Chomsky and representing the long-run average occupancy dis-
Halle, 1968). In order to account for this, we in- tribution over states. The stationary distribution q̂
clude in the symbol inventory Σ a special word is calculated by finding the left eigenvector of the
boundary delimiter #, which occurs as the final matrix S satisfying
symbol of each word, and which only occurs in
q̂> S = q̂,
that position. Furthermore, we constrain all ma-
trices T to transition deterministically back into where S is a right stochastic matrix giving the prob-
the initial state following the symbol #, effectively ability that a PFA transitions from state i to state j
reusing the initial state q0 as the final state qf . marginalizing over symbols emitted:
By constructing the automata in this way, we X
Sij = p(x|qi )p(qj |qi , x).
ensure that their long-run behavior is well-behaved.
x∈Σ
If an automaton of this form is allowed to keep gen-
erating past the symbol #, it will generate succes- For the Strictly Local and Strictly Piecewise au-
sive concatenated independent and identically dis- tomata, N = 0 by construction. For an automaton
tributed samples from its distribution over words, parameterized by T = softmax(T̃), it is not pos-
with boundary symbols # delineating them. This sible to attain N = 0, but nonetheless N can be
construction makes it possible to calculate station- made arbitrarily small. There are alternative pa-
ary distributions over states and complexity mea- rameterizations where N = 0 is achievable, for
sures related to them. example using the sparsemax function instead of
softmax (Martins and Astudillo, 2016; Peters et al.,
2.4 Regularization 2019).
The objective in Eq. 3 includes a regularization In order to constrain automata to be determinis-
term C representing complexity constraints. Any tic, we set the regularization term in Eq. 3 to be
differentiable complexity measure could be used
C = αN,
here. This regularization term can be viewed from
a Bayesian perspective as defining a prior over au- where α is a non-negative scalar determining the
tomata, and providing an inductive bias. We pro- strength of the trade-off of cross entropy and nonde-
pose to use this term to constrain the PFA induction terminism in the optimization. With α = 0 there is
process to produce deterministic automata. no constraint on the nondeterminism of the automa-
Most formal work on probabilistic finite-state ton, and minimizing the objective in Eq. 3 reduces
automata for phonology has focused on determin- to maximum likelihood estimation.
229
2.5 Implementing restricted automata Then the probability of the t’th symbol in a se-
We define Strictly Local and Strictly Piecewise au- quence xt given a context of previous symbols
tomata as automata that generate the respective lan- xt−1
i=1 is the geometric mixture of the probability
guages. We implement Strictly Local and Strictly of xt under each sub-automaton, also called the
Piecewise automata by hard-coding the transition co-emission probability
matrices T. For these automata, we only do opti- |Σ|
mization over the emission matrices E. Y
p(xt |xt−1
i=1 ) ∝ pAy (xt |xt−1
i=1 ).
Strictly Local In a Strictly k-Local (k-SL) lan- y=1
guage, each symbol is conditioned only on imme-
Because each sub-automaton Ay is deterministic,
diately preceding k − 1 symbol(s) (Heinz, 2018;
its state after seeing the context xt−1
i=1 is known,
Rogers and Pullum, 2011). We implement a 2-SL
and the conditional probability pAy (xt |xt−1
i=1 ) can
automaton by associating each state q ∈ Q with a
be computed using Eq. 1. For calculating the prob-
unique element x in the symbol inventory Σ. Upon
ability of a sequence, we assume an initial state of
emitting symbol x, the automaton deterministically
having seen the boundary symbol #; that is, the
transitions into the corresponding state, denoted qx .
sub-automaton A# starts in state q1# .
Thus the transition matrices have the form
Using this parameterization, we can do opti-
...q6=x ... qx ...q6=x ...
mization over the collection of emission matri-
.. .. .. ces {E(x) }x∈Σ . This construction yields |Σ| ×
. . .
Tx = . . . 0 ... 1 ...0... (|Σ| − 1) trainable parameters for the 2-SP automa-
.
.. .. .. ton, the same number of parameters as the 2-SL
. . . automaton.
This construction can be straightforwardly ex-
tended to k-SL, yielding |Σ|k−1 × (|Σ| − 1) train- SP + SL It is also possible to create and train
able parameters for a k-SL automaton. an automaton with the ability to condition on both
2-SL and 2-SP factors by taking the product of 2-
Strictly Piecewise A Strictly k-Piecewise k-SP) SL and 2-SP automata, as proposed by Heinz and
language, each symbol depends on the presence of Rogers (2013). We refer to the language gener-
any preceding k − 1 symbols at arbitrary distance ated by such an automaton as 2-SL + 2-SP. We
(Heinz, 2007, 2018; Shibata and Heinz, 2019). For experiment with such product machines below.
example, in a 2-SP language, in a string abc, c
would be conditional on the presence of a and the 2.6 Related work
presence of b, without regard to distance nor the PFA induction from data is a well-studied task
relative order of a and b. which has been the subject of multiple competi-
The implementation of an SP automaton is tions over the years (see Verwer et al., 2012, for a
slightly more complex than the SL automaton, as review). The most common approaches are vari-
the number of states required in a naïve imple- ants of Baum-Welch and heuristic state-merging
mentation is exponential in the symbol inventory algorithms (see for example de la Higuera, 2010).
size, resulting in intractably large matrices. We cir- Gibbs samplers and spectral methods have also
cumvent this complexity by parameterizing a 2-SP been proposed (Gao and Johnson, 2008; Bailly,
automaton as a product of simpler automata. We 2011; Shibata and Yoshinaka, 2012). Induction of
associate each symbol x ∈ Σ with a sub-automaton restricted PDFAs, especially for SL and SP lan-
Ax which has two states q0x and q1x , with state q0x guages, is explored in Heinz and Rogers (2013,
indicating that the symbol x has not been seen, 2010)
and q1x indicating that it has been seen. Each sub- Our work differs from previous approaches in its
automaton Ax has an emission matrix E(x) of size simplicity. Inspired by Shibata and Heinz (2019),
2 × |Σ| corresponding to the two states q0x and q1x ; we optimize the training objective directly via gra-
the emission matrix for all states q0x is constrained dient descent, without approximations or heuristics
to be the uniform distribution over symbols. The other than the use of minibatches. The same algo-
transition matrices T(x) are rithm is applied to learn both transition and emis-
(x) 0 1 (x) 1 0 sion structure, for learning of both general PFAs
Tx = , Ty6=x = . and restricted PDFAs. One of our contributions
0 1 0 1
230
is to show that this very simple approach gives not have an a followed by a b at any distance. The
reasonable results for learning phonotactics. reference automaton is given in Figure 1 (bottom).
The legal test string is baccca# and the illegal test
3 Inducing toy languages string is bacccb#.
First, we test the ability of the model to recover 3.3 Training parameters
automata for simple examples of subregular lan-
guages. We do so for the two subregular classes The logit matrices Ẽ and T̃ are initialized with
2-SL and 2-SP described in Section 2.5. For each random draws from a standard Normal distribution
of these language classes, we implement a ref- (Derrida, 1981). We perform stochastic gradient de-
erence PFA which generates strings from a sim- scent using the Adam algorithm, which adaptively
ple example language in that class, then generate sets the learning rate (Kingma and Ba, 2015). We
10, 000 sample sequences from the reference PFA. perform 10, 000 update steps with starting learning
We then use these samples as training data, and rate η = 0.001 and minibatch size 5.
study whether our learners can recover the relevant 3.4 Results
constraints from the data.
Unrestricted PFA induction succeeds in recover-
3.1 Evaluation ing the reference automata for both toy languages.
Learners restricted to the appropriate classes, as
We evaluate the ability to induce appropriate au-
well as the automaton combining SL and SP factors,
tomata in two ways. First, since we are studying
also succeed in inducing the appropriate automata,
very simple languages and automata, it is possible
while learners restricted to the ‘wrong’ class fail.
to directly inspect the E and T matrices and check
that they implement the correct automaton by ob- Figure 1 shows the legal–illegal differences for
serving the transition and emission probabilities. test strings over the course of training. We can
see that, when the learner is unrestricted or when
Second, we study the probabilities assigned
the learner is in the appropriate class, it eventu-
to carefully selected strings which exemplify the
ally picks up on the relevant constraint, with the
constraints that define the languages. For each
legal–illegal difference increasing apparently with-
language, we define an illegal test string which
out bound over training. Unrestricted learners take
violates the constraints of the language, and a
longer to reach this point, but they reach it reliably.
minimally-different legal test string. Given an
On the other hand, looking at the legal–illegal dif-
automaton, we can measure the legal–illegal dif-
ferences for learners in the wrong class, we see
ference: the log probability of the legal test string
that they asymptote to a small number and stop
minus the log probability of the illegal test string.
improving.
A larger legal–illegal difference indicates that the
model is assigning a higher probability to the legal These results demonstrate that our simple
form compared to the illegal one and therefore is method for PFA induction does succeed in induc-
successfully learning the constraints represented by ing certain simple structures relevant for modeling
the testing data. phonotactics in a small, controlled setting. Next,
we turn to induction of phonotactics from corpus
3.2 Languages data.
All languages are defined over the symbol inven- 4 Corpus experiments
tory {a, b, c} plus the boundary symbol #.
As an exemplar of 2-SL languages, we use the We evaluate our learner by training it on dictionary
language characterized by the forbidden factor *ab. forms from Quechua and Navajo and then studying
A deterministic PFA for the language is given in its ability to predict attested forms that were held
Figure 1 (top). The language contains all strings out in training in addition to artificially constructed
that do not have an a followed immediately by a b. nonce forms which probe the ability of the model
Our legal test string for this language is bacccb# to represent nonlocal constraints.
and the illegal test string is babccc#.
As an exemplar of 2-SP languages, we use 4.1 Training parameters
the language characterized by a forbidden factor All training parameters are as in Section 3.3, except
*a. . . b. This language contains all strings that do that we train for 100, 000 steps, and control the
231
Target language: *a...b (2−SP) Target language: *ab (2−SL)
Legal test string: baccca# Legal test string: bacccb#
Illegal test string: bacccb# Illegal test string: babccc#
8
Legal−illegal difference
−4
Learner class 2−SL 2−SP 2−SP + 2−SL Unrestricted PFA with |Q|=2
Figure 1: Difference in log probabilities for legal and illegal forms over the course of PFA induction for toy
languages. A large positive value indicates that the relevant constraint has been learned.
232
Navajo Quechua
45
Heldout NLL
40
35
30
25
20
Overfitting
4
0
bits
N (alpha = 0)
6
N (alpha = 1)
6
32 128 512
Number of states |Q|
64 256 1024
Figure 3: Accuracy and complexity metrics for unrestricted PFA induction. ‘Overfitting’ is the difference between
held-out NLL and training set NLL. N is nondeterminism and alpha is the regularization parameter α (see Sec-
tion 2.4). Runs with |Q| = 128, 256, 512 and α = 1 on Navajo data terminated early due to numerical underflow
in the calculation of the stationary distribution.
Navajo Quechua
60
Heldout NLL
50
40
30
bits
20
Legal−Illegal Difference
20
15
10
5
0
0 25000 50000 75000 100000 0 25000 50000 75000 100000
Training step
Figure 4: Performance of a 2-SP automaton, a 2-SL automaton, a 2-SP + 2-SL product automaton, and an un-
restricted PFA with 1, 024 states and α = 0. ‘Heldout NLL’ is the average NLL of a form in the set of attested
forms never seen during training. ‘Legal–illegal difference’ is the difference in log likelihood between ‘legal’ and
‘illegal’ forms in the nonce test set.
233
difference described in Section 3.1, but now as an 4.5 Discussion
average over many legal–illegal nonce pairs instead We find that an unrestricted PFA learner performs
of a difference for one pair. most accurately when predicting real held-out
forms, while an SP learner is most effective in learn-
4.4 Results ing certain nonlocal constraints. In fact, in terms
Unrestricted PFA induction Figure 3 shows re- of its ability to model the nonlocal constraints, the
sults from induction of unrestricted PFAs with var- PFA learner ends up comparable to an SL learner,
ious numbers of states. We find that show the av- which cannot learn the constraints at all. Mean-
erage NLL of forms in the heldout data, as well as while, the SP learner, which is unable to model
‘overfitting’, defined as the average held-out NLL local constraints, fares much worse than even the
minus the average training set NLL. This number SL learner on predicting held-out forms. The prod-
shows the extent to which the model assigns higher uct SP + SL learner combines the strengths of both
probabilities to forms in the training set as opposed restricted learners, but still does not assign as high
to the held-out set, an index of overfitting. We find probability to the real held-out forms as the unre-
that automata with more states fit the data better, stricted PFA learner.
but are also more prone to overfitting to the training This pattern of performance suggests that the
set. PFA learner is using most of its states to model
In Figure 3 (bottom two rows) we also show the local constraints beyond those captured in a 2-SL
measured nondeterminism N of the induced au- language. These constraints are important for pre-
tomata throughout training, for different values of dicting real held-out forms. The SP automaton
the regularization parameter α (see Section 2.4). is unable to achieve strong performance on held-
We find that, even without an explicit constraint out forms without the ability to model these local
for determinism, the induced PFAs tend towards constraints. On the other hand, the unrestricted
determinism over time, with N reaching around PFA tends to overfit to its training data, perhaps
1.5 bits by the final training step. Explicit regu- explaining its failure to detect nonlocal constraints
larization (with α = 1) makes this process faster, which are picked up by the appropriate restricted
with N reaching around 0.5 bits. Regularization automata.
for determinism has only a minimal effect on the
5 Conclusion
NLL values.
We introduced a framework for phonotactic learn-
Linguistic performance and restricted models ing based on simple induction of probabilistic finite-
Figure 4 shows held-out NLL and the legal–illegal state automata by stochastic gradient descent. We
difference for both languages, comparing the SL showed how this framework can be used to learn
automaton, the SP automaton, the product SP + unrestricted PFAs, in addition to PFAs restricted
SL automaton, and a PFA with 1, 024 states and to certain formal language classes such as Strictly
α = 0. Local and Strictly Piecewise, via constraints on
In terms of the ability to predict attested held- the transition matrices that define the automata.
out forms, the best model is consistently the unre- Furthermore, we showed that the framework is suc-
stricted PFA, with the SP automaton performing cessful in learning some phonotactic phenomena,
the worst. However, in terms of predicting the ill- with unrestricted automata performing best in a
formedness of artificial forms violating nonlocal wide-coverage evaluation on attested but held-out
phonotactic constraints, the best model is either forms, and Strictly Piecewise automata perform-
the SP automaton or the SP + SL product automa- ing best in a targeted evaluation using nonce forms
ton. Both of these automata successfully induce focusing on nonlocal constraints.
the nonlocal constraint. Our results leave open the question of whether
On the other hand, the unrestricted PFA learner the unrestricted learner or one of the restricted
shows no evidence at all of having learned the dif- learners is ‘best’ for learning phonotactics, since
ference between legal and illegal forms in the arti- they perform differently on different metrics. A key
ficial data, despite having the capacity to do so in question for future work is whether there might be
theory, and despite succeeding in inducing a 2-SP some model that could do well in inducing both
language in Section 3. local and nonlocal constraints simultaneously, and
234
performing well on both the held-out evaluation the 2008 Conference on Empirical Methods in Natu-
and the nonce form evaluation. Such a model could ral Language Processing, pages 344–352, Honolulu,
Hawaii. Association for Computational Linguistics.
come in the form of another restricted language
class such as Tier-Based Strictly Local languages Maria Gouskova and Gillian Gallagher. 2020. Induc-
(Heinz et al., 2011; Jardine and Heinz, 2016; Mc- ing nonlocal constraints from baseline phonotactics.
Mullin, 2016; Jardine and McMullin, 2017), or Natural Language & Linguistic Theory, 38(1):77–
116.
perhaps in the form of a regularization term in the
training objective which enforces an inductive bias Bruce Hayes and Colin Wilson. 2008. A maximum en-
that favors certain nonlocal interactions. tropy model of phonotactics and phonotactic learn-
ing. Linguistic Inquiry, 39(3):379–440.
The code for this project is available
at http://github.com/hutengdai/ Jeffrey Heinz. 2007. The inductive learning of phono-
PFA-learner. tactic patterns. Ph.D. thesis, PhD dissertation, Uni-
versity of California, Los Angeles.
Acknowledgments Jeffrey Heinz. 2010. Learning long-distance phonotac-
tics. Linguistic Inquiry, 41(4):623–661.
This work was supported by a GPU Grant from the
NVIDIA corporation. We thank the three anony- Jeffrey Heinz. 2018. The computational nature of
mous reviewers and Adam Jardine, Jeff Heinz, and phonological generalizations. Phonological Typol-
Dakotah Lambert for their comments. ogy, Phonetics and Phonology, pages 126–195.
Jianfeng Gao and Mark Johnson. 2008. A compar- Andre Martins and Ramon Astudillo. 2016. From
ison of Bayesian estimators for unsupervised Hid- softmax to sparsemax: A sparse model of atten-
den Markov Model POS taggers. In Proceedings of tion and multi-label classification. In International
235
Conference on Machine Learning, pages 1614–1623.
PMLR.
Kevin James McMullin. 2016. Tier-based locality in
long-distance phonotactics: learnability and typol-
ogy. Ph.D. thesis, University of British Columbia.
236
Recognizing Reduplicated Forms: Finite-State Buffered Machines
Yang Wang
Department of Linguistics
University of California, Los Angeles
Los Angeles, CA, USA
[email protected]
237
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 237–247
August 5, 2021. ©2021 Association for Computational Linguistics
Total reduplication: Dyirbal plurals (Dixon, 1972, 242)
Singular Gloss Plural Gloss
midi ‘little, small’ midi-midi ‘lots of little ones’
gulgiói ‘prettily painted men’ gulgiói-gulgiói ‘lots of prettily painted men’
Table 1: Total reduplication: Dyirbal plurals (top); partial reduplication: Agta plurals (bottom).
m i d i m i d i context sensitive
dly context
m il sen
s
o n
c tex
i ti
m i d i i d i m w w R t -f
ve
ree
regular
ai bj
Figure 1: Crossing dependencies in Dyirbal total redu-
plication ‘midi-midi’ (top) versus nesting dependencies ww
a ib j i
in unattested string reversal ‘midi-idim’ (bottom) ai bj ci dj a bj
238
Various attempts followed this vein:2 one ex- are two-taped finite state automata, sensitive to
ample is finite state registered machine in Cohen- copying activities within strings, hence able to de-
Sygal and Wintner (2006) (FSRAs) with finitely tect identity between sub-strings. This paper is
many registers as its memory, limited in the way organized as follows: Section 2 provides a defi-
that it only models bounded copying. The state-of- nition of FSBMs with examples. Then, to better
art finite state machinery that computes unbounded understand the copying mechanism, complete-path
copying elegantly and adequately is 2-way finite FSBMs, which recognize exactly the same set of
state transducers (2-way FSTs), capturing redupli- languages as general FSBMs, are highlighted. Sec-
cation as a string-to-string mapping (w → ww) tion 3 examines the computational and mathemat-
(Dolatian and Heinz, 2018a,b, 2019, 2020). To ical properties of the set of languages recognized
avoid the mirror image function (w → wwR ), complete-path FSBMs. Section 4 concludes with
Dolatian and Heinz (2020) further developed sub- discussion and directions for future research.
classes of 2-way FSTs which cannot output any-
thing during right-to-left passes over the input (cf. 2 Finite State Buffered Machine
rotating transducers: Baschenis et al., 2017).
2.1 Definitions
It should be noted that the issue addressed by 2-
way FSTs is a different one: reduplication is mod- FSBMs are two-taped automata with finite-state
eled as a function (w → ww), while this paper fo- core control. One tape stores the input, as in normal
cuses on a set of languages containing identical sub- FSAs; the other serves as an unbounded memory
strings (ww). The stringset question is non-trivial buffer, storing reduplicants temporarily for future
and well-motivated for reasons of both formal as- identity checking. Intuitively, FSBMs is an ex-
pects and its theoretical relevance. Firstly, since tension to FSRAs but equipped with unbounded
the studied 2-way FSTs are not readily invertible, memory. In theory, FSBMs with a bounded buffer
how to get the inverse relation ww → w remains would be as expressive as an FSRA and therefore
an open question, as acknowledged in Dolatian and can be converted to an FSA.
Heinz (2020). Although this paper does not directly The buffer interacts with the input in restricted
address this morphological analysis problem, rec- ways: 1) the buffer is queue-like; 2) the buffer
ognizing which strings are reduplicated and belong needs to work on the same alphabet as the input,
to Lww or any other copying languages may be an unlike the stack in a pushdown automata (PDA),
important first step.3 for example; 3) once one symbol is removed from
As for the theoretical aspects, there are some the buffer, everything else must also be wiped off
attested forms of meaning-free reduplication in before the buffer is available for other symbol ad-
natural languages.Zuraw (2002) proposes aggres- dition. These restrictions together ensure the ma-
sive reduplication in phonology: speakers are chine does not generate string reversals or other
sensitive to phonological similarity between sub- non-reduplicative non-regular patterns.
strings within words and reduplication-like struc- There are three possible modes for an FSBM M
tures are attributed to those words. It is still ar- when processing an input: 1) in normal (N) mode,
guable whether those meaning-free reduplicative M reads symbols and transits between states, func-
patterns of unbounded strings are generated via a tioning as a normal FSA; 2) in buffering (B) mode,
morphological function or not. Overall, it is de- besides consuming symbols from the input and tak-
sirable to have models that help to detect the sub- ing transitions among states, it adds a copy of just-
string identity within surface strings when those read symbols to the queue-like buffer, until it exits
sub-strings are in the regular set. buffering (B) mode; 3) after exiting buffering (B)
This paper introduces a new computational de- mode, M enters emptying (E) mode, in which M
vice: finite state buffered machine (FSBMs). They matches the stored symbols in the buffer against in-
put symbols. When all buffered symbols have been
2
Some other examples, pursuing more linguistically sound matched, M switches back to normal (N) mode for
and computationally efficient finite state techniques, are
Walther (2000), Beesley and Karttunen (2000) and Hulden another round of computation. Under the current
(2009). However, they fail to model unbounded copying. augmentation, FSBMs can only capture local redu-
Roark and Sproat (2007), Cohen-Sygal and Wintner (2006) plication with two adjacent, completely identical
and Dolatian and Heinz (2020) provide more comprehensive
reviews. copies. It cannot handle non-local reduplication,
3
Thanks to the reviewer for bringing this point up. nor multiple reduplication.
239
a
Definition 1. A Finite-State Buffered Machine a
240
symbols until it proceeds to q4 , an H state. State mode. Hence, to go through full cycles of mode
q4 requires M2 to stop buffering and switch to E changes, once M reaches a G state and switches
mode in order to check for string identity. Using the to B mode, it has to encounter some H states later
special transitions between H states (in this case, to be put in E mode. To allow us to only reason
a and b loops on State q4 ), M2 checks whether the about only the “useful” arrangements of G and H
stored symbols in the buffer matches the remaining states, we impose an ordering requirement on G
input. If so, after emitting out all symbols in the and H states along a path in a machine and define
buffer, M2 with a blank buffer can switch to N a complete path.
mode. It eventually ends at State q4 , a legal final
Definition 5. A path s from an initial state to a
state. Figure 6 gives a complete run of M2 on the
final state in a machine is said to be complete if
string “abbabb”. Figure 7 shows M2 rejects the
non-total reduplicated string “ababb” since a final 1. for one H state in s, there is always a preced-
configuration cannot be reached. ing G state;
Example 3. Partial reduplication Assume Σ = 2. once one G state is in s, s must contain must
{b, t, k, ng, l, i, a}, the FSBM M3 in Figure 8 contain at least one H following that G state
serves as a model of two Agta CVC reduplicated
plurals in Table 1. 3. in between G and the first H are only plain
Given the initial state q1 is in G, M3 has to enter states.
B mode before it takes any transitions. In B mode,
M3 transits to a plain state q2 , consuming an input Schematically, with P representing those non-G,
consonant and keeping it in the buffer. Similarly, non-H plain states and I, F representing initial,
M3 transits to a plain state q3 and then to q4 . When final states respectively, the regular expression de-
M3 first reaches q4 , the buffer would contain a noting the state information in a path s should be
CVC sequence. q4 , an H state, urges M3 to stop of the form: I(P ∗ GP ∗ HH ∗ P ∗ | P ∗ )∗ F .
buffering and enter E mode. Using the special Definition 6. A complete-path finite state
transitions between H states (in this case, loops buffered machine is an FSBM in which all possible
on q4 ), M3 matches the CVC in the buffer with the paths are complete.
remaining input. Then, M3 with a blank buffer can
switch to N mode at q4 . M3 in N mode loses the Example FSBMs we provide so far (Figure 3,
access to loops on q4 , as they are available only Figure 4 and in Figure 8) are complete-path FSBMs.
in E mode. It transits to q5 to process the rest of For the rest of this section, we describe several
the input by the normal transitions between q5 . A cases of an incomplete path in a machine M .
successful run should end at q5 , the only final state.
Figure 9 gives a complete run of M3 on the string No H states When a G state does not have any
“taktakki”. reachable H state following it, there is no complete
run, since M always stays in B mode.
2.3 Complete-path FSBMs
No H states in between two G states When a G
As shown in the definitions and the examples above,
state q0 has to transit to another G state q00 before
an FSBM is supposed to end in N mode to process
any H states, M cannot go to q00 , for M would
an input. There are two possible scenarios for a run
enter B mode at q0 while transiting to another G
to meet this requirement: either never entering B
state in B mode is ill-defined.
mode or undergoing full cycles of N , B , E , N mode
changes. The corresponding languages reflect ei- H states first When M has to follow a path con-
ther no copying (functioning as plain FSAs) or full taining two consecutive H states before any G state,
copying, respectively. it would clash in the end, because the transitions
In any specific run, it is the states that inform an among two H states can only be used in E mode.
FSBM M of its modality. The first time M reaches However, it is impossible to enter E mode without
a G state, it has to enter B mode and keeps buffering entering B mode enforced by some G states.
when it transits between plain states. The first time It should be emphasized that M in N mode can
when it reaches an H state, M is supposed to enter pass through one (and only one) H state to another
E mode and transit only between H states in E plain state. For instance, the language of the FSBM
241
Used Arc State Info Configuration
1. N/A q1 ∈ I (abbabb, q1 , , N)
2. N/A q1 ∈ G (abbabb, q1 , , B) Buffering triggered by q1 and empty buffer
3. (q1 , a, q2 ) q2 ∈
/G (bbabb, q2 , a, B)
4. (q2 , b, q3 ) (babb, q3 , ab, B)
5. (q3 , b, q3 ) (abb, q3 , abb, B)
6. (q3 , , q4 ) (abb, q4 , abb, B) Emptying triggered by q4
7. N/A (abb, q4 , abb, E)
8. (q4 , a, q4 ) (bb, q4 , bb, E)
9. (q4 , b, q4 ) (b, q4 , b, E)
10. (q4 , b, q4 ) q4 ∈ H (, q4 , , E) Normal triggered by q4 and empty buffer
11. N/A q4 ∈ F (, q4 , , N)
i, a b, t, k, ng, l, i, a
b, t, k, ng, l i, a b, t, k, ng, l
Start q1 q2 q3 q4 q5 Accept
b, t, k, ng, l
242
a b
3.1 Intersection with FSAs
Start q1 a q2 b q3 b q4 a q5 Accept
243
V C, V
recognized by complete-path FSBMs is not
Start q1 C q2 V q3 C q4 q5 Accept
closed under inverse alphabetic homomorphisms
and thus inverse homomorphism. Consider a
C complete-path FSBM-recognizable language L =
{ai bj ai bj | i, j ≥ 1} (cf. Figure 4). Consider an
Figure 12: An FSBM M5 on the alphabet {C, V } such
alphabetic homomorphism h : {0, 1, 2} → {a, b}∗
that L(M5 ) = h(L(M3 )) with M3 in Figure 8
such that h(0) = a, h(1) = a and h(2) = b. Then,
h−1 (L) = {(0|1)i 2j (0|1)i 2j | i, j ≥ 1} seems to
3.2 Homomorphism and inverse alphabetic be challenging for FSBMs. Finite state machines
homomorphism cannot handle the incurred crossing dependencies
while the augmented copying mechanism only con-
Definition 7. A (string) homomorphism is a func-
tributes to recognizing identical copies, but not
tion mapping one alphabet to strings of another
general cases of symbol correspondence. 5
alphabet, written h : Σ → ∆∗ . We can extend h
to operate on strings over Σ∗ such that 1) h(Σ ) 3.3 Other closure properties
= ∆ ; 2) ∀a ∈ Σ, h(a) ∈ ∆∗ ; 3) for w =
Union Assume there are complete-path FSBMs
a1 a2 . . . an ∈ Σ∗ , h(w) = h(a1 )h(a2 ) . . . h(an )
M1 and M2 such that L(M1 ) = L1 and L(M2 ) =
where each ai ∈ Σ. An alphabetic homomorphism
L2 , then L1 ∪ L2 is a complete-path FSBM-
h0 is a special homomorphism with h0 : Σ → ∆.
recognizable language. One can construct a new
Definition 8. Given a homomorphism h: Σ → machine M that accepts an input w if either M1
∆∗ and L1 ⊆ Σ∗ , L2 ⊆ ∆∗ , define h(L1 ) or M2 accepts w. The construction of M keeps
= {h(w) | w ∈ L1 } ⊆ ∆∗ and h−1 (L2 ) = M1 and M2 unchanged, but adds a new plain state
{w | h(w) = v ∈ L2 } ⊆ Σ∗ . q0 . Now, q0 becomes the only initial state, branch-
ing into those previous initial states in M1 and M2
Theorem 2. The set of complete-path FSBM- with -arcs. In this way, the new machine would
recognizable languages is closed under homomor- guess on either M1 or M2 accepts the input. If one
phisms. accepts w, M will accept w, too.
Concatenation Assume there are complete-path
Theorem 2. can be proved by constructing a
FSBMs M1 and M2 such that L(M1 ) = L1 and
new machine Mh based on M . The informal in-
L(M2 ) = L2 , then there is a complete-path FSBM
tuition goes as follows: relabel the odd arcs to
M that can recognize L1 ◦ L2 by normal concate-
mapped strings and add states to split the arcs so
nation of two automata. The new machine adds
that there is only one symbol or on each arc in Mh .
a new plain state q0 and makes q0 the only initial
When there are multiple symbols on normal arcs,
state, branching into those previous initial states
the newly added states can only be plain non-G,
in M1 with -arcs. All final states in M2 are the
non-H states. For multiple symbols on the special
only final states in M . Besides, the new machine
arcs between two H states, the newly added states
adds -arcs from any old final states in M1 to any
must be H states. Again, under this construction,
possible initial states in M2 . A path in the resulting
complete paths in M lead to newly constructed
machine is guaranteed to be complete because it
complete paths in Mh .
is essentially the concatenation of two complete
The fact that complete-path FSBMs guarantee
paths.
the closure under homomoprhism allows theorists
to perform analyses at certain levels of abstraction Kleene Star Assume there is a complete-path
of certain symbol representations. Consider two al- FSBM M1 such that L(M1 ) = L1 , L∗1 is a
phabets Σ = {b, t, k, ng, l, i, a} and ∆ = {C, V } complete-path FSBM-recognizable language. A
with a homomorphism h mapping every consonant new automaton M is similar to M1 with a new ini-
(b, t, k, ng, l) to C and mapping every vowel (i, a) tial state q0 . q0 is also a final state, branching into
to V . As illustrated by M3 on alphabet Σ (Fig- 5
The statement on the inverse homomorphism closure is
ure 8) and M5 on alphabet ∆ (Figure 12), FSBM- left as a conjecture. We admit that a more formal and rigor-
definable patterns on Σ would be another FSBM- ous mathematical proof proving h−1 (L) is not complete-path
FSBM-recognizable should be conducted. To achieve this
definable patterns on ∆. goal, a more formal tool, such as a developed pumping lemma
We conjecture that the set of languages for the corresponding set of languages, is important.
244
old initial states in M1 . In this way, M accepts the tive identity requirement by complete-path FSBMs
empty string . q0 is never a G state nor an H state. without using the full power of mildly context sen-
Moreover, to make sure M can jump back to an sitive formalisms. To achieve this goal, future work
initial state after it hits a final state, -arcs from any should consider developing an efficient algorithm
final state to any old initial states are added. that intersects complete-path FSBMs with weighted
FSAs.
4 Discussion and conclusion The present paper is the first step to recognize
reduplicated forms in adequate yet more restric-
In summary, this paper provides a new computa- tive models and techniques compared to MCS
tional device to compute unrestricted total redu- formalisms. There are some limitations of the
plication on any regular languages, including the current approach on the whole typology of redu-
simplest copying language Lww where w can be plication. Complete-path FSBMs can only cap-
any arbitrary string of an alphabet. As a result, it ture local reduplication with two adjacent identical
introduces a new class of languages incomparable copies. As for non-local reduplication, the modi-
to CFLs. This class of languages allows unbounded fication should be straightforward: the machines
copying without generating non-reduplicative non- need to allow the filled buffer in N mode (or in
regular patterns: we hypothesize context-free string another newly-defined memory holding mode) and
reversals are excluded since the buffer is queue-like. match strings only when needed. As for multi-
Meanwhile, the MCS Swiss-German cross-serial ple reduplication, complete-path FSBMs can eas-
dependencies, abstracted as {ai bj ci dj |i, j ≥ 1}, ily be modified to include multiple copies of the
is also excluded, since the buffer works on the same same base form ({wn | w ∈ Σ∗ , n ∈ N}) but
alphabet as the input tape and only matches identi- cannot be easily modified to recognize the non-
cal sub-strings. semilinear language containing copies of the copy
Following the sub-classes of 2-way FSTs in n
({w2 | w ∈ Σ∗ , n ∈ N}). It remains to be an open
Dolatian and Heinz (2018a,b, 2019, 2020), which question on the computational nature of multiple
successfully capture unbounded copying as func- reduplication. Last but not the least, as a reviewer
tions while exclude the mirror image mapping, points out, recognizing non-identical copies can
complete-path FSBMs successfully capture the be achieved by either storing or emptying not ex-
total-reduplicated stringsets while exclude string actly the same input symbols, but mapped sym-
reversals. Comparison between the characterized bols according to some function f . Under this
languages in this paper and the image of functions modification, the new automata would recognize
in Dolatian and Heinz (2020) should be further car- {an bn | n ∈ N} with f (a) = b but still exclude
ried out to build the connection. Moreover, one string reversals. In all, detailed investigations on
natural next step is to extend FSBMs as acceptors how to modify complete-path FSBMs should be
to finite state buffered transducers (FSBT). Our the next step to complete the typology.
intuition is FSBTs would be helpful in handling
the morphological analysis question (ww → w), Acknowledgments
a not-yet solved problem in the 2-way FSTs that
The author would like to thank Tim Hunter, Bruce
Dolatian and Heinz (2020) study. After reading the
Hayes, Dylan Bumford, Kie Zuraw, and the mem-
first w in input and buffering this chunk of string
bers of the UCLA Phonology Seminar for their
in the memory, the transducer can output for each
feedback and suggestions. Special thanks to the
matched symbol when transiting among H states.
anonymous reviewers for their constructive com-
Another potential area of research is applying ments and discussions. All errors remain my own.
this new machinery to Primitive Optimality Theory
(Eisner, 1997; Albro, 1998). Albro (2000, 2005)
used weighted finite state machine to model con- References
straints while represented the set of candidates by Daniel M Albro. 1998. Evaluation, implementation,
Multiple Context Free Grammars to enforce base- and extension of primitive optimality theory. Mas-
reduplicant correspondence (McCarthy and Prince, ter’s thesis, UCLA.
1995). Parallel to Albro’s way, given complete- Daniel M. Albro. 2000. Taking primitive Optimality
path FSBMs are intersectable with FSAs, it is pos- Theory beyond the finite state. In Proceedings of the
sible to computationally implement the reduplica- Fifth Workshop of the ACL Special Interest Group
245
in Computational Phonology, pages 57–67, Centre Hossep Dolatian and Jeffrey Heinz. 2019. RedTyp: A
Universitaire, Luxembourg. International Commit- database of reduplication with computational mod-
tee on Computational Linguistics. els. In Proceedings of the Society for Computation
in Linguistics (SCiL) 2019, pages 8–18.
Daniel M Albro. 2005. Studies in computational op-
timality theory, with special reference to the phono- Hossep Dolatian and Jeffrey Heinz. 2020. Comput-
logical system of Malagasy. Ph.D. thesis, University ing and classifying reduplication with 2-way finite-
of California, Los Angeles, Los Angeles. state transducers. Journal of Language Modelling,
8(1):179–250.
Bruce Bagemihl. 1989. The crossing constraint and
‘backwards languages’. Natural language & linguis- Jason Eisner. 1997. Efficient generation in primitive
tic Theory, 7(4):481–549. Optimality Theory. In 35th Annual Meeting of the
Association for Computational Linguistics and 8th
Félix Baschenis, Olivier Gauwin, Anca Muscholl, and Conference of the European Chapter of the Associa-
Gabriele Puppis. 2017. Untwisting two-way trans- tion for Computational Linguistics, pages 313–320,
ducers in elementary time. In 2017 32nd Annual Madrid, Spain. Association for Computational Lin-
ACM/IEEE Symposium on Logic in Computer Sci- guistics.
ence (LICS), pages 1–12.
Gerald Gazdar and Geoffrey K Pullum. 1985. Com-
Kenneth R. Beesley and Lauri Karttunen. 2000. Finite- putationally relevant properties of natural languages
state non-concatenative morphotactics. In Proceed- and their grammars. New generation computing,
ings of the 38th Annual Meeting of the Associa- 3(3):273–306.
tion for Computational Linguistics, pages 191–198,
Hong Kong. Association for Computational Linguis- Thomas Graf. 2017. The power of locality domains in
tics. phonology. Phonology, 34(2):385–405.
Jane Chandlee. 2014. Strictly local phonological pro- Phyllis M. Healey. 1960. An Agta Grammar. Bureau
cesses. Ph.D. thesis, University of Delaware. of Printing, Manila.
246
Aravind K. Joshi. 1985. Tree adjoining grammars:
How much context-sensitivity is required to provide
reasonable structural descriptions?, Studies in Natu-
ral Language Processing, page 206–250. Cambridge
University Press.
Ronald M. Kaplan and Martin Kay. 1994. Regular
models of phonological rule systems. Comput. Lin-
guist., 20(3):331–378.
Alec Marantz. 1982. Re reduplication. Linguistic in-
quiry, 13(3):435–482.
John J. McCarthy and Alan S. Prince. 1995. Faithful-
ness and reduplicative identity. In Jill N. Beckman,
Laura Walsh Dickey, and Suzanne Urbanczyk, edi-
tors, Papers in Optimality Theory. GLSA (Graduate
Linguistic Student Association), Dept. of Linguis-
tics, University of Massachusetts, Amherst, MA.
Robert McNaughton and Seymour A Papert. 1971.
Counter-Free Automata (MIT research monograph
no. 65). The MIT Press.
Brian Roark and Richard Sproat. 2007. Computational
approaches to morphology and syntax, volume 4.
Oxford University Press.
Carl Rubino. 2013. Reduplication. In Matthew S.
Dryer and Martin Haspelmath, editors, The World
Atlas of Language Structures Online. Max Planck In-
stitute for Evolutionary Anthropology, Leipzig.
Hiroyuki Seki, Takashi Matsumura, Mamoru Fujii,
and Tadao Kasami. 1991. On multiple context-
free grammars. Theoretical Computer Science,
88(2):191–229.
Stuart M Shieber. 1985. Evidence against the context-
freeness of natural language. In Philosophy, Lan-
guage, and Artificial Intelligence, pages 79–89.
Springer.
Imre Simon. 1975. Piecewise testable events. In Au-
tomata Theory and Formal Languages, pages 214–
222, Berlin, Heidelberg. Springer Berlin Heidelberg.
Michael Sipser. 2013. Introduction to the Theory of
Computation, third edition. Course Technology,
Boston, MA.
Edward Stabler. 1997. Derivational minimalism. In
Logical Aspects of Computational Linguistics, pages
68–95, Berlin, Heidelberg. Springer Berlin Heidel-
berg.
Markus Walther. 2000. Finite-state reduplication in
one-level prosodic morphology. In 1st Meeting of
the North American Chapter of the Association for
Computational Linguistics.
Kie Zuraw. 2002. Aggressive reduplication. Phonol-
ogy, 19(3):395–439.
247
An FST morphological analyzer for the Gitksan language
248
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 248–257
August 5, 2021. ©2021 Association for Computational Linguistics
ern dialects even after applying dialect rules. Sec- external morphology for the most complex word
ondly, we extend our FST morphological analyzer type, a transitive verb, is schematized in the tem-
by adding a data-driven neural guesser which fur- plate in Figure 2; an example word with all these
ther improves coverage both for the Eastern and slots filled would be ’naagask’otsdiitgathl ‘appar-
Western varieties. ently they cut.PL open (common noun)’
On the left edge of the stem can appear any num-
2 The Gitksan Language ber of modifying ‘proclitics’. These contribute
locative, adjectival, and manner-related informa-
The Gitxsan are one of the indigenous peoples of
tion to a noun or verb, often producing semi- or non-
British Columbia, Canada. Their traditional territo-
compositional idioms in a similar fashion to Ger-
ries consist of upwards of 50,000 square kilometers
manic particle verbs.1 It is often unclear whether
of land along the Skeena River in the BC northern
these proclitics constitute part of the root or stem,
interior. The Gitksan language is the easternmost
or if they are distinct words entirely. The ortho-
member of the Tsimshianic family, which spans the
graphic boundaries on this edge are consequently
entirety of the Skeena and Nass River watersheds
sometimes fuzzy. Sometimes clear contrasts are
to the Pacific Coast. Today, Gitksan is the most
presented, as with the sequence lax-yip ‘on-earth’:
vital Tsimshianic language, but is still critically
we see compositional lax yip ‘on the ground’ ver-
endangered with an estimated 300-850 speakers
sus lexicalized laxyip ‘land, territory’. However,
(Dunlop et al., 2018).
the boundary between compositional and idiomatic
The Tsimshianic family can be broadly under-
is not always so obvious, as in examples like (1).
stood as a dialect continuum, with each village
along these rivers speaking somewhat differently (1) a. saa-’witxw (away-come, ‘come from’)
from its neighbors up- or downstream, and the two b. k’ali-aks (upstream-water, ‘upriver’)
endpoints being mutually unintelligible. The six c. xsi-ga’a (out-see, ‘choose’)
Gitxsan villages are commonly divided into two d. luu-no’o (in-hole, ‘annihilate’)
dialects: East/Upriver and West/Downriver. The
dialects have some lexical and phonological dif- Inflectional morphology largely appears on the
ferences, with the most prominent being a vowel right edge of the stem. The main complexity of
shift. Consider the name of the Skeena River: Xsan, Gitksan inflection involves homophony and opac-
Ksan (Eastern) vs Ksen (Western). ity: a similar or identical wordform often has mul-
tiple possible analyses. For example, a word like
2.1 Morphological description gubin transparently involves a stem gup ‘eat’ and
The Gitksan language has strict VSO word order a 2SG suffix -n, but the intervening vowel i might
and multifunctional, fusional morphology (Rigsby, be analyzed as epenthetic, as transitive inflection
1986). It utilizes prefixation, suffixation, and both (TR), or as a specially-induced transitivizing suffix
pro- and en-cliticization. Category derivation and (T), resulting in three possible analyses in (2). Sim-
number marking are prefixal, while markers of ar- ilarly, a word gupdiit involves the same stem gup
gument structure, transitivity, and person/number ‘eat’ and a 3PL suffix -diit, but this suffix is able
agreement are suffixal. to delete preceding transitive suffixes, resulting in
The Tsimshianic languages have been described four possible analyses as in (3).
as having word-complexity similar to German (Tar-
(2) gubin
pent, 1987). The general structure of a noun or
verb stem is presented in the template in Figure a. gup-2SG
1. A stem consists of minimally a root (typically b. gup-TR-2SG
CVC); an example is monomorphemic gup ‘eat’. c. gup-T-2SG
Stems may also include derivational prefixes or (3) gupdiit
transitivity-related suffixes; compare gupxw ‘be a. gup-3PL
eaten; be edible’. b. gup-TR-3PL
In sentential context, stems are inflected for fea- c. gup-T-3PL
tures like transitivity and person/number. Our an- d. gup-T-TR-3PL
alyzer is concerned primarily with stem-external
inflection and cliticization. The structure of stem- 1
E.g. nachslagen ’look up’ in German.
249
Derivation– Proclitics– Plural– Root –Argument Structure
Figure 2: Morphological template of modification, inflection, and cliticization for a transitive verbal predicate
Running speech in Gitksan is additionally rife alects, as well as neighboring Nisga’a, with some
with clitics, which pose a more complex problem variations. Given the relatively short period that
for morphological modeling. First, there are a set this orthography has been in use, orthographic con-
of ergative ‘flexiclitics’, which are able to either ventions can vary widely across dialects and writ-
procliticize or encliticize onto a subordinator or ers. In producing this initial analyzer, we attempt to
auxiliary, or stand independently. The same combi- mitigate the issue by working with a small number
nation of host and clitic might result in sequences of more-standardized sources: the original H&R
like n=ii (1SG=and), ii=n (and=1SG), or ii na and an annotated, multidialectal text collection.
(and 1SG) (Stebbins, 2003; Forbes, 2018). We worked with a digitized version of the H&R
Second, all nouns are introduced with a noun- wordlist (Mother Tongues Dictionaries, 2020). The
class clitic that attaches to the preceding word, as original wordlist documents only the Git-an’maaxs
illustrated by the VSO sentence in (4). Here, the Eastern dialect; our version adds a small number
proper noun clitic =s attaches to the verb but is syn- of additional dialect variants, and fifteen common
tactically associated with Mary, and the common verbs and subordinators. In total, the list contains
noun clitic =hl attaches to Mary but is associated approximately 1250 lexemes and phrases, plus
with gayt ‘hat’. noted variants and plural forms.
(4) Giigwis Maryhl gayt. The analyzer was informed by descriptive work
giikw-i-t =s Mary =hl gayt on both Gitksan and its mutually intelligible neigh-
buy-TR-3. II =PN Mary =CN hat bor Nisga’a. This work details many aspects of
‘Mary bought a hat.’ Gitksan inflection, including morphological opac-
ity and the complex interactions of certain suffixes
Any word able to precede a noun phrase is a possi-
and clitics (Rigsby, 1986; Tarpent, 1987; Hunt,
ble host for one of these clitics (hence their appear-
1993; Davis, 2018; Brown et al., 2020).
ance on transitive verbs in Figure 2).
Finally, there are several sentence-final and A text collection of approximately 18,000 words
second-position clitics. whose distribution is based was also used in the development and evaluation
on prosodic rather than strictly categorial proper- of the analyzer. This collection consists of oral
ties; these attach on the right edge of subordina- narratives given by three speakers from different
tors/auxiliaries, predicates, and argument phrases, villages: Ansbayaxw (Eastern), Gijigyukwhla’a
depending on the structure of the sentence. (Western), and Git-anyaaw (Western) (cf. Forbes
A large part of Gitksan’s unique morphological et al., 2017). It includes multiple genres: personal
complexity therefore arises not in nominal or verbal anecdotes, traditional tales (ant’imahlasxw), histo-
inflection, but in the flexibility of multiple types of ries of ownership (adaawk), recipes, and explana-
clitics used in connected speech, and the logic of tions of cultural practice. The collection is fully
which possible sequences can appear with which annotated in the ‘interlinear gloss’ format with free
wordforms. translation, exemplified in (5).
250
3 Related Work input word into morphemes and label each mor-
pheme with one or more grammatical tags. Very
While considering different approaches to compu-
silmilarly to the approach that we adopt, Schwartz
tational modeling of Gitksan morphology, finite-
et al. (2019) and Moeller et al. (2018) use atten-
state morphology arose as a natural choice. At the
tional LSTM encoder-decoder models to augment
present time, finite-state methods are quite widely
morphological analyzers for extending morpholog-
applied for Indigenous languages of the Americas.
ical analyzers for St. Lawrence Island / Central
Chen and Schwartz (2018) present a morpholog-
Siberian Yupik and Arapaho, respectively.
ical analyzer for St. Lawrence Island / Central
Siberian Yupik for aid in language preservation and 4 The Model
revitalization work. Strunk (2020) present another
analyzer for Central Alaskan Yupik. Snoek et al. Our morphological analyzer was designed with sev-
(2014) present a morphological analyzer for Plains eral considerations in mind. First, given the small
Cree nouns and Harrigan et al. (2017) present one amount of data at our disposal, we chose to con-
for Plains Cree verbs. Littell (2018) build a finite- struct a rule-based finite state transducer, built from
state analyzer for Kwak’wala. All of the above a predefined lexicon and morphological description.
are languages which present similar challenges to The dependence of this type of analyzer on a lexi-
the ones encountered in the case of Gitksan: word con supports one of the major goals of this project:
forms consisting of a large number of morphemes, lexical discovery from texts. Words which cannot
both prefixing and suffixing morphology and mor- be analyzed will likely be novel lemmas that have
phophonological alternations. Finite-state morphol- yet to be documented. Furthermore, the process
ogy is well-suited for dealing with these challenges. of constructing a morphological description allows
It is noteworthy that similarly to Gitksan, a number for the refinement of our understanding of Gitksan
of the aforementioned languages are also undergo- morphology and orthographic standards. For exam-
ing active documentation efforts. ple, there is a common post-stem rounding effect
While we present the first morphological ana- that generates variants such as jogat, jogot ‘those
lyzer for Gitksan which is capable of productive who live’; the project helps us identify where this
inflection, this is not the first electronic lexical re- effect occurs. Our analyzer can also later serve as a
source for the Gitksan language. Littell et al. (2017) tool to explore of the behavior of less-documented
present an electronic dictionary interface Waldayu constructions (e.g. distributive, partitive), as gram-
for endangered languages and apply it to Gitksan. matical and pedagogical resources continue to be
The model is capable of performing fuzzy dictio- developed.
nary search which is an important extension in the Our general philosophy was to take a maximal-
presence of orthographic variation which widely segmentation approach to inflection and cliticiza-
occurs in Gitksan. While this represents an impor- tion: morphemes were added individually, and in-
tant development for computational lexicography teractions between morphemes (e.g. deletion) were
for Gitksan, the method cannot model productive derived through transformational rules based on
inflection which is important particularly for lan- morphological and phonological context. Most
guage learners who might not be able to easily interactions of this kind are strictly local; there
deduce the base-form of an inflected word (Hunt are few long-distance dependencies between mor-
et al., 2019). As mentioned earlier, our model can phemes. The only exception to the minimal chunk-
analyze inflected forms of lexemes. ing rule is a specific interaction between noun-class
We extend the coverage of our finite-state an- clitics and verbal agreement: when these clitics
alyzers by incorporating a neural morphological append to verbal agreement suffixes, they either
guesser which can be used to analyze word forms agglutinate with (6-a) or delete them (6-b) depend-
which are rejected by the finite-state analyzer. Simi- ing on whether the agreement and noun-class mor-
lar mechanisms have been explored for other Amer- pheme are associated with the same noun (Tarpent,
ican Indigenous languages. Micher (2017) use 1987; Davis, 2018). That is, the conditioning factor
segmental recurrent neural networks (Kong et al., for this alternation is syntactic, not morphophono-
2015) to augment a finite-state morphological an- logical.
alyzer for Inuktitut.2 These jointly segment the
http://www.inuktitutcomputing.ca/
2
The Uquailaut morphological analyzer: Uqailaut
251
(6) Realizations of gup-i-t=hl (eat-TR-3=CN) which was listed directly in the H&R wordlist (the
a. gubithl ‘he/she ate (common noun)’ symbol ˆ marks morpheme boundaries).3 After
b. gubihl ‘(common noun) ate’ the verb, we find two inflectional suffixes and one
clitic. Ultimately, rewrite rules are used to delete
The available set of resources further constrained the transitive suffix and segmentation boundaries
our options for the analyzer’s design and our means (8).
of evaluating it. The H&R wordlist is quite small,
and of only a single dialect, while the corpus for (7) saaˆbisbisˆiˆdiitˆhl
testing was multidialectal. We therefore aimed saa+PVB-bisbis+VT-TR-3PL=CN
to produce a flexible analyzer able to recognize (8) saabisbisdiithl
orthographic variation, to maximize the value of its
small lexicon. 4.2 Analyzer iterations
4.1 FST implementation We built and evaluated four iterations of the Gitk-
san morphological analyzer based upon the foun-
Our finite-state analyzer was written in lexc dation presented in Section 4.1: the v1. Lexical
and xfst format and compiled using foma (Hulden, FST, v2. Complete FST, v3. Dialectal FST and
2009b). Finite-state analyzers like this one are v4. FST+Neural. Each iteration cumulatively ex-
constructed from a dictionary of stems, with af- pands the previous one by incorporating additional
fixes added left-to-right, and morpho-phonological vocabulary items, rules or modeling components.
rewrite rules applied to produce allomorphs and The first analyzer (v1: Lexical FST) included
contextual variation. The necessary components of only the open-class categories of verbs, nouns,
the analyzer are therefore a lexicon, a morphotac- modifiers, and adverbs which made up the bulk
tic description, and a set of morphophonological of the H&R wordlist. The main focus of the
transformations, as illustrated in Figure 3. morphotactic description was transitive inflection,
Our analyzer’s lexicon is drawn from the H&R person/number-agreement, and cliticization for
wordlist. As a first step, each stem from that list these categories. Some semi-productive argument
was assigned a lexical category to determine its structural morphemes (e.g. the passive -xw or an-
inflectional possibilities. The resulting 1506 word tipassive -asxw) were also included.
+ category pairs were imported to category-specific
The second analyzer (v2: Complete FST) in-
groups in the morphotactic description.
corporated functional and closed-class morphemes
Any of the major stem categories could be used
such as subordinators, pronouns, prepositions, quo-
to start a word; modifiers, preverbs, and prenouns
tatives, demonstratives, and aspectual particles, in-
could also be used as verb/noun prefixes. Each
cluding additional types of clitics.
categorized group flowed to a series of category-
The third analyzer (v3: Dialectal FST) further
specific sections which appended the appropriate
incorporated predictable stem-internal variation,
part of speech, and then listed various derivational
such as the vowel shift and dorsal stop lenition/-
or inflectional affixes that could be appended. A
fortition seen across dialects. In order to apply the
morphological group would terminate either with a
vowel shift in a targeted way, all items in the lex-
hard stop (#) or by flowing to a final group ‘Word’,
icon were marked for stress using the notation $.
where clitics were appended.
Parses prior to rule application now appear as in
Finally, forms were subject to a sequence
(9) (compare to (7)).
of orthographic transformations reflecting mor-
phophonological rules. Some examples included (9) s$aaˆbisb$isˆiˆdiitˆhl
the deletion of adjacent morphemes which could
not co-occur, processes of vowel epenthesis or dele- Finally, we seek to expand the coverage of the
tion, vowel coloring by rounded and back conso- analyzer through machine learning, namely neu-
nants, and prevocalic stop voicing. ral architectures (v4: FST+Neural). Our FST ar-
A sample form produced by the FST for the chitecture allows for the automatic extraction of
word saabisbisdiithl ‘they tore off (pl. common surface-analysis pairs; this enables us to create
noun)’ is in example (7). This form involves a 3
The FST has no component to productively handle redu-
preverb saa being affixed directly to a transitive plication but this would be possible to implement given a
verb bisbis, a reduplicated plural form of the verb closed lexicon Hulden (2009a, Ch. 4).
252
LEXICON N
LEXICON RootN +N: NInfl ; Deletion before -3PL:
maa’y N ; LEXICON NInfl ˆi → 0 / _ ˆdiit
smax N ; -ATTR:^m # ;
LEXICON RootVI -SX:^it Word ; Vowel insertion:
yee VI ; Agr_II ; 0 → i / C ˆ _ Sonorant #
t’aa VI ; Word ;
LEXICON RootPrenoun LEXICON Prenoun Prevocalic voicing:
lax_ Prenoun ; +PNN: # ; p,t,ts,k,k → b,d,j,g,g / _ V
(a) Lexicon +PNN: RootN ; (c) Rewrite rules
(b) Morphotactic description
a training set for the neural models. We experi- dataset (2 speakers and dialects). Token and type
ment with two alternative neural architectures - the coverage for the three FSTs is provided in Table
Hard-Attentional model over edit actions (HA) de- 1, representing the percentage of wordforms for
scribed by Makarov and Clematide (2018), and the which each analyzer was able to provide one or
transformer model (Vaswani et al., 2017), as imple- more possible parses.
mented in Fairseq (Fairseq) (Ott et al., 2019). Un-
like the FST, the neural models can extend morpho- Types Tokens
logical patterns beyond a defined series of stems, East Lexical 63.12% 54.17%
analyzing forms that the FST cannot recognize. Complete 71.10% 81.48%
For both models, we extract 10,000 random anal- Dialectal 71.10% 81.48%
ysis pairs, with replacement; early stopping for West Lexical 45.49% 38.09%
both models uses a 10% validation set extracted Complete 53.20% 70.12%
from the training, with no overlap between train- Dialectal 62.35% 75.98%
ing and validation sets (although stem overlap is Table 1: Analyzer coverage on 2000-token datasets
allowed). The best checkpoint is chosen based on
validation accuracy. The HA model uses a Chi- The effect of adding function-word coverage to
nese Restaurant Process alignment model, and is the second ‘Complete’ analyzer was broadly sim-
trained for 60 epochs, with 10 epochs patience; the ilar across dialects, increasing type coverage by
encoder and decoder both have hidden dimension about 8% and token coverage by 27-32%, demon-
200, and are trained with 50% dropout on recurrent strating the relative importance of function words
connections. The Transformer model is a 3-layer, to lexical coverage.
4-head transformer trained for 50 epochs. The en- The first two analyzers performed substantially
coders and decoders each have an embedding size better on the Eastern dataset which more closely
of 512, and feed-forward size of 1024, with 50% matched the dialect of the wordlist/lexicon. The
dropout and 30% attentional dropout. We optimize third ‘Dialectal’ analyzer incorporated four types
using Adam (0.9, 0.98), and cross-entropy with of predictable stem-internal allomorphy to generate
20% label-smoothing as our objective. Western-style variants. These transformations had
Any wordform which received no analysis from no effect on coverage for the Eastern dataset, but
the FST was provided a set of five possible analyses increased type and token coverage for the Western
each from the HA and Fairseq models. dataset by 9% and 6% respectively.
253
racy evaluation therefore had to be done manually To further understand the analyzer’s limitations,
by comparing the annotated analysis in the corpus we categorized the reasons for erroneous and miss-
to the parse options produced by the FST (10). ing analyses, listed in Table 3. In addition to the
small datasets, for which all words were checked,
(10) japhl we also evaluated the 100 most-frequent word/anal-
a. make[-3.II]=CN (Corpus) ysis pairs in the larger datasets.
b. j$ap+N=CN The majority of erroneous and absent analyses
j$ap+N-3=CN were due to the use of new lemmas not in the lexi-
j$ap+VT-3=CN (FSTv3) con, or novel variants not captured by productive
stem-alternation rules. Novel lemmas made up
We evaluated the accuracy of the Dialectal FST on
about 18% each of the small datasets, and 4-8%
two smaller datasets: 150 tokens Eastern, and 250
of the top-100 most frequent types. Some func-
tokens Western. These datasets included 85 and
tional items had specific dialectal realizations; for
180 unique wordform/annotation pairs respectively.
example, all three speakers used a different locative
The same wordform might have multiple attested
preposition (goo-, go’o-, ga’a-), only one of which
analyses, depending on its usage. The performance
was recognized.
of the Dialectal analyzer on each dataset is sum-
marized in Table 2. Precision is calculated as the There were also a few errors attributable to
percentage of word/annotation pairs for which the the morphotactic rules encoded in the parser.
analyzer produced a parse matching the context- For example, there were several instances in the
sensitive annotation in the corpus.4 Other analyses dataset of supposed ‘preverb’ modifiers combin-
produced by the FST were ignored. For example in ing with nouns (e.g. t’ip-no’o=si, sharply.down-
(10), the token would be evaluated as correct given hole=PROX, ‘this steep-sided hole’), which the
the final parse, which uses the appropriate stem parser could not recognize. This category combi-
(jap ‘make’) and matching morphology; the other nation flags the need for further documentation of
parses using a different stem (jap ‘box trap’) and/or certain ‘preverbs’. As a second example, numbers
different morphology could not qualify the token attested without agreement were not recognized
as correctly parsed. Only parsable wordforms were because the analyzer expected that they would al-
considered (i.e. English words and names are ex- ways agree. This could be fixed by updating the
cluded). morphotactic description for numbers (e.g. to more
closely match intransitive verbs).
East West
Coverage 71.76% 68.89% 5.3 FST + Neural performance
(61/85) (124/180)
The addition of the neural component signifi-
Correct parse 71.76% (61) 64.44% (116)
cantly increased the analyzer’s coverage (mean HA:
Incorrect parse 0.00% (0) 2.78% (5)
+21%, Fairseq: +17%), but at the expense of pre-
Name, English 2.5% (2) 3.33% (6)
cision (mean -15% for both). The results of the
No parse 27.5% (22) 29.44% (53)
manual accuracy evaluation are presented in Fig-
Precision 100.00% 95.87%
ure 4. There remained several forms for which the
(61/61) (116/121)
neural analyzers produced no analyses.
Table 2: Accuracy evaluation for dialectal analyzer (v3) Both analyzers performed better on the 100-
on small datasets most-frequent types datasets, where they tended
to accurately identify dialectal variants of com-
The Western dataset was larger, and consisted of mon words (e.g. t’ihlxw from tk’ihlxw ‘child’, diye
two distinct dialects, in contrast to the smaller and from diya ‘3=QUOT (third person quotative)’). In
more homogeneous Eastern dataset. Regardless, the small datasets of running text, these models
analyzer coverage between the two datasets was were occasionally able to correctly identify un-
comparable (68-72%) and precision was very high known noun and verb stems that had minimal in-
(95-100%). When this analyzer was able to provide flection. However, they struggled with identify-
a possible parse, one was almost always correct. ing categories, and often failed to identify correct
4
Note that precision is computed only on word forms inflection. These difficulties stem from category-
which received at least one analysis from the FST. flexibility and homophony in Gitksan. Nouns and
254
East West
150 tokens (22) Top-100 (17) 250 tokens (58) Top-100 (23)
New lemma 15 2 30 2
New function word 1 2 4 6
Lexical variant 3 8 6 5
Functional variant 2 3 9 9
Morphotactic error 1 2 9 1
Table 3: Categorization of erroneous and absent analyses for dialectal analyzer (FSTv3)
East 150 tok East top-100 tok West 250 tokens West top-100 tok
1 1 1 1
0 0 0 0
FST FST+HA FST+FairSeq FST FST+HA FST+FairSeq FST FST+HA FST+FairSeq FST FST+HA FST+FairSeq
Figure 4: Proportion of forms which receive the correct analysis from each of our models (indicated in blue) and
the number of forms which receive only incorrect analyses from our models (indicated in red). The remaining
forms received no analyses.
verbs use the exact same inflection and clitics, mak- xsim$as+N-3PL
ing the category itself difficult to infer. Short in-
flectional sequences have a large number of ho-
mophonous parses, and even more differ only by a Further work can be done to improve the per-
character or two. formance of the neural addition, such as training
the model on attested tokens instead of, or in addi-
Qualitatively, the HA model tended to produce
tion to, tokens randomly generated from the FST
more plausible predictions, often producing the cor-
analyzer.
rect stem or else a mostly-plausible analysis that
could map to the same surface form, but with incor-
6 Discussion and Conclusions
rect categories or inflection. In contrast, the Fairseq
model often introduced stem changes or inflec- The grammatically-informed FST is able to han-
tional sequences which could not ultimately map dle many of Gitksan’s morphological complexi-
to the surface form. Example (11) provides a sam- ties with a high degree of precision, including ho-
ple set of incorrect predictions (surface-plausible mophony, contextual deletion, and position-flexible
analyses are starred). clitics. The FST analyzer’s patchy coverage can
be attributed to its small lexicon. Unknown lexi-
(11) ksimaasdiit ksi+PVB-m$aas+VT-TR-3PL cal items and variants comprised roughly 18% of
a. HA model each small dataset. Notably, errors and unidenti-
xsim$aas+N-3PL (*) fied forms in the FST analyzer signal the current
xsim$aas+N-T-3PL (*) limits of morphotactic descriptions and lexical doc-
xsim$aas+NUM-3PL (*?) umentation. The analyzer can therefore serve as a
xsim$aast+N-T-3PL useful part of a documentary linguistic workflow
to quickly and systematically identify novel lexical
b. Fairseq model items and grammatical rules from texts, facilitating
xsim$aast+N-3PL the expansion of lexical resources. It can also be
xsim$aast+N=RESEM used as a pedagogical tool to identify word stems
xsim$aast+N-SX=PN in running text, or to generate morphological exer-
255
cises for language learners. Atticus G Harrigan, Katherine Schmirler, Antti Arppe,
The neural system, with its expanded coverage, Lene Antonsen, Trond Trosterud, and Arok Wolven-
grey. 2017. Learning from the computational mod-
can serve as part of a feedback system with a hu-
elling of plains cree verbs. Morphology, 27(4):565–
man in the loop, informing future iterations of the 598.
annotation process. While its precision is lower
Lonnie Hindle and Bruce Rigsby. 1973. A short prac-
than the FST, it can still inform annotators on words
tical dictionary of the Gitksan language. Northwest
that the FST does not analyze. Newly-annotated Anthropological Research Notes, 7(1).
data can then be used to enlarge the FST coverage.
Mans Hulden. 2009a. Finite-state machine construc-
Acknowledgments tion methods and algorithms for phonology and mor-
phology. Ph.D. thesis, The University of Arizona.
’Wii t’isim ha’miyaa nuu’m aloohl the fluent speak- Mans Hulden. 2009b. Foma: a finite-state compiler
ers who continue to share their knowledge with and library. In Proceedings of the 12th Confer-
me (Barbara Sennott, Vincent Gogag, Hector Hill, ence of the European Chapter of the Association
Jeanne Harris), as well as the UBC Gitksan Re- for Computational Linguistics: Demonstrations Ses-
sion, pages 29–32. Association for Computational
search Lab. This research was supported by fund- Linguistics.
ing from the National Endowment for the Humani-
ties (Documenting Endangered Languages Fellow- Benjamin Hunt, Emily Chen, Sylvia L.R. Schreiner,
and Lane Schwartz. 2019. Community lexical ac-
ship) and the Social Sciences and Humanities Re- cess for an endangered polysynthetic language: An
search Council of Canada (Grant 430-2020-00793). electronic dictionary for St. Lawrence Island Yupik.
Any views/findings/conclusions expressed in this In Proceedings of the 2019 Conference of the North
publication do not necessarily reflect those of the American Chapter of the Association for Computa-
tional Linguistics (Demonstrations), pages 122–126,
NEH, NSF or SSHRC. Minneapolis, Minnesota. Association for Computa-
tional Linguistics.
256
Sarah Moeller, Ghazaleh Kazeminejad, Andrew Cow-
ell, and Mans Hulden. 2018. A neural morphologi-
cal analyzer for arapaho verbs learned from a finite
state transducer. In Proceedings of the Workshop
on Computational Modeling of Polysynthetic Lan-
guages, pages 12–20.
Mother Tongues Dictionaries. 2020. Gitk-
san. Edited by the UBC Gitksan Research
Lab. Accessed June 4, 2020. (https:
//mothertongues.org/gitksan).
Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Fan, Sam Gross, Nathan Ng, David Grangier, and
Michael Auli. 2019. fairseq: A fast, extensible
toolkit for sequence modeling. In Proceedings of
the 2019 Conference of the North American Chap-
ter of the Association for Computational Linguistics
(Demonstrations), pages 48–53, Minneapolis, Min-
nesota. Association for Computational Linguistics.
257
Comparative Error Analysis in Neural and Finite-state Models for
Unsupervised Character-level Transduction
Maria Ryskina1 Eduard Hovy1 Taylor Berg-Kirkpatrick2 Matthew R. Gormley3
1
Language Technologies Institute, Carnegie Mellon University
2
Computer Science and Engineering, University of California, San Diego
3
Machine Learning Department, Carnegie Mellon University
[email protected] [email protected]
[email protected] [email protected]
Abstract
3to to4no mana belagitu
Traditionally, character-level transduction
problems have been solved with finite-state
models designed to encode structural and это точно ಮನ #$ಳ&$ತು
linguistic knowledge of the underlying pro-
cess, whereas recent approaches rely on the
power and flexibility of sequence-to-sequence tehničko i stručno obrazovanje
models with attention. Focusing on the less
explored unsupervised learning scenario, we техничка и стручна настава
compare the two model classes side by side
and find that they tend to make different types
of errors even when achieving comparable Figure 1: Parallel examples from our test sets
performance. We analyze the distributions of
for two character-level transduction tasks: con-
different error classes using two unsupervised
tasks as testbeds: converting informally
verting informally romanized text to its original
romanized text into the native script of its lan- script (top; examples in Russian and Kannada)
guage (for Russian, Arabic, and Kannada) and and translating between closely related languages
translating between a pair of closely related (bottom; Bosnian–Serbian). Informal romaniza-
languages (Serbian and Bosnian). Finally, we tion is idiosyncratic and relies on both visual (q
investigate how combining finite-state and → 4) and phonetic (t → t) character similarity,
sequence-to-sequence models at decoding while translation is more standardized but not fully
time affects the output quantitatively and
character-level due to grammatical and lexical dif-
qualitatively.1
ferences (‘nastava’ → ‘obrazovanje’) between
1 Introduction and prior work the languages. The lines show character alignment
between the source and target side where possible.
Many natural language sequence transduction tasks,
such as transliteration or grapheme-to-phoneme
conversion, call for a character-level parameteriza- by the underlying linguistic process (e.g. mono-
tion that reflects the linguistic knowledge of the un- tonic character alignment) or by the probabilis-
derlying generative process. Character-level trans- tic generative model (Markov assumption; Eisner,
duction approaches have even been shown to per- 2002). Their interpretability also facilitates the in-
form well for tasks that are not entirely character- troduction of useful inductive bias, which is crucial
level in nature, such as translating between related for unsupervised training (Ravi and Knight, 2009;
languages (Pourdamghani and Knight, 2017). Ryskina et al., 2020).
Weighted finite-state transducers (WFSTs) have Unsupervised neural sequence-to-sequence
traditionally been used for such character-level (seq2seq) architectures have also shown impressive
tasks (Knight and Graehl, 1998; Knight et al., performance on tasks like machine transla-
2006). Their structured formalization makes it eas- tion (Lample et al., 2018) and style transfer (Yang
ier to encode additional constraints, imposed either et al., 2018; He et al., 2020). These models are
1
Code will be published at https://github.com/ substantially more powerful than WFSTs, and they
ryskina/error-analysis-sigmorphon2021 successfully learn the underlying patterns from
258
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 258–271
August 5, 2021. ©2021 Association for Computational Linguistics
monolingual data without any explicit information ization (Ryskina et al., 2020) and related language
about the underlying generative process. translation (Pourdamghani and Knight, 2017).
As the strengths of the two model classes dif- While there has been much error analysis for
fer, so do their weaknesses: the WFSTs and the the WFST and seq2seq approaches separately, it
seq2seq models are prone to different kinds of largely focuses on the more common supervised
errors. On a higher level, it is explained by the case. We perform detailed side-by-side error analy-
structure–power trade-off: while the seq2seq mod- sis to draw high-level comparisons between finite-
els are better at recovering long-range dependen- state and seq2seq models and investigate if the
cies and their outputs look less noisy, they also intuitions from prior work would transfer to the
tend to insert and delete words arbitrarily because unsupervised transduction scenario.
their alignments are unconstrained. We attribute
the errors to the following aspects of the trade-off: 2 Tasks
Language modeling capacity: the statistical We compare the errors made by the finite-state and
character-level n-gram language models (LMs) uti- the seq2seq approaches by analyzing their perfor-
lized by finite-state approaches are much weaker mance on two unsupervised character-level trans-
than the RNN language models with unlimited left duction tasks: translating between closely related
context. While a word-level LM can improve the languages written in different alphabets and con-
performance of a WFST, it would also restrict the verting informally romanized text into its native
model’s ability to handle out-of-vocabulary words. script. Both tasks are illustrated in Figure 1.
259
manized and native script sequences (Figure 1, top in §3.4.2
left). Abjads and abugidas, where graphemes corre-
spond to consonants or consonant-vowel syllables, 3.1 Informal romanization
increasingly use many-to-one alignment in their
romanization (Figure 1, top right), which makes Source: de el menu:)
Filtered: de el menu<...>
learning the latent alignments, and therefore decod-
Target: <...> éJÖ Ï @ ø X
ing, more challenging. In this work, we experiment
with three languages spanning over three major Gloss: ‘This is the menu’
types of writing systems—Russian (alphabetic),
Arabic (abjad), and Kannada (abugida)—and com- Figure 2: A parallel example from the LDC BOLT
pare how well-suited character-level models are for Arabizi dataset, written in Latin script (source) and
learning these varying alignment patterns. converted to Arabic (target) semi-manually. Some
source-side segments (in red) are removed by an-
2.2 Related language translation notators; we use the version without such segments
(filtered) for our task. The annotators also stan-
As shown by Pourdamghani and Knight (2017)
dardize spacing on the target side, which results in
and Hauer et al. (2014), character-level models can
difference with the source (in blue).
be used effectively to translate between languages
that are closely enough related to have only small
lexical and grammatical differences, such as Ser- Arabic We use the LDC BOLT Phase 2 cor-
bian and Bosnian (Ljubešić and Klubička, 2014). pus (Bies et al., 2014; Song et al., 2014) for training
We focus on this specific language pair and tie the and testing the Arabic transliteration models (Fig-
languages to specific orthographies (Cyrillic for ure 2). The corpus consists of short SMS and chat
Serbian and Latin for Bosnian), approaching the in Egyptian Arabic represented using Latin script
task as an unsupervised orthography conversion (Arabizi). The corpus is fully parallel: each mes-
problem. However, the transliteration framing of sage is automatically converted into the standard-
the translation problem is inherently limited since ized dialectal Arabic orthography (CODA; Habash
the task is not truly character-level in nature, as et al., 2012) and then manually corrected by human
shown by the alignment lines in Figure 1 (bottom). annotators. We split and preprocess the data accord-
Even the most accurate transliteration model will ing to Ryskina et al. (2020), discarding the target
not be able to capture non-cognate word transla- (native script) and source (romanized) parallel sen-
tions (Serbian ‘nastava’ [nastava, ‘education, teach- tences to create the source and target monolingual
ing’] → Bosnian ‘obrazovanje’ [‘education’]) and the training splits respectively.
resulting discrepancies in morphological inflection
Russian We use the romanized Russian dataset
(Serbian -a endings in adjectives agreeing with
collected by Ryskina et al. (2020), augmented with
feminine ‘nastava’ map to Bosnian -o represent-
the monolingual Cyrillic data from the Taiga cor-
ing agreement with neuter ‘obrazovanje’).
pus of Shavrina and Shapovalova (2017) (Figure 3).
One major difference with the informal roman-
The romanized data is split into training, validation,
ization task is the lack of the idiosyncratic orthogra-
and test portions, and all validation and test sen-
phy: the word spellings are now consistent across
tences are converted to Cyrillic by native speaker
the data. However, since the character-level ap-
annotators. Both the romanized and the native-
proach does not fully reflect the nature of the trans-
script sequences are collected from public posts and
formation, the model will still have to learn a many-
comments on a Russian social network vk.com,
to-many cipher with highly context-dependent char-
and they are on average 3 times longer than the
acter substitutions.
messages in the Arabic dataset (Table 1). However,
although both sides were scraped from the same
3 Data
online platform, the relevant Taiga data is collected
Table 1 details the statistics of the splits used for primarily from political discussion groups, so there
all languages and tasks. Below we describe each is still a substantial domain mismatch between the
dataset in detail, explaining the differences in data source and target sides of the data.
split sizes between languages. Additional prepro- 2
Links to download the corpora and other data sources
cessing steps applied to all datasets are described discussed in this section can be found in Appendix A.
260
Train (source) Train (target) Validation Test
Sent. Char. Sent. Char. Sent. Char. Sent. Char.
Romanized Arabic 5K 104K 49K 935K 301 8K 1K 20K
Romanized Russian 5K 319K 307K 111M 227 15K 1K 72K
Romanized Kannada 10K 1M 679K 64M 100 11K 100 10K
Serbian→Bosnian 160K 9M 136K 9M 16K 923K 100 9K
Bosnian→Serbian 136K 9M 160K 9M 16K 908K 100 10K
Table 1: Dataset splits for each task and language. The source and target train data are monolingual, and
the validation and test sentences are parallel. For the informal romanization task, the source and target
sides correspond to the Latin and the original script respectively. For the translation task, the source and
target sides correspond to source and target languages. The validation and test character statistics are
reported for the source side.
261
Cyrillic and Bosnian–Latin declaration texts and critics and non-printing characters like ZWJ are
follow the preprocessing guidelines of Pour- also treated as separate vocabulary items. To filter
damghani and Knight (2017). Although we strive out foreign or archaic characters and rare diacritics,
to approximate the training and evaluation setup we restrict the alphabets to characters that cover
of their work for fair comparison, there are some 99% of the monolingual training data. After that,
discrepancies: for example, our manual alignment we add any standard alphabetical characters and
of UDHR yields 100 sentence pairs compared to numerals that have been filtered out back into the
104 of Pourdamghani and Knight (2017). We use source and target alphabets. All remaining filtered
the data to train the translation models in both di- characters are replaced with a special UNK symbol
rections, simply switching the source and target in all splits except for the target-side test.
sides from Serbian to Bosnian and vice versa.
4 Methods
3.3 Inductive bias
We perform our analysis using the finite-state and
As discussed in §1, the WFST models are less pow-
seq2seq models from prior work and experiment
erful than the seq2seq models; however, they are
with two joint decoding strategies, reranking and
also more structured, which we can use to introduce
product of experts. Implementation details and
inductive bias to aid unsupervised training. Follow-
hyperparameters are described in Appendix B.
ing Ryskina et al. (2020), we introduce informative
priors on character substitution operations (for a de- 4.1 Base models
scription of the WFST parameterization, see §4.1).
The priors reflect the visual and phonetic similar- Our finite-state model is the WFST cascade in-
ity between characters in different alphabets and troduced by Ryskina et al. (2020). The model is
are sourced from human-curated resources built composed of a character-level n-gram language
with the same concepts of similarity in mind. For model and a script conversion transducer (emis-
all tasks and languages, we collect phonetically sion model), which supports one-to-one character
similar character pairs from the phonetic keyboard substitutions, insertions, and deletions. Charac-
layouts (or, in case of the translation task, from the ter operation weights in the emission model are
default Serbian keyboard layout, which is phonetic parameterized with multinomial distributions, and
in nature due to the dual orthography standard of similar character mappings (§3.3) are used to cre-
the language). We also add some visually similar ate Dirichlet priors on the emission parameters.
character pairs by automatically pairing all sym- To avoid marginalizing over sequences of infinite
bols that occur in both source and target alphabets length, a fixed limit is set on the delay of any path
(same Unicode codepoints). For Russian, which (the difference between the cumulative number of
exhibits a greater degree of visual similarity than insertions and deletions at any timestep). Ryskina
Arabic or Kannada, we also make use of the Uni- et al. (2020) train the WFST using stochastic step-
code confusables list (different Unicode codepoints wise EM (Liang and Klein, 2009), marginalizing
but same or similar glyphs).3 over all possible target sequences and their align-
It should be noted that these automatically gen- ments with the given source sequence. To speed
erated informative priors also contain noise: key- up training, we modify their training procedure
board layouts have spurious mappings because towards ‘hard EM’: given a source sequence, we
each symbol must be assigned to exactly one key in predict the most probable target sequence under
the QWERTY layout, and Unicode-constrained vi- the model, marginalize over alignments and then
sual mappings might prevent the model from learn- update the parameters. Although the unsupervised
ing correspondences between punctuation symbols WFST training is still slow, the stepwise training
(e.g. Arabic question mark ? → ?). procedure is designed to converge using fewer data
points, so we choose to train the WFST model
3.4 Preprocessing only on the 1,000 shortest source-side training se-
quences (500 for Kannada).
We lowercase and segment all sequences into char- Our default seq2seq model is the unsupervised
acters as defined by Unicode codepoints, so dia- neural machine translation (UNMT) model of Lam-
3
Links to the keyboard layouts and the confusables list can ple et al. (2018, 2019) in the parameterization
be found in Appendix A. of He et al. (2020). The model consists of an
262
Arabic Russian Kannada
CER WER BLEU CER WER BLEU CER WER BLEU
WFST .405 .86 2.3 .202 .58 14.8 .359 .71 12.5
Seq2Seq .571 .85 4.0 .229 .38 48.3 .559 .79 11.3
Reranked WFST .398 .85 2.8 .195 .57 16.1 .358 .71 12.5
Reranked Seq2Seq .538 .82 4.6 .216 .39 45.6 .545 .78 12.6
Product of experts .470 .88 2.5 .178 .50 22.9 .543 .93 7.0
Table 2: Character and word error rates (lower is better) and BLEU scores (higher is better) for the
romanization decipherment task. Bold indicates best per column. Model combinations mostly interpolate
between the base models’ scores, although reranking yields minor improvements in character-level and
word-level metrics for the WFST and seq2seq respectively. Note: base model results are not intended as a
direct comparison between the WFST and seq2seq, since they are trained on different amounts of data.
srp→bos bos→srp
CER WER BLEU CER WER BLEU
WFST .314 .50 25.3 .319 .52 25.5
Seq2Seq .375 .49 34.5 .395 .49 36.3
Reranked WFST .314 .49 26.3 .317 .50 28.1
Reranked Seq2Seq .376 .48 35.1 .401 .47 37.0
Product of experts .329 .54 24.4 .352 .66 20.6
(Pourdamghani and Knight, 2017) — — 42.3 — — 39.2
(He et al., 2020) .657 .81 5.6 .693 .83 4.7
Table 3: Character and word error rates (lower is better) and BLEU scores (higher is better) for the related
language translation task. Bold indicates best per column. The WFST and the seq2seq have comparable
CER and WER despite the WFST being trained on up to 160x less source-side data (§4.1). While none
of our models achieve the scores reported by Pourdamghani and Knight (2017), they all substantially
outperform the subword-level model of He et al. (2020). Note: base model results are not intended as a
direct comparison between the WFST and seq2seq, since they are trained on different amounts of data.
LSTM (Hochreiter and Schmidhuber, 1997) en- trained to translate in both directions simultane-
coder and decoder with attention, trained to map ously. Therefore, we reuse the same seq2seq model
sentences from each domain into a shared latent for both directions of the translation task, but train
space. Using a combined objective, the UNMT a separate finite-state model for each direction.
model is trained to denoise, translate in both direc-
tions, and discriminate between the latent represen- 4.2 Model combinations
tation of sequences from different domains. Since
the sufficient amount of balanced data is crucial The simplest way to combine two independently
for the UNMT performance, we train the seq2seq trained models is reranking: using one model to
model on all available data on both source and tar- produce a list of candidates and rescoring them ac-
get sides. Additionally, the seq2seq model decides cording to another model. To generate candidates
on early stopping by evaluating on a small parallel with a WFST, we apply the n–shortest paths algo-
validation set, which our WFST model does not rithm (Mohri and Riley, 2002). It should be noted
have access to. that the n–best list might contain duplicates since
each path represents a specific source–target char-
The WFST model treats the target and source acter alignment. The length constraints encoded in
training data differently, using the former to train the WFST also restrict its capacity as a reranker:
the language model and the latter for learning the beam search in the UNMT model may produce
emission parameters, while the UNMT model is hypotheses too short or long to have a non-zero
263
Input svako ima pravo da slobodno uqestvuje u kulturnom ivotu zajednice, da uiva
u umetnosti i da uqestvuje u nauqnom napretku i u dobrobiti koja otuda
proistiqe.
Ground truth svako ima pravo da slobodno sudjeluje u kulturnom životu zajednice, da uživa u umjetnosti i da
učestvuje u znanstvenom napretku i u njegovim koristima.
WFST svako ima pravo da slobodno učestvuje u kulturnom životu s jednice , da uživa u m etnosti i da
učestvuje u naučnom napretku i u dobrobiti koja otuda pr ističe .
Reranked WFST svako ima pravo da slobodno učestvuje u kulturnom životu s jednice , da uživa u m etnosti i da
učestvuje u naučnom napretku i u dobrobiti koja otuda pr ističe .
Seq2Seq svako ima pravo da slobodno učestvuje u kulturnom životu zajednice , da
učestvuje u naučnom napretku i u dobrobiti koja otuda proističe .
Reranked Seq2Seq svako ima pravo da slobodno učestvuje u kulturnom životu zajednice , da uživa u umjetnosti i da
učestvuje u naučnom napretku i u dobrobiti koja otuda proističe
Product of experts svako ima pravo da slobodno učestvuje u kulturnom za u s ajednice , da živa u umjetnosti i da
učestvuje u naučnom napretku i u dobro j i koja otuda proisti
Subword Seq2Seq s ami ima pravo da slobodno u tiče na srpskom nivou vlasti da razgovaraju u bosne i da djeluje u
med̄unarodnom turizmu i na buducnosti koja muža decisno .
Table 4: Different model outputs for a srp→bos translation example. Prediction errors are highlighted
in red. Correctly transliterated segments that do not match the ground truth (e.g. due to paraphrasing)
are shown in yellow. Here the WFST errors are substitutions or deletions of individual characters, while
the seq2seq drops entire words from the input (§5 #4). The latter problem is solved by reranking with a
WFST for this example. The seq2seq model with subword tokenization (He et al., 2020) produces mostly
hallucinated output (§5 #2). Example outputs for all other datasets can be found in the Appendix.
probability under the WFST. of Pourdamghani and Knight (2017) also use the
Our second approach is a product-of-experts- same respective sources, we cannot account for
style joint decoding strategy (Hinton, 2002): tokenization differences that could affect the scores
we perform beam search on the WFST lattice, reported by the authors.
reweighting the arcs with the output distribution of
the seq2seq decoder at the corresponding timestep. 5 Results and analysis
For each partial hypothesis, we keep track of the
Tables 2 and 3 present our evaluation of the two
WFST state s and the partial input and output se-
base models and three decoding-time model com-
quences x1:k and y1:t .4 When traversing an arc
binations on the romanization decipherment and
with input label i ∈ {xk+1 , } and output label o,
related language translation tasks respectively. For
we multiply the arc weight by the probability of
each experiment, we report character error rate,
the neural model outputting o as the next character:
word error rate, and BLEU (see Appendix C). The
pseq2seq (yt+1 = o|x, y1:t ). Transitions with o =
results for the base models support what we show
(i.e. deletions) are not rescored by the seq2seq. We
later in this section: the seq2seq model is more
group hypotheses by their consumed input length
likely to recover words correctly (higher BLEU,
k and select n best extensions at each timestep.
lower WER), while the WFST is more faithful on
4.3 Additional baselines character level and avoids word-level substitution
errors (lower CER). Example predictions can be
For the translation task, we also compare to prior found in Table 4 and in the Appendix.
unsupervised approaches of different granularity: Our further qualitative and quantitative findings
the deep generative style transfer model of He et al. are summarized in the following high-level take-
(2020) and the character- and word-level WFST aways:
decipherment model of Pourdamghani and Knight
(2017). The former is trained on the same training #1: Model combinations still suffer from search
set tokenized into subword units (Sennrich et al., issues. We would expect the combined decod-
2016), and we evaluate it on our UDHR test set ing to discourage all errors common under one
for fair comparison. While the train and test data model but not the other, improving the performance
4
Due to insertions and deletions in the emission model, k by leveraging the strengths of both model classes.
and t might differ; epsilon symbols are not counted. However, as Tables 2 and 3 show, they instead
264
WFST Seq2Seq
Figure 6: Highest-density sub-
matrices of the two base mod-
els’ character confusion matrices,
computed in the Russian roman-
ization task. White cells repre-
sent zero elements. The WFST
confusion matrix (left) is notice-
ably sparser than the seq2seq one
(right), indicating more repetitive
errors. # symbol stands for UNK.
mostly interpolate between the scores of the two This observation also aligns with the findings of
base models. In the reranking experiments, we find the recent work on language modeling complex-
that this is often due to the same base model er- ity (Park et al., 2021; Mielke et al., 2019). For
ror (e.g. the seq2seq model hallucinating a word many languages, including several Slavic ones re-
mid-sentence) repeating across all the hypotheses lated to the Serbian–Bosnian pair, a character-level
in the final beam. This suggests that successful language model yields lower surprisal than the one
reranking would require a much larger beam size trained on BPE units, suggesting that the effect
or a diversity-promoting search mechanism. might also be explained by the character tokeniza-
Interestingly, we observe that although adding tion making the language easier to language-model.
a reranker on top of a decoder does improve per-
formance slightly, the gain is only in terms of the #3: WFST model makes more repetitive errors.
metrics that the base decoder is already strong at— Although two of our evaluation metrics, CER and
character-level for reranked WFST and word-level WER, are based on edit distance, they do not dis-
for reranked seq2seq—at the expense of the other tinguish between the different types of edits (sub-
scores. Overall, none of our decoding strategies stitutions, insertions and deletions). Breaking them
achieves best results across the board, and no model down by the edit operation, we find that while both
combination substantially outperforms both base models favor substitutions on both word and char-
models in any metric. acter levels, insertions and deletions are more fre-
quent under the neural model (43% vs. 30% of all
#2: Character tokenization boosts performance edits on the Russian romanization task). We also
of the neural model. In the past, UNMT-style find that the character substitution choices of the
models have been applied to various unsupervised neural model are more context-dependent: while
sequence transduction problems. However, since the total counts of substitution errors for the two
these models were designed to operate on word or models are comparable, the WFST is more likely
subword level, prior work assumes the same tok- to repeat the same few substitutions per character
enization is necessary. We show that for the tasks type. This is illustrated by Figure 6, which visual-
allowing character-level framing, such models in izes the most populated submatrices of the confu-
fact respond extremely well to character input. sion matrices for the same task as heatmaps. The
Table 3 compares the UNMT model trained on WFST confusion matrix is noticeably more sparse,
characters with the seq2seq style transfer model with the same few substitutions occurring much
of He et al. (2020) trained on subword units. The more frequently than others: for example, WFST
original paper shows improvement over the UNMT often mistakes for a and rarely for other char-
baseline in the same setting, but simply switching acters, while the neural model’s substitutions of
to character-level tokenization without any other are distributed closer to uniform. This suggests
changes results in a 30 BLEU points gain for ei- that the WFST errors might be easier to correct
ther direction. This suggests that the tokenization with rule-based postprocessing. Interestingly, we
choice could act as an inductive bias for seq2seq did not observe the same effect for the translation
models, and character-level framing could be use- task, likely due to a more constrained nature of the
ful even for tasks that are not truly character-level. orthography conversion.
265
WFST Seq2Seq
Figure 7: Character error rate per
word for the WFST (left) and seq2seq
800 800
(right) bos→srp translation outputs.
Number of words
266
References Kyle Gorman. 2016. Pynini: A Python library for
weighted finite-state grammar compilation. In Pro-
Roee Aharoni and Yoav Goldberg. 2017. Morphologi- ceedings of the SIGFSM Workshop on Statistical
cal inflection generation with hard monotonic atten- NLP and Weighted Automata, pages 75–80, Berlin,
tion. In Proceedings of the 55th Annual Meeting of Germany. Association for Computational Linguis-
the Association for Computational Linguistics (Vol- tics.
ume 1: Long Papers), pages 2004–2015, Vancouver,
Canada. Association for Computational Linguistics. Nizar Habash, Mona Diab, and Owen Rambow. 2012.
Conventional orthography for dialectal Arabic. In
Cyril Allauzen, Michael Riley, Johan Schalkwyk, Woj- Proceedings of the Eighth International Conference
ciech Skut, and Mehryar Mohri. 2007. OpenFst: A on Language Resources and Evaluation (LREC’12),
general and efficient weighted finite-state transducer pages 711–718, Istanbul, Turkey. European Lan-
library. In Proceedings of the Ninth International guage Resources Association (ELRA).
Conference on Implementation and Application of
Automata, (CIAA 2007), volume 4783 of Lecture Bradley Hauer, Ryan Hayward, and Grzegorz Kon-
Notes in Computer Science, pages 11–23. Springer. drak. 2014. Solving substitution ciphers with com-
http://www.openfst.org. bined language models. In Proceedings of COLING
2014, the 25th International Conference on Compu-
Sowmya V. B., Monojit Choudhury, Kalika Bali, tational Linguistics: Technical Papers, pages 2314–
Tirthankar Dasgupta, and Anupam Basu. 2010. Re- 2325, Dublin, Ireland. Dublin City University and
source creation for training and testing of translit- Association for Computational Linguistics.
eration systems for Indian languages. In Proceed-
ings of the Seventh International Conference on Lan- Junxian He, Xinyi Wang, Graham Neubig, and Taylor
guage Resources and Evaluation (LREC’10), Val- Berg-Kirkpatrick. 2020. A probabilistic formulation
letta, Malta. European Language Resources Associ- of unsupervised text style transfer. In International
ation (ELRA). Conference on Learning Representations.
Felix Hieber and Stefan Riezler. 2015. Bag-of-words
Ann Bies, Zhiyi Song, Mohamed Maamouri, Stephen forced decoding for cross-lingual information re-
Grimes, Haejoong Lee, Jonathan Wright, Stephanie trieval. In Proceedings of the 2015 Conference of
Strassel, Nizar Habash, Ramy Eskander, and Owen the North American Chapter of the Association for
Rambow. 2014. Transliteration of Arabizi into Ara- Computational Linguistics: Human Language Tech-
bic orthography: Developing a parallel annotated nologies, pages 1172–1182, Denver, Colorado. As-
Arabizi-Arabic script SMS/chat corpus. In Proceed- sociation for Computational Linguistics.
ings of the EMNLP 2014 Workshop on Arabic Nat-
ural Language Processing (ANLP), pages 93–103, G. E. Hinton. 2002. Training products of experts by
Doha, Qatar. Association for Computational Lin- minimizing contrastive divergence. Neural Compu-
guistics. tation, 14(8):1771–1800.
Eugene Charniak and Mark Johnson. 2005. Coarse- Sepp Hochreiter and Jürgen Schmidhuber. 1997.
to-fine n-best parsing and MaxEnt discriminative Long short-term memory. Neural computation,
reranking. In Proceedings of the 43rd Annual Meet- 9(8):1735–1780.
ing of the Association for Computational Linguis-
tics (ACL’05), pages 173–180, Ann Arbor, Michi- Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and
gan. Association for Computational Linguistics. Yejin Choi. 2020. The curious case of neural text de-
generation. In International Conference on Learn-
Kareem Darwish. 2014. Arabizi detection and conver- ing Representations.
sion to Arabic. In Proceedings of the EMNLP 2014 Cibu Johny, Lawrence Wolf-Sonkin, Alexander Gutkin,
Workshop on Arabic Natural Language Processing and Brian Roark. 2021. Finite-state script normal-
(ANLP), pages 217–224, Doha, Qatar. Association ization and processing utilities: The Nisaba Brahmic
for Computational Linguistics. library. In Proceedings of the 16th Conference of
the European Chapter of the Association for Compu-
Jason Eisner. 2002. Parameter estimation for prob-
tational Linguistics: System Demonstrations, pages
abilistic finite-state transducers. In Proceedings
14–23, Online. Association for Computational Lin-
of the 40th Annual Meeting of the Association for
guistics.
Computational Linguistics, pages 1–8, Philadelphia,
Pennsylvania, USA. Association for Computational Kevin Knight and Jonathan Graehl. 1998. Ma-
Linguistics. chine transliteration. Computational Linguistics,
24(4):599–612.
Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff.
2012. Building large monolingual dictionaries at Kevin Knight, Anish Nair, Nishit Rathod, and Kenji
the Leipzig corpora collection: From 100 to 200 lan- Yamada. 2006. Unsupervised analysis for deci-
guages. In Proceedings of the Eighth International pherment problems. In Proceedings of the COL-
Conference on Language Resources and Evaluation ING/ACL 2006 Main Conference Poster Sessions,
(LREC’12), pages 759–765, Istanbul, Turkey. Euro- pages 499–506, Sydney, Australia. Association for
pean Language Resources Association (ELRA). Computational Linguistics.
267
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Mehryar Mohri and Michael Riley. 2002. An efficient
Callison-Burch, Marcello Federico, Nicola Bertoldi, algorithm for the n-best-strings problem. In Seventh
Brooke Cowan, Wade Shen, Christine Moran, International Conference on Spoken Language Pro-
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra cessing.
Constantin, and Evan Herbst. 2007. Moses: Open
source toolkit for statistical machine translation. In Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Proceedings of the 45th Annual Meeting of the As- Jing Zhu. 2002. BLEU: A method for automatic
sociation for Computational Linguistics Companion evaluation of machine translation. In Proceedings of
Volume Proceedings of the Demo and Poster Ses- the 40th Annual Meeting of the Association for Com-
sions, pages 177–180, Prague, Czech Republic. As- putational Linguistics, pages 311–318, Philadelphia,
sociation for Computational Linguistics. Pennsylvania, USA. Association for Computational
Linguistics.
Guillaume Lample, Alexis Conneau, Ludovic Denoyer,
and Marc’Aurelio Ranzato. 2018. Unsupervised ma- Hyunji Hayley Park, Katherine J. Zhang, Coleman Ha-
chine translation using monolingual corpora only. ley, Kenneth Steimel, Han Liu, and Lane Schwartz.
In International Conference on Learning Represen- 2021. Morphology matters: A multilingual lan-
tations. guage modeling analysis. Transactions of the Asso-
ciation for Computational Linguistics, 9:261–276.
Guillaume Lample, Sandeep Subramanian, Eric Smith,
Ludovic Denoyer, Marc’Aurelio Ranzato, and Y- Martin Paulsen. 2014. Translit: Computer-mediated
Lan Boureau. 2019. Multiple-attribute text rewrit- digraphia on the Runet. Digital Russia: The Lan-
ing. In International Conference on Learning Rep- guage, Culture and Politics of New Media Commu-
resentations. nication.
Percy Liang and Dan Klein. 2009. Online EM for Nima Pourdamghani and Kevin Knight. 2017. Deci-
unsupervised models. In Proceedings of Human phering related languages. In Proceedings of the
Language Technologies: The 2009 Annual Confer- 2017 Conference on Empirical Methods in Natu-
ence of the North American Chapter of the Associa- ral Language Processing, pages 2513–2518, Copen-
tion for Computational Linguistics, pages 611–619, hagen, Denmark. Association for Computational
Boulder, Colorado. Association for Computational Linguistics.
Linguistics.
Pushpendre Rastogi, Ryan Cotterell, and Jason Eisner.
Chu-Cheng Lin, Hao Zhu, Matthew R. Gormley, and 2016. Weighting finite-state transductions with neu-
Jason Eisner. 2019. Neural finite-state transducers: ral context. In Proceedings of the 2016 Conference
Beyond rational relations. In Proceedings of the of the North American Chapter of the Association
2019 Conference of the North American Chapter of for Computational Linguistics: Human Language
the Association for Computational Linguistics: Hu- Technologies, pages 623–633, San Diego, California.
man Language Technologies, Volume 1 (Long and Association for Computational Linguistics.
Short Papers), pages 272–283, Minneapolis, Min-
nesota. Association for Computational Linguistics. Sujith Ravi and Kevin Knight. 2009. Learning
Nikola Ljubešić and Filip Klubička. 2014. phoneme mappings for transliteration without paral-
{bs,hr,sr}WaC - web corpora of Bosnian, Croa- lel data. In Proceedings of Human Language Tech-
tian and Serbian. In Proceedings of the 9th Web as nologies: The 2009 Annual Conference of the North
Corpus Workshop (WaC-9), pages 29–35, Gothen- American Chapter of the Association for Computa-
burg, Sweden. Association for Computational tional Linguistics, pages 37–45, Boulder, Colorado.
Linguistics. Association for Computational Linguistics.
Peter Makarov, Tatiana Ruzsics, and Simon Clematide. Brian Roark, Richard Sproat, Cyril Allauzen, Michael
2017. Align and copy: UZH at SIGMORPHON Riley, Jeffrey Sorensen, and Terry Tai. 2012. The
2017 shared task for morphological reinflection. In OpenGrm open-source finite-state grammar soft-
Proceedings of the CoNLL SIGMORPHON 2017 ware libraries. In Proceedings of the ACL 2012 Sys-
Shared Task: Universal Morphological Reinflection, tem Demonstrations, pages 61–66, Jeju Island, Ko-
pages 49–57, Vancouver. Association for Computa- rea. Association for Computational Linguistics.
tional Linguistics.
Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov,
Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, Brian Sabrina J. Mielke, Cibu Johny, Isin Demirsahin, and
Roark, and Jason Eisner. 2019. What kind of lan- Keith Hall. 2020. Processing South Asian languages
guage is hard to language-model? In Proceedings of written in the Latin script: The Dakshina dataset.
the 57th Annual Meeting of the Association for Com- In Proceedings of the 12th Language Resources
putational Linguistics, pages 4975–4989, Florence, and Evaluation Conference, pages 2413–2423, Mar-
Italy. Association for Computational Linguistics. seille, France. European Language Resources Asso-
ciation.
Mehryar Mohri. 2009. Weighted automata algorithms.
In Handbook of weighted automata, pages 213–254. Maria Ryskina, Matthew R. Gormley, and Taylor Berg-
Springer. Kirkpatrick. 2020. Phonetic and visual priors for
268
decipherment of informal Romanization. In Pro-
ceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 8308–
8319, Online. Association for Computational Lin-
guistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016. Neural machine translation of rare words
with subword units. In Proceedings of the 54th An-
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1715–
1725, Berlin, Germany. Association for Computa-
tional Linguistics.
Tatiana Shavrina and Olga Shapovalova. 2017. To
the methodology of corpus construction for machine
learning: Taiga syntax tree corpus and parser. In
Proc. CORPORA 2017 International Conference,
pages 78–84, St. Petersburg.
Zhiyi Song, Stephanie Strassel, Haejoong Lee, Kevin
Walker, Jonathan Wright, Jennifer Garland, Dana
Fore, Brian Gainor, Preston Cabe, Thomas Thomas,
Brendan Callahan, and Ann Sawyer. 2014. Collect-
ing natural SMS and chat conversations in multiple
languages: The BOLT phase 2 corpus. In Proceed-
ings of the Ninth International Conference on Lan-
guage Resources and Evaluation (LREC’14), pages
1699–1704, Reykjavik, Iceland. European Language
Resources Association (ELRA).
Felix Stahlberg and Bill Byrne. 2019. On NMT search
errors and model errors: Cat got your tongue? In
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 3356–
3362, Hong Kong, China. Association for Computa-
tional Linguistics.
Shijie Wu and Ryan Cotterell. 2019. Exact hard mono-
tonic attention for character-level transduction. In
Proceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 1530–
1537, Florence, Italy. Association for Computational
Linguistics.
Shijie Wu, Ryan Cotterell, and Mans Hulden. 2021.
Applying the transformer to character-level trans-
duction. In Proceedings of the 16th Conference of
the European Chapter of the Association for Com-
putational Linguistics: Main Volume, pages 1901–
1907, Online. Association for Computational Lin-
guistics.
Shijie Wu, Pamela Shapiro, and Ryan Cotterell. 2018.
Hard non-monotonic attention for character-level
transduction. In Proceedings of the 2018 Confer-
ence on Empirical Methods in Natural Language
Processing, pages 4425–4438, Brussels, Belgium.
Association for Computational Linguistics.
Zichao Yang, Zhiting Hu, Chris Dyer, Eric P. Xing, and
Taylor Berg-Kirkpatrick. 2018. Unsupervised text
style transfer using language models as discrimina-
tors. In NeurIPS, pages 7298–7309.
269
A Data download links beam search and n–shortest path algorithm for the
UNMT and WFST respectively. Product of experts
The romanized Russian and Arabic data and pre-
decoding is also performed with beam size 5.
processing scripts can be downloaded here. This
repository also contains the relevant portion of the C Metrics
Taiga dataset, which can be downloaded in full at
this link. The romanized Kannada data was down- The character error rate (CER) and word error rate
loaded from the Dakshina dataset. (WER) as measured as the Levenshtein distance
The scripts to download the Serbian and Bosnian between the hypothesis and reference divided by
Leipzig corpora data can be found here. The reference length:
UDHR texts were collected from the corresponding dist(h, r)
pages: Serbian, Bosnian. ER(h, r) =
len(r)
The keyboard layouts used to construct the
phonetic priors are collected from the following with both the numerator and the denominator mea-
sources: Arabic 1, Arabic 2, Russian, Kannada, sured in characters and words respectively.
Serbian. The Unicode confusables list used for the We report BLEU-4 score (Papineni et al., 2002),
Russian visual prior can be found here. measured using the Moses toolkit script.7 For both
BLEU and WER, we split sentences into words
B Implementation using the Moses tokenizer (Koehn et al., 2007).
WFST We reuse the unsupervised WFST imple-
mentation of Ryskina et al. (2020),5 which utilizes
the OpenFst (Allauzen et al., 2007) and Open-
Grm (Roark et al., 2012) libraries. We use the
default hyperparameter settings described by the
authors (see Appendix B in the original paper). We
keep the hyperparameters unchanged for the trans-
lation experiment and set the maximum delay value
to 2 for both translation directions.
UNMT We use the PyTorch UNMT implementa-
tion of He et al. (2020)6 which incorporates im-
provements introduced by Lample et al. (2019)
such as the addition of a max-pooling layer. We
use a single-layer LSTM (Hochreiter and Schmid-
huber, 1997) with hidden state size 512 for both
the encoder and the decoder and embedding dimen-
sion 128. For the denoising autoencoding loss, we
adopt the default noise model and hyperparameters
as described by Lample et al. (2018). The autoen-
coding loss is annealed over the first 3 epochs. We
predict the output using greedy decoding and set
the maximum output length equal to the length of
the input sequence. Patience for early stopping is
set to 10.
Model combinations Our joint decoding imple-
mentations rely on PyTorch and the Pynini finite-
state library (Gorman, 2016). In reranking, we
rescore n = 5 best hypotheses produced using
5
https://github.com/ryskina/
romanization-decipherment 7
https://github.com/moses-smt/
6
https://github.com/cindyxinyiwang/ mosesdecoder/blob/master/scripts/
deep-latent-sequence-model generic/multi-bleu.perl
270
Input kongress ne odobril biudjet dlya osuchestvleniye
"bor’bi s kommunizmom" v yuzhniy amerike.
Ground truth kongress ne odobril bdet dl kongress ne odobril bjudžet dlja osuščestvlenija
osuwestvleni "bor~by s kommunizmom" "bor’by s kommunizmom" v južnoj amerike.
v noĭ amerike.
WFST kongress ne odobril viu d et dl a kongress ne odobril viu d et dl a osu sč estvleni y e
osu sq estvleni y e "bor # b i s "bor # b i s kommunizmom" v uuz n ani amerike.
kommunizmom" v uuz n ani amerike.
Reranked WFST kongress ne odobril vi d et d e l a kongress ne odobril vi d et d e l a osu sč estvleni y e
osu sq estvleni y e "bor # b i s "bor # b i s kommunizmom" v uuz n ani amerike.
kommunizmom" v uuz n ani amerike.
Seq2Seq kongress ne odobril kongress ne odobril b y udivitel’no
b y udivitel~no s s kommunizmom" v južn y j amerike.
kommunizmom" v n y ĭ amerike.
Reranked Seq2Seq kongress ne odobril bdet dl kongress ne odobril bjudžet dlja osuščestvleni e
osuwestvleni e "bor~by s kommunizmom" "bor’by s kommunizmom" v južn y j amerike.
v n y ĭ amerike.
Product of experts kongress ne odobril b i d et dl a kongress ne odobril b i d et dlja a osuščestvleni y e
osuwestvleni y e "bor~by s "bor’by s kommunizmom" v uuz n nik ameri
kommunizmom" v uuz n nik ameri
Table 5: Different model outputs for a Russian transliteration example (left column—Cyrillic, right—
scientific transliteration). Prediction errors are shown in red. Correctly transliterated segments that do not
match the ground truth because of spelling standardization in annotation are in yellow. # stands for UNK.
Table 6: Different model outputs for an Arabizi transliteration example (left column—Arabic, right—
Buckwalter transliteration).
ಮೂಲ +,-$Prediction errors are highlighted
.ನ/$0 DDR3ಯನು2 ಬಳಸಲು in red in the romanized versions. Correctly
ಮೂಲ +,-$.ನ/$0 DDR3ಯನು2 ಬಳಸಲು
transliterated segments that do not match the
ಮೂಲ +,-$.ನ/$0 DDR3ಯನು2 ಬಳಸಲು ground truth because of spelling standardization during
ಮೂಲ +,-$ . ನ/$ 0 DDR3ಯನು2 ಬಳಸಲು
annotation are highlighted
ಮೂಲ
ಮೂಲ+,-$in .
+,-$yellow.
.ನ/$
ನ/$0 0 DDR3ಯನು2
DDR3ಯನು2 ಬಳಸಲು
Table 7: Different model outputs for a Kannada transliteration example (left column—Kannada, right—
ISO 15919 transliterations). The ISO romanization is generated using the Nisaba library (Johny et al.,
2021). Prediction errors are highlighted in red in the romanized versions.
271
Finite-state Model of Shupamem Reduplication
272
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 272–281
August 5, 2021. ©2021 Association for Computational Linguistics
alternative, in which we use the MT FST to han- Transl. Lemma Red form
dle both reduplication and tones. It is important H HL L
to emphasize that all of the machines we discuss ‘crab’ kám kâm kàm
Nouns
are deterministic, which serves as another piece of L LH L
evidence that even such complex processes like full ‘game’ kàm kǎm kàm
reduplication can be modelled with deterministic H HŤH
finite-state technology (Chandlee and Heinz, 2012; ‘fry’ ká ká kŤá
Verbs
Heinz, 2018). LH LHŤH
This paper is structured as follows. First, we ‘peel’ kǎ kǎ kŤá
will briefly summarize the linguistic phenomena
observed in Shupamem reduplication (Section 2). Table 1: Nominal and verbal reduplication in Shu-
pamem
We then provide a formal description of the 2-way
(Section 3) and MT FSTs (Section 4). We propose
a synthesis of the 1-way, 2-way and MT FSTs in the purpose of this paper, we provide only a sum-
Section 5 and further illustrate them using rele- mary of those tonal alternations in Table 2.
vant examples from Shupamem in Section 6. In
Section 7 we discuss a possible alternative to the Red. tones Output
model which uses only MT FSTs. Finally, in Sec- Nouns HL L HL H
tion 8 we show that the proposed model works for LH L LH H
other tonal languages as well, and we conclude our Verbs HŤH HŤH
contributions. LHŤH HL LH
273
(á,á,+1) (á,λ,−1) (á,á,+1)
Figure 1: 2-way FST for total reduplication of ká ‘fry.IMP → ká ká ‘fry.IMP as opposed to boiling’
those factors. In the next two sections, we will pro- string, then the FST faithfully ‘copies’ the input
vide a brief formal introduction to 2-way FSTs and string while scanning it from left to right. In con-
MT-FSTs, and explain how they correctly model trast, while scanning the string from right to left,
full reduplication and autosegmental representa- it outputs nothing (λ), and it then copies the string
tion, respectively. again from left to right.
In this paper, we use an orthographic represen- Figure 1 illustrates a deterministic 2-way FST
tation for Shupamem which uses diacritics to in- that reduplicates ká ‘fry.IMP’; readers are referred
dicate tone. Shupamem does not actually use this to Dolatian and Heinz (2020) for formal defini-
orthography; however, we are interested in model- tions. The key difference between deterministic
ing the entire morpho-phonology of the language, 1-way FSTs and deterministic 2-way FSTs are the
independently of choices made for the orthography. addition of the ‘direction’ parameters {+1, 0, −1}
Furthermore, many languages do use diacritics to on the transitions which tell the FST to advance to
indicate tone, including the Volta-Niger languages the next symbol on the input tape (+1), stay on the
Yoruba and Igbo, as well as Dschang, a Grassfields same symbol (0), or return to the previous symbol
language closely related to Shupamem. (For a dis- (-1). Deterministic 1-way FSTs can be thought of
cussion of the orthography of Cameroonian lan- as deterministic 2-way FSTs where transitions are
guages, with a special consideration of tone, see all (+1).
(Bird, 2001).) Diacritics are also used to write The input to this machine is okán. The o and
tone in non-African languages, such as Vietnamese. n symbols mark beginning and end of a string, and
Therefore, this paper is also relevant to NLP ap- ∼ indicates the boundary between the first and the
plications for morphological analysis and genera- second copy. None of those symbols are essen-
tion in languages whose orthography marks tones tial for the model, nevertheless they facilitate the
with diacritics: the automata we propose could be transitions. For example, when the machine reads
used to directly model morpho-phonological com- n, it transitions from state q1 to q2 and reverses
putational problems for the orthography of such the direction of the read head. After outputting
languages. the first copy (state q1 ) and rewinding (state q2 ),
the machine changes to state q3 when it scans the
3 2-way FSTs left boundary symbol o and outputs ∼ to indicate
that another copy will be created. In this partic-
As Roark and Sproat (2007) point out, almost all
ular example, not marking morpheme boundary
morpho-pholnological processes can be modelled
would not affect the outcome. However, in Section
with 1-way FSTs with the exception of full redu-
5, where we propose the fully-fledged model, it
plication, whose output is not a regular language.
will be crucial to somehow separate the first from
One way to increase the expressivity of 1-way FST
second copy.
is to allow the read head of the machine to move
back and forth on the input tape. This is exactly 4 Multitape FSTs
what 2-way FST does (Rabin and Scott, 1959), and
Dolatian and Heinz (2020) explain how these trans- Multiple-tape FSTs are machines which operate in
ducers model full reduplication not only effectively, the exact same was as 1-way FST, with one key
but more faithfully to linguistic generalizations. difference: they can read the input and/or write the
Similarly to a 1-way FST, when a 2-way FST output on multiple tapes. Such a transducer can
reads an input, it writes something on the output operate in an either synchronous or asynchronous
tape. If the desired output is a fully reduplicated manner, such that the input will be read on all tapes
274
and an output produced simultaneously, or the ma- ple, (L, V) → V̀) and transitions to state q2 (if the
chine will operate on the input tapes one by one. symbol on the T-tape is H) or q3 (for L, as in our ex-
MT-FST can take a single (‘linear’) string as an ample). In states q2 and q3 , consonants are simply
input and output multiple strings on multiple tapes output as in state q1 , but for vowels, one of three
or it can do the reverse (Rabin and Scott, 1959; conditions may occur: the read head on the Tonal
Fischer, 1965). tape may be H or L, in which case the automaton
To illustrate this idea, let us look at Shupamem transitions (if not already there) to q2 (for H) or q3
noun màpàm ‘coat’. It has been argued that Shu- (for L), and outputs the appropriate orthographic
pamem nouns with only L surface tones will have symbol. But if on the Tonal tape the read head is
the L tone present in the UR (Markowska, 2019, on the right boundary marker n, we are in a case
2020). Moreover, in order to avoid violating the where there are more vowels in the Segmental tape
Obligatory Contour Principle (OCP) (Leben, 1973), than tones in the Tonal tape. This is when the OCP
which prohibits two identical consecutive elements determines the interpretation: all vowels get the
(tones) in the UR of a morpheme, we will assume tone of the last symbol on the Tone tier (which we
that only one L tone is present in the input. Con- remember as states q2 and q3 ). In our example,
sequently, the derivation will look as shown in Ta- this is an L. Finally, when the Segmental tape also
ble 3. reaches the right boundary marker n, the machine
transitions to the final state q4 . This (‘linearizing’)
Input: T-tape L MT-FST consists of 4 states and shows how OCP
Input: S-tape mapam effects can be handled with asynchronous multi-
Output: Single tape màpàm tape FSTs. Note that when there are more tones
on the Tonal tier than vowels on the Segmental tier,
Table 3: Representation of MT-FST for màpàm ‘coat’
they are simply ignored. We refer readers to Dola-
tian and Rawski (2020b) for formal definitions of
Separating tones from segments in this manner, these MT transducers.
i.e. by representing tones on the T(one)-tape and
We are also interested in the inverse process –
segments on the S(segmental)-tape, faithfully re-
that is, a finite-state machine that in the example
sembles linguistic understanding of the UR of a
above would take a single input string [màpàm]
word. The surface form màpàm has only one L
and produce two output strings [L] and [mapam].
tone present in the UR, which then spreads to all
While multitape FSTs are generally conceived as
TBUs, which happen to be vowels in Shupamem,
relations over n-ary relations over strings, Dolatian
if no other tone is present.
and Rawski (2020b) define their machines deter-
An example of a multi-tape machine is pre-
ministically with n input tapes and a single output
sented in Figure 2. For better readability, we
tape. We generalize their definition below.
introduce a generalized symbols for vowels (V)
and consonants (C), so that the input alphabet is Similarly to spreading processes described
Σo = {(C, V ), (L, H)} ∪ {o, n}, and the output above, separating tones from segments give us a lot
alphabet is Γ = {C, V́ , V̀ }. The machine operates of benefits while accounting for tonal alternations
on 2 input tapes and writes the output on a single taking place in nominal and verbal reduplication in
tape. Therefore, we could think of such machine Shupamem. First of all, functions such as Opposite
as a linearizer. The two input tapes represent the Tone Insertion (OTI) will apply solely at the tonal
Tonal and Segmental tiers and so we label them T level, while segments can be undergoing other op-
and S, respectively. We illustrate the functioning erations at the same time (recall that MT-FSTs can
of the machine using the example (mapam, L) → operate on some or all tapes simultaneously). Sec-
màpàm ‘coat’. While transitioning from state q0 to ondly, representing tones separately from segments
q1 , the output is an empty string since the left edge make tonal processes local, and therefore all the
marker is being read on both tapes simultaneously. alternations can be expresses with less powerful
In state q1 , when a consonant is being read, the functions (Chandlee, 2017).
machine outputs the exact same consonant on the Now that we presented the advantages of MT-
output tape. However, when the machine reaches a FSTs, and the need for utilizing 2-way FSTs to
vowel, it outputs a vowel with a tone that is being model full reduplication, we combine those ma-
read at the same time on the T-tape (in our exam- chines to account for all morphophonological pro-
275
T:(H,+1) T:(H,0)
S:(V,+1) S:(C,+1)
O:V́ O:C
T:(n,0) T:(n,0)
S:(V,+1) S:(C,+1)
O:V́ O:C
S:(o,+1)
T:(H,+1) T:(L,+1)
O:λ
start q0 q1 S:(V,+1) S:(V,+1) q4
O:V́ O:V̀
T:(L,+1) T:(L,0)
S:(V,+1) S:(C,+1)
O:V̀ O:C
T:(n,0) T:(n,0)
S:(V,+1) S:(C,+1)
O:V̀ O:C
cesses described in Section 2. erally use the index i to range from 1 to n and the
index j to range from 1 to m.
5 Deterministic 2-Way Multi-tape FST We define Deterministic 2-Way n, m Multitape
FST (2-way (n, m) MT FST for short) for n, m ∈
Before we define Deterministic 2-Way Multi-tape
N by synthesizing the definitions of Dolatian and
FST (or 2-way MT FST for short) we introduce
Heinz (2020) and Dolatian and Rawski (2020b);
some notation. An alphabet Σ is a finite set of
n, m refer to the number of input tapes and output
symbols and Σ∗ denotes the set of all strings of
tapes, respectively. A Deterministic 2-Way n, m
finite length whose elements belong to Σ. We use #» #»
Multitape FST is a six-tuple (Q, Σ, Γ , q0 , F, δ),
λ to denote the empty string. For each n ∈ N, an
where
n-string is a tuple hw1 , . . . wn i where each wi is
a string belonging to Σ∗i (1 ≤ i ≤ n). These n #»
• Σ = hΣ1 . . . Σn i is a tuple of n input alpha-
alphabets may contain distinct symbols or not. We bets that include the boundary symbols, i.e.,
write w#» to indicate a n-string and #» λ to indicate the
#» {o, n} ⊂ Σi , 1 ≤ i ≤ n,
n-string where each wi = λ. We also write Σ to
#»
denote a tuple of n alphabets: Σ = Σ1 × · · · Σn . #»
• Γ is a tuple of m output alphabets Γj (1 ≤
#»
Elements of Σ are denoted #» σ. j ≤ m),
#»∗
If w and v belong to Σ then the pointwise con-
#» #»
#» #» #»
catenation of w #» and #» v is denoted w v and equals
#» #» • δ : Q × Σ → Q × Γ ∗ × D is the transition
hw1 , . . . wn ihv1 , . . . vn i = hw1 v1 , . . . wn vn i. We function. D is an alphabet of directions equal
#» #»
are interested in functions that map n-strings to to {−1, 0, +1} and D is an n-tuple. Γ ∗ is a
m-strings with n, m ∈ N. In what follows we gen- m-tuple of strings written to each output tape.
276
Figure 3: Synthesis of 2-way FST and MT-FST
#» #»0 , r, #»
We understand δ(q, #» v , d ) as follows. It
σ ) = (r, #» We write ( w, #» q, #»
x , #»
u ) → (w x 0 , #» v ). Ob-
u #»
means that if the transducer is in state q and the serve that since δ is a function, there is at most one
n read heads on the input tapes are on symbols next configuration (i.e., the system is deterministic).
hσ1 , . . . σn i = #» σ , then several actions ensue. The Note there are some circumstances where there is
transducer changes to state r and pointwise concate- no next configuration. For instance if di = +1 and
nates #» v to the m output tapes. The n read heads xi = λ then there is no place for the read head to
#» #»
then move according to the instructions d ∈ D. advance. In such cases, the computation halts.
For each read head on input tape i, it moves back The transitive closure of → is denoted with →+ .
one symbol iff di = −1, stays where it is iff di = 0, Thus, if c →+ c0 then there exists a finite sequence
and advances one symbol iff di = +1. (If the read of configurations c1 , c2 . . . cn with n > 1 such that
head on an input tape “falls off” the beginning or c = c1 → c2 → . . . → cn = c0 .
end of the string, the computation halts.) At last we define the function that a 2-way (n, m)
The function recognized by a 2-way (n, m) MT MT FST T computes. The input strings are aug-
FST is defined as follows. A configuration of a mented with word boundaries on each tape. Let
#» #» #»
n, m-MT-FST T is a 4-tuple h Σ ∗ , q, Σ ∗ , Γ ∗ i. The # »
own = how1 n, . . . o wn ni. For each n-string
meaning of the configuration ( w, x , u) is that
#» q, #» #»∗ #»
w#» ∈ Σ , fT ( w)
#» = #» u ∈ Γ ∗ provided there
#» # » #»
the input to T is w x and the machine is currently
#» #»
exists qf ∈ F such that ( λ , q0 , own, λ ) →+
in state q with the n read heads on the first symbol # » #»
(own, qf , λ , #» u ).
of each xi (or has fallen off the right edge of the If fT ( w)
#» = #» u then #» u is unique because the
i-th input tape if xi = λ) and that #» u is currently sequence of configurations is determined determin-
written on the m output tapes. istically. If the computation of a 2-way MT-FST T
If the current configuration is ( w, #» q, #» u ) and
x , #» halts on some input w #» (perhaps because a subse-
#»
δ(q, #»σ ) = (r, #» v , d ) then the next configuration is quent configuration does not exist), then we say T
#»0 , r, #»
(w x 0 , #» v ), where for each i, 1 ≤ i ≤ n:
u #» is undefined on w. #»
#»0 = hw0 . . . w0 i and #»
• w n x 0 = hx01 . . . x0n i (1 ≤ The 2-way FSTs studied by Dolatian and Heinz
1
i ≤ n); (2020) are 2-way 1,1 MT FST. The n-MT-FSTs
studied by Dolatian and Rawski (2020b) are 2-way
• wi0 = wi and x0i = xi iff di = 0; n,1 MT FST where none of the transitions con-
tain the −1 direction. In this way, the definition
• wi0 = wi σ and x0i = x00i iff di = +1 and there
presented here properly subsumes both.
exists σ ∈ Σi , x00i ∈ Σ∗i such that xi = σx00i ;
Figure 4 shows an example of a 1,2 MT FST that
• wi0 = wi00 and x0i = σxi iff di = −1 and there “splits” a phonetic (or orthographic) transcription of
exists σ ∈ Σi , wi00 ∈ Σ∗i such that wi = σwi00 . a Shupamem word into a linguistic representation
277
with a tonal and segmental tier by outputting two As was discussed in Section 2, the tone on the
output strings, one for each tier. second copy is dependent on whether there was an
H tone preceding the reduplicated phrase. If there
I:(C,+1) I:(á,+1) I:(à,+1) was one, the tone on the reduplicant will be H.
T:λ T:H T:L Otherwise, the L-Default Insertion rule will insert
S:C S:a S:a L tone onto the toneless TBU of the second copy.
Because those tonal changes are not part of the
I:(â,+1) I:(ǎ,+1)
reduplicative process per se, we do not represent
T:HL T:LH
them either in our model in Figure 3, or in the
start q0 S:a
q1 oS:a q2 derivation in Table 4. Those alternations could be
I:(o,+1) I:(n,+1) accounted for with 1-way FST by simply inserting
T:λ T:λ H or L tone to the output of the composed machine
S:λ S:λ represented in Figure 3.
Modelling verbal reduplication and the tonal pro-
Figure 4: MT-FST: split cesses revolving around it (see Table 2) works in
ndáp → (ndap, H) ‘house’, C and V are notational the exact same way as described above for nominal
meta-symbols for consonants and vowels, resp.; T reduplication. The only difference are the functions
indicates the output tone tape, S – the segmental applied to the T-tape.
output tape, and I – the input.
7 An Alternative to 2-Way Automata
6 Proposed model 2-way n,m MT FST generalize regular functions
(Filiot and Reynier, 2016) to functions from n-
As presented in Figure 3, our proposed model, i.e.
strings to m-strings. It is worth asking however,
2-way 2,2 MT FST, consists of 1-way and 2-way
what each of these mechanisms brings, especially
deterministic transducers, which together operate
in light of fundamental operations such as func-
on two tapes. Both input and output are represented
tional composition.
on two separate tapes: Tonal and Segmental Tape.
Such representation is desired as it correctly mim-
ics linguistic representations of tonal languages,
where segments and tones act independently from
each other. On the T-tape, a 1-way FST takes the
H tone associated with the lexical representation of
ndáp ‘house’ and outputs HL∼ by implementing
the Opposite Tone Insertion function. On the S-
tape, a 2-way FST takes ndap as an input, and out-
puts a faithful copy of that string (ndap ndap). The
∼ symbol significantly indicates the morpheme
boundary and facilitates further output lineariza-
tion. A detailed derivation of ndáp 7→ ndâp ndap Figure 5: An alternative model for Shupamem redupli-
is shown in Table 4. cation
Figure 3 also represents two additional ‘trans-
formations’: splitting and linearizing. First, the For instance, it is clear that 2-way 1, 1 MT FSTs
phonetic transcription of a string (ndáp) is split can handle full reduplication in contrast to 1-way
into tones and segments with a 1,2 MT FST. The 1, 1 MT FSTs which cannot. However, full redupli-
output (H, ndap) serves as an input to the 2-way 2,2 cation can also be obtained via the composition of
MT FST. After the two processes discussed above a 1-way 1,2 MT FST with a 1-way 2, 1 MT FST.
apply, together acting on both tapes, the output is To illustrate, the former takes a single string as an
then linearized with an 2,1 MT FST. The composi- input, e.g. ndap, and ‘splits’ it into two identical
tion of those three machines, i.e. 1,2 MT, 2-way 2,2 copies represented on two separate output tapes.
MT FST, and 2,1 MT FSTs is particularly useful Then the 2-string output by this machines becomes
in applications where a phonetic or orthographic the input to the next 1-way 2,1 MT FST. Since
representations needs to be processed. this machine is asynchronous, it can linearize the
278
State Segment-tape Tone-tape S-output T-output
q0 ondapn oHn λ λ
q1 ondapn S:o:+1 oHn T:o:+1 n HL
q1 ondapn S:n:+1 oHn T:H:+1 nd HL∼
q1 ondapn S:d:+1 oHn T:n:0 nda HL∼
q1 ondapn S:a:+1 oHn T:n:0 ndap HL∼
q1 ondapn S:p:+1 oHn T:n:0 ndap∼ HL∼
q2 ondapn S:n:-1 oHn T:n:0 ndap∼ HL∼
q2 ondapn S:p:-1 oHn T:n:0 ndap∼ HL∼
q2 ondapn S:a:-1 oHn T:n:0 ndap∼ HL∼
q2 ondapn S:d:-1 oHn T:n:0 ndap∼ HL∼
q2 ondapn S:n:-1 oHn T:n:0 ndap∼ HL∼
q3 ondapn S:o:+1 oHn T:n:0 ndap∼n HL∼
q3 ondapn S:n:+1 oHn T:n:0 ndap∼nd HL∼
q3 ondapn S:d:+1 oHn T:n:0 ndap∼nda HL∼
q3 ondapn S:a:+1 oHn T:n:0 ndap∼ndap HL∼
q3 ondapn S:p:+1 oHn T:n:0 ndap∼ndap HL∼
2-string (e.g. (ndap, ndap)) to a 1-string (ndap other languages could also be accounted for.
ndap) by reading along one of these input tapes All three languages undergo full reduplication
(and writing it) and reading the other one (and writ- at the segmental level. What differs is the tonal
ing it) only when the read head on the first input pattern governing this process. In Adhola (Ka-
tape reaches the end. Consequently, an alternative plan, 2006), the second copy is always represented
way to model full reduplication is to write the ex- with a fixed tonal pattern H.HL, where ‘.’ indi-
act same output on two separate tapes, and then cates syllable boundary, irregardless of the lexi-
linearize it. Therefore, instead of implementing cal tone on the non-reduplicated form. In the fol-
Shupamem tonal reduplication with 2-way 2,2 MT lowing examples, unaccented vowels indicate low
FST, we could use the composition of two 1-way tone. For instance, tiju ‘work’ 7→ tija tíjâ ‘work
MT-FST: 1,3 MT-FST and 3,1 MT-FST as shown too much’, tSemó ‘eat’ 7→ tSemá tSŤémâ ‘eat too
in Figure 5. (We need three tapes, two Segmental much’. In Kikerewe (Odden, 1996), if the first
tapes to allow reduplication as just explained, and (two) syllable(s) of the verb are marked with an
one Tonal tape as discussed before.) H tone, the H tone would also be present in the
This example shows that additional tapes can first two syllables of the reduplicated phrase. On
be understood as serving a similar role to regis- the other hand, if the last two syllables of the non-
ters in register automata (Alur and Černý, 2011; reduplicated verb are marked with an H tone, the
Alur et al., 2014). Alur and his colleagues have H tone will be present on the last two syllables of
shown that deterministic 1-way transducers with the reduplicated phrase. For instance, bíba ‘plant’
registers are equivalent in expressivity to 2-way 7→ bíba biba ‘plant carelessly, here and there’, bib-
deterministic transducers (without registers). ílé ‘planted (yesterday)’ 7→ bibile bibílé ‘planted
(yesterday) carelessly, here and there’. Finally, in
8 Beyond Shupamem KiHehe (Odden and Odden, 1985), if an H tone ap-
The proposed model is not limited to modeling pears in the first syllable of the verb, the H tone will
full reduplication in Shupamem. It can be used for also be present in the first syllable of the second
other tonal languages exhibiting this morphological copy, for example dóongoleesa ‘roll’ 7→ dongolesa
process. We provide examples of the applicability dóongoleesa ‘roll a bit’.
of this model for the three following languages: The above discussed examples can be modelled
Adhola, Kikerewe, and Shona. And we predict that in a similar to Shupamem way, such that, first,
279
the input will be output on two tapes: Tonal and clude them in the RedTyp database (Dolatian and
Segmental, then some (morpho-)phonological pro- Heinz, 2019) so a broader empirical typology can
cesses will apply on both level. The final step is be studied with respect to the formal properties of
the ‘linearization’, which will be independent of these machines.
the case. For example, in Kikerewe, if the first tone
that is read on the Tonal tape is H, and a vowel
is read on the Segmental tape, the output will be
a vowel with an acute accent. If the second tone References
is L, as in bíba, this L tone will be ‘attached’ to Rajeev Alur, Adam Freilich, and Mukund
every remaining vowel in the reduplicated phrase. Raghothaman. 2014. Regular combinators for
While Kikerewe provides an example where there string transformations. In Proceedings of the
Joint Meeting of the Twenty-Third EACSL Annual
are more TBUs than tones, Adhola presents the
Conference on Computer Science Logic (CSL) and
reverse situation, where there are more tones than the Twenty-Ninth Annual ACM/IEEE Symposium on
TBU (contour tones). Consequently, it is crucial to Logic in Computer Science (LICS), CSL-LICS ’14,
mark syllable boundaries, such that only when ‘.’ pages 9:1–9:10, New York, NY, USA. ACM.
or the right edge marker (o) is read, the FST will
Rajeev Alur and Pavol Černý. 2011. Streaming trans-
output the ‘linearized’ element. ducers for algorithmic verification of single-pass list-
processing programs. In Proceedings of the 38th An-
9 Conclusion nual ACM SIGPLAN-SIGACT Symposium on Princi-
ples of Programming Languages, POPL ’11, page
In this paper we proposed a deterministic finite- 599–610, New York, NY, USA. Association for
Computing Machinery.
state model of total reduplication in Shupamem.
As it is typical for Bantu languages, Shupamem is Steven Bird. 2001. Orthography and identity in
a tonal language in which phonological processes Cameroon. Written Language & Literacy, 4(2):131–
operating on a segmental level differ from those on 162.
suprasegmental (tonal) level. Consequently, Shu- Jane Chandlee. 2017. Computational locality in mor-
pamem introduces two challenges for 1-way FSTs: phological maps. Morphology, pages 1–43.
language copying and autosegmental representa-
tion. We addressed those challenges by proposing Jane Chandlee and Jeffrey Heinz. 2012. Bounded copy-
ing is subsequential: Implications for metathesis and
a synthesis of a deterministic 2-way FST, which reduplication. In Proceedings of the Twelfth Meet-
correctly models total reduplication, and a MT FST, ing of the Special Interest Group on Computational
which enables autosegmental representation. Such Morphology and Phonology, pages 42–51, Montréal,
a machine operates on two tapes (Tonal and Seg- Canada. Association for Computational Linguistics.
mental), which faithfully replicate the linguistic
Hossep Dolatian and Jeffrey Heinz. 2019. Redtyp: A
analysis of Shupamem reduplication discussed in database of reduplication with computational mod-
Markowska (2020). Finally, the outputs of the 2- els. In Proceedings of the Society for Computation
way 2,2 MT FST is linearized with a separate 2,1 in Linguistics, volume 2. Article 3.
MT FST outputting the desired surface representa-
Hossep Dolatian and Jeffrey Heinz. 2020. Comput-
tion of a reduplicated word. The proposed model ing and classifying reduplication with 2-way finite-
is based on previously studied finite-state models state transducers. Journal of Language Modelling,
for reduplication (Dolatian and Heinz, 2020) and 8(1):179–250.
tonal processes (Dolatian and Rawski, 2020b,a).
Hossep Dolatian and Jonathan Rawski. 2020a. Com-
There are some areas of future research that we putational locality in nonlinear morphophonology.
plan to pursue. First, we have suggested that we Ms., Stony Brook University.
can handle reduplication using the composition
Hossep Dolatian and Jonathan Rawski. 2020b. Multi
of 1-way deterministic MT FSTs, dispensing with input strictly local functions for templatic morphol-
the need for 2-way automata altogether. Further ogy. In In Proceedings of the Society for Computa-
formal comparison of these two approaches is war- tion in Linguistics, volume 3.
ranted. More generally, we plan to investigate the
David M. Eberhard, Gary F. Simmons, and Charles D.
closure properties of classes of 2-way MT FSTs. A Fenning. 2021. Enthologue: Languages of the
third line of research is to collect more examples World. 24th edition. Dallas, Texas: SIL Interna-
of full reduplication in tonal languages and to in- tional.
280
Emmanuel Filiot and Pierre-Alain Reynier. 2016. Jonathan Rawski and Hossep Dolatian. 2020. Multi-
Transducers, logic and algebra for functions of finite input strict local functions for tonal phonology. Pro-
words. ACM SIGLOG News, 3(3):4–19. ceedings of the Society for Computation in Linguis-
tics, 3(1):245–260.
Patric C. Fischer. 1965. Multi-tape and infinite-state
automata-a survey. Communications of the ACM, Brian Roark and Richard Sproat. 2007. Computational
pages 799–805. Approaches to Morphology and Syntax. Oxford Uni-
versity Press, Oxford.
John Goldsmith. 1976. Autosegmental Phonology.
Ph.D. thesis, Massachusetts Institute of Technology. Carl Rubino. 2013. Reduplication. Max Planck Insti-
tute for Evolutionary Anthropology, Leipzig.
Nizar Habash and Owen Rambow. 2006. Magead: A
morphological analyzer for Arabic and its dialects. Bruce Wiebe. 1992. Modelling autosegmental phonol-
In Proceedings of the 21st International Confer- ogy with multi-tape finite state transducers.
ence on Computational Linguistics and 44th Annual
Meeting of the Association for Computational Lin- Edwin S. Williams. 1976. Underlying tone in Margi
guistics (Coling-ACL’06), Sydney, Australia. and Igbo. Linguistic Inquiry, 7:463–484.
281
Improved pronunciation prediction accuracy using morphology
282
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 282–288
August 5, 2021. ©2021 Association for Computational Linguistics
morphology. G2P fits nicely in the well-studied se- and Taylor and Richmond (2020) show the reverse.
quence to sequence learning paradigms (Sutskever The present work aligns with the latter, but instead
et al., 2014), here we use extensions that can handle of requiring full morphological segmentation of
supplementary inputs in order to inject the morpho- words we work with weaker and more easily anno-
logical information. Our techniques are similar to tated morphological information like word lemmas
Sharma et al. (2019), although the goal there is to and morphological categories.
lemmatize or inflect more accurately using pronun-
ciations. Taylor and Richmond (2020) consider 3 Improved pronunciation prediction
improving neural G2P quality using morphology,
We consider the G2P problem, i.e. prediction of
our work differs in two respects. First, we use
the sequence of phonemes (pronunciation) from
morphology class and lemma entries instead of
the sequence of graphemes in a single word. The
morpheme boundaries for which annotations may
G2P problem forms a clean, simple application of
not be as readily available. Secondly, they con-
seq2seq learning, which can also be used to cre-
sider BiLSTMs and Transformer models, but we
ate models that achieve state-of-the-art accuracies
additionally consider architectures which combine
in pronunciation prediction. Morphology can aid
BiLSTMs with attention and outperform both. We
this prediction in several ways. One, we could
also show significant gains by morphology injec-
use morphological category as a non-sequential
tion in the context of transfer learning for low re-
side input. Two, we could use the knowledge of
source languages where sufficient annotations are
the morphemes of the words and their pronuncia-
unavailable.
tions which may be possible with lower amounts
of annotation. For example, the lemma (and its
2 Background and related work
pronunciation) may already be annotated for an
Pronunciation prediction is often studied in settings out-of-vocabulary word. Often standard lexicons
of speech recognition and synthesis. Some recent list the lemmata of derived/inflected words, lemma-
work explores new representations (Livescu et al., tizer models can be used as a fallback. Learning
2016; Sofroniev and Çöltekin, 2018; Jacobs and from the exact morphological segmentation (Tay-
Mailhot, 2019), but in this work, a pronunciation lor and Richmond, 2020) would need more precise
is a sequence of phonemes, syllable boundaries models and annotation (Demberg et al., 2007).
and stress symbols (van Esch et al., 2016). A lot of Given the spelling, language specific models
work has been devoted to the G2P problem (e.g. see can predict the pronunciation by using knowledge
Nicolai et al. (2020)), ranging from those focused of typical grapheme to phoneme mappings in the
on accuracy and model size to those discussing ap- language. Some errors of these models may be
proaches for data-efficient scaling to low resource fixed with help from morphological information as
languages or multilingual modeling (Rao et al., argued above. For instance, homograph pronun-
2015; Sharma, 2018; Gorman et al., 2020). ciations can be predicted using morphology but
Morphology prediction is of independent interest it is impossible to deduce correctly using just or-
and has applications in natural language generation thography.1 The pronunciation of ‘read’ (/ôi:d/ for
as well as understanding. The problems of lemma- present tense and noun, /ôEd/ for past and partici-
tization and morphological inflection have been ple) can be determined by the part of speech and
studied in both contextual (in a sentence, which tense; the stress shifts from first to second syllable
involves morphosyntactics) and isolated settings between ‘project’ noun and verb.
(Cohen and Smith, 2007; Faruqui et al., 2015; Cot-
3.1 Dataset
terell et al., 2016; Sharma et al., 2019).
Morphophonological prediction, by which we We train and evaluate our models for five lan-
mean viewing morphology and pronunciation pre- guages to cover some morphophonological diver-
diction as a single task with several related inputs sity: (American) English, French, Russian, Span-
and outputs, has received relatively less attention as ish and Hungarian. For training our models, we
a language-independent computational task, even use pronunciation lexicons (word-pronunciation
though the significance for G2P has been argued pairs) and morphological lexicons (containing lex-
(Coker et al., 1991). Sharma et al. (2019) show 1
Homographs are words which are spelt identically but
improved morphology prediction using phonology, have different meanings and pronunciations.
283
ical form, i.e. lemma and morphology class) of 3.2.1 Bidirectional LSTM networks
only inflected words of size of the order of 104 LSTM (Hochreiter and Schmidhuber, 1997) allows
for each language (see Table 5 in Appendix A). learning of fixed length sequences, which is not a
For the languages discussed, these lexicons are ob- major problem for pronunciation prediction since
tained by scraping2 Wiktionary data and filtering grapheme and phoneme sequences (represented as
for words that have annotations (including pronun- one-hot vectors) are often of comparable length,
ciations available in the IPA format) for both the and in fact state-of-the-art accuracies can be ob-
surface form and the lexical form. While this or- tained using bidirectional LSTM (Rao et al., 2015).
der of data is often available for high-resource lan- We use single layer BiLSTM encoder - decoder
guages, in Section 3.3 we discuss extension of our with 256 units and 0.2 dropout to build a charac-
work to low-resource settings using Finnish and ter level RNN. Each character is represented by a
Portuguese for illustration where the Wiktionary trainable embedding of dimension 30.
data is about an order of magnitude smaller.
3.2.2 LSTM based encoder-decoder networks
Word (language) Morph. Class Pron. LS LP with attention (BiLSTM+Attn)
masseuses (fr) n-f-pl /ma.søz/ masseur /ma.sœK/
fagylaltozom (hu) v-fp-s-in-pr-id /"f6Íl6ltozom/ fagylaltozik /"f6Íl6ltozik/ Attention-based models (Vaswani et al., 2017;
Chan et al., 2016; Luong et al., 2015; Xu et al.,
Table 1: Example annotated entries. (v-fp-s-in-pr-id: 2015) are capable of taking a weighted sample of
Verb, first-person singular indicative present indefinite)
input, allowing the network to focus on different
possibly distant relevant segments of the input ef-
We keep 20% of the pronunciation lexicons fectively to predict the output. We use the model
aside for evaluation using word error rate (WER) defined in Section 3.2.1 with Luong attention (Lu-
metric. WER measures an output as correct if the ong et al., 2015).
entire output pronunciation sequence matches the
ground truth annotation for the test example. 3.2.3 Transformer networks
Transformer (Vaswani et al., 2017) uses self-
3.1.1 Morphological category attention in both encoder and decoder to learn
The morphological category of the word is ap- rich text representaions. We use a similar architec-
pended as an ordinal encoding to the spelling, sepa- ture but with fewer parameters, by using 3 layers,
rated by a special character. That is, the categories 256 hidden units, 4 attention heads and 1024 di-
of a given language are appended as unique inte- mensional feed forward layers with relu activation.
gers, as opposed to one-hot vectors which may be Both the attention and feedforward dropout is 0.1.
too large in morphologically rich languages. The input character embedding dimension is 30.
284
Model Inputs en fr ru es hu
BiLSTM (b/+c/+l) (39.7/39.4/37.1) (8.69/8.94/7.94) (5.26/4.87/5.60) (1.13/1.44/1.30) (6.96/5.85/7.21)
BiLSTM+Attn (b/+c/+l) (36.9/36.1/31.0) (4.45/4.20/4.12) (5.06/3.80/4.04) (0.32/0.32/0.29) (1.78/1.31/1.12)
Transformer (b/+c/+l) (40.2/39.3/37.7) (8.19/7.11/10.6) (6.57/6.38/5.36) (2.29/1.62/2.20) (8.20/4.93/8.11)
Table 2: Models and their Word Error Rates (WERs). ‘b’ corresponds to baseline (vanilla G2P), ‘+c’ refers to
morphology class injection (Sec. 3.1.1) and ‘+l’ to addition of lemma spelling and pronunciation (Sec. 3.1.2).
evaluate our model for two language pairs — hu We also look at how adding lexical form infor-
(high) - fi (low) and es (high) and pt (low) (results mation, i.e. morphological class and lemma, helps
in Table 3). We perform morphology injection us- with pronunciation prediction. We notice that the
ing lemma spelling and pronunciation (Sec. 3.1.2) improvements are particularly prominent when the
since it can be easier to annotate and potentially G2P task itself is more complex, for example in
more effective (per Table 2). fi and pt are not really English. In particular, ambiguous or exceptional
low-resource, but have relatively fewer Wiktionary grapheme subsequence (e.g. ough in English)
annotations for the lexical forms (Table 5). to phoneme subsequence mappings, may be re-
solved with help from lemma pronunciations. Also
Model fi fi+hu pt pt+es
morphological category seems to help for example
BiLSTM+Attn (base) 18.53 9.81 62.65 58.87
BiLSTM+Attn (+lem) 9.27 8.45 59.63 55.48 in Russian where it can contain a lot of informa-
tion due to the inherent morphological complexity
Table 3: Transfer learning for vanilla G2P (base) and (about 25% relative error reduction). See Appendix
morphology augmented G2P (+lem, Sec. 3.1.2). B for more detailed comparison and error analysis
for the models.
4 Discussion Our transfer learning experiments indicate that
We discuss our results under two themes — the morphology injection gives even more gains in low
efficacy of the different neural models we have resource setting. In fact for both the languages
implemented, and the effect of the different ways considered, adding morphology gives almost as
of injecting morphology that were considered. much gain as adding a high resource language to
We consider three neural models as described the BiLSTM+Attn model. This could be useful for
above. To compare the neural models, we first low resource languages like Georgian where a high
note the approximate number of parameters of each resource language from the same language family
model that we trained: is unavailable. Even with the high resource aug-
mentation, using morphology can give a significant
• BiLSTM: ∼1.7M parameters,
further boost to the prediction accuracy.
• BiLSTM+Attn: ∼3.5M parameters,
• Transformer: ∼5.2M parameters. 5 Conclusion
For BiLSTM and BiLSTM+Attn, the parameter We note that combining BiLSTM with attention
size is based on neural architecture search i.e. we seems to be the most attractive alternative in get-
estimated sizes at which accuracies (nearly) peaked. ting improvements in pronunciation prediction by
For transformer, we believe even larger models can leveraging morphology, and hence correspond to
be more effective and the current size was chosen the most appropriate ‘model bias’ for the prob-
due to computational restrictions and for “fairer” lem from among the alternatives considered. We
comparison of model effectiveness. Under this set- also note that all the neural network paradigms
ting, BiLSTM+Attn models seem to clearly outper- discussed are capable of improving the G2P predic-
form both the other models, even without morphol- tion quality when augmented with morphological
ogy injection (cf. Gorman et al. (2020), albeit it is information. Since our approach can potentially
in the multilingual modeling context). Transformer support partial/incomplete data (using appropriate
can beat BiLSTM in some cases even with the sub- hMISSINGi or hN/Ai tokens), one can use a sin-
optimal model size restriction, but is consistently gle model which injects morphology class and/or
worse when the sequence lengths are larger which lemma pronunciation as available. For languages
is the case when we inject lemma spellings and where neither is available, our results suggest build-
pronunciations. ing word-lemma lists or utilizing effective lemma-
285
tizers (Faruqui et al., 2015; Cotterell et al., 2016). recognition. In Acoustics, Speech and Signal Pro-
cessing (ICASSP), 2016 IEEE International Confer-
6 Future work ence on, pages 4960–4964. IEEE.
Our work only leverages the inflectional morphol- Noam Chomsky and Morris Halle. 1968. The sound
pattern of English.
ogy paradigms for better pronunciation prediction.
However in addition to inflection, morphology also Shay B Cohen and Noah A Smith. 2007. Joint mor-
results in word formation via derivation and com- phological and syntactic disambiguation. In Pro-
ceedings of the 2007 Joint Conference on Empirical
pounding. Unlike inflection, derivation and com- Methods in Natural Language Processing and Com-
pounding could involve multiple root words, so putational Natural Language Learning (EMNLP-
an extension would need a generalization of the CoNLL).
above approach along with appropriate data. An
Cecil H Coker, Kenneth W Church, and Maik Y Liber-
alternative would be to learn these in an unsuper- man. 1991. Morphology and rhyming: Two pow-
vised way using a dictionary augmented neural net- erful alternatives to letter-to-sound rules for speech
work which can efficiently refer to pronunciations synthesis. In The ESCA Workshop on Speech Syn-
in a dictionary and use them to predict pronunci- thesis.
ations of polymorphemic words using pronuncia- Ryan Cotterell, Christo Kirov, John Sylak-Glassman,
tions of the base words (Bruguier et al., 2018). It David Yarowsky, Jason Eisner, and Mans Hulden.
would be interesting to see if using a combination 2016. The SIGMORPHON 2016 shared
task—morphological reinflection. In Proceed-
of morphological side information and dictionary-
ings of the 14th SIGMORPHON Workshop on
augmentation results in a further accuracy boost. Computational Research in Phonetics, Phonology,
Developing non-neural approaches for the mor- and Morphology, pages 10–22.
phology injection could be interesting, although
Marelie Davel and Olga Martirosian. 2009. Pronuncia-
as noted before, the neural approaches are the state- tion dictionary development in resource-scarce envi-
of-the-art (Rao et al., 2015; Gorman et al., 2020). ronments.
One interesting application of the present work
Vera Demberg, Helmut Schmid, and Gregor Möhler.
would be to use the more accurate pronunciation 2007. Phonological constraints and morphological
prediction for morphologically related forms for ef- preprocessing for grapheme-to-phoneme conversion.
ficient pronunciation lexicon development (useful In Proceedings of the 45th Annual Meeting of the
for low resource languages where high-coverage Association of Computational Linguistics, pages 96–
103.
lexicons currently don’t exist), for example anno-
tating the lemma pronunciation should be enough Eric Engelhart, Mahsa Elyasi, and Gaurav Bharaj.
and the pronunciation of all the related forms can 2021. Grapheme-to-Phoneme Transformer Model
be predicted with high accuracy. This is hugely for Transfer Learning Dialects. arXiv preprint
arXiv:2104.04091.
beneficial for languages where there are hundreds
or even thousands of surface forms associated with Marina Ermolaeva. 2018. Extracting morphophonol-
the same lemma. Another concern for reliably us- ogy from small corpora. In Proceedings of the Fif-
teenth Workshop on Computational Research in Pho-
ing the neural approaches is explainability (Molnar, netics, Phonology, and Morphology, pages 167–175.
2019). Some recent research looks at explaining
neural models with orthographic and phonological Daan van Esch, Mason Chua, and Kanishka Rao. 2016.
features (Sahai and Sharma, 2021), an extension Predicting Pronunciations with Syllabification and
Stress with Recurrent Neural Networks. In INTER-
for morphological features should be useful. SPEECH, pages 2841–2845.
Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, and
References Chris Dyer. 2015. Morphological inflection genera-
tion using character sequence to sequence learning.
Antoine Bruguier, Anton Bakhtin, and Dravyansh arXiv preprint arXiv:1512.06110.
Sharma. 2018. Dictionary Augmented Sequence-
to-Sequence Neural Network for Grapheme to Kyle Gorman, Lucas FE Ashby, Aaron Goyzueta,
Phoneme prediction. Proc. Interspeech 2018, pages Arya D McCarthy, Shijie Wu, and Daniel You. 2020.
3733–3737. The SIGMORPHON 2020 shared task on multilin-
gual grapheme-to-phoneme conversion. In Proceed-
William Chan, Navdeep Jaitly, Quoc Le, and Oriol ings of the 17th SIGMORPHON Workshop on Com-
Vinyals. 2016. Listen, attend and spell: A neural putational Research in Phonetics, Phonology, and
network for large vocabulary conversational speech Morphology, pages 40–50.
286
Alex Graves and Navdeep Jaitly. 2014. Towards end- Kanishka Rao, Fuchun Peng, Haşim Sak, and
to-end speech recognition with recurrent neural net- Françoise Beaufays. 2015. Grapheme-to-phoneme
works. In International Conference on Machine conversion using long short-term memory recurrent
Learning, pages 1764–1772. neural networks. In Acoustics, Speech and Signal
Processing (ICASSP), 2015 IEEE International Con-
Sepp Hochreiter and Jürgen Schmidhuber. 1997. ference on, pages 4225–4229. IEEE.
Long short-term memory. Neural computation,
9(8):1735–1780. Zach Ryan and Mans Hulden. 2020. Data augmen-
tation for transformer-based G2P. In Proceedings
Cassandra L Jacobs and Fred Mailhot. 2019. Encoder- of the 17th SIGMORPHON Workshop on Computa-
decoder models for latent phonological representa- tional Research in Phonetics, Phonology, and Mor-
tions of words. In Proceedings of the 16th Workshop phology, pages 184–188.
on Computational Research in Phonetics, Phonol-
ogy, and Morphology, pages 206–217.
Saumya Sahai and Dravyansh Sharma. 2021. Predict-
Preethi Jyothi and Mark Hasegawa-Johnson. 2017. ing and explaining french grammatical gender. In
Low-resource grapheme-to-phoneme conversion us- Proceedings of the Third Workshop on Computa-
ing recurrent neural networks. In 2017 IEEE Inter- tional Typology and Multilingual NLP, pages 90–96.
national Conference on Acoustics, Speech and Sig-
nal Processing (ICASSP), pages 5030–5034. IEEE. Dravyansh Sharma. 2018. On Training and Evaluation
of Grapheme-to-Phoneme Mappings with Limited
Ronald M Kaplan and Martin Kay. 1994. Regular mod- Data. Proc. Interspeech 2018, pages 2858–2862.
els of phonological rule systems. Computational lin-
guistics, 20(3):331–378. Dravyansh Sharma, Melissa Wilson, and Antoine
Bruguier. 2019. Better Morphology Prediction for
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senel- Better Speech Systems. In INTERSPEECH, pages
lart, and Alexander Rush. 2017. OpenNMT: Open- 3535–3539.
source toolkit for neural machine translation. In
Proceedings of ACL 2017, System Demonstrations, Pavel Sofroniev and Çağrı Çöltekin. 2018. Phonetic
pages 67–72, Vancouver, Canada. Association for vector representations for sound sequence alignment.
Computational Linguistics. In Proceedings of the Fifteenth Workshop on Com-
putational Research in Phonetics, Phonology, and
Kimmo Koskenniemi. 1983. Two-Level Model for
Morphology, pages 111–116.
Morphological Analysis. In IJCAI, volume 83,
pages 683–685.
Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Fe-
Karen Livescu, Preethi Jyothi, and Eric Fosler-Lussier. lipe Santos, Kyle Kastner, Aaron Courville, and
2016. Articulatory feature-based pronunciation Yoshua Bengio. 2017. Char2wav: End-to-end
modeling. Computer Speech & Language, 36:212– speech synthesis.
232.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Minh-Thang Luong, Hieu Pham, and Christopher D Sequence to sequence learning with neural networks.
Manning. 2015. Effective Approaches to Attention- arXiv preprint arXiv:1409.3215.
based Neural Machine Translation. In Proceedings
of the 2015 Conference on Empirical Methods in Jason Taylor and Korin Richmond. 2020. Enhancing
Natural Language Processing, pages 1412–1421. Sequence-to-Sequence Text-to-Speech with Mor-
phology. Submitted to IEEE ICASSP.
Sameer Maskey, Alan Black, and Laura Tomokiya.
2004. Boostrapping phonetic lexicons for new lan- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
guages. In Eighth International Conference on Spo- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
ken Language Processing. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Proceedings of the 31st International
Christoph Molnar. 2019. Interpretable Machine
Conference on Neural Information Processing Sys-
Learning. https://christophm.github.io/
tems, pages 6000–6010.
interpretable-ml-book/.
Garrett Nicolai, Kyle Gorman, and Ryan Cotterell. Karl Weiss, Taghi M Khoshgoftaar, and DingDing
2020. Proceedings of the 17th SIGMORPHON Wang. 2016. A survey of transfer learning. Journal
Workshop on Computational Research in Phonetics, of Big data, 3(1):1–40.
Phonology, and Morphology. In Proceedings of the
17th SIGMORPHON Workshop on Computational Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,
Research in Phonetics, Phonology, and Morphology. Aaron Courville, Ruslan Salakhudinov, Rich Zemel,
and Yoshua Bengio. 2015. Show, attend and tell:
Ben Peters, Jon Dehdari, and Josef van Genabith. Neural image caption generation with visual atten-
2017. Massively Multilingual Neural Grapheme-to- tion. In International conference on machine learn-
Phoneme Conversion. EMNLP 2017, page 19. ing, pages 2048–2057.
287
Model Inputs en de es ru avg. rel. gain
BiLSTM (b/+c/+l) (31.0/30.5/25.2) (17.7/15.5/12.3) (8.1/7.9/6.7) (18.4/15.6/15.9) (-/+7.9%/+20.0%)
BiLSTM+Attn (b/+c/+l) (29.0/27.1/21.3) (12.0/11.6/11.6) (4.9/2.6/2.4) (14.1/13.6/13.1) (-/+15.1%/+22.0%)
Table 4: Number of total Wiktionary entries, and inflected entries with pronunciation and morphology annotations,
for the languages considered.
B Error analysis
Neural sequence to sequence models, while highly
accurate on average, make “silly” mistakes like
omitting or inserting a phoneme which are hard
to explain. With that caveat in place, there are
still reasonable patterns to be gleaned when com-
paring the outputs of the various neural models
discussed here. BiLSTM+Attn model seems to not
only be making fewer of these “silly” mistakes,
but also appears to be better at learning the gen-
uinely more challenging predictions. For exam-
ple, the French word pédagogiques (‘pedagogical’,
288
Leveraging Paradigmatic Information in Inflection Acceptability
Prediction: the JHU-SFU Submission to SIGMORPHON Shared Task 0.2
289
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 289–294
August 5, 2021. ©2021 Association for Computational Linguistics
Training Development Testing guages are summed to obtain the overall ranking.
English 47550 158 138
Dutch 84666 122 166 3 System Description
German 114185 150 266 The submitted system takes a lemma-inflection pair
Table 1: The number of entries of each language’s and its MSD and returns an acceptability score
training, development, and testing dataset. in the range [0, 7]. The model can be broken
down into three main modules. First, the model
extracts abstract paradigms from the training set
and mechanisms employed to predict acceptability based on Hulden’s (2014) algorithm. Then, based
ratings. We review the results of the shared task in on these paradigms, we create a probability dis-
§4 and suggest ways of improvement. Lastly, we tribution for each lemma-inflection pair that indi-
give our concluding remarks in §5. cates the likelihood of an inflection belonging to a
paradigm. Lastly, we extract a weighted average
2 Task Description of various measures of similarity and frequency
The goal of this year’s task is to predict acceptabil- for each lemma-inflection pair in the development
ity scores of inflections of nonce lemmas in English, set and the test set, and fit a linear model to the
Dutch, and German. For instance, given the En- development data. These processes are described
glish nonce lemma fink /fINk/, the submitted system in further detail in the following subsections.
will predict the acceptability of the past tense can- 3.1 Paradigm Extraction
didates finked /fINkt/, fank /fæNk/, and funk /f2Nk/.
This section reports on the datasets provided for The goal of this component is to transform inflec-
training and testing and the evaluation criteria for tion tables derived from the training data into ab-
the submitted predictions. stract paradigms. We do so by first transforming
We were provided with three datasets for each into the training dataset into a compatible matrix.
language (see Table 1), with all lemma and inflec- Then, we apply Hulden’s (2014) finite-state abstrac-
tion strings in IPA form (e.g. worked: wORkt). The tion algorithm.
training set contains real lemma-inflection entries As described in §2, the training data were entries
with the structure lemma, inflection, and of the form:
morphosyntactic description (MSD).
lemma inflection MSD
This dataset is relatively large with an average of
82,134 entries per language. For the purposes of We reorganized the data by having MSDs as
our model, this dataset is used to infer paradigms columns and lemmas as rows, such that each entry
and the possible (but not necessarily plausible) is a complete or incomplete inflection table (Figure
inflections of any real or nonce lemma. The X). An inflection table is considered complete if all
judgement-development (hereafter development MSDs slots are filled, whereas an incomplete table
set) and testing sets are smaller (167 entries on av- has at least one or more MSD slots empty. Due to
erage) with the structure lemma, inflection, the large number of MSD tags in German (31 tags),
MSD, and judgement score1 . Both the de- we had to manually prune some tags/columns to
velopment and testing sets contain lemmas that ensure that there were a sizeable amount of com-
have exactly two potential inflections occurring in plete tables for paradigm learning2 . Additionally,
a regular-irregular pair, such as hsnEl, snEldi (regu- lemmas with multiple inflections in the same MSD
lar) and hsnEl, snElti (irregular, similar to dealt). (usually due to a pronunciation difference) were
The submitted systems were evaluated (by each removed from the dataset to ensure each lemma
language) with a mixed-effects beta regression corresponds to one data entry.
model, with lemma type as a random intercept. The We then induced abstract paradigms from the set
Akaike information criterion (AIC) of the model of complete inflection tables. Hulden’s (2014) al-
was used to rank submitted systems relative to each 2
The MSDs that were retained were tags that contain PST
other: the lower the AIC, the closer the system is (past), since our goal is to predict past participle acceptability.
to the actual acceptability scores. AIC across lan- A possible extension that automates this pruning process are
simulations that maximize (1) the amount of tags related to
1
Details on judgement score elicitation and nonce lemma the target morphological relationship and (2) the amount of
generation are documented in Ambridge et al. (2021). complete tables.
290
gorithm3 relies on the notion of a longest common training set, incomplete tables are soft-classified
subsequnce (LCS), which is defined rigorously as into potential classes, providing more exemplars
follows (Hirschberg, 1977): for the development and test set lemmas. For
the development and test entries, the probability
• String L is a subsequence of X iff L can be ob- distribution allows for the feature extraction
tained by deleting any (0 or more) symbols in process in §3.3 to be weighted.
X, e.g. course is a subsequence of computer Our protocol for generating CPi is as follows.
science. First, we define the terms compatible and incom-
patible with respect to a class (except cO ) and an
• L is a common subsequence of X1 and X2 iff
inflection table. An inflection table and a class is
L is a subsequence of X1 and X2 .
incompatible if it meets any one of the following
• L is the LCS of X1 and X2 iff there does not criteria:
exist a string K such that K is a common
• The characters of the specified substrings of
subsequence of X1 and X2 and len(K) >
the abstract inflection (e.g. ed in x1 +ed) do
len(L).
not exist in the inflection string, e.g. ræn and
The goal for each table-to-paradigm process is to: x1 +d.
(1) extract the LCS between entries of a table and
• The placement for the variables make it impos-
(2) assign substrings of the LCS as variables. Then,
sible to fit the inflection string in the abstract
paradigms collapse to a smaller set of distinct, ab-
inflection. For example: dElt and x1 +d, Elt
stract paradigms with multiple lemmas belonging
cannot be accommodated in this configura-
to one paradigm. For example, the inflection ta-
tion.
bles ring: rIN#ræN#r2N, sing: sIN#sæN#s2N, and
walk: wOlk#wOlkt#wOlkt reduces to the paradigms: • Although a configuration is possible in vari-
x1 Ix2 #x1 æx2 #x1 2x2 and x1 #x1 t#x1 t. For the spe- ous dimensions individually, the variable as-
cific implementation procedures, please consult signments are conflicting. For example: feI-
(Hulden, 2014) (see also Ahlberg et al., 2014, dId ’faded’ fits into the voiced regular past
2015). tense template x1 +d by x1 =feIdI and feIdIN
At last, we define the set of abstract paradigms ’fading’ will also fit the progressive x1 +IN by
as C (for ‘classes’ hereafter) and set of MSD tags x1 =feId, but the two x1 variables are differ-
M . While C encapsulates the inflectional patterns ent, resulting in incompatibility.
found in the complete inflection tables, we antici-
pate that there will be unattested paradigms in the Otherwise, if all instances within and across the
incomplete training tables as well as the testing and inflection table and the abstract class do not violate
development datasets. So, we will be extending the criteria above, then they are considered com-
C to C# = C ∪ {cO }, where cO represents the patible. Some inflection tables may have multiple
inflectional tables unaccounted for by C. compatible classes – this may arise from the table
having a few inflectional dimensions, which in turn
3.2 Class Probability causes it to satisfy the compatibility requirements
Following the definition of the set of easily.
classes/paradigms C# , we then generate a Any incompatible classes are assigned a 0 prob-
function CPi : C# 7→ [0, 1] for each incomplete ability: CPi (c) = 0. In the case where no classes
inflection table i (later extended to complete are compatible with i, the null paradigm cO is as-
tables), where CPi (c) indicates the probability signed a probability of 1: CPi (cO ) = 1. Cases
that i belongs to class c. The sum of all outputs where an inflectional table has exactly one com-
of this function equals to 1, as we assume that the patible class, that class is assigned a probability of
classes in C# are all the possible outcomes (recall 1. Likewise, complete tables are automatically as-
C# accounts for unattested paradigms). This signed a probability of 1 in their respective classes
function has two applications in this model. For the generated from §3.1. In all other cases where a
3
table has multiple compatible classes, we run a
The code for this algorithm is openly available through
the pextract toolbox: https://code.google.com/ naive Bayes classifier (implemented in the nltk
archive/p/pextract/ package) on the lemmas of the competing classes.
291
For each class member (lemma) and the lemma
of the table in question, we obtain the values of
three parameters: its syllable structure (in a CV
string), its first phoneme, and last phoneme. The
class probabilities obtained from the classifiers are
assigned to the respective classes.
Similarly, we treat the development and test
lemma-inflections as incomplete inflection tables
with two dimensions. They undergo the same pro-
cess described above to obtain their class proba-
bilities. We now have a rich stock of lemmas and
inflection soft-categorized by paradigmatic infor-
mation, to which we will refer to during the pre- Figure 1: Judgement scores from the English develop-
dictor abstraction process, as detailed in the next ment set mapped against the predicted scores. Red dat-
subsection. apoints represent irregular items, whereas blue points
are regular items. The closer a point is to the line, the
3.3 Paradigm Information & Model Building closer the prediction is to the actual judgement score.
Lastly, we extract three predictor values that reflect
paradigmatic information: syllable structure Lev- 4 Results & Discussion
enshtein distance, weighted phonological feature
distance, and class size. All three predictors are We now turn to the results of the model and re-
weighted by the probability distribution generated view qualitatively some shortcomings of the model.
from §3.2, Then, we propose some potential extensions and
P as seen in the formulas that follow. Let
nc = w∈c CPw (c) (the frequency of a class c). fixes that may improve the performance of this
model.
• Syllable Levenshtein: average Levenshtein
distance of the test/development lemma and 4.1 General Observations
lemmas in the class.
! The AIC values for the English, Dutch, and German
X CPi (c) X test sets were −46, −30.3, and −14.8 respectively.
· (dist(i, w) · CPw (c))
nc With regard to ranking, our system ranked last for
c∈C w∈c
Dutch and German but ranked second for English.
• Weighted phonological features: average Figure 1 shows the relationship between the actual
phonological distance of the test/development scores and the predictions for a simulation in the
lemma and lemmas in the class, derived from development set. We notice that the all predictions
the panphon package (Mortensen et al., had a much narrower range (2.55-4.71) than that
2016). of the actual scores (0.29-6.19). The Dutch data
! also showed a similar pattern (1.60-4.68 predicted
X CPi (c) X
· (phondist(i, w) · CPw (c)) vs. 0.62-6.42 actual), which may imply that the
c∈C
nc w∈c variables chosen were not able to yield a distinc-
tive difference. A linear model may also not be
• Class frequency: the weighted frequency of sufficient to fit the data.
each class. On a similar vein, we notice that irregular in-
X flectional candidates were often overrated by our
CPi (c) · n
predictions, whereas regular inflections were un-
c∈C
derrated. These modelling issues may be remedied
The development data was subsequently fitted in a few ways. First, it is possible to define a reg-
to a linear model with acceptability scores as the ular class among C# (e.g. the most frequent class
response variables. We then applied the values and add a reward factor to all test lemma-inflection
from the test set to the linear model to yield the pairs of the class. Similarly, a penalty score can
predictions for this shared task. Values that were be added for pairs in irregular inflection classes.
below 0 were adjusted to 0 and those above 7 were Another tweak to our system is to consider other
adjusted to 7. regression models that may fit the data better, such
292
as using K-nearest neighbours given the current 5 Concluding Remarks
variable extraction methods (§3.3).
This paper proposes a framework that utilizes
Future iterations of this model may seek to
paradigmatic information to predict acceptability
improve accuracy scores by implementing a fur-
ratings of nonce lemma inflections. This system
ther abstraction of the paradigm extraction pro-
is highly modular – the variables that go between
cess described in §3.1. (Silfverberg et al., 2018)
each module can be easily tweaked and reviewed,
describes an extension of (Hulden, 2014) that ab-
which increases the interpretability of the system.
stracts paradigms to the featural level. For instance,
Additionally, we foresee that a more developed ver-
our current system has separate abstract paradigms
sion of this model may provide insight to deeper
for the class of lemmas with voiced regular past
questions in cognitive modelling, such as: how
tenses (e.g. bribe → bribe+/d/) and unvoiced (e.g.
does phonological neighbourhood interact with
jump → jump+/t/). By merging the two classes, we
paradigm size in these system, and does it con-
are able to represent, to a further extent, the class
form with findings in linguistic and psycholinguis-
of regular inflections and set them apart from other
tic studies? We look forward to future extensions
variants. This may lead to better accuracy scores
of this model and contributing to the ongoing cog-
for this task because it gives a better representation
nitive modelling work in morphology.
of frequency class and more exemplars for the test
lemma to refer to. We should note, however, this Acknowledgments
may be a detriment to prediction tasks that ask to
determine acceptability scores within a phonologi- We would like to thank the organizers for their hard
cal rule (e.g. is work+/d/ or work+/t/ acceptable). work on the shared task.
293
Mans Hulden. 2014. Generalizing inflection tables into
paradigms with finite state operations. In Proceed-
ings of the 2014 Joint Meeting of SIGMORPHON
and SIGFSM, pages 29–36.
Nivja H de Jong, Robert Schreuder, and R Har-
ald Baayen. 2000. The morphological family size
effect and morphology. Language and cognitive pro-
cesses, 15(4-5):329–365.
Kaidi Lõo, Juhani Järvikivi, Fabian Tomaschek, Ben-
jamin V Tucker, and R Harald Baayen. 2018. Pro-
duction of Estonian case-inflected nouns shows
whole-word frequency and paradigmatic effects.
Morphology, 28(1):71–97.
David R. Mortensen, Patrick Littell, Akash Bharad-
waj, Kartik Goyal, Chris Dyer, and Lori S. Levin.
2016. Panphon: A resource for mapping IPA seg-
ments to articulatory feature vectors. In Proceed-
ings of COLING 2016, the 26th International Con-
ference on Computational Linguistics: Technical Pa-
pers, pages 3475–3484. ACL.
Liina Pylkkänen, Sophie Feintuch, Emily Hopkins, and
Alec Marantz. 2004. Neural correlates of the effects
of morphological family frequency and family size:
an meg study. Cognition, 91(3):B35–B45.
Miikka Silfverberg, Ling Liu, and Mans Hulden. 2018.
A computational model for the linguistic notion of
morphological paradigm. In Proceedings of the 27th
International Conference on Computational Linguis-
tics, pages 1615–1626.
294
Author Index
295
Mielke, Sabrina J., 154 Woliński, Marcin, 154
Miller, Sean, 115 Wu, Shijie, 154
Palmer, Alexis, 90
Papillon, Maxime, 23
Perkoff, E. Margaret, 90
Pimentel, Tiago, 154
Ponti, Edoardo Maria, 154
Pratapa, Adithya, 49
Prud’hommeaux, Emily, 154
Vaduguru, Saujas, 60
Vania, Clara, 154
Villegas, Gema Celeste Silva, 154
Vylomova, Ekaterina, 154