0% found this document useful (0 votes)
330 views308 pages

Association For Computational Linguistics

T. Pimentel, M. Ryskina, S. Mielke, Sh. Wu, E. Chodroff, B. Leonard, G. Nicolaiá, Y. Ghanggo Ate, S. Khalifa, J. R. Montoya Samame, G. Celeste, S. Villegas, A. Ekä, J.-Ph. Bernard, A. Shcherbakov, A. Bayyr-ool, K. Sheifer. SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages. 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology. Bangkok, Thailand. 2021. P. 154–184.

Uploaded by

Karina Sheifer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
330 views308 pages

Association For Computational Linguistics

T. Pimentel, M. Ryskina, S. Mielke, Sh. Wu, E. Chodroff, B. Leonard, G. Nicolaiá, Y. Ghanggo Ate, S. Khalifa, J. R. Montoya Samame, G. Celeste, S. Villegas, A. Ekä, J.-Ph. Bernard, A. Shcherbakov, A. Bayyr-ool, K. Sheifer. SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages. 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology. Bangkok, Thailand. 2021. P. 154–184.

Uploaded by

Karina Sheifer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 308

SIGMORPHON 2021

18th SIGMORPHON Workshop on


Computational Research in Phonetics,
Phonology, and Morphology

Proceedings of the Workshop

August 5, 2021
Bangkok, Thailand (online)
©2021 The Association for Computational Linguistics
and The Asian Federation of Natural Language Processing

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)


209 N. Eighth Street
Stroudsburg, PA 18360
USA
Tel: +1-570-476-8006
Fax: +1-570-476-0860
[email protected]

ISBN 978-1-954085-62-6

ii
Preface

Welcome to the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology,


and Morphology, to be held on August 5, 2021 as part of a virtual ACL. The workshop aims to
bring together researchers interested in applying computational techniques to problems in morphology,
phonology, and phonetics. Our program this year highlights the ongoing investigations into how neural
models process phonology and morphology, as well as the development of finite-state models for low-
resource languages with complex morphology.

We received 25 submissions, and after a competitive reviewing process, we accepted 14.

The workshop is privileged to present four invited talks this year, all from very respected members of
the SIGMORPHON community. Reut Tsarfaty, Kenny Smith, Kristine Yu, and Ekaterina Vylomova all
presented talks at this year’s workshop.

This year also marks the sixth iteration of the SIGMORPHON Shared Task. Following upon the success
of last year’s multiple tasks, we again hosted 3 shared tasks:

Task 0:

SIGMORPHON’s sixth installment of its inflection generation shared task is divided into two parts:
Generalization, and cognitive plausibility.

In the first part, participants designed a model that learned to generate morphological inflections from a
lemma and a set of morphosyntactic features of the target form, similar to previous year’s tasks. This year,
participants learned morphological tendencies on a set of development languages, and then generalized
these findings to new languages - without much time to adapt their models to new phenomena.

The second part asks participants to inflect nonce words in the past tense, which are then judged for
plausibility by native speakers. This task aims to investigate whether state-of-the-art inflectors are
learning in a way that mimics human learners.

Task 1:

The second SIGMORPHON shared task on grapheme-to-phoneme conversion expands on the task from
last year, recategorizing data as belonging to one of three different classes: low-resource, medium-
resource, and high-resource.

The task saw 23 submissions from 9 participants.

Task 2:

Task 2 continues the effort from the 2020 shared task in unsupervised morphology. Unlike last year’s
task, which asked participants to implement a complete unsupervised morphology induction pipeline,
this year’s task concentrates on a single aspect of morphology discovery: paradigm induction. This task
asks participants to cluster words into inflectional paradigms, given no more than raw text.

The task saw 14 submissions from 4 teams.

We are grateful to the program committee for their careful and thoughtful reviews of the papers submitted
this year. Likewise, we are thankful to the shared task organizers for their hard work in preparing the
shared tasks. We are looking forward to a workshop covering a wide range of topics, and we hope for
lively discussions.

Garrett Nicolai
Kyle Gorman

iii
Ryan Cotterell

iv
Organizing Committee
Garrett Nicolai (University of British Columbia, Canada)
Kyle Gorman (City University of New York, USA)
Ryan Cotterell (ETH Zürich, Switzerland)

Program Committee
Damián Blasi (Harvard University)
Grzegorz Chrupała (Tilburg University)
Jane Chandlee (Haverford College)
Çağrı Çöltekin (University of Tübingen)
Daniel Dakota (Indiana University)
Colin de la Higuera (University of Nantes)
Micha Elsner (The Ohio State University)
Nizar Habash (NYU Abu Dhabi)
Jeffrey Heinz (University of Delaware)
Mans Hulden (University of Colorado)
Adam Jardine (Rutgers University)
Christo Kirov (Google AI)
Greg Kobele (Universität Leipzig)
Grzegorz Kondrak (University of Alberta)
Sandra Kübler (Indiana University)
Adam Lamont (University of Massachusetts Amherst)
Kevin McMullin (University of Ottawa)
Kemal Oflazer (CMU Qatar)
Jeff Parker (Brigham Young University)
Gerald Penn (University of Toronto)
Jelena Prokic (Universiteit Leiden)
Miikka Silfverberg (University of British Columbia)
Kairit Sirts (University of Tartu)
Kenneth Steimel (Indiana University)
Reut Tsarfaty (Bar-Ilan University)
Francis Tyers (Indiana University)
Ekaterina Vylomova (University of Melbourne)
Adina Williams (Facebook AI Research)
Anssi Yli-Jyrä (University of Helsinki)
Kristine Yu (University of Massachusetts)

Task 0 Organizing Committee


Tiago Pimentel (University of Cambridge)
Brian Leonard (Brian Leonard Consulting)
Maria Ryskina (Carnegie Mellon University)
Sabrina Mielke (Johns Hopkins University)
Coleman Haley (Johns Hopkins University)
Eleanor Chodroff (University of York)
Johann-Mattis List (Max Planck Institute)
Adina Williams (Facebook AI Research)
Ryan Cotterell (ETH Zürich)
Ekaterina Vylomova (University of Melbourne)
Ben Ambridge (University of Liverpool)

Task 1 Organizing Committee


v
To come

Task 2 Organizing Committee


Adam Wiemerslage( University of Colorado Boulder)
Arya McCarthy (Johns Hopkins University)
Alexander Erdmann (Ohio State University)
Manex Agirrezabal (University of Copenhagen)
Garrett Nicolai (University of British Columbia)
Miikka Silfverberg (University of British Columbia)
Mans Hulden (University of Colorado Boulder)
Katharina Kann (University of Colorado Boulder)

vi
Table of Contents

Towards Detection and Remediation of Phonemic Confusion


Francois Roewer-Despres, Arnold Yeung and Ilan Kogan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Recursive prosody is not finite-state


Hossep Dolatian, Aniello De Santo and Thomas Graf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

The Match-Extend serialization algorithm in Multiprecedence


Maxime Papillon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Incorporating tone in the calculation of phonotactic probability


James Kirby . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology


Khuyagbaatar Batsuren, Gábor Bella and fausto giunchiglia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

A Study of Morphological Robustness of Neural Machine Translation


Sai Muralidhar Jayanthi and Adithya Pratapa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Sample-efficient Linguistic Generalizations through Program Synthesis: Experiments with Phonology


Problems
Saujas Vaduguru, Aalok Sathe, Monojit Choudhury and Dipti Sharma. . . . . . . . . . . . . . . . . . . . . . . .60

Findings of the SIGMORPHON 2021 Shared Task on Unsupervised Morphological Paradigm Clustering
Adam Wiemerslage, Arya D. McCarthy, Alexander Erdmann, Garrett Nicolai, Manex Agirrezabal,
Miikka Silfverberg, Mans Hulden and Katharina Kann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Adaptor Grammars for Unsupervised Paradigm Clustering


Kate McCurdy, Sharon Goldwater and Adam Lopez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Orthographic vs. Semantic Representations for Unsupervised Morphological Paradigm Clustering


E. Margaret Perkoff, Josh Daniels and Alexis Palmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Unsupervised Paradigm Clustering Using Transformation Rules


Changbing Yang, Garrett Nicolai and Miikka Silfverberg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Paradigm Clustering with Weighted Edit Distance


Andrew Gerlach, Adam Wiemerslage and Katharina Kann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Results of the Second SIGMORPHON Shared Task on Multilingual Grapheme-to-Phoneme Conversion


Lucas F.E. Ashby, Travis M. Bartley, Simon Clematide, Luca Del Signore, Cameron Gibson, Kyle
Gorman, Yeonju Lee-Sikka, Peter Makarov, Aidan Malanoski, Sean Miller, Omar Ortiz, Reuben Raff,
Arundhati Sengupta, Bora Seo, Yulia Spektor and Winnie Yan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Data augmentation for low-resource grapheme-to-phoneme mapping


Michael Hammond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Linguistic Knowledge in Multilingual Grapheme-to-Phoneme Conversion


Roger Yu-Hsiang Lo and Garrett Nicolai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Avengers, Ensemble! Benefits of ensembling in grapheme-to-phoneme prediction


Vasundhara Gautam, Wang Yau Li, Zafarullah Mahmood, Frederic Mailhot, Shreekantha Nadig,
Riqiang WANG and Nathan Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

vii
CLUZH at SIGMORPHON 2021 Shared Task on Multilingual Grapheme-to-Phoneme Conversion: Vari-
ations on a Baseline
Simon Clematide and Peter Makarov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages


Tiago Pimentel, Maria Ryskina, Sabrina J. Mielke, Shijie Wu, Eleanor Chodroff, Brian Leonard,
Garrett Nicolai, Yustinus Ganggo Ate, Salam Khalifa, Nizar Habash, Charbel El-Khaissi, Omer Gold-
man, Michael Gasser, William Lane, Matt Coler, Arturo Oncevay, Jaime Rafael Montoya Samame, Gema
Celeste Silva Villegas, Adam Ek, Jean-Philippe Bernardy, Andrey Shcherbakov, Aziyana Bayyr-Ool,
Karina Sheifer, Elena Klyachko, Ali Salehi, Andrew Krizhanovsky, Natalia Krizhanovsky, Clara Vania,
Sardana Ivanova, Aelita Salchak, Christopher Straughn, Zoey Liu, Jonathan N. Washington, Duygu Ata-
man, Witold Kieraś, Marcin Woliński, Totok Suhardijanto, Niklas Stoehr, Zahroh Nuriah, Shyam Ratan,
Francis Tyers, Edoardo Maria Ponti, Grant Aiton, Richard J. Hatcher, Emily Prud’hommeaux, Ritesh
Kumar, Mans Hulden, Botond Barta, Dorina Lakatos, Gábor Szolnok, Judit Ács, David Yarowsky, Ryan
Cotterell, Ben Ambridge and Ekaterina Vylomova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

Training Strategies for Neural Multilingual Morphological Inflection


Adam Ek and Jean-Philippe Bernardy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

BME Submission for SIGMORPHON 2021 Shared Task 0. A Three Step Training Approach with Data
Augmentation for Morphological Inflection
Gábor Szolnok, Botond Barta, Dorina Lakatos and Judit Ács . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

Not quite there yet: Combining analogical patterns and encoder-decoder networks for cognitively plau-
sible inflection
Basilio Calderone, Nabil Hathout and Pierre Bonami . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

Were We There Already? Applying Minimal Generalization to the SIGMORPHON-UniMorph Shared


Task on Cognitively Plausible Morphological Inflection
Colin Wilson and Jane S. Y. Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

What transfers in morphological inflection? Experiments with analogical models


Micha Elsner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

Simple induction of (deterministic) probabilistic finite-state automata for phonotactics by stochastic gra-
dient descent
Huteng Dai and Richard Futrell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

Recognizing Reduplicated Forms: Finite-State Buffered Machines


Yang Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

An FST morphological analyzer for the Gitksan language


Clarissa Forbes, Garrett Nicolai and Miikka Silfverberg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

Comparative Error Analysis in Neural and Finite-state Models for Unsupervised Character-level Trans-
duction
Maria Ryskina, Eduard Hovy, Taylor Berg-Kirkpatrick and Matthew R. Gormley . . . . . . . . . . . . 258

Finite-state Model of Shupamem Reduplication


Magdalena Markowska, Jeffrey Heinz and Owen Rambow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

Improved pronunciation prediction accuracy using morphology


Dravyansh Sharma, Saumya Sahai, Neha Chaudhari and Antoine Bruguier . . . . . . . . . . . . . . . . . . 282

viii
Leveraging Paradigmatic Information in Inflection Acceptability Prediction: the JHU-SFU Submission
to SIGMORPHON Shared Task 0.2
Jane S. Y. Li and Colin Wilson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

ix
Workshop Program
Due to the ongoing pandemic, and the virtual nature of the workshop, the papers will be presented
asynchronously, with designated question periods.

Towards Detection and Remediation of Phonemic Confusion


Francois Roewer-Despres, Arnold Yeung and Ilan Kogan

Recursive prosody is not finite-state


Hossep Dolatian, Aniello De Santo and Thomas Graf

The Match-Extend serialization algorithm in Multiprecedence


Maxime Papillon

What transfers in morphological inflection? Experiments with analogical models


Micha Elsner

Simple induction of (deterministic) probabilistic finite-state automata for phonotac-


tics by stochastic gradient descent
Huteng Dai and Richard Futrell

Incorporating tone in the calculation of phonotactic probability


James Kirby

Recognizing Reduplicated Forms: Finite-State Buffered Machines


Yang Wang

An FST morphological analyzer for the Gitksan language


Clarissa Forbes, Garrett Nicolai and Miikka Silfverberg

MorphyNet: a Large Multilingual Database of Derivational and Inflectional Mor-


phology
Khuyagbaatar Batsuren, Gábor Bella and Fausto Giunchiglia

Comparative Error Analysis in Neural and Finite-state Models for Unsupervised


Character-level Transduction
Maria Ryskina, Eduard Hovy, Taylor Berg-Kirkpatrick and Matthew R. Gormley

Finite-state Model of Shupamem Reduplication


Magdalena Markowska, Jeffrey Heinz and Owen Rambow

A Study of Morphological Robustness of Neural Machine Translation


Sai Muralidhar Jayanthi and Adithya Pratapa

xi
No Day Set (continued)

Sample-efficient Linguistic Generalizations through Program Synthesis: Experi-


ments with Phonology Problems
Saujas Vaduguru, Aalok Sathe, Monojit Choudhury and Dipti Sharma

Improved pronunciation prediction accuracy using morphology


Dravyansh Sharma, Saumya Sahai, Neha Chaudhari and Antoine Bruguier

Data augmentation for low-resource grapheme-to-phoneme mapping


Michael Hammond

Linguistic Knowledge in Multilingual Grapheme-to-Phoneme Conversion


Roger Yu-Hsiang Lo and Garrett Nicolai

Avengers, Ensemble! Benefits of ensembling in grapheme-to-phoneme prediction


Vasundhara Gautam, Wang Yau Li, Zafarullah Mahmood, Frederic Mailhot, Shree-
kantha Nadig, Riqiang WANG and Nathan Zhang

CLUZH at SIGMORPHON 2021 Shared Task on Multilingual Grapheme-to-


Phoneme Conversion: Variations on a Baseline
Simon Clematide and Peter Makarov

Findings of the SIGMORPHON 2021 Shared Task on Unsupervised Morphological


Paradigm Clustering
Adam Wiemerslage, Arya D. McCarthy, Alexander Erdmann, Garrett Nicolai,
Manex Agirrezabal, Miikka Silfverberg, Mans Hulden and Katharina Kann

Orthographic vs. Semantic Representations for Unsupervised Morphological


Paradigm Clustering
E. Margaret Perkoff, Josh Daniels and Alexis Palmer

Unsupervised Paradigm Clustering Using Transformation Rules


Changbing Yang, Garrett Nicolai and Miikka Silfverberg

Paradigm Clustering with Weighted Edit Distance


Andrew Gerlach, Adam Wiemerslage and Katharina Kann

Adaptor Grammars for Unsupervised Paradigm Clustering


Kate McCurdy, Sharon Goldwater and Adam Lopez

xii
Towards Detection and Remediation of Phonemic Confusion

Francois Roewer-Despres1∗ Arnold YS Yeung1∗ Ilan Kogan2∗


1
Department of Computer Science 2 Department of Statistics
University of Toronto
{francoisrd, arnoldyeung}@cs.toronto.edu
[email protected]

Abstract
Reducing communication breakdown is criti-
cal to success in interactive NLP applications,
such as dialogue systems. To this end, we pro-
pose a confusion-mitigation framework for the
detection and remediation of communication
breakdown. In this work, as a first step towards
implementing this framework, we focus on de-
tecting phonemic sources of confusion. As a
proof-of-concept, we evaluate two neural ar- Figure 1: A simplified variant of our proposed
chitectures in predicting the probability that a confusion-mitigation framework, which enables gener-
listener will misunderstand phonemes in an ut- ative NLP systems to detect and remediate confusion-
terance. We show that both neural models out- related communication breakdown. The confusion pre-
perform a weighted n-gram baseline, showing diction component predicts the confusion probability
early promise for the broader framework. of candidate utterances, which are rejected if this prob-
ability is above a decision threshold, φ.
1 Introduction
Ensuring that interactive NLP applications, such
as dialogue systems, communicate clearly and ef-
fectively is critical to their long-term success and
viability, especially in high-stakes domains, such as formulations, the NLG and confusion prediction
healthcare. Successful systems should thus seek to components can be closely integrated to better de-
reduce communication breakdown. One aspect of termine precisely how to avoid confusion. This
successful communication is the degree to which process can also be conditioned on models of the
each party understands the other. For example, current listener or task to achieve personalized or
properly diagnosing a patient may necessitate ask- context-dependent results. Figure 1 shows the sim-
ing logically complex questions, but these ques- plest variant of the framework.
tions should be phrased as clearly as possible to
promote understanding and mitigate confusion. As a first step towards implementing this frame-
To reduce confusion-related communication work, we work towards developing its central con-
breakdown, we propose that generative NLP sys- fusion prediction component, which predicts the
tems integrate a novel confusion-mitigation frame- confusion probability of an utterance. In this work,
work into their natural language generation (NLG) we specifically target phonemic confusion, that is,
processes. In brief, this framework ensures that the misidentification of heard phonemes by a lis-
such systems avoid transmitting utterances with tener. We consider two potential neural architec-
high predicted probabilities of confusion. In the tures for this purpose: a fixed-context, feed-forward
simplest and most decoupled formulation, an exist- network and a residual, bidirectional LSTM net-
ing NLG component simply produces alternatives work. We train these models using a novel proxy
to any rejected utterances without additional guid- data set derived from audiobook recordings, and
ing information. In more advanced and coupled compare their performance to that of a weighted

Equal contribution. n-gram baseline.

1
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 1–10
August 5, 2021. ©2021 Association for Computational Linguistics
2 Background and Related Work noise. It may also arise from the natural similarities
between phonemes (discussed next). While many
Prior work focused on identifying confusion in nat- of these will not be represented in the text-based
ural language, rather than proactively altering it phonemic transcriptions data set used in this pre-
to help reduce communication breakdown, as our liminary work, our approach can be extended to
framework proposes. For example, Batliner et al. include them.
(2003) showed that certain features of recorded
Researchers in speech processing have studied
speech (e.g., repetition, hyperarticulation, strong
the prediction of phonemic confusion but, to our
emphasis) can be used to identify communica-
knowledge, this work has not been adapted to ut-
tion breakdown. The authors relied primarily on
terance generation. Instead, tasks such as prevent-
prosodic properties of recorded phrases, rather than
ing of sound-alike medication errors (i.e., naming
the underlying phonemes, words, or semantics, for
medications so that two medications do not sound
identifying communication breakdown. On the
identical) are common (Lambert, 1997). Zgank
other hand, conversational repair is a turn-level
and Kacic (2012) showed that the potential confus-
process in which conversational partners first iden-
ability of a word can be estimated by calculating
tify and then remediate communication breakdown
the Levenshtein distance (Levenshtein, 1966) of its
as part of a trouble-source repair (TSR) sequence
phonemic transcription to that of all others in the
(Sacks et al., 1974). Using this approach, Orange
vocabulary. We take inspiration from Zgank and
et al. (1996) identified differences in TSR patterns
Kacic (2012) and employ a phoneme-level Leven-
amongst people with no, early-stage, and middle-
shtein distance approach in this work.
stage Alzheimer’s, highlighting the usefulness of
communication breakdown detection. However, In the basic definition of the Levenshtein dis-
such work does not directly address the issue of tance, all errors are equally weighted. In practice,
proactive confusion mitigation and remediation, however, words that share many similar or identical
which the more advanced formulation of our frame- phonemes are more likely to be confused for one
work aims to address through listener and task con- another. Given this, Sabourin and Fabiani (2000)
ditioning. Our focus is on the simpler formulation developed a weighted phoneme-level Levenshtein
in this preliminary work. distance, where weights are determined by a human
Rothwell (2010) identified four types of noise expert or a learned model, such as a hidden Markov
that may cause confusion: physical noise (e.g., a model. Unfortunately, while these weights are
loud highway), physiological noise (e.g., hearing meant to represent phonemic similarity, selecting
impairment), psychological noise (e.g., attentive- an appropriate distance metric in phoneme space is
ness of listener), and semantic noise (e.g., word non-trivial. The classical results of Miller (1954)
choice). We postulate that mitigating confusion and Miller and Nicely (1955) group phonemes ex-
resulting from each type of noise may be possi- perimentally based on the noise level at which they
ble, at least to some extent, given sufficient context become indiscernible. The authors identify voic-
to make an informed compensatory decision. For ing, nasality, affrication, duration, and place of
example, given a particularly physically noisy envi- articulation as sub-phoneme features that predict a
ronment, speaking loudly would seem appropriate. phoneme’s sensitivity to distortion, and therefore
Unfortunately, such contextual information is often measure its proximity to others. Unfortunately,
lacking from existing data sets. In particular, the later work showed that these controlled conditions
physiological and psychological states of listeners do not map cleanly to the real world (Batliner et al.,
is rarely recorded. Even when such information is 2003). In addition, Wickelgren (1965) found al-
recorded (e.g., in Alzheimer’s speech studies Or- ternative phonemic distance features that could be
ange et al., 1996), the information is very coarse adapted into a distance metric.
(e.g, broad Alzheimer’s categories such as none, While this prior research sought to directly de-
early-stage, and middle-stage). fine a distance metric between phonemes based
We leave these non-trivial data gathering chal- on sub-phoneme features, since no method has
lenges as future work, instead focusing on phone- emerged as clearly superior, researchers now favour
mic confusion, which is significantly easier to op- direct, empirical measures of confusability (Bailey
erationalize. In practice, confusion at the phoneme- and Hahn, 2005). Likewise, our work assumes that
level may arise from any category of Rothwell these classical feature-engineering approaches to

2
predicting phoneme confusability can be improved
upon with neural approaches, just as automatic
speech recognition (ASR) systems have been im-
proved through the use of similar methods (e.g.,
Figure 2: We create a new data set with parallel ref-
Seide et al., 2011; Zeyer et al., 2019; Kumar et al., erence and hypothesis transcriptions from audiobook
2020). In addition, these classical approaches do data with parallel text and audio recordings. The text
not account for context (i.e., other phonemes sur- simply becomes the reference transcriptions. A tran-
rounding the phoneme of interest), whereas our scriber converts the audio recordings into hypothesis
approach conditions on such context to refine the transcriptions. In this preliminary work, we use an
confusion estimate. ASR system as a proxy for human transcribers.

3 Data
create aligned reference and hypothesis transcrip-
3.1 Data Gathering Process tions. For each text-audio pair, the text simply
To predict the phonemic confusability of utterances, becomes the reference transcriptions, while a tran-
we would ideally use a data set in which each utter- scriber converts the audio into hypothesis transcrip-
ance is annotated with speaker phonemic transcrip- tions. Given the preliminary nature of this work,
tion (the reference transcription), as well as listener we create a proxy data set in which we use Google
perceived phonemic transcription (the hypothesis Cloud’s publicly-available ASR system as a proxy
transcription). We could then compare these tran- for human transcribers (Cloud, 2019). We then
scriptions to identify phonemic confusion. process these transcriptions to identify phonemic
To the best of our knowledge, a data set of this confusion events (as described in Section 3.2). The
type does not exist. The English Consistent Con- final data set contains 84,253 parallel transcriptions.
fusion Corpus contains a collection of individual We split these into 63,189 training, 10,532 valida-
words spoken against a noisy background, with hu- tion, and 10,532 test transcriptions (a 75%-12.5%-
man listener transcriptions (Marxer et al., 2016). 12.5% split). The average reference and hypothesis
This is similar to our ideal data set, however the transcription lengths are 65.2 and 62.3 phonemes,
words are spoken in isolation, and thus without any respectively. The transcription error rate (i.e., the
utterance context. This same issue arises in the proportion of phonemes that are mis-transcribed)
Diagnostic Rhyme Test and its derivative data sets is only 8%, so there is significant imbalance in the
(Voiers et al., 1975; Greenspan et al., 1998). Other data set.
corpora, such as the BioScope Corpus (Vincze For the purposes of this preliminary work, the
et al., 2008) and the AMI Corpus (Carletta et al., Google Cloud ASR system (Cloud, 2019) is an
2005), contain annotations of dialogue acts, which acceptable proxy for human transcription ability
represent the intention of the speaker in producing under the reasonable assumption that, for any par-
each utterance (e.g., asking a question is labeled ticular transcriber, the distribution of error rates
with the dialogue act elicit information). across different phoneme sequences is nonuniform
However, dialogue acts relating to confusion only (i.e., within-transcriber variation is present). This
appear when a listener explicitly requests clarifica- assumption holds in all practical cases, and is rea-
tion from the speaker. This does not provide fine- sonable since the confusion-mitigation framework
grained information regarding which phonemes we propose can be conditioned on different tran-
caused the confusion, nor does it capture any in- scribers to control for inter-transcriber variation as
stances of confusion in which the listener does not future work.
explicitly vocalize their confusion.
3.2 Transcription Error Labeling
We thus create a new data set for this work (Fig-
ure 2). The Parallel Audiobook Corpus contains We post-process our aligned reference-hypothesis
121 hours of recorded speech data across 59 speak- transcription data set in two steps. First, each tran-
ers (Ribeiro, 2018). We use four of its audiobooks: scription must be converted from the word-level
Adventures of Huckleberry Finn, Emma, Treasure to the phoneme-level. For this, we use the CMU
Island, and The Adventures of Sherlock Holmes. Pronouncing Dictionary (Weide, 1998), which is
Crucially, the audio recordings in this corpus are based on the ARPAbet symbol set. For any words
aligned with the text being read, which allows us to with multiple phonemic conversions, we simply

3
of the others. That is, P (ỹi = xi | x, ỹ6=i ) =
P (ỹi = xi | x).1 We hypothesize that this assump-
tion, similar to the conditional independence as-
sumption of Naı̈ve Bayes (Zhang, 2004), will still
yield directionally-correct results, while drastically
increasing the tractability of the computation.
This assumption also allows us to simplify the
output space of the problem. Specifically, since
we only care to predict P (ỹ 6= x), with this as-
sumption, we now only need to consider, for each i,
whether ỹi = xi , rather than dealing with the much
Figure 3: Illustration of our transcription error labeling harder problem of predicting the exact value of ỹi .
process (using letters instead of phonemes for readabil- To achieve this, we use an element-wise Kronecker
ity). Given aligned reference (x) and hypothesis (y) ˜
delta function to replace ỹ with a binary vector, d,
vectors, we use the Levenshtein algorithm to ensure
they have the same length. Because y is not available such that d˜i ← ỹi 6= xi . Thus, the binary vector
at test time, we then “collapse” consecutive insertion d˜ records the position of each transcription error,
tokens to force the vectors to have the original length that is, the position of each phoneme in x that was
of x. Finally, we replace ỹ with the binary vector d, ˜ confused.
which has 1’s wherever x and ỹ don’t match. ˜ as ground truth
With the x’s as inputs and the d’s
labels, we can train models to predict P (d˜i | x) for
each i. As a post-processing step, we can then
default to the first conversion returned by the API.
combine these individual probabilities to estimate
Second, we label each resulting phoneme in each the utterance-level probability of phonemic confu-
reference transcription as either correctly or incor- sion, P (ỹ 6= x), which is the output of the central
rectly transcribed. This is nontrivial, because the confusion prediction component in Figure 1.
number of phonemes in the reference and hypothe- This formulation is general in the sense that any
sis transcriptions are rarely equal, and thus require xi can affect the predicted probability of any d˜i .
phoneme-level alignment. For this purpose, we In practice, however, and especially for long utter-
use a variant of the phoneme-level Levenshtein dis- ances, this is overly conservative, as only nearby
tance that returns the actual alignment, rather than phonemes are likely to have a significant effect. In
the final distance score (Figure 3). Section 4, we describe any additional conditional
Formally, let x ∈ Ka be a vector of reference independence assumptions that each architecture
phonemes and y ∈ Kb be a vector of hypothesis makes to further simplify its probability estimate.
phonemes from the data set. K refers to the set
{1, 2, 3, . . . , k, <INS>, <DEL>, <SOS>, <EOS>}, 4 Model Architectures and Baseline
where k is the number of unique phonemes in
the language being considered (e.g., in English, With recent advances, various neural architectures
k ≈ 40 depending on the dialect). In general, have been applied to NLP tasks. Early work in-
a 6= b, but we can manipulate the vectors by cludes n-gram-based, fully-connected architectures
incorporating insertion, deletion, and substitution for language modeling tasks (Bengio et al., 2003;
tokens (as done in the Levenshtein distance Mikolov et al., 2013). Recurrent neural network
algorithm). In general, this yields two vectors (RNN) architectures were then shown to be suc-
of the same length, x̃, ỹ ∈ Kc , c = max(a, b). cessful for applications such as language model-
While this manipulation can be performed at ing, speech recognition, and phoneme recognition
training time because y and b are known, such (Graves and Schmidhuber, 2005; Mikolov et al.,
information is unavailable at test time. Therefore, 2011). RNN architectures such as the LSTM
we modify the alignment at training time to ensure (Hochreiter and Schmidhuber, 1997) and GRU
x̃ ≡ x and c ≡ a. To achieve this, we “collapse” (Chung et al., 2015) variants had been successful
consecutive insertion tokens into a single instance in many NLP applications, such as machine lan-
of the insertion token, which ensures that |ỹ| = a. guage translation and phoneme classification (Sun-
dermeyer et al., 2012; Graves et al., 2013; Graves
Additionally, we assume that each hypothesis
phoneme, ỹi ∈ ỹ, is conditionally independent 1
ỹ6=i is every element in ỹ except the one at position i.

4
and Schmidhuber, 2005). Recently, the transformer the Adam optimizer with parameters α = 0.001,
architecture (Vaswani et al., 2017), which uses at- β1 = 0.9, and β2 = 0.999 (Kingma and Ba, 2014)
tention instead of recurrence to form dependencies to optimize a 1:10 weighted binary cross-entropy
between inputs, has shown state-of-the-art results loss function. We explored alternative parameter
in many areas of NLP, including syllable-based settings, and in particular a larger number of neu-
tasks (e.g., Zhou et al., 2018). rons, but found this architecture to be the most
In this work, we propose a fixed-context-window stable and highest performing of all variants tested,
architecture and a residual bi-LSTM architec- given the nature and relatively small size of the
ture for the central component of our confusion- data set.
mitigation framework. While similar architectures
4.2 LSTM Network
have already been applied to phoneme-based appli-
cations, such as phoneme recognition and classifi- The LSTM network receives the entire reference
cation (Graves and Schmidhuber, 2005; Weninger transcription, x, as input and predicts the entire bi-
et al., 2015; Graves et al., 2013; Li et al., 2017), narized hypothesis transcription, d, ˜ as output (Fig-
to our knowledge, our study is the first to apply ure 4b). Since the LSTM is bidirectional, we do
these architectures to identify phonemes related to not introduce any additional conditional indepen-
confusion for listeners. In our opinion, these archi- dence assumptions. Each input phoneme is passed
tectures strike an acceptable balance between com- through an embedding layer of dimension 42 (equal
pute and capability for this current work, unlike the to |K|) followed by a bidirectional LSTM layer and
more advanced transformer architectures, which two residual linear blocks with ReLU activations
require significantly more resources to train.2 (He et al., 2016). An output residual linear block
Since the data set is imbalanced (see Sec- with a sigmoid activation predicts the probability
tion 3.1), without sample weighting, early experi- of a transcription error. These skip connections
ments showed that both architectures never identi- are added since residual layers tend to outperform
fied any phonemes as likely to be mis-transcribed simpler alternatives (He et al., 2016). Passing the
(i.e., high specificity, low sensitivity). Accordingly, embedded input via skip connections ensures that
since the imbalance ratio is approximately 1:10, the original input is accessible at all depths of the
transcription errors are given 10-times more weight network, and also helps mitigate against any van-
than properly-transcribed phonemes in our binary ishing gradients that may arise in the LSTM.
cross-entropy loss function. We use the following output dimensions for each
layer: 50 for LSTM hidden and cell states, 40 for
4.1 Fixed-Context Network the first residual linear block, and 10 for the second.
We train with minibatches of size 256, using the
The fixed-context network takes as input the current
Adam optimizer with parameters α = 0.00005,
phoneme, xi , and the 4 phonemes before and after
β1 = 0.9, and β2 = 0.999 (Kingma and Ba, 2014)
it as a fixed window of context (Figure 4a). This
to optimize a 1:10 weighted binary cross-entropy
results in the additional conditional independence
loss function.
assumption that P (d˜i | x) = P (d˜i | xi−4:i+4 ). That
is, only phonemes within the fixed context window 4.3 Weighted n-Gram Baseline
of size 4 can affect the predicted probability of d˜i .
We compare our neural models to a weighted n-
These 9 phonemes are first embedded in a 15- gram baseline model. That is, d˜i depends only
dimensional embedding space. The embedding on the n previous phonemes in x (an order-n
layer is followed by a sequence of seven fully- Markov assumption). Formally, we make the con-
connected hidden layers with 512, 256, 256, 128, ditional independence assumption that P (d˜i | x) =
128, 64, and 64 neurons respectively. Each layer p(d˜i | xi−n+1:i ). Extending this baseline model to
is separated by Rectified Linear Unit (ReLU) non- include future phonemes would violate the order-n
linearities (Nair and Hinton, 2010; He et al., 2016). Markov assumption that is standard in n-gram ap-
Finally, an output with a sigmoid activation func- proaches. In this preliminary work, we opt to keep
tion predicts the probability of a transcription er- the baseline as standard as possible.
ror. We train with minibatches of size 32, using A weighted n-gram model is computed using an
2
Link to code: https://github.com/francois-rd/phonemic- algorithm similar to the standard maximum like-
confusion lihood estimation (MLE) n-gram counting algo-

5
(a) The fixed-context network uses a fixed window of context (b) Unrolled architecture of the LSTM network. The architec-
of size 4. These 9 phonemes are embedded using a shared ture consists of one bidirectional LSTM layer, two residual
embedding layer, concatenated, and then passed through 7 linear blocks with ReLU activations, and an output residual
linear layers with ReLU activations, followed by an output linear block with a sigmoid activation. Additional skip con-
layer with a sigmoid activation. nections are added throughout.

Figure 4: Architectural variants of the confusion prediction component of our confusion-mitigation framework.

rithm, but with the introduction of a weighting 5 Results and Discussion


scheme to deal with the class imbalance issue. The
weighting is necessary for a fair comparison to the 5.1 Quantitative Analysis
weighted loss function used in the neural network
We report receiver operating characteristic (ROC)
models. This approaches generalizes the standard
curves for all models (Figure 5). To facilitate fair
MLE n-gram counting algorithm, which implicitly
comparison, all models are trained with the same
uses a weight of 1.
random ordering of training data in each epoch.
Formally, let W > 0 be the selected weight, and Both neural network architectures outperform the
define ci ≡ xi−n+1:i to simplify the notation. Also, weighted n-gram baseline by a small margin, with
let C(d˜i | ci ) be the count of all incorrect phoneme the fixed-context network appearing to perform
transcriptions in the context ci in the entire data set, slightly better overall. While no individual model
and similarly, C(1 − d˜i | ci ) for correct transcrip- exhibits any significant performance gain over the
tions.3 The weighted n-gram is then computed as others, all models perform significantly better than
follows: random chance. This shows the promise of our
framework, which is precisely the objective of this
W × C(d˜i | ci ) work. We next speculate as to the causes of the
P (d˜i | ci ) =
C(1 − d˜i | ci ) + W × C(d˜i | ci ) slight gaps that are observed.
The neural network models likely outperform
Empirically, we find that a weighted 3-gram the weighted n-gram baseline for multiple reasons.
model works best; larger contexts are too sparse First, both neural network models condition on a
given the size of the data set and smaller contexts context that includes both past and future phonemes
lack expressive capacity. We do not use any n-gram (i.e., bidirectional), whereas the baseline only con-
smoothing methods. Instead, any missing contexts ditions on past phonemes (i.e., unidirectional). Uti-
encountered at test time are simply marked as in- lizing future phonemes as context is useful since
correct predictions. For this particular data set, both humans and most state-of-the-art ASR sys-
such missing contexts are vanishingly rare (occur- tems use this information to revise their predic-
ring only 0.003% of the time), which justifies our tions. Second, the neural networks can learn sub-
approach. contextual patterns that the baseline cannot. For
example, the contexts A B C and A B D have
3
We slightly abuse the notation here. Recall that d˜i ← the sub-context A B in common. Whereas the
ỹi 6= xi , so we notate ỹi = xi as 1 − d˜i . weighted n-gram treats these as completely dif-

6
Ground Truth Phrase Transcription of Audio Recording
... for they say every body is in love once ... ... for they say everybody is in love once ...
... his grave looks shewed that she was not ... ... his grave look showed that she was not ...
... shall use the carriage to night ... ... shall use the carriage tonight ...
... making him understand I warn’t dead ... ... making him understand I warrant dead ...
... shore at that place so we warn’t afraid ... ... sure at that place so we weren’t afraid ...
... read Elton’s letter as I was shewn in ... ... read Elton’s letters I was shown in ...
... sacrifice my poor hair to night and ... ... sacrifice my poor head tonight and ...
... we warn’t feeling just right ... ... we weren’t feeling just right ...
... that there was no want of taste ... ... that there was no on toothpaste ...
... knew that Arthur had discovered ... ... knew was it also have discovered ...

Table 1: Randomly selected phrases from amongst the top 100 phonemes predicted to be incorrectly transcribed
by the fixed-context model (transcription error probability > 0.999). Bold text denotes ASR transcription errors.

100 context model. Given this, our avoidance of more


advanced or deeper model, such as transformers,
80 seems justified for this preliminary work. We hy-
True positive rate (%)

pothesize that such models could outperform all


60 the models considered here given a significantly
larger data set.
40
Fixed-Context 5.2 Qualitative Analysis
LSTM
20
Baseline 5.2.1 Description
Random
0 We perform qualitative error analysis on randomly
0 20 40 60 80 100 selected phonemes from amongst those that are
False positive rate (%) most (Table 1) and least (Table 2) likely to contain
transcription errors according to the fixed-context
Figure 5: ROC curves for our model variants.
model. This offers some qualitative insights re-
garding phonemic confusion. We sample from the
fixed-context model due to its slightly superior per-
ferent contexts, the neural networks may be able to
formance, and show small phrases centered around
exploit the similarity between them. This kind of
the phoneme most (least) likely to cause confusion,
parameter sharing is more data efficient, which can
rather than full transcriptions, for clarity.
lead to lower variance estimates (less overfitting)
In addition, to improve readability, we show
in the small data set setting we are considering.
words rather than the underlying phonemes. As a
The simpler fixed-context network slightly out- result, some of the errors appear to be orthographic
performs the more complex LSTM alternative. in nature even if they are not. For example, “every
While RNN architectures have been shown to out- body” becomes “everybody” in the first example
perform feed-forward networks in language pro- of Table 1. However, the phonemes that constitute
cessing tasks (Sundermeyer et al., 2012), other re- “every body” and “everybody” are indeed differ-
search has shown that simpler architectures are still ent: “EH V ER IY B AA D IY” versus “EH
able to process phonemic data effectively (Ba and V R IY B AA D IY”. As per our definitions in
Caruana, 2014). The lack of an additional con- Section 3.2, these cases do represent transcription
ditional independence assumption for the LSTM errors. However, it may be argued that such errors
model may have resulted in worse data efficiency, introduce unwanted noise in the data set, which we
since the model needs to expend parameters on hope to correct in future work.
all reference phonemes, even those very far away
that may have little impact on the current one. In 5.2.2 Analysis
addition, the smaller number of parameters to esti- First, we note that every sample in Table 1 does
mate may have lead to lower variance in the fixed- indeed have a transcription error, while few sam-

7
Ground Truth Phrase Transcription of Audio Recording
... the exquisite feelings of delight and ... ... the exquisite feelings of delight and ...
... gone Mister Knightley called ... ... gone Mister Knightley called ...
... has been exceptionally ... ... has been exceptionally ...
... not afraid of your seeing ... ... not afraid if you’re saying ...
... the sale of Randalls was long ... ... the sale of Randalls was long ...
... her very kind reception of himself ... ... her very kind reception to himself ...
... for the purpose of preparatory inspection ... ... for the purpose of preparatory inspection ...
... you would not be happy until you ... ... you would not be happy until you ...
... with the exception of this little blot ... ... with the exception of this little blot ...
... night we were in a great bustle getting ... ... night we were in a great bustle getting ...

Table 2: Randomly selected phrases from amongst the top 100 phonemes predicted to be correctly transcribed by
the fixed-context model (transcription error probability < 0.03). Bold text denotes ASR transcription errors.

ples have errors in Table 2. It therefore seems as architecture, such as the transformer, may produce
though, when the fixed-context model is very cer- stronger results. Future work can also investigate
tain about the presence or absence of errors, it is the differences in human phonemic confusability
usually correct. on ‘natural’ versus semantically-unpredictable sen-
Second, many of the transcription errors in Ta- tences.
ble 1 are seemingly caused by the archaic or id- A major aspect of our confusion-mitigation
iosyncratic writing present in the books used to framework, which we have not explored in this
create the data set. While this can be seen as a work, is the generation of alternative, clearer utter-
source of unwanted noise (we used an ASR sys- ances that retain the initial meaning. Constructively
tem trained on standard modern English), we argue enumerating these alternatives is non-trivial, as is
that, as per Rothwell’s model of communication identifying the neighbourhood beyond which their
(Section 2), familiarity with the vocabulary is, in meaning differs too significantly from the original.
fact, a very legitimate source of semantic noise. Conditioning on a specific listener’s priors as an
Indeed, phrases using more modern and standard additional mechanism to reduce communication
vernacular are seemingly less likely to be confus- breakdown is another major aspect we leave to fu-
ing, according to the fixed-context model. ture work.
Third, many of the errors not related to ar- Perhaps most significantly, we have limited the
chaism involve stop words, homonyms, or near- scope of our confusion assessment drastically in
homophones, which intuitively makes sense. Ad- this preliminary work, primarily to simplify the
ditionally, hard consonant sounds between words data gathering process. While our results are
(and stress at the beginning rather than at the end promising, communication breakdown is a nuanced
of words) appears more common in the set of and multi-faceted phenomenon of which phonemic
correctly-transcribed phrases as compared to the set confusion is but one small component. Modeling
of incorrectly-transcribed ones. These findings sug- these larger and more complex processes remains
gest the fixed-context model has picked up on some an important open challenge.
underlying patterns governing phonemic confusion,
which is promising for our confusion-mitigation 6 Conclusion
framework as a whole.
Reducing communication breakdown is critical
5.3 Future Work to successful interaction in dialogue systems and
This work uses a relatively small data set. Creat- other generative NLP systems. In this work, we
ing and using a significantly larger corpus using proposed a novel confusion-mitigation framework
human subjects rather than an ASR proxy would that such systems could employ to help minimize
likely yield more directly relevant results. We pos- the probability of human confusion during an in-
tulate that, with a larger and higher quality data teraction. As a first step towards implementing
set, a deeper and more advanced neural network this framework, we evaluated two potential neu-

8
ral architectures—a fixed-context network and an Google Cloud. 2019. Speech-to-Text Client Libraries.
LSTM network—for its central component, which
predicts the confusion probability of a candidate Alex Graves, Abdel-rahman Mohamed, and Geoffrey
Hinton. 2013. Speech recognition with deep recur-
utterance. These neural architectures outperformed rent neural networks. In 2013 IEEE International
a weighted n-gram baseline (with the fixed-context Conference on Acoustics, Speech and Signal Pro-
network performing best overall) when trained us- cessing, pages 6645–6649. IEEE.
ing a proxy data set derived from audiobook record-
Alex Graves and Jürgen Schmidhuber. 2005. Frame-
ings. In addition, qualitative analyses suggest that wise phoneme classification with bidirectional
the fixed-context model has uncovered some of the LSTM and other neural network architectures. Neu-
more intuitive causes of phonemic confusion, in- ral Networks, 18(5-6):602–610.
cluding stop words, homonyms, near-homophones,
and familiarity with the vocabulary. These prelim- Steven L Greenspan, Raymond W Bennett, and Ann K
Syrdal. 1998. An evaluation of the diagnostic rhyme
inary results show the promise of our confusion- test. International Journal of Speech Technology,
mitigation framework. Given this early success, 2(3):201–214.
further investigation and refinement is warranted.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Acknowledgments Sun. 2016. Deep Residual Learning for Image
Recognition. In 2016 IEEE Conference on Com-
Resources used in preparing this research were puter Vision and Pattern Recognition (CVPR), pages
770–778. ISSN: 1063-6919.
provided, in part, by the Province of Ontario,
the Government of Canada through CIFAR, Sepp Hochreiter and Jürgen Schmidhuber. 1997.
and companies sponsoring the Vector Institute Long Short-Term Memory. Neural Computation,
(www.vectorinstitute.ai/partners). 9(8):1735–1780.

Diederik P Kingma and Jimmy Ba. 2014. Adam: A


References method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
Jimmy Ba and Rich Caruana. 2014. Do Deep Nets
Really Need to be Deep? In Z. Ghahramani, Kshitiz Kumar, Chaojun Liu, Yifan Gong, and Jian Wu.
M. Welling, C. Cortes, N. D. Lawrence, and K. Q. 2020. 1-D Row-Convolution LSTM: Fast Stream-
Weinberger, editors, Advances in Neural Informa- ing ASR at Accuracy Parity with LC-BLSTM. Proc.
tion Processing Systems 27, pages 2654–2662. Cur- Interspeech 2020, pages 2107–2111.
ran Associates, Inc.
Bruce L. Lambert. 1997. Predicting look-alike and
Todd M. Bailey and Ulrike Hahn. 2005. Phoneme sim- sound-alike medication errors. American Journal of
ilarity and confusability. Journal of Memory and Health-System Pharmacy, 54(10):1161–1171.
Language, 52(3):339–362.
Vladimir I Levenshtein. 1966. Binary codes capable
Anton Batliner, Kerstin Fischer, Richard Huber, Jörg of correcting deletions, insertions, and reversals. In
Spilker, and Elmar Nöth. 2003. How to find trouble Soviet Physics Doklady, volume 10, pages 707–710.
in communication. Speech Communication, 40(1-
2):117–143. Kun Li, Xiaojun Qian, and Helen Meng. 2017. Mispro-
nunciation Detection and Diagnosis in L2 English
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and
Speech Using Multidistribution Deep Neural Net-
Christian Jauvin. 2003. A neural probabilistic lan-
works. IEEE/ACM Transactions on Audio, Speech,
guage model. Journal of Machine Learning Re-
and Language Processing, 25(1):193–207.
search, 3(Feb):1137–1155.

Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Ricard Marxer, Jon Barker, Martin Cooke, and
Flynn, Mael Guillemot, Thomas Hain, Jaroslav Maria Luisa Garcia Lecumberri. 2016. A corpus
Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa of noise-induced word misperceptions for English.
Kronenthal, et al. 2005. The AMI Meeting Cor- The Journal of the Acoustical Society of America,
pus: A Pre-Announcement. In International Work- 140(5):EL458–EL463.
shop on Machine Learning for Multimodal Interac-
tion, pages 28–39. Springer. Tomáš Mikolov, Stefan Kombrink, Lukáš Burget, Jan
Černockỳ, and Sanjeev Khudanpur. 2011. Exten-
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, sions of recurrent neural network language model.
and Yoshua Bengio. 2015. Gated Feedback Recur- In 2011 IEEE International Conference on Acous-
rent Neural Networks. In International Conference tics, Speech and Signal Processing (ICASSP), pages
on Machine Learning, pages 2067–2075. 5528–5531. IEEE.

9
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor- William D Voiers, Alan D Sharpley, and Carl J Hehm-
rado, and Jeff Dean. 2013. Distributed Representa- soth. 1975. Research on Diagnostic Evaluation of
tions of Words and Phrases and their Composition- Speech Intelligibility. Research Report AFCRL-72-
ality. In Advances in neural information processing 0694, Air Force Cambridge Research Laboratories,
systems, pages 3111–3119. Bedford, Massachusetts.

George A. Miller. 1954. An Analysis of the Confusion Robert L. Weide. 1998. The CMU pronouncing dictio-
among English Consonants Heard in the Presence of nary. The Speech Group.
Random Noise. Journal of The Acoustical Society of
America, 26. Felix Weninger, Hakan Erdogan, Shinji Watanabe, Em-
manuel Vincent, Jonathan Le Roux, John R Her-
shey, and Björn Schuller. 2015. Speech Enhance-
George A. Miller and Patricia E. Nicely. 1955. An anal-
ment with LSTM Recurrent Neural Networks and its
ysis of perceptual confusions among some English
Application to Noise-Robust ASR. In International
consonants. The Journal of the Acoustical Society
Conference on Latent Variable Analysis and Signal
of America, 27(2):338–352.
Separation, pages 91–99. Springer.
Vinod Nair and Geoffrey E Hinton. 2010. Rectified Wayne A. Wickelgren. 1965. Acoustic similarity and
linear units improve restricted boltzmann machines. intrusion errors in short-term memory. Journal of
In Icml. Experimental Psychology, 70(1):102.

John B. Orange, Rosemary B. Lubinski, and D. Jef- Albert Zeyer, Parnia Bahar, Kazuki Irie, Ralf Schlüter,
fery Higginbotham. 1996. Conversational Repair and Hermann Ney. 2019. A Comparison of Trans-
by Individuals with Dementia of the Alzheimer’s former and LSTM Encoder Decoder Models for
Type. Journal of Speech, Language, and Hearing ASR. In 2019 IEEE Automatic Speech Recognition
Research, 39(4):881–895. and Understanding Workshop (ASRU), pages 8–15.
IEEE.
Manuel Sam Ribeiro. 2018. Parallel Audiobook Cor-
pus. University of Edinburgh School of Informatics. Andrej Zgank and Zdravko Kacic. 2012. Predicting
the Acoustic Confusability between Words for a
J. Dan Rothwell. 2010. In the Company of Others: An Speech Recognition System using Levenshtein Dis-
Introduction to Communication. New York: Oxford tance. Elektronika ir Elektrotechnika, 18(8):81–84.
University Press.
Harry Zhang. 2004. The optimality of naive Bayes.
American Association for Artificial Intelligence,
Michael Sabourin and Marc Fabiani. 2000. Predicting 1(2):3.
auditory confusions using a weighted Levinstein dis-
tance. US Patent 6,073,099. Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu.
2018. Syllable-Based Sequence-to-Sequence
Harvey Sacks, Emanuel Schegloff, and Gail Jefferson. Speech Recognition with the Transformer in Man-
1974. A Simple Systematic for the Organisation of darin Chinese. arXiv preprint arXiv:1804.10752.
Turn-Taking in Conversation. Language, 50:696–
735.

Frank Seide, Gang Li, and Dong Yu. 2011. Con-


versational Speech Transcription Using Context-
Dependent Deep Neural Networks. In Twelfth An-
nual Conference of the International Speech Com-
munication Association.

Martin Sundermeyer, Ralf Schlüter, and Hermann Ney.


2012. LSTM Neural Networks for Language Mod-
eling. In Thirteenth Annual Conference of the Inter-
national Speech Communication Association.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob


Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. arXiv preprint arXiv:1706.03762.

Veronika Vincze, György Szarvas, Richárd Farkas,


György Móra, and János Csirik. 2008. The Bio-
Scope corpus: biomedical texts annotated for uncer-
tainty, negation and their scopes. BMC Bioinformat-
ics, 9(11):S9.

10
Recursive prosody is not finite-state

Hossep Dolatian Aniello De Santo Thomas Graf


Department of Linguistics Department of Linguistics Department of Linguistics
Stony Brook University University of Utah Stony Brook University
Stony Brook, NY, USA Salt Lake City, Utah, USA Stony Brook, NY, USA
[email protected] [email protected] [email protected]

Abstract bottom-up tree transducers whose outputs correspond


to parallel multiple context-free string languages.
This paper investigates bounds on the generative
This paper is organized as follows. In §2, we
capacity of prosodic processes, by focusing
on the complexity of recursive prosody in provide a literature review of phonology and prosodic
coordination contexts in English (Wagner, 2010). phonology, with emphasis on the general tendency for
Although all phonological processes and most regular computation. In §3, we describe the recursive
prosodic processes are computationally regular prosody of coordination structures, and why it cannot
string languages, we show that recursive prosody be generated with an FST over string inputs. In §4,
is not. The output string language is instead we show how a multi bottom-up tree transducer can
parallel multiple context-free (Seki et al., 1991).
generate the prosodic patterns. We discuss our results
We evaluate the complexity of the pattern over
strings, and then move on to a characterization
in §5, and conclude in §6.
over trees that requires the expressivity of multi
bottom-up tree transducers. In doing so, we 2 Computation of prosody
provide a foundation for future mathematically Within computational prosody, there are two strands
grounded investigations of the syntax-prosody
interface.
of work. One focuses on the generation of prosodic
structure at or below the word level. The other
1 Introduction operates above the word-level.
At the word level, there is a plethora of work
At the level of words, all attested processes in phonol- on generating prosodic constituents, all of which
ogy form regular string languages and can be gener- require finite-state or regular computation, whether
ated via finite-state acceptors (FSAs) and transducers for syllables (Kiraz and Möbius, 1998; Yap, 2006;
(FSTs) (Johnson, 1972; Kaplan and Kay, 1994; Heinz, Hulden, 2006; Idsardi, 2009), feet (van Oostendorp,
2018). However, not much attention has been given 1993; Idsardi, 2009; Yu, 2017), or prosodic words
to the generative capacity of prosodic processes at (Coleman, 1995; Chew, 2003).1 In fact, most word-
the phrasal or sentential level (but see Yu, 2019). The level prosody seems to require at most subregular
little work that exists in this respect has shown that computation (Strother-Garcia, 2018, 2019; Hao, 2020;
many attested intonational processes are finite-state Dolatian, 2020; Dolatian et al., 2021; Koser, in prep).
and regular (Pierrehumbert, 1980). It is thus a common However, there is a dearth of formal results for
hypothesis in the literature that the cross-linguistic ty- phrasal or intonational prosody. Early work in genera-
pology of prosodic phonology should also be regular. tive phonology treated the prosodic representations as
In this paper, we falsify this hypothesis by provid- directly generated from the syntax, with any deviations
ing a mathematically grounded characterization of a caused by readjustment rules (Chomsky and Halle,
pattern of recursive prosody in English coordination, 1968). Notoriously, syntactic representations are at
as empirically documented by Wagner (2010). Specif-
1
ically, we show that when converting a syntactic repre- For syllables and feet, there is a large literature of formal-
ization within Declarative Phonology (Scobbie et al., 1996). This
sentation into a prosodic representation, the string lan- work tends to employ formal representations that are similar
guage that is generated by this prosodic process is nei- to context-free grammars (Klein, 1991; Walther, 1993, 1995;
ther a regular nor context-free language, and thus can- Dirksen, 1993; Coleman, 1991, 1992, 1993, 1996, 2000, 1998;
Coleman and Pierrehumbert, 1997; Chew, 2003). But these
not be generated by string-based FSAs. As a tree-to- representations can be restricted enough to be equivalent to
tree function, the pattern can be captured by a class of regular languages (see earlier such restrictions in Church, 1983).

11
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 11–22
August 5, 2021. ©2021 Association for Computational Linguistics
least context-free (Chomsky, 1956; Chomsky and string language is regular. It is then possible to create
Schützenberger, 1959). Because sentential prosody a computational network that uses a supra-regular
interacts with the syntactic level in non-trivial ways, it grammar for the syntax which interacts with a
might seem sensible to assume that 1) the transforma- finite-state grammar for the prosody (Yu and Stabler,
tion from syntax to prosody is not finite-state definable 2017; Yu, 2019). To summarize, it seems that the
(= definable with finite-state transducers), and that implicit consensus in computational prosody is that
2) the string language of prosodic representations 1) syntax can be supra-regular, but the corresponding
is a supra-regular language, not a regular language. prosody is regular; 2) prosodic recursion is bounded.
Importantly though, this assumption is not trivially However, as we elaborate in the next section,
true. In fact, early work has shown that even if syntax coordination data from Wagner (2005) is a case where
is context-free, the corresponding prosodic structures syntactic recursion generates potentially unbounded-
can be a regular string language. For instance, Reich recursive prosodic structure. The rest of the paper is
(1969) argued that the prosodic structures in SPE can then dedicated to exploring the consequences of this
be generated via finite-state devices (see also Langen- construction for the expressivity of sentential prosody.
doen, 1975), while Pierrehumbert (1980) modeled
English intonation using a simple finite-state acceptor. 3 Prosodic recursion in coordination
When analyzed over string languages, this To our knowledge, Wagner (2005, 2010) is the
mismatch between supra-regular syntax and regular clearest case where syntactic recursion gets mapped
prosody was not explored much in the subsequent to recursive prosody, such that the recursion is
literature. In fact, it seems that current research on unboundedly deep for the prosody. In this section, we
computational prosody uses the premise that prosodic go over the data and generalizations (§3.1), we sketch
structures are at most regular (Gibbon, 2001). Cru- Wagner’s cyclic analysis (§3.2), and we discuss issues
cially, this premise is confounded by the general lack with finiteness (§3.3). Finally, we show that that this
of explicit mathematical formalizations of prosodic construction does not correspond to a regular string
systems. For example, there are algorithms for Dutch language (§3.4).
intonation that capture surface intonational contours
3.1 Unbounded recursive prosody
and other acoustic cues (t’Hart and Cohen, 1973;
t’Hart and Collier, 1975). These algorithms however Wagner documents unbounded prosodic recursion
do not themselves provide sufficient mathematical in the coordination of nouns, in contrast to earlier
detail to show that the prosodic phenomenon in results which reported flat non-recursive prosody
question is a regular string language. Instead, one (Langendoen, 1987, 1998). Based on experimental
has to deduce that Dutch intonation is regular because and acoustic studies, Wagner reports that recursive
the algorithm does not utilize counting or unbounded coordination creates recursively strong prosodic
look-ahead (t’Hart et al., 2006, pg. 114). boundaries. Syntactic edges have a prosodic strength
that incrementally depends on their distance from the
As a reflection of this mismatch, early work in bottom-most constituents.
prosodic phonology assumed something known as the When three items are coordinated with two non-
strict layer hypothesis (SLH; Nespor and Vogel, 1986; identical operators, then two syntactic parses are pos-
Selkirk, 1986). The SLH assumed that prosodic trees sible. Each syntactic parse has an analogous prosodic
cannot be recursive — i.e. a prosodic phrase cannot parse. The prosodic parse is based on the relative
dominate another prosodic phrase — thus ensuring strength of a prosodic boundary, with | being weaker
that a prosodic tree will have fixed depth. Subsequent than ||. The boundary is placed before the operator.
work in prosodic phonology weakened the SLH:
prosodic recursion at the phrase or sentence level is Table 1: Prosody of three items with non-identical
now accepted as empirically robust (Ladd 1986, 2008, operators
ch8; Selkirk 2011; Ito and Mester 2012, 2013). But
empirically, it is difficult to find cases of unbounded Syntactic grouping Prosodic grouping
prosodic recursion (Van der Hulst, 2010). Consider [A and [B or C]] A || and B | or C
a language that uses only bounded prosodic recursion [[A and B] or C] A | and B || or C
— e.g. there can be at most two recursive levels of
prosodic phrases. The prosodic tree will have fixed When the two operators are identical, then three
depth; and the computation of the corresponding syntactic and prosodic parses are possible. The

12
difference between the parses is determined by 3.2 Wagner’s cyclic analysis
semantic associativity. For example, a sentence like In order to generate the above forms, Wagner devised
I saw [[A and B] and C] means that I saw A and B a cyclic procedure which we summarize with the
together, and I saw C separately. algorithm below.
Table 2: Prosody of three items with identical operators
2. Wagner’s cyclic algorithm
Syntactic grouping Prosodic grouping (a) Base case: Let X be a constituent that
[A and [B and C]] A || and B | and C contains a set of unprosodified nouns
[[A and B] and C] A | and B || and C (terminal nodes) that are in an associative
[[A and B and C] A | and B | and C coordination. Place a boundary of strength
| between each noun.
When four items are coordinated, then at most (b) Recursive case: Consider a constituent Y.
11 parses are possible. The maximum is reached Let S be a set of constituents S (terminals
when the three operators are identical. We can have or non-terminals) that is properly contained
three levels of prosodic boundaries, ranging from the in Y, such that at least one constituent in
weakest | to the strongest |||. S be prosodified. Let |k be the strongest
prosodic boundary inside Y. Place the
Table 3: Prosody of four items with identical operators boundary |k+1 between each constituent
Syntactic grouping Prosodic grouping in Y.
[A and B and C and D] A | and B | and C | and D
[A and B and [C and D]] A || and B || and C | and D The algorithm is generalized to coordination of any
[A and [B and C] and D] A || and B | and C || and D depth. It takes as input a syntactic tree, and the output
[[A and B] and C and D] A | and B || and C || and D is prosodically marked strings. We illustrate this below,
[A and [B and C and D]] A || and B | and C | and D with the input tree represented as a bracketed string.
[[A and B and C] and D] A | and B | and C || and D
[[A and B] and [C and D]] A | and B || and C | and D 3. Illustrating Wagner’s algorithm
[A and [B and [C and D]] A ||| and B || and C | and D
Input [A and B and [C and D]]
[A and [[B and C] and D]] A ||| and B | and C || and D
Base case C | and D
[[A and [B and C]] and D] A || and B | and C ||| and D
[[[A and B] and C] and D] A | and B || and C ||| and D Recursive case A || and B || and C | and D

3.3 Issues of finiteness


We can extract the following generalizations from
the data above. First, the depth of a constituent di- Because Wagner’s study used noun phrases with
rectly affects the prosodic strength of its edges. At a at most three or four items, the resulting language
syntactic edge, the strength of the prosodic boundary of prosodic parses is a finite language. Thus, the
depends on the distance between that edge and the relevant syntax-to-prosody function is bounded. It is
most embedded element: for instance, in (1a) the left- difficult to elicit coordination of 5 items, likely due
bracket between A-B is mapped to a prosodic bound- to processing reasons (Wagner, 2010, 194).
ary of strength three |||, because A is above two layers If the primary culprit is performance, though,
of coordination. The deepest constituent C-D gets the then syntactic competence may in fact allow for
weakest boundary |. Second, when there is associativ- coordination constructions of unbounded depth with
ity, the prosodic strength percolates to other positions any number of items. Wagner’s algorithm generates
within this associative span. For example, in (1b) the a prosodic structure for any such sentence, such as
boundary of strength || is percolated to A-B from B-C. for (4). For the rest of this paper, we abstract away the
finite bounds on coordination size in order to analyze
1. Generalizations on coordination the generative capacity of the underlying system (see
(a) Strength is long-distantly calculated Savitch, 1993, for mathematical arguments in support
[A and [B and [C and D]]] is mapped to of factoring out finite bounds).
A ||| and B || and C | and D
(b) Strength percolates when associative 4. Hypothetical prosody for large coordination
[A and B and [C and D]] is mapped to [A and B and [C and [D and E]]] is mapped to
A || and B || and C | and D A ||| and B ||| and C || and D | and E

13
3.4 Computing recursive prosody over strings grows at a rate of at least O(n2) where n is the size
The choice of representation plays an important role of the input string.
in determining the generative capacity of the prosodic Such a function is neither rational nor regular.
mapping. We first start by treating the mapping as Rational functions are computed by 1-way FSTs,
a string-to-string function. We show that the mapping and regular functions by 2-way FSTs (Engelfriet
is not regular. and Hoogeboom, 2001).2 They share the following
Let the input language be a bracketed string property in terms of growth rates (Lhote, 2020).
language, such that the input alphabet is a set of Theorem 1. Given an input string of size n, the size
nouns{A, ..., Z}, coordinators, and brackets. The of the output string of a regular function grows at
output language replaces the brackets with substrings most linearly as c·n, where c is a constant.
of |∗. For illustration, assume that the input language Thus, this string-to-string function is not regular.
is guaranteed to be a well-bracketed string. At a It could be a more expressive polyregular function
syntactic boundary, we have to calculate the number (Engelfriet and Maneth, 2002; Engelfriet, 2015;
of intervening boundaries between it and deepest node. Bojańczyk, 2018; Bojańczyk et al., 2019), a question
But this requires unbounded memory. For instance, to that we leave for future work.
parse the example below, we incrementally increase The discussion in this section focused on generat-
the prosodic strength of each boundary as we read ing the output prosodic string when the input syntax
the input left-to-right. is a bracketed string. Importantly though, Lemma 1
entails that no matter how one chooses their string
5. Linearly parsing the prosody:
encoding of syntactic structure, prosody cannot be
[[[A and B] and C] and D] is mapped to
modeled as a rational transduction unless there is
A | and B || and C ||| and D, where
an upper bound on the minimum number of output
Input alphabet Σ ={ A, ... , Z, and, or, [, ]}
symbols that a single syntactic boundary must be
Output alphabet ∆ ={ A, ... , Z, and, or, |}
rewritten as. To the best of our knowledge, there is
Input language is Σ∗ and well-bracketed
no syntactic string encoding that guarantees such a
Given the above string with only left-branching bound. In the next section, we will discuss how to
syntax, the leftmost prosodic boundary will have a compute prosodic strength starting from a tree.
juncture of strength |. Every subsequent prosodic
4 Computing recursive prosody over trees
boundary will have incrementally larger strength.
Over a string, this means we have to memorize the Wagner (2010)’s treatment of recursive prosody as-
number x of prosodic junctures that were generated sumes an algorithm that maps a syntactic tree to a
at any point in order to then generate x+1 junctures prosodic string. It is thus valuable to understand the
at the next point. A 1-way FST cannot memorize an complexity of processes at the syntax-prosody inter-
unbounded amount of information. Thus, this function face starting from the tree representation of a sen-
is not rational function and cannot be defined by a tence. Assuming we start from trees, there is one
1-way FST. To prove this, we can look at this function more choice to be made, namely whether the prosodic
in terms of the size of the input and output strings. information (in the output) is present within a string or
a tree. Notably, every tree-to-string transduction can
6. Illustrating growth size of recursive prosody be regarded as a tree-to-tree transduction plus a string
[n A0 and A1 ] and A2] and ... and An] yield mapping. As the tree-to-tree case subsumes the
is mapped to tree-to-string one, it makes sense to consider only
A0 | and A1 || and A2 ||| and ... |n and An the former. For a tree-to-tree mapping, the goal is
to obtain a tree representation that already contains
Abstractly, for a left-branching input string with the correct prosodic information (Ladd, 1986; Selkirk,
n number of left-brackets [, the output string has 2011). This is the focus of the rest of this paper.
a monotonically increasing number of prosodic
junctures: | ··· || ··· ||| ··· |n. The total number of 4.1 Dependency trees
prosodic junctures is a triangular number n(n+1)/2. When working over syntactic structures explicitly, it is
We thus derive the following lemma. important to commit to a specific tree representation.
Lemma 1. For generating coordination prosody as a 2
This equivalence only holds for functions and deterministic
string-to-string function, the size of the output string FSTs. Non-deterministic FSTs can also compute relations.

14
In what follows, we adopt a type of dependency trees, number of prosodic boundaries needed at that level.
where the head of a phrase is treated as the mother of and4
the subtree that contains its arguments. For example,
the coordinated noun phrase Pearl and Garnet is Pearl1 $2 and7
represented as the following dependency tree.
$3 Garnet5 $6 Rose8
and
Finally, the prosodic tree is fed to a yield function
Pearl Garnet to generate an output prosodified string. In particular,
the correct tree-to-string mapping can be obtained
Dependency trees have a rich tradition in descrip- by a modified version of a recursive-descent yield,
tive, theoretical, and computational approaches to lan- which enumerates nodes left-to-right, depth first,
guage, and their properties have been defined across a and only enumerates the mother node of each
variety of grammar formalisms (Tesnière, 1965; Nivre, level after the boundary branch. This strategy is
2005; Boston et al., 2009; Kuhlmann, 2013; Debus- depicted by the numerical subscripts in the tree above,
mann and Kuhlmann, 2010; De Marneffe and Nivre, which reconstruct how the yield of the prosodically
2019; Graf and De Santo, 2019; Shafiei and Graf, annotated tree produces the string: Pearl || and
2020, a.o.). Dependency trees keep the relation be- Garnet | and Rose. The rest of this section will focus
tween heads and arguments local, and they maximally on how to obtain the correct tree encoding of prosodic
simplify the readability of our mapping rules. Hence, information, starting from a plain dependency tree.
they allow us to focus our discussion on issues that
are directly related to the connection of coordinated 4.3 Mathematical preliminaries
embeddings and prosodic strength, without having to For a natural number n, we let [n] = {1,...,n}. A
commit to a particular analysis of coordinate structure. ranked alphabet Σ is a finite set of symbols, each one
Importantly, this choice does not impact the gener- of which has a rank assigned by the function r :Σ→N.
alizability of the solution. It is fairly straightforward to We write Σ(n) to denote {σ ∈Σ|r(σ)=n}, and σ(n)
convert basic dependency trees into phrase structure indicates that σ has rank n.
trees. Similarly, although it is possible to adopt n-ary Given a ranked alphabet Σ and a set A, TΣ(A) is
branching structures, we chose to limit ourselves the set of all trees over Σ indexed by A. The symbols
to binary trees (in the input). This turns out to be in Σ are possible labels for nodes in the tree, indexed
the most conservative assumption, as it forces us to by elements in A. The set TΣ of Σ-trees contains
explicitly deal with associativity and flat prosody. all σ ∈Σ(0) and all terms σ(n)(t1,...,tn) (n≥0) such
that t1, ... , tn ∈ TΣ. Given a term m(n)(s1, ... , sn)
4.2 Encoding prosodic strength over trees where each si is a subtree with root di, we call m the
We are interested in the complexity of mapping a mother of the daughters d1,...,dn (1 ≤ i ≤ n). If two
“plain” syntactic tree to a tree representation which con- distinct nodes have the same mother, they are siblings.
tains the correct prosodic information. Because of this, Essentially, the rank of a symbol denotes the finite
we encode prosodic strength over trees in the form of number of daughters that it can take. Elements of A
strength boundaries at each level of embedding. Each are considered as additional symbols of rank 0.

embedding level in our final tree representation will Example 1. Given Σ := a(0),b(0),c(2),d(2) , TΣ is
thus have a prosodic strength branch. The tree below an infinite set. The symbol a(0) means that a is
shows how the syntactic tree for Pearl and Garnet a terminal node without daughters, while c(2) is a
is enriched with prosodic information, according to non-terminal node with two daughters. For example,
our encoding choices. For readability, we use $ to consider the tree below.
mark prosodic boundaries in trees instead of |, since
d
the latter could be confused with a unary tree branch.
and c d

Pearl $ Garnet b b b a
As the tree below shows, the depth of the prosody This tree corresponds to the term d(c(b,b),d(b,a)),
branch at each embedding level corresponds to the contained in TΣ. y

15
As is standard in defining meta-rules, we introduce • σ(q1(x1,1,...,x1,n1 ),...,qk (xk,1,...,xk,nk )) → r
X as a countably infinite set of variable symbols in R
(X ∩ Σ = X) to be used as place-holders in the
definitions of transduction rules over trees. and there are trees Ti,j ∈ TΣ for every
i ∈ [k] and j ∈ [ni], s.t. ϕ =
4.4 Multi bottom-up tree transducers β[σ(q1(t1,1, ... , t1,n1 ), ... , qk (tk,1, ... , tk,nk ))], and
We assume that the starting point of the prosodic pro- ψ =β[r[xi,j ←ti,j |i∈[k],j ∈[ni]]]; or there is a rule
cess is a plain syntactic tree. Thus, in order to derive
• root(q(x1,...,xn))→qf (t) in R
the correct prosodic encoding, we need to propagate
information about levels of coordination embedding and there are trees ti ∈ T∆ for every i ∈ [n] s.t. ϕ =
and about associativity. We adopt a bottom-up ap- β[root(q(t1,...,tn))], and ψ = β[qf (t[t1,...,tn])]. The
proach, and characterize this process in terms of multi tree transformation computed by M is the relation:
bottom-up tree transducers (MBOT; Engelfriet et al.,
1980; Lilin, 1981; Maletti, 2011). Essentially, MBOTs τM ={(s,t)∈TΣ ×T∆ | root(s)⇒∗M qf (t)}
generalize traditional bottom–up tree transducers in
that they allow states to pass more than one output sub- Intuitively, tree transductions are performed by
tree up to subsequent transducer operations (Gildea, rewriting a local tree fragment as specified by one
2012). In other words, each MBOT rule potentially of the rules in R. For instance, a rule can replace
specifies several parts of the output tree. This is high- a subtree, or copy it to a different position. Rules
lighted by the fact that the transducer states (q ∈Q) can apply bottom–up from the leaves of the input tree,
have rank greater than one — i.e. they can have more and terminate in an accepting state qf .
than one daughter, where the additional daughters are
4.5 MBOT for recursive prosody
used to hold subtrees in memory. We follow Fülöp
et al. (2004) in presenting the semantics of MBOTs. We want a transducer which captures Wagner
(2010)’s bottom-up cyclic procedure. Consider now
Definition 1 (MBOT). A multi bottom-up tree trans-
the MBOT Mpros = (Q, Σ, ∆, root, qf , R), with
ducer (MBOT) is a tuple M = (Q,Σ,∆,root,qf ,R),
Q = {q∗,qc}, σc ∈ {and,or} ( Σ, σ ∈ Σ−{and,or},
where Q, Σ ∪ ∆, {root}, {qf } are pairwise disjoint,
and Σ = ∆. We use qc to indicate that Mpros has
such that:
verified that a branch contains a coordination (so σc),
• Q is a ranked alphabet with Q(0) =∅, called the with q∗ assigned to any other branch. As mentioned,
set of states we use $ to mark prosodic boundaries in the trees
• Σ and ∆ are ranked input and output alphabets, instead of |. The set of rules R is as follows.
respectively Rule 1 rewrites a terminal symbol σ as itself. The
• root is a unary symbol, called the root symbol MBOT for that branch transitions to q∗(σ).
• qf is a unary symbol called the final state
σ →q∗(σ) (1)
R is a finite set of rules of two forms:
Rule 2 applies to a subtree headed by
• σ(q1(x1,1,...,x1,n1 ),...,qk (xk,1,...,xk,nk )) σc ∈{and,or}, with only terminal symbols as daugh-
→q0(t1,...,tn0 ) ters: σc(q∗(x),q∗(y)). It inserts a prosodic boundary
$ between the daughters x,y. The boundary $ is also
where k ≥ 0, σ ∈ Σ(k), for every copied as a daughter of the mother qc, as record of
i ∈ [k] ∪ {0}, qi ∈ Q(ni) for some ni ≥ 1, for the fact that we have seen one coordination level.
every j ∈[n0],tj ∈T∆({xi,j |i∈[k],j ∈[ni]}).
σc(q∗(x),q∗(y))→qc(σc(x,$,y),$) (2)
• root(q(x1,...,xn))→qf (t)
We illustrate this in Figure 1 with a coordination
where n≥1,q ∈Q(n), and t∈T∆(Xn). y of two items, representing the mapping: [B and A]
The derivational relation induced by M is a binary re- → B | and A. We also assume that sentence-initial
lation ⇒M over the set TΣ∪∆∪Q∪{root,qf } defined as boundaries are vacuously interpreted.
follows. For every ϕ,ψ ∈TΣ∪∆∪Q∪{root,qf }, ϕ⇒M ψ We now consider cases where a coordination is
iff there is a tree β ∈ TΣ∪∆∪Q∪{root,qf }(X1) s.t. x1 the mother not just of terminal nodes, but of other
occurs exactly once in β and either there is a rule coordinated phrases. Rule 3 handles the case in which

16
(1) where the embedding of the coordination is strictly
and and
right branching, with the bulk of the work done via
B A q∗ q∗ rule 3. However, while these rules work well for
instances in which a coordination is always the right
B A
daughter of a node, they cannot deal with cases in
which the coordination branches left, or alternates
and qc
between the two. This is easily fixed by introducing
q∗ q∗
(2)
and $
variants to rule 3, which consider the position of
the coordination as marked by qc. Importantly, the
B A B $ A position of the copy of the boundary branch is not
altered, and it is always kept as the rightmost sibling
Figure 1: Example of the application of rules (1) and (2). of qc. What changes is the relative position of the w
The numerical label on the arrow indicates which rule and x subbranches in the output (see Figure 3).
was applied in order to rewrite the tree on the left as the
tree on the right. σc(qc(w,y),q∗(x))→qc(σc(w,$(y),x),$(y)) (5)

the right sibling of the mother was also headed by


and qc
a coordination (as encoded by σc having qc as one
of its daughters). Here, qc is the result of a previous qc C and $
rule application (e.g. rule 2) and it has two subtrees
itself: qc(w,y). Although we do not have access to (5)
and $ and $ C $
the internal labels of x, y, and w, by the format of the
previous rules we know that the right daughter of qc B $ A B $ A $
(i.e. y) is the one that contains the strength informa-
tion. Then, rule 3 has three things to do. It increments Figure 3: Left branching example as in rule (5).
y by one boundary: $(y). It places $(y) in between
the two subtrees x and w. And, it copies $(y) as the Following the same logic, rule 6 handles cases like
daughter of the new qc state in order to propagate [[A and B] and [C and D]], in which both daughters
$(y) to the next embedding level (see Figure 2). of a coordination are headed by a coordination
themselves (see Figure 4).
σc(q∗(x),qc(w,y))→qc(σc(x,$(y),w),$(y)) (3)
σc(qc(x,z),qc(y,w))→qc($(x),σc(z,$(x),w)) (6)
and qc
and qc

C qc and $
qc qc and $

(3) (6)
and $ C $ and $ and $ and $ and $ and $

A $ B $ C $ D $ A $ B $ C $ D $
B $ A $ B $ A
Figure 4: Example of the application of rule (6).
Figure 2: Example of the application of rule (3). For ease
of readability, we omit q∗ states over terminal nodes. Finally, we need to take care of the flat prosody
or associativity issue. The MBOT Mpros as outlined
Rule 4 applies once all coordinate phrases up to the so far increases the depth of the boundary branch at
root have been rewritten. It simply rewrites the root each level of embedding. Because we are adopting
as the final accepting state. It gets rid of the daughter binary branching trees, the current set of rules is
of qc that contains the strength markers, since there trivially unable to encode cases like [A and B and
is no need to propagate them any further. C]. We follow Wagner’s assumption that semantic
information on the syntactic tree guides the prosody
root(qc(x,y))→qf (x) (4)
cycles. Representationally, we mark this by using
As the examples so far should have clarified, specific labels on the internal nodes of the tree. We
Mpros as currently defined readily handles cases assume that the flat constituent interpretation is

17
Input Apply rule (2) Apply rule (3) Apply rule (3) Apply rule (4)
and and qc and

and D and D qc and $ D $ and

D and C qc and $ D $ and $ $ C $ and

C and and $ C $ and $ $ C $ and $ $ $ B $ A

B A B $ A $ B $ A $ $ B $ A

Figure 5: Walk-through of the transduction defined by Mpros . For ease of readability, and to highlight how qc propagates
embedding information about the coordination, q∗ and qf states are omitted.

obtained by marking internal nodes as non-cyclic, ture trees, by virtue of the bottom-up strategy being
introducing the alphabet symbol σn: intrinsically equipped with finite look-ahead. A switch
to phrase structure trees may prove useful for future
σn(q∗(x),qc(w,y)→qc(σc(x,y,w),y) (7) work on the interaction of prosody and movement.
Essentially, rule 7 tells us that when a coordination
node is marked as σn, Mpros just propagates the level 5 Generating recursive prosody
of prosodic strength that it currently has registered (in The previous section characterized recursive prosody
y), without increments (see Figure 6). This rule can be over trees with a non-linear, deterministic MBOT.
trivially adjusted to deal with branching differences, This is a nice result, as MBOTs are generally well-
as done for rules 3 and 5. understood in terms of their algorithmic properties.
andn qc Moreover, this result is in line with past work explor-
ing the connections of MBOTs, tree languages, and
C qc andn $ the complexity of movement and copying operations
in syntax (Kobele, 2006; Kobele et al., 2007, a.o.).
(7) We can now ask what the complexity of this
and $ C $ and
approach is. MBOTs generate output string languages
B $ A B $ A that are potentially parallel multiple context-free
languages (PMCFL; Seki et al., 1991, 1993; Gildea,
Figure 6: Application of rule (7) for flat prosody. 2012; Maletti, 2014; Fülöp et al., 2005). Since this
class of string languages is more powerful than
A full, step by step Mpros transduction is shown context-free, the corresponding tree language is not
in Figure 5. Taken together, the recursive prosodic a regular tree language (Gécseg and Steinby, 1997).
patterns are fully characterized by Mpros when it is This is not surprising, as MBOTs can be understood
adjusted with a set of rules to deal with alternating as an extension of synchronous tree substitution
branching and flat associativity. The tree transducer grammars (Maletti, 2014).
generates tree representations where each level of Notably, independently of our specific MBOT solu-
embedding is marked by a branch, which carries tion, prosody as defined in this paper generates at least
information about the prosodic strength for that level. some output string languages that lack the constant
As outlined in Section 4.2, this final representation growth property — hence, that are PMCFLs. Consider
may then be fed to a modified string yield function as input a regular tree language of left-branching
for dependency tree languages. coordinationate phrases, where each level is simply of
Dependency trees allowed us to present a transducer the form and(X, Mary). The n−th level of embedding
with rules that are relatively easy to read. But, as men- from the top extends the string yield by n+2 symbols.
tioned before, this choice does not affect our general This immediately implies no constant growth, and
result. Under the standard assumption that the distance thus no semi-linearity (Weir, 1988; Joshi et al., 1990).
between the head of a phrase and its maximal projec- Interestingly though, the prosody MBOT devel-
tion is bounded, Mpros can be extended to phrase struc- oped here is fairly limited in its expressivity as the

18
transducer states themselves do almost no work, because of the following fundamental properties:
and most of the transduction rules in Mpros rely
on the ability to store the prosody strength branch. • The syntax has unbounded recursion.
Hence, the specific MBOT in this paper might turn • The prosody has unbounded recursion.
out to belong to a relatively weak subclass of tree • All recursive prosodic constituents have the
transductions with copying, perhaps a variant of input same prosodic label (= a prosodic phrase).
strictly local tree transductions (cf. Ikawa et al., 2020; • The recursive prosodic constituents have
Ji and Heinz, 2020), or a transducer variant of sensing acoustic cues marking different strengths.
tree automata (cf. Fülöp et al., 2004; Kobele et al., • There is an algorithm which explicitly assigns
2007; Maletti, 2011, 2014; Graf and De Santo, 2019). the recursive prosodic constituents to these
Since all of those have recently been used in the different strengths.
formal study of syntax, they are natural candidates
for a computational model of prosody, and their sensi- In this paper, we focused on explicitly generating
tivity to minor representational difference might also the prosodic strengths at each recursive prosodic
illuminate what aspects of syntactic representation levels, putting aside the mathematically simpler task
affect the complexity of prosodic processes. of converting a recursive syntactic tree into a recursive
Finally, one might worry that the mathematical prosodic tree (Elfner, 2015; Bennett and Elfner,
complexity is a confound of the representation we use, 2019) — which is a process essentially analogous to
rather than a genuine property of the phenomenon. a relabeling of the nonterminal nodes of the syntactic
However, a representation of prosodic strength is tree, without care for the prosodic strength. The
necessary and cannot be reduced further for two mapping studied in this paper has been conjectured in
reasons. First, strength cannot be reduced to syntactic the past to be computationally more expressive than
boundaries because a single prosodic edge ( may regular languages or functions (Yu and Stabler, 2017).
correspond to |k for any k ≥1. As discussed in depth Here, we formally verified that hypothesis.
by Wagner (2005, 2010), one cannot simply convert An open question then is to find other empirical
a syntactic tree into a prosodic tree by replacing the phenomena which also have the above properties.
labels of nonterminal nodes. Second, strength also One potential area of investigation is the assignment
cannot be reduced to different categories of prosodic of relative prominence relations in English compound
constituents — e.g. assuming that | is a prosodic prosody (Chomsky and Halle, 1968). However, En-
phrase while || is an intonational phrase. As argued glish compound prosody is a highly controversial area.
in depth in (Wagner, 2005, 2010), these different It is unclear what is the current consensus on an exact
constituent types do not map neatly to prosodic algorithm for these compounds, especially one that
strength. Instead, these boundaries all encode relative utilizes recursion and is not based on impressionistic
strengths of prosodic phrase boundaries. judgments (Liberman and Prince, 1977; Gussenhoven,
2011). In this sense, the mathematical results in this
6 Conclusion paper highlight the importance of representational
commitments and of explicit assumptions in the study
This paper formalizes the computation of unbounded of prosodic expressivity. Our paper might then help
recursive prosodic structures in coordination. Their identify crucial issues in future theoretical and em-
computation cannot be done by string-based finite- pirical investigations of the syntax-prosody interface.
state transducers. They instead need more expressive
grammars. To our knowledge, this paper is one of Acknowledgements
the few (if only) formal results on how prosodic We are grateful to our anonymous reviewers, Jon
phonology at the sentence-level is computationally Rawski, and Kristine Yu. Thomas Graf is supported
more expressive than phonology at the word-level. by the National Science Foundation under Grant No.
As discussed above, recent work in prosodic BCS-1845344.
phonology relies on the assumption that prosodic
structure can be recursive. However, because such
work usually uses bounded-recursion, such phenom- References
ena are computationally regular. Departing from this Ryan Bennett and Emily Elfner. 2019. The syntax–
stance, this paper focused on the prosodic phenomena prosody interface. Annual Review of Linguistics,
reported in Wagner (2005) as a core case study, 5:151–171.

19
Mikołaj Bojańczyk. 2018. Polyregular functions. arXiv John S Coleman. 1991. Prosodic structure, parameter-
preprint arXiv:1810.08760. setting and ID/LP grammar. In Steven Bird, editor,
Declarative perspectives on phonology, pages 65–78.
Mikołaj Bojańczyk, Sandra Kiefer, and Nathan Lhote. Centre for Cognitive Science, University of Edinburgh.
2019. String-to-string interpretations with polynomial-
size output. In 46th International Colloquium on Marie-Catherine De Marneffe and Joakim Nivre. 2019.
Automata, Languages, and Programming, ICALP Dependency grammar. Annual Review of Linguistics,
2019, July 9-12, Patras, Greece. (LIPIcs), volume 5:197–218.
132, page 106:1–106:14, Schloss Dagstuhl - Leibniz-
Zentrum fuer Informatik. Ralph Debusmann and Marco Kuhlmann. 2010. Depen-
Marisa Ferrara Boston, John T. Hale, and Marco dency grammar: Classification and exploration. In
Kuhlmann. 2009. Dependency structures derived Resource-adaptive cognitive processes, pages 365–388.
from minimalist grammars. In The Mathematics of Springer.
Language, pages 1–12. Springer.
Arthur Dirksen. 1993. Phrase structure phonology. In
Peter Chew. 2003. A computational phonology of Russian. T. Mark Ellison and James Scobbie, editors, Compu-
Universal-Publishers, Parkland, FL. tational Phonology, page 81–96. Centre for Cognitive
Science, University of Edinburgh.
Noam Chomsky. 1956. Three models for the description
of language. IRE Transactions on information theory, Hossep Dolatian. 2020. Computational locality of cyclic
2(3):113–124. phonology in Armenian. Ph.D. thesis, Stony Brook
University.
Noam Chomsky and Morris Halle. 1968. The sound
pattern of English. MIT Press, Cambridge, MA. Hossep Dolatian, Nate Koser, Kristina Strother-Garcia,
Noam Chomsky and Marcel P Schützenberger. 1959. and Jonathan Rawski. 2021. Computational restric-
The algebraic theory of context-free languages. In tions on iterative prosodic processes. In Proceedings
Studies in Logic and the Foundations of Mathematics, of the 2019 Annual Meeting on Phonology. Linguistic
volume 26, pages 118–161. Elsevier. Society of America.

Kenneth Ward Church. 1983. Phrase-structure parsing: A Emily Elfner. 2015. Recursion in prosodic phrasing:
method for taking advantage of allophonic constraints. Evidence from Connemara Irish. Natural Language &
Ph.D. thesis, Massachusetts Institute of Technology. Linguistic Theory, 33(4):1169–1208.
John Coleman. 1992. The phonetic interpretation of Joost Engelfriet. 2015. Two-way pebble transducers
headed phonological structures containing overlapping for partial functions and their composition. Acta
constituents. Phonology, 9(1):1–44. Informatica, 52(7-8):559–571.
John Coleman. 1993. English word-stress in unification-
based grammar. In T. Mark Ellison and James Scobbie, Joost Engelfriet and Hendrik Jan Hoogeboom. 2001.
editors, Computational Phonology, page 97–106. MSO definable string transductions and two-way finite-
Centre for Cognitive Science, University of Edinburgh. state transducers. Transactions of the Association for
Computational Linguistics, 2(2):216–254.
John Coleman. 1995. Declarative lexical phonology.
In Jacques Durand and Francsis Katamba, editors, Joost Engelfriet and Sebastian Maneth. 2002. Two-way
Frontiers of phonology: Atoms, structures, derivations, finite state transducers with nested pebbles. In Inter-
pages 333–383. Longman, London. national Symposium on Mathematical Foundations of
Computer Science, pages 234–244. Springer.
John Coleman. 1996. Declarative syllabification in
Tashlhit Berber. In Jacques Durand and Bernard Laks, Joost Engelfriet, Grzegorz Rozenberg, and Giora Slutzki.
editors, Current trends in phonology: Models and 1980. Tree transducers, l systems, and two-way
methods, volume 1, pages 175–216. European Studies machines. Journal of Computer and System Sciences,
Research Institute, University of Salford, Salford. 20(2):150–202.
John Coleman. 1998. Phonological representations: Zoltán Fülöp, Armin Kühnemann, and Heiko Vogler.
Their names, forms and powers. Cambridge University 2004. A bottom-up characterization of deterministic
Press, Cambridge. top-down tree transducers with regular look-ahead.
John Coleman. 2000. Candidate selection. The Linguistic Information Processing Letters, 91(2):57–67.
Review, 17(2-4):167–180.
Zoltán Fülöp, Armin Kühnemann, and Heiko Vogler. 2005.
John Coleman and Janet Pierrehumbert. 1997. Stochastic Linear deterministic multi bottom-up tree transducers.
phonological grammars and acceptability. In Third Theoretical computer science, 347(1-2):276–287.
meeting of the ACL special interest group in com-
putational phonology: Proceedings of the workshop, Ferenc Gécseg and Magnus Steinby. 1997. Tree lan-
pages 49–56, East Stroudsburg, PA. Association for guages. In Handbook of formal languages, pages 1–68.
computational linguistics. Springer.

20
Dafydd Gibbon. 2001. Finite state prosodic analysis of C Douglas Johnson. 1972. Formal aspects of phonologi-
African corpus resources. In EUROSPEECH 2001 cal description. Mouton, The Hague.
Scandinavia, 7th European Conference on Speech
Communication and Technology, 2nd INTERSPEECH Aravind K Joshi, K Vijay Shanker, and David Weir. 1990.
Event, Aalborg, Denmark, September 3-7, 2001, pages The convergence of mildly context-sensitive grammar
83–86. ISCA. formalisms. Technical Reports (CIS), page 539.
Daniel Gildea. 2012. On the string translations produced Ronald M. Kaplan and Martin Kay. 1994. Regular
by multi bottom–up tree transducers. Computational models of phonological rule systems. Computational
Linguistics, 38(3):673–693. linguistics, 20(3):331–378.
Thomas Graf and Aniello De Santo. 2019. Sensing tree
George Anton Kiraz and Bernd Möbius. 1998. Mul-
automata as a model of syntactic dependencies. In
tilingual syllabification using weighted finite-state
Proceedings of the 16th Meeting on the Mathematics of
transducers. In The third ESCA/COCOSDA workshop
Language, pages 12–26, Toronto, Canada. Association
(ETRW) on speech synthesis.
for Computational Linguistics.
Carlos Gussenhoven. 2011. Sentential prominence in Ewan Klein. 1991. Phonological data types. In Steven
English. In Marc van Oostendorp, Colin Ewen, Eliz- Bird, editor, Declarative perspectives on phonol-
abeth Hume, and Keren Rice, editors, The Blackwell ogy, pages 127–138. Centre for Cognitive Science,
companion to phonology, volume 5, pages 1–29. University of Edinburgh.
Wiley-Blackwell, Malden, MA.
Gregory M. Kobele, Christian Retoré, and Sylvain Salvati.
Yiding Hao. 2020. Metrical grids and generalized tier pro- 2007. An automata-theoretic approach to minimalism.
jection. In Proceedings of the Society for Computation Model theoretic syntax at 10, pages 71–80.
in Linguistics, volume 3.
Gregory Michael Kobele. 2006. Generating Copies: An
Jeffrey Heinz. 2018. The computational nature of investigation into structural identity in language and
phonological generalizations. In Larry Hyman and grammar. Ph.D. thesis, University of California, Los
Frans Plank, editors, Phonological Typology, Phonetics Angeles.
and Phonology, chapter 5, pages 126–195. Mouton de
Gruyter, Berlin. Nate Koser. in prep. The computational nature of stress
assignment. Ph.D. thesis, Rutgers University.
Mans Hulden. 2006. Finite-state syllabification. In Anssi
Yli-Jyrä, Lauri Karttunen, and Juhani Karhumäki,
Marco Kuhlmann. 2013. Mildly non-projective de-
editors, Finite-State Methods and Natural Language
pendency grammar. Computational Linguistics,
Processing. FSMNLP 2005. Lecture Notes in Computer
39(2):355–387.
Science, volume 4002. Springer, Berlin/Heidelberg.
Harry Van der Hulst. 2010. A note on recursion in D. Robert Ladd. 1986. Intonational phrasing: The case for
phonology recursion. In Harry Van der Hulst, editor, recursive prosodic structure. Phonology, 3:311–340.
Recursion and human language, pages 301–342.
Mouton de Gruyter, Berlin & New York. D. Robert Ladd. 2008. Intonational phonology. Cam-
bridge University Press, Cambridge.
William J Idsardi. 2009. Calculating metrical structure.
In Eric Raimy and Charles E. Cairns, editors, Contem- D. Terence Langendoen. 1975. Finite-state parsing
porary views on architecture and representations in of phrase-structure languages and the status of
phonology, number 48 in Current Studies in Linguistics, readjustment rules in grammar. Linguistic Inquiry,
pages 191–211. MIT Press, Cambridge, MA. 6(4):533–554.
Shiori Ikawa, Akane Ohtaka, and Adam Jardine. 2020. D. Terence Langendoen. 1987. On the phrasing of
Quantifier-free tree transductions. Proceedings of the coordinate compound structures. In Brian Joseph and
Society for Computation in Linguistics, 3(1):455–458. Arnold Zwicky, editors, A festschrift for Ilse Lehiste,
page 186–196. Ohio State University, Ohio.
Junko Ito and Armin Mester. 2012. Recursive prosodic
phrasing in Japanese. In Toni Borowsky, Shigeto
D. Terence Langendoen. 1998. Limitations on embedding
Kawahara, Shinya Takahito, and Mariko Sugahara,
in coordinate structures. Journal of Psycholinguistic
editors, Prosody matters: Essays in honor of Elisabeth
Research, 27(2):235–259.
Selkirk, pages 280–303. Equinox Publishing, London.
Junko Ito and Armin Mester. 2013. Prosodic subcate- Nathan Lhote. 2020. Pebble minimization of polyreg-
gories in Japanese. Lingua, 124:20–40. ular functions. In Proceedings of the 35th Annual
ACM/IEEE Symposium on Logic in Computer Science,
Jing Ji and Jeffrey Heinz. 2020. Input strictly local pages 703–712.
tree transducers. In International Conference on
Language and Automata Theory and Applications, Mark Liberman and Alan Prince. 1977. On stress and
pages 369–381. Springer. linguistic rhythm. Linguistic inquiry, 8(2):249–336.

21
Eric Lilin. 1981. Propriétés de clôture d’une extension de Kristina Strother-Garcia. 2018. Imdlawn Tashlhiyt Berber
transducteurs d’arbres déterministes. In Colloquium syllabification is quantifier-free. In Proceedings of
on Trees in Algebra and Programming, pages 280–289. the Society for Computation in Linguistics, volume 1,
Springer. pages 145–153.
Andreas Maletti. 2011. How to train your multi bottom- Kristina Strother-Garcia. 2019. Using model theory
up tree transducer. In Proceedings of the 49th Annual in phonology: a novel characterization of syllable
Meeting of the Association for Computational Linguis- structure and syllabification. Ph.D. thesis, University
tics: Human Language Technologies, pages 825–834. of Delaware.
Andreas Maletti. 2014. The power of regularity- Lucien Tesnière. 1965. Eléments de syntaxe structurale,
preserving multi bottom-up tree transducers. In 1959. Paris, Klincksieck.
International Conference on Implementation and
Application of Automata, pages 278–289. Springer. Johan t’Hart and Antonie Cohen. 1973. Intonation by rule:
a perceptual quest. Journal of Phonetics, 1(4):309–327.
Marina Nespor and Irene Vogel. 1986. Prosodic
phonology. Foris, Dordrecht. Johan t’Hart and René Collier. 1975. Integrating different
levels of intonation analysis. Journal of Phonetics,
Joakim Nivre. 2005. Dependency grammar and depen- 3(4):235–255.
dency parsing. MSI report, 5133.1959:1–32.
Johan t’Hart, René Collier, and Antonie Cohen. 2006.
Marc van Oostendorp. 1993. Formal properties of A perceptual study of intonation: An experimental-
metrical structure. In Sixth Conference of the Euro- phonetic approach to speech melody. Cambridge
pean Chapter of the Association for Computational University Press.
Linguistics, pages 322–331, Utrecht. ACL.
Michael Wagner. 2005. Prosody and recursion. Ph.D.
Janet Breckenridge Pierrehumbert. 1980. The phonology thesis, Massachusetts Institute of Technology.
and phonetics of English intonation. Ph.D. thesis,
Massachusetts Institute of Technology. Michael Wagner. 2010. Prosody and recursion in coor-
dinate structures and beyond. Natural Language &
Peter. A. Reich. 1969. The finiteness of natural language. Linguistic Theory, 28(1):183–237.
Language, 45:831–843.
Markus Walther. 1993. Declarative syllabification with
Walter J Savitch. 1993. Why it might pay to assume that applications to German. In T. Mark Ellison and James
languages are infinite. Annals of Mathematics and Scobbie, editors, Computational Phonology, pages
Artificial Intelligence, 8(1-2):17–25. 55–79. Centre for Cognitive Science, University of
Edinburgh.
James M. Scobbie, John S. Coleman, and Steven Bird.
1996. Key aspects of declarative phonology. In Jacques Markus Walther. 1995. A strictly lexicalized approach
Durand and Bernard Laks, editors, Current Trends in to phonology. In Proceedings of DGfS/CL’95, page
Phonology: Models and Methods, volume 2. European 108–113, Düsseldorf. Deutsche Gesellschaft für
Studies Research Institute, Salford, Manchester. Sprachwissenschaft, Sektion Computerlinguistik.
Hiroyuki Seki, Takashi Matsumura, Mamoru Fujii, and David Jeremy Weir. 1988. Characterizing mildly context-
Tadao Kasami. 1991. On multiple context-free gram- sensitive grammar formalisms. Ph.D. thesis, University
mars. Theoretical Computer Science, 88(2):191–229. of Pennsylvania.
Hiroyuki Seki, Ryuichi Nakanishi, Yuichi Kaji, Sachiko Ngee Thai Yap. 2006. Modeling syllable theory with finite-
Ando, and Tadao Kasami. 1993. Parallel multiple state transducers. Ph.D. thesis, University of Delaware.
context-free grammars, finite-state translation systems,
and polynomial-time recognizable subclasses of Kristine M. Yu. 2017. Advantages of constituency:
lexical-functional grammars. In Proceedings of the Computational perspectives on Samoan word prosody.
31st annual meeting on Association for Computa- In International Conference on Formal Grammar 2017,
tional Linguistics, pages 130–139. Association for page 105–124, Berlin. Spring.
Computational Linguistics.
Kristine M. Yu. 2019. Parsing with minimalist grammars
Elisabeth Selkirk. 1986. On derived domains in sentence and prosodic trees. In Robert C. Berwick and Edward P.
phonology. Phonology Yearbook, 3(1):371–405. Stabler, editors, Minimalist Parsing, pages 69–109.
Oxford University Press, London.
Elisabeth Selkirk. 2011. The syntax-phonology interface.
In John Goldsmith, Jason Riggle, and Alan C. L. Yu, Kristine M. Yu and Edward P. Stabler. 2017. (In)
editors, The Handbook of Phonological Theory, 2 variability in the Samoan syntax/prosody interface
edition, pages 435–483. Blackwell, Oxford. and consequences for syntactic parsing. Laboratory
Nazila Shafiei and Thomas Graf. 2020. The subregular Phonology: Journal of the Association for Laboratory
complexity of syntactic islands. In Proceedings of the Phonology, 8(1):1–44.
Society for Computation in Linguistics, volume 3.

22
The Match-Extend Serialization Algorithm in Multiprecedence

Maxime Papillon
Classic, Modern Languages and Linguistics, Concordia University
[email protected]

Abstract enough to handle these dependencies. The mul-


tiprecedence model expanded upon here builds
Raimy (1999; 2000a; 2000b) proposed a
on properties that are already implicit in all ap-
graphical formalism for modeling redu-
proaches to phonological representation, and ac-
plication, originallymostly focused on
tually gets rid of some standard assumptions. It
phonological overapplication in a deriva-
accounts for attested patterns and predicts an unat-
tional framework. This framework is now
tested reduplication pattern to be impossible.
known as Precedence-based phonology
or Multiprecedence phonology. Raimy’s 2 Multiprecedence
idea is that the segments at the input to
the phonology are not totally ordered by The theory of Multiprecedence seeks to account
precedence. This paper tackles a chal- for reduplication representationally via loops in a
lenge that arose with Raimy’s work, the graph. Eschewing correspondence statements and
development of a deterministic serializa- copying procedures, Multiprecedence treats redu-
tion algorithm as part of the derivation of plication as fundamentally a structural property
surface forms. The Match-Extend algo- created by the addition of an affix, whose serial-
rithm introduced here requires fewer as- ization has the effect of pronouncing all or part of
sumptions and sticks tighter to the attested the form twice.
typology. The algorithm also contains no Consider a string like Fig. 1a, the standard
parameter or constraint specific to individ- way of representing the segments that constitute
ual graphs or topologies, unlike previous a phonological representation. An alternative way
proposals. Match-Extend requires nothing to encode that same information is in the form of
except knowing the last added set of links. a set of immediate precedence statements like Fig.
1b. For legibility the set of pairs in Fig. 1b can
1 Introduction be represented in the form of a graph. Adding the
This paper provides a general serialization algo- convention that of using # and % for the START
rithm for all morphological structures in all lan- and END symbols respectively we get the picture
guages. The challenge of converting non-linear in Fig. 1c. In general I will refer to this as the
structures of linguistic representation into a format graph representation.
ready to be handled in production is one that mat-
a.kæt
ters to both morphosyntax and morphophonology.
b. { h START, k i, h k,æ i,h æ,t i,h t,END i}
Reduplication is a phenomenon at the frontier of
c. # → k → æ→ t → %
morphology and phonology that has drawn a lot
of attention in the last few decades. Reduplica- Figure 1: Phonological representations of the
tion’s non-concatenative nature and the fact that word cat as a string (a), ordered pairs (b), and a
it manifests long-distance dependencies among graph (c).
segments set it apart from the ‘standard’ word-
formation that most theories are designed to han- The graph representation should highlight an
dle. These properties have often pushed theoreti- important detail. There is no a priori logical rea-
cians to propose expansive systems such as copy- son in this representation why forms should be lin-
ing procedures on top of traditional linear seg- ear, with one segment following another in a chain.
mental phonology to make the system powerful This is only an assumption that we impose on the

23
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 23–31
August 5, 2021. ©2021 Association for Computational Linguistics
a.
structure when assuming strings. This assumption
z
is what Multiprecedence abandons. Multiprece-
dence proposes that asymmetry and irreflexivity # k æ t %
are not relevant to phonology. A segment can pre-
cede or follow multiple segment, two segments =⇒ # k æ t z %
can transitively precede each other, and a segment b.
can precede itself. A valid multiprecedence graph m
is not restricted by topology, a term a will use for
the pattern of the graph independent from the con- # h N u P %
tent of the nodes.
=⇒ # h m N u P %
Using this view of precedence, affixation is the
process of combining the graph representations of c.
different morphemes. A word is a graph consist-
ing of the edges and vertices (precedence relations # k @ r a %
and segments) of one or more morphemes. An ex- =⇒ # k @ r a k @ r a %
ample of the suffixation of the English plural is
shown in Fig. 2a, and the infixation of the Atayal
Figure 2: Affixation in Multiprecedence. Suffixa-
animate focus morpheme is given in Fig. 2b. Full
tion (a), infixation (b), reduplication (c).
root reduplication, which expresses the plural of
nouns in Indonesian is shown in Fig. 2c. There are
two things to notice in Fig. 2c. First, that a prece- tor focus and attaching between the first and the
dence arrow is added, without any segmental ma- second segment. For details on the mechanics
terial: the reduplicative morpheme consists of just of attachment see Raimy (2000a, §3.2), Samuels
that arrow. Second, although Fig. 2a and Fig. 2b (2009, p.177-87), and Papillon (2020, §2.2). It
each offer two paths from the START to the END suffices here to say that at vocabulary insertion an
of the graph, Fig. 2c contains a loop that offers an affix can target certain segments of the stem for
infinite number of paths from START to End. The attachment. Raimy (2000a) shows how this rep-
representation itself does not enforce how many resentation can generate the reduplicative patterns
times the arrow added by the plural morpheme from numerous languages as well as account for
should be traversed. All three of these structures such phenomena as over- and under-application of
have to be handled by a serialization algorithm in phonological processes in reduplication.
order to be actualized by the phonetic motor sys-
tem, which selects a path through the graph to be a. [last segment] → z → %
sent to the articulators. A correct serialization al- b. # → k → æ→ t → %
gorithm must be able to select the correct of the c.
two paths in Fig. 2a and Fig. 2b and the path go- z
ing through the back loop only once in Fig. 2c. # k æ t %
I will assume here that these forms are con-
structed by the attachment of an affix morpheme Figure 3: Affix (a) and root (b) combined in (c).
onto a stem as in Fig. 3. English speakers have a
graph as a lexical item for the plural as in Fig. 3a Given the assumption that a non-linear graph
and a lexical item for CAT as in Fig. 3b, which cannot be pronounced, phonology requires an al-
combine as in Fig. 3c. The moniker “last seg- gorithm capable of converting graph representa-
ment” is an informal way to refer to that part of tions into strings like in Fig. 2. Two main fam-
the affix that is responsible for attaching it to the ilies of algorithms have been proposed. Raimy
stem in the right location. This piece of the plural (1999) proposed a stack-based algorithm which
affix will attach onto the last segment, the one pre- was expanded upon by Idsardi and Shorey (2007)
ceding the end of the word %, of what it combined and McClory and Raimy (2007). This algo-
with, and onto %, yielding Fig 3c. Similarly the rithm traverses the graph from # to % by access-
Atayal form in Fig. 2 is built from a root #hNuP% ing the stack. This idea suffers the problem of
‘soak’ and an infix -m- marking the animate ac- requiring parameters on individual arcs. Every

24
morphologically-added precedence link must be 1 .The precedence links of the stem begin in a
parametrized as to its priority determining whether set StemSet.
it goes to the top or the bottom of the stack. This is 2. The morphologically added links begin in
necessary in this system because when a given arc a set WorkSpace.
is traversed is not predictable on the basis of when 3. Whenever two strings in the WorkSpace
it is encountered in a traversal. This parametriza- match such that the end of one string is iden-
tion radically explodes the range of patterns pre- tical to the end of the other, the operation
dicted to be possible much beyond what is at- Match collapses the two into one string such
tested. Fitzpatrick and Nevins (2002; 2004) pro- that the shared part appears once. E.g. abcd
posed a different constraint-base algorithm which and cdef to abcdef. A Match along multi-
globally compares paths through the graph for ple characters is done first.
completeness and economy but suffers the prob- 4. When there is no match within the
lem of requiring ad hoc constraints targeting indi- WorkSpace, the operation Extend simultane-
vidual types of graphs, lacking generality. In the ously lengthens all strings in the WorkSpace
rest of this article I will present a new algorithm to the right and left using matching prece-
which lacks any parameter and whose two opera- dence links of the stem. StemSet remains un-
tions are generic and not geared towards any spe- changed.
cific configuration. 5.Steps 3 and 4 are repeated until # and %
have been reached by Extend and there is a
3 The Match-Extend algorithm single string in the WorkSpace.
This section will present the Match-Extend al- Algorithm 1: The Match-Extend Algorithm
gorithm and follow up with a demonstration of (informal version).
its operation on various attested Multiprecedence
topologies. StemSet { #k , k@ , @r , ra , a%}
The input to the algorithm is the set of pairs WorkSpace ak
of segments corresponding to the pairs of seg- Extend rak@
ments in immediate precedence relation without Extend @rak@r
the affix, e.g. {#k,kæ,æt,t%} for the English stem Extend k@rak@ra
kæt, and the set of pair of segments correspond- Extend #k@rak@ra%
ing to the precedence links added by the affix, e.g.
{tz,z%} when the plural is added. Figure 4: Match-Extend derivation of k@ra-k@ra.
Intuitively the algorithm starts from the mor-
phologically added links and extends outwards by reduplication in Tohono O’odham involving redu-
following the precedence links in the StemSet, the plicated pattern such as babaD to ba-b-baD, and
set of all precedence links in the stem to which the čipkan to či-čpkan requiring graphs as in Raimy
morpheme is being added. If there is more than (2000a, p.114). Although there are multiple plau-
one morphologically added link, they all extend in sible paths through this graph, only one is attested
parallel and collapse together if one string ends in and this path requires traversing the graph by fol-
one or more segment and the other begins with the lowing the backlink before the front-link, even
same segment or segments. A working version of though the front-link would be encountered first
this algorithm coded in Python will be included as in a traversal.
supplementary material.

3.1 Match-Extend in action # č i p k a n %


Consider first total reduplication as in Fig. 2c
above. Fig. 3.1 shows the full derivation of k@ra-
k@ra with total reduplication. As there is only one Figure 5: Tohono O’odham či-čpkan.
morphologically-added link, no Match step will
happen. The match-Extend algorithm will correctly de-
Let us turn to more complex graphs discussed rive the correct form as shown in . Right away
in the literature. Raimy discusses a process of CV the strings ič and čp match, as one starts with

25
the node c and the other ends with the same node. StemSet {#s, su, ut, t%}
The two are collapsed as ičp and then keep ex- #i
tending. WorkSpace it
ts
StemSet {#č, či, ip, pk, ka, an, n%}
#it
čp Match
WorkSpace ts
ič Match #its
Match ičp Extend #itsu
Extend čičpk Extend #itsut
Extend #čičpka Extend #itsut%
Extend #čičpkan
Extend #čičpkan% Figure 8: Derivation of Nancowry Pitsut.

Figure 6: Match-extend derivation of Tohono


O’odham čičpkan. (2002; 2004) observed that in cases where mul-
tiple reduplication processes of different size hap-
A similarly complex graph is needed in Nan- pen to the same form, with multiple morpholog-
cowry. Raimy (2000a, p.81) discusses examples ically added arrows forking away from the same
like Nancowry reduplication of the last consonant segment, these graphs are seemingly universally
toward the beginning of the word, e.g. sut ‘to rub’ serialized such that they follow the shorter arc
to Pit-sut which requires a graph as in Fig. 7. first. They discuss Lushotseed forms with both
However here the opposite order of traversal must distributive and Out-Of-Control (OOC) reduplica-
be followed, not skipping the first forward link. I tion. They argue on the basis of the fact that in ei-
assume here, like Raimy, that the glottal stop is ther scope order the form is serialized in the same
epenthetic and added after serialization. Here, not way, suggesting that they are serialized simultane-
taking the first link would also result in the wrong ously. This implies forms like gw ad, ‘talk’, surfac-
output [*sutsut]. So this form requires the first ing in the distributive OOC or the OOC distribu-
morphologically-added link to be taken to produce tive as gw ad-ad-gw ad, requiring a graphs like Fig.
the correct form. 9.

i
# gw a d %
# s u t %
Figure 9: Lushotseed gw ad-ad-gw ad.
Figure 7: Nancowry Pit-sut.
Fitzpatrick & Nevins (2002; 2004) proposed
Again Match-Extend will serialize Fig. 7 with- an ad hoc constraint to handle this type of sce-
out any further parameter as in Fig. 8. The three nario, the constraint S HORTEST, enforcing seri-
strings #i, it, and ts can match right away into alizations that follow the shorter arrow first. But
a single string #its which will keep extending. Match-Extend derives the attested pattern without
As these examples illustrate, Match-Extend any further assumptions. Consider the derivation
does not need to be specified with look-ahead, of the Lushotseed form in Fig. 9. After one Ex-
global considerations, or graph-by-graph specifi- tend step, the two strings adgw a and adad match
cations of serialization to derive the attested seri- along the nodes ad. You might notice that the two
alization of graphs like Fig. 5 or Fig. 7. The se- strings also match in the other order with the node
rialization starts in parallel from two added links a, so we must assume the reasonable principle that
that extend until they reach each other in the mid- in case of multiple matches, the best match, mean-
dle, and this will work regardless of the order in ing the match along more nodes, is chosen. From
which ‘backward’ and ‘forward’ arcs are located. that point on adadgw a extends into the desired
They will meet in one direction and serialize in form.
this order. It is somewhat intuitive to see why this works:
Another interesting topology is found in the because Match-Extend applies one step of Extend
analysis of Lushotseed. Fitzpatrick & Nevins at a time and must Match and collapse separate

26
StemSet { #gw , gw a, ad, d%} The graph in Fig. 12 is simply the transpose
dgw graph of a graph where S HORTEST would apply
WorkSpace like Fig. 9, but it does not actually fit the pat-
da
adgw a tern of S HORTEST as its two ‘backward’ arrows
Extend do not start from the same node. In fact if any-
adad
Match adadgw a thing S HORTEST would predict the wrong surface
Extend g adadgw ad
w form, as *si-sil-sil would be the form if the shorter
Extend #gw adadgw ad% path were taken first. In Match-Extend and Clos-
est Attachment Fig. 11 the prediction is clear: it is
Figure 10: Derivation of Lushotseed gw adadgw ad. predicted to serialize as sil-si-sil because the path
from l→s to i→s is shorter than the path from i→s
to l→s, thus deriving the correct string.
strings from the WorkSpace immediately when a
Fitzpatrick and Nevins (2002) report some
Match is found, two arcs added by the morphology
forms with graphs like Fig. 12 which must be
will necessarily match in the direction in which
linearized in ways that would contradict Match-
they are the closest. The end of the d→a arc is
Extend, such as saxw to sa-saxw -saxw in Lusot-
closer to the beginning of the d→gw one than vice-
sheed Diminutive+Distributive forms. But con-
versa, and hence the two will join in this direction
trary to the Distributive+OOC forms discussed
and therefore surface in this order. This can be
earlier there is no independent evidence here for
generalized as Fig. 11.
the two reduplications being serialized together.
• If the graph contains two morphologically I therefore assume that those instances consist
added links α → β and γ → δ, and of two separate cycles, serialized one at a time:
saxw to saxw -saxw to sa-saxw -saxw . Match-Extend
I There is a unique path X from β to γ not
therefore relies on cyclicity, with the graph built
going through α → β or γ → δ, and
up through affixation and serialized multiple times
I There is a unique path Y from δ to α not over the course of the derivation.
going through α → β or γ → δ,
3.2 Non-Edge Fixed Segmentism
• Then the Match-Extend algorithm will output
a string containing: Fixed segmentism refers to cases of reduplication
where a segment of one copy is overridden by one
I ...αβ...γδ... if X is shorter than Y or more fixed segments. A well known English
I ...γδ...αβ... if Y is shorter than X example is schm-reduplication like table to table-
schmable where schm-replaces the initial onset. I
Figure 11: Closest Attachment in Match-Extend. will call Non-Edge Fixed Segmentism (NEFS) the
special case of fixed segmentism where the fixed
Note that this is not a new assumption: this is segment is not at the edge of one of the copies.
a theorem of the model derivable from the way These are the examples where the graph needed is
Match and Extend interact with multiple morpho- like Fig. 13 or Fig. 14.
logically added arcs. This can allow us to work
out some serializations without having to do the
# a b c d e %
whole derivation.
Consider for instance the Nlaka’pamuctsin dis- x
tributive+diminutive double reduplication, e.g. sil,
‘calico’, to sil-si-sil, (Broselow, 1983). This pat- Figure 13: NEFS ‘early’ in the copy.
tern requires the Multiprecedence graph to look as
in Fig. 12.

# a b c d e %
# s i l %
x

Figure 12: Nlaka’pamuctsin sil-si-sil. Figure 14: NEFS ‘late’ in the copy.

27
Closest Attachment in Match-Extend predicts parametrized as to whether they are added on top
that if a fixed-segment is added towards the be- or at the bottom of the stack upon affixation, thus
ginning of the form, it should surface in the sec- deriving elaNeliN from the /a/ allomorph being on
ond copy, and if it is added toward the end of the top of the stack and traversed early and udanuden
form, it should surface in the first copy. Or in other from the /e/ allomorph being at the bottom of the
words the fixed segment will always occur in the stack and traversed late. This freedom of lexical
copy such that the fixed segment is closer to the specification grants their system the power to en-
juncture of the two copies. The graph in Fig. 13 force any order needed, including the capacity to
will serialize as abcde-axcde and the graph in handle the ‘look-ahead’ and ‘shortest’ cases above
Fig. 14 will serialize as abcxe-abcde. This in terms of full lexical specification. They could
follows from the properties of Match and Extend: also easily handle languages with the equivalent
as the precedence pairs of the overwriting segment of a L ONGEST constraint. This model is less pre-
and the precedence pair of the backward link ex- dictive while also being more complex.
tend outward, it will either reach the left or right
side first and this will determine the order in which
they appear in the final serialized form. # u d a n %
This prediction is borne out by many exam- e
ples of productive patterns of reduplication with
NEFS such as Marathi saman-suman (Alderete et Figure 16: Javanese udan-uden according to Id-
al., 1999, citing Apte 1968), Bengali sajôa-sujôa sardi and Shorey (2007) and McClory and Raimy
(Khan, 2006, p.14), Kinnauri migo-m@go (Chang, (2007).
2007).
Apparent counterexamples exist, but have other But this complexity is unneeded if we instead
plausible analyses. A major one worth discussing adopt dissimilation analysis closer in spirit to
briefly is the previous multiprecedence analysis of Yip’s original Optimality-Theory analysis. We
the Javanese Habitual-Repetitive as described by can say that the /a/ of the first copy is an over-
Yip (1995; 1998). Most forms surface with a fixed written /a/ in both elaN-eliN and in udan-uden
/a/ in the first copy as in elaN-eliN ‘remember’. and a phonological process causes dissimilation of
This requires a graph such as Fig. 15 which se- the root /a/ in the presence of the added /a/. In
rializes in comformity with Match-Extend. Optimality-Theory this requires an appeal to the
Obligatory Contour Principle operating between
the two copies, but in Multiprecedence the dissim-
# e l i N % ilation is even simpler to state because the two /a/’s
are very local in the graph. We simply need a rule
a to the effect of raising a stem /a/ in the context of a
morphologically-added /a/ that precedes the same
Figure 15: Javanese elaN-eliN. segment as in Fig.17.

However when the first copy already contains a e


/a/ as the second vowel the form is realized with
X =⇒ X
/e/ in the second copy as udan-uden ‘rain’. Id-
sardi and Shorey (2007) and McClory and Raimy a a
(2007) have analyzed this as a phonologically-
Figure 17: Dissimilation Rule
conditioned allomorph with fixed segment /e/ that
must be serialized differently from the /a/ allo-
morph, with the overwriting vowel in the second
copy, i.e. a graph such as Fig. 16 that does not # u d a n % =⇒ # u d e n %
serialize in comformity with Match-Extend. Id- a a
sardi and Shorey (2007) and McClory and Raimy
(2007) use this example to argue for a system of Figure 18: Derivation of udan-uden.
stacks that serialization must read from the top-
down. Precedence arcs in turn can be lexically There is therefore no need to abandon Match-

28
Extend on the basis of Javanese. We have therefore seen that Match-Extend can
Consider another apparent counterexample to straightforwardly account for a number of attested
the prediction: the Palauan root /rEb@th / forms complex reduplicative patterns without any special
its distributive with CVCV reduplication and the stipulations. More interestingly Match-Extend
verbal prefix m@- forming m@-r@b@-rEb@th (Zuraw, makes strong novel predictions about the loca-
2003). At first blush, one may be tempted to see tion of fixed segments. I have not been able to
the first schwa of the first copy as overwriting the locate many examples of NEFS in the literature.
root’s /E/. But the presence of this schwa actually For example the typology of fixed segmentism in
follows from the independently-motivated phonol- Alderete et al. (1999) does not contain any exam-
ogy of Palauan in which all non-stressed vowels ple of NEFS. This will require further empirical
go to [@]. This thus is the result of a phonolog- research.
ical rule applying after serialization about which
Match-Extend has nothing to say. 4 One limitation of Match-Extend:
Relatedly, other apparent issues may be caused overly symmetrical graphs
by interactions with phonology. D’souza (1991,
There is a gap in the predictions of Fig. 11:
p.294) describes how echo-formation in some
Closest Attachment predicts that morphologically-
Munda languages is accomplished by replacing all
added edges will attach in the order they are the
the vowels in the second copy with a fixed vowel,
closest, which relies on an asymmetry in the form
e.g. Gorum bubuP ‘snake’ > bubuP-bibiP. Fixed
such that morphologically-added links are closer
segmentism of each vowel individually may not
in one order than the other. This leaves the prob-
be the best analysis of these forms, there may in-
lem of symmetrical forms like Fig. 19. The former
stead be a single fixed segment and a separate pro-
of there was posited in the analysis of Semai con-
cess of vowel harmony or something along those
tinuative reduplication by Raimy (2000a, p.146-
lines. This type of complex interaction of non-
47) for forms like dNOh ‘appearance of nodding’
local phonology with reduplication has been in-
to dh-dNOh; the latter would be needed in various
vestigated before in Multiprecedence, e.g. the
languages reduplicating CVC forms with vowel
analyses of Tuvan vowel harmony in reduplicated
changes such as the Takelma aorist described
forms in Harrison and Raimy (2004) and Papillon
in Sapir (1922, p.58) like t’eu ‘play shinny’ to
(2020, §7.1), but these analyses make extra as-
t’eut’au.
sumptions about possible Multiprecedence struc-
tures that go far beyond the basics explored here.
The subject requires further exploration, but ap- # a b c % # a b c %
pears to be more of an issue of phonology and rep-
x
resentation than of serialization per se.
Apparent counterexamples will have to be ap- Figure 19: Two structures overly symmetrical for
proached on a case-by case basis, but I have not Match-Extend.
identified many problematic examples so far that
did not turn out to be errors of analysis.1 These are the forms which, in the course of
1
Match-Extend, will come to a point where Match
One such apparent counter-example is worth briefly
commenting on here due to its being mentioned in well-
is indeterminate because two strings could match
known surveys of reduplication. This alleged reduplication equally well in either direction. For example the
is from in Macdonald and Darjowidjojo (1967, p.54) and WorkSpace of the first of these structures will start
repeated in Rubino (2005, p.16): Indonesian belat ‘screen’
to belat-belit ‘underhanded’. If correct this example would with ac and ca, which can match either as aca
be a counterexample to Match-Extend, as a fixed /i/ must or cac. The former would extend into #acabc%
surface in the second copy. However this pair seems to be and the latter into #abcac%. Match-Extend as
misidentified. The English-Indonesian bilingual dictionary
by (Stevens and Schmidgall-Tellings, 2004) lists a word be- stated so far is therefore indeterminate with regard
lit meaning ‘crooked, cunning, deceitful, dishonest, under- to these symmetrical forms.
handed’, which semantically seems like a more plausible
source for the reduplicated form belat-belit and fits the pre-
This is not an insurmountable problem for
dictions of Match-Extend. The same dictionary’s entry under Match-Extend. To the contrary this is a problem
belat lists some screen-related entries and then belat-belit as
meaning ‘crooked, devious, artful, cunning, insincere’ cross- was misidentified by previous authors and is unproblematic
referencing to belit as the base. I conclude that this example for Match-Extend.

29
of having too many solutions without a way to de- allomorphy (Papillon, 2020). A serialization
cide between them, none of which require adding algorithm capable of handling these structures is
parametrization to Match-Extend. Maybe sym- crucial for the completeness of the theory.
metrical forms crash the derivation and all appar- As pointed out by a reviewer, it is crucial to de-
ent instances in the literature must contain some velop a a typology of the possible attested graph-
hidden asymmetry. It is worth noting that the pat- ical input structures to the algorithm so as to
tern in Fig. 19 attested in Semai has a close cog- properly characterize and formalize the algorithm
nate in Temiar, but in this language the symmet- needed. In every form discussed here the roots is
rical structure is only obtained for simple onsets, implicitly assumed to be underlyingly linear and
kOw ‘call’ to kw-kOw, but slOg ‘sleep with’ to s- affixes alone add some topological variety to the
g-lOg (Raimy, 2000a, p.146). This asymmetry re- graphs, as is mostly the case in all the forms from
solves the Match-Extend derivation. It may simply (Raimy, 1999; Raimy, 2000a). Elsewhere I have
be the case that the forms that look symmetrical challenged this idea by positing parallel structures
have a hidden asymmetry in the form of silent seg- both underlyingly and in the output of phonol-
ments. For example if the root has an X at the start ogy (Papillon, 2020). If these structures are al-
as in Fig. 21. This is obviously very ad hoc and lowed in Multiprecedence Phonology then Match-
powerful so minimally we should seek language- Extend will need to be amended or enhanced to
internal evidence for such a segment before jump- handle more varied structures.
ing to conclusions. In this paper I proposed a model that departs
from the previous ones in being framed as patch-
ing a path from the morphology-added links to-
# s l O g %
wards # and % from the inside-out, as opposed to
the existing models seeking to give a set of instruc-
Figure 20: Temiar sglog. tions to correctly traverse the graph from # to %
from beginning to the end.

# X k O w % References
John Alderete, Jill Beckman, Laura Benua, Amalia
Gnanadesikan, John McCarthy, and Suzanne Ur-
Figure 21: Semai kw-kOw with hidden asymme- banczyk. 1999. Reduplication with fixed segmen-
try in the form of a segment X without a phonetic tism. Linguistic inquiry, 30(3):327–364.
correlate, which breaks the symmetry.
Ellen I Broselow. 1983. Salish double reduplications:
subjacency in morphology. Natural Language &
Alternatively it could be that symmetrical forms Linguistic Theory, 1(3):317–346.
lead to both options being constructed and this op-
Charles B Chang. 2007. Accounting for the phonology
tionality is resolved in extra-grammatical ways. I
of reduplication. Presentation at LASSO XXXVI.
will leave this hole in the theory open, as a prob- Available on https://cbchang.com/
lem to be resolved through further research. curriculum-vitae/presentations/.

5 Conclusion Jean D’souza. 1991. Echos in a sociolinguistic area.


Language sciences, 13(2):289–299.
This article presents an invariant serialization al-
Justin Fitzpatrick and Andrew Nevins. 2002. Phono-
gorithm for all morphological patterns in Multi- logical occurrences: Relations and copying. In Sec-
precedence. ond North American Phonology Conference, Mon-
The Multiprecedence research program treal.
has been fruitful in bringing various non-
Justin Fitzpatrick and Andrew Nevins. 2004. Lin-
concatenative phenomena other than reduplication earization of nested and overlapping precedence in
within the scope of a derivational item-and- multiple reduplication.
arrangement model of morphology, including
Michaël Gagnon and Maxime Piché. 2007. Principles
e.g. subtractive morphology (Gagnon and Piché, of linearization & subtractive morphology. In CUNY
2007), Semitic templatic morphology (Raimy, Phonology Forum Conference on Precedence Rela-
2007), and vowel harmony, word tone, and tions.

30
K. David Harrison and Eric Raimy. 2004. Reduplica-
tion in Tuvan: Exponence, readjustment and phonol-
ogy. In Proceedings of Workshop in Altaic Formal
Linguistics, volume 1. Citeseer.
William Idsardi and Rachel Shorey. 2007. Unwinding
morphology. In CUNY Phonology Forum Workshop
on Precedence Relations.
SD Khan. 2006. Similarity Avoidance in Bengali
Fixed-Segment Reduplication. Ph.D. thesis, Univer-
sity of California.
Roderick Ross Macdonald and Soenjono Darjowidjojo.
1967. A student’s reference grammar of modern for-
mal Indonesian. Georgetown University Press.
Daniel McClory and Eric Raimy. 2007. Enhanced
edges: morphological influence on linearization. In
Poster presented at The 81st Annual Meeting of the
Linguistics Society of America. Anaheim, CA.
Maxime Papillon. 2020. Precedence and the Lack
Thereof: Precedence-Relation-Oriented Phonology.
Ph.D. thesis. https://drum.lib.umd.edu/
handle/1903/26391.
Eric Raimy. 1999. Representing reduplication. Ph.D.
thesis, University of Delaware.
Eric Raimy. 2000a. The phonology and morphology of
reduplication, volume 52. Walter de Gruyter.
Eric Raimy. 2000b. Remarks on backcopying. Lin-
guistic Inquiry, 31(3):541–552.
Eric Raimy. 2007. Precedence theory, root and tem-
plate morphology, priming effects and the struc-
ture of the lexicon. CUNY Phonology Symposium
Precedence Conference, January.
Carl Rubino. 2005. Reduplication: Form, function and
distribution. Studies on reduplication, pages 11–29.
Bridget Samuels. 2009. The structure of phonological
theory. Harvard University.
Edward Sapir. 1922. Takelma. In Franz Boas, editor,
Handbook of American Indian Languages,. Smith-
sonian Institution. Bureau of American Ethnology.
Alan M. Stevens and A. Ed. Schmidgall-Tellings.
2004. A comprehensive Indonesian-English dictio-
nary. PT Mizan Publika.
Moira Yip. 1995. Repetition and its avoidance: The
case in Javanese.
Moira Yip. 1998. Identity avoidance in phonology
and morphology. In Morphology and its Relation to
Phonology and Syntax, pages 216–246. CSLI Publi-
cations.
Kie Zuraw. 2003. Vowel reduction in Palauan redu-
plicants. In Proceedings of the 8th Annual Meeting
of the Austronesian Formal Linguistics Association
[AFLA 8], pages 385–398.

31
Incorporating tone in the calculation of phonotactic probability

James P. Kirby
University of Edinburgh
School of Philosophy, Psychology, and Language Sciences
[email protected]

Abstract While this issue is occasionally remarked on


(e.g. Newman et al., 2011: 246), there remains
This paper investigates how the ordering no widespread consensus in practice. Choice
of tone relative to the segmental string
of ordering is sometimes justified based on
influences the calculation of phonotactic
probability. Trigram and recurrent neu- segment-tone co-occurrence restrictions in the
ral network models were trained on sylla- language under study (Myers and Tsay, 2005),
ble lexicons of four Asian syllable-tone lan- but is often presented without justification
guages (Mandarin, Thai, Vietnamese, and (Kirby and Yu, 2007; Yang et al., 2018), and
Cantonese) in which tone was treated as in some cases tone is simply ignored (Gong,
a segment occurring in different positions 2017). When the space of possibilities is con-
in the string. For trigram models, the
sidered, researchers generally select the permu-
optimal permutation interacted with lan-
guage, while neural network models were
tation which maximizes model fit to some ex-
relatively unaffected by tone position in all ternal data, such as participant judgments of
languages. In addition to providing a base- phonological distance (Do and Lai, 2021a) or
line for future evaluation, these results sug- wordlikeness (Do and Lai, 2021b).
gest that phonotactic probability is robust Although extrinsic evaluation is in some
to choices of how tone is ordered with re- sense a gold standard, intrinsic metrics of
spect to other elements in the syllable. model fit can also be informative, in part be-
cause extrinsic metrics are not always robust
1 Introduction
across data sets. For instance, participant
The phonotactic probability of a string is an wordlikeness judgments can vary considerably
important quantity in several areas of lin- based on the particulars of the experimen-
guistic research, including language acquisi- tal design (Myers and Tsay, 2005; Shademan,
tion, wordlikeness, word segmentation, and 2006; Vitevitch and Luce, 1999), so the treat-
speech production and perception (Bailey and ment of tone that produces a best-fit model for
Hahn, 2001; Daland and Pierrehumbert, 2011; one dataset may not do so for another. The
Storkel and Lee, 2011; Vitevitch and Luce, lexicon of a given language is much more in-
1999). When the language of interest is a ternally stable in terms of how segments and
tone language, the question arises of how tone tones are distributed, so intrinsic evaluation
should be incorporated into the probability cal- may provide a useful baseline for reasoning
culation. As phonotactic probability is fre- about the treatment of tone relative to seg-
quently computed based on some type of n- ments both within and across languages.
gram model, this means deciding on which This short paper considers a simple
segment(s) the probability of a tone should information-theoretic motivation for selecting
be conditioned. For instance, using a bigram a permutation: all else being equal, we should
model, one might compute the probability of prefer a model that maximizes the probability
the Mandarin syllable fāng as P (a|f) × P (N|a) of the lexicon (i.e., minimizes the cross-entropy
× P (tone 1|N), but could just as well consider loss), because this will be the model that by
P (tone 1|f) × P (a|1) × P (N|a), or any other definition does the best job of capturing the
conceivable permutation of tone and segments. phonotactic regularities of the lexicon (Cherry

32
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 32–38
August 5, 2021. ©2021 Association for Computational Linguistics
et al., 1953; Goldsmith, 2002; Pimentel et al., yin ping). Obstruents never occur as codas.
2020). By treating tone as another phone in
Thai (tha) A Thai lexicon of 4,133 unique
the segmental string, we can see whether and
syllables was created based on the dictionary
to what degree this choice has an effect on the
of Haas (1964) which contains around 19,000
overall entropy of the lexicon.
entries and 47,000 syllables. The phonemic
Intuitively, any model that can take into ac-
representation encodes 20 onsets, 3 medials /w
count phonotactic constraints will result in a
l r/, 21 nuclei (vowel length being contrastive
reduction in entropy. Thus, even an n-gram
in Thai), 8 codas and 5 tones. In Thai, high
model with a sufficiently large context window
tone is rare/unattested following unaspirated
should in principle be able model segment-tone
and voiced onsets, but there is also statistical
co-occurrences at the syllable level. However,
evidence for a restriction on rising tones with
tone languages differ with respect to tone-
these onsets (Perkins, 2013). In syllables with
segment co-occurrence restrictions (see Sec. 2).
an obstruent coda (/p t k/), only high, low, or
If a relevant constraint primarily targets syl-
falling tones occur, depending on length of the
lable onsets, for instance, placing the tonal
nuclear vowel (Morén and Zsiga, 2006).
“segment” in immediate proximity to the on-
set will increase the probability of the string, Vietnamese (vie) The Vietnamese lexicon
even relative to a model capable of capturing of 8,128 syllables was derived from a freely
the dependency at a longer distance. available dictionary of around 74,000 words
(Đức, 2004), phonetized using a spelling pro-
2 Languages nunciation (Kirby, 2008). The resulting repre-
Four syllable-tone languages were selected for sentation encodes 24 onsets, 1 medial (/w/),
this study: Mandarin Chinese, Cantonese, 14 nuclei, 8 codas and 6 tones. Vietnamese
Vietnamese and Thai. They are partially a syllables ending in obstruents /p t k/ are re-
convenience sample in that the necessary lex- stricted to just one of two tones.
ical resources were readily available, but also Cantonese (yue) The Cantonese syllabary
have some useful similarities: all share a sim- consists of the 1,884 unique syllables in the
ilar syllable structure template and have five Chinese Character Database (Kwan et al.,
or six tones. However, the four languages vary 2003), encoded using the jyutping system.
in terms of their segment-tone co-occurrence This representation distinguishes 22 onsets, 1
restrictions, as detailed below. medial (/w/), 11 nuclei, 5 codas and 6 tones.
In all cases, the lexicon was defined as In Cantonese, unaspirated initials do not oc-
the set of unique syllable shapes in each lan- cur in syllables with low-falling tones, and
guage. For consistency, the syllable tem- aspirated initials do not occur with the low
plate in all four languages is considered to be tone. Syllables ending with /p t k/ are re-
(C1 )(C2 )V(C)T, with variable positioning of stricted to one of the three “entering” tones
T. Offglides were treated as codas in all lan- (Yue-Hashimoto, 1972).
guages. The syllable lexicons for all four lan-
guages are provided in the supplementary ma- 3 Methods
terials (http://doi.org/10.17605/OSF.IO/NA5FB). Two classes of character-level language models
Mandarin (cmn) The Mandarin syllabary (LMs) were considered: simple n-gram models
consists of 1,226 syllables based on list of at- and recurrent neural networks (Mikolov et al.,
tested readings of the 13,060 BIG5 characters 2010). In an n-gram model, the probability
from Tsai (2000), phonetized using the phono- of a string is proportional to the conditional
logical system of Duanmu (2007). This rep- probabilities of the component n-grams:
resentation encodes 22 onsets, 3 medials (/j P (xi |xi−1 i−1
(1)
1 ) ≈ P (xi |xi−n+1 )
4 w/), 6 nuclei, 4 codas and 5 tones (includ-
ing the neutral tone). In Mandarin, unaspi- The degree of context taken into account is
rated obstruent onsets rarely appear with mid- thus determined by the value chosen for n.
rising tone (MC yang ping), and sonorant on- In a recurrent neural network (RNN), the
sets rarely occur with the high-level tone (MC next character in a sequence is predicting using

33
the current character and the previous hidden resulting strings were identical across permu-
state. At each step t, the network retrieves an tations. Both smoothed trigram and simple
embedding for the current input xt and com- RNN LMs were then fit to each permuted lex-
bines it with the hidden layer from the previ- icon 10 times, with random 80/20 train/dev
ous step to compute a new hidden layer ht : splits (other splits produced similar results).
For each run, the perplexity of the language
ht = g(U ht−1 + W xt ) (2) model on the dev set D = x1 x2 . . . xN (i.e., the
exponentiated cross-entropy1 ) was recorded:
where W is the weight matrix for the current
time step, U the weight matrix for the previ- P P L(D) = bH(D) (4)
1
ous time step, and g is an appropriate non- −N logb P (x1 x2 ...xN )
= b (5)
linear activation function. This hidden layer
ht is then used to generate an output layer 4 Results
yt , which is passed through a softmax layer to For brevity, only the main findings are sum-
generate a probability distribution over the en- marized here; the full results are available as
tire vocabulary. The probability of a sequence part of the online supplementary materials
x1 , x2 . . . xz is then just the product of the (http://doi.org/10.17605/OSF.IO/NA5FB).
probabilities of each character in the sequence: Table 1 show the orderings which minimized
z perplexity for each method and language, aver-

P (x1 , x2 . . . xz ) = yi (3) aged over 10 runs. Table 2 shows the average
i=1 perplexity over all permutations for a given
language and method.
The incorporation of the recurrent connec-
tion as part of the hidden layer allows RNNs to method lexicon order PPL
avoid the problem of limited context inherent cmn T|C 4.91 (0.06)
in n-gram models, because the hidden state tha T|M 7.34 (0.12)
3-gram
embodies (some type of) information about all vie T|C 7.35 (0.03)
of the preceding characters in the string. Al- yue T|M 5.84 (0.09)
though RNNs cannot capture arbitrarily long- cmn T|M 4.01 (0.08)
distance dependencies, this is unlikely to make tha T|M 5.20 (0.04)
RNN
a difference for the relatively short distances vie T|M 5.16 (0.02)
involved in phonotactic modeling. yue T|# 4.37 (0.05)
Trigram models were built using the SRILM
Table 1: Orders which produced the lowest per-
toolkit (Stolcke, 2002), with maximum likeli-
plexities averaged over 10 runs (means and stan-
hood estimates smoothed using interpolated dard deviations).
Witten-Bell discounting (Witten and Bell,
1991). RNN LMs were built using PyTorch Differences between orderings were then as-
(Paszke et al., 2019), based on an implementa- sessed visually, aided by simple analyses of
tion by Mayer and Nelson (2020). The results variance. For the trigram LMs, perplexity was
reported here make use of simple recurrent net- lowest in Mandarin when tones followed co-
works (Elman, 1990), but similar results were das, while differences in perplexity between
obtained using an LSTM layer (Hochreiter and other orderings were negligible. For Thai,
Schmidhuber, 1997). Vietnamese, and Cantonese, all orderings were
roughly comparable except for when tone was
3.1 Procedure
ordered as the first segment in the syllable
The syllables in each lexicon were arranged (T|#), which increased perplexity by up to
in 5 distinct permutations: tone following the 1 over the mean of the other orderings. For
coda (T|C), nucleus (T|N), medial (T|M), on- Thai, the ordering T|M resulted in signifi-
set (T|O) and with tone as the initial seg- cantly lower perplexities compared to all other
ment in the syllable (T|#). As many syl- 1
Equivalently, we may think of P P L(D) as the in-
lables in these languages lack onsets, medi- verse probability of the set of syllables D, normalized
als, and/or codas, a sizable number of the for the number of phonemes.

34
cmn tha vie yue
3-gram 5.15 (0.17) 7.76 (0.4) 7.49 (0.27) 5.98 (0.18)
RNN 4.01 (0.07) 5.28 (0.05) 5.18 (0.03) 4.42 (0.07)

Table 2: Mean and standard deviation of perplexity across all permutations by lexicon and language
model.

permutations. For the RNN LMs, although the language model. Even a model with a large
T|M was the numerically optimal ordering for enough context window to capture such depen-
three out of the four languages, in practical dencies will assign the lexicon a higher perplex-
terms permutation had no effect on perplex- ity when structured in this way.
ity, with numerical differences of no greater The finding that the T|M ordering is always
than 0.1 (see Table 2). optimal in Thai (and by a larger margin than
in the other languages) is presumably due to
5 Discussion the fact that the distribution of the medials
/w l r/ is severely restricted in this language,
Consistent with other recent work in compu-
occurring only after /p ph t th k kh f/. The
tational phonotactics (e.g. Mayer and Nel-
distribution of tones after onset-medial clus-
son, 2020; Mirea and Bicknell, 2019; Pimentel
ters is inherently more constrained and there-
et al., 2020), the neural network models out-
fore more predictable. A similar restriction
performed the trigram baselines by a consider-
holds in Cantonese, albeit to a lesser degree
able margin (a reduction in average perplexity
(the medial /w/ only occurs with onsets /k/
of up to 2.5, depending on language). Neu-
and /kh /).
ral network models were also much less sen-
sitive to the linear position of tone relative
5.1 Shortcomings and extensions
to other elements in the segmental string (cf.
Do and Lai, 2021b), no doubt due to the fact This work did not explore representations
that the ability of the RNNs to model co- based on phonological features, given that
occurrence tendencies within the syllable is not their incorporation has failed to provide evalu-
constrained by context in the way that n-gram ative improvements in other studies of com-
models are. putational phonotactics (Mayer and Nelson,
Perhaps as a result, however, the RNN mod- 2020; Mirea and Bicknell, 2019; Pimentel et al.,
els reveal little about the nature of segment- 2020). However, feature-based approaches can
tone co-occurrence restrictions in any of the be both theoretically insightful and may even
languages investigated. In this regard, the tri- prove necessary for other quantifications, such
gram models, while clearly less optimal in a as the measure of phonological distance where
global sense, are still informative. The fact tone is involved (Do and Lai, 2021a).
that the ordering T|# was significantly worse The present study has focused on a small
under the trigram model for Cantonese, Viet- sample of structurally and typologically simi-
namese and Thai but not Mandarin can be ex- lar languages. All have relatively simple syl-
plained (or predicted) by the fact that of the lable structures in which one and only one
four languages, only Mandarin does not per- tone is associated with each syllable. Not all
mit obstruent codas, and consequently has no tone languages share these properties, how-
coda-tone co-occurrence restrictions (indeed, ever. In so-called “word-tone” languages, such
the four primary tones of Mandarin occur with as Japanese or Shanghainese, the surface tone
more or less equal type frequency). In the with which a given syllable is realized is fre-
other three languages, syllables with obstruent quently not lexically specified. In other lan-
codas can only bear a restricted set of tones, guages, such as Yoloxóchitl Mixtec (DiCanio
and in a trigram model, this dependency is not et al., 2014), tonal specification may be tied
modeled when tone is prepended to the sylla- to sub-syllabic units, such as the mora. Fi-
ble, since this means it will frequently, though nally, data from many other languages, such
not always, fall outside the window visible to as Kukuya (Hyman, 1987), make it clear that

35
in at least in some cases tones can only be tonal tier in the first instance, and ordering
treated in terms of abstract melodies, which with respect to segments may simply not be
do not have a consistent association to sylla- relevant (but see Goldsmith and Riggle, 2012).
bles, moras, or vowels (Goldsmith, 1976). In Finally, the present study has focused on the
these and many other cases, careful consider- lexical representation of tone, but in many lan-
ation of the theoretical motivations justifying guages tone primarily serves a morphological
a particular representation are required before function. The SIGMORPHON 2020 Task 0
it makes sense to consider ordering effects. shared challenge (Vylomova et al., 2020) in-
cluded inflection data from several tonal Oto-
However, to the extent that it is possible to
Manguean languages in which tone was or-
generate a segmental representation of a tone
thographically encoded in different ways via
language in which surface tones are indicated,
string diacritics. While the authors noted
what the present work suggests is that the pre-
the existence these differences, it is unclear
cise ordering of the tonal symbols with respect
whether and to what extent the different rep-
to other symbols in the string is unlikely to
resentations of tones affected system perfor-
have a significant impact on phonotactic prob-
mance. Similarly, the potential impact of
ability. This follows from two assumption (or
tone ordering relative to other elements in the
constraints): first, that the set of symbols used
string has yet to be systematically investigated
to indicate tones is distinct from those used to
in this setting.
indicate the vowels and consonants; and sec-
ond, that one and only one such tone symbol 6 Conclusion
appears per string domain (here, the syllable).
If these two constraints hold, the complexity This paper has assessed how different permu-
of the syllable template should in general have tations of tone and segments affects the per-
a greater impact on the entropy of the string plexity of the lexicon in four syllable-tone lan-
set than the position of the tone symbol, al- guages using two types of phonotactic lan-
though the number of unique tone symbols rel- guage models, an interpolated trigram model
ative to the number of segmental symbols may and a simple recurrent neural network. The
also have an effect. According to Maddieson perplexities assigned by the neural network
(2013) and Easterday (2019), languages with models were essentially unaffected by different
complex syllable structures (defined as those choices of ordering; while the trigram model
permitting fairly free combinations of two or was more sensitive to permutations of tone and
more consonants in the position before a vowel, segments, the effects on perplexity remained
and/or two or more consonants in the position minimal. In addition to providing a baseline
after the vowel) rarely have complex tone sys- for future evaluation, these results suggest that
tems, or indeed tone systems at all, so this is the phonotactic probability of a syllable is rel-
unlikely to be an issue for most tone languages. atively robust to choice of how tone is ordered
with respect to other elements in the string,
One possibility the present work did not ad- especially when using a model capable of en-
dress is whether it is even necessary, or desir- coding dependencies across the entire syllable.
able, to include tone in phonotactic probability
calculations in the first place. The probability Acknowledgments
of the lexicon of a tonal language would surely
This work was supported in part by the ERC
change if tone is ignored, but whether listeners’
EVOTONE project (grant no. 758605).
judgments of a sequence as well- or ill-formed
is better predicted by a model that takes tone
into account vs. one that does not is an empir- References
ical question (but see Kirby and Yu, 2007; Do Todd Bailey and Ulrike Hahn. 2001. Determinants
and Lai, 2021b for some evidence that it may of wordlikeness: phonotactics or lexical neigh-
not). Similarly, for research questions focused borhoods? Journal of Memory and Language,
on tone sandhis, or on the distributions of the 44:568–591.
tonal sequences themselves (tonotactics), the E. Colin Cherry, Morris Halle, and Roman Jakob-
relevant computations will be restricted to the son. 1953. Toward the logical description of

36
languages in their phonemic aspect. Language, James P. Kirby and Alan C. L. Yu. 2007. Lexi-
29(1):34–46. cal and phonotactic effects on wordlikeness judg-
ments in Cantonese. In Proceedings of the 16th
Robert Daland and Janet B. Pierrehumbert. 2011. International Conference of the Phonetic Sci-
Learning diphone-based segmentation. Cogni- ences, pages 1389–1392, Saarbrücken.
tive Science, 35(1):119–155.
Tze-Wan Kwan, Wai-Sang Tang, Tze-Ming
Christian DiCanio, Jonathan D Amith, and Chiu, Lei-Yin Wong, Denise Wong, and
Rey Castillo García. 2014. The phonetics of Li Zhong. 2003. Chinese character database
moraic alignment in yoloxóchitl mixtec. In Pro- with word-formations phonologically disam-
ceedings of the 4th International Symposium on biguated according to the Cantonese dialect.
Tonal Aspects of Languages (TAL-2014), pages http://humanum.arts.cuhk.edu.hk/Lexis/lexi-
203–210. can/. Accessed 9 February 2007.

Youngah Do and Ryan Ka Yau Lai. 2021a. Ac- Ian Maddieson. 2013. Tone. In Matthew S. Dryer
counting for lexical tones when modeling phono- and Martin Haspelmath, editors, The World At-
logical distance. Language, 97(1):e39–e67. las of Language Structures Online. Max Planck
Institute for Evolutionary Anthropology.
Youngah Do and Ryan Ka Yau Lai. 2021b. Incor-
porating tone in the modelling of wordlikeness Connor Mayer and Max Nelson. 2020. Phonotactic
judgements. Phonology, 37:577–615. learning with neural language models. Proceed-
ings of the Society for Computation in Linguis-
San Duanmu. 2007. The phonology of standard tics, 3:16.
Chinese, 2nd edition. Oxford University Press, Tomáš Mikolov, Martin Karafiát, Lukáš Burget,
Oxford. Jan Černocký, and Sanjeev Khudanpur. 2010.
Recurrent neural network based language model.
Shelece Easterday. 2019. Highly complex syllable
In Proc. INTERSPEECH 2010, page 1045–1048.
structure: A typological and diachronic study.
Studies in Laboratory Phonology. Language Sci- Nicole Mirea and Klinton Bicknell. 2019. Using
ence Press. LSTMs to assess the obligatoriness of phonolog-
ical distinctive features for phonotactic learning.
Jeffrey L. Elman. 1990. Finding structure in time. In Proceedings of the 57th Annual Meeting of
Cognitive Science, 14(2):179–211. the Association for Computational Linguistics,
pages 1595–1605. Association for Computational
John Goldsmith. 1976. Autosegmental Phonology. Linguistics.
Ph.D. thesis, MIT. [Published by Garland Press,
New York, 1979.]. Bruce Morén and Elizabeth Zsiga. 2006. The
lexical and post-lexical phonology of Thai
John Goldsmith. 2002. Phonology as information tones. Natural Language and Linguistic Theory,
minimization. Phonological Studies, 5:21–46. 24(1):113–178.
John Goldsmith and Jason Riggle. 2012. Informa- James Myers and Jane Tsay. 2005. The pro-
tion theoretic approaches to phonological struc- cessing of phonological acceptability judgements.
ture: the case of Finnish vowel harmony. Natu- In Proc. Symposium on 90-92 NSC Projects,
ral Language and Linguistic Theory, 30:859–896. Taipei.

Donald Shuxiao Gong. 2017. Grammaticality and Ellen Hamilton Newman, Twila Tardif, Jingyuan
lexical statistics in Chinese unnatural phonotac- Huang, and Hua Shu. 2011. Phonemes matter:
tics. UCL Working Papers in Linguistics, 17:1– The role of phoneme-level awareness in emergent
23. Chinese readers. Journal of Experimental Child
Psychology, 108(2):242–259.
Mary R. Haas. 1964. Thai-English student’s dic-
tionary. Stanford University Press, Stanford. Adam Paszke, Sam Gross, Francisco Massa,
Adam Lerer, James Bradbury, Gregory Chanan,
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Trevor Killeen, Zeming Lin, Natalia Gimelshein,
Long short-term memory. Neural Computation, Luca Antiga, Alban Desmaison, Andreas Kopf,
9(8):1735–1780. Edward Yang, Zachary DeVito, Martin Raison,
Alykhan Tejani, Sasank Chilamkurthy, Benoit
Larry M. Hyman. 1987. Prosodic domains in Steiner, Lu Fang, Junjie Bai, and Soumith Chin-
Kukuya. Natural Language & Linguistic The- tala. 2019. Pytorch: An imperative style, high-
ory, 5(3):311–333. performance deep learning library. In H. Wal-
lach, H. Larochelle, A. Beygelzimer, F. d'Alché-
James P. Kirby. 2008. vPhon: a Vietnamese Buc, E. Fox, and R. Garnett, editors, Advances
phonetizer (version 2.1.1) [computer program]. in Neural Information Processing Systems 32,
https://github.com/kirbyj/vPhon. pages 8024–8035. Curran Associates, Inc.

37
Jeremy Perkins. 2013. Consonant-tone interaction
in Thai. Ph.D. thesis, Rutgers, The State Uni-
versity of New Jersey.
Tiago Pimentel, Brian Roark, and Ryan Cotterell.
2020. Phonotactic complexity and its trade-offs.
Transactions of the Association for Computa-
tional Linguistics, 8:1–18.
Shabnam Shademan. 2006. Is phonotactic knowl-
edge grammatical knowledge? In Proceedings of
the 25th West Coast Conference on Formal Lin-
guistics, pages 371–379. Cascadilla Proceedings
Project.
Andreas Stolcke. 2002. SRILM – an extensible lan-
guage modeling toolkit. In Proc. Intl. Conf. on
Spoken Language Processing Vol. 2, pages 901–
904, Denver.
Holly L. Storkel and Su-Yeon Lee. 2011. The inde-
pendent effects of phonotactic probability and
neighbourhood density on lexical acquisition by
preschool children. Language and Cognitive Pro-
cesses, 26(2):191–211.
Chih-Hao Tsai. 2000. Mandarin syllable
frequency counts for Chinese characters.
http://technology.chtsai.org/syllable/. Ac-
cessed 10 March 2021.
Michael S. Vitevitch and Paul A. Luce. 1999. Prob-
abilistic phonotactics and neighborhood activa-
tion in spoken word recognition. Journal of
Memory and Language, 40:374–408.
Ekaterina Vylomova, Jennifer White, Eliza-
beth Salesky, Sabrina J. Mielke, Shijie Wu,
Edoardo Maria Ponti, Rowan Hall Maudslay,
Ran Zmigrod, Josef Valvoda, Svetlana Toldova,
and et al. 2020. Sigmorphon 2020 shared task
0: Typologically diverse morphological inflec-
tion. In Proceedings of the 17th SIGMORPHON
Workshop on Computational Research in Pho-
netics, Phonology, and Morphology, page 1–39.
Association for Computational Linguistics.
Ian H. Witten and Timothy C. Bell. 1991. The
zero-frequency problem: estimating the proba-
bilities of novel events in adaptive text compres-
sion. IEEE Transactions on Information The-
ory, 37(4):1085–1094.
Shiying Yang, Chelsea Sanker, and Uriel Co-
hen Priva. 2018. The organization of lexicons:
a cross-linguistic analysis of monosyllabic words.
In Proceedings of the Society for Computation
in Linguistics (SCiL) 2018, page 164–173.
Anne O. Yue-Hashimoto. 1972. Studies in Yue Di-
alects 1: Phonology of Cantonese. Cambridge
University Press.
Hồ Ngọc Đức. 2004. Vietnamese
word list. http://www.informatik.uni-
leipzig.de/∼duc/software/misc/wordlist.html.
Accessed 24 February 2021.

38
MorphyNet: a Large Multilingual Database
of Derivational and Inflectional Morphology

Khuyagbaatar Batsuren1 , Gábor Bella2 , and Fausto Giunchiglia2,3


1
National University of Mongolia, Mongolia
2
DISI, University of Trento, Italy
3
College of Computer Science and Technology, Jilin University, China
[email protected];{gabor.bella,fausto.giunchiglia}@unitn.it

Abstract agglutinative—languages more efficiently (Pinnis


Large-scale morphological databases provide
et al., 2017; Vylomova et al., 2017; Ataman and
essential input to a wide range of NLP applica- Federico, 2018; Gerz et al., 2018).
tions. Inflectional data is of particular impor- In response to such needs, and as simple and con-
tance for morphologically rich (agglutinative venient substitutes for monolingual morphological
and highly inflecting) languages, and deriva- analyzers, multilingual morphological databases
tions can be used, e.g. to infer the semantics of have been developed, indicating for each word
out-of-vocabulary words. Extending the scope form entry one or more corresponding root or
of state-of-the-art multilingual morphological
dictionary entries, as well as analysis (features)
databases, we announce the release of Mor-
phyNet, a high-quality resource with 15 lan- (Kirov et al., 2018; Metheniti and Neumann, 2020;
guages, 519k derivational and 10.1M inflec- Vidra et al., 2019). The precision and recall of
tional entries, and a rich set of morphologi- these resources vary wildly, and there is still a lot
cal features. MorphyNet was extracted from of ground to cover with respect to the support of
Wiktionary using both hand-crafted and auto- new languages, the modelling of the inflectional
mated methods, and was manually evaluated and derivational complexity of each language, as
to be of a precision higher than 98%. Both
well as the richness of the information (features,
the resource generation logic and the resulting
database are made freely available12 and are
affixes, parts of speech, etc.) provided.
reusable as stand-alone tools or in combination As a further step towards extending online mor-
with existing resources. phological data, we introduce MorphyNet, a new
database that addresses both derivational and in-
1 Introduction flectional morphology. Its current version cov-
Despite repeated paradigm shifts in computational ers 15 languages and has 519k derivational and
linguistics and natural language processing, mor- 10.1M inflectional entries, as well as a rich set of
phological analysis and its related tasks, such features (lemma, parts of speech, morphological
as lemmatization, stemming, or compound split- tags, affixes, etc.). Similarly to certain existing
ting, have always remained essential components databases, MorphyNet was built from Wiktionary
within language processing systems. Recently, in data; however, our extraction logic allows for a
the context of language models based on subword more exhaustive coverage of both derivational and
embeddings, a morphologically meaningful split- inflectional cases.
ting of words has been shown to improve the effi- The contributions of this paper are the freely
ciency of downstream tasks (Devlin et al., 2019; available MorphyNet resource, the description of
Sennrich et al., 2016; Bojanowski et al., 2017; the data extraction logic and tool, also made freely
Provilkov et al., 2020). In particular, the rein- accessible, as well as its evaluation and compari-
troduction of linguistically motivated approaches son to state-of-the-art multilingual morphological
and high-quality linguistic resources into deep databases. Due to the limited overlap between the
learning architectures has been crucial for deal- contents of these resources and MorphyNet, we
ing with morphologically rich—highly inflecting, consider it as complementary and therefore usable
1 in combination with them.
http://ukc.disi.unitn.it/index.php/
MorphyNet Section 2 of the paper presents the state of the art.
2
http://github.com/kbatsuren/MorphyNet Section 3 gives details on our method for generat-

39
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 39–48
August 5, 2021. ©2021 Association for Computational Linguistics
Figure 1: The MorphyNet generation process and the input datasets used.

ing MorphyNet data. Section 4 presents the result- the scope of UniMorph by new extraction rules
ing resource, and Section 5 evaluates it. Section 6 and logic. The first version of MorphyNet covers
concludes the paper. 15 languages, and it is distinct from other resources
in three aspects: (1) it includes both inflectional
2 State of the Art and derivational data; (2) it extracts a significantly
higher number of inflections from Wiktionary; and
Ever since the early days of computational linguis- (3) it provides a wider range of morphological in-
tics, morphological analysis and its related tasks— formation. While for the languages it covers Mor-
such as stemming and lemmatization—have been phyNet can be considered a superset of UniMorph,
part of NLP systems. Earlier grammar-based sys- the latter supports more languages. With UDer, as
tems used finite-state transducers or affix stripping we show in section 4, the overlap is minor on all
techniques, and certain of them were already mul- languages. For these reasons, we consider Mor-
tilingual and were capable of tackling morpholog- phyNet as complementary to these databases, con-
ically complex languages (Beesley and Karttunen, siderably enriching their coverage on the 15 sup-
2003; Trón et al., 2005; Inxight, 2005). However, ported languages but not replacing them.
due to the costliness of producing the grammar
rules that drove them, many of these systems were 3 MorphyNet Generation
only commercially available.
MorphyNet is generated mainly from Wiktionary,
More recently, several projects have followed
through the following steps.
the approach of formalizing and/or integrating ex-
isting morphological data for multiple languages. 1. Filtering returns XML-based Wiktionary con-
UDer (Universal Derivations) (Kyjánek et al., tent from specific sections of relevant lexical
2020) integrates 27 derivational morphology re- entries: headword lines, etymology sections,
sources in 20 languages. UniMorph (Kirov et al., and inflectional tables are returned for nouns,
2016, 2018) and the Wikinflection Corpus (Methen- verbs, and adjectives.
iti and Neumann, 2020) rely mostly on Wiktionary
from which they extract inflectional information. 2. Extraction obtains raw morphological data by
Beyond the data source, however, the two last parsing the sections above.
projects have little in common: UniMorph is by
far more precise and complete, and being used 3. Enrichment algorithmically extends the cov-
as gold standard for NLP community (Cotterell erage of derivations and inflections obtained
et al., 2017, 2018) (recently covering 133 lan- from Wiktionary, through entirely distinct
guages (McCarthy et al., 2020)), while Wikinflec- methods for inflection and derivation.
tion follows a naïve, linguistically uninformed ap- 4. Resource generation, finally, outputs Mor-
proach of merely concatenating affixes, generat- phyNet data.
ing an abundance of ungrammatical word forms
(e.g. for Hungarian or Finnish). Below we explain the non-trivial Wiktionary ex-
MorphyNet is also based on extracting morpho- traction and enrichment steps, while Section 4 pro-
logical information from Wiktionary, extending vides details on the generated resource itself.

40
3.1 Wiktionary Extraction competição = {{suffix|pt|competir|ção}}
Wiktionary
We extract inflectional and derivational data accusation = {{suffix|en|accuse|ation}}
through hand-crafted extraction rules that target
recurrent patterns in Wiktionary content both in en pt
cognate
accuse.v acusar.v
source markdown and in HTML-rendered form. CogNet
cognate
With respect to UniMorph that takes a similar ap- accusation.n acusação.n
proach and scrapes tables that provide inflectional
paradigms, the scope of extraction is considerably pt acusar.v acusação.n -ção
extended, also including headword lines and ety-
MorphyNet pt competir.v competição.n -ção
mology sections. This allows us to obtain new en accuse.v accusation.n -ation
derivations, inflections, and features not covered
by UniMorph, such as gender information or noun Figure 2: Derivation enrichment example: inference of
and adjective declensions for Catalan, French, Ital- the derivation of the Portuguese word acusação.
ian, Spanish, Russian, English, or Serbo-Croatian.
Our rules target nouns, adjectives, and verbs in all
languages covered. etymology sections of Wiktionary entries to collect
the Morphology template usages, such as for the
Inflection extraction rules target two types of
English accusation:
Wiktionary content: inflectional tables and head-
word lines. Inflectional tables provide conjugation Equivalent to accuse + -ation.
and declension paradigms for a subset of verbs, where we have a morphology entry
nouns, and adjectives in Wiktionary. On tables, {{suffix|en|accuse|-ation}} from the Wiktionary
our extraction method was similar to that of Uni- XML dump. After collecting all morphology
Morph as described in (Kirov et al., 2016, 2018), entries, we applied the enrichment method to
with one major difference. UniMorph also ex- increase its coverage.
tracted a large number of separate entries with
modifier and auxiliary words, such as Spanish 3.2 Derivation Enrichment
negative imperatives (no comas, no coma, no co- Derivation enrichment is based on a linguistically
mamos etc.) or Finnish negative indicatives (en informed cross-lingual generalization of deriva-
puhu, et puhu, eivät puhu etc.). MorphyNet, on tional patterns observed in Wiktionary data, in or-
the other hand, has a single entry for each distinct der to extend the coverage of derivational data.
word form, regardless of the modifier word used.
In the example shown in Figure 2, Wik-
This policy had a particular impact on the size of
tionary contains the Portuguese derivation com-
the Finnish vocabulary.
petir (to compete) → competição (competition)
As inflectional tables are only provided by Wik- but not acusar (to accuse) → acusação (accusa-
tionary for 62.5%3 of nouns, verbs, and adjectives, tion). An indiscriminate application of the suf-
we extended the scope of extraction to headword fix -ção to all verbs would, of course, gener-
lines, such as ate lots of false positives, such as chegar (to ar-
banca f (plural banche) rive) ↛ *chegação. Even when the target word
does exist, the inferred derivation is often false, as
From this headword line, we extract two entries:
in the case of corar (to blush) ↛ coração (heart).
one for banca is feminine singular and second for
A counter-example from English could be jewel +
banche is feminine plural. We created specific
-ery → jewellery but gal +-ery ↛ gallery.
parsing rules for nouns, verbs, and adjectives be-
For this reason, we use stronger cross-lingual
cause each part of speech is described through a dif-
derivational evidence to induce the applicability
ferent set of morphological features. For example,
of the affix. In the example above, the existence
valency (transitive or reflexive) and aspect (perfec-
of the English derivation accuse → accusation,
tive or imperfective) are essential for verbs, while
where the meanings of the English and the corre-
gender (masculine or feminine) and number (singu-
sponding Portuguese words are the same, serves as
lar or plural) pertain to nouns and adjectives.
a strong hint for the applicability of the Portuguese
Derivation extraction rules were applied to the
pattern.
3
Computed over the 15 languages covered by MorphyNet. This intuition is formalized in MorphyNet as fol-

41
Table 1: Structure of MorphyNet inflectional data and its comparison to UniMorph. Data provided only by Mor-
phyNet is highlighted in bold. The rest is provided by both resources in a nearly identical format.

Language base_word trg_word features src_word morpheme


Hungarian ház házak N;NOM;PL ház -ak
Hungarian ház házat N;ACC;SG ház -at
Hungarian ház házakat N;ACC;PL házak -at
Russian играть играть V;NFIN;IPFV;ACT играть
Russian играть играют V;IND;PRS;3;PL;IPFV;ACT;FIN играть -ают
Russian играть играющий V;V.PTCP;ACT;PRS играют -щий

Table 2: Structure of MorphyNet derivational data and its comparison to UDer. Data only provided by MorphyNet
is highlighted in bold. The rest is provided by both resources in a nearly identical format.

Language src_word trg_word src_pos trg_pos morpheme


English time timeless noun adjective -less
English soda sodium noun substance.noun -ium
English zoo zoophobia noun state.noun -phobia
Finnish kirjoittaa kirjoittaminen verb noun -minen
Finnish lyödä lyöjä verb person.noun -jä

lows: if in language A a derivation from source the inflection múltja → múltjával (his/her/its
word wAs to target word wAt through the affix aA is past + instrumental). For múltja, in turn, it pro-
not explicitly asserted (e.g. by Wiktionary) but it is vides múlt → múltja (past + possessive). It does
asserted for the corresponding cognates in at least not, however, directly provide the combination
one language B, then we infer its existence: of the two inflections: múlt → múltjával (past +
possessive + instrumental). Inflection enrichment
cog(wAs , wBs ) ∧ cog(wAt , wBt ) ∧ cog(aA , aB ) consists of inferring such missing rules from the
existing data.
∧ der(wBs , aB ) = wBt ⇒ der(wAs , aA ) = wAt The case above is formalized as follows: if, after
where cog(x, y) means that the words x and y are the Wiktionary extraction phase, the MorphyNet
cognates and der(b, a) = d that word d is derived data contains the inflections wr → w1 (with feature
from base word b and affix a. In our example, set F1 ) as well as w1 → w2 (with feature set F2 ),
A = Portuguese, B = English, wAs = acusar, then we create the new inflection wr → w2 with
wBs = accuse, wAt = acusação, wBt = accusation, feature set F1 ∪ F2 .
aA = -ção, and aB = -tion. The application of this logic increased the inflec-
As shown in Figure 1, we exploited a cognate tional coverage of MorphyNet by 10.8% and its re-
database, CogNet4 (Batsuren et al., 2019, 2021), call (with respect to ground truth data presented in
that has 8.1M cognate pairs, for evidence on cog- section 5) by 8.2% on average.
nacy: cog(wA , wB ) = True is asserted by the pres-
4 The MorphyNet Resource
ence of the word pair in CogNet.
The result of enrichment was a total increase of Morphynet is freely available for download, both
25.6% of the number of derivations in MorphyNet. as text files containing the data and as the source
Efficiency varies among languages, essentially de- code of the Wiktionary extractor.5 Two text files
pending on the completeness of the Wiktionary are provided per language: one for inflections and
coverage: it was the lowest for English with 3% one for derivations. The structure of the two types
and the highest for Spanish with 57%. of files is illustrated in Tables 1 and 2, respectively.
As shown, MorphyNet covers all data fields pro-
3.3 Inflection Enrichment vided by UniMorph for inflections and by UDer
The enrichment of inflectional data is based on for derivations. In addition, it extends UniMorph
the simple observation that Wiktionary does by indicating the affix and the immediate source
not provide the root word for all inflected word that produced the inflection. Such informa-
forms. For example, for the Hungarian múltjá- tion is useful, for example, to NLP applications
val (with his/her/its past), Wiktionary provides that rely on subword information for understand-
4 5
http://github.com/kbatsuren/CogNet http://github.com/kbatsuren/WiktConv

42
Table 3: MorphyNet dataset statistics

Inflectional morphology Derivational morphology


# Languages words entries morphemes words entries morphemes Total
1 Finnish 65,402 1,617,751 1,139 18,142 37,199 446 1,654,950
2 Serbo-Croatian 68,757 1,760,095 263 8,553 20,008 429 1,780,103
3 Italian 75,089 748,321 104 22,650 42,149 749 790,470
4 Hungarian 38,067 1,034,317 428 14,566 37,940 832 1,072,257
5 Russian 67,695 1,343,760 252 21,922 36,922 575 1,380,682
6 Spanish 67,796 677,423 145 16,268 27,633 490 705,056
7 French 44,729 453,229 98 15,473 37,203 636 490,432
8 Portuguese 30,969 329,861 161 10,504 15,974 387 345,835
9 Polish 36,940 663,545 251 9,518 18,404 405 681,949
10 German 35,086 214,401 243 13,070 23,867 465 238,268
11 Czech 9,781 298,888 112 4,875 9,660 318 307,935
12 English 149,265 652,487 8 67,412 200,365 2,445 852,852
13 Catalan 16,404 168,462 91 3,244 4,083 220 172,545
14 Swedish 14,485 131,693 32 3,190 5,810 217 137,503
15 Mongolian 2,085 14,592 35 1,410 1,940 229 16,532
Total 722,550 10,108,825 3,362 230,797 519,157 8,843 10,627,369

Table 4: UniMorph and MorphyNet data sizes com- Comparison to ground truth. The quality eval-
pared to Universal Dependencies content. uation of morphology database is a challenging
Language UniMorph MorphyNet Univ. Dep. task due to many weird morphology aspects of lan-
Catalan 81,576 168,462 25,443 guages evaluated (Gorman et al., 2019). As ground
Czech 134,528 298,888 151,838 truth on inflections we used the Universal Depen-
English 115,523 652,487 17,296
French 367,733 453,229 28,921
dencies6 dataset (Nivre et al., 2016, 2017), which
Finnish 2,490,377 1,617,751 47,813 (among others) provides morphological analysis
Hungarian 552,950 1,034,317 3,685 of inflected words over a multilingual corpus of
Italian 509,575 748,321 24,002
Serbo-Croatian 840,799 1,760,095 35,936 hand-annotated sentences. McCarthy et al. (2018)
Spanish 382,955 677,423 32,571 built a Python tool7 to convert these treebanks
Swedish 78,411 131,693 15,030 into UniMorph schema (Sylak-Glassman, 2016).
Russian 473,482 1,343,760 18,774
We evaluated both UniMorph 2.0 and MorphyNet
Total 5,893,381 8,886,426 401,309
against this data (performing the necessary map-
ping of feature tags beforehand) over the 11 lan-
ing out-of-vocabulary words. MorphyNet also ex- guages in the intersection of the two resources:
tends the UDer structure by indicating the affix and Hungarian (Vincze et al., 2010), Catalan, Span-
the semantic category for the target word when it ish (Taulé et al., 2008), Czech (Bejček et al.,
can be inferred from the morpheme. Such informa- 2013), Finnish (Pyysalo et al., 2015), Russian (Lya-
tion is again useful for subword regularization of shevskaya et al., 2016), Serbo-Croatian (De Melo,
derivationally rich languages, such as English. 2014), French (Guillaume et al., 2019), Italian
Table 4 provides per-language statistics on Mor- (Bosco et al., 2013), Swedish (Nivre and Megyesi,
phyNet data. The present version of the resource 2007), and English (Silveira et al., 2014). Ta-
contains 10.6 million entries, of which 95% are in- ble 5 contains evaluation results over nouns, verbs,
flections. Highly inflecting and agglutinative lan- and adjectives separately, as well as totals per lan-
guages are dominating the resource as 55% of all guage. Missing data points (e.g. for Catalan nouns)
entries belong to Finnish, Hungarian, Russian, and indicate that UniMorph did not have any corre-
Serbo-Croatian. Language coverage above all de- sponding inflections. For languages and parts of
pends on the completeness of Wiktionary, the main speech where both resources provide data, Mor-
source of our data. phyNet always provides higher recall. The excep-
tion is Finnish because of our policy of not extract-
5 Evaluation ing conjugations with auxiliary and modifier words
as separate entries (see Section 3.1). Overall, as
We evaluated MorphyNet through two different 6
https://universaldependencies.org/
methods: (1) through comparison to ground truth 7
https://github.com/unimorph/
and (2) through manual validation by experts. ud-compatibility

43
Table 5: Inflectional morphology evaluation of MorphyNet against UniMorph on Universal Dependencies

Noun Verb Adjective Total


Language Resource
R P F1 R P F1 R P F1 R P F1
UniMorph - - - 71.9 99.3 83.4 - - - 21.3 99.3 35.1
Catalan
MorphyNet 66.0 98.4 79.0 73.8 99.1 84.6 48.2 99.6 65.0 64.3 98.8 77.9
UniMorph 28.2 99.1 43.9 9.5 18.1 12.5 17.6 44.8 25.3 21.0 72.7 32.6
Czech
MorphyNet 33.2 98.9 49.7 28.2 93.8 43.4 36.1 98.1 52.8 34.2 98.0 50.7
UniMorph - - - 96.1 90.9 93.4 - - - 28.3 90.9 43.2
English
MorphyNet 81.5 99.1 89.4 97.1 96.8 96.9 85.3 99.7 91.9 83.2 98.8 90.3
UniMorph - - - 70.6 98.5 82.2 - - - 20.6 98.5 34.1
French
MorphyNet 80.2 98.6 88.5 94.4 98.5 96.4 60.1 94.6 73.5 79.7 97.9 87.9
UniMorph 45.5 99.5 62.4 50.5 88.4 64.3 61.4 81.7 70.1 49.1 93.5 64.4
Finnish
MorphyNet 49.8 99.4 66.4 53.8 89.5 67.2 67.2 98.1 79.8 54.5 96.7 69.7
UniMorph 45.3 99.0 62.2 31.9 97.8 48.1 - - - 30.8 98.8 47.0
Hungarian
MorphyNet 55.2 99.1 70.9 77.2 96.9 85.9 43.1 95.9 59.5 56.3 97.9 71.5
UniMorph - - - 66.1 91.6 76.8 - - - 22.8 91.6 36.5
Italian
MorphyNet 86.7 99.0 92.4 88.8 96.9 92.7 84.9 98.9 91.4 87.0 98.2 92.3
UniMorph 0.0 0.0 0.0 0.0 0.0 0.0 49.4 47.4 48.4 18.5 47.4 26.6
Serbo-Croatian
MorphyNet 69.5 88.4 77.8 69.1 98.1 81.1 54.9 98.6 70.5 63.9 93.3 75.9
UniMorph - - - 93.0 99.8 96.3 - - - 32.1 99.8 48.6
Spanish
MorphyNet 88.3 99.2 93.4 97.0 99.5 98.2 81.9 99.2 89.7 89.7 99.3 94.3
UniMorph 15.1 98.4 26.2 59.7 84.8 70.1 34.1 94.8 50.2 27.1 92.0 41.9
Swedish
MorphyNet 36.8 99.4 53.7 78.0 98.1 86.9 38.1 99.6 55.1 44.6 99.1 61.5
UniMorph 0.0 0.0 0.0 0.0 0.0 0.0 52.8 97.4 68.5 10.8 97.4 19.4
Russian
MorphyNet 56.5 95.1 70.9 67.7 92.9 78.3 64.5 99.0 78.1 61.5 95.2 74.7

seen from Table 4, MorphyNet contains about 47% as Cohen’s Kappa, was 0.85 overall, varying be-
more entries over the 11 languages where it over- tween 0.74 (Finnish) and 0.97 (Portuguese). If we
laps with UniMorph. In terms of precision, the two consider UDer as gold standard, we obtain preci-
resources are comparable, except for Finnish (ad- sion figures between 87% and 99%.
jectives) and Swedish (adjectives and verbs) where
Manual evaluation was carried out by language
MorphyNet appears to be significantly more pre-
experts over sample data from five languages: En-
cise.
glish, Italian, French, Hungarian, and Mongolian.
The sample consisted of 1,000 randomly selected
UDer (Kyjánek et al., 2020) is a collection of
entries per language, half of them inflectional and
individual monolingual resources of derivational
the other half derivational. The experts were asked
morphology. Most of them have been carefully
to validate the correctness of source–target word
evaluated against their own datasets and offer high
pairs, of morphemes, as well as inflectional fea-
quality. We evaluated MorphyNet derivational
tures and parts of speech (the latter for deriva-
data against UDer over the nine languages covered
tions). Table 7 shows detailed results. The over-
by both resources: French (Hathout and Namer,
all precision is 98.9%, per-language values varying
2014), Portuguese (de Paiva et al., 2014), Czech
between 98.2% (Hungarian) and 99.5% (English).
(Vidra et al., 2019), German (Zeller et al., 2013),
The good results are proof both of the high qual-
Russian (Vodolazsky, 2020), Italian (Talamo et al.,
ity of Wiktionary data and of the general correct-
2016), Finnish (Lindén and Carlson, 2010; Lindén
ness of the data extraction and enrichment logic of
et al., 2012), Latin (Litta et al., 2016), and En-
MorphyNet. A manual checking of the incorrect
glish (Habash and Dorr, 2003). Statistics and re-
entries revealed that most of them were due to the
sults are shown in Table 6. First of all, the over-
failure of extraction rules due to occasional devia-
lap between MorphyNet and UDer is small, which
tions in Wiktionary from its own conventions.
is visible from our recall values relative to UDer
that vary between 0.6% (Czech) and 59.5% (Ital- 6 Conclusions and Future Work
ian). Among the languages evaluated, six were
better covered by MorphyNet and the remaining We consider the resource released and described
three (Czech, German, and Russian) by UDer. The here as an initial work-in-progress version that we
agreement between the two resources, computed plan to extend and improve. We are currently

44
Table 6: Derivational morphology evaluation of MorphyNet against Universal Derivations (UDer)

# Language MorphyNet Univeral Derivations (UDer) UDer ∩ MorphyNet Recall Precision Kappa
1 French 37,203 Démonette 13,272 2,558 18.5 95.5 0.91
2 Portuguese 15,974 NomLex-PT 3,420 1,235 35.8 98.9 0.97
3 Czech 9,660 Derinet 804,011 5,347 0.6 94.1 0.88
4 German 23,867 DerivBase 35,528 5,878 15.6 93.5 0.87
5 Russian 36,922 DerivBase.RU 118,374 6,370 12.3 88.1 0.76
6 Italian 42,149 DerIvaTario 1,548 958 59.5 90.7 0.81
7 Finnish 37,199 FinnWordnet 8,337 2,664 30.6 87.0 0.74
8 Latin 9,191 WFL 2,792 4,037 14.0 93.7 0.87
9 English 200,365 CatVar 16,185 7,397 45.7 91.9 0.83
Total 412,530 1,003,467 36,444 25.8 92.6 0.85

Table 7: Manual validation of language experts on MorphyNet

Inflectional morphology Derivational morphology


# Language word pair features morphemes trg words POS morphemes Total
1 English 99.2 100.0 99.0 100.0 99.0 100.0 99.5
2 French 99.8 98.0 100.0 100.0 96.8 100.0 99.1
3 Hungarian 97.0 95.0 100.0 98.6 99.1 99.2 98.2
4 Italian 100.0 100.0 99.4 98.0 97.4 99.0 99.0
5 Mongolian 98.2 100.0 99.2 98.4 98.1 98.6 98.8
Average. 98.8 98.6 99.5 99.0 98.1 99.4 98.9

working on increasing the coverage to 20 lan- References


guages. We also plan to extend MorphyNet data
Duygu Ataman and Marcello Federico.
with additional features and the semantic cate-
2018. Compositional representation of
gories of words (e.g. animate or inanimate object,
morphologically-rich input for neural machine
action) inferred from derivations. We are planning
translation. In Proceedings of the 56th Annual
to conduct a more in-depth study of our evaluation
Meeting of the Association for Computational
results, especially with respect to UDer where it
Linguistics (Volume 2: Short Papers), pages
is not yet clear whether the occasional lower pre-
305–311.
cision figures (87% for Finnish, 88% for Russian)
are due to mistakes in MorphyNet, in the UDer re- Khuyagbaatar Batsuren, Gábor Bella, and Fausto
sources, or are caused by other factors. Giunchiglia. 2021. A large and evolving cog-
nate database. Language Resources and Evalu-
A major piece of ongoing work concerns the ation.
representation of MorphyNet derivational data as
a lexico-semantic graph, as it is done in word- Khuyagbaatar Batsuren, Gábor Bella, and Fausto
nets (Miller, 1998; Giunchiglia et al., 2017) where Giunchiglia. 2019. Cognet: a large-scale cog-
derivationally related word senses are intercon- nate database. In Proceedings of ACL 2019, Flo-
nected by associative relationships. This effort, rence, Italy.
justifying the -Net in the name of our resource, will
allow us to address completeness issues in existing Kenneth R Beesley and Lauri Karttunen. 2003.
wordnets by extending them by morphological re- Finite-state morphology: Xerox tools and tech-
lations and derived words. niques. CSLI, Stanford.

Eduard Bejček, Eva Hajičová, Jan Hajič, Pavlína


We are happy to offer the MorphyNet extraction Jínová, Václava Kettnerová, Veronika
logic to be reused on a community basis. As ex- Kolářová, Marie Mikulová, Jiří Mírovskỳ,
tending the tool with new Wiktionary extraction Anna Nedoluzhko, Jarmila Panevová, et al.
rules is straightforward, we hope that the availabil- 2013. Prague dependency treebank 3.0.
ity of the tool will allow language coverage to grow
even further. We also hope that the MorphyNet Piotr Bojanowski, Edouard Grave, Armand Joulin,
data and the extraction logic can serve existing and Tomas Mikolov. 2017. Enriching word vec-
high-quality projects such as UniMorph and UDer. tors with subword information. Transactions of

45
the Association for Computational Linguistics, word-level prediction. Transactions of the As-
5:135–146. sociation for Computational Linguistics, 6:451–
465.
Cristina Bosco, Montemagni Simonetta, and Simi
Maria. 2013. Converting italian treebanks: To- Fausto Giunchiglia, Khuyagbaatar Batsuren, and
wards an italian stanford dependency treebank. Gabor Bella. 2017. Understanding and exploit-
In 7th Linguistic Annotation Workshop and In- ing language diversity. In IJCAI, pages 4009–
teroperability with Discourse, pages 61–69. The 4017.
Association for Computational Linguistics.
Kyle Gorman, Arya D McCarthy, Ryan Cotterell,
Ryan Cotterell, Christo Kirov, John Sylak- Ekaterina Vylomova, Miikka Silfverberg, and
Glassman, Géraldine Walther, Ekaterina Magdalena Markowska. 2019. Weird inflects
Vylomova, Arya D McCarthy, Katharina Kann, but ok: Making sense of morphological gener-
Sebastian J Mielke, Garrett Nicolai, Miikka ation errors. In Proceedings of the 23rd Con-
Silfverberg, et al. 2018. The conll–sigmorphon ference on Computational Natural Language
2018 shared task: Universal morphological Learning (CoNLL), pages 140–151.
reinflection. In Proceedings of the CoNLL–
SIGMORPHON 2018 Shared Task: Universal Bruno Guillaume, Marie-Catherine de Marneffe,
Morphological Reinflection, pages 1–27. and Guy Perrier. 2019. Conversion et amélio-
rations de corpus du français annotés en univer-
Ryan Cotterell, Christo Kirov, John Sylak- sal dependencies. Traitement Automatique des
Glassman, Géraldine Walther, Ekaterina Langues, 60(2):71–95.
Vylomova, Patrick Xia, Manaal Faruqui, San-
dra Kübler, David Yarowsky, Jason Eisner, Nizar Habash and Bonnie Dorr. 2003. Catvar: A
et al. 2017. Conll-sigmorphon 2017 shared database of categorial variations for english. In
task: Universal morphological reinflection in Proceedings of the MT Summit, pages 471–474.
52 languages. In Proceedings of the CoNLL Citeseer.
SIGMORPHON 2017 Shared Task: Universal
Nabil Hathout and Fiammetta Namer. 2014. Dé-
Morphological Reinflection, pages 1–30.
monette, a french derivational morpho-semantic
Gerard De Melo. 2014. Etymological wordnet: network. In Linguistic Issues in Language Tech-
Tracing the history of words. In LREC, pages nology, Volume 11, 2014-Theoretical and Com-
1148–1154. Citeseer. putational Morphology: New Trends and Syner-
gies.
Valeria de Paiva, Livy Real, Alexandre Rade-
maker, and Gerard de Melo. 2014. Nomlex- Inxight. 2005. Linguistx natural language process-
pt: A lexicon of portuguese nominalizations. In ing platform.
Proceedings of the Ninth International Confer-
ence on Language Resources and Evaluation Christo Kirov, Ryan Cotterell, John Sylak-
(LREC’14), pages 2851–2858. Glassman, Géraldine Walther, Ekaterina
Vylomova, Patrick Xia, Manaal Faruqui,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Sabrina J Mielke, Arya D McCarthy, Sandra
Kristina Toutanova. 2019. Bert: Pre-training of Kübler, et al. 2018. Unimorph 2.0: Universal
deep bidirectional transformers for language un- morphology. In Proceedings of the Eleventh In-
derstanding. In Proceedings of the 2019 Con- ternational Conference on Language Resources
ference of the North American Chapter of the and Evaluation (LREC 2018).
Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long Christo Kirov, John Sylak-Glassman, Roger Que,
and Short Papers), pages 4171–4186. and David Yarowsky. 2016. Very-large scale
parsing and normalization of wiktionary mor-
Daniela Gerz, Ivan Vulić, Edoardo Ponti, Jason phological paradigms. In Proceedings of the
Naradowsky, Roi Reichart, and Anna Korhonen. Tenth International Conference on Language
2018. Language modeling for morphologically Resources and Evaluation (LREC’16), pages
rich languages: Character-aware modeling for 3121–3126.

46
Lukáš Kyjánek, Zdeněk Žabokrtskỳ, Magda George A Miller. 1998. WordNet: An electronic
Ševčíková, and Jonáš Vidra. 2020. Univer- lexical database. MIT press.
sal derivations 1.0, a growing collection of har-
monised word-formation resources. The Prague Joakim Nivre, Željko Agić, Lars Ahrenberg, Lene
Bulletin of Mathematical Linguistics, (115):5– Antonsen, Maria Jesus Aranzabe, Masayuki
30. Asahara, Luma Ateyah, Mohammed Attia, Aitz-
iber Atutxa, Liesbeth Augustinus, et al. 2017.
Krister Lindén and Lauri Carlson. 2010. Universal dependencies 2.1.
Finnwordnet–finnish wordnet by transla-
tion. LexicoNordica–Nordic Journal of Joakim Nivre, Marie-Catherine De Marneffe, Filip
Lexicography, 17:119–140. Ginter, Yoav Goldberg, Jan Hajic, Christo-
pher D Manning, Ryan McDonald, Slav Petrov,
Krister Lindén, Jyrki Niemi, and Mirka Hyvärinen. Sampo Pyysalo, Natalia Silveira, et al. 2016.
2012. Extending and updating the finnish word- Universal dependencies v1: A multilingual tree-
net. In Shall We Play the Festschrift Game?, bank collection. In Proceedings of the Tenth In-
pages 67–98. Springer. ternational Conference on Language Resources
and Evaluation (LREC’16), pages 1659–1666.
Eleonora Litta, Marco Passarotti, and Chris Culy.
2016. Formatio formosa est. Building a Word Joakim Nivre and Beata Megyesi. 2007. Bootstrap-
Formation Lexicon for Latin. In Proceedings of ping a swedish treebank using cross-corpus har-
the Third Italian Conference on Computational monization and annotation projection. In Pro-
Linguistics (CLIC–IT 2016), pages 185–189. ceedings of the 6th international workshop on
treebanks and linguistic theories, pages 97–102.
Olga Lyashevskaya, Kira Droganova, Daniel Citeseer.
Zeman, Maria Alexeeva, Tatiana Gavrilova,
Mārcis Pinnis, Rihards Krišlauks, Daiga Deksne,
Nina Mustafina, Elena Shakurova, et al. 2016.
and Toms Miks. 2017. Neural machine trans-
Universal dependencies for russian: A new
lation for morphologically rich languages with
syntactic dependencies tagset. Lyashevkaya,
improved sub-word units and synthetic data. In
K. Droganova, D. Zeman, M. Alexeeva, T.
International Conference on Text, Speech, and
Gavrilova, N. Mustafina, E. Shakurova//Higher
Dialogue, pages 237–245. Springer.
School of Economics Research Paper No. WP
BRP, 44. Ivan Provilkov, Dmitrii Emelianenko, and Elena
Voita. 2020. Bpe-dropout: Simple and effective
Arya D McCarthy, Christo Kirov, Matteo Grella,
subword regularization. In Proceedings of the
Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekate-
58th Annual Meeting of the Association for Com-
rina Vylomova, Sabrina J Mielke, Garrett Nico-
putational Linguistics, pages 1882–1892.
lai, Miikka Silfverberg, et al. 2020. Unimorph
3.0: Universal morphology. In Proceedings of Sampo Pyysalo, Jenna Kanerva, Anna Missilä,
The 12th Language Resources and Evaluation Veronika Laippala, and Filip Ginter. 2015. Uni-
Conference, pages 3922–3931. versal dependencies for finnish. In Proceedings
of the 20th Nordic Conference of Computational
Arya D McCarthy, Miikka Silfverberg, Ryan Cot- Linguistics (Nodalida 2015), pages 163–172.
terell, Mans Hulden, and David Yarowsky. 2018.
Marrying universal dependencies and universal Rico Sennrich, Barry Haddow, and Alexandra
morphology. In Proceedings of the Second Birch. 2016. Neural machine translation of rare
Workshop on Universal Dependencies (UDW words with subword units. In Proceedings of
2018), pages 91–101. the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Pa-
Eleni Metheniti and Günter Neumann. 2020. pers), pages 1715–1725.
Wikinflection corpus: A (better) multilingual,
morpheme-annotated inflectional corpus. In Natalia Silveira, Timothy Dozat, Marie-Catherine
Proceedings of The 12th Language Resources de Marneffe, Samuel Bowman, Miriam Connor,
and Evaluation Conference, pages 3905–3912. John Bauer, and Christopher D. Manning. 2014.

47
A gold standard dependency corpus for English.
In Proceedings of the Ninth International Con-
ference on Language Resources and Evaluation
(LREC-2014).

John Sylak-Glassman. 2016. The composition


and use of the universal morphological feature
schema (unimorph schema). Johns Hopkins Uni-
versity.

Luigi Talamo, Chiara Celata, and Pier Marco


Bertinetto. 2016. Derivatario: An annotated
lexicon of italian derivatives. Word Structure,
9(1):72–102.

Mariona Taulé, Maria Antònia Martí, and Marta


Recasens. 2008. Ancora: Multilevel annotated
corpora for catalan and spanish. In Lrec.

Viktor Trón, Gyögy Gyepesi, Péter Halácsy, An-


drás Kornai, László Németh, and Dániel Varga.
2005. Hunmorph: open source word analysis.
In Proceedings of Workshop on Software, pages
77–85.

Jonáš Vidra, Zdeněk Žabokrtskỳ, Magda


Ševčíková, and Lukáš Kyjánek. 2019. De-
rinet 2.0: towards an all-in-one word-formation
resource. In Proceedings of the Second Inter-
national Workshop on Resources and Tools for
Derivational Morphology, pages 81–89.

Veronika Vincze, Dóra Szauter, Attila Almási,


György Móra, Zoltán Alexin, and János Csirik.
2010. Hungarian dependency treebank.

Daniil Vodolazsky. 2020. Derivbase. ru: A deriva-


tional morphology resource for russian. In Pro-
ceedings of The 12th Language Resources and
Evaluation Conference, pages 3937–3943.

Ekaterina Vylomova, Trevor Cohn, Xuanli He, and


Gholamreza Haffari. 2017. Word representation
models for morphologically rich languages in
neural machine translation. In Proceedings of
the First Workshop on Subword and Character
Level Models in NLP, pages 103–108.

Britta Zeller, Jan Šnajder, and Sebastian Padó.


2013. Derivbase: Inducing and evaluating a
derivational morphology resource for german.
In Proceedings of the 51st Annual Meeting of the
Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 1201–1211.

48
A Study of Morphological Robustness of Neural Machine Translation

Sai Muralidhar Jayanthi,* Adithya Pratapa*


Language Technologies Institute
Carnegie Mellon University
{sjayanth,vpratapa}@cs.cmu.edu

Abstract tems to input noise is well-studied (Belinkov and


Bisk, 2018), most prior work has focused on trans-
In this work, we analyze the robustness of neu-
lation from English (English→X) (Anastasopoulos
ral machine translation systems towards gram-
matical perturbations in the source. In par- et al., 2019; Alam and Anastasopoulos, 2020).
ticular, we focus on morphological inflection With over 800 million second-language (L2)
related perturbations. While this has been speakers for English, it is imperative that the trans-
recently studied for English→French transla- lation models should be robust to any potential er-
tion (M ORPHEUS) (Tan et al., 2020), it is rors in the source English text. A recent work (Tan
unclear how this extends to Any→English
et al., 2020) has shown that English→X translation
translation systems. We propose M ORPHEUS -
M ULTILINGUAL that utilizes UniMorph dic- systems are not robust to inflectional perturbations
tionaries to identify morphological perturba- in the source. Inspired by this work, we aim to
tions to source that adversely affect the trans- quantify the impact of inflectional perturbations
lation models. Along with an analysis of state- for X→English translation systems. We hypothe-
of-the-art pretrained MT systems, we train and size that inflectional perturbations to source tokens
analyze systems for 11 language pairs using shouldn’t adversely affect the translation quality.
the multilingual TED corpus (Qi et al., 2018). However, morphologically-rich languages tend to
We also compare this to actual errors of non-
have free word order as compared to English, and
native speakers using Grammatical Error Cor-
rection datasets. Finally, we present a qualita- small perturbations in the word inflections can lead
tive and quantitative analysis of the robustness to significant changes to the overall meaning of the
of Any→English translation systems. Code sentence. This is a challenge to our analysis.
for our work is publicly available. 1 We build upon Tan et al. (2020) to induce in-
flectional perturbations to source tokens using the
1 Introduction
unimorph inflect tool (Anastasopoulos and
Multilingual machine translation is common- Neubig, 2019) along with UniMorph dictionaries
place, with high-quality commercial systems avail- (McCarthy et al., 2020) (§2). We then present a
able in over 100 languages (Johnson et al., comprehensive evaluation of the robustness of MT
2017). However, translation from and into low- systems for languages from different language fam-
resource languages remains a challenge (Arivazha- ilies (§3). To understand the impact of size of par-
gan et al., 2019). Additionally, translation from allel corpora available for training, we experiment
morphologically-rich languages to English (and on a spectrum of high, medium and low-resource
vice-versa) presents new challenges due to the wide languages. Furthermore, to understand the impact
differences in morphosyntactic phenomenon of the in real settings, we run our adversarial perturbation
source and target languages. In this work, we study algorithm on learners’ text from Grammatical Error
the effect of noisy inputs to neural machine trans- Correction datasets for German and Russian (§3.3).
lation (NMT) systems. A concrete practical appli-
cation for this is the translation of text from non- 2 Methodology
native speakers. While the brittleness of NMT sys-
To evaluate the robustness of X→English NMT sys-
* Equal contribution.
1
https://github.com/murali1996/ tems, we generate inflectional perturbations to the
morpheus_multilingual tokens in source language text. In our methodology,

49
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 49–59
August 5, 2021. ©2021 Association for Computational Linguistics
we aim to identify adversarial examples that lead verbs.5 Now, to construct a perturbed sentence, we
to maximum degradation in the translation quality. iterate through each token and uniformly sample
We build upon the recently proposed M ORPHEUS one inflectional form from the candidate inflections.
toolkit (Tan et al., 2020), that evaluated the robust- We repeat this process N (=50) times and compile
ness of NMT systems translating from English→X. our pool of perturbed sentences.6
For a given source English text, M ORPHEUS works To identify the adversarial sentence, we compute
by greedily looking for inflectional perturbations the chrF score (Popović, 2017) using the sacrebleu
by sequentially iterating through the tokens in input toolkit (Post, 2018) and select the sentence that re-
text. For each token, it identifies inflectional edits sults in the maximum drop in chrF score (if any). In
that lead to maximum drop in BLEU score. our preliminary experiments, we found chrF to be
We extend this approach to test X→English more reliable than BLEU (Papineni et al., 2002) for
translation systems. Since their toolkit2 is limited identifying adversarial candidates. While BLEU
to perturbations in English only, in this work we de- uses word n-grams to compare the translation out-
velop our own inflectional methodology that relies put with the reference, chrF uses character n-grams
on UniMorph (McCarthy et al., 2020). instead; which helps with matching morphological
variants of words.
2.1 Reinflection The original M ORPHEUS toolkit follows a
slightly different algorithm to identify adversaries.
UniMorph project3 provides morphological data Similar to our approach, they first extract all pos-
for numerous languages under a universal schema. sible inflectional forms for each of the constituent
The project supports over 100 languages and pro- tokens. Then, they sequentially iterate through the
vides morphological inflection dictionaries for upto tokens in the sentence, and for each token, they
three part-of-speech tags, nouns (N), adjectives select an inflectional form that results in the worst
(A DJ) and verbs (V). While some UniMorph dic- BLEU score. Once an adversarial form is identi-
tionaries include a large number of types (or fied, they directly replace the form in the original
paradigms) (German (≈15k), Russian (≈28k)), sentence and continue to the next token. While a
many dictionaries are relatively small (Turkish similar approach is possible in our setup, we found
(≈3.5k), Estonian (<1k)). This puts a limit on their algorithm to be computationally expensive as
the number of tokens we can perturb via UniMorph it prevents from performing efficient batching.
dictionary look-up. To overcome this limitation, we It is important to note that neither M ORPHEUS -
use the unimorph inflect toolkit4 that takes M ULTILINGUAL nor the original M ORPHEUS ex-
as input the lemma and the morphosyntactic de- haustively searches over all possible sentences, due
scription (MSD) and returns a reinflected word to memory and time constraints. However, our
form. This tool was trained using UniMorph dictio- approach in M ORPHEUS -M ULTILINGUAL can be
naries and generalizes to unseen types. An illustra- efficiently implemented and reduces the inference
tion of our inflectional perturbation methodology time by almost a factor of 20. We experiment on
is described in Table 1. 11 different language pairs, therefore, the run time
and computational costs are critical to our experi-
2.2 M ORPHEUS -M ULTILINGUAL ments.
Given an input sentence, our proposed method,
M ORPHEUS -M ULTILINGUAL, identifies adversar- 3 Experiments
ial inflectional perturbations to the input tokens In this section, we present a comprehensive eval-
that leads to maximum degradation in performance uation of the robustness of X→English machine
of the machine translation system. We first iter- translation systems. Since it is natural for NMT
ate through the sentence to extract all possible in- models to be more robust when trained on large
flectional forms for each of the constituent tokens. amounts of parallel data, we experiment with two
Since, we are relying on UniMorph dictionaries, we
5
are limited to perturbing only nouns, adjectives and Some dictionaries might contain fewer POS tags, for
example, in German we are restricted to just nouns and verbs.
6
2
N is a hyperparameter, and in our preliminary experi-
https://github.com/salesforce/morpheus ments, we find N = 50 to be sufficiently high to generate
3
https://unimorph.github.io/ many uniquely perturbed sentences and also keep the process
4
https://github.com/antonisa/unimorph inflect computationally tractable.

50
PRON VERB PART PUNCT ADV NOUN VERB AUX
Sie wissen nicht , wann Räuber kommen können
you-NOM.3PL knowledge-PRS.3PL not-NEG , when robber-NOM.PL come-NFIN can-PRS.3PL

(*) Sie wissten nicht , wann Räuber kommen können


(*) Sie wissen nicht , wann Räuber kommen könne
(*) Sie wisse nicht , wann Räuber kommen könnte

Table 1: Example inflectional perturbations on a German text.

sets of translation systems. First, we use state-of- 3.2 TED corpus


the-art pre-trained models for Russian→English The multilingual TED corpus (Qi et al., 2018) pro-
and German→English from fairseq (Ott et al., vides parallel data for over 50 language pairs, but
2019).7 Secondly, we use the multilingual TED in our experiments we only use a subset of these
corpus (Qi et al., 2018) to train transformer-based language pairs. We selected our test language pairs
translation systems from scratch.8 Using the TED (X→English) to maximize the diversity in language
corpus allows us to expand our evaluation to a families, as well as the resources available for train-
larger pool of language pairs. ing MT systems. Since we rely on UniMorph
3.1 WMT19 Pretrained Models and unimorph inflect for generating pertur-
bations, we only select languages that have rea-
We evaluate the robustness of best-performing sonably high accuracy in unimorph inflect
systems from WMT19 news translation shared (>80%). Table 3 presents an overview of the cho-
task (Barrault et al., 2019), specifically for sen source languages, along with the information
Russian→English and German→English (Ott et al., on language family and training resources.
2019). We follow the original work and use new- We also quantify the morphological richness for
stest2018 as our test set for adversarial evaluation. the languages listed in Table 3. As we are not
Using the procedure described in §2.2, we create aware of any standard metric to gauge morphologi-
adversarial versions of newstest2018 for both the cal richness of a language, we use the reinflection
language pairs. In Table 2, we present the base- dictionaries to define this metric. We compute the
line and adversarial results using BLEU and chrF morphological richness using the Type-Token Ra-
metrics. For both the language pairs, we notice tio (TTR) as follows,
significant drops on both metrics. Before diving
further into the qualitative analysis of these MT Ntypes (lg) Nparadigms (lg)
TTRlg = = (1)
systems, we first present a broader evaluation on Ntokens (lg) Nforms (lg)
MT systems trained on multilingual TED corpus. In Table 3, we report the TTRlg scores mea-
sured on UniMorph dictionaries as well as on the
lg Baseline Adversarial UniMorph-style dictionaries constructed from TED
dev splits using unimorph inflect tool. Note
BLEU chrF BLEU chrF NR
that, TTRlg , as defined here, slightly differs from
rus 38.33 0.63 18.50 0.47 0.81 the widely known Type-Token ration used for mea-
deu 48.40 0.70 33.43 0.59 1.00 suring lexical diversity (or richness) of a corpus.
We run M ORPHEUS -M ULTILINGUAL to gener-
Table 2: Baseline & Adversarial results on new- ate adversarial sentences for the validation splits of
stest2018 using fairseq’s pre-trained models. NR
the TED corpus. We term a sentence adversarial if
denotes Target-Source Noise Ratio (2).
it leads to the maximum drop in chrF score. Note
that, it is possible to have perturbed sentences that
7
Due to resource constraints, we only experiment with the may not lead to any drop in chrF scores. In Figure
single models and leave the evaluation of ensemble models
for future work. 1, we plot the fraction of perturbed sentences along
8
For the selected languages, we train an MT model with with adversarial fraction for each of the source lan-
‘transformer iwslt de en’ architecture from fairseq. We guages. We see considerable perturbations for most
use a sentence-piece vocab size of 8000, and train up to 80
epochs with Adam optimizer (see A.2 in Appendix for more languages, with the exception of Swedish, Lithua-
details) nian, Ukrainian, and Estonian.

51
lg Family Resource TTR
heb Semetic High (0.044, 0.191)
rus Slavic High (0.080, 0.107)
tur Turkic High (0.016, 0.048)
deu Germanic High (0.210, 0.321) Figure 2: Schematic for preliminary evaluation on
ukr Slavic High (0.103, 0.143) learners’ language text. This is similar to the methodol-
ces Slavic High (0.071, 0.082) ogy used in Anastasopoulos (2019).
swe Germanic Medium (0.156, 0.281)
lit Baltic Medium (0.051, 0.084)
slv Slavic Low (0.109, 0.087) versarial sets are however synthetic. In this section,
kat Kartvelian Low (0.057, ——) we evaluate the impact of morphological inflection
est Uralic Low (0.026, 0.056) related errors directly on learners’ text.
To this end, we utilize two grammatical er-
Table 3: List of language chosen from multilingual ror correction (GEC) datasets, German Falko-
TED corpus. For each language, the table presents the MERLIN-GEC (Boyd, 2018), Russian RULEC-
language family, resource level as the Type-Token ratio GEC (Rozovskaya and Roth, 2019). Both of these
(TTRlg ). We measure the ratio using the types and to-
datasets contain labeled error types relating to word
kens present in the reinflection dictionaries (UniMorph,
lexicon from TED dev) morphology. Evaluating the robustness on these
datasets will give us a better understanding of the
performance on actual text produced by second
language (L2) speakers.
Unfortunately, we don’t have gold English trans-
lations for the grammatically incorrect (or cor-
rected) text from GEC datasets. While there is a re-
lated prior work (Anastasopoulos et al., 2019) that
annotated Spanish translations for English GEC
data, we are not aware of any prior work that pro-
vide gold English translations for grammatically
incorrect data in non-English languages. There-
fore, we propose a pseudo-evaluation methodol-
ogy that allows for measuring robustness of MT
Figure 1: Perturbation statistics for selected TED lan- systems. A schematic overview of our methodol-
guages ogy is presented in Figure 2. We take the ungram-
matical text and use the gold GEC annotations to
In preparing our adversarial set, we retain the correct all errors except for the morphology re-
original source sentence if we fail to create any per- lated errors. We now have ungrammatical text that
turbation or if none of the identified perturbations only contains morphology related errors and it is
lead to a drop in chrF score. This is to make sure the similar to the perturbed outputs from M ORPHEUS -
adversarial set has the same number of sentences M ULTILINGUAL. Since, we don’t have gold trans-
as the original validation set. In Table 4, we present lations for the input Russian/German sentences,
the baseline and adversarial MT results. We notice we use the machine translation output of the fully
a considerable drop in performance for Hebrew, grammatical text as reference and the translation
Russian, Turkish and Georgian. As expected, the % output of partially-corrected text as hypothesis. In
drops are correlated to the perturbations statistics Table 5, we present the results on both Russian and
from Figure 1. German learners’ text.
Overall, we find that the pre-trained MT models
3.3 Translating Learner’s Text from fairseq are quite robust to noise in learn-
In the previous sections (§3.1, §3.2), we have seen ers’ text. We manually inspected some examples,
the impact of noisy inputs to MT systems. While, and found the MT systems to sufficiently robust
these results indicate a need for improving the ro- to morphological perturbations and changes in the
bustness of MT systems, the above-constructed ad- output translation (if any) are mostly warranted.

52
X→English Code # train Baseline Adversarial
BLEU chrF BLEU chrF NR
Hebrew heb 211K 40.06 0.5898 33.94 (-15%) 0.5354 (-9%) 1.56
Russian rus 208K 25.64 0.4784 11.70 (-54%) 0.3475 (-27%) 1.03
Turkish tur 182K 27.77 0.5006 18.90 (-32%) 0.4087 (-18%) 1.43
German deu 168K 34.15 0.5606 31.29 (-8%) 0.5373 (-4%) 1.82
Ukrainian ukr 108K 25.83 0.4726 25.66 (-1%) 0.4702 (-1%) 2.96
Czech ces 103K 29.35 0.5147 26.58 (-9%) 0.4889 (-5%) 2.11
Swedish swe 56K 36.93 0.5654 36.84 (-0%) 0.5646 (-0%) 3.48
Lithuanian lit 41K 18.88 0.3959 18.82 (-0%) 0.3948 (-0%) 3.42
Slovenian slv 19K 11.53 0.3259 10.48 (-9%) 0.3100 (-5%) 3.23
Georgian kat 13K 5.83 0.2462 4.92 (-16%) 0.2146 (-13%) 2.49
Estonian est 10K 6.68 0.2606 6.53 (-2%) 0.2546 (-2%) 4.72

Table 4: Results on multilingual TED corpus (Qi et al., 2018)

Dataset f-BLEU f-chrF compare the impact of each perturbation type (POS,
dim) on the overall performance of MT model. Ad-
Russian GEC 85.77 91.56
ditionally, as seen in Figure 1, all inflectional per-
German GEC 89.60 93.95
turbations need not cause a drop in chrF (or BLEU)
scores. The adversarial sentences only capture the
Table 5: Translation results on Russian and German
GEC corpora. An oracle (aka. fully robust) MT system worst case drop in chrF. Therefore, to analyze the
would give a perfect score. We adopt the faux-BLEU overall impact of the each perturbation (POS, dim),
terminology from Anastasopoulos (2019). f-BLEU is we also compute the impact score on the entire set
identical to BLEU, except that it is computed against a of perturbed sentences explored by M ORPHEUS -
pseudo-reference instead of true reference. M ULTILINGUAL.
Table 8 (in Appendix) presents the results for
all the TED languages. First, the trends for adver-
Viewing these results in combination with results
sarial perturbations is quite similar to all explored
on TED corpus, we believe that X→English are
perturbations. This indicates that the adversarial
robust to morphological perturbations at source as
impact of a perturbation is not determined by just
long as they are trained on sufficiently large parallel
the perturbation type (POS, dim) but is lexically
corpus.
dependent.
4 Analysis Evaluation Metrics: In the results presented in
To better understand what makes a given MT sys- §3, we reported the performance using BLEU and
tem to be robust to morphology related grammati- chrF metrics (following prior work (Tan et al.,
cal perturbations in source, we present a thorough 2020)). We noticed significant drops on these met-
analysis of our results and also highlight a few lim- rics, even for high-resource languages like Rus-
itations of our adversarial methodology. sian, Turkish and Hebrew, including the state-of-
the-art fairseq models. To better understand
Adversarial Dimensions: To quantify the im- these drops, we inspected the output translations of
pact of each inflectional perturbation, we perform adversarial source sentences. We found a number
a fine-grained analysis on the adversarial sentences of cases where the new translation is semantically
obtained from multilingual TED corpus. For each valid but both the metrics incorrectly score them
perturbed token in the adversarial sentence, we low (see S2 in Table 6). This is a limitation of using
identify the part-of-speech (POS) and the feature surface level metrics like BLEU/chrF.
dimension(s) (dim) perturbed in the token. We uni- Additionally, we require the new translation to
formly distribute the % drop in sentence-level chrF be as close as possible to the original translation,
score to each (POS, dim) perturbation in the ad- but this can be a strict requirement on many occa-
versarial sentence. This allows us to quantitatively sions. For instance, if we changed a noun in the

53
Figure 3: Correlation between Noise Ratio (NR) and Figure 4: Correlation between Target-Source Noise Ra-
# train. The results indicate that, larger the training tio (NR) on TED machine translation and Type-Token
data, the models are more robust towards source pertur- Ratio (TTRlg ) of the source language (from UniMorph).
bations (NR≈1). The results indicate that the morphological richness
of the source language doesn’t necessarily correlate to
NMT robustness.
source from its singular to plural form, it is natural
to expect a robust translation system to reflect that
change in the output translation. To account for this perimented with four languages within the Slavic
behavior, we compute Target-Source Noise Ratio family, Czech, Ukranian, Russian and Slovenian.
(NR) metric from Anastasopoulos (2019). NR is All except Slovenian are high-resource. These
computed as follows, languages differ significantly in their morphologi-
cal richness (TTR) with, TTRces < TTRslv <<
100 − BLEU(t, t̃) TTRrus << TTRukr .9 As we have already seen in
NR(s, t, s̃, t̃) = (2)
100 − BLEU(s, s̃) above analysis (see Figure 4), morphological rich-
ness isn’t indicative of the noise ratio (NR), and
The ideal NR is ∼1, where a change in the source this behavior is also true for Slavic languages. We
(s → s̃) results in a proportional change in the tar- now check if morphological richness determines
get (t → t̃). For the adversarial experiments on the drop in BLEU/chrF scores? In fact, we find
TED corpus, we compute the NR metric for each that this is also not the case. We see larger % drop
language pair and the results are presented in Ta- for rus as compared to slv or ukr. We instead
ble 4. Interestingly, while Russian sees a major notice that the % drop in BLEU/chrF is dependent
drop in BLEU/chrF score, the noise ratio is close on the % edits we make to the validation set. The
to 1. This indicates that the Russian MT is actu- % edits we were able to make follows the order,
ally quite robust to morphological perturbations. δrus >> δces > δslv >> δukr (see Figure 1).
Furthermore, in Figure 3, we present a correlation
While NR is driven by size of training set, and %
analysis between the size of parallel corpus avail-
drop in BLEU is driven by % edits to the validation
able for training vs noise ratio metric. We see a
set. The % edits in turn depends on the size of
very strong negative correlation, indicating that
UniMorph dictionaries and not on morphological
high-resource MT systems (e.g., heb, rus, tur)
richness of the language. Therefore, we conclude
are quite robust to inflectional perturbations, inspite
that both the metrics, % drop in BLEU/chrF and
of the large drops in BLEU/chrF scores. Addition-
NR are dependent on the resource size (parallel
ally, we noticed that morphological richness of the
data and UniMorph dictionaries) and not on the
source language (measured via TTR in Table 3)
morphological richness of the language.
doesn’t play any significant role in the MT perfor-
mance under adversarial settings (e.g., rus, tur Semantic Change: In our adversarial attacks,
vs deu). The scatter plot between TTR and NR for we aim to create a ungrammatical source via inflec-
TED translation task is presented in Figure 4. tional edits and evaluate the robustness of systems
for these edits. While these adversarial attacks can
Morphological Richness: To analyze the im-
help us discover any significant biases in the transla-
pact of morphological richness of source, we look
deeper into the Slavic language family. We ex- 9
TTRlg measured on lexicons from TED dev splits.

54
Figure 5: Elasticity score for TED languages Figure 6: Boxplots for the distribution of # edits per
sentence in the adversarial TED validation set.

tion systems, they can often lead to unintended con-


sequences. Consider the example Russian sentence Aggressive edits: Our algorithm doesn’t put any
S1 (s) from Table 6. The sentence is grammatically restrictions on the number of tokens that can be
correct, with the subject Тренер (‘Coach’) and perturbed in a given sentence. This can lead to ag-
object игрока (‘player’) in NOM and ACC cases gressive edits, especially in languages like Russian
respectively. If were perturb this sentence to A- that are morphologically-rich and the reinflection
S1 (s̃), the new words Тренера (‘Coach’), and lexicons are sufficiently large. As we illustrate in
игрок (‘player’) are now in ACC and NOM cases Figure 6, median edits per sentence in rus is 5,
respectively. Due to case assignment phenomenon significantly higher than the next language (tur
in Russian, this perturbation (s → s̃) has essen- at 1). Such aggressive edits in Russian can lead
tially swapped the subject and object roles in the to unrealistic sentences, and far from our intended
Russian sentence. As we can see in the example, simulation of learners’ text. We leave the idea of
the English translation, t̃ (A-T1) does in fact cor- thresholding # edits to future work.
rectly capture this change. This indicates that our Adversarial Training: In an attempt to improve
attacks can sometimes lead to significant change robustness of NMT systems against morphologi-
in the semantics of the source sentence. Handling cal perturbations, we propose training NMT mod-
such cases would require deeper understanding of els with augmenting adversarially perturbed sen-
each language grammar and we leave this for future tences. Due to computational constraints, we eval-
work. uate this setting only for slv. We follow the strat-
Elasticity: As we have seen in discussion on egy outlined in Section 2 to obtain adversarial per-
noise ratio, it is natural for MT systems to transfer turbations for TED corpus training data. We ob-
changes in source to the target. However, inspired serve that the adversarially trained model performs
by (Anastasopoulos, 2019), we wanted to under- marginally poorer (BLEU 10.30 from 10.48 when
stand how this behavior changes as we increase the trained without data augmentation). We hypothe-
number of edits in the source sentence. For this size that this could possibly due to small training
purpose, we first bucket all the explored perturbed data, and believe that this training setting can better
sentences based on the number of edits (or perturba- benefit models with already high BLEU scores. We
tions) from the original source. Within each bucket, leave extensive evaluation and further analysis on
we compute the fraction of perturbed source sen- adversarial training to future work.
tences that result in same translation as the original
5 Conclusion
source. We define this fraction as the elasticity
score, i.e. whether the translation remains the same In this work, we propose M ORPHEUS -
under changes in source. Figure 5 presents the M ULTILINGUAL, a tool to analyze the robustness
results and we find the elasticity score dropping of X→English NMT systems under morphological
quickly to zero as the # edits increase. Notably, perturbations. Using this tool, we experiment
ukr drops quickly to zero, while rus retains rea- with 11 different languages selected from diverse
sonable elasticity score for higher number of edits. language families with varied training resources.

55
S1 Source (s) Тренер полностью поддержал игрока.
T1 Target (t) The coach fully supported the player.
rus
A-S1 Source (s̃) Тренера полностью поддержал игрок.
A-T1 Target (t̃) The coach was fully supported by the player.
S2 Source (s) Dinosaurier benutzte Tarnung, um seinen Feinden auszuweichen
T2 Target (t) Dinosaur used camouflage to evade its enemies (1.000)
deu
A-S2 Source (s̃) Dinosaurier benutze Tarnung, um seinen Feindes auszuweichen
A-T2 Target (t̃) Dinosaur Use camouflage to dodge his enemy (0.512)
S3 Source (s) У нас вообще телесные наказания не редкость.
T3 Target (t) In general, corporal punishment is not uncommon. (0.885)
rus
A-S3 Source (s̃) У нас вообще телесных наказании не редкостях.
A-T3 Target (t̃) We don’t have corporal punishment at all. (0.405)
S4 Source (s) Вот телесные наказания - спасибо, не надо.
T4 Target (t) That’s corporal punishment - thank you, you don’t have to. (0.458)
rus
A-S4 Source (s̃) Вот телесных наказаний - спасибах, не надо.
A-T4 Target (t̃) That’s why I’m here. (0.047)
S5 Source (s) Die Schießereien haben nicht aufgehört.
T5 Target (t) The shootings have not stopped. (0.852)
deu
A-S5 Source (s̃) Die Schießereien habe nicht aufgehört.
A-T5 Target (t̃) The shootings did not stop, he said. (0.513)
S6 Source (s) Всякое бывает.
T6 Target (t) Anything happens. (0.587)
rus
A-S6 Source (s̃) Всякое будете бывать.
A-T6 Target (t̃) You’ll be everywhere. (0.037)
S7 Source (s)
T7 Target (t) It ’s a real school. (0.821)
kat
A-S7 Source (s̃)
A-T7 Target (t̃) There ’s a man who ’s friend. (0.107)
S8 Source (s) Ning meie laste tuleviku varastamine saab ühel päeval kuriteoks.
T8 Target (t) And our children ’s going to be the future of our own day. (0.446)
est
A-S8 Source (s̃) Ning meie laptegs tuleviku varastamine saab ühel päeval kuriteoks.
A-T8 Target (t̃) And our future is about the future of the future. (0.227)
S9 Source (s) Nad pagevad üle piiride nagu see.
T9 Target (t) They like that overdights like this. (0.318)
est
A-S9 Source (s̃) Nad pagevad üle piirete nagu see.
A-T9 Target (t̃) They dress it as well as well. (0.141)
S10 Source (s) Мой дедушка был необычайным человеком того времени.
T10 Target (t) My grandfather was an extraordinary man at that time. (0.802)
rus
A-S10 Source (s̃) Мой дедушка будё необычайна человеков того времи.
A-T10 Target (t̃) My grandfather is incredibly harmful. (0.335)

Table 6: Qualitative analysis. (1) semantic change, (2) issues with evaluation metrics, (3,4,5,6,7,10) good examples
for attacks, (8) poor attacks, (9) poor translation quality (s → t)

56
We evaluate NMT models trained on TED corpus Findings of the 2019 conference on machine transla-
as well as pretrained models readily available as tion (WMT19). In Proceedings of the Fourth Con-
ference on Machine Translation (Volume 2: Shared
part of fairseq library. We observe a wide
Task Papers, Day 1), pages 1–61, Florence, Italy. As-
range of 0-50% drop in performances under sociation for Computational Linguistics.
adversarial setting. We further supplement our
experiments with an analysis on GEC-learners Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic
corpus for Russian and German. We qualitatively and natural noise both break neural machine transla-
tion. In International Conference on Learning Rep-
and quantitatively analyze the perturbations resentations.
created by our methodology and presented its
strengths as well as limitations, outlining some Adriane Boyd. 2018. Using Wikipedia edits in low
avenues for future research towards building more resource grammatical error correction. In Proceed-
ings of the 2018 EMNLP Workshop W-NUT: The
robust NMT systems. 4th Workshop on Noisy User-generated Text, pages
79–84, Brussels, Belgium. Association for Compu-
tational Linguistics.
References
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim
Md Mahfuz Ibn Alam and Antonios Anastasopoulos. Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,
2020. Fine-tuning MT systems for robustness to Fernanda Viégas, Martin Wattenberg, Greg Corrado,
second-language speaker variations. In Proceedings Macduff Hughes, and Jeffrey Dean. 2017. Google’s
of the Sixth Workshop on Noisy User-generated Text multilingual neural machine translation system: En-
(W-NUT 2020), pages 149–158, Online. Association abling zero-shot translation. Transactions of the As-
for Computational Linguistics. sociation for Computational Linguistics, 5:339–351.

Antonios Anastasopoulos. 2019. An analysis of source- Arya D. McCarthy, Christo Kirov, Matteo Grella,
side grammatical errors in NMT. In Proceedings of Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekate-
the 2019 ACL Workshop BlackboxNLP: Analyzing rina Vylomova, Sabrina J. Mielke, Garrett Nico-
and Interpreting Neural Networks for NLP, pages lai, Miikka Silfverberg, Timofey Arkhangelskiy, Na-
213–223, Florence, Italy. Association for Computa- taly Krizhanovsky, Andrew Krizhanovsky, Elena
tional Linguistics. Klyachko, Alexey Sorokin, John Mansfield, Valts
Ernštreits, Yuval Pinter, Cassandra L. Jacobs, Ryan
Antonios Anastasopoulos, Alison Lui, Toan Q. Cotterell, Mans Hulden, and David Yarowsky. 2020.
Nguyen, and David Chiang. 2019. Neural machine UniMorph 3.0: Universal Morphology. In Proceed-
translation of text from non-native speakers. In Pro- ings of the 12th Language Resources and Evaluation
ceedings of the 2019 Conference of the North Amer- Conference, pages 3922–3931, Marseille, France.
ican Chapter of the Association for Computational European Language Resources Association.
Linguistics: Human Language Technologies, Vol-
ume 1 (Long and Short Papers), pages 3070–3080, Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Minneapolis, Minnesota. Association for Computa- Fan, Sam Gross, Nathan Ng, David Grangier, and
tional Linguistics. Michael Auli. 2019. fairseq: A fast, extensible
toolkit for sequence modeling. In Proceedings of
Antonios Anastasopoulos and Graham Neubig. 2019. the 2019 Conference of the North American Chap-
Pushing the limits of low-resource morphological in- ter of the Association for Computational Linguistics
flection. In Proceedings of the 2019 Conference on (Demonstrations), pages 48–53, Minneapolis, Min-
Empirical Methods in Natural Language Processing nesota. Association for Computational Linguistics.
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
984–996, Hong Kong, China. Association for Com- Jing Zhu. 2002. Bleu: a method for automatic eval-
putational Linguistics. uation of machine translation. In Proceedings of
the 40th Annual Meeting of the Association for Com-
Naveen Arivazhagan, Ankur Bapna, Orhan Firat, putational Linguistics, pages 311–318, Philadelphia,
Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Pennsylvania, USA. Association for Computational
Mia Xu Chen, Yuan Cao, George Foster, Colin Linguistics.
Cherry, et al. 2019. Massively multilingual neural
machine translation in the wild: Findings and chal- Maja Popović. 2017. chrF++: words helping charac-
lenges. arXiv preprint arXiv:1907.05019. ter n-grams. In Proceedings of the Second Con-
ference on Machine Translation, pages 612–618,
Loı̈c Barrault, Ondřej Bojar, Marta R. Costa-jussà, Copenhagen, Denmark. Association for Computa-
Christian Federmann, Mark Fishel, Yvette Gra- tional Linguistics.
ham, Barry Haddow, Matthias Huck, Philipp Koehn,
Shervin Malmasi, Christof Monz, Mathias Müller, Matt Post. 2018. A call for clarity in reporting BLEU
Santanu Pal, Matt Post, and Marcos Zampieri. 2019. scores. In Proceedings of the Third Conference on

57
Machine Translation: Research Papers, pages 186– A Appendix
191, Brussels, Belgium. Association for Computa-
tional Linguistics. A.1 UniMorph Example
Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Pad- An example from German UniMorph dictionary is
manabhan, and Graham Neubig. 2018. When and presented in Table 7.
why are pre-trained word embeddings useful for neu-
ral machine translation? In Proceedings of the 2018 Paradigm Form MSD
Conference of the North American Chapter of the
Association for Computational Linguistics: Human abspielen (‘play’) abgespielt (‘played’) V. PTCP ; PST
Language Technologies, Volume 2 (Short Papers), abspielen (‘play’) abspielend (‘playing’) V. PTCP ; PRS
pages 529–535, New Orleans, Louisiana. Associa-
tion for Computational Linguistics. abspielen (‘play’) abspielen (‘play’) V; NFIN

Alla Rozovskaya and Dan Roth. 2019. Grammar error Table 7: Example inflections for German verb abspie-
correction in morphologically rich languages: The len (‘play’) from the UniMorph dictionary.
case of Russian. Transactions of the Association for
Computational Linguistics, 7:1–17.
Samson Tan, Shafiq Joty, Min-Yen Kan, and Richard A.2 MT training
Socher. 2020. It’s morphin’ time! Combating For all the languages in TED corpus, we train
linguistic discrimination with inflectional perturba-
Any→English using the fairseq toolkit. Specif-
tions. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics, ically, we use the ‘transformer iwslt de en’ archi-
pages 2920–2935, Online. Association for Computa- tecture, and train the model using Adam optimizer.
tional Linguistics. We use an inverse square root learning rate sched-
uler with warm-up update steps of 4000. In the
linear warm-up phase, we use an initial learning
rate of 1e-7 until a configured rate of 2e-4. We use
cross entropy criterion with label smoothing of 0.1.

A.3 Dimension Analysis

58
Dimension ces deu est heb kat lit rus slv swe tur ukr
ADJ.Animacy - - - - - - 3.51(0.89) - - - -
ADJ.Case 4.31(0.81) - - - 10.67(2.59) - 4.78(0.91) - 5.05(5.05) - 6.04(1.10)
ADJ.Comparison - - - - - 7.99(0.46) - - - - -
ADJ.Gender 3.83(0.78) - - - - 6.81(-1.35) 5.30(1.00) - - - -
ADJ.Number 4.07(0.78) - - - 13.90(1.52) 6.31(-2.26) 4.67(0.94) - 5.05(5.05) 7.92(2.23) 6.25(1.29)
ADJ.Person - - - - - - - - - 8.89(2.43) -
N.Animacy - - - - - - 6.53(1.19) - - - -
N.Case 6.94(0.81) 6.39(1.26) 12.35(1.50) - 15.38(0.98) - 6.65(1.20) - 4.29(1.05) 14.39(2.37) 10.28(7.66)
N.Definiteness - - - - - - - - 8.36(1.61) - -
N.Number 5.44(0.77) 5.70(1.27) 8.10(1.33) 16.22(5.92) 14.46(0.66) - 6.12(1.22) - 4.30(1.52) 13.08(2.31) 21.20(15.96)
N.Possession - - - 12.63(4.31) - - - - - - -
V.Aspect - - - - 14.17(-0.38) - - - - - -
V.Gender - - - - - - 6.52(1.51) - - - -
V.Mood 13.17(2.78) 15.89(2.77) - - 11.11(0.58) - - 21.49(3.73) - - -
V.Number 8.23(2.72) 32.86(8.12) - 13.78(4.60) 9.02(1.33) - 6.23(1.44) 21.47(-9.47) - - -
V.Person 6.58(2.69) 6.22(1.50) - 10.86(4.99) 12.37(1.33) - 6.10(1.29) - - - -
V.Tense - - - 17.52(7.13) 13.09(1.05) - 6.59(1.61) - - - -
V.CVB.Tense - - - - - - 6.70(0.87) - - - 9.09(2.62)
V.MSDR.Aspect - - - - 14.39(4.68) - - - - - -
V.PTCP.Gender 10.28(2.75) - - - - - - - - - -
V.PTCP.Number 9.31(2.51) - - - - - - - - - -

Table 8: Fine-grained analysis of X→English translation performance w.r.t the perturbation type (POS, Morpho-
logical feature dimension). The number reported in this table indicate the average % drop in sentence level chrF for
an adversarial pertubation on a token with POS on the dimension (dim). The numbers in the parentheses indicate
average % drop for all the tested perturbations including the adversarial perturbations.

59
Sample-efficient Linguistic Generalizations through Program Synthesis:
Experiments with Phonology Problems
Saujas Vaduguru1 Aalok Sathe2 Monojit Choudhury3 Dipti Misra Sharma1
1
IIIT Hyderabad 2
MIT BCS∗
3
Microsoft Research India
1
{saujas.vaduguru@research.,dipti@}iiit.ac.in
2
aalok.sathe@{mit.edu, richmond.edu}
3
[email protected]

Abstract to V to be Ved
Neural models excel at extracting statistical mappasuN dipasuN
patterns from large amounts of data, but strug- mattunu ditunu
gle to learn patterns or reason about language ? ditimbe
from only a few examples. In this paper, we
? dipande
ask: Can we learn explicit rules that general-
ize well from only a few examples? We ex-
Table 1: Verb forms in Mandar (McCoy, 2018)
plore this question using program synthesis.
We develop a synthesis model to learn phonol-
ogy rules as programs in a domain-specific lan-
guage. We test the ability of our models to is not represented in large-scale text datasets that
generalize from few training examples using could allow the model to harness pretraining, and
our new dataset of problems from the Linguis- the number of samples presented here is likely not
tics Olympiad, a challenging set of tasks that sufficient for the neural model to learn the task.
require strong linguistic reasoning ability. In However, a human would fare much better at
addition to being highly sample-efficient, our this task even if they didn’t know Mandar. Identi-
approach generates human-readable programs,
fying rules and patterns in a different language
and allows control over the generalizability of
the learnt programs. is a principal concern of a descriptive linguist
(Brown and Ogilvie, 2010). Even people who
1 Introduction aren’t trained in linguistics would be able to solve
such a task, as evidenced by contestants in the Lin-
In the last few years, the application of deep neural
guistics Olympiads1 , and general-audience puzzle
models has allowed rapid progress in NLP. Tasks
books (Bellos, 2020). In addition to being able to
in phonology and morphology have been no excep-
solve the task, humans would be able to express
tion to this, with neural encoder-decoder models
their solution explicitly in terms of rules, that is to
achieving strong results in recent shared tasks in
say, a program that maps inputs to outputs.
phonology (Gorman et al., 2020) and morphology
Program synthesis (Gulwani et al., 2017) is a
(Vylomova et al., 2020). However, the neural mod-
method that can be used to learn programs that map
els that perform well on these tasks make use of
an input to an output in a domain-specific language
hundreds, if not thousands of training examples
(DSL). It has been shown to be a highly sample-
for each language. Additionally, the patterns that
efficient technique to learn interpretable rules by
neural models identify are not interpretable. In
specifying the assumptions of the task in the DSL
this paper, we explore the problem of learning in-
(Gulwani, 2011).
terpretable phonological and morphological rules
This raises the questions (i) Can program syn-
from only a small number of examples, a task that
thesis be used to learn linguistic rules from only
humans are able to perform.
a few examples? (ii) If so, what kind of rules can
Consider the example of verb forms in the lan-
be learnt? (iii) What kind of operations need to ex-
guage Mandar presented in Table 1. How would
plicitly be defined in the DSL to allow it to model
a neural model tasked with filling the two blank
linguistic rules? (iv) What knowledge must be im-
cells do? The data comes from a language that
∗ 1
Work done while at the University of Richmond https://www.ioling.org/

60
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 60–71
August 5, 2021. ©2021 Association for Computational Linguistics
plicitly provided with these operations to allow the We learn programs for token-level examples,
model to choose rules that generalize well? which transform an input token in its context to
In this work, we use program synthesis to output tokens. The program is a sequence of rules
learn phonological rules for solving Linguistics which are applied to each token in an input string
Olympiad problems, where only the minimal num- to produce the output string. The rules learnt are
ber of examples necessary to generalize are given similar to rewrite rules, of the form
(Şahin et al., 2020). We present a program syn-
thesis model and a DSL for learning phonological φ−l · · · φ−2 φ−1 Xφ1 φ2 · · · φr → T
rules, and curate a set of Linguistics Olympiad
problems for evaluation. where (i) X : I → B is a boolean predicate that
We perform experiments and comparisons to determines input tokens to which the rule is applied
baselines, and find that program synthesis does (ii) φi : I → B is a boolean predicate applied to
significantly better than our baseline approaches. the ith character relative to X, and the predicates φ
We also present some observations about the ability collectively determine the context in which the rule
of our system to find rules that generalize well, and is applied (iii) T : I → O∗ is a function that maps
discuss examples of where it fails. an input token to a sequence of output tokens.
X and φ belong to a set of predicates P, and T
2 Program synthesis is a function belonging to a set of transformation
functions T . P and T are specified by the DSL.
Program synthesis is “the task of automatically We allow the model to synthesize programs that
finding programs from the underlying program- apply multiple rules to a single token by synthesiz-
ming language that satisfy (user) intent expressed ing rules in passes and maintaining state from one
in some form of constraints” (Gulwani et al., 2017). pass to the next. This allows the system to learn
This method allows us to specify domain-specific stateful multi-pass rules (Sarthi et al., 2021).
assumptions as a language, and use generic synthe-
sis approches like FlashMeta (Polozov and Gul- 2.2 Domain-specific language
wani, 2015) to synthesize programs. The domain-specific language (DSL) is the declar-
The ability to explicitly encode domain-specific ative language which defines the allowable string
assumptions gives program synthesis broad appli- transformation operations. The DSL is defined by
cability to various tasks. In this paper, we explore a set of operators, a grammar which determines
applying it to the task of learning phonological how they can be combined, and a semantics which
rules. Whereas previous work on rule-learning has determines what each operator does. By defining
focused on learning rules of a specific type (Brill, operators to capture domain-specific phenomena,
1992; Johnson, 1984), the DSL in program synthe- we can reduce the space of programs to be searched
sis allows learning rules of different types, and in to include those programs that capture distinctions
different rule formalisms. relevant to the domain. This allows us to explicitly
In this work, we explore learning rules similar to encode knowledge of the domain into the system.
rewrite rules (Chomsky and Halle, 1968) that are Operators in the DSL also have a score asso-
used extensively to describe phonology. Sequences ciated with each operator that allows for setting
of rules are learnt using a noisy disjunctive synthe- domain-specific preferences for certain kinds of
sis algorithm NDSyn (Iyer et al., 2019) extended to programs. We can combine scores for each oper-
learn stateful multi-pass rules (Sarthi et al., 2021). ator in a program to compute a ranking score that
we can use to identify the most preferred program
2.1 Phonological rules as programs among candidates. The ranking score can capture
The synthesis task we solve is to learn a program in implicit preferences like shorter programs, more/-
a domain-specific language (DSL) for string trans- less general programs, certain classes of transfor-
duction, that is, to transform a given sequence of mations, etc.
input tokens i ∈ I ∗ to a sequence of output tokens The DSL defines the predicates P and set of
o ∈ O∗ , where I is the set of input tokens, and O transformations T that can be applied to a partic-
is the set of output tokens. Each token is a symbol ular token. The predicates and transformations in
accompanied by a feature set, a set of key-value the DSL we use, along with the description of their
pairs that maps feature names to boolean values. semantics, can be found in Tables 2 and 3.

61
Predicate
IsToken(w, s, i) Is x equal to the token s? This allows us to evaluate matches with specific
tokens.
Is(w, f, i) Is f true for x? This allows us to generalize beyond single tokens and use
features that apply to multiple tokens.
TransformationApplied(w, t, i) Has the transformation t has been applied to x in a previous pass? This
allows us to reference previous passes in learning rules for the current pass.
Not(p) Negates the predicate p.

Table 2: Predicates that are used for synthesis. The predicates are applied to a token x that is at an offset i from
the current token in the word w. The offset may be positive to refer to tokens after the current token, zero to refer
to the current token, or negative to refer to tokens before the current token.

Transformation
ReplaceBy(x, s1 , s2 ) If x is s1 , it is replaced with s2 . This allows the system to learn conditional
substitutions.
ReplaceAnyBy(x, s) x is replaced with s. This allows the system to learn unconditional substitutions.
Insert(x, S) This inserts a sequence of tokens S after x at the end of the pass. It allows for the
insertion of variable-length affixes.
Delete(x) This deletes x from the word at the end of the pass.
CopyReplace(x, i) These are analogues of the ReplaceBy and Insert transformations where the
CopyInsert(x, i) token which is added is the same as the token at an offset i from x. They allow
the system to learn phonological transformations such as assimilation and
gemination.
Identity(x) This returns x unchanged. It allows the system where a transformation applies
under certain conditions, but does not under others.

Table 3: Transformations that are used for synthesis. The transformations are applied to a token x in the word w.
The offset i for the Copy transformations may be positive to refer to tokens after the current token, zero to refer to
the current token, or negative to refer to tokens before the current token.

output := Map ( disjunction , input_tokens ) tures based only on the symbols in the input, more
disjunction := Else ( rule , disjunction ) complex features based on meaning and linguistic
rule := transformation
| IfThen ( predicate , rule ); categories can be provided to a system that works
on learning rules for more complex domains like
Figure 1: IfThen-Else statements in the DSL morphology or syntax. We leave this investigation
for future work.

Sequences of rules are learnt as disjunctions of 2.3 Synthesis algorithm


IfThen operators, and are applied to each token We use an extension (Sarthi et al., 2021) of the
of the input using a Map operator (Figure 1). The NDSyn algorithm (Iyer et al., 2019) that can syn-
conjunction of predicates X and φ that define the thesize stateful multi-pass rules. Iyer et al. (2019)
context are learnt by nesting IfThen operators. describe an algorithm for selecting disjunctions
A transformation produces an token that is of rules, and use the FlashMeta algorithm as the
tagged with the transformation that is applied. This rule synthesis component. Sarthi et al. (2021) ex-
allows for maintaining state across passes. tend the approach proposed by Iyer et al. (2019)
The operators in our DSL are quite generic and for disjunctive synthesis to the task of grapheme-
can be applied to other string transformations as to-phoneme (G2P) conversion in Hindi and Tamil.
well. In addition to designing our DSL for string They propose the idea of learning transformations
transformation tasks, we allow for phonological on token aligned examples, and use language-
information to be specified as features, which are a specific predicates and transformations to learn
set of key-value pairs that map attributes to boolean rules for G2P conversion. We use a similar ap-
values. While we restrict our investigation to fea- proach, and use a different set of predicates and

62
multi-pass

Candidates

Token-level #1. IfThen ( IsToken (w ," $ " ,1) ,


examples IfThen ( Is (w ," voice " ,0) ,
Insert (x ," z ")))
k→k #2. IfThen ( IsToken (w ," t " ,0) , Rules
æ→æ IfThen ( IsToken (w ," $ " ,1) ,
Words align Insert (x ," s "))) NDSyn Else (#1 ,
Input t→t FM #3. IfThen ( IsToken (w ," $ " ,1) , Else (#3 ,
kæt → kæts
examples →s Insert (x ," s ")) , Else (#5)
dOg → dOgz #4. IfThen ( IsToken (w ," g " ,0) , )
d→d
IfThen ( IsToken (w ," $ " ,1) , )
O→O Insert (x ," z ")))
g→g #5. IfThen (
→z Not ( IsToken (w ," $ " ,1)) ,
Identity ( x )) Program

Figure 2: An illustration of the synthesis algorithm. FM is FlashMeta, which synthesizes rules which are com-
bined into a disjunction of rules by NDSyn. Here, rule #1 is chosen over #4 since it uses the more general concept
of the voice feature as opposed to a specific token, and thus has a higher ranking score.

transformations that are language-agnostic. Fig- ples, and those that are not solved are passed as the
ure 2 sketches the working of the algorithm. set of examples to NDSyn in the next pass. This
The NDSyn algorithm is an algorithm for learn- proceeds until all the examples are solved, or for a
ing disjunctions of rules, of the form shown in maximum number of passes.
Figure 1. Given a set of examples, it first gen-
erates a set of candidate rules using the Flash- 3 Dataset
Meta synthesis algorithm (Polozov and Gulwani,
To test the ability of our program synthesis system
2015). This algorithm searches for a program in the
to learn linguistic rules from only a few examples,
DSL that satisfies a set of examples by recursively
we require a task with a small number of training
breaking down the search problem into smaller sub-
examples, and a number of test examples which
problems. Given an operator, and the input-output
measure how well the model generalises to unseen
constraints it must satisfy, it infers constraints on
data. Additionally, to ensure a fair evaluation, the
each of the arguments to the operator, allowing it to
test examples should be chosen such that the sam-
recursively search for programs that satisfy these
ples in the training data provide sufficient evidence
constraints on each of the arguments. For exam-
to correctly solve the test examples.
ple, given the Is predicate and a set of examples
To this end, we use problems from the Linguis-
where the predicate is true or false, the algorithm
tics Olympiad. The Linguistics Olympiad is an
infers constraints on the arguments the token s and
umbrella term describing contests for high school
offset i such that the set of examples is satisfied.
students across the globe. Students are tasked with
The working of FlashMeta is illustrated with an
solving linguistics problems—a genre of composi-
example in Figure 3. We use the implementation of
tion that presents linguistic facts and phenomena
the FlashMeta algorithm available as part of the
in enigmatic form (Derzhanski and Payne, 2010).
PROSE 2 framework.
These problems typically have 2 parts: the data
From the set of candidate rules, NDSyn selects
and the assignments.
a subset of rules with a high ranking score that
The data consists of examples where the solver is
correctly answers the most examples as well incor-
presented with the application rules to some linguis-
rectly answers the least3 . Additional details about
tic forms (words, phrases, sentences) and the forms
the algorithm are provided in Appendix A.
derived by applying the rules to these forms. The
The synthesis of multi-pass rules proceeds in
data typically consists of 20-50 forms, the minimal
passes. In each pass, a set of token-aligned exam-
number of examples required to infer the correct
ples is provided as input to the NDSyn algorithm.
rules is presented (Şahin et al., 2020).
The resulting rules are then applied to all the exam-
The assignments provide other linguistic forms,
2
https://www.microsoft.com/en-us/research/ and the solver is tasked with applying the rules
group/prose/
3
A rule will not produce any answer to examples that don’t inferred from the data to these forms. The forms
satisfy the context constraints of the rule. in the assignments are carefully selected by the

63
IfThen ( IsToken (w ," a " , -1) ,
ReplaceBy (x ," b " ," d "))
abc → d IfThen ( IsToken (w ," b " ,0) ,
ReplaceAnyBy (x ," d "))
IfThen ( IsToken (w ," c " ,1) ,
ReplaceAnyBy (x ," d "))
Inverse SemanticsIfThen
predicate
rule
abc → True
Search for rule ReplaceBy (x ," b " ," d ")
ReplaceAnyBy (x ," d ")
Inverse SemanticsIsToken
token offset
abc → a abc → −1
IsToken (w ," a " , -1)
abc → b abc → 0 IsToken (w ," b " ,0)
IsToken (w ," c " ,1)
abc → c abc → 1

Figure 3: An illustration of the search performed by the FlashMeta algorithm. The blue boxes show the spec-
ification that an operator must satisfy in terms of input-output examples, with the input token underlined in the
context of the word. The Inverse Semantics of an operator is a function that is used to infer the specification for
each argument of the operator based on the semantics of the operator. This may be a single specification (as for
predicate) or a disjunction of specifications (as for token and offset). The algorithm then recursively searches for
programs to satisfy the specification for each argument, and combines the results of the search to obtain a program.
The search for the rule in an IfThen statement proceeds similarly to the search for a predicate. Examples of pro-
grams that are inferred from a specification are indicated with =⇒ . A dashed line between inferred specifications
indicates that the specifications are inferred jointly.

designer to test whether the solver has correctly form of a language and the corresponding phono-
inferred the rules, including making generalizations logical form (Table 4c) (4) marking the phonolog-
to unseen data. This allows us to see how much of ical stress on a given word (Table 4d). We refer
the intended solution has been learnt by the solver to each of these categories of problems as mor-
by examining responses to the assignments. phophonology, multilingual, transliteration, and
stress respectively. We further describe the dataset
The small number of training examples (data)
in Appendix B4 .
tests the generalization ability and sample effi-
ciency of the system, and presents a challenging 3.1 Structure of the problems
learning problem for the system. The careful se-
lection of test examples (assignment) lets us use Each problem is presented in the form of a matrix
them to measure how well the model learns these M . Each row of the matrix contains data pertaining
generalizations. to a single word/linguistic form, and each column
contains the same form of different words, i.e.,
We present a dataset of 34 linguistics problems, an inflectional or derivational paradigm, the word
collected from various publicly accessible sources. form in a particular language, the word in a partic-
These problems are based on phonology, and some ular script, or the stress values for each phoneme in
aspects of the morphology of languages, as well a word. A test sample in this case is presented as a
as the orthographic properties of languages. These particular cell Mij in the table that has to be filled.
problems are chosen such that the underlying rules The model has to use the data from other words in
depend only on the given word forms, and not the same row (Mi: ) and the words in the column
on inherent properties of the word like grammat- (M:j ) to predict the form of the word in Mij .
ical gender or animacy. The problems involve In addition to the data in the table, each prob-
(1) inferring phonological rules in morphological lem contains some additional information about the
inflection (Table 4a) (2) inferring phonological symbols used to represent the words. This addi-
changes between multiple related languages (Ta-
ble 4b) (3) converting between the orthographic 4
The dataset is available here.

64
base form negative form Turkish Tatar Listuguj Pronunciation Aleut Stress
joy kas joya:ya’ bandIr mandIr g’p’ta’q g@b@da:x tatul 01000
bi:law kas bika’law yelken cilkän epsaqtejg epsaxteck n@tG@lqin 000010000
tipoysu:da ? ? osta emtoqwatg ? sawat ?
? kas wurula:la’ bilezik ? ? @mtesk@m qalpuqal 00001000

(a) Movima negation (b) Turkish and Tatar (c) Micmac orthography (d) Aleut stress

Table 4: A few examples from different types of Linguistics Olympiad problems. ‘?’ represents a cell in the table
that is part of the test set.

tional information is meant to aid the solver under- train a model for each pair of columns in a problem.
stand the meaning of a symbol they may not have For each test example Mij , we find the column with
seen before. We manually encode this information the smallest index j 0 such that Mij 0 is non-empty
in the feature set associated with each token for and use Mij 0 as the source string to infer Mij .
synthesis. Where applicable, we also add conso- Additional details of baselines are provided in
nant/vowel distinctions in the given features, since Appendix C.
this is a basic distinction assumed in the solutions
to many Olympiad problems. 4.2 Program synthesis experiments
We use the assignments that accompany every As discussed in Section 3.1, the examples in a prob-
problem as the test set, ensuring that the correct lem are in a matrix, and we synthesize programs
answer can be inferred based on the given data. to transform entries in one column to entries in
another. Given a problem matrix M , we refer to
3.2 Dataset statistics a program to transform an entry in column i to
The dataset we present is highly multilingual. The an entry in column j as M:i → M:j . To obtain
34 problems contain samples from 38 languages, token-level examples, we use the Smith-Waterman
drawn from across 19 language families. There alignment algorithm (Smith et al., 1981), which
are 15 morphophonology problems, 7 multilingual favours contiguous sequences in aligned strings.
problems, 6 stress, and 6 transliteration problems. We train three variants of our synthesis system
The set contains 1452 training words with an aver- with different scores for the Is and IsToken op-
age of 43 words per problem, and 319 test words erators. The first one, N O F EATURE, does not use
with an average of 9 per problem. Each problem features, or the Is predicate. The second one, T O -
has a matrix that has between 7 and 43 rows, with KEN, assigns a higher score to IsToken and prefers
an average of 23. The number of columns ranges more specific rules that reference tokens. The third
from 2 to 6, with most problems having 2. one, F EATURE, assigns a higher score to Is and
prefers more general rules that reference features
4 Experiments instead of tokens. All other aspects of the model
remain the same across variants.
4.1 Baselines
Morphophonology and multilingual problems:
Given that we model our task as string transduc- For every pair of columns (s, t) in the problem
tion, we compare with the following transduction matrix M , we synthesize the program M:s → M:t .
models used as baselines in shared tasks on G2P To predict the form of a test sample Mij , we find
conversion (Gorman et al., 2020) and morphologi- a column k such that the program M:k → M:j has
cal reinflection (Vylomova et al., 2020). the best ranking score, and evaluate it on Mik .
Neural: We use LSTM-based sequence-to- Transliteration problems: Given a problem ma-
sequence models with attention as well as Trans- trix M , we construct a new matrix M 0 for each pair
former models as implemented by Wu (2020). For of columns (s, t) such that all entries in M 0 are in
each problem, we train a single neural model that the same script. We align word pairs (Mis , Mit )
takes the source and target column numbers, and using the Phonetisaurus many-to-many alignment
the source word, and predicts the target word. tool (Jiampojamarn et al., 2007), and build a sim-
WFST: We use models similar to the pair n-gram ple mapping f for each source token to the target
models (Novak et al., 2016), with the implementa- token with which it is most frequently aligned. We
tion similar to that used by Lee et al. (2020). We 0 by applying f to each token of M and
fill in Mis is

65
All Morphophonology Multilingual Transliteration Stress
Model

E XACT CHR F E XACT CHR F E XACT CHR F E XACT CHR F E XACT


N O F EATURE 26.8% 0.64 30.1% 0.72 42.1% 0.59 12.0% 0.51 15.4%
T OKEN 32.7% 0.63 37.5% 0.68 45.3% 0.60 16.4% 0.52 22.2%
F EATURE 30.9% 0.51 38.6% 0.56 39.9% 0.42 9.5% 0.49 23.0%
LSTM 8.2% 0.44 9.2% 0.49 5.7% 0.45 2.1% 0.31 15.0%
Transformer 5.4% 0.42 2.3% 0.39 9.2% 0.50 1.7% 0.42 12.6%
WFST 20.9% 0.56 16.3% 0.47 38.7% 0.63 29.7% 0.71 2.8%

Table 5: Metrics for all problems, and for problems of each type. The CHF F score for stress problems is not
calculated, and not used to determine the overall CHR F score.

Mit0 = Mit . We then find a program M:s0 → M:t0 . explicit knowledge in the DSL and implicit knowl-
Stress problems: For these problems, we do not edge provided as the ranking score to generalize.
perform any alignment, since the training pairs are We then consider specific examples of problems,
already token aligned. The synthesis system learns and show examples of where our models succeed
to transform the source string to the sequence of and fail in learning different types of patterns.
stress values.
Model 100% ≥ 75% ≥ 50%
4.3 Metrics
N O F EATURE 3 5 7
We calculate two metrics: exact match accuracy, T OKEN 3 6 10
F EATURE 3 6 11
and CHR F score (Popović, 2015). The exact match WFST 1 2 7
accuracy measures the fraction of examples the
synthesis system gets fully correct. Table 6: Number of problems where the model
#{correctly predicted test samples} achieves different thresholds of the E XACT score.
E XACT =
#{test samples}
The CHR F score is calculated only at the token 5.1 Features aid generalization
level, and measures the n-gram overlaps between Since the test examples are chosen to test specific
the predicted answer and the true answer, and al- rules, solving more test examples correctly is in-
lows us to measure partially correct answers. We dicative of the number of rules inferred correctly.
do not calculate the CHR F score for stress problems In Table 6, we see that providing the model with
as n-gram overlap is not a meaningful measure of features allows it to infer more general rules, solv-
performance for these problems. ing a greater fraction of more problems. We see
4.4 Results that allowing the model to use features increases
its performance, and having it prefer more general
Table 5 summarizes the results of our experiments. rules involving features lets it do even better.
We report the average of each metric across prob-
lems for all problems and by category. 5.2 Correct programs are short
We find that neural models that don’t have spe- In Figure 4 we see that the number of rules in a
cific inductive biases for the kind of tasks we problem5 tends to be higher when the model gets
present here are not able to perform well with this the problem wrong, than when it gets it right. This
amount of data. The synthesis models do better indicates that when the model finds many specific
than the WFST baseline overall, and on all types rules, it overfits to the training data, and fails to
of problems except transliteration. This could be generalize well. This holds true for all the variants,
due to the simple map computed from alignments as seen in the downward slope of the lines.
before program synthesis causing errors that the We also find that allowing and encouraging a
rule learning process cannot correct. model to use features leads to shorter programs.
5 Analysis The average length of a program synthesized by
5
To account for some problems having more columns than
We examine two aspects of the program synthesis others (and hence more rules), we find the average number of
models we propose. The first is the way it uses the rules for each pair of columns.

66
rules based on features, and instead chooses rules
specific to each initial character in the root.
Since the DSL allows for substituting one token
with one other, or inserting multiple tokens, the
system has to use multiple rules to substitute one
token with multiple tokens. In the case of Mandar,
we see one way it does this, by performing multiple
substitutions (to transform di- to mas- it replaces d
and i with a and s respectively, and then inserts m).

5.4 Multi-pass rules


In a problem on Somali verb forms (Somers, 2016),
Figure 4: Number of rules plotted against E XACT score we see a different way of handling multi-token
substitutions by using multi-pass rules to create a
complex rule using simpler elements. The problem
N O F EATURES is 30.5 rules, while it is 25.8 for requires being able to convert verbs from 1st person
T OKEN, and 20.7 for F EATURE. This suggests that to 3rd person singular. The solution includes a rule
explicit access to features, and implicit preference where a single token (l) is replaced with (sh). The
for them leads to fewer, more general rules. learned program uses two passes to capture this
rule through sequential application of two rules:
5.3 Using features
first ReplaceBy(x, "l", "h"), followed by
Some problems provide additional information
IfThen ( TransformationApplied (w ,
about certain sounds. For example, a prob- "{ ReplaceBy , h }" , 1) ,
lem based on the alternation retroflexes in Insert (x , " s "))
Warlpiri words (Laughren, 2011) explicitly identi-
fies retroflex sounds in the problem statement. In 5.5 Selecting spans of the input
this case, a program produced by our F EATURE
In a problem involving reduplication in Tarangan
system is able to use these features, and isolate the
(Warner, 2019), all variants fail to capture any syn-
focus of the problem by learning rules such as
thesis rules. Reduplication in Tarangan involves
IfThen ( Not ( Is (w , " retroflex ", 0)) , copying one or two syllables in the source word
Identity (x ))
to produce the target word. However, the DSL we
The system learns a concise solution, and is able use does not have any predicates or transformations
to generalize using features rather than learning that allow the system to reference a span of mul-
separate rules for individual sounds. tiple tokens (which would form a syllable) in the
In the case of inflecting a Mandar verb (McCoy, input. Therefore, it fails to model reduplication.
2018), the F EATURE system uses a feature to find
a more general rule than is the case. To capture the 5.6 Global constraints
rule that the prefix di- changes to mas- when the Since we provide the synthesis model with token-
root starts with s, the model synthesizes level examples, it does not have access to word-
IfThen ( Is (w , " fricative ", 1) , level information. This results in poor performance
ReplaceBy (x , "i", "s ")) on stress problems, as stress depends on the entire
However, since s is the only fricative in the data, word. Consider the example of Chickasaw stress
this rule is equivalent to a rule specific to s. This (Vaduguru, 2019). It correctly learns the rule
rule also covers examples where the root starts with IfThen ( Is (w , " long ", 0) ,
s, and causes the model to miss the more general ReplaceAnyBy (x , "1"))
rule of a voiceless sound at the beginning of the root that stresses any long vowel in the word. How-
to be copied to the end of the prefix. It identifies ever, since it cannot check if the word has a long
this rule only for roots starting with p as vowel that has already been stressed, it is not able
IfThen ( IsToken (w , "p", 1) , to correctly model the case when the word doesn’t
CopyReplace (x , w , 1)) have a long vowel. This results in some samples be-
The T OKEN system does not synthesize these ing marked with stress at two locations, one where

67
the rule for long vowels applies, and one where the we hope to apply it to learning more complex types
rule for words without long vowels applies. of lingusitic rules in the future.
In addition to being a way to learn rules from
6 Related work data, the ability to explicity control the general-
ization behaviour of the model allows for the use
Gildea and Jurafsky (1996) also study the problem of program synthesis to understand the kinds of
of learning phonological rules from data, and ex- learning biases and operations that are required to
plicitly controlling generalization behaviour. We model various linguistic processes. We leave this
pursue a similar goal, but in a few-shot setting. exploration to future work.
Barke et al. (2019) and Ellis et al. (2015) study
program synthesis applied to linguistic rule learn- Acknowledgements
ing. They make much stronger assumptions about
the data (the existence of an underlying form, and We would like to thank Partho Sarthi for invaluable
the availability of additional information like IPA help with PROSE and NDSyn. We would also like
features). We take a different approach, and study to thank the authors of the ProLinguist paper for
program synthesis models that can work only on their assistance. Finally, we would like to thank the
the tokens in the word (like N O F EATURE), and also anonymous reviewers for their feedback.
explore the effect of providing features in these
cases. We also test our approach on a more varied References
set of problems that involves aspects of morphol-
ogy, transliteration, multilinguality, and stress. Shraddha Barke, Rose Kunkel, Nadia Polikarpova, Eric
Meinhardt, Eric Bakovic, and Leon Bergen. 2019.
Şahin et al. (2020) also present a set of Linguis- Constraint-based learning of phonological processes.
tics Olympiad problems as a test of the metalin- In Proceedings of the 2019 Conference on Empirical
guistic reasoning abilities of NLP models. While Methods in Natural Language Processing and the
problems in their set involve finding phonological 9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 6176–
rules, they also require the knowledge of syntax 6186, Hong Kong, China. Association for Computa-
and semantics that are out of the scope of our study. tional Linguistics.
We present a set of problems that only requires
reasoning about surface word forms, and without A. Bellos. 2020. The Language Lover’s Puzzle Book:
Lexical perplexities and cracking conundrums from
requiring the meanings. across the globe. Guardian Faber Publishing.

7 Conclusion Eric Brill. 1992. A simple rule-based part of speech


tagger. In Speech and Natural Language: Proceed-
In this paper, we explore the problem of learning ings of a Workshop Held at Harriman, New York,
February 23-26, 1992.
linguistic rules from only a few training examples.
We approach this using program synthesis, and Keith Brown and Sarah Ogilvie. 2010. Concise ency-
demonstrate that it is a powerful and flexible tech- clopedia of languages of the world. Elsevier.
nique for learning phonology rules in Olympiad
Noam Chomsky and Morris Halle. 1968. The sound
problems. These problems are designed to be chal- pattern of english.
lenging tasks that require learning rules from a
minimal number of examples. These problems also Ivan Derzhanski and Thomas Payne. 2010. The
allow us to specifically test for generalization. Linguistic Olympiads: academic competitions in
linguistics for secondary school students, page
We compare our approach to various baselines, 213–226. Cambridge University Press.
and find that it is capable of learning phonologi-
cal rules that generalize much better than existing Kevin Ellis, Armando Solar-Lezama, and Josh Tenen-
approaches. We show that using the DSL, we can baum. 2015. Unsupervised learning by program syn-
thesis. In C. Cortes, N. D. Lawrence, D. D. Lee,
explicitly control the structure of rules, and using M. Sugiyama, and R. Garnett, editors, Advances in
the ranking score, we can provide the model with Neural Information Processing Systems 28, pages
implicit preferences for certain kinds of rules. 973–981. Curran Associates, Inc.
Having demonstrated the potential of program Daniel Gildea and Daniel Jurafsky. 1996. Learning
synthesis as a learning technique that can work with bias and phonological-rule induction. Computa-
very little data and provide human-readable models, tional Linguistics, 22(4):497–530.

68
Kyle Gorman, Lucas F.E. Ashby, Aaron Goyzueta, Josef Robert Novak, Nobuaki Minematsu, and Keikichi
Arya McCarthy, Shijie Wu, and Daniel You. 2020. Hirose. 2016. Phonetisaurus: Exploring grapheme-
The SIGMORPHON 2020 shared task on multilin- to-phoneme conversion with joint n-gram models in
gual grapheme-to-phoneme conversion. In Proceed- the wfst framework. Natural Language Engineer-
ings of the 17th SIGMORPHON Workshop on Com- ing, 22(6):907–938.
putational Research in Phonetics, Phonology, and
Morphology, pages 40–50, Online. Association for Oleksandr Polozov and Sumit Gulwani. 2015. Flash-
Computational Linguistics. meta: A framework for inductive program synthesis.
SIGPLAN Not., 50(10):107–126.
Sumit Gulwani. 2011. Automating string processing
in spreadsheets using input-output examples. SIG- Maja Popović. 2015. chrF: character n-gram F-score
PLAN Not., 46(1):317–330. for automatic MT evaluation. In Proceedings of the
Tenth Workshop on Statistical Machine Translation,
Sumit Gulwani, Oleksandr Polozov, and Rishabh pages 392–395, Lisbon, Portugal. Association for
Singh. 2017. Program synthesis. Foundations and Computational Linguistics.
Trends® in Programming Languages, 4(1-2):1–119.
Arun Iyer, Manohar Jonnalagedda, Suresh Gözde Gül Şahin, Yova Kementchedjhieva, Phillip
Parthasarathy, Arjun Radhakrishna, and Sriram K Rust, and Iryna Gurevych. 2020. PuzzLing Ma-
Rajamani. 2019. Synthesis and machine learning chines: A Challenge on Learning From Small Data.
for heterogeneous extraction. In Proceedings of the In Proceedings of the 58th Annual Meeting of the
40th ACM SIGPLAN Conference on Programming Association for Computational Linguistics, pages
Language Design and Implementation, pages 1241–1254, Online. Association for Computational
301–315. Linguistics.

Sittichai Jiampojamarn, Grzegorz Kondrak, and Tarek Partho Sarthi, Monojit Choudhury, Arun Iyer, Suresh
Sherif. 2007. Applying many-to-many alignments Parthasarathy, Arjun Radhakrishna, and Sriram Ra-
and hidden Markov models to letter-to-phoneme jamani. 2021. ProLinguist: Program Synthesis for
conversion. In Human Language Technologies Linguistics and NLP. IJCAI Workshop on Neuro-
2007: The Conference of the North American Chap- Symbolic Natural Language Inference.
ter of the Association for Computational Linguistics;
Proceedings of the Main Conference, pages 372– Temple F Smith, Michael S Waterman, et al. 1981.
379, Rochester, New York. Association for Compu- Identification of common molecular subsequences.
tational Linguistics. Journal of molecular biology, 147(1):195–197.

Mark Johnson. 1984. A discovery procedure for cer- Harold Somers. 2016. Changing the subject.
tain phonological rules. In 10th International Con- In Andrew Lamont and Dragomir Radev, edi-
ference on Computational Linguistics and 22nd An- tors, North American Computational Linguistics
nual Meeting of the Association for Computational Olympiad 2016: Invitational Round. North Ameri-
Linguistics, pages 344–347, Stanford, California, can Computational Linguistics Olympiad.
USA. Association for Computational Linguistics.
Saujas Vaduguru. 2019. Chickasaw stress. In
Mary Laughren. 2011. Stopping and flapping in Shardul Chiplunkar and Saujas Vaduguru, editors,
warlpiri. In Dragomir Radev and Patrick Littell, Panini Linguistics Olympiad 2019. Panini Linguis-
editors, North American Computational Linguistics tics Olympiad.
Olympiad 2011: Invitational Round. North Ameri-
can Computational Linguistics Olympiad. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Kaiser, and Illia Polosukhin. 2017. Attention is all
Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. you need. In Advances in Neural Information Pro-
McCarthy, and Kyle Gorman. 2020. Massively cessing Systems, volume 30. Curran Associates, Inc.
multilingual pronunciation modeling with WikiPron.
In Proceedings of the 12th Language Resources Ekaterina Vylomova, Jennifer White, Eliza-
and Evaluation Conference, pages 4223–4228, Mar- beth Salesky, Sabrina J. Mielke, Shijie Wu,
seille, France. European Language Resources Asso- Edoardo Maria Ponti, Rowan Hall Maudslay, Ran
ciation. Zmigrod, Josef Valvoda, Svetlana Toldova, Francis
Minh-Thang Luong, Hieu Pham, and Christopher D Tyers, Elena Klyachko, Ilya Yegorov, Natalia
Manning. 2015. Effective approaches to attention- Krizhanovsky, Paula Czarnowska, Irene Nikkarinen,
based neural machine translation. arXiv preprint Andrew Krizhanovsky, Tiago Pimentel, Lucas
arXiv:1508.04025. Torroba Hennigen, Christo Kirov, Garrett Nicolai,
Adina Williams, Antonios Anastasopoulos, Hilaria
Tom McCoy. 2018. Better left unsaid. In Patrick Lit- Cruz, Eleanor Chodroff, Ryan Cotterell, Miikka
tell, Tom McCoy, Dragomir Radev, and Ali Shar- Silfverberg, and Mans Hulden. 2020. SIGMOR-
man, editors, North American Computational Lin- PHON 2020 shared task 0: Typologically diverse
guistics Olympiad 2018: Invitational Round. North morphological inflection. In Proceedings of the
American Computational Linguistics Olympiad. 17th SIGMORPHON Workshop on Computational

69
Research in Phonetics, Phonology, and Morphology, its arguments. The arguments may be other op-
pages 1–39, Online. Association for Computational erators, offsets, or other constants (like tokens or
Linguistics.
features). The score for an operator in the argu-
Elysia Warner. 2019. Tarangan. In Samuel Ahmed, ment is computed recursively. The score for an
Bozhidar Bozhanov, Ivan Derzhanski (technical offset favours smaller numbers and local rules by
editor), Hugh Dobbs, Dmitry Gerasimov, Shin-
jini Ghosh, Ksenia Gilyarova, Stanislav Gurevich,
decreasing the score for larger offsets. The score
Gabrijela Hladnik, Boris Iomdin, Bruno L’Astorina, for other constants is chosen to be a small negative
Tae Hun Lee (editor-in chief), Tom McCoy, André constant. The scores for the arguments are added
Nikulin, Miina Norvik, Tung-Le Pan, Aleksejs up, along with a small negative penalty to favour
Peguševs, Alexander Piperski, Maria Rubinstein,
shorter programs, to obtain the final score for the
Daniel Rucki, Artūrs Semeņuks, Nathan Somers,
Milena Veneva, and Elysia Warner, editors, Interna- operator.
tional Linguistics Olympiad 2019. International Lin- This ranking score selects for programs that are
guistics Olympiad. shorter, and favours either choosing more gen-
Shijie Wu. 2020. Neural transducer. https://github. eral by giving the Is predicate a higher score
com/shijie-wu/neural-transducer/. (F EATURE) or more specific rules by giving the
A NDSyn algorithm IsToken predicate a higher score (T OKEN). The
top k programs according to the ranking function
We use the NDSyn algorithm to learn disjunctions are chosen as candidates for the next step.
of rules. We apply NDSyn in multiple passes to To choose the final set of rules from the candi-
allow the model to learn multi-pass rules. dates generated using the FlashMeta algorithm,
At each pass, the algorithm learns rules to per- we use a set covering algorithm that chooses the
form token-level transformations that are applied rules that correctly answer the most number of ex-
to each element of the input sequence. The token- amples while also incorrectly answering the least.
level examples are passed to NDSyn, which learns These rules are applied to each example, and the
the if-then-else statements that constitute a set of output tokens are tagged with the transformation
rules. This is done by first generating a set of can- that is applied. These outputs are then the input to
didate rules by randomly sampling a token-level the next pass of the algorithm.
example and synthesizing a set of rules that satisfy
the example. Then, rules are selected to cover the B Dataset
token-level examples.
Rules that satisfy a randomly sampled example We select problems from various Linguistics
are learnt using the FlashMeta program synthesis Olympiads to create our dataset. We include pub-
algorithm (Polozov and Gulwani, 2015). The syn- licly available problems that have appeared in
thesis task is given by the DSL operator P and the Olympiads before. We choose problems that only
specification of constraints X that the synthesized involve rules based on the symbols in the data, and
program must satisfy. In our application, this speci- not based on knowledge of notions such as gender,
fication is in the form of token-level examples, and tense, case, or semantic role. These problems are
the DSL operators are the predicates and transfor- based on the phonology of a particular language,
mations defined in the paper. The algorithm recur- and include aspects of morphology and orthogra-
sively decomposes the synthesis problem (P, X ) phy, and maybe also the phonology of a different
into smaller tasks (Pi , Xi ) for each argument Pi language. In some cases where a single Olympiad
to the operator. Xi is inferred using the inverse problem involves multiple components that can be
semantics of the operator Pi , which is encoded as solved independent of each other, we include them
a witness function. The inverse semantics provides as separate problems in our dataset.
the possible values for the arguments of an opera- We put the data and assignments in a matrix, as
tor, given the output of the operator. We refer the described in Section 3.1 . We separate tokens in a
reader to the paper by Polozov and Gulwani (2015) word by a space while transcribing the problems
for a full description of the synthesis algorithm. from their source PDFs. We do not separate diacrit-
After the candidates are generated, they are ics as different tokens, and include them as part of
ranked according to a ranking score of each pro- the same token. For each token in the Roman script,
gram. The ranking score for an operator in a pro- we add the boolean features vowel and consonant,
gram is computed as a function of the scores of and manually tag the tokens according to whether

70
they are a vowel or consonant.
We store the problems in JSON files with details
about the languages, the families to which the lan-
guages belong, the data matrix, the notes used to
create the features, and the feature sets for each
token.

C Baselines
C.1 Neural
Following Şahin et al. (2020), we use small neural
models for sequence-to-sequence tasks. We train a
single neural model for each task, and provide the
column numbers as tags in addition to the source
sequence. We find that the single model approach
works better than training a model for each pair of
columns.
LSTM: We use LSTM models with soft attention
(Luong et al., 2015), with embeddings of size 64,
hidden layers of size 128, a 2-layer encoder and a
single layer decoder. We apply a dropout of 0.3 for
all layers. We train the model for 100 epochs using
the Adam optimizer with a learning rate of 10−3 ,
learning rate reduction on plateau, and a batch size
of 2. We clip the gradient norm to 5.
Transformer: We use Transformer models
(Vaswani et al., 2017) with embeddings of size
128, hidden layers of size 256, a 2-layer encoder
and a 2-layer decoder. We apply a dropout of 0.3
for all layers. We train the model for 2000 steps
using the Adam optimizer with a learning rate of
10−3 , warmup of 400 steps, learning rate reduction
on plateau, and a batch size of 2. We use a label
smoothing value of 0.1, and clip the gradient norm
to 1.
We use the implementations provided at https:
//github.com/shijie-wu/neural-transducer/ for all
neural models.

C.2 WFST
We use the implementation the WFST models avail-
able at https://github.com/sigmorphon/2020/tree/
master/task1/baselines/fst for the WFST models.
We train a model for each pair of columns. We
report the results for models of order 5, which were
found to perform the best on the test data (highest
E XACT score) among models of order 3 to 9.

71
Findings of the SIGMORPHON 2021 Shared Task on
Unsupervised Morphological Paradigm Clustering
Adam Wiemerslage† Arya McCarthy‡ Disfrutar Alexander Erdmann∇ Disfrutar
Garrett Nicolai ψ
Manex Agirrezabal φ
Miikka Silfverbergψ
Mans Hulden† Katharina Kann†
V;IND;PRS;1;SG
V;IND;PRS;1;PL
disfruto
disfrutáis


University of Colorado Boulder ‡
Johns Hopkins University ∇
V;IND;PRS;2;SG
V;IND;PRS;2;PL

Ohio State University


disfrutas
disfrutamos

φ
University of Copenhagen ψ
University of British Columbia
{adam.wiemerslage,katharina.kann}@colorado.edu
V;IND;PRS;3;SG V;IND;PRS;3;PL disfruta disfrutan

Abstract
We describe the second SIGMORPHON disfrutamos

...disfruta de la vida lo más que puedas!

shared task on unsupervised morphology: the Si disfrutamos de la vida, ...


disfruta
goal of the SIGMORPHON 2021 Shared Task
on Unsupervised Morphological Paradigm
Clustering is to cluster word types from a raw
text corpus into paradigms. To this end, we re- Figure 1: Unsupervised morphological paradigm clus-
lease corpora for 5 development and 9 test lan- tering consists of clustering word forms from raw text
guages, as well as gold partial paradigms for into paradigms.
evaluation. We receive 14 submissions from 4
teams that follow different strategies, and the
best performing system is based on adaptor neither does it know (a) the features for which a
grammars. Results vary significantly across
lemma typically inflects, nor (b) the number of dis-
languages. However, all systems are outper-
formed by a supervised lemmatizer, implying tinct inflected forms which constitute the paradigm.
that there is still room for improvement. A successful unsupervised paradigm cluster-
ing system leverages common patterns in the lan-
1 Introduction guage’s inflectional morphology while simultane-
In recent years, most research in the area of compu- ously ignoring regular circumstantial similarities
tational morphology has focused on the application along with derivational patterns. For example, an
of supervised machine learning methods to word accurate unsupervised system must recognize that
inflection: generating the inflected forms of a word, disfrutamos (English: we enjoy) and disfruta (En-
often a lemma, in order to express certain grammat- glish: he/she/it enjoys) are inflected variants of the
ical properties. For example, a supervised inflec- same paradigm, but that the orthographically sim-
tion system for Spanish might be provided with a ilar disparamos (English: we shoot), belongs to a
lemma disfrutar (English: to enjoy) and morpho- separate paradigm. Likewise, a successful system
logical features such as indicative, present tense & for English will recognize that walk and walked
1st person singular, and generate the corresponding belong to the same verbal paradigm but walker
inflected form disfruto as output. is a derived form belonging to a distinct nominal
However, a supervised machine learning setup paradigm. Such fine-grained distinctions are diffi-
is quite different from a human first language (L1) cult to learn in an unsupervised manner.
acquisition setting. Young children must learn to This paper describes the SIGMORPHON 2021
segment a continuous speech signal into discrete Shared Task on Unsupervised Morphological
words and perform unsupervised classification, de- Paradigm Clustering. Participants are asked to sub-
coding, and eventually, inference with incomplete mit systems which cluster words from the Bible
feedback on this noisy input. The task of unsu- into inflectional paradigms.1 Participants are not
pervised paradigm clustering aims to replicate one allowed to use any external resources. Four teams
of the steps in this process—namely, the grouping submit at least one system for the shared task and
of word forms belonging to the same lexeme into 1
Bible translations for five development and nine test lan-
inflectional paradigms. In this unsupervised task, a guages were obtained from the Johns Hopkins University
system does not know about lemmas. Furthermore, Bible Corpus introduced by McCarthy et al. (2020b).

72
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 72–81
August 5, 2021. ©2021 Association for Computational Linguistics
all teams also submit a system description paper. 2.1 Data
The shared task systems can be grouped into Languages The SIGMORPHON 2021 Shared
two broad categories: similarity-based systems Task on Unsupervised Morphological Paradigm
experiment with different combinations of ortho- Clustering features 5 development languages: Mal-
graphic and embedding-based similarity metrics for tese, Persian, Portuguese, Russian, and Swedish.
word forms combined with clustering methods like The final evaluation is done on 9 test languages:
k-means or agglomerative clustering. Grammar- Basque, Bulgarian, English, Finnish, German, Kan-
based methods instead learn grammars or rules nada, Navajo, Spanish, and Turkish.
from the data and either apply these to clustering Our languages span 4 writing systems, and repre-
directly, or first segment words into stems and af- sent fusional, agglutinative, templatic, and polysyn-
fixes and then cluster forms which share a stem into thetic morphologies. The languages in the develop-
paradigms. Our official baseline, described in Sec- ment set are mostly suffixing, except for Maltese,
tion 2.3, is based on grouping together word forms which is a templatic language. And while most of
sharing a common substring of length ≥ k, where the test languages are also predominantly suffix-
k is a hyperparameter. Grammar-based systems ob- ing, Navajo employs prefixes and Basque uses both
tain higher average F1 scores (see Section 2.2 for prefixes and suffixes.
details on evaluation) across the nine test languages
than the baseline. The Edinburgh system has the Text Corpora We provide corpora from the
best overall performance: it outperforms the base- Johns Hopkins University Bible Corpus (JHUBC)
line by 34.61% F1 and the second best system by (McCarthy et al., 2020b) for all development and
1.84% F1. test languages. This is the only resource that sys-
tems are allowed to use.
The rest of the paper is organized as follows:
Section 2 describes the task of unsupervised mor- Gold Partial Paradigms Along with the Bibles,
phological paradigm clustering in detail, including we also release a set of gold partial paradigms for
the official baseline and all provided datasets. Sec- the development languages to be used for system
tion 3 gives an overview of the participating sys- development. Gold data sets are also compiled for
tems. Section 4 describes the official results, and the test languages, but these test sets are withheld
5 presents an analysis. Finally, Section 6 contains until the completion of the shared task.
a discussion of where the task can move in future In order to produce gold partial paradigms, we
iterations and concludes the paper. first take the set of all paradigms Π for each lan-
guage from UniMorph (McCarthy et al., 2020a).
We then obtain gold partial paradigms ΠĜ =
2 Task Description T
Π Σ, where Σ is the set of types attested in the
Bible corpus. Finally, we sample up to 1000 of the
Unsupervised morphological paradigm clustering
resulting gold partial paradigms for each language,
consists of, given a raw text corpus, grouping words
resulting in the set ΠG according to the following
from that corpus into their paradigms without any
steps:
additional information. Recent work in unsuper-
vised morphology has attempted to induce full 1. Group gold paradigms in ΠĜ by size, result-
paradigms from corpora with only a subset of all ing in the set G, where gk ∈ G is the group of
types. Kann et al. (2020) and Erdmann et al. (2020) paradigms with k forms in it.
explore initial approaches to this task, which is
2. Continually loop over all gk ∈ G and ran-
called unsupervised morphological paradigm com-
domly sample one paradigm from gk until we
pletion, but find it to be challenging. Building
have 1000 paradigms.
upon the SIGMORPHON 2020 Shared Task on Un-
supervised Morphological Paradigm Completion Because not every token in the Bible corpora is in
(Kann et al., 2020), our shared task is focused on a UniMorph, we can only evaluate on the subset of
subset of the overall problem: sorting words into paradigms that exist in the UniMorph database. In
paradigms. This can be seen as an initial step to practice, this means that for several languages, we
paradigm completion, as unobserved types do not are not able to sample 1000 paradigms, cf. Tables
need to be induced, and the inflectional categories 1 and 2. Notably, for Basque, we can only provide
of paradigm slots do not need to be considered. 12 paradigms.

73
Maltese Persian Portuguese Russian Swedish
# Lines 7761 7931 31167 31102 31168
# Tokens 193257 227584 828861 727630 871707
# Types 16017 11877 31446 46202 25913
TTR .083 .052 .038 .063 .03
# Paradigms 76 64 1000 1000 1000
# Forms in paradigms 572 446 11430 6216 3596
Largest paradigm size 14 20 47 17 9

Table 1: Statistics for the development Bible corpora and the dev gold partial paradigms. TTR is the type-token
ratio in the corpus. The statistics for the paradigms reflect only those words in our partial paradigms, not the full
paradigms from Unimorph.

English Navajo Spanish Finnish Bulgarian Basque Kannada German Turkish


# Lines 7728 5058 7337 31087 31101 7958 7863 31102 30182
# Tokens 236465 104631 251581 685699 801657 195459 193213 826119 616418
# Types 7144 18799 9755 54635 37048 18376 28561 22584 59458
TTR .03 .18 .039 .08 .046 .094 .148 .027 .096
# Paradigms 1000 88 990 1000 1000 12 92 1000 1000
# Forms in paradigms 2475 214 5154 8509 5086 63 933 3628 9204
Largest paradigm size 7 13 34 31 27 25 44 15 49

Table 2: Statistics for the test Bible corpora and the test gold partial paradigms.

P1 one contains a word that does not exist in the set of


dependemos
gold paradigms, and thus cannot be judged – these
dependem
words are ignored and do not affect evaluation. In
dependerá

dependesse

this example, the predicted P1 is a better match,


G1
depende
resulting in a perfect F1 score. However, our eval-
dependemos
dependiam

dependem
1.0 uation punishes systems for predicting a second
desfrutarem
dependerá
paradigm, P2, with words from G1, reducing the
dependesse
overall precision score of this submission.
depende
0.5
dependiam P2
Building upon BMAcc (Jin et al., 2020), we
use best-match F1 score for evaluation. We define
dependesse

depende
a paradigm as a set of word forms f ∈ π. Du-
desonrares plicate forms within π (syncretism) are discarded.
Given a set of gold partial paradigms π g ∈ ΠG , a
set of predicted paradigms π p ∈ ΠP , a gold vo-
Figure 2: An example matching of predicted paradigms S
cabulary Σg = π g , and a predicted vocabulary
in blue, and a gold paradigm in green. Words in red do S
Σp = π p , it works according to the following
not exist in the gold set, and thus cannot be evaluated.
steps:
1. Redefine each predicted paradigm, remov-
2.2 Evaluation 0
ing the words that we cannot evaluate π p =
T
As our task is entirely unsupervised, evaluation π p Σg , to form a set of pruned paradigms
is not straightforward: as in Kann et al. (2020), Π0P .
our evaluation requires a mapping from predicted
paradigms to gold paradigms. Because our set of 2. Build a complete Bipartite graph over Π0P and
gold partial paradigms does not cover all words in ΠG , where the edge weight between πig and
0 T 0
the corpus, in practice we only evaluate against a πjp is the number of true positives |πig πjp |.
subset of the clusters predicted by systems.
3. Compute the maximum-weight full matching
For these reasons, we want an evaluation that using Karp (1980), in order to find the optimal
assesses the best matching paradigms, ignoring pre- alignment between Π0P and ΠG
dicted forms that do not occur in the gold set, but
0
still punishing for spurious predictions that are in 4. Assign all predicted words Σp and all gold
the gold set. For example, Figure 2 shows two can- words Σg a label corresponding to the gold
didate matches for a gold partial paradigm. Each paradigm, according to the matching found in

74
0 0
3. Any unmatched wip ∈ Σp is assigned a disconnected components is > n/2 (Hartuv and
label corresponding to a spurious paradigm. Shamir, 2000). The number of HCSs is then taken
to be the cluster number. In practice, however,
5. Compute the F1 score between the sets of the graph-clustering step proves to be prohibitively
0
labeled words in Σp and Σg slow and results for test languages are submitted
using fixed numbers of clusters of size 500, 1000,
2.3 Baseline System 1500 and 1900. In experiments on the dev lan-
guages, they find that the orthographic representa-
We provide a straightforward baseline that con-
tions outperform the semantic representations for
structs paradigms based on substring overlap be-
all languages, and thus submit four systems utiliz-
tween words. We construct paradigms out of words
ing orthographic representations.
that share a substring of length ≥ k. Since words
can share multiple substrings, it is possible that The Boulder-Gerlach-Wiemerslage-Kann team
multiple identical, redundant paradigms are cre- (Boulder-GWK; Gerlach et al., 2021) submits
ated. We reduce these to a single paradigm. Words two systems based on an unsupervised lemmati-
that do not belong to a cluster are assigned a sin- zation system originally proposed by Rosa and
gleton paradigm, that is, a paradigm that consists Zabokrtský (2019). Their approach is based on ag-
of only that word. glomerative hierarchical clustering of word types,
We tune k on the development sets and find that where the distance between word types is computed
k = 5 works best on average. This means that a as a combination of a string distance metric and
word of less than 5 characters can only ever be in the cosine distance of fastText embeddings (Bo-
one, singleton, paradigm. janowski et al., 2017). Their choice of fastText
embeddings is due to the limited size of the shared
3 Submitted Systems task datasets. Two variants of edit distance are com-
pared to quantify string distance: (1) Jaro-Winkler
The Boulder-Perkoff-Daniels-Palmer team
edit distance (Winkler, 1990) resembles regular
(Boulder-PDP; Perkoff et al., 2021) participates
edit distance of strings but emphasizes similarity
with four submissions, resulting from experiments
at the start of strings which is likely to bias the
with two different systems. Both systems apply
system toward languages expressing inflection via
k-means clustering on vector representations of
suffixation. (2) A weighted variant of edit distance,
input words. They differ in the type of vector
where costs for insertions, deletions and substitu-
representations used: either orthographic or
tions are derived from a character-based language
semantic representations. Semantic skip-gram
model trained on the shared task data.
representations are generated using word2vec
(Mikolov et al., 2013). For the orthographic The CU–UBC (Yang et al., 2021) team provides
representations, each word is encoded into a vector systems that built upon the official shared task base-
of fixed dimensionality equaling the word length line – given the pseudo-paradigms found by the
|wmax | for the longest word wmax in the input baseline, they extract inflection rules of multiple
corpus. They associate each character c ∈ Σ in the types. Comparing pairs of words in each paradigm,
alphabet of the input corpus with a real number they learn both continuous and discontinuous char-
r ∈ [0, 1] and assign vi := r if the ith character acter sequences that transform the first word into
of the input word w is c. If |w| < |wmax |, the the second, following work on supervised inflec-
remaining entries are assigned to 0. tional morphology, such as Durrett and DeNero
The number of clusters is a hyperparameter of (2013); Hulden et al. (2014). Rules are sorted by
the k-means clustering algorithm. In order to set frequency to separate genuine inflectional patterns
this hyperparameter, Perkoff et al. (2021) experi- from noise. Starting from a random seed word,
ment with a graph-based method. The word types paradigms are constructed by iteratively applying
in the corpus form the nodes of a graph, where the the most frequent rules. Generated paradigms are
neighborhood of a word w consists of all words further tested for paradigm coherence using met-
sharing a maximal substring with w. The graph is rics such as graph degree calculation and fastText
split into highly connected subgraphs (HCS) con- embedding similarity.
taining n nodes, where the number of edges that The Edinburgh team (McCurdy et al., 2021)
need to be cut in order to split the graph into two submits a system based on adaptor grammars (John-

75
English Navajo Spanish Finnish Bulgarian Basque Kannada German Turkish Average
Rec 28.93 32.71 23.90 18.43 20.55 28.57 25.19 25.50 15.70 24.39
Boulder-PDP-1 Prec 29.27 34.15 24.68 18.81 20.75 29.51 35.18 25.64 15.90 25.99
F1 29.10 33.41 24.29 18.62 20.65 29.03 29.36 25.57 15.80 25.09
Rec 36.57 36.92 28.52 23.38 26.37 30.16 25.83 33.21 19.53 28.94
Boulder-PDP-2 Prec 37.00 38.54 29.45 23.86 26.63 31.15 36.08 33.40 19.79 30.65
F1 36.78 37.71 28.98 23.62 26.50 30.65 30.11 33.31 19.66 29.70
Rec 42.79 37.85 29.41 26.01 28.73 26.98 25.94 38.18 21.38 30.81
Boulder-PDP-3 Prec 43.30 39.51 30.37 26.55 29.01 27.87 36.23 38.39 21.66 32.54
F1 43.04 38.66 29.88 26.27 28.87 27.42 30.23 38.28 21.52 31.58
Rec 45.45 40.19 30.64 26.60 29.79 28.57 24.54 39.86 21.65 31.92
Boulder-PDP-4 Prec 45.99 41.95 31.63 27.15 30.08 29.51 34.28 40.08 21.93 33.62
F1 45.72 41.05 31.13 26.87 29.93 29.03 28.61 39.97 21.79 32.68
Rec 28.81 10.75 19.27 22.02 30.02 19.05 18.54 31.92 20.63 22.33
Boulder-GWK-2 Prec 66.33 65.71 69.93 67.36 71.69 35.29 62.45 78.56 64.09 64.60
F1 40.17 18.47 30.21 33.19 42.32 24.74 28.60 45.39 31.22 32.70
Rec 24.53 11.21 18.30 22.69 31.18 25.40 16.93 30.98 21.16 22.49
Boulder-GWK-1 Prec 56.47 68.57 66.41 69.41 74.46 47.06 57.04 76.26 65.74 64.60
F1 34.20 19.28 28.69 34.20 43.96 32.99 26.12 44.06 32.02 32.83
Rec 76.69 59.81 72.18 76.73 73.02 25.40 38.48 77.62 65.82 62.86
Baseline Prec 38.76 23.02 26.56 17.86 26.50 18.60 17.22 25.35 15.60 23.28
F1 51.49 33.25 38.83 28.97 38.89 21.48 23.79 38.22 25.23 33.35
Rec 66.95 50.93 60.52 45.96 65.08 17.46 30.33 66.57 43.25 49.67
CU–UBC-5 Prec 90.40 68.55 72.70 56.47 76.85 52.38 61.26 74.40 54.05 67.45
F1 76.93 58.45 66.05 50.68 70.48 26.19 40.57 70.26 48.05 56.41
Rec 63.76 51.867 63.62 48.75 63.84 17.46 33.12 65.05 45.81 50.36
CU–UBC-6 Prec 85.99 69.375 76.49 59.67 75.99 52.38 64.24 72.39 57.52 68.23
F1 73.23 59.36 69.46 53.66 69.39 26.19 43.71 68.52 51.00 57.17
Rec 60.36 53.74 64.05 51.51 58.18 22.22 35.37 59.32 47.74 50.28
CU–UBC-7 Prec 81.42 72.33 76.98 62.58 69.23 66.67 69.77 66.13 60.17 69.47
F1 69.33 61.66 69.92 56.51 63.23 33.33 46.94 62.54 53.24 57.41
Rec 83.39 47.66 76.48 52.06 73.14 25.40 36.33 74.28 46.50 57.25
CU–UBC-3 Prec 84.38 49.76 78.97 53.14 73.87 26.23 50.75 74.70 47.10 59.88
F1 83.89 48.69 77.71 52.60 73.50 25.81 42.35 74.49 46.80 58.42
Rec 80.69 47.66 78.35 57.29 73.77 28.57 40.73 74.06 50.93 59.12
CU–UBC-4 Rec 81.64 49.76 80.89 58.48 74.50 29.51 56.89 74.47 51.59 61.97
F1 81.16 48.69 79.60 57.88 74.14 29.03 47.47 74.27 51.26 60.39
Rec 75.96 47.66 75.73 65.35 69.07 28.57 49.52 65.08 60.58 59.73
CU–UBC-1 Prec 76.86 49.76 78.19 66.71 69.92 29.51 69.16 65.44 61.36 62.99
F1 76.41 48.69 76.94 66.03 69.50 29.03 57.71 65.26 60.97 61.17
Rec 88.16 41.59 81.90 72.68 76.58 28.57 50.91 73.98 67.37 64.64
CU–UBC-2 Prec 89.21 43.41 84.56 74.18 77.34 29.51 71.11 74.39 68.24 67.99
F1 88.68 42.48 83.21 73.42 76.96 29.03 59.34 74.18 67.80 66.12
Rec 89.54 41.59 82.38 59.58 80.22 31.75 58.95 78.97 72.82 66.20
Edinburgh Prec 90.75 43.41 85.06 60.84 83.30 32.79 82.34 79.41 73.75 70.18
F1 90.14 42.48 83.70 60.20 81.73 32.26 68.71 79.19 73.28 67.96
Rec 95.31 - 85.49 86.21 84.74 65.08 - 79.19 86.80 83.26
stanza Prec 93.87 - 85.84 85.91 82.79 50.62 - 71.57 86.87 79.64
F1 94.59 - 85.66 86.06 83.75 56.94 - 75.19 86.84 81.29

Table 3: Results on all test languages for all systems in %; the official shared task metric is best-match F1. To
provide a more complete picture, we also show precision and recall. stanza is a supervised system.

son et al., 2007) modeling word structure. Their guage then consists of the prefixes and stem identi-
work draws on parallels between the unsupervised fied by the adaptor grammar. For a predominantly
paradigm clustering task and unsupervised mor- prefixing language, the final stem instead contains
phological segmentation. Their grammars segment all suffixes of the word form. The team notes that
word forms in the shared task corpora into a se- this approach is unsuitable for languages which
quence of zero or more prefixes and a single stem extensively make use of both prefixes and suffixes,
followed by zero or more suffixes. such as Basque.
Based on the segmented words from the raw text Finally, they group all words which share the
data, they then determine whether the language same stem into paradigms. However, because
uses prefixes or suffixes for inflection. The final sampling from an adaptor grammar is a non-
stem for words in a predominantly suffixing lan- deterministic process – i.e., the system may return

76
multiple possible segmentations for a single word naaghá neiikai naahkai
naashá nijighá nideeshaał
form – they construct preliminary clusters by in- naayá ninádaah naniná
cluding all forms which might share a given stem. ninájı́daah nizhdoogaał
Then they select the cluster that maximizes a score
Table 4: A paradigm from our gold set for Navajo.
based on frequency of occurrence of the induced
segment in all segmentations.
Overgeneralization/Underspecification When
4 Results and Discussion acquiring language, children often overgeneralize
morphological analogies to new, ungrammatical
The official results obtained by all submitted sys-
forms. For example, the past tense of the English
tems on the test sets are shown in Table 3.
verb to know might be expressed as knowed, rather
The Edinburgh system performs best overall than the irregular knew. The same behavior can
with an average best-match F1 of 67.96%. In also be observed in learning algorithms at some
general, grammar-based systems attain the best re- point during the learning process (Kirov and
sults, with all of the CU–UBC systems and the Cotterell, 2018). This is reflected to some extent in
Edinburgh system outperforming the baseline by at Table 3 by trade-offs between precision and recall.
least 23.06% F1. The Boulder-GWK and Boulder- A low precision, but high recall indicates that a
PDP systems, both of which perform clustering system is overgeneralizing: some surface forms
over word representations, approach but do not ex- are erroneously assigned to too many paradigms.
ceed baseline performance. Perkoff et al. (2021) In effect, these systems are hypothesizing that
found that clustering over word2vec embeddings a substring is productive, and thus proposing a
performs poorly on the development languages, paradigmatic relationship between two words. For
and their scores on the test set reflect clusters found example, the English words approach and approve
with vectors based purely on orthography. The share the stem appro- with unproductive segments
Boulder-GWK systems contain incomplete results, as suffixes. The baseline tends to overgeneralize
and partial evidence suggests that their cluster- due to its creation of large paradigms via a naive
ing method, which combines both fastText embed- grouping of words by shared n-grams.
dings trained on the provided bible corpora, and On the other hand, several systems seem to un-
edit distance, can indeed outperform the baseline. derspecify, indicated by their low recall. A low
However, it likely cannot outperform the grammar- recall, but high precision indicates that a system
based submissions. does not attribute inflected forms to a paradigm
For comparison, we also evaluate a supervised that the form does in fact belong to. This can be
lemmatizer from the Stanza toolkit (Qi et al., 2020). caused by suppletion in systems based purely on
The Stanza lemmatizer is a neural network model orthography, for example, generating the paradigm
trained on Universal Dependencies (UD) treebanks with go and goes, but attributing went to a separate
(Nivre et al., 2020), which first tags for parts of paradigm. Underspecification is apparent in the
speech, and then uses these tags to generate lemmas CU–UBC submissions that relied on discontinuous
for a given word. Because there is no UD corpus in rules (CU–UBC 5, 6, and 7). This is likely because
the current version for Navajo nor Kannada, we do they filtered these systems down to far fewer rules
not have scores for those languages. Stanza’s accu- than their prefix/suffix systems, in order to avoid
racy on our task is far lower than that reported for severe overgeneralization that can result from spuri-
lemmatization on UD data. We note, however, that ous morphemes based on discontinuous substrings.
1) our data is from a different domain, 2) Biblical Similarly, the Boulder-GWK systems both have
language in particular can differ strongly from con- reasonable precision, but very low recalls. They
temporary text, and 3) we evaluate on only a partial report that this is due to the fact that they ignore
set of types in the corpus, which could represent a any words with less than a certain frequency in the
particularly challenging set of paradigms for some corpus due to time constraints, thus creating small
languages. The Stanza lemmatizer outperforms all paradigms and ignoring many words completely.
systems for all languages, except for German. This
is unsurprising as it is a supervised system, though Language and Typology In general, we find that
it is interesting that the German score falls short of Basque and Navajo are the two most difficult test
that of the Edinburgh system. languages. Both languages have relatively small

77
Figure 3: Singleton paradigm counts for the best performing system on all test languages. Languages for which
we have more than 100 paradigms on the left, and those for which we have less than 100 paradigms on the right.
Predicted singleton paradigms are in red and blue, gold singleton paradigms are in grey.

Figure 4: The F1 score across paradigm sizes for the best performing system on all test languages. From left to
right, the graphs represent the groups of languages in increasing order of how well systems typically performed on
them. F1 scores are interpolated for paradigm sizes that do not exist in a given language.

corpora, and are typlogically agglutinative – that pus may cause difficulties for their algorithm that
is, they express inflection via the concatenation of builds clusters based on affix frequency. Notably,
potentially many morpheme segments, which can the CU-UBC-7 system, which learns discontinu-
result in a large number of unique surface forms. ous rules rather than rules that model strictly con-
Both languages thus have relatively high type-token catenative morphology, performs best on Navajo
ratios (TTR) – especially Navajo, which has the by a large margin when compared to the best per-
highest TTR, cf. Table 2. It is also important to forming system, which relies on strictly concate-
note that both Basque and Navajo have compara- native grammars. It also performs best on Basque,
tively small sets of paradigms against which we though by a smaller margin. Another difficulty in
evaluate. This leaves the possibility that the subset Navajo morphology is that it exhibits verbal stem
of paradigms in the gold set are particularly chal- alternation for expressing mood, tense, and aspect,
lenging. However, the differences between system which creates challenges for systems that rely on
scores indicates that these two languages do offer rewrite rules or string similarity, based on continu-
challenges related to their morphology. ous substrings. For instance, our evaluation algo-
Navajo is a predominantly prefixing language rithm aligns a singleton predicted paradigm to the
– the only one in the development and test sets – gold paradigm in Table 4 for nearly all systems.
and Basque also inflects using prefixes, though to On Basque, most systems perform poorly. Mc-
a lesser extent. The top two performing systems Curdy et al. (2021), the best performing system
both obtain low scores for Navajo. The CU–UBC-2 overall, obtains a low score for Basque, which may
system considers only suffix rules, which results be due to their system assuming that a language
in it being the lowest performing CU–UBC system inflects either via prefixation or suffixation, but not
on Navajo. The Edinburgh submission should be both, as Basque does. Other systems, however,
able to identify prefixes and consider the suffix to attain similarly low scores for Basque.
be part of the stem in Navajo. However, the large The next tier of difficulty seems to comprise
number of types, for a relatively small Navajo cor- Finnish, Kannada, and Turkish, on which most sys-

78
tems obtain low scores. All of those languages form paradigms comprising several inflected forms.
are suffixing, but also have an agglutinative mor- Figure 3 demonstrates that the best system tends to
phology. The largest paradigm of each of these 3 overgenerate singleton paradigms. We see this to
languages are all in the top 4 largest paradigms in some extent for all agglutinative languages, which
Table 2. This implies that large paradigm sizes and may be due to the high number of typically long,
large numbers of distinct inflectional morphemes – unique forms. This is especially true for Navajo,
two properties often assumed to correlate with ag- which has a small corpus and extremely high type–
glutinative morphology –, coupled with sparse cor- token ratio. On the other hand, for the languages
pora to learn from, offer challenges for paradigm for which the highest scores are obtained, Span-
clustering. Though agglutinative morphology, hav- ish and English, the system does not overgenerate
ing relatively unchanged morphemes across words, singleton paradigms. Of the large number of sin-
might be simpler for automatic segmentation sys- gleton paradigms predicted for both languages, the
tems than morphology characterized as fusional, vast majority are correct. For other systems not
our sparse data sets are likely to complicate this. pictured in the figure, singleton paradigms are typi-
Finally, systems obtain the best results for En- cally undergenerated for Spanish and English. In
glish, followed by Spanish, and then Bulgarian. the case of English, this could be due to words
These three languages are also strongly suffixing, that share a derivational relationship. For example,
but typically express inflection with a single mor- the word accomplishment might be assigned to the
pheme. German appears to be a bit of an outlier, paradigm for the verb accomplish, when, in fact,
generally exhibiting scores that lie somewhere be- their relationship is not inflectional.
tween the highest scoring languages, and the more
difficult agglutinative languages. McCurdy et al. 6 Conclusion and Future Shared Tasks
(2021) hypothesize that this may be due to non-
concatenative morphology from German verbal cir- We presented the SIGMORPHON 2021 Shared
cumfixes. This hypothesis could explain why the Task on Unsupervised Morphological Paradigm
Boulder-GWK system performs better on German Clustering. Submissions roughly fell into two cat-
than other languages: it incorporates semantic in- egories: similarity-based methods and grammar-
formation. However, the CU–UBC systems that based methods, with the latter proving more
use discontinuous rules (systems 5, 6, and 7), and successful at the task of clustering inflectional
thus should better model circumfixation, do not paradigms. The best systems significantly im-
produce higher German scores than the continuous proved over the provided n-gram baseline, roughly
rules, including the suffix-only system. doubling the F1 score – mostly through much im-
proved precision. A comparison against a super-
5 Analysis: Partial Paradigm Sizes vised lemmatizer demonstrated that we have not yet
reached the ceiling for paradigm clustering: many
The effect of the size of the gold partial paradigms words are still either incorrectly left in singleton
on F1 score for the best system is illustrated in paradigms or incorrectly clustered with circum-
Figure 4. For Basque and Navajo, the F1 score stantially (and often derivationally) related words.
tends to drop as paradigm size increases. We see Regardless of the ground still to be covered, the
the same trend for Finnish, Kannada, and German, submitted results were a successful first step in au-
with a few exceptions, but this trend does not exist tomatically inducing the morphology of a language
for all languages. English resembles something without access to expert-annotated data.
like a bell shape, other than the low scoring outlier Unsupervised morphological paradigm cluster-
for the largest paradigms of size 7. Interestingly, ing is only the first step in a morphological learn-
Spanish and Turkish attain both very high and very ing process that more closely models human L1
low scores for larger paradigms. acquisition. We envision future tasks expanding
An artifact of a sparse corpus is that many sin- on this task to include other important aspects of
gleton paradigms arise. For theoretically larger morphological acquisition. Paradigm slot catego-
paradigms, only a single inflected form might oc- rization is a natural next step. To correctly cate-
cur in such a small corpus. Of course, this also hap- gorize paradigm slots, cross-paradigmatic similari-
pens naturally for certain word classes. However, ties must be considered, for example, the German
nouns, verbs, and occasionally adjectives typically words liest and schreibt are both 3rd person singular

79
present indicative inflections of two different verbs. Xia, Manaal Faruqui, Sandra Kübler, David
This can occasionally be identified via string simi- Yarowsky, Jason Eisner, et al. 2017. Conll-
sigmorphon 2017 shared task: Universal morpholog-
larity, but more often requires syntactic information.
ical reinflection in 52 languages. In Proceedings of
Syncretism (the collapsing of multiple paradigm the CoNLL SIGMORPHON 2017 Shared Task: Uni-
slots into a single representation) further compli- versal Morphological Reinflection, pages 1–30.
cates the task. A similar subtask involves lemma
identification, where a canonical form (Cotterell Ryan Cotterell, Christo Kirov, John Sylak-Glassman,
David Yarowsky, Jason Eisner, and Mans
et al., 2016b) is identified within the paradigm. Hulden. 2016a. The sigmorphon 2016 shared
Likewise, another important task involves fill- task—morphological reinflection. In Proceedings
ing unrealized slots in paradigms by generating the of the 14th SIGMORPHON Workshop on Compu-
correct surface form, which can be approached sim- tational Research in Phonetics, Phonology, and
Morphology, pages 10–22.
ilarly to previous SIGMORPHON shared tasks on
inflection (Cotterell et al., 2016a, 2017, 2018; Mc- Ryan Cotterell, Tim Vieira, and Hinrich Schütze.
Carthy et al., 2019; Vylomova et al., 2020), but will 2016b. A joint model of orthography and mor-
likely be based on noisy information from the slot phological segmentation. Association for Computa-
tional Linguistics.
categorization – all previous tasks have assumed
that the morphosyntactic information provided to Greg Durrett and John DeNero. 2013. Supervised
an inflector is correct. Currently, investigations into learning of complete morphological paradigms. In
the robustness of these systems to noise are sparse. Proceedings of the 2013 Conference of the North
American Chapter of the Association for Computa-
Another direction for this task is the expansion tional Linguistics: Human Language Technologies,
to more under-resourced languages. The submit- pages 1185–1195.
ted results demonstrate that the task becomes par-
ticularly difficult when the provided raw text is Alexander Erdmann, Micha Elsner, Shijie Wu, Ryan
Cotterell, and Nizar Habash. 2020. The paradigm
small, but under-documented languages are often discovery problem. In Proceedings of the 58th An-
the ones most in need of morphological corpora. nual Meeting of the Association for Computational
The JHUBC contains Bible data for more than 1500 Linguistics, pages 7778–7790, Online. Association
languages, which can potentially be augmented by for Computational Linguistics.
other raw text corpora because morphology is rel- Andrew Gerlach, Adam Wiemerslage, and Katharina
atively stable across domains. Future tasks may Kann. 2021. Paradigm clustering with weighted
enable the construction of inflectional paradigms edit distance. In Proceedings of the 18th Workshop
in languages that require them to construct further on Computational Research in Phonetics, Phonol-
computational tools. ogy, and Morphology. Association for Computa-
tional Linguistics.
Acknowledgments Erez Hartuv and Ron Shamir. 2000. A clustering al-
gorithm based on graph connectivity. Information
We would like to thank all of our shared task par- processing letters, 76(4-6):175–181.
ticipants for their hard work on this difficult task!
Mans Hulden, Markus Forsberg, and Malin Ahlberg.
2014. Semi-supervised learning of morphological
References paradigms and lexicons. In Proceedings of the 14th
Conference of the European Chapter of the Associa-
Piotr Bojanowski, Edouard Grave, Armand Joulin, and tion for Computational Linguistics, pages 569–578.
Tomas Mikolov. 2017. Enriching word vectors with
subword information. Transactions of the Associa- Huiming Jin, Liwei Cai, Yihui Peng, Chen Xia, Arya
tion for Computational Linguistics, 5:135–146. McCarthy, and Katharina Kann. 2020. Unsuper-
vised morphological paradigm completion. In Pro-
Ryan Cotterell, Christo Kirov, John Sylak-Glassman, ceedings of the 58th Annual Meeting of the Asso-
Géraldine Walther, Ekaterina Vylomova, Arya D ciation for Computational Linguistics, pages 6696–
McCarthy, Katharina Kann, Sabrina J Mielke, 6707, Online. Association for Computational Lin-
Garrett Nicolai, Miikka Silfverberg, et al. 2018. guistics.
The conll–sigmorphon 2018 shared task: Univer-
sal morphological reinflection. arXiv preprint Mark Johnson, Thomas L Griffiths, Sharon Goldwa-
arXiv:1810.07125. ter, et al. 2007. Adaptor grammars: A frame-
work for specifying compositional nonparametric
Ryan Cotterell, Christo Kirov, John Sylak-Glassman, bayesian models. Advances in neural information
Géraldine Walther, Ekaterina Vylomova, Patrick processing systems, 19:641.

80
Katharina Kann, Arya D. McCarthy, Garrett Nico- Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
lai, and Mans Hulden. 2020. The SIGMORPHON ter, Jan Hajič, Christopher D. Manning, Sampo
2020 shared task on unsupervised morphological Pyysalo, Sebastian Schuster, Francis Tyers, and
paradigm completion. In Proceedings of the 17th Daniel Zeman. 2020. Universal Dependencies v2:
SIGMORPHON Workshop on Computational Re- An evergrowing multilingual treebank collection.
search in Phonetics, Phonology, and Morphology, In Proceedings of the 12th Language Resources
pages 51–62, Online. Association for Computational and Evaluation Conference, pages 4034–4043, Mar-
Linguistics. seille, France. European Language Resources Asso-
ciation.
Richard M Karp. 1980. An algorithm to solve the m×
n assignment problem in expected time o (mn log n). E. Margaret Perkoff, Josh Daniels, and Alexis Palmer.
Networks, 10(2):143–152. 2021. Orthographic vs. semantic representations
for unsupervised morphological paradigm cluster-
Christo Kirov and Ryan Cotterell. 2018. Recurrent neu- ing. In Proceedings of the 18th Workshop on Compu-
ral networks in linguistic theory: Revisiting pinker tational Research in Phonetics, Phonology, and Mor-
and prince (1988) and the past tense debate. Trans- phology. Association for Computational Linguistics.
actions of the Association for Computational Lin-
guistics, 6:651–665. Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,
and Christopher D. Manning. 2020. Stanza: A
Arya D. McCarthy, Christo Kirov, Matteo Grella, python natural language processing toolkit for many
Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekate- human languages. In Proceedings of the 58th An-
rina Vylomova, Sabrina J. Mielke, Garrett Nico- nual Meeting of the Association for Computational
lai, Miikka Silfverberg, Timofey Arkhangelskiy, Na- Linguistics: System Demonstrations, pages 101–
taly Krizhanovsky, Andrew Krizhanovsky, Elena 108, Online. Association for Computational Linguis-
Klyachko, Alexey Sorokin, John Mansfield, Valts tics.
Ernštreits, Yuval Pinter, Cassandra L. Jacobs, Ryan
Rudolf Rosa and Zdenek Zabokrtský. 2019. Unsu-
Cotterell, Mans Hulden, and David Yarowsky.
pervised lemmatization as embeddings-based word
2020a. UniMorph 3.0: Universal Morphology.
clustering. CoRR, abs/1908.08528.
In Proceedings of the 12th Language Resources
and Evaluation Conference, pages 3922–3931, Mar- Ekaterina Vylomova, Jennifer White, Elizabeth
seille, France. European Language Resources Asso- Salesky, Sabrina J Mielke, Shijie Wu, Edoardo
ciation. Ponti, Rowan Hall Maudslay, Ran Zmigrod, Josef
Valvoda, Svetlana Toldova, et al. 2020. Sigmorphon
Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu, 2020 shared task 0: Typologically diverse morpho-
Chaitanya Malaviya, Lawrence Wolf-Sonkin, Gar- logical inflection. arXiv preprint arXiv:2006.11572.
rett Nicolai, Christo Kirov, Miikka Silfverberg, Sab-
rina J. Mielke, Jeffrey Heinz, Ryan Cotterell, and William E. Winkler. 1990. String comparator met-
Mans Hulden. 2019. The SIGMORPHON 2019 rics and enhanced decision rules in the fellegi-sunter
shared task: Morphological analysis in context and model of record linkage. In Proceedings of the Sec-
cross-lingual transfer for inflection. In Proceedings tion on Survey Research, pages 354–359.
of the 16th Workshop on Computational Research in
Phonetics, Phonology, and Morphology, pages 229– Changbing Yang, Garrett Nicolai, and Miikka Silfver-
244, Florence, Italy. Association for Computational berg. 2021. Unsupervised paradigm clustering us-
Linguistics. ing transformation rules. In Proceedings of the 18th
Workshop on Computational Research in Phonetics,
Arya D. McCarthy, Rachel Wicks, Dylan Lewis, Aaron Phonology, and Morphology. Association for Com-
Mueller, Winston Wu, Oliver Adams, Garrett Nico- putational Linguistics.
lai, Matt Post, and David Yarowsky. 2020b. The
Johns Hopkins University Bible corpus: 1600+
tongues for typological exploration. In Proceed-
ings of the 12th Language Resources and Evaluation
Conference, pages 2884–2892, Marseille, France.
European Language Resources Association.

Kate McCurdy, Sharon Goldwater, and Adam Lopez.


2021. Adaptor grammars for unsupervised
paradigm clustering. In Proceedings of the 18th
Workshop on Computational Research in Phonetics,
Phonology, and Morphology. Association for Com-
putational Linguistics.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey


Dean. 2013. Efficient estimation of word represen-
tations in vector space.

81
Adaptor Grammars for Unsupervised Paradigm Clustering

Kate McCurdy
Sharon Goldwater Adam Lopez
School of Informatics
University of Edinburgh
[email protected], {sgwater, alopez}@inf.ed.ac.uk

Abstract 2 Adaptor Grammars


Our approach is based upon Adaptor Grammars, a
This work describes the Edinburgh submis- framework which achieves state-of-the-art results
sion to the SIGMORPHON 2021 Shared Task
on the related task of unsupervised morphological
2 on unsupervised morphological paradigm
clustering. Given raw text input, the task
segmentation (Eskander et al., 2020).
was to assign each token to a cluster with
other tokens from the same paradigm. We 2.1 Model
use Adaptor Grammar segmentations com- Adaptor Grammars (AGs; Johnson et al., 2007b)
bined with frequency-based heuristics to pre-
are a class of nonparametric Bayesian probabilistic
dict paradigm clusters. Our system achieved
the highest average F1 score across 9 test lan-
models which learn structured representations, or
guages, placing first out of 15 submissions. parses, of natural language input strings. An AG
has two components: a Probabilistic Context-Free
Grammar (PCFG) and one or more adaptors. The
1 Introduction PCFG is a 5-tuple (N, W, R, S, θ) which specifies
a base distribution over parse trees. Parse trees are
While the task of supervised morphological inflec- generated top-down by expanding non-terminals
tion has seen dramatic gains in accuracy over recent N (including the start symbol S ∈ N ) to non-
years (e.g. Cotterell et al., 2016, 2017, 2018; Vy- terminals N (excluding S) and terminals W , using
lomova et al., 2020), unsupervised morphological the set of allowed expansion rules R with expan-
analysis remains an open challenge. This is evident sion probability θr for each rule r ∈ R. PCFGs
in the results of the 2020 SIGMORPHON Shared have very strong independence assumptions; the
Task 2 on Unsupervised Morphological Paradigm adaptor component relaxes these assumptions by al-
Completion, in which no submission consistently lowing certain nonterminals to adapt to a particular
outperformed the baseline (Kann et al., 2020; Jin corpus, meaning they can cache and re-use subtrees
et al., 2020). with probabilities conditioned on that corpus.
The 2021 Shared Task 2 (Wiemerslage et al., An AG extends a PCFG by specifying a set of
2021) focuses on a subproblem from the 2020 adapted nonterminals A ⊆ N and a vector of adap-
task: given raw text input, cluster tokens together tors C. For each adapted nonterminal X ∈ A, the
based on membership in the same morphologi- adaptor CX stores all subtrees previously emitted
cal paradigm. For example, given the sentence with the root node X. When a new tree rooted in
“My dog met some other dogs”, a successful sys- X is sampled, the adaptor CX either generates a
tem would assign “dog” and “dogs” to the same new tree from the PCFG base distribution or re-
paradigm because they are two inflected forms of turns a previously emitted subtree from its cache.
the same lemma “dog”, while each other word The adaptor distribution is generally based on a
would occupy its own cluster. Furthermore, a Pitman-Yor Process (PYP; Pitman and Yor, 1997),
successful system needs to cluster typologically under which the probability of Cx returning a par-
diverse, morphologically rich languages such as ticular subtree σ is roughly proportional to the
Finnish and Navajo, with inflectional paradigms number of times X has previously expanded to
which are much larger than English paradigms. σ. This leads to a “rich-get-richer” effect as more

82
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 82–89
August 5, 2021. ©2021 Association for Computational Linguistics
Word → Stem Suffix tering, we need a method to group words together
Word → Stem Word Segmentation based on their AG segmentations. The example
Stem → Chars walked walk-ed segmentations shown in Figure 1 suggest a very
Suffix → Chars jumping jump-ing simple approach to paradigm clustering: assign all
Chars → Char walking walk-ing forms with the same stem to the same cluster. For
Chars → Char Chars jump jump example, “walked” and “walking” would correctly
(a) Example grammar. (b) Toy corpus with target seg- cluster together with the shared stem “walk”. Our
Adapted nonterminals are mentations system builds upon this intuition.
underlined.
As a preliminary step, we select grammars to
Word Word
sample from, looking only at the development lan-
guages. We build simple clusters and heuristically
Stem Suffix Stem Suffix select grammars which show relatively high per-
formance, as described in Section 3.2. In this case
walk ed jump ing we select two grammars. Once the grammars have
(c) Example target morphological analyses, showing only the been selected, we discard the simple clusters in
top 2 levels of structure favor of a more sophisticated strategy.
We implement1 a procedure to generate clusters
Figure 1: A possible morphological analysis (1c)
for both development and test languages. First,
learned by the grammar in (1a) over the corpus shown
in (1b) (from Johnson et al., 2007b) we sample 3 separate AG parses for each corpus
and each grammar, resulting in 6 segmentations for
each word. We then use frequency-based metrics
frequently sampled subtrees gain higher probability over the segmentations to identify the language’s
within the conditional adapted distribution. Given adfix direction, i.e. whether it is predominantly
an AG specification, MCMC sampling can be used prefixing or suffixing, as described in Section 3.3.
to infer values for the PCFG rule probabilities θ Finally, we iterate over the entire vocabulary and
(Johnson et al., 2007a) and PYP hyperparameters apply frequency-based scores to generate paradigm
(Johnson and Goldwater, 2009). clusters, as described in Section 3.4 .

2.2 AGs for Morphological Analysis 3.2 Grammar selection


The probabilistic parses generated by adaptor gram- An adaptor grammar builds upon an initial PCFG
mars can be used to segment sequences. In cases specification, and many such grammars can be ap-
where the grammar specifies word structures, the plied to model word structure. As a first step, we
segmentations may reflect morphological analy- evaluate various grammar specifications on the de-
ses. For example, an AG trained with the simple velopment languages and select the grammars for
grammar shown in Table 1a may learn to cache our final model.
“jump” and “walk” as Stem subtrees, and “ing” and To train the adaptor grammar representations,
“ed” as Suffix subtrees, ideally producing the tar- we use MorphAGram (Eskander et al., 2020), a
get segmentations shown in Figure 1c. In prac- framework which extends the adaptor grammar im-
tice, researchers have successfully applied AGs to plementation of Johnson et al. (2007b). Eskan-
the task of unsupervised morphological segmenta- der et al. (2020) evaluated nine different PCFG
tion (Sirts and Goldwater, 2013; Eskander et al., grammar specifications on the task of unsupervised
2016). Eskander et al. (2020) found that a language- word segmentation. Each grammar specifies the
independent AG framework achieved state-of-the- range of possible word structures which can be
art results on 12 typologically distinct languages. learned under that model. We evaluated six of their
nine proposed grammars on the development lan-
3 System description guages (Maltese, Persian, Portuguese, Russian, and
Swedish). Following their procedure, we extracted
3.1 Overview
a vocabulary V of word types as AG inputs.2
The task of unsupervised paradigm clustering is
1
closely related to morphological segmentation, but https://github.com/kmccurdy/
paradigm-clusters
we are not aware of previous applications of AGs 2
Although AGs can also model token frequencies (Gold-
to the current task. To use AGs for paradigm clus- water et al., 2006), which could conceivably improve perfor-

83
Word
To evaluate grammar performance, we follow
the intuition in Section 3.1 and group by AG- Prefix Stem Suffix

segmented stems. Grouping by stem can be more PrefixMorph SubMorph SuffixMorph SuffixMorph

difficult for complex words. For example, an AG SubMorph Char Char Char SubMorph SubMorph SubMorph SubMorph

with a more complex grammar might segment the Char Char Char o r t Char Char Char Char Char

plural noun “actionables” into “action-able-s”, with a p p i o n e d

“action” as the stem (see also the example in Fig- (a) PrStSu+SM
ure 2a); however, the target paradigm for cluster- Word

ing includes only “actionable” and “actionables”, Prefix Stem Suffix

not “action” and “actions”. To address this issue SubMorph SubMorph SubMorph SubMorph SubMorph SubMorph
for our clustering task, we make the further sim- Char Char Char Char Char Char Char Char Char Char Char
plifying (but linguistically motivated; e.g. Stump,
a p p o r t i o n e d
2005, 56) assumption that inflectional morphology
(b) Simple+SM
is generally realized on a word’s periphery, so a
segmentation like “action-able-s” implies the stem Figure 2: Two example parses of the word “appor-
“actionable” (in a suffixing language like English, tioned” from our two distinct grammar specifications,
where the prefix is included in the stem). As all learned on the English test data.
of the development languages were predominantly
suffixing (with the partial exception of Maltese,
parameters applied to the same data can predict dif-
which includes root-and-pattern morphology), we
ferent segmentation outputs. Given this variability,
simply grouped together words with the same AG-
we run the AG sampler three times for each of our
segmented Prefix + Stem.
two selected grammars, yielding 6 parses of the
We selected two grammars with the following de-
lexicon for each language. The number of gram-
sirable attributes: 1) they reliably showed good per-
mar runs was heuristically selected and not tuned
formance on the development set, relative to other
in any way, so adding more runs for each grammar
grammars; and 2) they specified very similar struc-
might improve performance (for example, Sirts and
tures, making it easier to combine their outputs in
Goldwater, 2013, use 5 samples per grammar). We
later steps. Both grammars model words as a tripar-
then combine the resulting segmentations using the
tite Prefix-Stem-Suffix sequence. Both grammars
following procedure.
also use a SubMorph level of representation, which
has been shown to aid word segmentation (Sirts 3.3 Adfix direction
and Goldwater, 2013), although we only consider
segments from the level directly above SubMorphs The first step is to determine the adfix direction
in clustering. The full grammar specifications are for each language, i.e. whether the language uses
included in Appendix A. predominantly prefixing or suffixing inflection. We
heuristically select the adfix direction using the
• Simple+SM: Each word comprises one op- following automatic procedure.
tional prefix, one stem, and one optional suf- First, we count the frequency of each peripheral
fix. Each of these levels can comprise one or segment across all 6 parses of the lexicon. A pe-
more SubMorphs. ripheral segment is a substring at the start or end
of a word, which has been parsed as a segment
• PrStSu+SM Each word comprises zero or above the SubMorph level in some AG sample. For
more prefixes, one stem, and zero or more suf- instance, in the parse shown in Figure 2a, “app-”
fixes. Each of these levels can comprise one would be the initial peripheral segment, and “-ed”
or more SubMorphs. Eskander et al. (2020) would be the final peripheral segment. By contrast,
found that this grammar showed the highest for the parse shown in Figure 2b,“ap-” would be
performance in unsupervised segmentation the initial peripheral segment, and “-ioned” would
across the languages they evaluated. be the final peripheral segment.
Next, we rank the segmented adfixes by their
Sampling from an adaptor grammar is a non-
frequency, and select the top N for consideration,
deterministic process, so the same set of initial
where N is some heuristically chosen quantity. In
mance on this task, we did not explore this option. light of the generally Zipfian properties of linguistic

84
distributions, we chose to scale N logarithmically the example from Figure 2, if “apportions” were
with the vocabulary size, so N = log(|V |). also in the corpus, it would be added to the cluster
Finally, we select the majority label (i.e. “prefix” for the stem “apportion”, with “-s” as the adfix ai .
or “suffix”) of the N most frequent segments as the Similarly, it would also be considered in the cluster
adfix direction. This simple approach has obvious for the stem “apport”, with adfix “-ions”.
limitations — to name just one, it neglects the re-
Score cluster members For each word wi in cs ,
ality of nonconcatenative morphology, such as the
calculate a score xi :.
root-and-pattern inflection of many Maltese verbs.
Nonetheless, it appears to capture some key distinc- p
xi = Aw
i log(Ai ) (1)
tions: this method correctly identified Navajo as a
prefixing language, and all other development and where Ai is the normalized overall frequency of
test languages as predominantly suffixing. the ith adfix ai (suffix or prefix) per 10,000 types
in the corpus of 6 segmentations, and Aw i is the
3.4 Creating paradigm clusters proportion of segmentations of the ith word wi
Once we have inferred the adfix direction for a lan- which contain the adfix ai . For example, if “ap-
guage, we use a greedy iterative procedure over portioned” were in consideration for a hypothetical
words to identify and score potential clusters. Our cluster based on the stem “apportion”, Ai would be
scoring metric is frequency-based, motivated by the normalized corpus frequency of “-ed”, and Aw i
the observation that inflectional morphology (such would be .5 (assuming only the two segmentations
as the “-s” in “actionables”) tends to be more fre- shown in Figure 2). For a cluster with the stem
quent across word types relative to derivational “apport”, Ai would be the normalized frequency of
morphology (such as the “-able” in “actionables”). “-ioned”, and Aw i would still be .5.
Yarowsky and Wicentowski (2000) have demon- Intuitively, when evaluating a single word, Eq. 1
strated the value of frequency metrics in aligning assumes that adfixes which appear frequently in the
inflected forms from the same lemma. segmented corpus overall are more likely to be in-
We start with no assigned clusters and iterate flectional, so words with more frequent adfixes are
through the vocabulary in alphabetical order.3 For more likely paradigm members (the log(Ai ) term).
each word w which has not yet been assigned to For instance, the high frequency of the “-s” suffix
a cluster, we identify the most likely cluster using in English will increase the score of any word with
the following procedure. an “-s” suffix in its segmentation (e.g. “apportion-
s”). Eq. 1 also assumes that, for all segmentations
Find possible stems Identify each possible stem of this particular word wi , adfixes which appear
s from all of the segmentations for w, where the in a higher proportion
p of segmentations are more
“stem” comprises the entire substring up to a pe- reliable (the Aw i term), so the more times some
ripheral adfix. For example, based on the two AG samples the “apportion-s” segmentation, the
parses shown in Figure 2, “apportion” and “ap- higher the score for “apportions” membership in
port” would constitute possible stems for the word the “apportion”-stem paradigm. The square root
“apportioned”. The word w in its entirety is also transform was selected based on development set
considered as a possible stem. performance, and has not been tuned extensively.

Find possible cluster members For each stem Filter and score clusters For each possible stem
s, identify other words in the corpus which might cluster cs , filter out words whose score xi is below
share that stem, forming a potential cluster cs . A the score threshold hyperparameter t, to create a
word potentially shares a stem if it shares the same new cluster c0s . Calculate the cluster score xs by
substring from the non-adfixing-direction — so a taking the average of xi for only those words in c0s ,
stem is a shared prefix substring in a suffixing lan- i.e. only words with score xi ≥ t. The value for t
guage like English, and vice-versa for a prefixing is selected via grid search on the development set.
language like Navajo. For each word wi that is We found that setting t = 2 maximized F1 across
identified this way, the rest of the string outside the development languages as a whole.
of the possible stem s is a possible adfix ai . In Select cluster Select the potential cluster c0s with
3
The method is relatively insensitive to order, except re- the highest score, and assign w to that cluster, along
versed alphabetical order, which is worse for most languages. with each word wi in c0s .

85
Language Precision Recall F1 same cluster, but incorrectly assigns “geändert” to
a separate cluster. We estimate that roughly 30%
Maltese .30 .30 .30
of the model’s incorrect German predictions stem
Persian .54 .52 .53
from this issue. This limitation also contributed to
Portuguese .92 .91 .91
our model’s poor performance on Basque, which,
Russian .83 .82 .82
like Maltese, uses both prefixing and suffixing in-
Swedish .85 .81 .83
flection to express polypersonal agreement.4
Mean .69 .67 .68 One obvious way to improve this issue would
be to use an extension of the AG framework which
Table 1: Performance on development languages can represent nonconcatenative morphology. Botha
and Blunsom (2013) present such an extension,
Language Precision Recall F1 replacing the PCFG with a Probabilistic Simple
Basque .33 .32 .32 Range Concatenating Grammar. They report suc-
Bulgarian .83 .80 .82 cessful results for unsupervised segmentation on
English .91 .90 .90 Hebrew and Arabic. On the other hand, it’s unclear
Finnish .61 .60 .60 whether such a nonconcatenative-focused approach
German .79 .79 .79 could also adequately represent concatenative mor-
Kannada .82 .59 .69 phology. Fullwood and O’Donnell (2013) explore
Navajo .43 .42 .42 a similar framework, using Pitman-Yor processes
Spanish .85 .82 .84 to sample separate lexica of roots, templates, and
Turkish .74 .73 .73 “residue” segments; they find that their model works
well for Arabic, but much less well for English. In
Mean .70 .66 .68 addition, Eskander et al. (2020) report state-of-the-
art morphological segmentation for Arabic using
Table 2: Performance on test languages the PrStSu+SM grammar which we also use here.
Their findings suggest that, rather than changing
4 Results and Discussion the AG framework, we might attempt a more intel-
ligent clustering method based on noncontiguous
Performance was evaluated using the script pro- segmented subsequences rather than contiguous
vided by the shared task organizers. Table 1 shows substrings.
the results for the development languages, and Ta-
ble 2 shows the results for the test languages. While Irregular morphology The strong assumption
the average F1 score ends up being quite similar of contiguous substrings as stems also hinders ac-
for both development and test languages, it’s clear curate clustering of irregular forms of any kind,
within both groups that there are large differences from predictable stem alternations (such as um-
in performance across different languages. laut in German and Swedish, or theme vowels in
Portuguese and Spanish) to more challenging sup-
4.1 Error analysis and ways to improve pletive forms such as English “go”-“went”. The
Noncontiguous stems The clustering method de- latter likely requires additional input from seman-
scribed in Section 3.4 makes an unjustifiably strong tic representations, but semiregular alternations in
assumption that stems are contiguous substrings, forms could also be handled in principle by a more
which effectively eliminates its ability to represent intelligent clustering process. On this point, we
nonconcatenative morphology. This limitation con- note that some small but significant fraction of AG
tributes to the low score on Maltese, a Semitic lan- parses of Portuguese verbs grouped verbal theme
guage which includes root-and-pattern morphology vowels and inflections together (e.g. parsing “apre-
for certain verbs. The model further assumes that sentada” as “apresent-ada” rather than “apresenta-
the left or right edge of a word — the side opposite da”, “apresentarem” as “apresent-arem” rather than
from the adfix direction — is contiguous with the “apresenta-rem”, and so on), and these parses were
stem. This leads to errors on German, as most verbs crucial to our model’s relatively high performance
have a circumfixing past participle form “ge- + -t” on Portuguese.
or “ge- + -en”. For example, the model correctly 4
We thank an anonymous reviewer for bringing this to our
assigns “ändern”, “änderten”, and “ändert” to the attention.

86
Derivation vs. inflection Another issue is that subword length of 2 characters, and used it to clus-
the parses sampled by AGs do not distinguish be- ter words from the same cell rather than the same
tween inflectional and derivational morphology. paradigm (e.g. clustering together English verbs
This is apparent in Figure 2, where both grammars in the third person singular such as “walks” and
parse “apportioned” with “-ioned” as the suffix. We “jumps”). We attempted to follow this procedure,
seek to address this issue with frequency-based met- but it proved too difficult, as paradigm cell informa-
rics in our clustering method, but frequent deriva- tion was not explicitly included in the development
tional adfixes often score high enough to be as- data for this shared task. 3) We used the method
signed a wrong paradigm cluster. For example, described by Bojanowski et al. (2017) to identify
in English our model correctly clusters “allow”, important subwords within a word, in hopes of
“allows”, and “allowed” together, but it also incor- combining them with AG segmentations. However,
rectly assigns “allowance” to the same cluster. the identified subwords did not consistently align
A straightforward way to handle this within our with stem-adfix segementations as we had hoped,
existing approach would be to allow language- and did not seem to provide any additional benefit.
specific variation of the score threshold t. As we
had no method for unsupervised estimation of t Brown clustering Part of speech tags could pro-
for unfamiliar languages, we did not pursue this; vide latent structure as a higher-order grouping for
however, a researcher who had minimal familiarity paradigm clusters — for example, verbs would be
with the language in question might be able to se- expected to have paradigms more similar to other
lect a more sensible value for t based on inspecting verbs than to nouns. Brown clusters (Brown et al.,
the clusters. Beyond that, the distinction between 1992) have been used for unsupervised induction
inflectional and derivational morphology is an in- of word classes approximating part of speech tags.
triguing and contested issue within linguistics (e.g. We used a spectral clustering algorithm (Stratos
Stump, 2005), and the question of how to model it et al., 2014) to learn Brown clusters, but they did
computationally requires much more attention. not reliably correspond to part of speech categories
on our development language data.
4.2 Things that didn’t work
5 Conclusion
We attempted a number of unsupervised ap-
proaches beyond AG segmentations, with the goal The Adaptor Grammar framework has previously
of incorporating them during the clustering pro- been applied to unsupervised morphological seg-
cess; however, we could not consistently improve mentation. In this paper, we demonstrate that AG
performance with any of them. It seems likely to segmentations can be used for the related task of
us that these methods could still be used to improve unsupervised paradigm clustering with successful
AG-segmentation-based clusters, but we could not results, as shown by our system’s performance in
find immediately obvious ways to do this. the 2021 SIGMORPHON Shared Task.
We note that there is still considerable room for
FastText As the AG framework only models
improvement in our clustering procedure. Two key
word structure based on form, we hoped to use the
directions for future development are more sophis-
distributional representations learned by FastText
ticated treatment of nonconcatenative morphology,
(Bojanowski et al., 2017) to incorporate semantic
and incorporation of additional sources of informa-
and syntactic information into our model’s clus-
tion beyond the word form alone.
ters. We tried several different approaches without
success. 1) We trained a skipgram model with a Acknowledgments
context window of 5 words, a setting often used
for semantic applications, in hopes that words from This work was supported in part by the EPSRC Cen-
the same paradigm might have similar semantic tre for Doctoral Training in Data Science, funded
representations. Agglomerative clustering on these by the UK Engineering and Physical Sciences Re-
representations alone yielded much worse clusters search Council (grant EP/L016427/1) and the Uni-
than the AG method, and we could not find a way versity of Edinburgh, and a James S McDonnell
to combine them successfully with the AG clusters. Foundation Scholar Award (#220020374) to the
2) Erdmann et al. (2020) trained a skipgram model second author.
with a context window of 1 word and a minimum

87
References Unsupervised Morphological Segmentation of Un-
seen Languages. In Proceedings of COLING 2016,
Piotr Bojanowski, Edouard Grave, Armand Joulin, and the 26th International Conference on Computational
Tomas Mikolov. 2017. Enriching Word Vectors with Linguistics: Technical Papers, pages 900–910, Os-
Subword Information. Transactions of the Associa- aka, Japan. The COLING 2016 Organizing Commit-
tion for Computational Linguistics, 5:135–146. tee.
Jan A Botha and Phil Blunsom. 2013. Adaptor gram- Michelle Fullwood and Tim O’Donnell. 2013. Learn-
mars for learning non-concatenative morphology. In ing non-concatenative morphology. In Proceedings
Proceedings of the 2013 conference on empirical of the Fourth Annual Workshop on Cognitive Model-
methods in natural language processing, pages 345– ing and Computational Linguistics (CMCL), pages
356. 21–27, Sofia, Bulgaria. Association for Computa-
tional Linguistics.
Peter F. Brown, Peter V. Desouza, Robert L. Mer-
cer, Vincent J. Della Pietra, and Jenifer C. Lai.
Sharon Goldwater, Mark Johnson, and Thomas L Grif-
1992. Class-based n-gram models of natural lan-
fiths. 2006. Interpolating between types and tokens
guage. Computational linguistics, 18(4):467–479.
by estimating power-law generators. In Advances in
Ryan Cotterell, Christo Kirov, John Sylak-Glassman, neural information processing systems, pages 459–
Géraldine Walther, Ekaterina Vylomova, Arya D. 466.
McCarthy, Katharina Kann, Sabrina J. Mielke, Gar-
rett Nicolai, Miikka Silfverberg, David Yarowsky, Huiming Jin, Liwei Cai, Yihui Peng, Chen Xia,
Jason Eisner, and Mans Hulden. 2018. The Arya D. McCarthy, and Katharina Kann. 2020.
CoNLL–SIGMORPHON 2018 Shared Task: Uni- Unsupervised Morphological Paradigm Completion.
versal Morphological Reinflection. In Proceedings arXiv:2005.00970 [cs]. ArXiv: 2005.00970.
of the CoNLL–SIGMORPHON 2018 Shared Task:
Universal Morphological Reinflection, pages 1–27, Mark Johnson and Sharon Goldwater. 2009. Improving
Brussels. Association for Computational Linguis- nonparameteric Bayesian inference: experiments on
tics. unsupervised word segmentation with adaptor gram-
mars. In Proceedings of Human Language Tech-
Ryan Cotterell, Christo Kirov, John Sylak-Glassman, nologies: The 2009 Annual Conference of the North
Gėraldine Walther, Ekaterina Vylomova, Patrick American Chapter of the Association for Compu-
Xia, Manaal Faruqui, Sandra Kübler, David tational Linguistics, pages 317–325, Boulder, Col-
Yarowsky, Jason Eisner, and Mans Hulden. 2017. orado. Association for Computational Linguistics.
CoNLL-SIGMORPHON 2017 Shared Task: Uni-
versal Morphological Reinflection in 52 Languages. Mark Johnson, Thomas Griffiths, and Sharon Gold-
In Proceedings of the CoNLL SIGMORPHON 2017 water. 2007a. Bayesian Inference for PCFGs via
Shared Task: Universal Morphological Reinflection, Markov Chain Monte Carlo. In Human Language
pages 1–30, Vancouver. Association for Computa- Technologies 2007: The Conference of the North
tional Linguistics. American Chapter of the Association for Computa-
tional Linguistics; Proceedings of the Main Confer-
Ryan Cotterell, Christo Kirov, John Sylak-Glassman, ence, pages 139–146, Rochester, New York. Associ-
David Yarowsky, Jason Eisner, and Mans Hulden. ation for Computational Linguistics.
2016. The SIGMORPHON 2016 Shared
Task—Morphological Reinflection. In Pro- Mark Johnson, Thomas L. Griffiths, and Sharon Gold-
ceedings of the 14th SIGMORPHON Workshop on water. 2007b. Adaptor Grammars:A Framework for
Computational Research in Phonetics, Phonology, Specifying Compositional Nonparametric Bayesian
and Morphology, pages 10–22, Berlin, Germany. Models. In Bernhard Schölkopf, John Platt, and
Association for Computational Linguistics. Thomas Hofmann, editors, Advances in Neural In-
formation Processing Systems 19. The MIT Press.
Alexander Erdmann, Micha Elsner, Shijie Wu, Ryan
Cotterell, and Nizar Habash. 2020. The Paradigm Katharina Kann, Arya D. McCarthy, Garrett Nico-
Discovery Problem. arXiv:2005.01630 [cs]. ArXiv: lai, and Mans Hulden. 2020. The SIGMORPHON
2005.01630. 2020 Shared Task on Unsupervised Morphologi-
cal Paradigm Completion. In Proceedings of the
Ramy Eskander, Francesca Callejas, Elizabeth Nichols, 17th SIGMORPHON Workshop on Computational
Judith Klavans, and Smaranda Muresan. 2020. Mor- Research in Phonetics, Phonology, and Morphology,
phAGram, Evaluation and Framework for Unsuper- pages 51–62, Online. Association for Computational
vised Morphological Segmentation. In Proceedings Linguistics.
of the 12th Language Resources and Evaluation
Conference, pages 7112–7122, Marseille, France. Jim Pitman and Marc Yor. 1997. The Two-Parameter
European Language Resources Association. Poisson-Dirichlet Distribution Derived from a Sta-
ble Subordinator. The Annals of Probability,
Ramy Eskander, Owen Rambow, and Tianchun Yang. 25(2):855–900. Publisher: Institute of Mathemati-
2016. Extending the Use of Adaptor Grammars for cal Statistics.

88
Kairit Sirts and Sharon Goldwater. 2013. Minimally- A PCFGs
Supervised Morphological Segmentation using
Adaptor Grammars. Transactions of the Association Our system uses the following two grammar specifi-
for Computational Linguistics, 1:255–266. cations, developed by Eskander et al. (2016, 2020).
Karl Stratos, Do-kyum Kim, Michael Collins, and Nonterminals are adapted by default. Non-adapted
Daniel Hsu. 2014. A spectral algorithm for learn- nonterminals are preceded by 1, indicating an ex-
ing class-based n-gram models of natural language. pansion probability of 1, i.e. the PCFG always ex-
Proceedings of the Association for Uncertainty in Ar- pands this rule and never caches it.
tificial Intelligence.
A.1 Simple+SM
Gregory T Stump. 2005. Word-formation and inflec-
tional morphology. In Handbook of word-formation, 1 Word --> Prefix Stem Suffix
volume 64, pages 49–71. Springer, Dordrecht, The
Prefix --> ˆˆˆ SubMorphs
Netherlands.
Prefix --> ˆˆˆ
Ekaterina Vylomova, Jennifer White, Eliza-
Stem --> SubMorphs
beth Salesky, Sabrina J. Mielke, Shijie Wu,
Edoardo Maria Ponti, Rowan Hall Maudslay, Ran Suffix --> SubMorphs $$$
Zmigrod, Josef Valvoda, Svetlana Toldova, Francis Suffix --> $$$
Tyers, Elena Klyachko, Ilya Yegorov, Natalia
Krizhanovsky, Paula Czarnowska, Irene Nikkarinen, 1 SubMorphs --> SubMorph SubMorphs
Andrew Krizhanovsky, Tiago Pimentel, Lucas 1 SubMorphs --> SubMorph
Torroba Hennigen, Christo Kirov, Garrett Nicolai, SubMorph --> Chars
Adina Williams, Antonios Anastasopoulos, Hilaria 1 Chars --> Char
Cruz, Eleanor Chodroff, Ryan Cotterell, Miikka 1 Chars --> Char Chars
Silfverberg, and Mans Hulden. 2020. SIGMOR-
PHON 2020 Shared Task 0: Typologically Diverse A.2 PrStSu+SM
Morphological Inflection. In Proceedings of the
17th SIGMORPHON Workshop on Computational 1 Word --> Prefix Stem Suffix
Research in Phonetics, Phonology, and Morphology,
pages 1–39, Online. Association for Computational Prefix --> ˆˆˆ
Linguistics. Prefix --> ˆˆˆ PrefMorphs
1 PrefMorphs --> PrefMorph PrefMorphs
Adam Wiemerslage, Arya McCarthy, Alexander Erd- 1 PrefMorphs --> PrefMorph
PrefMorph --> SubMorphs
mann, Garrett Nicolai, Manex Agirrezabal, Miikka
Silfverberg, Mans Hulden, and Katharina Kann. Stem --> SubMorphs
2021. The SIGMORPHON 2021 Shared Task
on Unsupervised Morphological Paradigm Cluster- Suffix --> $$$
ing. In Proceedings of the 18th SIGMORPHON Suffix --> SuffMorphs $$$
Workshop on Computational Research in Phonetics, 1 SuffMorphs --> SuffMorph SuffMorphs
Phonology, and Morphology. Association for Com- 1 SuffMorphs --> SuffMorph
putational Linguistics. SuffMorph --> SubMorphs

David Yarowsky and Richard Wicentowski. 2000. Min- 1 SubMorphs --> SubMorph SubMorphs
imally Supervised Morphological Analysis by Mul- 1 SubMorphs --> SubMorph
timodal Alignment. pages 207–216. SubMorph --> Chars
1 Chars --> Char
1 Chars --> Char Chars

89
Orthographic vs. Semantic Representations for Unsupervised
Morphological Paradigm Clustering

E. Margaret Perkoff? Josh Daniels† Alexis Palmer†


?
Dept. of Computer Science, † Dept. of Linguistics
University of Colorado Boulder
{margaret.perkoff, joda4370, alexis.palmer}@colorado.edu

Abstract Surface Forms Morphological Features


walk bring V; 1SG; 2SG; 3PL; 1PL
This paper presents two different systems walks brings V; 3SG
for unsupervised clustering of morphological walking bringing PRES; PART
paradigms, in the context of the SIGMOR- walked brought PAST
PHON 2021 Shared Task 2. The goal of this
task is to correctly cluster words in a given Table 1: Morphological paradigms for the English
language by their inflectional paradigm, with- verbs walk and bring.
out any previous knowledge of the language
and without supervision from labeled data of
any sort. The words in a single morphological with resources suitable for computational morpho-
paradigm are different inflectional variants of logical analysis, there is no guarantee that the avail-
an underlying lemma, meaning that the words able data in fact covers all important aspects of the
share a common core meaning. They also -
language, leading to significant error rates on un-
usually - show a high degree of orthographi-
cal similarity. Following these intuitions, we seen data. This uncertainty regarding training data
investigate KMeans clustering using two dif- makes unsupervised learning a natural modeling
ferent types of word representations: one fo- choice for the field of computational morphology.
cusing on orthographical similarity and the The unsupervised setting takes away the need for
other focusing on semantic similarity. Addi- large quantities of labeled text in order to detect
tionally, we discuss the merits of randomly ini- linguistic phenomena. The SIGMORPHON 2021
tialized centroids versus pre-defined centroids
shared task aims to leverage the unsupervised set-
for clustering. Pre-defined centroids are iden-
tified based on either a standard longest com-
ting in order to identify morphological paradigms,
mon substring algorithm or a connected graph at the same time including languages with a wide
method built off of longest common substring. range of morphological properties.
For all development languages, the character- For a given language, the morphological
based embeddings perform similarly to the paradigms are the models that relate root forms
baseline, and the semantic embeddings per- (or lemmas) of words to their surface forms. The
form well below the baseline. Analysis of the task we tackle is to cluster surface word forms into
systems’ errors suggests that clustering based
groups that reflect the application of a morphologi-
on orthographic representations is suitable for
a wide range of morphological mechanisms, cal paradigm to a single lemma. The lemma of the
particularly as part of a larger system. paradigm is typically the dictionary citation form,
and the corresponding surface forms are inflected
1 Introduction variations of that lemma, conveying grammatical
properties such as tense, gender, or plurality. For
One significant barrier to progress in morpholog- example, Table 1 displays partial clusters for two
ical analysis is the lack of available data for most English verbs: walk and bring.
of the world’s languages. As a result, there is a In developing our system, we consider two types
dramatic divide between high and low resource of information that could reasonably play a role in
languages when it comes to performance on au- unsupervised paradigm induction. First, the words
tomated morphological analysis (as well as many in a single paradigm cluster are different inflec-
other language-related tasks). Even for languages tional variants of an underlying lemma, meaning

90
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 90–97
August 5, 2021. ©2021 Association for Computational Linguistics
that the words share a common core meaning. They tional morphology and unsupervised learning that
also - usually - show a high degree of orthograph- could be combined to approach this problem.
ical similarity. Following these intuitions, we in- Previous work has identified the benefit of com-
vestigate KMeans clustering using two different bining rules based on linguistic characteristics with
types of word representations: one focusing on or- machine learning techniques. Erdmann et al. (2020)
thographical similarity and the other focusing on established a baseline for the Paradigm Discovery
semantic similarity. Intuitively, we would expect Problem that clusters the unannotated sentences
the cluster of forms for walk to be recognizable first by a combination of string similarity and lex-
largely based on orthographic similarity. The par- ical semantics and then uses this clustering as in-
tially irregular cluster for bring shows greater ortho- put for a neural transducer. Erdmann and Habash
graphical variability in the past-tense form brought (2018) investigated the benefits of different similar-
and so might be expected to require information ity models as they apply to Arabic dialects. Their
beyond orthographic similarity. findings demonstrated that Word2Vec embeddings
significantly underperformed in comparison to the
System Overview. The core of our approach is
Levenshtein distance baseline. The highest per-
to cluster unlabelled surface word forms using
forming representation was a combination of Fast-
KMeans clustering; a complete architecture dia-
Text and a de-lexicalized morphological analyzer.
gram can be seen in Figure 1. After reading the
The FastText embeddings (Bojanowski et al., 2016)
input file for a particular language to identify the
have the benefit of including sub-word information
lexicon and alphabet, we transform each word into
by representing words as character n-grams. The
two different types of vector representations. To
de-lexicalized analyzer relies on linguistic expert
capture semantic information, we train Word2Vec
knowledge of Arabic to identify the morphological
embeddings from the input data. The orthography-
closeness of two words. In the context of the paper,
based representations we learn are character embed-
it is used to prune out word relations that do not con-
dings, again trained from the input data. Details for
form to Arabic morphological rules. The approach
both representations appear in section 4.1. For the
mentioned greatly benefits from the use of a mor-
experiments in this paper, we test each type of rep-
phological analyzer, something that is not readily
resentation separately, using randomly initialized
available for low-resource languages. Soricut and
centers for the clustering. In later work, we plan to
Och (2015) focused on the use of morphological
explore the integration of both types of representa-
transformations as the basis for word representa-
tions. We would also like to explore the use of pre-
tions. Their representation can be quite accurate
defined centers for clustering. These pre-defined
for affix-based morphology.
centers could be provided using either a longest
Our representations are based entirely off of un-
common subsequence method or a graph-based al-
labelled data and do not require linguistic experts
gorithm such as that described in section 4.3. The
to provide morphological transformation rules for
final output of the system is a set of clusters, each
the language. Additionally, we hoped to create a
one representing a morphological paradigm.
system that would be robust for languages that in-
2 Previous Work clude non-affix based morphology. In this work we
compare Word2Vec representations to character-
The SIGMORPHON 2020 shared task set included based representations to represent orthography. We
an open problem calling for unsupervised systems have not yet evaluated additional representations
to complete morphological paradigms. For the or combinations of the two.
2020 task, participants were provided with the
set of lemmas available for each language (Kann, 3 Task overview
2020). In contrast, the 2021 SIGMORPHON task 2
outlines that submissions are unsupervised systems The 2021 SIGMORPHON Shared Task 2 created
that cluster input tokens into the appropriate mor- a call for unsupervised systems that would cre-
phological paradigm (Nicolai et al., 2020). Given ate morphological paradigm clusters. This was
the novelty of the task, there is a lack of previous intended to build upon the shared task from 2020
work done to cluster morphological paradigms in that focused on morphological paradigm comple-
an unsupervised manner. However, we have identi- tion. Participants were provided with tokenized
fied key methods from previous work in computa- Bible data from the JHU bible corpus (McCarthy

91
et al., 2020) and gold standard paradigm clusters 2. Identify the maximum word length of the lex-
for five development languages: Maltese, Persian, icon.
Portuguese, Russian and Swedish. Teams could
use this data to train their systems and evaluate 3. Create a dictionary of the alphabet where each
against the gold standard files as well as a baseline. character corresponds to a float value between
The baseline provided groups together words that 0 (non-inclusive) and 1 (inclusive).
share a substring of length n and then removes any 4. For each word:
duplicate clusters. The resulting systems were then
used to cluster tokenized data from a set of test (a) Initialize an array of zeros the same size
languages including: Basque, Bulgarian, English, as the maximum length word.
Finnish, German, Kannada, Navajo, Spanish, and (b) Map each character in the word, in order,
Turkish. to its respective float value based on our
alphabet dictionary. Leave the remaining
4 System Architecture values as zero.
The overall architecture of our system includes This representation focuses purely on the charac-
several distinct pieces as demonstrated in Figure ters of the language. For the time being, it does
1. For a given language, we read the corpus text not take into account the relationship between or-
provided and generate a lexicon of unique words. thographic characters in any of the languages but
The lexicon is then fed to an embedding layer and future work could attempt to create smarter numer-
an optional lemma identification layer. The em- ical representations based on these relationships.
bedding layer generates a vector representation of
Word Embeddings with Word2Vec. To incor-
each word based on either a character level embed-
porate semantic and syntactic information, we use
ding or a Word2Vec embedding. When used, the
the Word2Vec embeddings. Specifically, we train a
lemma identification layer generates a set of prede-
Word2Vec model for each language with the Gen-
fined lemmas from the lexicon based on either the
sim skip-gram representations (Řehůřek and Sojka,
standard longest common substring or a connected
2010).
graph formed from the longest common substring.
Result word embeddings along with the optional 4.2 (Optional) Lemma Identification
set of predefined lemmas are used as input to a
LCS Graph Formation One of the challenges
KMeans clustering algorithm. In the event prede-
of using clustering-based methods on this prob-
fined lemmas are not provided, the system defaults
lem is determining the number of morphological
to using a randomly initialized set of centroids. Oth-
paradigms expected to be present and then finding
erwise, the initial centroids for the clusters are the
suitable lemmas for each to serve as centers for
result of finding the appropriate word embedding
clustering. One potential approach to find lemmas
for the lemmas identified. Once a cluster has been
is to first arrange the words into a network graph
created, the output cluster predictions are formatted
based on the longest common substring relation-
into a paradigm dictionary which can be written to
ships between them. Specifically, for each attested
a file for evaluation.
word W in a language’s data, the longest common
4.1 Word Representations substring (LCS) is calculated between W and every
other attested word in the language. Graph edges
We create two different types of word represen- are then constructed between W and the word (or
tations, aiming to capture information that may words if there are multiple with the same length
reflect the relatedness of words within a paradigm. LCS) that have the longest LCS with W. This pro-
Character Based Embeddings. To capture or- cess is repeated for every word in the given lan-
thographic information, we generate a character- guage’s corpus. This results in a large graph that
based word embedding for the language. For each appears to capture many of the morphological de-
language we do the following: pendencies within the language.
Next, we split the graph into highly connected
1. Generate a lexicon of all the words in the de- subgraphs (HCSs). HCS are defined as graphs
velopment corpus and an alphabet of unique in which the number of edges that would need
characters in the language. to be cut to split the graph into two disconnected

92
Figure 1: Overall Statistical Clustering Architecture diagram. There are two possible word embedding algorithms
represented in the diagram (left side of split). The optional lemma identification layer also includes two possible
methods (right side).

subgraphs is greater than one half of the num- be as close (as defined by Euclidean distance) to
ber of nodes. This is helpful because in the LCS the cluster’s center, or the lemma word, as possible.
graphs generated, morphologically related forms
Clustering with Randomly Initiated Centers.
tend to be connected relatively densely to each
For comparison, we evaluate the effectiveness of
other and only weakly connect to forms from other
using randomly initialized centers for our clusters.
paradigms. Additionally, the use of a threshold
In the context of this task, this means that the first
based algorithm like HCS, unlike other clustering
set of centers fed to the algorithm do not necessarily
methods, would allow lemmas to be extracted with-
correspond to any valid word in the given language,
out having to prespecify the expected number of
or perhaps any language. Another obstacle for this
lemmas beforehand. Unfortunately, during testing
approach in an unsupervised setting is defining the
the HCS graph analysis proved computationally
number of clusters to use. Identifying this requires
taxing and was unable to be completed in time for
human interference with hyper-parameters that are
evaluation, though qualitative analysis of the gen-
not going to be cross-linguistically relevant. The
erated LCS graphs suggests the technique may still
size of the input bible corpus and the inflectional
be useful with better computational power. We will
morphology of the language both directly impact
explore this method further in future work.
the number of clusters, or the number of lemmas,
4.3 Clustering that are relevant. We used a range of cluster sizes
for the development languages from 100 to 6000 to
The word representations described in section 4.1
evaluate which ones provided the highest accuracy.
are used as input to a clustering algorithm. We use
For the test languages, we chose to submit results
the KMeans algorithm as defined by the sklearn im-
for clusters of size 500, 1000, 1500, and 1900 to
plementation (Pedregosa et al., 2011). The KMeans
assess performance variability based on number of
approach is one of the pioneering algorithms in un-
lemmas.
supervised learning (MacQueen et al., 1967). Input
values are grouped by continuously shifting clus- Extension: Initializing with Non-Random Cen-
ters and their centers while attempting to minimize ters. The use of non-random centers would have
the variance of each cluster. This indicates that the multiple benefits in the context of this task. This
cluster that a particular word is assigned to should approach would incorporate linguistic information

93
Language BL KMW2V KMCE 5 Results
Maltese 0.29 0.19 0.25
Persian 0.30 0.18 0.36 Table 2 shows results to date. We compare the
Portuguese 0.34 0.06 0.24 two representation methods on the development
Russian 0.36 0.11 0.34 languages. The KMeans clusterings for the devel-
Swedish 0.44 0.18 0.45 opment languages were generated based on optimal
cluster values starting with size 100 and increas-
Table 2: F1 Scores for each of the model types on all ing to a cluster size of 6000, or until the accuracy
development languages. The best F1 scores are in bold. no longer improved from an increase in cluster
BL is Baseline, KMW2V is KMeans with Word2Vec size. For the Word2Vec embeddings we used clus-
embeddings, and KMCE is KMeans with Character terings of size 110 for Maltese, 130 for Persian,
Embeddings.
1490 for Portuguese, 1490 for Russian, and 1490
for Swedish. With the character embeddings, we
Language BL 500 1000 1500 1900 had 540 clusters for Maltese, 110 clusters for Per-
Basque 0.21 0.29 0.31 0.27 0.29 sian, 2200 clusters for Portuguese, 4000 clusters
Bulgarian 0.39 0.21 0.27 0.29 0.30 for Russian, and 5400 clusters for Swedish. The F1
English 0.52 0.29 0.37 0.43 0.45 scores provided are based on comparing the appro-
Finnish 0.29 0.19 0.24 0.26 0.27 priate model’s predictions to the gold paradigms for
German 0.38 0.26 0.33 0.38 0.40 this task using the evaluation function defined in
Kannada 0.24 0.29 0.30 0.30 0.29 the SIGMORPHON 2021 Task 2 github repository.
Navajo 0.33 0.33 0.38 0.39 0.41 The KMCE models clearly and consistently out-
Spanish 0.39 0.24 0.29 0.30 0.31 perform the KMW2V models, for all development
Turkish 0.25 0.16 0.20 0.22 0.22 languages.
For test languages, we run clustering only with
Table 3: F1 Scores for the baseline (BL) and the the better-performing character-based representa-
KMCE models on the test languages. The best F1 tions. The performance on test languages was
scores are in bold. Test languages were evaluated on evaluated with clusters of size 500, 1000, 1500,
KMCE models with clusters of size 500, 1000, 1500, and 1900. These results are in Table 3. We found
and 1900.
that our algorithm outperformed the baseline for
Basque, German, Kannada, and Navajo. For both
Basque and Kannada, the largest clustering did
to inform the initial set of centers. This could lead not have the highest result suggesting that the cor-
to quicker convergence of a model due to more pora provided for these languages contain a smaller
intelligently picked centers. It could also prevent number of morphological paradigms. In the case of
the model from being skewed towards less than Bulgarian, English, Spanish, and Finnish, we note
ideal center values. Additionally, with pre-defined that the KMCE model performance increases with
centers we can remove the need to arbitrarily define each increase in cluster size. This suggests that the
the number of clusters. model accuracy would continue increasing if we
In the scope of this task, we were unable to ex- ran the model for these languages with a higher
periment with pre-defined center values but we number of clusters. Additional discussion of the er-
have proposed two potential methods for doing ror analysis appears in section 6, and fo the results
so: using longest common substrings and picking in section 7.
highly connected nodes from an LCS graph for-
6 Error Analysis
mation. The longest common substring approach
would mimic the lemma identification approach de- We have evaluated the results from the Word2Vec
scribed above (4.2). Both of these systems are rep- representations and our character-based embedding
resented as an optional lemma identification layer and compared them to the gold standard paradigms
on the right hand side of Figure 1. The output of provided by the task organizers. We have found
each one would be a set of words to use as centers. that, overall, the character-based version is more
Each word would be converted to the appropriate robust on regular verb forms than the Word2Vec
word representation and then fed as an input to the version, and that neither is effective on irregular
KMeans clustering. forms. Additionally, we explore some of the nu-

94
anced errors with the character based embeddings handle irregular verb forms.
and how they could be addressed for future work.

6.1 Regular Verb Forms 6.3 Character Distance Errors


Our results are consistent with our initial expec-
In some cases, the character representations result
tation that an orthographic word representation
in strange cluster formations due to the usage of
would perform better on regular verb forms than the
Euclidean distance in the sklearn KMeans library.
Word2Vec representation, since it weights close-
Since each character in the language’s alphabet
ness based on the characters of the word. The char-
was mapped arbitrarily to a numeric value, the
acter embedding correctly groups many surface
closeness of a pair of characters does not reflect a
forms together based on regular English morpho-
morphological relationship between those symbols.
logical paradigms, or those that follow the pattern
However, characters that are assigned to numeri-
of -ed for past tense, -s for third person singular
cal values that are closer to one another will be
present, - ∅ for first, second and third person plural
classified as closer by the Euclidean distance algo-
present. However, there were sometimes words
rithm. It would be possible to learn more about the
missing from the paradigm. For example, the sys-
language specific character relations by training
tem generated the paradigm {stumble, stumbles,
a recurrent network with a focus on the charac-
stumbling}. This should have included stumbled,
ter sequence alignments. This network could then
but instead that is in a paradigm with thaddaeus.
be used as an encoder to generate character level
In contrast, the Word2Vec representation separates
embeddings.
all four surface forms into different morphological
paradigms. For Spanish paradigms, we see that the
character embeddings perform well for matching 6.4 Non-Affix Based Morphology
some of the regular surface forms together, but can-
not handle longer suffixes. For example, aprendas, For verb forms in English that do not use a regu-
aprendan and aprenden are grouped together while lar affixation paradigm, we find that some surface
leaving out longer surface forms like aprendere- forms are paired together in the correct clusters,
mos. Similarly, hablaste, hablaras, hablaran and but those clusters often contain additional unre-
hablara are grouped in the same morphological lated words. Consider the following group: {drank,
paradigm, but hablar, hablan, hablar, hables and drinks, drink, drunk, breaks, break, branch}. In
hablen are part of a separate grouping. We discuss this cluster, we see that drank, drinks, drink and
the issue of errors related to word length in detail drunk were all correctly identified as being related.
below. The algorithm also matched break and breaks to-
gether. This suggests that character representations
6.2 Irregular Verb Forms have the potential to identify morphological trans-
Because it focuses on semantic relatedness, we ex- formations that occur at different points in the word,
pected the Word2Vec representation to be more ac- as opposed to just prefixes or suffixes. However, the
curate in grouping together irregular surface forms result is a combination of what should be three dis-
from the same paradigm. For example, we have tinct morphological paradigms, including a unique
the paradigm for go: {go, goes, going, went}. In paradigm for branch. In Navajo, jidooleeł and
fact, Word2Vec created a morphological paradigm dadooleeł are correctly put in the same paradigm.
for go, one for {upstairs, carry, goes, favour}, However, this paradigm also includes jidooleełgo
{going, reply, robbed, dared, gaius, failed, god- and dadooleełgo, which are not morphologically
less}, and one for went. The orthographical rep- related. We also see the tendency of over-grouping
resentation also produced some undesired results, in Basque, where bitzate, nitzan, and ditzan are all
with a paradigm for { eyes, goat, goes, gone, gong, grouped together along with over ten other unre-
else, eloi, noah, none, sons, long} and {gains, lated forms. This could potentially be addressed by
noisy, lasea, lysias, gates, lying, fatal, often, notes, increasing the number of clusters to favor smaller
loses, latin, latest, going}. The other surface forms clusters. Adding semantic features to the word em-
of went and go also ended up in separate morpho- beddings such as part of speech or limited context
logical paradigms. These results suggest that nei- windows may also help filter out words that are not
ther representation is currently robust enough to relevant to a particular paradigm.

95
6.5 Word Length analysis or through the threshold-based graph clus-
Another type of cluster error has to do with word tering technique discussed above. Other potential
length. The word representation vectors were sized variations on that approach, once the problem of
based on the largest word present in a given lan- computational limits has been solved, include us-
guage’s corpus. If a word is under the maximum ing longest common sequences rather than longest
length, the remaining vector gets filled in with ze- common substrings, and weighting graph edges by
ros. This means that words that are similar in the length of the LCS between the two words. The
length are more likely to be paired together for former would potentially help accommodate forms
a cluster. The gold data created the morphological of non-concatenative morphology, while the latter
paradigm {crowd, crowds, crowding}, while ours would potentially include more information about
created two separate clusterings: {crowd, crowds} morphological relationships than an unweighted
and {brawling, proposal, crowding}. This is also graph does. Future research should also explore
present in the clustering of certain words in Navajo. how other sources of linguistic information could
Our algorithm grouped nizaad, and bizaad together, be leveraged for this task. This could include
but some of the longer forms in this paradigm were other forms of semantic information outside of the
excluded such as danihizaad and nihizaad. In fu- context-based semantics used by W2V, as well as
ture work, we would attempt to mitigate this by things like the orthographic-phonetic correspon-
using subword distances or cosine similarity as dences in a given language.
the basis for distance metrics in a clustering algo- Finally, we would like to explore filtering of the
rithm. This could prevent inaccurate groupings due output clusters according to language-specific prop-
to large affix lengths. erties in order to improve the overall results.This
would involve adding additional layers to our sys-
7 Discussion, Conclusions, Future Work tem architecture that take place after a distance-
based clustering. One such layer could prune un-
Overall, these results demonstrate an improvement likely clusters based off of a morphological trans-
over the baseline in several languages, namely Per- formations, such as the method used by Soricut and
sian, Swedish, Basque, Germany, Kannada, and Och (2015). Future unsupervised systems for clus-
Navajo, when using KMeans clustering over char- tering morphological paradigms should consider
acter embeddings. This suggests that embedding- the benefits of hierarchical models that leverage dif-
based clustering systems merit further exploration ferent algorithm types to gain the most information
as a potential approach to unsupervised problems in possible.
morphology. The fact that the character embedding
system outperformed the W2V one and the fact that
performance was strongest on words with regular References
inflectional paradigms suggests that this approach Piotr Bojanowski, Edouard Grave, Armand Joulin, and
might be best suited to synthetic and agglutinating Tomas Mikolov. 2016. Enriching word vectors with
languages in which morphology is encoded fairly subword information.
simply within the orthography of the word. Lan-
Alexander Erdmann, Micha Elsner, Shijie Wu, Ryan
guages that rely heavily on more complex morpho- Cotterell, and Nizar Habash. 2020. The paradigm
logical processes, particularly non-concatenative discovery problem. In Proceedings of the 58th An-
morphology, would likely require an extension of nual Meeting of the Association for Computational
this system that integrates more sources of non- Linguistics, pages 7778–7790, Online. Association
for Computational Linguistics.
orthographic information, or a different approach
all together. Alexander Erdmann and Nizar Habash. 2018. Comple-
One obvious avenue for building on this research mentary strategies for low resourced morphological
modeling. In Proceedings of the Fifteenth Workshop
is to find more efficient and more effective methods
on Computational Research in Phonetics, Phonol-
for the initial process of lemma identification. De- ogy, and Morphology, pages 54–65, Brussels, Bel-
veloping a set of lemmas would allow a pre-defined gium. Association for Computational Linguistics.
set of centers to be fed into the clustering algo-
Katharina Kann. 2020. Acquisition of inflectional
rithm rather than using randomly defined centers, morphology in artificial neural networks with prior
which would likely improve performance. This knowledge. In Proceedings of the Society for Com-
could be done by leveraging an initial rule based putation in Linguistics 2020, pages 144–154, New

96
York, New York. Association for Computational Lin-
guistics.
MacQueen, J, and author. 1967. Some methods for
classification and analysis of multivariate observa-
tions. In Proceedings of the Fifth Berkeley Sym-
posium on Mathematical Statistics and Probability,
Volume 1: Statistics, pages 281–297. University of
California Press.
Arya D. McCarthy, Rachel Wicks, Dylan Lewis, Aaron
Mueller, Winston Wu, Oliver Adams, Garrett Nico-
lai, Matt Post, and David Yarowsky. 2020. The
Johns Hopkins University Bible corpus: 1600+
tongues for typological exploration. In Proceed-
ings of the 12th Language Resources and Evaluation
Conference, pages 2884–2892, Marseille, France.
European Language Resources Association.
Garrett Nicolai, Kyle Gorman, and Ryan Cotterell, edi-
tors. 2020. Proceedings of the 17th SIGMORPHON
Workshop on Computational Research in Phonetics,
Phonology, and Morphology. Association for Com-
putational Linguistics, Online.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,


B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay. 2011. Scikit-learn: Machine learning in
Python. Journal of Machine Learning Research,
12:2825–2830.
Radu Soricut and Franz Och. 2015. Unsupervised mor-
phology induction using word embeddings. In Pro-
ceedings of the 2015 Conference of the North Amer-
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages
1627–1637, Denver, Colorado. Association for Com-
putational Linguistics.

Radim Řehůřek and Petr Sojka. 2010. Software frame-


work for topic modelling with large corpora. In Pro-
ceedings of LREC 2010 workshop New Challenges
for NLP Frameworks, pages 46–50, Valletta, Malta.
University of Malta.

97
Unsupervised Paradigm Clustering Using Transformation Rules

Changbing Yang� Garrett Nicolai� Miikka Silfverberg�


� University of Colorado Boulder � University of British Columbia

[email protected] [email protected]

Abstract the distinct inflected forms of lexemes occur-


ring in a corpus into morphological paradigms.
This paper describes the submission of
Figure 1 illustrates the task.
the CU-UBC team for the SIGMORPHON
2021 Shared Task 2: Unsupervised morpho- Our system generates paradigms using mor-
logical paradigm clustering. Our system phological transformation rules which are dis-
generates paradigms using morphological covered from raw data. As an example, con-
transformation rules which are discovered sider the rule ed → ing, which maps an En-
from raw data. We experiment with two glish past tense verb form like walked into the
methods for discovering rules. Our first present participle walking. In this paper, we
approach generates prefix and suffix trans-
use regular expressions of symbol-pairs (that
formations between similar strings. Sec-
ondly, we experiment with more general
is, regular relations) in the well-known Xerox
rules which can apply transformations in- formalism (Beesley and Karttunen, 2003) to
side the input strings in addition to prefix denote rules: for example, ?+ e:i d:n 0:g.
and suffix transformations. We find that These rule can be applied using composition
the best overall performance is delivered of regular relations:
by prefix and suffix rules but more gen-
[w a l k e d] .o. [?+ e:i d:n 0:g]
eral transformation rules perform better
for languages with templatic morphology will result in an output form w a l k i n g.
and very high morpheme-to-word ratios. We cluster forms into the same paradigm if
we can find morphological transformation rules
1 Introduction which map one of the forms into the other. Our
Supervised sequence-to-sequence models for approach is illustrated in Figure 2.
word inflection have delivered impressive re- We experiment with two methods for discov-
sults in the past few years and a number of ering rules, described in Section 3.3. Our first
shared tasks on supervised learning of morphol- approach is inspired by work on morphology
ogy have helped to raise the state of the art of discovery by Soricut and Och (2015), who gen-
this task (Cotterell et al., 2016, 2017, 2018; Mc- erate prefix and suffix transformations between
Carthy et al., 2019; Vylomova et al., 2020). In similar strings. This idea closely parallels our
contrast, unsupervised approaches to morphol- approach for extracting rules. Unlike Soricut
ogy have received far less attention in recent and Och (2015), however, we do not utilize
years. Nevertheless, the question of whether word embeddings when extracting rules due to
the morphological system of a language can be the very small size of the shared task datasets.
discovered from raw text data alone is certainly In addition to prefix and suffix rules, we also
an interesting one. experiment with more general discontinuous
This paper describes the submission of the transformation rules which can apply trans-
CU-UBC team for the SIGMORPHON 2021 formations to infixes as well as prefixes and
Shared Task 2: Unsupervised morphologi- suffixes. For example, the rule
cal paradigm clustering (Wiemerslage et al., ?+ i:0 ?+ e:i ?+ 0:t
2021).1 The objective of this task is to group
would transform the input form gidem (‘to
1
github.com/changbingY/Sigmorph-2021-task2 bite’ in Maltese) to gdimt. Our results

98
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 98–106
August 5, 2021. ©2021 Association for Computational Linguistics
Data: sponding forms in distinct paradigms (like all
the cat meowed plural forms of English nouns) are clustered
the cats are meowing into cells. Their benchmark system is based on
splitting every form into a (potentially discon-
tinuous) base and exponent, where the base is
the longest common subsequence of the forms
Paradigms:
in a paradigm and the exponent is the residual
cat meowed the are of the form. They then maximize the base in
cats meowing each paradigm while minimizing the exponents
of individual forms.
Figure 1: The unsupervised paradigm clustering
task. 3 Methods
This section describes how we extract rules
demonstrate that prefix and suffix rules deliver from the dataset and apply them to paradigm
stronger performance for most languages in the clustering. We also describe methods for fil-
shared task dataset but our more general trans- tering out extraneous forms from generated
formations rules are beneficial for templatic paradigms.
languages like Maltese and languages with a
high morpheme-to-word ratio like Basque. 3.1 Baseline
As a baseline, we use the character n-gram
2 Related Work
clustering method provided by the shared task
The unsupervised paradigm clustering task is organizers (Wiemerslage et al., 2021). Here
closely related to the 2020 SIGMORPHON all forms sharing a given substring of length
shared task on unsupervised morphological n are clustered into a paradigm. Duplicate
paradigm completion (Kann et al., 2020). How- paradigms are removed. The hyperparameter
ever, paradigm clustering systems do not infer n can be tuned on validation data if such data is
missing forms in paradigms. Our system re- available (we use n = 5 in all our experiments).
sembles the baseline system for the paradigm
completion task (Jin et al., 2020) which also 3.2 Transformation rules
extracts transformation rules, however, in the Our approach builds on the baseline paradigms
form of edit trees (Chrupala et al., 2008). discovered in the previous step. We start by ex-
Several approaches to unsupervised or mini- tracting transformation rules between all word
mally supervised morphology learning, which forms in a single baseline paradigm. For each
share characteristics with our system, have pair of strings like dog and dogs belonging to a
been proposed. Our rules are essentially iden- paradigm, we generate a rule like ?+ 0:s which
tical to the FST rules used by Beemer et al. translates the first form into the second one.
(2020) for the task of supervised morpholog- From a paradigm of size n, we can therefore ex-
ical inflection. Likewise, Durrett and DeN- tract n2 − n rules—one for each ordered pair of
ero (2013) and Ahlberg et al. (2015) both ex- distinct word forms. Preliminary experiments
tract inflectional rules after aligning forms from showed that large baseline paradigms tended
known paradigms. Yarowsky and Wicentowski to generate many incorrect rules which did not
(2000) also generate rules for morphological represent genuine morphological transforma-
transformations but their system for minimally tions. We, therefore, limited rule-discovery to
supervised morphological analysis requires ad- paradigms spanning maximally 20 forms.
ditional information in the form of a list of After generating transformation rules, we
morphemes as input. compute rule-frequency over all baseline
Erdmann et al. (2020) present a task called paradigms and discard rare rules which are
the paradigm discovery problem which is quite unlikely to represent genuine morphological
similar to the unsupervised paradigm clustering transformations (the minimum threshold for
task. In their formulation of the task, inflected rule frequency is a hyperparameter). The re-
forms are clustered into paradigms and corre- maining rules are then applied iteratively to

99
dog dog
Extract 2: ?+ 0:s Discard 2: ?+ 0:s Rebuild dogs
dogs
rules 2: ?+ s:0 rare rules 2: ?+ s:0 paradigms
hotdog
1: h:0 o:0 t:0 ?+ 1: h:0 o:0 t:0 ?+ hotdog
cat 1: 0:h 0:o 0:t ?+ 1: 0:h 0:o 0:t ?+
cats 1: h:0 o:0 t:0 ?+ 0:s 1: h:0 o:0 t:0 ?+ 0:s cat
1: 0:h 0:o 0:t ?+ s:0 1: 0:h 0:o 0:t ?+ s:0 cats

Figure 2: A schematic representation of our approach. We start by generating preliminary paradigms


using the baseline method. We then extract transformation rules for each word pair in our paradigms
noting how many times each unique rule occurred. For example, here both (dog, dogs) and (cat, cats)
result in a rule ?* 0:s which therefore has count 2. Subsequently, we discard rare rules like h:0 o:0 t:0
?* which are unlikely to represent genuine morphological transformations. We then use the remaining
rules to reconstruct our morphological paradigms as explained in Section 3.3.

our datasets to construct paradigms. We exper- a rule which can apply transformations inside
iment with two rule types which are described the input string:
below. ?+ i:0 ?+ e:i ?+ 0:t
3.2.1 Prefix and Suffix Rules Like prefix and suffix rules, discontinuous
Our first approach to rule-discovery is based on rules are generated from baseline paradigms.
identifying a contiguous word stem shared by Unlike prefix and suffix rules, however, discon-
both forms. The stem is defined as the longest tinuous rules require a character-level align-
common substring of the forms. We split both ment between the input and output string.
forms into a prefix, stem and suffix. The mor- To this end, we start by generating a dataset
phological transformation is then defined as a consisting of all string pairs like (dog, dogs)
joint substitution of a prefix and suffix. For ex- and (hotdog, dog), where both strings belong
ample, given the German forms acker+n and to the the same paradigm. We then apply
ge+acker+t (German ‘to plow’), we would a character-level aligner based on the itera-
generate a rule: tive Markov chain Monte Carlo method to this
dataset.2 Using this method, we can jointly
0:g 0:e ?+ n:t
align all string pairs in the baseline paradigms.
As mentioned above, these rules are extracted This is beneficial because the MCMC aligner
from paradigms generated by the baseline sys- will prefer common substitutions, deletions
tem. and insertions over rare ones.3 which enforces
We also experiment with a more restricted consistency of the alignment over the entire
form of these rules in which only suffix trans- dataset. This in turn can help us find linguis-
formations are allowed. While this limits the tically motivated transformation rules.
possible transformations, it will also result in Character-level alignment results in pairs:
fewer incorrect rules and may, therefore, de- INPUT: d o g 0
liver better performance for languages which OUTPUT: d o g s
are predominantly suffixing. INPUT: h o t d o g
3.2.2 Discontinuous rules OUTPUT: 0 0 0 d o g
Even though prefix and suffix transformations Each symbol pair in the alignment represents
are adequate for representing morphological one of the following types: (1) an identity pair
transformations in many languages, they fail to x:x, (2) an insertion 0:x, (3) a deletion x:0,
derive the appropriate generalizations for lan- or (4) a substitution x:y. In order to convert
guages with templatic morphology like Maltese a pair of aligned strings into a transformation
(which was included among the development 2
This aligner was initially used for the baseline sys-
languages). For example, it is impossible to tem in the 2016 iteration of the SIGMORPHON shared
identify a contiguous stem-like unit spanning task (Cotterell et al., 2016).
3
This is a consequence of the fact that the algorithm
more than a single character for the Maltese iteratively maximizes the likelihood of the alignment for
forms gidem ‘to bite’ and gdimt. We need each example given all other examples in the dataset.

100
rule, we simply replace all contiguous sequences walk
of identity pairs with ?+. For the alignments
above, we get the rules: ?+ 0:s and h:0 o:0 X
t:0 ?+. wall walking
3.3 Iterative Application of Rules
After extracting a set of rules from baseline
paradigms, we discard the baseline paradigms. walked walks
We then construct new paradigms using our
rules. We start by picking a random word Figure 3: Given the candidate paradigm {walk,
form w from the dataset. We then form the wall, walking, walked, walks}, we can form a
graph where two word forms are connected if a
paradigm P for w as the set of all forms in
rule like ?+ 0:e 0:d derives one of the forms like
our dataset which can be derived from w by walked from the other one walk. We experiment
applying our rules iteratively. For example, with filtering out forms which have low degree in
given the form eats and the rules: this graph since those are more likely to be spurious
additions resulting from rules like ?+ l:k in the ex-
?+ s:0 and ?+ 0:i 0:n 0:g ample, which do not capture genuine morphologi-
the paradigm of eats would contain both eat cal regularities. In this example, wall might be fil-
(generated by the first rule) and eating (gen- tered out because it has low degree one compared
to all other forms which have degree three.
erated by the second rule from eats) provided
that both of these forms are present in our orig-
inal dataset. All forms in P are removed from If we first generate all paradigms and then fil-
the dataset and we then repeat the process for ter out extraneous forms, we will be left with a
another randomly sampled form in the remain- number of forms which have not been assigned
ing dataset. This continues until the dataset is to a paradigm. In order to circumvent this
exhausted. The procedure is sensitive to the or- problem, we apply filtering immediately after
der in which we sample forms from the dataset generating each individual paradigm. Forms
but exploring the optimal way to sample forms which are filtered out from the paradigm are
falls beyond the scope of the present work. placed back into the original dataset. They
For prefix and suffix rules, we limit rule ap- can then be included in paradigms which are
plication to a single iteration because this de- generated later in the process.
livered better results in practice. Applying
rules iteratively tended to result in very large Degree test Our morphological transforma-
paradigms. For discontinuous rules, we do ap- tion rules induce dependencies and therefore
ply rules iteratively. a graph structure between the forms in a
paradigm as demonstrated in Figure 3. Within
3.4 Filtering Paradigms each paradigm, we calculate the degree of a
According to our preliminary experiments, word in the following way: For each attested
many large paradigms generated by transfor- word w in the generated paradigm, its degree
mation rules contained word forms which were is the number of forms w0 in the paradigm for
morphologically unrelated to the other forms which we can find a transformation rule map-
in the paradigm. To counteract this, we ex- ping w → w0 . We increment the degree if there
perimented with three strategies for filtering is at least one edge between words w and w0
out individual extraneous forms from generated in the paradigm (the number of distinct rules
paradigms: the degree test, the rule-frequency mapping form w to w0 is irrelevant here as long
test and the embedding-similarity test. Forms as there is at least one). If the degree of a word
which fail all of our three tests are removed is less than a third of the paradigm size, the
from the paradigm.4 word fails the degree test.
4
These filtering strategies are applied to paradigms Rule-Frequency test Some rules like ?+
containing > 20 forms. This threshold was determined
based on examining the output clusters for the devel- e:i d:n 0:g for English represent genuine in-
opment languages. flectional transformations and will therefore

101
occur often in our datasets. Others, like the 4.2 Experiments on validation
rule ?* l:k in Figure 3, instead result from co- languages
incidence, and will usually have low frequency.
We can, therefore, use rule frequency as a cri-
Since our transformation rules are generated
terion when identifying extraneous forms in
from paradigms discovered by the baseline sys-
generated paradigms. We examine the cumula-
tem, which contain incorrect items, it is to be
tive frequency of all rules applying to the form
expected that some incorrect rules are gener-
in our paradigm. If this frequency is lower
ated. We filter out infrequent rules, as they are
than the median cumulative frequency in the
less likely to represent genuine morphological
paradigm, the form fails the rule-frequency test.
transformations. For prefix and suffix rules
Embedding-similarity test If a word fails (i.e., PS), we experimented with including the
to pass the degree and the rule frequency tests, top 2000 (PS-2000), 5000 (PS-5000), and all
we will measure the semantic similarity of the rules (PS-all), as measured by rule-frequency.
given form with other forms in the paradigm. Additionally, we present experiments using a
To this end, we trained FastText embeddings system which relies exclusively on suffix trans-
(Bojanowski et al., 2017) and calculated co- formations including all of them regardless of
sine similarity between embedding vectors as a frequency (S-all). For discontinuous rules (D),
measure of semantic relatedness.5 We start by we used lower thresholds because our prelimi-
selecting two reference words in the paradigm nary experiments indicated that incorrect gen-
which have high degree (at least 50% of the eralizations were a more severe problem for
maximal degree) and whose cumulative rule fre- this rule type. We selected the 200 (D-200),
quency is above the paradigm’s median value. 300 (D-300), and 500 (D-500) most frequent
We then compute their cosine similarity as a rules, respectively. Results with regard to best-
reference point r. For all other words in the match F1 score (see Wiemerslage et al. (2021)
paradigm, we then compare their cosine simi- for details) are shown in Table 1.
larity r0 to one of the reference forms. Forms According to the results, all of our systems
fail the embedding-similarity test if r0 < 0.5 outperform the baseline system by at least
and r − r0 > 0.3. 25.53% as measured using the mean best match
F1 score. Plain suffix rules (S-all) provide
4 Experiments and Results the best performance with a mean F1 score
of 65.41%, followed by other affixal systems
In this section, we describe experiments on the
(PS-2000, PS-5000 and PS-all). On average,
shared task development and test languages.
discontinuous rules (D-200, D-300 and D-500)
are slightly less-successful, but they deliver
4.1 Data and Resources
the best performance for Maltese. Table 1
The shared task uses two data resources. Cor- demonstrates that simply increasing the num-
pus data for the four development languages ber of rules does not always contribute to bet-
(Maltese, Persian, Russian and Swedish) and ter performance—the optimal threshold varies
nine test languages (Basque Bulgarian, English, between languages.
Finnish, German, Kannada, Navajo, Spanish
and Turkish) are sourced from the Johns Hop- As explained in Section 3.4, we aim to fil-
kins Bible Corpus (McCarthy et al., 2020b). ter out extraneous forms from overly-large
For most of the languages, complete Bibles paradigms. We applied this approach to discon-
were provided but for some of them, we only tinuous rules with a 500 threshold. Results are
had access to a subset (see Wiemerslage et al. shown in Table 2. As the table shows, a filtering
(2021) for details). Gold standard paradigms strategy can offer very limited improvements.
were automatically generated using the Uni- Most of the languages do not benefit from this
morph 3.0 database (McCarthy et al., 2020a). approach and even for languages which do, the
gain is miniscule. Due to their very limited
5
We train 300-dimensional embeddings with context effect, we did not apply filtering strategies to
window 3 and use character n-grams of size 3-6. test languages.

102
Maltese Persian Portuguese Russian Swedish Mean
Baseline 29.07 30.04 34.15 36.30 43.62 34.64
PS-2000 35.41 50.17 65.53 81.20 81.14 62.69
PS-5000 36.81 50.40 71.33 81.96 79.82 64.06
PS-all 40.67 53.15 76.63 75.39 72.46 63.66
S-all 30.32 52.69 82.67 80.65 80.74 65.41
D-200 42.99 54.65 66.86 70.38 68.76 60.73
D-300 42.99 53.64 69.38 72.33 67.14 61.10
D-500 45.05 51.82 66.37 75.26 62.30 60.16

Table 1: F1 Scores for each of the model types on all development languages. The best F1 scores are in
bold.

Maltese Persian Portuguese Russian Swedish Mean


Baseline 29.07 30.04 34.15 36.30 43.62 34.64
D-500 45.05 51.82 66.37 75.26 62.30 60.16
Filter 45.05 51.82 66.45 75.26 62.30 60.18

Table 2: F1 score for Discontinuous rules systems and Filtering systems across five validation languages.

4.3 Experiments on Test Languages suffix rules and discontinuous rules, discontin-
uous rules tend to generate more paradigms of
Results for the test languages are presented in size 1. In contrast to the paradigms generated
Table 3. We find that all of our systems sur- by our systems, the frequency of gold standard
passed the baseline results by at least 23.06% in paradigms drops far slower as the paradigms
F1 score. The prefix and suffix system using all grow. For example, for Finnish and Kannada,
of the suffix rules displays the best performance paradigms containing 10 forms are still very
with an F1 score of 66.12%. Among the discon- common. The only language where the distri-
tinuous systems, the system with a threshold of bution generated by our systems very closely
500 has the best results. On average, the affixal parallels the gold standard is Spanish. For
systems outperform the discontinuous ones. In all other languages, our systems very clearly
particular, these methods perform best on lan- over-generate small paradigms.
guages which are known to be predominantly
suffixing, such as English, Spanish, and Finnish. 5 Discussion and Conclusions
Contrarily, discontinuous rules deliver the best Paradigm construction can suffer from two
performance for Navajo—a strongly prefixing main difficulties: overgeneralization, and un-
language. Discontinuous rules also result in derspecification. In the former, paradigms are
the best performance for Basque, which has a too generous when adding new members. Con-
very high morpheme-to-word ratio. sider, for example, a paradigm headed by “sea”.
In order to better understand the behavior of We would want to include the plural “seas”, but
our systems, we analyzed the distribution of the not the unrelated words “seal”, “seals”, “un-
size of generated paradigms for prefix and suffix dersea”, etc. Contrarily, a paradigm selection
systems as well as discontinuous systems. Re- algorithm that is overly selective will result in
sults for selected systems are shown in Figure 4. a large number of small paradigms - less than
We conducted this experiment for the overall ideal in a morphologically-dense language.
best system (S-all), as well as the best discontin- Considering the results described in the pre-
uous system (D-500). Both systems follow the vious section, we note that our two best mod-
same overall pattern: large paradigms are rarer els skew towards conservatism - they prefer
than smaller ones and the frequency drops very smaller paradigms. This is likely an artifact of
rapidly with increasing paradigm size. The ma- our development cycle - we found that the base-
jority of generated paradigms have sizes in the line preferred large paradigms, often capturing
range 1-5. Although the tendency is similar for derivational features, or even circumstantial

103
English Navajo Spanish Finnish Bulgarian Basque Kannada German Turkish Mean
Baseline 51.49 33.25 38.83 28.97 38.89 21.48 23.79 38.22 25.23 33.35
PS-2000 83.89 48.69 77.71 52.60 73.50 25.81 42.35 74.49 46.80 58.42
PS-5000 81.16 48.69 79.60 57.88 74.14 29.03 47.47 74.27 51.26 60.39
PS-all 76.41 48.69 76.94 66.03 69.50 29.03 57.71 65.26 60.97 61.17
S-all 88.68 42.48 83.21 73.42 76.96 29.03 59.34 74.18 67.80 66.12
D-200 76.93 58.45 66.05 50.68 70.48 26.19 40.57 70.26 48.05 56.41
D-300 73.23 59.36 69.46 53.66 69.39 26.19 43.71 68.52 51.00 57.17
D-500 69.33 61.66 69.92 56.51 63.23 33.33 46.94 62.54 53.24 57.41

Table 3: F1 Scores for each of the model types on all test languages. The best F1 scores are in bold.

Basque Paradigm Size Distribution Bulgarian Paradigm Size Distribution English Paradigm Size Distribution
80 80 80
Basque-Suffix-all Bulgarian-Suffix-all English-Suffix-all
70 Basque-Discontinuous-500 70 Bulgarian-Discontinuous-500 70 English-Discontinuous-500
Basque-Gold Bulgarian-Gold English-Gold
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
1 5 10 15 20 1 5 10 15 20 1 5 10 15 20

(a) Basque (b) Bulgarian (c) English


Finnish Paradigm Size Distribution German Paradigm Size Distribution Kannada Paradigm Size Distribution
80 80 80
Finnish-Suffix-all German-Suffix-all Kannada-Suffix-all
70 Finnish-Discontinuous-500 70 German-Discontinuous-500 70 Kannada-Discontinuous-500
Finnish-Gold German-Gold Kannada-Gold
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
1 5 10 15 20 1 5 10 15 20 1 5 10 15 20

(d) Finnish (e) German (f) Kannada


Navajo Paradigm Size Distribution Spanish Paradigm Size Distribution Turkish Paradigm Size Distribution
80 80 80
Navajo-Suffix-all Spanish-Suffix-all Turkish-Suffix-all
70 Navajo-Discontinuous-500 70 Spanish-Discontinuous-500 70 Turkish-Discontinuous-500
Navajo-Gold Spanish-Gold Turkish-Gold
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
1 5 10 15 20 1 5 10 15 20 1 5 10 15 20

(g) Navajo (h) Spanish (i) Turkish

Figure 4: Paradigm size distribution across nine test languages. The x axis stands for paradigm size
ranging from 1 to 20. The y axis shows the percentage of each paradigm size accounts for among all
paradigms the system generates.

104
string similarities, when clustering paradigms. cant rule support to combine them. Possible ar-
Much of our focus was thus on limiting rule ap- eas for exploration include iterative rule extrac-
plication only to those rules we could be certain tion on successively more correct paradigms, or
were genuine. Unfortunately, this means that the incorporation of a machine learning element
many words are excluded, residing in singleton that can predict missing forms.
paradigms. In this paper, we have presented a method for
Our methods were also affected by the choice automatically building inflectional paradigms
of development languages. Of these languages, from raw data. Starting with an n-gram base-
only one (Persian) is agglutinating, and none line, we extract intra-paradigmatic rewrite
of the authors can read the script, so it had a rules. These rules are then re-applied to the cor-
smaller impact on the evolution of our methods. pus in a discovery process that re-establishes
We believe that several languages —namely, known paradigms. Our methods prove very
Finnish, Turkish, and Basque— could have competitive, with our best model finishing
benefited from iterative rule application; how- within 2% of the best submitted system.
ever, the iterative process was not selected after
seeing a degradation (due to overgeneralization) References
on the development languages.
Malin Ahlberg, Markus Forsberg, and Mans
It is also worth discussing two outliers in Hulden. 2015. Paradigm classification in su-
our system selection. Our suffix-first model pervised learning of morphology. In Proceed-
performed very well on all of the development ings of the 2015 Conference of the North Amer-
ican Chapter of the Association for Computa-
languages except Maltese. This is not sur- tional Linguistics: Human Language Technolo-
prising, given its templatic morphology. Mal- gies, pages 1024–1029.
tese inspired the creation of our discontinuous
rule set, and indeed, these rules outperformed Sarah Beemer, Zak Boston, April Bukoski, Daniel
Chen, Princess Dickens, Andrew Gerlach, Torin
the suffixes for Maltese. Switching to the test Hopkins, Parth Anand Jawale, Chris Koski,
languages, we see that this model has higher Akanksha Malhotra, Piyush Mishra, Saliha
performance for Navajo and Basque –two lan- Muradoglu, Lan Sang, Tyler Short, Sagarika
guages that are rarely described as templatic. Shreevastava, Elizabeth Spaulding, Testumichi
Umada, Beilei Xiang, Changbing Yang, and
We observe, however, that both languages make Mans Hulden. 2020. Linguist vs. machine:
heavy use of prefixing. Note in Table 2 that in- Rapid development of finite-state morphological
cluding prefixes (PS-All) significantly improves grammars. In Proceedings of the 17th SIGMOR-
Navajo: the only language to see such a bene- PHON Workshop on Computational Research in
Phonetics, Phonology, and Morphology, pages
fit. Likewise, Navajo also has significant stem 162–170, Online. Association for Computational
alternation, which may be benefiting from dis- Linguistics.
continuous rule sets. Basque is trickier - it
Kenneth R Beesley and Lauri Karttunen. 2003.
does not improve simply from including pre-
Finite-state morphology: Xerox tools and tech-
fixal rules. Upon closer inspection, we observe niques. CSLI, Stanford.
that much Basque prefixation more closely re-
sembles circumfixation: the stem has a prefixal Piotr Bojanowski, Edouard Grave, Armand Joulin,
and Tomas Mikolov. 2017. Enriching word vec-
vowel to indicate tense, which is jointly applied tors with subword information. Transactions of
with inflectional suffixes. One round of rule the Association for Computational Linguistics,
application - even if it includes both suffixes 5:135–146.
and prefixes, appears to be insufficient. Grzegorz Chrupala, Georgiana Dinu, and Josef
There is still plenty of ground to be covered, van Genabith. 2008. Learning morphology with
with the mean F1 score below 70%. We be- Morfette. In Proceedings of the Sixth Inter-
national Conference on Language Resources
lieve that the next step lies in re-establishing and Evaluation (LREC’08), Marrakech, Mo-
a bottom-up construction for those paradigms rocco. European Language Resources Associa-
that our methods currently separate into small tion (ELRA).
sub-paradigms. Our methods predict roughly Ryan Cotterell, Christo Kirov, John Sylak-
twice to 3 times as many singleton paradigms Glassman, Géraldine Walther, Ekaterina Vylo-
as exist in the gold data, and there is not signifi- mova, Arya D McCarthy, Katharina Kann, Se-

105
bastian J Mielke, Garrett Nicolai, Miikka Silfver- and David Yarowsky. 2020a. UniMorph 3.0:
berg, et al. 2018. The CoNLL–SIGMORPHON Universal Morphology. In Proceedings of the
2018 shared task: Universal morphological re- 12th Language Resources and Evaluation Con-
inflection. In Proceedings of the CoNLL– ference, pages 3922–3931, Marseille, France. Eu-
SIGMORPHON 2018 Shared Task: Universal ropean Language Resources Association.
Morphological Reinflection, pages 1–27.
Arya D McCarthy, Ekaterina Vylomova, Shi-
Ryan Cotterell, Christo Kirov, John Sylak- jie Wu, Chaitanya Malaviya, Lawrence Wolf-
Glassman, Géraldine Walther, Ekaterina Vylo- Sonkin, Garrett Nicolai, Christo Kirov, Miikka
mova, Patrick Xia, Manaal Faruqui, Sandra Silfverberg, Sabrina J Mielke, Jeffrey Heinz,
Kübler, David Yarowsky, Jason Eisner, et al. et al. 2019. The SIGMORPHON 2019 shared
2017. Conll-sigmorphon 2017 shared task: Uni- task: Morphological analysis in context and
versal morphological reinflection in 52 languages. cross-lingual transfer for inflection. In Proceed-
In Proceedings of the CoNLL SIGMORPHON ings of the 16th Workshop on Computational Re-
2017 Shared Task: Universal Morphological Re- search in Phonetics, Phonology, and Morphology,
inflection, pages 1–30. pages 229–244.

Ryan Cotterell, Christo Kirov, John Sylak- Arya D. McCarthy, Rachel Wicks, Dylan Lewis,
Glassman, David Yarowsky, Jason Eisner, and Aaron Mueller, Winston Wu, Oliver Adams,
Mans Hulden. 2016. The SIGMORPHON 2016 Garrett Nicolai, Matt Post, and David Yarowsky.
shared task—morphological reinflection. In Pro- 2020b. The Johns Hopkins University Bible
ceedings of the 14th SIGMORPHON Work- corpus: 1600+ tongues for typological explo-
shop on Computational Research in Phonetics, ration. In Proceedings of the 12th Language
Phonology, and Morphology, pages 10–22. Resources and Evaluation Conference, pages
2884–2892, Marseille, France. European Lan-
Greg Durrett and John DeNero. 2013. Supervised guage Resources Association.
learning of complete morphological paradigms.
In Proceedings of the 2013 Conference of the Radu Soricut and Franz Och. 2015. Unsuper-
North American Chapter of the Association for vised morphology induction using word embed-
Computational Linguistics: Human Language dings. In Proceedings of the 2015 Conference
Technologies, pages 1185–1195. of the North American Chapter of the Associ-
ation for Computational Linguistics: Human
Alexander Erdmann, Micha Elsner, Shijie Wu, Language Technologies, pages 1627–1637, Den-
Ryan Cotterell, and Nizar Habash. 2020. The ver, Colorado. Association for Computational
paradigm discovery problem. In Proceedings Linguistics.
of the 58th Annual Meeting of the Association
for Computational Linguistics, pages 7778–7790, Ekaterina Vylomova, Jennifer White, Elizabeth
Online. Association for Computational Linguis- Salesky, Sabrina J Mielke, Shijie Wu, Edoardo
tics. Ponti, Rowan Hall Maudslay, Ran Zmigrod,
Josef Valvoda, Svetlana Toldova, et al. 2020.
Huiming Jin, Liwei Cai, Yihui Peng, Chen Xia, SIGMORPHON 2020 shared task 0: Typolog-
Arya McCarthy, and Katharina Kann. 2020. Un- ically diverse morphological inflection. SIG-
supervised morphological paradigm completion. MORPHON 2020.
In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, Adam Wiemerslage, Arya McCarthy, Alexander
pages 6696–6707, Online. Association for Com- Erdmann, Garrett Nicolai, Manex Agirreza-
putational Linguistics. bal, Miikka Silfverberg, Mans Hulden, and
Katharina Kann. 2021. The SIGMORPHON
Katharina Kann, Arya D. McCarthy, Garrett Nico- 2021 shared task on unsupervised morphologi-
lai, and Mans Hulden. 2020. The SIGMOR- cal paradigm clustering. In Proceedings of the
PHON 2020 shared task on unsupervised mor- 18th SIGMORPHON Workshop on Computa-
phological paradigm completion. In Proceed- tional Research in Phonetics, Phonology, and
ings of the 17th SIGMORPHON Workshop on Morphology. Association for Computational Lin-
Computational Research in Phonetics, Phonol- guistics.
ogy, and Morphology, pages 51–62, Online. As-
sociation for Computational Linguistics. David Yarowsky and Richard Wicentowski. 2000.
Minimally supervised morphological analysis by
Arya D. McCarthy, Christo Kirov, Matteo Grella, multimodal alignment. In Proceedings of the
Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekate- 38th Annual Meeting of the Association for
rina Vylomova, Sabrina J. Mielke, Garrett Nico- Computational Linguistics, pages 207–216, Hong
lai, Miikka Silfverberg, Timofey Arkhangelskiy, Kong. Association for Computational Linguis-
Nataly Krizhanovsky, Andrew Krizhanovsky, tics.
Elena Klyachko, Alexey Sorokin, John Mans-
field, Valts Ernštreits, Yuval Pinter, Cassan-
dra L. Jacobs, Ryan Cotterell, Mans Hulden,

106
Paradigm Clustering with Weighted Edit Distance

Andrew Gerlach, Adam Wiemerslage and Katharina Kann


University of Colorado Boulder
[email protected]

Basque Lemma: egin


Abstract
begi begiate begidate
This paper describes our system for the begie begiete begigu
SIGMORPHON 2021 Shared Task on Un- begigute begik begin
beginate begio begiote
supervised Morphological Paradigm Cluster- begit begite begitza
ing, which asks participants to group in- ... ... ...
flected forms together according their underly- zenegizkigukeen zenegizkigukete zenegizkiguketen
ing lemma without the aid of annotated train- zenegizkigun zenegizkigute zenegizkiguten
ing data. We employ agglomerative cluster- zenegizkio zenegizkioke zenegizkiokeen
ing to group word forms together using a met- zenegizkiokete zenegizkioketen zenegizkion
zenegizkiote zenegizkioten zenegizkit
ric that combines an orthographic distance and
a semantic distance from word embeddings. Table 1: The paradigm of the Basque verb egin consists
We experiment with two variations of an edit of 674 inflected forms. In contrast, the paradigm of the
distance-based model for quantifying ortho- English verb do only consists of 5 inflected forms: do,
graphic distance, but, due to time constraints, does, doing, did, and done.
our systems do not outperform the baseline.
However, we also show that, with more time,
our results improve strongly.
obtained from the lemma by adding -s or -es to
1 Introduction the end of the noun, e.g., list/lists or kiss/kisses.
Most of the world’s languages express gram- However, irregular plurals also exist, such as
matical properties, such as tense or case, via ox/oxen or mouse/mice. Although irregular
small changes to a word’s surface form. This forms are less frequent, they cause challenges
process is called morphological inflection, and for the automatic generation or analysis of the
the canonical form of a word is known as surface forms of English plural nouns.
its lemma. A search of the WALS database In this work, we address the SIGMOR-
of linguistic typology shows that 80% of the PHON 2021 Shared Task on Unsupervised
database’s languages mark verb tense and 65% Morphological Paradigm Clustering (”Task 2”)
mark grammatical case through morphology (Wiemerslage et al., 2021). The goal of this
(Dryer and Haspelmath, 2013). shared task is to group words encountered in
The English lemma do, for instance, has an naturally occurring text into morphological
inflected form did that expresses past tense. paradigms. Unsupervised paradigm cluster-
Though English verbs inflect to express tense, ing can be helpful for state-of-the-art natural
there are generally only 4 to 5 surface varia- language processing (NLP) systems, which
tions for a given English lemma. In contrast, a typically require large amounts of training
Russian verb can have up to 30 morphological data. The ability to group words together into
inflections per lemma, and other languages – paradigms is a useful first step for training a
such as Basque – have hundreds of forms per system to induce full paradigms from a lim-
lemma, cf. Table 1. ited number of examples, a task known as (su-
Inflected forms are systematically related to pervised) morphological paradigm completion.
each other: in English, most noun plurals are Building paradigms can help an NLP system
107
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 107–114
August 5, 2021. ©2021 Association for Computational Linguistics
to induce representations for rare words or to cal paradigm completion has been proposed
generate words that have not been observed in (Kann et al., 2020; Jin et al., 2020; Erdmann
a given corpus. Lastly, unsupervised systems et al., 2020), wherein the goal is to induce full
have the advantage of not needing annotated paradigms from raw text corpora.
data, which can be costly in terms of time and
money, or, in the case of extinct or endangered In this year’s SIGMORPHON shared task,
languages, entirely impossible. we are asked to only address part of the unsu-
Since 2016, the Association for Computa- pervised paradigm completion task: paradigm
tional Linguistics’ Special Interest Group on clustering. Intuitively, the task of segmentation
Computational Morphology and Phonology is related to paradigm clustering, but the out-
(SIGMORPHON) has created shared tasks to puts are different. Goldsmith (2001) produces
help spur the development of state-of-the-art morphological signatures, which are similar
systems to explicitly handle morphological to approximate paradigms, based on an algo-
processes in a language. These tasks have rithm that uses minimum description length.
involved morphological inflection (Cotterell However, this type of algorithm relies heavily
et al., 2016), lemmatization (McCarthy et al., on purely orthographic features of the vocab-
2019), as well as other, related tasks. SIG- ulary. Schone and Jurafsky (2001) hypothe-
MORPHON has increased the level of diffi- size that approximating semantic information
culty of the shared tasks, largely along two can help differentiate between hypothesized
dimensions. The first dimension is the amount morphemes, revealing those that are produc-
of data available for models to learn, reflecting tive. They propose an algorithm that combines
the difficulties of analyzing low-resource lan- orthography, semantics, and syntactic distri-
guages. The second dimension is the amount butions to induce morphological relationships.
of structure provided in the input data. Initially, They used semantic relatedness, quantified by
SIGMORPHON shared tasks provided prede- latent semantic analysis, combined with the
fined tables of lemmas, morphological tags, frequencies of affixes and syntactic context
and inflected forms. For the SIGMORPHON (Schone and Jurafsky, 2000).
2021 Shared Task on Unsupervised Morpho-
logical Paradigm Clustering, only raw text is More recently, Soricut and Och (2015) have
provided as input. used SkipGram word embeddings (Mikolov
We propose a system that combines ortho- et al., 2013) to find meaningful morphemes
graphic and semantic similarity measures to based on analogies: regularities exhibited by
cluster surface forms found in raw text. We embedding spaces allow for inferences of cer-
experiment with a character-level language tain types (e.g., king is to man what queen is
model for weighing substring differences be- to woman). Hypothesizing that these regulari-
tween words. Due to time constraints we are ties also hold for morphological relations, they
only able to cluster over a subset of each lan- represent morphemes by vector differences be-
guages’ vocabulary. Despite of this, our sys- tween semantically similar forms, e.g., the vec-
tem’s performance is comparable to the base- tor for the suffix →

s may be represented by the
line. −−→ −→
difference between cats and cat.
2 Related Work
Drawing upon these intuitions, we follow
Unsupervised morphology has attracted a great Rosa and Zabokrtský (2019), which combines
deal of interest historically, including a large semantic distance using fastText embeddings
body of work focused on segmentation (Xu (Bojanowski et al., 2017) with an orthographic
et al., 2018; Creutz and Lagus, 2007; Poon distance between word pairs. Words are then
et al., 2009; Narasimhan et al., 2015). Re- clustered into paradigms using agglomerative
cently, the task of unsupervised morphologi- clustering.
108
3 Task Description computed as a combination of orthographic
and semantic distances.
Given a raw text corpus, the task is to
sort words into clusters that correspond to Orthographic Distance The orthographic
paradigms. More formally, for the vocabulary distance of two words is computed as their
Σ of all types attested in the corpus and the Jaro-Winkler (JW) edit distance (Winkler,
set of morphological paradigms Π for which 1990). JW distance differs from the more com-
at least one word is in Σ, the goal
T is to out- mon Levenshtein distance (Levenshtein, 1966)
put clusters corresponding to πk Σ for all in that JW distance gives more importance to
πk ∈ Π. the beginnings of strings than to their ends,
which is where characters belonging to the
Data As the raw text data for this task, JHU stem are likely to be in suffixing languages.
Bible corpora (McCarthy et al., 2020b) are pro- The JW distance is averaged with the JW
vided by the organizers. This is the only data distance of a simplified variant of the string.
that systems can use. The organizers further The simplified variant is a string that has been
provide development and test sets consisting of lower cased, transliterated to ASCII, and had
gold clusters for a subset of words in the Bible the non-initial vowels deleted. This is done to
corpora. Each
T cluster is a list of words repre- soften the impact of characters that are likely to
senting πk Σ for πk ∈ Πdev or πk ∈ Πtest , correspond with affixes. Crucially, we believe
respectively, and Πdev , Πtest ( Π. that this biases the system towards languages
The partial morphological paradigms in that express inflection via suffixation.
Πdev and Πtest are taken from the UniMorph
database (McCarthy et al., 2020a). Develop- Semantic Distance We represent words in
ment sets are only available for the develop- the corpus by fastText embeddings, similar to
ment languages, while test sets are only pro- Erdmann and Habash (2018), who cluster fast-
vided for the test languages. All test sets are Text embeddings for the same task in various
hidden from the participants until the conclu- Arabic dialects. We expect fastText embed-
sion of the shared task. dings to provide better representations than,
e.g., Word2Vec (Mikolov et al., 2013), due to
Languages The development languages fea- the limited size of the Bible corpora. Unfortu-
tured in the shared task are Maltese, Per- nately, using fastText may also inadvertently
sian, Portuguese, Russian, and Swedish. The result in higher similarity between words be-
test languages are Basque, Bulgarian, English, longing to different lemmas that contain over-
Finnish, German, Kannada, Navajo, Spanish, lapping subwords corresponding to affixes.
and Turkish.
Overall Distance We compute a pairwise
4 System Descriptions distance matrix for all words in the corpus.
The distance between two words w1 and w2 is
We submit two systems based on Rosa and
computed as:
Zabokrtský (2019). The first, referred to be-
low as JW-based clustering, follows their work cos(ŵ1 , ŵ2 ) + 1
very closely. The second, LM-based cluster- d(w1 , w2 ) = 1 − δ(w1 , w2 ) · ,
2
ing, contains the same main components, but (1)
approximates orthographic distances with the where ŵ1 and ŵ2 are the embeddings of w1 and
help of a language model. w2 , cos is the cosine distance, and δ is the JW
edit distance. The cosine distance is mapped
4.1 JW-based Clustering to [0, 1] to avoid negative distances.
We describe the system of Rosa and Finally, agglomerative clustering is per-
Zabokrtský (2019) in more detail here. This formed by first assigning each word form to a
system clusters over words whose distance is unique cluster. At each step, the two clusters
109
with the lowest average distance are merged embedding size of 128 and a hidden layer size
together. The merging continues while the dis- of 128. We train the model until the training
tance between clusters stays below a threshold. loss stops decreasing, for up to 100 epochs,
We tune this hyperparameter on the develop- using Adam (Kingma and Ba, 2014) with a
ment set, and our final threshold is 0.3. learning rate of 0.001 and a batch size of 16.
When calculating the edit distance between
4.2 LM-based Clustering two words, the insertion, deletion, or substitu-
The JW-based clustering described above re- tion costs are computed as a function of the
lies on heuristics to obtain a good measure of LM probabilities. We expect this to give more
orthographic similarity. These heuristics help weights to differences in the stem than to those
to quantify orthographic similarity between in other parts of the word. Each character is
two words by relying more on the shared char- then associated with a cost given by
acters in the stem than in the affix: The plu-
ral past participles gravados and louvados in p(wi )
cost(wi ) = 1 − P , (2)
Portuguese have longer substrings in common p(wj )
j∈|w|
than the substrings by which they differ. This
is due to the affix -ados, which indicates that where p(wi ) is the probability of the ith char-
the two words express the same inflectional in- acter in word w as given by the LM. We then
formation, even though their lemmas are differ- compute the cost of an insertion or deletion
ent. Similarly, the Portuguese verbs abafa and as the cost of the character being inserted or
abafávamos differ in many characters, though deleted. The cost of a substitution is the aver-
they belong to the same paradigm, as can be age of the costs of the two involved characters.
observed by the shared stem abaf. The sum over these operations is the weighted
However, not all languages express inflec- edit distance between two words, (w1 , w2 ).
tion exclusively via suffixation, nor via con- Finally, we compute pairwise distances using
catenation. We thus experiment with remov- Equation 1, replacing δ(w1 , w2 ) with
ing the edit distance heuristics and, instead, (w1 , w2 )
utilizing probabilities from a character-level .
max(|w1 |, |w2 |)
language model (LM) to distinguish between
stems and affixes. In doing so, we hope to
achieve better results for templatic languages, Forward vs. Backward LM We hypoth-
such as Maltese. We hypothesize that the LM esize that the direction in which the LM is
will have a higher confidence for characters trained affects the probabilities for affixes. In-
that are part of an affix than for those that are tuitively, an LM is likely to assign higher confi-
part of the stem. We then draw upon this hy- dence to characters at the beginning of a word
pothesis and weigh edit operations between than at the end. Thus, an LM trained on data in
two strings based on these confidences. the forward direction (LM-F) should be more
likely to assign higher probabilities to charac-
LM-weighted Edit Distance Similar to ters at the beginning of a word, such as pre-
the intuition behind Silfverberg and Hulden fixes, while a model trained on reversed words
(2018), we train a character-level LM on the (LM-B) should assign higher probabilities to
entire vocabulary for each Bible corpus. Un- suffixes. In practice, LM-B outperforms LM-F
like their work, we do not have inflectional on all development languages, cf. Table 2. Be-
tags for each word. Despite this, we hypothe- cause of that, we employ LM-B to weigh edit
size that the highly regular and frequent nature operations for all test languages.1
of inflectional affixes will lead to higher likeli- 1
This might be caused by none of the development lan-
hoods for characters that occur in affixes than guages being prefixing. However, in order to make a more
informed choice, a method to automatically distinguish be-
for those in stems. We train a two-layer LSTM tween prefixing and suffixing languages from raw text alone
(Hochreiter and Schmidhuber, 1997) with an would be necessary.

110
Lang Baseline LMC-B LMC-F JWC
prec. rec. F1 prec. rec. f1 prec. rec. F1 prec. rec. F1
Maltese 0.250 0.348 0.291 0.465 0.229 0.307 0.411 0.202 0.272 0.489 0.241 0.323
Persian 0.265 0.348 0.300 0.321 0.307 0.314 0.494 0.197 0.282 0.579 0.231 0.330
Portuguese 0.218 0.794 0.341 0.771 0.248 0.376 0.494 0.159 0.241 0.742 0.239 0.362
Russian 0.234 0.807 0.363 0.802 0.282 0.417 0.726 0.255 0.378 0.792 0.278 0.412
Swedish 0.303 0.776 0.436 0.818 0.378 0.517 0.695 0.321 0.439 0.838 0.388 0.530
Average 0.254 0.615 0.346 0.635 0.289 0.386 0.482 0.186 0.268 0.688 0.275 0.391

Table 2: Precision, recall, and F1 for all development languages. LMC-R is the LM-clustering system for language
models trained from left-to-right (reverse). LMC-F are trained from left-to-right, and JWC is the JW-clustering
system. The highest F1 for each language is in bold.

Lang Baseline LMC JWC


prec. rec. F1 prec. rec. F1 prec. rec. F1
English 0.388 0.767 0.515 0.565 0.245 0.3420 0.663 0.288 0.402
Navajo 0.230 0.598 0.333 0.686 0.112 0.1928 0.657 0.108 0.185
Spanish 0.266 0.722 0.388 0.664 0.183 0.2869 0.699 0.193 0.302
Finnish 0.179 0.767 0.290 0.694 0.227 0.342 0.674 0.220 0.332
Bulgarian 0.265 0.730 0.390 0.745 0.312 0.440 0.717 0.300 0.423
Basque 0.186 0.254 0.215 0.471 0.254 0.330 0.353 0.191 0.247
Kannada 0.172 0.385 0.238 0.570 0.169 0.261 0.625 0.185 0.286
German 0.254 0.776 0.382 0.7626 0.310 0.441 0.787 0.319 0.454
Turkish 0.156 0.658 0.252 0.6574 0.212 0.320 0.641 0.206 0.312
Average 0.233 0.629 0.334 0.646 0.225 0.328 0.646 0.223 0.327

Table 3: Precision, recall, and F1 for all test languages. LMC is the LM-clustering system, JWC is the JW-
clustering system. The highest F1 for each language is in bold.

5 Results and Discussion is likely due to it simply clustering words with


shared substrings, such that a given word is
The official scores obtained by our systems as likely to appear in many predicted clusters.
well as the baseline are shown in Table 3.
Both of our systems perform minimally Interestingly, both of our submissions have
worse than the baseline if we consider F1 av- the same average precision on the test set, de-
eraged over languages (0.334 vs. 0.328 and spite varying across languages. Notably, the
0.327). However, we believe this to be largely LM-based clustering system strongly outper-
due to our submissions only generating clus- forms the JW-based system on Basque with
ters for a subset of the full vocabularies: due to respect to precision. However, the JW-based
time constraints, we only consider words that system outperforms the LM-based one by a
appear at least 5 times in the corpus. No other large margin on English. One hypothesis for
words are included in the predicted clusters. the difference in results is that agglutinating in-
The large gap between precision and recall re- flection in Basque causes very long affixes,
flects this constraint: our submissions have which our LM-based system should down-
a high average precision (0.646 for both sys- weigh in its measurement of orthographic simi-
tems), indicating that the limited set of words larity. Basque is also not a strictly suffixing lan-
we consider are being clustered more accu- guage, which we expect the JW-based model to
rately than the F1 scores would suggest. The be biased towards. On the other hand, English
low recall scores (0.225 and 0.223) are likely has relatively little inflectional morphology,
at least partially caused by the missing words and is strictly suffixing (in terms of inflection).
in our predictions.2 The assumptions behind the JW-based system
Conversely, the baseline system has a high are more ideal for a language like English. The
recall (0.629) and a low precision (0.233). This JW system performs best on Maltese, which
2
suggests that the heuristics of that system are
We confirm this hypothesis with additional experiments
after the shared task’s completion. Those results can be found sufficient for a templatic language, compared
in the appendix. to the LM-based system.
111
6 Conclusion Huiming Jin, Liwei Cai, Yihui Peng, Chen Xia, Arya
McCarthy, and Katharina Kann. 2020. Unsuper-
We present two systems for the SIGMOR- vised morphological paradigm completion. In Pro-
ceedings of the 58th Annual Meeting of the Asso-
PHON 2021 Shared Task on Unsupervised ciation for Computational Linguistics, pages 6696–
Morphological Paradigm Clustering. Both of 6707, Online. Association for Computational Lin-
our systems perform slighly worse than the of- guistics.
ficial baseline. However, we also show that this Katharina Kann, Arya D. McCarthy, Garrett Nicolai,
is due to our official submissions only making and Mans Hulden. 2020. The SIGMORPHON
predictions for a subset of the corpus’ vocab- 2020 shared task on unsupervised morphological
paradigm completion. In Proceedings of the 17th
ulary, due to time constraints and that at least SIGMORPHON Workshop on Computational Re-
one of our systems improves strongly if the search in Phonetics, Phonology, and Morphology,
time constraints are removed. pages 51–62, Online. Association for Computa-
tional Linguistics.

Diederik P Kingma and Jimmy Ba. 2014. Adam: A


References method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
Tomas Mikolov. 2017. Enriching word vectors with Vladimir Iosifovich Levenshtein. 1966. Binary codes
subword information. Transactions of the Associa- capable of correcting deletions, insertions and re-
tion for Computational Linguistics, 5:135–146. versals. Soviet Physics Doklady, 10(8):707–710.
Doklady Akademii Nauk SSSR, V163 No4 845-848
1965.
Ryan Cotterell, Christo Kirov, John Sylak-Glassman,
David Yarowsky, Jason Eisner, and Mans Hulden. Arya D. McCarthy, Christo Kirov, Matteo Grella,
2016. The SIGMORPHON 2016 shared task— Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekate-
morphological reinflection. In Proceedings of the rina Vylomova, Sabrina J. Mielke, Garrett Nico-
2016 Meeting of SIGMORPHON, Berlin, Germany. lai, Miikka Silfverberg, Timofey Arkhangelskiy, Na-
Association for Computational Linguistics. taly Krizhanovsky, Andrew Krizhanovsky, Elena
Klyachko, Alexey Sorokin, John Mansfield, Valts
Mathias Creutz and Krista Lagus. 2007. Unsupervised Ernštreits, Yuval Pinter, Cassandra L. Jacobs, Ryan
models for morpheme segmentation and morphol- Cotterell, Mans Hulden, and David Yarowsky.
ogy learning. 4(1). 2020a. UniMorph 3.0: Universal Morphology.
In Proceedings of the 12th Language Resources
and Evaluation Conference, pages 3922–3931, Mar-
Matthew S. Dryer and Martin Haspelmath, editors. seille, France. European Language Resources Asso-
2013. WALS Online. Max Planck Institute for Evo- ciation.
lutionary Anthropology, Leipzig.
Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu,
Alexander Erdmann, Micha Elsner, Shijie Wu, Ryan Chaitanya Malaviya, Lawrence Wolf-Sonkin, Gar-
Cotterell, and Nizar Habash. 2020. The paradigm rett Nicolai, Christo Kirov, Miikka Silfverberg, Sab-
discovery problem. In Proceedings of the 58th An- rina J. Mielke, Jeffrey Heinz, Ryan Cotterell, and
nual Meeting of the Association for Computational Mans Hulden. 2019. The SIGMORPHON 2019
Linguistics, pages 7778–7790, Online. Association shared task: Morphological analysis in context and
for Computational Linguistics. cross-lingual transfer for inflection. In Proceedings
of the 16th Workshop on Computational Research in
Phonetics, Phonology, and Morphology, pages 229–
Alexander Erdmann and Nizar Habash. 2018. Comple- 244, Florence, Italy. Association for Computational
mentary strategies for low resourced morphological Linguistics.
modeling. In Proceedings of the Fifteenth Workshop
on Computational Research in Phonetics, Phonol- Arya D. McCarthy, Rachel Wicks, Dylan Lewis, Aaron
ogy, and Morphology, pages 54–65, Brussels, Bel- Mueller, Winston Wu, Oliver Adams, Garrett Nico-
gium. Association for Computational Linguistics. lai, Matt Post, and David Yarowsky. 2020b. The
Johns Hopkins University Bible corpus: 1600+
John Goldsmith. 2001. Unsupervised learning of the tongues for typological exploration. In Proceed-
morphology of a natural language. Computational ings of the 12th Language Resources and Evaluation
linguistics, 27(2):153–198. Conference, pages 2884–2892, Marseille, France.
European Language Resources Association.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Long short-term memory. Neural computation, Dean. 2013. Efficient estimation of word represen-
9(8):1735–1780. tations in vector space.

112
Karthik Narasimhan, Regina Barzilay, and Tommi Conference on Empirical Methods in Natural Lan-
Jaakkola. 2015. An unsupervised method for un- guage Processing, pages 2465–2474, Brussels, Bel-
covering morphological chains. Transactions of the gium. Association for Computational Linguistics.
Association for Computational Linguistics, 3:157–
167.

Hoifung Poon, Colin Cherry, and Kristina Toutanova.


2009. Unsupervised morphological segmentation
with log-linear models. In Proceedings of Human
Language Technologies: The 2009 Annual Confer-
ence of the North American Chapter of the Associa-
tion for Computational Linguistics, pages 209–217,
Boulder, Colorado. Association for Computational
Linguistics.

Rudolf Rosa and Zdenek Zabokrtský. 2019. Unsu-


pervised lemmatization as embeddings-based word
clustering. CoRR, abs/1908.08528.

Patrick Schone and Daniel Jurafsky. 2000. Knowledge-


free induction of morphology using latent semantic
analysis. In Fourth Conference on Computational
Natural Language Learning and the Second Learn-
ing Language in Logic Workshop.

Patrick Schone and Daniel Jurafsky. 2001. Knowledge-


free induction of inflectional morphologies. In Sec-
ond Meeting of the North American Chapter of the
Association for Computational Linguistics.

Miikka Silfverberg and Mans Hulden. 2018. An


encoder-decoder approach to the paradigm cell fill-
ing problem. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Process-
ing, pages 2883–2889, Brussels, Belgium. Associa-
tion for Computational Linguistics.

Radu Soricut and Franz Och. 2015. Unsupervised mor-


phology induction using word embeddings. In Pro-
ceedings of the 2015 Conference of the North Amer-
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages
1627–1637, Denver, Colorado. Association for Com-
putational Linguistics.

Adam Wiemerslage, Arya McCarthy, Alexander Erd-


mann, Garrett Nicolai, Manex Agirrezabal, Miikka
Silfverberg, Mans Hulden, and Katharina Kann.
2021. The SIGMORPHON 2021 shared task on un-
supervised morphological paradigm clustering. In
Proceedings of the 18th SIGMORPHON Workshop
on Computational Research in Phonetics, Phonol-
ogy, and Morphology. Association for Computa-
tional Linguistics.

William E. Winkler. 1990. String comparator met-


rics and enhanced decision rules in the fellegi-sunter
model of record linkage. In Proceedings of the Sec-
tion on Survey Research, pages 354–359.

Ruochen Xu, Yiming Yang, Naoki Otani, and Yuexin


Wu. 2018. Unsupervised cross-lingual transfer of
word embedding spaces. In Proceedings of the 2018

113
7 Appendix
Here we present new results which include
the entire data set for selected languages. We
see an improvement in F1 for each language.
This due to the increased recall scores from
the paradigms being more complete. Precision
scores decrease across the board. This may
be due to the languages being sensitive to the
threshold value.
Lang Subset Full
prec. rec. F1 prec. rec. F1
Basque 0.471 0.254 0.330 0.443 0.429 0.435
Bulgarian 0.745 0.312 0.440 0.638 0.631 0.634
English 0.565 0.245 0.342 0.430 0.425 0.428
German 0.763 0.310 0.441 0.703 0.699 0.701
Maltese 0.465 0.229 0.307 0.402 0.400 0.401
Navajo 0.686 0.112 0.193 0.449 0.430 0.435
Spanish 0.664 0.183 0.287 0.579 0.560 0.569
Swedish 0.818 0.378 0.517 0.783 0.737 0.759
Average 0.659 0.252 0.357 0.553 0.539 0.545

Table 4: Post-shared task results using the full data set


for selected languages. These results use LM-B with a
threshold value of 0.3.

114
Results of the Second SIGMORPHON Shared Task
on Multilingual Grapheme-to-Phoneme Conversion
Lucas F.E. Ashby∗ , Travis M. Bartley∗ , Simon Clematide† , Luca Del Signore∗ ,
Cameron Gibson∗ , Kyle Gorman∗ , Yeonju Lee-Sikka∗ , Peter Makarov† ,
Aidan Malanoski∗ , Sean Miller∗ , Omar Ortiz∗ , Reuben Raff∗ ,
Arundhati Sengupta∗ , Bora Seo∗ , Yulia Spektor∗ , Winnie Yan∗

Graduate Program in Linguistics, Graduate Center, City University of New York

Department of Computational Linguistics, University of Zurich

Abstract ral sequence-to-sequence models (e.g., Rao et al.


2015, Yao and Zweig 2015, van Esch et al. 2016).
Grapheme-to-phoneme conversion is an im- With the possible exception of van Esch
portant component in many speech technolo-
gies, but until recently there were no multi-
et al. (2016), who evaluate against a proprietary
lingual benchmarks for this task. The second database of 20 languages and dialects, virtually
iteration of the SIGMORPHON shared task all of the prior published research on grapheme-
on multilingual grapheme-to-phoneme conver- to-phoneme conversion evaluates only on English,
sion features many improvements from the for which several free and low-cost pronunciation
previous year’s task (Gorman et al. 2020), in- dictionaries are available. The 2020 SIGMOR-
cluding additional languages, a stronger base- PHON Shared Task on Multilingual Grapheme-to-
line, three subtasks varying the amount of
Phoneme Conversion (Gorman et al. 2020) repre-
available resources, extensive quality assur-
ance procedures, and automated error analy- sented a first attempt to construct a multilingual
ses. Four teams submitted a total of thirteen benchmark for grapheme-to-phoneme conversion.
systems, at best achieving relative reductions The 2020 shared task targeted fifteen languages
of word error rate of 11% in the high-resource and received 23 submissions from nine teams. The
subtask and 4% in the low-resource subtask. second iteration of this shared task attempts to
further refine this benchmark by introducing addi-
1 Introduction tional languages, a much stronger baseline model,
Many speech technologies demand mappings be- new quality assurance procedures for the data, and
tween written words and their pronunciations. automated error analysis techniques. Furthermore,
In open-vocabulary systems—as well as certain in response to suggestions from participants in the
resource-constrained embedded systems—it is in- 2020 shared task, the task has been divided into
sufficient to simply list all possible pronunciations; high-, medium-, and low-resource subtasks.
these mappings must generalize to rare or unseen
words as well. Therefore, the mapping must be 2 Data
expressed as a mapping from a sequence of ortho-
As in the previous year’s shared task, all data
graphic characters—graphemes— to a sequence
was drawn from WikiPron (Lee et al. 2020), a
of sounds—phones or phonemes.1
massively multilingual pronunciation database ex-
The earliest work on grapheme-to-phoneme
tracted from the online dictionary Wiktionary. De-
conversion (G2P), as this task is known, used or-
pending on the language and script, Wiktionary
dered rewrite rules. However, such systems are
pronunciations are either manually entered by hu-
often brittle and the linguistic expertise needed
man volunteers working from language-specific
to build, test, and maintain rule-based systems
pronunciation guidelines and/or generated from
is often in short supply. Furthermore, rule-
the graphemic form via language-specific server-
based systems are outperformed by modern neu-
side scripting. WikiPron scrapes these pro-
1
We note that referring to elements of transcriptions as nunciatons from Wiktionary, optionally applying
phonemes implies an ontological commitment which may or case-folding to the graphemic form, removing
may not be justified; see Lee et al. 2020 (fn. 4) for discussion.
Therefore, we use the term phone to refer to symbols used to any stress and syllable boundaries, and segment-
transcribe pronunciations. ing the pronunciation—encoded in the Interna-
115
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 115–125
August 5, 2021. ©2021 Association for Computational Linguistics
tional Phonetic Alphabet—using the Python li- simple human error. Therefore, we wished to
brary segments (Moran and Cysouw 2018). In exclude pronunciations which include any non-
all, 21 WikiPron languages were selected for the native segments. This was accomplished by creat-
three subtasks, including seven new languages and ing phonelists which enumerate native phones for
fourteen of the fifteen languages used in the 2020 a given language. Separate phonelists may be pro-
shared task.2 vided for broad and narrow transcriptions of the
In several cases, multiple scripts or dialects same language. During data ingestion, if a pro-
are available for a given language. For instance, nunciation contains any segment not present on the
WikiPron has both Latin and Cyrillic entries for phonelist, the entry was discarded. Phonelist filtra-
Serbo-Croatian, and three different dialects of tion was used for all languages in the medium- and
Vietnamese. In such case, the largest data set of the low-resource subtasks, described below.
available scripts and/or dialects is chosen. Further-
more, WikiPron distinguishes between “broad” 4 Task definition
transcriptions delimited by forward slash (/) and
In this task, participants were provided with a col-
“narrow” transcriptions delimited by square brack-
lection of words and their pronunciations, and then
ets ([ and ]).3 Once again, the larger of the two
scored on their ability to predict the pronunciation
data sets is the one used for this task.
of a set of unseen words.
3 Quality assurance 4.1 Subtasks
During the previous year’s shared task we be- In the previous year’s shared task, each language’s
came aware of several consistency issues with the data consisted of 4,500 examples, sampled from
shared task data. This lead us to develop quality WikiPron, split randomly into 80% training exam-
assurance procedures for WikiPron and the “up- ples, 10% development examples, and 10% test
stream” Wiktionary data. For a few languages, examples. As part of their system development,
we worked with Wiktionary editors who automat- two teams in the 2020 shared task (Hauer et al.
ically enforced upstream consistency via “bots”, 2020, Yu et al. 2020) down-sampled these data to
i.e., scripts which automatically edit Wiktionary simulate a lower-resource setting, and one partici-
entries. We also improved WikiPron’s routines for pant expressed concern whether the methods used
extracting pronunciation data from Wiktionary. In in the shared task would generalize effectively
some cases (e.g., Vietnamese), this required the to high-resource scenarios like the large English
creation of language-specific extraction routines. data sets traditionally used to evaluate grapheme-
In early versions of WikiPron, users had limited to-phoneme systems. This motivated a division of
means to separate out entries for languages written the data into three subtasks, varying the amount of
in multiple scripts. We therefore added an auto- data provided, as described below.4
mated script detection system which ensures that
High-resource subtask The first subtask con-
entries for the many languages written with multi-
sists of a roughly 41,000-word sample of Main-
ple scripts—including shared task languages Mal-
stream American English (eng_us). Participating
tese, Japanese, and Serbo-Croatian—are sorted ac-
teams were permitted to use any and all external
cording to script.
resources to develop their systems except for Wik-
We noticed that the WikiPron data includes
tionary or WikiPron. It was anticipated partici-
many hyper-foreign pronunciations with non-
pants would exploit other freely available Amer-
native phones. For example, the English data in-
ican English pronunciation dictionaries.
cludes a broad pronunciation of Bach (the sur-
name of a family of composers) as /bɑːx/ with Medium-resource subtask The second subtask
a velar fricative /x/, a segment which is com- represents a medium-resource task. For each of
mon in German but absent in modern English. the ten target languages, a sample of 10,000 words
Furthermore, unexpected phones may represent was used. Teams participating in this subtask were
2 4
The fifteenth language, Lithuanian, was omitted due to Languages were sorted into medium- vs. low-resource
unresolved quality assurance issues. subtasks according to data availability. For example, Ice-
3
Sorting by script, dialect, and broad vs. narrow transcrip- landic was placed in the low-resource shared task simply be-
tion is performed automatically during data ingestion. cause it has less than 10,000 pronunciations available.
116
permitted to use UniMorph paradigms (Kirov et al. example, ц is transcribed as both /ts, ͡ts/. Further-
2018) to lemmatize or to look up morphological more, the broad transcriptions sometimes contain
features, but were not permitted to use any other allophones of the consonants /t, d, l/ (Ternes and
external resources. The languages for this subtask Vladimirova-Buhtz 1990); e.g., л is transcribed as
are listed and exemplified in Table 1. both /l, ɫ/. A script was used to enforce a consistent
broad transcription.
Low-resource subtask The third subtask is de-
signed to simulate a low-resource setting and con- Maltese In the Latin-script Maltese data, Wik-
sists of 1,000 words from ten languages. Teams tionary has multiple transcriptions of digraph
were were not permitted to use any external re- għ, which in the contemporary language indi-
sources for this subtask. The languages for this cates lengthening of an adjacent vowel, except
subtask are shown in Table 2. word-finally where it is read as [ħ] (Hoberman
2007:278f.). Rather than excluding multiple pro-
4.2 Data preparation nunciations, a script was used to eliminate pronun-
ciations which contain archaic readings of this di-
The procedures for sampling and splitting the data
graph, e.g., as pharyngealization or as [ɣ].
are similar to those used in the previous year’s
shared task; see Gorman et al. 2020, §3. For Welsh WikiPron’s transcriptions of the South-
each of the three subtasks, the data for each lan- ern dialect of Welsh include the effects of vari-
guage are first randomly downsampled according able processes of monophthongization and dele-
to their frequencies in the Wortschatz (Goldhahn tion (Hannahs 2013:18–25). Once again, rather
et al. 2012) norms. Words containing less than than excluding multiple pronunciations, a script
two Unicode characters or less than two phone seg- was used to select the “longer” pronunciation—
ments are excluded, as are words with multiple naturally, the pronunciation without variable
pronunciations. The resulting data are randomly monophthongization or deletion—of Welsh words
split into 80% training data, 10% development with multiple pronunciations.
data, and 10% test data. As in the previous year’s
shared task, these splits are constrained so that in- 5 Evaluation
flectional variants of any given lemma—according The primary metric for this task was word error
to the UniMorph (Kirov et al. 2018) paradigms— rate (WER), the percentage of words for which the
can occur in at most one of the three shards. Train- hypothesized transcription sequence is not iden-
ing and development data was made available at tical to the gold reference transcription. As the
the start of the task. The test words were also medium- and low-resource subtasks involve multi-
made available at the start of the task; test pro- ple languages, macro-averaged WER was used for
nunciations were withheld until the end of the task. system ranking. Participants were provided with
Some additional processing is required for certain two evaluation scripts: one which computes WER
languages, as described below. for a single language, and one which also com-
putes macro-averaged WER across two or more
English The Wiktionary American English pro-
languages. The 2020 shared task also reported an-
nunciations exhibit a large number of inconsisten-
other metric, phone error rate (PER), but this was
cies. These pronunciations were validated by au-
found to be highly correlated with WER and there-
tomatically comparing them with entries in the
fore has been omitted here.
CALLHOME American English Lexicon (Kings-
bury et al. 1997), which provides broad ARPAbet 6 Baseline
transcriptions of Mainstream American English.
Furthermore, a script was used to standardize use The 2020 shared task included three baselines: a
of vowel length and enforce consistent use of tie WFST-based pair n-gram model, a bidirectional
bars with affricates (e.g., /tʃ/ → /t͡ʃ/). However, we LSTM encoder-decoder network, and a trans-
note that Gautam et al. (2021:§2.1) report several former. All models were tuned to minimize per-
residual quality issues with this data. language development-set WER using a limited-
budget grid search. Best results overall were ob-
Bulgarian Bulgarian Wiktionary transcriptions tained by the bidirectional LSTM. Despite the
make inconsistent use of tie bars on affricates; for extensive GPU resources required to execute a
117
Armenian (Eastern) arm_e համադրություն h ɑ m ɑ d ә ɾ u tʰ j u n
Bulgarian bul обоснованият o̝ b o̝ s n o v a n i j ә t̪
Dutch dut konijn k oː n ɛ i̯ n
French fre joindre ʒ w ɛ̃ d ʁ̃
Georgian geo მოუქნელად m ɔ u kʰ n ɛ l ɑ d
Serbo-Croatian (Latin) hbs_latn opadati o p ǎː d a t i
Hungarian hun lobog loboɡ
Japanese (Hiragana) jpn_hira ぜんたいしゅぎ d͡z ẽ̞ n t a̠ i ɕ ɨᵝ ɡʲ i
Korean kor 쇠가마우지 sʰ w e̞ ɡ a̠ m a̠ u d͡ʑ i
Vietnamese (Hanoi) vie_hanoi ngừng bắn ŋ ɨ ŋ ˨˩ ʔ ɓ a n ˧˦

Table 1: The ten languages in the medium-resource subtask with language codes and example training data pairs.

Adyghe ady кӏэшӏыхьан ͡tʃʼ a ʃʼ ә ħ aː n


Greek gre λέγεται leʝete
Icelandic ice maður m aː ð ʏ r
Italian ita marito marito
Khmer khm ្របហារ p r ɑ h aː
Latvian lav mīksts m îː k s t s
Maltese (Latin) mlt_latn minna mɪnna
Romanian rum ierburi j e r b u rʲ
Slovenian slv oprostite ɔ p r ɔ s t íː t ɛ
Welsh (Southwest) wel_sw gorff ɡɔrf

Table 2: The ten languages in the low-resource subtask with language codes and example training data pairs.

per-language grid search, the best baseline was dictions are converted back to the composed form
handily outperformed by nearly all submissions. (NFC). An implementation of the baseline was pro-
This led us to seek a simpler, stronger, and vided during the task and participating teams were
less computationally-demanding baseline for this encouraged to adapt it for their submissions.
year’s shared task.
The baseline for the 2021 shared task is a neu- 7 Submissions
ral transducer system using an imitation learn-
ing paradigm (Makarov and Clematide 2018). A Below we provide brief descriptions of sub-
variant of this system (Makarov and Clematide missions to the shared task; more detailed
2020) was the second-best system in the 2020 descriptions—as well as various exploratory anal-
shared task.5 Alignments are computed using yses and post-submission experiments—can be
ten iterations of expectation maximization, and found in the system papers later in this volume.
the imitation learning policy is trained for up to
sixty epochs (with a patience of twelve) using the AZ Hammond (2021) produced a single submis-
Adadelta optimizer. A beam of size of four is sion to the low-resource subtask. The model is in-
used for prediction. Final predictions are produced spired by the previous year’s bidirectional LSTM
by a majority-vote ten-component ensemble. In- baseline but also employs several data augmenta-
ternal processing is performed using the decom- tion strategies. First, much of the development
posed Unicode normalization form (NFD), but pre- data is used for training rather than for validation.
5
Secondly, new training examples are generated us-
The baseline was implemented using the DyNet neural
network toolkit (Neubig et al. 2017). In contrast to the previ- ing substrings of other training examples. Finally,
ous year’s baseline, the imitation learning system does not re- the AZ model is trained simultaneously on all lan-
quire a GPU for efficient training; it runs effectively on CPU guages, a method used in some of the previous
and can exploit multiple CPU cores if present. Training, en-
sembling, and evaluation for all three subtasks took roughly year’s shared task submissions (e.g., Peters and
72 hours of wall-clock time on a commodity desktop PC. Martins 2020, Vesik et al. 2020).
118
CLUZH Clematide and Makarov (2021) pro- baseline model. The results are shown in Table 5;
duced four submissions to the medium-resource note that the individual language results are ex-
subtask and three to the low-resource subtask. All pressed as three-digit percentages since there are
seven submissions are variations on the imitation 1,000 test examples each. While several of the
learning baseline model (section 6). They ex- CLUZH systems outperform the baseline on in-
periment with processing individual IPA Unicode dividual languages, including Armenian, French,
characters instead of entire IPA “segments” (e.g., Hungarian, Japanese, Korean, and Vietnamese,
CLUZH-1, CLUZH-5, and CLUZH-6), and larger the baseline achieves the best macro-accuracy.
ensembles (e.g., CLUZH-3). They also experi-
ment with input dropout, mogrifier LSTMs, and Low-resource subtask Three teams—AZ,
adaptive batch sizes, among other features. CLUZH, and UBC—submitted a total of six
systems to the low-resource subtask. Results for
Dialpad Gautam et al. (2021) produced three this subtask are shown in Table 6; note that the re-
systems to the high-resource subtask. The sults are expressed as two-digit percentages since
Dialpad-1 submission is a large ensemble of seven there are 100 test examples for each language.
different sequence models. Dialpad-2 is a smaller Three submissions outperformed the baseline.
ensemble of three models. Dialpad-3 is a single The best-performing submission was UBC-2, an
transformer model implemented as part of CMU adaptation of the baseline which assigns higher
Sphinx. Gautam et al. also experiment with sub- penalties for mis-predicted vowels and diacritic
word modeling techniques. characters. It achieved a 1.0% absolute (4%
relative) reduction in WER over the baseline.
UBC Lo and Nicolai (2021) submitted two sys-
tems for the low-resource subtask, both variations 8.2 Error analysis
on the baseline model. The UBC-1 submission hy-
pothesizes that, as previously reported by van Esch Error analysis can help identify strengths and
et al. (2016), inserting explicit syllable boundaries weaknesses of existing models, suggesting future
into the phone sequences enhances grapheme-to- improvements and guiding the construction of
phoneme performance. They generate syllable ensemble models. Prior experience using gold
boundaries using an automated onset maximiza- crowd-sourced data extracted from Wiktionary
tion heuristic. The UBC-2 submission takes a dif- suggests that a non-trivial portion of errors made
ferent approach: it assigns additional language- by top systems are due to errors in the gold data
specific penalties for mis-predicted vowels and di- itself. For example, Gorman et al. (2019) report
acritic characters such as the length mark /ː/. that a substantial portion of the prediction errors
made by the top two systems in the 2017 CoNLL–
8 Results SIGMORPHON Shared Task on Morphological
Reinflection (Cotterell et al. 2017) are due to tar-
Multiple submissions to the high- and low-
get errors, i.e., errors in the gold data. Therefore
resource subtasks outperformed the baseline; how-
we conducted an automatic error analysis for four
ever, no submission to the medium-resource sub-
target languages. It was hoped that this analysis
task exceeded the baseline. The best results for
would also help identify (and quantify) target er-
each language are shown in Table 3.
rors in the test data.
8.1 Subtasks Two forms of error analysis were employed
here. First, after Makarov and Clematide (2020),
High-resource subtask The Dialpad team sub-
the most frequent error types in each language are
mitted three systems for the high-resource subtask,
shown in Table 7. From this table it is clear that
all of which outperformed the baseline. Results for
many errors can be attributed either to the ambigu-
this subtask are shown in Table 4. The best sub-
ity of a language’s writing system. For example, in
mission overall, Dialpad-1, a seven-component
both Serbo-Croatian and Slovenian the most com-
ensemble, achieved an impressive 4.5% absolute
mon errors involve the confusion or omission of
(11% relative) reduction in WER over the baseline.
suprasegmental information such as pitch accent
Medium-resource subtask The CLUZH team and vowel length, neither of which are represented
submitted four systems for the medium-resource in the orthography. Likewise, in French and Ital-
subtask. All of of these systems are variants of the ian the most frequent errors confuse vowel sounds
119
Baseline WER Best submission(s) WER
eng_us 41.91 Dialpad-1 37.43
arm_e 7.0 CLUZH-7 6.4
bul 18.3 CLUZH-6 18.8
dut 14.7 CLUZH-7 14.7
fre 8.5 CLUZH-4, CLUZH-5, CLUHZ-6 7.5
geo 0.0 CLUZH-4, CLUHZ-5, CLUZH-6, CLUZH-7 0.0
hbs_latn 32.1 CLUZH-7 35.3
hun 1.8 CLUZH-6, CLUZH-7 1.0
jpn_hira 5.2 CLUZH-7 5.0
kor 16.3 CLUZH-4 16.2
vie_hanoi 2.5 CLUZH-5, CLUZH-7 2.0
ady 22 CLUZH-2, CLUZH-3, UBC-2 22
gre 21 CLUZH-1, CLUZH-3 20
ice 12 CLUZH-1, CLUZH-3 10
ita 19 UBC-1 20
khm 34 UBC-2 28
lav 55 CLUZH-2, CLUZH-3, UBC-2 49
mlt_latn 19 CLUZH-1 12
rum 10 UBC-2 10
slv 49 UBC-2 47
wel_sw 10 CLUZH-1 10

Table 3: Baseline WER, and the best submission(s) and their WER, for each language.

Baseline Dialpad-1 Dialpad-2 Dialpad-3


eng_us 41.94 37.43 41.72 41.58

Table 4: Results for the high-resource (US English) subtask.

represented by the same graphemes. three medium-resource languages and four of the
Many errors may also be attributable to prob- low-resource languages. A fragment of the Bul-
lems with the target data. For example, the two garian covering grammar, showing readings of the
most frequent errors for English are predicting [ɪ] characters б, ф, and ю, is presented in Table 8.6
instead of [ә], and predicting [ɑ] instead of [ɔ]. Let G be the graphemic form of a word and let
Impressionistically, the former is due in part to P and P̂ be the corresponding gold and hypothe-
inconsistent transcription of the -ed and -es suf- sis pronunciations for that word. For error analysis
fixes, whereas the latter may reflect inconsistent we are naturally interested in cases where P ̸= P̂,
transcription of the low back merger. i.e., those cases where the gold and hypothesis
The second error analysis technique used here pronunciations do not match, since these are ex-
is an adaptation of a quality assurance technique actly the cases which contribute to word error rate.
proposed by Jansche (2014). For each language Then, P = πo (G ◦ γ) is a finite-state lattice repre-
targeted by the error analysis, a finite-state cov- senting the set of all “possible” pronunciations of
ering grammar is constructed by manually listing G admitted by the covering grammar.
all pairs of permissible grapheme-phone mappings When P ̸= P̂ but P ∈ P—that is, when
for that language. Let C be the set of all such g, p
6
pairs. Then, the covering grammar γ is the ra- Error analysis software was implemented using the
Pynini finite-state toolkit (Gorman 2016). See Gorman and
tional relation given by the closure over C, thus Sproat 2021, ch. 3, for definitions of the various finite-state
γ = C∗ . Covering grammars were constructed for operations used here.
120
Baseline CLUZH-4 CLUZH-5 CLUZH-6 CLUZH-7
arm_e 7.0 7.1 6.6 6.6 6.4
bul 18.3 20.1 19.2 18.8 19.7
dut 14.7 15.0 14.9 15.6 14.7
fre 8.5 7.5 7.5 7.5 7.6
geo 0.0 0.0 0.0 0.0 0.0
hbs_latn 32.1 38.4 35.6 37.0 35.3
hun 1.8 1.5 1.2 1.0 1.0
jpn_hira 5.2 5.9 5.3 5.5 5.0
kor 16.3 16.2 16.9 17.2 16.3
vie_hanoi 2.5 2.3 2.0 2.1 2.0
Macro-average 10.6 11.4 10.9 11.1 10.8

Table 5: Results for the medium-resource subtask.

Baseline AZ CLUZH-1 CLUZH-2 CLUZH-3 UBC-1 UBC-2


ady 22 30 24 22 22 25 22
gre 21 23 20 22 20 22 22
ice 12 22 10 12 10 13 11
ita 19 25 23 24 21 20 22
khm 34 42 32 33 32 31 28
lav 55 53 53 49 49 58 49
mlt_latn 19 19 12 16 14 19 18
rum 10 13 13 13 12 14 10
slv 49 90 50 59 55 56 47
wel_sw 10 40 10 13 12 13 12
Macro-average 25.1 35.7 24.7 26.3 24.7 27.1 24.1

Table 6: Results for the low-resource subtask.

the gold pronunciation is one of the possible is silent whereas the non-suffixal word-final ent
pronunciations—we refer to such errors as model is normally read as [ɑ̃]. Morphological informa-
deficiencies, since this condition suggests that the tion was not provided to the covering grammar,
system in question has failed to guess one of sev- but it could easily be exploited by grapheme-to-
eral possible pronunciations of the current word. phoneme models.
In many cases this reflects genuine ambiguities in
Another condition of interest is when P ̸= P̂
the orthography itself. For example, in Italian, e
but P ∈ / P. We refer to such errors as coverage de-
is used to write both the phonemes /e, ɛ/ and o is
ficiencies, since they arise when the gold pronun-
similarly read as /o, ɔ/ (Rogers and d’Arcangeli
ciation is not one permitted by the covering gram-
2004). There are few if any orthographic clues
mar. While coverage deficiencies may result from
to which mid-vowel phoneme is intended, and
actual deficiencies in the covering grammar itself,
all submissions incorrectly predicted that the o in
they more often arise when a word does not fol-
nome ‘name’ is read as [ɔ] rather than [o]. Simi-
low the normal orthographic principles of its lan-
lar issues arise in Icelandic and French. The pre-
guage. For instance, Italian has borrowed the En-
ceding examples both represent global ambigui-
glish loanword weekend [wikɛnd] ‘id.’ but has not
ties, but model deficiencies may also occur when
yet adapted it to Italian orthographic principles. Fi-
the system has failed to disambiguate a local am-
nally, coverage deficiencies may indicate target er-
biguity. One example of this can be found in
rors, inconsistencies in the gold data itself. For ex-
French: the verbal third-person plural suffix -ent
ample, in the Italian data, the tie bars used to indi-
121
eng_us ɪ ә 113 ɑ ɔ 112 _ ʊ• 96 _ ɪ• 85 ɪ i 76
arm_e _ ә• 16 ә• _ 10 tʰ d 6 d tʰ 6 j• _ 3
bul ɛ• d͡ 32 a ә 31 ә ɤ 30 _ ◌̝ 27 ә a 25
dut ә eː 10 _ ː 10 ә ɛ 9 eː ә 8 z s 8
fre a ɑ 6 _ •s 5 ɔ o 5 e ɛ•ʁ 3 _ •t 3
geo
hbs_latn _ ː 85 ː _ 76 _ ◌̌ 55 ◌̌ ◌̂ 53 ◌̌ _ 52
hun _ ː 6 h ɦ 3 ʃ s 2 ː _ 2
jpn_hira _ ◌̥ 20 _ ◌̊ 11 _ d͡ 4 ː •ɯ̟ᵝ 3 h ɰᵝ 3
kor _ ː 73 ː _ 28 ʌ̹ ɘː 23 ʰ ◌͈ 9 ɘː ʌ̹ 6
vie_hanoi _ w• 3 _ ˧ 3 _ w•ŋ͡m• 2 ◌͡ɕ •ɹ 2 _ ʔ• 2
ady ʼ _ 3 ː _ 3 ʃ ʂ 3 ə• _ 2 a ә 2
gre ɾ r 8 r ɾ 3 i ʝ 3 m• _ 2 ɣ ɡ 2
ice ː _ 2 ◌̥ _ 2 _ ː 2
ita o ɔ 6 e ɛ 5 j i 3 ◌͡ • 2 ɔ o 2
khm aː i•ә 3 _ ʰ 3 _ •ɑː 2 ĕ ɔ 2 ɑ a 2
lav ◌̄ ◌̂ 11 _ ◌̂ 10 ◌̀ _ 9 ◌̄ _ 7 _ ◌̀ 4
mlt_latn _ ː 5 _ ɪ• 2 ɐ a 2 b p 2 a ɐ 2
rum ◌͡ • 2
slv ◌́ ◌̀ 7 ◌̀ː _ 6 ◌́ː _ 6 _ ◌́ː 5 ɛ éː 4
wel_sw ɪ iː 3 ɪ i̯ 2 _ ɛ• 2

Table 7: The five most frequent error types, represented by the hypothesis string, gold string, and count, for each
language; • indicates whitespace and _ the empty string.

… can obtain the coverage deficiency rate simply by


б b subtracting MDR from WER. By comparing WER
б bj and MDR one can see the overwhelming majority
б p of errors in these seven languages are model defi-
… ciencies, most naturally arising from genuine am-
ф f biguities in orthography rather than target errors
ф fj (i.e., data inconsistencies).
… To facilitate ensemble construction and further
ю ju error analysis, we release all submissions’ test set
ю u predictions to the research community.7

9 Discussion
Table 8: Fragment of a covering grammar for Bul-
garian; the left column contains graphemes and corre- We once again see an enormous difference in lan-
sponding phones are given in the right column. guage difficulty. One of the languages with the
highest amount of data, English, also has one of
the highest WERs. In contrast, the baseline and all
cate affricates are not always present, and many ap- four submissions to the medium-resource subtask
parent errors are the result of gold pronunciations achieve perfect performance on Georgian. This
which omit a tie bar. is a substantial change from the previous year’s
WER and model deficiency rate (MDR) is shared task: with a sample roughly half the size of
shown for select systems and three languages this year’s task, the best system (Yu et al. 2020) ob-
from the medium-resource subtask in Table 9, and tained a WER of 24.89 on Georgian (Gorman et al.
Table 10 shows similar statistics for four low- 7
https://drive.google.com/drive/folders/
resource languages. Note that by construction, one 1Fer7UfHBnt5k-WFHsVXQO8ac3BvREAyC
122
Baseline CLUZH-5
WER MDR WER MDR
bul 18.3 17.6 19.2 19.0
fre 8.5 7.5 7.5 6.8
jpn_hira 5.2 4.4 5.3 4.5

Table 9: WER and model deficiency rate (MDR) for three languages from the medium-resource subtask.

Baseline AZ CLUZH-1 UBC-2


WER MDR WER MDR WER MDR WER MDR
ady 22 22 30 23 24 21 22 22
gre 21 18 23 19 20 17 22 21
ice 12 9 22 17 10 7 11 5
ita 19 15 25 19 23 16 22 19

Table 10: WER and model deficiency rate (MDR) for four languages from the low-resource subtask.

2020:47). This enormous improvement likely re- resources. Some prior work (e.g., Demberg et al.
flects quality assurance work on this language,8 2007) has found morphological tags highly useful,
but we did not anticipate reaching ceiling perfor- and error analysis (§8.2) suggests this information
mance. Insofar as the above quality assurance and would make an impact in French.
error analysis techniques prove effective and gen- There is a large performance gap between the
eralizable, we may soon be able to ask what makes medium-resource and low-resource subtasks. For
a language hard to pronounce (cf. Gorman et al. instance, the baseline achieves a WER of 10.6 in
2020:45f.). the medium-resource scenario and a WER of 25.1
As mentioned above, the data here are a mixture in the low-resource scenario. It seems that cur-
of broad and narrow transcriptions. At first glance, rent models are unable to reach peak performance
this might explain some of the variation in lan- with the 800 training examples provided in the low-
guage difficulty; for example, it is easy to imagine resource subtask. Further work is needed to de-
that the additional details in narrow transcriptions velop more efficient models and data augmenta-
make them more difficult to predict. However, for tion strategies for low-resource scenarios. In our
many languages, only one of the two levels of tran- opinion, this scenario is the most important one
scription is available at scale, and other languages, for speech technology, since speech resources—
divergence between broad and narrow transcrip- including pronunciation data—are scarce for the
tions is impressionistically quite minor. However, vast majority of the world’s written languages.
this impression ought to be quantified.
While we responded to community demand for 10 Conclusions
lower- and higher-resource subtasks, only one
team submitted to the high- and medium-resource The second iteration of the shared task on multi-
subtasks, respectively. It was surprising that none lingual grapheme-to-phoneme conversion features
of the medium-resource submissions were able to many improvements on the previous year’s task,
consistently outperform the baseline model across most of all data quality. Four teams submitted
the ten target languages. Clearly, this year’s base- thirteen systems, achieving substantial reductions
line is much stronger than the previous year’s. in both absolute and relative error over the base-
line in two of three subtasks. We hope the code
Participants in the high- and medium-resource
and data, released under permissive licenses,9 will
subtasks were permitted to make use of lemmas
be used to benchmark grapheme-to-phoneme con-
and morphological tags from UniMorph as addi-
version and sequence-to-sequence modeling tech-
tional features. However, no team made use of
niques more generally.
8
https://github.com/CUNY-CL/wikipron/
9
issues/138 https://github.com/sigmorphon/2021-task1/
123
Acknowledgements Kyle Gorman, Lucas F.E. Ashby, Aaron Goyzueta,
Shijie Wu, and Daniel You. 2020. The SIGMOR-
We thank the Wiktionary contributors, particularly PHON 2020 shared task on multilingual grapheme-
Aryaman Arora, without whom this shared task to-phoneme conversion. In Proceedings of the 17th
would be impossible. We also thank contribu- SIGMORPHON Workshop on Computational Re-
search in Phonetics, Phonology, and Morphology,
tors to WikiPron, especially Sasha Gutkin, Jack- pages 40–50.
son Lee, and the participants of Hacktoberfest
2020. Finally, thank you to Andrew Kirby for last- Kyle Gorman, Arya D. McCarthy, Ryan Cotterell,
minute copy editing assistance. Ekaterina Vylomova, Miikka Silfverberg, and Mag-
dalena Markowska. 2019. Weird inflects but OK:
making sense of morphological generation errors. In
Proceedings of the 23rd Conference on Computa-
References tional Natural Language Learning, pages 140–151.
Simon Clematide and Peter Makarov. 2021. CLUZH
Kyle Gorman and Richard Sproat. 2021. Finite-State
at SIGMORPHON 2021 Shared Task on Multilin-
Text Processing. Morgan & Claypool.
gual Grapheme-to-Phoneme Conversion: variations
on a baseline. In Proceedings of the 18th SIG-
Michael Hammond. 2021. Data augmentation for low-
MORPHON Workshop on Computational Research
resource grapheme-to-phoneme mapping. In Pro-
in Phonetics, Phonology, and Morphology.
ceedings of the 18th SIGMORPHON Workshop on
Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Computational Research in Phonetics, Phonology,
Géraldine Walther, Ekaterina Vylomova, Patrick and Morphology.
Xia, Manaal Faruqui, Sandra Kübler, David
Yarowsky, Jason Eisner, and Mans Hulden. 2017. S. J. Hannahs. 2013. The Phonology of Welsh. Oxford
CoNLL–SIGMORPHON 2017 shared task: univer- University Press.
sal morphological reinflection in 52 languages. In
Proceedings of the CoNLL SIGMORPHON 2017 Bradley Hauer, Amir Ahmad Habibi, Yixing Luan,
Shared Task: Universal Morphological Reinflection, Arnob Mallik, and Grzegorz Kondrak. 2020. Low-
pages 1–30. resource G2P and P2G conversion with synthetic
training data. In Proceedings of the 17th SIGMOR-
Vera Demberg, Helmut Schmid, and Gregor Möhler. PHON Workshop on Computational Research in
2007. Phonological constraints and morphologi- Phonetics, Phonology, and Morphology, pages 117–
cal preprocessing for grapheme-to-phoneme conver- 122.
sion. In Proceedings of the 45th Annual Meeting of
the Association of Computational Linguistics, pages Robert Hoberman. 2007. Maltese morphology. In
96–103. Alan S. Kaye, editor, Morphologies of Asia and
Africa, volume 1, pages 257–282. Eisenbrauns.
Daan van Esch, Mason Chua, and Kanishka Rao. 2016.
Predicting pronunciations with syllabification and Martin Jansche. 2014. Computer-aided quality as-
stress with recurrent neural networks. In INTER- surance of an Icelandic pronunciation dictionary.
SPEECH 2016: 17th Annual Conference of the In Proceedings of the Ninth International Confer-
International Speech Communication Association, ence on Language Resources and Evaluation, pages
pages 2841–2845. 2111–2114.

Vasundhara Gautam, Wang Yau Li, Zafarullah Paul Kingsbury, Stephanie Strassel, Cynthia
Mahmood, Frederic Mailhot, Shreekantha Nadig, McLemore, and Robert MacIntyre. 1997. CALL-
Riqiang Wang, and Nathan Zhang. 2021. Avengers, HOME American English Lexicon (PRONLEX).
ensemble! Benefits of ensembling in grapheme-to- LDC97L20.
phoneme prediction. In Proceedings of the 18th
SIGMORPHON Workshop on Computational Re- Christo Kirov, Ryan Cotterell, John Sylak-Glassman,
search in Phonetics, Phonology, and Morphology. Géraldine Walther, Ekaterina Vylomova, Patrick
Xia, Manaal Faruqui, Arya D. McCarthy, Sandra
Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. Kübler, David Yarowsky, Jason Eisner, and Mans
2012. Building large monolingual dictionaries at the Hulden. 2018. UniMorph 2.0: universal morphol-
Leipzig Corpora Collection: from 100 to 200 lan- ogy. In Proceedings of the 11th Language Resources
guages. In Proceedings of the Eighth International and Evaluation Conference, pages 1868–1873.
Conference on Language Resources and Evaluation,
pages 759–765. Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza,
Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D.
Kyle Gorman. 2016. Pynini: a Python library for McCarthy, and Kyle Gorman. 2020. Massively mul-
weighted finite-state grammar compilation. In Pro- tilingual pronunciation mining with WikiPron. In
ceedings of the SIGFSM Workshop on Statistical Proceedings of the 12th Language Resources and
NLP and Weighted Automata, pages 75–80. Evaluation Conference, pages 4216–4221.
124
Roger Yu-Hsiang Lo and Garrett Nicolai. 2021. Lin- Xiang Yu, Ngoc Thang Vu, and Jonas Kuhn. 2020.
guistic knowledge in multilingual grapheme-to- Ensemble self-training for low-resource languages:
phoneme conversion. In Proceedings of the 18th grapheme-to-phoneme conversion and morpholog-
SIGMORPHON Workshop on Computational Re- ical inflection. In Proceedings of the 17th SIG-
search in Phonetics, Phonology, and Morphology. MORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages
Peter Makarov and Simon Clematide. 2018. Imita- 70–78.
tion learning for neural morphological string trans-
duction. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
pages 2877–2882.

Peter Makarov and Simon Clematide. 2020. CLUZH


at SIGMORPHON 2020 shared task on multilingual
grapheme-to-phoneme conversion. In Proceedings
of the 17th SIGMORPHON Workshop on Computa-
tional Research in Phonetics, Phonology, and Mor-
phology, pages 171–176.

Steven Moran and Michael Cysouw. 2018. The Uni-


code Cookbook for Linguists: Managing Writing
Systems using Orthography Profiles. Language Sci-
ence Press.

Graham Neubig, Chris Dyer, Yoav Goldberg, Austin


Matthews, Waleed Ammar, Antonios Anastasopou-
los, and Pengcheng Yin. 2017. DyNet: the dynamic
neural network toolkit. arXiv:1701.03980.

Ben Peters and André F.T. Martins. 2020. One-size-


fits-all multilingual models. In Proceedings of the
17th SIGMORPHON Workshop on Computational
Research in Phonetics, Phonology, and Morphology,
pages 63–69.

Kanishka Rao, Fuchun Peng, Haşim Sak, and


Françoise Beaufays. 2015. Grapheme-to-phoneme
conversion using long short-term memory recurrent
neural networks. In 2015 IEEE International Con-
ference on Acoustics, Speech and Signal Processing,
pages 4225–4229.

Derek Rogers and Luciana d’Arcangeli. 2004. Italian.


Journal of the International Phonetic Association,
34(1):117–121.

Elmar Ternes and Tatjana Vladimirova-Buhtz. 1990.


Bulgarian. Journal of the International Phonetic As-
sociation, 20(1):45–47.

Kaili Vesik, Muhammad Abdul-Mageed, and Miikka


Silfverberg. 2020. One model to pronounce them
all: multilingual grapheme-to-phoneme conversion
with a transformer ensemble. In Proceedings of the
17th SIGMORPHON Workshop on Computational
Research in Phonetics, Phonology, and Morphology,
pages 146–152.

Kaisheng Yao and Geoffrey Zweig. 2015. Sequence-


to-sequence neural net models for grapheme-to-
phoneme conversion. In INTERSPEECH 2015:
16th Annual Conference of the International Speech
Communication Association, pages 3330–3334.
125
Data augmentation for low-resource grapheme-to-phoneme mapping

Michael Hammond
Dept. of Linguistics
U. of Arizona
Tucson, AZ, USA
[email protected]

Abstract Code Language


ady Adyghe
In this paper we explore a very simple neural
approach to mapping orthography to phonetic gre Modern Greek
transcription in a low-resource context. The ice Icelandic
basic idea is to start from a baseline system ita Italian
and focus all efforts on data augmentation. We khm Khmer
will see that some techniques work, but others lav Latvian
do not. mlt( latn) Maltese (Latin script)
1 Introduction rum Romanian
slv Slovene
This paper describes a submission by our team to wel( sw) Welsh (South Wales dialect)
the 2021 edition of the SIGMORPHON Grapheme-
to-Phoneme conversion challenge. Here we demon- Table 1: Languages and codes
strate our efforts to improve grapheme-to-phoneme
mapping for low-resource languages in a neural each of the moves above separately. We will see
context using only data augmentation techniques. that some work and some do not.
The basic problem in the low-resource condition
We acknowledge at the outset that we do not ex-
was to build a system that maps from graphemes
pect a system of this sort to “win”. Rather, we were
to phonemes with very limited data. Specifically,
interested in seeing how successful a minimalist
there were 10 languages with 800 training pairs
approach might be, one that did not require major
and 100 development pairs. Each pair was a word
changes in system architecture or training. This
in its orthographic representation and a phonetic
minimalist approach entailed that the system not
transcription of that word (though some multi-word
require a lot of detailed manipulation and so we
sequences were also included). Systems were then
started with a “canned” system. This approach also
tested on 100 additional pairs for each language.
entailed that training be something that could be
The 10 languages are given in Table 1.
accomplished with modest resources and time. All
To focus our efforts, we kept to a single system,
configurations below were run on a Lambda Labs
intentionally similar to one of the simple baseline
Tensorbook with a single GPU.1 No training run
systems from the previous year’s challenge.
took more than 10 minutes.
We undertook and tested three data augmenta-
tion techniques. 2 General architecture
1. move as much development data to training The general architecture of the model is inspired by
data as possible one of the 2020 baseline systems (Gorman et al.,
2. extract substring pairs from the training data 2020): a sequence-to-sequence neural net with a
to use as additional training data two-level LSTM encoder and a two-level LSTM
decoder. The system we used is adapted from the
3. train all the languages together OpenNMT base (Klein et al., 2017).
There is a 200-element embedding layer in both
In the following, we first provide additional de-
tails on our base system and then outline and test 1
RTX 3080 Max-Q.

126
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 126–130
August 5, 2021. ©2021 Association for Computational Linguistics
encoder and decoder. Each LSTM layer has 300 Character Word
nodes. The systems are connected by a 5-head 94.84 75.6
attention mechanism (Luong et al., 2015). Training 94.78 75.3
proceeds in 24,000 steps and the learning rate starts 94.46 74.0
at 1.0 and decays at a rate of 0.8 every 1,000 steps 94.84 75.5
starting at step 10,000. Optimization is stochastic 94.71 75.0
gradient descent, the batch size is 64, the dropout 94.59 74.5
rate is 0.5. 94.90 75.8
We spent a fair amount of time tuning the system 94.53 74.2
to these settings for optimal performance with this 94.53 74.2
general architecture on these data. We do not detail 94.71 75.0
these efforts as this is just a normal part of working Mean 94.69 74.91
with neural nets and not our focus here.
Table 2: Development accuracy for 10 runs of the full
Precise instructions for building the docker im-
system with all languages grouped together with esti-
age, full configuration files, and auxiliary code files mated word-level accuracy
are available at https://github.com/hammondm/
g2p2021. Character Word
94.60 74.6
3 General results 94.71 75.0
In this section, we give the general results of the full 94.35 73.8
system with all strategies in place and then in the 94.48 74.0
next sections we strip away each of our augmenta- 94.48 74.0
tion techniques to see what kind of effect each has. 94.50 74.2
In building our system, we did not have access to 94.59 74.5
the correct transcriptions for the test data provided, 94.71 75.0
so we report performance on the development data 94.80 75.4
here. 94.59 74.5
The system was subject to certain amount of Mean 94.58 74.5
randomness because of randomization of training Table 3: Development accuracy for 10 runs of the re-
data and random initial weights in the network. We duced system with all languages grouped together with
therefore report mean final accuracy scores over 100 development pairs with estimated word-level accu-
multiple runs. racy
Our system provides accuracy scores for devel-
opment data in terms of character-level accuracy.
that we are reporting accuracy rather than error rate,
The general task was scored in terms of word-level
so the goal is to maximize these values.
error rate, but we keep this measure for several
reasons. First, it was simply easier as this is what 4 Using development data
the system provided as a default. Second, this is
a more granular measure that enabled us to adjust The default partition for each language is 800 pairs
the system more carefully. Finally, we were able for training and 100 pairs for development. We
to simulate word-level accuracy in addition as de- shifted this to 880 pairs for training and 20 pairs
scribed below. for development. The logic of this choice was
We use a Monte Carlo simulation to calculate to retain what seemed like the minimum number
expected word-level accuracy based on character- of development items. Running the system ten
level accuracy and average transcription length for times without this repartitioning gives the results
the training data for the different languages. The in Table 3.
simulation works by generating 100, 000 words There is a small difference in the right direction,
with a random distribution of a specific character- but it is not significant for characters (t = −1.65,
level accuracy rate and then calculating word-level p = 0.11, unpaired) or words (t = −1.56, p =
accuracy from that. Running the full system ten 0.13, unpaired). It may be that with a larger sample
times, we get the results in Table 2. Keep in mind of runs, the difference becomes more stable.

127
Code Items added Character Word
ady 4 95.15 76.9
gre 223 94.40 73.7
ice 58 95.15 76.8
ita 194 94.59 74.5
khm 39 94.65 74.8
lav 100 95.27 77.4
mlt latn 62 94.53 74.2
rum 119 94.78 75.2
slv 127 95.09 76.6
wel sw 7 94.59 74.5
Mean 94.82 75.46
Table 4: Number of substrings added for each language
Table 5: 10 runs with all languages grouped together
without substrings added for each language
5 Using substrings
This method involves finding peripheral letters that result in a y in a non-final syllable ending up in a
map unambiguously to some symbol and then find- final syllable in a substring generated as above.
ing plausible splitting points within words to create Table 5 shows the results of 10 runs without
partial words that can be added to the training data. these additions and simulated word error rates for
Let’s exemplify this with Welsh. First, we iden- each run.
tify all word-final letters that always correspond to
Strikingly, adding the substrings lowered per-
the same symbol in the transcription. For exam-
formance, but the difference with the full model
ple, the letter c always corresponds to a word-final
is not significant for either characters (t = 1.18,
[k]. Similarly, we identify word-initial characters
p = 0.25, unpaired) or for words (t = 1.17,
with the same property. For example, in these data,
p = 0.25, unpaired). This model without sub-
the word-initial letter t always corresponds to [t].2
strings is the best-performing of all the models we
We then search for any word in training that has
tried, so this is what was submitted.
the medial sequence ct where the transcription has
[kt]. We take each half of the relevant item and add 6 Training together
them to the training data if that pair is not already
there. For example, the word actor [aktOr] fits the The basic idea here was to leverage the entire set
pattern, so we can add the pairs ac-ak and tor-tOr. of languages. Thus all languages were trained to-
to the training data. Table 4 gives the number of gether. To distinguish them, each pair was prefixed
items added for each language. This strategy is by its language code. Thus if we had orthogra-
a more limited version of the “slice-and-shuffle” phy O = ho1 , o2 , . . . , on i and transcription T =
approach used by Ryan and Hulden (2020) in last ht1 , t2 , . . . , tm i from language x, the net would be
year’s challenge. trained on the pair O0 = hx, o1 , o2 , . . . , on i and
Note that this procedure can make errors. If there T 0 = hx, t1 , t2 , . . . , tm i. The idea is that, while
are generalizations about the pronunciation of let- the mappings and orthographies are distinct, there
ters that are not local, that involve elements at a are similarities in what letters encode what sounds
distance, this procedure can obscure those. Another and in the possible sequences of sounds that can
example from Welsh makes the point. There are occur in the transcriptions. This approach is very
exceptions, but the letter y in Welsh is pronounced similar to that of Peters et al. (2017), except that
two ways. In a word-final syllable, it is pronounced we tag the orthography and the transcription with
[1], e.g. gwyn [gw1n] ‘white’. In a non-final sylla- the same langauge tag. Peters and Martins (2020)
ble, it is pronounced [@], e.g. gwynion [gw@njOn] took a similar approach in last years challenge, but
‘white ones’. Though it doesn’t happen in the train- use embeddings prefixed at each time step.
ing data here, the procedure above could easily In Table 6 we give the results for running each
2
language separately 5 times. Since there was a lot
This is actually incorrect for the language as a whole.
Word-initial t in the digraph th corresponds to a different less training data for each run, these models settled
sound [T]. faster, but we ran them the same number of steps

128
as the full models for comparison purposes. G. Klein, Y. Kim, Y. Y. Deng, J. Senellart, and
There’s a lot of variation across runs and the A.M. Rush. 2017. OpenNMT: Open-source toolkit
for neural machine translation. ArXiv e-prints.
means for each language are quite different, pre- 1701.02810.
sumably based on different levels of orthographic
transparency. The general pattern is clear that, over- Minh-Thang Luong, Hieu Pham, and Christopher D.
all, training together does better than training sep- Manning. 2015. Effective approaches to attention-
based neural machine translation. In Proceedings of
arately. Comparing run means with our baseline the 2015 Conference on Empirical Methods in Natu-
system is significant (t = −6.06, p < .001, un- ral Language Processing, pages 1412–1421.
paired).
Ben Peters, Jon Dehdari, and Josef van Genabith.
This is not true in all cases however. For some 2017. Massively multilingual neural grapheme-to-
languages, individual training seems to be better phoneme conversion. In Proceedings of the First
than training together. Our hypothesis is that lan- Workshop on Building Linguistically Generalizable
guages that share similar orthographic systems did NLP Systems, pages 19–26, Copenhagen. Associa-
tion for Computational Linguistics.
better with joint training and that languages with
diverging systems suffered. Ben Peters and André F. T. Martins. 2020. DeepSPIN
The final results show that our best system (no at SIGMORPHON 2020: One-size-fits-all multilin-
substrings included, all languages together, moving gual models. In Proceedings of the 17th SIGMOR-
PHON Workshop on Computational Research in
development data to training) performed reason- Phonetics, Phonology, and Morphology, pages 63–
ably for some languages, but did quite poorly for 69. Association for Computational Linguistics.
others. This suggests a hybrid strategy that would
Zach Ryan and Mans Hulden. 2020. Data augmen-
have been more successful. In addition to training tation for transformer-based G2P. In Proceedings
the full system here, train individual systems for of the 17th SIGMORPHON Workshop on Computa-
each language. For test, compare final develop- tional Research in Phonetics, Phonology, and Mor-
ment results for individual languages for the jointly phology, pages 184–188. Association for Computa-
tional Linguistics.
trained system and the individually trained systems
and use whichever does better for each language in
testing.

7 Conclusion
To conclude, we have augmented a basic sequence-
to-sequence LSTM model with several data aug-
mentation moves. Some of these were successful:
redistributing data from development to training
and training all the languages together. Some tech-
niques were not successful though: the substring
strategy resulted in diminished performance.

Acknowledgments
Thanks to Diane Ohala for useful discussion.
Thanks to several anonymous reviewers for very
helpful feedback. All errors are my own.

References
Kyle Gorman, Lucas F.E. Ashby, Aaron Goyzueta,
Arya McCarthy, Shijie Wu, and Daniel You. 2020.
The SIGMORPHON 2020 shared task on multilin-
gual grapheme-to-phoneme conversion. In Proceed-
ings of the 17th SIGMORPHON Workshop on Com-
putational Research in Phonetics, Phonology, and
Morphology, pages 40–50. Association for Compu-
tational Linguistics.

129
language 5 separate runs Mean
ady 95.27 91.12 93.49 94.67 93.49 93.61
gre 97.25 98.35 98.35 98.90 98.90 98.35
ice 91.16 94.56 93.88 90.48 94.56 92.93
ita 93.51 94.59 94.59 94.59 94.59 94.38
khm 94.19 90.32 90.97 90.97 90.97 91.48
lav 94.00 90.67 89.33 92.00 90.67 91.33
mlt latn 91.89 94.59 91.89 92.57 93.24 92.84
rum 95.29 96.47 94.71 95.88 95.29 95.51
slv 94.01 94.61 04.61 94.61 94.01 94.37
wel sw 96.30 97.04 96.30 97.04 96.30 96.59
Mean 94.29 94.23 93.81 94.17 94.2 94.14

Table 6: 5 separate runs for each language

130
Linguistic Knowledge in Multilingual Grapheme-to-Phoneme Conversion

Roger Yu-Hsiang Lo Garrett Nicolai


Department of Linguistics Department of Linguistics
The University of British Columbia The University of British Columbia
[email protected] [email protected]

Abstract In this paper, we describe our methodology and


approaches to the low-resource setting, including
This paper documents the UBC Linguistics
insights that informed our methods. We conclude
team’s approach to the SIGMORPHON 2021
Grapheme-to-Phoneme Shared Task, concen- with an extensive error analysis of the effectiveness
trating on the low-resource setting. Our sys- of our approach.
tems expand the baseline model with simple This paper is structured as follows: Section 2
modifications informed by syllable structure overviews previous work on G2P conversion. Sec-
and error analysis. In-depth investigation of tion 3 gives a description of the data in the low-
test-set predictions shows that our best model resource subtask, evaluation metric, and baseline
rectifies a significant number of mistakes com-
results, along with the baseline model architecture.
pared to the baseline prediction, besting all
other submissions. Our results validate the Section 4 introduces our approaches as well as the
view that careful error analysis in conjunction motivation behind them. We present our results in
with linguistic knowledge can lead to more ef- Section 5 and associated error analyses in Section 6.
fective computational modeling. Finally, Section 7 concludes our paper.

1 Introduction 2 Previous Work on G2P conversion


With speech technologies becoming ever more
prevalent, grapheme-to-phoneme (G2P) conversion The techniques for performing G2P conversion
is an important part of the pipeline. G2P conver- have long been coupled with contemporary ma-
sion refers to mapping a sequence of orthographic chine learning advances. Early paradigms utilize
representations in some language to a sequence of joint sequence models that rely on the alignment
phonetic symbols, often transcribed in the Interna- between grapheme and phoneme, usually with
tional Phonetic Alphabet (IPA). This is often an variants of the Expectation-Maximization (EM)
early step in tasks such as text-to-speech, where algorithm (Dempster et al., 1977). The result-
the pronunciation must be determined before any ing sequences of graphones (i.e., joint grapheme-
speech is produced. An example of such a G2P phoneme tokens) are then modeled with n-gram
conversion, in Amharic, is illustrated below: models or Hidden Markov Models (e.g., Jiampoja-
marn et al., 2007; Bisani and Ney, 2008; Jiampo-
€≈r{ 7→ [amar1ï:a] ‘Amharic’ jamarn and Kondrak, 2010). A variant of this
For the second year, one of SIGMORPHON paradigm includes weighted finite-state transducers
shared tasks concentrates on G2P. This year, the trained on such graphone sequences (Novak et al.,
task is further broken into three subtasks of varying 2012, 2015).
data levels: high-resource ( 33K training instances), With the rise of various neural network tech-
medium-resource (8K training instances), and low- niques, neural-based methods have dominated the
resource (800 training instances). Our focus is on scene ever since. For example, bidirectional long
the low-resource subtask. The language data and short-term memory-based (LSTM) networks using
associated constraints in the low-resource setting a connectionist temporal classification layer pro-
will be summarized in Section 3.1; the reader inter- duce comparable results to earlier n-gram models
ested in the other two subtasks is referred to Ashby (Rao et al., 2015). By incorporating alignment in-
et al. (this volume) for an overview. formation into the model, the ceiling set by n-gram

131
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 131–140
August 5, 2021. ©2021 Association for Computational Linguistics
models has since been broken (Yao and Zweig, 3.3 Baselines
2015). Attention further improved the performance, The official baselines for individual languages are
as attentional encoder-decoders (Toshniwal and based on an ensembled neural transducer trained
Livescu, 2016) learned to focus on specific input se- with the imitation learning (IL) paradigm (Makarov
quences. As attention became “all that was needed” and Clematide, 2018a). The baseline WERs are tab-
(Vaswani et al., 2017), transformer-based architec- ulated in Table 3. In what follows, we overview this
tures have begun looming large (e.g., Yolchuyeva baseline neural-transducer system, as our models
et al., 2019). are built on top of this system. The detailed formal
Recent years have also seen works that capital- description of the baseline system can be found in
ize on multilingual data to train a single model Makarov and Clematide (2018a,b,c, 2020).
with grapheme-phoneme pairs from multiple lan- The neural transducer in question defines a con-
guages. For example, various systems from last ditional distribution over edit actions, such as copy,
year’s shared task submissions learned from a mul- deletion, insertion, and substitution:
tilingual signal (e.g., ElSaadany and Suter, 2020;
Peters and Martins, 2020; Vesik et al., 2020). |a|
Y
pθ (y, a|x) = pθ (aj |a<j , x),
3 The Low-resource Subtask j=1

where x denotes an input sequence of graphemes,


This section provides relevant information concern-
and a = a1 . . . a|a| stands for a sequence of edit
ing the low-resource subtask.
actions. Note that the ouput sequence y is missing
from the conditional probability on the right-hand
3.1 Task Data
side as it can be deterministically computed from x
The provided data in the low-resource subtask and a. The model is implemented with an LSTM
come from ten languages1 : Adyghe (ady; in the decoder, coupled with a bidirectional LSTM en-
Cyrillic script), Modern Greek (gre; in the Greek coder.
alphabet), Icelandic (ice), Italian (ita), Khmer The model is trained with IL and therefore de-
(khm; in the Khmer script, which is an alphasyl- mands an expert policy, which contains demon-
labary system), Latvian (lat), Maltese transliter- strations of how the task can be optimally solved
ated into the Latin script (mlt_latn), Romanian given any configuration. Cast as IL, the mapping
(rum), Slovene (slv), and the South Wales dialect from graphemes to phonemes can be understood
of Welsh (wel_sw). The data are extracted from as following an optimal path dictated by the expert
Wikitionary2 using WikiPron (Lee et al., 2020), and policy that gradually turns input orthographic sym-
filtered and downsampled with proprietary tech- bols to output IPA characters. To acquire the expert
niques, resulting in each language having 1,000 policy, a Stochastic Edit Distance (Ristad and Yian-
labeled grapheme-phoneme pairs, split into a train- ilos, 1998) model trained with the EM algorithm
ing set of 800 pairs, a development set of 100 pairs, is employed to find an edit sequence consisting of
and a blind test set of 100 pairs. four types of edits: copy, deletion, insertion, and
substitution. During training time, the expert policy
3.2 The Evaluation Metric is queried to identify the next optimal edit that min-
imizes the following objective expressed in terms
This year, the evaluation metric is the word er-
of Levenshtein distance and edit sequence cost:
ror rate (WER), which is simply the percentage
of words for which the predicted transcription se-
quence differs from the ground-truth transcription. βED(ŷ, y) + ED(x, ŷ), β ≥ 1,
Different systems are ranked based on the macro- where the first term is the Levenshtein distance
average over all languages, with lower scores indi- between the target sequence y and the predicted
cating better systems. We also adopted this metric sequence ŷ, and the second term measures the cost
when evaluating our models on the development of editing x to ŷ.
sets. The baseline is run with default hyperparameter
1 values, which include ten different initial seeds and
All output is represented in IPA; unless specified other-
wise, the input is written in the Latin alphabet. a beam of size 4 during inference. The predictions
2
https://www.wiktionary.org/ of these individual models are ensembled using a

132
voting majority. Early efforts to modify the ensem- To identify syllable boundaries in the input se-
ble to incorporate system confidence showed that a quence, we adopted a simple heuristic, the specific
majority ensemble was sufficient. steps of which are listed below:3
This model has proved to be competitive, judg-
ing from its performance on the previous year’s 1. Find vowels in the output: We first identify
G2P shared task. We therefore decided to use it as the vowels in the phoneme sequence by com-
the foundation to construct our systems. paring each segment with the vowel symbols
from the IPA chart. For instance, the symbols
4 Our Approaches [ø] and [y] in [th røyst] for Icelandic traust are
vowels because they match the vowel symbols
This section lays out our attempted approaches. [ø] and [y] on the IPA chart.
We investigate two alternatives, both linguistic in
nature. The first is inspired by a universal linguistic 2. Find vowels in the input: Next we align
structure—the syllable—and the other by the error the grapheme sequence with the phoneme se-
patterns discerned from the baseline predictions on quence using an unsupervised many-to-many
the development data. aligner (Jiampojamarn et al., 2007; Jiampo-
jamarn and Kondrak, 2010). By identifying
4.1 System 1: Augmenting Data with graphemes that are aligned to phonemic vow-
Unsupervised Syllable Boundaries els, we can identify vowels in the input. Using
Our first approach originates from the observation the Icelandic example again, the aligner pro-
that, in natural languages, a sequence of sounds duces a one-to-one mapping: t 7→ th , r 7→ r, a
does not just assume a flat structure. Neighboring 7→ ø, u 7→ y, s 7→ s, and t 7→ t. We therefore
sounds group to form units, such as the onset, nu- assume that the input characters a and u rep-
cleus, and coda. In turn, these units can further resent two vowels. Note that this step is often
project to a syllable (see Figure 1 for an example redundant for input sequences based on the
of such projection). Syllables are useful structural Latin script but is useful in identifying vowel
units in describing various linguistic phenomena symbols in other scripts.
and indeed in predicting the pronunciation of a
3. Find valid onsets and codas: A key step in
word in some languages (e.g., Treiman, 1994). For
syllabification is to identify which sequences
instance, in Dutch, the vowel quality of the nu-
of consonants can form an onset or a coda.
cleus can be reliably inferred from the spelling
Without resorting to linguistic knowledge, one
after proper syllabification: .dag. [dAx] ‘day’ but
way to identify valid onsets and codas is to
.da.gen. [da:G@n] ‘days’, where . marks syllable
look at the two ends of a word—consonant
boundaries. Note that a in a closed syllable is pro-
sequences appearing word-initially before the
nounced as the short vowel [A], but as the long
first vowel are valid onsets, and consonant
vowel [a:] in an open syllable. In applying syllabi-
sequences after the final vowel are valid codas.
fication to G2P conversion, van Esch et al. (2016)
Looping through each input sequence in the
find that training RNNs to jointly predict phoneme
training data gives us a list of valid onsets and
sequences, syllabification, and stress leads to fur-
codas. In the Icelandic example traust, the
ther performance gains in some languages, com-
initial tr sequence must be a valid onset, and
pared to models trained without syllabification and
the final st sequence a valid coda.
stress information.
4. Break word-medial consonant sequences
Syllable
into an onset and a coda: Unfortunately
identifying onsets and codas among word-
Onset Rhyme medial consonant sequences is not as straight-
forward. For example, how do we know the
Nucleus Coda 3
We are aware that different languages permit distinct
syllable constituents (e.g., some languages allow syllabic con-
t w E ł f T sonants while others do not), but given the restriction that we
are not allowed to use external resources in the low-resource
subtask, we simply assume that all syllables must contain a
Figure 1: The syllable structure of twelfth [twEłfT] vowel.

133
sequence in the input VngstrV (V for a vowel while other languages simply default to vowel
character) should be parsed as Vng.strV, as hiatuses/two side-by-side nuclei (e.g., Italian
Vn.gstrV, or even as V.ngstrV? To tackle this badia 7→ [badia])—indeed, both are common
problem, we use the valid onset and coda lists cross-linguistically. We again rely on the
gathered from the previous step: we split the alignment results in the second step to select
consonant sequence into two parts, and we the vowel segmentation strategy for individual
choose the split where the first part is a valid languages.
coda and the second part a valid onset. For
instance, suppose we have an onset list {str, After we have identified the syllables that com-
tr} and a coda list {ng, st}. This implies that pose each word, we augmented the input se-
we only have a single valid split—Vng.strV— quences with syllable boundaries. We identify
so ng is treated as the coda for the previous four labels to distinguish different types of sylla-
syllable and str as the onset for the follow- ble boundaries: <cc>, <cv>. <vc>, and <vv>,
ing syllable. In the case where more than one depending on the classes of sound the segments
split is acceptable, we favor the split that pro- straddling the syllable boundary belong to. For
duces a more complex onset, based on the instance, the input sequence b í l a v e r
linguistic heuristic that natural languages tend k s t æ ð i in Icelandic will be augmented
to tolerate more complex onsets than codas. to be b í <vc> l a <vc> v e r k <cc>
For example, Vng.strV > Vngs.trV. In the s t æ <vc> ð i. We applied the same syl-
situation where none of the splits produces a labification algorithm to all languages to generate
concatenation of a valid coda and onset, we new input sequences, with the exception of Khmer,
adopt the following heuristic: as the Khmer script does not permit a straightfor-
ward linear mapping between input and output se-
• If there is only one medial consonant quences, which is crucial for the vowel identifi-
(such as in the case where the consonant cation step. We then used these syllabified input
can only occur word-internally but not sequences, along with their target transcriptions, as
in the onset or coda position), this con- the training data for the baseline model.4
sonant is classified as the onset for the
following syllable. 4.2 System 2: Penalizing Vowel and Diacritic
• If there is more than one consonant, the Errors
first consonant is classified as the coda Our second approach focuses on the training ob-
and attached to the previous syllable jective of the baseline model, and is driven by
while the rest as the onset of the follow- the errors we observed in the baseline predictions.
ing syllable. Specifically, we noticed that the majority of er-
Of course, this procedure is not free of errors rors for the languages with a high WER—Khmer,
(e.g., some languages have onsets that are only Latvian, and Slovene—concerned vowels, some
allowed word-medially, so word-initial onsets examples of which are given in Table 1. Note the
will naturally not include them), but overall it nature of these mistakes: the mismatch can be in
gives reasonable results. the vowel quality (e.g., [O] for [o]), in the vowel
length (e.g., [á:] for [á]), in the pitch accent (e.g.,
5. Form syllables: The last step is to put to- [ı́:] for [ı̀:]), or a combination thereof.
gether consonant and vowel characters to form Based on the above observation, we modified
syllables. The simplest approach is to allow the baseline model to explicitly address this vowel-
each vowel character to be projected as a nu- mismatching issue. We modified the objective such
cleus and distribute onsets and codas around that erroneous vowel or diacritic (e.g., the length-
these nuclei to build syllables. If there are ening marker [:]) predictions during training incur
four vowels in the input, there are likewise 4
The hyperparameters used are the default values provided
four syllables. There is one important caveat, in the baseline model code: character and action embedding =
however. When there are two or more consec- 100, encoder LSTM state dimension = decoder LSTM state
utive vowel characters, some languages prefer dimension = 200, encoder layer = decoder layer = 1, beam
width = 4, roll-in hyperparameter = 1, epochs = 60, patience
to merge them into a single vowel/nucleus in = 12, batch size = 5, EM iterations = 10, ensemble size =
their pronunciation (e.g., Greek και 7→ [ce]) 10.

134
Language Target Baseline prediction 5 Results
khm nuh n ŭ @ h The performances of our systems, measured in
r O: j r ĕ @ j WER, are juxtaposed with the official baseline re-
s p ŏ @ n span sults in Table 3. We first note that the baseline was
particularly strong—gains were difficult to achieve
lat t s e: l s t s Ê: l s
for most languages. Our first system (Syl), which is
j u ō k s j ù o k s
based on syllabic information, unfortunately does
v æ̂ l s v ǣ: l s
not outperform the baseline. Our second system
slv j ó: g u r t j O g ú: r t (VP), which includes additional penalties for vow-
k r ı̀: S k r ı́: S els and diacritics, however, does outperform the
z d á j z d á: j baselines in several languages. Furthermore, the
macro WER average not only outperforms the base-
Table 1: Typical errors in the development set that in- line, but all other submitted systems.
volve vowels from Khmer (khm), Latvian (lat), and
Slovene (slv)
WER
Language Baseline Syl VP
additional penalties. Each incorrectly-predicted
vowel incurs this penalty. The penalty acts as a ady 22 25 22
regularizer that forces the model to expend more gre 21 22 22
effort on learning vowels. This modification is in ice 12 13 11
the same spirit as the softmax-margin objective of ita 19 20 22
Gimpel and Smith (2010), which penalizes high- khm 34 31 28
cost outputs more heavily, but our approach is even lav 55 58 49
simpler—we merely supplement the loss with ad- mlt_latn 19 19 18
ditional penalties for vowels and diacritics. We rum 10 14 10
fine-tuned the vowel and diacritic penalties using a slv 49 56 47
grid search on the development data, incrementing wel_sw 10 13 12
each by 0.1, from 0 to 0.5. In the cases of ties, we Average 25.1 27.1 24.1
skewed higher as the penalties generally worked
better at higher values. The final values used to Table 3: Comparison of test-set results based on the
generate predictions for the test data are listed in word error rates (WERs)
Table 2. We also note that the vowel penalty had
significantly more impact than the diacritic penalty. It seems that extra syllable information does not
help with predictions in this particular setting. It
Penalty might be the case that additional syllable bound-
aries increase input variability without providing
Language Vowel Diacritic much useful information with the current neural-
ady 0.5 0.3 transducer architecture. Alternatively, information
gre 0.3 0.2 about syllable boundary locations might be redun-
ice 0.3 0.3 dant for this set of languages. Finally, it is possible
ita 0.5 0.5 that the unsupervised nature of our syllable anno-
khm 0.2 0.4 tation was too noisy to aid the model. We leave
lav 0.5 0.5 these speculations as research questions for future
mlt_latn 0.2 0.2 endeavors and restrict the subsequent error analy-
rum 0.5 0.2 ses and discussion to the results from our vowel-
slv 0.4 0.4 penalty system.5
wel_sw 0.4 0.5 5
One reviewer raised a question of why only syllable
boundaries, as opposed to smaller constituents, such as onsets
Table 2: Vowel penalty and diacritic penalty values in or codas, are marked. Our hunch is that many phonological al-
the final models ternations happen at syllable boundaries, and that vowel length
in some languages depends on whether the nucleus vowel is
in a closed or open syllable. Also, given that adding syllable

135
ady gre ice ita khm lav mlt_latn rum slv wel_sw

80

60
Count

40

20

0
base Syl VP base Syl VP base Syl VP base Syl VP base Syl VP base Syl VP base Syl VP base Syl VP base Syl VP base Syl VP
Systems

Error types C-V, V-C C-C, C-ϵ, ϵ-C V-V, V-ϵ, ϵ-V

Figure 2: Distributions of error types in test-set predictions across languages. Error types are distinguished based
on whether an error involves only consonants, only vowels, or both. For example, C-V means that the error is
caused by a ground-truth consonant being replaced by a vowel in the prediction. C-ǫ means that it is a deletion
error where the ground-truth consonant is missing in the prediction while ǫ-C represents an insertion error where a
consonant is wrongly added.

6 Error Analyses (e.g., [P] → ǫ), and substitutions (e.g., [d] →


[t]).
In this section, we provide detailed error analyses
on the test-set predictions from our best system. • Those involving exchanges of a vowel and a
The goals of these analyses are twofold: (i) to ex- consonant (e.g., [w] → [u]) or vice versa.
amine the aspects in which this model outperforms
the baseline and to what extent, and (ii) to get a The frequency of each error type made by the
better understanding of the nature of errors made baseline model and our systems for each individ-
by the system—we believe that insights and im- ual language is plotted in Figure 2. Some patterns
provements can be derived from a good grasp of are immediately clear. First, both systems have a
error patterns. similar pattern in terms of the distribution of error
We analyzed the mismatches between predicted types across language, albeit that ours makes fewer
sequences and ground-truth sequences at the seg- errors on average. Second, both systems err on
mental level. For this purpose, we again utilized different elements, depending on the language. For
many-to-many alignment (Jiampojamarn et al., instance, while Adyghe (ady) and Khmer (khm)
2007; Jiampojamarn and Kondrak, 2010), but this have a more balanced distribution between conso-
time between a predicted sequence and the corre- nant and vowel errors, Slovene (slv) and Welsh
sponding ground-truth sequence.6 For each error (wel_sw) are dominated by vowel errors. Third,
along the aligned sequence, we classified it into the improvements gained in our system seem to
one of the three kinds: come mostly from reduction in vowel errors, as is
evident in the case of Khmer, Latvian (lav), and,
• Those involving erroneous vowel insertions
to a lesser extent, Slovene.
(e.g., ǫ → [@]), deletions (e.g., [@] → ǫ), or
The final observation is backed up if we zoom
substitutions (e.g., [@] → [a]).
in on the errors in these three languages, which
• In the same vein, those involving erroneous we visualize in Figure 3. Many incorrect vowels
consonant insertions (e.g., ǫ → [P]), deletions generated by the baseline model are now correctly
predicted. We note that there are also cases, though
boundaries does not improve the results, it is unlikely that less common, where the baseline model gives the
marking constituent boundaries, which adds more variability
to the input, will result in better performance, though we did right prediction, but ours does not. It should be
not test this hypothesis. pointed out that, although our system shows im-
6
The parameters used are: allowing deletion of input provement over the baseline, there is still plenty
grapheme strings, maximum length of aligned grapheme and
phoneme substring being one, and a training threshold of of room for improvement in many languages, and
0.001. our system still produces incorrect vowels in many

136
Khmer vowels Latvian vowels Slovene vowels
ϵ
ϵ ū ɔ́ː
ù
i
u ɔ
ô
o óː
ɛː j
īː
îː o
ə
ī
Error types
Vowels

Vowels

Vowels
î ɛ̀ː
eː i base wrong
ɛ̄ː ə
ɛ̀ː ours wrong
e
ɛ̄ èː
ɛ
ɑː æ
āː éː
âː
ɑ
ā àː
â
aː à áː
a
baseline ground-truth ours (VP) baseline ground-truth ours (VP) baseline ground-truth ours (VP)
Systems Systems Systems

Figure 3: Comparison of vowels predicted by the baseline model and our best system (VP) with the ground-truth
vowels. Here we only visualize the cases where either the baseline model gives the right vowel but our system does
not, or vice versa. We do not include cases where both the baseline model and our system predict the correct vowel,
or both predict an incorrect vowel, to avoid cluttering the view. Each baseline—ground-truth—ours line represents
a set of aligned vowels in the same word; the horizontal line segment between a system and the ground-truth means
that the prediction from the system agrees with the ground-truth. Color hues are used to distinguish cases where
the prediction from the baseline is correct versus those where the prediction from our second system is correct.
Shaded areas on the plots enclose vowels of similar vowel quality.

instances. els are therefore largely suprasegmental—vowel


Finally, we look at several languages which length and pitch accent, both of which are lexical-
still resulted in high WER on the test set—ady, ized and not explicitly marked in the orthography.
gre, ita, khm, lav, and slv. We analyze For the other three languages, their errors also show
the confusion matrix analysis to identify clusters distinct patterns: for Adyghe, consonants differing
of commonly-confused phonemes. This analysis only in secondary features can get confused; in
again relies on the alignment between the ground- Greek, many errors can be attributed to the mixing
truth sequence and the corresponding predicted of [r] and [R]; in Italian, front and back mid vowels
sequence to characterize error distributions. The can trick our model.
results from this analysis are shown in Figure 4, We hope that our detailed error analyses show
and some interesting patterns are discussed below. not only that these errors “make linguistic sense”—
Figure 2 suggests that Khmer has an equal share of and therefore attest to the power of the model—
consonant and vowel errors, and the heat maps in but also point out a pathway along which future
Figure 4 reveal that these errors do not seem to fol- modeling can be improved.
low a certain pattern. However, a different picture
emerges with Latvian and Slovene. For both lan- 7 Conclusion
guages, Figure 2 indicates the dominance of errors
tied to vowels; consonant errors account for a rela- This paper presented the approaches adopted by
tively small proportion of errors. This observation the UBC Linguistics team to tackle the SIGMOR-
is borne out in Figure 4, with the consonant heat PHON 2021 Grapheme-to-Phoneme Conversion
maps for the two languages displaying a clear diag- challenge in the low-resource setting. Our submis-
onal stripe, and the vowel heat maps showing much sions build upon the baseline model with modifi-
more off-diagonal signals. What is more interest- cations inspired by syllable structure and vowel
ing is that the vowel errors in fact form clusters, error patterns. While the first modification does
as highlighted by white squares on the heat maps. not result in more accurate predictions, the second
The general pattern is that confusion only arises modification does lead to sizable improvements
within a cluster where vowels are of similar quality over the baseline results. Subsequent error analy-
but differ in terms of length or pitch accent. For ses reveal that the modified model indeed reduces
example, while [i:] might be incorrectly-predicted erroneous vowel predictions for languages whose
as [i], our model does not confuse it with, say, [u]. errors are dominated by vowel mismatches. Our
The challenges these languages present to the mod- approaches also demonstrate that patterns uncov-

137
Proportion
0.00 0.25 0.50 0.75 1.00

Khmer consonants Latvian consonants Slovene consonants


ɓ b b
c c
d
cʰ d
f
ɗ f
ɡ
f ɡ
h j j
j ɟ k
k k
l
kʰ l
Predicted consonants

Predicted consonants

Predicted consonants
m
l ʎ
n
m m
n n p
ɲ ɲ r
ŋ ŋ
s
p p
ʃ
pʰ r
t
r ɾ
s s t͡s
t ʃ t͡ʃ
tʰ t
ʋ
ʋ v
x
w w
z
z z
ʔ ʒ ʒ

ɓ c cʰ ɗ f h j k kʰ l m n ɲ ŋ p pʰ r s t tʰ ʋ w z ʔ b c d f ɡ j ɟ k l ʎ m n ɲ ŋ p r ɾ s ʃ t v w z ʒ b d f ɡ j k l m n p r s ʃ t t͡s t͡ʃ ʋ x z ʒ
Ground-truth consonants Ground-truth consonants Ground-truth consonants

Khmer vowels Latvian vowels Slovene vowels


a a
a
à
aː â á
ā áː
ɑ

ɑː

àː àː a
e
âː
āː a éː
æ èː
ĕ ǣ
ə
æ̀ː
eː ǣː ə́
e
ə ē ɛ
ɛ ɛ́
əː
Predicted vowels

Predicted vowels

Predicted vowels
ɛ̀

e, ə, ɛ
ɛ̂ ɛ́ː
ɛː
ɛ̄ ɛ̀ː
ɛ̀ː

æ, e, ɛ
i
ɛ̂ː i
iː ɛ̄ː íː
ɨ
i
ì ìː i
o ī óː
ŏ
îː
īː i òː
o
oː ɔ
ô
ō ɔ́
ɔ o
o, ɔ

ɔ́ː
ɔː u
ù ɔ̀ː
u û
ū u
ŭ ùː úː

ûː
ūː u ùː u
a aː ɑ ɑː e ĕ eː ə əː ɛː i iː ɨ o ŏ oː ɔ ɔː u ŭ uː a à â ā aː àː âː āː æ ǣ æ̀ːǣː e ē ɛ ɛ̀ ɛ̂ ɛ̄ ɛ̀ː ɛ̂ː ɛ̄ː i ì ī îː īː o ô ō oː u ù û ū ùː ûː ūː a á áː àː éː èː ə ə́ ɛ ɛ́ ɛ́ː ɛ̀ː i íː ìː óː òː ɔ ɔ́ ɔ́ː ɔ̀ː u úː ùː
Ground-truth vowels Ground-truth vowels Ground-truth vowels

Adyghe consonants Greek consonants Italian vowels


b
ɕ b
d
d͡z c
d͡ʒ a
f ç
ɡʲ
ɡʷ

e, ɛ
d
ɣ
ħ ð
j

kʲʼ f e

kʼ ɡ
l
ɬ ɣ
ɬʼ
ɮ ʝ
m ɛ
Predicted consonants

Predicted consonants

n k
p
Predicted vowels

pʷʼ l

q
qʷ ʎ
r
ʁ m i
ʁʷ
s n
ʂ

o, ɔ
ʂʷ ɲ
ʃ
ʃʷ
r, ɾ
ʃʷʼ ŋ
ʃʼ o
t p
t͡s
t͡sʼ r
t͡ʂ
t͡ʃ ɾ
t͡ʃʼ
tʼ s ɔ
w
x t
z
ʐ
ʐʷ v
ʑ
ʒ x
ʒʷ u
ʔ z
ʔʷ
χ θ
χʷ
b ɕ dd͡zd͡ʒ f ɡʲɡʷɣ ħ j kʲkʲʼkʷkʼ l ɬ ɬʼ ɮmn ppʷʼpʼ qqʷr ʁʁʷs ʂʂʷ ʃ ʃʷʃʷʼʃʼ t t͡st͡sʼt͡ʂ t͡ʃ t͡ʃʼ tʼ w x z ʐʐʷʑ ʒʒʷʔʔʷχχʷ b c ç d ð f ɡ ɣ ʝ k l ʎ m n ɲ ŋ p r ɾ s t v x z θ a e ɛ i o ɔ u
Ground-truth consonants Ground-truth consonants Ground-truth vowels

Figure 4: Confusion matrices of vowel and consonant predictions by our second system (VP) for languages with the
test WER > 20%. Each row represents a predicted segment, with colors across columns indicating the proportion
of times the predicted segment matches individual ground-truth segments. A gray row means the segment in
question is absent in any predicted phoneme sequences but is present in at least one ground-truth sequence. The
diagonal elements represent the number of times for which the predicted segment matches the target segment,
while off-diagonal elements are those that are mis-predicted by the system. White squares are added to highlight
segment groups where mismatches are common.

138
ered from careful error analyses can inform the Peter Makarov and Simon Clematide. 2018c. UZH at
directions for potential improvements. CoNLL-SIGMORPHON 2018 shared task on uni-
versal morphological reinflection. In Proceedings of
the CoNLL-SIGMORPHON 2018 Shared Task: Uni-
versal Morphological Reinflection, pages 69–75.
References
Peter Makarov and Simon Clematide. 2020. CLUZH
Maximilian Bisani and Hermann Ney. 2008. Joint- at SIGMORPHON 2020 shared task on multilin-
sequence models for grapheme-to-phoneme conver- gual grapheme-to-phoneme conversion. In Proceed-
sion. Speech Communication, 50:434–451. ings of the Seventeenth SIGMORPHON Workshop
on Computational Research in Phonetics, Phonol-
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. ogy, and Morphology, pages 171–176.
Maximum likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical Soci- Josef R. Novak, Nobuaki Minematsu, and Keikichi
ety. Series B (Methodological), 39(1):1–38. Hirose. 2012. WFST-based grapheme-to-phoneme
conversion: Open source tools for alignment, model-
Omnia ElSaadany and Benjamin Suter. 2020. building and decoding. In Proceedings of the 10th
Grapheme-to-phoneme conversion with a mul- International Workshop on Finite State Methods and
tilingual transformer model. In Proceedings of Natural Language Processing, pages 45–49.
the Seventeenth SIGMORPHON Workshop on
Computational Research in Phonetics, Phonology, Josef Robert Novak, Nobuaki Minematsu, and Keikichi
and Morphology, pages 85–89. Hirose. 2015. Phonetisaurus: Exploring garpheme-
to-phoneme conversion with joint n-gram models in
Daan van Esch, Mason Chua, and Kanishka Rao. 2016. the WFST framework. Natural Language Engineer-
Predicting pronunciations with syllabification and ing, 22(6):907–938.
stress with recurrent neural networks. In Proceed-
Ben Peters and André F. T. Martins. 2020. DeepSPIN
ings of Interspeech 2016, pages 2841–2845.
at SIGMORPHON 2020: One-size-fits-all multilin-
gual models. In Proceedings of the Seventeenth SIG-
Kevin Gimpel and Noah A. Smith. 2010. Softmax-
MORPHON Workshop on Computational Research
margin CRFs: Training log-linear models with cost
in Phonetics, Phonology, and Morphology, pages
functions. In Human Language Technologies: The
63–69.
2010 Annual Conference of the North American
Chapter of the ACL, pages 733–736. Kanishka Rao, Fuchun Peng, Haşim Sak, and
Françoise Beaufays. 2015. Grapheme-to-phoneme
Sittichai Jiampojamarn and Grzegorz Kondrak. 2010. conversion using long short-term memory recurrent
Letter-phoneme alignment: An exploration. In Pro- neural networks. In IEEE International Confer-
ceedings of the 48th Annual Meeting of the Associa- ence on Acoustics, Speech and Signal Processing
tion for Computational Linguistics, pages 780–788. (ICASSP), pages 4225–4229.
Sittichai Jiampojamarn, Grzegorz Kondrak, and Tarek Eric Sven Ristad and Peter N. Yianilos. 1998. Learning
Sherif. 2007. Applying many-to-many alignments string-edit distance. IEEE Transactions on Pattern
and Hidden Markov Models to letter-to-phoneme Analysis and Machine Intelligence, 20(5):522–532.
conversion. In Proceedings of NAACL HLT 2007,
pages 372–379. Shubham Toshniwal and Karen Livescu. 2016. Jointly
learning to align and convert graphemes to
Jackson L. Lee, Lucas F. E. Ashby, M. Elizabeth Garza, phonemes with neural attention models. In
Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. 2016 IEEE Spoken Language Technology Workshop
McCarthy, and Kyle Gorman. 2020. Massively mul- (SLT), pages 76–82.
tilingual pronunciation mining with WikiPron. In
Rebecca Treiman. 1994. To what extent do ortho-
Proceedings of the 12th Conference on Language Re-
graphic units in print mirror phonological units in
sources and Evaluation (LREC 2020), pages 4223–
speech? Journal of Psycholinguistic Research,
4228.
23(1):91–110.
Peter Makarov and Simon Clematide. 2018a. Imita- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
tion learning for neural morphological string trans- Uszkoreit, Llion Jones, Aidan N. Gomez, Łukaaz
duction. In Proceedings of the 2018 Conference on Kaiser, and Illia Polosukhin. 2017. Attention is all
Empirical Methods in Natural Language Processing, you need. In Proceedings of the 31st Conference
pages 2877–2882. on Neural Information Processing Systems (NIPS
2017), pages 1–11.
Peter Makarov and Simon Clematide. 2018b. Neu-
ral transition-based string transduction for limited- Kaili Vesik, Muhammad Abdul-Mageed, and Miikka
resource setting in morphology. In Proceedings of Silfverberg. 2020. One model to pronounce them
the 27th International Conference on Computational all: Multilingual grapheme-to-phoneme conversion
Linguistics, pages 83–93. with a Transformer ensemble. In Proceedings of the

139
Seventeenth SIGMORPHON Workshop on Computa-
tional Research in Phonetics, Phonology, and Mor-
phology, pages 146–152.
Kaisheng Yao and Geoffrey Zweig. 2015. Sequence-
to-sequence neural net models for grapheme-to-
phoneme conversion. In Proceedings of Interspeech
2015, pages 3330–3334.
Sevinj Yolchuyeva, Géza Németh, and Bálint Gyires-
Tóth. 2019. Transformer based grapheme-to-
phoneme conversion. In Proceedings of Interspeech
2019, pages 2095–2099.

140
Avengers, Ensemble! Benefits of ensembling in
grapheme-to-phoneme prediction
Vasundhara Gautam, Wang Yau Li, Zafarullah Mahmood, Fred Mailhot∗,
Shreekantha Nadig† , Riqiang Wang, Nathan Zhang

Dialpad Canada † Dialpad India


vasundhara,wangyau.li,zafar,fred.mailhot
shree,riqiang.wang,[email protected]

Abstract along with discussion of the challenges posed


We describe three baseline beating sys- by the data that was provided.
tems for the high-resource English-only
sub-task of the SIGMORPHON 2021 2 Sub-task 1: high-resource,
Shared Task 1: a small ensemble that English-only
Dialpad’s1 speech recognition team uses
internally, a well-known off-the-shelf The organizers provided 41,680 lines of data
model, and a larger ensemble model in total; 33,344 for training, and 4,168 each
comprising these and others. We addition- for development and test. The data consists
ally discuss the challenges related to the
of word/pronunciation pairs (word-pron pairs,
provided data, along with the processing
steps we took. henceforth), where words are sequences of
graphemes and pronunciations are sequences
1 Introduction of characters from the International Phonetic
Alphabet (International Phonetic Association,
The transduction of sequences of graphemes 1999). The data was derived from the English
to phones or phonemes,2 that is from charac- portion of the WikiPron database (Lee et al.,
ters used in orthographic representations to 2020), a massively multilingual resource of
characters used to represent minimal units of word-pron pairs extracted from Wiktionary4
speech, is a core component of many tasks in and subject to some manual QA and post-
speech science & technology. This grapheme- processing.5
to-phoneme conversion (or g2p) may be used,
The baseline model provided was the 2nd
e.g., to automate or scale the creation of
place finisher from the 2020 g2p shared task
digital lexicons or pronunciation dictionaries,
(Gorman et al., 2020). It is an ensembled neu-
which are crucial to FST-based approaches to
ral transition model that operates over edit
automatic speech recognition (ASR) and syn-
actions and is trained via imitation learning
thesis (Mohri et al., 2002).
(Makarov and Clematide, 2020).
The SIGMORPHON 2021 Workshop in-
Evaluation scripts were provided to com-
cluded a Shared Task on g2p conversion, com-
pute word error rate (WER), the percentage of
prising 3 sub-tasks.3 The low- and medium-
words for which the output sequence does not
resource tasks were multilingual, while the
match the gold label.
high-resource task was English-only. This
Notwithstanding the baseline’s strong prior
paper provides an overview of the three
performance and the amount of data avail-
baseline-beating systems submitted by the Di-
able, the task proved to be challenging; the
alpad team for the high-resource sub-task,
baseline system achieved development and
Corresponding

author. Contributing authors are test set WERs of 45.13 and 41.94, respec-
listed alphabetically.
1
https://www.dialpad.com/ tively. We discuss possible reasons for this
2
We use these terms interchangeably here to refer below.
to graphical representations of minimal speech sounds,
4
remaining agnostic as to their theoretical or ontological https://en.wiktionary.org/
status. 5
See https://github.com/sigmorphon/2021-task1
3
https://github.com/sigmorphon/2021-task1 for fuller details on data formatting and processing.

141
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 141–147
August 5, 2021. ©2021 Association for Computational Linguistics
2.1 Data-related challenges of issues where a phone was transcribed with
a Unicode symbol not used in the IPA at all.
Wiktionary is an open, collaborative, public
effort to create a free dictionary in multiple Most of these were cases where the rare
languages. Anyone can create an account and variant was at least two orders of magnitude
add or amend words, pronunciations, etymo- less frequent than the common variant of the
logical information, etc. As with most user- symbol. There was, however, one class of
generated content, this is a noisy method of sounds where the variation was less dramat-
data creation and annotation. ically skewed; the consonants /m/, /n/, and
/l/ appeared in unstressed syllables follow-
Even setting aside the theory-laden ques-
ing schwa (/əm/, /ən/, /əl/) roughly one or-
tion of when or whether a given word should
der of magnitude more frequently than their
be counted as English,6 the open nature of
syllabic counterparts (/m̩ /, /n̩/, /l ̩/), and we
Wiktionary means that speakers of different
opted not to normalize these. If we had nor-
variants or dialects of English may submit
malized the syllabic variants, it would have
varying or conflicting pronunciations for sets
resulted in more consistent g2p output but it
of words. For example, some transcriptions
would likely also have penalized our perfor-
indicate that the users who input them had
mance on the uncleaned test set.7 In the end,
the cot/caught merger while others do not; in
our training data contained 47 phones (plus
the training data “cot” is transcribed /k ɑ t/
end-of-sequence and UNK symbols for some
and “caught” is transcribed /k ɔ t/, indicat-
models).
ing a split, but “aughts” is transcribed as /ɑ t
s/, indicating merger. There is also variation
3 Models
in the narrowness of transcription. For exam-
ple, some transcriptions include aspiration on We trained and evaluated several models for
stressed-syllable-initial stops while others do this task, both publicly available, in-house,
not c.f. “kill” /kʰ ɪ l/ and “killer” /k ɪ l ɚ/. and custom developed, along with various en-
Typically the set of English phonemes is sembling permutations. In the end, we sub-
taken to be somewhere between 38-45 de- mitted three sets of baseline beating results.
pending on variant/dialect (McMahon, 2002). The organizers assigned sequential identifiers
In exploring the training data, we found a to- to multiple submissions (e.g. Dialpad-N); we
tal of 124 symbols in the training set transcrip- include these in the discussion of our entries
tions, many of which only appeared in a small below, for ease of subsequent reference.
set (1–5) of transcriptions. To reduce the ef-
fect of this long tail of infrequent symbols, we 3.1 The Dialpad model (Dialpad-2)
normalized the training set.
Dialpad uses a g2p system internally for scal-
The main source of symbols in the long
able generation of novel lexicon additions.
tail was the variation in the broadness of
We were motivated to enter this shared task
transcription—vowels were sometimes but
as a means of assessing potential areas of im-
not always transcribed with nasalization be-
provement for our system; in order to do so
fore a nasal consonant, aspiration on word-
we needed to assess our own performance as
initial voiceless stops was inconsistently indi-
a baseline.
cated, phonetic length was occasionally indi-
This model is a simple majority-vote ensem-
cated, etc. There were also some cases of er-
ble of 3 existing publicly available g2p sys-
roneous transcription that we uncovered by
tems: Phonetisaurus (Novak et al., 2012), a
looking at the lowest frequency phones and
WFST-based model, Sequitur (Bisani and Ney,
the word-pronunciation pairs where they ap-
2008), a joint-sequence model trained via EM,
peared. For instance, the IPA /j/ was tran-
and a neural sequence-to-sequence model de-
scribed as /y/ twice, the voiced alveolar ap-
veloped at CMU as part of the CMUSphinx8
proximant /ɹ/ was mistranscribed as the trill
/r/ over 200 times, and we found a handful 7
Although the possibility also exists that one or more
of our models would have found and exploited contex-
6
E.g., the training data included the arguably French tual cues that weren’t obvious to us by inspection.
8
word-pronunciation pair: embonpoint /ɑ̃ b ɔ̃ p w ɛ̃/ https://cmusphinx.github.io

142
toolkit (see subsection 3.2). As Dialpad uses The method of ensembling for this model is
a proprietary lexicon and phoneset internally, word level majority-vote ensembling. We se-
we retrained all three models on the cleaned lect the most common prediction when there
version of the shared task training data, re- is a majority prediction (i.e. one prediction
taining default hyperparameters and architec- has more votes than all of the others). If there
tures. is a tie, we pick the prediction that was gen-
In the end, this ensemble achieved a test set erated by the best standalone model with re-
WER of 41.72, narrowly beating the baseline spect to each model’s performance on the de-
(results are discussed in more depth in Section velopment set.
4). This collection of models achieved a test set
WER of 37.43, a 10.75% relative reduction in
3.2 A strong standalone model: WER over the baseline model. As shown in
CMUSphinx g2p-seq2seq (Dialpad-3) Table 1, although a majority of the compo-
CMUSphinx is a set of open systems and nent models did not outperform the baseline,
tools for speech science developed at Carnegie there was sufficient agreement across differ-
Mellon University, including a g2p system.9 ent examples that a simple majority voting
It is a neural sequence-to-sequence model scheme was able to leverage the models’ vary-
(Sutskever et al., 2014) that is Transformer- ing strengths effectively. We discuss the com-
based (Vaswani et al., 2017), written in Ten- ponents and their individual performance be-
sorflow (Abadi et al., 2015). A pre-trained 3- low and in Section 4.
layer model is available for download, but it is
trained on a dictionary that uses ARPABET, a 3.3.1 Baseline variations
substantially different phoneset from the IPA
The “foundation” of our ensemble was the de-
used in this challenge. For this reason we re-
fault baseline model (Makarov and Clematide,
trained a model from scratch on the cleaned
2018), which we trained using the raw data
version of the training data.
and default settings in order to reflect the
This model achieved a test set WER of
baseline performance published by the orga-
41.58, again narrowly beating the baseline.
nization. We included this in order to individ-
Interestingly, this outperformed the Dialpad
ually assess the effect of additional models on
model which incorporates it, suggesting that
overall performance.
Phonetisaurus and Sequitur add more noise
than signal to predicted outputs, to say noth- In addition to this default base, we added
ing of increased computational resources and a larger version of the same model, for which
training time. More generally, this points to we increased the number of encoder and de-
the CMUSphinx seq2seq model as a simple coder layers from 1 to 3, and increased the
and strong baseline against which future g2p hidden dimensions 200 to 400.
research should be assessed.
3.3.2 biLSTM+attention seq2seq
3.3 A large ensemble (Dialpad-1) We conducted experiments with a RNN
In the interest of seeing what results could be seq2seq model, comprising a biLSTM encoder,
achieved via further naive ensembling, our fi- LSTM decoder, and dot-product attention.10
nal submission was a large ensemble, compris- We conducted several rounds of hyperparam-
ing two variations on the baseline model, the eter optimization over layer sizes, optimizer,
Dialpad-2 ensemble discussed above, and two and learning rate. Although none of these
additional seq2seq models, one using LSTMs models outperformed the baseline, a small
and the other Transformer-based. The latter network (16-d embeddings, 128-d LSTM lay-
additionally incorporated a sub-word extrac- ers) proved to be efficiently trainable (2 CPU-
tion method designed to bias a model’s input- hours) and improved the ensemble results, so
output mapping toward “good” grapheme- it was included.
phoneme correspondences.
10
We used the DyNet toolkit (Neubig et al., 2017) for
9
https://github.com/cmusphinx/g2p-seq2seq these experiments.

143
3.3.3 PAS2P: Pronunciation-assisted model has 6 layers of encoder and decoder
sub-words to phonemes with 2048 units, and 4 attention heads with
Sub-word segmentation is widely used in ASR 256 units. We use dropout with a probability
and neural machine translation tasks, as it of 0.1 and label smoothing with a weight
reduces the cardinality of the search space of 0.1 to regularize the model. This model
over word-based models, and mitigates the is- achieved WERs of 44.84 and 43.40 on the
sue of OOVs. Use of sub-words for g2p tasks development and test sets, respectively.
has been explore, e.g. Reddy and Goldsmith
4 Results
(2010) develop an MDL-based approach to ex-
tracting sub-word units for the task of g2p. Our main results are shown in Table 1, where
Recently, a pronunciation-assisted sub-word we show both dev and test set WER for each
model (PASM) (Xu et al., 2019) was shown individual model in addition to the submit-
to improve the performance of ASR models. ted ensembles. In particular, we can see that
We experimented with pronunciation-assisted many of the ensemble components do not beat
sub-words to phonemes (PAS2P), leveraging the baseline WER, but nonetheless serve to im-
the training data and a reparameterization of prove the ensembled models.
the IBM Model 2 aligner (Brown et al., 1993)
dubbed fast_align (Dyer et al., 2013).11 Model dev test
The alignment model is used to find an Dialpad-3 43.30 41.58
alignment of sequences of graphemeres to PAS2P 44.84 43.40
their corresponding phonemes. We follow a Baseline (large) 44.99 41.65
similar process as Xu et al. (2019) to find Baseline (organizer) 45.13 41.94
the consistent grapheme-phoneme pairs and Phonetisaurus 45.44 43.88
refinement of the pairs for the PASM model. Baseline (raw data) 45.92 41.70
We also collect grapheme sequence statistics Sequitur 46.69 43.86
and marginalize it by summing up the counts biLSTM seq2seq 47.89 44.05
of each type of grapheme sequence over all Dialpad-2 43.83 41.72
possible types of phoneme sequences. These Dialpad-1 40.12 37.43
counts are the weights of each sub-word se-
Table 1: Results for components of ensembles,
quence. and submitted models/ensembles (bolded).
Given a word and the weights for each
sub-word, the segmentation process is a
search problem over all possible sub-word 5 Additional experiments
segmentation of that word. We solve this
We experimented with different ensembles
search problem by building weighted FSTs12
and found that incorporating models with dif-
of a given word and the sub-word vocabu-
ferent architectures generally improves over-
lary, and finding the best path through this
all performance. In the standalone results,
lattice. For example, the word “thought-
only the top three models beat the base-
fulness” would be segmented by PASM as
line WER, but adding additional models with
“th_ough_t_f_u_l_n_e_ss”, and this would be
higher WER than the baseline continues to re-
used as the input in the PAS2P model
duce overall WER. Table 2 shows the effect
rather than the full sequence of individual
of this progressive ensembling, from our top-
graphemes.
3 models to our top-7 (i.e. the ensemble for
Finally, the PAS2P transducer is a
the Dialpad-1 model).
Transformer-based sequence-to-sequence
model trained using the ESPnet end-to-end 5.1 Edit distance-based voting
speech processing toolkit (Watanabe et al.,
In addition to varying our ensemble sizes and
2018), with pronunciation-assisted sub-
components, we investigated a different en-
words as inputs and phones as outputs. The
semble voting scheme, in which ties are bro-
11
https://github.com/clab/fast_align ken using edit distance when there is no 1-
12
We use Pynini (Gorman, 2016) for this. best majority option. That is, in the event of

144
Model dev test These represent massive performance im-
Ensemble-top3 41.10 39.71 provements (approx. 15% absolute, or 37%
Ensemble-top4 40.74 38.89 relative, WER reduction), and suggest refine-
Ensemble-top5 40.50 38.12 ment of our output selection/voting method
Ensemble-top6 40.31 37.69 (perhaps via some kind of confidence weight-
Ensemble-top7 (Dialpad-1) 40.12 37.43 ing) could lead to much-improved results.

Table 2: Progressive ensembling results, with top- 6.2 Data-related errors


performing components
We also investigated outputs for which none
of our component models predicted the cor-
a tie, instead of selecting the prediction made rect pronunciation, in hopes of finding some
by the best standalone model (our usual tie- patterns of interest.
breaking method), we select the prediction Many of the training data-related issues
that minimizes edit distance to all other pre- raised in section 2.1 appeared in the dev and
dictions that have the same number of votes. test labels as well. In some cases this led to
The idea of this method is to maximize sub- high cross-component agreement, even on in-
word level agreement. Although this method correct predictions. Our hope that subtle con-
did not show clear improvements on the de- textual cues might reveal patterns in the distri-
velopment set, we found after submission that bution of syllabic versus schwa-following liq-
it narrowly but consistently outperformed the uids and nasals was not borne out, e.g. our en-
top-N ensembles on the test set (see Table 3). semble was led astray on words like “warble”,
which had a labelled pronunciation of /w ɔ ɹ
Model dev test b l ̩/, while all 7 of our models predicted /w ɔ
ED-Dialpad-3 43.76 41.70 ɹ b ə l/, a functionally non-distinct pronuncia-
ED-top3 41.24 39.40 tion. In addition, the previously mentioned is-
ED-top4 40.62 38.48 sue of /ɹ/ being mistranscribed as /r/ affected
ED-top5 40.50 37.69 our performance, e.g. with the word “unilat-
ED-top6 40.28 37.50 eral”, whose labelled pronunciation was /j u
ED-top7 40.21 37.31 n ɪ l æ t ə r ə l/, instead of /j u n ɪ l æ t ə ɹ ə l/,
which was again the pronunciation predicted
Table 3: Results for ensembling with edit-distance
tie-breaking by all 7 models. Finally, narrowness of tran-
scription was also an issue that affected our
performance on the dev and test sets, e.g., for
6 Error analysis words like “cloudy” /k ɫ a ʊ d i/ and “cry” /k
ɹ a ɪ ̯/, for which we predicted /k l a ʊ d i/ and
We conducted some basic analyses of the /k ɹ a ɪ/, respectively. In the end, it seems
Dialpad-1 submission’s patterns of errors, to that noisiness in the data was a major source
better understand its performance and iden- of errors for our submissions.14
tify potential areas of improvement.13 Aside from issues arising due label noise,
6.1 Oracle WER our systems also made some genuine errors
that are typical of g2p models, mostly related
We began by calculating the oracle WER, i.e. to data distribution or sparsity. For example,
the theoretical best WER that the ensemble our component models overwhelmingly pre-
could have achieved if it had selected the cor- dicted that “irreparate” (/ɪ ɹ ɛ p ə ɹ ə t/) should
rect/gold prediction every time it was present rhyme instead with “rate” (this “-ate-” /e ɪ t/
in the pool of component model predictions correspondence was overwhelmingly present
for a given input. The Dialpad-1 system’s ora- in the training data), that “backache” (/b æ
cle WERs on the dev and test sets were 25.12 k e ɪ k/) must contain the affricate /t͡ʃ/, that
and 23.27, respectively (c.f. 40.12 and 37.43
actual). 14
We nonetheless acknowledge the magnitude and
challenge of the task of cleaning/normalizing a large
13
We are grateful to an anonymous reviewer for sug- quantity of user-generated data, and thank the organiz-
gesting that this would strengthen the paper. ers for the work that they did in this area.

145
“acres” (e ɪ k ɚ z/) rhymes with “degrees”, and Manjunath Kudlur, Josh Levenberg, Dan Mané,
that “beret” has a /t/ sound in it. In each of Rajat Monga, Sherry Moore, Derek Murray,
Chris Olah, Mike Schuster, Jonathon Shlens,
these cases, there was either not enough sam-
Benoit Steiner, Ilya Sutskever, Kunal Talwar,
ples in the training set to reliably learn the Paul Tucker, Vincent Vanhoucke, Vijay Vasude-
relevant grapheme-phoneme correspondence, van, Fernanda Viégas, Oriol Vinyals, Pete War-
or else a conflicting (but correct) correspon- den, Martin Wattenberg, Martin Wicke, Yuan
dence was over-represented in the training Yu, and Xiaoqiang Zheng. 2015. Tensor-
Flow: Large-scale machine learning on hetero-
data. geneous systems. Software available from ten-
sorflow.org.
7 Conclusion
Maximilian Bisani and Hermann Ney. 2008. Joint-
We presented and discussed three g2p sys- sequence models for grapheme-to-phoneme
tems submitted for the SIGMORPHON2021 conversion. Speech Communication, 50(5):434–
English-only shared sub-task. In addition 451.
to finding a strong off-the-shelf contender, Peter F. Brown, Stephen A. Della Pietra, Vincent J.
we show that naive ensembling remains a Della Pietra, and Robert L. Mercer. 1993. The
strong strategy in supervised learning, of mathematics of statistical machine translation:
Parameter estimation. Computational Linguistics,
which g2p is a sub-domain, and that sim- 19(2):263–311.
ple majority-voting schemes in classification
can often leverage the respective strengths Chris Dyer, Victor Chahuneau, and Noah A. Smith.
of sub-optimal component models, especially 2013. A simple, fast, and effective reparam-
eterization of IBM model 2. In Proceedings of
when diverse architectures are combined. Ad- the 2013 Conference of the North American Chap-
ditionally, we provided more evidence for ter of the Association for Computational Linguis-
the usefulness of linguistically-informed sub- tics: Human Language Technologies, pages 644–
word modeling as an input transformation on 648, Atlanta, Georgia. Association for Compu-
tational Linguistics.
speech-related tasks.
We also discussed additional experiments Kyle Gorman. 2016. Pynini: A python library
whose results were not submitted, indicating for weighted finite-state grammar compilation.
the benefit of exploring top-N model vs en- In Proceedings of the SIGFSM Workshop on Sta-
tistical NLP and Weighted Automata, pages 75–
semble trade-offs, and demonstrating the po- 80, Berlin, Germany. Association for Computa-
tential benefit of an edit-distance based tie- tional Linguistics.
breaking method for ensemble voting.
Kyle Gorman, Lucas F.E. Ashby, Aaron Goyzueta,
Future work includes further search for
Arya McCarthy, Shijie Wu, and Daniel You.
the optimal trade-off between ensemble size 2020. The SIGMORPHON 2020 shared task
and performance, as well as additional explo- on multilingual grapheme-to-phoneme conver-
ration of the edit-distance voting scheme, and sion. In Proceedings of the 17th SIGMORPHON
more sophisticated ensembling/voting meth- Workshop on Computational Research in Phonet-
ics, Phonology, and Morphology, pages 40–50,
ods, e.g. majority voting at the phone level Online. Association for Computational Linguis-
on aligned outputs. tics.

Acknowledgments International Phonetic Association. 1999. Hand-


book of the International Phonetic Association:
We are grateful to Dialpad Inc. for provid- A guide to the use of the International Phonetic
Alphabet. Cambridge University Press, Cam-
ing the resources, both temporal and compu-
bridge, U.K.
tational, to work on this project.
Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth
Garza, Yeonju Lee-Sikka, Sean Miller, Arya
References D. McCarthy Alan Wong, and Kyle Gorman.
2020. Massively multilingual pronunciation
Martín Abadi, Ashish Agarwal, Paul Barham, Eu- mining with wikipron. In Proceedings of the
gene Brevdo, Zhifeng Chen, Craig Citro, Greg S. 12th Language Resources and Evaluation Confer-
Corrado, Andy Davis, Jeffrey Dean, Matthieu ence, pages 4216––4221, Marseille.
Devin, Sanjay Ghemawat, Ian Goodfellow, An-
drew Harp, Geoffrey Irving, Michael Isard, Peter Makarov and Simon Clematide. 2018. Imi-
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, tation learning for neural morphological string

146
transduction. In Proceedings of the 2018 Confer- Shinji Watanabe, Takaaki Hori, Shigeki Karita,
ence on Empirical Methods in Natural Language Tomoki Hayashi, Jiro Nishitoba, Yuya Unno,
Processing, pages 2877–2882, Brussels, Belgium. Nelson Enrique Yalta Soplin, Jahn Heymann,
Association for Computational Linguistics. Matthew Wiesner, Nanxin Chen, Adithya Ren-
duchintala, and Tsubasa Ochiai. 2018. Espnet:
Peter Makarov and Simon Clematide. 2020. End-to-end speech processing toolkit. In Proc.
CLUZH at SIGMORPHON 2020 shared task on Interspeech 2018, pages 2207–2211.
multilingual grapheme-to-phoneme conversion.
In Proceedings of the 17th SIGMORPHON Work- Hainan Xu, Shuoyang Ding, and Shinji Watanabe.
shop on Computational Research in Phonetics, 2019. Improving end-to-end speech recogni-
Phonology, and Morphology, pages 171–176, On- tion with pronunciation-assisted sub-word mod-
line. Association for Computational Linguistics. eling. ICASSP 2019 - 2019 IEEE International
Conference on Acoustics, Speech and Signal Pro-
April McMahon. 2002. An Introduction to English cessing (ICASSP), pages 7110–7114.
Phonology. Edinburgh University Press, Edin-
burgh, U.K.

Mehryar Mohri, Fernando Pereira, and Michael


Riley. 2002. Weighted finite-state transducers
in speech recognition. Computer Speech & Lan-
guage, 16(1):69–88.

Graham Neubig, Chris Dyer, Yoav Goldberg,


Austin Matthews, Waleed Ammar, Antonios
Anastasopoulos, Miguel Ballesteros, David Chi-
ang, Daniel Clothiaux, Trevor Cohn, Kevin
Duh, Manaal Faruqui, Cynthia Gan, Dan Gar-
rette, Yangfeng Ji, Lingpeng Kong, Adhiguna
Kuncoro, Gaurav Kumar, Chaitanya Malaviya,
Paul Michel, Yusuke Oda, Matthew Richard-
son, Naomi Saphra, Swabha Swayamdipta, and
Pengcheng Yin. 2017. Dynet: The dynamic neu-
ral network toolkit.

Josef R. Novak, Nobuaki Minematsu, and Kei-


kichi Hirose. 2012. WFST-based grapheme-to-
phoneme conversion: Open source tools for
alignment, model-building and decoding. In
Proceedings of the 10th International Workshop on
Finite State Methods and Natural Language Pro-
cessing, pages 45–49, Donostia–San Sebastián.
Association for Computational Linguistics.

Sravana Reddy and John Goldsmith. 2010. An


MDL-based approach to extracting subword
units for grapheme-to-phoneme conversion. In
Human Language Technologies: The 2010 Annual
Conference of the North American Chapter of the
Association for Computational Linguistics, pages
713–716, Los Angeles, California. Association
for Computational Linguistics.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le.


2014. Sequence to sequence learning with neu-
ral networks. In Advances in Neural Informa-
tion Processing Systems, volume 27. Curran As-
sociates, Inc.

Ashish Vaswani, Noam Shazeer, Niki Parmar,


Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Ł ukasz Kaiser, and Illia Polosukhin. 2017. At-
tention is all you need. In Advances in Neural
Information Processing Systems, volume 30. Cur-
ran Associates, Inc.

147
CLUZH at SIGMORPHON 2021 Shared Task on Multilingual
Grapheme-to-Phoneme Conversion: Variations on a Baseline

Simon Clematide and Peter Makarov


Department of Computational Linguistics
University of Zurich, Switzerland
[email protected] [email protected]

Abstract Lang. Grapheme Phoneme Wiktionary


ice persóna pʰ ɛ r̥ s o uː n a /ˈpʰɛr̥. souːna/
This paper describes the submission by the fra williams wiljamz /wi.ljamz/
team from the Department of Computational bul засичайки z ɐ s t͡ʃ ʃ ə j kʲ i /zɐˈsitʃəjkʲi/
Linguistics, Zurich University, to the Mul- kor 검출 k ɘː m t͡ɕʰ u ɭ [ˈkʌ̹(ː)mt͡ɕʰuɭ]
tilingual Grapheme-to-Phoneme Conversion
(G2P) Task 1 of the SIGMORPHON 2021 Figure 1: Examples of the original G2P shared task data
challenge in the low and medium settings. The from four different languages and their pronunciation
submission is a variation of our 2020 G2P entries in Wiktionary.
system, which serves as the baseline for this
year’s challenge. The system is a neural trans-
ducer that operates over explicit edit actions well as contour information. See Figure 1 for the
and is trained with imitation learning. For this post-processed shared task entries and the original
challenge, we experimented with the follow-
entries from the Wiktionary pronunciation section.
ing changes: a) emitting phoneme segments in-
stead of single character phonemes, b) input
For more information, we refer the reader to the
character dropout, c) a mogrifier LSTM de- shared task overview paper (Ashby et al., 2021).
coder (Melis et al., 2019), d) enriching the de- In the low and medium data setting, the 2021
coder input with the currently attended input SIGMORPHON multilingual G2P challenge fea-
character, e) parallel BiLSTM encoders, and tures ten different languages from various phyloge-
f) an adaptive batch size scheduler. In the low netic families and written in different scripts. The
setting, our best ensemble improved over the
low setting comes with 800 training, 100 develop-
baseline, however, in the medium setting, the
baseline was stronger on average, although for ment and 100 test examples. In the medium setting,
certain languages improvements could be ob- the data splits are 10 times larger. Although it is
served. permitted to use external resources for the medium
setting, all our models used exclusively the official
1 Introduction training material.
Our system is a neural transducer with pointer
The SIGMORPHON Grapheme-to-Phoneme Con-
network-like monotonic hard attention (Aharoni
version task consists of mapping a sequence of
and Goldberg, 2017) that operates over explicit
characters in some language into a sequence of
character edit actions and is trained with imita-
whitespace delimited International Phonetic Alpha-
tion learning (Daumé III et al., 2009; Ross et al.,
bet (IPA) symbols, which represent the pronunci-
2011; Chang et al., 2015). It is an adaptation of
ation of this input character sequence (not neces-
our type-level morphological inflection generation
sarily a phonemic transcription, despite the name
system that proved its data efficiency and perfor-
of the task) according to the language-specific con-
mance in the SIGMORPHON 2018 shared task
ventions used in the English Wiktionary.1 The data
(Makarov and Clematide, 2018). G2P shares many
was collected and post-processed by the WikiPron
similarities with traditional morphological string
project (Lee et al., 2020). Post-processing removes
transduction: The changes are mostly local and of-
stress and syllable markers and applies IPA seg-
ten simple depending on how close the spelling
mentation for combining and modifier diacritics as
of a language reflects pronunciation. For most lan-
1
https://en.wiktionary.org/ guages, a substantial part of the work is actually

148
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 148–153
August 5, 2021. ©2021 Association for Computational Linguistics
Σ :  / p(DEL(Σ))  : Ω / p(INS(Ω)) The imitation learning algorithm relies on an
expert policy for suggesting intuitive and appro-
p(#) priate character substitution, insertion and deletion
actions. For instance, for the data sample кит 7→
/kj it/ (Russian: “whale”), we would like the fol-
Σ : Ω / p(SUB(Σ, Ω)) lowing most natural edit sequence to attain the low-
est cost: SUBS[k], INS[j ], SUBS[i], SUBS[t]. The cost
Figure 2: Stochastic edit distance (Ristad and Yianilos, function for these actions is estimated by fitting
1998): A memoryless probabilistic FST. Σ and Ω stand a Stochastic Edit Distance (SED) model (Ristad
for any input and output symbol, respectively. Transi- and Yianilos, 1998) on the training data, which
tion weights are to the right of the slash and p(#) is
is a memoryless weighted finite-state transducer
the final weight.
shown in Figure 2. The resulting SED model is
integrated into the expert policy, the SED policy,
applying character-by-character substitutions. An that uses Viterbi decoding to compute optimal edit
extreme case is Georgian, which features an al- action sequences for any point in the action search
most deterministic one-to-one mapping between space: Given a transducer configuration of partially
graphemes and IPA segments that can be learned processed input, find the best edit actions to gen-
almost perfectly from little training data.2 erate the remaining target sequence suffix. Dur-
The main goal of our submission was to test ing training, an aggressive exploration schedule
1
whether our last year’s system, which is the base- psampling (i) = 1+exp(i) where i is the training
line for this year’s G2P challenge, already exhausts epoch number, exposes the model to configurations
the potential of its architecture, or whether changes sampled by executing edit actions from the model.
to the output representation (IPA segments vs. IPA For an extended description of the SED policy and
Unicode codepoints; input character dropout), to IL training, we refer the reader to the last year’s
the LSTM decoder (the mogrifier steps and the system description paper (Makarov and Clematide,
additional input of the attended character), to the 2020).
BiLSTM encoder (parallel encoders), or to other
hyper-parameter settings (adaptive batch size) can 2.1 Changes to the baseline model
improve the results without replacing the LSTM- This section describes the changes that we imple-
based encoder/decoder setup by a Transformer- mented in our submissions.
based architecture (see e.g. Wu et al. (2021) for IPA segments vs. IPA Unicode characters:
Transformer-based state-of-the-art results). Emitting IPA segments in one action (includ-
ing its whitespace delimiter), e.g., for the Rus-
2 Model description sian example from above SUBS[kj •],3 instead
of producing the same output by three actions
The model defines a conditional distribution
SUBS [k], INS[j ], INS[•] reduces the number of ac-
over substitution, insertion and deletion edits
Q|a| tion predictions (and potential errors) considerably,
pθ (a | x) = j=1 pθ (aj | a<j , x), where x =
which is beneficial. On the other hand, this might
x1 . . . x|x| is an input sequence of graphemes and
lead to larger action vocabularies and sparse train-
a = a1 . . . a|a| is an edit action sequence. The
ing distributions. Therefore, we experimented with
output sequence of IPA symbols y is determin-
character (CHAR) and IPA segment (SEG) edit ac-
istically computed from x and a. The model is
tions in our submission. Table 1 shows statistics
equipped with an LSTM decoder and a bidirec-
on the resulting vocabulary sizes if CHAR or SEG
tional LSTM encoder (Graves and Schmidhuber,
actions are used. Some caution is needed though
2005). At each decoding step j, the model attends
because some segments might only appear once in
to a single grapheme xi . The attention steps mono-
the training data, e.g. English has an IPA segment
tonically through the input sequence, steered by the
s:: that only appears in the word “psst”.
edits that consume input (e.g. a deletion shifts the
Input character dropout: To prevent the model
attention to the next grapheme xi+1 ).
from memorizing the training set and to force it to
2
Even a reduced training set of only 100 items allows a learn about syllable contexts, we randomly replace
single model to achieve over 90% accuracy on the Georgian
3
test set. • denotes whitespace symbol.

149
S Language NFD< SEG C NFC C NFD and IPA characters.
L ady 0.5% 67 37 37 Enriching the decoder input with the cur-
L gre 4.3% 33 33 33 rently attended input character: The auto-
L ice 30.3% 60 36 36 regressive decoder of the baseline system uses the
L ita 0.8% 32 29 29 LSTM decoder output of the previous time step
L khm 0.5% 47 36 34 and the BiLSTM encoded representation of the
L lav 12.4% 73 51 36 currently attended input character as input. Intu-
L mlt latn 9.0% 41 29 29 itively, by feeding the input character embedding
L rum 0.3% 45 31 31 directly into the decoder (as a kind of skip con-
L slv 4.3% 48 38 30 nection), we want to liberate the BiLSTM encoder
L wel sw 2.4% 43 37 37 from transporting the hard attention information
M arm e 0.0% 54 31 31 to the decoder, thereby motivating the sequence
M bul 3.5% 46 34 34 encoder to focus more on the contextualization of
M dut 0.8% 49 39 39 the input character.
M fre 0.1% 39 36 36 Multiple parallel BiLSTM encoders: Convo-
M geo 0.0% 33 27 27 lutional encoders typically use many convolutional
M hbs latn 3.7% 63 43 33 filters for representation learning and Transformer
M hun 42.5% 66 37 37 encoders similarly feature multi-head attention. Us-
M jpn hira 36.1% 64 42 39 ing several LSTM encoders in parallel has been
M kor 99.8% 60 46 46 proposed by Zhu et al. (2017) for language model-
M vie hanoi 88.2% 49 44 44 ing and translation and was e.g. also successfully
H eng us 0.0% 124 83 80 used for named entity recognition (Žukov-Gregorič
Average 16.2% 54.1 39.0 37.0 et al., 2018). Technically, the same input is fed
though several smaller LSTMs, each with its own
Table 1: Statistics on Unicode normalization for low
parameter set, and then their output is concatenated
(L), medium (M), and high (H) settings (column S).
Column NFD< specifies the percentage of training for each time step. The idea behind parallel LSTM
items where NFD normalized graphemes had smaller encoders is to provide a more robust ensemble-style
length difference to phonemes than in NFC normal- encoding with lower variance between models. For
ization. Column SEG gives the vocabulary size of IPA our submission, there was not enough time to sys-
segments (the counts are the same for NFC and NFD). tematically tune the input and hidden state sizes as
Column CNFC reports the phoneme vocabulary size in well as the number of parallel LSTMs.
NFC Unicode characters (CHAR) and CNFD in NFD.
Adaptive batch size scheduler: We combine
the ideas of “Don’t Decay the Learning Rate, In-
an input character with the UNK symbol according crease the Batch Size” (Smith et al., 2017) and
to a linearly decaying schedule.4 cyclical learning schedules by dynamically enlarg-
Mogrifier LSTM decoder: Mogrifier LSTMs ing or reducing the batch size according to develop-
(Melis et al., 2019) iteratively and mutually up- ment set accuracy: Starting with a defined minimal
date the hidden state of a previous time step with batch size m threshold, the batch size for the next
the current input before feeding the modified hid- epoch is set to bm − 0.5c if the development set
den state and input into a standard LSTM cell. On performance improved, or bm + 0.5c otherwise.5
language modeling tasks with smaller corpora, this If a predefined maximum batch size is reached,
technique closed the gap between LSTM and Trans- the batch size is reset in one step to the minimum
former models. We apply a standard mogrifier with threshold. The motivation for the reset comes from
5 rounds of updates in our experiments. We expect empirical observations that going back to a small
the mogrifier decoder to profit from IPA segmen- batch size can help overcome local optima. With
tation because in this setup the decoder mogrifies larger training sets, we subsample the training sets
neighboring IPA phoneme segments and not space per epoch randomly in order to have a more dy-
namic behavior.6
4
For all experiments, we start with a probability of 50%
5
for UNKing a character in a word and reduce this rate over 10 See also the recent discussion on learning rates and batch
epochs to a minimal probability of 1%. Light experimentation sizes by Wu et al. (2021).
6
on a few languages led to this cautious setting, which might The subsample size is set to 3,000 items per epoch in all
leave room for further improvement. our experiments.

150
2.2 Unicode normalization 50(L/M)
For some writing systems, e.g. for Korean or Viet- • LSTM encoder hidden state dimension: 200
namese, applying Unicode NFD normalization to (B), 300 (L/M) divided by 6 parallel encoders.
the input has a great impact on the input sequence We submit 3 ensemble runs for the low setting:
length and consequently on the G2P character cor- CLUZH-1: 15 models with CHAR input,
respondences. The decomposition of diacritics and CLUZH-2: 15 models with SEG input,
other composing characters for all languages, as CLUZH-3: 30 models with CHAR or SEG input.
performed in the baseline, has the disadvantage of We submit 4 ensemble runs for the medium setting:
longer input sequences. We apply a simple heuris- CLUZH-4: 5 models with CHAR input,
tic to decide on NFD normalization based on a CLUZH-5: 10 models with SEG input,
criterion for the minimum length distance between CLUZH-6: 5 models with SEG input,
graphemes and phonemes: If more than 50% of the CLUZH-7: 15 models with CHAR or SEG input.
training grapheme sequences in NFD normalization Due to a configuration error, medium results were
have a smaller length difference compared to the actually computed without two add-ons: mogrifier
phoneme sequence than their corresponding NFC LSTMs and the additional input character. In post-
variants, then NFD normalization is applied. See submission experiments, we computed runs that
Table 1 for statistics, which indicate a preference enabled these features and report their results as
for NFD for only 2 languages. well (CLUZH-4m/5m).
3 Submission details 4 Results and discussion
Modifications such as mogrifier LSTMs, additional Table 2 shows a comparison of results for the low
input character skip connections, or parallel en- setting. We report the development and test set
coders increase the number of model parameters average word error rate (WER) performance to
and make it difficult to compare the baseline system illustrate the sometimes dramatic differences be-
directly with its variants. Additionally, we did not tween these sets (e.g. Greek). Both runs containing
have enough time before the submission to system- CHAR action emitting models (CLUZH-1, CLUZH-
atically explore and fine-tune for the best combina- 3) have second best results (the best system reaches
tion of model modifications and hyper-parameters. 24.1). The SEG models with IPA segmentation ac-
In the end, after some light experimentation we had tions excel on some languages (Adyghe, Latvian),
to stick to settings that might not be optimal. but fail badly on Slovene and Maltese. Only for
We train separate models for each language on Romanian and Italian, we see an improvement for
the official training data and use the development the 30-strong mixed ensemble. The expectation
set exclusively for model selection. As beam de- that the size difference between the SEG and CHAR
coding for mogrifier models sometimes suffered vocabulary correlates with language-specific per-
compared to greedy decoding, we built all ensem- formance differences cannot be confirmed given
bles from greedy model prediction. Like the base- the numbers in Table 1. E.g. Latvian features 73
line system (B), we train the SED model for 10 different IPA segments but only 51 IPA characters,
epochs, use one-layer LSTMs, hidden state dimen- still, the SEG variant shows only 49% WER.
sion 200 for the decoder LSTMs and action embed- Table 3 shows a comparison of results for the
ding dimension 100. For the low (L) and medium medium setting. We report selected development
(M) setting, we have the following specific hyper- and test set average performance to illustrate that
parameters: also in this larger setting, the expectation of a
• patience: 12 (B), 24 (L), 18 (M) slightly higher development set performance does
• maximal epochs: 60 (B), 80 (L/M) not always hold (e.g. Korean or Japanese). On the
• minimal batch size:7 3 (L), 5 (M) other hand, Bulgarian and Dutch have a sharp in-
• maximal batch size: 10 (L/M) crease in errors on the test set compared to the de-
• character embedding dimension:8 100 (B), velopment set. The comparison between runs with
7
The baseline system’s batch size is 5. the mogrifier LSTM decoder and the attended char-
8
The motivation for lowering the character embedding acter input (CLUZH-Nm) or without (C-N) suggest
size comes from adding the input character to the mogrifier
decoder LSTM, which increases the parameter size for each that these changes are not beneficial. In the medium
of the 5 update weight matrices. setting, C-4 (CHAR) and C-6 (SEG) can be directly

151
CLUZH-1 (CHAR) CLUZH-2 (SEG) C-3 OUR BASELINE BSL Other
AVERAGE E AVERAGE E E AVERAGE E E
LNG dev test sd test dev test sd test test dev test sd test test test
ady 25.0 27.8 3.3 24 25.6 26.2 1.8 22 22 26 25.2 2.8 21 22 22
gre 6.5 22.2 2.3 20 5.1 22.8 2.8 22 20 5 26.0 3.3 25 21 21
ice 14.8 12.4 2.4 10 16.1 14.5 2.2 12 10 21 15.8 2.1 12 12 11
ita 24.5 27.0 2.2 23 24.4 26.3 3.2 24 21 25 22.7 3.5 19 19 20
khm 39.8 38.2 3.4 32 40.3 36.9 2.2 33 32 39 40.4 2.5 34 34 28
lav 47.2 53.7 2.8 53 46.9 55.3 3.7 49 49 44 56.5 2.2 54 55 49
mlt 17.0 18.0 2.4 12 19.7 21.2 2.9 16 14 23 21.8 5.1 17 19 18
rum 11.1 13.7 1.8 13 10.3 14.1 1.0 13 12 11 12.5 2.1 10 10 10
slv 46.4 56.4 2.7 50 48 60.2 3.4 59 55 44 54.2 2.1 51 49 47
wel 18.0 14.9 3.5 10 15.6 15.7 1.8 13 12 19 14.8 2.0 12 10 12
AVG 25.0 28.4 2.7 24.7 25.2 29.3 2.5 26.3 24.7 25.7 29.0 2.8 25.5 25.1 23.8

Table 2: Overview of the dev and test results in the low setting. C-3 is CLUZH-3 ensemble. OUR BASELINE
shows the results for our own run of the baseline configuration. They are different from the official baseline results
(BSL) due to different random seeds. Column sd always reports the test set standard deviation. E means ensemble
results.

C-4 CLUZH-4m (CHAR) C-5 CLUZH-5m (SEG) C-5l C-6 C-7 OUR BASELINE BSL
E5 AVERAGE E5 E10 AVERAGE E10 E10 E5 E15 AVERAGE E10 E10
LNG test dev test sd test test dev test sd test test test test dev test sd test test
arm 7.1 5.4 7.9 0.7 6.4 6.6 5.1 7.2 0.5 6.2 7.1 6.6 6.4 5.8 7.8 0.7 6.5 7.0
bul 20.1 12.2 20.4 2.0 19.9 19.2 11.9 23.3 2.1 22.4 16.2 18.8 19.7 12.5 19.7 1.7 19.3 18.3
dut 15.0 13.1 18.3 1.2 14.8 14.9 12.4 16.8 0.6 14.6 14.5 15.6 14.7 13.1 17.7 1.3 14.3 14.7
fre 7.5 8.4 9.7 0.6 8.2 7.5 8.5 9.5 0.7 8.1 8.1 7.5 7.6 8.9 9.1 0.5 7.8 8.5
geo .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0
hbs 38.4 43.2 44.5 1.1 39.1 35.6 42.4 44.3 1.5 36.8 35.7 37.0 35.3 39.1 38.9 1.2 33.6 32.1
hun 1.5 1.8 1.8 0.1 1.6 1.2 1.7 1.5 0.3 1.0 0.9 1.0 1.0 1.7 2.0 0.3 1.8 1.8
jpn 5.9 6.9 6.8 0.2 5.5 5.3 6.8 6.5 0.3 5.4 5.2 5.5 5.0 6.8 6.4 0.5 5.5 5.2
kor 16.2 21.3 18.6 0.7 17.4 16.9 19.6 18.3 0.8 16.2 16.1 17.2 16.3 20.4 18.9 0.8 16.5 16.3
vie 2.3 1.2 2.4 0.1 2.3 2.0 1.2 2.1 0.1 2.1 2.2 2.1 2.0 1.4 2.5 0.2 2.4 2.5
AVG 11.4 11.4 13.0 0.7 11.5 10.9 11.0 12.9 0.7 11.3 10.6 11.1 10.8 11.0 12.3 0.7 10.8 10.6

Table 3: Overview of the development and test results in the medium setting. C-N is CLUZH-N ensemble. CLUZH-
Nm runs use the mogrifier decoder and additional input character in decoder (these are post-submisson runs). C-5l
uses larger parameterization and reaches WER 10.60 (BSL: 10.64). OUR BASELINE shows the results for our
own run of the baseline configuration. Boldface indicates best performance in official shared task runs; underline
marks the best performance in post-submission configurations. Column sd always reports the test set standard
deviation. En means n-strong ensemble results.

compared because they feature the same ensem- ble. It achieves an impressive low word error rate
ble size: The results suggest that IPA segmentation of 38.7 compared to the official baseline (41.94)
(SEG) for higher resource settings (and the specific and the best other submission (37.43).
medium languages) seems to be slightly better than Future work: Performance variance between
CHAR . C-5l is a post-submission run with a larger different runs of our LSTM-based architecture
parametrization.9 This post-submission ensemble makes it difficult to reliably assess the actual useful-
outperforms the baseline system by a small mar- ness of the small architectural changes; extensive
gin, but still struggles with Serbo-Croatian (hbs) experimentation, e.g. in the spirit of Reimers and
compared to the official baseline results. Gurevych (2017), is needed for that. One should
In a post-submission experiment on the high set- also investigate the impact of the official data set
ting, we built a large10 5-strong SEG-based ensem- splits: The observed differences between the de-
9 velopment set and test set performance in the low
Three parallel encoders with 200 hidden units each; char-
acter embedding dimension of 200; no mogrifier; no input sion 100; decoder hidden state dimension: 500; minimal batch
character added to the decoder. size: 5; maximal batch size: 20; epochs: 200 (subsampled to
10
Character embedding dimension: 200; action embedding 3,000 items); patience: 24; no mogrifier; no input character
dimension: 100; 10 parallel encoders with hidden state dimen- added to the decoder.

152
setting for Slovene or Greek are extreme. Cross- Alex Graves and Jürgen Schmidhuber. 2005. Frame-
validation experiments might help assess the true wise phoneme classification with bidirectional
LSTM and other neural network architectures. Neu-
difficulty of the WikiPron datasets.
ral Networks, 18(5).
5 Conclusion Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza,
Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D.
This paper presents the approach taken by the McCarthy, and Kyle Gorman. 2020. Massively mul-
CLUZH team to solving the SIGMORPHON 2021 tilingual pronunciation modeling with WikiPron. In
Multilingual Grapheme-to-Phoneme Conversion LREC.
challenge. Our submission for the low and medium Peter Makarov and Simon Clematide. 2018. Imitation
settings is based on our successful SIGMORPHON learning for neural morphological string transduc-
2020 system, which is a majority-vote ensemble tion. In EMNLP.
of neural transducers trained with imitation learn- Peter Makarov and Simon Clematide. 2020. CLUZH
ing. We add several modifications to the existing at SIGMORPHON 2020 shared task on multilingual
LSTM architecture and experiment with IPA seg- grapheme-to-phoneme conversion. In Proceedings
ment vs. IPA character action predictions. For the of the 17th SIGMORPHON Workshop on Computa-
tional Research in Phonetics, Phonology, and Mor-
low setting languages, our IPA character-based run phology.
outperforms the baseline and ranks second overall.
The average performance of segment-based action Gábor Melis, Tomás Kociský, and Phil Blunsom. 2019.
edits suffers from performance outliers for certain Mogrifier LSTM. CoRR, abs/1909.01792.
languages. For the medium setting languages, we Nils Reimers and Iryna Gurevych. 2017. Reporting
note small improvements on some languages, but score distributions makes a difference: Performance
the overall performance is lower than the baseline. study of LSTM-networks for sequence tagging. In
EMNLP.
Using a mogrifier LSTM decoder and enriching
the encoder input with the currently attended in- Eric Sven Ristad and Peter N Yianilos. 1998. Learning
put character did not improve performance in the string-edit distance. IEEE Transactions on Pattern
medium setting. Post-submission experiments sug- Analysis and Machine Intelligence, 20(5).
gest that network capacity for the submitted sys- Stephane Ross, Geoffrey Gordon, and Drew Bagnell.
tems was too small. A post-submission run for the 2011. A reduction of imitation learning and struc-
high-setting shows considerable improvement over tured prediction to no-regret online learning. In
PMLR, volume 15 of Proceedings of Machine Learn-
the baseline. ing Research.
Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V.
References Le. 2017. Don’t decay the learning rate, increase the
batch size. CoRR, abs/1711.00489.
Roee Aharoni and Yoav Goldberg. 2017. Morphologi-
cal inflection generation with hard monotonic atten- Shijie Wu, Ryan Cotterell, and Mans Hulden. 2021.
tion. In ACL. Applying the transformer to character-level transduc-
tion. In EACL.
Lucas F.E Ashby, Travis M. Bartley, Simon Clematide,
Luca Del Signore, Cameron Gibson, Kyle Gorman, Danhao Zhu, Si Shen, Xin-Yu Dai, and Jiajun Chen.
Yeonju Lee-Sikka, Peter Makarov, Aidan Malanoski, 2017. Going wider: Recurrent neural network with
Sean Miller, Omar Ortiz, Reuben Raff, Arundhati parallel cells. CoRR, abs/1705.01346.
Sengupta, Bora Seo, Yulia Spektor, and Winnie Yan.
2021. Results of the Second SIGMORPHON 2021 Andrej Žukov-Gregorič, Yoram Bachrach, and Sam
Shared Task on Multilingual Grapheme-to-Phoneme Coope. 2018. Named entity recognition with par-
Conversion. In Proceedings of 18th SIGMORPHON allel recurrent neural networks. In ACL.
Workshop on Computational Research in Phonetics,
Phonology, and Morphology.

Kai-Wei Chang, Akshay Krishnamurthy, Hal Daumé,


III, and John Langford. 2015. Learning to search
better than your teacher. In PMLR, volume 37 of
Proceedings of Machine Learning Research.

Hal Daumé III, John Langford, and Daniel Marcu.


2009. Search-based structured prediction. Machine
learning, 75(3).

153
SIGMORPHON 2021 Shared Task on Morphological Reinflection:
Generalization Across Languages
Tiago PimentelQ∗ Maria Ryskinaì∗ Sabrina MielkeZ Shijie WuZ Eleanor Chodroff7
Brian LeonardB Garrett Nicolaiá Yustinus Ghanggo AteÆ Salam Khalifaè Nizar Habashè
Charbel El-KhaissiS Omer Goldmanń Michael GasserI William LaneR Matt Colerå
Arturo Oncevayď Jaime Rafael Montoya SamameH Gema Celeste Silva VillegasH
Adam Ekä Jean-Philipe Bernardyä Andrey Shcherbakov@ Aziyana Bayyr-oolü
Karina SheiferE,ż,Œ Sofya GanievaM,ż Matvey PlugaryovM,ż Elena KlyachkoE,ż Ali Salehiř
Andrew KrizhanovskyK Natalia KrizhanovskyK Clara Vania5 Sardana Ivanova1
Aelita Salchakù Christopher Straughnñ Zoey Liuť Jonathan North WashingtonF
Duygu Atamanæ Witold KieraśT Marcin WolińskiT Totok Suhardijantoþ Niklas StoehrD
Zahroh Nuriahþ Shyam RatanU Francis M. TyersI,E Edoardo M. Pontiø
Richard J. Hatcherř Emily Prud’hommeauxť Ritesh KumarU Mans HuldenX
Botond BartaA Dorina LakatosA Gábor SzolnokA Judit ÁcsA Mohit RajU
David YarowskyZ Ryan CotterellD Ben AmbridgeL,C Ekaterina Vylomova@
Q
University of Cambridge ì Carnegie Mellon University Z Johns Hopkins University
7
University of York B Brian Leonard Consulting á University of British Columbia
Æ
STKIP Weetebula è New York University Abu Dhabi S Australian National University
ń
Bar-Ilan University R Charles Darwin University å University of Groningen
I
Indiana University ď University of Edinburgh H Pontificia Universidad Católica del Perú
ä
University of Gothenburg @ University of Melbourne E Higher School of Economics
ü
Institute of Philology of the Siberian Branch of the Russian Academy of Sciences
ż
Institute of Linguistics, Russian Academy of Sciences M Moscow State University
Œ
Institute for System Programming, Russian Academy of Sciences
ř
University at Buffalo K Karelian Research Centre of the Russian Academy of Sciences
C
ESRC International Centre for Language and Communicative Development (LuCiD)
ñ
Northeastern Illinois University 1 University of Helsinki ù Tuvan State University
5
New York University ť Boston College F Swarthmore College æ University of Zürich
T
Institute of Computer Science, Polish Academy of Sciences þ Universitas Indonesia
U
Dr. Bhimrao Ambedkar University ø Mila/McGill University Montreal
D
ETH Zürich X University of Colorado Boulder L University of Liverpool
A
Budapest University of Technology and Economics
[email protected] [email protected] [email protected]
Abstract systems on the new data and conduct an ex-
tensive error analysis of the systems’ predic-
This year’s iteration of the SIGMORPHON tions. Transformer-based models generally
Shared Task on morphological reinflection demonstrate superior performance on the ma-
focuses on typological diversity and cross- jority of languages, achieving >90% accuracy
lingual variation of morphosyntactic features. on 65% of them. The languages on which sys-
In terms of the task, we enrich UniMorph tems yielded low accuracy are mainly under-
with new data for 32 languages from 13 resourced, with a limited amount of data. Most
language families, with most of them be- errors made by the systems are due to allo-
ing under-resourced: Kunwinjku, Classical morphy, honorificity, and form variation. In
Syriac, Arabic (Modern Standard, Egyptian, addition, we observe that systems especially
Gulf), Hebrew, Amharic, Aymara, Magahi, struggle to inflect multiword lemmas. The sys-
Braj, Kurdish (Central, Northern, Southern), tems also produce misspelled forms or end up
Polish, Karelian, Livvi, Ludic, Veps, Võro, in repetitive loops (e.g., RNN-based models).
Evenki, Xibe, Tuvan, Sakha, Turkish, In- Finally, we report a large drop in systems’ per-
donesian, Kodi, Seneca, Asháninka, Yanesha, formance on previously unseen lemmas.1
Chukchi, Itelmen, Eibela. We evaluate six
1
The data, systems, and their predictions are available:

The authors contributed equally https://github.com/sigmorphon/2021Task0

154
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 154–184
August 5, 2021. ©2021 Association for Computational Linguistics
1 Introduction of idiosyncratic properties present in them, which
makes cross-linguistic comparison challenging.
Chomsky (1995) noted that if a Martian anthropol- Haspelmath (2010) suggested a distinction be-
ogist were to visit our planet, all of our world’s tween descriptive categories (specific to languages)
languages would appear as a dialect of a sin- and comparative concepts. The idea was then re-
gle language, more specifically instances of what fined and further developed with respect to mor-
he calls a “universal grammar”. This idea—that phology and realized in the UniMorph schema
all languages have a large inventory of shared (Sylak-Glassman et al., 2015b). Morphosyntac-
sounds, vocabulary, syntactic structures with minor tic features (such as “the dative case” or “the past
variations—was especially common among cog- tense”) in the UniMorph occupy an intermediate po-
nitive scientists. It was based on highly biased sition between the descriptive categories and com-
ethnocentric empirical observations, resulting from parative concepts. The set of features was initially
the fact that a vast majority of cognitive scientists, established on the basis of analysis of typologi-
including linguists, focused only on the familiar cal literature, and refined with the addition of new
European languages. Moreover, as Daniel (2011) languages to the UniMorph database (Kirov et al.,
notes, many linguistic descriptive traditions of in- 2018; McCarthy et al., 2020). Since 2016, SIG-
dividual languages, even isolated ones such as Rus- MORPHON organized shared tasks on morpholog-
sian or German, heavily rely on cross-linguistic ical reinflection (Cotterell et al., 2016, 2017, 2018;
assumptions about the structure of human language McCarthy et al., 2019; Vylomova et al., 2020) that
that are often projected from Latin grammars. Sim- aimed at evaluating contemporary systems. Parallel
ilarly, despite making universalistic claims, genera- to that, they also served as a platform for enriching
tive linguists, for a very long time, have focused on the UniMorph database with new languages. For
a small number of the world’s major languages, typ- instance, the 2020 shared task (Vylomova et al.,
ically using English as their departure point. This 2020) featured 90 typologically diverse languages
could be partly attributed to the fact that generative derived from various linguistic resources.
grammar follows a deductive approach where the This year, we are bringing many under-resourced
observed data is conditioned on a general model. languages (languages of Peru, Russia, India, Aus-
However, as linguists explored more languages, tralia, Papua New Guinea) and dialects (e.g., for
descriptions and comparisons of more diverse kinds Arabic and Kurdish). The sample is highly diverse:
of languages began to come up, both within the it contains languages with templatic, concatenative
framework of generative syntax as well as that of (fusional and agglutinative) morphology. In addi-
linguistic typology. Greenberg (1963) presents one tion, we bring more polysynthetic languages such
of the earliest typologically informed description as Kunwinjku, Chukchi, Asháninka. Unlike previ-
of “language universals” based on an analysis of ous years, we pay more attention to the conversion
a relatively larger set of 30 languages, which in- of the morphosynactic features of these languages
cluded a substantial proportion of data from non- into the UniMorph schema. In addition, for most
European languages. Subsequently, typologists languages we conduct an extensive error analysis.
have claimed that it is essential to describe the lim-
its of cross-linguistic variation (Croft, 2002; Com- 2 Task Description
rie, 1989) rather than focus only on cross-linguistic
similarities. This is especially evident from Evans In this shared task, the participants were told to de-
and Levinson (2009), where the authors question sign a model that learns to generate morphological
the notion of “language universals”, i.e. the exis- inflections from both a lemma and a set of mor-
tence of a common pattern, or basis, shared across phosyntactic features of the target form. Specifi-
human languages. By looking at cross-linguistic cally, each language in the task had its own training,
work done by typologists and descriptive linguists, development, and test splits. The training and de-
they demonstrate that “diversity can be found at velopment splits contained triples, with a lemma,
almost every level of linguistic organization”: lan- a set of morphological features, and the target in-
guages vary greatly on phonological, morpholog- flected form, while test splits only provided lemmas
ical, semantic, and syntactic levels. This leads and morphological tags: the participants’ models
us to p-linguistics (Haspelmath, 2020), a study of needed to predict the missing target form—making
particular languages, including the whole variety this a standard supervised learning task.

155
The target of the task, however, was to analyse suffixing and would struggle to learn prefixing
how well the current state-of-the-art reinflection or circumfixation, and the degree of the bias only
models could generalise across a typologically di- becomes apparent during experimentation on other
verse set of languages. These models should, in languages whose inflectional morphology patterns
theory, be general enough to work for natural lan- differ. Further, the model architecture itself could
guages of any typological patterning.2 As such, we also explicitly or implicitly favor certain word
designed the task in three phases: a Development formation types (suffixing, prefixing, etc.).
Phase, a Generalization Phase, and an Evaluation
Phase. As the phases advanced, more data and 3 Description of the Languages
more languages were released. 3.1 Gunwinyguan
In the Development Phase, we provided train-
The Gunwinyguan language family consists of Aus-
ing and development splits that should be used by
tralian Aboriginal languages spoken in the Arnhem
participants to develop their systems. Model de-
Land region of Australia’s Northern Territory.
velopment, evaluation, and hyper-parameter tuning
were, thus, mainly performed on these sets of lan- 3.1.1 Gunwinggic: Kunwinjku
guages. We will refer to these as the development This data set contains one member of this fam-
languages. ily: a dialect of Bininj Kunwok called Kunwin-
In the Generalization Phase, we provided train- jku. Kunwinjku is a polysynthetic language with
ing and development splits for new languages mostly agglutinating verbal morphology. A typical
where approximately half were genetically related verb there might look like Aban-yawoith-warrgah-
(belonged to the same family) and half were ge- marne-ganj-ginje-ng ‘1/3PL-again-wrong-BEN-
netically unrelated (either isolates or belonging to meat-cook-PP’ (“I cooked the wrong meat for them
different families) to the development languages. again”). As shown, the form has several prefixes
These languages (and their families) were kept as a and suffixes attached to the stem. As in other Aus-
surprise throughout the first (development) phase tralian languages, long vowels are typically repre-
and were only announced later on. As the partic- sented by double characters, and trills with “rr”.3
ipants were only given a few days with access to According to Evans’ (2003) analysis, the verb tem-
these languages before the submission deadline, we plate contains 12 affix slots which include two in-
expected that the systems couldn’t be radically im- corporated noun classes, and derivational affixes
proved to work on them—as such, these languages such as the benefactive and comitative. The data
allowed us to evaluate the generalization capacity included in this set are verbs extracted from the
of the re-inflection models, and how well they per- Kunwinjku translation of the Bible using the mor-
formed on new typologically unrelated languages. phological analyzer from Lane and Bird (2019) and
Finally, in the Evaluation Phase, the partic- manually verified by human annotators.
ipants’ models were evaluated on held-out test
forms from all of the languages of the previous 3.2 Afro-Asiatic
phases. The languages from the Development The Afro-Asiatic language family is represented by
Phase and the Generalization Phase were evaluated the Semitic subgroup.
simultaneously. The only difference between the
development and generalization languages was that 3.2.1 Semitic: Classical Syriac
participants had more time to construct their mod- Classical Syriac is a dialect of the Aramaic lan-
els for the languages released in the Development guage and is attested as early as the 1st century
Phase. It follows that a model could easily favor CE. As with most Semitic languages, it displays
or overfit to the phenomena that are more frequent non-concatenative morphology involving primarily
in the languages presented in the Development tri-consonantal roots. Syriac nouns and adjectives
Phase, especially if the parameters were shared are conventionally classified into three ‘states’—
across languages. For instance, a model based on Emphatic, Absolute, Construct—which loosely cor-
the morphological patterning of Indo-European relate with the syntactic features of definiteness,
languages may end up with a bias towards indeterminacy and the genitive. There are over 10
2 3
For example, Tagalog verbs exhibit circumfixation; thus, More details: https://en.wikipedia.
a model with a strong inductive bias towards suffixing would org/wiki/Transcription_of_Australian_
likely not work well for Tagalog. Aboriginal_languages.

156
Family Genus ISO 639-3 Language Source of Data Annotators
Development
Afro-Asiatic Semitic afb Gulf Arabic Khalifa et al. (2018) Salam Khalifa, Nizar Habash
Semitic amh Amharic Gasser (2011) Michael Gasser
Semitic ara Modern Standard Arabic Taji et al. (2018) Salam Khalifa, Nizar Habash
Semitic arz Egyptian Arabic Habash et al. (2012) Salam Khalifa, Nizar Habash
Semitic heb Hebrew (Vocalized) Wiktionary Omer Goldman
Semitic syc Classic Syriac SEDRA Charbel El-Khaissi
Arawakan Southern ame Yanesha Duff-Trip (1998) Arturo Oncevay, Gema Celeste Silva Vil-
Arawakan legas
Southern cni Asháninka Zumaeta Rojas and Zerdin Arturo Oncevay, Jaime Rafael Montoya
Arawakan (2018); Kindberg (1980) Samame
Austronesian Malayo- ind Indonesian KBBI, Wikipedia Clara Vania, Totok Suhardijanto, Zahroh
Polynesian Nuriah
Malayo- kod Kodi Ghanggo Ate (2021) Yustinus Ghanggo Ate, Garrett Nicolai
Polynesian
Aymaran Aymaran aym Aymara Coler (2014) Matt Coler, Eleanor Chodroff
Chukotko- Northern ckt Chukchi Chuklang; Tyers and Karina Sheifer, Maria Ryskina
Kamchatkan Chukotko- Mishchenkova (2020)
Kamchatkan
Southern itl Itelmen Karina Sheifer, Sofya Ganieva, Matvey
Chukotko- Plugaryov
Kamchatkan
Gunwinyguan Gunwinggic gup Kunwinjku Lane and Bird (2019) William Lane
Indo- Indic bra Braj Raw data from Kumar et al. Shyam Ratan, Ritesh Kumar
European (2018)
Slavic bul Bulgarian UniMorph (Kirov et al., 2018, Christo Kirov
Wiktionary)
Slavic ces Czech UniMorph (Kirov et al., 2018,
Wiktionary)
Iranian ckb Central Kurdish (Sorani) Alexina project Ali Salehi
Germanic deu German UniMorph (Kirov et al., 2018,
Wiktionary)
Iranian kmr Northern Kurdish (Kur- Alexina project
manji)
Indic mag Magahi Raw data from (Kumar et al., Mohit Raj, Ritesh Kumar
2014, 2018)
Germanic nld Dutch UniMorph (Kirov et al., 2018,
Wiktionary)
Slavic pol Polish Woliński et al. (2020); Witold Kieraś, Marcin Woliński
Woliński and Kieraś (2016)
Romance por Portuguese UniMorph (Kirov et al., 2018,
Wiktionary)
Slavic rus Russian UniMorph (Kirov et al., 2018, Ekaterina Vylomova
Wiktionary)
Romance spa Spanish UniMorph (Kirov et al., 2018,
Wiktionary)
Iranian sdh Southern Kurdish Fattah (2000, native speakers) Ali Salehi
Iroquoian Northern Iro- see Seneca Bardeau (2007) Richard J. Hatcher, Emily
quoian Prud’hommeaux, Zoey Liu
Trans–New Bosavi ail Eibela Aiton (2016b) Grant Aiton, Edoardo Maria Ponti, Eka-
Guinea terina Vylomova
Tungusic Tungusic evn Evenki Kazakevich and Klyachko Elena Klyachko
(2013)
Turkic Turkic sah Sakha Forcada et al. (2011, Apertium: Francis M. Tyers, Jonathan North Wash-
apertium-sah) ington, Sardana Ivanova, Christopher
Straughn, Maria Ryskina
Turkic tyv Tuvan Forcada et al. (2011, Apertium: Francis M. Tyers, Jonathan North
apertium-tyv) Washington, Aziyana Bayyr-ool, Aelita
Salchak, Maria Ryskina
Uralic Finnic krl Karelian Zaytseva et al. (2017, VepKar) Andrew and Natalia Krizhanovsky
Finnic lud Ludic Zaytseva et al. (2017, VepKar) Andrew and Natalia Krizhanovsky
Finnic olo Livvi Zaytseva et al. (2017, VepKar) Andrew and Natalia Krizhanovsky
Finnic vep Veps Zaytseva et al. (2017, VepKar) Andrew and Natalia Krizhanovsky
Generalization (Surprise)
Tungusic Tungusic sjo Xibe Zhou et al. (2020) Elena Klyachko
Turkic Turkic tur Turkish UniMorph (Kirov et al., 2018, Omer Goldman and Duygu Ataman
Wiktionary)
Uralic Finnic vro Võro Wiktionary Ekaterina Vylomova

Table 1: Development and surprise languages used in the shared task.

157
verbal paradigms that combine affixation slots with For MSA, the paradigms inflect for all the above-
inflectional templates to reflect tense (past, present, mentioned features, while for EGY and GLF they
future), person (first, second, third), number (sin- inflect for the above-mentioned features except for
gular, plural), gender (masculine, feminine, com- voice, mood, case, and state. We use the func-
mon), mood (imperative, infinitive), voice (active, tional (grammatical) gender and number for MSA
passive), and derivational form (i.e., participles). and GLF, but the form-based gender and number
Paradigmatic rules are determined by a range of for EGY, since the resources we used did not have
linguistic factors, such as root type or phonolog- EGY functional gender and number (Alkuhlani and
ical properties. The data included in this set was Habash, 2011).
relatively small and consisted of 1,217 attested lex- We generated all the inflection tables from the
emes in the New Testament, which were extracted morphological analysis databases using the gener-
from Beth Mardutho: The Syriac Institute’s lexical ation component provided by CamelTools (Obeid
database, SEDRA. et al., 2020). We extracted all the verb, noun, and
adjective lemmas from a number of annotated cor-
3.2.2 Semitic: Arabic
pora and selected those that are already in the mor-
Modern Standard Arabic (MSA, ara) is the pri- phological analysis databases. For MSA, we used
marily written form of Arabic which is used in all the CALIMA-STAR database (Taji et al., 2018),
official communication means. In contrast, Ara- based on the SAMA database (Maamouri et al.,
bic dialects are the primarily spoken varieties of 2010), and the PATB (Maamouri et al., 2004) as
Arabic, and the increasingly written varieties on the sources of lemmas. For EGY, we used the
unofficial social media platforms. Dialects have CALIMA-EGY database (Habash et al., 2012) and
no official status despite being widely used. Both the ARZTB (Maamouri et al., 2012) as the sources
MSA and the dialects coexist in a state of diglos- of lemmas. For GLF, we used the Gulf verb ana-
sia (Ferguson, 1959) whether in spoken or written lyzer (Khalifa et al., 2017) for verbs, and for both
form. Arabic dialects vary amongst themselves nouns and adjectives we extracted all the annota-
and are different from MSA in most linguistic as- tions from the Annotated Gumar Corpus (Khalifa
pects (phonology, morphology, syntax, and lexical et al., 2018).
choice). In this work we provide inflection tables
for MSA (ara), Egyptian Arabic (EGY, arz), and 3.2.3 Semitic: Hebrew
Gulf Arabic (GLF, afb). Egyptian Arabic is the
variety of Arabic spoken in Egypt. Gulf Arabic As Syriac, Hebrew is a member of the Northwest
refers to the dialects spoken by the indigenous pop- Semitic branch, and, like Syriac and Arabic, it
ulations of the members of the Gulf Cooperation is written using an abjad where the vowels are
Council, especially regions on the Arabian Gulf. sparsely marked in unvocalized text. This fact en-
Similar to other Semitic languages, Arabic is a tails that in unvocalized data the complex ablaut-
templatic language. A word consists of a templatic extensive non-concatenative Semitic morphology
stem (root and pattern) and a number of affixes and is somewhat watered down as the consonants of the
clitics. Verb lemmas in Arabic inflect for person, root frequently appear consecutively with the alter-
gender, number, voice, mood, and aspect. Nomi- nating vowel unwritten. In this work we present
nal lemmas inflect for gender, number, case, and data in vocalized Hebrew, in order to examine the
state. Those features are realized through both the models’ ability to handle Hebrew’s full-fledged
templatic patterns and the concatenative affixations. Semitic morphological system.
Arabic words also take on a number of clitics: at- Hebrew verbs belong to 7 major classes
tachable prepositions, conjunctions, determiners, (Binyanim) with many subclasses depending on
and pronominal objects and possessives. In this the phonological features of the root’s consonants.
work, we do not include clitics as a part of the Verbs inflect for number, gender, and tense-mood,
paradigms, as they heavily increase the size of the while the nominal inflection tables include definite-
paradigms. We made the exception to add the Al de- ness and possessor.
terminer particle in order to be consistent with com- The provided inflection tables are largely identi-
monly used tokenizations for Arabic treebanks— cal to those of the past years’ shared tasks, scraped
Penn Arabic Treebank (Maamouri et al., 2004) and from Wiktionary, with the addition of the verbal
Arabic Universal Dependencies (Taji et al., 2017). nouns and all forms being automatically vocalized.

158
3.2.4 Semitic: Amharic 3.3.1 Aymaran: Aymara
Amharic is the most spoken and best-resourced Aymara is spoken mainly in Andean communities
among the roughly 15 languages in the Ethio- in the region encompassing Bolivia and Peru from
Semitic branch of South Semitic. Unlike most other the north of Lake Titicaca to the south of Lake
Semitic languages, but like other Ethio-Semitic lan- Poopó, extending westward to the valleys of the Pa-
guages, it is written in the Ge’ez (Ethiopic) script, cific coast and eastward to the Yunga valleys. It has
an abugida in which each character represents ei- roughly two million speakers, over half of whom
ther a consonant-vowel sequence or a consonant in are Bolivian. The rest reside mainly in Peru, with
the syllable coda position. small communities in Chile and Argentina. Ay-
mara is a highly agglutinative, suffix-only language.
Like other Semitic languages, Amharic displays
Nouns are inflected for grammatical number, case,
both affixation and non-concatenative template
and possessiveness. As Coler (2010) notes, Ay-
morphology. Verbs inflect for subject person, gen-
mara has 11–12 grammatical cases, depending on
der, and number and tense/aspect/mood. Voice and
the variety (as in some varieties the locative and
valence are also marked, both by templates and
genitive suffixes have merged and in others they
affixes, but these are treated as separate lemmas in
have not). The case suffix is attached to the last
the data. Other verb affixes (or clitics, depending
element of a noun phrase. Verbs present relatively
on the analysis) indicate object person, gender, and
complex paradigms, with dimensions such as gram-
number; negation; relativization; conjunctions; and,
matical person (marking both subject and direct
on relativized forms, prepositions and definiteness.
object), number, tense (simple, future, recent past,
None of these are included in the data.
distal past), mood (evidentials, two counterfactual
Nouns and adjectives share most of their mor- paradigms, and an imperative paradigm). More-
phology and are often not clearly distinguished. over, Aymara has a variety of suffixes which change
Nouns and adjectives inflect for definiteness, num- the grammatical category of the word. Words can
ber, and possession. Gender is only explicit when change grammatical category multiple times.5
the masculine or feminine singular definite suffixes
are present; most nouns have no inherent gender. 3.4 Indo-European
Nouns and adjectives also have prepositional pre- The Indo-European language family is the parent
fixes (or clitics) and accusative suffixes, which are family of most of the European and Asian lan-
not included in the data. guages. In this iteration of the shared task, we
The data for the shared task were generated by enrich the data with languages from Indo-Aryan,
the HornMorpho generator (Gasser, 2011), an FST Iranian, and Slavic groups. Iranian and Indo-
weighted with feature structures. Common ortho- Aryan are recognised as distinct subgroups of Indo-
graphic variants of the lemmas and common variant European. Characteristic retentions and innova-
plural forms of nouns are included. In these cases, tions make Iranian and Indo-Aryan language fami-
the variants are distinguished with the LGSPEC1 lies diverged and distinct from each other (Jain and
and LGSPEC2 features. Predictable orthographic Cardona, 2007).
variants are not included.
3.4.1 Indo-Aryan, or Indic: Magahi, Braj
3.3 Aymaran The Indian subcontinent is the heartland of where
the Indo-Aryan languages are spoken. This area
The Aymaran family has two branches: Southern is also referred to as South Asia and encompasses
Aymaran (which is the branch described in this con- India, Pakistan, Bangladesh, Nepal, Bhutan and
tribution, as represented by Mulyaq’ Aymara) and the islands of Sri Lanka and Maldives (Jain and
Central Aymaran (Jaqaru).4 Aymaran has no ex- Cardona, 2007). Magahi and Braj, which belong
ternal relatives. The neighboring and overlapping to the Indo-Aryan language family, are under our
Quechuan family is often erroneously believed to observation.
be related. Magahi comes under the Magadhi group of the
middle Indo-Aryan which includes Bhojpuri and
4
Sometimes Cauqui (also spelled “Kawki”), a language
5
spoken by less than ten elders in Cachuy, Canchán, Caipán, Tags’ conversion into UniMorph: https:
and Chavín, is considered to be a third Aymaran language but //github.com/unimorph/aym/blob/main/
it may be more accurate to consider it a Jaqaru dialect. Muylaq’AymaraUnimorphConversion.tsv

159
Maithili. While the exact classification within this lish and popularise the literary tradition of Braj, the
subgroup is still debatable, most accepted analyses local state government of Rajasthan has set up the
put it under one branch of the Eastern group of lan- Braj Bhasha Akademi (Braj Bhasha Academy) in
guages which includes Bangla, Asamiya, and Oriya Jaipur. Along with this, some individuals, local lit-
(Grierson and Konow, 1903). Magahi speech area erary and cultural groups, and language enthusiasts
is mainly concentrated in the Eastern Indian states at the local level also bring out publications in Braj
of Bihar and Jharkhand, but it also extends to the (Kumar et al., 2018). In all of the above sources,
adjoining regions of Bengal and Odisha (Grierson, bhakti poetry6 constitutes a large proportion of the
1903). traditional literature of Braj (Pankaj, 2020).
There is no grammatical gender and number As in the case of other Indo-Aryan languages,
agreement in Magahi, though sex-related gender Braj is also rich in morphological inflections. The
derivation commonly occurs for animate nouns dataset released for the present task contains two
like /laika/ (boy) and /laiki/ (girl). Number is sets of inflectional paradigms with morphologi-
also marked on nouns, and it affects the form of cal features for nouns and verbs. Nominal lem-
case markers and postpositions in certain instances mas in Braj are inflected for gender (masculine
(Lahiri, 2021). Moreover, it has a rich system of and feminine) and number (singular and plural);
verbal morphology to show the tense, aspect, per- verb lemmas take gender (masculine and feminine),
son, and honorific agreement with the subject as number (singular and plural), person (first, second
well as the addressee. and third), politeness/honorificity (formal and in-
formal), tense (present, past and future), and aspect
In the present dataset, the inflectional paradigms
(perfective, progressive, habitual and prospective)
for verbs show the honorificity level of both the
markings. Among these, the politeness feature
subjects and the addressees, and also the person
is marked for showing honorificity and formality.
of the subject, the tense and aspect markers. The
More generally, a formal/polite marker is used for
inflectional paradigms for nouns and adjectives are
strangers and the elite class, while informal/neutral
generated on the basis of the inflectional marker
markers are used for family and friends.
used for expressing case, familiarity, plurality, and
In order to generate the morphological
(sometimes) gender within animate nouns. Pro-
paradigms, we have used the data from the
nouns are marked for different cases and honori-
literary domain, annotated at the token level
ficity levels. These paradigms are generated on the
in the CoNLL-U editor (Heinecke, 2019). The
basis of a manually annotated corpus of Magahi
dataset was initially annotated using the Universal
folktales.
Dependencies morphological feature set and
We used a raw dataset from the literary domain. then automatically converted to the UniMorph
First, we annotated the dataset with the Universal schema using the script provided by McCarthy
Dependency morphological feature tags at token et al. (2018). Finally, the converted dataset was
level using the CoNLL-U editor (Heinecke, 2019). manually validated and edited to conform to the
We then converted the annotated dataset into the constraints and conventions of UniMorph to arrive
UniMorph schema using the script available for at the final labels.
converting UD data into the UniMorph tagset (Mc-
Carthy et al., 2018). To finalize the data, we man- 3.4.2 Iranian: Kurdish
ually validated the dataset against the UniMorph The Iranian branch is represented by Kurdish.
schema (Sylak-Glassman et al., 2015a). Among Western Iranian languages, Kurdish is the
Brajbhasha, or Braj is one of the Indo-Aryan term covering the largest group of related dialects.
languages spoken in the Western Indian states of Kurdish comprises three main subgroup dialects,
Uttar Pradesh, Madhya Pradesh, and Rajasthan. namely Northern Kurdish (including Kurmanji),
Grierson (1908) groups Brajbhasha under West- Central Kurdish (including Sorani), and Southern
ern Hindi of the Central Group in the Indo-Aryan Kurdish. Sorani Kurdish, spoken in Iran and Iraq,
family, along with other languages like Hindustani, is known for its morphological split ergative sys-
Bangaru, Kannauji, and Bundeli. Braj is not gener- tem. There are two sets of morphemes traditionally
ally used in education or for any official purposes in described as agreement markers: clitic markers and
any Braj spoken state, but it has a very rich literary 6
This is dedicated to Indian spiritual and mythological
tradition. Also in order to preserve, promote, pub- imagination as being associated with Lord Krishna.

160
verbal affixes, which are verbal agreement markers, three are represented in this dataset, with Polish
or the copula. The distribution of these formatives and Czech being the typical West Slavic languages,
can be described as ergative alignment, although Russian being the most prominent East Slavic lan-
mandatory agent indexing has led some scholars guage, and Bulgarian representing the Eastern part
to refer to the Sorani system as post- or remnant- of the South Slavic group. Slavic languages are
ergative (Jügel, 2009). Note that Sorani nominals characterized by a rich verbal and nominal inflec-
do not feature case marking. The single argument tion system. Typically, verbs mark tense, person,
of an intransitive verb is an affix while the transi- gender, aspect, and mood. Nouns mark gender,
tive verbs have a tense-sensitive alignment. With number, and case, although in Bulgarian and Mace-
transitive verbs, agents are indexed by affixes in the donian cases are reduced to only nominative and
present tense and with clitics in the past tense. On vocative. Masculine nouns additionally mark ani-
the other hand, the object is indexed with a clitic in macy.
the present tense and an affix in the past tense. In Polish data was obtained via a conversion
addition, Sorani also has the so-called experiencer- from the largest Polish morphological dictionary
subject verbs, with which both the agent and the (Woliński et al., 2020) which is also used as the
object are marked with clitic markers. Like other main data source in the morphological analysis.
Iranian languages, Sorani also features a series of Table 10 presents a simplified mapping from the
light-verb constructions which are composed us- original flexemic tagset of Polish (Przepiórkowski
ing the verbs kirdin ‘to do’ or bun ‘to be’. In the and Woliński, 2003) to the UniMorph schema. The
light verb constructions, the agent is marked with data for the remaining three Slavic languages were
an affix in the present tense, while a clitic marks obtained from Wiktionary.
the subject in the past tense. Southern Kurdish fea-
tures all the same verbs types, clitics and affixes, 3.5 Uralic: Karelian, Livvi, Ludic, Veps,
while the alignment pattern can be completely dif- Võro
ferent due to a nominative-accusative alignment
system. The usage of agreement markers with af- The Uralic languages are spoken from the north
fixes is widely predominant in Southern Kurdish of Siberia in Russia to Scandinavia and Hungary.
and clitics can be used to mark the possessives. They are agglutinating with some subgroups dis-
Both dialects of Kurdish allow for clitic and af- playing fusional characteristics (e.g., the Sámi lan-
fix stacking marking the agent and the object of guages). Many of the languages have vowel har-
a verb. In Sorani, for instance, dit=yan-im ‘They mony. Many of the larger case paradigms are made
saw me’ uses a clitic and an affix to mark the agent up of spatial cases, sometimes with distinctions for
and the object, and wist=yan=im ‘I wanted them’ direction and position. Further, most of the lan-
marks both the agent and the object with clitics. guages have possessive suffixes, which can express
Ditransitive verbs can be formed by a transitive possession or agreement in non-finite clauses.
verb and an applicative marker. For instance, a di- We use Karelian, Ludic, Livvi, Veps, and Võro
transitive three-participant verb da-m-în=î-yê ‘He in the shared task. All the data except Võro were
gave them to me’ marks the recipient and the object exported from the Open corpus of Veps and Kare-
with affixes, and the agent is marked with a clitic lian languages (VepKar). Veps and Karelian are
in the presence of an applicative (yê). A separate agglutinative languages with rich suffixal morphol-
set of morphological features is needed to account ogy. All inflectional categories in these languages
for such structures, in which the verb dictates the are formed by attaching one or more affixes corre-
person marker index as subject, agent, object or sponding to different grammatical categories to the
recipient. stem.
The presence of one or two stems in the nom-
3.4.3 Slavic: Polish inal parts of speech and verbs is essential when
The Slavic genus comprises a group of fusional constructing word forms in the Veps and Karelian
languages evolved from Proto-Slavic and spoken languages (Novak, 2019, 57). In these languages,
in Central and Eastern Europe, the Balkans and the to build the inflected forms of nouns and verbs, one
Asian parts of Russia from Siberia to the Far East. needs to identify one or two word stems. There
Slavic languages are most commonly divided into are formalized (algorithmic) ways to determine the
three major subgroups: East, West, and South. All stem, although not for all words (Novak et al., 2020,

161
684). 3.8 Turkic
Note that in the Ludic and Livvi dialects of the 3.8.1 Siberian Turkic: Sakha and Tuvan
Karelian language and in the Veps language, re- The Turkic languages of Siberia, spoken mostly
flexive forms of verbs have their own paradigm. within the Russian Federation, range from vulnera-
Thus, one set of morphological rules is needed for ble to severely endangered (Eberhard et al., 2021)
reflexive verbs and another set for non-reflexive and represent several branches of Turkic with vary-
verbs. ing degrees of relatedness (Баскаков, 1969; Tekin,
Võro represents the South Estonian dialect 1990; Schönig, 1999). They have rich agglutinat-
group. Similar to other Uralic languages, it has ag- ing morphology, like other Turkic languages, and
glutinative, primarily suffixal, morphology. Nouns share many grammatical properties (Washington
inflect for grammatical case and number. The cur- and Tyers, 2019).
rent shared task sample contains noun paradigm In this shared task, the Turkic languages of this
tables derived from Wiktionary.7 area are represented by Tuvan (Sayan Turkic) and
Sakha (Lena Turkic). For both languages, we
3.6 Tungusic make use of the lexicons of the morphological trans-
ducers built as part of the Apertium open-source
The Tungusic genus comprises a group of aggluti-
project (Khanna et al., to appear in 2021; Washing-
native languages spoken from Central and Eastern
ton et al., to appear in 2021). We use the transduc-
Siberia to the Far East over the territories of Russia
ers for Tuvan8 (Tyers et al., 2016; Washington et al.,
and China. The genus is considered to be a member
2016) and Sakha9 (Ivanova et al., 2019, to appear in
of the Altaic (or Transeurasian) language family by
2022) as morphological generators, extracting the
some researchers, although this is disputed. Tun-
paradigms for all the verbs and nouns in the lexicon.
gusic languages are commonly divided into two or
We manually design a mapping between the Aper-
three branches (see Oskolskaya et al. (2021) for
tium tagset and the UniMorph schema (Table 8),
discussion).
based on the system descriptions and additional
grammar resources (Убрятова et al. (1982) for
3.7 Tungusic: Evenki and Xibe
Sakha and Исхаков and Пальмбах (1961); An-
The dataset presents two Tungusic languages, derson and Harrison (1999); Harrison (2000) for
namely Evenki and Xibe, belonging to different Tuvan). Besides the tag mapping, we also include a
branches in any approach, with Xibe being quite few conditional rules, such as marking definiteness
aberrant from other Tungusic languages. Tungu- for nouns in the accusative and genitive cases.
sic languages are characterized by rich verbal and Since the UniMorph schema in its current ver-
nominal inflection and demonstrate vowel harmony. sion is not well-suited to capture the richness of
Typically verbs mark tense, person, aspect, voice Turkic morphology, we exclude many forms with
and mood. Nouns mark number, case and posses- morphological attributes that do not have a close
sion. equivalent in UniMorph. We also omit forms with
Inflection is achieved through suffixes. Evenki affixes that are considered quasi-derivational rather
is a typical agglutinative language with almost no than inflectional, such as the desiderative /-ksA/
fusion whereas Xibe is more fusional. in Tuvan (Washington et al., 2016), with the ex-
The Evenki data was obtained by conversion ception of the negative marker. These constraints
from a corpus of oral Evenki texts (Kazakevich and greatly reduce the sizes of the verbal paradigms:
Klyachko, 2013), which uses IPA. The Xibe data the median number of forms per lemma is 234 and
was obtained by conversion from a Universal De- 87 for Tuvan and Sakha respectively, compared to
pendency treebank compiled by Zhou et al. (2020), roughly 5,700 forms per lemma produced by ei-
which contains textbook and newspaper texts. Xibe ther generator. Our tag conversion and paradigm
texts use the traditional script. filtering code is publicly released.10
8
https://github.com/apertium/
7 apertium-tyv/
The tag conversion schema for Uralic lan-
9
guages is provided here: https://docs. https://github.com/apertium/
google.com/spreadsheets/d/1RjO_ apertium-sah/
10
J22yDB5FH5C24ej7sGGbeFAjcIadJA6ML55tsOI/ https://github.com/ryskina/
edit. apertium2unimorph

162
3.8.2 Turkic: Turkish 3.9.2 Malayo-Polynesian: Kodi/Kodhi
One of the further west Turkic languages, Turkish Kodi or Kodhi [koâi] is spoken in Sumba Island,
is part of the Oghuz branch, and, like the other eastern Indonesia (Ghanggo Ate, 2020). Regard-
languages of this family, it is highly agglutinative. ing its linguistic classification, Kodi belongs to the
In this work, we vastly expanded the existing Central-Eastern subgroup of Austronesian, related
UniMorph inflection tables. As with the Siberian to Sumba-Hawu languages. Based on the linguis-
Turkic languages, it was necessary to omit many tic fieldwork observations done by Ghanggo Ate
forms from the paradigm as the UniMorph schema (2020), it may be tentatively concluded that there
is not well-suited for Turkic languages. For this are only two Kodi dialects: Kodi Bhokolo and
reason, we only included the forms that may ap- Mbangedho-Mbalaghar. Even though some work
pear in main clauses. Other than this limitation, has been done on Kodi (Ghanggo Ate, to appear
we tried to include all possible tense-aspect-mood in 2021), it remains a largely under-documented
combinations, resulting in 30 series of forms, each language. Further, Kodi is vulnerable or threat-
including 3 persons and 2 numbers. The nominal ened because Indonesian, the prestigious national
coverage is less comprehensive and includes forms language, is used in most sociolinguistic domains
with case and possessive suffixes. outside the domestic sphere.
A prominent linguistic feature of Kodi is its
3.9 Austronesian clitic system, which is pervasive in various syntac-
tic categories—verbs, nouns, and adjectives—and
3.9.1 Malayo-Polynesian: Indonesian marks person (1, 2, 3) and number (SG vs. PL). In
Indonesian or Bahasa Indonesia is the official lan- addition, Kodi contains four sets of pronominal cli-
guage of Indonesia. It belongs to the Austronesian tics that agree with their antecedent: NOM(inative)
language family and it is written with the Latin proclitics, ACC(usative) enclitics, DAT(ive) en-
script. clitics and GEN(initive) enclitics. Interestingly,
these clitic sets are not markers of NOM, ACC,
Indonesian does not mark grammatical case, gen-
DAT, or GEN grammatical case—as in Malayalam
der, or tense. Words are composed from their roots
or Latin—but rather identify the head for TERM
through affixation, compounding, or reduplication.
relations (subject and object). Thus, by default,
The four types of Indonesian affixes are prefixes,
pronominal clitics are core grammatical arguments
suffixes, circumfixes (combination of prefixes and
reflecting subject and object.
suffixes), and infixes (inside the base form). In-
donesian uses both full and partial reduplication For the analyses of the features of Kodi clitics,
processes to form words. Full reduplication is often the data freshly collected in the fieldwork funded by
used to express the plural forms of nouns, while par- the Endangered Language Fund is annotated. Then,
tial reduplication is typically used to derive forms the collected data is converted to the UniMorph
that might have a different category than their base task format, which has the lemmas, the word forms,
forms. Unlike English, the distinction between in- and the morphosyntactic features of Kodi.
flectional and derivational morphological processes
3.10 Iroquoian
in Indonesian is not always clear (Pisceldo et al.,
2008). 3.10.1 Northern Iroquoian: Seneca
In this shared task, the Indonesian data is cre- The Seneca language is an indigenous Native Amer-
ated by bootstrapping the data from an Indone- ican language from the Iroquoian (Hodinöhšöni)
sian Wikipedia dump. Using a list of possible In- language family. Seneca is considered critically en-
donesian affixes, we collect unique word forms dangered and is currently estimated to have fewer
from Wikipedia and analyze them using MorphInd than 50 first-language speakers left, most of whom
(Larasati et al., 2011), a morphological analyzer are elders. The language is spoken mainly in three
tool for Indonesian based on an FST. We manu- reservations located in Western New York: Alle-
ally create a mapping between the MorphInd tagset gany, Cattaraugus, and Tonawanda.
and the UniMorph schema. We then use this map- Seneca possesses highly complex morphological
ping and apply some additional rule-based formu- features, with a combination of both agglutinative
las created by Indonesian linguists to build the final and fusional properties. The data presented here
dataset (Table 9). consists of inflectional paradigms for Seneca verbs,

163
the basic structure of which is composed of a verb 3.11.2 Southern Arawakan: Yanesha
base that describes an event or state of action. In Yanesha is an Amazonian language from the Pre-
virtually all cases, the verb base would be preceded Andine subgroup of Arawakan family (Adelaar
by a pronominal prefix which indicates the agent, and Muysken, 2004), spoken in Central Peru by
the patient, or both for the event or state, and fol- between 3 and 5 thousand people. It has two lin-
lowed by an aspect suffix which usually marks a guistic variants that correspond to the upriver and
habitual or a stative state. downriver areas, both mutually intelligible.
Yanesha is an agglutinating, polysynthetic lan-
(1) ha skatkwë s
guage with a VSO word order. Nouns and verbs
it he laugh HAB
are the two major word classes while the adjective
He laughs. word class is questionable due to the absence of
In some other scenarios, for instance, when the non-derived forms. The verb is the most morpho-
verb is describing a factual, future or hypothetical logically complex word class and the only obliga-
event, a modal prefix is attached before the pronom- tory constituent of a clause (Dixon and Aikhenvald,
inal prefix and the aspect suffix marks a punctual 1999).
state instead. The structures and orders of the pre- Among other typologically remarkable features,
fixes can be more complicated depending on, e.g., we find that the language lacks the distinction in
whether the action denoted by the verb is repetitive grammatical gender, the subject cross-referencing
or negative; these details are realized by adding a morphemes and one of the causatives are prefixes;
prepronominal prefix before the modal prefix. all other verbal affixes are suffixes, and nouns and
classifiers may be incorporated in the verb (Wise,
3.11 Arawakan 2002).
3.11.1 Southern Arawakan: Asháninka The corpus consists of inflected nouns and verbs
from both dialectal varieties. The annotated nouns
Asháninka is an Arawak language with more than take possessor prefixes, plural marking, and loca-
70,000 speakers in Central and Eastern Peru and in tive case, while the annotated verbs take subject
the state of Acre in Eastern Brazil, in a geographi- prefixes.
cal region located between the eastern foothills of
the Andes and the western fringe of the Amazon 3.12 Chukotko-Kamchatkan
basin (Mihas, 2017; Mayor Aparicio and Bodmer, The Chukotko-Kamchatkan languages, spoken in
2009). Although it is the most widely spoken Ama- the far east of the Russian Federation, are repre-
zonian language in Peru, certain varieties, such as sented in this dataset by two endangered languages,
Alto Perené, are highly endangered. Chukchi and Itelmen (Eberhard et al., 2021).
It is an agglutinating, polysynthetic, verb-initial
language. The verb is the most morphologically 3.12.1 Chukotko-Kamchatkan: Chukchi
complex word class, with a rich repertoire of aspec- Chukchi is a polysynthetic language that exhibits
tual and modal categories. The language lacks case polypersonal agreement, ergative–absolutive align-
marking, except for one locative suffix; grammat- ment, and a subject–object–verb basic word or-
ical relations of subject and object are indexed as der in transitive clauses (Tyers and Mishchenkova,
affixes on the verb itself. Other notable linguistic 2020). We use the data of the Amguema corpus,
features of the language include a distinction be- available through the Chuklang website,11 com-
tween alienably and inalienably possessed nouns, prised of transcriptions of spoken Chukchi in the
obligatory marking of reality status (realis/irrealis) Amguema variant. The Amguema data had been
on the verb, a rich system of applicative suffixes, annotated in the CoNLL-U format by Tyers and
serial verb constructions, and pragmatically condi- Mishchenkova (2020), and we convert it to the
tioned split intransitivity. UniMorph format using the conversion system of
The corpus consists of inflected nouns and verbs McCarthy et al. (2018).
from the variety spoken in the Tambo river of Cen-
3.12.2 Chukotko-Kamchatkan: Itelmen
tral Peru. The annotated nouns take possessor pre-
fixes, locative case and/or plural marking, while Itelmen is a language spoken on the western coast
the annotated verbs take subject prefixes, reality of the Kamchatka Peninsula. The language is con-
status (realis/irrealis), and/or perfective aspect. 11
https://chuklang.ru/

164
sidered to be highly endangered since it stopped Conversion into the UniMorph schema. After
been transferred from elders to youth ∼50 years the data collection was finalised for the above
ago (most are Russian-speaking monolinguals). languages, we converted them to the UniMorph
The language is agglutinative and primarily uses schema—canonicalising them in the process.14
suffixes. For instance, the plural form of a noun This process consisted mainly of typo corrections
is expressed by the suffix -Pn. We note that the (e.g. removing an incorrectly placed space in a tag,
plural form only exists in four grammatical cases “PRIV ” → “PRIV”), removing redundant tags
(NOM, DAT, LOC, VOC).12 The same plural suffix (e.g. duplicated verb annotation, “V;V.PTCP” →
transforms a noun into an adjective. Verbs mark “V.PTCP”), and fixing tags to conform to the Uni-
both subjects (with prefixes and suffixes) and ob- Morph schema (e.g. “2;INCL” → “2+INCL”).
jects (with suffixes). For instance, the first person These changes were implemented via language-
subject is marked by attaching the prefix t- and the specific Bash scripts. Given this freshly con-
suffix -čen (Volodin, 1976).13 The Itelmen data pre- verted data, we canonicalised its tag annotations,
sented in the task was collected through fieldwork making use of https://github.com/unimorph/
and manually annotated according to the UniMorph um-canonicalize. This process sorts the inflec-
schema. tion tags into their canonical order and verifies that
all the used tags are present in the ground truth
3.13 Trans-New Guinea UniMorph schema, flagging potential data issues
in the process.
3.13.1 Bosavi: Eibela
Data splitting. Given the canonicalised data as
Eibela, or Aimele, is an endangered language spo-
described above, we removed all instances with
ken by a small (∼300 speakers) community in Lake
duplicated <lemma; tags> pair—as these in-
Campbell, Western Province, Papua New Guinea.
stances were ambiguous with respect to their target
Eibela morphology is exclusively suffixing. Verbs
inflected form—and removed all forms other than
conjugate for tense, aspect, mood, evidentiality and
verbs, nouns, or adjectives. We then capped the
exhibit complex paradigms with a high degree of
dataset sizes to a maximum of 100,000 instances
irregularity. Generally, verbs cab be grouped into
per language, subsampling when necessary. Fi-
three classes based on their stems. Verbal inflec-
nally, we create a 70–10–20 train–dev–test split per
tional classes present various kinds of stem alter-
language, splitting the data across these sets at the
nations and suppletion. As Aiton (2016b) notes,
instance level (as opposed to, e.g., the lemma one).
the present and past forms are produced either
As such, the information about a lemma’s declen-
through stem changes or by a concatenative suffix.
sion or inflection class is spread out across these
In some cases, the forms can be quite similar (such
train, dev and test sets, making this task much sim-
as na:gla: ‘be sick.PST’ and na:glE ‘be sick.PRS’).
pler than if one had to predict the entire class from
The future tense forms are typically inflected using
the lemma’s form alone, as done by, e.g., Williams
suffixes. The current sample has been derived from
et al. (2020) and Liu and Hulden (2021).
interlinear texts from Aiton (2016a) and contains
mostly partial paradigms. 5 Baseline Systems

4 Data Preparation The organizers provide four neural systems as base-


lines, a product of two models and optional data
As in the previous editions, each instance in the augmentation. The first model is a transformer
provided training and development is in a form of a (Vaswani et al., 2017, TRM), and the second model
triple (lemma, tag, inflected form). The test set, on is an adaption of the transformer to character-level
the other hand, was released with only lemmas and transduction tasks (Wu et al., 2021, CHR-TRM),
tags (i.e. without the target inflections). Producing which holds the state-of-the-art on the 2017 SIG-
these data sets required a few extra steps, which we MORPHON shared task data. Both models follow
discuss in this section. the hyperparameters of Wu et al. (2021). The op-
tional data augmentation follows the technique pro-
12
https://postnauka.ru/longreads/156195
13
posed by Anastasopoulos and Neubig (2019). Rely-
http://148.202.18.157/sitios/
14
publicacionesite/pperiod/funcion/pdf/ The new languages are included into the UniMorph data:
11-12/289.pdf https://unimorph.github.io/

165
L V N ADJ V.CVB V.PTCP V.MSDR
afb 19,861 2,184 7,595 2,996 4,208 1,510 – – – – – –
amh 20,254 670 20,280 1,599 829 195 4,096 668 – – 668 668
ara 31,002 635 53,365 1,703 58,187 742 – – – – – –
arz 8,178 1,320 10,533 3,205 6,551 1,771 – – – – – –
heb 28,635 1,041 3,666 142 – – – – – – 847 847
syc 596 187 724 329 158 86 – – 261 77 – –
ame 1,246 184 2,359 143 – – – – – – – –
cni 5,478 150 14,448 258 – – – – – – – –
ind 8,805 2,570 5,699 2,759 1,313 731 – – – – – –
kod 315 44 91 14 56 8 – – – – – –
aym 50,050 910 91,840 656 – – – – – – 910 910
ckt 67 62 113 95 8 8 – – – – – –
itl 718 424 567 419 63 59 412 352 – – 20 19
Development

gup 305 73 – – – – – – – – – –
bra 564 286 808 757 174 157 – – – – – –
bul 13,978 699 8,725 1,334 13,050 435 423 423 17,862 699 1,692 423
ces 33,989 500 44,275 3,167 48,370 1,458 2,518 360 5,375 360 – –
ckb 14,368 112 1,882 142 – – – – 289 112 – –
deu 64,438 2,390 73,620 9,543 – – – – 4,777 2,390 – –
kmr 6,092 301 135,604 14,193 – – – – 397 150 783 301
mag 442 145 692 664 77 76 – – 6 6 3 3
nld 32,235 2,149 – – 21,084 2,844 – – 2,148 2,148 – –
pol 40,396 636 12,313 894 23,042 424 625 614 50,772 446 15,456 633
por 133,499 1,884 – – – – – – 9,420 1,884 – –
rus 33,961 2,115 54,153 4,747 46,268 1,650 3,188 2,107 5,486 2,138 – –
spa 132,702 2,042 – – – – 2,042 2,042 8,184 2,046 – –
see 5,430 140 – – – – – – – – – –
ail 940 365 339 249 32 24 – – – – –
evn 2,106 961 3,661 2,249 446 393 612 390 716 517 – –
sah 20,466 237 122,510 1,189 – – 2,832 236 – – – –
tyv 61,208 314 81,448 970 – – 9,336 314 – – – –
krl 108,016 1,042 1,118 107 213 24 – – 3,043 1,021 – –
lud 57 31 125 77 1 1 – – – – – –
olo 72,860 649 55,281 2,331 12,852 538 – – 1,762 575 – –
vep 55,066 712 69,041 2,804 16,317 560 – – 2,543 705 – –
sjo 135 99 49 41 16 16 86 69 78 65 51 44
Surprise

tur 97,090 190 44,892 992 1,440 20 – – – – – –


vro – – 1,148 41 – – – – – – – –

Table 2: Number of samples and unique lemmata (the second number in each column) in each word class in the
shared task data, aggregated over all splits. Here: “V” – verbs, “N” – nouns, “ADJ” – adjectives, “V.CVB” –
converbs, “V.PTCP” – participles, “V.MSDR” – masdars.

ing on a simple character-level alignment between los and Neubig (2019). More specifically, the team
the lemma and the form, this technique replaces implemented an encoder–decoder model with an
shared substrings of length > 3 with random char- attention mechanism. The encoder processes a
acters from the language’s alphabet, producing hal- character sequence using an LSTM-based RNN
lucinated lemma–tag–form triples. Data augmen- with attention. Tags are encoded with a self-
tation (+AUG) is applied to languages with fewer attention (Vaswani et al., 2017) position-invariant
than 10K training instances, and 10K examples are module. The decoder is an LSTM with separate
generated for each language. attention mechanisms for the lemma and the
tags. GUClasp focus their efforts on exploring
6 Submitted Systems strategies for training a multilingual model, in
particular, they implement the following strategies:
GUClasp The system submitted by Team
curriculum learning with competence (Platanios
GUClasp is based on the architecture and data
et al., 2019) based on character frequency and
augmentation technique presented by Anastasopou-

166
L BME GUClasp TRM TRM+AUG CHR-TRM CHR-TRM+AUG
Development
afb 92.39 81.71 94.88 94.88 94.89 94.89
amh 98.16 93.81 99.37 99.37 99.45 99.45
ara 99.76 94.86 99.74 99.74 99.79 99.79
arz 95.27 87.12 96.71 96.71 96.46 96.46
heb 97.46 89.93 99.10 99.10 99.23 99.23
syc 21.71 10.57 35.14 34.29 36.29 34.57
ame 82.46 55.94 87.43 87.85 87.15 86.19
cni 99.5 93.36 99.90 99.90 99.88 99.88
ind 81.31 55.68 83.61 83.61 83.30 83.30
kod 94.62 87.1 96.77 95.70 95.70 96.77
aym 99.98 99.97 99.98 99.98 99.98 99.98
ckt 44.74 52.63 26.32 55.26 28.95 57.89
itl 32.4 31.28 38.83 39.66 38.55 39.11
gup 14.75 21.31 59.02 63.93 55.74 60.66
bra 58.52 56.91 53.38 59.81 59.49 58.20
bul 98.9 96.46 99.63 99.63 99.56 99.56
ces 98.03 94.00 98.24 98.24 98.21 98.21
ckb 99.46 96.60 99.94 99.94 99.97 99.97
deu 97.98 91.94 97.43 97.43 97.46 97.46
kmr 98.21 98.09 98.02 98.02 98.01 98.01
mag 70.2 72.24 66.94 73.47 70.61 72.65
nld 98.28 94.91 98.89 98.89 98.92 98.92
pol 99.54 98.52 99.67 99.67 99.70 99.70
por 99.85 99.11 99.90 99.90 99.86 99.86
rus 98.07 94.32 97.55 97.55 97.58 97.58
spa 99.82 97.65 99.86 99.86 99.90 99.90
see 78.28 40.97 90.65 89.64 90.01 88.63
ail 6.84 6.46 12.17 11.79 10.65 12.93
evn 51.9 51.5 57.65 58.05 57.85 59.12
sah 99.95 99.69 99.93 99.93 99.97 99.97
tyv 99.97 99.78 99.95 99.95 99.97 99.97
krl 99.88 98.50 99.90 99.90 99.90 99.90
lud 59.46 59.46 16.22 45.95 27.03 45.95
olo 99.72 98.2 99.67 99.67 99.66 99.66
vep 99.72 97.05 99.65 99.65 99.70 99.70
Surprise
sjo 35.71 15.48 35.71 47.62 45.24 42.86
tur 99.90 99.49 99.36 99.36 99.35 99.35
vro 94.78 87.39 97.83 98.26 97.83 97.39

Table 3: Accuracy for each language on the test data.

model loss, predicting Levenshtein operations often help but sometimes have a negative effect.
(copy, delete, replace and add) as a multi-
task objective going from lemma to inflected form, 7 Evaluation
label smoothing based on other characters in the
Following the evaluation procedure established in
same language (language-wise label smoothing),
the previous shared task iterations, we compare
and scheduled sampling (Bengio et al., 2015).
all systems in terms of their test set accuracy. In
addition, we perform an extensive error analysis
BME Team BME’s system is an LSTM encoder-
for most languages.
decoder model based on the work of Faruqui et al.
(2016), with three-step training where the model 8 Results
is first trained on all languages, then fine-tuned
on each language family, and finally fine-tuned on As Table 3 demonstrates, most systems achieve
individual languages. A different type of data aug- over 90% accuracy on most languages, with
mentation technique inspired by Neuvel and Fulop transformer-based baseline models demonstrating
(2002) is also used in the first two steps. Team BME superior performance on all language families ex-
also perform ablation studies and show that the aug- cept Uralic. Two Turkic languages, Sakha and Tu-
mentation techniques and the three training steps van, achieve particularly high accuracy of 99.97%.

167
This is likely due to the data being derived from
morphological transducers where certain parts of
verbal paradigms were excluded (see Section 3.8.1).
On the other hand, the accuracy on Classical Syriac, L BME GUClasp TRM
TRM+
AUG
CHR-TRM
CHR-TRM
+AUG
Chukchi, Itelmen, Kunwinjku, Braj, Ludic, Eibela, afb 94.24 82.35 96.31 96.31 96.47 96.47
75.04 75.69 81.40 81.40 80.09 80.09
Evenki, Xibe is low overall. Most of them are amh 98.15 93.77 99.38 99.38 99.44 99.44
100.00 100.00 98.07 98.07 100.00 100.00
under-resourced and have very limited amounts of ara 99.78 94.90 99.78 99.78 99.82 99.82 *
50.00 27.77 33.33 33.33 33.33 33.33
data—indeed, the Spearman correlation between arz 95.66 86.70 97.08 97.08 96.80 96.80
the transformer model’s performance and a lan- heb
89.91
97.46
92.79
89.92
91.64
99.09
91.64
99.09
91.64
99.23
91.64
99.23
guage’s training set size is of roughly 77%. syc
-
28.89
-
14.06
-
46.38
-
43.72
-
46.76
-
44.10
0 0 1.14 5.74 4.59 5.74
Analysis for each POS ame 82.45
-
55.93
-
87.43
-
87.84
-
87.15
-
86.18
-
cni 99.50 93.35 99.90 99.90 99.87 99.87
Tables 13 to 18 in the Appendix provide the accu- 100.00 100.00 100.00 100.00 100.00 100.00
ind 82.29 55.18 84.90 84.90 84.49 84.49
racy numbers for each word class. Verbs and nouns 68.83 61.90 67.09 67.09 67.96 67.96
kod 94.62 90.32 100.00 98.92 98.92 100.00
are the most represented classes in the dataset. For - - - - - -
aym 99.97 99.96 99.97 99.97 99.97 99.97
under-resourced languages such as Classical Syr- - - - - - -
iac, Itelmen, Chukchi, Braj, Magahi, Evenki, Ludic, ckt 50.00
44.11
75.00
50.00
50.00
23.52
75.00
52.94
50.00
26.47
75.00
55.88
*

nouns are predicted more accurately, most likely itl 29.03


34.97
23.22
37.43
34.83
41.87
38.06
40.88
36.12
40.39
33.54
43.34
due to the larger number of samples and smaller gup 14.28
20.00
21.42
20.00
62.50
20.00
64.28
60.00
58.92
20.00
64.28
20.00 *
paradigms. Still, the models’ performance is rel- bra 29.29 30.30 28.28 32.32 32.32 28.28
72.16 69.33 65.09 72.64 72.16 72.16
atively stable across POS tags, the Pearson cor- bul 98.94 96.50 99.67 99.67 99.60 99.60
37.50 25.00 37.50 37.50 37.50 37.50 *
relation between all models’ performance on the ces 98.02 94.00 98.23 98.23 98.21 98.21
- - - - - -
verb and the noun data is 86%, while the noun– ckb 99.45 96.60 99.93 99.93 99.96 99.96
- - - - - -
adjective performance correlation is 89% and the deu 97.98 91.93 97.43 97.43 97.45 97.45
94.73 89.47 94.73 94.73 94.73 94.73 *
verb–adjective one is 84%. The most stable model kmr 98.21 98.09 98.01 98.01 98.00 98.00
across POS tags, at least according to these corre- mag
-
37.50
-
43.05
-
37.50
-
38.88
-
43.05
-
43.05
lations, is BME, with an 87%, 96% and 91% Pear- nld
83.81
98.32
84.39
94.91
79.19
98.92
87.86
98.92
82.08
98.96
84.97
98.96
son correlation for verb–noun, noun–adjective, and pol
88.00
99.55
92.00
98.54
90.00
99.68
90.00
99.68
90.00
99.70
90.00
99.70
verb–adjective performance accuracies respectively. 81.25 62.50 81.25 81.25 81.25 81.25 *
por 99.84 99.11 99.89 99.89 99.85 99.85
- - - - - -
rus 98.06 94.31 97.54 97.54 97.57 97.57
100.00 100.00 100.00 100.00 100.00 100.00
Analysis for out-of-vocabulary lemmas spa 99.81 97.64 99.86 99.86 99.89 99.89
- - - - - -
see 78.27 40.97 90.65 89.64 90.00 88.63
Table 4 shows the differences in performance be- - - - - - -
tween the lemmas present both in training and ail 5.88
8.60
5.88
7.52
11.76
12.90
11.17
12.90
12.35
7.52
12.35
13.97
test sets and the “unknown” lemmas. A closer evn 46.88
59.46
44.88
61.47
52.66
65.15
53.00
65.66
53.66
64.15
53.33
67.83
inspection of this table shows that while BME and sah 99.95
-
99.68
-
99.93
-
99.93
-
99.97
-
99.97
-
GUClasp models have an accuracy gap of 5.5% tyv 99.97 99.78 99.95 99.95 99.96 99.96
- - - - - -
and 3% respectively between previously known krl 99.88 98.50 99.92 99.92 99.92 99.92
100.00 100.00 58.33 58.33 50.00 50.00 *
and unknown lemmas, the transformer-based archi- lud 62.50 56.25 37.50 37.50 50.00 43.75 *
57.14 61.90 0 52.38 9.52 47.61
tectures show an accuracy gap from 9% to 16%. olo 99.72 98.20 99.71 99.71 99.69 99.69
100.00 94.73 71.05 71.05 73.68 73.68
This larger gap, however, is partly explained by vep 99.75 97.06 99.67 99.67 99.72 99.72
88.29 92.55 92.55 92.55 91.48 91.48
the better performance of the transformer-based sjo 45.65 10.86 54.34 47.82 60.86 41.30
models on previously seen lemmas (around 75%, tur
23.68
99.90
21.05
99.49
13.15
99.35
47.36
99.35
26.31
9 *9.35
44.73
99.35
while BME’s performance is 71% and GUClasp’s vro
-
94.78
-
87.39
-
97.82
-
98.26
-
97.82
-
97.39
is 66%). The performance on the previously unseen - - - - - -

lemmas, on the other hand, is mostly driven by data Table 4: Accuracy comparison for the lemmas known
augmentation. The models without data augmenta- from the training set (black numbers) vs. unknown lem-
tion have an accuracy around 60% on these lemmas, mas (red numbers). Groups having <20 unique lemmas
while all other models achieve around 65% on pre- are marked with asterisks.
viously unseen lemmas. This is in line with the
findings of Liu and Hulden (2021), who show that
the transducer’s performance on previously seen

168
words can be greatly improved by simply training L BME GUClasp
the models to perform the trivial task of copying afb 19.61 14.23
amh 6.81 15.90
random lemmas during training—a method some- ara 49.05 5.66
what related to data augmentation. arz 20.28 18.11
heb 15.90 13.63
syc 0 .50
Analysis for the most challenging inflections ame 6.45 4.83
cni 33.33 0
Table 5 shows the accuracy of the submitted sys- ind 19.94 15.60
tems on the “most challenging” test instances, aym 33.33 0
ckt 0 0
where all four baselines failed to predict the tar- itl 3.50 2.33
get form correctly. gup 0 0
bra 9.27 9.27
Frequently observed types of such cases include: bul 18.42 34.21
ces 35.59 16.57
• Unusual alternations of some letters in partic- ckb 100.00 0
deu 55.59 13.14
ular lexemes which are hard to generalize; kmr 39.36 10.10
mag 4.00 6.00
• Ambiguity of the target forms. Certain lem- nld 11.30 14.78
pol 47.61 30.15
mas allow some variation in forms, while the por 27.27 0
test set only lists a single exclusive golden rus 46.18 20.93
spa 64.00 12.00
form for each (lemma, tags) combination. In see 7.89 3.94
ail .99 .99
most cases, multiple acceptable forms may evn 5.55 4.78
be hardly distinguishable in spoken language. sah 25.00 0
tyv 80.00 20.00
For instance, they may only differ by an un- krl 78.94 47.36
stressed vowel or be orthographic variants of lud 17.64 23.52
olo 64.61 23.07
the same form. vep 58.46 9.23
sjo 9.37 6.25
• Multiword expressions are challenging when tur 93.37 88.39
vro 33.33 0
agreement is required. UniMorph does not
provide dependency information, however, Table 5: Test accuracy for each language for the sam-
the information can be inferred from simi- ples where none of baseline systems succeeds to pro-
lar samples or other parts of the same lemma duce correct prediction.
paradigm. The system’s ability to make gen-
eralizations from a sequence down to its sub-
sequences essentially depends on its architec- several cases, the result is practically correct but
ture. only in case of a different dialect, such as abull@n
instead of abuld@n. The performance is better for
• Errors in the test sets. Still, a small percentage nominal wordforms (74.27 accuracy for nouns only
of errors come from the data itself. vs. 30.55 for verbs only). This is perhaps due to
the higher regularity of nominal forms. BME is
9 Error Analysis performing slightly better for the Evenki language,
with errors in vowel harmony (such as ahatkanmo
As Elsner et al. (2019) note, accuracy-level evalua-
instead of ahatkanm@). In contrast with GUClasp,
tion might be sufficient to compare model variants
it tends to generate longer forms, adding unneces-
but does not provide much insight into the under-
sary suffixes. The problems with dialectal forms
standing of morphological systems and their learn-
can be found as well. The performance for Xibe
ability. Therefore, we now turn to a more detailed
is worse for both systems, though BME is better,
analysis of mispredictions made by the systems.
despite the simpler morphology—perhaps it is due
For the purpose of this study, we will rely on the
to the complexity of the Xibe script. At least in one
error type taxonomy proposed by Gorman et al.
instance, one of the systems generated a form with
(2019) and Muradoglu et al. (2020).
a Latin letter n instead of Xibe letters.
9.1 Evenki and Xibe
9.2 Syriac
For the Evenki language, GUClasp tends to
shorten the resulting words, sometimes generat- Both GUClasp and BME generated 350 nominal,
ing theoretically impossible forms. However, in verbal and adjectival forms with less than 50% ac-

169
curacy. This includes forms that are hypothetically The largest category of errors for both systems
correct despite being historically unattested (e.g., (unlike the baseline systems) were allomorphy
abydwtkwn ‘your (M.PL) loss’). Both systems per- errors, 51.76% for BME, 62.65% for GUClasp.
formed better on nominal and adjectival forms than Most of these resulted from the confusion be-
verbal forms. This may be explained by the higher tween vowels in verbal templates. Particularly
morphological regularity of nominal forms relative common were vowel errors in jussive-imperative
to verbal forms; nominal/adjectival inflections typi- (IMP) forms. Most Amharic verb lemmas belong
cally follow linear affixation rules (e.g., suffixation) to one of two inflection classes, each based on roots
while verbal forms follow the same rules in addition consisting of three consonants. The vowels in the
to non-concatenative processes. Further, both sys- templates for these categories are identical in the
tems handle lexemes with two or three letters (e.g., perfective (PRF), imperfective (IPFV), and con-
dn ‘to judge’) poorly compared to longer lexemes verb (V.CVB) forms, but differ in the infinitive
(e.g., bt.nwt’ ‘conception’). Where both systems (V.MSDR) and jussive-imperative, where class A
generate multiple verbal forms for the same lex- has the vowels .1.@ and class B has the vowels [email protected].
eme, the consonantal root is inconsistent. Finally, Both systems also produced a significant number
as expected, lexicalised phrases (e.g., klnš ‘every- of silly errors—incorrect forms that could not be
one’, derived from the high-frequency contraction explained otherwise. Most of these consisted of
kl ‘every’ and nš ‘person’) and homomorphs (e.g., consonant deletion, replacing one consonant with
ql’ ‘an expression (n.)’ or ‘to fry (v.)’) are handled another, or repeating a consonant–vowel sequence.
poorly. Comparatively, the BME system performed
worse than GUClasp, especially in terms of vowel 9.4 Polish
diacritic placement and consonant doubling, which
are consistently hypercorrected in both cases (e.g., Polish is among languages for which both systems
h.byb’ > h.abbbbay; h.yltn’ > h.aallto’). and all the baselines achieved the highest accu-
racy results. BME, with 99.54% accuracy, is doing
slightly better than GUClasp (98.52%). However,
9.3 Amharic
neither system exceeds the baseline results (99.67–
Both submitted systems performed well on the 99.70%).
Amharic data, BME (98.16% accuracy) somewhat Most of the errors made by both systems were al-
better than GUClasp (93.81% accuracy), though ready noted and classified by Gorman et al. (2019)
neither outperformed the baseline models. and follow from typical irregularities in Polish. For
For both systems, target errors represented a sig- example, masculine nouns have two GEN.SG suf-
nificant proportion of the errors, 32.35% for BME, fixes: -a and -u. The latter is typical for inan-
24.08% for GUClasp. Many of these involved al- imate nouns but the former could be used both
ternative plural forms of nouns. The data included with animate and inanimate nouns, which makes
only the most frequent plural forms when there it highly unpredictable and causes production of
were alternatives, sometimes excluding uncommon incorrect forms such as negatywa, rabunka instead
but still possible forms. In some cases only an of negatywu ‘negative’, rabunku ‘robbery’. Both
irregular form appeared in the data, and the sys- systems are vulnerable to such mistakes. Another
tem “erroneously” predicted the regular form with example would be the GEN.PL forms of plurale
the suffix -(w)oč, which also correct. For example, tantum nouns, which could have -ów or zero suffix,
BME produced hawaryawoč, the regular plural of leading to errors such as: tekstyli, wiktuał instead
hawarya ‘apostle’, instead of the expected irregular of tekstyliów ‘textiles’, wiktuałów ‘victuals’. Some
plural hawaryat. Another source of target errors loan words in Polish have fully (mango, marines,
was the confusion resulting from multiple represen- monsieur) or partially (millenium, in singular only)
tations for the phonemes /h,P,s,s’/ in the Amharic syncretic inflectional paradigms. This phenomenon
orthography. Again, the data included only the is hard to predict, as the vast majority of Polish
common spelling for a given lemma or inflected nouns inflect regularly. Both systems tend to pro-
form, but alternative spellings are usually also at- duce inflected forms of those nouns according to
tested. Many of the “errors” consisted of predicting their regular endings, which would be otherwise
correct forms with one of these phonemes spelled correct if not for their syncretic paradigms.
differently than in the expected form. One area in which BME returns significantly bet-

170
ter results than GUClasp are imperative forms. гостьей, ‘female guest’). Additionally, we ob-
Polish imperative forms follow a few different pat- serve more errors in the prediction of the instru-
terns involving some vowel alternations but in gen- mental case forms, mainly due to allomorphy. In
eral are fairly regular. For the 364 imperative forms many cases, the systems would have benefited
in the test data set, BME produced only 12 errors, from observing stress patterns or grammatical gen-
mostly excusable and concerning existing phonetic der. For instance, consider the feminine акварель
alternations which could cause some problems even ‘aquarelle’ and the masculine пароль ‘password’.
for native or L2 speakers. GUClasp, however, In order to make a correct prediction, a model
produced 61 erroneous imperative forms, some of should either be explicitly provided with the gram-
them being examples of overgeneralization of the matical gender, or a partial paradigm (e.g., the da-
zero suffix pattern for first person singular impera- tive and genitive singular slots) for the correspond-
tives (wyjaśn instead of wyjaśnij for the verb WY- ing lemma should be observed in the training set.
JAŚNIĆ ‘explain’). Indeed, the latter is often the case, but the systems
Interestingly, both systems sometimes produce still fail to make a correct inference.
forms that are considered incorrect in standard Pol- Finally, multiword expressions present them-
ish, but are quite often used colloquially by native selves as an extra challenge to the models. In
speakers. Both BME and GUClasp generated the most cases, the test lemmas also appeared in
form podniesa˛ si˛e (instead of podniosa˛ si˛e ‘they the training set, therefore the systems could
will rise’). Moreover, GUClasp generated the infer the dependency information from other
form podeszłeś (instead of podszedłeś ‘you came parts of the same lexeme. Still, Russian mul-
up’). tiword expessions appeared to be harder to in-
flect, probably as they show richer combina-
9.5 Russian tory diversity. For instance, электромагнитное
взаимодействие ‘electromagnetic interaction’ for
Similar to Polish and many other high-resource lan- the plural instrumental case slot is mispredicted
guages, the accuracy of all systems on Russian is as *электромагнитными взаимодействия, i.e
high, with BME being the best-performing model the adjective part is correct while the noun form
(98.07%). Majority of errors consistently made by is not. As Table 7 illustrates, the accuracy gap in
all systems (including the baseline ones) are related predicting multiword expressions with lemmas in-
to the different inflections for animate and inani- or out-of-vocabulary is quite large.
mate nouns in the accusative case. In particular,
UniMorph does not provide the corresponding ani- 9.6 Ludic
macy feature for nouns, an issue that has also been The Ludic language, in comparison with the Kare-
reported previously by Gorman et al. (2019). lian, Livvi and Veps languages, has the smallest
The formation of the short forms of adjectives number of lemmas (31 verbs, 77 nouns and 1 adjec-
and participles with -ен- and -енн- is another tive) and has the lowest accuracy (16–60%). There-
source of misprediction. The systems either gen- fore, the incomplete data is the main cause of errors
erate an incorrect number of н, as in *умерена in the morphological analyzers working with the
(should be умеренна ‘moderate’), or fail to at- Ludic dialect (‘target errors’ in the error taxonomy
tach the suffix in cases that require some repeti- proposed by Gorman et al.).
tion in the production, as in *жертвен (should be
жертвенен ‘sacrificial’), i.e. generation stops af- 9.7 Kurdish
ter the first ен is produced. In addition to that, the Testing on the Sorani data yields high accuracy val-
systems often mispredict alternations of е and ё, ues, although there are errors in some cases. More
as in *ошеломлённы instead of ошеломлены than 200 lemmas and 16K samples were generated.
‘overwhelmed’. The same error also occurs Both BME and GUClasp generate regular nominal
in the formation of past participle forms such and verbal forms with a high accuracy of 99.46%
as *покормлённый (should be покормленный and 96.6% respectively, although neither system
‘fed’). Further, we also observe it in noun declen- exceeds the baseline results of 99.94% and 99.97%.
sion, more specifically, in the prediction of the Kurdish has a complex morphological system with
singular instrumental form: *слесарём (should defective paradigms and second-position person
be слесарем ‘locksmith’), *гостьёй (should be markers. Double clitic and affix-clitic construc-

171
tions can mark subjects or objects in a verbal con- The case of Yanesha is different, as the base-
struction and ditranstives are made with applicative line only peaked at 87.43%, whereas the BME and
markers. Such morphological structures can be the GUClasp systems underperformed with 82.46%
reason for the few issues that still occur. and 55.94%, respectively. The task for Yanesha
is harder, as the writing tradition is not transpar-
9.8 Tuvan and Sakha ent enough to predict some rules. For instance,
Both BME and GUClasp predict the majority of large and short vowels are written in a similar way,
the inflected forms correctly, achieving test accura- always with a single vowel, and the aspirated vow-
cies of over 99.6% on both Tuvan and Sakha, with els are optionally marked with a diacritic. These
BME performing slightly better on both languages. distinctions are essential at the linguistic level, as
The remaining errors are generally caused by mis- they allow one to explain the morphophonologi-
applications of morphophonology, either by the cal processes, such as the syncope of the weak
prediction system or by the data generator itself. vowels in the flexionated forms (po’kochllet in-
Since the forms treated as ground truth were au- stead of po’kchellet). We also observe allomor-
tomatically generated by morphological transduc- phy errors, for instance, predicting phomchocheñ
ers (§3.8.1), the mismatch between the prediction instead of pomchocheñ (from mochocheñets and
and the reference might be due to ‘target errors’ V;NO2;FIN;REAL). The singular second person
where the reference itself is wrong (Gorman et al., prefix has ph-, pe- and p- as allomorphs, each with
2019). For the BME system, target errors account different rules to apply. Moreover, there are some
for 1/8 disagreement cases for Tuvan and 3/13 for spelling issues as well, as the diacritic and the
Sakha, although for all of them the system’s pre- apostrophe are usually optional. For instance, the
diction is indeed incorrect as well. For GUClasp, spellings wapa or wápa (to come where someone
the reference is wrong in 19/62 cases for Tuvan is located) are both correct. It is important to note
(four of them also have an incorrect lemma, which that the orthographic standards are going to be re-
makes it impossible to judge the correctness of vised by the Peruvian government to reduce the
any inflection) and 43/90 for Sakha. Interestingly, ambiguous elements.
GUClasp actually predicts the correct inflected
form for 27/43 and 3/1515 target error cases for 9.10 Magahi
Sakha and Tuvan, respectively. The transformer baseline with data augmenta-
The actual failure cases for both BME and tion (TRM+AUG) achieved the highest score, with
GUClasp are largely allomorphy errors, per Gor- GUClasp taking the second place (with 72.24%)
man et al.’s classification. Common problems and the base transformer yielding the lowest score
include consonant alternation (Sakha *охсусуҥ of 66.94%. For Magahi, the results do not vary
instead of охсуһуҥ), vowel harmony (Tu- too much between systems, and the clearest perfor-
van *ижиарлер instead of ижигерлер) and mance boost seems to arise from the use of data
vowel/null alternation (Tuvan *шымынар силер augmentation. The low score of the TRM baseline
instead of шымныр силер). Unadapted loan- is caused by the scarcity of data and the diversity
words that entered the languages through Russian in the morphophonological structure. Prediction er-
(e.g. Sakha педагог ‘pedagogue’, принц ‘prince’, rors on Magahi include incorrect honorificity, mis-
наследие ‘heritage’) are also frequent among the predicting plural markers, and spelling errors.
errors for both systems. Honorificity: the systems predict forms lacking
the honorific marker /-ak/. For example, /puch-
9.9 Ashaninka and Yanesha
hal/ (‘asked’) is predicted instead of /puchhlak/
For Ashaninka, the high baseline scores (over (‘asked’), or /bital/ (‘passed time’) instead of /bit-
99.8%) could be attributed to the relatively high lak/ (‘passed time’).
regularity of the (morpho)phonological rules in Plural marker: the systems’ predictions omit
the language. In this language, the BME system the plural markers /-van/ and /-yan/, similarly to
achieved comparable performance with 99.5%, the case of the honorific markers discussed above.
whereas GUClasp still achieved a robust accuracy For example, /thag/ (‘con’) is produced instead of
of 93.36%. /thagwan/ (‘con’).
15
19 target errors excluding the 4 unjudgeable cases. Spelling errors: the predicted words do not oc-

172
cur in the Magahi language. The predictions also the adjective achchhe instead of achchhau ‘good’);
do not show any specific error pattern. morphemes of honorificity (formal and informal:
We thus conclude that the performance of the suni instead of sunikain ‘listen’, rahee instead of
baseline systems is greatly affected by the morpho- raheen ‘be’ and karibau instead of kari ‘do’, etc.).
logical structure of Magahi. Also, some language- A portion of these errors is also caused by the
specific properties of Magahi are not covered by inflection errors in predicting and generating mul-
the scope of the UniMorph tagset. For example, tiword expressions (MWEs) (e.g. aannd instead
consider the following pair: of aannd-daayak ‘comfortable’). Apart from the
(/dekh/, /dekhlai/, ‘V;3;PRF;INFM;LSGSPEC1’, mentioned error types, the systems also made silly
‘see’) errors (e.g. uthi instead of uthaay ‘get up’, kathan
(/dekh/, /dekhlak/, ‘V;3;PRF;INFM;LGSPEC2’, instead of kathanopakathan ‘conversation’, karaah
‘see’) instead of karaahatau ‘groan’, keeee instead of kee-
Here both forms exhibit morphological fea- nee ‘do’, and grahanave instead of grahan ‘accept’,
tures that are not defined in the default annotation etc.) and spelling errors (e.g. dhamaky instead of
schema. Morphologically, the first form indicates dhamakii ‘threat’, laau or laauy instead of liyau
that the speaker knows the addressee but not inti- ‘take’, saii instead of saanchii ‘correct’, and sama-
mately (or there is a low level of intimacy), while jhat instead of samajhaayau ‘explain’, etc.) as
the second one signals a comparatively higher level classified by Gorman et al. (2019). Under all of the
of intimacy. Such aspects of the Magahi morphol- models, the majority of errors were silly or spelling
ogy are challenging for the systems. errors.

9.12 Aymara
9.11 Braj
All systems achieved high accuracy (99.98%) on
For the low-resource language Braj, both submit- this language. The few errors are mainly due to the
ted systems performed worse than the baseline sys- inconsistency in the initial data annotation. For in-
tems. BME achieved 58.52% prediction accuracy, stance, the form uraqiw is listed as a lemma while
slightly outperforming GUClasp with 56.91%. As it can only be understood as being comprised of
for the baseline systems, CHR-TRM scored high- two morphemes: uraqi-w(a) ‘it (is) land’. The root,
est with 59.49% accuracy and TRM scored lowest or the nominative unmarked form, is uraqi ‘land’.
with 53.38%. Among Indo-European languages, The -wa is presumably the declarative suffix. The
the performance of the BME, GUClasp, and the nucleus of this suffix can be lost owing to the com-
baseline systems is lowest for Braj. The low accu- plex morphophonemic rules which operate at the
racy and the larger number of errors are broadly edges of phrases. In addition, the accusative form
due to misprediction and misrepresentation of the uraqi is incorrect since the accusative is marked
morphological units and the smaller data size. by subtractive disfixation, therefore, uraq is the
BME, GUClasp, and the baseline systems gen- accusative inflected form.
erated 311 nominal, verbal, and adjectival inflected
forms from existing lemmas. In these outputs, the 9.13 Eibela
misprediction and misrepresentation errors are mor- Eibela seems to be one of the most challenging
phemic errors, already included/classified by Gor- languages, probably due to its small data size and
man et al. (2019). The findings of our analysis of sparsity. Since it has been extracted from inter-
both the gold data and the predictions of all systems linear texts, a vast majority of its paradigms are
highlight several common problems for nominal, partial, and this certainly makes the task more diffi-
verbal, and adjectival inflected forms. Common cult. A closer look at system outputs reveals that
errors, mispredicted by all models, include mor- many errors are related to misprediction of vowel
phemes of gender (masculine and feminine: for the length. For instance, to:mulu is inflected in N;ERG
noun akhabaaree instead of akhabaar ‘newspaper’, as tomulE instead of to:mu:lE:.
for the verb arraa instead of arraaii ‘shout’, and
for the adjective mithak instead of mithakeey ‘an- 9.14 Kunwinjku
cient’); morphemes of number (singular and plural: The data for Kunwinjku is relatively small and
for the noun kahaanee instead of kahaaneen ‘story’, contains verbal paradigms only. Test accuracies
for the verb utaran instead of utare ‘get off’, for range from 14.75% (BME) to 63.93% (TRM+AUG).

173
In this language, many errors were due to incor-
rect spelling and missing parts of transcription. For
instance, for the second person plural non-past of L BME GUClasp TRM
TRM+
CHR-TRM
CHR-TRM
AUG +AUG
the lemma borlbme, TRM predicts *ngurriborlbme afb 96.43 87.25 97.70 97.70 97.70 97.70
instead of ngurriborle. Interestingly, BME mis- amh
76.93
98.78
60.47
96.71
84.06
99.55
84.06
99.55
84.14
99.58
84.14
99.58
predicts most forms due to the looping effect de- 95.51 81.45 98.58 98.58 98.86 98.86
ara 99.89 97.06 99.90 99.90 99.90 99.90
scribed by Shcherbakov et al. (2020). In particu- 97.97 65.28 97.56 97.56 98.22 98.22
arz 96.59 91.86 97.96 97.96 97.83 97.83
lar, it starts producing sequences such as *ngar- 80.04 32.51 82.26 82.26 80.54 80.54
heb 98.70 94.84 99.61 99.61 99.74 99.74
rrrrrrrrrrrrrmbbbijj (should be karribelbmerrinj) 92.22 69.15 96.93 96.93 97.09 97.09
syc 85.71 82.14 85.71 85.71 85.71 82.14 *
or *ngadjarridarrkddrrdddrrmerri (should be kar- 16.14 4.34 30.74 29.81 31.98 30.43
riyawoyhdjarrkbidyikarrmerrimeninj). ame 87.93
74.40
75.63
26.96
92.11
80.54
92.80
80.54
92.80
78.83
90.95
79.18
cni 99.86 94.69 100.00 100.00 99.94 99.94
93.53 71.55 98.27 98.27 98.70 98.70
10 Discussion ind 81.67 56.08 83.98 83.98 83.70 83.70
36.00 4.00 36.00 36.00 32.00 32.00
kod 98.82 97.64 100.00 98.82 98.82 100.00
Reusing transformation patterns 50.00 12.50 100.00 100.00 100.00 100.00 *
aym 99.98 99.98 99.99 99.99 99.99 99.99
50.00 0 0 0 0 0 *
In most cases, morphological transformations may ckt 73.91 82.60 34.78 82.60 39.13 86.95
be properly carried out by matching a lemma itl
0
54.80
6.66
51.44
13.33
62.50
13.33
65.86
13.33
62.50
13.33
64.90
*

against a pattern containing fixed characters and gup


1.33
27.77
3.33
44.44
6.00
66.66
3.33
88.88
5.33
72.22
3.33
83.33 *
variable (wildcard) character sequences. A mor- 9.30 11.62 55.81 53.48 48.83 51.16
bra 75.96 73.81 68.66 76.82 75.53 72.96
phological inflection may be described in terms 6.41 6.41 7.69 8.97 11.53 14.10
bul 99.05 96.92 99.73 99.73 99.66 99.66
of inserting, deleting, and/or replacing fixed char- 81.44 42.26 87.62 87.62 87.62 87.62
ces 98.34 95.00 98.43 98.43 98.47 98.47
acters. When a test sample follows such a regu- 78.97 34.22 86.12 86.12 82.32 82.32
ckb 99.43 97.22 99.93 99.93 99.96 99.96
lar transformation pattern observed in the training 99.66 90.42 100.00 100.00 100.00 100.00
set, it usually becomes significantly easier to track deu 98.11
78.10
92.50
10.94
97.58
76.11
97.58
76.11
97.59
77.61
97.59
77.61
down a correct target form. Table 6 demonstrates kmr 98.22
84.21
98.13
26.31
98.01
100.00
98.01
100.00
98.01
84.21
98.01
84.21 *
the difference in performance w.r.t. whether the mag 86.91 90.05 80.62 89.00 85.34 89.00
11.11 9.25 18.51 18.51 18.51 14.81
required transformation pattern was immediately nld 98.39 95.31 99.00 99.00 99.06 99.06
78.46 24.61 78.46 78.46 75.38 75.38
witnessed in any training sample. To enumerate the pol 99.65 98.91 99.80 99.80 99.80 99.80
96.12 86.86 95.90 95.90 96.66 96.66
possible patterns, we used the technique described por 99.89 99.40 99.93 99.93 99.91 99.91
88.07 22.93 91.74 91.74 84.40 84.40
by Scherbakov (2020). It is worth emphasizing rus 98.40 95.36 97.82 97.82 97.89 97.89
that the presence of a matching pattern by itself spa
73.72
99.86
18.36
97.97
77.55
99.93
77.55
99.93
75.00
99.94
75.00
99.94
does not guarantee that achieving high accuracy see
93.47
88.14
46.19
66.75
89.13
96.39
89.13
96.13
92.39
96.13
92.39
94.32
would be straightforward, because in a vast major- ail
72.83
32.14
26.74
39.28
87.48
39.28
86.05
42.85
86.62
39.28
85.49
60.71
ity of cases there are multiple alternative patterns. 3.82 2.55 8.93 8.08 7.23 7.23
evn 67.75 67.84 73.62 74.79 73.98 76.06
Choosing the correct one is a challenging task. 6.92 5.12 12.30 10.51 12.05 11.02
sah 99.96 99.80 99.95 99.95 99.98 99.98
98.57 83.88 96.68 96.68 98.10 98.10
Multi-word forms tyv 99.97 99.85 99.96 99.96 99.97 99.97
99.76 95.20 99.28 99.28 99.76 99.76
Inflecting multi-word expressions is one of the krl 99.92
98.24
99.08
78.34
99.91
99.20
99.91
99.20
99.92
99.04
99.92
99.04
most challenging tasks. However, in the the shared lud 91.66
0
87.50
7.69
25.00
0
70.83
0
41.66
0
70.83
0 *
task dataset, almost all multi-word lemmas found olo 99.80 98.77 99.82 99.82 99.82 99.82
97.77 83.56 95.91 95.91 95.45 95.45
in the test set are also present in the training set, vep 99.76 97.67 99.72 99.72 99.75 99.75
97.75 66.32 96.20 96.20 97.06 97.06
which made the task easier to solve. sjo 50.00 22.22 48.14 64.81 61.11 55.55
10.00 3.33 13.33 16.66 16.66 20.00
The systems were quite successful at predicting tur 99.93 99.64 99.39 99.39 99.38 99.38
97.12 83.81 94.96 94.96 95.68 95.68
the multi-word forms if the required transforma- vro 99.14 99.14 98.29 99.14 98.29 99.14
tion was directly observed in a training example. 90.26 75.22 97.34 97.34 97.34 95.57

Otherwise, the prediction accuracy significantly Table 6: Accuracy comparison for fragment substitu-
degraded. Table 7 shows the multi-word lemma tions that could be observed in the training set (black
transformation accuracies. From these results we numbers) vs. more complex transformations (red num-
further notice that while all systems’ performance bers). Groups having <20 unique lemmas are marked
degrades on the previously unseen multi-word in- with asterisks.
flection patterns, this degradation is considerably
smaller for the transformer-based baselines (except

174
L BME GUClasp TRM
TRM+
CHR-TRM
CHR-TRM tuguese, for instance, most of the errors produced
AUG +AUG
aym 100.00 99.90 100.00 100.00 100.00 100.00
by the BME system arise due to missing acute ac-
- - - - - - cents, which mark stress; their use is determined by
bul 95.00 93.21 100.00 100.00 100.00 100.00
81.81 63.63 100.00 100.00 100.00 100.00 specific (and somewhat idiosyncratic) orthographic
ces 75.00 55.00 80.00 80.00 80.00 80.00
71.42 57.14 100.00 100.00 100.00 100.00 rules.
ckb 100.00 100.00 100.00 100.00 100.00 100.00
- - - - - -
deu 88.57 74.28 97.14 97.14 100.00 100.00 11 Conclusion
71.87 0 93.75 93.75 87.50 87.50
kmr 98.76
-
98.71
-
95.34
-
95.34
-
95.51
-
95.51
-
In the development of this shared task we added
nld 50.00 50.00 50.00 50.00 50.00 50.00 new data for 32 languages (13 language families)
- - - - - -
pol 99.85 99.60 99.97 99.97 99.91 99.91 to UniMorph—most of which are under-resourced.
98.28 92.12 98.28 98.28 98.28 98.28
rus 90.93 87.91 96.45 96.45 96.05 96.05 Further, we evaluated the performance of morpho-
56.55 27.04 72.13 72.13 72.13 72.13
logical reinflection systems on a typologically di-
tur 99.77 98.63 95.67 95.67 95.58 95.58
90.90 59.09 77.27 77.27 86.36 86.36 verse set of languages and performed fine-grained
analysis of their error patterns in most of these
Table 7: Accuracy for MWE lemmata in each language
on the test data. The numbers in black correspond to
languages. The main challenge for the morpho-
fragment substitutions that could be observed in the logical reinflection systems is still (as expected)
training set, while the red numbers correspond to more handling low-resource scenarios (where there is lit-
complex transformations. tle training data). We further identified a large gap
in these systems’ performance between the test lem-
mas present in the training set and the previously
for Turkish), implying that these models can better unseen lemmas—the latter are naturally hard test
generalise to previously unseen patterns. cases, but the work on reinflection models could
Allomorphy focus on improving these results going forward, fol-
lowing, for instance, the work of Liu and Hulden
The application of wrong (albeit valid) inflectional (2021). Further, allomorphy, honorificity and mul-
transformations by the models (allomorphy) is tiword lemmas also pose challenges for the current
present in most analysed languages. These allomor- models. We hope that the analysis presented here,
phy errors can be further divided into two groups: together with the new expansion of the UniMorph
(1) when an inflectional tag itself allows for multi- resources, will help drive further improvements in
ple inflection patterns which must be distinguished morphological reinflection. Following Malouf et al.
by the declension/inflection class to which the word (2020), we would like to emphasize that linguis-
belongs, and (2) when the model applies an inflec- tic analyses using UniMorph should be performed
tional rule that is simply invalid for that specific tag. with some degree of caution, since for many lan-
These errors are hard to analyse, however. The first guages it might not provide an exhaustive list of
is potentially unavoidable without extra informa- paradigms and variants.
tion, as declension/inflection classes are not always
fully predictable from word forms alone (Williams Acknowledgements
et al., 2020). The second type of allomorphic error,
We would like to thank Dr George Kiraz and Beth
on the other hand, is potentially fixable. In our anal-
Mardutho: the Syriac Institute for their help with
ysis, however, we did not find any concrete patterns
Classical Syriac data.
to when the models make this second (somewhat
arbitrary) type of mistake.
References
Spelling Errors
Willem F. H. Adelaar and Pieter C. Muysken. 2004.
Spelling errors are pervasive in most analysed lan- The Languages of the Andes. Cambridge Language
guages, even high-resource ones. These are chal- Surveys. Cambridge University Press.
lenging, as they require a deep understanding of the Grant Aiton. 2016a. The documentation of eibela: An
modelled language in order to be avoided. Spelling archive of eibela language materials from the bosavi
errors are especially common in languages with region (western province, papua new guinea).
vowel harmony (e.g. Tungusic), as the models have Grant William Aiton. 2016b. A grammar of Eibela:
some difficulty in correctly modelling it. Another a language of the Western Province, Papua New
source of spelling errors are the diacritics. In Por- Guinea. Ph.D. thesis, James Cook University.

175
Sarah Alkuhlani and Nizar Habash. 2011. A corpus pages 1–30, Vancouver. Association for Computa-
for modeling morpho-syntactic agreement in Arabic: tional Linguistics.
Gender, number and rationality. In Proceedings of
the 49th Annual Meeting of the Association for Com- Ryan Cotterell, Christo Kirov, John Sylak-Glassman,
putational Linguistics: Human Language Technolo- David Yarowsky, Jason Eisner, and Mans Hulden.
gies, pages 357–362, Portland, Oregon, USA. Asso- 2016. The SIGMORPHON 2016 shared Task—
ciation for Computational Linguistics. Morphological reinflection. In Proceedings of the
14th SIGMORPHON Workshop on Computational
Antonios Anastasopoulos and Graham Neubig. 2019. Research in Phonetics, Phonology, and Morphol-
Pushing the limits of low-resource morphological in- ogy, pages 10–22, Berlin, Germany. Association for
flection. In Proceedings of the 2019 Conference on Computational Linguistics.
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu- William Croft. 2002. Typology and universals. Cam-
ral Language Processing (EMNLP-IJCNLP), pages bridge University Press.
984–996, Hong Kong, China. Association for Com-
putational Linguistics. Michael Daniel. 2011. Linguistic typology and the
study of language. In The Oxford handbook of lin-
Gregory David Anderson and K David Harrison. 1999. guistic typology. Oxford University Press.
Tyvan (Languages of the World/Materials 257).
München: LINCOM Europa. R. M. W. Dixon and Alexandra Y. Aikhenvald, editors.
Phyllis E. Wms. Bardeau. 2007. The Seneca Verb: La- 1999. The Amazonian languages (Cambridge Lan-
beling the Ancient Voice. Seneca Nation Education guage Surveys). Cambridge University Press.
Department, Cattaraugus Territory.
M Duff-Trip. 1998. Diccionario Yanesha’ (Amuesha)-
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Castellano. Lima: Instituto Lingüístico de Verano.
Noam Shazeer. 2015. Scheduled sampling for se-
quence prediction with recurrent neural networks. David M. Eberhard, Gary F. Simons, and Charles
In Advances in Neural Information Processing Sys- D. Fennig (eds.). 2021. Ethnologue: Languages of
tems 28: Annual Conference on Neural Informa- the world. Twenty-fourth edition. Online version:
tion Processing Systems 2015, December 7-12, 2015, http://www.ethnologue.com.
Montreal, Quebec, Canada, pages 1171–1179.
Micha Elsner, Andrea D Sims, Alexander Erdmann,
Noam Chomsky. 1995. Language and nature. Mind, Antonio Hernandez, Evan Jaffe, Lifeng Jin, Martha
104(413):1–61. Booker Johnson, Shuan Karim, David L King, Lu-
ana Lamberti Nunes, et al. 2019. Modeling morpho-
Matt Coler. 2010. A grammatical description of Muy- logical learning, typology, and change: What can the
laq’Aymara. Ph.D. thesis, Vrije Universiteit Amster- neural sequence-to-sequence framework contribute?
dam. Journal of Language Modelling, 7.
Matt Coler. 2014. A grammar of Muylaq’Aymara: Ay- Nicholas Evans. 2003. Bininj Gun-wok: A Pan-
mara as spoken in Southern Peru. Brill. dialectal Grammar of Mayali, Kunwinjku and Kune.
Bernard Comrie. 1989. Language universals and lin- Pacific Linguistics. Australian National University.
guistic typology: Syntax and morphology. Univer-
sity of Chicago press. Nicholas Evans and Stephen C Levinson. 2009. The
myth of language universals: Language diversity
Ryan Cotterell, Christo Kirov, John Sylak-Glassman, and its importance for cognitive science. Behavioral
Géraldine Walther, Ekaterina Vylomova, Arya D. and brain sciences, 32(5):429–448.
McCarthy, Katharina Kann, Sabrina J. Mielke, Gar-
rett Nicolai, Miikka Silfverberg, David Yarowsky, Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, and
Jason Eisner, and Mans Hulden. 2018. The CoNLL– Chris Dyer. 2016. Morphological inflection genera-
SIGMORPHON 2018 shared task: Universal mor- tion using character sequence to sequence learning.
phological reinflection. In Proceedings of the In Proceedings of the 2016 Conference of the North
CoNLL–SIGMORPHON 2018 Shared Task: Univer- American Chapter of the Association for Computa-
sal Morphological Reinflection, pages 1–27, Brus- tional Linguistics: Human Language Technologies,
sels. Association for Computational Linguistics. pages 634–643, San Diego, California. Association
for Computational Linguistics.
Ryan Cotterell, Christo Kirov, John Sylak-Glassman,
Géraldine Walther, Ekaterina Vylomova, Patrick I.K. Fattah. 2000. Les dialectes kurdes méridionaux:
Xia, Manaal Faruqui, Sandra Kübler, David étude linguistique et dialectologique. Acta Iranica
Yarowsky, Jason Eisner, and Mans Hulden. 2017. : Encyclopédie permanente des études iraniennes.
CoNLL-SIGMORPHON 2017 shared task: Univer- Peeters.
sal morphological reinflection in 52 languages. In
Proceedings of the CoNLL SIGMORPHON 2017 Charles F Ferguson. 1959. Diglossia. Word,
Shared Task: Universal Morphological Reinflection, 15(2):325–340.

176
Mikel L Forcada, Mireia Ginestí-Rosell, Jacob Nord- Martin Haspelmath. 2020. Human linguisticality and
falk, Jim O’Regan, Sergio Ortiz-Rojas, Juan An- the building blocks of languages. Frontiers in psy-
tonio Pérez-Ortiz, Felipe Sánchez-Martínez, Gema chology, 10:3056.
Ramírez-Sánchez, and Francis M Tyers. 2011. Aper-
tium: a free/open-source platform for rule-based ma- Johannes Heinecke. 2019. ConlluEditor: a fully graph-
chine translation. Machine translation, 25(2):127– ical editor for universal dependencies treebank files.
144. In Proceedings of the Third Workshop on Univer-
sal Dependencies (UDW, SyntaxFest 2019), pages
Michael Gasser. 2011. HornMorpho: a system 87–93, Paris, France. Association for Computational
for morphological processing of Amharic, Oromo, Linguistics.
and Tigrinya. In Proceedings of the Conference
on Human Language Technology for Development, Sardana Ivanova, Anisia Katinskaia, and Roman Yan-
Alexandria, Egypt. garber. 2019. Tools for supporting language learn-
ing for Sakha. In Proceedings of the 22nd Nordic
Yustinus Ghanggo Ate. 2020. Kodi (Indonesia) - Lan- Conference on Computational Linguistics, pages
guage Snapshot. Language Documentation and De- 155–163, Turku, Finland. Linköping University
scription 19, pages 171–180. Electronic Press.

Yustinus Ghanggo Ate. 2021. Documentation of Kodi. Sardana Ivanova, Francis M. Tyers, and Jonathan N.
New Haven: Endangered Language Fund. Washington. to appear in 2022. A free/open-source
morphological analyser and generator for Sakha. In
Yustinus Ghanggo Ate. to appear in 2021. Reduplica- preparation.
tion in Kodi: A paradigm function account. Word
Structure 14(3). Danesh Jain and George Cardona. 2007. The Indo-
Aryan Languages. Routledge.
Kyle Gorman, Arya D. McCarthy, Ryan Cotterell,
Thomas Jügel. 2009. Ergative Remnants in Sorani Kur-
Ekaterina Vylomova, Miikka Silfverberg, and Mag-
dish? Orientalia Suecana, 58:142–158.
dalena Markowska. 2019. Weird inflects but OK:
Making sense of morphological generation errors. Olga Kazakevich and Elena Klyachko. 2013. Creat-
In Proceedings of the 23rd Conference on Computa- ing a multimedia annotated text corpus: a research
tional Natural Language Learning (CoNLL), pages task (Sozdaniye multimediynogo annotirovannogo
140–151, Hong Kong, China. Association for Com- korpusa tekstov kak issledovatelskaya protsedura).
putational Linguistics. In Proceedings of International Conference Compu-
tational linguistics 2013, pages 292–300.
Joseph Greenberg. 1963. Some universals of grammar
with particular reference to the order of meaningful Salam Khalifa, Nizar Habash, Fadhl Eryani, Ossama
elements. In Joseph Greenberg, editor, Universals Obeid, Dana Abdulrahim, and Meera Al Kaabi.
of Language, pages 73–113. MIT Press. 2018. A morphologically annotated corpus of Emi-
rati Arabic. In Proceedings of the Eleventh Interna-
George A. Grierson. 1908. Indo-Aryan Family: Cen- tional Conference on Language Resources and Eval-
tral Group: Specimens of the Rājasthānı̄ and Gu- uation (LREC 2018), Miyazaki, Japan. European
jarātı̄, volume IX(II) of Linguistic Survey of India. Language Resources Association (ELRA).
Office of the Superintendent of Government Print-
ing, Calcutta. Salam Khalifa, Sara Hassan, and Nizar Habash. 2017.
A morphological analyzer for Gulf Arabic verbs. In
George Abraham Grierson. 1903. Linguistic Survey of Proceedings of the Third Arabic Natural Language
India, Vol-III. Calcutta: Office of the Superinten- Processing Workshop, pages 35–45, Valencia, Spain.
dent, Government of PRI. Association for Computational Linguistics.
George Abraham Grierson and Sten Konow. 1903. Lin- Tanmai Khanna, Jonathan N. Washington, Fran-
guistic Survey of India. Calcutta Supt., Govt. Print- cis M. Tyers, Sevilay Bayatlı, Daniel G. Swanson,
ing. Tommi A. Pirinen, Irene Tang, and Hèctor Alòs
i Font. to appear in 2021. Recent advances in Aper-
Nizar Habash, Ramy Eskander, and Abdelati Hawwari. tium, a free/open-source rule-based machine transla-
2012. A morphological analyzer for Egyptian Ara- tion platform for low-resource languages. Machine
bic. In Proceedings of the Twelfth Meeting of the Translation.
Special Interest Group on Computational Morphol-
ogy and Phonology, pages 1–9, Montréal, Canada. Lee Kindberg. 1980. Diccionario asháninca. Lima:
Association for Computational Linguistics. Instituto Lingüístico de Verano.

K. David Harrison. 2000. Topics in the Phonology and Christo Kirov, Ryan Cotterell, John Sylak-Glassman,
Morphology of Tuvan. Ph.D. thesis, Yale University. Géraldine Walther, Ekaterina Vylomova, Patrick
Xia, Manaal Faruqui, Sabrina J. Mielke, Arya Mc-
Martin Haspelmath. 2010. Comparative concepts Carthy, Sandra Kübler, David Yarowsky, Jason Eis-
and descriptive categories in crosslinguistic studies. ner, and Mans Hulden. 2018. UniMorph 2.0: Uni-
Language, 86(3):663–687. versal Morphology. In Proceedings of the Eleventh

177
International Conference on Language Resources Robert Malouf, Farrell Ackerman, and Arturs Se-
and Evaluation (LREC 2018), Miyazaki, Japan. Eu- menuks. 2020. Lexical databases for computational
ropean Language Resources Association (ELRA). analyses: A linguistic perspective. In Proceedings
of the Society for Computation in Linguistics 2020,
Ritesh Kumar, Bornini Lahiri, and Deepak Alok. 2014. pages 446–456, New York, New York. Association
Developing LRs for Non-scheduled Indian Lan- for Computational Linguistics.
guages: A Case of Magahi. In Human Language
Technology Challenges for Computer Science and Pedro Mayor Aparicio and Richard E Bodmer. 2009.
Linguistics, Lecture Notes in Computer Science, Pueblos indígenas de la Amazonía peruana. Iquitos:
pages 491–501. Springer International Publishing, Centro de Estudios Teológicos de la Amazonía.
Switzerland. Original-date: 2014.
Arya D. McCarthy, Christo Kirov, Matteo Grella,
Ritesh Kumar, Bornini Lahiri, Deepak Alok Atul Kr. Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekate-
Ojha, Mayank Jain, Abdul Basit, and Yogesh Dawar. rina Vylomova, Sabrina J. Mielke, Garrett Nico-
2018. Automatic identification of closely-related In- lai, Miikka Silfverberg, Timofey Arkhangelskiy, Na-
dian languages: Resources and experiments. In Pro- taly Krizhanovsky, Andrew Krizhanovsky, Elena
ceedings of the 4th Workshop on Indian Language Klyachko, Alexey Sorokin, John Mansfield, Valts
Data Resource and Evaluation (WILDRE-4), Paris, Ernštreits, Yuval Pinter, Cassandra L. Jacobs, Ryan
France. European Language Resources Association Cotterell, Mans Hulden, and David Yarowsky. 2020.
(ELRA). UniMorph 3.0: Universal Morphology. In Proceed-
ings of the 12th Language Resources and Evaluation
Bornini Lahiri. 2021. The Case System of Eastern Indo- Conference, pages 3922–3931, Marseille, France.
Aryan Languages: A Typological Overview. Rout- European Language Resources Association.
ledge.
Arya D. McCarthy, Miikka Silfverberg, Ryan Cotterell,
William Lane and Steven Bird. 2019. Towards a robust Mans Hulden, and David Yarowsky. 2018. Marrying
morphological analyzer for Kunwinjku. In Proceed- Universal Dependencies and Universal Morphology.
ings of the The 17th Annual Workshop of the Aus- In Proceedings of the Second Workshop on Univer-
tralasian Language Technology Association, pages sal Dependencies (UDW 2018), pages 91–101, Brus-
1–9, Sydney, Australia. Australasian Language Tech- sels, Belgium. Association for Computational Lin-
nology Association. guistics.
Septina Dian Larasati, Vladislav Kubon, and Daniel
Zeman. 2011. Indonesian morphology tool (Mor- Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu,
phInd): Towards an Indonesian corpus. In Sys- Chaitanya Malaviya, Lawrence Wolf-Sonkin, Gar-
tems and Frameworks for Computational Morphol- rett Nicolai, Christo Kirov, Miikka Silfverberg, Sab-
ogy - Second International Workshop, SFCM 2011, rina J. Mielke, Jeffrey Heinz, Ryan Cotterell, and
Zurich, Switzerland, August 26, 2011. Proceedings, Mans Hulden. 2019. The SIGMORPHON 2019
volume 100 of Communications in Computer and In- shared task: Morphological analysis in context and
formation Science, pages 119–129. Springer. cross-lingual transfer for inflection. In Proceedings
of the 16th Workshop on Computational Research in
Ling Liu and Mans Hulden. 2021. Can a transformer Phonetics, Phonology, and Morphology, pages 229–
pass the wug test? Tuning copying bias in neu- 244, Florence, Italy. Association for Computational
ral morphological inflection models. arXiv preprint Linguistics.
arXiv:2104.06483.
Elena Mihas. 2017. The Kampa subgroup of the
Mohamed Maamouri, Ann Bies, Tim Buckwalter, and Arawak language family. In Alexandra Y. Aikhen-
Wigdan Mekki. 2004. The Penn Arabic Treebank: vald and R. M. W. Dixon, editors, The Cambridge
Building a Large-Scale Annotated Arabic Corpus. Handbook of Linguistic Typology, Cambridge Hand-
In Proceedings of the International Conference on books in Language and Linguistics, page 782–814.
Arabic Language Resources and Tools, pages 102– Cambridge University Press.
109, Cairo, Egypt.
Saliha Muradoglu, Nicholas Evans, and Ekaterina Vy-
Mohamed Maamouri, Ann Bies, Seth Kulick, Dalila lomova. 2020. Modelling verbal morphology in
Tabessi, and Sondos Krouna. 2012. Egyptian Arabic Nen. In Proceedings of the The 18th Annual Work-
Treebank DF Parts 1-8 V2.0 - LDC catalog num- shop of the Australasian Language Technology As-
bers LDC2012E93, LDC2012E98, LDC2012E89, sociation, pages 43–53, Virtual Workshop. Aus-
LDC2012E99, LDC2012E107, LDC2012E125, tralasian Language Technology Association.
LDC2013E12, LDC2013E21.
Sylvain Neuvel and Sean A. Fulop. 2002. Unsuper-
Mohamed Maamouri, Dave Graff, Basma Bouziri, Son- vised learning of morphology without morphemes.
dos Krouna, Ann Bies, and Seth Kulick. 2010. LDC In Proceedings of the ACL-02 Workshop on Mor-
standard Arabic morphological analyzer (SAMA) phological and Phonological Learning, pages 31–
version 3.1. 40. Association for Computational Linguistics.

178
I. P. Novak, N. B. Krizhanovskaya, T. P. Boiko, and Andrei Shcherbakov, Saliha Muradoglu, and Ekate-
N. A. Pellinen. 2020. Development of rules of gen- rina Vylomova. 2020. Exploring looping effects
eration of nominal word forms for new-written vari- in RNN-based architectures. In Proceedings of the
ants of the Karelian language. Vestnik ugrovedenia The 18th Annual Workshop of the Australasian Lan-
= Bulletin of Ugric Studies, 10(4):679–691. guage Technology Association, pages 115–120, Vir-
tual Workshop. Australasian Language Technology
Irina Novak. 2019. Karelian language and its dialects. Association.
In I. Vinokurova, editor, Peoples of Karelia: His-
torical and Ethnographic Essays, pages 56–65. Pe- John Sylak-Glassman, Christo Kirov, Matt Post, Roger
riodika. Que, and David Yarowsky. 2015a. A universal
feature schema for rich morphological annotation
Ossama Obeid, Nasser Zalmout, Salam Khalifa, Dima and fine-grained cross-lingual part-of-speech tag-
Taji, Mai Oudah, Bashar Alhafni, Go Inoue, Fadhl ging. In International Workshop on Systems and
Eryani, Alexander Erdmann, and Nizar Habash. Frameworks for Computational Morphology, pages
2020. CAMeL tools: An open source python toolkit 72–93. Springer.
for Arabic natural language processing. In Proceed-
ings of the 12th Language Resources and Evaluation John Sylak-Glassman, Christo Kirov, David Yarowsky,
Conference, pages 7022–7032, Marseille, France. and Roger Que. 2015b. A language-independent
European Language Resources Association. feature schema for inflectional morphology. In Pro-
ceedings of the 53rd Annual Meeting of the Associ-
Sofia Oskolskaya, Ezequiel Koile, and Martine ation for Computational Linguistics and the 7th In-
Robbeets. 2021. A Bayesian approach to the clas- ternational Joint Conference on Natural Language
sification of Tungusic languages. Diachronica. Processing (Volume 2: Short Papers), pages 674–
680, Beijing, China. Association for Computational
Prateek Pankaj. 2020. Reconciling Surdas and Keshav- Linguistics.
das: A study of commonalities and differences in
Brajbhasha literature. IOSR Journal of Humanities Dima Taji, Nizar Habash, and Daniel Zeman. 2017.
and Social Sciences, 25. Universal Dependencies for Arabic. In Proceedings
of the Third Arabic Natural Language Processing
Femphy Pisceldo, Rahmad Mahendra, Ruli Manurung, Workshop, pages 166–176, Valencia, Spain. Associ-
and I Wayan Arka. 2008. A two-level morpholog- ation for Computational Linguistics.
ical analyser for the Indonesian language. In Pro-
ceedings of the Australasian Language Technology Dima Taji, Salam Khalifa, Ossama Obeid, Fadhl
Association Workshop 2008, pages 142–150, Hobart, Eryani, and Nizar Habash. 2018. An Arabic mor-
Australia. phological analyzer and generator with copious fea-
tures. In Proceedings of the Fifteenth Workshop
Emmanouil Antonios Platanios, Otilia Stretcu, Graham on Computational Research in Phonetics, Phonol-
Neubig, Barnabás Póczos, and Tom M. Mitchell. ogy, and Morphology, pages 140–150, Brussels, Bel-
2019. Competence-based curriculum learning for gium. Association for Computational Linguistics.
neural machine translation. In Proceedings of the
2019 Conference of the North American Chapter Talat Tekin. 1990. A new classification of the Turkic
of the Association for Computational Linguistics: languages. Türk dilleri araştırmaları, 1:5–18.
Human Language Technologies, NAACL-HLT 2019, Francis Tyers, Aziyana Bayyr-ool, Aelita Salchak, and
Minneapolis, MN, USA, June 2-7, 2019, Volume 1 Jonathan Washington. 2016. A finite-state mor-
(Long and Short Papers), pages 1162–1172. Associ- phological analyser for Tuvan. In Proceedings of
ation for Computational Linguistics. the Tenth International Conference on Language Re-
sources and Evaluation (LREC’16), pages 2562–
Adam Przepiórkowski and Marcin Woliński. 2003. A
2567, Portorož, Slovenia. European Language Re-
flexemic tagset for Polish. In Proceedings of the
sources Association (ELRA).
2003 EACL Workshop on Morphological Processing
of Slavic Languages, pages 33–40, Budapest, Hun- Francis Tyers and Karina Mishchenkova. 2020. Depen-
gary. Association for Computational Linguistics. dency annotation of noun incorporation in polysyn-
thetic languages. In Proceedings of the Fourth
Andreas Scherbakov. 2020. The UniMelb submission Workshop on Universal Dependencies (UDW 2020),
to the SIGMORPHON 2020 shared task 0: Typo- pages 195–204, Barcelona, Spain (Online). Associa-
logically diverse morphological inflection. In Pro- tion for Computational Linguistics.
ceedings of the 17th SIGMORPHON Workshop on
Computational Research in Phonetics, Phonology, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
and Morphology, pages 177–183, Online. Associa- Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz
tion for Computational Linguistics. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
Claus Schönig. 1999. The internal division of modern cessing Systems, volume 30.
Turkic and its historical implications. Acta Orien-
talia Academiae Scientiarum Hungaricae, pages 63– A. P. Volodin. 1976. The Itelmen language.
95. Prosveshchenie, Leningrad.

179
Ekaterina Vylomova, Jennifer White, Eliza- European Chapter of the Association for Computa-
beth Salesky, Sabrina J. Mielke, Shijie Wu, tional Linguistics: Main Volume, pages 1901–1907,
Edoardo Maria Ponti, Rowan Hall Maudslay, Ran Online. Association for Computational Linguistics.
Zmigrod, Josef Valvoda, Svetlana Toldova, Francis
Tyers, Elena Klyachko, Ilya Yegorov, Natalia Nina Zaytseva, Andrew Krizhanovsky, Natalia
Krizhanovsky, Paula Czarnowska, Irene Nikkarinen, Krizhanovsky, Natalia Pellinen, and Aleksndra Ro-
Andrew Krizhanovsky, Tiago Pimentel, Lucas dionova. 2017. Open corpus of Veps and Karelian
Torroba Hennigen, Christo Kirov, Garrett Nicolai, languages (VepKar): preliminary data collection
Adina Williams, Antonios Anastasopoulos, Hilaria and dictionaries. In Corpus Linguistics-2017, pages
Cruz, Eleanor Chodroff, Ryan Cotterell, Miikka 172–177.
Silfverberg, and Mans Hulden. 2020. SIGMOR-
PHON 2020 shared task 0: Typologically diverse He Zhou, Juyeon Chung, Sandra Kübler, and Francis
morphological inflection. In Proceedings of the Tyers. 2020. Universal Dependency treebank for
17th SIGMORPHON Workshop on Computational Xibe. In Proceedings of the Fourth Workshop on
Research in Phonetics, Phonology, and Morphology, Universal Dependencies (UDW 2020), pages 205–
pages 1–39, Online. Association for Computational 215, Barcelona, Spain (Online). Association for
Linguistics. Computational Linguistics.

Jonathan North Washington, Aziyana Bayyr-ool, Esaú Zumaeta Rojas and Gerardo Anton Zerdin. 2018.
Aelita Salchak, and Francis M Tyers. 2016. De- Guía teórica del idioma asháninka. Lima: Universi-
velopment of a finite-state model for morphological dad Católica Sedes Sapientiae.
processing of Tuvan. Rodnoy Yazyk, 1:156–187.
Н. А. Баскаков. 1969. Введение в изучение
Jonathan North Washington, Ilnar Salimzianov, Fran- тюркских языков [N. A. Baskakov. An intro-
cis M. Tyers, Memduh Gökırmak, Sardana Ivanova, duction to Turkic language studies]. Москва:
and Oğuzhan Kuyrukçu. to appear in 2021. Высшая школа.
Free/open-source technologies for Turkic languages
developed in the Apertium project. In Proceedings Ф. Г. Исхаков and А. А. Пальмбах. 1961.
of the International Conference on Turkic Language Грамматика тувинского языка: Фонетика и
Processing (TURKLANG 2019). морфология [F. G. Iskhakov and A. A. Pal’mbakh.
A grammar of Tuvan: Phonetics and morphology].
Jonathan North Washington and Francis Morton Tyers. Москва: Наука.
2019. Delineating Turkic non-finite verb forms by
syntactic function. In Proceedings of the Workshop Е. И. Убрятова, Е. И. Коркина, Л. Н. Харитонов,
on Turkic and Languages in Contact with Turkic, and Н. Е. Петров, editors. 1982. Грамматика
volume 4, pages 115–129. современного якутского литературного
языка: Фонетика и морфология [E. I. Ubrya-
Adina Williams, Tiago Pimentel, Hagen Blix, Arya D. tova et al. Grammar of the modern Yakut literary
McCarthy, Eleanor Chodroff, and Ryan Cotterell. language: Phonetics and morphology]. Москва:
2020. Predicting declension class from form and Наука.
meaning. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
pages 6682–6695, Online. Association for Computa-
tional Linguistics.
Mary Ruth Wise. 2002. Applicative affixes in Pe-
ruvian Amazonian languages. Current Studies on
South American Languages [Indigenous Languages
of Latin America, 3], pages 329–344.
Marcin Woliński and Witold Kieraś. 2016. The on-
line version of grammatical dictionary of Polish. In
Proceedings of the Tenth International Conference
on Language Resources and Evaluation (LREC’16),
pages 2589–2594, Portorož, Slovenia. European
Language Resources Association (ELRA).
Marcin Woliński, Zygmunt Saloni, Robert Wołosz,
Włodzimierz Gruszczyński, Danuta Skowrońska,
and Zbigniew Bronk. 2020. Słownik gramatyczny
j˛ezyka polskiego, 4th edition. Warsaw. http://
sgjp.pl.

Shijie Wu, Ryan Cotterell, and Mans Hulden. 2021.


Applying the transformer to character-level transduc-
tion. In Proceedings of the 16th Conference of the

180
A Data conversion into UniMorph

Apertium tag UniMorph tag Apertium tag UniMorph tag Apertium tag UniMorph tag
<p1> 1 <imp> IMP <px1sg> PSS1S
<p2> 2 <ins> INS <px2pl> PSS2P
<p3> 3 <iter> ITER <px2sg> PSS2S
<abl> ABL <loc> LOC <px3pl> PSS3P
<acc> ACC <n> N <px3sg> PSS3S
<all> ALL <neg> NEG <px3sp> PSS3S/PSS3P
<com> COM <nom> NOM <pii> PST;IPFV
<comp> COMPV <aor> NPST <ifi> PST;LGSPEC1
<dat> DAT <nec> OBLIG <past> PST;LGSPEC2
<ded> DED <pl> PL <sg> SG
<du> DU <perf> PRF <v> V
<fut> FUT <resu> PRF;LGSPEC3 <gna_cond> V.CVB;COND
<gen> GEN <par> PRT <prc_cond> V.CVB;COND
<hab> HAB <px1pl> PSS1P

Table 8: Apertium tag mapping to the UniMorph schema for Sakha and Tuvan. For the definitions of the Apertium
tags, see Washington et al. (2016). This mapping alone is not sufficient to reconstruct the UniMorph annotation,
since some conditional rules are applied on top of this conversion (see §3.8.1)

Level-1 Level-2 Level-3


MorphInd UniMorph MorphInd UniMorph MorphInd UniMorph
N N P PL F FEM
S SG M MASC
D NEUT
P PROPN P PL 1 1
S SG 2 2
3 3
V V P - A ACT
S - P PASS
C NUM C - - -
O - - -
D - - -
A ADJ P PL P -
S SG S -

Table 9: A simplified mapping from MorphInd tags to the UniMorph schema for Indonesian data. We follow
MorphInd’s three-level annotations for the mapping.

PL tag UniMorph tag PL tag UniMorph tag PL tag UniMorph tag


pri 1 impt IMP perf PFV
sec 2 imps IMPRS pl PL
ter 3 inst INS fin PRS/FUT
acc ACC imperf IPFV praet PRT
adj ADJ m2 MASC;ANIM sg SG
adv ADV m1 MASC;HUM sup SPRL
com COMPR m3 MASC;INAN pcon V.CVB;PRS
dat DAT subst N pant V.CVB;PST
pos EQT neg NEG ger V.MSDR
loc ESS n NEUT voc VOC
f FEM inf NFIN pact V.PTCP;ACT
gen GEN nom NOM ppas V.PTCP;PASS

Table 10: Simplified mapping from the original flexemic tagset of Polish used in Polish morphological analysers
and corpora annotations (Przepiórkowski and Woliński, 2003) to the UniMorph schema. The mapping contains
most of the POS and feature labels and does not allow to reconstruct the full conversion of the original data, as
some mappings are conditional.

181
Xibe Universal Dependencies feature / UniMorph Additional rules
word transliteration
ADJ ADJ
ADP ADP
ADV ADV
AUX AUX
CCONJ CONJ
DET DET
INTJ INTJ
NOUN N
NUM NUM
PART PART
PRON PRO
PROPN PROPN
PUNCT _ excluding punctuation marks
SCONJ CONJ
SYM _ excluding symbols
VERB depends on other properties
X
_
Abbr=Yes _
Aspect=Imp IPFV
Aspect=Perf PFV seems to be closer to PFV than to PRF
Aspect=Prog PROG
Case=Abl ABL not for adpositions
Case=Acc ACC not for adpositions
Case=Cmp COMPV not for adpositions
Case=Com COM not for adpositions
Case=Dat DAT not for adpositions
Case=Gen GEN not for adpositions
Case=Ins INSTR not for adpositions
Case=Lat ALL not for adpositions
Case=Loc IN not for adpositions
Clusivity=Ex EXCL
Clusivity=In INCL
Degree=Cmp CMPR
Degree=Pos _
Foreign=Yes _
Mood=Cnd CMD=COND for finite forms only
Mood=Imp IMP
Mood=Ind IND
Mood=Sub SBJV
NumType=Card _
NumType=Mult POS=ADV
NumType=Ord POS=ADJ
NumType=Sets POS=ADJ
Number=Plur PL
Number=Sing SG
Person=1 1
Person=2 2
Person=3 3
Polarity=Neg NEG not for the negative auxiliary
Polite=Elev _
Poss=Yes CMD=PSS
PronType=Dem CMD=DEIXIS
PronType=Ind _
PronType=Int _
PronType=Prs _
PronType=Tot _
Reflex=Yes _
Tense=Fut FUT
Tense=Past PST
Tense=Pres PRS
Typo=Yes _ not including typos into the resulting table

Table 11: Simplified mapping for the Xibe Universal Dependencies corpus (Pt. 1)

182
Xibe Universal Dependencies feature / UniMorph Additional rules
word transliteration
VerbForm=Conv POS=V.CVB
VerbForm=Fin FIN
VerbForm=Inf NFIN
VerbForm=Part POS=V.PTCP
VerbForm=Vnoun POS=V.MSDR
Voice=Act ACT
Voice=Cau CAUS
Voice=Pass PASS
Voice=Rcp RECP
ateke _
dari _ means ‘each, every’
eiten _ means ‘each, every’
enteke _ means ‘like this’
ere PROX
erebe PROX
ereci PROX
eremu PROX
geren _ means ‘all’
harangga _
tenteke _ means ‘like that’
terali _ means ‘like that’
teralingge _ means ‘like that’
tere REMT
terebe REMT
terei REMT
tesu REMT
tuba _ means ‘there’
tuttu _ means ‘like that’
uba _ means ‘here’
ubaci _ means ‘here’
ubai _ means ‘here’
udu _ means ‘some’
uttu _ means ‘like this’

Table 12: Simplified mapping for the Xibe Universal Dependencies corpus (Pt. 2)

183
B Accuracy trends TRM+ CHR-TRM
L BME GUClasp TRM CHR-TRM
AUG +AUG
afb 92.42 79.83 95.13 95.13 95.43 95.43
TRM+ CHR-TRM amh 98.36 91.56 99.72 99.72 99.72 99.72
L BME GUClasp TRM CHR-TRM
AUG +AUG ara 99.79 88.63 99.88 99.88 99.87 99.87
afb 94.77 90.26 95.24 95.24 95.84 95.84 arz 93.31 78.08 95.16 95.16 94.98 94.98
amh 89.67 87.09 94.83 94.83 94.83 94.83 heb 98.41 92.95 99.65 99.65 99.75 99.75
ara 99.87 98.34 99.93 99.93 99.93 99.93 syc 11.02 4.41 25.73 21.32 27.94 25.00
arz 95.65 91.39 97.31 97.31 97.07 97.07 ame 81.30 57.82 84.78 86.52 86.08 83.47
syc 10.71 7.14 10.71 17.85 14.28 14.28 cni 98.73 79.51 99.72 99.72 99.63 99.63
ind 80.15 69.26 85.60 85.60 84.43 84.43 ind 83.01 52.41 84.47 84.47 84.36 84.36
kod 100.00 90.90 100.00 100.00 90.90 100.00 kod 93.65 92.06 100.00 98.41 100.00 100.00
ckt 50.00 50.00 50.00 50.00 50.00 50.00 aym 99.98 99.99 100.00 100.00 100.00 100.00
itl 50.00 58.33 66.66 58.33 66.66 58.33 ckt 25.00 31.25 18.75 37.50 18.75 43.75
bra 68.75 65.62 50.00 71.87 65.62 56.25 itl 14.96 12.92 20.40 25.17 21.08 23.12
bul 99.73 96.85 100.00 100.00 99.96 99.96 gup 14.75 21.31 59.01 63.93 55.73 60.65
ces 99.49 97.74 99.50 99.50 99.52 99.52 bra 31.30 29.56 24.34 27.82 27.82 30.43
mag 69.23 84.61 76.92 92.30 92.30 84.61 bul 99.51 98.36 99.86 99.86 99.89 99.89
nld 97.29 96.38 97.85 97.85 97.80 97.80 ces 98.88 94.97 99.54 99.54 99.40 99.40
pol 99.91 99.67 99.95 99.95 100.00 100.00 ckb 99.72 96.44 99.96 99.96 99.96 99.96
rus 99.81 98.88 99.44 99.44 99.44 99.44 deu 99.39 94.55 99.75 99.75 99.73 99.73
ail 12.50 12.50 0 12.50 12.50 12.50 kmr 98.20 96.97 100.00 100.00 99.67 99.67
evn 73.52 74.50 78.43 76.47 72.54 77.45 mag 36.36 38.63 42.04 42.04 44.31 39.77
krl 100.00 90.69 93.02 93.02 93.02 93.02 nld 99.53 94.99 99.86 99.86 99.86 99.86
olo 99.80 98.05 99.92 99.92 99.76 99.76 pol 99.57 98.22 99.76 99.76 99.74 99.74
vep 99.85 97.86 99.82 99.82 99.88 99.88 por 99.84 99.12 99.91 99.91 99.86 99.86
sjo 66.66 66.66 66.66 100.00 100.00 100.00 rus 99.25 90.31 97.09 97.09 97.38 97.38
tur 97.78 97.41 100.00 100.00 100.00 100.00 spa 99.82 97.55 99.90 99.90 99.92 99.92
see 78.27 40.97 90.65 89.64 90.00 88.63
Table 13: Accuracy for “Adjective” on the test data. ail 5.69 6.73 10.88 8.80 9.32 10.36
evn 34.70 32.03 44.90 44.66 44.90 46.35
sah 99.83 98.98 99.61 99.61 99.83 99.83
tyv 99.94 99.50 99.91 99.91 99.95 99.95
TRM+ CHR-TRM
L BME GUClasp TRM CHR-TRM krl 99.94 98.82 99.94 99.94 99.94 99.94
AUG +AUG
lud 56.25 56.25 0 50.00 6.25 50.00
syc 65.21 6.52 84.78 82.60 86.95 80.43 olo 99.84 99.14 99.71 99.71 99.70 99.70
bul 99.60 97.90 100.00 100.00 100.00 100.00 vep 99.71 97.50 99.60 99.60 99.65 99.65
ces 100.00 98.40 99.90 99.90 100.00 100.00 sjo 18.51 3.70 29.62 33.33 29.62 25.92
ckb 97.91 95.83 100.00 100.00 100.00 100.00 tur 99.96 99.98 99.17 99.17 99.17 99.17
deu 94.89 91.72 95.91 95.91 96.52 96.52
kmr 100.00 100.00 100.00 100.00 100.00 100.00
nld 89.17 78.58 94.58 94.58 96.00 96.00 Table 17: Accuracy for “Verb” in each language on the
pol 100.00 99.92 100.00 100.00 100.00 100.00
por 99.83 98.96 99.67 99.67 99.78 99.78
test data.
rus 97.04 92.06 96.58 96.58 96.67 96.67
spa 99.93 99.03 99.35 99.35 99.29 99.29
evn 12.76 7.09 17.73 18.43 14.89 19.14
krl 100.00 98.18 100.00 100.00 100.00 100.00 TRM+ CHR-TRM
L BME GUClasp TRM CHR-TRM
olo 99.69 97.83 99.38 99.38 99.69 99.69 AUG +AUG
vep 99.02 96.67 99.21 99.21 99.21 99.21
afb 91.00 81.87 94.00 94.00 92.95 92.95
sjo 22.22 22.22 27.77 55.55 55.55 50.00 amh 98.46 96.30 99.24 99.24 99.31 99.31
ara 99.60 94.76 99.44 99.44 99.56 99.56
Table 14: Accuracy for “Participle” on the test data. arz 96.58 91.66 97.56 97.56 97.23 97.23
heb 94.23 70.37 98.39 98.39 98.92 98.92
syc 20.00 18.57 32.85 34.28 32.14 32.85
TRM+ CHR-TRM ame 82.99 55.06 88.66 88.46 87.65 87.44
L BME GUClasp TRM CHR-TRM
AUG +AUG cni 99.79 98.62 99.96 99.96 99.96 99.96
ind 78.93 57.69 81.81 81.81 81.38 81.38
amh 98.67 95.78 99.75 99.75 99.87 99.87
kod 94.73 84.21 100.00 100.00 100.00 100.00
itl 19.04 13.09 25.00 20.23 21.42 21.42
aym 99.97 99.95 99.96 99.96 99.96 99.96
bul 100.00 98.57 100.00 100.00 100.00 100.00
ces 98.97 95.47 100.00 100.00 99.38 99.38 ckt 60.00 70.00 30.00 70.00 35.00 70.00
pol 99.22 99.22 100.00 100.00 100.00 100.00 itl 64.22 66.97 71.55 71.55 72.47 72.47
rus 99.21 97.49 97.96 97.96 99.68 99.68 bra 75.60 74.39 74.39 79.87 80.48 78.04
spa 98.74 98.23 99.24 99.24 100.00 100.00 bul 95.27 89.54 97.92 97.92 97.47 97.47
evn 23.38 16.12 25.00 32.25 27.41 33.06 ces 95.49 88.60 95.56 95.56 95.60 95.60
sah 100.00 100.00 100.00 100.00 100.00 100.00 ckb 97.44 98.01 99.71 99.71 100.00 100.00
tyv 100.00 100.00 100.00 100.00 100.00 100.00 deu 96.93 89.64 95.48 95.48 95.51 95.51
kmr 98.20 98.12 97.91 97.91 97.91 97.91
sjo 54.54 9.09 54.54 54.54 72.72 45.45 mag 91.60 92.30 81.81 91.60 85.31 92.30
pol 96.30 89.71 97.07 97.07 97.35 97.35
Table 15: Accuracy for “Converb” on the test data. rus 95.80 92.91 96.24 96.24 96.03 96.03
ail 9.67 4.83 17.74 20.96 14.51 20.96
evn 71.30 74.23 75.48 75.34 76.88 76.18
L BME GUClasp TRM
TRM+
CHR-TRM
CHR-TRM sah 99.97 99.78 99.97 99.97 99.99 99.99
AUG +AUG tyv 99.98 99.93 99.96 99.96 99.97 99.97
amh 88.61 79.67 95.12 95.12 97.56 97.56 krl 93.48 68.83 95.81 95.81 95.81 95.81
heb 79.53 73.68 83.62 83.62 83.04 83.04 lud 61.90 61.90 28.57 42.85 42.85 42.85
aym 100.00 100.00 100.00 100.00 100.00 100.00 olo 99.54 96.99 99.57 99.57 99.58 99.58
itl 33.33 33.33 33.33 50.00 33.33 33.33 vep 99.72 96.50 99.66 99.66 99.70 99.70
bul 98.85 98.00 100.00 100.00 100.00 100.00 sjo 58.33 41.66 25.00 58.33 25.00 66.66
kmr 99.37 100.00 98.74 98.74 100.00 100.00 tur 99.83 98.49 99.73 99.73 99.71 99.71
pol 99.96 99.96 100.00 100.00 100.00 100.00 vro 94.78 87.39 97.82 98.26 97.82 97.39
sjo 46.15 0 46.15 38.46 46.15 30.76

Table 16: Accuracy for “Masdar” on the test data. Table 18: Accuracy for “Noun” in each language on the
test data.

184
Training Strategies for Neural Multilingual Morphological Inflection

Adam Ek and Jean-Philippe Bernardy


Centre for Linguistic Theory and Studies in Probability
Department of Philosophy, Linguistics and Theory of Science
University of Gothenburg
{adam.ek,jean-philippe.bernardy}@gu.se

Abstract This is a challenging problem for several rea-


sons. For many languages resources are scarce,
This paper presents the submission of team and a multilingual system must balance the training
GUCLASP to SIGMORPHON 2021 Shared signals from both high-resource and low-resource
Task on Generalization in Morphological In- languages such that the model learns something
flection Generation. We develop a multilin-
about both. Additionally, different languages em-
gual model for Morphological Inflection and
primarily focus on improving the model by us-
ploy different morphological processes to inflect
ing various training strategies to improve accu- words. In addition to languages employing a vari-
racy and generalization across languages. ety of different morphological processes, different
languages use different scripts (for example Ara-
1 Introduction bic, Latin, or Cyrillic), which can make it hard to
transfer knowledge about one language to another.
Morphological inflection is the task of transforming To account for these factors a model must learn
a lemma to its inflected form given a set of gram- to recognize the different morphological processes,
matical features (such as tense or person). Dif- the associated grammatical features, the script used,
ferent languages have different strategies, or mor- and map them to languages.
phological processes such as affixation, circumfix- In this paper, we investigate how far these issues
ation, or reduplication among others (Haspelmath can be tackled using different training strategies, as
and Sims, 2013). These are all ways to make a opposed to focusing on model design. Of course, in
lemma express some grammatical features. One the end, an optimal system will be a combination of
way to characterize how languages express gram- a good model design and good training strategies.
matical features is a spectrum of morphological We employ an LSTM encoder-decoder architecture
typology ranging from agglutinative to isolating. with attention, based on the architecture of Anas-
In agglutinative languages, grammatical features tasopoulos and Neubig (2019), as our base model
are encoded as bound morphemes attached to the and consider the following training strategies:
lemma, while in isolating languages each gram-
matical feature is represented as a lemma. Thus, • Curriculum learning: We tune the order in
languages in different parts of this spectrum will which the examples are presented to the model
have different strategies for expressing information based on the loss.
given by grammatical features.
• Multi-task learning: We predict the formal
In recent years, statistical and neural models
operations required to transform a lemma into
have been performing well on the task of mor-
its inflected form.
phological inflection (Smit et al., 2014; Kann and
Schütze, 2016; Makarov et al., 2017; Sharma et al., • Language-wise label smoothing: We smooth
2018). We follow this tradition and implement a the loss function to not penalize the model as
neural multilingual model for morphological inflec- much when it predicts a character from the
tion. In a multilingual system, a single model is correct language.
developed to handle input from several different
languages: we can give the model either a word in • Scheduled sampling: We use a probability
Evenk or Russian and it perform inflection. distribution to determine whether to use the

185
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 185–192
August 5, 2021. ©2021 Association for Computational Linguistics
previous output or the gold as input when de- 3 Method
coding.
In this section, the multilingual model and training
2 Data strategies used are presented. 3 We employ a single
model with shared parameters across all languages.
The data released cover 38 languages of varying
typology, genealogy, grammatical features, scripts, 3.1 Model
and morphological processes. The data for the To account for different languages in our model
different languages vary greatly in size, from 138 we prepend a language embedding to the input
examples (Ludic) to 100310 (Turkish). For the (similarly to Johnson et al. (2017); Raffel et al.
low-resourced languages1 we extend the original (2019)). To model inflection, we employ an
dataset with hallucinated data (Anastasopoulos and encoder-decoder architecture with attention. The
Neubig, 2019) to train on. first layer in the model is comprised of an LSTM,
With respect to the work of Anastasopoulos and which produces a contextual representation for
Neubig (2019), we make the following changes. each character in the lemma. We encode the tags us-
We identify all subsequences of length 3 or more ing a self-attention module (equivalent to a 1-head
that overlap in the lemma and inflection. We then transformer layer) (Vaswani et al., 2017). This
randomly sample one of them, denoted R, as the layer does not use any positional data: indeed the
sequence to be replaced. For each language, we order of the tags does not matter (Anastasopoulos
compile a set CL containing all (1,2,3,4)-grams in and Neubig, 2019).
the language. We construct a string G to replace R To generate inflections, we use an LSTM de-
with by uniformly sampling n-grams from CL and coder with two attention modules. One attending
concatenating them G = cat(g0 , ..., gm ) until we to the lemma and one to the tags. For the lemma
have a sequence whose length satisfy: |R| − 2 ≤ attention, we use a content-based attention mod-
|G| ≤ |R| + 2. ule (Graves et al., 2014; Karunaratne et al., 2021)
Additionally, we do not consider subsequences which uses cosine similarity as its scoring method.
which include a phonological symbol.2 A However, we found that only using content-based
schematic of the hallucination process is shown attention causes attention to be too focused on a sin-
in Figure 1. gle character, and mostly ignores contextual cues
relevant for the generation.
New
inflection:
r a ː t k ː ŋ i t i n
To remedy this, we combine the content-based
New
lemma: r a ː t k ː attention with additive attention as follows, where
superscript cb indicate content-based attention, add
Sample random
(1,2,3)-grams additive attention and k the key:

Sample random aadd = softmax(w> tanh(Wa k + Wb h))


subsequence
T
X
Lemma: h e ː k i ː attadd = aadd
t ht
add

t=1
Inflection: h e ː k i ː ŋ i t i n
acb = softmax(cos(k, h))
Figure 1: A example of the data hallucination process. T
X
cb
The sequence R = ki is replace by G = tk. att = acb cb
t ht
t=1
Sampling n-grams instead of individual charac- att = W[attcb ; attadd ]
ters allow us to retain some of the orthographical
In addition to combining content-based attention
information present in the language. We generate a
and additive attention we also employ regulariza-
set of 10 000 hallucinated examples for each of the
tion on the attention modules such that for each
low-resource languages.
decoding step, the attention is encouraged to dis-
1
We consider languages with less than 10 000 training tribute the attention weights a uniformly across
examples as low-resource in this paper.
2 3
Thus in Figure 1 a subsequence of length 2 is selected Our code is available here:
as the sequence to be replaced, since the larger subsequences https://github.com/adamlek/
would include the phonological symbol : multilingual-morphological-inflection/

186
the lemma and tag hidden states (Anastasopoulos 3.3 Curriculum Learning
and Neubig, 2019; Cohn et al., 2016). We employ We employ a competence-based curriculum learn-
additive attention for the tags. ing strategy (Liu et al., 2020; Platanios et al., 2019).
In each decoding step, we pass the gold or pre- A competence curriculum learning strategy con-
dicted character embedding to the decoding LSTM. structs a learning curriculum based on the compe-
We then take the output as the key and calculate tence of a model, and present examples which the
attention over the lemma and tags. This representa- model is deemed to be able to handle. The goal of
tion is then passed to a two-layer perceptron with this strategy is for the model to transfer or apply
ReLU activations. the information it acquires from the easy examples
to the hard examples.
3.2 Multi-task learning
To estimate an initial difficulty for an example
Instead of predicting the characters in the inflected we consider the character unigram log probability
form, one can also predict the Levenshtein opera- of the lemma and inflection. For a word (either the
tions needed to transform the lemma into the in- lemma or inflection) w = c0 , ..., cK , the unigram
flected form; as shown by Makarov et al. (2017). log probability is given by:
A benefit of considering operations instead of
characters needed to transform a lemma to its in- K
X
flected form is that the script used is less of a fac- log(PU (w)) = log(p(ck )) (1)
tor, since by considering the operations only we k=0

abstract away from the script used. We find that To get a score for a lemma and inflection pair
making both predictions, as a multi-task setup, im- (henceforth (x, y)), we calculate it as the sum of
proves the performance of the system. the log probabilities of x and y:
The multi-task setup operates on the character
level, thus for each contextual representation of a score(x, y) = PU (x) + PU (y) (2)
character we want to predict an operation among
deletion (del), addition/insertion (add), substi- Note that here we do not normalize by the length
tution (sub) and copy (cp). Because add and of the inflection and lemma. This is because an
del change the length, we predict two sets of op- additional factor in how difficult an example should
erations, the lemma-reductions and the lemma- be considered is its length, i.e. longer words are
additions. To illustrate, the Levenshtein operations harder to model. We then sort the examples and use
for the word pair (valatas, ei valate) in Veps (uralic a cumulative density function (CDF) to map the
language related to Finnish) is shown in Figure 2. unigram probabilities to a score in the range (0, 1],
we denote the training set of pairs and their scores
Inflection: e i v a l a t e
((x, y), s)0 , . . . , ((x, y), s)m , where m indicate the
Operations: add add add cp cp cp cp cp del sub number of examples in the dataset, as D.
To select appropriate training examples from D
Lemma: v a l a t a s
we must estimate the competence c of our model.
Figure 2: Levenshtein operations mapped to characters The competence of the model is estimated by a
in the lemma and inflection. function of the number of training steps t taken:
s !
In our setup, the task of lemma-reductions is 1 − c(1)2
performed by predicting the cp, del, and sub c(t) = min 1, t + c(1)2 (3)
c(1)2
operations based on the encoded hidden states in
the lemma. The task of lemma-additions then is During training, we employ a probabilistic ap-
performed by predicting the cp, add, and sub proach to constructing batches from our corpus,
operations on the characters generated by the de- we uniformly draw samples ((x, y), s) from the
coder. We use a single two-layer perceptron with training set D such that the score s is lower than
ReLU activation to predict both lemma-reduction the model competence c(t). This ensures that for
and lemma-additions. 4 each training step, we only consider examples that
4
In the future, we’d like to experiment with including the the model can handle according to our curriculum
representations of tags in the input to the operation classifier. schedule.

187
However, just because an example has low un- incorrect input in the training phase. We address
igram probability doesn’t ensure that the exam- this issue using scheduled sampling (Bengio et al.,
ple is easy, as the example may contain frequent 2015).
characters but also include rare morphological pro- We implement a simple schedule for calculat-
cesses (or rare combinations of Levenshtein op- ing the probability of using the gold characters or
erations), to account for this we recompute the the model’s prediction by using a global sample
example scores at each training step. We sort the probability variable which is updated at each train-
examples in each training step according to the ing step. We start with a probability ρ of 100% to
decoding loss, then assign a new score to the ex- take the gold. At each training step, we decrease ρ
amples in the range (0, 1] using a CDF function. 1
by totalsteps . For each character, we take a sample
We also have to take into account that as the from the Bernoulli distribution of parameter ρ to
model competence grows, “easy” (low loss or high determine the decision to make.
unigram probability) examples will be included
more often in the batches. To ensure that the 3.5 Training
model learns more from examples whose difficulty We use cross-entropy loss for the character gener-
is close to its competence we compute a weight w ation loss and for the operation predictions tasks.
for each example in the batch. We then scale the Our final loss function consists of the character gen-
loss by dividing the score s by the model compe- eration loss, the lemma-reduction, and the lemma-
tence at the current time-step: addition losses summed. We use a cosine anneal-
ing learning rate scheduler (Loshchilov and Hutter,
score(x, y) 2017), gradually decreasing the learning rate. The
weighted loss(x, y) = loss(x, y) ×
c(t) hyperparameters used for training are presented in
(4) Table 1.
Because the value of our model competence
is tied to a specific number of training steps, H YPERPARAMETER VALUE
we develop a probabilistic strategy for sampling Batch Size 256
batches when the model has reached full compe- Embedding dim 128
tence. When the model reaches full competence we Hidden dim 256
construct language weights by dividing the number Training steps 240000
of examples in a language by the total number of Steps for full competence 60000
examples in the dataset and taking the inverse dis- Initial LR 0.001
tribution as the language weights. Thus for each Min LR 0.0000001
language, we get a value in the range (0, 1] where Smoothing-α 2.5%
low-resource languages receive a higher weight. To
Table 1: Hyperparameters used. As we use a proba-
construct a batch we continue by sampling exam- bilistic approach to training we report number of train-
ples, but now we only add an example if r ∼ ρ, ing steps rather than epochs. In total, the number of
where ρ is a uniform Bernoulli distribution, is less training steps we take correspond to about 35 epochs.
than the language weight of the example. This strat-
egy allows us to continue training our model after
reaching full competence without neglecting the Language-wise Label smoothing We use
low-resource languages. language-wise label smoothing to calculate the
In total we train the model for 240 000 training loss. This means that we remove a constant
steps, and consider the model to be fully competent α from the probability of the correct character
after 60 000 training steps. and distribute the same α uniformly across the
probabilities of the characters belonging to the
3.4 Scheduled Sampling language of the word. The motivation for doing
Commonly, when training an encoder-decoder label smoothing this way is that we know that
RNN model, the input at time-step t is not the all incorrect character predictions are not equally
output from the decoder at t − 1, but rather the incorrect. For example, when predicting the
gold data. It has been shown that models trained inflected form of a Modern Standard Arabic (ara)
with this strategy may suffer at inference time. In- word, it is more correct to select any character
deed, they have never been exposed to a partially from the Arabic alphabet than a character from

188
Test Dev
1 · 105
L ANG EM L EV EM L EV

100
afb 90.29 0.17 91.29 0.15
ail 6.84 3.6 7.69 3.62
ame 70.72 0.75 73.67 0.64
50,000

50
amh 97.44 0.04 96.87 0.04
ara 98.69 0.03 98.59 0.04
arz 91.65 0.14 92.48 0.14
aym 99.8 0.0 99.75 0.01
0

0
bra 62.38 0.71 64.05 0.59
aym

amh

mag
ame
kmr

gup
kod
deu

heb

ckb
evn
vep
por
sah

spa

syc

vro
tyv
olo

pol

ces

nld

afb

ind
see

bra

lud
bul
ara

rus

arz
cni

ckt
sjo
tur

krl

ail
itl
bul 98.16 0.03 98.02 0.03
ces 93.41 0.12 94.01 0.12
Figure 3: Number of examples (green indicate natu- ckb 68.27 0.77 68.91 0.73
ral and blue hallucinated examples, left x-axis) plotted ckt 60.53 1.37 55.56 1.72
against the exact match accuracy (right x-axis) of our cni 91.99 0.11 91.38 0.12
system on the development data (blue) and the test data
deu 93.95 0.09 93.28 0.1
(red).
evn 51.7 1.47 50.41 1.5
gup 22.95 3.92 26.67 3.1
the Latin or Cyrillic alphabet. A difficulty is that heb 95.86 0.09 94.97 0.1
each language potentially uses a different set of ind 62.32 1.3 60.28 1.33
characters. We calculate this set using the training itl 22.63 2.89 22.16 3.11
set only— so it is important to make α not too kmr 98.19 0.02 98.32 0.02
large, so that there is not a too big difference kod 79.57 0.58 80.43 0.37
between characters seen in the training set and krl 97.62 0.05 97.83 0.04
those not seen. Indeed, if there were, the model lud 62.16 0.73 66.67 0.44
might completely exclude unseen characters from mag 70.2 0.53 63.64 0.71
its test-time predictions. (We found that α = 2.5% nld 92.51 0.12 92.31 0.12
is a good value.) olo 99.39 0.01 99.36 0.01
pol 97.34 0.04 97.51 0.04
4 Results por 98.41 0.03 98.3 0.03
rus 97.02 0.05 96.8 0.05
The results from our system using the four straining sah 99.86 0.0 99.86 0.0
strategies presented earlier are presented in Table 2. see 49.77 1.7 50.37 1.51
Each language is evaluated by two metrics, exact sjo 29.76 1.73 32.43 1.83
match, and average Levenshtein distance. The aver- spa 98.66 0.02 98.65 0.02
age Levenshtein distance is on average, how many syc 9.43 5.25 15.7 5.41
operations are required to transform the system’s tur 98.69 0.03 98.86 0.03
guess to the gold inflected form. One challenging tyv 98.61 0.03 98.51 0.03
aspect of this dataset for our model is balancing vep 99.26 0.01 99.3 0.01
the information the model learns about low- and vro 82.17 0.31 80.7 0.42
high-resource languages. We plot the accuracy the
model achieved against the data available for that Table 2: Results on the development data.
language in Figure 3.
We note that for all languages with roughly more
than 30 000 examples our model performs well, language examples do. That is, when construct-
achieving around 98% accuracy. When we con- ing hallucinated examples, orthography is taken
sider languages that have around 10 000 natural into account only indirectly because we consider
examples and no hallucinated data the accuracy n-grams instead of characters when finding the re-
drops closer to round 50%. For the languages with placement sequence. However, we find that for
hallucinated data, we would expect this trend to many of the languages with hallucinated data the
continue as the data is synthetic and does not take exact match accuracy is above 50%, but varies a
into account orthographic information as natural lot depending on the language.

189
Two of the worst languages in our model is arz) script and one cluster that use Amharic and
Classical Syriac (syc) and Xibe (sjo). An issue Hebrew (amh, heb) script. As mentioned earlier
with Classical Syriac is that the language uses Classical Syriac uses its another script and seems
a unique script, the Syriac abjad, which makes to consequently appear in another part of the map.
it difficult for the model to transfer information In general, our model’s language embeddings ap-
about operations and common character combina- pear to learn some relationships between languages,
tions/transformations into Classical Syriac from but certainly not all of them. However, that we find
related languages such as Modern Standard Arabic some patterns in encouraging for future work.
(spoken in the region). For Xibe there is a similar
story: it uses the Sibe alphabet which is a variant 6 Scheduled Sampling
of Manchu script, which does not occur elsewhere
We note that during the development all of our
in our dataset.
training strategies showed a stronger performance
for the task, except one: scheduled sampling. We
5 Language similarities
hypothesize this is because the low-resource lan-
Our model process many languages simultaneously, guages benefit from using the gold as input when
thus it would be encouraging if the model also was predicting the next character, while high-resource
able to find similarities between languages. To languages do not need this as much. The model has
explore this we investigate whether the language seen more examples from high-resource languages
embeddings learned by the model produce clusters and thus can model them better, which makes us-
of language families. A t-SNE plot of the language ing the previous hidden state more reliable as input
embeddings is shown in Figure 4. when predicting the next token. Indeed, the sched-
uled sampling degrade the overall performance by
3.04 percentage points, increasing our total aver-
aym
age accuracy to 83.3 percentage points, primarily
bulrus pol affecting low-resource languages.

vro ckb
arz
7 Conclusions and future Work
sjo mag lud ara
por sycbra kod afb We have presented a single multilingual model for
kmr tyv see spa
sah ces morphological inflection in 38 languages enhanced
heb
amh amedeunld
with different training strategies: curriculum learn-
ailevn tur vep ckt
olo ing, multi-task learning, scheduled sampling and
krl
gupind
itl
language-wise label smoothing. The results indi-
cni cate that our model to some extent capture simi-
larities between the input languages, however, lan-
guages that use different scripts appears problem-
Figure 4: t-SNE plot of the language embeddings. Dif- atic. A solution to this would be to employ translit-
ferent colors indicate different language families. eration (Murikinati et al., 2020).
In future work, we plan on exploring curriculum
The plot shows that the model can find some learning in more detail and move away from esti-
family resemblances between languages. For ex- mating the competence of our model linearly, and
ample, we have a Uralic cluster consisting of the instead, estimate the competence using the accu-
languages Veps (vep), Olonets (olo), and Karelian racy on the batches. Another interesting line of
(krl) which are all spoken in a region around Russia work here is instead of scoring the examples by
and Finland. However, Ludic (lud) and Võro (vro) model loss alone, but combine it with insights from
are not captured in this cluster, yet they are spoken language acquisition and teaching, such as sorting
in the same region. lemmas based on their frequency in a corpus (Ionin
We can see that the model seem to separate lan- and Wexler, 2002; Slabakova, 2010).
guage families somewhat depending on the script We also plan to investigate language-wise label
used. The Afro-Asiatic languages are split into smoothing more closely, specifically how the value
two smaller clusters, one cluster containing the of α should be fine-tuned with respect to the num-
languages that use Standard Arabic (ara, afv and ber of characters and languages.

190
Acknowledgments Geethan Karunaratne, Manuel Schmuck, Manuel
Le Gallo, Giovanni Cherubini, Luca Benini, Abu
The research reported in this paper was supported Sebastian, and Abbas Rahimi. 2021. Robust high-
by grant 2014-39 from the Swedish Research Coun- dimensional memory-augmented neural networks.
cil, which funds the Centre for Linguistic Theory Nature communications, 12(1):1–12.
and Studies in Probability (CLASP) in the Depart- Xuebo Liu, Houtim Lai, Derek F. Wong, and Lidia S.
ment of Philosophy, Linguistics, and Theory of Chao. 2020. Norm-based curriculum learning for
Science at the University of Gothenburg. neural machine translation. In Proceedings of the
58th Annual Meeting of the Association for Com-
putational Linguistics, ACL 2020, Online, July 5-
10, 2020, pages 427–436. Association for Compu-
References tational Linguistics.
Antonios Anastasopoulos and Graham Neubig. 2019.
Pushing the limits of low-resource morphological in- Ilya Loshchilov and Frank Hutter. 2017. SGDR:
flection. In Proceedings of the 2019 Conference on stochastic gradient descent with warm restarts. In
Empirical Methods in Natural Language Processing 5th International Conference on Learning Repre-
and the 9th International Joint Conference on Nat- sentations, ICLR 2017, Toulon, France, April 24-
ural Language Processing, EMNLP-IJCNLP 2019, 26, 2017, Conference Track Proceedings. OpenRe-
Hong Kong, China, November 3-7, 2019, pages 984– view.net.
996. Association for Computational Linguistics. Peter Makarov, Tatiana Ruzsics, and Simon Clematide.
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and 2017. Align and copy: UZH at SIGMORPHON
Noam Shazeer. 2015. Scheduled sampling for se- 2017 shared task for morphological reinflection. In
quence prediction with recurrent neural networks. Proceedings of the CoNLL SIGMORPHON 2017
In Advances in Neural Information Processing Sys- Shared Task: Universal Morphological Reinflection,
tems 28: Annual Conference on Neural Informa- Vancouver, BC, Canada, August 3-4, 2017, pages
tion Processing Systems 2015, December 7-12, 2015, 49–57. Association for Computational Linguistics.
Montreal, Quebec, Canada, pages 1171–1179.
Nikitha Murikinati, Antonios Anastasopoulos, and Gra-
Trevor Cohn, Cong Duy Vu Hoang, Ekaterina Vy- ham Neubig. 2020. Transliteration for cross-lingual
molova, Kaisheng Yao, Chris Dyer, and Gholamreza morphological inflection. In Proceedings of the
Haffari. 2016. Incorporating structural alignment 17th SIGMORPHON Workshop on Computational
biases into an attentional neural translation model. Research in Phonetics, Phonology, and Morphology,
In NAACL HLT 2016, The 2016 Conference of the pages 189–197, Online. Association for Computa-
North American Chapter of the Association for Com- tional Linguistics.
putational Linguistics: Human Language Technolo-
Emmanouil Antonios Platanios, Otilia Stretcu, Gra-
gies, San Diego California, USA, June 12-17, 2016,
ham Neubig, Barnabas Poczos, and Tom M Mitchell.
pages 876–885. The Association for Computational
2019. Competence-based curriculum learning
Linguistics.
for neural machine translation. arXiv preprint
Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. arXiv:1903.09848.
Neural turing machines. CoRR, abs/1410.5401.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Martin Haspelmath and Andrea Sims. 2013. Under- Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
standing morphology. Routledge. Wei Li, and Peter J Liu. 2019. Exploring the limits
of transfer learning with a unified text-to-text trans-
Tania Ionin and Kenneth Wexler. 2002. Why is ‘is’ eas- former. arXiv preprint arXiv:1910.10683.
ier than ‘-s’?: acquisition of tense/agreement mor-
phology by child second language learners of en- Abhishek Sharma, Ganesh Katrapati, and Dipti Misra
glish. Second language research, 18(2):95–136. Sharma. 2018. IIT(BHU)–IIITH at CoNLL–
SIGMORPHON 2018 shared task on universal mor-
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim phological reinflection. In Proceedings of the
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, CoNLL–SIGMORPHON 2018 Shared Task: Uni-
Fernanda Viégas, Martin Wattenberg, Greg Corrado, versal Morphological Reinflection, pages 105–111,
et al. 2017. Google’s multilingual neural machine Brussels. Association for Computational Linguis-
translation system: Enabling zero-shot translation. tics.
Transactions of the Association for Computational
Linguistics, 5:339–351. Roumyana Slabakova. 2010. What is easy and what is
hard to acquire in a second language?
Katharina Kann and Hinrich Schütze. 2016. Med: The
lmu system for the sigmorphon 2016 shared task on Peter Smit, Sami Virpioja, Stig-Arne Grönroos, and
morphological reinflection. In Proceedings of the Mikko Kurimo. 2014. Morfessor 2.0: Toolkit for
14th SIGMORPHON Workshop on Computational statistical morphological segmentation. In Proceed-
Research in Phonetics, Phonology, and Morphology, ings of the 14th Conference of the European Chap-
pages 62–70. ter of the Association for Computational Linguistics,

191
EACL 2014, April 26-30, 2014, Gothenburg, Swe-
den, pages 21–24. The Association for Computer
Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
cessing Systems 30: Annual Conference on Neural
Information Processing Systems 2017, December 4-
9, 2017, Long Beach, CA, USA, pages 5998–6008.

192
BME Submission for SIGMORPHON 2021 Shared Task 0. A Three Step
Training Approach with Data Augmentation for Morphological Inflection
Gábor Szolnok∗ Botond Barta∗
Budapest University of Budapest University of
Technology and Economics Technology and Economics
[email protected] [email protected]

Dorina Lakatos∗ Judit Ács


Budapest University of Budapest University of
Technology and Economics Technology and Economics
[email protected] [email protected]

Abstract Our model builds on the work of Faruqui et al.


We present the BME submission for the SIG- (2015). We use a sequence-to-sequence
MORPHON 2021 Task 0 Part 1, Generaliza- (seq2seq) model with a bidirectional LSTM
tion Across Typologically Diverse Languages (Hochreiter and Schmidhuber, 1997) encoder and
shared task. We use an LSTM encoder- a unidirectional LSTM decoder with attention.
decoder model with three step training that is We perform a small hyperparameter search and
first trained on all languages, then fine-tuned we find that the most important parameters are the
on each language families and finally fine- choice of the embedding size and the hidden size.
tuned on individual languages. We use a differ-
We use two data augmentation techniques; a
ent type of data augmentation technique in the
first two steps. Our system outperformed the simple copy mechanism, a stem modification
only other submission. Although it remains method. We add these methods at the first two
worse than the Transformer baseline released steps of our three-step training regime. We de-
by the organizers, our model is simpler and scribe a third data augmentation technique, a tem-
our data augmentation techniques are easily plate induction method, that did not improve the
applicable to new languages. We perform ab- overall results in the end so we did not use it in
lation studies and show that the augmentation
our submission. We first train a single model on
techniques and the three training steps often
help but sometimes have a negative effect. Our all languages and then fine-tune the model on each
code is publicly available1 . language family and then on each language.
Our main contributions are:
1 Introduction
• We present the highest performing submis-
Morphological inflection is the task of generating
sion for the SIGMORPHON 2021 Task0 Part
inflected word forms given a lemma and a set
1 shared task.
of morphosyntactic tags. Inflection plays a key
role in natural language generation, particularly in • We try three data augmentation techniques.
languages with rich morphology. The SIGMOR-
PHON Shared Tasks are yearly competitions for • We use a three-step training regime mixed
inflection tasks(Cotterell et al., 2016, 2017, 2018; with a two data augmentation techniques ap-
McCarthy et al., 2019; Nicolai et al., 2020). plied at the first two steps.
This paper describes the BME team’s submis-
sion for Part 1 of the 2021 SIGMORPHON– • We perform ablation studies for the augmen-
UniMorph Shared Task on Generalization in Mor- tation techniques and the training steps. We
phological Inflection Generation. There were only highlight the negative results as well.
two submissions to this subtask and our team out-
2 Related Work
performed the other submission by a large margin.
The task was about type-level morphological in-Seq2seq neural networks were first popularized
flection in 38 typologically diverse languages from
in machine translation (Sutskever et al., 2014) and
12 language families. since the addition of the attention mechanism
1 (Bahdanau et al., 2014; Luong et al., 2015) and
https://github.com/szogabaha/SIGMORPHON2021-task0

The first three authors contributed equally. the Transformer (Vaswani et al., 2017), it has been

193
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 193–195
August 5, 2021. ©2021 Association for Computational Linguistics
the dominant approach in the field. The first ma- Family Langs Low Samples
jor success of seq2seq models in morphological in- Afro-Asiatic 12 1 196550
flection was the submission by Kann and Schütze Arnhem 1 1 214
(2016) to the 2016 edition of the SIGMORPHON Aymaran 1 0 100000
shared task. This was followed by an extensive Arawakan 2 0 16472
study by Faruqui et al. (2015) on LSTM-based Iroquoian 1 0 3801
encoder-decoder models for morphological inflec- Turkic 3 0 300371
tion. Chukotko-Kamchatkan 2 2 1378
We used the augmentation techniques intro- Tungusic 2 1 5500
duced by Neuvel and Fulop (2002). Inspired Austronesian 2 1 11395
Trans-New-Guinean 1 1 918
by Bergmanis et al. (2017) we attempted to ex-
Indo-European 12 2 685567
tract different morphological properties of the
Uralic 5 2 279720
languages and used them to generate data.
Anastasopoulos and Neubig (2019) used a two Table 1: List of language families and the number of
step training method that first trains on the lan- languages from each family. The third column is the
guage family and then on the individual languages. number of low resource (<1300 samples) languages in
We use a similar procedure but we augment the a particular family. The forth column is the overall sam-
data with a different technique in each step. ple count in each family.

3 Data
as our encoder and a unidirectional LSTM with at-
The shared task covered 38 languages from 12 tention as our decoder.
language families. 35 of these languages were Recall that the input for the inflection task is a
available from the beginning while 3 surprise lan- pair consisting of a lemma and a list of morphosyn-
guages, Turkish, Vibe and Võro, were released one tactic tags. We represent these pairs as a single se-
week before the submission deadline. Each lan- quence as the LSTM’s input. For the input lemma-
guage had a train and a development split of vary- tags pair izar, (V, COND, PL, 2), we se-
ing size. Each sample consists of a lemma, an in- rialize it as
flected form and a list of morphosyntactic tags in <SOS> i z a r <SEP> V COND PL 2 <EOS>
the following format: Similarly, we convert the target form into a se-
vaguear vaguearás V;IND;SG;2;FUT quence of characters:
emunger emunjamos V;IMP;PL;1;POS <SOS> i z a r <EOS>
desenchufar desenchufo V;IND;SG;1;PRS
The output of our model looks like this when
delirar deliraren V;SBJV;PL;3;FUT
the inflected word is izarı́ais:
The amount of data varies widely among the lan- <SOS> i z a r ı́ a i s <EOS>
guages: while the language Veps has more than
The input sequence is first projected to an em-
100000 examples, Ludic, the most underresourced
bedding space, which then provides the input for
language, has only 128 train samples. Table 1 lists
the encoder LSTM. The decoder is a standard uni-
the 12 language families and the number of lan-
directional LSTM with attention. We decode the
guages from that family. We consider some lan-
output in a greedy fashion and do not use beam
guages low resource languages if they have fewer
search. We project the final output to the output
than 1300 samples. 8 language families had at
vocabulary’s dimension and use the softmax func-
least one low resource language and 3 families
tion to generate a probability distribution. The in-
were represented only by low resource languages.
put and the output embeddings use shared weights
One goal of our data augmentation techniques is
and they are trained from scratch along with the
to offset this imbalance (see Section 5).
rest of the model.
4 Model architecture
4.2 Hyperparameter selection
4.1 LSTM based seq2seq model We selected 16 languages from diverse families
Our model is largely based on the encoder-decoder for hyperparameter tuning. Most of them were fu-
model of Faruqui et al. (2015). We use a bidirec- sional or low resource because early experiments
tional LSTM (Hochreiter and Schmidhuber, 1997) showed that these are the harder ones to learn for

194
Lang excluded feature basemodels
Family Result
code Copy Stem-mod Step 1 Step 2 Step 3 IIT+DA OL
Turkic tur 99.90 99.90 99.94 99.92 97.38 99.90 99.35 97.10
vep 99.72 54.10 99.55 99.80 99.05 99.67 99.70 91.13
Uralic lud 59.46 56.76 70.27 56.76 67.57 62.16 45.95 0.00
olo 99.72 91.15 99.84 99.78 98.26 99.72 99.66 99.48
rus 98.07 94.84 98.00 97.86 95.56 97.34 97.58 70.72
Indo -
kmr 98.21 86.02 98.74 98.41 97.50 98.21 98.01 5.14
European
deu 97.98 91.19 98.23 97.91 89.91 97.98 97.46 91.86

Table 2: The different results we achieved on the test dataset with different models, with different aug-
mentation techniques excluded and with different training steps excluded. For comparison the table show
the result of our submission (result), the given basemodel IIT+DA (Input Invariant Transformer + Data
Augment) and the models that were just trained on only one language (OL).

the model. We downsampled the larger languages 5 Augmentation techniques


and merged the train sets. We trained 100 models
with random parameters sampled uniformly from In this section we describe the data augmentation
the parameter ranges listed in Table 3. techniques we used. We applied the same steps for
each language with varying effect. We performed
ablation studies (Section 7.1) on some languages
Parameter Type Min value Max value to investigate the individual effect of these tech-
num. layers int 1 3 niques.
dropout float 0.1 0.6
embedding int 32 256 5.1 Stem modification
hidden size int 64 1024 Neural networks tend to have difficulties with low
amounts of training data as is the case with low
Table 3: Parameter ranges used for hyperparameter tun-
ing.
resource languages. For example a model trained
on a language with 50-150 examples will learn to
output the training character sequences. In order to
It turns out that only two of these hyperparame- avoid this we used the data hallucination technique
ters makes a significant difference, the number of introduced in Anastasopoulos and Neubig (2019),
layers and the hidden size. One LSTM layer was who identified the “stem” based on common sub-
clearly inferior and so were LSTMs with fewer strings in the inflected forms of the same lemma.
than 400 neurons per layer. The embedding dimen- Then they replace some characters of the stem
sion and the dropout rate made less difference. We with random characters. We use a similar method
decided to go with two configurations, a small one but instead of using random characters, we sample
with 200 dimensional embedding and the hidden them according to the unigram distribution of each
size set to 256 and a large one with the embedding language. This way we created 10000 additional
set to 150 dimensions and 900 hidden size. We re- examples for each language in the training set.
port the better one for each language based on the
development set. 5.2 Copy
Another attempt was to help the model learn to
4.3 Training details copy because without it the model can output
wrong characters for the stem instead of copying
We train the models end-to-end with gradient de- it. We added a maximum number of 10000 exam-
scent using the Adam optimizer with 0.001 learn- ples to the training set where the additional data
ing rate. We apply teacher forcing to the decoder for each language looks like:
with 0.5 probability, which means that we feed the izar izar Lang-family;Lang-code;COPY
ground truth character instead of the output of the Copy is a new tag we added just for this specific
previous step half of the time. task.

195
Not quite there yet: Combining analogical patterns and
encoder-decoder networks for cognitively plausible inflection

Basilio Calderonea Nabil Hathouta Olivier Bonamib


a b
CNRS, CLLE, Université de Toulouse Université de Paris, LLF, CNRS
[email protected]
[email protected]
[email protected]

Abstract verbs supporting the assumption that the model


The paper presents four models submitted to
learns representations of specific knowledge and
Part 2 of the SIGMORPHON 2021 Shared involves cognitive mechanisms that are known to
Task 0, which aims at replicating human judge- underlie language processing in the speakers. How-
ments on the inflection of nonce lexemes. Our ever, whether and how such models are able to
goal is to explore the usefulness of combin- mimic the human behaviour of subjects exposed to
ing pre-compiled analogical patterns with an the same stimuli is still an open question (Corkery
encoder-decoder architecture. Two models are et al., 2019). Part 2 of Shared Task 0 addresses
designed using such patterns either in the in-
this same issue. More specifically, it adopts the ex-
put or the output of the network. Two ex-
tra models controlled for the role of raw sim- perimental approach of Albright and Hayes (2003).
ilarity of nonce inflected forms to existing in- The task is to design models which predict the in-
flected forms in the same paradigm cell, and flected forms of a set of nonce verbs in a given
the role of the type frequency of analogical language and that have output scores of the predic-
patterns. Our strategy is entirely endogenous tions that correlate with human judgements.
in the sense that the models appealing solely In this paper we report on a series of experiments
to the data provided by the SIGMORPHON
that address the shared task by exploring whether
organisers, without using external resources.
Our model 2 ranks second among all submit- pre-learned formal analogies between strings can
ted systems, suggesting that the inclusion of be usefully combined with an ED architecture to al-
analogical patterns in the network architecture leviate some limitations of the application of an ED
is useful in mimicking speakers’ predictions. architecture to raw forms. We present two types
of models using analogical patterns in the input
1 Introduction
(M1) vs. in the output (M2) of an ED network, and
Psycho-computational experiments deal with the compare their performance to that of two baselines
capability of the computational models of language that focus respectively on the phonotactic typicality
to capture and mimic the behavioural responses of outputs (M3) and the type frequency of alterna-
of speakers exposed to the same data stimuli. In tions (M4). We report good performance for the
phonology, and in morphology, a large number M2 architecture and quite poor performance for
of computational models (both symbolic and sub- M1.
symbolic) have been tested on various experimental In Section 2, we shortly present the data pro-
data in order to evaluate their capacity to simulate vided by the organisers. Section 3 focuses on the
linguistic behavior (Rumelhart and McClelland, analogical patterns we use in our models. The
1986; Nosofsky, 1990; Albright and Hayes, 2003; models architecture, the training parameters and
Hahn and Bailey, 2005; Hay et al., 2004; Albright, the results are reported in section 4. We then dis-
2009). cuss the results in Section 5 and draw conclusions
More recently, Kirov and Cotterell (2018) argue (Section 6).
that a neural encoder-decoder network (ED) can
perform morphological inflection tasks in a cogni- 2 Data and goal
tively valid manner. In particular, the authors claim
that, in a wug test protocol, ED’s outputs signifi- The linguistic data provided by the organisers are
cantly correlate with human judgements for nonce inflected verb forms in English (ENG), German

196
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 196–204
August 5, 2021. ©2021 Association for Computational Linguistics
(DEU) and Dutch (NLD). For each language, four
datasets are released: (a) a training dataset, (b) a
development dataset of attested verb forms, (c) a (1) WP1 WP2
development dataset of wug forms which includes anSpi:l@n anSpi:l@nt ← TAP
human judgements, and (d) a test dataset of wug XnSpi:l@n XnSpi:l@nt
forms without the human judgements. The goal of aXSpi:l@n aXSpi:l@nt
the Shared Task is to assign a score to each wug anXpi:l@nt anXpi:l@n
form in (d) that correlates as closely as possible aXSY i:Z@T aXSY i:Z@T t
with the human judgements (see Section 4.2 for XnY pZlT n XnY pZlT nt
more details on the evaluation process). X@n X@nt ← FAP
The entries of the datasets include lemma/form aX aXt
pair and a UniMorph (Kirov et al., 2018) tag (UT) XnY XnY t
specifying the part of speech and the paradigm cell Xn Xnt
of the form. The pairs are provided as written forms X Xt ← BAP
(orthographically) in datasets (a) and (b) and in IPA
(phonologically) in all four datasets. Henceforth we will note patterns using the
Table 1 summarizes the size of the training data ‘+’ symbol as a general notation for vari-
(a), its phonological make-up, the number of mor- ables, and rely implicitly on order to match
phosyntactic tags and the proportion (%) of syn- variables in alternations. Hence e.g. the AP
cretism, i.e. of forms that fill two or more paradigm (XnY pZlT n, XnY pZlT nt) will be noted
cells of the same lexeme. Although the number +n+p+l+n/+n+p+l+nt.
of phonemes is substantially similar in the three Most of the patterns in (1) are of little morpho-
languages, the datasets differ in the number of en- logical interest. This is in particular the case of the
tries (twice as many entries in DEU as in ENG) trivial alternation pattern (TAP) which just records
and number of cells. The three datasets have a the two strings without making any generalization
comparable amount of syncretism. over common elements. A broad alternation pat-
tern (BAP) is an optimal pattern that can be inferred
ENG DEU NLD by pairwise alignment of the two forms, without
entries 41 658 100 011 74 176 taking into account the situation in the rest of the
phonemes 43 44 39 paradigm. In principle there can be more than one
UTs 11 29 7 BAP for a pair of form, although that rarely hap-
syncretism % 53 50 42 pens in practice. This type of pattern is of crucial
interest to the study of the implicative structure
Table 1: Training data
of paradigms (Ackerman et al., 2009; Ackerman
and Malouf, 2013) and the induction of inflection
3 Analogical Patterns classes (Beniamine et al., 2017), but does not lead
to the identification of affixes familiar from typical
Three of our model architectures rely on alternation grammatical descriptions. For that purpose, multi-
patterns (APs) describing the formal relationship ple alignments across the paradigms are necessary
between two word-forms. An AP is a pairing of two (Beniamine and Guzmán Naranjo, 2021), and lead
word-forms patterns (WPs) with shared variables to what we call a fine alternation pattern (FAP):
over substrings which represent word parts that here @n is the infinitive suffix and @nt is the present
are common between the two forms. For example, participle suffix.
the two German word-forms anSpi:l@n ‘to allude The crucial intuition behind the determination of
to’ and anSpi:l@nt ‘alluded to’ are related by the FAPs is that they identify recurrent partials (Hock-
pattern (anX@n, anX@nt) where the variable X ett, 1947) across both paradigms and lexemes. For
represents Spi:l. instance, the FAP in (1) is motivated by the fact
The number of different APs satisfied by a pair that the substrings an and Spi:l are shared across
of forms is typically large. For instance, there are all pairs of paradigm cells of anspielen (2), while
256 (28 ) distinct patterns relating anSpi:l@n and an- the substrings @n and @nt recur in many (infinitive,
Spi:l@nt, some of which are shown in (1), where present participle) pairs across lexemes (3).
italic capital letters represent variables over strings.

197
(2) anSpi:l@n Spi:l@t+an V;IND;PST;2;PL ‘to switch over’ is reordered as y:b@rvEks@l@. Sec-
anSpi:l@n Spi:lt@+an V;SBJV;PST;1;SG ond, all phonemes represented by digraphs and
anSpi:l@n Spi:lt+an V;IMP;2;PL trigraphs in IPA were replaced with arbitrary uni-
anSpi:l@n Spi:l@+an V;IMP;2;SG graphs (capital letters; eg. S is substituted for y:).
anSpi:l@n anSpi:l@st V;SBJV;PST;2;SG
anSpi:l@n anSpi:lst V;IND;PRS;2;SG Broad alternation patterns. Each entry in the
datasets consists of the infinitive and another form
(3) tari:fi:r@n tari:fi:r@nt V.PTCP;PRS of some lexeme, accompanied by the UT of the
‘to tariff’ second form. The BAP of a pair of forms is com-
tari:fi:r@n tari:fi:rt@n V;SBJV;PST;3;PL puted through an alignment of the two word-forms
ast@n ast@nt V.PTCP;PRS and the identification of their common parts and
‘to lug’ their differences. The alignment is computed by
ast@n ast@t@t V;SBJV;PST;2;PL means of the SequenceMatcher method of the
vain@n vain@nt V.PTCP;PRS python difflib library; we then go through the

‘to cry’ “
sequences provided by the method and create the
vain@n vaint V;IND;PRS;3;SG word-form patterns by replacing the common parts
“ “
tsErStrait@n tsErStrait@nt V.PTCP;PRS by + and copying the differences in their respec-

‘to disagree’ “
tive patterns. For example, SequenceMatcher
tsErStrait@n tsErStrait@st V;SBJV;PST;2;SG aligns the forms aptail@n and apg@tailt as in (5)
“ “ “ “ BAPs are
In this paper, we rely on an algorithm for in- which yields the ++en/+ge+t BAP.
ferring BAPs and FAPs initially developed to cre- therefore calculated separately for each entry con-
ate Glawinette (Hathout et al., 2020). Glawinette sidering only the two forms.
is a French derivational lexicon created from the
(5) word-form1 ap tail @n
definitions of the GLAWI machine readable dic-
word-form2 ap g@ tai“l t
tionary (Sajous and Hathout, 2015; Hathout and “
BAP1 + + @n
Sajous, 2016). Glawinette provides a description
BAP2 + g@ + t
of derivational morphology by means of morpho-
logical families and derivational series; it is part Note that a BAP can also be seen as a charac-
of an effort aiming at the design of derivational terization of an analogical series. For instance,
paradigms. BAPs and FAPs have been adapted to the pairs of forms in (4) can all be aligned in ex-
the datasets of the current task, analogizing inflec- actly the same way as in (5), they all have the same
tional paradigms to derivational families and pairs BAP ++@n/+g@+t and they form formal analogies
of inflectional paradigm cells to derivational series. (Lepage, 1998, 2004b,a; Stroppa and Yvon, 2005;
For instance, (4) presents an excerpt of an inflec- Langlais and Yvon, 2008). More specifically, if two
tional series in the inflectional paradigm of the verb pairs of forms (F1 , F2 ) and (F3 , F4 ) have the same
anspielen that realizes the features V;NFIN and BAP, then F1 : F2 :: F3 : F4 . BAPs could also
V.PTCP;PST. In turn, this series yields two series be computed for entire inflectional paradigms as
of word-forms, the ones in the left column and the proposed by Hulden (2014). Also note that BAPs
ones in the right column. are not specific to an inflection class, as two classes
(4) apzu:x@n apg@zu:xt ‘to search’ may exhibit common behavior in one part of their
aplOx@n apg@lOxt ‘to punch off’ paradigm but not in another. For instance, the BAP
aprYk@n apg@rYkt ‘to disengage’ +/+s describes the formal relation that connects
apgUk@n apg@gUkt ‘to peek’ the infinitive and the V;PRS;3;SG form of both
regular (work) and irregular English verbs (eat).
Basic preprocessing. The forms in the test set
(d) being in IPA, we only computed phonological Fine alternation patterns. Unlike BAPs which
BAPs and FAPs. BAPs and FAPs have been com- are derived solely from the examination of pair-
puted for all the entries of all four datasets. In wise alternations, FAPs rely on the place of the two
addition, two basic modifications were performed. word-forms in the overall morphological system to
First, particles were reorder so as to appear in the identify more stable recurrent partials correspond-
same position in the infinitival and inflected word- ing to traditional exponents. For instance, the BAP
forms. For instance, wechsele über, vEks@l@ y:b@r relating the German weak verbs like anspielen to

198
its present participle anspielend, relying on the op- terns characterize a large enough subset of the pairs
timal alignment between the two forms, does not in Φ. These APs are obtained as follows.
identify the infinitive and past participle exponents We first collect the WPs that possibly charac-
-en and -end. These cannot be deduced from an terize the word-forms in Φ1 by computing a pat-
isolate pair of word-forms, and require considering, tern of word-forms for each pair of word-forms
across the paradigm, all the pairs of word-forms made up of two word-forms from Φ1 . These pat-
that include infinitives or present participles and terns are dual of the ones illustrated in (5) as we
finding out the pair of endings that best character- need to represent what the word-forms have in
izes, across lexemes, the infinitives “similar” to common and not their differences. For instance,
anspielen and the present participles “similar” to in the second column of (1), the pattern that de-
anspielend. The main challenges in the identifi- scribes the common part of apg@zu:xt and apg@lOxt
cation of the FAPs are then (i) that they involve is apg@+xt and the one for the common part of
the entire dataset and cannot be computed locally apg@lOxt and apg@rYkt is apg@+t. If the num-
for a single pair of word-forms; (ii) that we need ber of word-forms in the column is large and var-
to formally define what “similar” means; (iii) that ied enough, all the relevant WPs that characterize
we potentially need to consider all the APs of all a part of the word-forms will be collected. We
the pairs of words included in the dataset; (iv) that then align the patterns obtained for the two col-
we need a reasonable operational approximation of umn. This is done by considering the WPs as if
what could considered as linguistically relevant. they were word-forms and computing their ana-
logical signature, i.e. their BAP. For instance, we
(i) The regularities that determine the FAPs are
have aplOx@n : aprYk@n :: apg@lOxt : apg@rYkt. The
holistic properties of the dataset, i.e. of the union of
BAP for the first pair is ++@n/+g@+t and the
the datasets (a), (b), (c) and (d). The consequence
BAP for the second is identical; the pattern that
is that each FAP depends on the entire dataset, and
characterizes aplOx@n:aprYk@n is ap+@n and the
FAPs have to be recomputed each time any of the
one that characterizes the second is apge+t.
datasets (a), (b), (c) or (d) is modified.
These two WPs are aligned because their BAP is
(ii) The pairs in (4) are good examples of what ++@n/+g@+t, i.e. the same as the BAP of the two
similar may mean, from an inflectional point of pairs of word-forms.
view. This type of similarity can be defined in terms By doing the same computation for all pairs of
of analogy. We first assume that pairs of forms word-forms and matching them with respect to their
that satisfy the same BAP constitute an analogical BAP, we end up with a number of FAP candidates
series (as they satisfy the formal analogy encoded that we first screen in order to exclude those that
by the BAP). Word-forms belonging to the same describe only a small part of Φ, or that are made
column in an analogical series are then considered up of WPs that describe a small part of Φ1 or Φ2 .
as similar. In our example, the word-forms in each Another feature that helps select valuable FAPs is
column in (4) count as similar. the number of variable parts (+) it contains. For our
models, we only used FAPs that contain exactly one
(iii) We limit the number of patterns to be con- variable part, but this number could be increased
sidered by looking only at the ones that are involved for languages with templatic morphology or that
in the characterization of similar word-forms. In make use of infixes.
other words, once the sets of similar word-forms
are created, we only consider the similarities that (iv) We assume that optimal FAPs are pairs of
exist between the word-forms that belong to each WPs that recur both within the paradigm and across
set, since only these ones may be part of an FAP. the lexicon, as we illustrated in (2) and (3). For
Let Φ be the set of pairs of word-forms satisfy- instance, the FAP of a pair of word-forms anSpi:l@n
ing some BAP, and Φ1 (resp. Φ2 ) the set of word- and anSpi:l@nt consists of the aligned patterns de-
forms that are the first (resp. second) element of scribing the largest number of word-forms similar
a pair in Φ. What we are looking for are the pat- to anSpi:l@n on the one hand, and the largest num-
terns that characterize a large enough subset of the ber of word-forms similar to anSpi:l@nt on the other
word-forms in Φ1 that are in correspondence with hand. It turns out that this is the pattern +@n/+@nt.
patterns that characterize a large enough subset of More precisely, let (F1 , F2 ) be a pair of word-
the word-forms in Φ2 , i.e. such that the pair of pat- forms and let {(P1 , Q1 ), (P2 , Q2 ), ..., (Pn , Qn )}

199
be the FAP candidates connecting F1 and F2 (i.e. in Section 3 provide a description complementary
the set of the aligned WPs of F1 and F2 ). Let to the lemma-form mapping in which analogical
|X| be the number of form pairs that satisfy an regularities may be locally to a single pair of forms
alternation that contain the WP X. The FAP of (BAPs) or globally from the entire lexicon (FAPs).
(F1 , F2 ) is then the WP pair (Pi , Qi ) such that These paradigmatic analogies emerge when the
|Pi | + |Qi | = maxnj=1 ([Pj | + |Qj |). FAPs are forms are contrasted with all other forms of their
therefore selected separately for each pair of word- lexeme and the other forms that occupy the same
forms. cell in the paradigm (Bonami and Beniamine, 2016;
The models M1 and M2 presented in Section 4 Ahlberg et al., 2015; Albright, 2002).
use FAPs computed from the union of the datasets The models we designed for the task combine
(a), (b), (c) and (d) for each of the three languages the capacity of the sequence-to-sequence models to
of the task. learn the regularities present in strings of phonemes
with the alternation patterns acquired from the
Discussion. BAPs and FAPs give different types
paradigms, in order to predict native speaker re-
of information: BAPs capture relations between
sponses in a wug test.
pairs of forms independently of the rest of the sys-
tem, and are hence crucial to addressing the im- 4.1 Models
plicative structure of paradigms (Wurzel, 1989).
We designed four models for the shared task.
FAPs on the other hand characterize a pair of forms
taking into account their place in the rest of the sys- 4.1.1 Model 1
tem; this typically leads to more specific patterns In the first model, M1, we consider morphologi-
that are satisfied by a smaller set of pairs. cal inflection as a mapping over sequences. The
mapping is implemented by bidirectional LSTMs
4 Combining analogical patterns and
with dropouts (Hochreiter and Schmidhuber, 1997;
encoder-decoder networks Gal and Ghahramani, 2016). The hidden states
Early work on connectionist models of the ac- of the encoder are used to initialize the decoder
quisition of morphology involved pattern associ- states. The model adopts a teacher forcing strategy
ators that learn relationships between a base lex- to compute the decoder’s state in the next time-
ical form (i.e. the lemma) and a target form (i.e. step. M1 takes as input four sequences: the lemma
the inflected form). For example, Rumelhart and (IPA-encoded), the UT, the BAP and the FAP pat-
McClelland (1986) propose a simulation of how terns. The output sequence is the inflected form
English past tense is learned. They focus on pairs (IPA-encoded). The output layer uses a sigmoid to
of verb forms like go-went and walk-walked and produce a probability distribution over the output
consider that morphological learning is a gradual phonemes.
process which includes an intermediate phases of
M1 Input: {lemma + UT + BAP + FAP}
“over-regularization” (where the past form of go
Output: {inflected form}
is goed instead of went). This yields the well-
known“U-shape” curves observed in the develop- The probability score assigned to the wug forms
mental phases of morphological competence in is the joint probability of the its phonemes. The
children. model M1 addresses the task directly. We expected
More recently, models based on deep learning the prediction of a model that uses all the available
architectures have been used (Malouf, 2017) and information including BAPs and FAPs would be
in particular sequence-to-sequence models able to accurate and highly correlated with the judgments
predict one form of a lexeme from another (Faruqui of the speakers.
et al., 2016; Kirov and Cotterell, 2018).
These approaches are based on the assumption 4.1.2 Model 2
that the morphological learning can reduce to a The second model, M2, relies on FAPs to identify
simple mapping between a base form and an in- the crucial thing to be predicted in a wug task,
flected one. Generalizations over similar mappings namely the inflectional pattern of the output form.
(e.g. love-loved,walk-walked vs. sing-sang, ring- Hence the model is trained to predict, instead of the
rang) are learned from the dependences between raw output form, the word pattern that constitutes
the phonemes in sequences. The APs presented the second part of the FAP (FAP2 ) and identifies

200
its place in the inflection system while abstracting This is meant as a very simple baseline, cap-
away from what is common between the input and turing in a very crude fashion the intuition that
output forms. speakers judge as more natural wugs that fit into a
more frequent pattern.
M2 Input: {lemma + UT}
Output: {FAP2 } 4.2 Results

Computationally M2 is similar to M1 except for The submissions to the task are evaluated using
the input/output sequences involved. In particular, the AIC scores from a mixed-effects beta regres-
the probability score of a wug form is the joint sion model (Magnusson et al., 2017) where the
probability of the output symbols. scaled human ratings (DV) were predicted from
the submitted model’s ratings (IV). The regression
4.1.3 Model 3 implements a random intercept for lemma type. Ta-
Our third model, M3, estimates a possible word- ble 2 reports the AIC scores of the test data for the
likeness effects due to phonological similarity of three languages.
the inflected forms that have the same UT. Word-
likeness is the extent to which a sound sequence Models ENG NLD DEU
composing a form is phonologically acceptable in M1 −33.4 −60.0 −16.1
a given language. It mostly depends on the phono- M2 −43.0 −66.0 −98.8
tactic knowledge of the speakers (Vitevitch and M3 −37.5 −64.9 −12.9
Luce, 2005; Hahn and Bailey, 2005) and on the ex- M4 −40.7 −36.8 −72.9
istence of phonologically similar words in the men-
Table 2: AIC scores calculated on the basis of the final
tal lexicon (Albright, 2007). For example, a wug test data. Lower scores are better.
past form like saIndId included in the English test
dataset could trigger wordlikeness effect because
it is similar to an attested past form saIdId (sided). 5 Discussion
For each of the three languages, we designed a
classifier which predicts whether an inflected form The performance of our four models suggest the
is assigned to a specific UT in the train set. The following observations. First, M2 outperforms our
target UTs are the ones of the inflected forms in the three other models for all three languages, and
three test sets (d), namelyV;PST;1;SG for ENG, also ranked second of all systems submitted to the
V;PST;PL for NLD and V.PTCP;PST for DEU. shared task. Second, there is a striking difference
Technically, for each language, the M3 model is in performance between M2 and M1, which had an
an LSTM-based binary classifier which takes the similar architecture, but performed very poorly—
inflected form as input and outputs whether it is worse than our baseline M4 model, and second to
assigned to the target UT (value 1) or not (value 0). last of all systems submitted to the shared task. Al-
At training time, the forms which are assigned to though more experiments are needed to conclude
the target UT and to another UT, are only kept with on this point, we conjecture that the better per-
their target UT. formance of M2 might be due to the fact that it
abstracts away from the question of predicting the
M3 For inflected UT in the test set (d), shape of the stem in the output, but focuses instead
Input: {inflected form} on that part of the inflected form that is not to be
Output: {0,1} found in the input. This seems to match intuitions
about human behavior: when dealing with inflec-
The score assigned to the wug form is simply the
tions, speakers may have a hard time applying the
probability outputted by the system.
right pattern, but they never have a hard time re-
4.1.4 Model 4 membering what the stem looks like, even if it is
Our fourth model, M4, simply uses the type fre- phonotactically unusual (see Virpioja et al. 2018
quency of the BAP relating the wug lemma and the for a psycho-computational study).
wug form as a score for the test dataset. The other surprising result is that M4, which
was intended as a crude baseline, did surprisingly
M4 Raw type frequency of the BAP relating the well on the English and German data, although it
wug lemma and wug form performed very poorly on Dutch. This is interest-

201
ingly complementary to the performance of M3, objective of our participation was to test different
which did surprisingly well on Dutch but poorly on hypotheses. The main one is the relatively low
German. As M3 is entirely focused on phonotactic importance of stems when predicting the accept-
similarity while M4 is focused on the frequency ability of wug forms, as evidenced by the good
of alternations, this suggests that the three inflec- performance of the M2 model, which which only
tion systems (to the extent that they are faithfully predicts the FAP of the inflected form. Therefore,
represented by the datasets) raise different kinds of M2 is output-oriented in the sense that the proper-
challenges to speakers. ties that characterize the input, i.e. the lemma, are
To better assess the quality of M2, we exam- not used during training.
ined how well it statistically correlates with human M1 and M2 models are able to predict inflected
performance in Albright and Hayes’s (2003) exper- forms and FAP2 patterns for any UT in the training
iments on islands of reliability (IOR) in regular and set while M3 models are specialized on a single
irregular inflection in English. Albright and Hayes UT. In future work, we plan to develop specialized
are trying to establish that speakers rely on struc- versions of M1 and M2 in order to estimate the im-
tured linguistic knowledge as encoded in their Min- portance of the tested inflectional series (i.e. of the
imal Generalization Learner (MGL, Albright and set of form pairs with the same UTs as the entries in
Hayes, 2006) rather than pure analogy when inflect- test set) with respect to the entire training set. We
ing novel words. To establish this, they collected further plan to test our models on more complete
both productions of human participants asked to datasets in which the inflected forms could be pre-
inflect a novel word, and judgments on pairs of dicted from other forms than the lemma, but also
word-forms. They show that the MGL leads to a jointly from several forms of the lexeme (Bonami
better correlation with human results than a purely and Beniamine, 2016).
analogical system based on Nosofsky (1990) (NOS
in the table below). As Table 3 shows, our M2 per- Acknowledgement
forms at a level comparable to the MGL. More pre- Experiments presented in this paper were carried
cisley, it clearly outperforms it on irregular verbs out using the OSIRIM platform, that is adminis-
while trailing on regulars. Importantly, M2 does tered by IRIT and supported by CNRS, the Region
that without relying on any structured knowledge Occitanie, the French Government and ERDF.
of the kind found in the MGL, although it does rely
on a more complete view of the morphological sys-
tem. This suggests that the conclusions of Albright References
and Hayes should be reconsidered. Farrell Ackerman, James P. Blevins, and Robert Mal-
ouf. 2009. Parts and wholes: implicative patterns in
Ratings Production inflectional paradigms. In James P. Blevins and Juli-
Models probabilities ette Blevins, editors, Analogy in Grammar, pages
54–82. Oxford University Press, Oxford.
reg. irr. reg. irr.
MGL 0.745 0.570 0.678 0.333 Farrell Ackerman and Robert Malouf. 2013. Morpho-
NOS 0.448 0.488 0.446 0.517 logical organization: the low conditional entropy
M2 0.583 0.595 0.611 0.560 conjecture. Language, 89:429–464.
Malin Ahlberg, Markus Forsberg, and Mans Hulden.
Table 3: Pearson correlations (r) of participant re- 2015. Paradigm classification in supervised learn-
sponses to models. Core IOR verbs (n = 41). See ing of morphology. In Proceedings of the 2015 Con-
Albright and Hayes (2003) for the list of nonce verbs ference of the North American Chapter of the Asso-
exploited in the experiment ciation for Computational Linguistics: Human Lan-
guage Technologies, pages 1024–1029, Denver, Col-
orado. Association for Computational Linguistics.
6 Conclusion Adam Albright. 2002. Islands of reliability for regu-
lar morphology: Evidence from italian. Language,
At the time of writing, we do not have the descrip- 78:684–709.
tions of the other systems that were submitted to
Adam Albright. 2007. Gradient phonologi-
the shared task. As a result, we are not able to cal acceptability as a grammatical effect.
identify the reasons for the good and not so good https://www.mit.edu/˜albright/papers/
performance of the four systems we submitted. The Albright-GrammaticalGradience.pdf.

202
Adam Albright. 2009. Feature-based generalisation as Twelfth International Conference on Language Re-
a source of gradient acceptability. Phonology, 8:9– sources and Evaluation (LREC 2020), pages 3870–
41. 3878, Marseille.

Adam Albright and Bruce Hayes. 2003. Rules Jennifer Hay, Janet Pierrehumbert, and Mary E. Beck-
vs. analogy in English past tenses: A computa- man. 2004. Speech perception, well-formedness
tional/experimental study. Cognition, 90(2):119– and the statistics of the lexicon. In John Local,
161. Richard Ogden, and Rosalind Temple, editors, Pho-
netic Interpretation: Papers in Laboratory Phonol-
Adam Albright and Bruce Hayes. 2006. Modeling ogy VI, Papers in Laboratory Phonology, pages 58–
productivity with the gradual learning algorithm: 74. Cambridge University Press.
The problem of accidentally exceptionless general-
izations. In Gisbert Fanselow, Féry Caroline, Vo- Sepp Hochreiter and Jürgen Schmidhuber. 1997.
gel Ralf, and Schlesewsky Matthias, editors, Gra- Long short-term memory. Neural Computation,
dience in Grammar: Generative Perspectives, page 9(8):1735–1780.
185—204. Oxford University Press, Oxford.
Charles F. Hockett. 1947. Problems of morphemic
Sacha Beniamine, Olivier Bonami, and Benoı̂t Sagot. analysis. Language, 23:321–343.
2017. Inferring inflection classes with description
Mans Hulden. 2014. Generalizing inflection tables into
length. Journal of Language Modelling, 5(3):465–
paradigms with finite state operations. In Proceed-
525.
ings of the 2014 Joint Meeting of SIGMORPHON
Sacha Beniamine and Matı́as Guzmán Naranjo. 2021. and SIGFSM, pages 29–36, Baltimore, Maryland.
Multiple alignments of inflectional paradigms. In Christo Kirov and Ryan Cotterell. 2018. Recurrent neu-
Proceedings of the Society for Computation in Lin- ral networks in linguistic theory: Revisiting pinker
guistics, volume 4. and prince (1988) and the past tense debate. Trans-
Olivier Bonami and Sacha Beniamine. 2016. Joint pre- actions of the Association for Computational Lin-
dictiveness in inflectional paradigms. Word Struc- guistics, 6:651–665.
ture, 9:156–182. Christo Kirov, Ryan Cotterell, John Sylak-Glassman,
Géraldine Walther, Ekaterina Vylomova, Patrick
Maria Corkery, Yevgen Matusevych, and Sharon Gold-
Xia, Manaal Faruqui, Sabrina J. Mielke, Arya Mc-
water. 2019. Are we there yet? encoder-decoder Carthy, Sandra Kübler, David Yarowsky, Jason Eis-
neural networks as cognitive models of english past ner, and Mans Hulden. 2018. UniMorph 2.0: Uni-
tense inflection. versal Morphology. In Proceedings of the Eleventh
Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, and International Conference on Language Resources
Chris Dyer. 2016. Morphological inflection genera- and Evaluation (LREC 2018), Miyazaki, Japan. Eu-
tion using character sequence to sequence learning. ropean Language Resources Association (ELRA).
In Proceedings of the 2016 Conference of the North Philippe Langlais and François Yvon. 2008. Scaling
American Chapter of the Association for Computa- up analogical learning. In Proceedings of the 22nd
tional Linguistics: Human Language Technologies, International Conference on Computational Linguis-
pages 634–643, San Diego, California. Association tics (COLING 2008), page 51–54, Manchester.
for Computational Linguistics.
Yves Lepage. 1998. Solving analogies on words: An
Yarin Gal and Zoubin Ghahramani. 2016. A theoret- algorithm. In Proceedings of the 36th Annual Meet-
ically grounded application of dropout in recurrent ing of the Association for Computational Linguistics
neural networks. In Advances in Neural Information and of the 17th International Conference on Com-
Processing Systems, volume 29. Curran Associates, putational Linguistics, volume 2, pages 728–735,
Inc. Montréal.
Ulrike Hahn and Todd M. Bailey. 2005. What makes Yves Lepage. 2004a. Analogy and formal languages.
words sound similar? Cognition, 97:227–267. Electronic notes in theoretical computer science,
53:180–191.
Nabil Hathout and Franck Sajous. 2016. Wiktion-
naire’s Wikicode GLAWIfied: a workable French Yves Lepage. 2004b. Lower and higher estimates of
machine-readable dictionary. In Proceedings of the number of true analogies between sentences con-
the Tenth International Conference on Language tained in a large multilingual corpus. In Proceed-
Resources and Evaluation (LREC 2016), Portorož, ings of the 20th international conference on Com-
Slovenia. putational Linguistics (COLING-2004), pages 736–
742, Genève.
Nabil Hathout, Franck Sajous, Basilio Calderone, and
Fiammetta Namer. 2020. Glawinette: a linguisti- Arni Magnusson, Hans J. Skaug, Anders Nielsen,
cally motivated derivational description of French Casper W. Berg, Kasper Kristensen, Martin Maech-
acquired from GLAWI. In Proceedings of the ler, Koen J. van Bentham, Benjamin M. Bolker,

203
and Mollie E. Brooks. 2017. glmmTMB: Gener-
alized Linear Mixed Models using Template Model
Builder.
Robert Malouf. 2017. Abstractive morphological learn-
ing with a recurrent neural network. Morphology,
27:431–458.
Robert M. Nosofsky. 1990. Relations between
exemplar-similarity and likelihood models of clas-
sification. Journal of Mathematical Psychology,
34:393–418.
David. E. Rumelhart and James. L. McClelland. 1986.
On learning the past tense of English verbs. In D. E.
Rumelhart and J. L. McClelland, editors, Parallel
Distributed Processing, volume 2, pages 216–271.
MIT Press.
Franck Sajous and Nabil Hathout. 2015. GLAWI,
a free XML-encoded Machine-Readable Dictionary
built from the French Wiktionary. In Proceedings
of the of the eLex 2015 conference, pages 405–426,
Herstmonceux, England.
Nicolas Stroppa and François Yvon. 2005. An analog-
ical learner for morphological analysis. In Proceed-
ings of the 9th Conference on Computational Natu-
ral Language Learning (CoNLL-2005), pages 120–
127, Ann Arbor, MI. ACL.
Sami Petteri Virpioja, Minna Lehtonen, Annika Hultén,
Henna Kivikari, Riitta Salmelin, and Krista Lagus.
2018. Using statistical models of morphology in the
search for optimal units of representation in the hu-
man mental lexicon. Cognitive Science, 42(3):939–
973.
Michael S. Vitevitch and Paul A. Luce. 2005. In-
creases in phonotactic probability facilitate spoken-
nonword repetition. Journal of Memory and Lan-
guage, 52:93–204.
Wolfgang Ulrich Wurzel. 1989. Inflectional Morphol-
ogy and Naturalness. Kluwer, Dordrecht.

204
Were We There Already? Applying Minimal Generalization to the
SIGMORPHON-UniMorph Shared Task on Cognitively Plausible
Morphological Inflection

Colin Wilson1 Jane S.Y. Li1,2


1
Johns Hopkins University 2
Simon Fraser University
[email protected] [email protected]

Abstract by research on psychological categories (Nakisa


et al., 2001).
Morphological rules with various levels of Along with Albright (2002), which presents a
specificity can be learned from example lex- parallel treatment of Italian inflection, Albright &
emes by recursive application of minimal gen-
Hayes’s study of the English past tense is an ideal
eralization (Albright and Hayes, 2002, 2003).
A model that learns rules solely through min- example of theory-driven, multiple-methodology,
imal generalization was used to predict av- open and reproducible research in cognitive sci-
erage human wug-test ratings from German, ence.2 Their model has enduring significance for
English, and Dutch in the SIGMORPHON- the study of morphological learning and productiv-
UniMorph 2021 Shared Task, with compet- ity in English (e.g., Rácz et al., 2014, 2020; Corkery
itive results. Some formal properties of et al., 2019) and many other languages (e.g., Hi-
the minimal generalization operation were
jazi Arabic: Ahyad 2019; Japanese: Oseki et al.
proved,experimentalntially pruned. An auto-
matic method was developed to create wug-
2019; Korean: Albright and Kang 2009; Navajo:
test stimuli for future experiments that inves- Albright and Hayes 2006; Portuguese: Veríssimo
tigate whether the model’s morphological gen- and Clahsen 2014; Russian: Kapatsinski 2010; Tg-
eralizations are too minimal. daya Seediq: Kuo 2020; Spanish: Albright and
Hayes 2003; Swedish: Strik 2014).
1 Introduction In this study, we applied a partial reimplemen-
tation of the Albright and Hayes (2002, 2003)
In a landmark paper, Albright and Hayes (2003) model to wug-test rating data from three lan-
proposed a model that learns morphological rules guages (German, English, and Dutch) collected
by recursive minimal generalization from lexeme- for the SIGMORPHON-UniMorph 2021 Shared
specific examples (e.g., I → 2 / st N for sting ∼ Task. Our version of the model is based purely
stung and I → 2 / fl N for fling ∼ flung general- on minimal generalization of morphological rules,
ized to I → 2 / X [−syllabic, +coronal, +anterior, as described in §3.1 of Albright and Hayes (2002)
...] N).1 The model was presented more for- and reviewed below. It does not include additional
mally in Albright and Hayes (2002), along with mechanisms for learning phonological rules, and
evidence that the rules it learns for the English further expanding or reigning in morphological
past tense give a good account of native speakers’ rules, that were part of the original model (see Al-
productions and ratings in wug-test experiments bright and Hayes, 2002, §3.3 - §3.7). We think it is
(e.g., judgments that splung is quite acceptable as
the past tense of the novel verb spling). In addi- 2
Albright & Hayes released both the results of their
tion to providing further analysis of the behavioral wug-test experiments and an implementation of their
model (see http://www.mit.edu/~albright/mgl/
data, Albright and Hayes (2003) compared their and https://linguistics.ucla.edu/people/
proposal with early connectionist models of mor- hayes/RulesVsAnalogy/index.html). An im-
phology (e.g., Plunkett and Juola, 1999) and an pediment to large-scale simulation with the model is
that it runs from a GUI interface only. As part of the
analogical or ‘family resemblance’ model inspired present project, we have added a command line interface
to the original source code and converted the English
1
The square brackets contain all of the the shared phono- input files to a more user-friendly format (available on
logical feature specifications of /t/ and /l/, which in the request). We are aware of one other implementation of
system used here are [−syllabic, +consonantal, −nasal, the minimal generalization model, due to João Verís-
−spread.gl, −labial, −round, −labiodental, +coronal, simo, but this was unavailable at the time of our study
+anterior, −distributed, −strident, −dorsal]. (https://www.jverissimo.net/resources).
205
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 205–213
August 5, 2021. ©2021 Association for Computational Linguistics
worthwhile to consider minimal generalization on 2.2 Base rules
its own, with other mechanisms ablated, as borne
For each wordform pair, the model constructs a
out by our competitive results on the shared task.
lexeme-specific morphological rule by first identi-
1.1 Outline fying the longest common prefix (lcp) of the word-
forms excluding n (i.e., the left-hand rule context
In §2 we review the definition of minimal general- C), then the longest common suffix from the re-
ization proposed by Albright & Hayes and prove mainder (the right-hand context D), and finally
a number of original results about the operation identifying the remaining symbols in the first (A)
and its recursive application in learning rules. We and second (B) wordform. The resulting rule is
also define a generality relation that can be used to A → B/C D. The symbol ∅ ∈ / Σ# denotes
prune insufficiently broad rules without affecting the empty string in A or B.4 To illustrate, the rule
the model’s predictions. In §3 we describe how formed from howOkn, owOktni has the compo-
we preprocessed the shared task training data and nents C = owOk, D = n, A = ∅ and B = t (i.e., ∅
generated predicted wug-test ratings, and report → t / owOk n). The rule for hok2tn, ok2tni
our results on the task. We briefly summarize our is ∅ → ∅ / ok2t n.
findings in §4 and conclude by discussing a novel
method for constructing wug items that can be used 2.3 Minimal Generalization
in future empirical tests of minimal generalization
Given any two base rules R1 and R2 that make the
and other approaches to morphological learning.
same change (A → B), the model forms a possi-
2 Minimal Generalization bly more general rule by aligning and comparing
their contexts. The minimal generalization oper-
2.1 Inputs ation, R = R1 u R2 , carries over the common
change of the two base rules and applies indepen-
The model takes as input a set of wordform pairs,
dently to their left-hand (C1 , C2 ) and right-hand
one per lexeme, that instantiate the same morpho-
(D1 , D2 ) contexts. For convenience, we define
logical relationship in a language. In simulations of
minimal generalization of the right-hand contexts.
English past tense formation, these are pairs of bare
Minimal generalization of the left-hand contexts
verb stems and past tense forms such as howOkn,
can be performed by reversing C1 and C2 , applying
owOktni, hotOkn, otOktni, hostINn, ost2Nni,
the definition for right-hand contexts, and reversing
hoflINn, ofl2Nni, and hok2tn, ok2tni for the
the result.
lexemes walk, talk, sting, fling, and cut. Word-
forms consist of phonological segments (here, in The minimal generalization D = D1 u D2 is de-
broad IPA transcription) delimited by special be- fined precedurally by first extracting the lcp σ1∧2
ginning and end of string symbols. The set Σ of of the two contexts and then operating on the re-
phonological segments for the language, and the mainders (D10 , D20 ). If both D10 and D20 are empty
set Σ# = Σ ∪ {o, n}, are provided to the model. then D = σ1∧2 . If one but not both of them are
empty then D = σ1∧2 X, where X ∈ / Σ# is a vari-
The model also requires a phonological feature
able over symbol sequences (i.e., X stands for Σ∗# ).
specification for each of the symbols that appears
If neither remainder is empty, then the operation
in wordforms. We adopted a well-known feature
determines whether their initial symbols have any
system, augmenting it with orthogonal and distinct
shared features; for this purpose it is useful to con-
feature specifications for the delimiters o and n.3
sider φ(x) as a function from symbols to sets of
The set Φ contains all possible (partial) specifica-
feature-value pairs, so that common features are
tions of the features and φ(x) gives the specifica-
found by set intersection.
tions of x ∈ Σ# .
If there are no common features, φ1∩2 = ∅, then
3
The phonological features are available from Bruce as before D = σ1∧2 X. Otherwise, the set of com-
Hayes’s website (https://linguistics.ucla.edu/ mon features φ1∩2 6= ∅ is appended to σ1∧2 , the
people/hayes/120a/Index.htm#features).
These features are all binary, with the possibility of under- first symbol is removed from D10 and D20 , and the
specification, while Albright & Hayes’s original simulations
4
made use of some multi-valued scalar features. Alternative We instead use λ ∈ / Σ# to stand for the empty string
sources of binary feature systems that are compatible with our in left- and right- hand contexts. Our notation for strings of
implementation include PHOIBLE (Moran et al., 2014) and phonological segments generally follows Chandlee (2017) and
PanPhon (Mortensen et al., 2016). research cited there.
206
operation processes the remainders. If both remain- identify the lcp of symbols from Σ# in the two
ders are empty then D = σ1∧2 φ1∩2 , otherwise contexts (σ1∧2 ) and then operate on the remainders
D = σ1∧2 φ1∩2 X. (D10 , D20 ). If both D10 and D20 are empty then D =
To summarize, the generalized right-hand con- σ1∧2 . If one but not both of them are empty then
text D consists of the longest common prefix D = σ1∧2 X. If both are non-empty then their
shared by D1 and D2 , followed by a single set initial elements are either symbols in Σ# , feature
of shared features (if any), followed by X in case sets in Φ, or X. Replace any initial symbol x ∈ Σ#
there are no shared features or one context is longer with its feature set φ(x), extend the function φ so
than the other. With the change and generalized that φ(X) = ∅, and compute the union φ1∩2 of
left-hand context C determined as noted above, the the initial elements. The rest of the definition is
result of applying minimal generalization to the unchanged (see end of §2.3).
two base rules is R = A → B/C D.5 By construction, the contexts that result from this
operation are also in Σ∗# (Φ)(X) (i.e., no ordinary
2.4 Recursive Minimal Generalization
symbol can occur after a feature set, there is at
Let R1 be the set of base rules (one per wordform most one feature set, X can only be a terminal
pair in the input data) and R2 be the set contain- element, etc.). Therefore, the revised definition
ing all of the base rules and the result of apply- supports the application of minimal generalization
ing minimal generalization to each eligible pair to its own products. Let Rk be the set of rules
of base rules. While the rules of R2 have greater containing every member of Rk−1 and the result
collective scope than those of R1 , they are nev- of applying minimal generalization to each eligible
ertheless unlikely to account for the level of mor- pair of rules in Rk−1 (for k > 1). In principle,
phological productivity shown by native speakers. there is an infinite sequence of rules set related by
For example, English speakers can systematically inclusion R1 ⊆ R2 ⊆ R3 · · · . In practice, the
rate and produce past tense forms of novel verbs equality becomes strict after a small number of
that contain unusual segment sequences, such as iterations of minimal generalization (typically 6-7),
ploamf /ploUmf/ (e.g., Prasada and Pinker, 1993). at which point there are no more rules to be found.
Albright & Hayes propose to apply minimal gen-
eralization recursively and demonstrate that this 2.5 Completeness
can yield rules that are highly general (e.g., in our
notation, ∅ → t / X [-voice] n). Having defined minimal generalization for arbitrary
In the original proposal, recursive minimal gen- contexts (as allowed by the model), we can revisit
eralization was defined only for pairs that include the conjecture that nothing is lost by restricting the
one base rule; it was conjectured that no additional operation to pairs at least one of which is a base
generalizations could result from dropping this re- rule. This is a practical concern, as the number of
striction. Here we define the operation for any base rules is a constant determined by the input data
two right-hand contexts D1 , D2 ∈ Σ∗# (Φ)(X). As while the number of generalized rules can increase
before, only rules that make the same change are exponentially.
eligible for generalization and the operation applies Conceptually, each rule learned by unrestricted
to left-hand contexts under reversal. minimal generalization has a (possibly non-unique)
The definition of D = D1 u D2 needed for ‘history’ of base rules from which it originated. A
recursive application is identical to the one given base rule R ∈ R1 has the history {R}. A rule in
above except that we must consider input contexts R ∈ R2 has the history {R1 , R2 } consisting of the
that contain feature sets and X (which previously two base rules from which it derived. In general,
could occur only in outputs). As before, we first the history of each rule in Rk is the union of the
5 histories of two rules in Rk−1 (k > 1).
There could be a small difference between our defini-
tion of context generalization and that in Albright and Hayes Because all rules are learned ‘bottom-up’ in this
(2002), hinging on whether the empty feature set is allowed in sense, the conjecture can be proved by showing that
rules. In our definition, φ1∩2 = ∅ is replaced by the variable
X. It is possible that the original proposal intended for empty the minimal generalization operation is associative;
and non-empty feature sets to be treated alike. The definitions we also show that it is commutative — both prop-
can diverge when applied to right contexts that are of identical erties inherited from equality, lcp, set intersection,
length and share all but the last segment (resp. left contexts
that share all but the last segment), in which case our version and other more primitive ingredients. As before,
would result in a broader rule. we explicitly consider right-hand contexts, from
207
which parallel results for left-hand contexts and end of D and E; this is independent of grouping
entire rules follow immediately. It follows that any along the same lines shown previously.
rule R can be replaced, for the purposes of minimal Complete. We now prove by induction
generalization, with the base rules in its history (in that, for any R ∈ Rk and R1 , R2 ∈ Rk−1 (k > 1)
any order). such that R = R1 u R2 , rule R can also be de-
Commutative. Let D = D1 u D2 for any rived by applying minimal generalization to R1
D1 , D2 ∈ Σ∗# (Φ)(X). We prove by construc- and one or more base rules (i.e., the rules in the
tion that D is also equal to D2 u D1 . The lcp history of R2 ).7 For R ∈ R2 this is true by def-
of elements from Σ# is the same regardless of the inition. For R ∈ R3 , we have R = R1 u R2 =
order of the contexts (σ1∧2 = σ2∧1 ) as are the R1 u (R21 u R22 ) = (R1 u R21 ) u R22 , where
remainders (D10 and D20 ). If both remainders are R21 and R22 are base rules whose minimal gener-
empty, then the result of minimal generalization alization results in R2 . In general, suppose that the
is σ1∧2 = σ2∧1 . If one but not both of them are statement is true for k − 1 > 0. Then it is also true
empty then the result is σ1∧2 X = σ2∧1 X; note for k because R ∈ Rk can be derived by R1 uR2 =
that X appears regardless of which input context R1 u (uni=1 R2i ) = (((R1 u R21 ) u R22 ) · · · u R2n )
is longer. If both are non-empty then we ensure where R1 , R2 ∈ Rk−1 and each R2i is a base rule
that their initial elements are (possibly empty) fea- in the history of R2 .
ture sets and take their intersection, which is order These results validate the rule learning algorithm
independent: φ1∩2 = φ2∩1 . If φ1∩2 = ∅ then the proposed by Albright and Hayes (2002) and used in
result is σ1∧2 X = σ2∧1 X. Otherwise, the initial our implementation. Any minimal generalization
elements are removed and the operation continues of two rules R1 and R2 allowed by the model can
to the remainders. If both remainders are empty be derived from R1 (or R2 ) by recursive application
the result is σ1∧2 φ1∩2 = σ2∧1 φ2∩1 , otherwise it is of minimal generalization with one or more base
the same expressions terminated by X. rules.
Associative. Let D = (D1 u D2 ) u D3 for any
D1 , D2 , D3 ∈ Σ∗# (Φ)(X). We prove by construc- 2.6 Relative generality
tion that D is equal to E = D1 u (D2 u D3 ). Let While not required for the minimal generalization
σ be the longest prefix of symbols from Σ# in D. operation itself, we define here a (partial) generality
Because σ occurs in D iff it is the lcp of this type relation on rules. The definition uses the same
in (D1 u D2 ) and D3 , it must be a prefix of each notation as above and is employed in pruning rules
of D1 , D2 , D3 and the longest such prefix that ap- after recursive minimal generalization has applied
pears in all of them. It follows that σ is also the (see §3.4 below).
lcp of symbols from Σ# in D1 and (D2 u D3 ). Relative generality is defined only for rules R1
Therefore, D and E both begin with σ. We now and R2 that make the same change. As usual, it
remove the prefix σ from all of the input contexts is sufficient to consider the right-hand contexts D1
and consider the remainders D10 , D20 , D30 . and D2 and then apply the same definition to the
If all of the remainders are empty, then D = reversed left-hand contexts. Conceptually, context
E = σ. If all but one of them are empty, then D2 is at least as general as context D1 , D1 v D2 ,
D = E = σX.6 If none of the remainders is empty, iff the set of strings represented by D2 is a superset
let φ1 , φ2 , φ3 be their (featurized) initial elements. of that represented by D1 when both contexts are
The intersection of those elements is independent considered as regular expressions over Σ∗# . The
of grouping, φ = (φ1 ∩φ2 )∩φ3 = φ1 ∩(φ2 ∩φ3 ). If procedural definition is complicated somewhat by
the intersection is empty then again D = E = σX. X, which can appear at the end of either context.
If the intersection is non-empty then D and E both Replace each symbol x ∈ Σ# in D1 or D2 with
begin σφ. Finally, remove the initial elements of its feature set φ(x), treat X as equivalent to ∅, and
each of D10 , D20 , D30 and compare the lengths of the let |D| be the length of context D. Then D1 v D2
remainders to determine whether X appears at the iff (i) |D1 | ≥ |D2 | and D1 [k] ⊆ D2 [k] for all
6
1 ≤ k ≤ |D1 |, except when |D1 | = |D2 | + 1
If D10 or D20 is the longest context, assume by com-
mutativity that it is D10 . The minimal generalizations are
and the last element of D1 but not D2 is X, or (ii)
(D10 u D20 ) = X and X u D30 = X, which gives the same |D1 | = |D2 | − 1, D1 [k] ⊆ D2 [k] for all 1 ≤ k ≤
result as (D20 u D30 ) = λ and D10 u λ = X. Similar reasoning
7
applies if D30 is the longest context. We ignore rules that are carried over from Rk−1 to Rk .
208
|D1 |, and the last element of D2 is X. Context targeted by the shared task: formation of past par-
D2 is strictly more general than D1 , D1 @ D2 , ticiples in German (Clahsen, 1999) and past tenses
iff D1 v D2 and D2 6v D1 . Rule R2 is at least in English and Dutch (Booij, 2019).
as general as R1 , R1 v R2 , iff C1 v C2 and Second, the model cannot learn sensible rules
D1 v D2 ; it is a strictly more general rule iff for circumfixes (Albright and Hayes, 2002, §5.2).
either of the context relations is strict. This could be remedied by allowing the model to
form rules that simultaneously make changes at
3 System Description and Results both wordform edges, or by allowing it to apply
multiple rules when mapping inputs to outputs. As
Our system for the shared task preprocesed the
a workaround, we simply removed the prefix /g@-/
input wordforms, learned rules with recursive min-
whenever it occured at the beginning of a German
imal generalization, scored the rules in two alter-
past participle (training or wug wordform).
native ways, pruned rule that have no effect on
the model’s predictions, and applied the remaining
3.2 Rules
rules to wug forms to yield predicted ratings.
Given the preprocessed and filtered input data, a
3.1 Preprocessing base rule was learned for each lexeme and then
The shared task provided space-separated broad minimal generalization was applied recursively as
IPA transcriptions of the training and wug word- in §2. This resulted in tens of thousands of morpho-
forms (e.g., /w O k/, /w O k t/, /s t I N/, /s t 2 N/). logical rules for each of the three languages (see
As already mentioned, we added explicit beginning Table 1).
and end of string symbols. Because minimal gener- A major goal of Albright & Hayes was to learn
alization requires each wordform symbol to have a rules that can construct outputs from inputs (as
phonological feature specificiation, but some seg- opposed to merely rating or selecting outputs that
ments in the data lack entries in our feature chart, are generated by some other source). Their model
we further simplified or split the symbols as fol- achieved this goal, and a substantial portion of its
lows. original implementation was dedicated to rule ap-
For German, we split the diphthongs /ai au oi plication. We instead delegated the application
i:@ e:@ E:@/ into their component vowels and“ addi-
“ “ of rules to a general purpose finite-state library
tionally regularized /i u/ to /i u/. For English, we (Pynini; Gorman, 2016; Gorman and Sproat, 2021),
split the diphthongs “/aI“ aU OI u:I/ into their com- as follows.
ponents and /3~/ into /E ô/, simplified /eI @U/ to Each component of a rule A → B/C D was
/e o/, and regularized /m n r l Õ/ to /m n ô l O/. first converted to a regular expression over symbols
We also deleted all length" marks
" " /:/ and instances in Σ# by mapping any feature set φ ∈ Φ to the
of / /. For Dutch, we split /EI AU UI/ into their
G disjunction of symbols that bear all of the specified
components. features and deleting instances of X. Segments
Checking that all wordform symbols appear in a were then encoded as integers using a symbol table.
phonological feature chart is useful for data clean- Pynini provides a function cdrewrite that com-
ing. It helped us to identify a few thousand Dutch piles rules in this format to finite-state transducers,
wordforms containing ‘+’ (indicating a Verb + a function accep for converting input strings to
Preposition juncture), which we removed. And it linear finite-state acceptors encoded with the same
caught an encoding error in which two distinct but symbol table, a composition function @ that applies
perceptually similar Unicode symbols were used rules to inputs yielding output acceptors, and the
for the voiced velar stop /g/. means to decode the output back to strings.8
Two acknowledged limitations of the original 8
The technique of mapping feature matrices to disjunctions
version of the minimal generalization model, and (i.e., natural classes) of segments and beginning/end symbols,
our version, are relevant here. First, the model and ultimately to disjunctions of integer ids, was also used in
the finite-state implementation of Hayes and Wilson (2008).
learns rules for individual morphological relations X was deleted here because it occurs only at the beginning of
(e.g., mapping a bare stem to a past tense form), not left-hand contexts and at the end of right-hand contexts, both

for entire morphological systems jointly. Therefore, positions where Pynini’s rule compiler implicitly adds Σ# .
Pynini’s implementation of finite-state automata wraps and
we retained from the preprocessed input data only extends OpenFst (Riley et al., 2009) and its rule compilation
the wordform pairs that instantiate the relations algorithm is due to Mohri and Sproat (1996).
209
3.3 Scoring mine possible inputs for a given output wordform
The score of a rule is related to its accuracy on the (by composition with the inverted transducer), and
training data. The simplest notion of score would to assign scores to input/output mappings. Follow-
be just accuracy: the number of training outputs ing Albright & Hayes, we assume that the score
that are correctly predicted by the rule (hits), di- of a mapping is taken from the highest-scoring
vided by the number of training inputs that meet the rule(s) that could produce it. Rules neither ‘gang
structural description of the rule (scope). Albright up’ — multiple rules cannot contribute to the score
& Hayes propose instead to discount the scores of of a mapping — nor do they compete — rules that
rules with smaller scopes, using a formula previ- prefer different outputs for the same input do not
ously applied to linguistic rules by Mikheev (1997). detract from the score. When no rule produces a
Our implementation also includes this way of scor- mapping, we assigned it the minimal score of zero.
ing rules, which Albright & Hayes call confidence.9 As for the scoring function itself, many other
Because confidence imposes only a modest possibilities could be considered. For example,
penalty on rules with small scopes, we also con- rule scores could be normalized within or across
sidered a score function of the form scoreβ = changes, a type of competition that is inherent
hits/(scope + β), where β is a non-negative dis- to probabilistic models. See Albright and Hayes
count factor (here, β = 10). A rules that is per- (2006) for a different kind of competition model in
fectly accurate and applies to just 5 cases has high which rules learned by minimal generalization are
confidence (.90) but much lower score10 (.33); one weighted as conflicting constraints.
that applies perfectly to 1000 cases has a near-
maximal value (> .99) regardless of how the score
3.6 Results
is calculated. Clearly, these are only two of a wide
range of score functions that could be explored. Table 1 provides quantitative details of our simu-
lations for the three morphological relations in the
3.4 Pruning
shared task. The AIC values were calculated with
When applied to training data consisting of thou- an evaluation script provided by the organizers,
sands of lexemes, recursive minimal generalization which compares average human ratings of output
can produce tens of thousands of distinct rules. Al- wordforms with ratings predicted by the model.
bright & Hayes mention but do not implement the (Values are not directly comparable across the lan-
possibility of pruning the rules on the basis of their guages because the number of wug forms differed.)
generality and scores. We pursued this suggestion We used whichever scoring method, confidence
by first partitioning the set of all learned rules ac- or score10 , achieved a better AIC value on the de-
cording to their change and imposing a partial order velopment wug data. For German and English, this
on each of the resulting subsets. was confidence; for Dutch it was score10 . Upon
We ordered rules by generality (§2.6), score, and close inspection of the development data for En-
length when expressed with features (Chomsky and glish, we found it plausible that human participants
Halle, 1968). Rule R2 dominates rule R1 in the had down-rated regular past tense forms of bare
order, R1 ≺ R2 iff R2 is at least as general as R1 forms ending in coronal stops /t d/ because these
(R1 v R2 ) and (i) R2 has a higher score or (ii) might appear to be ‘double past’ inflections (e.g.,
the rules tie on score and R2 is either strictly more /vaInd@d/ for the stem /vaInd/, which has a rime
general (R1 @ R2 ) or shorter. Dominated rules /aInd/ that is rare outside of past tense forms).
were pruned without affecting the predictions of Therefore, in generating predictions for the English
the model, as we discuss next. wug test we added a penalty to the model score
3.5 Prediction for such outputs. The magnitude of the penalty
was fit by linear regression to the development data.
Once rules have been learned by minimal general- As the development and test wugs were generated
ization and scored, they can be used for multiple by different methods, addition of this factor could
purposes: to generate potential outputs for input have had a detrimental effect on the model’s per-
wordforms (by finite-state composition), to deter- formance. On the contrary, our model had the best
9
The confidence formula has one free parameter, which we AIC for the German and English test data and the
set to α = .55 following Albright and Hayes (2003, p. 127). best overall AIC (summed over the languages).
210
Language Lexemes Rules (all) Rules (pruned) AIC (dev wugs) AIC (test wugs)
German (past part.) 3,417 31,562 3,629 -127.6 -135.0
English (past) 5,803 30,728 263 -112.0 -62.2
Dutch (past) 7,823 55,114 1,862 -58.5 -76.5

Table 1: Number of lexemes (wordform pairs) used for training, number of rules learned by minimal generalization
(before and after pruning), and evaluation on average human wug-test ratings for each language. Lower AIC values
indicate a better match between model predictions and behavioral results.

4 Summary and Future Directions account the downstream effects of phonology. We


have also not explored impugnment (Albright and
We have described the minimal generalization op- Hayes, 2002, §3.7), which unlike the other compo-
eration for morphological rules as proposed by Al- nents of the model seeks to limit rather than expand
bright & Hayes and presented some new formal upon minimal generalization.
results on this operation. We have also described
our partial implementation of their model — a pure 4.2 Near misses
minimal generalization learner — and applied it As the organizers of the shared task have empha-
to wug-test data from three related languages. We sized, implemented models can be used not only
conclude with some remarks on how our imple- to predict the results of behavioral experiments but
mentation could be extended and how the central also to generate stimuli. Ideally, stimulus items
concept of minimal generalization could be empiri- would be designed to test the core tenets of a single
cally tested in future behavioral experiments. model or to probe systematic differences in predic-
tion among models. As part of our implementation,
4.1 Extensions
we have developed an automatic method of select-
The most obvious extension of the present study ing wug items to investigate a main concern about
would be to compare our stripped-down model with minimal generalization: namely, that by learning
the original one. For some of the additional mech- rules in a strictly bottom-up way it will undergen-
anisms proposed by Albright & Hayes this would eralize, predicting sharp contrasts in inflectional
be straightforward and we have alreay begun to behavior on the basis of slight differences in form.
do so; other modifications would require larger We illustrate our method with the English irreg-
changes to the model and enhancements to the train- ular pattern I → 2, which attracted new members
ing data. For example, Albright and Hayes (2002, in the history of English and has elicited relatively
§3.4) motivate a second generalization mechanism high production rates and acceptability ratings in
that creates cross-context (or more jocularly ‘Dop- previous wug tests (e.g., Bybee and Moder, 1983;
pelgänger’) rules: for each pair of rules A → Albright and Hayes, 2003). We extracted all of the
B/C D and A → B 0 /C 0 D0 , their model onsets and rimes that appear in the bare forms of
adds A → B /C D and A → B/C 0 D0 . This
0
monosyllabic English verbs and freely combined
is a simple change to our implementation that, us- them to create a large pool of possible stimulus
ing the results of §2, need only apply to base rules. items. We eliminated items that are real verbs, then
Learning phonological rules along with morphol- shrunk the pool to those items that are one (segmen-
ogy, as in Albright and Hayes (2002, §3.3), would tal) edit away from some existing irregular verb that
require the training data to contain lexeme frequen- undergoes I → 2. We further required each item
cies. This is because the original implementation to share its rime with at least one such irregular
processes the training lexemes in order of descend- verb.10 All of the wugs in the final pool are highly
ing frequency, ensuring that a phonological rule similar, in this sense, to existing irregulars.
learned on the basis of one lexeme is consistent We then divided the pool into two sets: items
with all previous (i.e., higher frequency) training that are within the scope of at least one I → 2 rule
examples. We have not yet begun to explore this or learned by minimal generalization (potential hits),
alternative means of incorporating phonology into and items that are outside the scope of all such
the model; this is an important extension because, 10
Studies of English irregular verbs have focused primarily
as Albright & Hayes demonstrate, learning fully on vowels and codas of monosyllables, though see Bybee and
general morphological rules requires taking into Moder (1983) on the potential role of onsets.
211
rules (near misses). For the former, we recorded References
the highest-scoring applicable rule. We wanted to
Honaida Yousuf Ahyad. 2019. Vowel Distribution in
provide the model with the opportunity to form the Hijazi Arabic Root. Ph.D. dissertation, State
rules that were as broad as possible — making it University of New York at Stony Brook.
more difficult for us to find near misses — and
therefore implemented cross-context base rules as Adam Albright. 2002. Islands of reliability for regu-
lar morphology: Evidence from Italian. Language,
described earlier.11 78(4):684–709.
Some of the potential hits and near misses are
minimal pairs. For example, /lIN/ (.67) and /SIN/ Adam Albright and Bruce Hayes. 2002. Modeling En-
(.61) could potentially undergo I → 2 rules with glish past tense intuitions with minimal generaliza-
the indicated confidence values. But /fIN/ and tion. In Proceedings of the ACL-02 Workshop on
Morphological and Phonological Learning, pages
/vIN/ are ineligible for the change according to 58–69. Association for Computational Linguistics.
the model (because no existing irregular verb of
this type has a non-coronal fricative immediately Adam Albright and Bruce Hayes. 2003. Rules
before the vowel). Other differences in the onset vs. analogy in English past tenses: A computa-
tional/experimental study. Cognition, 90(2):119–
can also dramatically affect the model’s predictions: 161.
/TôINk/ (.88) and /glIN/ (.67) are potential hits but
/smINk/ and /smIN/ are near misses. The second Adam Albright and Bruce Hayes. 2006. Modeling pro-
two are phonotactically challenged (Davis, 1989), ductivity with the Gradual Learning Algorithm: The
but are /Tô2Nk/ and /gl2N/ far superior to /sm2Nk/ problem of accidentally exceptionless generaliza-
tions. In Gisbert Fanselow, Caroline Fery, Matthias
and /sm2N/ when the phonotactic acceptability of Schlesewsky, and Ralf Vogel, editors, Gradience in
their bare forms is factored out? Grammar: Generative Perspectives, pages 185–204.
The same procedure can be applied to any irreg- Oxford University Press, Oxford.
ular (or indeed regular) change. For i → Ept (as
Adam Albright and Yoonjung Kang. 2009. Predicting
in sleep ∼ slept), we find that the potential hits in- innovative alternations in Korean verb paradigms.
clude /gip/ (.85) and /flip/ (.73, one of Albright & Current issues in unity and diversity of languages:
Hayes’s wug items) while /fip/, /vip/, /nip/, and Collection of the papers selected from the 18th Inter-
/snip/ are among the near misses. Would native national Conference of Linguistics, pages 1–20.
English speakers rate the novel past form /gEpt/
Geert Booij. 2019. The Morphology of Dutch, second
much higher than /fEpt/, as the model predicts?12 edition. Oxford University Press, New York.
We look forward to future empirical tests of min-
imal generalization, along these lines and others, as Joan L. Bybee and Carol Lynn Moder. 1983. Mor-
part of the collective effort to find out where we are phological classes as natural categories. Language,
59(2):251–270.
and how much further we have to go in cognitive
modeling of inflection. Jane Chandlee. 2017. Computational locality in mor-
phological maps. Morphology, 27(4):599–641.
Acknowledgements
Noam Chomsky and Morris Halle. 1968. The Sound
We would like to thank the organizers for all of Pattern of English. MIT Press, Cambridge, MA.
their work on the shared task. Special thanks to
Adam Albright and Bruce Hayes for inspiring our Harald Clahsen. 1999. Lexical entries and rules of
language: A multidisciplinary study of German in-
study and for stimulating conversations on related flection. Behavioral and Brain Sciences, 22(6):991–
topics over many years. This research was partially 1013.
supported by NSF grant BCS-1844780 to CW.
Maria Corkery, Yevgen Matusevych, and Sharon Gold-
11
With this modification to the implementation, which was water. 2019. Are we there yet? encoder-decoder
not used in previous sections, the total number of rules for the neural networks as cognitive models of English past
English past tense ballooned to 191,874. Even after pruning tense inflection. In Proceedings of the 57th Annual
there were tens of thousands of rules (69,747) and 128 for just
Meeting of the Association for Computational Lin-
I → 2. The majority of the rules have very low scores.
12 guistics, pages 3868–3877, Florence, Italy. Associa-
If irregular pasts of these potential hits and near misses
are all rated low, this could instead suggest that the model tion for Computational Linguistics.
has overgeneralized, perhaps supporting an alternative that
uses something more like detailed exemplars to represent and Stuart Davis. 1989. Cross-vowel phonotactic con-
extend irregular patterns. straints. Computational Linguistics, 15(2):109–110.
212
Kyle Gorman. 2016. Pynini: A Python library for Sandeep Prasada and Steven Pinker. 1993. Generalisa-
weighted finite-state grammar compilation. In Pro- tion of regular and irregular morphological patterns.
ceedings of the SIGFSM Workshop on Statistical Language and Cognitive Processes, 8(1):1–56.
NLP and Weighted Automata, pages 75–80, Berlin,
Germany. Association for Computational Linguis- Péter Rácz, Clay Beckner, Jennifer B. Hay, and Janet B.
tics. Pierrehumbert. 2020. Morphological convergence
as on-line lexical analogy. Language, 96(4):735–
Kyle Gorman and Richard Sproat. 2021. Finite-State 770.
Text Processing. Synthesis Lectures on Human Lan-
guage Technologies, 14(2):1–158. Péter Rácz, Clayton Beckner, Jennifer B. Hay, and
Janet B. Pierrehumbert. 2014. Rules, analogy, and
Bruce Hayes and Colin Wilson. 2008. A maximum en- social factors codetermine past-tense formation pat-
tropy model of phonotactics and phonotactic learn- terns in English. In Proceedings of the 2014 Joint
ing. Linguistic Inquiry, 39(3):379–440. Meeting of SIGMORPHON and SIGFSM, pages 55–
63, Baltimore, Maryland. Association for Computa-
Vsevolod Kapatsinski. 2010. Velar palatalization in tional Linguistics.
Russian and artificial grammar: Constraints on mod-
Michael Riley, Cyril Allauzen, and Martin Jansche.
els of morphophonology. Laboratory Phonology,
2009. OpenFst: An open-source, weighted finite-
1(2):361–393.
state transducer library and its applications to speech
and language. In Proceedings of Human Language
Jennifer Kuo. 2020. Evidence for Base-Driven Alterna-
Technologies: The 2009 Annual Conference of the
tion in Tgdaya Seediq. Master’s Thesis, UCLA.
North American Chapter of the Association for Com-
putational Linguistics, Companion Volume: Tutorial
Andrei Mikheev. 1997. Automatic rule induction for
Abstracts, pages 9–10, Boulder, Colorado. Associa-
unknown-word guessing. Computational Linguis-
tion for Computational Linguistics.
tics, 23(3):405–423.
Oscar Strik. 2014. Explaining tense marking changes
Mehryar Mohri and Richard Sproat. 1996. An efficient in Swedish verbs: An application of two analogical
compiler for weighted rewrite rules. In 34th An- computer models. Journal of Historical Linguistics,
nual Meeting of the Association for Computational 4(2):192–231.
Linguistics, pages 231–238, Santa Cruz, California,
USA. Association for Computational Linguistics. João Veríssimo and Harald Clahsen. 2014. Variables
and similarity in linguistic generalization: Evidence
Steven Moran, Daniel McCloy, and Richard Wright, ed- from inflectional classes in Portuguese. Journal of
itors. 2014. PHOIBLE Online. Max Planck Institute Memory and Language, 76:61–79.
for Evolutionary Anthropology, Leipzig.

David R. Mortensen, Patrick Littell, Akash Bharad-


waj, Kartik Goyal, Chris Dyer, and Lori Levin. 2016.
PanPhon: A resource for mapping IPA segments to
articulatory feature vectors. In Proceedings of COL-
ING 2016, the 26th International Conference on
Computational Linguistics: Technical Papers, pages
3475–3484, Osaka, Japan. The COLING 2016 Orga-
nizing Committee.

Ramin Charles Nakisa, Kim Plunkett, and Ulrika Hahn.


2001. A cross-linguistic comparison of single and
dual-route models of inflectional morphology. In
Peter Broeder and Jaap Murre, editors, Models of
Language Acquisition: Inductive and Deductive Ap-
proaches, pages 201–222. MIT Press, Cambridge,
MA.

Yohei Oseki, Yasutada Sudo, Hiromu Sakai, and Alec


Marantz. 2019. Inverting and modeling morphologi-
cal inflection. In Proceedings of the 16th Workshop
on Computational Research in Phonetics, Phonol-
ogy, and Morphology, pages 170–177, Florence,
Italy. Association for Computational Linguistics.

Kim Plunkett and Patrick Juola. 1999. A connectionist


model of English past tense and plural morphology.
Cognitive Science, 23(4):463–490.
213
What transfers in morphological inflection? Experiments with analogical
models

Micha Elsner
Department of Linguistics
The Ohio State University
[email protected]

Abstract abstract representational concepts do inflection net-


works acquire and how are these shared across lan-
This paper investigates how abstract processes
like suffixation can be learned from morpho-
guages?
logical inflection task data using an analogi- This is a difficult question to address in the stan-
cal memory-based framework. In this frame- dard framework for inflection (Kann and Schütze,
work, the inflection target form is specified 2016), in which morphosyntactic properties are
by providing an example inflection of another closely tied to their specific exponents in a par-
word in the language. This model is capa- ticular language as well as to the more abstract
ble of near-baseline performance on the Sig- processes by which these exponents are applied.
Morphon 2020 inflection challenge. Such a
In such a network, it is difficult to test whether
model can make predictions for unseen lan-
guages, allowing one-shot inflection for natu- a generic suffixation operation has been learned,
ral languages and the investigation of morpho- without reference to a particular form/feature map-
logical transfer with synthetic probes. Accu- ping, for instance between the Maori passive fea-
racy for one-shot transfer can be unexpectedly ture PASS and the spelling of a particular passive
high for some target languages (88% in Shona) suffix -tia. Suffixing as a generic operation is much
and language families (53% across Romance). more likely to be useful in another language than
Probe experiments show that the model learns
the individual suffix. This work decouples these
partially generalizable representations of pre-
fixation, suffixation and reduplication, aiding representational pieces by performing inflection in
its ability to transfer. The paper argues that an analogical, memory-based framework.1 In this
the degree of generality of these process repre- framework, inflection instances do not have tags;
sentations also helps to explain transfer results rather, they include an instance of the desired map-
from previous research. ping with respect to a different lemma (Figure 1).
For example, to produce a passive Maori verb, the
1 Introduction
system takes an example verb with its passive and
Morphological transfer learning has proven to be completes the four-part analogy: lemma : target ::
a powerful and effective technique for improving exemplar lemma : exemplar target. The advantage
the performance of inflection models on under- of this redefinition of the task is that, in principle,
resourced languages. The beneficial effects of the system does not need to learn anything about
transfer between source and target languages are the individual affixes of a particular language, since
known to be higher when the two are closely re- these can be copied from the exemplar. Thus, it is
lated (Anastasopoulos and Neubig, 2019) or typo- possible to investigate how well such a system has
logically similar (Lin et al., 2019), mediated by learned a particular morphological process such as
the effect of script (Murikinati et al., 2020). But suffixation, which is expected to be present in a
these effects are not always consistent; a variety variety of languages.2
of researchers report failure of transfer between 1
“Memory-based” has been used in the literature to refer
closely related languages, or surprising successes to models with dynamic read-write memory (Graves et al.,
with rather dissimilar ones (Sec 2). Pushing for- 2016), as well as KNN-like exemplar models which store a
ward our understanding of these cases requires a large number of examples in a static memory (van den Bosch
and Daelemans, 1999). The current work is of the latter type.
more nuanced understanding of what is transferred 2
Code available at: https://github.com/
by morphological transfer learning— that is, what melsner/transformerbyexample.

214
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 214–226
August 5, 2021. ©2021 Association for Computational Linguistics
Section 5 shows that this analogical framework or disrupt the mapping between them. Low-level
for inflection can predict inflections across a va- correspondence between character sets is the most
riety of languages, demonstrating reasonable per- important factor for successful transfer in very low-
formance on the Sigmorphon 2020 multilingual resource settings, but models with disjoint charac-
benchmark (Vylomova et al., 2020). Section 6 de- ter representations still succeed at transfer once at
scribes one-shot learning experiments, performing least 200 target examples are available, indicating
language transfer without fine-tuning, and shows that higher-level information is also transferred and
that for languages with concatenative affixes, one- contributes to performance.
shot transfer can be more effective than previously
Kann et al. (2017b) also represents a prior one-
thought. Section 7 studies the system’s ability to ap-
shot morphological learning experiment. Their set-
ply different types of morphological processes us-
ting is not quite the same as the one here; they
ing constructed stimuli, showing that some config-
assume access to a single inflected form in half the
urations are capable of learning generic and trans-
paradigm cells in their target language (Spanish)
ferable representations of processes including pre-
which are used to fine-tune a pretrained system.
fixing, suffixing and reduplication.
Because their system uses the conventional tag-
based framework, they are capable of filling cells
2 Related work
for which no example is available (zero-shot learn-
The overall positive effect of transfer learning is ing), while the memory-based system presented
well established (McCarthy et al., 2019). Previ- here is not. On the other hand, the current work
ous research has also evaluated how the choice does not use fine-tuning or require target-language
of source language affects the performance in the data at training time. They evaluate inflection on
target. While there is a robust trend for related both seen and unseen cells as a function of five
languages to perform better, there are also many re- source languages, four of which are in the Romance
ports of exceptions. Kann (2020) finds that Hungar- family. The best one-shot transfer within Romance
ian is a better source for English than German and scores 44% exact match, the worst 13%. Transfer
a better source for Spanish than Italian. She con- from unrelated Arabic scores 0%. One-shot learn-
cludes that matching the target language’s default ing experiments in this work use a much larger set
affix placement (prefixing/suffixing) is important, of languages, and although performance in the typ-
and that agglutinative languages might be benefi- ical case is similar, the best results are substantially
cial to transfer learning in general, but that genetic better.
relatedness is not always a necessary or sufficient The memory-based design of the current work is
for effective transfer. Lin et al. (2019) also find that rooted in cognitive theories of morphological pro-
Hungarian and Turkish are good source languages cessing. The widely accepted dual route model of
for a surprising variety of unrelated targets. Rather morphological processing postulates that the mind
than attribute this to agglutination, they propose retrieves familiar inflected forms from memory as
that these languages lead to good transfer because well as synthesizing forms from scratch (Milin
of their large datasets and difficulty as tasks. Fur- et al., 2017; Alegre and Gordon, 1998; Butterworth,
ther puzzling results come from Anastasopoulos 1983). It has often been claimed that memorized
and Neubig (2019), who find that Italian data does forms of specific words are central to the structure
not improve performance in closely related Ladin of inflection classes (Bybee and Moder, 1983; By-
or Neapolitan3 once monolingual hallucinated data bee, 2006; Jackendoff and Audring, 2020). In such
is available, and that Latvian is as good a source a theory, production of a form of a rare lemma is
for Scots Gaelic as its relative Irish. guided by the memory of the appropriate forms of
Previous analyses of transfer learning have at- common ones. Additional evidence for this view
tempted to differentiate the contributions of various comes from historical changes in which one word’s
parts of the model through factored vocabularies or paradigm is analogically remodeled on another’s
ciphering (Kann et al., 2017b; Jin and Kann, 2017). (Krott et al., 2001; Hock and Joseph, 1996, ch.5).
These methods give disjoint representations to char- Liu and Hulden (2020) evaluate a model very simi-
acters and tags in the source and target languages, lar to this one (a transformer in which target forms
3
Regional Romance languages spoken in Northern and of other words, which they term “cross-table” ex-
Southern Italy respectively. amples, are provided as part of the input). They

215
Lemma Target specification → Target
Standard inflection generation waiata V; PASS waiatatia
waiata karanga : karangatia waiatatia
Memory-based
waiata kaukau : kaukauria waiatatia

Figure 1: Differing inputs for inflection models, eliciting the passive of the Maori verb waiata “sing”. The memory-
based system relies on an exemplar verb as the target specifier; shown here are karanga “call”, which takes a
matching suffix, and kaukau “swim”, which mismatches.

find that such examples are complementary to data evaluated using instances generated using random
hallucination and yield improved results in data- selection.
sparse settings. Some earlier non-neural models To perform similarity-based selection, each
also rely on stored word forms (Skousen, 1989; lemma is aligned with its target form in the training
Albright and Hayes, 2002). data in order to extract an edit rule (Durrett and
DeNero, 2013; Nicolai et al., 2016). (For the first
3 Exemplar selection memory-based example in Figure 1, both words
have the same edit rule -+tia.) The selected exem-
The system uses instances generated as described in
plar/form pair uses the same edit rule, if possible.
Figure 1, separating the lemma, exemplar lemma
During training, a lemma is allowed to act as its
and exemplar form with punctuation characters.
own exemplar, so that there is always at least one
Each instance also contains two features indicating
candidate. However, words in the test set must be
the language and language family of the example
given exemplars from the training set. If a cell in
(e.g. LANG MAO, FAM AUSTRONESIAN).
the test set does not appear in the training set, no
The selection of the exemplar is critical to the
prediction can be made; in this case, the system
model’s performance. Ideally, the lemma and the
outputs the lemma. Extending the model to cover
exemplar inflect in the same way, reducing the in-
this case is discussed below as future work.6
flection task to copying. But this is not always the
case. For example, Maori verbs fall into inflection
4 Model design
classes, as shown in Figure 1; when the exemplar
comes from a different class than the lemma, copy- The system uses the character-based transformer
ing will yield an invalid output, so the model has (Wu et al., 2020) as its learning model; this is a
to guess which class the input belongs to.4 sequence-to-sequence transformer (Vaswani et al.,
This paper presents experiments using two set- 2017) tuned for morphological tasks, and serves as
tings: In random selection, the exemplar lemma a strong official baseline for the Sigmorphon 2020
is chosen arbitrarily from the set of training task. Moreover, transformers are known to perform
lemma/form pairs for the appropriate language and well in the few-shot setting (Brown et al., 2020).
cell. This makes the task difficult, but allows the All default hyperparameters7 match those of Wu
model to learn to cope with the distribution of in- et al. (2020).
puts it will face at test time. In similarity-based As discussed in prior work (Anastasopoulos
selection, each source lemma is paired with an and Neubig, 2019; Kann and Schütze, 2017), it
exemplar for which the transductions are highly is important to pretrain the model to predispose
similar. This makes the task easy, but since it relies it to copy strings. To ensure this, the system is
on access to the true target form, it can be used only trained on a synthetic dataset. Each synthetic in-
for model training, not for testing.5 All models are stance is generated within a random character set.
4
In cases of class-dependent syncretism, the model must
The instance consists of a random pseudo-lemma
also guess which cell is being specified. For instance, German and pseudo-exemplar created by sampling word
feminine nouns do not inflect for case, but some masculine
nouns do, so the combination of a masculine lemma and a using random selection. To avoid this issue, no training scores
feminine exemplar can yield an unsolvable prediction prob- are reported in this paper.
6
lem. In the SigMorphon 2020 datasets, this rarely occurs in
5 practice. ≥ 99% of target cells are covered in all languages ex-
Within the training set, the same lemma/inflected form
pair can appear as both an exemplar and a target instance; a re- cept Ingrian (98%), Evenki (96%), and notably Ludic (61%).
7
viewer speculates that this might allow the model to memorize Including 4 layers, batches of 64, and the learning rate
lexically-specific outputs within the training set even when schedule.

216
lengths from the training word length distribution Family Random Similarity Base
Austronesian (4) 83 (13) 67 (21) 81 (18)
and then filling each one with random characters. Germanic (10) 87 (10) 51 (16) 90 (9)
With probability 50% the example is given a pre- Niger-Congo (9) 98 (4) 94 (9) 97 (3)
fix; independently with probability 50% a suffix; Oto-Manguean (10) 82 (16) 39 (23) 86 (12)3
Uralic (11) 92 (6) 46 (14) 93 (0.05)
independently with probability 10% an infix at Overall 89 (12) 57 (26) 90 (11)
a random character position. Prefixes and suf-
fixes are random strings between 2-5 characters Table 1: Fine-tuned accuracy scores for models trained
long and infixes are 1-2 characters long. (This with random and similarity-based selection, compared
means that, in some cases, no affix is added and to the baseline. Num languages in family and score
standard deviation across languages in parentheses.
the transformation is the identity, as occurs in cases
of morphological syncretism.) An example such
instance is mpieňjmel:rbeaikkea::zlürbeaikkeaüe can vary based on the choice of exemplar, the sys-
with output zlümpieňjmelüe. The language tags tem applies a simple post-process to compensate
for these examples indicate the kinds of affixa- for unlucky choices: it runs each lemma with five
tion operations which were performed, for exam- randomly-selected exemplars and chooses the ma-
ple LANG PREFIX SUFFIX; the family tag identifies jority output.
them as SYNTHETIC. While this synthetic dataset is Neither model achieves the same performance as
inspired by hallucination techniques (Anastasopou- the baseline (90%), although the random-exemplar
los and Neubig, 2019; Silfverberg et al., 2017), note model (89%) comes quite close. The similar-
that these synthetic instances are not presented to exemplar model (57%) is clearly inferior due to
the model as part of any natural language. its severe mismatch between training and test set-
The Sigmorphon 2020 data is divided into “de- tings. Performance varies across language families.
velopment languages” (45 languages in 5 fami- All models perform well in Niger-Congo, although
lies: Austronesian, Germanic, Niger-Congo, Oto- the conference organizers state that data from these
Manguean and Uralic) and “surprise languages” languages may have been biased toward regular
(45 more languages, including some members of forms in an unrepresentative way.8 The random-
development families as well as unseen families). exemplar model is at or near baseline performance
Data from all the “development languages”, plus in Austronesian and Uralic, but falls further below
the synthetic examples from the previous stage, is baseline in Germanic and Oto-Manguean. Both
used to train a multilingual model, which is fine- of these families are characterized by complex in-
tuned family. Finally the family models are fine- flection class structure in which randomly chosen
tuned by language. During multilingual training exemplars are less likely to resemble the target for
and per-family tuning, the dataset is balanced to a given word.
contain 20,000 instances per language; languages The similar-exemplar model also performs
with more training instances than this are subsam- poorly in Uralic. While some Uralic languages
pled, while languages with fewer are upsampled by have inflection classes (Baerman, 2014), many
sampling multiple exemplars (with replacement) (like Finnish) do not, but have complex systems
for each lemma/target pair. For the final language- of phonological alternations (Koskenniemi and
specific fine-tuning stage, all instances from the Church, 1988). While the random-exemplar model
specific language are used. can learn to compensate for these, the similar-
exemplar model does not.
5 Fine-tuned results
6 One-shot results
This section shows the test results for fully fine-
tuned models on the development languages. Table This section shows the results of one-shot learning.
1 shows the average exact match and standard de- These experiments apply the multilingual and fam-
viation by language family. Full results are given ily models from the development languages to the
in Appendix A. Tables also show the results of the surprise languages, without fine-tuning. For lan-
official competition baseline which is closest to the guages within development families, they use the
current work, the character transformer (Wu et al., appropriate family model; otherwise they use the
2020) fine-tuned by language, TRM - SINGLE. 8
A Swahili speaker confirms that some forms in the data
Because the results of exemplar-based models appear artificially over-regularized (Martha Johnson p.c.).

217
multilingual model. Thus, the model’s only access Family Random Similarity Base
Germanic (3) 29 (13) 38 (22) 80 (13)
to information about the target language is via the Niger-Congo (1) 75 (0) 88 (0) 100 (0)
provided exemplar. Uralic (5) 21 (9) 28 (12) 76 (26)
Each experiment evaluates the results across five Afro-Asiatic (3) 7 (3) 26 (18) 96 (3)
Algic (1) 2 (0) 14 (0) 68 (0)
random exemplars per test instance (with replace- Dravidian (2) 7 (7) 13 (3) 85 (9)
ment), but averages the results rather than applying Indic (4) 4 (5) 4 (2) 98 (3)
majority selection. This computes the expected Iranian (3) 35 (39) 34 (32) 82 (19)
Romance (8) 6 (4) 53 (19) 99 (1)
performance in the one-shot setting where only a Sino-Tibetan (1) 21 (0) 9 (0) 84 (0)
single exemplar is available. Siouan (1) 13 (0) 13 (0) 96 (0)
Results are shown in Table 2. One-shot learning Songhay (1) 21 (0) 82 (0) 88 (0)
Southern Daly 4 (0) 6 (0) 90 (0)
is not competitive with the baseline fine-tuned sys- Tungusic (1) 28 (0) 27 (0) 57 (0)
tem in any language family, but has some capacity Turkic (9) 7 (8) 19 (11) 96 (7)
to predict inflections in all families. Performance Uto-Aztecan (1) 33 (0) 30 (0) 81 (0)
Overall 14 (18) 30 (25) 90 (15)
is generally better in families for which related
languages were present in development. Table 2: One-shot accuracy scores for models trained
The system trained with random exemplars with random and similarity-based selection, compared
achieves its best results on Tajik (Iranian: tgk, score to the baseline. Num. languages in family and
89%), Shona (Niger-Congo: sna, score 75%)9 , and score standard deviation across languages in parenthe-
ses. Families represented in development above the
Norwegian Nynorsk (Germanic: nno, score 42%).
line, surprise families below.
The system trained with similar exemplars achieves
its best results on Shona (88%), Zarma (Songhay:
dje, score 82%) and Tajik (79%). Notably, some “repeat” and engolir “ingest”, are mismatched with
of these high scores are achieved on languages that exemplars from a different inflection class; both
were difficult for the baseline systems; the score for systems make incorrect predictions, but the similar-
Tajik beats the transformer baseline (56%), perhaps exemplar system preserves the suffixes while the
due to data sparsity, since baselines regularized us- random-exemplar system does not. Finally, in
ing data hallucination perform better (93%). the last example llevar-se “get up”, the similar-
Training with similar exemplars leads to clearly exemplar model misinterprets the reflexive suffix
better results than random exemplars, a reversal of -se as part of the verb stem, while the random-
the trend observed with fine-tuning. This difference exemplar model fails to make any edit.
is particularly marked in Romance (53% average A more systematic analysis computes an
vs 5%). While the random-exemplar system is alignment-based edit rule for each system predic-
better at guessing what to do when the exemplar tion (King et al., 2020) and counts the unique rules
and target forms are divergent, this causes errors used to form one-shot predictions in the Catalan de-
with unfamiliar languages. The system attempts velopment set. Over 37105 instances, the random-
to guess the correct inflection, rather than simply exemplar model applies 626 unique edit rules, 20
copying. of which appear in correct predictions. The similar-
As an example, Table 3 shows an analysis of exemplar model applies 3137 unique rules, 154 of
performance in Catalan (cat), selected because its them correctly. The greater variety of both correct
results are fairly typical of the Romance family; and incorrect outputs from the similar-exemplar
the similar-exemplar system scores 53% while the model demonstrates its preference for faithfulness
random-exemplar system scores 12%. The table to the exemplar rather than remodeling the output
shows selected instances with different levels of to fit language-specific constraints.
exemplar match and mismatch. The first two, ar-
rissar “curl” and disputar “discuss”, match their 7 Synthetic transfer experiments
exemplars well and are good cases for copying.
The random-exemplar model gets these both wrong, When transfer learning fails, it can be difficult to
segmenting incorrectly in the first and adding a spu- tell whether the system has failed to represent a
rious character in the second. The next two, repetir general morphological process, or whether it mis-
9
applies what it has learned due to mismatched lexi-
As stated above, the Niger-Congo datasets are artificial-
ized and probably does not represent the real difficulty of the cal/phonological triggers. Experiments on artificial
inflection task. data can probe what abstract processes the model

218
Lemma Exemplar Rand. Sel. Sim. Sel Target
arrissar posar : posarien arrissaren arrissarien arrissarien
disputar descriure : descriuria disputarta disputaria disputaria
repetir cremar : cremo repetirer repetio repeteixo
engolir forjar : forjava engolire engoliva engolia
llevar-se terminar : termino llevar-se llevor-se llevo

Table 3: Development data from Catalan (Romance: cat) showing the outputs of two one-shot systems.

has learned to apply, the links between these pro- at least half the languages of every development
cesses and language families, and the environments family.11 The second is a subset of Cyrillic char-
in which they can operate. acters intended to test transfer to a less-familiar or-
A probing dataset is synthesized to model several thography; a few Uralic development languages are
morphological operations (Figure 2), including pre- written in Cyrillic. Each language has 90 random
fix/suffix affixation, reduplication and gemination. lemmas, sampled with the frames CVCV, CVCVC,
Affixation is typologically widespread (Bickel and CVCVCVC; affixal languages have 30 affixes of
Nichols, 2013) and appears in every development types VCV, CV, CVCV, plus 7 single-letter affixes.
language on which the model was trained. Suf- No probe lemma coincides with any real lemma,
fixation is more common in Germanic and Uralic; and no probe affix has frequency > 5% as a string
Oto-Manguean tonal morphology is also often rep- prefix or suffix in any real language. Affixal lan-
resented via word-final diacritics.10 Prefixing is guages contain an instance for every lemma/affix
more common in the Niger-Congo family. pair. Reduplication and gemination languages have
Reduplication appears in three of the four Aus- one instance per lemma.
tronesian development languages, Tagalog, Hili- The model is prompted to inflect the probes as
gaynon and Cebuano (WAL, 2013), but not in the if they are members of each language family, and
Maori dataset provided. The probe language has as members of a comparatively well-resourced lan-
partial reduplication of the first syllable, as found in guage selected from those families, specifically
Tagalog and Hiligaynon. Previous work with artifi- Tagalog (tgl), German (deu), Mezquital Otomi
cial data demonstrates that sequence-to-sequence (ote), Swahili (swa) and Finnish (fin), as well as
learners can learn fully abstract representations of the synthetic suffixing language used in pretraining
reduplication (Prickett et al., 2018; Nelson et al., (suff). In addition to checking whether the output
2020; Haley and Wilson, 2021), but it has not been matches, the table shows whether reduplicated in-
previously shown that networks trained on real stances have been correctly reduplicated (using a
data do this in a transferable way. In one-shot regular expression).
language transfer, reduplication instances are actu- Table 4 shows the results. A comparison be-
ally ambiguous. Given an instance modi : :: gobu tween the random-exemplar and similar-exemplar
: gogobu, there are two plausible interpretations, models confirms the hypothesis from above that
reduplicative momodi and affixal gomodi. Thus, random-exemplar models have less generalizable
analysis of reduplicative instances can be infor- representations of morphological processes, es-
mative about the model’s learned linkage between pecially prefixation and suffixation. While both
language family and typology. models are capable of attaching affixes in the syn-
Gemination is a inflectional process whereby a thetic language, the random-exemplar model learns
segment is lengthened to mark some morpholog- very language- and suffix-specific rules for apply-
ical feature (Samek-Lodovici, 1992). The probe ing these operations, leading to very low accuracy
language geminates the last non-final consonant. for copying generic affixes. Both models show
None of the development languages have morpho- less language-specific remodeling of affixes in the
logical gemination. family-only setting than when the probes are la-
The probe languages use two alphabets: the first beled as part of a particular language; this effect is
is a common subset of characters which appear in again more pronounced for the random-exemplar
10
No Unicode normalization was performed; Oto-
model.
Manguean tone diacritics are treated as characters (as are parts Both models learn to reduplicate arbitrary CV
of the complex characters of the Indic scripts). The placement syllables, but this process is mostly restricted to
of these diacritics within the word varies from language to
11
language. Consonants mpbntdrlskgh, vowels aeiou.

219
Lemma semet is Hungarian so successful as a source language
Probe type Exemplar Target
Prefixing kigu : igokigu igosemet for unrelated targets? Kann (2020) suggests that
Suffixing kigu : kiguigo semetigo it is its agglutinative nature. The results shown
Reduplication modi : momodi sesemet here offer some speculative support for this view—
Gemination bogu : boggu semmet
perhaps the relative segmentability of prototypi-
Figure 2: Probe tasks illustrated for a single lemma. cally agglutinative languages (Plank, 1999) acts
like the similar-exemplar setting in the memory-
based model, giving the source model a general
Tagalog,12 , with some generalization to Austrone-
bias for concatenative affixation, unpolluted by too
sian. Most other languages interpret reduplication
many lexical and phonological alternations. As re-
instances as affixes.
ported here, such a model is a promising starting
Only the similar-exemplar model gets any gem-
point for inflection in many non-agglutinative sys-
ination instances correct, and these primarily in
tems, such as Romance verbs, which nevertheless
Uralic.13 This is unsurprising, since the model was
are strongly concatenative.
never trained with morphological gemination. It
demonstrates that the model’s representations of Where transfer between related languages fails,
morphological processes represent the input typol- it is conjecturally possible that the source model
ogy and are not simply artifacts of the transformer representations of edit operations are too closely
architecture. While Uralic does not have gemi- linked to particular phonological and lexical prop-
nation as an independent morphological process, erties of the source. This is clearly shown in the
alternations involving geminates do occur in some synthetic transfer experiments, where generic suf-
paradigms; the NOM . PL of tikka “dart” is tikat.14 fixation fails in Germanic and Uralic despite these
The model seems to have learned a little about gem- families being strongly suffixing, because the sys-
ination from this morphophonological process, but tem has learned to remodel its outputs to conform
not a fully generalized representation. too closely to source-language templates.
Affixation remains relatively successful when us- More broadly, the synthetic experiments show
ing Cyrillic characters (suffixes more than prefixes), a link between language typology and learning
but for the most part, less so than with Latin char- of morphological processes, suggesting that lan-
acters, although in the random-exemplar model, guage structure, not only language relatedness, is
Cyrillic suffixes are somewhat more accurate, prob- key to successful transfer— transfer of structural
ably due to less interference from language-specific principles can lead to improvements even without
knowledge. This substantiates the general find- cognate words or affixes. For instance, success-
ing (Murikinati et al., 2020) that transfer across ful reduplication appears only in Austronesian and
scripts is more difficult than within-script. Cyrillic successful gemination only in Uralic. A promising
reduplication sees a much larger drop in accuracy. direction for future work would be to replace the
The difference is probably that simple affixation is language family feature with a set of typological
phonologically uncomplicated, while reduplication feature indicators such as WALs properties (WAL,
requires phonological information about vowels 2013), which might help the model to learn faster
and consonants. in low-resource target languages.

8 Discussion Two other extensions might bring the memory-


based model closer to the state of the art in super-
These experiments with real and synthetic trans- vised inflection prediction. First, although the Sig-
fer suggest some useful insights into the problem- Morphon 2020 datasets are balanced by paradigm
atic findings of earlier transfer experiments. Why cell, real datasets are Zipfian, with sparse cover-
12
The random-exemplar model has low accuracy for redu- age of cells (Blevins et al., 2017; Lignos and Yang,
plication in Tagalog because it appends spurious Tagalog pre- 2018). For languages with large paradigms, the
fixes to the output, another example of a language-specific model thus requires the capacity to fill cells for
rule. However, the regular expression check confirms that
reduplication is performed correctly. which no exemplar can be retrieved, perhaps using
13
Because of this poor performance, Cyrillic gemination a variant of adaptive source selection (Erdmann
was not tested. et al., 2020; Kann et al., 2017a). Second, the
14
See Silfverberg et al. (2021) for a fuller investigation of
generalizable representations of gradation processes in Finnish similar-exemplar model performs better in one-shot
noun paradigms. transfer experiments, but is hampered in the su-

220
Model Fam/Lg. Pref Pref (Cyrl) Suff Suff (Cyrl) Redup. Redup. (Cyrl) Gem.
austro 62 36 26 38 0 (10) 0 0
austro/tgl 0 1 0 0 28 (90) 3 (7) 0
ger 1 0 25 36 0 (3) 0 0
ger/deu 0 0 8 10 0 (3) 0 0
n-congo 92 55 40 41 0 (3) 0 0
n-congo/swa 100 76 36 25 0 (3) 0 0
Rand.
oto 20 15 21 33 0 (3) 0 0
oto/ote 35 30 1 9 0 (3) 0 0
uralic 3 0 23 34 0 (3) 0 0
uralic/fin 0 0 7 22 0 (3) 0 0
synth 84 62 97 91 0 (3) 0 0
synth/suff 28 1 100 97 0 (3) 0 0
austro 86 75 94 85 30 (30) 0 0
austro/tgl 30 35 75 63 88 (88) 8 (8) 0
ger 85 55 99 96 3 (3) 0 8
ger/deu 86 55 99 98 0 0 5
n-congo 99 96 98 93 0 (3) 0 3
n-congo/swa 99 98 88 57 0 0 0
Sim.
oto 88 76 95 87 18 (18) 0 0
oto/ote 96 84 59 17 5 (5) 0 0
uralic 59 10 97 95 0 0 17
uralic/fin 52 4 98 98 0 0 12
synth 94 84 99 95 8 (10) 0 2
synth/suff 86 42 100 99 0 0 2

Table 4: Accuracy of synthetic probe tasks presented as different language and language family. (Cyrl) indicates
Cyrillic alphabet. Parentheses in reduplication columns show frequency of correct CV reduplication.

pervised setting by train-test mismatch. Selecting strongest in languages without large numbers of
training exemplars using a classifier which could inflection classes, and requires training exemplars
also be used at inference time would reduce this to be selected in the same way as test exemplars.
mismatch. These experiments are left for future Memory-based analogy also provides a foundation
work. for one-shot transfer; in this case, training exem-
Finally, since the memory-based architecture is plars should closely match the elicited inflections,
cognitively inspired, it might be adapted as a cog- so that the model learns to copy rather than recon-
nitive model of language learning in contact sit- struct the output form. One-shot transfer using this
uations. Work on this learning process suggests mechanism can achieve higher accuracy than pre-
that speakers find it much easier to learn new ex- viously thought, even when no genetically related
ponents than to learn new morphological processes languages are available in training. Scores vary
(Dorian, 1978; Mithun, 2020). Thus, the impact widely, but can be over 80% for some languages.
of source-language transfer may indeed be most
Finally, this paper provides new evidence about
significant in cases where the L1 and L2 (source
what kinds of abstract information (beyond char-
and target) languages differ in the abstract mecha-
acter correspondences) is transferred between lan-
nisms of inflection rather than the specifics. Histor-
guages when learning to inflect. The model learns
ical contact-induced change provides evidence for
general processes for prefixation and suffixation
this viewpoint in the form of systems which have
which apply (to some extent) across character sets,
changed to employ the same processes as a contact
but its application of these can be disrupted by
language. For example, Cappadocian Greek has
language-specific morpho-phonological rules. It
become agglutinative through its extensive contact
also learns to reduplicate arbitrary CV sequences,
with Turkish (Janse, 2004). For other examples,
but applies this process only when targeting a lan-
see Green (1995); Thomason (2001).
guage with reduplication. Learning of morphologi-
9 Conclusion cal processes in general appears to be driven by the
input typology. The discussion argues that the use-
The results of this paper demonstrate that the pro- fulness of general representations for prefixation
posed cognitive mechanism of memory-based anal- and suffixation accounts for the puzzling effective-
ogy provides a relatively strong basis for inflection ness of agglutinative languages as transfer sources
prediction. Performance in a supervised setting is reported in previous research.

221
Acknowledgments Brian Butterworth. 1983. Lexical representation. In
Brian Butterworth, editor, Language production, vol.
This research is deeply indebted to ideas con- 2: Development, writing and other language pro-
tributed by Andrea Sims. I am also grateful to cesses, pages 257–294. Academic Press.
members of LING 5802 in autumn 2020 at Ohio Joan Bybee. 2006. From usage to grammar: The
State, and to the three anonymous reviewers for mind’s response to repetition. Language, 82(4):711–
their comments and suggestions. Parts of this work 733.
were run on the Ohio Supercomputer (OSC, 1987).
Joan Bybee and Carol Moder. 1983. Morphological
classes as natural categories. Language, 59(2):251–
270.
References
Nancy C. Dorian. 1978. The fate of morphological
2013. World atlas of language structures online. Avail- complexity in language death: Evidence from East
able online at https://wals.info/, accessed 3 Sutherland Gaelic. Language, 54(3):590–609.
June 2020.
Greg Durrett and John DeNero. 2013. Supervised
Adam Albright and Bruce Hayes. 2002. Modeling learning of complete morphological paradigms. In
English past tense intuitions with minimal gener- Proceedings of the 2013 Conference of the North
alization. In Proceedings of the Sixth Meeting of American Chapter of the Association for Computa-
the Association for Computational Linguistics Spe- tional Linguistics: Human Language Technologies,
cial Interest Group in Computational Phonology in pages 1185–1195, Atlanta, Georgia. Association for
Philadelphia, July 2002, pages 58–69. Computational Linguistics.
Maria Alegre and Peter Gordon. 1998. Frequency ef- Alexander Erdmann, Tom Kenter, Markus Becker, and
fects and the representational status of regular inflec- Christian Schallhart. 2020. Frugal paradigm com-
tions. Journal of Memory and Language, 40:41–61. pletion. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
Antonios Anastasopoulos and Graham Neubig. 2019.
pages 8248–8273, Online. Association for Computa-
Pushing the limits of low-resource morphological in-
tional Linguistics.
flection. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing Alex Graves, Greg Wayne, Malcolm Reynolds,
and the 9th International Joint Conference on Natu- Tim Harley, Ivo Danihelka, Agnieszka Grabska-
ral Language Processing (EMNLP-IJCNLP), pages Barwińska, Sergio Gómez Colmenarejo, Edward
984–996, Hong Kong, China. Association for Com- Grefenstette, Tiago Ramalho, John Agapiou, et al.
putational Linguistics. 2016. Hybrid computing using a neural net-
work with dynamic external memory. Nature,
Matthew Baerman. 2014. Covert systematicity in a dis- 538(7626):471–476.
tributionally complex system. Journal of Linguis-
tics, pages 1–47. Ian Green. 1995. The death of ‘prefixing’: contact in-
duced typological change in northern australia. In
Balthasar Bickel and Johanna Nichols. 2013. Fusion Annual Meeting of the Berkeley Linguistics Society,
of selected inflectional formatives. In Matthew S. volume 21, pages 414–425.
Dryer and Martin Haspelmath, editors, The World
Atlas of Language Structures Online. Max Planck In- Coleman Haley and Colin Wilson. 2021. Deep neural
stitute for Evolutionary Anthropology, Leipzig. networks easily learn unnatural infixation and redu-
plication patterns. Proceedings of the Society for
James P. Blevins, Petar Milin, and Michael Ramscar. Computation in Linguistics, 4(1):427–433.
2017. The Zipfian paradigm cell filling problem.
In Ferenc Kiefer, James P. Blevins, and Huba Bar- Hans Henrich Hock and Brian D. Joseph. 1996. Lan-
tos, editors, Perspectives on morphological organi- guage history, language change and language rela-
zation: Data and analyses, pages 141–158. Brill. tionship: An introduction to historical and compar-
ative linguistics. Mouton de Gruyter.
Antal van den Bosch and Walter Daelemans. 1999.
Memory-based morphological analysis. In Proceed- Ray Jackendoff and Jenny Audring. 2020. The texture
ings of the 37th Annual Meeting of the Association of the lexicon: Relational Morphology and the Par-
for Computational Linguistics, pages 285–292, Col- allel Architecture. Oxford University Press.
lege Park, Maryland, USA. Association for Compu-
tational Linguistics. Mark Janse. 2004. Animacy, definiteness, and case in
Cappadocian and other Asia Minor Greek dialects.
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Journal of Greek linguistics, 5(1):3–26.
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Huiming Jin and Katharina Kann. 2017. Exploring
Askell, et al. 2020. Language models are few-shot cross-lingual transfer of morphological knowledge
learners. arXiv preprint arXiv:2005.14165. in sequence-to-sequence models. In Proceedings of

222
the First Workshop on Subword and Character Level Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li,
Models in NLP, pages 70–75, Copenhagen, Den- Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani,
mark. Association for Computational Linguistics. Junxian He, Zhisong Zhang, Xuezhe Ma, Antonios
Anastasopoulos, Patrick Littell, and Graham Neubig.
Katharina Kann. 2020. Acquisition of inflectional 2019. Choosing transfer languages for cross-lingual
morphology in artificial neural networks with prior learning. In Proceedings of the 57th Annual Meet-
knowledge. In Proceedings of the Society for Com- ing of the Association for Computational Linguis-
putation in Linguistics 2020, pages 144–154, New tics, pages 3125–3135, Florence, Italy. Association
York, New York. Association for Computational Lin- for Computational Linguistics.
guistics.
Ling Liu and Mans Hulden. 2020. Analogy models for
Katharina Kann, Ryan Cotterell, and Hinrich Schütze. neural word inflection. In Proceedings of the 28th
2017a. Neural multi-source morphological reinflec- International Conference on Computational Linguis-
tion. In Proceedings of the 15th Conference of the tics, pages 2861–2878, Barcelona, Spain (Online).
European Chapter of the Association for Computa- International Committee on Computational Linguis-
tional Linguistics: Volume 1, Long Papers, pages tics.
514–524, Valencia, Spain. Association for Compu-
tational Linguistics. Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu,
Chaitanya Malaviya, Lawrence Wolf-Sonkin, Gar-
Katharina Kann, Ryan Cotterell, and Hinrich Schütze. rett Nicolai, Christo Kirov, Miikka Silfverberg, Sab-
2017b. One-shot neural cross-lingual transfer for rina J. Mielke, Jeffrey Heinz, Ryan Cotterell, and
paradigm completion. In Proceedings of the 55th Mans Hulden. 2019. The SIGMORPHON 2019
Annual Meeting of the Association for Computa- shared task: Morphological analysis in context and
tional Linguistics (Volume 1: Long Papers), pages cross-lingual transfer for inflection. In Proceedings
1993–2003, Vancouver, Canada. Association for of the 16th Workshop on Computational Research in
Computational Linguistics. Phonetics, Phonology, and Morphology, pages 229–
244, Florence, Italy. Association for Computational
Katharina Kann and Hinrich Schütze. 2016. MED: The Linguistics.
LMU system for the SIGMORPHON 2016 shared
task on morphological reinflection. In Proceedings Petar Milin, Laurie Beth Feldman, Michael Ramscar,
of the 14th SIGMORPHON Workshop on Computa- Roberta A. Hendrick, and R. Harald Baayen. 2017.
tional Research in Phonetics, Phonology, and Mor- Discrimination in lexical decision. PLoS ONE,
phology, pages 62–70, Berlin, Germany. Associa- 12(2):e0171935.
tion for Computational Linguistics.
Marianne Mithun. 2020. Where is morphological com-
Katharina Kann and Hinrich Schütze. 2017. Unlabeled plexity? In Peter Arkadiev and Francesco Gardani,
data for morphological generation with character- editors, The complexities of morphology, pages 306–
based sequence-to-sequence models. In Proceed- 327. Oxford University Press.
ings of the First Workshop on Subword and Charac-
ter Level Models in NLP, pages 76–81, Copenhagen, Nikitha Murikinati, Antonios Anastasopoulos, and Gra-
Denmark. Association for Computational Linguis- ham Neubig. 2020. Transliteration for cross-lingual
tics. morphological inflection. In Proceedings of the
17th SIGMORPHON Workshop on Computational
David King, Andrea Sims, and Micha Elsner. 2020. In- Research in Phonetics, Phonology, and Morphology,
terpreting sequence-to-sequence models for Russian pages 189–197, Online. Association for Computa-
inflectional morphology. In Proceedings of the Soci- tional Linguistics.
ety for Computation in Linguistics 2020, pages 481–
490, New York, New York. Association for Compu- Max Nelson, Hossep Dolatian, Jonathan Rawski, and
tational Linguistics. Brandon Prickett. 2020. Probing RNN encoder-
decoder generalization of subregular functions us-
Kimmo Koskenniemi and Kenneth Ward Church. 1988. ing reduplication. In Proceedings of the Society
Complexity, two-level morphology and Finnish. In for Computation in Linguistics 2020, pages 167–
Coling Budapest 1988 Volume 1: International Con- 178, New York, New York. Association for Compu-
ference on Computational Linguistics. tational Linguistics.

Andrea Krott, R Harald Baayen, and Robert Schreuder. Garrett Nicolai, Bradley Hauer, Adam St Arnaud, and
2001. Analogy in morphology: modeling the choice Grzegorz Kondrak. 2016. Morphological reinflec-
of linking morphemes in dutch. tion via discriminative string transduction. In Pro-
ceedings of the 14th SIGMORPHON Workshop on
Constantine Lignos and Charles Yang. 2018. Morphol- Computational Research in Phonetics, Phonology,
ogy and language acquisition. In Andrew Hippis- and Morphology, pages 31–35, Berlin, Germany. As-
ley and Gregory T. Stump, editors, Cambridge hand- sociation for Computational Linguistics.
book of morphology, pages 765–791. Cambridge
University Press. OSC. 1987. Ohio supercomputer center.

223
Frans Plank. 1999. Split morphology: How agglutina-
tion and flexion mix. Linguistic Typology, 3:279–
340.
Brandon Prickett, Aaron Traylor, and Joe Pater. 2018.
Seq2Seq models with dropout can learn generaliz-
able reduplication. In Proceedings of the Fifteenth
Workshop on Computational Research in Phonetics,
Phonology, and Morphology, pages 93–100, Brus-
sels, Belgium. Association for Computational Lin-
guistics.
Vieri Samek-Lodovici. 1992. A unified analysis of
crosslinguistic morphological gemination. In Pro-
ceedings of CONSOLE, volume 1, pages 265–283.
Citeseer.
Miikka Silfverberg, Francis Tyers, Garrett Nicolai, and
Mans Hulden. 2021. Do RNN states encode abstract
phonological alternations? In Proceedings of the
2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu-
man Language Technologies, pages 5501–5513. As-
sociation for Computational Linguistics.
Miikka Silfverberg, Adam Wiemerslage, Ling Liu, and
Lingshuang Jack Mao. 2017. Data augmentation for
morphological reinflection. In Proceedings of the
CoNLL SIGMORPHON 2017 Shared Task: Univer-
sal Morphological Reinflection, pages 90–99, Van-
couver. Association for Computational Linguistics.
Royal Skousen. 1989. Analogical modeling of lan-
guage. Springer Science & Business Media.
Sarah Grey Thomason. 2001. Contact-induced typo-
logical change. In Language typology and language
universals: An international handbook, volume 2,
pages 1640–1648.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In NIPS.
Ekaterina Vylomova, Jennifer White, Eliza-
beth Salesky, Sabrina J. Mielke, Shijie Wu,
Edoardo Maria Ponti, Rowan Hall Maudslay, Ran
Zmigrod, Josef Valvoda, Svetlana Toldova, Francis
Tyers, Elena Klyachko, Ilya Yegorov, Natalia
Krizhanovsky, Paula Czarnowska, Irene Nikkarinen,
Andrew Krizhanovsky, Tiago Pimentel, Lucas
Torroba Hennigen, Christo Kirov, Garrett Nicolai,
Adina Williams, Antonios Anastasopoulos, Hilaria
Cruz, Eleanor Chodroff, Ryan Cotterell, Miikka
Silfverberg, and Mans Hulden. 2020. SIGMOR-
PHON 2020 shared task 0: Typologically diverse
morphological inflection. In Proceedings of the
17th SIGMORPHON Workshop on Computational
Research in Phonetics, Phonology, and Morphology,
pages 1–39, Online. Association for Computational
Linguistics.
Shijie Wu, Ryan Cotterell, and Mans Hulden. 2020.
Applying the transformer to character-level transduc-
tion. arXiv preprint arXiv:2005.10213.

224
A Full results
For replicability, this appendix provides
full results for all languages, as 0-1 accu-
racy on the official test datasets. The re-
ported baseline is TRM - SINGLE, copied from
https://docs.google.com/spreadsheets/d/
1ODFRnHuwN-mvGtzXA1sNdCi-jNqZjiE-i9jRxZCK0kg. Lang Fam Rand Sim Base
Scores for supervised systems on the development ang Indo-Eur: Germanic 72 19 78
azg Oto-Manguean 94 22 95
languages are shown in Table 5 and scores for ceb Austronesian 79 69 84
one-shot systems on surprise languages are shown cly Oto-Manguean 82 19 91
in Table 6. See Vylomova et al. (2020) for cpa Oto-Manguean 74 33 91
ctp Oto-Manguean 43 15 60
language abbreviation definitions. czn Oto-Manguean 83 32 80
dan Indo-Eur: Germanic 75 42 75
deu Indo-Eur: Germanic 93 62 98
eng Indo-Eur: Germanic 97 67 97
est Uralic 94 47 95
fin Uralic 100 39 100
frr Indo-Eur: Germanic 81 39 87
gaa Niger-Congo 100 100 98
gmh Indo-Eur: Germanic 94 75 91
hil Austronesian 97 74 98
isl Indo-Eur: Germanic 88 37 97
izh Uralic 85 33 87
kon Niger-Congo 99 99 98
krl Uralic 99 36 99
lin Niger-Congo 100 100 100
liv Uralic 93 54 96
lug Niger-Congo 90 74 91
mao Austronesian 71 57 52
mdf Uralic 92 67 94
mhr Uralic 91 67 93
mlg Austronesian 100 100 100
myv Uralic 93 61 94
nld Indo-Eur: Germanic 99 61 99
nob Indo-Eur: Germanic 75 47 76
nya Niger-Congo 100 100 100
ote Oto-Manguean 99 80 99
otm Oto-Manguean 98 46 98
pei Oto-Manguean 65 17 72
sme Uralic 99 31 100
sot Niger-Congo 100 100 98
swa Niger-Congo 100 100 100
swe Indo-Eur: Germanic 97 59 99
tgl Austronesian 69 35 72
vep Uralic 83 28 84
vot Uralic 81 41 86
xty Oto-Manguean 90 79 91
zpv Oto-Manguean 87 46 85
zul Niger-Congo 92 83 92
Overall 89 57 90
Stdev 12 26 11

Table 5: Zero-one test-set accuracy scores by language


for SigMorphon 2020 development languages (super-
vised).

225
Lang Fam Rand Sim Base
ast Indo-Eur: Romance 2 64 100
aze Turkic 9 17 81
bak Turkic 15 14 100
ben Indo-Aryan 1 4 99
bod Sino-Tibetan 21 9 84
cat Indo-Eur: Romance 12 53 100
cre Algic 2 14 68
crh Turkic 24 45 99
dak Siouan 13 13 96
dje Nilo-Saharan 21 82 88
evn Tungusic 28 27 57
fas Indo-Eur: Iranian 2 13 100
frm Indo-Eur: Romance 7 73 100
fur Indo-Eur: Romance 11 19 100
glg Indo-Eur: Romance 9 59 100
gml Indo-Eur: Germanic 11 11 62
gsw Indo-Eur: Germanic 33 64 93
hin Indo-Aryan 0 1 100
kan Dravidian 13 16 76
kaz Turkic 0 7 98
kir Turkic 2 6 98
kjh Turkic 11 11 100
kpv Uralic 17 47 97
lld Indo-Eur: Romance 3 68 99
lud Uralic 22 14 32
mlt Afro-Asiatic 10 13 97
mwf Australian 4 6 90
nno Indo-Eur: Germanic 42 40 86
olo Uralic 37 33 94
ood Uto-Aztecan 33 30 81
orm Afro-Asiatic 2 52 99
pus Indo-Eur: Iranian 13 9 90
san Indo-Aryan 13 5 93
sna Niger-Congo 75 88 100
syc Afro-Asiatic 8 13 91
tel Dravidian 0 10 95
tgk Indo-Eur: Iranian 89 79 56
tuk Turkic 0 21 86
udm Uralic 11 30 98
uig Turkic 0 26 99
urd Indo-Aryan 2 7 99
uzb Turkic 0 21 100
vec Indo-Eur: Romance 2 62 100
vro Uralic 17 17 61
xno Indo-Eur: Romance 2 22 96
Overall 14 30 90
Stdev 18 25 15

Table 6: Zero-one test-set accuracy scores by language


for SigMorphon 2020 surprise languages (one-shot).

226
Simple induction of (deterministic) probabilistic finite-state automata for
phonotactics by stochastic gradient descent

Huteng Dai Richard Futrell


Rutgers University University of California, Irvine
[email protected] [email protected]

Abstract possible to induce an unrestricted automaton with


a given number of states, or an automaton with
We introduce a simple and highly general hard-coded constraints representing various subreg-
phonotactic learner which induces a proba-
ular languages. This work fills a gap in the formal
bilistic finite-state automaton from word-form
data. We describe the learner and show how to linguistics literature, where learners have been de-
parameterize it to induce unrestricted regular veloped within certain subregular classes (Shibata
languages, as well as how to restrict it to cer- and Heinz, 2019; Heinz, 2010; Heinz and Rogers,
tain subregular classes such as Strictly k-Local 2010; Futrell et al., 2017), whereas our learner
and Strictly k-Piecewise languages. We evalu- can in principle induce any (sub)regular language.
ate the learner on its ability to learn phonotac- In addition, we demonstrate how Strictly Local
tic constraints in toy examples and in datasets
and Strictly Piecewise constraints can be encoded
of Quechua and Navajo. We find that an un-
restricted learner is the most accurate over-
within our framework, and show how information-
all when modeling attested forms not seen in theoretic regularization can be applied to produce
training; however, only the learner restricted deterministic automata.
to the Strictly Piecewise language class suc- Empirically, our main result is to show that
cessfully captures certain nonlocal phonotactic our approach gives reasonable and linguistically
constraints. Our learner serves as a baseline accurate results. We find that inducing an unre-
for more sophisticated methods.
stricted PFA produces the best fit to held-out at-
tested forms, while inducing an automaton for a
1 Introduction
Strictly 2-Piecewise language yields a model that
Natural language phonotactics is argued to fall in successfully captures nonlocal constraints. We also
the class of regular languages, or even in a smaller analyze the nondeterminism of induced automata,
class of subregular languages (Rogers et al., 2013). and the extent to which induced automata overfit
This observation has motivated the study of proba- to their training data.
bilistic finite-state automata (PFAs) that generate
these languages as models of phonotactics. Here 2 Model specification
we implement a simple method for the induction of
2.1 Probabilistic Finite-state Automata
PFAs for phonotactics from data, which can induce
general regular languages in addition to languages A probabilistic finite-state automaton (PFA) for
in certain more restricted subclasses, for example, generating sequences consists of a finite set of
Strictly k-Local and Strictly k-Piecewise languages states Q, an inventory of symbols Σ, an emission
(Heinz, 2018; Heinz and Rogers, 2010). We evalu- distribution with probability mass function p(x|q)
ate our learner on corpus data from Quechua and which gives the probability of generating a symbol
Navajo, with a particular emphasis on the ability to x ∈ Σ given state q ∈ Q, and a transition dis-
learn nonlocal constraints. tribution with probability mass function p(q 0 |q, x)
We make both theoretical and empirical con- which gives the probability of transitioning into
tributions. Theoretically, we present the differen- new state q 0 from state q after emission of symbol
tiable linear-algebraic formulation of PFAs which x.
enables learning of the structure of the automa- We parameterize a PFA using a family of right-
ton by gradient descent. In our framework, it is stochastic matrices. The emission matrix E, of

227
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 227–236
August 5, 2021. ©2021 Association for Computational Linguistics
shape |Q| × |Σ|, gives the probability of emitting Given a data distribution X with support over
a symbol x given a state. Each row in the matrix Σ∗ , we wish to learn a PFA by finding parameter
represents a state, and each column represents an matrices E and T to minimize an objective func-
output symbol. Given a distribution on states rep- tion of the form
resented as a stochastic vector q, the probability
mass function over symbols is: J(E, T) = h− log p(x|E, T)ix∼X + C(E, T),
(3)
p(·|q) = q> E. (1) where h·ix∼X indicates an average over val-
ues x drawn from the data distribution X, and
Each symbol x ∈ Σ is associated with a right- − log p(x|E, T) is the negative log likelihood
stochastic transition matrix Tx of shape |Q|×|Q|, (NLL) of a sample x under the model; the average
so that the probability distribution on following negative log likelihood is equivalent to the cross en-
states given that the symbol x was emitted from the tropy of the data distribution X and the model. By
distribution on states q is minimizing cross-entropy, we maximize likelihood
and thus fit to the data. The term C(E, T) repre-
p(·|q, x) = q> Tx . (2) sents additional complexity constraints on the E
and T matrices, discussed in Section 2.4. When C
Generation of a particular sequence x ∈ Σ∗
is interpreted as a log prior probability on automata,
works by starting in a distinguished initial state
then minimizing Eq. 3 is equivalent to fitting the
q0 , generating a symbol x, transitioning into the
model by maximum a posteriori.
next state q 0 , and so on recursively until reaching a
Given the formulation in Eq. 3, because the ob-
distinguished final state qf . Given a PFA parame-
jective function is differentiable, we can search
terized by matrices E and T, the probability of a
for the optimal matrices E and T by performing
sequence xN t=1 marginalizing over all trajectories
(stochastic) descent on the gradients of the objec-
through states can be calculated according to the
tive. That is, for a parameter matrix X, we can
Forward algorithm (Baum et al., 1970; Vidal et al.,
search for a minimum by performing updates of
2005a, §3) as follows:
the form
p(xN N X0 = X − η∇J(X), (4)
t=1 |E, T) = f (xt=1 |δq0 ),
where the scalar η is the learning rate. In stochas-
where δq is a one-hot coordinate vector on state q tic gradient descent, each update is performed using
and a random finite sample from the data distribution,
called a minibatch, to approximate the average
f (∅|q) = δq>f q
over the data distribution in Eq. 3.
f (xnt=1 |q) = p(x1 |q) · f (xnt=2 |q> Tx1 ). However, we cannot apply these updates directly
to the matrices E and T because they must be
The important aspect of this formulation is that right-stochastic, meaning that the entries in each
the probability of a sequence is a differentiable row must be positive and sum to 1. There is no
function of the matrices E and T that define the guarantee that the output of Eq. 4 would satisfy
PFA. Because the probability function is differen- these constraints. This issue was addressed by Dai
tiable, we can induce a PFA from a set of training (2021) by clipping the values of the matrix E to
sequences by using gradient descent to search for be between 0 and 1. A more general solution is
matrices that maximize the probability of the train- that, instead of doing optimization on the E and T
ing sequences. matrices directly, we instead do optimization over
2.2 Learning by gradient descent underlying real-valued matrices Ẽ and T̃ such that

We describe a simple and highly general method exp Ẽij exp T̃ij
for inducing a PFA from data by stochastic gradi- Eij = P , Tij = P ,
k exp Ẽik k exp T̃ik
ent descent. Although more specialized learning
algorithms and heuristics exist for special cases in other words we derive the matrices E and T
(see for example Vidal et al., 2005b, §3), ours has by applying the softmax function to underlying
the advantage of generality. Our goal is to see how matrices Ẽ and T̃, whose entries are called logits.
effective this simple approach can be in practice. Gradient descent is then done on the objective as

228
a function of the logit matrices Ẽ and T̃. This ap- istic PFAs because of their nice theoretical proper-
proach to parameterizing probability distributions ties (Heinz, 2010). A deterministic PFA is dis-
is standard in machine learning. Applied to induce tinguished by having fully deterministic transi-
a PFA with states Q and symbol inventory Σ, our tion matrices T. This condition can be expressed
formulation yields a total of |Q| × (|Q| × |Σ| − 1) information-theoretically. Assuming 0 log 0 = 0,
meaningful trainable parameters. letting the entropy of a stochastic vector p be:
We note that this procedure is not guaranteed to X
find an automaton that globally minimizes the ob- H[p] = − pi log pi ,
i
jective when optimizing T (see Vidal et al., 2005b,
§3). But in practice, stochastic gradient descent in a PFA is deterministic when it satisfies the con-
high-dimensional spaces can avoid local minima, dition H[q> Tx ] = 0 for all symbols x and state
functioning as a kind of annealing (Bottou, 1991, distributions q.
§4); using these simple optimization techniques on We can use this expression to monitor the degree
non-convex objectives is now standard practice in of nondeterminism of a PFA during optimization,
machine learning. or to add a determinism constraint to the objective
in Section 2.2. The average nondeterminism N
2.3 Sequence representation and word of a PFA is given by
boundaries X
N (E, T) = q̂i Eij H[δq>i Tj ],
In order to model phonotactics, a PFA must be sen-
ij
sitive to the boundaries of words, because there are
often constraints that apply only at word beginnings where q̂ is the stationary distribution over states,
or endings (Hayes and Wilson, 2008; Chomsky and representing the long-run average occupancy dis-
Halle, 1968). In order to account for this, we in- tribution over states. The stationary distribution q̂
clude in the symbol inventory Σ a special word is calculated by finding the left eigenvector of the
boundary delimiter #, which occurs as the final matrix S satisfying
symbol of each word, and which only occurs in
q̂> S = q̂,
that position. Furthermore, we constrain all ma-
trices T to transition deterministically back into where S is a right stochastic matrix giving the prob-
the initial state following the symbol #, effectively ability that a PFA transitions from state i to state j
reusing the initial state q0 as the final state qf . marginalizing over symbols emitted:
By constructing the automata in this way, we X
Sij = p(x|qi )p(qj |qi , x).
ensure that their long-run behavior is well-behaved.
x∈Σ
If an automaton of this form is allowed to keep gen-
erating past the symbol #, it will generate succes- For the Strictly Local and Strictly Piecewise au-
sive concatenated independent and identically dis- tomata, N = 0 by construction. For an automaton
tributed samples from its distribution over words, parameterized by T = softmax(T̃), it is not pos-
with boundary symbols # delineating them. This sible to attain N = 0, but nonetheless N can be
construction makes it possible to calculate station- made arbitrarily small. There are alternative pa-
ary distributions over states and complexity mea- rameterizations where N = 0 is achievable, for
sures related to them. example using the sparsemax function instead of
softmax (Martins and Astudillo, 2016; Peters et al.,
2.4 Regularization 2019).
The objective in Eq. 3 includes a regularization In order to constrain automata to be determinis-
term C representing complexity constraints. Any tic, we set the regularization term in Eq. 3 to be
differentiable complexity measure could be used
C = αN,
here. This regularization term can be viewed from
a Bayesian perspective as defining a prior over au- where α is a non-negative scalar determining the
tomata, and providing an inductive bias. We pro- strength of the trade-off of cross entropy and nonde-
pose to use this term to constrain the PFA induction terminism in the optimization. With α = 0 there is
process to produce deterministic automata. no constraint on the nondeterminism of the automa-
Most formal work on probabilistic finite-state ton, and minimizing the objective in Eq. 3 reduces
automata for phonology has focused on determin- to maximum likelihood estimation.

229
2.5 Implementing restricted automata Then the probability of the t’th symbol in a se-
We define Strictly Local and Strictly Piecewise au- quence xt given a context of previous symbols
tomata as automata that generate the respective lan- xt−1
i=1 is the geometric mixture of the probability
guages. We implement Strictly Local and Strictly of xt under each sub-automaton, also called the
Piecewise automata by hard-coding the transition co-emission probability
matrices T. For these automata, we only do opti- |Σ|
mization over the emission matrices E. Y
p(xt |xt−1
i=1 ) ∝ pAy (xt |xt−1
i=1 ).
Strictly Local In a Strictly k-Local (k-SL) lan- y=1
guage, each symbol is conditioned only on imme-
Because each sub-automaton Ay is deterministic,
diately preceding k − 1 symbol(s) (Heinz, 2018;
its state after seeing the context xt−1
i=1 is known,
Rogers and Pullum, 2011). We implement a 2-SL
and the conditional probability pAy (xt |xt−1
i=1 ) can
automaton by associating each state q ∈ Q with a
be computed using Eq. 1. For calculating the prob-
unique element x in the symbol inventory Σ. Upon
ability of a sequence, we assume an initial state of
emitting symbol x, the automaton deterministically
having seen the boundary symbol #; that is, the
transitions into the corresponding state, denoted qx .
sub-automaton A# starts in state q1# .
Thus the transition matrices have the form
Using this parameterization, we can do opti-
 ...q6=x ... qx ...q6=x ...
 mization over the collection of emission matri-
.. .. .. ces {E(x) }x∈Σ . This construction yields |Σ| ×
 . . . 
Tx =  . . . 0 ... 1 ...0...  (|Σ| − 1) trainable parameters for the 2-SP automa-
 .
.. .. .. ton, the same number of parameters as the 2-SL
. . . automaton.
This construction can be straightforwardly ex-
tended to k-SL, yielding |Σ|k−1 × (|Σ| − 1) train- SP + SL It is also possible to create and train
able parameters for a k-SL automaton. an automaton with the ability to condition on both
2-SL and 2-SP factors by taking the product of 2-
Strictly Piecewise A Strictly k-Piecewise k-SP) SL and 2-SP automata, as proposed by Heinz and
language, each symbol depends on the presence of Rogers (2013). We refer to the language gener-
any preceding k − 1 symbols at arbitrary distance ated by such an automaton as 2-SL + 2-SP. We
(Heinz, 2007, 2018; Shibata and Heinz, 2019). For experiment with such product machines below.
example, in a 2-SP language, in a string abc, c
would be conditional on the presence of a and the 2.6 Related work
presence of b, without regard to distance nor the PFA induction from data is a well-studied task
relative order of a and b. which has been the subject of multiple competi-
The implementation of an SP automaton is tions over the years (see Verwer et al., 2012, for a
slightly more complex than the SL automaton, as review). The most common approaches are vari-
the number of states required in a naïve imple- ants of Baum-Welch and heuristic state-merging
mentation is exponential in the symbol inventory algorithms (see for example de la Higuera, 2010).
size, resulting in intractably large matrices. We cir- Gibbs samplers and spectral methods have also
cumvent this complexity by parameterizing a 2-SP been proposed (Gao and Johnson, 2008; Bailly,
automaton as a product of simpler automata. We 2011; Shibata and Yoshinaka, 2012). Induction of
associate each symbol x ∈ Σ with a sub-automaton restricted PDFAs, especially for SL and SP lan-
Ax which has two states q0x and q1x , with state q0x guages, is explored in Heinz and Rogers (2013,
indicating that the symbol x has not been seen, 2010)
and q1x indicating that it has been seen. Each sub- Our work differs from previous approaches in its
automaton Ax has an emission matrix E(x) of size simplicity. Inspired by Shibata and Heinz (2019),
2 × |Σ| corresponding to the two states q0x and q1x ; we optimize the training objective directly via gra-
the emission matrix for all states q0x is constrained dient descent, without approximations or heuristics
to be the uniform distribution over symbols. The other than the use of minibatches. The same algo-
transition matrices T(x) are rithm is applied to learn both transition and emis-
   
(x) 0 1 (x) 1 0 sion structure, for learning of both general PFAs
Tx = , Ty6=x = . and restricted PDFAs. One of our contributions
0 1 0 1
230
is to show that this very simple approach gives not have an a followed by a b at any distance. The
reasonable results for learning phonotactics. reference automaton is given in Figure 1 (bottom).
The legal test string is baccca# and the illegal test
3 Inducing toy languages string is bacccb#.
First, we test the ability of the model to recover 3.3 Training parameters
automata for simple examples of subregular lan-
guages. We do so for the two subregular classes The logit matrices Ẽ and T̃ are initialized with
2-SL and 2-SP described in Section 2.5. For each random draws from a standard Normal distribution
of these language classes, we implement a ref- (Derrida, 1981). We perform stochastic gradient de-
erence PFA which generates strings from a sim- scent using the Adam algorithm, which adaptively
ple example language in that class, then generate sets the learning rate (Kingma and Ba, 2015). We
10, 000 sample sequences from the reference PFA. perform 10, 000 update steps with starting learning
We then use these samples as training data, and rate η = 0.001 and minibatch size 5.
study whether our learners can recover the relevant 3.4 Results
constraints from the data.
Unrestricted PFA induction succeeds in recover-
3.1 Evaluation ing the reference automata for both toy languages.
Learners restricted to the appropriate classes, as
We evaluate the ability to induce appropriate au-
well as the automaton combining SL and SP factors,
tomata in two ways. First, since we are studying
also succeed in inducing the appropriate automata,
very simple languages and automata, it is possible
while learners restricted to the ‘wrong’ class fail.
to directly inspect the E and T matrices and check
that they implement the correct automaton by ob- Figure 1 shows the legal–illegal differences for
serving the transition and emission probabilities. test strings over the course of training. We can
see that, when the learner is unrestricted or when
Second, we study the probabilities assigned
the learner is in the appropriate class, it eventu-
to carefully selected strings which exemplify the
ally picks up on the relevant constraint, with the
constraints that define the languages. For each
legal–illegal difference increasing apparently with-
language, we define an illegal test string which
out bound over training. Unrestricted learners take
violates the constraints of the language, and a
longer to reach this point, but they reach it reliably.
minimally-different legal test string. Given an
On the other hand, looking at the legal–illegal dif-
automaton, we can measure the legal–illegal dif-
ferences for learners in the wrong class, we see
ference: the log probability of the legal test string
that they asymptote to a small number and stop
minus the log probability of the illegal test string.
improving.
A larger legal–illegal difference indicates that the
model is assigning a higher probability to the legal These results demonstrate that our simple
form compared to the illegal one and therefore is method for PFA induction does succeed in induc-
successfully learning the constraints represented by ing certain simple structures relevant for modeling
the testing data. phonotactics in a small, controlled setting. Next,
we turn to induction of phonotactics from corpus
3.2 Languages data.
All languages are defined over the symbol inven- 4 Corpus experiments
tory {a, b, c} plus the boundary symbol #.
As an exemplar of 2-SL languages, we use the We evaluate our learner by training it on dictionary
language characterized by the forbidden factor *ab. forms from Quechua and Navajo and then studying
A deterministic PFA for the language is given in its ability to predict attested forms that were held
Figure 1 (top). The language contains all strings out in training in addition to artificially constructed
that do not have an a followed immediately by a b. nonce forms which probe the ability of the model
Our legal test string for this language is bacccb# to represent nonlocal constraints.
and the illegal test string is babccc#.
As an exemplar of 2-SP languages, we use 4.1 Training parameters
the language characterized by a forbidden factor All training parameters are as in Section 3.3, except
*a. . . b. This language contains all strings that do that we train for 100, 000 steps, and control the

231
Target language: *a...b (2−SP) Target language: *ab (2−SL)
Legal test string: baccca# Legal test string: bacccb#
Illegal test string: bacccb# Illegal test string: babccc#
8
Legal−illegal difference

−4

0 2500 5000 7500 10000 0 2500 5000 7500 10000


Training step

Learner class 2−SL 2−SP 2−SP + 2−SL Unrestricted PFA with |Q|=2

Figure 1: Difference in log probabilities for legal and illegal forms over the course of PFA induction for toy
languages. A large positive value indicates that the relevant constraint has been learned.

b : 1/4 palatal strident is illegal. The learning data of


c : 1/4 Navajo includes 6, 279 Navajo phonological words;
# : 1/4 a : 3/8 we divide this data into a training set of 5, 023
a : 1/4 forms and a held-out set of 1, 256 forms. The
nonce testing data of Navajo consists of 5, 000 gen-
q0 q1
erated nonce words, which were labelled as illegal
c : 3/8 (N = 3, 271) and legal (N = 1, 729) based on
# : 1/4 whether the nonlocal phonotactics are satisfied.
In Quechua, any stop cannot be followed by an
b : 1/4 ejective or aspirated stop at any distance. The learn-
c : 1/4 a : 3/8
ing data of Quechua includes 10, 804 phonolog-
# : 1/4 c : 3/8
ical words, which we separate into 8, 643 train-
a : 1/4
ing forms and 2, 160 held-out forms. The testing
q0 q1 data of Quechua (Gouskova and Gallagher, 2020)
consists of 24, 352 nonce forms which were man-
# : 1/4 ually classified as legal (N = 18, 502) and ille-
gal (N = 5, 810, including stop-aspirate and stop-
Figure 2: Reference automata for the 2-SL language ejective pairs).
characterized by the constraint *ab (top) and the 2-SP
language characterized by the constraint *a. . . b (bot-
4.3 Dependent Variables
tom). Arcs are annotated with symbols emitted and
their corresponding emission probabilities. For the linguistic performance of the classifier, we
study two main dependent variables. First, the
average held-out negative log likelihood (NLL)
succession of minibatches to be the same across indicates the ability of the model to assign high
models within the same language. probabilities to unseen but attested forms—low
NLL indicates higher probabilities. Second, us-
4.2 Dataset
ing our nonce forms dataset, we measure the ex-
The proposed learner is applied to the datasets of tent to which the model can differentiate the legal
Navajo and Quechua (Gouskova and Gallagher, forms from the illegal forms using the difference
2020), in which nonlocal phonotactics are attested. in log likelihood for the legal forms minus the il-
In Navajo, the co-occurrence of alveolar and legal forms. This is the same as the legal–illegal

232
Navajo Quechua
45

Heldout NLL
40
35
30
25
20

Overfitting
4

0
bits

N (alpha = 0)
6

N (alpha = 1)
6

0 25000 50000 75000 100000 0 25000 50000 75000 100000


Training step

32 128 512
Number of states |Q|
64 256 1024

Figure 3: Accuracy and complexity metrics for unrestricted PFA induction. ‘Overfitting’ is the difference between
held-out NLL and training set NLL. N is nondeterminism and alpha is the regularization parameter α (see Sec-
tion 2.4). Runs with |Q| = 128, 256, 512 and α = 1 on Navajo data terminated early due to numerical underflow
in the calculation of the stationary distribution.

Navajo Quechua
60

Heldout NLL
50

40

30
bits

20
Legal−Illegal Difference

20
15
10
5
0
0 25000 50000 75000 100000 0 25000 50000 75000 100000
Training step

2−SL 2−SP 2−SP + 2−SL PFA |Q|=1024

Figure 4: Performance of a 2-SP automaton, a 2-SL automaton, a 2-SP + 2-SL product automaton, and an un-
restricted PFA with 1, 024 states and α = 0. ‘Heldout NLL’ is the average NLL of a form in the set of attested
forms never seen during training. ‘Legal–illegal difference’ is the difference in log likelihood between ‘legal’ and
‘illegal’ forms in the nonce test set.

233
difference described in Section 3.1, but now as an 4.5 Discussion
average over many legal–illegal nonce pairs instead We find that an unrestricted PFA learner performs
of a difference for one pair. most accurately when predicting real held-out
forms, while an SP learner is most effective in learn-
4.4 Results ing certain nonlocal constraints. In fact, in terms
Unrestricted PFA induction Figure 3 shows re- of its ability to model the nonlocal constraints, the
sults from induction of unrestricted PFAs with var- PFA learner ends up comparable to an SL learner,
ious numbers of states. We find that show the av- which cannot learn the constraints at all. Mean-
erage NLL of forms in the heldout data, as well as while, the SP learner, which is unable to model
‘overfitting’, defined as the average held-out NLL local constraints, fares much worse than even the
minus the average training set NLL. This number SL learner on predicting held-out forms. The prod-
shows the extent to which the model assigns higher uct SP + SL learner combines the strengths of both
probabilities to forms in the training set as opposed restricted learners, but still does not assign as high
to the held-out set, an index of overfitting. We find probability to the real held-out forms as the unre-
that automata with more states fit the data better, stricted PFA learner.
but are also more prone to overfitting to the training This pattern of performance suggests that the
set. PFA learner is using most of its states to model
In Figure 3 (bottom two rows) we also show the local constraints beyond those captured in a 2-SL
measured nondeterminism N of the induced au- language. These constraints are important for pre-
tomata throughout training, for different values of dicting real held-out forms. The SP automaton
the regularization parameter α (see Section 2.4). is unable to achieve strong performance on held-
We find that, even without an explicit constraint out forms without the ability to model these local
for determinism, the induced PFAs tend towards constraints. On the other hand, the unrestricted
determinism over time, with N reaching around PFA tends to overfit to its training data, perhaps
1.5 bits by the final training step. Explicit regu- explaining its failure to detect nonlocal constraints
larization (with α = 1) makes this process faster, which are picked up by the appropriate restricted
with N reaching around 0.5 bits. Regularization automata.
for determinism has only a minimal effect on the
5 Conclusion
NLL values.
We introduced a framework for phonotactic learn-
Linguistic performance and restricted models ing based on simple induction of probabilistic finite-
Figure 4 shows held-out NLL and the legal–illegal state automata by stochastic gradient descent. We
difference for both languages, comparing the SL showed how this framework can be used to learn
automaton, the SP automaton, the product SP + unrestricted PFAs, in addition to PFAs restricted
SL automaton, and a PFA with 1, 024 states and to certain formal language classes such as Strictly
α = 0. Local and Strictly Piecewise, via constraints on
In terms of the ability to predict attested held- the transition matrices that define the automata.
out forms, the best model is consistently the unre- Furthermore, we showed that the framework is suc-
stricted PFA, with the SP automaton performing cessful in learning some phonotactic phenomena,
the worst. However, in terms of predicting the ill- with unrestricted automata performing best in a
formedness of artificial forms violating nonlocal wide-coverage evaluation on attested but held-out
phonotactic constraints, the best model is either forms, and Strictly Piecewise automata perform-
the SP automaton or the SP + SL product automa- ing best in a targeted evaluation using nonce forms
ton. Both of these automata successfully induce focusing on nonlocal constraints.
the nonlocal constraint. Our results leave open the question of whether
On the other hand, the unrestricted PFA learner the unrestricted learner or one of the restricted
shows no evidence at all of having learned the dif- learners is ‘best’ for learning phonotactics, since
ference between legal and illegal forms in the arti- they perform differently on different metrics. A key
ficial data, despite having the capacity to do so in question for future work is whether there might be
theory, and despite succeeding in inducing a 2-SP some model that could do well in inducing both
language in Section 3. local and nonlocal constraints simultaneously, and

234
performing well on both the held-out evaluation the 2008 Conference on Empirical Methods in Natu-
and the nonce form evaluation. Such a model could ral Language Processing, pages 344–352, Honolulu,
Hawaii. Association for Computational Linguistics.
come in the form of another restricted language
class such as Tier-Based Strictly Local languages Maria Gouskova and Gillian Gallagher. 2020. Induc-
(Heinz et al., 2011; Jardine and Heinz, 2016; Mc- ing nonlocal constraints from baseline phonotactics.
Mullin, 2016; Jardine and McMullin, 2017), or Natural Language & Linguistic Theory, 38(1):77–
116.
perhaps in the form of a regularization term in the
training objective which enforces an inductive bias Bruce Hayes and Colin Wilson. 2008. A maximum en-
that favors certain nonlocal interactions. tropy model of phonotactics and phonotactic learn-
ing. Linguistic Inquiry, 39(3):379–440.
The code for this project is available
at http://github.com/hutengdai/ Jeffrey Heinz. 2007. The inductive learning of phono-
PFA-learner. tactic patterns. Ph.D. thesis, PhD dissertation, Uni-
versity of California, Los Angeles.
Acknowledgments Jeffrey Heinz. 2010. Learning long-distance phonotac-
tics. Linguistic Inquiry, 41(4):623–661.
This work was supported by a GPU Grant from the
NVIDIA corporation. We thank the three anony- Jeffrey Heinz. 2018. The computational nature of
mous reviewers and Adam Jardine, Jeff Heinz, and phonological generalizations. Phonological Typol-
Dakotah Lambert for their comments. ogy, Phonetics and Phonology, pages 126–195.

Jeffrey Heinz, Chetan Rawal, and Herbert G. Tan-


ner. 2011. Tier-based Strictly Local constraints for
References phonology. In Proceedings of the 49th Annual Meet-
ing of the Association for Computational Linguistics:
Raphael Bailly. 2011. Quadratic weighted automata: Human Language Technologies, pages 58–64, Port-
Spectral algorithm and likelihood maximization. In land, Oregon, USA. Association for Computational
Asian Conference on Machine Learning, pages 147– Linguistics.
163. PMLR.
Jeffrey Heinz and James Rogers. 2010. Estimating
Leonard E. Baum, Ted Petrie, George Soules, and Nor- Strictly Piecewise distributions. In Proceedings
mal Weiss. 1970. A maximization technique occur- of the 48th Annual Meeting of the Association for
ring in the statistical analysis of probabilistic func- Computational Linguistics, pages 886–896, Upp-
tions of Markov chains. Annals of Mathematical sala, Sweden. Association for Computational Lin-
Statistics, 41:164–171. guistics.
Léon Bottou. 1991. Stochastic gradient learning in Jeffrey Heinz and James Rogers. 2013. Learning sub-
neural networks. Proceedings of Neuro-Nımes, regular classes of languages with factored determin-
91(8):12. istic automata. In Proceedings of the 13th Meeting
on the Mathematics of Language (MoL 13), pages
Noam Chomsky and Morris Halle. 1968. The Sound 64–71, Sofia, Bulgaria. Association for Computa-
Pattern of English. Harper & Row. tional Linguistics.
Huteng Dai. 2021. Learning nonlocal phonotactics in Adam Jardine and Jeffrey Heinz. 2016. Learning Tier-
Strictly Piecewise phonotactic model. In Proceed- based Strictly 2-Local languages. Transactions of
ings of the 2020 Annual Meeting on Phonology. the Association for Computational Linguistics, 4:87–
98.
Colin de la Higuera. 2010. Grammatical Inference:
Learning Automata and Grammars. Cambridge Uni- Adam Jardine and Kevin McMullin. 2017. Effi-
versity Press. cient learning of Tier-based Strictly k-Local lan-
guages. In International Conference on Language
Bernard Derrida. 1981. Random-energy model: An ex- and Automata Theory and Applications, pages 64–
actly solvable model of disordered systems. Physi- 76. Springer.
cal Review B, 24(5):2613.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
Richard Futrell, Adam Albright, Peter Graff, and Tim- method for stochastic optimization. In 3rd Inter-
othy J. O’Donnell. 2017. A generative model of national Conference on Learning Representations,
phonotactics. Transactions of the Association for ICLR 2015, Conference Track Proceedings, San
Computational Linguistics, 5:73–86. Diego, CA.

Jianfeng Gao and Mark Johnson. 2008. A compar- Andre Martins and Ramon Astudillo. 2016. From
ison of Bayesian estimators for unsupervised Hid- softmax to sparsemax: A sparse model of atten-
den Markov Model POS taggers. In Proceedings of tion and multi-label classification. In International

235
Conference on Machine Learning, pages 1614–1623.
PMLR.
Kevin James McMullin. 2016. Tier-based locality in
long-distance phonotactics: learnability and typol-
ogy. Ph.D. thesis, University of British Columbia.

Ben Peters, Vlad Niculae, and André F. T. Martins.


2019. Sparse sequence-to-sequence models. In Pro-
ceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 1504–
1519, Florence, Italy. Association for Computational
Linguistics.
James Rogers, Jeffrey Heinz, Margaret Fero, Jeremy
Hurst, Dakotah Lambert, and Sean Wibel. 2013.
Cognitive and sub-regular complexity. In Formal
Grammar, pages 90–108. Springer.
James Rogers and Geoffrey K. Pullum. 2011. Aural
pattern recognition experiments and the subregular
hierarchy. Journal of Logic, Language and Informa-
tion, 20(3):329–342.
Chihiro Shibata and Jeffrey Heinz. 2019. Maximum
likelihood estimation of factored regular determinis-
tic stochastic languages. In Proceedings of the 16th
Meeting on the Mathematics of Language, pages
102–113, Toronto, Canada. Association for Compu-
tational Linguistics.
Chihiro Shibata and Ryo Yoshinaka. 2012. Marginaliz-
ing out transition probabilities for several subclasses
of PFAs. Journal of Machine Learning Research
- Workshops and Conference Proceedings, 21:259–
263.
Sicco Verwer, Rémi Eyraud, and Colin de la Higuera.
2012. PAutomaC: A PFA/HMM learning competi-
tion. Journal of Machine Learning Research - Work-
shops and Conference Proceedings, 21.
Enrique Vidal, Franck Thollard, Colin de la Higuera,
Francisco Casacuberta, and Rafael C. Carrasco.
2005a. Probabilistic finite-state machines – Part I.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 27(7):1013–1025.
Enrique Vidal, Franck Thollard, Colin de la Higuera,
Francisco Casacuberta, and Rafael C. Carrasco.
2005b. Probabilistic finite-state machines – Part II.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 27(7):1026–1039.

236
Recognizing Reduplicated Forms: Finite-State Buffered Machines

Yang Wang
Department of Linguistics
University of California, Los Angeles
Los Angeles, CA, USA
[email protected]

Abstract Several findings suggest that those four levels do


not align with natural languages precisely, some
Total reduplication is common in natural lan-
guage phonology and morphology. However,
leading to major refinements on the CH. First,
formally as copying on reduplicants of un- the unbounded crossing dependencies in Swiss-
bounded size, unrestricted total reduplication German case marking (Shieber, 1985) facilitated
requires computational power beyond context- attempts to characterize mildly context-sensitive
free, while other phonological and morpholog- languages (MCS), which extend context-free lan-
ical patterns are regular, or even sub-regular. guages (CFLs) but still preserve some useful prop-
Thus, existing language classes characterizing erties of CFLs (e.g., Joshi, 1985; Seki et al., 1991;
reduplicated strings inevitably include typo-
Stabler, 1997). Secondly, it is generally accepted
logically unattested context-free patterns, such
as reversals. This paper extends regular lan- that phonology is regular (e.g. Johnson, 1972; Ka-
guages to incorporate reduplication by intro- plan and Kay, 1994). However, being regular is
ducing a new computational device: finite argued to be an unrestrictive property for phonolog-
state buffered machine (FSBMs). We give ical well-formed strings: for example, a language
its mathematical definitions and discuss some whose words are sensitive to an even or odd num-
closure properties of the corresponding set of ber of certain sounds is unattested (Heinz, 2018).
languages. As a result, the class of regular
With strong typological evidence, the sub-regular
languages and languages derived from them
through a copying mechanism is characterized.
hierarchy was further developed, which continues
Suggested by previous literature (Gazdar and to be an active area of research (e.g., McNaughton
Pullum, 1985), this class of languages should and Papert, 1971; Simon, 1975; Heinz, 2007; Heinz
approach the characterization of natural lan- et al., 2011; Chandlee, 2014; Graf, 2017).
guage word sets. In this paper, we analyze another mismatch be-
tween existing well-known language classes and
1 The Puzzle of (Total) Reduplication
empirical findings: reduplication, which involves
Formal language theory (FLT) provides computa- copying operations on certain base forms (Inkelas
tional mechanisms characterizing different classes and Zoll, 2005). The reduplicated phonological
of abstract languages based on their inherent struc- strings are either of total identity (total reduplica-
tures. Following FLT in the study of human lan- tion) or of partial identity (partial reduplication) to
guages, in principle, researchers would expect a the base forms. Table 1 provides examples showing
hierarchy of grammar formalisms that matches em- the difference between total reduplication and par-
pirical findings: more complex languages in such a tial reduplication: in Dyirbal, the pluralization of
hierarchy are supposed to be 1) less common in nat- nominals is realized by fully copying the singular
ural language typology; and 2) harder for learners stems, while in Agta examples, plural forms only
to learn. copy the first CVC sequence of the corresponding
The classical Chomsky Hierarchy (CH) puts for- singular forms (Healey, 1960; Marantz, 1982).
mal languages into four levels with increasing com- Reduplication is common cross-linguistically.
plexity: regular, context-free, context-sensitive, re- According to Rubino (2013) and Dolatian and
cursively enumerable (Chomsky, 1956; Jäger and Heinz (2020), 313 out of 368 natural languages
Rogers, 2012). Does the CH notion of formal exhibit productive reduplication, in which 35 lan-
complexity have the desired empirical correlates? guages only have total reduplication, but not partial

237
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 237–247
August 5, 2021. ©2021 Association for Computational Linguistics
Total reduplication: Dyirbal plurals (Dixon, 1972, 242)
Singular Gloss Plural Gloss
midi ‘little, small’ midi-midi ‘lots of little ones’
gulgiói ‘prettily painted men’ gulgiói-gulgiói ‘lots of prettily painted men’

Partial reduplication: Agta plurals (Healey, 1960,7)


Singular Gloss Plural Gloss
labáng ‘patch’ lab-labáng ‘patches’
takki ‘leg’ tak-takki ‘legs’

Table 1: Total reduplication: Dyirbal plurals (top); partial reduplication: Agta plurals (bottom).

m i d i m i d i context sensitive
dly context
m il sen
s
o n
c tex

i ti
m i d i i d i m w w R t -f

ve
ree
regular
ai bj
Figure 1: Crossing dependencies in Dyirbal total redu-
plication ‘midi-midi’ (top) versus nesting dependencies ww
a ib j i
in unattested string reversal ‘midi-idim’ (bottom) ai bj ci dj a bj

reduplication. As a comparison, it is widely rec-


ognized that context-free string reversals are rare
in phonology and morphology (Marantz, 1982) Figure 2: The class of regular with copying languages
and appear to be confined to language games in CH
(Bagemihl, 1989).
Unrestricted total reduplication, or unbounded
copying, can be abstracted as Lww = {ww | w ∈ We do not know whether there exists an
Σ∗ }, a well-known non-context free language independent characterization of the class
(Culy, 1985; Hopcroft and Ullman, 1979).1 Its non- of languages that includes the regular
context-freeness comes from the incurred crossing sets and languages derivable from them
dependencies among symbols, similar to Swiss- through reduplication, or what the time
German case marking constructions. However, complexity of that class might be, but
the typologically-rare string reversals wwR demon- it currently looks as if this class might
strate nesting dependencies, which are context-free be relevant to the characterization of NL
(see Fig. 1 as an illustration). word-sets.
Given most phonological and morphological pat-
terns are regular, how can one fit in reduplicated
strings without including reversals? Gazdar and Motivated by Gazdar and Pullum (1985), this
Pullum (1985, 278) made the remark that article aims to give a formal characterization of
1 regular with copying languages. Specifically, it ex-
Total reduplication does not immediately guarantee un-
boundedness. When the set of bases is finite, i.e, {ww | w ∈ amines what minimal changes can be brought to
L} when L is finite, total reduplication can be squeezed in regular languages to include stringsets with two ad-
languages described by 1 way finite state machines (Chandlee,
2017), though doing so eventually leads to state explosion
jacent copies, while excluding some typologically
(Roark and Sproat, 2007; Dolatian and Heinz, 2020). Com- unattested context-free patterns, such as reversals,
putationally, only total reduplication with infinite number of shown in Fig. 2. One possible way to probe such
potential reduplicants is true unbounded copying. With care-
ful treatment, unbounded copying, externalizing a primitive a language class is by adding copying to the set of
copying operation, can be justified as a model of reduplication operations whose closure defines regular languages.
in natural languages. More in-depth discussion of 1): bounded Instead, the approach we take in this paper is to
versus unbounded and 2): copying as a primitive operation
can be found in Clark and Yoshinaka (2014); Chandlee (2017); add reduplication to finite state automata (FSAs),
Dolatian and Heinz (2020). which compute regular languages.

238
Various attempts followed this vein:2 one ex- are two-taped finite state automata, sensitive to
ample is finite state registered machine in Cohen- copying activities within strings, hence able to de-
Sygal and Wintner (2006) (FSRAs) with finitely tect identity between sub-strings. This paper is
many registers as its memory, limited in the way organized as follows: Section 2 provides a defi-
that it only models bounded copying. The state-of- nition of FSBMs with examples. Then, to better
art finite state machinery that computes unbounded understand the copying mechanism, complete-path
copying elegantly and adequately is 2-way finite FSBMs, which recognize exactly the same set of
state transducers (2-way FSTs), capturing redupli- languages as general FSBMs, are highlighted. Sec-
cation as a string-to-string mapping (w → ww) tion 3 examines the computational and mathemat-
(Dolatian and Heinz, 2018a,b, 2019, 2020). To ical properties of the set of languages recognized
avoid the mirror image function (w → wwR ), complete-path FSBMs. Section 4 concludes with
Dolatian and Heinz (2020) further developed sub- discussion and directions for future research.
classes of 2-way FSTs which cannot output any-
thing during right-to-left passes over the input (cf. 2 Finite State Buffered Machine
rotating transducers: Baschenis et al., 2017).
2.1 Definitions
It should be noted that the issue addressed by 2-
way FSTs is a different one: reduplication is mod- FSBMs are two-taped automata with finite-state
eled as a function (w → ww), while this paper fo- core control. One tape stores the input, as in normal
cuses on a set of languages containing identical sub- FSAs; the other serves as an unbounded memory
strings (ww). The stringset question is non-trivial buffer, storing reduplicants temporarily for future
and well-motivated for reasons of both formal as- identity checking. Intuitively, FSBMs is an ex-
pects and its theoretical relevance. Firstly, since tension to FSRAs but equipped with unbounded
the studied 2-way FSTs are not readily invertible, memory. In theory, FSBMs with a bounded buffer
how to get the inverse relation ww → w remains would be as expressive as an FSRA and therefore
an open question, as acknowledged in Dolatian and can be converted to an FSA.
Heinz (2020). Although this paper does not directly The buffer interacts with the input in restricted
address this morphological analysis problem, rec- ways: 1) the buffer is queue-like; 2) the buffer
ognizing which strings are reduplicated and belong needs to work on the same alphabet as the input,
to Lww or any other copying languages may be an unlike the stack in a pushdown automata (PDA),
important first step.3 for example; 3) once one symbol is removed from
As for the theoretical aspects, there are some the buffer, everything else must also be wiped off
attested forms of meaning-free reduplication in before the buffer is available for other symbol ad-
natural languages.Zuraw (2002) proposes aggres- dition. These restrictions together ensure the ma-
sive reduplication in phonology: speakers are chine does not generate string reversals or other
sensitive to phonological similarity between sub- non-reduplicative non-regular patterns.
strings within words and reduplication-like struc- There are three possible modes for an FSBM M
tures are attributed to those words. It is still ar- when processing an input: 1) in normal (N) mode,
guable whether those meaning-free reduplicative M reads symbols and transits between states, func-
patterns of unbounded strings are generated via a tioning as a normal FSA; 2) in buffering (B) mode,
morphological function or not. Overall, it is de- besides consuming symbols from the input and tak-
sirable to have models that help to detect the sub- ing transitions among states, it adds a copy of just-
string identity within surface strings when those read symbols to the queue-like buffer, until it exits
sub-strings are in the regular set. buffering (B) mode; 3) after exiting buffering (B)
This paper introduces a new computational de- mode, M enters emptying (E) mode, in which M
vice: finite state buffered machine (FSBMs). They matches the stored symbols in the buffer against in-
put symbols. When all buffered symbols have been
2
Some other examples, pursuing more linguistically sound matched, M switches back to normal (N) mode for
and computationally efficient finite state techniques, are
Walther (2000), Beesley and Karttunen (2000) and Hulden another round of computation. Under the current
(2009). However, they fail to model unbounded copying. augmentation, FSBMs can only capture local redu-
Roark and Sproat (2007), Cohen-Sygal and Wintner (2006) plication with two adjacent, completely identical
and Dolatian and Heinz (2020) provide more comprehensive
reviews. copies. It cannot handle non-local reduplication,
3
Thanks to the reviewer for bringing this point up. nor multiple reduplication.

239
a
Definition 1. A Finite-State Buffered Machine a

(FSBM) is a 7-tuple hΣ, Q, I, F, G, H, δi where  


Start q1 q2 q4 Accept
• Q: a finite set of states
b
b
• I ⊆ Q: initial states
Figure 3: An FSBM M1 with G = {q1 } (diamond) and
• F ⊆ Q: final states H = {q3 } (square); dashed arcs are used only for the
emptying process. L(M1 ) = {ww|w ∈ {a, b}∗ }
• G ⊆ Q: states where the machine must enter
buffering (B) mode a, b
a b
• H ⊆ Q: states visited while the machine is
a b 
emptying the buffer Start q1 q2 q3 q4 Accept

• G∩H =∅ Figure 4: An FSBM M2 with G = {q1 } and H = {q4 }.


• δ: Q × (Σ ∪ {}) × Q: the state transitions L(M2 ) = {ai bj ai bj |i, j ≥ 1}
according to a specific symbol
Specifying G and H states allows an FSBM to Definition 4. A run of FSBM M on w is a se-
control what portions of a string are copied. To quence of configurations D0 , D1 , D2 . . . Dm such
avoid complications, G and H are defined to be that 1): ∃ q0 ∈ I, D0 = (w, q0 , , N); 2): ∃ qf ∈ F ,
disjoint. In addition, states in H identify certain Dm = (, qf , , N); 3): ∀ 0 ≤ i < m, Di `M Di+1 .
special transitions. Transitions between two H The language recognized by an FSBM M is de-
states check input-memory identity and consume noted by L(M ). w ∈ L(M ) iff there’s a run of M
symbols in both the input and the buffer. By con- on w.
trast, transitions with at least one state not in H can
2.2 Examples
be viewed as normal FSA transitions. In all, there
are effectively two types of transitions in δ. In all illustrations, G states are drawn with dia-
monds and H states are drawn with squares. The
Definition 2. A configuration of an FSBM D =
special transitions between H states are dashed.
(u, q, v, t) ∈ Σ∗ × Q × Σ∗ × {N, B, E}, where u is
the input string; v is the string in the buffer; q is the Example 1. Total reduplication Figure 3 offers
current state and t is the current mode the machine an FSBM M1 for Lww , with any arbitrary strings
is in. made out of an alphabet Σ = {a, b} as candidates
Definition 3. Given an FSBM M and x ∈ (Σ ∪ of bases.
{}), u, w, v ∈ Σ∗ , we define that a configuration Lww is the simplest representation of unbounded
D1 yields a configuration D2 in M (D1 `M D2 ) copying, but this language is somewhat structurally
as the smallest relation such that: 4 dull. For the rest of the illustration, we focus on
the FSBM M2 in Figure 4. M2 recognizes the non-
• For every transition (q1 , x, q2 ) with at least context free {ai bj ai bj |i, j ≥ 1}. This language
one state of q1 , q2 ∈
/H can be viewed as total reduplication added to the
(xu, q1 , , N) `M (u, q2 , , N) with q1 ∈
/G regular language {ai bj |i, j ≥ 1} (recognized by
(xu, q1 , v, B) `M (u, q2 , vx, B) with q2 ∈/G the FSA M0 in Figure 5).
State q1 is an initial state and more importantly a
• For every transition (q1 , x, q2 ) and q1 , q2 ∈ H
G state, forcing M2 to enter B mode before it takes
(xu, q1 , xv, E) `M (u, q2 , v, E)
any arcs and transits to other states. Then, M2 in
• For every q ∈ G B mode always keeps a copy of consumed input
(u, q, , N) `M (u, q, , B)
a b
• For every q ∈ H
(u, q, v, B) `M (u, q, v, E) a b
Start q1 q2 q3 Accept
(u, q, , E) `M (u, q, , N)
4
Note that a machine cannot do both symbol consumption
and mode changing at the same time. Figure 5: An FSA M0 with L(M0 )= {ai bj |i, j ≥ 1}

240
symbols until it proceeds to q4 , an H state. State mode. Hence, to go through full cycles of mode
q4 requires M2 to stop buffering and switch to E changes, once M reaches a G state and switches
mode in order to check for string identity. Using the to B mode, it has to encounter some H states later
special transitions between H states (in this case, to be put in E mode. To allow us to only reason
a and b loops on State q4 ), M2 checks whether the about only the “useful” arrangements of G and H
stored symbols in the buffer matches the remaining states, we impose an ordering requirement on G
input. If so, after emitting out all symbols in the and H states along a path in a machine and define
buffer, M2 with a blank buffer can switch to N a complete path.
mode. It eventually ends at State q4 , a legal final
Definition 5. A path s from an initial state to a
state. Figure 6 gives a complete run of M2 on the
final state in a machine is said to be complete if
string “abbabb”. Figure 7 shows M2 rejects the
non-total reduplicated string “ababb” since a final 1. for one H state in s, there is always a preced-
configuration cannot be reached. ing G state;
Example 3. Partial reduplication Assume Σ = 2. once one G state is in s, s must contain must
{b, t, k, ng, l, i, a}, the FSBM M3 in Figure 8 contain at least one H following that G state
serves as a model of two Agta CVC reduplicated
plurals in Table 1. 3. in between G and the first H are only plain
Given the initial state q1 is in G, M3 has to enter states.
B mode before it takes any transitions. In B mode,
M3 transits to a plain state q2 , consuming an input Schematically, with P representing those non-G,
consonant and keeping it in the buffer. Similarly, non-H plain states and I, F representing initial,
M3 transits to a plain state q3 and then to q4 . When final states respectively, the regular expression de-
M3 first reaches q4 , the buffer would contain a noting the state information in a path s should be
CVC sequence. q4 , an H state, urges M3 to stop of the form: I(P ∗ GP ∗ HH ∗ P ∗ | P ∗ )∗ F .
buffering and enter E mode. Using the special Definition 6. A complete-path finite state
transitions between H states (in this case, loops buffered machine is an FSBM in which all possible
on q4 ), M3 matches the CVC in the buffer with the paths are complete.
remaining input. Then, M3 with a blank buffer can
switch to N mode at q4 . M3 in N mode loses the Example FSBMs we provide so far (Figure 3,
access to loops on q4 , as they are available only Figure 4 and in Figure 8) are complete-path FSBMs.
in E mode. It transits to q5 to process the rest of For the rest of this section, we describe several
the input by the normal transitions between q5 . A cases of an incomplete path in a machine M .
successful run should end at q5 , the only final state.
Figure 9 gives a complete run of M3 on the string No H states When a G state does not have any
“taktakki”. reachable H state following it, there is no complete
run, since M always stays in B mode.
2.3 Complete-path FSBMs
No H states in between two G states When a G
As shown in the definitions and the examples above,
state q0 has to transit to another G state q00 before
an FSBM is supposed to end in N mode to process
any H states, M cannot go to q00 , for M would
an input. There are two possible scenarios for a run
enter B mode at q0 while transiting to another G
to meet this requirement: either never entering B
state in B mode is ill-defined.
mode or undergoing full cycles of N , B , E , N mode
changes. The corresponding languages reflect ei- H states first When M has to follow a path con-
ther no copying (functioning as plain FSAs) or full taining two consecutive H states before any G state,
copying, respectively. it would clash in the end, because the transitions
In any specific run, it is the states that inform an among two H states can only be used in E mode.
FSBM M of its modality. The first time M reaches However, it is impossible to enter E mode without
a G state, it has to enter B mode and keeps buffering entering B mode enforced by some G states.
when it transits between plain states. The first time It should be emphasized that M in N mode can
when it reaches an H state, M is supposed to enter pass through one (and only one) H state to another
E mode and transit only between H states in E plain state. For instance, the language of the FSBM

241
Used Arc State Info Configuration
1. N/A q1 ∈ I (abbabb, q1 , , N)
2. N/A q1 ∈ G (abbabb, q1 , , B) Buffering triggered by q1 and empty buffer
3. (q1 , a, q2 ) q2 ∈
/G (bbabb, q2 , a, B)
4. (q2 , b, q3 ) (babb, q3 , ab, B)
5. (q3 , b, q3 ) (abb, q3 , abb, B)
6. (q3 , , q4 ) (abb, q4 , abb, B) Emptying triggered by q4
7. N/A (abb, q4 , abb, E)
8. (q4 , a, q4 ) (bb, q4 , bb, E)
9. (q4 , b, q4 ) (b, q4 , b, E)
10. (q4 , b, q4 ) q4 ∈ H (, q4 , , E) Normal triggered by q4 and empty buffer
11. N/A q4 ∈ F (, q4 , , N)

Figure 6: M2 in Figure 4 accepts abbabb

Used Arc State Info Configuration


1. N/A q1 ∈ I (ababb, q1 , , N)
2. N/A q1 ∈ G (ababb, q1 , , B) Buffering triggered by q1 and empty buffer
3. (q1 , a, q2 ) q2 ∈
/G (babb, q2 , a, B)
4. (q2 , b, q3 ) q3 ∈ H (abb, q3 , ab, B)
6. (q3 , , q4 ) (abb, q4 , ab, B) Emptying triggered by q4
5. N/A (abb, q4 , ab, E)
6. (q4 , a, q4 ) (bb, q4 , b, E)
7. (q4 , b, q4 ) q4 ∈ H (b, q4 , , E) Normal triggered by q4 and empty buffer
8. N/A (b, q4 , , N)
Clash

Figure 7: M2 in Figure 4 rejects ababb

i, a b, t, k, ng, l, i, a

b, t, k, ng, l i, a b, t, k, ng, l 
Start q1 q2 q3 q4 q5 Accept

b, t, k, ng, l

Figure 8: An FSBM M3 for Agta CVC-reduplicated plurals: G = {q1 } and H = {q4 }

Used Arc State Info Configuration


1. N/A q1 ∈ G (taktakki, q1 , , N) Buffering triggered by q1 and empty buffer
2. N/A (taktakki, q1 , , B)
3. (q1 , t, q2 ) q2 ∈
/G (aktakki, q2 , t, B)
4. (q2 , a, q3 ) (ktakki, q3 , ta, B)
5. (q3 , k, q4 ) q4 ∈ H (takki, q4 , tak, B) Emptying triggered by q4
6. N/A (takki, q4 , tak, E)
7. (q4 , t, q4 ) (akki, q4 , ak, E)
8. (q4 , a, q4 ) (kki, q4 , k, E)
9. (q4 , k, q4 ) q4 ∈ H (ki, q4 , , E) Normal triggered by q4 and empty buffer
10. N/A (ki, q4 , , N)
11. (q4 , , q5 ) (ki, q5 , , N)
12. (q5 , k, q5 ) (i, q5 , , N)
13. (q5 , i, q5 ) q5 ∈ F (, q5 , , N)

Figure 9: M3 in Figure 8 accepts taktakki

242
a b
3.1 Intersection with FSAs
Start q1 a q2 b q3 b q4 a q5 Accept

Figure 10: An incomplete FSBM M4 with G = ∅ and Theorem 1. If L1 is a complete-path FSBM-


H = {q2 , q4 }; L(M4 ) = {abba} recognizable language and L2 is a regular
language, then L1 ∩ L2 is a complete-path
Start q1 a q2 b q3 b q4 a q5 Accept FSBM-recognizable language.

Figure 11: An FSA (or an FSBM with G = ∅ and H =


∅) whose language is equivalent as M4 in Figure 10 In other words, if L1 is a language rec-
ognized by a complete-path FSBM M1 =
hQ1 , Σ, I1 , F1 , G1 , H1 , δ1 i, and L2 is a language
M4 in Figure 10 is equivalent to the language rec-
recognized by an FSA M2 = hQ2 , Σ, I2 , F2 , δ2 i,
ognized by the FSA in Figure 11. M4 remains to
then L1 ∩ L2 is a language recognizable by an-
be an incomplete FSBM because it doesn’t have
other complete-path FSBM. It is easy to con-
any G state preceding the H states q2 and q4 .
struct an intersection machine M where M =
The languages recognized by complete-path FS-
hQ, Σ, I, F, G, H, δi with 1) Q = Q1 × Q2 ; 2)
BMs are precisely the languages recognized by
I = I1 × I2 ; 3) F = F1 × F2 ; 4) G = G1 × Q2 ;
general FSBMs. One key observation is the lan-
5) H = H1 × Q2 ; 6) ((q1 , q10 ), x, (q2 , q20 )) ∈ δ iff
guage recognized by the new machine is the union
(q1 , x, q2 ) ∈ δ1 and (q10 , x, q20 ) ∈ δ2 . Paths in M
of the languages along all possible paths. Then, the
would inherit the completeness from M1 given the
validity of such a statement builds on different in-
current construction. Then, L(M ) = L1 ∩ L2 , as
complete cases of G and H states along a path: they
M simulates L1 ∩ L2 by running M1 and M2 si-
either recognize the empty-set language or show
multaneously. M accepts w if and only if both M1
equivalence to finite state machines. Therefore, the
and M2 accept w.
language along an incomplete path of the machine
is still in the regular set. Only a complete path
In nature, FSAs can be viewed as FSBMs with-
containing at least one well-arranged G . . . HH ∗
out copying: they can be converted to an FSBM
sequence uses the copying power and extends the
with an empty G set, an empty H set and trivially
regular languages. Therefore, in the next section,
no special transitions between H states.
we focus on complete-path FSBMs.
That FSBM-recognizable languages are closed
3 Some closure properties of FSBMs
under intersection with regular languages is of great
In this section, we show some closure properties of relevance to phonological theory: assume a natu-
complete-path FSBM-recognizable languages and ral language X imposes backness vowel harmony,
their linguistic relevance. Section 3.1 discusses its which can be modeled by an FSA MV H . In ad-
closure under intersection with regular languages; dition, this language also requires phonological
Section 3.2 shows it is closed under homomor- strings of certain forms to be reduplicated, which
phism; Section 3.3 briefly mentions union, con- can be modeled by an FSBM MRED . One hereby
catenation, Kleene star. These operations are of can construct another FSBM MRED+V H to en-
special interests because they are regular opera- force both backness vowel harmony and the total
tions defining regular expressions (Sipser, 2013, identity of sub-strings in those forms. Not lim-
64). That complete-path FSBMs are closed under ited to harmony systems, phonotactics other than
regular operations leads to a conjecture that the identity of sub-strings are regular (Heinz, 2018),
set of languages recognized by the new automata indicating almost all phonological markedness con-
is equivalent to the set of languages denoted by a straints can be modeled by FSAs. When FSBMs in-
version of regular expression with copying added. tersect with FSAs computing those phonotactic re-
Noticeably, given FSBMs are FSAs with a copy- strictions, the resulting formalism is still an FSBM
ing mechanism, the proof ideas in this section but not other grammar with higher computational
are similar to the corresponding proofs for FSAs, power. Thus, FSBMs can model natural language
which can be found in Hopcroft and Ullman (1979) phonotactics once including recognizing surface
and Sipser (2013). sub-string identity.

243
V C, V
recognized by complete-path FSBMs is not
Start q1 C q2 V q3 C q4  q5 Accept
closed under inverse alphabetic homomorphisms
and thus inverse homomorphism. Consider a
C complete-path FSBM-recognizable language L =
{ai bj ai bj | i, j ≥ 1} (cf. Figure 4). Consider an
Figure 12: An FSBM M5 on the alphabet {C, V } such
alphabetic homomorphism h : {0, 1, 2} → {a, b}∗
that L(M5 ) = h(L(M3 )) with M3 in Figure 8
such that h(0) = a, h(1) = a and h(2) = b. Then,
h−1 (L) = {(0|1)i 2j (0|1)i 2j | i, j ≥ 1} seems to
3.2 Homomorphism and inverse alphabetic be challenging for FSBMs. Finite state machines
homomorphism cannot handle the incurred crossing dependencies
while the augmented copying mechanism only con-
Definition 7. A (string) homomorphism is a func-
tributes to recognizing identical copies, but not
tion mapping one alphabet to strings of another
general cases of symbol correspondence. 5
alphabet, written h : Σ → ∆∗ . We can extend h
to operate on strings over Σ∗ such that 1) h(Σ ) 3.3 Other closure properties
= ∆ ; 2) ∀a ∈ Σ, h(a) ∈ ∆∗ ; 3) for w =
Union Assume there are complete-path FSBMs
a1 a2 . . . an ∈ Σ∗ , h(w) = h(a1 )h(a2 ) . . . h(an )
M1 and M2 such that L(M1 ) = L1 and L(M2 ) =
where each ai ∈ Σ. An alphabetic homomorphism
L2 , then L1 ∪ L2 is a complete-path FSBM-
h0 is a special homomorphism with h0 : Σ → ∆.
recognizable language. One can construct a new
Definition 8. Given a homomorphism h: Σ → machine M that accepts an input w if either M1
∆∗ and L1 ⊆ Σ∗ , L2 ⊆ ∆∗ , define h(L1 ) or M2 accepts w. The construction of M keeps
= {h(w) | w ∈ L1 } ⊆ ∆∗ and h−1 (L2 ) = M1 and M2 unchanged, but adds a new plain state
{w | h(w) = v ∈ L2 } ⊆ Σ∗ . q0 . Now, q0 becomes the only initial state, branch-
ing into those previous initial states in M1 and M2
Theorem 2. The set of complete-path FSBM- with -arcs. In this way, the new machine would
recognizable languages is closed under homomor- guess on either M1 or M2 accepts the input. If one
phisms. accepts w, M will accept w, too.
Concatenation Assume there are complete-path
Theorem 2. can be proved by constructing a
FSBMs M1 and M2 such that L(M1 ) = L1 and
new machine Mh based on M . The informal in-
L(M2 ) = L2 , then there is a complete-path FSBM
tuition goes as follows: relabel the odd arcs to
M that can recognize L1 ◦ L2 by normal concate-
mapped strings and add states to split the arcs so
nation of two automata. The new machine adds
that there is only one symbol or  on each arc in Mh .
a new plain state q0 and makes q0 the only initial
When there are multiple symbols on normal arcs,
state, branching into those previous initial states
the newly added states can only be plain non-G,
in M1 with -arcs. All final states in M2 are the
non-H states. For multiple symbols on the special
only final states in M . Besides, the new machine
arcs between two H states, the newly added states
adds -arcs from any old final states in M1 to any
must be H states. Again, under this construction,
possible initial states in M2 . A path in the resulting
complete paths in M lead to newly constructed
machine is guaranteed to be complete because it
complete paths in Mh .
is essentially the concatenation of two complete
The fact that complete-path FSBMs guarantee
paths.
the closure under homomoprhism allows theorists
to perform analyses at certain levels of abstraction Kleene Star Assume there is a complete-path
of certain symbol representations. Consider two al- FSBM M1 such that L(M1 ) = L1 , L∗1 is a
phabets Σ = {b, t, k, ng, l, i, a} and ∆ = {C, V } complete-path FSBM-recognizable language. A
with a homomorphism h mapping every consonant new automaton M is similar to M1 with a new ini-
(b, t, k, ng, l) to C and mapping every vowel (i, a) tial state q0 . q0 is also a final state, branching into
to V . As illustrated by M3 on alphabet Σ (Fig- 5
The statement on the inverse homomorphism closure is
ure 8) and M5 on alphabet ∆ (Figure 12), FSBM- left as a conjecture. We admit that a more formal and rigor-
definable patterns on Σ would be another FSBM- ous mathematical proof proving h−1 (L) is not complete-path
FSBM-recognizable should be conducted. To achieve this
definable patterns on ∆. goal, a more formal tool, such as a developed pumping lemma
We conjecture that the set of languages for the corresponding set of languages, is important.

244
old initial states in M1 . In this way, M accepts the tive identity requirement by complete-path FSBMs
empty string . q0 is never a G state nor an H state. without using the full power of mildly context sen-
Moreover, to make sure M can jump back to an sitive formalisms. To achieve this goal, future work
initial state after it hits a final state, -arcs from any should consider developing an efficient algorithm
final state to any old initial states are added. that intersects complete-path FSBMs with weighted
FSAs.
4 Discussion and conclusion The present paper is the first step to recognize
reduplicated forms in adequate yet more restric-
In summary, this paper provides a new computa- tive models and techniques compared to MCS
tional device to compute unrestricted total redu- formalisms. There are some limitations of the
plication on any regular languages, including the current approach on the whole typology of redu-
simplest copying language Lww where w can be plication. Complete-path FSBMs can only cap-
any arbitrary string of an alphabet. As a result, it ture local reduplication with two adjacent identical
introduces a new class of languages incomparable copies. As for non-local reduplication, the modi-
to CFLs. This class of languages allows unbounded fication should be straightforward: the machines
copying without generating non-reduplicative non- need to allow the filled buffer in N mode (or in
regular patterns: we hypothesize context-free string another newly-defined memory holding mode) and
reversals are excluded since the buffer is queue-like. match strings only when needed. As for multi-
Meanwhile, the MCS Swiss-German cross-serial ple reduplication, complete-path FSBMs can eas-
dependencies, abstracted as {ai bj ci dj |i, j ≥ 1}, ily be modified to include multiple copies of the
is also excluded, since the buffer works on the same same base form ({wn | w ∈ Σ∗ , n ∈ N}) but
alphabet as the input tape and only matches identi- cannot be easily modified to recognize the non-
cal sub-strings. semilinear language containing copies of the copy
Following the sub-classes of 2-way FSTs in n
({w2 | w ∈ Σ∗ , n ∈ N}). It remains to be an open
Dolatian and Heinz (2018a,b, 2019, 2020), which question on the computational nature of multiple
successfully capture unbounded copying as func- reduplication. Last but not the least, as a reviewer
tions while exclude the mirror image mapping, points out, recognizing non-identical copies can
complete-path FSBMs successfully capture the be achieved by either storing or emptying not ex-
total-reduplicated stringsets while exclude string actly the same input symbols, but mapped sym-
reversals. Comparison between the characterized bols according to some function f . Under this
languages in this paper and the image of functions modification, the new automata would recognize
in Dolatian and Heinz (2020) should be further car- {an bn | n ∈ N} with f (a) = b but still exclude
ried out to build the connection. Moreover, one string reversals. In all, detailed investigations on
natural next step is to extend FSBMs as acceptors how to modify complete-path FSBMs should be
to finite state buffered transducers (FSBT). Our the next step to complete the typology.
intuition is FSBTs would be helpful in handling
the morphological analysis question (ww → w), Acknowledgments
a not-yet solved problem in the 2-way FSTs that
The author would like to thank Tim Hunter, Bruce
Dolatian and Heinz (2020) study. After reading the
Hayes, Dylan Bumford, Kie Zuraw, and the mem-
first w in input and buffering this chunk of string
bers of the UCLA Phonology Seminar for their
in the memory, the transducer can output  for each
feedback and suggestions. Special thanks to the
matched symbol when transiting among H states.
anonymous reviewers for their constructive com-
Another potential area of research is applying ments and discussions. All errors remain my own.
this new machinery to Primitive Optimality Theory
(Eisner, 1997; Albro, 1998). Albro (2000, 2005)
used weighted finite state machine to model con- References
straints while represented the set of candidates by Daniel M Albro. 1998. Evaluation, implementation,
Multiple Context Free Grammars to enforce base- and extension of primitive optimality theory. Mas-
reduplicant correspondence (McCarthy and Prince, ter’s thesis, UCLA.
1995). Parallel to Albro’s way, given complete- Daniel M. Albro. 2000. Taking primitive Optimality
path FSBMs are intersectable with FSAs, it is pos- Theory beyond the finite state. In Proceedings of the
sible to computationally implement the reduplica- Fifth Workshop of the ACL Special Interest Group

245
in Computational Phonology, pages 57–67, Centre Hossep Dolatian and Jeffrey Heinz. 2019. RedTyp: A
Universitaire, Luxembourg. International Commit- database of reduplication with computational mod-
tee on Computational Linguistics. els. In Proceedings of the Society for Computation
in Linguistics (SCiL) 2019, pages 8–18.
Daniel M Albro. 2005. Studies in computational op-
timality theory, with special reference to the phono- Hossep Dolatian and Jeffrey Heinz. 2020. Comput-
logical system of Malagasy. Ph.D. thesis, University ing and classifying reduplication with 2-way finite-
of California, Los Angeles, Los Angeles. state transducers. Journal of Language Modelling,
8(1):179–250.
Bruce Bagemihl. 1989. The crossing constraint and
‘backwards languages’. Natural language & linguis- Jason Eisner. 1997. Efficient generation in primitive
tic Theory, 7(4):481–549. Optimality Theory. In 35th Annual Meeting of the
Association for Computational Linguistics and 8th
Félix Baschenis, Olivier Gauwin, Anca Muscholl, and Conference of the European Chapter of the Associa-
Gabriele Puppis. 2017. Untwisting two-way trans- tion for Computational Linguistics, pages 313–320,
ducers in elementary time. In 2017 32nd Annual Madrid, Spain. Association for Computational Lin-
ACM/IEEE Symposium on Logic in Computer Sci- guistics.
ence (LICS), pages 1–12.
Gerald Gazdar and Geoffrey K Pullum. 1985. Com-
Kenneth R. Beesley and Lauri Karttunen. 2000. Finite- putationally relevant properties of natural languages
state non-concatenative morphotactics. In Proceed- and their grammars. New generation computing,
ings of the 38th Annual Meeting of the Associa- 3(3):273–306.
tion for Computational Linguistics, pages 191–198,
Hong Kong. Association for Computational Linguis- Thomas Graf. 2017. The power of locality domains in
tics. phonology. Phonology, 34(2):385–405.

Jane Chandlee. 2014. Strictly local phonological pro- Phyllis M. Healey. 1960. An Agta Grammar. Bureau
cesses. Ph.D. thesis, University of Delaware. of Printing, Manila.

Jeffrey Heinz. 2007. The Inductive Learning of Phono-


Jane Chandlee. 2017. Computational locality in mor-
tactic Patterns. Ph.D. thesis, University of Califor-
phological maps. Morphology, 27:599–641.
nia, Los Angeles.
Noam Chomsky. 1956. Three models for the descrip-
Jeffrey Heinz. 2018. The computational nature of
tion of language. IRE Trans. Inf. Theory, 2:113–124.
phonological generalizations. In Larry Hyman and
Alexander Clark and Ryo Yoshinaka. 2014. Distri- Frans Plank, editors, Phonological Typology, Pho-
butional learning of parallel multiple context-free netics and Phonology, chapter 5, pages 126–195. De
grammars. Mach. Learn., 96(1–2):5–31. Gruyter Mouton.

Jeffrey Heinz, Chetan Rawal, and Herbert G Tan-


Yael Cohen-Sygal and Shuly Wintner. 2006. Finite-
ner. 2011. Tier-based strictly local constraints for
state registered automata for non-concatenative mor-
phonology. In Proceedings of the 49th Annual Meet-
phology. Computational Linguistics, 32(1):49–82.
ing of the Association for Computational Linguistics:
Christopher Culy. 1985. The complexity of the vo- Human language technologies, pages 58–64.
cabulary of bambara. Linguistics and philosophy, John E Hopcroft and Jeffrey D Ullman. 1979. Introduc-
8(3):345–351. tion to automata theory, languages, and computation.
Addison-Welsey, NY.
Robert M. W. Dixon. 1972. The Dyirbal Language of
North Queensland, volume 9 of Cambridge Studies Mans Hulden. 2009. Finite-state Machine Construc-
in Linguistics. Cambridge University Press, Cam- tion Methods and Algorithms for Phonology and
bridge. Morphology. Ph.D. thesis, University of Arizona,
Tucson, USA.
Hossep Dolatian and Jeffrey Heinz. 2018a. Learning
reduplication with 2-way finite-state transducers. In Sharon Inkelas and Cheryl Zoll. 2005. Reduplication:
Proceedings of the 14th International Conference Doubling in morphology, volume 106. Cambridge
on Grammatical Inference, volume 93 of Proceed- University Press.
ings of Machine Learning Research, pages 67–80.
PMLR. Gerhard Jäger and James Rogers. 2012. Formal
language theory: refining the chomsky hierarchy.
Hossep Dolatian and Jeffrey Heinz. 2018b. Modeling Philosophical Transactions of the Royal Society B:
reduplication with 2-way finite-state transducers. In Biological Sciences, 367(1598):1956–1970.
Proceedings of the Fifteenth Workshop on Computa-
tional Research in Phonetics, Phonology, and Mor- C. Douglas Johnson. 1972. Formal Aspects of Phono-
phology, pages 66–77, Brussels, Belgium. Associa- logical Description. Monographs on linguistic anal-
tion for Computational Linguistics. ysis. Mouton, The Hague.

246
Aravind K. Joshi. 1985. Tree adjoining grammars:
How much context-sensitivity is required to provide
reasonable structural descriptions?, Studies in Natu-
ral Language Processing, page 206–250. Cambridge
University Press.
Ronald M. Kaplan and Martin Kay. 1994. Regular
models of phonological rule systems. Comput. Lin-
guist., 20(3):331–378.
Alec Marantz. 1982. Re reduplication. Linguistic in-
quiry, 13(3):435–482.
John J. McCarthy and Alan S. Prince. 1995. Faithful-
ness and reduplicative identity. In Jill N. Beckman,
Laura Walsh Dickey, and Suzanne Urbanczyk, edi-
tors, Papers in Optimality Theory. GLSA (Graduate
Linguistic Student Association), Dept. of Linguis-
tics, University of Massachusetts, Amherst, MA.
Robert McNaughton and Seymour A Papert. 1971.
Counter-Free Automata (MIT research monograph
no. 65). The MIT Press.
Brian Roark and Richard Sproat. 2007. Computational
approaches to morphology and syntax, volume 4.
Oxford University Press.
Carl Rubino. 2013. Reduplication. In Matthew S.
Dryer and Martin Haspelmath, editors, The World
Atlas of Language Structures Online. Max Planck In-
stitute for Evolutionary Anthropology, Leipzig.
Hiroyuki Seki, Takashi Matsumura, Mamoru Fujii,
and Tadao Kasami. 1991. On multiple context-
free grammars. Theoretical Computer Science,
88(2):191–229.
Stuart M Shieber. 1985. Evidence against the context-
freeness of natural language. In Philosophy, Lan-
guage, and Artificial Intelligence, pages 79–89.
Springer.
Imre Simon. 1975. Piecewise testable events. In Au-
tomata Theory and Formal Languages, pages 214–
222, Berlin, Heidelberg. Springer Berlin Heidelberg.
Michael Sipser. 2013. Introduction to the Theory of
Computation, third edition. Course Technology,
Boston, MA.
Edward Stabler. 1997. Derivational minimalism. In
Logical Aspects of Computational Linguistics, pages
68–95, Berlin, Heidelberg. Springer Berlin Heidel-
berg.
Markus Walther. 2000. Finite-state reduplication in
one-level prosodic morphology. In 1st Meeting of
the North American Chapter of the Association for
Computational Linguistics.
Kie Zuraw. 2002. Aggressive reduplication. Phonol-
ogy, 19(3):395–439.

247
An FST morphological analyzer for the Gitksan language

Clarissa Forbesα Garrett Nicolaiβ Miikka Silfverbergβ


α
University of Arizona β
University of British Columbia
[email protected] [email protected]

Abstract Our work has three central goals: (1) We want


to build a flexible morphological analyzer to sup-
This paper presents a finite-state morpholog- plement lexical and textual resources in support of
ical analyzer for the Gitksan language. The language learning. Such an analyzer can support
analyzer draws from a 1250-token Eastern di-
learners in identifying the base-form of inflected
alect wordlist. It is based on finite-state tech-
nology and additionally includes two exten-
words where the morpheme-to-word ratio might be
sions which can provide analyses for out-of- particularly high, in a way not addressed by a tradi-
vocabulary words: rules for generating pre- tional dictionary. It may also productively generate
dictable dialect variants, and a neural guesser inflected forms of words. (2) We want to facilitate
component. The pre-neural analyzer, tested ongoing efforts to expand the aforementioned 1250
against interlinear-annotated texts from multi- token wordlist into a broad-coverage dictionary of
ple dialects, achieves coverage of (75-81%), the Gitksan language. Running our analyzer on
and maintains high precision (95-100%). The
Gitksan texts, we can rapidly identify word forms
neural extension improves coverage at the cost
of lowered precision. whose base-form has not yet been documented. An
analyzer can also help automate the process of iden-
1 Introduction tifying sample sentences for dictionary words, the
addition of which substantially increases the value
Endangered languages of the Americas are typi- of the dictionary. (3) We want to use the model to
cally underdocumented and underresourced. Com- further our understanding of Gitksan morphology.
putational tools like morphological analyzers Unanalyzeable and erroneously analyzed forms can
present the opportunity to speed up ongoing docu- help us identify shortcomings in our description of
mentation efforts by enabling automatic and semi- the morphological system and can thus feed back
automatic data analysis. This paper describes the into the documentation effort of the language.
development of a morphological analyzer for Gitk- The Gitksan-speaking community recognizes
san, an endangered Indigenous language of West- two dialects: Eastern (Upriver) and Western
ern Canada. The analyzer is capable of providing (Downriver). Our analyzer is based on resources
the base form and morphosyntactic description of which mainly represent the Eastern dialect. Con-
inflected word forms: a word gupdiit ‘they ate’ is sequently, our base analyzer achieves higher cov-
annotated gup-TR-3PL. erage of 71% for the Eastern dialect as measured
Our Gitksan analyzer is based on two core on a manually annotated test set. For the West-
documentary resources: a wordlist spanning ap- ern dialect, coverage is lower at 53%. In order
proximately 1250 tokens, and an 18,000 token to improve coverage on the Western variety, we
interlinear-annotated text collection. Due to the explore two extensions to our analyzer. First, we
scarcity of available lexical and corpus resources, implement a number of dialectal relaxation rules
we take a rule-based approach to modeling of mor- which model the orthographic variation between
phology which is less dependent on large datasets Eastern and Western dialects. This leads to siz-
than machine learning methods. Our analyzer is able improvements in coverage for the Western
based on finite-state technology (Beesley and Kart- dialect (around 9%-points on types and 6%-points
tunen, 2003) using the foma finite-state toolkit on tokens). Moreover, the precision of our ana-
(Hulden, 2009b). lyzer remains high both for the Eastern and West-

248
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 248–257
August 5, 2021. ©2021 Association for Computational Linguistics
ern dialects even after applying dialect rules. Sec- external morphology for the most complex word
ondly, we extend our FST morphological analyzer type, a transitive verb, is schematized in the tem-
by adding a data-driven neural guesser which fur- plate in Figure 2; an example word with all these
ther improves coverage both for the Eastern and slots filled would be ’naagask’otsdiitgathl ‘appar-
Western varieties. ently they cut.PL open (common noun)’
On the left edge of the stem can appear any num-
2 The Gitksan Language ber of modifying ‘proclitics’. These contribute
locative, adjectival, and manner-related informa-
The Gitxsan are one of the indigenous peoples of
tion to a noun or verb, often producing semi- or non-
British Columbia, Canada. Their traditional territo-
compositional idioms in a similar fashion to Ger-
ries consist of upwards of 50,000 square kilometers
manic particle verbs.1 It is often unclear whether
of land along the Skeena River in the BC northern
these proclitics constitute part of the root or stem,
interior. The Gitksan language is the easternmost
or if they are distinct words entirely. The ortho-
member of the Tsimshianic family, which spans the
graphic boundaries on this edge are consequently
entirety of the Skeena and Nass River watersheds
sometimes fuzzy. Sometimes clear contrasts are
to the Pacific Coast. Today, Gitksan is the most
presented, as with the sequence lax-yip ‘on-earth’:
vital Tsimshianic language, but is still critically
we see compositional lax yip ‘on the ground’ ver-
endangered with an estimated 300-850 speakers
sus lexicalized laxyip ‘land, territory’. However,
(Dunlop et al., 2018).
the boundary between compositional and idiomatic
The Tsimshianic family can be broadly under-
is not always so obvious, as in examples like (1).
stood as a dialect continuum, with each village
along these rivers speaking somewhat differently (1) a. saa-’witxw (away-come, ‘come from’)
from its neighbors up- or downstream, and the two b. k’ali-aks (upstream-water, ‘upriver’)
endpoints being mutually unintelligible. The six c. xsi-ga’a (out-see, ‘choose’)
Gitxsan villages are commonly divided into two d. luu-no’o (in-hole, ‘annihilate’)
dialects: East/Upriver and West/Downriver. The
dialects have some lexical and phonological dif- Inflectional morphology largely appears on the
ferences, with the most prominent being a vowel right edge of the stem. The main complexity of
shift. Consider the name of the Skeena River: Xsan, Gitksan inflection involves homophony and opac-
Ksan (Eastern) vs Ksen (Western). ity: a similar or identical wordform often has mul-
tiple possible analyses. For example, a word like
2.1 Morphological description gubin transparently involves a stem gup ‘eat’ and
The Gitksan language has strict VSO word order a 2SG suffix -n, but the intervening vowel i might
and multifunctional, fusional morphology (Rigsby, be analyzed as epenthetic, as transitive inflection
1986). It utilizes prefixation, suffixation, and both (TR), or as a specially-induced transitivizing suffix
pro- and en-cliticization. Category derivation and (T), resulting in three possible analyses in (2). Sim-
number marking are prefixal, while markers of ar- ilarly, a word gupdiit involves the same stem gup
gument structure, transitivity, and person/number ‘eat’ and a 3PL suffix -diit, but this suffix is able
agreement are suffixal. to delete preceding transitive suffixes, resulting in
The Tsimshianic languages have been described four possible analyses as in (3).
as having word-complexity similar to German (Tar-
(2) gubin
pent, 1987). The general structure of a noun or
verb stem is presented in the template in Figure a. gup-2SG
1. A stem consists of minimally a root (typically b. gup-TR-2SG
CVC); an example is monomorphemic gup ‘eat’. c. gup-T-2SG
Stems may also include derivational prefixes or (3) gupdiit
transitivity-related suffixes; compare gupxw ‘be a. gup-3PL
eaten; be edible’. b. gup-TR-3PL
In sentential context, stems are inflected for fea- c. gup-T-3PL
tures like transitivity and person/number. Our an- d. gup-T-TR-3PL
alyzer is concerned primarily with stem-external
inflection and cliticization. The structure of stem- 1
E.g. nachslagen ’look up’ in German.

249
Derivation– Proclitics– Plural– Root –Argument Structure

Figure 1: Morphological template of a complex nominal or verbal stem

Proclitics– Stem –Transitive –Person/Number =Epistemic =Next Noun Class

Figure 2: Morphological template of modification, inflection, and cliticization for a transitive verbal predicate

Running speech in Gitksan is additionally rife alects, as well as neighboring Nisga’a, with some
with clitics, which pose a more complex problem variations. Given the relatively short period that
for morphological modeling. First, there are a set this orthography has been in use, orthographic con-
of ergative ‘flexiclitics’, which are able to either ventions can vary widely across dialects and writ-
procliticize or encliticize onto a subordinator or ers. In producing this initial analyzer, we attempt to
auxiliary, or stand independently. The same combi- mitigate the issue by working with a small number
nation of host and clitic might result in sequences of more-standardized sources: the original H&R
like n=ii (1SG=and), ii=n (and=1SG), or ii na and an annotated, multidialectal text collection.
(and 1SG) (Stebbins, 2003; Forbes, 2018). We worked with a digitized version of the H&R
Second, all nouns are introduced with a noun- wordlist (Mother Tongues Dictionaries, 2020). The
class clitic that attaches to the preceding word, as original wordlist documents only the Git-an’maaxs
illustrated by the VSO sentence in (4). Here, the Eastern dialect; our version adds a small number
proper noun clitic =s attaches to the verb but is syn- of additional dialect variants, and fifteen common
tactically associated with Mary, and the common verbs and subordinators. In total, the list contains
noun clitic =hl attaches to Mary but is associated approximately 1250 lexemes and phrases, plus
with gayt ‘hat’. noted variants and plural forms.
(4) Giigwis Maryhl gayt. The analyzer was informed by descriptive work
giikw-i-t =s Mary =hl gayt on both Gitksan and its mutually intelligible neigh-
buy-TR-3. II =PN Mary =CN hat bor Nisga’a. This work details many aspects of
‘Mary bought a hat.’ Gitksan inflection, including morphological opac-
ity and the complex interactions of certain suffixes
Any word able to precede a noun phrase is a possi-
and clitics (Rigsby, 1986; Tarpent, 1987; Hunt,
ble host for one of these clitics (hence their appear-
1993; Davis, 2018; Brown et al., 2020).
ance on transitive verbs in Figure 2).
Finally, there are several sentence-final and A text collection of approximately 18,000 words
second-position clitics. whose distribution is based was also used in the development and evaluation
on prosodic rather than strictly categorial proper- of the analyzer. This collection consists of oral
ties; these attach on the right edge of subordina- narratives given by three speakers from different
tors/auxiliaries, predicates, and argument phrases, villages: Ansbayaxw (Eastern), Gijigyukwhla’a
depending on the structure of the sentence. (Western), and Git-anyaaw (Western) (cf. Forbes
A large part of Gitksan’s unique morphological et al., 2017). It includes multiple genres: personal
complexity therefore arises not in nominal or verbal anecdotes, traditional tales (ant’imahlasxw), histo-
inflection, but in the flexibility of multiple types of ries of ownership (adaawk), recipes, and explana-
clitics used in connected speech, and the logic of tions of cultural practice. The collection is fully
which possible sequences can appear with which annotated in the ‘interlinear gloss’ format with free
wordforms. translation, exemplified in (5).

2.2 Resources (5) Ii al’algaltgathl get,


ii CVC-algal-t=gat=hl get
The Gitksan community orthography was designed CCNJ PL -watch-3. II = REPORT = CN people
and documented in the Hindle and Rigsby (1973) ‘And they stood by and watched,’
wordlist (H&R). Though it originally reflected
only the single dialect of one of the authors (Git- The analyzed corpus provides insight into the use of
an’maaxs, Eastern), this orthography is in broad clitics in running speech, and is the dataset against
use today across the Gitxsan community for all di- which we test the results of the analyzer.

250
3 Related Work input word into morphemes and label each mor-
pheme with one or more grammatical tags. Very
While considering different approaches to compu-
silmilarly to the approach that we adopt, Schwartz
tational modeling of Gitksan morphology, finite-
et al. (2019) and Moeller et al. (2018) use atten-
state morphology arose as a natural choice. At the
tional LSTM encoder-decoder models to augment
present time, finite-state methods are quite widely
morphological analyzers for extending morpholog-
applied for Indigenous languages of the Americas.
ical analyzers for St. Lawrence Island / Central
Chen and Schwartz (2018) present a morpholog-
Siberian Yupik and Arapaho, respectively.
ical analyzer for St. Lawrence Island / Central
Siberian Yupik for aid in language preservation and 4 The Model
revitalization work. Strunk (2020) present another
analyzer for Central Alaskan Yupik. Snoek et al. Our morphological analyzer was designed with sev-
(2014) present a morphological analyzer for Plains eral considerations in mind. First, given the small
Cree nouns and Harrigan et al. (2017) present one amount of data at our disposal, we chose to con-
for Plains Cree verbs. Littell (2018) build a finite- struct a rule-based finite state transducer, built from
state analyzer for Kwak’wala. All of the above a predefined lexicon and morphological description.
are languages which present similar challenges to The dependence of this type of analyzer on a lexi-
the ones encountered in the case of Gitksan: word con supports one of the major goals of this project:
forms consisting of a large number of morphemes, lexical discovery from texts. Words which cannot
both prefixing and suffixing morphology and mor- be analyzed will likely be novel lemmas that have
phophonological alternations. Finite-state morphol- yet to be documented. Furthermore, the process
ogy is well-suited for dealing with these challenges. of constructing a morphological description allows
It is noteworthy that similarly to Gitksan, a number for the refinement of our understanding of Gitksan
of the aforementioned languages are also undergo- morphology and orthographic standards. For exam-
ing active documentation efforts. ple, there is a common post-stem rounding effect
While we present the first morphological ana- that generates variants such as jogat, jogot ‘those
lyzer for Gitksan which is capable of productive who live’; the project helps us identify where this
inflection, this is not the first electronic lexical re- effect occurs. Our analyzer can also later serve as a
source for the Gitksan language. Littell et al. (2017) tool to explore of the behavior of less-documented
present an electronic dictionary interface Waldayu constructions (e.g. distributive, partitive), as gram-
for endangered languages and apply it to Gitksan. matical and pedagogical resources continue to be
The model is capable of performing fuzzy dictio- developed.
nary search which is an important extension in the Our general philosophy was to take a maximal-
presence of orthographic variation which widely segmentation approach to inflection and cliticiza-
occurs in Gitksan. While this represents an impor- tion: morphemes were added individually, and in-
tant development for computational lexicography teractions between morphemes (e.g. deletion) were
for Gitksan, the method cannot model productive derived through transformational rules based on
inflection which is important particularly for lan- morphological and phonological context. Most
guage learners who might not be able to easily interactions of this kind are strictly local; there
deduce the base-form of an inflected word (Hunt are few long-distance dependencies between mor-
et al., 2019). As mentioned earlier, our model can phemes. The only exception to the minimal chunk-
analyze inflected forms of lexemes. ing rule is a specific interaction between noun-class
We extend the coverage of our finite-state an- clitics and verbal agreement: when these clitics
alyzers by incorporating a neural morphological append to verbal agreement suffixes, they either
guesser which can be used to analyze word forms agglutinate with (6-a) or delete them (6-b) depend-
which are rejected by the finite-state analyzer. Simi- ing on whether the agreement and noun-class mor-
lar mechanisms have been explored for other Amer- pheme are associated with the same noun (Tarpent,
ican Indigenous languages. Micher (2017) use 1987; Davis, 2018). That is, the conditioning factor
segmental recurrent neural networks (Kong et al., for this alternation is syntactic, not morphophono-
2015) to augment a finite-state morphological an- logical.
alyzer for Inuktitut.2 These jointly segment the
http://www.inuktitutcomputing.ca/
2
The Uquailaut morphological analyzer: Uqailaut

251
(6) Realizations of gup-i-t=hl (eat-TR-3=CN) which was listed directly in the H&R wordlist (the
a. gubithl ‘he/she ate (common noun)’ symbol ˆ marks morpheme boundaries).3 After
b. gubihl ‘(common noun) ate’ the verb, we find two inflectional suffixes and one
clitic. Ultimately, rewrite rules are used to delete
The available set of resources further constrained the transitive suffix and segmentation boundaries
our options for the analyzer’s design and our means (8).
of evaluating it. The H&R wordlist is quite small,
and of only a single dialect, while the corpus for (7) saaˆbisbisˆiˆdiitˆhl
testing was multidialectal. We therefore aimed saa+PVB-bisbis+VT-TR-3PL=CN
to produce a flexible analyzer able to recognize (8) saabisbisdiithl
orthographic variation, to maximize the value of its
small lexicon. 4.2 Analyzer iterations

4.1 FST implementation We built and evaluated four iterations of the Gitk-
san morphological analyzer based upon the foun-
Our finite-state analyzer was written in lexc dation presented in Section 4.1: the v1. Lexical
and xfst format and compiled using foma (Hulden, FST, v2. Complete FST, v3. Dialectal FST and
2009b). Finite-state analyzers like this one are v4. FST+Neural. Each iteration cumulatively ex-
constructed from a dictionary of stems, with af- pands the previous one by incorporating additional
fixes added left-to-right, and morpho-phonological vocabulary items, rules or modeling components.
rewrite rules applied to produce allomorphs and The first analyzer (v1: Lexical FST) included
contextual variation. The necessary components of only the open-class categories of verbs, nouns,
the analyzer are therefore a lexicon, a morphotac- modifiers, and adverbs which made up the bulk
tic description, and a set of morphophonological of the H&R wordlist. The main focus of the
transformations, as illustrated in Figure 3. morphotactic description was transitive inflection,
Our analyzer’s lexicon is drawn from the H&R person/number-agreement, and cliticization for
wordlist. As a first step, each stem from that list these categories. Some semi-productive argument
was assigned a lexical category to determine its structural morphemes (e.g. the passive -xw or an-
inflectional possibilities. The resulting 1506 word tipassive -asxw) were also included.
+ category pairs were imported to category-specific
The second analyzer (v2: Complete FST) in-
groups in the morphotactic description.
corporated functional and closed-class morphemes
Any of the major stem categories could be used
such as subordinators, pronouns, prepositions, quo-
to start a word; modifiers, preverbs, and prenouns
tatives, demonstratives, and aspectual particles, in-
could also be used as verb/noun prefixes. Each
cluding additional types of clitics.
categorized group flowed to a series of category-
The third analyzer (v3: Dialectal FST) further
specific sections which appended the appropriate
incorporated predictable stem-internal variation,
part of speech, and then listed various derivational
such as the vowel shift and dorsal stop lenition/-
or inflectional affixes that could be appended. A
fortition seen across dialects. In order to apply the
morphological group would terminate either with a
vowel shift in a targeted way, all items in the lex-
hard stop (#) or by flowing to a final group ‘Word’,
icon were marked for stress using the notation $.
where clitics were appended.
Parses prior to rule application now appear as in
Finally, forms were subject to a sequence
(9) (compare to (7)).
of orthographic transformations reflecting mor-
phophonological rules. Some examples included (9) s$aaˆbisb$isˆiˆdiitˆhl
the deletion of adjacent morphemes which could
not co-occur, processes of vowel epenthesis or dele- Finally, we seek to expand the coverage of the
tion, vowel coloring by rounded and back conso- analyzer through machine learning, namely neu-
nants, and prevocalic stop voicing. ral architectures (v4: FST+Neural). Our FST ar-
A sample form produced by the FST for the chitecture allows for the automatic extraction of
word saabisbisdiithl ‘they tore off (pl. common surface-analysis pairs; this enables us to create
noun)’ is in example (7). This form involves a 3
The FST has no component to productively handle redu-
preverb saa being affixed directly to a transitive plication but this would be possible to implement given a
verb bisbis, a reduplicated plural form of the verb closed lexicon Hulden (2009a, Ch. 4).

252
LEXICON N
LEXICON RootN +N: NInfl ; Deletion before -3PL:
maa’y N ; LEXICON NInfl ˆi → 0 / _ ˆdiit
smax N ; -ATTR:^m # ;
LEXICON RootVI -SX:^it Word ; Vowel insertion:
yee VI ; Agr_II ; 0 → i / C ˆ _ Sonorant #
t’aa VI ; Word ;
LEXICON RootPrenoun LEXICON Prenoun Prevocalic voicing:
lax_ Prenoun ; +PNN: # ; p,t,ts,k,k → b,d,j,g,g / _ V
(a) Lexicon +PNN: RootN ; (c) Rewrite rules
(b) Morphotactic description

Figure 3: Three main components of the FST (simplified)

a training set for the neural models. We experi- dataset (2 speakers and dialects). Token and type
ment with two alternative neural architectures - the coverage for the three FSTs is provided in Table
Hard-Attentional model over edit actions (HA) de- 1, representing the percentage of wordforms for
scribed by Makarov and Clematide (2018), and the which each analyzer was able to provide one or
transformer model (Vaswani et al., 2017), as imple- more possible parses.
mented in Fairseq (Fairseq) (Ott et al., 2019). Un-
like the FST, the neural models can extend morpho- Types Tokens
logical patterns beyond a defined series of stems, East Lexical 63.12% 54.17%
analyzing forms that the FST cannot recognize. Complete 71.10% 81.48%
For both models, we extract 10,000 random anal- Dialectal 71.10% 81.48%
ysis pairs, with replacement; early stopping for West Lexical 45.49% 38.09%
both models uses a 10% validation set extracted Complete 53.20% 70.12%
from the training, with no overlap between train- Dialectal 62.35% 75.98%
ing and validation sets (although stem overlap is Table 1: Analyzer coverage on 2000-token datasets
allowed). The best checkpoint is chosen based on
validation accuracy. The HA model uses a Chi- The effect of adding function-word coverage to
nese Restaurant Process alignment model, and is the second ‘Complete’ analyzer was broadly sim-
trained for 60 epochs, with 10 epochs patience; the ilar across dialects, increasing type coverage by
encoder and decoder both have hidden dimension about 8% and token coverage by 27-32%, demon-
200, and are trained with 50% dropout on recurrent strating the relative importance of function words
connections. The Transformer model is a 3-layer, to lexical coverage.
4-head transformer trained for 50 epochs. The en- The first two analyzers performed substantially
coders and decoders each have an embedding size better on the Eastern dataset which more closely
of 512, and feed-forward size of 1024, with 50% matched the dialect of the wordlist/lexicon. The
dropout and 30% attentional dropout. We optimize third ‘Dialectal’ analyzer incorporated four types
using Adam (0.9, 0.98), and cross-entropy with of predictable stem-internal allomorphy to generate
20% label-smoothing as our objective. Western-style variants. These transformations had
Any wordform which received no analysis from no effect on coverage for the Eastern dataset, but
the FST was provided a set of five possible analyses increased type and token coverage for the Western
each from the HA and Fairseq models. dataset by 9% and 6% respectively.

5 Evaluation 5.2 FST precision


While our analyzer manipulates vocabulary items
5.1 FST Coverage
at the level of the stem seen in the lexicon, the cor-
The analyzers were run on two 2000-token datasets pus used for evaluation is annotated to the level of
drawn from the multidialectal corpus: an Eastern the root and was not always comparable (e.g. ih-
Gitksan dataset (1 speaker), and a Western Gitksan lee’etxw ‘red’ vs ihlee’e-xw ‘blood-VAL’). Accu-

253
racy evaluation therefore had to be done manually To further understand the analyzer’s limitations,
by comparing the annotated analysis in the corpus we categorized the reasons for erroneous and miss-
to the parse options produced by the FST (10). ing analyses, listed in Table 3. In addition to the
small datasets, for which all words were checked,
(10) japhl we also evaluated the 100 most-frequent word/anal-
a. make[-3.II]=CN (Corpus) ysis pairs in the larger datasets.
b. j$ap+N=CN The majority of erroneous and absent analyses
j$ap+N-3=CN were due to the use of new lemmas not in the lexi-
j$ap+VT-3=CN (FSTv3) con, or novel variants not captured by productive
stem-alternation rules. Novel lemmas made up
We evaluated the accuracy of the Dialectal FST on
about 18% each of the small datasets, and 4-8%
two smaller datasets: 150 tokens Eastern, and 250
of the top-100 most frequent types. Some func-
tokens Western. These datasets included 85 and
tional items had specific dialectal realizations; for
180 unique wordform/annotation pairs respectively.
example, all three speakers used a different locative
The same wordform might have multiple attested
preposition (goo-, go’o-, ga’a-), only one of which
analyses, depending on its usage. The performance
was recognized.
of the Dialectal analyzer on each dataset is sum-
marized in Table 2. Precision is calculated as the There were also a few errors attributable to
percentage of word/annotation pairs for which the the morphotactic rules encoded in the parser.
analyzer produced a parse matching the context- For example, there were several instances in the
sensitive annotation in the corpus.4 Other analyses dataset of supposed ‘preverb’ modifiers combin-
produced by the FST were ignored. For example in ing with nouns (e.g. t’ip-no’o=si, sharply.down-
(10), the token would be evaluated as correct given hole=PROX, ‘this steep-sided hole’), which the
the final parse, which uses the appropriate stem parser could not recognize. This category combi-
(jap ‘make’) and matching morphology; the other nation flags the need for further documentation of
parses using a different stem (jap ‘box trap’) and/or certain ‘preverbs’. As a second example, numbers
different morphology could not qualify the token attested without agreement were not recognized
as correctly parsed. Only parsable wordforms were because the analyzer expected that they would al-
considered (i.e. English words and names are ex- ways agree. This could be fixed by updating the
cluded). morphotactic description for numbers (e.g. to more
closely match intransitive verbs).
East West
Coverage 71.76% 68.89% 5.3 FST + Neural performance
(61/85) (124/180)
The addition of the neural component signifi-
Correct parse 71.76% (61) 64.44% (116)
cantly increased the analyzer’s coverage (mean HA:
Incorrect parse 0.00% (0) 2.78% (5)
+21%, Fairseq: +17%), but at the expense of pre-
Name, English 2.5% (2) 3.33% (6)
cision (mean -15% for both). The results of the
No parse 27.5% (22) 29.44% (53)
manual accuracy evaluation are presented in Fig-
Precision 100.00% 95.87%
ure 4. There remained several forms for which the
(61/61) (116/121)
neural analyzers produced no analyses.
Table 2: Accuracy evaluation for dialectal analyzer (v3) Both analyzers performed better on the 100-
on small datasets most-frequent types datasets, where they tended
to accurately identify dialectal variants of com-
The Western dataset was larger, and consisted of mon words (e.g. t’ihlxw from tk’ihlxw ‘child’, diye
two distinct dialects, in contrast to the smaller and from diya ‘3=QUOT (third person quotative)’). In
more homogeneous Eastern dataset. Regardless, the small datasets of running text, these models
analyzer coverage between the two datasets was were occasionally able to correctly identify un-
comparable (68-72%) and precision was very high known noun and verb stems that had minimal in-
(95-100%). When this analyzer was able to provide flection. However, they struggled with identify-
a possible parse, one was almost always correct. ing categories, and often failed to identify correct
4
Note that precision is computed only on word forms inflection. These difficulties stem from category-
which received at least one analysis from the FST. flexibility and homophony in Gitksan. Nouns and

254
East West
150 tokens (22) Top-100 (17) 250 tokens (58) Top-100 (23)
New lemma 15 2 30 2
New function word 1 2 4 6
Lexical variant 3 8 6 5
Functional variant 2 3 9 9
Morphotactic error 1 2 9 1

Table 3: Categorization of erroneous and absent analyses for dialectal analyzer (FSTv3)

East 150 tok East top-100 tok West 250 tokens West top-100 tok
1 1 1 1

0.75 0.75 0.75 0.75

0.5 0.5 0.5 0.5

0.25 0.25 0.25 0.25

0 0 0 0
FST FST+HA FST+FairSeq FST FST+HA FST+FairSeq FST FST+HA FST+FairSeq FST FST+HA FST+FairSeq

Figure 4: Proportion of forms which receive the correct analysis from each of our models (indicated in blue) and
the number of forms which receive only incorrect analyses from our models (indicated in red). The remaining
forms received no analyses.

verbs use the exact same inflection and clitics, mak- xsim$as+N-3PL
ing the category itself difficult to infer. Short in-
flectional sequences have a large number of ho-
mophonous parses, and even more differ only by a Further work can be done to improve the per-
character or two. formance of the neural addition, such as training
the model on attested tokens instead of, or in addi-
Qualitatively, the HA model tended to produce
tion to, tokens randomly generated from the FST
more plausible predictions, often producing the cor-
analyzer.
rect stem or else a mostly-plausible analysis that
could map to the same surface form, but with incor-
6 Discussion and Conclusions
rect categories or inflection. In contrast, the Fairseq
model often introduced stem changes or inflec- The grammatically-informed FST is able to han-
tional sequences which could not ultimately map dle many of Gitksan’s morphological complexi-
to the surface form. Example (11) provides a sam- ties with a high degree of precision, including ho-
ple set of incorrect predictions (surface-plausible mophony, contextual deletion, and position-flexible
analyses are starred). clitics. The FST analyzer’s patchy coverage can
be attributed to its small lexicon. Unknown lexi-
(11) ksimaasdiit ksi+PVB-m$aas+VT-TR-3PL cal items and variants comprised roughly 18% of
a. HA model each small dataset. Notably, errors and unidenti-
xsim$aas+N-3PL (*) fied forms in the FST analyzer signal the current
xsim$aas+N-T-3PL (*) limits of morphotactic descriptions and lexical doc-
xsim$aas+NUM-3PL (*?) umentation. The analyzer can therefore serve as a
xsim$aast+N-T-3PL useful part of a documentary linguistic workflow
to quickly and systematically identify novel lexical
b. Fairseq model items and grammatical rules from texts, facilitating
xsim$aast+N-3PL the expansion of lexical resources. It can also be
xsim$aast+N=RESEM used as a pedagogical tool to identify word stems
xsim$aast+N-SX=PN in running text, or to generate morphological exer-

255
cises for language learners. Atticus G Harrigan, Katherine Schmirler, Antti Arppe,
The neural system, with its expanded coverage, Lene Antonsen, Trond Trosterud, and Arok Wolven-
grey. 2017. Learning from the computational mod-
can serve as part of a feedback system with a hu-
elling of plains cree verbs. Morphology, 27(4):565–
man in the loop, informing future iterations of the 598.
annotation process. While its precision is lower
Lonnie Hindle and Bruce Rigsby. 1973. A short prac-
than the FST, it can still inform annotators on words
tical dictionary of the Gitksan language. Northwest
that the FST does not analyze. Newly-annotated Anthropological Research Notes, 7(1).
data can then be used to enlarge the FST coverage.
Mans Hulden. 2009a. Finite-state machine construc-
Acknowledgments tion methods and algorithms for phonology and mor-
phology. Ph.D. thesis, The University of Arizona.
’Wii t’isim ha’miyaa nuu’m aloohl the fluent speak- Mans Hulden. 2009b. Foma: a finite-state compiler
ers who continue to share their knowledge with and library. In Proceedings of the 12th Confer-
me (Barbara Sennott, Vincent Gogag, Hector Hill, ence of the European Chapter of the Association
Jeanne Harris), as well as the UBC Gitksan Re- for Computational Linguistics: Demonstrations Ses-
sion, pages 29–32. Association for Computational
search Lab. This research was supported by fund- Linguistics.
ing from the National Endowment for the Humani-
ties (Documenting Endangered Languages Fellow- Benjamin Hunt, Emily Chen, Sylvia L.R. Schreiner,
and Lane Schwartz. 2019. Community lexical ac-
ship) and the Social Sciences and Humanities Re- cess for an endangered polysynthetic language: An
search Council of Canada (Grant 430-2020-00793). electronic dictionary for St. Lawrence Island Yupik.
Any views/findings/conclusions expressed in this In Proceedings of the 2019 Conference of the North
publication do not necessarily reflect those of the American Chapter of the Association for Computa-
tional Linguistics (Demonstrations), pages 122–126,
NEH, NSF or SSHRC. Minneapolis, Minnesota. Association for Computa-
tional Linguistics.

References Katharine Hunt. 1993. Clause Structure, Agreement


and Case in Gitksan. Ph.D. thesis, University of
Kenneth R Beesley and Lauri Karttunen. 2003. Finite- British Columbia.
state morphology: Xerox tools and techniques.
CSLI, Stanford. Lingpeng Kong, Chris Dyer, and Noah A Smith.
2015. Segmental recurrent neural networks. arXiv
Colin Brown, Clarissa Forbes, and Michael David preprint arXiv:1511.06018.
Schwan. 2020. Clause-type, transitivity, and the
transitive vowel in Tsimshianic. In Papers of the In- Patrick Littell. 2018. Finite-state morphology for
ternational Conference on Salish and Neighbouring kwak’wala: A phonological approach. In Proceed-
Languages 55. UBCWPL. ings of the Workshop on Computational Modeling
of Polysynthetic Languages, pages 21–30, Santa Fe,
Emily Chen and Lane Schwartz. 2018. A morpholog- New Mexico, USA. Association for Computational
ical analyzer for st. lawrence island/central siberian Linguistics.
yupik. In Proceedings of the Eleventh International
Conference on Language Resources and Evaluation Patrick Littell, Aidan Pine, and Henry Davis. 2017.
(LREC 2018). Waldayu and waldayu mobile: Modern digital dic-
tionary interfaces for endangered languages. In Pro-
Henry Davis. 2018. Only connect! a unified analysis ceedings of the 2nd Workshop on the Use of Com-
of the Tsimshianic connective system. International putational Methods in the Study of Endangered Lan-
Journal of American Linguistics, 84(4):471–511. guages, pages 141–150.
Britt Dunlop, Suzanne Gessner, Tracey Herbert, and Peter Makarov and Simon Clematide. 2018. Neu-
Aliana Parker. 2018. Report on the status of BC First ral transition-based string transduction for limited-
Nations languages. Report of the First People’s Cul- resource setting in morphology. In Proceedings of
tural Council. Retrieved March 24, 2019. the 27th International Conference on Computational
Linguistics, pages 83–93, Santa Fe, New Mexico,
Clarissa Forbes. 2018. Persistent ergativity: Agree- USA. Association for Computational Linguistics.
ment and splits in Tsimshianic. Ph.D. thesis, Uni-
versity of Toronto. Jeffrey Micher. 2017. Improving coverage of an Inuk-
titut morphological analyzer using a segmental re-
Clarissa Forbes, Henry Davis, Michael Schwan, and current neural network. In Proceedings of the 2nd
the UBC Gitksan Research Laboratory. 2017. Three Workshop on the Use of Computational Methods in
Gitksan texts. In Papers for the 52nd International the Study of Endangered Languages, pages 101–106,
Conference on Salish and Neighbouring Languages, Honolulu. Association for Computational Linguis-
pages 47–89. UBC Working Papers in Linguistics. tics.

256
Sarah Moeller, Ghazaleh Kazeminejad, Andrew Cow-
ell, and Mans Hulden. 2018. A neural morphologi-
cal analyzer for arapaho verbs learned from a finite
state transducer. In Proceedings of the Workshop
on Computational Modeling of Polysynthetic Lan-
guages, pages 12–20.
Mother Tongues Dictionaries. 2020. Gitk-
san. Edited by the UBC Gitksan Research
Lab. Accessed June 4, 2020. (https:
//mothertongues.org/gitksan).
Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Fan, Sam Gross, Nathan Ng, David Grangier, and
Michael Auli. 2019. fairseq: A fast, extensible
toolkit for sequence modeling. In Proceedings of
the 2019 Conference of the North American Chap-
ter of the Association for Computational Linguistics
(Demonstrations), pages 48–53, Minneapolis, Min-
nesota. Association for Computational Linguistics.

Bruce Rigsby. 1986. Gitxsan grammar. Master’s the-


sis, University of Queensland, Australia.
Lane Schwartz, Emily Chen, Benjamin Hunt, and
Sylvia LR Schreiner. 2019. Bootstrapping a neu-
ral morphological analyzer for st. lawrence island
yupik from a finite-state transducer. In Proceedings
of the Workshop on Computational Methods for En-
dangered Languages, volume 1.

Conor Snoek, Dorothy Thunder, Kaidi Loo, Antti


Arppe, Jordan Lachler, Sjur Moshagen, and Trond
Trosterud. 2014. Modeling the noun morphology of
plains cree. In Proceedings of the 2014 Workshop
on the Use of Computational Methods in the Study
of Endangered Languages, pages 34–42.

Tonya Stebbins. 2003. On the status of interme-


diate form-classes: Words, clitics, and affixes in
Coast Tsimshian (Sm’algyax). Linguistic Typology,
7(3):383–415.
Lonny Strunk. 2020. A Finite-State Morphological An-
alyzer for Central Alaskan Yup’Ik. Ph.D. thesis, Uni-
versity of Washington.

Marie-Lucie Tarpent. 1987. A Grammar of the Nisgha


Language. Ph.D. thesis, University of Victoria.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob


Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. arXiv preprint arXiv:1706.03762.

257
Comparative Error Analysis in Neural and Finite-state Models for
Unsupervised Character-level Transduction
Maria Ryskina1 Eduard Hovy1 Taylor Berg-Kirkpatrick2 Matthew R. Gormley3
1
Language Technologies Institute, Carnegie Mellon University
2
Computer Science and Engineering, University of California, San Diego
3
Machine Learning Department, Carnegie Mellon University
[email protected] [email protected]
[email protected] [email protected]

Abstract
3to to4no mana belagitu
Traditionally, character-level transduction
problems have been solved with finite-state
models designed to encode structural and это точно ಮನ #$ಳ&$ತು
linguistic knowledge of the underlying pro-
cess, whereas recent approaches rely on the
power and flexibility of sequence-to-sequence tehničko i stručno obrazovanje
models with attention. Focusing on the less
explored unsupervised learning scenario, we техничка и стручна настава
compare the two model classes side by side
and find that they tend to make different types
of errors even when achieving comparable Figure 1: Parallel examples from our test sets
performance. We analyze the distributions of
for two character-level transduction tasks: con-
different error classes using two unsupervised
tasks as testbeds: converting informally
verting informally romanized text to its original
romanized text into the native script of its lan- script (top; examples in Russian and Kannada)
guage (for Russian, Arabic, and Kannada) and and translating between closely related languages
translating between a pair of closely related (bottom; Bosnian–Serbian). Informal romaniza-
languages (Serbian and Bosnian). Finally, we tion is idiosyncratic and relies on both visual (q
investigate how combining finite-state and → 4) and phonetic (t → t) character similarity,
sequence-to-sequence models at decoding while translation is more standardized but not fully
time affects the output quantitatively and
character-level due to grammatical and lexical dif-
qualitatively.1
ferences (‘nastava’ → ‘obrazovanje’) between
1 Introduction and prior work the languages. The lines show character alignment
between the source and target side where possible.
Many natural language sequence transduction tasks,
such as transliteration or grapheme-to-phoneme
conversion, call for a character-level parameteriza- by the underlying linguistic process (e.g. mono-
tion that reflects the linguistic knowledge of the un- tonic character alignment) or by the probabilis-
derlying generative process. Character-level trans- tic generative model (Markov assumption; Eisner,
duction approaches have even been shown to per- 2002). Their interpretability also facilitates the in-
form well for tasks that are not entirely character- troduction of useful inductive bias, which is crucial
level in nature, such as translating between related for unsupervised training (Ravi and Knight, 2009;
languages (Pourdamghani and Knight, 2017). Ryskina et al., 2020).
Weighted finite-state transducers (WFSTs) have Unsupervised neural sequence-to-sequence
traditionally been used for such character-level (seq2seq) architectures have also shown impressive
tasks (Knight and Graehl, 1998; Knight et al., performance on tasks like machine transla-
2006). Their structured formalization makes it eas- tion (Lample et al., 2018) and style transfer (Yang
ier to encode additional constraints, imposed either et al., 2018; He et al., 2020). These models are
1
Code will be published at https://github.com/ substantially more powerful than WFSTs, and they
ryskina/error-analysis-sigmorphon2021 successfully learn the underlying patterns from

258
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 258–271
August 5, 2021. ©2021 Association for Computational Linguistics
monolingual data without any explicit information ization (Ryskina et al., 2020) and related language
about the underlying generative process. translation (Pourdamghani and Knight, 2017).
As the strengths of the two model classes dif- While there has been much error analysis for
fer, so do their weaknesses: the WFSTs and the the WFST and seq2seq approaches separately, it
seq2seq models are prone to different kinds of largely focuses on the more common supervised
errors. On a higher level, it is explained by the case. We perform detailed side-by-side error analy-
structure–power trade-off: while the seq2seq mod- sis to draw high-level comparisons between finite-
els are better at recovering long-range dependen- state and seq2seq models and investigate if the
cies and their outputs look less noisy, they also intuitions from prior work would transfer to the
tend to insert and delete words arbitrarily because unsupervised transduction scenario.
their alignments are unconstrained. We attribute
the errors to the following aspects of the trade-off: 2 Tasks

Language modeling capacity: the statistical We compare the errors made by the finite-state and
character-level n-gram language models (LMs) uti- the seq2seq approaches by analyzing their perfor-
lized by finite-state approaches are much weaker mance on two unsupervised character-level trans-
than the RNN language models with unlimited left duction tasks: translating between closely related
context. While a word-level LM can improve the languages written in different alphabets and con-
performance of a WFST, it would also restrict the verting informally romanized text into its native
model’s ability to handle out-of-vocabulary words. script. Both tasks are illustrated in Figure 1.

Controllability of learning: more structured mod- 2.1 Informal romanization


els allow us to ensure that the model does not at- Informal romanization is an idiosyncratic transfor-
tempt to learn patterns orthogonal to the underlying mation that renders a non-Latin-script language in
process. For example, domain imbalance between Latin alphabet, extensively used online by speak-
the monolingual corpora can cause the seq2seq ers of Arabic (Darwish, 2014), Russian (Paulsen,
models to exhibit unwanted style transfer effects 2014), and many Indic languages (Sowmya et al.,
like inserting frequent target side words arbitrarily. 2010). Figure 1 shows examples of romanized Rus-
Search procedure: WFSTs make it easy to per- sian (top left) and Kannada (top right) sentences
form exact maximum likelihood decoding via along with their “canonicalized” representations in
shortest-distance algorithm (Mohri, 2009). For the Cyrillic and Kannada scripts respectively. Unlike
neural models trained using conventional methods, official romanization systems such as pinyin, this
decoding strategies that optimize for the output type of transliteration is not standardized: charac-
likelihood (e.g. beam search with a large beam ter substitution choices vary between users and are
size) have been shown to be susceptible to favoring based on the specific user’s perception of how sim-
empty outputs (Stahlberg and Byrne, 2019) and ilar characters in different scripts are. Although
generating repetitions (Holtzman et al., 2020). the substitutions are primarily phonetic (e.g. Rus-
sian n /n/ → n), i.e. based on the pronunciation
Prior work on leveraging the strength of the two of a specific character in or out of context, users
approaches proposes complex joint parameteriza- might also rely on visual similarity between glyphs
>
tions, such as neural weighting of WFST arcs or (e.g. Russian q /tSj / → 4), especially when the
paths (Rastogi et al., 2016; Lin et al., 2019) or associated phoneme cannot be easily mapped to
encoding alignment constraints into the attention a Latin-script grapheme (e.g. Arabic ¨ /Q/ → 3).
layer of seq2seq models (Aharoni and Goldberg, To capture this variation, we view the task of de-
2017; Wu et al., 2018; Wu and Cotterell, 2019; coding informal romanization as a many-to-many
Makarov et al., 2017). We study whether perfor- character-level decipherment problem.
mance can be improved with simpler decoding- The difficulty of deciphering romanization also
time model combinations, reranking and product depends on the type of the writing system the
of experts, which have been used effectively for language traditionally uses. In alphabetic scripts,
other model classes (Charniak and Johnson, 2005; where grapheme-to-phoneme correspondence is
Hieber and Riezler, 2015), evaluating on two un- mostly one-to-one, there tends to be a one-to-one
supervised tasks: decipherment of informal roman- monotonic alignment between characters in the ro-

259
manized and native script sequences (Figure 1, top in §3.4.2
left). Abjads and abugidas, where graphemes corre-
spond to consonants or consonant-vowel syllables, 3.1 Informal romanization
increasingly use many-to-one alignment in their
romanization (Figure 1, top right), which makes Source: de el menu:)
Filtered: de el menu<...>
learning the latent alignments, and therefore decod-
Target: <...> éJÖ Ï @ ø X
ing, more challenging. In this work, we experiment

with three languages spanning over three major Gloss: ‘This is the menu’
types of writing systems—Russian (alphabetic),
Arabic (abjad), and Kannada (abugida)—and com- Figure 2: A parallel example from the LDC BOLT
pare how well-suited character-level models are for Arabizi dataset, written in Latin script (source) and
learning these varying alignment patterns. converted to Arabic (target) semi-manually. Some
source-side segments (in red) are removed by an-
2.2 Related language translation notators; we use the version without such segments
(filtered) for our task. The annotators also stan-
As shown by Pourdamghani and Knight (2017)
dardize spacing on the target side, which results in
and Hauer et al. (2014), character-level models can
difference with the source (in blue).
be used effectively to translate between languages
that are closely enough related to have only small
lexical and grammatical differences, such as Ser- Arabic We use the LDC BOLT Phase 2 cor-
bian and Bosnian (Ljubešić and Klubička, 2014). pus (Bies et al., 2014; Song et al., 2014) for training
We focus on this specific language pair and tie the and testing the Arabic transliteration models (Fig-
languages to specific orthographies (Cyrillic for ure 2). The corpus consists of short SMS and chat
Serbian and Latin for Bosnian), approaching the in Egyptian Arabic represented using Latin script
task as an unsupervised orthography conversion (Arabizi). The corpus is fully parallel: each mes-
problem. However, the transliteration framing of sage is automatically converted into the standard-
the translation problem is inherently limited since ized dialectal Arabic orthography (CODA; Habash
the task is not truly character-level in nature, as et al., 2012) and then manually corrected by human
shown by the alignment lines in Figure 1 (bottom). annotators. We split and preprocess the data accord-
Even the most accurate transliteration model will ing to Ryskina et al. (2020), discarding the target
not be able to capture non-cognate word transla- (native script) and source (romanized) parallel sen-
tions (Serbian ‘nastava’ [nastava, ‘education, teach- tences to create the source and target monolingual
ing’] → Bosnian ‘obrazovanje’ [‘education’]) and the training splits respectively.
resulting discrepancies in morphological inflection
Russian We use the romanized Russian dataset
(Serbian -a endings in adjectives agreeing with
collected by Ryskina et al. (2020), augmented with
feminine ‘nastava’ map to Bosnian -o represent-
the monolingual Cyrillic data from the Taiga cor-
ing agreement with neuter ‘obrazovanje’).
pus of Shavrina and Shapovalova (2017) (Figure 3).
One major difference with the informal roman-
The romanized data is split into training, validation,
ization task is the lack of the idiosyncratic orthogra-
and test portions, and all validation and test sen-
phy: the word spellings are now consistent across
tences are converted to Cyrillic by native speaker
the data. However, since the character-level ap-
annotators. Both the romanized and the native-
proach does not fully reflect the nature of the trans-
script sequences are collected from public posts and
formation, the model will still have to learn a many-
comments on a Russian social network vk.com,
to-many cipher with highly context-dependent char-
and they are on average 3 times longer than the
acter substitutions.
messages in the Arabic dataset (Table 1). However,
although both sides were scraped from the same
3 Data
online platform, the relevant Taiga data is collected
Table 1 details the statistics of the splits used for primarily from political discussion groups, so there
all languages and tasks. Below we describe each is still a substantial domain mismatch between the
dataset in detail, explaining the differences in data source and target sides of the data.
split sizes between languages. Additional prepro- 2
Links to download the corpora and other data sources
cessing steps applied to all datasets are described discussed in this section can be found in Appendix A.

260
Train (source) Train (target) Validation Test
Sent. Char. Sent. Char. Sent. Char. Sent. Char.
Romanized Arabic 5K 104K 49K 935K 301 8K 1K 20K
Romanized Russian 5K 319K 307K 111M 227 15K 1K 72K
Romanized Kannada 10K 1M 679K 64M 100 11K 100 10K
Serbian→Bosnian 160K 9M 136K 9M 16K 923K 100 9K
Bosnian→Serbian 136K 9M 160K 9M 16K 908K 100 10K

Table 1: Dataset splits for each task and language. The source and target train data are monolingual, and
the validation and test sentences are parallel. For the informal romanization task, the source and target
sides correspond to the Latin and the original script respectively. For the translation task, the source and
target sides correspond to source and target languages. The validation and test character statistics are
reported for the source side.

Annotated Target: ಮೂಲ +,-$.ನ/$0 DDR3ಯನು2 ಬಳಸಲು


Source: proishodit s prirodoy 4to to very very bad Source: moola saaketnalli ddr3yannu balasalu
Filtered: proishodit s prirodoy 4to to <...> Gloss: ‘to use DDR3 in the source circuit’
Target: proishodit s prirodoĭ qto-to <...>
Gloss: ‘Something very very bad is happening to
the environment’ Figure 4: A parallel example from the Kannada por- Kagbi mozno ri
Monolingual tion of the Dakshina dataset. The Kannada script ‘[One] could kinda risk
Source: — data (target) is scraped from Wikipedia and man-
Target: to videoroliki so sezda par- ually converted to Latin (source) by human anno-
tii“Edina Rossi”
Gloss: ‘These are the videos from the “United Rus- tators. Foreign target-side characters (in red) get Kagbi mozno r
‘Not at all. Just like Žirinovskij, [they] often make
sia” party congress’ preserved in thesensible annotation but our preprocessing
suggestions.’
replaces them with UNK on the target side.
Отнюдь. Так же, как
Figure 3: Top: A parallel example from the roman- Жириновский, часто предлагает
Как бы можно р
ized Russian dataset. We use the filtered version Serbian: svako ima вещи.
здравые pravo na ivot, slobodu
of the romanized (source) sequences, removing the i bezbednost liqnosti.
Bosnian: svako ima pravo na život, slobodu i osobnu
segments the annotators were unable to convert to sigurnost.
Cyrillic, e.g. code-switched phrases (in red). The Gloss: ‘Everyone has the right to life, liberty and
annotators also standardize minor spelling varia- security of person.’
tion such as hyphenation (in blue). Bottom: a
monolingual Cyrillic example from the vk.com Figure 5: A parallel example from the Serbian–
portion of the Taiga corpus, which mostly consists Cyrillic and Bosnian–Latin UDHR. The sequences
of comments in political discussion groups. are not entirely parallel on character level due to
paraphrases and non-cognate translations (in blue).

3.2 Related language translation


Kannada Our Kannada data (Figure 4) is taken
from the Dakshina dataset (Roark et al., 2020), Following prior work (Pourdamghani and Knight,
a large collection of native-script text from 2017; Yang et al., 2018; He et al., 2020), we train
Wikipedia for 12 South Asian languages. Unlike our unsupervised models on the monolingual data
the Russian and Arabic data, the romanized portion from the Leipzig corpora (Goldhahn et al., 2012).
of Dakshina is not scraped directly from the users’ We reuse the non-parallel training and synthetic par-
online communication, but instead elicited from allel validation splits of Yang et al. (2018), who gen-
native speakers given the native-script sequences. erated their parallel data using the Google Trans-
Because of this, all romanized sentences in the lation API. Rather than using their synthetic test
data are parallel: we allocate most of them to the set, we opt to test on natural parallel data from the
source side training data, discarding their original Universal Declaration of Human Rights (UDHR),
script counterparts, and split the remaining anno- following Pourdamghani and Knight (2017).
tated ones between validation and test. We manually sentence-align the Serbian–

261
Cyrillic and Bosnian–Latin declaration texts and critics and non-printing characters like ZWJ are
follow the preprocessing guidelines of Pour- also treated as separate vocabulary items. To filter
damghani and Knight (2017). Although we strive out foreign or archaic characters and rare diacritics,
to approximate the training and evaluation setup we restrict the alphabets to characters that cover
of their work for fair comparison, there are some 99% of the monolingual training data. After that,
discrepancies: for example, our manual alignment we add any standard alphabetical characters and
of UDHR yields 100 sentence pairs compared to numerals that have been filtered out back into the
104 of Pourdamghani and Knight (2017). We use source and target alphabets. All remaining filtered
the data to train the translation models in both di- characters are replaced with a special UNK symbol
rections, simply switching the source and target in all splits except for the target-side test.
sides from Serbian to Bosnian and vice versa.
4 Methods
3.3 Inductive bias
We perform our analysis using the finite-state and
As discussed in §1, the WFST models are less pow-
seq2seq models from prior work and experiment
erful than the seq2seq models; however, they are
with two joint decoding strategies, reranking and
also more structured, which we can use to introduce
product of experts. Implementation details and
inductive bias to aid unsupervised training. Follow-
hyperparameters are described in Appendix B.
ing Ryskina et al. (2020), we introduce informative
priors on character substitution operations (for a de- 4.1 Base models
scription of the WFST parameterization, see §4.1).
The priors reflect the visual and phonetic similar- Our finite-state model is the WFST cascade in-
ity between characters in different alphabets and troduced by Ryskina et al. (2020). The model is
are sourced from human-curated resources built composed of a character-level n-gram language
with the same concepts of similarity in mind. For model and a script conversion transducer (emis-
all tasks and languages, we collect phonetically sion model), which supports one-to-one character
similar character pairs from the phonetic keyboard substitutions, insertions, and deletions. Charac-
layouts (or, in case of the translation task, from the ter operation weights in the emission model are
default Serbian keyboard layout, which is phonetic parameterized with multinomial distributions, and
in nature due to the dual orthography standard of similar character mappings (§3.3) are used to cre-
the language). We also add some visually similar ate Dirichlet priors on the emission parameters.
character pairs by automatically pairing all sym- To avoid marginalizing over sequences of infinite
bols that occur in both source and target alphabets length, a fixed limit is set on the delay of any path
(same Unicode codepoints). For Russian, which (the difference between the cumulative number of
exhibits a greater degree of visual similarity than insertions and deletions at any timestep). Ryskina
Arabic or Kannada, we also make use of the Uni- et al. (2020) train the WFST using stochastic step-
code confusables list (different Unicode codepoints wise EM (Liang and Klein, 2009), marginalizing
but same or similar glyphs).3 over all possible target sequences and their align-
It should be noted that these automatically gen- ments with the given source sequence. To speed
erated informative priors also contain noise: key- up training, we modify their training procedure
board layouts have spurious mappings because towards ‘hard EM’: given a source sequence, we
each symbol must be assigned to exactly one key in predict the most probable target sequence under
the QWERTY layout, and Unicode-constrained vi- the model, marginalize over alignments and then
sual mappings might prevent the model from learn- update the parameters. Although the unsupervised
ing correspondences between punctuation symbols WFST training is still slow, the stepwise training
(e.g. Arabic question mark ? → ?). procedure is designed to converge using fewer data
points, so we choose to train the WFST model
3.4 Preprocessing only on the 1,000 shortest source-side training se-
quences (500 for Kannada).
We lowercase and segment all sequences into char- Our default seq2seq model is the unsupervised
acters as defined by Unicode codepoints, so dia- neural machine translation (UNMT) model of Lam-
3
Links to the keyboard layouts and the confusables list can ple et al. (2018, 2019) in the parameterization
be found in Appendix A. of He et al. (2020). The model consists of an

262
Arabic Russian Kannada
CER WER BLEU CER WER BLEU CER WER BLEU
WFST .405 .86 2.3 .202 .58 14.8 .359 .71 12.5
Seq2Seq .571 .85 4.0 .229 .38 48.3 .559 .79 11.3
Reranked WFST .398 .85 2.8 .195 .57 16.1 .358 .71 12.5
Reranked Seq2Seq .538 .82 4.6 .216 .39 45.6 .545 .78 12.6
Product of experts .470 .88 2.5 .178 .50 22.9 .543 .93 7.0

Table 2: Character and word error rates (lower is better) and BLEU scores (higher is better) for the
romanization decipherment task. Bold indicates best per column. Model combinations mostly interpolate
between the base models’ scores, although reranking yields minor improvements in character-level and
word-level metrics for the WFST and seq2seq respectively. Note: base model results are not intended as a
direct comparison between the WFST and seq2seq, since they are trained on different amounts of data.

srp→bos bos→srp
CER WER BLEU CER WER BLEU
WFST .314 .50 25.3 .319 .52 25.5
Seq2Seq .375 .49 34.5 .395 .49 36.3
Reranked WFST .314 .49 26.3 .317 .50 28.1
Reranked Seq2Seq .376 .48 35.1 .401 .47 37.0
Product of experts .329 .54 24.4 .352 .66 20.6
(Pourdamghani and Knight, 2017) — — 42.3 — — 39.2
(He et al., 2020) .657 .81 5.6 .693 .83 4.7

Table 3: Character and word error rates (lower is better) and BLEU scores (higher is better) for the related
language translation task. Bold indicates best per column. The WFST and the seq2seq have comparable
CER and WER despite the WFST being trained on up to 160x less source-side data (§4.1). While none
of our models achieve the scores reported by Pourdamghani and Knight (2017), they all substantially
outperform the subword-level model of He et al. (2020). Note: base model results are not intended as a
direct comparison between the WFST and seq2seq, since they are trained on different amounts of data.

LSTM (Hochreiter and Schmidhuber, 1997) en- trained to translate in both directions simultane-
coder and decoder with attention, trained to map ously. Therefore, we reuse the same seq2seq model
sentences from each domain into a shared latent for both directions of the translation task, but train
space. Using a combined objective, the UNMT a separate finite-state model for each direction.
model is trained to denoise, translate in both direc-
tions, and discriminate between the latent represen- 4.2 Model combinations
tation of sequences from different domains. Since
the sufficient amount of balanced data is crucial The simplest way to combine two independently
for the UNMT performance, we train the seq2seq trained models is reranking: using one model to
model on all available data on both source and tar- produce a list of candidates and rescoring them ac-
get sides. Additionally, the seq2seq model decides cording to another model. To generate candidates
on early stopping by evaluating on a small parallel with a WFST, we apply the n–shortest paths algo-
validation set, which our WFST model does not rithm (Mohri and Riley, 2002). It should be noted
have access to. that the n–best list might contain duplicates since
each path represents a specific source–target char-
The WFST model treats the target and source acter alignment. The length constraints encoded in
training data differently, using the former to train the WFST also restrict its capacity as a reranker:
the language model and the latter for learning the beam search in the UNMT model may produce
emission parameters, while the UNMT model is hypotheses too short or long to have a non-zero

263
Input svako ima pravo da slobodno uqestvuje u kulturnom ivotu zajednice, da uiva
u umetnosti i da uqestvuje u nauqnom napretku i u dobrobiti koja otuda
proistiqe.
Ground truth svako ima pravo da slobodno sudjeluje u kulturnom životu zajednice, da uživa u umjetnosti i da
učestvuje u znanstvenom napretku i u njegovim koristima.
WFST svako ima pravo da slobodno učestvuje u kulturnom životu s jednice , da uživa u m etnosti i da
učestvuje u naučnom napretku i u dobrobiti koja otuda pr ističe .
Reranked WFST svako ima pravo da slobodno učestvuje u kulturnom životu s jednice , da uživa u m etnosti i da
učestvuje u naučnom napretku i u dobrobiti koja otuda pr ističe .
Seq2Seq svako ima pravo da slobodno učestvuje u kulturnom životu zajednice , da
učestvuje u naučnom napretku i u dobrobiti koja otuda proističe .
Reranked Seq2Seq svako ima pravo da slobodno učestvuje u kulturnom životu zajednice , da uživa u umjetnosti i da
učestvuje u naučnom napretku i u dobrobiti koja otuda proističe
Product of experts svako ima pravo da slobodno učestvuje u kulturnom za u s ajednice , da živa u umjetnosti i da
učestvuje u naučnom napretku i u dobro j i koja otuda proisti
Subword Seq2Seq s ami ima pravo da slobodno u tiče na srpskom nivou vlasti da razgovaraju u bosne i da djeluje u
med̄unarodnom turizmu i na buducnosti koja muža decisno .

Table 4: Different model outputs for a srp→bos translation example. Prediction errors are highlighted
in red. Correctly transliterated segments that do not match the ground truth (e.g. due to paraphrasing)
are shown in yellow. Here the WFST errors are substitutions or deletions of individual characters, while
the seq2seq drops entire words from the input (§5 #4). The latter problem is solved by reranking with a
WFST for this example. The seq2seq model with subword tokenization (He et al., 2020) produces mostly
hallucinated output (§5 #2). Example outputs for all other datasets can be found in the Appendix.

probability under the WFST. of Pourdamghani and Knight (2017) also use the
Our second approach is a product-of-experts- same respective sources, we cannot account for
style joint decoding strategy (Hinton, 2002): tokenization differences that could affect the scores
we perform beam search on the WFST lattice, reported by the authors.
reweighting the arcs with the output distribution of
the seq2seq decoder at the corresponding timestep. 5 Results and analysis
For each partial hypothesis, we keep track of the
Tables 2 and 3 present our evaluation of the two
WFST state s and the partial input and output se-
base models and three decoding-time model com-
quences x1:k and y1:t .4 When traversing an arc
binations on the romanization decipherment and
with input label i ∈ {xk+1 , } and output label o,
related language translation tasks respectively. For
we multiply the arc weight by the probability of
each experiment, we report character error rate,
the neural model outputting o as the next character:
word error rate, and BLEU (see Appendix C). The
pseq2seq (yt+1 = o|x, y1:t ). Transitions with o = 
results for the base models support what we show
(i.e. deletions) are not rescored by the seq2seq. We
later in this section: the seq2seq model is more
group hypotheses by their consumed input length
likely to recover words correctly (higher BLEU,
k and select n best extensions at each timestep.
lower WER), while the WFST is more faithful on
4.3 Additional baselines character level and avoids word-level substitution
errors (lower CER). Example predictions can be
For the translation task, we also compare to prior found in Table 4 and in the Appendix.
unsupervised approaches of different granularity: Our further qualitative and quantitative findings
the deep generative style transfer model of He et al. are summarized in the following high-level take-
(2020) and the character- and word-level WFST aways:
decipherment model of Pourdamghani and Knight
(2017). The former is trained on the same training #1: Model combinations still suffer from search
set tokenized into subword units (Sennrich et al., issues. We would expect the combined decod-
2016), and we evaluate it on our UDHR test set ing to discourage all errors common under one
for fair comparison. While the train and test data model but not the other, improving the performance
4
Due to insertions and deletions in the emission model, k by leveraging the strengths of both model classes.
and t might differ; epsilon symbols are not counted. However, as Tables 2 and 3 show, they instead

264
WFST Seq2Seq
Figure 6: Highest-density sub-
matrices of the two base mod-
els’ character confusion matrices,
computed in the Russian roman-
ization task. White cells repre-
sent zero elements. The WFST
confusion matrix (left) is notice-
ably sparser than the seq2seq one
(right), indicating more repetitive
errors. # symbol stands for UNK.

mostly interpolate between the scores of the two This observation also aligns with the findings of
base models. In the reranking experiments, we find the recent work on language modeling complex-
that this is often due to the same base model er- ity (Park et al., 2021; Mielke et al., 2019). For
ror (e.g. the seq2seq model hallucinating a word many languages, including several Slavic ones re-
mid-sentence) repeating across all the hypotheses lated to the Serbian–Bosnian pair, a character-level
in the final beam. This suggests that successful language model yields lower surprisal than the one
reranking would require a much larger beam size trained on BPE units, suggesting that the effect
or a diversity-promoting search mechanism. might also be explained by the character tokeniza-
Interestingly, we observe that although adding tion making the language easier to language-model.
a reranker on top of a decoder does improve per-
formance slightly, the gain is only in terms of the #3: WFST model makes more repetitive errors.
metrics that the base decoder is already strong at— Although two of our evaluation metrics, CER and
character-level for reranked WFST and word-level WER, are based on edit distance, they do not dis-
for reranked seq2seq—at the expense of the other tinguish between the different types of edits (sub-
scores. Overall, none of our decoding strategies stitutions, insertions and deletions). Breaking them
achieves best results across the board, and no model down by the edit operation, we find that while both
combination substantially outperforms both base models favor substitutions on both word and char-
models in any metric. acter levels, insertions and deletions are more fre-
quent under the neural model (43% vs. 30% of all
#2: Character tokenization boosts performance edits on the Russian romanization task). We also
of the neural model. In the past, UNMT-style find that the character substitution choices of the
models have been applied to various unsupervised neural model are more context-dependent: while
sequence transduction problems. However, since the total counts of substitution errors for the two
these models were designed to operate on word or models are comparable, the WFST is more likely
subword level, prior work assumes the same tok- to repeat the same few substitutions per character
enization is necessary. We show that for the tasks type. This is illustrated by Figure 6, which visual-
allowing character-level framing, such models in izes the most populated submatrices of the confu-
fact respond extremely well to character input. sion matrices for the same task as heatmaps. The
Table 3 compares the UNMT model trained on WFST confusion matrix is noticeably more sparse,
characters with the seq2seq style transfer model with the same few substitutions occurring much
of He et al. (2020) trained on subword units. The more frequently than others: for example, WFST
original paper shows improvement over the UNMT often mistakes  for a and rarely for other char-
baseline in the same setting, but simply switching acters, while the neural model’s substitutions of
to character-level tokenization without any other  are distributed closer to uniform. This suggests
changes results in a 30 BLEU points gain for ei- that the WFST errors might be easier to correct
ther direction. This suggests that the tokenization with rule-based postprocessing. Interestingly, we
choice could act as an inductive bias for seq2seq did not observe the same effect for the translation
models, and character-level framing could be use- task, likely due to a more constrained nature of the
ful even for tasks that are not truly character-level. orthography conversion.

265
WFST Seq2Seq
Figure 7: Character error rate per
word for the WFST (left) and seq2seq
800 800
(right) bos→srp translation outputs.
Number of words

600 600 The predictions are segmented us-


ing Moses tokenizer (Koehn et al.,
400 400 2007) and aligned to ground truth
with word-level edit distance. The in-
200 200
creased frequency of CER=1 for the
0 0
seq2seq model as compared to the
0 0.6 1.2 1.8 2.4 0 0.6 1.2 1.8 2.4 WFST indicates that it replaces entire
CER per word CER per word words more often.

#4: Neural model is more sensitive to data dis- 6 Conclusion


tribution shifts. The language model aiming to
replicate its training data distribution could cause We perform comparative error analysis in finite-
the output to deviate from the input significantly. state and seq2seq models and their combinations
This could be an artifact of a domain shift, such as for two unsupervised character-level tasks, infor-
in Russian, where the LM training data came from mal romanization decipherment and related lan-
a political discussion forum: the seq2seq model fre- guage translation. We find that the two model types
quently predicts unrelated domain-specific proper tend towards different errors: seq2seq models are
names in place of very common Russian words, e.g. more prone to word-level errors caused by distribu-
izn~ [žizn, ‘life’] → Zganov [Zjuganov, ‘Zyuganov tional shifts while WFSTs produce more character-
(politician’s last name)’] or to [èto, ‘this’] → Edina level noise despite the hard alignment constraints.
Rossi [Edinaja Rossija, ‘United Russia (political party)’], Despite none of our simple decoding-time com-
presumably distracted by the shared first character binations substantially outperforming the base mod-
in the romanized version. To quantify the effect of els, we believe that combining neural and finite-
a mismatch between the train and test data distri- state models to harness their complementary ad-
butions in this case, we inspect the most common vantages is a promising research direction. Such
word-level substitutions under each decoding strat- combinations might involve biasing seq2seq mod-
egy, looking at all substitution errors covered by the els towards WFST-like behavior via pretraining or
1,000 most frequent substitution ‘types’ (ground directly encoding constraints such as hard align-
truth–prediction word pairs) under the respective ment or monotonicity into their parameterization
decoder. We find that 25% of the seq2seq substitu- (Wu et al., 2018; Wu and Cotterell, 2019). Al-
tion errors fall into this category, as compared to though recent work has shown that the Transformer
merely 3% for the WFST—notable given the rela- can learn to perform character-level transduction
tive proportion of in-vocabulary words in the mod- without such biases in a supervised setting (Wu
els’ outputs (89% for UNMT vs. 65% for WFST). et al., 2021), exploiting the structured nature of the
task could be crucial for making up for the lack of
large parallel corpora in low-data and/or unsuper-
vised scenarios. We hope that our analysis provides
Comparing the error rate distribution across out- insight into leveraging the strengths of the two ap-
put words for the translation task also supports this proaches for modeling character-level phenomena
observation. As can be seen from Figure 7, the in the absence of parallel data.
seq2seq model is likely to either predict the word
correctly (CER of 0) or entirely wrong (CER of Acknowledgments
1), while the the WFST more often predicts the
word partially correctly—examples in Table 4 illus- The authors thank Badr Abdullah, Deepak
trate this as well. We also see this in the Kannada Gopinath, Junxian He, Shruti Rijhwani, and Stas
outputs: WFST typically gets all the consonants Kashepava for helpful discussion, and the anony-
right but makes mistakes in the vowels, while the mous reviewers for their valuable feedback.
seq2seq tends to replace the entire word.

266
References Kyle Gorman. 2016. Pynini: A Python library for
weighted finite-state grammar compilation. In Pro-
Roee Aharoni and Yoav Goldberg. 2017. Morphologi- ceedings of the SIGFSM Workshop on Statistical
cal inflection generation with hard monotonic atten- NLP and Weighted Automata, pages 75–80, Berlin,
tion. In Proceedings of the 55th Annual Meeting of Germany. Association for Computational Linguis-
the Association for Computational Linguistics (Vol- tics.
ume 1: Long Papers), pages 2004–2015, Vancouver,
Canada. Association for Computational Linguistics. Nizar Habash, Mona Diab, and Owen Rambow. 2012.
Conventional orthography for dialectal Arabic. In
Cyril Allauzen, Michael Riley, Johan Schalkwyk, Woj- Proceedings of the Eighth International Conference
ciech Skut, and Mehryar Mohri. 2007. OpenFst: A on Language Resources and Evaluation (LREC’12),
general and efficient weighted finite-state transducer pages 711–718, Istanbul, Turkey. European Lan-
library. In Proceedings of the Ninth International guage Resources Association (ELRA).
Conference on Implementation and Application of
Automata, (CIAA 2007), volume 4783 of Lecture Bradley Hauer, Ryan Hayward, and Grzegorz Kon-
Notes in Computer Science, pages 11–23. Springer. drak. 2014. Solving substitution ciphers with com-
http://www.openfst.org. bined language models. In Proceedings of COLING
2014, the 25th International Conference on Compu-
Sowmya V. B., Monojit Choudhury, Kalika Bali, tational Linguistics: Technical Papers, pages 2314–
Tirthankar Dasgupta, and Anupam Basu. 2010. Re- 2325, Dublin, Ireland. Dublin City University and
source creation for training and testing of translit- Association for Computational Linguistics.
eration systems for Indian languages. In Proceed-
ings of the Seventh International Conference on Lan- Junxian He, Xinyi Wang, Graham Neubig, and Taylor
guage Resources and Evaluation (LREC’10), Val- Berg-Kirkpatrick. 2020. A probabilistic formulation
letta, Malta. European Language Resources Associ- of unsupervised text style transfer. In International
ation (ELRA). Conference on Learning Representations.
Felix Hieber and Stefan Riezler. 2015. Bag-of-words
Ann Bies, Zhiyi Song, Mohamed Maamouri, Stephen forced decoding for cross-lingual information re-
Grimes, Haejoong Lee, Jonathan Wright, Stephanie trieval. In Proceedings of the 2015 Conference of
Strassel, Nizar Habash, Ramy Eskander, and Owen the North American Chapter of the Association for
Rambow. 2014. Transliteration of Arabizi into Ara- Computational Linguistics: Human Language Tech-
bic orthography: Developing a parallel annotated nologies, pages 1172–1182, Denver, Colorado. As-
Arabizi-Arabic script SMS/chat corpus. In Proceed- sociation for Computational Linguistics.
ings of the EMNLP 2014 Workshop on Arabic Nat-
ural Language Processing (ANLP), pages 93–103, G. E. Hinton. 2002. Training products of experts by
Doha, Qatar. Association for Computational Lin- minimizing contrastive divergence. Neural Compu-
guistics. tation, 14(8):1771–1800.
Eugene Charniak and Mark Johnson. 2005. Coarse- Sepp Hochreiter and Jürgen Schmidhuber. 1997.
to-fine n-best parsing and MaxEnt discriminative Long short-term memory. Neural computation,
reranking. In Proceedings of the 43rd Annual Meet- 9(8):1735–1780.
ing of the Association for Computational Linguis-
tics (ACL’05), pages 173–180, Ann Arbor, Michi- Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and
gan. Association for Computational Linguistics. Yejin Choi. 2020. The curious case of neural text de-
generation. In International Conference on Learn-
Kareem Darwish. 2014. Arabizi detection and conver- ing Representations.
sion to Arabic. In Proceedings of the EMNLP 2014 Cibu Johny, Lawrence Wolf-Sonkin, Alexander Gutkin,
Workshop on Arabic Natural Language Processing and Brian Roark. 2021. Finite-state script normal-
(ANLP), pages 217–224, Doha, Qatar. Association ization and processing utilities: The Nisaba Brahmic
for Computational Linguistics. library. In Proceedings of the 16th Conference of
the European Chapter of the Association for Compu-
Jason Eisner. 2002. Parameter estimation for prob-
tational Linguistics: System Demonstrations, pages
abilistic finite-state transducers. In Proceedings
14–23, Online. Association for Computational Lin-
of the 40th Annual Meeting of the Association for
guistics.
Computational Linguistics, pages 1–8, Philadelphia,
Pennsylvania, USA. Association for Computational Kevin Knight and Jonathan Graehl. 1998. Ma-
Linguistics. chine transliteration. Computational Linguistics,
24(4):599–612.
Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff.
2012. Building large monolingual dictionaries at Kevin Knight, Anish Nair, Nishit Rathod, and Kenji
the Leipzig corpora collection: From 100 to 200 lan- Yamada. 2006. Unsupervised analysis for deci-
guages. In Proceedings of the Eighth International pherment problems. In Proceedings of the COL-
Conference on Language Resources and Evaluation ING/ACL 2006 Main Conference Poster Sessions,
(LREC’12), pages 759–765, Istanbul, Turkey. Euro- pages 499–506, Sydney, Australia. Association for
pean Language Resources Association (ELRA). Computational Linguistics.

267
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Mehryar Mohri and Michael Riley. 2002. An efficient
Callison-Burch, Marcello Federico, Nicola Bertoldi, algorithm for the n-best-strings problem. In Seventh
Brooke Cowan, Wade Shen, Christine Moran, International Conference on Spoken Language Pro-
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra cessing.
Constantin, and Evan Herbst. 2007. Moses: Open
source toolkit for statistical machine translation. In Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Proceedings of the 45th Annual Meeting of the As- Jing Zhu. 2002. BLEU: A method for automatic
sociation for Computational Linguistics Companion evaluation of machine translation. In Proceedings of
Volume Proceedings of the Demo and Poster Ses- the 40th Annual Meeting of the Association for Com-
sions, pages 177–180, Prague, Czech Republic. As- putational Linguistics, pages 311–318, Philadelphia,
sociation for Computational Linguistics. Pennsylvania, USA. Association for Computational
Linguistics.
Guillaume Lample, Alexis Conneau, Ludovic Denoyer,
and Marc’Aurelio Ranzato. 2018. Unsupervised ma- Hyunji Hayley Park, Katherine J. Zhang, Coleman Ha-
chine translation using monolingual corpora only. ley, Kenneth Steimel, Han Liu, and Lane Schwartz.
In International Conference on Learning Represen- 2021. Morphology matters: A multilingual lan-
tations. guage modeling analysis. Transactions of the Asso-
ciation for Computational Linguistics, 9:261–276.
Guillaume Lample, Sandeep Subramanian, Eric Smith,
Ludovic Denoyer, Marc’Aurelio Ranzato, and Y- Martin Paulsen. 2014. Translit: Computer-mediated
Lan Boureau. 2019. Multiple-attribute text rewrit- digraphia on the Runet. Digital Russia: The Lan-
ing. In International Conference on Learning Rep- guage, Culture and Politics of New Media Commu-
resentations. nication.
Percy Liang and Dan Klein. 2009. Online EM for Nima Pourdamghani and Kevin Knight. 2017. Deci-
unsupervised models. In Proceedings of Human phering related languages. In Proceedings of the
Language Technologies: The 2009 Annual Confer- 2017 Conference on Empirical Methods in Natu-
ence of the North American Chapter of the Associa- ral Language Processing, pages 2513–2518, Copen-
tion for Computational Linguistics, pages 611–619, hagen, Denmark. Association for Computational
Boulder, Colorado. Association for Computational Linguistics.
Linguistics.
Pushpendre Rastogi, Ryan Cotterell, and Jason Eisner.
Chu-Cheng Lin, Hao Zhu, Matthew R. Gormley, and 2016. Weighting finite-state transductions with neu-
Jason Eisner. 2019. Neural finite-state transducers: ral context. In Proceedings of the 2016 Conference
Beyond rational relations. In Proceedings of the of the North American Chapter of the Association
2019 Conference of the North American Chapter of for Computational Linguistics: Human Language
the Association for Computational Linguistics: Hu- Technologies, pages 623–633, San Diego, California.
man Language Technologies, Volume 1 (Long and Association for Computational Linguistics.
Short Papers), pages 272–283, Minneapolis, Min-
nesota. Association for Computational Linguistics. Sujith Ravi and Kevin Knight. 2009. Learning
Nikola Ljubešić and Filip Klubička. 2014. phoneme mappings for transliteration without paral-
{bs,hr,sr}WaC - web corpora of Bosnian, Croa- lel data. In Proceedings of Human Language Tech-
tian and Serbian. In Proceedings of the 9th Web as nologies: The 2009 Annual Conference of the North
Corpus Workshop (WaC-9), pages 29–35, Gothen- American Chapter of the Association for Computa-
burg, Sweden. Association for Computational tional Linguistics, pages 37–45, Boulder, Colorado.
Linguistics. Association for Computational Linguistics.

Peter Makarov, Tatiana Ruzsics, and Simon Clematide. Brian Roark, Richard Sproat, Cyril Allauzen, Michael
2017. Align and copy: UZH at SIGMORPHON Riley, Jeffrey Sorensen, and Terry Tai. 2012. The
2017 shared task for morphological reinflection. In OpenGrm open-source finite-state grammar soft-
Proceedings of the CoNLL SIGMORPHON 2017 ware libraries. In Proceedings of the ACL 2012 Sys-
Shared Task: Universal Morphological Reinflection, tem Demonstrations, pages 61–66, Jeju Island, Ko-
pages 49–57, Vancouver. Association for Computa- rea. Association for Computational Linguistics.
tional Linguistics.
Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov,
Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, Brian Sabrina J. Mielke, Cibu Johny, Isin Demirsahin, and
Roark, and Jason Eisner. 2019. What kind of lan- Keith Hall. 2020. Processing South Asian languages
guage is hard to language-model? In Proceedings of written in the Latin script: The Dakshina dataset.
the 57th Annual Meeting of the Association for Com- In Proceedings of the 12th Language Resources
putational Linguistics, pages 4975–4989, Florence, and Evaluation Conference, pages 2413–2423, Mar-
Italy. Association for Computational Linguistics. seille, France. European Language Resources Asso-
ciation.
Mehryar Mohri. 2009. Weighted automata algorithms.
In Handbook of weighted automata, pages 213–254. Maria Ryskina, Matthew R. Gormley, and Taylor Berg-
Springer. Kirkpatrick. 2020. Phonetic and visual priors for

268
decipherment of informal Romanization. In Pro-
ceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 8308–
8319, Online. Association for Computational Lin-
guistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016. Neural machine translation of rare words
with subword units. In Proceedings of the 54th An-
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1715–
1725, Berlin, Germany. Association for Computa-
tional Linguistics.
Tatiana Shavrina and Olga Shapovalova. 2017. To
the methodology of corpus construction for machine
learning: Taiga syntax tree corpus and parser. In
Proc. CORPORA 2017 International Conference,
pages 78–84, St. Petersburg.
Zhiyi Song, Stephanie Strassel, Haejoong Lee, Kevin
Walker, Jonathan Wright, Jennifer Garland, Dana
Fore, Brian Gainor, Preston Cabe, Thomas Thomas,
Brendan Callahan, and Ann Sawyer. 2014. Collect-
ing natural SMS and chat conversations in multiple
languages: The BOLT phase 2 corpus. In Proceed-
ings of the Ninth International Conference on Lan-
guage Resources and Evaluation (LREC’14), pages
1699–1704, Reykjavik, Iceland. European Language
Resources Association (ELRA).
Felix Stahlberg and Bill Byrne. 2019. On NMT search
errors and model errors: Cat got your tongue? In
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 3356–
3362, Hong Kong, China. Association for Computa-
tional Linguistics.
Shijie Wu and Ryan Cotterell. 2019. Exact hard mono-
tonic attention for character-level transduction. In
Proceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 1530–
1537, Florence, Italy. Association for Computational
Linguistics.
Shijie Wu, Ryan Cotterell, and Mans Hulden. 2021.
Applying the transformer to character-level trans-
duction. In Proceedings of the 16th Conference of
the European Chapter of the Association for Com-
putational Linguistics: Main Volume, pages 1901–
1907, Online. Association for Computational Lin-
guistics.
Shijie Wu, Pamela Shapiro, and Ryan Cotterell. 2018.
Hard non-monotonic attention for character-level
transduction. In Proceedings of the 2018 Confer-
ence on Empirical Methods in Natural Language
Processing, pages 4425–4438, Brussels, Belgium.
Association for Computational Linguistics.
Zichao Yang, Zhiting Hu, Chris Dyer, Eric P. Xing, and
Taylor Berg-Kirkpatrick. 2018. Unsupervised text
style transfer using language models as discrimina-
tors. In NeurIPS, pages 7298–7309.

269
A Data download links beam search and n–shortest path algorithm for the
UNMT and WFST respectively. Product of experts
The romanized Russian and Arabic data and pre-
decoding is also performed with beam size 5.
processing scripts can be downloaded here. This
repository also contains the relevant portion of the C Metrics
Taiga dataset, which can be downloaded in full at
this link. The romanized Kannada data was down- The character error rate (CER) and word error rate
loaded from the Dakshina dataset. (WER) as measured as the Levenshtein distance
The scripts to download the Serbian and Bosnian between the hypothesis and reference divided by
Leipzig corpora data can be found here. The reference length:
UDHR texts were collected from the corresponding dist(h, r)
pages: Serbian, Bosnian. ER(h, r) =
len(r)
The keyboard layouts used to construct the
phonetic priors are collected from the following with both the numerator and the denominator mea-
sources: Arabic 1, Arabic 2, Russian, Kannada, sured in characters and words respectively.
Serbian. The Unicode confusables list used for the We report BLEU-4 score (Papineni et al., 2002),
Russian visual prior can be found here. measured using the Moses toolkit script.7 For both
BLEU and WER, we split sentences into words
B Implementation using the Moses tokenizer (Koehn et al., 2007).
WFST We reuse the unsupervised WFST imple-
mentation of Ryskina et al. (2020),5 which utilizes
the OpenFst (Allauzen et al., 2007) and Open-
Grm (Roark et al., 2012) libraries. We use the
default hyperparameter settings described by the
authors (see Appendix B in the original paper). We
keep the hyperparameters unchanged for the trans-
lation experiment and set the maximum delay value
to 2 for both translation directions.
UNMT We use the PyTorch UNMT implementa-
tion of He et al. (2020)6 which incorporates im-
provements introduced by Lample et al. (2019)
such as the addition of a max-pooling layer. We
use a single-layer LSTM (Hochreiter and Schmid-
huber, 1997) with hidden state size 512 for both
the encoder and the decoder and embedding dimen-
sion 128. For the denoising autoencoding loss, we
adopt the default noise model and hyperparameters
as described by Lample et al. (2018). The autoen-
coding loss is annealed over the first 3 epochs. We
predict the output using greedy decoding and set
the maximum output length equal to the length of
the input sequence. Patience for early stopping is
set to 10.
Model combinations Our joint decoding imple-
mentations rely on PyTorch and the Pynini finite-
state library (Gorman, 2016). In reranking, we
rescore n = 5 best hypotheses produced using
5
https://github.com/ryskina/
romanization-decipherment 7
https://github.com/moses-smt/
6
https://github.com/cindyxinyiwang/ mosesdecoder/blob/master/scripts/
deep-latent-sequence-model generic/multi-bleu.perl

270
Input kongress ne odobril biudjet dlya osuchestvleniye
"bor’bi s kommunizmom" v yuzhniy amerike.
Ground truth kongress ne odobril bdet dl kongress ne odobril bjudžet dlja osuščestvlenija
osuwestvleni "bor~by s kommunizmom" "bor’by s kommunizmom" v južnoj amerike.
v noĭ amerike.
WFST kongress ne odobril viu d et dl a kongress ne odobril viu d et dl a osu sč estvleni y e
osu sq estvleni y e "bor # b i s "bor # b i s kommunizmom" v uuz n ani amerike.
kommunizmom" v uuz n ani amerike.
Reranked WFST kongress ne odobril vi d et d e l a kongress ne odobril vi d et d e l a osu sč estvleni y e
osu sq estvleni y e "bor # b i s "bor # b i s kommunizmom" v uuz n ani amerike.
kommunizmom" v uuz n ani amerike.
Seq2Seq kongress ne odobril kongress ne odobril b y udivitel’no
b y udivitel~no s s kommunizmom" v južn y j amerike.
kommunizmom" v n y ĭ amerike.
Reranked Seq2Seq kongress ne odobril bdet dl kongress ne odobril bjudžet dlja osuščestvleni e
osuwestvleni e "bor~by s kommunizmom" "bor’by s kommunizmom" v južn y j amerike.
v n y ĭ amerike.
Product of experts kongress ne odobril b i d et dl a kongress ne odobril b i d et dlja a osuščestvleni y e
osuwestvleni y e "bor~by s "bor’by s kommunizmom" v uuz n nik ameri
kommunizmom" v uuz n nik ameri

Table 5: Different model outputs for a Russian transliteration example (left column—Cyrillic, right—
scientific transliteration). Prediction errors are shown in red. Correctly transliterated segments that do not
match the ground truth because of spelling standardization in annotation are in yellow. # stands for UNK.

Input ana h3dyy 3lek bokra 3la 8 kda



Ground truth èY» 8 úΫ èQºK. ½J
Ê« ø
Y« Ag AK@ AnA H>Edy Elyk bkrp ElY 8 kdh

WFST èY» 8 B QºK. ½Ë ù
K
Yg AK@ AnA H d y y l k bkr l > 8 kdh

Reranked WFST èY» 8 B QºK. ½Ë ù
K
Yg AK@ AnA H d y y l k bkr l > 8 kdh

Seq2Seq èY» 1 Èð @ Qk ½Êg @ ø
X AK. AK@ AnA b >dy >x l k H r >w l 1 kdh

Reranked Seq2Seq èY» 1 Èð @ Qk ½Êg @ ø
X AK. AK@ AnA b >dy >x l k H r >w l 1 kdh

Product of experts èY» 8 B @ @Q» H. ½Ë ø
X AK@ AnA dy l k b kr A > l A 8 kdh

Table 6: Different model outputs for an Arabizi transliteration example (left column—Arabic, right—
Buckwalter transliteration).
ಮೂಲ +,-$Prediction errors are highlighted
.ನ/$0 DDR3ಯನು2 ಬಳಸಲು in red in the romanized versions. Correctly
ಮೂಲ +,-$.ನ/$0 DDR3ಯನು2 ಬಳಸಲು
transliterated segments that do not match the
ಮೂಲ +,-$.ನ/$0 DDR3ಯನು2 ಬಳಸಲು ground truth because of spelling standardization during
ಮೂಲ +,-$ . ನ/$ 0 DDR3ಯನು2 ಬಳಸಲು
annotation are highlighted
ಮೂಲ
ಮೂಲ+,-$in .
+,-$yellow.
.ನ/$
ನ/$0 0 DDR3ಯನು2
DDR3ಯನು2 ಬಳಸಲು

5ುಲ0ಕ 7,8$ನ ಅವಳ ;$ೂೕ=,ಟವನು2 ಅದು @$ವA$ಸ ುತBC$.


5ುಲ0ಕ 7,8$ನ ಅವಳ ;$ೂೕ=,ಟವನು2 ಅದು @$ವA$ಸ ುತBC$.
Input kshullaka
5ುಲ0ಕ 7,8$baalina
ನ ಅವಳ avala
;$ೂhoraatavannu
ೕ=,ಟವನು2 ಅದುadu vivarisuttade.
@$ವA$ಸ ುತBC$.
5ುಲ0
5ುಲ0ಕಕಕ7,8$
7,8$ನ
Ground truth 5ುಲ0 7,8$
ಕುಹೂE,0F ನನ$ ಅವಳ
F
ಅವಳ ;$
ಅವಳ
7,/$ನನ
;$ೂ
;$ ೂೂG,ಳ
ೕ=,ಟವನು2 ಅದು @$
ೕ=,ಟವನು2
ೕ=,ಟವನು2 ಅದು @$ವ ವA$A$ಸಸುತB
ುತBCC$.$.
C $. A$ಸ ುತBC$. ks.ullaka bāl.ina aval.a hōrāt.avannu adu
ಕುಹೂE,0 $ 7,/$ ುುG,ಳ ;$;$ೂೂರI,ವನು2
ರI,ವನು2 ಆದುಆದು@$@$ ವವ A$ಸ ುತBC$. vivarisuttade.
ಕುಹೂE,0F$ 7,/$ನು G,ಳ ;$ೂರI,ವನು2 ಆದು @$ವA$ಸ ುತBC$.
ಕುಹೂE,0
ಕುಹೂE,0F F $
F$ $ 7,/$ 7,/$ ನ ು G,ಳ
7,/$ನನುು G,ಳ ;$ ೂ ರI,ವನು2
G,ಳ ;$ೂರI,ವನು2 ಆದು @$ವA$A$ಸ ಆದು @$ ವ A$ಸ ುತBC$.$.
WFST ಕುಹೂE,0 ಸುತBುತBCC $. k u hū ll ā k he bā l in u v ā l.a
ಕುಹೂE,0F$ 7,/$ನ G,ಳu ;$ೂರI,ವನು2 ಆದು @$ವA$ಸಸ
ಕುಹೂE,0 F $ 7,/$ ನ G,ಳu ;$ ೂ ರI,ವನು2 ಆದು @$ ವ A$ ುತBುತB
CC $. $. h o r atā vannu ā du vivarisuttade.
ಕುಹೂE,0
ಕುಹೂE,0 F $
F$$ 7,/$ 7,/$
7,/$ನ G,ಳuನ G,ಳu
G,ಳu ;$ ;$
;$ೂ ೂ ರI,ವನು2
ೂರI,ವನು2 ಆದು
ರI,ವನು2 ಆದು @$ @$ವ A$ ಸ
@$ವA$A$ಸಸುತB ುತB C
ುತBC$.$. $ .
Reranked WFST ಕುಹೂE,0F F k u hū ll ā k he bā l ina v ā l. u
ಕುಹೂE,0 $ 7,/$ನನ G,ಳu ;$ೂ ರI,ವನು2 ಆದುಆದು @$ವ ವA$ಸ ುತBC C $.
h o r atā vannu ā du vivarisuttade.
ಕಳuಹುಳL 7,@$ಂN ಇಲ0P$ೕ ;$ೂೕ=,ಟವನು2 ಇದು @$ವವ
ಕಳuಹುಳL 7,@$ ಂ N ಇಲ0 P ೕ
$ ;$ ೂ ೕ=,ಟವನು2 ಇದು @$ A$A$
ಸಸ ುತBುತB CC $. $.
Seq2Seq ಕಳuಹುಳL
ಕಳuಹುಳL 7,@$
7,@$ಂ ಂ
ಂN N ಇಲ0
N ಇಲ0
ಇಲ0P P ೕ
$
P$ೕ$ೕ ;$ ;$
;$ೂ ೂ ೕ=,ಟವನು2
ೂೕ=,ಟವನು2 ಇದು
ೕ=,ಟವನು2 ಇದು @$ @$ ವ A$ ಸ
@$ವA$A$ಸಸುತB ುತB
ುತBC$.$. C $ . k al. u hul.l. a bā v i ṁg ill av ē hōrāt.avannu
ಕಳuಹುಳL7,@$
ಕಳuಹುಳL 7,@$ ಂN ಇಲ0P $ೕ ;$ೂ ೕ=,ಟವನು2 ಇದು
ಇದು @$ವ ವA$ಸ ುತBC C $. i du vivarisuttade.
Reranked Seq2Seq ಕಳuಹುಳL 7,@$ಂತ ಇಲ0P$ೕ ;$ೂೕ=,ಟವನು2 ಇದು @$ವA$ಸಸ
ಕಳuಹುಳL 7,@$ ಂ ತ ಇಲ0 P ೕ
$ ;$ ೂ ೕ=,ಟವನು2 ಇದು @$ ವ A$ ುತBುತB CC $. $. k al. u hul.l. a bā v i ṁt a ill av ē hōrāt.avannu
ಕಳuಹುಳL
ಕಳuಹುಳL 7,@$ಂಂತತ ಇಲ0
7,@$ ಇಲ0PP$ೕ$ೕ;$ ;$ೂೂೕ=,ಟವನು2
ೕ=,ಟವನು2ಇದುಇದು@$@$ವವA$A$ಸಸುತBುತB C C$ . $.
ಕಳuಹುಳL7,@$
ಕಳuಹುಳL 7,@$ಂಂತತ ಇಲ0 ಇಲ0P P$ೕ$ೕ ;$
;$ೂ ೂೕ=,ಟವನು2
ೕ=,ಟವನು2 ಇದು
ಇದು @$ @$ವ
ವA$A$ಸಸುತB
ುತBCC$.$. i du vivarisuttade.
Product of experts ಕಳL 7,ಕ/$
ಕಳL 7,ಕ/$ನನ22 G,E, G,E,;$ ;$ೂೂೕ=,ಟI,Qನು2
ೕ=,ಟI,Qನು2ದು ದು@$@$GG,A$ ,A$ಸಸುತBುತB
ದ ದ k a l.l. a bā kal in n a v ālā hōrāt.a t v ā nnu
ಕಳL
ಕಳL 7,ಕ/$ನ
7,ಕ/$ ನ22 G,E, G,E, ;$ ;$ೂೂೕ=,ಟI,Qನು2
ೕ=,ಟI,Qನು2ದು ದು@$@$GG,A$
,A$ಸಸುತBುತBದದ du viv ā risuttad a
ಕಳL7,ಕ/$
ಕಳL 7,ಕ/$ನನ22 G,E, G,E, ;$ ;$ೂೂೕ=,ಟI,Qನು2
ೕ=,ಟI,Qನು2 ದು ದು @$
@$GG,A$
,A$ಸ ಸುತB
ುತBದದ

Table 7: Different model outputs for a Kannada transliteration example (left column—Kannada, right—
ISO 15919 transliterations). The ISO romanization is generated using the Nisaba library (Johny et al.,
2021). Prediction errors are highlighted in red in the romanized versions.

271
Finite-state Model of Shupamem Reduplication

Magdalena Markowska, Jeffrey Heinz, and Owen Rambow


Stony Brook University
Department of Linguistics &
Institute for Advanced Computational Science
{magdalena.markowska,jeffrey.heinz,owen.rambow}@stonybrook.edu

Abstract Deterministic 2-way FSTs can model both partial


and full segmental reduplication in a compact way.
Shupamem, a language of Western Cameroon, However, many languages that exhibit reduplica-
is a tonal language which also exhibits the
tive processes also are tonal, which often means
morpho-phonological process of full redupli-
cation. This creates two challenges for a finite- that tones and segments act independently from
state model of its morpho-syntax and morpho- one another in their morpho-phonology. For in-
phonology: how to manage the full reduplica- stance, in Shupamem, a tonal language of Western
tion, as well as the autosegmental nature of Cameroon, ndáp ‘house’ → ndâp ndàp ‘houses’
lexical tone. Dolatian and Heinz (2020) ex- (Markowska, 2020).
plain how 2-way finite-state transducers can
model full reduplication without an exponen- tones H HL L
tial increase in states, and finite-state trans- segments ndap ndap ndap
ducers with multiple tapes have been used
to model autosegmental tiers, including tone Pioneering work in autosegmental phonology
(Wiebe, 1992; Dolatian and Rawski, 2020a; (Leben, 1973; Williams, 1976; Goldsmith, 1976)
Rawski and Dolatian, 2020). Here we synthe- shows tones may act independently from their tone-
size 2-way finite-state transducers and multi- bearing units (TBUs). Moreover, tones may ex-
tape transducers, resulting in a finite-state for- hibit behavior that is not typical for segments (Hy-
malism that subsumes both, to account for man, 2014; Jardine, 2016), which brings yet an-
the full reduplicative processes in Shupamem other strong argument for separating them from
which also affect tone.
segments in their linguistic representations. Such
autosegmental representations can be mimicked
1 Introduction
using finite-state machines, in particular, Multi-
Reduplication is a very common morphological Tape Finite-State Transducers (MT FSTs) (Wiebe,
process cross-linguistically. Approximately 75% 1992; Dolatian and Rawski, 2020a; Rawski and
of world languages exhibit partial or total redupli- Dolatian, 2020). We note that McCarthy (1981)
cation (Rubino, 2013). This morphological pro- uses the same autosegmental representations in
cess is particularly interesting from the compu- the linguistic representation to model templatic
tational point of view because it introduces chal- morphology, and this approach has been modeled
lenges for 1-way finite-state transducers (FSTs). for Semitic morphological processing using multi-
Even though partial reduplication can be modelled tape automata (Kiraz, 2000; Habash and Rambow,
with 1-way FSTs (Roark and Sproat, 2007; Chan- 2006).
dlee and Heinz, 2012), there is typically an ex- This paper investigates what finite-state machin-
plosion in the number of states. Total reduplica- ery is needed for languages which have both redu-
tion, on the other hand, is the only known morpho- plication and tones. We first argue that we need
phonological process that cannot be modelled with a synthesis of the aforementioned transducers, i.e.
1-way FSTs because the number of copied ele- 1-way, 2-way and MT FSTs, to model morphol-
ments, in principle, has no upper bound. Dolatian ogy in the general case. The necessity for such
and Heinz (2020) address this challenge with 2- a formal device will be supported by the morpho-
way FSTs, which can move back and forth on the phonological processes present in Shupamem nom-
input tape, producing a faithful copy of a string. inal and verbal reduplication. We then discuss an

272
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 272–281
August 5, 2021. ©2021 Association for Computational Linguistics
alternative, in which we use the MT FST to han- Transl. Lemma Red form
dle both reduplication and tones. It is important H HL L
to emphasize that all of the machines we discuss ‘crab’ kám kâm kàm
Nouns
are deterministic, which serves as another piece of L LH L
evidence that even such complex processes like full ‘game’ kàm kǎm kàm
reduplication can be modelled with deterministic H HŤH
finite-state technology (Chandlee and Heinz, 2012; ‘fry’ ká ká kŤá
Verbs
Heinz, 2018). LH LHŤH
This paper is structured as follows. First, we ‘peel’ kǎ kǎ kŤá
will briefly summarize the linguistic phenomena
observed in Shupamem reduplication (Section 2). Table 1: Nominal and verbal reduplication in Shu-
pamem
We then provide a formal description of the 2-way
(Section 3) and MT FSTs (Section 4). We propose
a synthesis of the 1-way, 2-way and MT FSTs in the purpose of this paper, we provide only a sum-
Section 5 and further illustrate them using rele- mary of those tonal alternations in Table 2.
vant examples from Shupamem in Section 6. In
Section 7 we discuss a possible alternative to the Red. tones Output
model which uses only MT FSTs. Finally, in Sec- Nouns HL L HL H
tion 8 we show that the proposed model works for LH L LH H
other tonal languages as well, and we conclude our Verbs HŤH HŤH
contributions. LHŤH HL LH

Table 2: Tonal alternations: interaction of tones for


2 Shupamem nominal and verbal reduplicated forms (“Red. tones”) with grammatical H
reduplication tones
Shupamem is an understudied Grassfields Bantu
The underlined tones indicate changes triggered
language of Cameroon spoken by approximately
by the H grammatical/phrasal tone. In the obser-
420,000 speakers (Eberhard et al., 2021). It ex-
vance of H tone associated with the right edge of
hibits four contrastive surface tones (Nchare, 2012):
the subject position in Shupamem, the L tone that is
high (H; we use diacritic V́ on a vowel as an or-
present on the surface in the suffix of reduplicated
thographic representation), low (L; diacritic V̀),
nouns (recall Table 1), now is represented with an
rising (LH; diacritic V̌), and falling (HL; diacritic
H tone. Now it should be clear that the noun redu-
V̂). Nouns and verbs in the language reduplicate
plicant should, in fact, be toneless in the underlying
to create plurals and introduce semantic contrast,
representation (UR). While the presence of H tone
respectively. Out of 13 nouns classes, only one
directly preceding the reduplicated verb does not
exhibits reduplication. Nouns that belong to that
affect H-tone verbs, such as ká ‘fry’, it causes ma-
class are monosyllabic and carry either H or L lexi-
jor tonal alternations in rising reduplicated verbs.
cal tones. Shupamem verbs are underlyingly H or
Let us look at a particular example representing the
rising (LH). Table 1 summarizes the data adapted
final row in Table 2:
from Markowska (2020).
In both nouns and verbs, the first item of the redu- p´@ ‘PASTIII’ + kǎ ká ‘peel.CONTR’ → p´@ kâ kǎ
plicated phrase is the base, while the reduplicant is The H tone associated with the tense marker
the suffix. We follow the analysis in Markowska introduces two tonal changes to the reduplicated
(2020) and summarize it here. The nominal redu- verb: it causes tonal reversal on the morphological
plicant is toneless underlyingly, while the verbal base, and it triggers L-tone insertion to the left edge
reduplicant has an H tone. Furthermore, the rule of of the reduplicant.
Opposite Tone Insertion explains the tonal alterna- The data in both Table 1 and 2 show that 1)
tion in the base of reduplicated nouns, and Default verbs and nouns in Shupamem reduplicate fully
L-Insertion accounts for the L tone on the suffix. at the segmental level, and 2) tones are affected
Interestingly, more tonal alternations are observed by phonological rules that function solely at the
when the tones present in the reduplicated phrase suprasegmental level. Consequently, a finite-state
interact with other phrasal/grammatical tones. For model of the language must be able to account for

273
(á,á,+1) (á,λ,−1) (á,á,+1)

(o,λ,+1) (n,λ,-1) (o,∼,+1) (n,λ,+1)


start q0 q1 q2 q3 q4

(k,k,+1) (k,λ,−1) (k,k,+1)

Figure 1: 2-way FST for total reduplication of ká ‘fry.IMP → ká ká ‘fry.IMP as opposed to boiling’

those factors. In the next two sections, we will pro- string, then the FST faithfully ‘copies’ the input
vide a brief formal introduction to 2-way FSTs and string while scanning it from left to right. In con-
MT-FSTs, and explain how they correctly model trast, while scanning the string from right to left,
full reduplication and autosegmental representa- it outputs nothing (λ), and it then copies the string
tion, respectively. again from left to right.
In this paper, we use an orthographic represen- Figure 1 illustrates a deterministic 2-way FST
tation for Shupamem which uses diacritics to in- that reduplicates ká ‘fry.IMP’; readers are referred
dicate tone. Shupamem does not actually use this to Dolatian and Heinz (2020) for formal defini-
orthography; however, we are interested in model- tions. The key difference between deterministic
ing the entire morpho-phonology of the language, 1-way FSTs and deterministic 2-way FSTs are the
independently of choices made for the orthography. addition of the ‘direction’ parameters {+1, 0, −1}
Furthermore, many languages do use diacritics to on the transitions which tell the FST to advance to
indicate tone, including the Volta-Niger languages the next symbol on the input tape (+1), stay on the
Yoruba and Igbo, as well as Dschang, a Grassfields same symbol (0), or return to the previous symbol
language closely related to Shupamem. (For a dis- (-1). Deterministic 1-way FSTs can be thought of
cussion of the orthography of Cameroonian lan- as deterministic 2-way FSTs where transitions are
guages, with a special consideration of tone, see all (+1).
(Bird, 2001).) Diacritics are also used to write The input to this machine is okán. The o and
tone in non-African languages, such as Vietnamese. n symbols mark beginning and end of a string, and
Therefore, this paper is also relevant to NLP ap- ∼ indicates the boundary between the first and the
plications for morphological analysis and genera- second copy. None of those symbols are essen-
tion in languages whose orthography marks tones tial for the model, nevertheless they facilitate the
with diacritics: the automata we propose could be transitions. For example, when the machine reads
used to directly model morpho-phonological com- n, it transitions from state q1 to q2 and reverses
putational problems for the orthography of such the direction of the read head. After outputting
languages. the first copy (state q1 ) and rewinding (state q2 ),
the machine changes to state q3 when it scans the
3 2-way FSTs left boundary symbol o and outputs ∼ to indicate
that another copy will be created. In this partic-
As Roark and Sproat (2007) point out, almost all
ular example, not marking morpheme boundary
morpho-pholnological processes can be modelled
would not affect the outcome. However, in Section
with 1-way FSTs with the exception of full redu-
5, where we propose the fully-fledged model, it
plication, whose output is not a regular language.
will be crucial to somehow separate the first from
One way to increase the expressivity of 1-way FST
second copy.
is to allow the read head of the machine to move
back and forth on the input tape. This is exactly 4 Multitape FSTs
what 2-way FST does (Rabin and Scott, 1959), and
Dolatian and Heinz (2020) explain how these trans- Multiple-tape FSTs are machines which operate in
ducers model full reduplication not only effectively, the exact same was as 1-way FST, with one key
but more faithfully to linguistic generalizations. difference: they can read the input and/or write the
Similarly to a 1-way FST, when a 2-way FST output on multiple tapes. Such a transducer can
reads an input, it writes something on the output operate in an either synchronous or asynchronous
tape. If the desired output is a fully reduplicated manner, such that the input will be read on all tapes

274
and an output produced simultaneously, or the ma- ple, (L, V) → V̀) and transitions to state q2 (if the
chine will operate on the input tapes one by one. symbol on the T-tape is H) or q3 (for L, as in our ex-
MT-FST can take a single (‘linear’) string as an ample). In states q2 and q3 , consonants are simply
input and output multiple strings on multiple tapes output as in state q1 , but for vowels, one of three
or it can do the reverse (Rabin and Scott, 1959; conditions may occur: the read head on the Tonal
Fischer, 1965). tape may be H or L, in which case the automaton
To illustrate this idea, let us look at Shupamem transitions (if not already there) to q2 (for H) or q3
noun màpàm ‘coat’. It has been argued that Shu- (for L), and outputs the appropriate orthographic
pamem nouns with only L surface tones will have symbol. But if on the Tonal tape the read head is
the L tone present in the UR (Markowska, 2019, on the right boundary marker n, we are in a case
2020). Moreover, in order to avoid violating the where there are more vowels in the Segmental tape
Obligatory Contour Principle (OCP) (Leben, 1973), than tones in the Tonal tape. This is when the OCP
which prohibits two identical consecutive elements determines the interpretation: all vowels get the
(tones) in the UR of a morpheme, we will assume tone of the last symbol on the Tone tier (which we
that only one L tone is present in the input. Con- remember as states q2 and q3 ). In our example,
sequently, the derivation will look as shown in Ta- this is an L. Finally, when the Segmental tape also
ble 3. reaches the right boundary marker n, the machine
transitions to the final state q4 . This (‘linearizing’)
Input: T-tape L MT-FST consists of 4 states and shows how OCP
Input: S-tape mapam effects can be handled with asynchronous multi-
Output: Single tape màpàm tape FSTs. Note that when there are more tones
on the Tonal tier than vowels on the Segmental tier,
Table 3: Representation of MT-FST for màpàm ‘coat’
they are simply ignored. We refer readers to Dola-
tian and Rawski (2020b) for formal definitions of
Separating tones from segments in this manner, these MT transducers.
i.e. by representing tones on the T(one)-tape and
We are also interested in the inverse process –
segments on the S(segmental)-tape, faithfully re-
that is, a finite-state machine that in the example
sembles linguistic understanding of the UR of a
above would take a single input string [màpàm]
word. The surface form màpàm has only one L
and produce two output strings [L] and [mapam].
tone present in the UR, which then spreads to all
While multitape FSTs are generally conceived as
TBUs, which happen to be vowels in Shupamem,
relations over n-ary relations over strings, Dolatian
if no other tone is present.
and Rawski (2020b) define their machines deter-
An example of a multi-tape machine is pre-
ministically with n input tapes and a single output
sented in Figure 2. For better readability, we
tape. We generalize their definition below.
introduce a generalized symbols for vowels (V)
and consonants (C), so that the input alphabet is Similarly to spreading processes described
Σo = {(C, V ), (L, H)} ∪ {o, n}, and the output above, separating tones from segments give us a lot
alphabet is Γ = {C, V́ , V̀ }. The machine operates of benefits while accounting for tonal alternations
on 2 input tapes and writes the output on a single taking place in nominal and verbal reduplication in
tape. Therefore, we could think of such machine Shupamem. First of all, functions such as Opposite
as a linearizer. The two input tapes represent the Tone Insertion (OTI) will apply solely at the tonal
Tonal and Segmental tiers and so we label them T level, while segments can be undergoing other op-
and S, respectively. We illustrate the functioning erations at the same time (recall that MT-FSTs can
of the machine using the example (mapam, L) → operate on some or all tapes simultaneously). Sec-
màpàm ‘coat’. While transitioning from state q0 to ondly, representing tones separately from segments
q1 , the output is an empty string since the left edge make tonal processes local, and therefore all the
marker is being read on both tapes simultaneously. alternations can be expresses with less powerful
In state q1 , when a consonant is being read, the functions (Chandlee, 2017).
machine outputs the exact same consonant on the Now that we presented the advantages of MT-
output tape. However, when the machine reaches a FSTs, and the need for utilizing 2-way FSTs to
vowel, it outputs a vowel with a tone that is being model full reduplication, we combine those ma-
read at the same time on the T-tape (in our exam- chines to account for all morphophonological pro-

275
T:(H,+1) T:(H,0)

S:(V,+1) S:(C,+1)

O:V́ O:C

T:(n,0) T:(n,0)

S:(V,+1) S:(C,+1)

O:V́ O:C

T:(H,+1) T:(H,0) T:(o,+1)

S:(V,+1) S:(C,+1) S:(o,+1)


q2
O:V́ O:C O:λ
T:(o,+1)

S:(o,+1)
T:(H,+1) T:(L,+1)
O:λ
start q0 q1 S:(V,+1) S:(V,+1) q4
O:V́ O:V̀

T:(L,+1) T:(L,0) T:(o,+1)


q3
S:(V,+1) S:(C,+1) S:(o,+1)

O:V̀ O:C O:λ

T:(L,+1) T:(L,0)

S:(V,+1) S:(C,+1)

O:V̀ O:C

T:(n,0) T:(n,0)

S:(V,+1) S:(C,+1)

O:V̀ O:C

Figure 2: MT-FST: linearize


C and V are notational meta-symbols for consonants and vowels, resp.; T indicates the tone tape, S, the
segmental tape, and O the output tape.

cesses described in Section 2. erally use the index i to range from 1 to n and the
index j to range from 1 to m.
5 Deterministic 2-Way Multi-tape FST We define Deterministic 2-Way n, m Multitape
FST (2-way (n, m) MT FST for short) for n, m ∈
Before we define Deterministic 2-Way Multi-tape
N by synthesizing the definitions of Dolatian and
FST (or 2-way MT FST for short) we introduce
Heinz (2020) and Dolatian and Rawski (2020b);
some notation. An alphabet Σ is a finite set of
n, m refer to the number of input tapes and output
symbols and Σ∗ denotes the set of all strings of
tapes, respectively. A Deterministic 2-Way n, m
finite length whose elements belong to Σ. We use #» #»
Multitape FST is a six-tuple (Q, Σ, Γ , q0 , F, δ),
λ to denote the empty string. For each n ∈ N, an
where
n-string is a tuple hw1 , . . . wn i where each wi is
a string belonging to Σ∗i (1 ≤ i ≤ n). These n #»
• Σ = hΣ1 . . . Σn i is a tuple of n input alpha-
alphabets may contain distinct symbols or not. We bets that include the boundary symbols, i.e.,
write w#» to indicate a n-string and #» λ to indicate the
#» {o, n} ⊂ Σi , 1 ≤ i ≤ n,
n-string where each wi = λ. We also write Σ to

denote a tuple of n alphabets: Σ = Σ1 × · · · Σn . #»
• Γ is a tuple of m output alphabets Γj (1 ≤

Elements of Σ are denoted #» σ. j ≤ m),
#»∗
If w and v belong to Σ then the pointwise con-
#» #»
#» #» #»
catenation of w #» and #» v is denoted w v and equals
#» #» • δ : Q × Σ → Q × Γ ∗ × D is the transition
hw1 , . . . wn ihv1 , . . . vn i = hw1 v1 , . . . wn vn i. We function. D is an alphabet of directions equal
#» #»
are interested in functions that map n-strings to to {−1, 0, +1} and D is an n-tuple. Γ ∗ is a
m-strings with n, m ∈ N. In what follows we gen- m-tuple of strings written to each output tape.

276
Figure 3: Synthesis of 2-way FST and MT-FST

#» #»0 , r, #»
We understand δ(q, #» v , d ) as follows. It
σ ) = (r, #» We write ( w, #» q, #»
x , #»
u ) → (w x 0 , #» v ). Ob-
u #»
means that if the transducer is in state q and the serve that since δ is a function, there is at most one
n read heads on the input tapes are on symbols next configuration (i.e., the system is deterministic).
hσ1 , . . . σn i = #» σ , then several actions ensue. The Note there are some circumstances where there is
transducer changes to state r and pointwise concate- no next configuration. For instance if di = +1 and
nates #» v to the m output tapes. The n read heads xi = λ then there is no place for the read head to
#» #»
then move according to the instructions d ∈ D. advance. In such cases, the computation halts.
For each read head on input tape i, it moves back The transitive closure of → is denoted with →+ .
one symbol iff di = −1, stays where it is iff di = 0, Thus, if c →+ c0 then there exists a finite sequence
and advances one symbol iff di = +1. (If the read of configurations c1 , c2 . . . cn with n > 1 such that
head on an input tape “falls off” the beginning or c = c1 → c2 → . . . → cn = c0 .
end of the string, the computation halts.) At last we define the function that a 2-way (n, m)
The function recognized by a 2-way (n, m) MT MT FST T computes. The input strings are aug-
FST is defined as follows. A configuration of a mented with word boundaries on each tape. Let
#» #» #»
n, m-MT-FST T is a 4-tuple h Σ ∗ , q, Σ ∗ , Γ ∗ i. The # »
own = how1 n, . . . o wn ni. For each n-string
meaning of the configuration ( w, x , u) is that
#» q, #» #»∗ #»
w#» ∈ Σ , fT ( w)
#» = #» u ∈ Γ ∗ provided there
#» # » #»
the input to T is w x and the machine is currently
#» #»
exists qf ∈ F such that ( λ , q0 , own, λ ) →+
in state q with the n read heads on the first symbol # » #»
(own, qf , λ , #» u ).
of each xi (or has fallen off the right edge of the If fT ( w)
#» = #» u then #» u is unique because the
i-th input tape if xi = λ) and that #» u is currently sequence of configurations is determined determin-
written on the m output tapes. istically. If the computation of a 2-way MT-FST T
If the current configuration is ( w, #» q, #» u ) and
x , #» halts on some input w #» (perhaps because a subse-

δ(q, #»σ ) = (r, #» v , d ) then the next configuration is quent configuration does not exist), then we say T
#»0 , r, #»
(w x 0 , #» v ), where for each i, 1 ≤ i ≤ n:
u #» is undefined on w. #»
#»0 = hw0 . . . w0 i and #»
• w n x 0 = hx01 . . . x0n i (1 ≤ The 2-way FSTs studied by Dolatian and Heinz
1
i ≤ n); (2020) are 2-way 1,1 MT FST. The n-MT-FSTs
studied by Dolatian and Rawski (2020b) are 2-way
• wi0 = wi and x0i = xi iff di = 0; n,1 MT FST where none of the transitions con-
tain the −1 direction. In this way, the definition
• wi0 = wi σ and x0i = x00i iff di = +1 and there
presented here properly subsumes both.
exists σ ∈ Σi , x00i ∈ Σ∗i such that xi = σx00i ;
Figure 4 shows an example of a 1,2 MT FST that
• wi0 = wi00 and x0i = σxi iff di = −1 and there “splits” a phonetic (or orthographic) transcription of
exists σ ∈ Σi , wi00 ∈ Σ∗i such that wi = σwi00 . a Shupamem word into a linguistic representation

277
with a tonal and segmental tier by outputting two As was discussed in Section 2, the tone on the
output strings, one for each tier. second copy is dependent on whether there was an
H tone preceding the reduplicated phrase. If there
I:(C,+1) I:(á,+1) I:(à,+1) was one, the tone on the reduplicant will be H.
T:λ T:H T:L Otherwise, the L-Default Insertion rule will insert
S:C S:a S:a L tone onto the toneless TBU of the second copy.
Because those tonal changes are not part of the
I:(â,+1) I:(ǎ,+1)
reduplicative process per se, we do not represent
T:HL T:LH
them either in our model in Figure 3, or in the
start q0 S:a
q1 oS:a q2 derivation in Table 4. Those alternations could be
I:(o,+1) I:(n,+1) accounted for with 1-way FST by simply inserting
T:λ T:λ H or L tone to the output of the composed machine
S:λ S:λ represented in Figure 3.
Modelling verbal reduplication and the tonal pro-
Figure 4: MT-FST: split cesses revolving around it (see Table 2) works in
ndáp → (ndap, H) ‘house’, C and V are notational the exact same way as described above for nominal
meta-symbols for consonants and vowels, resp.; T reduplication. The only difference are the functions
indicates the output tone tape, S – the segmental applied to the T-tape.
output tape, and I – the input.
7 An Alternative to 2-Way Automata
6 Proposed model 2-way n,m MT FST generalize regular functions
(Filiot and Reynier, 2016) to functions from n-
As presented in Figure 3, our proposed model, i.e.
strings to m-strings. It is worth asking however,
2-way 2,2 MT FST, consists of 1-way and 2-way
what each of these mechanisms brings, especially
deterministic transducers, which together operate
in light of fundamental operations such as func-
on two tapes. Both input and output are represented
tional composition.
on two separate tapes: Tonal and Segmental Tape.
Such representation is desired as it correctly mim-
ics linguistic representations of tonal languages,
where segments and tones act independently from
each other. On the T-tape, a 1-way FST takes the
H tone associated with the lexical representation of
ndáp ‘house’ and outputs HL∼ by implementing
the Opposite Tone Insertion function. On the S-
tape, a 2-way FST takes ndap as an input, and out-
puts a faithful copy of that string (ndap ndap). The
∼ symbol significantly indicates the morpheme
boundary and facilitates further output lineariza-
tion. A detailed derivation of ndáp 7→ ndâp ndap Figure 5: An alternative model for Shupamem redupli-
is shown in Table 4. cation
Figure 3 also represents two additional ‘trans-
formations’: splitting and linearizing. First, the For instance, it is clear that 2-way 1, 1 MT FSTs
phonetic transcription of a string (ndáp) is split can handle full reduplication in contrast to 1-way
into tones and segments with a 1,2 MT FST. The 1, 1 MT FSTs which cannot. However, full redupli-
output (H, ndap) serves as an input to the 2-way 2,2 cation can also be obtained via the composition of
MT FST. After the two processes discussed above a 1-way 1,2 MT FST with a 1-way 2, 1 MT FST.
apply, together acting on both tapes, the output is To illustrate, the former takes a single string as an
then linearized with an 2,1 MT FST. The composi- input, e.g. ndap, and ‘splits’ it into two identical
tion of those three machines, i.e. 1,2 MT, 2-way 2,2 copies represented on two separate output tapes.
MT FST, and 2,1 MT FSTs is particularly useful Then the 2-string output by this machines becomes
in applications where a phonetic or orthographic the input to the next 1-way 2,1 MT FST. Since
representations needs to be processed. this machine is asynchronous, it can linearize the

278
State Segment-tape Tone-tape S-output T-output
q0 ondapn oHn λ λ
q1 ondapn S:o:+1 oHn T:o:+1 n HL
q1 ondapn S:n:+1 oHn T:H:+1 nd HL∼
q1 ondapn S:d:+1 oHn T:n:0 nda HL∼
q1 ondapn S:a:+1 oHn T:n:0 ndap HL∼
q1 ondapn S:p:+1 oHn T:n:0 ndap∼ HL∼
q2 ondapn S:n:-1 oHn T:n:0 ndap∼ HL∼
q2 ondapn S:p:-1 oHn T:n:0 ndap∼ HL∼
q2 ondapn S:a:-1 oHn T:n:0 ndap∼ HL∼
q2 ondapn S:d:-1 oHn T:n:0 ndap∼ HL∼
q2 ondapn S:n:-1 oHn T:n:0 ndap∼ HL∼
q3 ondapn S:o:+1 oHn T:n:0 ndap∼n HL∼
q3 ondapn S:n:+1 oHn T:n:0 ndap∼nd HL∼
q3 ondapn S:d:+1 oHn T:n:0 ndap∼nda HL∼
q3 ondapn S:a:+1 oHn T:n:0 ndap∼ndap HL∼
q3 ondapn S:p:+1 oHn T:n:0 ndap∼ndap HL∼

Table 4: Derivation of ndáp 7→ ndâp ndap ‘houses’


This derivation happens in the two automata in the center of Figure 3. The FST for the Segmental tier is
the one shown in Figure 1, and the states in the table above refer to this FST.

2-string (e.g. (ndap, ndap)) to a 1-string (ndap other languages could also be accounted for.
ndap) by reading along one of these input tapes All three languages undergo full reduplication
(and writing it) and reading the other one (and writ- at the segmental level. What differs is the tonal
ing it) only when the read head on the first input pattern governing this process. In Adhola (Ka-
tape reaches the end. Consequently, an alternative plan, 2006), the second copy is always represented
way to model full reduplication is to write the ex- with a fixed tonal pattern H.HL, where ‘.’ indi-
act same output on two separate tapes, and then cates syllable boundary, irregardless of the lexi-
linearize it. Therefore, instead of implementing cal tone on the non-reduplicated form. In the fol-
Shupamem tonal reduplication with 2-way 2,2 MT lowing examples, unaccented vowels indicate low
FST, we could use the composition of two 1-way tone. For instance, tiju ‘work’ 7→ tija tíjâ ‘work
MT-FST: 1,3 MT-FST and 3,1 MT-FST as shown too much’, tSemó ‘eat’ 7→ tSemá tSŤémâ ‘eat too
in Figure 5. (We need three tapes, two Segmental much’. In Kikerewe (Odden, 1996), if the first
tapes to allow reduplication as just explained, and (two) syllable(s) of the verb are marked with an
one Tonal tape as discussed before.) H tone, the H tone would also be present in the
This example shows that additional tapes can first two syllables of the reduplicated phrase. On
be understood as serving a similar role to regis- the other hand, if the last two syllables of the non-
ters in register automata (Alur and Černý, 2011; reduplicated verb are marked with an H tone, the
Alur et al., 2014). Alur and his colleagues have H tone will be present on the last two syllables of
shown that deterministic 1-way transducers with the reduplicated phrase. For instance, bíba ‘plant’
registers are equivalent in expressivity to 2-way 7→ bíba biba ‘plant carelessly, here and there’, bib-
deterministic transducers (without registers). ílé ‘planted (yesterday)’ 7→ bibile bibílé ‘planted
(yesterday) carelessly, here and there’. Finally, in
8 Beyond Shupamem KiHehe (Odden and Odden, 1985), if an H tone ap-
The proposed model is not limited to modeling pears in the first syllable of the verb, the H tone will
full reduplication in Shupamem. It can be used for also be present in the first syllable of the second
other tonal languages exhibiting this morphological copy, for example dóongoleesa ‘roll’ 7→ dongolesa
process. We provide examples of the applicability dóongoleesa ‘roll a bit’.
of this model for the three following languages: The above discussed examples can be modelled
Adhola, Kikerewe, and Shona. And we predict that in a similar to Shupamem way, such that, first,

279
the input will be output on two tapes: Tonal and clude them in the RedTyp database (Dolatian and
Segmental, then some (morpho-)phonological pro- Heinz, 2019) so a broader empirical typology can
cesses will apply on both level. The final step is be studied with respect to the formal properties of
the ‘linearization’, which will be independent of these machines.
the case. For example, in Kikerewe, if the first tone
that is read on the Tonal tape is H, and a vowel
is read on the Segmental tape, the output will be
a vowel with an acute accent. If the second tone References
is L, as in bíba, this L tone will be ‘attached’ to Rajeev Alur, Adam Freilich, and Mukund
every remaining vowel in the reduplicated phrase. Raghothaman. 2014. Regular combinators for
While Kikerewe provides an example where there string transformations. In Proceedings of the
Joint Meeting of the Twenty-Third EACSL Annual
are more TBUs than tones, Adhola presents the
Conference on Computer Science Logic (CSL) and
reverse situation, where there are more tones than the Twenty-Ninth Annual ACM/IEEE Symposium on
TBU (contour tones). Consequently, it is crucial to Logic in Computer Science (LICS), CSL-LICS ’14,
mark syllable boundaries, such that only when ‘.’ pages 9:1–9:10, New York, NY, USA. ACM.
or the right edge marker (o) is read, the FST will
Rajeev Alur and Pavol Černý. 2011. Streaming trans-
output the ‘linearized’ element. ducers for algorithmic verification of single-pass list-
processing programs. In Proceedings of the 38th An-
9 Conclusion nual ACM SIGPLAN-SIGACT Symposium on Princi-
ples of Programming Languages, POPL ’11, page
In this paper we proposed a deterministic finite- 599–610, New York, NY, USA. Association for
Computing Machinery.
state model of total reduplication in Shupamem.
As it is typical for Bantu languages, Shupamem is Steven Bird. 2001. Orthography and identity in
a tonal language in which phonological processes Cameroon. Written Language & Literacy, 4(2):131–
operating on a segmental level differ from those on 162.
suprasegmental (tonal) level. Consequently, Shu- Jane Chandlee. 2017. Computational locality in mor-
pamem introduces two challenges for 1-way FSTs: phological maps. Morphology, pages 1–43.
language copying and autosegmental representa-
tion. We addressed those challenges by proposing Jane Chandlee and Jeffrey Heinz. 2012. Bounded copy-
ing is subsequential: Implications for metathesis and
a synthesis of a deterministic 2-way FST, which reduplication. In Proceedings of the Twelfth Meet-
correctly models total reduplication, and a MT FST, ing of the Special Interest Group on Computational
which enables autosegmental representation. Such Morphology and Phonology, pages 42–51, Montréal,
a machine operates on two tapes (Tonal and Seg- Canada. Association for Computational Linguistics.
mental), which faithfully replicate the linguistic
Hossep Dolatian and Jeffrey Heinz. 2019. Redtyp: A
analysis of Shupamem reduplication discussed in database of reduplication with computational mod-
Markowska (2020). Finally, the outputs of the 2- els. In Proceedings of the Society for Computation
way 2,2 MT FST is linearized with a separate 2,1 in Linguistics, volume 2. Article 3.
MT FST outputting the desired surface representa-
Hossep Dolatian and Jeffrey Heinz. 2020. Comput-
tion of a reduplicated word. The proposed model ing and classifying reduplication with 2-way finite-
is based on previously studied finite-state models state transducers. Journal of Language Modelling,
for reduplication (Dolatian and Heinz, 2020) and 8(1):179–250.
tonal processes (Dolatian and Rawski, 2020b,a).
Hossep Dolatian and Jonathan Rawski. 2020a. Com-
There are some areas of future research that we putational locality in nonlinear morphophonology.
plan to pursue. First, we have suggested that we Ms., Stony Brook University.
can handle reduplication using the composition
Hossep Dolatian and Jonathan Rawski. 2020b. Multi
of 1-way deterministic MT FSTs, dispensing with input strictly local functions for templatic morphol-
the need for 2-way automata altogether. Further ogy. In In Proceedings of the Society for Computa-
formal comparison of these two approaches is war- tion in Linguistics, volume 3.
ranted. More generally, we plan to investigate the
David M. Eberhard, Gary F. Simmons, and Charles D.
closure properties of classes of 2-way MT FSTs. A Fenning. 2021. Enthologue: Languages of the
third line of research is to collect more examples World. 24th edition. Dallas, Texas: SIL Interna-
of full reduplication in tonal languages and to in- tional.

280
Emmanuel Filiot and Pierre-Alain Reynier. 2016. Jonathan Rawski and Hossep Dolatian. 2020. Multi-
Transducers, logic and algebra for functions of finite input strict local functions for tonal phonology. Pro-
words. ACM SIGLOG News, 3(3):4–19. ceedings of the Society for Computation in Linguis-
tics, 3(1):245–260.
Patric C. Fischer. 1965. Multi-tape and infinite-state
automata-a survey. Communications of the ACM, Brian Roark and Richard Sproat. 2007. Computational
pages 799–805. Approaches to Morphology and Syntax. Oxford Uni-
versity Press, Oxford.
John Goldsmith. 1976. Autosegmental Phonology.
Ph.D. thesis, Massachusetts Institute of Technology. Carl Rubino. 2013. Reduplication. Max Planck Insti-
tute for Evolutionary Anthropology, Leipzig.
Nizar Habash and Owen Rambow. 2006. Magead: A
morphological analyzer for Arabic and its dialects. Bruce Wiebe. 1992. Modelling autosegmental phonol-
In Proceedings of the 21st International Confer- ogy with multi-tape finite state transducers.
ence on Computational Linguistics and 44th Annual
Meeting of the Association for Computational Lin- Edwin S. Williams. 1976. Underlying tone in Margi
guistics (Coling-ACL’06), Sydney, Australia. and Igbo. Linguistic Inquiry, 7:463–484.

Jeffrey Heinz. 2018. The computational nature of


phonological generalizations. In Larry Hyman and
Frans Plank, editors, Phonological Typology, Pho-
netics and Phonology, chapter 5, pages 126–195. De
Gruyter Mouton.

Larry Hyman. 2014. How autosegmental is phonol-


ogy? The Linguistic Review, 31(2):363–400.

Adam Jardine. 2016. Computationally, tone is differ-


ent. Phonology, 33:247–283.

Aaron F. Kaplan. 2006. Tonal and morphological iden-


tity in reduplication. In Proceedings of the An-
nual Meeting of the Berkeley Linguistics Society, vol-
ume 31.

George Anton Kiraz. 2000. Multi-tiered nonlinear mor-


phology using multi-tape finite automata: A case
study on Syriac and Arabic. Computational Linguis-
tics, 26(1):77–105.

William R. Leben. 1973. Suprasegmental Phonology.


Ph.D. thesis, Massachusetts Institute of Technology.

Magdalena Markowska. 2019. Tones in Shuapmem


possessives. Ms., Graduate Center, City University
of New York.

Magdalena Markowska. 2020. Tones in Shupamem


reduplication. CUNY Academic Works.

John McCarthy. 1981. A prosodic theory of noncon-


catenative morphology. Linguistic Inquiry, 12:373–
418.

Abdoulaye L. Nchare. 2012. The Grammar of Shu-


pamem. Ph.D. thesis, New York University.

David Odden. 1996. Patterns of reduplication in kik-


erewe. OSU WPL, 48:111–148.

David Odden and Mary Odden. 1985. Ordered redupli-


cation in KiHehe. Linguistic Inquiry, 16:497–503.

Michael O Rabin and Dana Scott. 1959. Finite au-


tomata and their decision problems. IBM journal
of research and development, 3:114–125.

281
Improved pronunciation prediction accuracy using morphology

Dravyansh Sharma, Saumya Yashmohini Sahai, Neha Chaudhari† , Antoine Bruguier



Google LLC?
[email protected], [email protected],
[email protected], [email protected]

Abstract in phonology, e.g. in cops, cogs and courses. The


different forms can be thought to be derived from a
Pronunciation lexicons and prediction models common plural morphophoneme which undergoes
are a key component in several speech synthe-
context dependent transformations to produce the
sis and recognition systems. We know that
morphologically related words typically fol-
correct phones.
low a fixed pattern of pronunciation which can A pronunciation model, also known as a
be described by language-specific paradigms. grapheme to phoneme (G2P) converter, is a sys-
In this work we explore how deep recurrent tem that produces a phonemic representation of a
neural networks can be used to automatically word from its written form. The word is converted
learn and exploit this pattern to improve the from the sequence of letters in the orthographic
pronunciation prediction quality of words re- script to a sequence of phonemes (sound symbols)
lated by morphological inflection. We propose
in a pre-determined transcription, such as IPA or
two novel approaches for supplying morpho-
logical information, using the word’s morpho- X-SAMPA. It is expensive and possibly, say in
logical class and its lemma, which are typi- morphologically rich languages with productive
cally annotated in standard lexicons. We re- compounding, infeasible to list the pronunciations
port improvements across a number of Euro- for all the words. So one uses rules or learned mod-
pean languages with varying degrees of phono- els for this task. Pronunciation models are impor-
logical and morphological complexity, and tant components of both speech recognition (ASR)
two language families, with greater improve-
and synthesis (text-to-speech, TTS) systems. Even
ments for languages where the pronunciation
prediction task is inherently more challenging.
though end-to-end models have been gathering re-
We also observe that combining bidirectional cent attention (Graves and Jaitly, 2014; Sotelo et al.,
LSTM networks with attention mechanisms is 2017), often state-of-the-art models in industrial
an effective neural approach for the computa- production systems involve conversion to and from
tional problem considered, across languages. an intermediate phoneme layer.
Our approach seems particularly beneficial in A single system of morphophonological rules
the low resource setting, both by itself and in which connects morphology with phonology is
conjunction with transfer learning.
well-known (Chomsky and Halle, 1968). In fact
computational models for morphology such as the
two-level morphology of Koskenniemi (1983); Ka-
1 Introduction plan and Kay (1994) have the bulk of the machinery
designed to handle phonological rules. However,
Morphophonology is the study of interaction be- the approach involves encoding language-specific
tween morphological and phonological processes rules as a finite-state transducer, a tedious and ex-
and mostly involves description of sound changes pensive process requiring linguistic expertise. Lin-
that take place in morphemes (minimal meaningful guistic rules are augmented computationally for
units) when they combine to form words. For ex- small corpora in Ermolaeva (2018), although scala-
ample, the plural morpheme in English appears as bility and applicability of the approach across lan-
‘-s’ or ‘-es’ in orthography and as [s], [z], and [Iz] guages is not tested.
?
Part of the work was done when D.S., N.C. and A.B. We focus on using deep neural models to im-
were at Google. prove the quality of pronunciation prediction using

282
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 282–288
August 5, 2021. ©2021 Association for Computational Linguistics
morphology. G2P fits nicely in the well-studied se- and Taylor and Richmond (2020) show the reverse.
quence to sequence learning paradigms (Sutskever The present work aligns with the latter, but instead
et al., 2014), here we use extensions that can handle of requiring full morphological segmentation of
supplementary inputs in order to inject the morpho- words we work with weaker and more easily anno-
logical information. Our techniques are similar to tated morphological information like word lemmas
Sharma et al. (2019), although the goal there is to and morphological categories.
lemmatize or inflect more accurately using pronun-
ciations. Taylor and Richmond (2020) consider 3 Improved pronunciation prediction
improving neural G2P quality using morphology,
We consider the G2P problem, i.e. prediction of
our work differs in two respects. First, we use
the sequence of phonemes (pronunciation) from
morphology class and lemma entries instead of
the sequence of graphemes in a single word. The
morpheme boundaries for which annotations may
G2P problem forms a clean, simple application of
not be as readily available. Secondly, they con-
seq2seq learning, which can also be used to cre-
sider BiLSTMs and Transformer models, but we
ate models that achieve state-of-the-art accuracies
additionally consider architectures which combine
in pronunciation prediction. Morphology can aid
BiLSTMs with attention and outperform both. We
this prediction in several ways. One, we could
also show significant gains by morphology injec-
use morphological category as a non-sequential
tion in the context of transfer learning for low re-
side input. Two, we could use the knowledge of
source languages where sufficient annotations are
the morphemes of the words and their pronuncia-
unavailable.
tions which may be possible with lower amounts
of annotation. For example, the lemma (and its
2 Background and related work
pronunciation) may already be annotated for an
Pronunciation prediction is often studied in settings out-of-vocabulary word. Often standard lexicons
of speech recognition and synthesis. Some recent list the lemmata of derived/inflected words, lemma-
work explores new representations (Livescu et al., tizer models can be used as a fallback. Learning
2016; Sofroniev and Çöltekin, 2018; Jacobs and from the exact morphological segmentation (Tay-
Mailhot, 2019), but in this work, a pronunciation lor and Richmond, 2020) would need more precise
is a sequence of phonemes, syllable boundaries models and annotation (Demberg et al., 2007).
and stress symbols (van Esch et al., 2016). A lot of Given the spelling, language specific models
work has been devoted to the G2P problem (e.g. see can predict the pronunciation by using knowledge
Nicolai et al. (2020)), ranging from those focused of typical grapheme to phoneme mappings in the
on accuracy and model size to those discussing ap- language. Some errors of these models may be
proaches for data-efficient scaling to low resource fixed with help from morphological information as
languages or multilingual modeling (Rao et al., argued above. For instance, homograph pronun-
2015; Sharma, 2018; Gorman et al., 2020). ciations can be predicted using morphology but
Morphology prediction is of independent interest it is impossible to deduce correctly using just or-
and has applications in natural language generation thography.1 The pronunciation of ‘read’ (/ôi:d/ for
as well as understanding. The problems of lemma- present tense and noun, /ôEd/ for past and partici-
tization and morphological inflection have been ple) can be determined by the part of speech and
studied in both contextual (in a sentence, which tense; the stress shifts from first to second syllable
involves morphosyntactics) and isolated settings between ‘project’ noun and verb.
(Cohen and Smith, 2007; Faruqui et al., 2015; Cot-
3.1 Dataset
terell et al., 2016; Sharma et al., 2019).
Morphophonological prediction, by which we We train and evaluate our models for five lan-
mean viewing morphology and pronunciation pre- guages to cover some morphophonological diver-
diction as a single task with several related inputs sity: (American) English, French, Russian, Span-
and outputs, has received relatively less attention as ish and Hungarian. For training our models, we
a language-independent computational task, even use pronunciation lexicons (word-pronunciation
though the significance for G2P has been argued pairs) and morphological lexicons (containing lex-
(Coker et al., 1991). Sharma et al. (2019) show 1
Homographs are words which are spelt identically but
improved morphology prediction using phonology, have different meanings and pronunciations.

283
ical form, i.e. lemma and morphology class) of 3.2.1 Bidirectional LSTM networks
only inflected words of size of the order of 104 LSTM (Hochreiter and Schmidhuber, 1997) allows
for each language (see Table 5 in Appendix A). learning of fixed length sequences, which is not a
For the languages discussed, these lexicons are ob- major problem for pronunciation prediction since
tained by scraping2 Wiktionary data and filtering grapheme and phoneme sequences (represented as
for words that have annotations (including pronun- one-hot vectors) are often of comparable length,
ciations available in the IPA format) for both the and in fact state-of-the-art accuracies can be ob-
surface form and the lexical form. While this or- tained using bidirectional LSTM (Rao et al., 2015).
der of data is often available for high-resource lan- We use single layer BiLSTM encoder - decoder
guages, in Section 3.3 we discuss extension of our with 256 units and 0.2 dropout to build a charac-
work to low-resource settings using Finnish and ter level RNN. Each character is represented by a
Portuguese for illustration where the Wiktionary trainable embedding of dimension 30.
data is about an order of magnitude smaller.
3.2.2 LSTM based encoder-decoder networks
Word (language) Morph. Class Pron. LS LP with attention (BiLSTM+Attn)
masseuses (fr) n-f-pl /ma.søz/ masseur /ma.sœK/
fagylaltozom (hu) v-fp-s-in-pr-id /"f6Íl6ltozom/ fagylaltozik /"f6Íl6ltozik/ Attention-based models (Vaswani et al., 2017;
Chan et al., 2016; Luong et al., 2015; Xu et al.,
Table 1: Example annotated entries. (v-fp-s-in-pr-id: 2015) are capable of taking a weighted sample of
Verb, first-person singular indicative present indefinite)
input, allowing the network to focus on different
possibly distant relevant segments of the input ef-
We keep 20% of the pronunciation lexicons fectively to predict the output. We use the model
aside for evaluation using word error rate (WER) defined in Section 3.2.1 with Luong attention (Lu-
metric. WER measures an output as correct if the ong et al., 2015).
entire output pronunciation sequence matches the
ground truth annotation for the test example. 3.2.3 Transformer networks
Transformer (Vaswani et al., 2017) uses self-
3.1.1 Morphological category attention in both encoder and decoder to learn
The morphological category of the word is ap- rich text representaions. We use a similar architec-
pended as an ordinal encoding to the spelling, sepa- ture but with fewer parameters, by using 3 layers,
rated by a special character. That is, the categories 256 hidden units, 4 attention heads and 1024 di-
of a given language are appended as unique inte- mensional feed forward layers with relu activation.
gers, as opposed to one-hot vectors which may be Both the attention and feedforward dropout is 0.1.
too large in morphologically rich languages. The input character embedding dimension is 30.

3.3 Transfer learning for low resource G2P


3.1.2 Lemma spelling and pronounciation
Both non-neural and neural approaches have been
Information about the lemma is given to the mod- studied for transfer learning (Weiss et al., 2016)
els by appending both, the lemma pronouncia- from a high-resource language for low resource
tion hLPi and lemma spelling hLSi to the word language G2P setting using a variety of strategies
spelling hWSi, all separated by special characters, including semi-automated bootstrapping, using
like, hWSi§hLPi¶hLSi. Lemma spelling can po- acoustic data, designing representations suitable
tentially help in irregular cases, for example ‘be’ for neural learning, active learning, data augmen-
has past forms ‘gone’ and ‘were’, so the model tation and multilingual modeling (Maskey et al.,
can reject the lemma pronunciation in this case by 2004; Davel and Martirosian, 2009; Jyothi and
noting that the lemma spellings are different (but Hasegawa-Johnson, 2017; Sharma, 2018; Ryan and
potentially still use it for ‘been’). Hulden, 2020; Peters et al., 2017; Gorman et al.,
2020). Recently, transformer-based architectures
3.2 Model details have also been used for this task (Engelhart et al.,
The models described below are implemented in 2021). Here we apply a similar approach of us-
OpenNMT (Klein et al., 2017). ing representations learned from the high-resource
languages as an additional input for low-resource
2
kaikki.org/dictionary/ models but for our BiLSTM+Attn architecture. We

284
Model Inputs en fr ru es hu
BiLSTM (b/+c/+l) (39.7/39.4/37.1) (8.69/8.94/7.94) (5.26/4.87/5.60) (1.13/1.44/1.30) (6.96/5.85/7.21)
BiLSTM+Attn (b/+c/+l) (36.9/36.1/31.0) (4.45/4.20/4.12) (5.06/3.80/4.04) (0.32/0.32/0.29) (1.78/1.31/1.12)
Transformer (b/+c/+l) (40.2/39.3/37.7) (8.19/7.11/10.6) (6.57/6.38/5.36) (2.29/1.62/2.20) (8.20/4.93/8.11)

Table 2: Models and their Word Error Rates (WERs). ‘b’ corresponds to baseline (vanilla G2P), ‘+c’ refers to
morphology class injection (Sec. 3.1.1) and ‘+l’ to addition of lemma spelling and pronunciation (Sec. 3.1.2).

evaluate our model for two language pairs — hu We also look at how adding lexical form infor-
(high) - fi (low) and es (high) and pt (low) (results mation, i.e. morphological class and lemma, helps
in Table 3). We perform morphology injection us- with pronunciation prediction. We notice that the
ing lemma spelling and pronunciation (Sec. 3.1.2) improvements are particularly prominent when the
since it can be easier to annotate and potentially G2P task itself is more complex, for example in
more effective (per Table 2). fi and pt are not really English. In particular, ambiguous or exceptional
low-resource, but have relatively fewer Wiktionary grapheme subsequence (e.g. ough in English)
annotations for the lexical forms (Table 5). to phoneme subsequence mappings, may be re-
solved with help from lemma pronunciations. Also
Model fi fi+hu pt pt+es
morphological category seems to help for example
BiLSTM+Attn (base) 18.53 9.81 62.65 58.87
BiLSTM+Attn (+lem) 9.27 8.45 59.63 55.48 in Russian where it can contain a lot of informa-
tion due to the inherent morphological complexity
Table 3: Transfer learning for vanilla G2P (base) and (about 25% relative error reduction). See Appendix
morphology augmented G2P (+lem, Sec. 3.1.2). B for more detailed comparison and error analysis
for the models.
4 Discussion Our transfer learning experiments indicate that
We discuss our results under two themes — the morphology injection gives even more gains in low
efficacy of the different neural models we have resource setting. In fact for both the languages
implemented, and the effect of the different ways considered, adding morphology gives almost as
of injecting morphology that were considered. much gain as adding a high resource language to
We consider three neural models as described the BiLSTM+Attn model. This could be useful for
above. To compare the neural models, we first low resource languages like Georgian where a high
note the approximate number of parameters of each resource language from the same language family
model that we trained: is unavailable. Even with the high resource aug-
mentation, using morphology can give a significant
• BiLSTM: ∼1.7M parameters,
further boost to the prediction accuracy.
• BiLSTM+Attn: ∼3.5M parameters,
• Transformer: ∼5.2M parameters. 5 Conclusion
For BiLSTM and BiLSTM+Attn, the parameter We note that combining BiLSTM with attention
size is based on neural architecture search i.e. we seems to be the most attractive alternative in get-
estimated sizes at which accuracies (nearly) peaked. ting improvements in pronunciation prediction by
For transformer, we believe even larger models can leveraging morphology, and hence correspond to
be more effective and the current size was chosen the most appropriate ‘model bias’ for the prob-
due to computational restrictions and for “fairer” lem from among the alternatives considered. We
comparison of model effectiveness. Under this set- also note that all the neural network paradigms
ting, BiLSTM+Attn models seem to clearly outper- discussed are capable of improving the G2P predic-
form both the other models, even without morphol- tion quality when augmented with morphological
ogy injection (cf. Gorman et al. (2020), albeit it is information. Since our approach can potentially
in the multilingual modeling context). Transformer support partial/incomplete data (using appropriate
can beat BiLSTM in some cases even with the sub- hMISSINGi or hN/Ai tokens), one can use a sin-
optimal model size restriction, but is consistently gle model which injects morphology class and/or
worse when the sequence lengths are larger which lemma pronunciation as available. For languages
is the case when we inject lemma spellings and where neither is available, our results suggest build-
pronunciations. ing word-lemma lists or utilizing effective lemma-

285
tizers (Faruqui et al., 2015; Cotterell et al., 2016). recognition. In Acoustics, Speech and Signal Pro-
cessing (ICASSP), 2016 IEEE International Confer-
6 Future work ence on, pages 4960–4964. IEEE.

Our work only leverages the inflectional morphol- Noam Chomsky and Morris Halle. 1968. The sound
pattern of English.
ogy paradigms for better pronunciation prediction.
However in addition to inflection, morphology also Shay B Cohen and Noah A Smith. 2007. Joint mor-
results in word formation via derivation and com- phological and syntactic disambiguation. In Pro-
ceedings of the 2007 Joint Conference on Empirical
pounding. Unlike inflection, derivation and com- Methods in Natural Language Processing and Com-
pounding could involve multiple root words, so putational Natural Language Learning (EMNLP-
an extension would need a generalization of the CoNLL).
above approach along with appropriate data. An
Cecil H Coker, Kenneth W Church, and Maik Y Liber-
alternative would be to learn these in an unsuper- man. 1991. Morphology and rhyming: Two pow-
vised way using a dictionary augmented neural net- erful alternatives to letter-to-sound rules for speech
work which can efficiently refer to pronunciations synthesis. In The ESCA Workshop on Speech Syn-
in a dictionary and use them to predict pronunci- thesis.
ations of polymorphemic words using pronuncia- Ryan Cotterell, Christo Kirov, John Sylak-Glassman,
tions of the base words (Bruguier et al., 2018). It David Yarowsky, Jason Eisner, and Mans Hulden.
would be interesting to see if using a combination 2016. The SIGMORPHON 2016 shared
task—morphological reinflection. In Proceed-
of morphological side information and dictionary-
ings of the 14th SIGMORPHON Workshop on
augmentation results in a further accuracy boost. Computational Research in Phonetics, Phonology,
Developing non-neural approaches for the mor- and Morphology, pages 10–22.
phology injection could be interesting, although
Marelie Davel and Olga Martirosian. 2009. Pronuncia-
as noted before, the neural approaches are the state- tion dictionary development in resource-scarce envi-
of-the-art (Rao et al., 2015; Gorman et al., 2020). ronments.
One interesting application of the present work
Vera Demberg, Helmut Schmid, and Gregor Möhler.
would be to use the more accurate pronunciation 2007. Phonological constraints and morphological
prediction for morphologically related forms for ef- preprocessing for grapheme-to-phoneme conversion.
ficient pronunciation lexicon development (useful In Proceedings of the 45th Annual Meeting of the
for low resource languages where high-coverage Association of Computational Linguistics, pages 96–
103.
lexicons currently don’t exist), for example anno-
tating the lemma pronunciation should be enough Eric Engelhart, Mahsa Elyasi, and Gaurav Bharaj.
and the pronunciation of all the related forms can 2021. Grapheme-to-Phoneme Transformer Model
be predicted with high accuracy. This is hugely for Transfer Learning Dialects. arXiv preprint
arXiv:2104.04091.
beneficial for languages where there are hundreds
or even thousands of surface forms associated with Marina Ermolaeva. 2018. Extracting morphophonol-
the same lemma. Another concern for reliably us- ogy from small corpora. In Proceedings of the Fif-
teenth Workshop on Computational Research in Pho-
ing the neural approaches is explainability (Molnar, netics, Phonology, and Morphology, pages 167–175.
2019). Some recent research looks at explaining
neural models with orthographic and phonological Daan van Esch, Mason Chua, and Kanishka Rao. 2016.
features (Sahai and Sharma, 2021), an extension Predicting Pronunciations with Syllabification and
Stress with Recurrent Neural Networks. In INTER-
for morphological features should be useful. SPEECH, pages 2841–2845.
Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, and
References Chris Dyer. 2015. Morphological inflection genera-
tion using character sequence to sequence learning.
Antoine Bruguier, Anton Bakhtin, and Dravyansh arXiv preprint arXiv:1512.06110.
Sharma. 2018. Dictionary Augmented Sequence-
to-Sequence Neural Network for Grapheme to Kyle Gorman, Lucas FE Ashby, Aaron Goyzueta,
Phoneme prediction. Proc. Interspeech 2018, pages Arya D McCarthy, Shijie Wu, and Daniel You. 2020.
3733–3737. The SIGMORPHON 2020 shared task on multilin-
gual grapheme-to-phoneme conversion. In Proceed-
William Chan, Navdeep Jaitly, Quoc Le, and Oriol ings of the 17th SIGMORPHON Workshop on Com-
Vinyals. 2016. Listen, attend and spell: A neural putational Research in Phonetics, Phonology, and
network for large vocabulary conversational speech Morphology, pages 40–50.

286
Alex Graves and Navdeep Jaitly. 2014. Towards end- Kanishka Rao, Fuchun Peng, Haşim Sak, and
to-end speech recognition with recurrent neural net- Françoise Beaufays. 2015. Grapheme-to-phoneme
works. In International Conference on Machine conversion using long short-term memory recurrent
Learning, pages 1764–1772. neural networks. In Acoustics, Speech and Signal
Processing (ICASSP), 2015 IEEE International Con-
Sepp Hochreiter and Jürgen Schmidhuber. 1997. ference on, pages 4225–4229. IEEE.
Long short-term memory. Neural computation,
9(8):1735–1780. Zach Ryan and Mans Hulden. 2020. Data augmen-
tation for transformer-based G2P. In Proceedings
Cassandra L Jacobs and Fred Mailhot. 2019. Encoder- of the 17th SIGMORPHON Workshop on Computa-
decoder models for latent phonological representa- tional Research in Phonetics, Phonology, and Mor-
tions of words. In Proceedings of the 16th Workshop phology, pages 184–188.
on Computational Research in Phonetics, Phonol-
ogy, and Morphology, pages 206–217.
Saumya Sahai and Dravyansh Sharma. 2021. Predict-
Preethi Jyothi and Mark Hasegawa-Johnson. 2017. ing and explaining french grammatical gender. In
Low-resource grapheme-to-phoneme conversion us- Proceedings of the Third Workshop on Computa-
ing recurrent neural networks. In 2017 IEEE Inter- tional Typology and Multilingual NLP, pages 90–96.
national Conference on Acoustics, Speech and Sig-
nal Processing (ICASSP), pages 5030–5034. IEEE. Dravyansh Sharma. 2018. On Training and Evaluation
of Grapheme-to-Phoneme Mappings with Limited
Ronald M Kaplan and Martin Kay. 1994. Regular mod- Data. Proc. Interspeech 2018, pages 2858–2862.
els of phonological rule systems. Computational lin-
guistics, 20(3):331–378. Dravyansh Sharma, Melissa Wilson, and Antoine
Bruguier. 2019. Better Morphology Prediction for
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senel- Better Speech Systems. In INTERSPEECH, pages
lart, and Alexander Rush. 2017. OpenNMT: Open- 3535–3539.
source toolkit for neural machine translation. In
Proceedings of ACL 2017, System Demonstrations, Pavel Sofroniev and Çağrı Çöltekin. 2018. Phonetic
pages 67–72, Vancouver, Canada. Association for vector representations for sound sequence alignment.
Computational Linguistics. In Proceedings of the Fifteenth Workshop on Com-
putational Research in Phonetics, Phonology, and
Kimmo Koskenniemi. 1983. Two-Level Model for
Morphology, pages 111–116.
Morphological Analysis. In IJCAI, volume 83,
pages 683–685.
Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Fe-
Karen Livescu, Preethi Jyothi, and Eric Fosler-Lussier. lipe Santos, Kyle Kastner, Aaron Courville, and
2016. Articulatory feature-based pronunciation Yoshua Bengio. 2017. Char2wav: End-to-end
modeling. Computer Speech & Language, 36:212– speech synthesis.
232.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Minh-Thang Luong, Hieu Pham, and Christopher D Sequence to sequence learning with neural networks.
Manning. 2015. Effective Approaches to Attention- arXiv preprint arXiv:1409.3215.
based Neural Machine Translation. In Proceedings
of the 2015 Conference on Empirical Methods in Jason Taylor and Korin Richmond. 2020. Enhancing
Natural Language Processing, pages 1412–1421. Sequence-to-Sequence Text-to-Speech with Mor-
phology. Submitted to IEEE ICASSP.
Sameer Maskey, Alan Black, and Laura Tomokiya.
2004. Boostrapping phonetic lexicons for new lan- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
guages. In Eighth International Conference on Spo- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
ken Language Processing. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Proceedings of the 31st International
Christoph Molnar. 2019. Interpretable Machine
Conference on Neural Information Processing Sys-
Learning. https://christophm.github.io/
tems, pages 6000–6010.
interpretable-ml-book/.

Garrett Nicolai, Kyle Gorman, and Ryan Cotterell. Karl Weiss, Taghi M Khoshgoftaar, and DingDing
2020. Proceedings of the 17th SIGMORPHON Wang. 2016. A survey of transfer learning. Journal
Workshop on Computational Research in Phonetics, of Big data, 3(1):1–40.
Phonology, and Morphology. In Proceedings of the
17th SIGMORPHON Workshop on Computational Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,
Research in Phonetics, Phonology, and Morphology. Aaron Courville, Ruslan Salakhudinov, Rich Zemel,
and Yoshua Bengio. 2015. Show, attend and tell:
Ben Peters, Jon Dehdari, and Josef van Genabith. Neural image caption generation with visual atten-
2017. Massively Multilingual Neural Grapheme-to- tion. In International conference on machine learn-
Phoneme Conversion. EMNLP 2017, page 19. ing, pages 2048–2057.

287
Model Inputs en de es ru avg. rel. gain
BiLSTM (b/+c/+l) (31.0/30.5/25.2) (17.7/15.5/12.3) (8.1/7.9/6.7) (18.4/15.6/15.9) (-/+7.9%/+20.0%)
BiLSTM+Attn (b/+c/+l) (29.0/27.1/21.3) (12.0/11.6/11.6) (4.9/2.6/2.4) (14.1/13.6/13.1) (-/+15.1%/+22.0%)

Table 4: Number of total Wiktionary entries, and inflected entries with pronunciation and morphology annotations,
for the languages considered.

Appendix plural) /pe.da.gO.Zik/ is pronounced correctly by


BiLSTM+Attn, but as /pe.da.ZO.Zik/ by BiLSTM.
A On size of data Similarly BiLSTM+Attn predicts /"dZæmIN/, while
We record the size of data scraped from Wiktionary Transformer network says /"dZamIN/ for jamming
in Table 5. There is marked inconsistency in the (en). We note that errors for Spanish often involve
number of annotated inflected words where the pro- incorrect stress assignment since the grapheme-to-
nunciation transcription is available, as a fraction of phoneme mapping is highly consistent.
the total vocabulary, for the languages considered. Adding morphological class information seems
In the main paper, we have discussed results to reduce the error in endings for morphologically
on the publicly available Wiktionary dataset. We rich languages, which can be an important source
perform more experiments on a larger dataset (105 - of error if there is relative scarcity of transcrip-
106 examples of annotated inflections per language) tions available for the inflected words. For exam-
using the same data format and methodology for ple, for our BiLSTM+Attn model, the pronunci-
(American) English, German, Spanish and Russian ation for фуррем (ru, ‘furry’ instrumental singu-
(Table 4). We get very similar observations in this lar noun) is fixed from /"furj :em/ to /"furj :Im/, and
regime as well in terms of relative gains in model koronavı́rusról (hu, ‘coronavirus’ delative singu-
performances using our techniques, but these re- lar) gets corrected from /"koron6vi:ruSo:l/ to /"ko-
sults are likely more representative of word error ron6vi:ruSro:l/. On the other hand, adding lemma
rates for the whole languages. pronunciation usually helps with pronouncing the
root morpheme correctly. Without the lemma in-
Language Total senses Annotated inflections jection, our BiLSTM+Attn model mispronounces
en 1.25M 7543 debriefing (en) as /dI"bôi:fIN/ and sentences (en)
es 0.93M 28495 as /sEn"tEnsIz/. Based on these observations, it
fi 0.24M 3663 sounds interesting to try to inject both categorical
fr 0.46M 24062 and lemma information simultaneously.
hu 77.7K 31486
pt 0.39M 2647
ru 0.47M 20558

Table 5: Number of total Wiktionary entries, and in-


flected entries with pronunciation and morphology an-
notations, for the languages considered.

B Error analysis
Neural sequence to sequence models, while highly
accurate on average, make “silly” mistakes like
omitting or inserting a phoneme which are hard
to explain. With that caveat in place, there are
still reasonable patterns to be gleaned when com-
paring the outputs of the various neural models
discussed here. BiLSTM+Attn model seems to not
only be making fewer of these “silly” mistakes,
but also appears to be better at learning the gen-
uinely more challenging predictions. For exam-
ple, the French word pédagogiques (‘pedagogical’,

288
Leveraging Paradigmatic Information in Inflection Acceptability
Prediction: the JHU-SFU Submission to SIGMORPHON Shared Task 0.2

Jane S.Y. Li1,2 Colin Wilson2


1
Simon Fraser University 2
Johns Hopkins University
[email protected] [email protected]

Abstract domain, is paradigmatic information helpful for


tasks in computational morphology? To this end,
Given the prevalence of paradigmatic ef- several recent studies have successfully applied
fects in psycholinguistic processing, we pro-
paradigmatic information to different morphologi-
pose a system that utilizes information from
paradigms (paradigm size, lemma similarity cal generation tasks. (Ahlberg et al., 2014) applied
within paradigms, etc.) to predict acceptabil- a finite-state paradigm generalization technique
ity scores of nonce word inflections in En- (Hulden, 2014) and tested the system on inflecting
glish, Dutch, and German. The proposed unseen Spanish, German, and Finnish words with
model combines a finite-state paradigm gener- competitive accuracy. Similarly, (Ahlberg et al.,
ator (Hulden, 2014) with a naive Bayes clas- 2015) applied the same technique to an inflectional
sifier to soft-classify test lemmas and subse- table generation task in 11 typologically diverse
quently extract paradigm-related features for
languages. In this paper, we examine whether the
acceptability predictions. Although the model
ranked last for its German and Dutch predic- same successes can be achieved in analysis tasks.
tions, it placed second for the English predic- This year’s SIGMORPHON-UniMorph shared
tions. We conjecture that these performance task on cognitively plausible morphological inflec-
differences arise from a lack of language- tion is a prime opportunity to apply these paradigm
specific features for German and Dutch during abstraction techniques. The task requires partici-
classification, thus the system still has signifi- pants to predict the acceptability scores of nonce
cant room for improvement.
inflections in one dimension (English & Dutch:
past tense; German: past participle) while provid-
1 Introduction
ing a large set of real word inflections beyond the
The notion of a paradigm, a set of word instantia- target morphological relationship.
tions connected by a lexeme (Haspelmath and Sims, Our system utilizes paradigmatic information in
2013), has been fundamental to morphological lit- two ways. First, for every inflectional table encoun-
erature from both psycholinguistic and linguistic tered in the model, we build a probability distribu-
perspectives. Paradigmatic influences in morpho- tion (likelihood that the lemma belongs to a class)
phonological processing are well-documented in on a set of abstract paradigms. Second, the pre-
psycholinguistic literature. On the production front, dictor variables that were extracted for the model
for example, we know that higher paradigm sizes were based on paradigmatic information, such as
and morphological family sizes facilitate naming class frequency (de Jong et al., 2000; Pylkkänen
latencies in morphologically-complex languages et al., 2004) and phonological similarity to mem-
(Lõo et al., 2018), and paradigms with higher en- bers of the class (Ernestus and Baayen, 2003), all
tropy are inhibited during naming tasks (Baayen of which are supported by experimental literature.
et al., 2007). At the level of linguistic analysis, While the system scored last overall, we learned
the abstraction of paradigms to inflectional classes that extracting the correct, language-specific fea-
(Haspelmath and Sims, 2013) aid us in understand- tures for inflection table-to-paradigm classification
ing synchronic variation and diachronic changes in is important for the model to succeed.
a language. The rest of the paper is organized as follows. We
Given the encouraging evidence in other do- first introduce the shared task and the evaluation
mains, it is fair for us to ask: in the computational criteria in §2. Then, in §3, we reported the steps

289
Proceedings of the Eighteenth SIGMORPHON Workshop on Computational Research
in Phonetics, Phonology, and Morphology, pages 289–294
August 5, 2021. ©2021 Association for Computational Linguistics
Training Development Testing guages are summed to obtain the overall ranking.
English 47550 158 138
Dutch 84666 122 166 3 System Description
German 114185 150 266 The submitted system takes a lemma-inflection pair
Table 1: The number of entries of each language’s and its MSD and returns an acceptability score
training, development, and testing dataset. in the range [0, 7]. The model can be broken
down into three main modules. First, the model
extracts abstract paradigms from the training set
and mechanisms employed to predict acceptability based on Hulden’s (2014) algorithm. Then, based
ratings. We review the results of the shared task in on these paradigms, we create a probability dis-
§4 and suggest ways of improvement. Lastly, we tribution for each lemma-inflection pair that indi-
give our concluding remarks in §5. cates the likelihood of an inflection belonging to a
paradigm. Lastly, we extract a weighted average
2 Task Description of various measures of similarity and frequency
The goal of this year’s task is to predict acceptabil- for each lemma-inflection pair in the development
ity scores of inflections of nonce lemmas in English, set and the test set, and fit a linear model to the
Dutch, and German. For instance, given the En- development data. These processes are described
glish nonce lemma fink /fINk/, the submitted system in further detail in the following subsections.
will predict the acceptability of the past tense can- 3.1 Paradigm Extraction
didates finked /fINkt/, fank /fæNk/, and funk /f2Nk/.
This section reports on the datasets provided for The goal of this component is to transform inflec-
training and testing and the evaluation criteria for tion tables derived from the training data into ab-
the submitted predictions. stract paradigms. We do so by first transforming
We were provided with three datasets for each into the training dataset into a compatible matrix.
language (see Table 1), with all lemma and inflec- Then, we apply Hulden’s (2014) finite-state abstrac-
tion strings in IPA form (e.g. worked: wORkt). The tion algorithm.
training set contains real lemma-inflection entries As described in §2, the training data were entries
with the structure lemma, inflection, and of the form:
morphosyntactic description (MSD).
lemma inflection MSD
This dataset is relatively large with an average of
82,134 entries per language. For the purposes of We reorganized the data by having MSDs as
our model, this dataset is used to infer paradigms columns and lemmas as rows, such that each entry
and the possible (but not necessarily plausible) is a complete or incomplete inflection table (Figure
inflections of any real or nonce lemma. The X). An inflection table is considered complete if all
judgement-development (hereafter development MSDs slots are filled, whereas an incomplete table
set) and testing sets are smaller (167 entries on av- has at least one or more MSD slots empty. Due to
erage) with the structure lemma, inflection, the large number of MSD tags in German (31 tags),
MSD, and judgement score1 . Both the de- we had to manually prune some tags/columns to
velopment and testing sets contain lemmas that ensure that there were a sizeable amount of com-
have exactly two potential inflections occurring in plete tables for paradigm learning2 . Additionally,
a regular-irregular pair, such as hsnEl, snEldi (regu- lemmas with multiple inflections in the same MSD
lar) and hsnEl, snElti (irregular, similar to dealt). (usually due to a pronunciation difference) were
The submitted systems were evaluated (by each removed from the dataset to ensure each lemma
language) with a mixed-effects beta regression corresponds to one data entry.
model, with lemma type as a random intercept. The We then induced abstract paradigms from the set
Akaike information criterion (AIC) of the model of complete inflection tables. Hulden’s (2014) al-
was used to rank submitted systems relative to each 2
The MSDs that were retained were tags that contain PST
other: the lower the AIC, the closer the system is (past), since our goal is to predict past participle acceptability.
to the actual acceptability scores. AIC across lan- A possible extension that automates this pruning process are
simulations that maximize (1) the amount of tags related to
1
Details on judgement score elicitation and nonce lemma the target morphological relationship and (2) the amount of
generation are documented in Ambridge et al. (2021). complete tables.

290
gorithm3 relies on the notion of a longest common training set, incomplete tables are soft-classified
subsequnce (LCS), which is defined rigorously as into potential classes, providing more exemplars
follows (Hirschberg, 1977): for the development and test set lemmas. For
the development and test entries, the probability
• String L is a subsequence of X iff L can be ob- distribution allows for the feature extraction
tained by deleting any (0 or more) symbols in process in §3.3 to be weighted.
X, e.g. course is a subsequence of computer Our protocol for generating CPi is as follows.
science. First, we define the terms compatible and incom-
patible with respect to a class (except cO ) and an
• L is a common subsequence of X1 and X2 iff
inflection table. An inflection table and a class is
L is a subsequence of X1 and X2 .
incompatible if it meets any one of the following
• L is the LCS of X1 and X2 iff there does not criteria:
exist a string K such that K is a common
• The characters of the specified substrings of
subsequence of X1 and X2 and len(K) >
the abstract inflection (e.g. ed in x1 +ed) do
len(L).
not exist in the inflection string, e.g. ræn and
The goal for each table-to-paradigm process is to: x1 +d.
(1) extract the LCS between entries of a table and
• The placement for the variables make it impos-
(2) assign substrings of the LCS as variables. Then,
sible to fit the inflection string in the abstract
paradigms collapse to a smaller set of distinct, ab-
inflection. For example: dElt and x1 +d, Elt
stract paradigms with multiple lemmas belonging
cannot be accommodated in this configura-
to one paradigm. For example, the inflection ta-
tion.
bles ring: rIN#ræN#r2N, sing: sIN#sæN#s2N, and
walk: wOlk#wOlkt#wOlkt reduces to the paradigms: • Although a configuration is possible in vari-
x1 Ix2 #x1 æx2 #x1 2x2 and x1 #x1 t#x1 t. For the spe- ous dimensions individually, the variable as-
cific implementation procedures, please consult signments are conflicting. For example: feI-
(Hulden, 2014) (see also Ahlberg et al., 2014, dId ’faded’ fits into the voiced regular past
2015). tense template x1 +d by x1 =feIdI and feIdIN
At last, we define the set of abstract paradigms ’fading’ will also fit the progressive x1 +IN by
as C (for ‘classes’ hereafter) and set of MSD tags x1 =feId, but the two x1 variables are differ-
M . While C encapsulates the inflectional patterns ent, resulting in incompatibility.
found in the complete inflection tables, we antici-
pate that there will be unattested paradigms in the Otherwise, if all instances within and across the
incomplete training tables as well as the testing and inflection table and the abstract class do not violate
development datasets. So, we will be extending the criteria above, then they are considered com-
C to C# = C ∪ {cO }, where cO represents the patible. Some inflection tables may have multiple
inflectional tables unaccounted for by C. compatible classes – this may arise from the table
having a few inflectional dimensions, which in turn
3.2 Class Probability causes it to satisfy the compatibility requirements
Following the definition of the set of easily.
classes/paradigms C# , we then generate a Any incompatible classes are assigned a 0 prob-
function CPi : C# 7→ [0, 1] for each incomplete ability: CPi (c) = 0. In the case where no classes
inflection table i (later extended to complete are compatible with i, the null paradigm cO is as-
tables), where CPi (c) indicates the probability signed a probability of 1: CPi (cO ) = 1. Cases
that i belongs to class c. The sum of all outputs where an inflectional table has exactly one com-
of this function equals to 1, as we assume that the patible class, that class is assigned a probability of
classes in C# are all the possible outcomes (recall 1. Likewise, complete tables are automatically as-
C# accounts for unattested paradigms). This signed a probability of 1 in their respective classes
function has two applications in this model. For the generated from §3.1. In all other cases where a
3
table has multiple compatible classes, we run a
The code for this algorithm is openly available through
the pextract toolbox: https://code.google.com/ naive Bayes classifier (implemented in the nltk
archive/p/pextract/ package) on the lemmas of the competing classes.

291
For each class member (lemma) and the lemma
of the table in question, we obtain the values of
three parameters: its syllable structure (in a CV
string), its first phoneme, and last phoneme. The
class probabilities obtained from the classifiers are
assigned to the respective classes.
Similarly, we treat the development and test
lemma-inflections as incomplete inflection tables
with two dimensions. They undergo the same pro-
cess described above to obtain their class proba-
bilities. We now have a rich stock of lemmas and
inflection soft-categorized by paradigmatic infor-
mation, to which we will refer to during the pre- Figure 1: Judgement scores from the English develop-
dictor abstraction process, as detailed in the next ment set mapped against the predicted scores. Red dat-
subsection. apoints represent irregular items, whereas blue points
are regular items. The closer a point is to the line, the
3.3 Paradigm Information & Model Building closer the prediction is to the actual judgement score.
Lastly, we extract three predictor values that reflect
paradigmatic information: syllable structure Lev- 4 Results & Discussion
enshtein distance, weighted phonological feature
distance, and class size. All three predictors are We now turn to the results of the model and re-
weighted by the probability distribution generated view qualitatively some shortcomings of the model.
from §3.2, Then, we propose some potential extensions and
P as seen in the formulas that follow. Let
nc = w∈c CPw (c) (the frequency of a class c). fixes that may improve the performance of this
model.
• Syllable Levenshtein: average Levenshtein
distance of the test/development lemma and 4.1 General Observations
lemmas in the class.
! The AIC values for the English, Dutch, and German
X CPi (c) X test sets were −46, −30.3, and −14.8 respectively.
· (dist(i, w) · CPw (c))
nc With regard to ranking, our system ranked last for
c∈C w∈c
Dutch and German but ranked second for English.
• Weighted phonological features: average Figure 1 shows the relationship between the actual
phonological distance of the test/development scores and the predictions for a simulation in the
lemma and lemmas in the class, derived from development set. We notice that the all predictions
the panphon package (Mortensen et al., had a much narrower range (2.55-4.71) than that
2016). of the actual scores (0.29-6.19). The Dutch data
! also showed a similar pattern (1.60-4.68 predicted
X CPi (c) X
· (phondist(i, w) · CPw (c)) vs. 0.62-6.42 actual), which may imply that the
c∈C
nc w∈c variables chosen were not able to yield a distinc-
tive difference. A linear model may also not be
• Class frequency: the weighted frequency of sufficient to fit the data.
each class. On a similar vein, we notice that irregular in-
X flectional candidates were often overrated by our
CPi (c) · n
predictions, whereas regular inflections were un-
c∈C
derrated. These modelling issues may be remedied
The development data was subsequently fitted in a few ways. First, it is possible to define a reg-
to a linear model with acceptability scores as the ular class among C# (e.g. the most frequent class
response variables. We then applied the values and add a reward factor to all test lemma-inflection
from the test set to the linear model to yield the pairs of the class. Similarly, a penalty score can
predictions for this shared task. Values that were be added for pairs in irregular inflection classes.
below 0 were adjusted to 0 and those above 7 were Another tweak to our system is to consider other
adjusted to 7. regression models that may fit the data better, such

292
as using K-nearest neighbours given the current 5 Concluding Remarks
variable extraction methods (§3.3).
This paper proposes a framework that utilizes
Future iterations of this model may seek to
paradigmatic information to predict acceptability
improve accuracy scores by implementing a fur-
ratings of nonce lemma inflections. This system
ther abstraction of the paradigm extraction pro-
is highly modular – the variables that go between
cess described in §3.1. (Silfverberg et al., 2018)
each module can be easily tweaked and reviewed,
describes an extension of (Hulden, 2014) that ab-
which increases the interpretability of the system.
stracts paradigms to the featural level. For instance,
Additionally, we foresee that a more developed ver-
our current system has separate abstract paradigms
sion of this model may provide insight to deeper
for the class of lemmas with voiced regular past
questions in cognitive modelling, such as: how
tenses (e.g. bribe → bribe+/d/) and unvoiced (e.g.
does phonological neighbourhood interact with
jump → jump+/t/). By merging the two classes, we
paradigm size in these system, and does it con-
are able to represent, to a further extent, the class
form with findings in linguistic and psycholinguis-
of regular inflections and set them apart from other
tic studies? We look forward to future extensions
variants. This may lead to better accuracy scores
of this model and contributing to the ongoing cog-
for this task because it gives a better representation
nitive modelling work in morphology.
of frequency class and more exemplars for the test
lemma to refer to. We should note, however, this Acknowledgments
may be a detriment to prediction tasks that ask to
determine acceptability scores within a phonologi- We would like to thank the organizers for their hard
cal rule (e.g. is work+/d/ or work+/t/ acceptable). work on the shared task.

4.2 Language-specific Observations


References
We note that there was a clear performance dif-
Malin Ahlberg, Markus Forsberg, and Mans Hulden.
ference between the English predictions and the 2014. Semi-supervised learning of morphological
Dutch/German predictions4 . We conjecture that paradigms and lexicons. In Proceedings of the 14th
these differences arise from the lack of language- Conference of the European Chapter of the Associa-
specific features during the naive Bayes classifica- tion for Computational Linguistics, pages 569–578.
tion process (§3.2). Recall that our naive Bayes Malin Ahlberg, Markus Forsberg, and Mans Hulden.
classification had first and last phoneme and CV 2015. Paradigm classification in supervised learn-
string as input, when those factors may not be im- ing of morphology. In Proceedings of the 2015 Con-
portant to determining class assignment in Dutch ference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan-
or German. guage Technologies, pages 1024–1029.
(Ernestus and Baayen, 2001) describe Dutch past
tense selection (between -te and -de) is a complex R Baayen, W Levelt, Robert Schreuder, and Mirjam
Ernestus. 2007. Paradigmatic structure in speech
interaction between lemma frequency, analogy, and
production. In Proceedings from the annual meeting
the type of stem final obstruent. For example, they of the Chicago linguistic society, volume 43, pages
claim that when a lemma frequency is low, analogy 1–29. Chicago Linguistic Society.
is weighed stronger. While this information is not
applicable to nonce lemmas and the current training Mirjam Ernestus and R Harald Baayen. 2003. Predict-
ing the unpredictable: Interpreting neutralized seg-
set (no frequency information is provided), future ments in dutch. Language, pages 5–38.
models that reference other sources (e.g. CELEX
database) may seek to incorporate these variables. Mirjam TC Ernestus and R Harald Baayen. 2001.
Additionally, a naive Bayes classifier may not cap- Choosing between the dutch past-tense suffixes-te
and-de. Linguistics in the Netherlands, 18(1):77–
ture these interactions, thus other classifers such as 88.
decision trees may be considered.
Martin Haspelmath and Andrea Sims. 2013. Under-
4
We note that the German and Dutch predictions may have standing morphology. Routledge.
also under-performed due to a character error problem, where
the Unicode character in the training and development sets Daniel S Hirschberg. 1977. Algorithms for the longest
use the Unicode character U+0067 ‘g’ and U+0261 ‘ g’ for common subsequence problem. Journal of the ACM
the test set. (JACM), 24(4):664–675.

293
Mans Hulden. 2014. Generalizing inflection tables into
paradigms with finite state operations. In Proceed-
ings of the 2014 Joint Meeting of SIGMORPHON
and SIGFSM, pages 29–36.
Nivja H de Jong, Robert Schreuder, and R Har-
ald Baayen. 2000. The morphological family size
effect and morphology. Language and cognitive pro-
cesses, 15(4-5):329–365.
Kaidi Lõo, Juhani Järvikivi, Fabian Tomaschek, Ben-
jamin V Tucker, and R Harald Baayen. 2018. Pro-
duction of Estonian case-inflected nouns shows
whole-word frequency and paradigmatic effects.
Morphology, 28(1):71–97.
David R. Mortensen, Patrick Littell, Akash Bharad-
waj, Kartik Goyal, Chris Dyer, and Lori S. Levin.
2016. Panphon: A resource for mapping IPA seg-
ments to articulatory feature vectors. In Proceed-
ings of COLING 2016, the 26th International Con-
ference on Computational Linguistics: Technical Pa-
pers, pages 3475–3484. ACL.
Liina Pylkkänen, Sophie Feintuch, Emily Hopkins, and
Alec Marantz. 2004. Neural correlates of the effects
of morphological family frequency and family size:
an meg study. Cognition, 91(3):B35–B45.
Miikka Silfverberg, Ling Liu, and Mans Hulden. 2018.
A computational model for the linguistic notion of
morphological paradigm. In Proceedings of the 27th
International Conference on Computational Linguis-
tics, pages 1615–1626.

294
Author Index

Ács, Judit, 154, 193 Goldman, Omer, 154


Agirrezabal, Manex, 72 Goldwater, Sharon, 82
Aiton, Grant, 154 Gorman, Kyle, 115
Ambridge, Ben, 154 Gormley, Matthew R., 258
Ashby, Lucas F.E., 115 Graf, Thomas, 11
Ataman, Duygu, 154
Habash, Nizar, 154
Barta, Botond, 154, 193 Hammond, Michael, 126
Bartley, Travis M., 115 Hatcher, Richard J., 154
Batsuren, Khuyagbaatar, 39 Hathout, Nabil, 196
Bayyr-Ool, Aziyana, 154 Heinz, Jeffrey, 272
Bella, Gábor, 39 Hovy, Eduard, 258
Berg-Kirkpatrick, Taylor, 258 Hulden, Mans, 72, 154
Bernardy, Jean-Philippe, 154, 185
Bonami, Pierre, 196 Ivanova, Sardana, 154
Bruguier, Antoine, 282
Jayanthi, Sai Muralidhar, 49
Calderone, Basilio, 196
Kann, Katharina, 72, 107
Chaudhari, Neha, 282
Khalifa, Salam, 154
Chodroff, Eleanor, 154
Kieraś, Witold, 154
Choudhury, Monojit, 60
Kirby, James, 32
Clematide, Simon, 115, 148
Klyachko, Elena, 154
Coler, Matt, 154
Kogan, Ilan, 1
Cotterell, Ryan, 154
Krizhanovsky, Andrew, 154
Dai, Huteng, 227 Krizhanovsky, Natalia, 154
Daniels, Josh, 90 Kumar, Ritesh, 154
De Santo, Aniello, 11
Lakatos, Dorina, 154, 193
Del Signore, Luca, 115
Lane, William, 154
Dolatian, Hossep, 11
Lee-Sikka, Yeonju, 115
Ek, Adam, 154, 185 Leonard, Brian, 154
El-Khaissi, Charbel, 154 Li, Jane S. Y., 205, 289
Elsner, Micha, 214 Li, Wang Yau, 141
Erdmann, Alexander, 72 Liu, Zoey, 154
Lo, Roger Yu-Hsiang, 131
Forbes, Clarissa, 248 Lopez, Adam, 82
Futrell, Richard, 227
Mahmood, Zafarullah, 141
Ganggo Ate, Yustinus, 154 Mailhot, Frederic, 141
Gasser, Michael, 154 Makarov, Peter, 115, 148
Gautam, Vasundhara, 141 Malanoski, Aidan, 115
Gerlach, Andrew, 107 Markowska, Magdalena, 272
Gibson, Cameron, 115 McCarthy, Arya D., 72
giunchiglia, fausto, 39 McCurdy, Kate, 82

295
Mielke, Sabrina J., 154 Woliński, Marcin, 154
Miller, Sean, 115 Wu, Shijie, 154

Nadig, Shreekantha, 141 Yan, Winnie, 115


Nicolai, Garrett, 72, 98, 131, 154, 248 Yang, Changbing, 98
Nuriah, Zahroh, 154 Yarowsky, David, 154
Yeung, Arnold, 1
Oncevay, Arturo, 154
Ortiz, Omar, 115 Zhang, Nathan, 141

Palmer, Alexis, 90
Papillon, Maxime, 23
Perkoff, E. Margaret, 90
Pimentel, Tiago, 154
Ponti, Edoardo Maria, 154
Pratapa, Adithya, 49
Prud’hommeaux, Emily, 154

Raff, Reuben, 115


Rambow, Owen, 272
Ratan, Shyam, 154
Roewer-Despres, Francois, 1
Ryskina, Maria, 154, 258

Sahai, Saumya, 282


Salchak, Aelita, 154
Salehi, Ali, 154
Samame, Jaime Rafael Montoya, 154
Sathe, Aalok, 60
Sengupta, Arundhati, 115
Seo, Bora, 115
Sharma, Dipti, 60
Sharma, Dravyansh, 282
Shcherbakov, Andrey, 154
Sheifer, Karina, 154
Silfverberg, Miikka, 72, 98, 248
Spektor, Yulia, 115
Stoehr, Niklas, 154
Straughn, Christopher, 154
Suhardijanto, Totok, 154
Szolnok, Gábor, 154, 193

Tyers, Francis, 154

Vaduguru, Saujas, 60
Vania, Clara, 154
Villegas, Gema Celeste Silva, 154
Vylomova, Ekaterina, 154

WANG, Riqiang, 141


Wang, Yang, 237
Washington, Jonathan N., 154
Wiemerslage, Adam, 72, 107
Wilson, Colin, 205, 289

You might also like