0% found this document useful (0 votes)

79 views8 pages

Sta N Z A: A Python Natural Language Processing Toolkit For Many Human Languages

Uploaded by

Grace Lepar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views8 pages

Sta N Z A: A Python Natural Language Processing Toolkit For Many Human Languages

Uploaded by

Grace Lepar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Sta n z a : A Python Natural Language Processing Toolkit

for Many Human Languages

Peng Qi* Yuhao Zhang* Yuhui Zhang
Jason Bolton Christopher D. Manning
Stanford University
Stanford, CA 94305
{pengqi, yuhaozhang, yuhuiz}@stanford.edu
{jebolton, manning}@stanford.edu

Abstract Tokenization & Sentence Split Hello!

EN
Bonjour!
FR
你好!
ZH
Hallo!
DE
TOKENIZE

We introduce Sta n z a , an open-source Python

‫!ﻣرﺣﺑﺎ‬ 안녕하세요! ¡Hola! Здравствуйте!

Multi-word Token Expansion AR KO ES RU

नमस्कार!
natural language processing toolkit support- MWT こんにちは！ Hallo! xin chào!
arXiv:2003.07082v2 [cs.CL] 23 Apr 2020

JA NL VI HI

ing 66 human languages. Compared to ex- Lemmatization Multilingual: 66 Languages

LEMMA
isting widely used toolkits, Sta n z a features RAW TEXT
POS & Morphological Tagging
a language-agnostic fully neural pipeline for POS

text analysis, including tokenization, multi- Dependency Parsing

WORDS
Native Python Objects
TOKEN
word token expansion, lemmatization, part-of- DEPPARSE
LEMMA POS HEAD DEPREL ...
speech and morphological feature tagging, de- Named Entity Recognition
WORD
NER
pendency parsing, and named entity recogni- Fully Neural: Language-agnostic SENTENCE

tion. We have trained Sta n z a on a total of PROCESSORS DOCUMENT

112 datasets, including the Universal Depen-
dencies treebanks and other multilingual cor- Figure 1: Overview of Sta n z a ’s neural NLP pipeline.
pora, and show that the same neural architec- Sta n z a takes multilingual text as input, and produces
ture generalizes well and achieves competitive annotations accessible as native Python objects. Be-
performance on all languages tested. Addition- sides this neural pipeline, Sta n z a also features a
ally, Sta n z a includes a native Python interface Python client interface to the Java CoreNLP software.
to the widely used Java Stanford CoreNLP
software, which further extends its function- ing downstream applications and insights obtained
ality to cover other tasks such as coreference
from them. Third, some tools assume input text has
resolution and relation extraction. Source
code, documentation, and pretrained models been tokenized or annotated with other tools, lack-
for 66 languages are available at https:// ing the ability to process raw text within a unified
stanfordnlp.github.io/stanza/. framework. This has limited their wide applicabil-
ity to text from diverse sources.
1 Introduction We introduce Sta n z a 2 , a Python natural language
The growing availability of open-source natural lan- processing toolkit supporting many human lan-
guage processing (NLP) toolkits has made it easier guages. As shown in Table 1, compared to existing
for users to build tools with sophisticated linguistic widely-used NLP toolkits, Sta n z a has the following
processing. While existing NLP toolkits such as advantages:
CoreNLP (Manning et al., 2014), F LAIR (Akbik • From raw text to annotations. Sta n z a fea-
et al., 2019), spaCy1 , and UDPipe (Straka, 2018) tures a fully neural pipeline which takes raw
have had wide usage, they also suffer from several text as input, and produces annotations includ-
limitations. First, existing toolkits often support ing tokenization, multi-word token expansion,
only a few major languages. This has significantly lemmatization, part-of-speech and morpholog-
limited the community’s ability to process multilin- ical feature tagging, dependency parsing, and
gual text. Second, widely used tools are sometimes named entity recognition.
under-optimized for accuracy either due to a focus
on efficiency (e.g., spaCy) or use of less power- • Multilinguality. Sta n z a ’s architectural de-
ful models (e.g., CoreNLP), potentially mislead- sign is language-agnostic and data-driven,
∗
which allows us to release models support-
Equal contribution. Order decided by a tossed coin.
1 2
https://spacy.io/ The toolkit was called StanfordNLP prior to v1.0.0.
System # Human Programming Raw Text Fully Pretrained State-of-the-art
Languages Language Processing Neural Models Performance

CoreNLP 6 Java ! !
F LAIR 12 Python ! ! !
spaCy 10 Python ! !
UDPipe 61 C++ ! ! !
Sta n z a 66 Python ! ! ! !
Table 1: Feature comparisons of Sta n z a against other popular natural language processing toolkits.

ing 66 languages, by training the pipeline on (fr) L’Association des Hôtels

(en) The Association of Hotels
the Universal Dependencies (UD) treebanks (fr) Il y a des hôtels en bas de la rue
and other multilingual corpora. (en) There are hotels down the street

• State-of-the-art performance. We evaluate Figure 2: An example of multi-word tokens in French.

Sta n z a on a total of 112 datasets, and find its The des in the first sentence corresponds to two syntac-
neural pipeline adapts well to text of different tic words, de and les; the second des is a single word.
genres, achieving state-of-the-art or competi-
tive performance at each step of the pipeline. Tokenization and Sentence Splitting. When
presented raw text, Sta n z a tokenizes it and groups
Additionally, Sta n z a features a Python interface tokens into sentences as the first step of processing.
to the widely used Java CoreNLP package, allow- Unlike most existing toolkits, Sta n z a combines tok-
ing access to additional tools such as coreference enization and sentence segmentation from raw text
resolution and relation extraction. into a single module. This is modeled as a tagging
Sta n z a is fully open source and we make pre- problem over character sequences, where the model
trained models for all supported languages and predicts whether a given character is the end of a
datasets available for public download. We hope Sta token, end of a sentence, or end of a multi-word
n z a can facilitate multilingual NLP research and ap- token (MWT, see Figure 2).3 We choose to predict
plications, and drive future research that produces MWTs jointly with tokenization because this task
insights from human languages. is context-sensitive in some languages.

2 System Design and Architecture Multi-word Token Expansion. Once MWTs

are identified by the tokenizer, they are expanded
At the top level, Sta n z a consists of two individual into the underlying syntactic words as the basis
components: (1) a fully neural multilingual NLP of downstream processing. This is achieved with
pipeline; (2) a Python client interface to the Java an ensemble of a frequency lexicon and a neural
Stanford CoreNLP software. In this section we sequence-to-sequence (seq2seq) model, to ensure
introduce their designs. that frequently observed expansions in the training
set are always robustly expanded while maintaining
2.1 Neural Multilingual NLP Pipeline flexibility to model unseen words statistically.
Sta n z a
’s neural pipeline consists of models that
range from tokenizing raw text to performing syn- POS and Morphological Feature Tagging. For
tactic analysis on entire sentences (see Figure 1). each word in a sentence, Sta n z a assigns it a part-
All components are designed with processing many of-speech (POS), and analyzes its universal mor-
human languages in mind, with high-level design phological features (UFeats, e.g., singular/plural,
choices capturing common phenomena in many 1st /2nd /3rd person, etc.). To predict POS and UFeats,
languages and data-driven models that learn the dif- we adopt a bidirectional long short-term mem-
ference between these languages from data. More- ory network (Bi-LSTM) as the basic architecture.
over, the implementation of Sta n z a components is For consistency among universal POS (UPOS),
highly modular, and reuses basic model architec- 3
Following Universal Dependencies (Nivre et al., 2020),
tures when possible for compactness. We highlight we make a distinction between tokens (contiguous spans of
characters in the input text) and syntactic words. These are
the important design choices here, and refer the interchangeable aside from the cases of MWTs, where one
reader to Qi et al. (2018) for modeling details. token can correspond to multiple words.
treebank-specific POS (XPOS), and UFeats, we existing server interface in CoreNLP, and imple-
adopt the biaffine scoring mechanism from Dozat ment a robust client as its Python interface.
and Manning (2017) to condition XPOS and When the CoreNLP client is instantiated, Sta n z
UFeats prediction on that of UPOS. a will automatically start the CoreNLP server as a
local process. The client then communicates with
Lemmatization. Sta n z a also lemmatizes each the server through its RESTful APIs, after which
word in a sentence to recover its canonical form annotations are transmitted in Protocol Buffers, and
(e.g., did→do). Similar to the multi-word token ex- converted back to native Python objects. Users can
pander, Sta n z a ’s lemmatizer is implemented as an also specify JSON or XML as annotation format.
ensemble of a dictionary-based lemmatizer and a To ensure robustness, while the client is being used,
neural seq2seq lemmatizer. An additional classifier Sta n z a periodically checks the health of the server,
is built on the encoder output of the seq2seq model, and restarts it if necessary.
to predict shortcuts such as lowercasing and iden-
tity copy for robustness on long input sequences 3 System Usage
such as URLs.
Sta n z a
’s user interface is designed to allow quick
Dependency Parsing. Sta n z a parses each sen- out-of-the-box processing of multilingual text. To
tence for its syntactic structure, where each word achieve this, Sta n z a supports automated model
in the sentence is assigned a syntactic head that download via Python code and pipeline customiza-
is either another word in the sentence, or in the tion with processors of choice. Annotation results
case of the root word, an artificial root symbol. We can be accessed as native Python objects to allow
implement a Bi-LSTM-based deep biaffine neural for flexible post-processing.
dependency parser (Dozat and Manning, 2017). We
further augment this model with two linguistically 3.1 Neural Pipeline Interface
motivated features: one that predicts the lineariza- Sta n z a ’s neural NLP pipeline can be initialized
tion order of two words in a given language, and with the Pipeline class, taking language name
the other that predicts the typical distance in linear as an argument. By default, all processors will be
order between them. We have previously shown loaded and run over the input text; however, users
that these features significantly improve parsing can also specify the processors to load and run with
accuracy (Qi et al., 2018). a list of processor names as an argument. Users
can additionally specify other processor-level prop-
Named Entity Recognition. For each input sen- erties, such as batch sizes used by processors, at
tence, Sta n z a also recognizes named entities in it initialization time.
(e.g., person names, organizations, etc.). For NER The following code snippet shows a minimal us-
we adopt the contextualized string representation- age of Sta n z a for downloading the Chinese model,
based sequence tagger from Akbik et al. (2018). annotating a sentence with customized processors,
We first train a forward and a backward character- and printing out all annotations:
level LSTM language model, and at tagging time
we concatenate the representations at the end of import stanza
# download Chinese model
each word position from both language models stanza.download(’zh’)
with word embeddings, and feed the result into a # initialize Chinese neural pipeline
nlp = stanza.Pipeline(’zh’, processors=’tokenize,
standard one-layer Bi-LSTM sequence tagger with pos,ner’)
# run annotation over a sentence
a conditional random field (CRF)-based decoder. doc = nlp(’斯坦福是一所私立研究型大学。’)
print(doc)

2.2 CoreNLP Client

After all processors are run, a Document in-
Stanford’s Java CoreNLP software provides a com- stance will be returned, which stores all annotation
prehensive set of NLP tools especially for the En- results. Within a Document, annotations are fur-
glish language. However, these tools are not easily ther stored in Sentences, Tokens and Words
accessible with Python, the programming language in a top-down fashion (Figure 1). The following
of choice for many NLP practitioners, due to the code snippet demonstrates how to access the text
lack of official support. To facilitate the use of and POS tag of each word in a document and all
CoreNLP from Python, we take advantage of the named entities in the document:
# print the text and POS of all words
for sentence in doc.sentences:
for word in sentence.words:
print(word.text, word.pos)

# print all entities in the document

print(doc.entities)

Sta n z a is designed to be run on different hard-

ware devices. By default, CUDA devices will be
used whenever they are visible by the pipeline, or
otherwise CPUs will be used. However, users can
force all computation to be run on CPUs by setting
use_gpu=False at initialization time.

3.2 CoreNLP Client Interface

The CoreNLP client interface is designed in a way
that the actual communication with the backend Figure 3: Sta n z a annotates a German sentence, as vi-
CoreNLP server is transparent to the user. To an- sualized by our interactive demo. Note am is expanded
into syntactic words an and dem before downstream
notate an input text with the CoreNLP client, a
analyses are performed.
CoreNLPClient instance needs to be initialized,
with an optional list of CoreNLP annotators. After
An example of running Sta n z a on a German sen-
the annotation is complete, results will be accessi-
tence can be found in Figure 3.
ble as native Python objects.
This code snippet shows how to establish a 3.4 Training Pipeline Models
CoreNLP client and obtain the NER and corefer-
ence annotations of an English sentence: For all neural processors, Sta n z a provides
command-line interfaces for users to train their
from stanza.server import CoreNLPClient own customized models. To do this, users need
# start a CoreNLP client
to prepare the training and development data in
with CoreNLPClient(annotators=[’tokenize’,’ssplit compatible formats (i.e., CoNLL-U format for the
’,’pos’,’lemma’,’ner’,’parse’,’coref’]) as
client: Universal Dependencies pipeline and BIO format
# run annotation over input
ann = client.annotate(’Emily said that she column files for the NER model). The following
liked the movie.’)
# access all entities
command trains a neural dependency parser with
for sent in ann.sentence: user-specified training and development data:
print(sent.mentions)
# access coreference annotations
print(ann.corefChain) $ python -m stanza.models.parser \
--train_file train.conllu \
--eval_file dev.conllu \
With the client interface, users can annotate text --gold_file dev.conllu \
in 6 languages as supported by CoreNLP. --output_file output.conllu

3.3 Interactive Web-based Demo

4 Performance Evaluation
To help visualize documents and their annotations
generated by Sta n z a , we build an interactive web To establish benchmark results and compare with
demo that runs the pipeline interactively. For all other popular toolkits, we trained and evaluated
languages and all annotations Sta n z a provides in Sta n z a on a total of 112 datasets. All pretrained
those languages, we generate predictions from the models are publicly downloadable.
models trained on the largest treebank/NER dataset,
Datasets. We train and evaluate Sta n z a ’s tokeniz-
and visualize the result with the Brat rapid annota-
er/sentence splitter, MWT expander, POS/UFeats
tion tool.4 This demo runs in a client/server archi-
tagger, lemmatizer, and dependency parser with
tecture, and annotation is performed on the server
the Universal Dependencies v2.5 treebanks (Ze-
side. We make one instance of this demo publicly
man et al., 2019). For training we use 100 tree-
available at http://stanza.run/. It can also be
banks from this release that have non-copyrighted
run locally with proper Python libraries installed.
training data, and for treebanks that do not include
4
https://brat.nlplab.org/ development data, we randomly split out 20% of
Treebank System Tokens Sents. Words UPOS XPOS UFeats Lemmas UAS LAS
Overall (100 treebanks) Sta n z a 99.09 86.05 98.63 92.49 91.80 89.93 92.78 80.45 75.68
Sta n z a 99.98 80.43 97.88 94.89 91.75 91.86 93.27 83.27 79.33
Arabic-PADT
UDPipe 99.98 82.09 94.58 90.36 84.00 84.16 88.46 72.67 68.14
Sta n z a 92.83 98.80 92.83 89.12 88.93 92.11 92.83 72.88 69.82
Chinese-GSD
UDPipe 90.27 99.10 90.27 84.13 84.04 89.05 90.26 61.60 57.81
Sta n z a 99.01 81.13 99.01 95.40 95.12 96.11 97.21 86.22 83.59
English-EWT UDPipe 98.90 77.40 98.90 93.26 92.75 94.23 95.45 80.22 77.03
spaCy 97.30 61.19 97.30 86.72 90.83 – 87.05 – –
Sta n z a 99.68 94.92 99.48 97.30 – 96.72 97.64 91.38 89.05
French-GSD UDPipe 99.68 93.59 98.81 95.85 – 95.55 96.61 87.14 84.26
spaCy 98.34 77.30 94.15 86.82 – – 87.29 67.46 60.60
Sta n z a 99.98 99.07 99.98 98.78 98.67 98.59 99.19 92.21 90.01
Spanish-AnCora UDPipe 99.97 98.32 99.95 98.32 98.13 98.13 98.48 88.22 85.10
spaCy 99.47 97.59 98.95 94.04 – – 79.63 86.63 84.13

Table 2: Neural pipeline performance comparisons on the Universal Dependencies (v2.5) test treebanks. For our
system we show macro-averaged results over all 100 treebanks. We also compare our system against UDPipe and
spaCy on treebanks of five major languages where the corresponding pretrained models are publicly available. All
results are F1 scores produced by the 2018 UD Shared Task official evaluation script.

the training data as development data. These tree- nese Gigaword corpora5 , respectively. We again
banks represent 66 languages, mostly European applied the same hyper-parameters to models for
languages, but spanning a diversity of language all languages.
families, including Indo-European, Afro-Asiatic,
Uralic, Turkic, Sino-Tibetan, etc. For NER, we Universal Dependencies Results. For perfor-
train and evaluate Sta n z a with 12 publicly avail- mance on UD treebanks, we compared Sta n z a
able datasets covering 8 major languages as shown (v1.0) against UDPipe (v1.2) and spaCy (v2.2) on
in Table 3 (Nothman et al., 2013; Tjong Kim Sang treebanks of 5 major languages whenever a pre-
and De Meulder, 2003; Tjong Kim Sang, 2002; trained model is available. As shown in Table 2, St
Benikova et al., 2014; Mohit et al., 2012; Taulé a n z a achieved the best performance on most scores
et al., 2008; Weischedel et al., 2013). For the reported. Notably, we find that Sta n z a ’s language-
WikiNER corpora, as canonical splits are not avail- agnostic architecture is able to adapt to datasets of
able, we randomly split them into 70% training, different languages and genres. This is also shown
15% dev and 15% test splits. For all other corpora by Sta n z a ’s high macro-averaged scores over 100
we used their canonical splits. treebanks covering 66 languages.

NER Results. For performance of the NER com-

Training. On the Universal Dependencies tree- ponent, we compared Sta n z a (v1.0) against F LAIR
banks, we tuned all hyper-parameters on several (v0.4.5) and spaCy (v2.2). For spaCy we reported
large treebanks and applied them to all other tree- results from its publicly available pretrained model
banks. We used the word2vec embeddings released whenever one trained on the same dataset can be
as part of the 2018 UD Shared Task (Zeman et al., found, otherwise we retrained its model on our
2018), or the fastText embeddings (Bojanowski datasets with default hyper-parameters, follow-
et al., 2017) whenever word2vec is not available. ing the publicly available tutorial.6 For F LAIR,
For the character-level language models in the NER since their downloadable models were pretrained
component, we pretrained them on a mix of the 5
https://catalog.ldc.upenn.edu/
Common Crawl and Wikipedia dumps, and the LDC2011T13
news corpora released by the WMT19 Shared Task 6
https://spacy.io/usage/training#ner
(Barrault et al., 2019), except for English and Chi- Note that, following this public tutorial, we did not use
pretrained word embeddings when training spaCy NER
nese, for which we pretrained on the Google One models, although using pretrained word embeddings may
Billion Word (Chelba et al., 2013) and the Chi- potentially improve the NER results.
Language Corpus # Types Sta n z a F LAIR spaCy Sta n z a UDPipe F LAIR
Task
CPU GPU CPU CPU GPU
Arabic AQMAR 4 74.3 74.0 –
UD 10.3× 3.22× 4.30× – –
Chinese OntoNotes 18 79.2 – –
NER 17.7× 1.08× – 51.8× 1.17×
Dutch CoNLL02 4 89.2 90.3 73.8
WikiNER 4 94.8 94.8 90.9 Table 4: Annotation runtime of various toolkits rela-
English CoNLL03 4 92.1 92.7 81.0 tive to spaCy (CPU) on the English EWT treebank and
OntoNotes 18 88.8 89.0 85.4∗ OntoNotes NER test sets. For reference, on the com-
French WikiNER 4 92.9 92.5 88.8∗ pared UD and NER tasks, spaCy is able to process 8140
and 5912 tokens per second, respectively.
German CoNLL03 4 81.9 82.5 63.9
GermEval14 4 85.2 85.4 68.4
Russian WikiNER 4 92.9 – – For future work, we consider the following areas
of improvement in the near term:
Spanish CoNLL02 4 88.1 87.3 77.5
AnCora 4 88.6 88.4 76.1 • Models downloadable in Sta n z a are largely
trained on a single dataset. To make mod-
Table 3: NER performance across different languages
els robust to many different genres of text,
and corpora. All scores reported are entity micro-
averaged test F1 . For each corpus we also list the num-
we would like to investigate the possibility of
ber of entity types. ∗ marks results from publicly avail- pooling various sources of compatible data to
able pretrained models on the same dataset, while oth- train “default” models for each language;
ers are from models retrained on our datasets. • The amount of computation and resources
available to us is limited. We would there-
on dataset versions different from canonical ones,
fore like to build an open “model zoo” for
we retrained all models on our own dataset splits
Sta n z a , so that researchers from outside our
with their best reported hyper-parameters. All test
group can also contribute their models and
results are shown in Table 3. We find that on all
benefit from models released by others;
datasets Sta n z a achieved either higher or close F1
scores when compared against F LAIR. When com- • Sta n z a was designed to optimize for accuracy
pared to spaCy, Sta n z a ’s NER performance is much of its predictions, but this sometimes comes at
better. It is worth noting that Sta n z a ’s high per- the cost of computational efficiency and lim-
formance is achieved with much smaller models its the toolkit’s use. We would like to further
compared with F LAIR (up to 75% smaller), as we investigate reducing model sizes and speed-
intentionally compressed the models for memory ing up computation in the toolkit, while still
efficiency and ease of distribution. maintaining the same level of accuracy.

Speed comparison. We compare Sta n z a against • We would also like to expand Sta n z a ’s func-
existing toolkits to evaluate the time it takes to an- tionality by adding other processors such as
notate text (see Table 4). For GPU tests we use a neural coreference resolution or relation ex-
single NVIDIA Titan RTX card. Unsurprisingly, traction for richer text analytics.
Sta n z a ’s extensive use of accurate neural models
makes it take significantly longer than spaCy to Acknowledgments
annotate text, but it is still competitive when com-
The authors would like to thank the anonymous
pared against toolkits of similar accuracy, espe-
reviewers for their comments, Arun Chaganty for
cially with the help of GPU acceleration.
his early contribution to this toolkit, Tim Dozat for
5 Conclusion and Future Work his design of the original architectures of the tagger
and parser models, Matthew Honnibal and Ines
We introduced Sta n z a , a Python natural language Montani for their help with spaCy integration and
processing toolkit supporting many human lan- helpful comments on the draft, Ranting Guo for the
guages. We have showed that Sta n z a ’s neural logo design, and John Bauer and the community
pipeline not only has wide coverage of human lan- contributors for their help with maintaining and
guages, but also is accurate on all tasks, thanks improving this toolkit. This research is funded in
to its language-agnostic, fully neural architectural part by Samsung Electronics Co., Ltd. and in part
design. Simultaneously, Sta n z a ’s CoreNLP client by the SAIL-JD Research Initiative.
extends its functionality with additional NLP tools.
References Daniel Zeman. 2020. Universal dependencies v2:
An evergrowing multilingual treebank collection. In
Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Proceedings of the Twelfth International Conference
Rasul, Stefan Schweter, and Roland Vollgraf. 2019. on Language Resources and Evaluation (LREC’20).
FLAIR: An easy-to-use framework for state-of-the-
art NLP. In Proceedings of the 2019 Conference of Joel Nothman, Nicky Ringland, Will Radford, Tara
the North American Chapter of the Association for Murphy, and James R Curran. 2013. Learning mul-
Computational Linguistics (Demonstrations). Asso- tilingual named entity recognition from Wikipedia.
ciation for Computational Linguistics. Artificial Intelligence, 194:151–175.
Alan Akbik, Duncan Blythe, and Roland Vollgraf. Peng Qi, Timothy Dozat, Yuhao Zhang, and Christo-
2018. Contextual string embeddings for sequence pher D. Manning. 2018. Universal dependency pars-
labeling. In Proceedings of the 27th International ing from scratch. In Proceedings of the CoNLL 2018
Conference on Computational Linguistics. Associa- Shared Task: Multilingual Parsing from Raw Text to
tion for Computational Linguistics. Universal Dependencies. Association for Computa-
tional Linguistics.
Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà,
Christian Federmann, Mark Fishel, Yvette Gra- Milan Straka. 2018. UDPipe 2.0 prototype at CoNLL
ham, Barry Haddow, Matthias Huck, Philipp Koehn, 2018 UD shared task. In Proceedings of the CoNLL
Shervin Malmasi, Christof Monz, Mathias Müller, 2018 Shared Task: Multilingual Parsing from Raw
Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Text to Universal Dependencies. Association for
Findings of the 2019 conference on machine transla- Computational Linguistics.
tion (WMT19). In Proceedings of the Fourth Con-
ference on Machine Translation (Volume 2: Shared Mariona Taulé, M. Antònia Martí, and Marta Recasens.
Task Papers, Day 1). Association for Computational 2008. AnCora: Multilevel annotated corpora for
Linguistics. Catalan and Spanish. In Proceedings of the Sixth
International Conference on Language Resources
Darina Benikova, Chris Biemann, and Marc Reznicek.
and Evaluation (LREC’08). European Language Re-
2014. NoSta-D named entity annotation for Ger-
sources Association (ELRA).
man: Guidelines and dataset. In Proceedings of
the Ninth International Conference on Language Re- Erik F. Tjong Kim Sang. 2002. Introduction to the
sources and Evaluation (LREC’14). CoNLL-2002 shared task: Language-independent
named entity recognition. In COLING-02: The
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
6th Conference on Natural Language Learning 2002
Tomas Mikolov. 2017. Enriching word vectors with
(CoNLL-2002).
subword information. Transactions of the Associa-
tion for Computational Linguistics, 5. Erik F. Tjong Kim Sang and Fien De Meulder.
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, 2003. Introduction to the CoNLL-2003 shared task:
Thorsten Brants, Phillipp Koehn, and Tony Robin- Language-independent named entity recognition. In
son. 2013. One billion word benchmark for measur- Proceedings of the Seventh Conference on Natural
ing progress in statistical language modeling. Tech- Language Learning at HLT-NAACL 2003.
nical report, Google.
Ralph Weischedel, Martha Palmer, Mitchell Marcus,
Timothy Dozat and Christopher D. Manning. 2017. Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Ni-
Deep biaffine attention for neural dependency pars- anwen Xue, Ann Taylor, Jeff Kaufman, Michelle
ing. In International Conference on Learning Rep- Franchini, et al. 2013. OntoNotes release 5.0. Lin-
resentations (ICLR). guistic Data Consortium.

Christopher D. Manning, Mihai Surdeanu, John Bauer, Daniel Zeman, Jan Hajič, Martin Popel, Martin Pot-
Jenny Finkel, Steven J. Bethard, and David Mc- thast, Milan Straka, Filip Ginter, Joakim Nivre, and
Closky. 2014. The Stanford CoreNLP natural lan- Slav Petrov. 2018. CoNLL 2018 shared task: Mul-
guage processing toolkit. In Association for Compu- tilingual parsing from raw text to universal depen-
tational Linguistics (ACL) System Demonstrations. dencies. In Proceedings of the CoNLL 2018 Shared
Task: Multilingual Parsing from Raw Text to Univer-
Behrang Mohit, Nathan Schneider, Rishav Bhowmick, sal Dependencies. Association for Computational
Kemal Oflazer, and Noah A Smith. 2012. Recall- Linguistics.
oriented learning of named entities in Arabic
Wikipedia. In Proceedings of the 13th Conference of Daniel Zeman, Joakim Nivre, Mitchell Abrams, Noëmi
the European Chapter of the Association for Compu- Aepli, Željko Agić, Lars Ahrenberg, Gabrielė Alek-
tational Linguistics. Association for Computational sandravičiūtė, Lene Antonsen, Katya Aplonova,
Linguistics. Maria Jesus Aranzabe, Gashaw Arutie, Masayuki
Asahara, Luma Ateyah, Mohammed Attia, Aitz-
Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin- iber Atutxa, Liesbeth Augustinus, Elena Badmaeva,
ter, Jan Hajič, Christopher D. Manning, Sampo Miguel Ballesteros, Esha Banerjee, Sebastian Bank,
Pyysalo, Sebastian Schuster, Francis Tyers, and Verginica Barbu Mititelu, Victoria Basmov, Colin
Batchelor, John Bauer, Sandra Bellato, Kepa Ben- Munro, Yugo Murawaki, Kaili Müürisep, Pinkey
goetxea, Yevgeni Berzak, Irshad Ahmad Bhat, Nainwani, Juan Ignacio Navarro Horñiacek, Anna
Riyaz Ahmad Bhat, Erica Biagetti, Eckhard Bick, Nedoluzhko, Gunta Nešpore-Bērzkalne, Lương
Agnė Bielinskienė, Rogier Blokland, Victoria Bo- Nguyễn Thi., Huyền Nguyễn Thi. Minh, Yoshi-
bicev, Loïc Boizou, Emanuel Borges Völker, Carl hiro Nikaido, Vitaly Nikolaev, Rattima Nitisaroj,
Börstell, Cristina Bosco, Gosse Bouma, Sam Bow- Hanna Nurmi, Stina Ojala, Atul Kr. Ojha, Adédayo.
man, Adriane Boyd, Kristina Brokaitė, Aljoscha Olúòkun, Mai Omura, Petya Osenova, Robert
Burchardt, Marie Candito, Bernard Caron, Gauthier Östling, Lilja Øvrelid, Niko Partanen, Elena Pas-
Caron, Tatiana Cavalcanti, Gülşen Cebiroğlu Ery- cual, Marco Passarotti, Agnieszka Patejuk, Guil-
iğit, Flavio Massimiliano Cecchini, Giuseppe G. A. herme Paulino-Passos, Angelika Peljak-Łapińska,
Celano, Slavomír Čéplö, Savas Cetin, Fabri- Siyao Peng, Cenel-Augusto Perez, Guy Perrier,
cio Chalub, Jinho Choi, Yongseok Cho, Jayeol Daria Petrova, Slav Petrov, Jason Phelan, Jussi
Chun, Alessandra T. Cignarella, Silvie Cinková, Piitulainen, Tommi A Pirinen, Emily Pitler, Bar-
Aurélie Collomb, Çağrı Çöltekin, Miriam Con- bara Plank, Thierry Poibeau, Larisa Ponomareva,
nor, Marine Courtin, Elizabeth Davidson, Marie- Martin Popel, Lauma Pretkalnin, a, Sophie Prévost,
Catherine de Marneffe, Valeria de Paiva, Elvis Prokopis Prokopidis, Adam Przepiórkowski, Tiina
de Souza, Arantza Diaz de Ilarraza, Carly Dicker- Puolakainen, Sampo Pyysalo, Peng Qi, Andriela
son, Bamba Dione, Peter Dirix, Kaja Dobrovoljc, Rääbis, Alexandre Rademaker, Loganathan Ra-
Timothy Dozat, Kira Droganova, Puneet Dwivedi, masamy, Taraka Rama, Carlos Ramisch, Vinit Rav-
Hanne Eckhoff, Marhaba Eli, Ali Elkahky, Binyam ishankar, Livy Real, Siva Reddy, Georg Rehm, Ivan
Ephrem, Olga Erina, Tomaž Erjavec, Aline Eti- Riabov, Michael Rießler, Erika Rimkutė, Larissa Ri-
enne, Wograine Evelyn, Richárd Farkas, Hector naldi, Laura Rituma, Luisa Rocha, Mykhailo Ro-
Fernandez Alcalde, Jennifer Foster, Cláudia Fre- manenko, Rudolf Rosa, Davide Rovati, Valentin
itas, Kazunori Fujita, Katarína Gajdošová, Daniel Rosca, Olga Rudina, Jack Rueter, Shoval Sadde,
Galbraith, Marcos Garcia, Moa Gärdenfors, Se- Benoît Sagot, Shadi Saleh, Alessio Salomoni, Tanja
bastian Garza, Kim Gerdes, Filip Ginter, Iakes Samardžić, Stephanie Samson, Manuela Sanguinetti,
Goenaga, Koldo Gojenola, Memduh Gökırmak, Dage Särg, Baiba Saulı̄te, Yanin Sawanakunanon,
Yoav Goldberg, Xavier Gómez Guinovart, Berta Nathan Schneider, Sebastian Schuster, Djamé Sed-
González Saavedra, Bernadeta Griciūtė, Matias Gri- dah, Wolfgang Seeker, Mojgan Seraji, Mo Shen,
oni, Normunds Grūzı̄tis, Bruno Guillaume, Céline Atsuko Shimada, Hiroyuki Shirasu, Muh Shohibus-
Guillot-Barbance, Nizar Habash, Jan Hajič, Jan Ha- sirri, Dmitry Sichinava, Aline Silveira, Natalia Sil-
jič jr., Mika Hämäläinen, Linh Hà Mỹ, Na-Rae veira, Maria Simi, Radu Simionescu, Katalin Simkó,
Han, Kim Harris, Dag Haug, Johannes Heinecke, Fe- Mária Šimková, Kiril Simov, Aaron Smith, Isabela
lix Hennig, Barbora Hladká, Jaroslava Hlaváčová, Soares-Bastos, Carolyn Spadine, Antonio Stella,
Florinel Hociung, Petter Hohle, Jena Hwang, Milan Straka, Jana Strnadová, Alane Suhr, Umut
Takumi Ikeda, Radu Ion, Elena Irimia, O.lájídé Sulubacak, Shingo Suzuki, Zsolt Szántó, Dima
Ishola, Tomáš Jelínek, Anders Johannsen, Fredrik Taji, Yuta Takahashi, Fabio Tamburini, Takaaki
Jørgensen, Markus Juutinen, Hüner Kaşıkara, An- Tanaka, Isabelle Tellier, Guillaume Thomas, Li-
dre Kaasen, Nadezhda Kabaeva, Sylvain Kahane, isi Torga, Trond Trosterud, Anna Trukhina, Reut
Hiroshi Kanayama, Jenna Kanerva, Boris Katz, Tsarfaty, Francis Tyers, Sumire Uematsu, Zdeňka
Tolga Kayadelen, Jessica Kenney, Václava Ket- Urešová, Larraitz Uria, Hans Uszkoreit, Andrius
tnerová, Jesse Kirchner, Elena Klementieva, Arne Utka, Sowmya Vajjala, Daniel van Niekerk, Gert-
Köhn, Kamil Kopacewicz, Natalia Kotsyba, Jolanta jan van Noord, Viktor Varga, Eric Villemonte de la
Kovalevskaitė, Simon Krek, Sookyoung Kwak, Clergerie, Veronika Vincze, Lars Wallin, Abigail
Veronika Laippala, Lorenzo Lambertino, Lucia Walsh, Jing Xian Wang, Jonathan North Washing-
Lam, Tatiana Lando, Septina Dian Larasati, Alexei ton, Maximilan Wendt, Seyi Williams, Mats Wirén,
Lavrentiev, John Lee, Phương Lê Hồng, Alessandro Christian Wittern, Tsegay Woldemariam, Tak-sum
Lenci, Saran Lertpradit, Herman Leung, Cheuk Ying Wong, Alina Wróblewska, Mary Yako, Naoki Ya-
Li, Josie Li, Keying Li, KyungTae Lim, Maria Li- mazaki, Chunxiao Yan, Koichi Yasuoka, Marat M.
ovina, Yuan Li, Nikola Ljubešić, Olga Loginova, Yavrumyan, Zhuoran Yu, Zdeněk Žabokrtský, Amir
Olga Lyashevskaya, Teresa Lynn, Vivien Macke- Zeldes, Manying Zhang, and Hanzhi Zhu. 2019.
tanz, Aibek Makazhanov, Michael Mandl, Christo- Universal Dependencies 2.5. LINDAT/CLARIAH-
pher Manning, Ruli Manurung, Cătălina Mărăn- CZ digital library at the Institute of Formal and Ap-
duc, David Mareček, Katrin Marheinecke, Héc- plied Linguistics (ÚFAL), Faculty of Mathematics
tor Martínez Alonso, André Martins, Jan Mašek, and Physics, Charles University.
Yuji Matsumoto, Ryan McDonald, Sarah McGuin-
ness, Gustavo Mendonça, Niko Miekka, Mar-
garita Misirpashayeva, Anna Missilä, Cătălin Mi-
titelu, Maria Mitrofan, Yusuke Miyao, Simonetta
Montemagni, Amir More, Laura Moreno Romero,
Keiko Sophie Mori, Tomohiko Morioka, Shin-
suke Mori, Shigeki Moro, Bjartur Mortensen,
Bohdan Moskalevskyi, Kadri Muischnek, Robert

Recent Advances in Natural Language Processing V Selected Papers From RANLP 2007 1st Edition Nicolas Nicolov (Ed.) Download
100% (4)
Recent Advances in Natural Language Processing V Selected Papers From RANLP 2007 1st Edition Nicolas Nicolov (Ed.) Download
61 pages
Commentaries On 5th Canto Bhagavatam - Danavir Goswami
100% (1)
Commentaries On 5th Canto Bhagavatam - Danavir Goswami
828 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
33 pages
Norwegian Tenses Detailed Table
No ratings yet
Norwegian Tenses Detailed Table
2 pages
Unit III 1
No ratings yet
Unit III 1
11 pages
Women's Language Features Found in Same-Sex and Cross-Sex Conversations in He's Just Not That Into You Movie
No ratings yet
Women's Language Features Found in Same-Sex and Cross-Sex Conversations in He's Just Not That Into You Movie
10 pages
NLP - Spacy Package
No ratings yet
NLP - Spacy Package
28 pages
Relationship Matchmaker - Google Forms
No ratings yet
Relationship Matchmaker - Google Forms
1,332 pages
Basili & Nissim & Satta Eds, Proceedings of The 4th Italian Conf Computational Linguistics 2017 Rome (AUP 2017)
No ratings yet
Basili & Nissim & Satta Eds, Proceedings of The 4th Italian Conf Computational Linguistics 2017 Rome (AUP 2017)
354 pages
Java
93% (15)
Java
383 pages
Lekl 122
100% (1)
Lekl 122
13 pages
Text Processing For NLP Text Processing
No ratings yet
Text Processing For NLP Text Processing
15 pages
Pertemuan 3 - Preprocessing
No ratings yet
Pertemuan 3 - Preprocessing
25 pages
(Synthese Library) P. Humphreys, J.H. Fetzer (Editors) - The New Theory of Reference - Kripke, Marcus, and Its Origins (1998, Springer)
No ratings yet
(Synthese Library) P. Humphreys, J.H. Fetzer (Editors) - The New Theory of Reference - Kripke, Marcus, and Its Origins (1998, Springer)
294 pages
Module 5
No ratings yet
Module 5
24 pages
2503 06594v1-LaMaTE
No ratings yet
2503 06594v1-LaMaTE
36 pages
Tacl A 00109
No ratings yet
Tacl A 00109
14 pages
How Does A GPT Tool Process Inputs
No ratings yet
How Does A GPT Tool Process Inputs
19 pages
NLP
No ratings yet
NLP
29 pages
Core Components of Natural Language Processing
No ratings yet
Core Components of Natural Language Processing
43 pages
Aicb Unit 4
No ratings yet
Aicb Unit 4
15 pages
NLP Basics
No ratings yet
NLP Basics
12 pages
2024 Propor-1 46
No ratings yet
2024 Propor-1 46
7 pages
Persianmind:: A Cross-Lingual Persian-English Large Language Model
No ratings yet
Persianmind:: A Cross-Lingual Persian-English Large Language Model
13 pages
P18 4 PDF
No ratings yet
P18 4 PDF
162 pages
NLP LLM
No ratings yet
NLP LLM
47 pages
Lecture 27
No ratings yet
Lecture 27
40 pages
Volar Conjugation Spanish
No ratings yet
Volar Conjugation Spanish
2 pages
Naacl2015 Discourse
No ratings yet
Naacl2015 Discourse
5 pages
OpenNMT Open-Source Toolkit For Neural Machine Translation
No ratings yet
OpenNMT Open-Source Toolkit For Neural Machine Translation
6 pages
A Recipe For Arabic-English Neural Machine Translation
No ratings yet
A Recipe For Arabic-English Neural Machine Translation
5 pages
Moscow Has Ears Everywhere: New Investigations On Pasternak and Ivinskaya, by Palolo Mancosu (Preview)
No ratings yet
Moscow Has Ears Everywhere: New Investigations On Pasternak and Ivinskaya, by Palolo Mancosu (Preview)
61 pages
Text Generation
No ratings yet
Text Generation
4 pages
Tokenization
No ratings yet
Tokenization
4 pages
Place of Articulation
No ratings yet
Place of Articulation
10 pages
Text PDF
No ratings yet
Text PDF
79 pages
2024 Arabicnlp-1 24
No ratings yet
2024 Arabicnlp-1 24
15 pages
Development of Pre-Trained Transformer-Based Models For The Nepali Language
No ratings yet
Development of Pre-Trained Transformer-Based Models For The Nepali Language
8 pages
Thư viện Trankit
No ratings yet
Thư viện Trankit
11 pages
Intro To Natural Language Processing (NLP)
No ratings yet
Intro To Natural Language Processing (NLP)
13 pages
Michel DeGraff (1993) " A Riddle On Negation in Haitian Creole" (Article in Journal - Probus - )
No ratings yet
Michel DeGraff (1993) " A Riddle On Negation in Haitian Creole" (Article in Journal - Probus - )
32 pages
NLP Cookbook
No ratings yet
NLP Cookbook
27 pages
STUDY GUIDE No. 6
No ratings yet
STUDY GUIDE No. 6
8 pages
Bahasa Inggris V (English For Nursing) : Lecturer: Deni Abdillah. M, M. PD
No ratings yet
Bahasa Inggris V (English For Nursing) : Lecturer: Deni Abdillah. M, M. PD
8 pages
Starter A and B Reading and Writing Activities
No ratings yet
Starter A and B Reading and Writing Activities
24 pages
NLP
No ratings yet
NLP
17 pages
NLP Mod 1 (New)
No ratings yet
NLP Mod 1 (New)
50 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
17 pages
Sivasri NLP Lab
No ratings yet
Sivasri NLP Lab
50 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
02 Linguistics Essentials
No ratings yet
02 Linguistics Essentials
36 pages
TP1 3
No ratings yet
TP1 3
5 pages
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
Satish Mahajan
No ratings yet
Satish Mahajan
3 pages
The Stanford Corenlp Natural Language Processing Toolkit: Annota) On Object
No ratings yet
The Stanford Corenlp Natural Language Processing Toolkit: Annota) On Object
6 pages
Neural Net
No ratings yet
Neural Net
62 pages
Câu Chẻ, Hỏi Đuôi, People Say That
No ratings yet
Câu Chẻ, Hỏi Đuôi, People Say That
6 pages
Coursera Course List
No ratings yet
Coursera Course List
36 pages
Tess: Hope For The Humanity.
No ratings yet
Tess: Hope For The Humanity.
6 pages
Tuning Multilingual Transformers For Language-Specific Named Entity Recognition (W19-3712)
No ratings yet
Tuning Multilingual Transformers For Language-Specific Named Entity Recognition (W19-3712)
5 pages
Trend
No ratings yet
Trend
47 pages
Archivo - 01 (Outra Cópia)
No ratings yet
Archivo - 01 (Outra Cópia)
1 page
Newspaper Front Page Rubric
100% (1)
Newspaper Front Page Rubric
1 page
DemokritosGR Proceedings
No ratings yet
DemokritosGR Proceedings
10 pages
Unit 1 and 2
No ratings yet
Unit 1 and 2
5 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Unit 4
No ratings yet
Unit 4
8 pages
NLP Notes and Related Questions
No ratings yet
NLP Notes and Related Questions
7 pages
UNIT 4 New
No ratings yet
UNIT 4 New
14 pages
03 NLP Document
No ratings yet
03 NLP Document
38 pages
F Tell Tale Heart Notes
No ratings yet
F Tell Tale Heart Notes
6 pages
Tools and Resources For Romanian Text-To-speech and Speech-To-text
No ratings yet
Tools and Resources For Romanian Text-To-speech and Speech-To-text
8 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Cooperative Learning and Achievement in English Language Acquisition in A Literature Class in A Secondary School
No ratings yet
Cooperative Learning and Achievement in English Language Acquisition in A Literature Class in A Secondary School
139 pages
Day of The Dead British English Student Ver2
No ratings yet
Day of The Dead British English Student Ver2
4 pages
NLP Cookbook
No ratings yet
NLP Cookbook
27 pages
5 Theoretical and Applied Linguistics
No ratings yet
5 Theoretical and Applied Linguistics
7 pages
cs224n spr2024 Lecture01 Wordvecs1
No ratings yet
cs224n spr2024 Lecture01 Wordvecs1
40 pages
Lecture 9-Adjectives and Adverbs
No ratings yet
Lecture 9-Adjectives and Adverbs
11 pages
NLP Assignment-1
No ratings yet
NLP Assignment-1
11 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
NLP Short Que Ans
No ratings yet
NLP Short Que Ans
21 pages
Direct and Indirect Speech
No ratings yet
Direct and Indirect Speech
10 pages
Mca II Dbms Labmannual
No ratings yet
Mca II Dbms Labmannual
29 pages
A Pre-Islamic Rite in South Arabia
No ratings yet
A Pre-Islamic Rite in South Arabia
11 pages
Cheat
No ratings yet
Cheat
10 pages
Some Help To Improve Your Writing Skills - B1 ISE I
No ratings yet
Some Help To Improve Your Writing Skills - B1 ISE I
22 pages
Natural Language Toolkit NLTK PDF
No ratings yet
Natural Language Toolkit NLTK PDF
23 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Feeding Ground (Spanish)
From Everand
Feeding Ground (Spanish)
Swifty Lang
No ratings yet

Sta N Z A: A Python Natural Language Processing Toolkit For Many Human Languages

Uploaded by

Sta N Z A: A Python Natural Language Processing Toolkit For Many Human Languages

Uploaded by

Sta n z a : A Python Natural Language Processing Toolkit

for Many Human Languages

Abstract Tokenization & Sentence Split Hello!

We introduce Sta n z a , an open-source Python

Multi-word Token Expansion AR KO ES RU

ing 66 human languages. Compared to ex- Lemmatization Multilingual: 66 Languages

text analysis, including tokenization, multi- Dependency Parsing

tion. We have trained Sta n z a on a total of PROCESSORS DOCUMENT

ing 66 languages, by training the pipeline on (fr) L’Association des Hôtels

• State-of-the-art performance. We evaluate Figure 2: An example of multi-word tokens in French.

2 System Design and Architecture Multi-word Token Expansion. Once MWTs

2.2 CoreNLP Client

# print all entities in the document

Sta n z a is designed to be run on different hard-

3.2 CoreNLP Client Interface

3.3 Interactive Web-based Demo

NER Results. For performance of the NER com-

You might also like