Sta N Z A: A Python Natural Language Processing Toolkit For Many Human Languages
Sta N Z A: A Python Natural Language Processing Toolkit For Many Human Languages
नमस्कार!
natural language processing toolkit support- MWT こんにちは! Hallo! xin chào!
arXiv:2003.07082v2 [cs.CL] 23 Apr 2020
JA NL VI HI
CoreNLP 6 Java ! !
F LAIR 12 Python ! ! !
spaCy 10 Python ! !
UDPipe 61 C++ ! ! !
Sta n z a 66 Python ! ! ! !
Table 1: Feature comparisons of Sta n z a against other popular natural language processing toolkits.
Table 2: Neural pipeline performance comparisons on the Universal Dependencies (v2.5) test treebanks. For our
system we show macro-averaged results over all 100 treebanks. We also compare our system against UDPipe and
spaCy on treebanks of five major languages where the corresponding pretrained models are publicly available. All
results are F1 scores produced by the 2018 UD Shared Task official evaluation script.
the training data as development data. These tree- nese Gigaword corpora5 , respectively. We again
banks represent 66 languages, mostly European applied the same hyper-parameters to models for
languages, but spanning a diversity of language all languages.
families, including Indo-European, Afro-Asiatic,
Uralic, Turkic, Sino-Tibetan, etc. For NER, we Universal Dependencies Results. For perfor-
train and evaluate Sta n z a with 12 publicly avail- mance on UD treebanks, we compared Sta n z a
able datasets covering 8 major languages as shown (v1.0) against UDPipe (v1.2) and spaCy (v2.2) on
in Table 3 (Nothman et al., 2013; Tjong Kim Sang treebanks of 5 major languages whenever a pre-
and De Meulder, 2003; Tjong Kim Sang, 2002; trained model is available. As shown in Table 2, St
Benikova et al., 2014; Mohit et al., 2012; Taulé a n z a achieved the best performance on most scores
et al., 2008; Weischedel et al., 2013). For the reported. Notably, we find that Sta n z a ’s language-
WikiNER corpora, as canonical splits are not avail- agnostic architecture is able to adapt to datasets of
able, we randomly split them into 70% training, different languages and genres. This is also shown
15% dev and 15% test splits. For all other corpora by Sta n z a ’s high macro-averaged scores over 100
we used their canonical splits. treebanks covering 66 languages.
Speed comparison. We compare Sta n z a against • We would also like to expand Sta n z a ’s func-
existing toolkits to evaluate the time it takes to an- tionality by adding other processors such as
notate text (see Table 4). For GPU tests we use a neural coreference resolution or relation ex-
single NVIDIA Titan RTX card. Unsurprisingly, traction for richer text analytics.
Sta n z a ’s extensive use of accurate neural models
makes it take significantly longer than spaCy to Acknowledgments
annotate text, but it is still competitive when com-
The authors would like to thank the anonymous
pared against toolkits of similar accuracy, espe-
reviewers for their comments, Arun Chaganty for
cially with the help of GPU acceleration.
his early contribution to this toolkit, Tim Dozat for
5 Conclusion and Future Work his design of the original architectures of the tagger
and parser models, Matthew Honnibal and Ines
We introduced Sta n z a , a Python natural language Montani for their help with spaCy integration and
processing toolkit supporting many human lan- helpful comments on the draft, Ranting Guo for the
guages. We have showed that Sta n z a ’s neural logo design, and John Bauer and the community
pipeline not only has wide coverage of human lan- contributors for their help with maintaining and
guages, but also is accurate on all tasks, thanks improving this toolkit. This research is funded in
to its language-agnostic, fully neural architectural part by Samsung Electronics Co., Ltd. and in part
design. Simultaneously, Sta n z a ’s CoreNLP client by the SAIL-JD Research Initiative.
extends its functionality with additional NLP tools.
References Daniel Zeman. 2020. Universal dependencies v2:
An evergrowing multilingual treebank collection. In
Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Proceedings of the Twelfth International Conference
Rasul, Stefan Schweter, and Roland Vollgraf. 2019. on Language Resources and Evaluation (LREC’20).
FLAIR: An easy-to-use framework for state-of-the-
art NLP. In Proceedings of the 2019 Conference of Joel Nothman, Nicky Ringland, Will Radford, Tara
the North American Chapter of the Association for Murphy, and James R Curran. 2013. Learning mul-
Computational Linguistics (Demonstrations). Asso- tilingual named entity recognition from Wikipedia.
ciation for Computational Linguistics. Artificial Intelligence, 194:151–175.
Alan Akbik, Duncan Blythe, and Roland Vollgraf. Peng Qi, Timothy Dozat, Yuhao Zhang, and Christo-
2018. Contextual string embeddings for sequence pher D. Manning. 2018. Universal dependency pars-
labeling. In Proceedings of the 27th International ing from scratch. In Proceedings of the CoNLL 2018
Conference on Computational Linguistics. Associa- Shared Task: Multilingual Parsing from Raw Text to
tion for Computational Linguistics. Universal Dependencies. Association for Computa-
tional Linguistics.
Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà,
Christian Federmann, Mark Fishel, Yvette Gra- Milan Straka. 2018. UDPipe 2.0 prototype at CoNLL
ham, Barry Haddow, Matthias Huck, Philipp Koehn, 2018 UD shared task. In Proceedings of the CoNLL
Shervin Malmasi, Christof Monz, Mathias Müller, 2018 Shared Task: Multilingual Parsing from Raw
Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Text to Universal Dependencies. Association for
Findings of the 2019 conference on machine transla- Computational Linguistics.
tion (WMT19). In Proceedings of the Fourth Con-
ference on Machine Translation (Volume 2: Shared Mariona Taulé, M. Antònia Martí, and Marta Recasens.
Task Papers, Day 1). Association for Computational 2008. AnCora: Multilevel annotated corpora for
Linguistics. Catalan and Spanish. In Proceedings of the Sixth
International Conference on Language Resources
Darina Benikova, Chris Biemann, and Marc Reznicek.
and Evaluation (LREC’08). European Language Re-
2014. NoSta-D named entity annotation for Ger-
sources Association (ELRA).
man: Guidelines and dataset. In Proceedings of
the Ninth International Conference on Language Re- Erik F. Tjong Kim Sang. 2002. Introduction to the
sources and Evaluation (LREC’14). CoNLL-2002 shared task: Language-independent
named entity recognition. In COLING-02: The
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
6th Conference on Natural Language Learning 2002
Tomas Mikolov. 2017. Enriching word vectors with
(CoNLL-2002).
subword information. Transactions of the Associa-
tion for Computational Linguistics, 5. Erik F. Tjong Kim Sang and Fien De Meulder.
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, 2003. Introduction to the CoNLL-2003 shared task:
Thorsten Brants, Phillipp Koehn, and Tony Robin- Language-independent named entity recognition. In
son. 2013. One billion word benchmark for measur- Proceedings of the Seventh Conference on Natural
ing progress in statistical language modeling. Tech- Language Learning at HLT-NAACL 2003.
nical report, Google.
Ralph Weischedel, Martha Palmer, Mitchell Marcus,
Timothy Dozat and Christopher D. Manning. 2017. Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Ni-
Deep biaffine attention for neural dependency pars- anwen Xue, Ann Taylor, Jeff Kaufman, Michelle
ing. In International Conference on Learning Rep- Franchini, et al. 2013. OntoNotes release 5.0. Lin-
resentations (ICLR). guistic Data Consortium.
Christopher D. Manning, Mihai Surdeanu, John Bauer, Daniel Zeman, Jan Hajič, Martin Popel, Martin Pot-
Jenny Finkel, Steven J. Bethard, and David Mc- thast, Milan Straka, Filip Ginter, Joakim Nivre, and
Closky. 2014. The Stanford CoreNLP natural lan- Slav Petrov. 2018. CoNLL 2018 shared task: Mul-
guage processing toolkit. In Association for Compu- tilingual parsing from raw text to universal depen-
tational Linguistics (ACL) System Demonstrations. dencies. In Proceedings of the CoNLL 2018 Shared
Task: Multilingual Parsing from Raw Text to Univer-
Behrang Mohit, Nathan Schneider, Rishav Bhowmick, sal Dependencies. Association for Computational
Kemal Oflazer, and Noah A Smith. 2012. Recall- Linguistics.
oriented learning of named entities in Arabic
Wikipedia. In Proceedings of the 13th Conference of Daniel Zeman, Joakim Nivre, Mitchell Abrams, Noëmi
the European Chapter of the Association for Compu- Aepli, Željko Agić, Lars Ahrenberg, Gabrielė Alek-
tational Linguistics. Association for Computational sandravičiūtė, Lene Antonsen, Katya Aplonova,
Linguistics. Maria Jesus Aranzabe, Gashaw Arutie, Masayuki
Asahara, Luma Ateyah, Mohammed Attia, Aitz-
Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin- iber Atutxa, Liesbeth Augustinus, Elena Badmaeva,
ter, Jan Hajič, Christopher D. Manning, Sampo Miguel Ballesteros, Esha Banerjee, Sebastian Bank,
Pyysalo, Sebastian Schuster, Francis Tyers, and Verginica Barbu Mititelu, Victoria Basmov, Colin
Batchelor, John Bauer, Sandra Bellato, Kepa Ben- Munro, Yugo Murawaki, Kaili Müürisep, Pinkey
goetxea, Yevgeni Berzak, Irshad Ahmad Bhat, Nainwani, Juan Ignacio Navarro Horñiacek, Anna
Riyaz Ahmad Bhat, Erica Biagetti, Eckhard Bick, Nedoluzhko, Gunta Nešpore-Bērzkalne, Lương
Agnė Bielinskienė, Rogier Blokland, Victoria Bo- Nguyễn Thi., Huyền Nguyễn Thi. Minh, Yoshi-
bicev, Loïc Boizou, Emanuel Borges Völker, Carl hiro Nikaido, Vitaly Nikolaev, Rattima Nitisaroj,
Börstell, Cristina Bosco, Gosse Bouma, Sam Bow- Hanna Nurmi, Stina Ojala, Atul Kr. Ojha, Adédayo.
man, Adriane Boyd, Kristina Brokaitė, Aljoscha Olúòkun, Mai Omura, Petya Osenova, Robert
Burchardt, Marie Candito, Bernard Caron, Gauthier Östling, Lilja Øvrelid, Niko Partanen, Elena Pas-
Caron, Tatiana Cavalcanti, Gülşen Cebiroğlu Ery- cual, Marco Passarotti, Agnieszka Patejuk, Guil-
iğit, Flavio Massimiliano Cecchini, Giuseppe G. A. herme Paulino-Passos, Angelika Peljak-Łapińska,
Celano, Slavomír Čéplö, Savas Cetin, Fabri- Siyao Peng, Cenel-Augusto Perez, Guy Perrier,
cio Chalub, Jinho Choi, Yongseok Cho, Jayeol Daria Petrova, Slav Petrov, Jason Phelan, Jussi
Chun, Alessandra T. Cignarella, Silvie Cinková, Piitulainen, Tommi A Pirinen, Emily Pitler, Bar-
Aurélie Collomb, Çağrı Çöltekin, Miriam Con- bara Plank, Thierry Poibeau, Larisa Ponomareva,
nor, Marine Courtin, Elizabeth Davidson, Marie- Martin Popel, Lauma Pretkalnin, a, Sophie Prévost,
Catherine de Marneffe, Valeria de Paiva, Elvis Prokopis Prokopidis, Adam Przepiórkowski, Tiina
de Souza, Arantza Diaz de Ilarraza, Carly Dicker- Puolakainen, Sampo Pyysalo, Peng Qi, Andriela
son, Bamba Dione, Peter Dirix, Kaja Dobrovoljc, Rääbis, Alexandre Rademaker, Loganathan Ra-
Timothy Dozat, Kira Droganova, Puneet Dwivedi, masamy, Taraka Rama, Carlos Ramisch, Vinit Rav-
Hanne Eckhoff, Marhaba Eli, Ali Elkahky, Binyam ishankar, Livy Real, Siva Reddy, Georg Rehm, Ivan
Ephrem, Olga Erina, Tomaž Erjavec, Aline Eti- Riabov, Michael Rießler, Erika Rimkutė, Larissa Ri-
enne, Wograine Evelyn, Richárd Farkas, Hector naldi, Laura Rituma, Luisa Rocha, Mykhailo Ro-
Fernandez Alcalde, Jennifer Foster, Cláudia Fre- manenko, Rudolf Rosa, Davide Rovati, Valentin
itas, Kazunori Fujita, Katarína Gajdošová, Daniel Rosca, Olga Rudina, Jack Rueter, Shoval Sadde,
Galbraith, Marcos Garcia, Moa Gärdenfors, Se- Benoît Sagot, Shadi Saleh, Alessio Salomoni, Tanja
bastian Garza, Kim Gerdes, Filip Ginter, Iakes Samardžić, Stephanie Samson, Manuela Sanguinetti,
Goenaga, Koldo Gojenola, Memduh Gökırmak, Dage Särg, Baiba Saulı̄te, Yanin Sawanakunanon,
Yoav Goldberg, Xavier Gómez Guinovart, Berta Nathan Schneider, Sebastian Schuster, Djamé Sed-
González Saavedra, Bernadeta Griciūtė, Matias Gri- dah, Wolfgang Seeker, Mojgan Seraji, Mo Shen,
oni, Normunds Grūzı̄tis, Bruno Guillaume, Céline Atsuko Shimada, Hiroyuki Shirasu, Muh Shohibus-
Guillot-Barbance, Nizar Habash, Jan Hajič, Jan Ha- sirri, Dmitry Sichinava, Aline Silveira, Natalia Sil-
jič jr., Mika Hämäläinen, Linh Hà Mỹ, Na-Rae veira, Maria Simi, Radu Simionescu, Katalin Simkó,
Han, Kim Harris, Dag Haug, Johannes Heinecke, Fe- Mária Šimková, Kiril Simov, Aaron Smith, Isabela
lix Hennig, Barbora Hladká, Jaroslava Hlaváčová, Soares-Bastos, Carolyn Spadine, Antonio Stella,
Florinel Hociung, Petter Hohle, Jena Hwang, Milan Straka, Jana Strnadová, Alane Suhr, Umut
Takumi Ikeda, Radu Ion, Elena Irimia, O.lájídé Sulubacak, Shingo Suzuki, Zsolt Szántó, Dima
Ishola, Tomáš Jelínek, Anders Johannsen, Fredrik Taji, Yuta Takahashi, Fabio Tamburini, Takaaki
Jørgensen, Markus Juutinen, Hüner Kaşıkara, An- Tanaka, Isabelle Tellier, Guillaume Thomas, Li-
dre Kaasen, Nadezhda Kabaeva, Sylvain Kahane, isi Torga, Trond Trosterud, Anna Trukhina, Reut
Hiroshi Kanayama, Jenna Kanerva, Boris Katz, Tsarfaty, Francis Tyers, Sumire Uematsu, Zdeňka
Tolga Kayadelen, Jessica Kenney, Václava Ket- Urešová, Larraitz Uria, Hans Uszkoreit, Andrius
tnerová, Jesse Kirchner, Elena Klementieva, Arne Utka, Sowmya Vajjala, Daniel van Niekerk, Gert-
Köhn, Kamil Kopacewicz, Natalia Kotsyba, Jolanta jan van Noord, Viktor Varga, Eric Villemonte de la
Kovalevskaitė, Simon Krek, Sookyoung Kwak, Clergerie, Veronika Vincze, Lars Wallin, Abigail
Veronika Laippala, Lorenzo Lambertino, Lucia Walsh, Jing Xian Wang, Jonathan North Washing-
Lam, Tatiana Lando, Septina Dian Larasati, Alexei ton, Maximilan Wendt, Seyi Williams, Mats Wirén,
Lavrentiev, John Lee, Phương Lê Hồng, Alessandro Christian Wittern, Tsegay Woldemariam, Tak-sum
Lenci, Saran Lertpradit, Herman Leung, Cheuk Ying Wong, Alina Wróblewska, Mary Yako, Naoki Ya-
Li, Josie Li, Keying Li, KyungTae Lim, Maria Li- mazaki, Chunxiao Yan, Koichi Yasuoka, Marat M.
ovina, Yuan Li, Nikola Ljubešić, Olga Loginova, Yavrumyan, Zhuoran Yu, Zdeněk Žabokrtský, Amir
Olga Lyashevskaya, Teresa Lynn, Vivien Macke- Zeldes, Manying Zhang, and Hanzhi Zhu. 2019.
tanz, Aibek Makazhanov, Michael Mandl, Christo- Universal Dependencies 2.5. LINDAT/CLARIAH-
pher Manning, Ruli Manurung, Cătălina Mărăn- CZ digital library at the Institute of Formal and Ap-
duc, David Mareček, Katrin Marheinecke, Héc- plied Linguistics (ÚFAL), Faculty of Mathematics
tor Martínez Alonso, André Martins, Jan Mašek, and Physics, Charles University.
Yuji Matsumoto, Ryan McDonald, Sarah McGuin-
ness, Gustavo Mendonça, Niko Miekka, Mar-
garita Misirpashayeva, Anna Missilä, Cătălin Mi-
titelu, Maria Mitrofan, Yusuke Miyao, Simonetta
Montemagni, Amir More, Laura Moreno Romero,
Keiko Sophie Mori, Tomohiko Morioka, Shin-
suke Mori, Shigeki Moro, Bjartur Mortensen,
Bohdan Moskalevskyi, Kadri Muischnek, Robert