Improving Basic Natural Language Processing Tools for the Ainu Language
Abstract
:1. Introduction
2. The Ainu Language
(A) | ku-kamuy-panakte [18] | |
“I was punished by the gods” | ||
(B) | kamuy | en-panakte [18] |
“The gods punished me” |
/akon | nispa/ |
a-kor | nispa |
“Our chief” |
Current Situation
3. History of Ainu Language Research
Natural Language Processing for Ainu
POST-AL—Natural Language Processing Tool for the Ainu Language
- Transcription normalization: modification of parts of text that do not conform to modern rules of transcription (e.g., kamui→kamuy);
- Word segmentation (tokenization): a process in which the text is divided into basic meaningful units (referred to as tokens). For writing systems using explicit word delimiters, tokenization is relatively simple. However, in some languages (such as Chinese) word boundaries are not indicated in the surface form, or orthographic words are too coarse-grained and need to be further analyzed—which is the case for many texts written in Ainu;
- Part-of-speech tagging: assigning a part-of-speech marker to each token;
- Morphological analysis (see Ptaszynski et al. [51]);
- Word-to-word translation (into Japanese).
- Modifying character representations of certain sounds, such as ‘ch’→‘c’, ‘sh’→‘s’, ‘ui’→‘uy’, ‘au’→‘aw’;
- Correcting word segmentation. Authors of older transcriptions tended to use less word delimiters (texts were divided according to poetic recitation rules, into chunks often containing multiple syntactic words). In the context of our research, it leads to an increase in the proportion of forms not covered in dictionaries. Thus, it is necessary to split some of the orthographic words into smaller units.
4. System Description
4.1. Transcription Normalization Algorithm
4.2. Tokenization Algorithm
4.2.1. Input
- With transcription normalization: if the text has to be corrected in terms of transcription, then the word segmentation algorithm takes a list of all possible transcriptions of a given input string, which has been generated in the previous stage;
- Without transcription normalization: if there is no need for transcription normalization, the input only includes one string (the one that has been provided by the user).
4.2.2. Word Segmentation Process
4.3. Part-of-Speech Tagger
- With n-gram based POS disambiguation (as in the original POST-AL system);
- With TF (term frequency) based POS disambiguation;
- N-grams + TF (TF based disambiguation is only applied to cases where n-gram based disambiguation is insufficient).
5. Dictionaries
5.1. Ainu shin-yōshū jiten
5.2. Ainu Conversational Dictionary
5.3. Combined Dictionary
6. Test Data and Gold Standard
- Yukar epics: Five out of thirteen yukar stories (no. 9–13) from the Ainu shin-yōshū [37]. Apart from the original version by Chiri, we also used the variants edited by Kirikae, who manually corrected their transcription and word segmentation according to modern linguistic conventions, and included them in the Ainu shin-yōshū jiten [33]. The modernized version comprises a total of 1608 tokens. Later we refer to this dataset as “Y9–13”. In the experiment with POS tagging, we only used a subset of the data in question, namely the story no. 10: Pon Okikirmuy yayeyukar “kutnisa kutunkutun” [“Kutnisa kutunkutun”—a song Pon Okikirmuy sang], which has 189 tokens—later it will be abbreviated to “Y10”. A fragment is presented in Table 5.
- Samples from the Ainu Conversational Dictionary: Sixty two sentences (428 tokens) from the A Talking Dictionary of Ainu: A New Version of Kanazawa’s Ainu Conversational Dictionary, which were excluded from the training data (see Section 5.2). Apart from the original text by Jinbō and Kanazawa [23], we also used the modernized version by Bugaeva et al. [52]. Later we refer to this dataset as “JK samples”. A fragment is shown in Table 6.
- Shibatani’s colloquial text samples: Both datasets mentioned above are either obtained directly from one of the dictionaries applied as the training data for our system (Ainu Conversational Dictionary) or from the compilation of yukar stories on which one of these dictionaries was based (Ainu shin-yōshū). To investigate the performance with texts unrelated to the system’s training data, we decided to apply other datasets as well. As the first one we used a colloquial text sample included in The Languages of Japan [17], namely a fragment (154 tokens) of Kura Sunasawa’s memoirs written in the Ainu language, Ku sukup oruspe (“My life story”) [59], transcribed according to modern linguistic rules. Later we refer to this dataset as “Shib.”;
- Mukawa dialect samples: We also used a sample (11 sentences, 87 tokens) from the Japanese–Ainu Dictionary for the Mukawa Dialect of Ainu [40]. It is a transcribed version of audio materials Tatsumine Katayama recorded between 1996 and 2002 with two native speakers of the Mukawa dialect of Ainu: Seino Araida and Fuyuko Yoshimura, containing 6284 entries. Later we refer to this dataset as “Muk.”
6.1. Test Data for Transcription Normalization
- Original (“O”): Original texts by Chiri or Jinbō and Kanazawa, without any modifications;
- Original, with spaces removed (“O-SR”): Original texts by Chiri and Jinbō and Kanazawa, preprocessed by removing any word segmentation (whitespaces) from each line.
6.2. Test Data for Tokenization
- Modern transcription, spaces removed (“M-SR”): The first variant includes all four datasets. In the case of Y9–13 and JK samples, modernized versions by Kirikae and Bugaeva et al. were used. Each line of text was preprocessed by removing whitespaces;
- Original spaces, modernized transcription (“O/M”): In this variant, only two datasets were used: the Y9–13 and JK samples. We retained the word segmentation (usage of whitespaces or lack thereof) of the original texts [23,37]. However, in order to prevent differences between transcription rules applied in original and modernized texts from affecting word segmentation experiment results, the texts were preprocessed by unifying their transcription with the modern (gold standard) versions. Table 7 shows to what extent the word segmentation of original texts is consistent with modernized versions by Kirikae and Bugaeva et al. (the evaluation method is explained in Section 7).
- Original transcription: unnukar awa kor wenpuri enantui ka;
- Modern transcription (gold standard): un nukar a wa kor wen puri enan tuyka;
- Modern transcription, spaces removed: unnukarawakorwenpurienantuyka;
- Modern transcription, original word segmentation: unnukar awa kor wenpuri enantuy ka;
- Meaning: “When she found me, her face [took] the color of anger.”
6.3. Test Data for POS Tagging
7. Evaluation Methods
7.1. Balanced F-Score
7.2. Evaluation of Transcription Normalization
7.3. Evaluation of Tokenization
7.4. Evaluation of Part-of-Speech Tagging
8. Results and Discussion
8.1. Transcription Normalization
8.2. Tokenization
8.3. Part-of-Speech Tagging
9. Conclusions and Future Work
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- UNESCO ad Hoc Expert Group on Endangered Languages. Language Vitality and Endangerment. Available online: http://www.unesco.org/culture/ich/doc/src/00120-EN.pdf (accessed on 23 December 2017).
- Kazeminejad, G.; Cowell, A.; Hulden, M. Creating lexical resources for polysynthetic languages—The case of Arapaho. In Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages, Honolulu, HI, USA, 2–5 March 2017; Association for Computational Linguistics: Honolulu, HI, USA, 2017; pp. 10–18. [Google Scholar]
- Anastasopoulos, A.; Lekakou, M.; Quer, J.; Zimianiti, E.; DeBenedetto, J.; Chiang, D. Part-of-Speech Tagging on an Endangered Language: A Parallel Griko-Italian Resource. arXiv 2018, arXiv:1806.03757. [Google Scholar]
- Besacier, L.; Barnard, E.; Karpov, A.; Schultz, T. Automatic speech recognition for under-resourced languages: A survey. Speech Commun. 2014, 56, 85–100. [Google Scholar] [CrossRef]
- Ruokolainen, T.; Kohonen, O.; Virpioja, S.; Kurimo, M. Supervised Morphological Segmentation in a Low-Resource Learning Setting using Conditional Random Fields. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Sofia, Bulgaria, 8–9 August 2013; Association for Computational Linguistics: Sofia, Bulgaria, 2013; pp. 29–37. [Google Scholar]
- Abney, S.; Bird, S. The Human Language Project: Building a Universal Corpus of the World’s Languages. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 11–16 July 2010; Association for Computational Linguistics: Uppsala, Sweden, 2010; pp. 88–97. [Google Scholar]
- Bird, S.; Chiang, D. Machine Translation for Language Preservation. In Proceedings of the COLING 2012: Posters, Mumbai, India, 8–15 December 2012; The COLING 2012 Organizing Committee: Mumbai, India, 2012; pp. 125–134. [Google Scholar]
- Blokland, R.; Fedina, M.; Gerstenberger, C.; Partanen, N.; Rießler, M.; Wilbur, J.D. Language Documentation meets Language Technology. In Proceedings of the 1st International Workshop in Computational Linguistics for Uralic Languages (IWCLUL 2015), Tromsø, Norway, 16 January 2015. [Google Scholar]
- Blokland, R.; Partanen, N.; Rießler, M.; Wilbur, J. Using Computational Approaches to Integrate Endangered Language Legacy Data into Documentation Corpora: Past Experiences and Challenges Ahead. In Proceedings of the 3rd Workshop on Computational Methods for Endangered Languages, Honolulu, HI, USA, 26–27 February 2019; Volume 2, pp. 24–30. [Google Scholar]
- Gerstenberger, C.; Partanen, N.; Rießler, M.; Wilbur, J. Utilizing Language Technology in the Documentation of Endangered Uralic Languages. North. Eur. J. Lang. Technol. 2016, 4, 29–47. [Google Scholar] [CrossRef]
- Gerstenberger, C.; Partanen, N.; Rießler, M.; Wilbur, J. Instant Annotations—Applying NLP Methods to the Annotation of Spoken Language Documentation Corpora. In Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages; Association for Computational Linguistics: St. Petersburg, Russia, 2017; pp. 25–36. [Google Scholar]
- Berger, K.C.; Hernaiz, A.G.; Baroni, P.; Hicks, D.; Kruse, E.; Quochi, V.; Russo, I.; Salonen, T.; Sarhimaa, A.; Soria, C. Digital Language Survival Kit: The DLDP Recommendations to Improve Digital Vitality; 2018; Available online: http://wp.dldp.eu/wp-content/uploads/2018/09/Digital-Language-Survival-Kit.pdf (accessed on 19 October 2019).
- Lewis, M.; Simons, G.; Fennig, C. (Eds.) Ethnologue: Languages of the World, 19th ed.; SIL International: Dallas, TX, USA, 2016. [Google Scholar]
- Ptaszynski, M.; Momouchi, Y. Part-of-Speech Tagger for Ainu Language Based on Higher Order Hidden Markov Model. Expert Syst. Appl. 2012, 39, 11576–11582. [Google Scholar] [CrossRef]
- Majewicz, A. Ajnu. Lud, jego język i tradycja ustna; [Ainu. The people, its language and oral tradition]; Wydawnictwo Naukowe UAM: Poznań, Poland, 1984. [Google Scholar]
- Bugaeva, A. Southern Hokkaido Ainu. In The languages of Japan and Korea; Tranter, N., Ed.; Routledge: London, UK, 2012; pp. 461–509. [Google Scholar]
- Shibatani, M. The languages of Japan; Cambridge University Press: London, UK, 1990. [Google Scholar]
- Satō, T. Ainugo Chitose hōgen ni okeru meishi-hōgō: Sono shurui to kanren sho-kisoku [Noun incorporation in the Chitose dialect of Ainu: Types and related rules]. Hokkaidō-Ritsu Ainu Minzoku Bunka Kenkyū Sentā Kenkyū Kiyō/Bulletin of the Hokkaido Ainu Culture Research Center 2012, 18, 1–32. [Google Scholar]
- Hattori, S. Ainugo hōgen jiten; [Dictionary of Ainu dialects]; Iwanami Shoten: Tōkyō, Japan, 1964. [Google Scholar]
- Refsing, K. The Ainu Language. The Morphology and Syntax of the Shizunai Dialect; Aarhus University Press: Aarhus, Denmark, 1986. [Google Scholar]
- Hokkaidō Government, Environment and Lifestyle Section. Hokkaidō Ainu seikatsu jittai chōsa hōkokusho; [Report of the Survey on the Hokkaidō Ainu actual living conditions]; 2013. Available online: http://www.pref.hokkaido.lg.jp/ks/ass/ainu_living_conditions_survey.pdf (accessed on 23 October 2019).
- Uehara, K.; Abe, C. Ezo hōgen moshiogusa; [Ezo dialect dictionary]; 1804. [Google Scholar]
- Jinbō, K.; Kanazawa, S. Ainugo kaiwa jiten; [Ainu conversational dictionary]; Kinkōdō Shoseki: Tōkyō, Japan, 1898. [Google Scholar]
- Dobrotvorskij, M. Ainsko-russkij Slovar; [Ainu-Russian Dictionary]; V Universitetskoj tipografii: Kazan, Russia, 1875. [Google Scholar]
- Batchelor, J. E-wa-ei santsui jisho. An Ainu–English–Japanese Dictionary and Grammar; Hokkaidō-chō: Sapporo, Japan, 1889. [Google Scholar]
- Radliński, I. Słownik narzecza Ainów zamieszkujących wyspę Szumszu w łańcuchu Kurylskim przy Kamczatce, ze zbiórow Prof. B. Dybowskiego [A dictionary of the dialect of Ainu inhabiting the Shumshu Island in the Kurile Archipelago near Kamchatka, collected by prof. B.Dybowski]. In Słowniki Narzeczy ludów Kamczackich; Nakładem Akademii Umiejętności: Cracow, Poland, 1891; Volume 1. [Google Scholar]
- Chiri, M. Bunrui Ainu-go jiten. Dai-ikkan: Shokubutsu-hen; [Dictionary of Ainu, vol. I: Plants]; Nihon Jōmin Bunka Kenkyūsho: Tōkyō, Japan, 1953. [Google Scholar]
- Chiri, M. Bunrui Ainu-go jiten. Dai-sankan: Ningen-hen; [Dictionary of Ainu, vol. III: People]; Nihon Jōmin Bunka Kenkyūsho: Tōkyō, Japan, 1954. [Google Scholar]
- Chiri, M. Bunrui Ainu-go jiten. Dai-nikan: Dōbutsu-hen; [Dictionary of Ainu, vol. II: Animals]; Nihon Jōmin Bunka Kenkyūsho: Tōkyō, Japan, 1962. [Google Scholar]
- Nakagawa, H. Ainugo Chitose Hōgen Jiten; [Dictionary of the Chitose Dialect of Ainu]; Sōfūkan: Tōkyō, Japan, 1995. [Google Scholar]
- Tamura, S. Ainugo jiten: Saru hōgen. The Ainu-Japanese Dictionary: Saru dialect; Sōfūkan: Tōkyō, Japan, 1996. [Google Scholar]
- Kayano, S. Kayano Shigeru no Ainugo jiten; [Shigeru Kayano’s Ainu dictionary]; Sanseidō: Tōkyō, Japan, 1996. [Google Scholar]
- Kirikae, H. Ainu shin-yōshū jiten: Tekisuto, bumpō kaisetsu tsuki; [Lexicon to Yukie Chiri’s Ainu Shin-yōshū with text and grammatical notes]; Daigaku Shorin: Tōkyō, Japan, 2003. [Google Scholar]
- Majewicz, A. (Ed.) The Collected Works of Bronisław Piłsudski; Mouton de Gruyter: Berlin, Germany, 1998–2004; Volumes 1–3. [Google Scholar]
- Kindaichi, K. Ainu Jojishi, Yūkara no Kenkyū; [Studies of yukar, the Ainu epics]; Tōyō Bunko: Tōkyō, Japan, 1931. [Google Scholar]
- Kindaichi, K.; Kannari, M. Ainu Jojishi Yūkara-shū; [Collection of yukar, the Ainu epics], vols. 1-8; Sanseidō: Tōkyō, Japan, 1959–1968. [Google Scholar]
- Chiri, Y. Ainu shin-yōshū; [Collection of Ainu mythic epics]; Kyōdo Kenkyūsha: Tōkyō, Japan, 1923. [Google Scholar]
- Kayano, S. Kayano Shigeru no Ainu shinwa shūsei; [A collection of Ainu myths by Shigeru Kayano] (vols. 1–10); Heibonsha: Tōkyō, Japan, 1998. [Google Scholar]
- National Institute for Japanese Language and Linguistics. A Topical Dictionary of Conversational Ainu. Available online: http://ainutopic.ninjal.ac.jp (accessed on 25 August 2017).
- Chiba University Graduate School of Humanities and Social Sciences. Ainugo Mukawa Hōgen Nihongo—Ainugo Jiten [Japanese—Ainu Dictionary for the Mukawa Dialect of Ainu]. Available online: http://cas-chiba.net/Ainu-archives/index.html (accessed on 25 February 2017).
- The Ainu Museum. Ainu-go Ākaibu [Ainu Language Archive]. Available online: http://ainugo.ainu-museum.or.jp/ (accessed on 25 August 2018).
- Katō, D.; Echizen’ya, H.; Araki, K.; Momouchi, Y.; Tochinai, K. Automatic Construction of the Bilingual Words Dictionary for Ainu-to-Japanese Using Recursive Chain-link-type Learning. In Proceedings of the 1st Forum on Information Technology, Tokyo, Japan, 25–28 September 2002; pp. 179–180. [Google Scholar]
- Momouchi, Y. Incremental Direct Translation of Noun Phrases of the Ainu Language to Japanese. IPSJ SIG Tech. Rep. 2002, 162, 79–86. [Google Scholar]
- Echizen’ya, H.; Araki, K.; Momouchi, Y. Automatic extraction of bilingual word pairs using Local Focus-based Learning from an Ainu-Japanese parallel corpus. Bull. Fac. Eng. Hokkai-Gakuen Univ. 2005, 32, 41–63. [Google Scholar]
- Azumi, Y.; Momouchi, Y. Development of analysis tool for hierarchical Ainu-Japanese translation data. Bull. Fac. Eng. Hokkai-Gakuen Univ. 2009, 36, 175–193. [Google Scholar]
- Azumi, Y.; Momouchi, Y. Development of tools for retrieving and analyzing Ainu-Japanese translation data and their applications to Ainu-Japanese machine translation system. Eng. Res. Bull. Grad. Sch. Eng. Hokkai-Gakuen Univ. 2009, 9, 37–58. [Google Scholar]
- Momouchi, Y.; Azumi, Y.; Kadoya, Y. Research note: Construction and utilization of electronic data for “Ainu shin-yōsyū”. Bull. Fac. Eng. Hokkai-Gakuen Univ. 2008, 35, 159–171. [Google Scholar]
- Momouchi, Y.; Kobayashi, R. Dictionaries and analysis tools for the componential analysis of ainu place name. Eng. Res. Bull. Grad. Sch. Eng. Hokkai-Gakuen Univ. 2010, 10, 39–49. [Google Scholar]
- Senuma, H.; Aizawa, A. Toward Universal Dependencies for Ainu. In Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017), Gothenburg, Sweden, 22 May 2017; pp. 133–139. [Google Scholar]
- Senuma, H.; Aizawa, A. Universal Dependencies for Ainu. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018; pp. 2354–2358. [Google Scholar]
- Ptaszynski, M.; Mukaichi, K.; Momouchi, Y. NLP for Endangered Languages: Morphology Analysis, Translation Support and Shallow Parsing of Ainu Language. In Proceedings of the 19th Annual Meeting of The Association for Natural Language Processing, Nagoya, Japan, 12–15 March 2013; pp. 418–421. [Google Scholar]
- Bugaeva, A.; Endō, S.; Kurokawa, S.; Nathan, D. A Talking Dictionary of Ainu: A New Version of Kanazawa’s Ainu Conversational Dictionary. Available online: http://lah.soas.ac.uk/projects/ainu/ (accessed on 25 November 2015).
- Kirikae, H. Ainu ni yoru Ainugo hyōki [transcription of the Ainu language by Ainu people]. Koku-bungaku kaishaku to kanshō 1997, 62, 99–107. [Google Scholar]
- Nakagawa, H. Ainu-jin ni yoru Ainugo hyōki e no torikumi [efforts to transcribe the Ainu language by Ainu people]. In Hyōki no shūkan no nai gengo no hyōki = Writing Unwritten Languages; Shiohara, A., Kodama, S., Eds.; Tōkyō gaikoku-go daigaku, Ajia/Afurika gengo bunka kenkyūjo [The Research Institute for Languages and Cultures of Asia and Africa, Tokyo University of Foreign Studies]: Tōkyō, Japan, 2006; pp. 1–44. [Google Scholar]
- Endō, S. Nabesawa Motozō ni yoru Ainugo no kana hyōki taikei: Kokuritsu Minzoku-gaku Hakubutsukan shozō hitsu-roku nōto kara [Ainu language notation method used by Motozō Nabesawa: From the written notes held by the National Museum of Ethnology]. Kokuritsu Minzoku-Gaku Hakubutsukan Chōsa Hōkoku 2016, 134, 41–66. [Google Scholar]
- Ptaszynski, M.; Ito, Y.; Nowakowski, K.; Honma, H.; Nakajima, Y.; Masui, F. Combining Multiple Dictionaries to Improve Tokenization of Ainu Language. In Proceedings of the 31st Annual Conference of the Japanese Society for Artificial Intelligence, Nagoya City, Japan, 23–26 May 2017. [Google Scholar]
- Ptaszynski, M.; Nowakowski, K.; Momouchi, Y.; Masui, F. Comparing Multiple Dictionaries to Improve Part-of-Speech Tagging of Ainu Language. In Proceedings of the 22nd Annual Meeting of The Association for Natural Language Processing, Sendai, Japan, 7–11 March 2016; pp. 973–976. [Google Scholar]
- Bugaeva, A. Internet applications for endangered languages: A talking dictionary of Ainu. Waseda Inst. Adv. Study Res. Bull. 2011, 52, 73–81. [Google Scholar]
- Sunasawa, K. Ku Sukup Oruspe; [My life story]; Miyama Shobō: Sapporo, Japan, 1983. [Google Scholar]
- Peterson, B. Project Okikirmui. The Complete Ainu Legends of Chiri Yukie, in English. Available online: http://www.okikirmui.com/ (accessed on 17 September 2017).
- Ringger, E.; McClanahan, P.; Haertel, R.; Busby, G.; Carmen, M.; Carroll, J.; Seppi, K.; Lonsdale, D. Active Learning for Part-of-Speech Tagging: Accelerating Corpus Annotation. In Proceedings of the Linguistic Annotation Workshop; Association for Computational Linguistics: Prague, Czech Republic, 28–29 June 2007; pp. 101–108. [Google Scholar]
- Giménez, J.; Márquez, L. SVMTool: A general POS tagger generator based on Support Vector Machines. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), European Language Resources Association (ELRA), Lisbon, Portugal, 26–28 May 2004. [Google Scholar]
- Toutanova, K.; Klein, D.; Manning, C.; Singer, Y. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2003), Edmonton, Canada, 27 May–1 June 2003; pp. 252–259. [Google Scholar]
Original Transcription | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ch | sh(i) | ai | ui | ei | oi | au | iu | eu | ou | mb | b | g | d |
c | s | ay | uy | ey | oy | aw | iw | ew | ow | np | p | k | t |
Modern Transcription |
Input String | List of Output Strings | Meaning |
---|---|---|
chepshuttuye | cepsuttuye | “to exterminate fish” |
cepshuttuye | ||
chepsuttuye | ||
chepshuttuye |
Input | Shortest Sequence of Tokens to Match |
---|---|
cepsuttuye | cep sut tuye |
cepshuttuye | |
chepsuttuye | |
chepshuttuye |
Input Token | Strings Generated by the Transcription Normalization Algorithm | Gold Standard Transcription and Word Segmentation | Meaning |
---|---|---|---|
setautar | setawtar | seta utar | “dogs” |
setautar |
Shineantota petetok un shinotash kushu payeash awa |
Sine an to ta petetok un sinot as kusu paye as a wa |
Meaning: “One day when I went for a trip up the river” |
Tambe makanak an chiki pirika? |
Tanpe makanak an ciki pirka? |
Meaning: “What should I do about it?” |
Y9–13 | JK Samples | Overall | |
---|---|---|---|
Precision | 0.998 | 0.949 | 0.983 |
Recall | 0.609 | 0.918 | 0.674 |
F-score | 0.756 | 0.933 | 0.800 |
Data | Variant | Characters (Excluding Spaces) | Tokens | |
---|---|---|---|---|
Ainu shin-yōshū | Y9–13 | O* | 6883 | 1076 |
O-SR** | N/A | |||
M-SR*** | 6501 | N/A | ||
O/M**** | 1076 | |||
Kirikae [33] | 6501 | 1608 | ||
Y10 | Kirikae [33] | 822 | 189 | |
Ainugo Kaiwa Jiten/ A Talking Dictionary of Ainu... (JK samples) | O | 1742 | 418 | |
O-SR | N/A | |||
M-SR | 1617 | N/A | ||
O/M | 416 | |||
Bugaeva et al. [52] | 1617 | 428 | ||
Shibatani’s colloquial text samples (Shib.) | Shibatani [17] | 583 | 154 | |
Mukawa dialect samples (Muk.) | Chiba University... [40] | 341 | 87 |
JK | KK | Tamura (1996) | Nakagawa (1995) | |
---|---|---|---|---|
(complete verb) | (complete verb) | (complete verb) | → | (complete verb) |
(intransitive verb) | (intransitive verb) | (intransitive verb) | → | (intransitive verb) |
(transitive verb) | (transitive verb) | (transitive verb) | → | (transitive verb) |
(ditransitive verb) | (ditransitive verb) | (ditransitive verb) | → | (ditransitive verb) |
(personal pronoun) | (personal pronoun) | → | (pronoun) | |
(demonstrative pronoun) | → | (pronoun) | ||
(interrogative indefinite pronoun) | (interrogative pronoun) | → | (interrogative) | |
(interrogative indefinite adverb) | (interrogative adverb) | → | (interrogative) | |
(demonstrative adverb) | → | (adverb) | ||
(postpositive adverb) | (postpositive adverb) | (postpositive adverb) | → | (adverb) |
(demonstrative prenoun adjectival) | → | (prenoun adjectival) | ||
(postposition) | → | (case particle) | ||
(nominal particle) | → | (noun) |
Nakagawa (1995) | Simplified Standard | |
---|---|---|
/ / / (complete verb / intransitive verb / transitive verb / ditransitive verb) | → | (verb) |
/ / / (proper noun / pronoun / locative noun / expletive noun) | → | (noun) |
/ / / (case particle / conjunctive particle / adverbial particle / final particle) | → | (particle) |
/ / (personal affix / prefix / suffix) | → | (affix) |
Y9–13 | JK Samples | Overall | Input Text Version: | |||
---|---|---|---|---|---|---|
DICTIONARY | JK | Precision | 0.871 | 0.942 | 0.885 | O-SR |
Recall | 0.897 | 0.658 | 0.833 | |||
F-score | 0.884 | 0.775 | 0.859 | |||
Precision | 0.890 | 0.956 | 0.903 | O | ||
Recall | 0.897 | 0.658 | 0.833 | |||
F-score | 0.893 | 0.780 | 0.867 | |||
KK | Precision | 0.967 | 0.899 | 0.954 | O-SR | |
Recall | 0.966 | 0.628 | 0.876 | |||
F-score | 0.966 | 0.740 | 0.913 | |||
Precision | 0.980 | 0.926 | 0.969 | O | ||
Recall | 0.958 | 0.628 | 0.871 | |||
F-score | 0.969 | 0.749 | 0.917 | |||
JK+KK | Precision | 0.953 | 0.942 | 0.951 | O-SR | |
Recall | 0.964 | 0.658 | 0.883 | |||
F-score | 0.958 | 0.775 | 0.916 | |||
Precision | 0.971 | 0.956 | 0.968 | O | ||
Recall | 0.958 | 0.658 | 0.879 | |||
F-score | 0.964 | 0.780 | 0.921 |
Y9–13 | JK Samples | Y9–13 + JK Samples | Shib. + Muk. | Overall | Input Text Version: | |||
---|---|---|---|---|---|---|---|---|
DICTIONARY | JK | Precision | 0.575 | 0.935 | 0.634 | 0.742 | 0.644 | M-SR |
Recall | 0.772 | 0.907 | 0.801 | 0.808 | 0.801 | |||
F-score | 0.659 | 0.921 | 0.708 | 0.774 | 0.714 | |||
Precision | 0.652 | 0.933 | 0.700 | n/a | n/a | O/M | ||
Recall | 0.894 | 0.984 | 0.913 | n/a | n/a | |||
F-score | 0.754 | 0.957 | 0.792 | n/a | n/a | |||
KK | Precision | 0.921 | 0.703 | 0.867 | 0.649 | 0.838 | M-SR | |
Recall | 0.889 | 0.842 | 0.879 | 0.822 | 0.873 | |||
F-score | 0.905 | 0.766 | 0.873 | 0.726 | 0.855 | |||
Precision | 0.950 | 0.772 | 0.904 | n/a | n/a | O/M | ||
Recall | 0.944 | 0.981 | 0.952 | n/a | n/a | |||
F-score | 0.947 | 0.864 | 0.928 | n/a | n/a | |||
JK+KK | Precision | 0.905 | 0.943 | 0.913 | 0.776 | 0.896 | M-SR | |
Recall | 0.854 | 0.896 | 0.863 | 0.860 | 0.863 | |||
F-score | 0.879 | 0.919 | 0.887 | 0.816 | 0.879 | |||
Precision | 0.939 | 0.932 | 0.937 | n/a | n/a | O/M | ||
Recall | 0.919 | 0.975 | 0.931 | n/a | n/a | |||
F-score | 0.929 | 0.953 | 0.934 | n/a | n/a |
Input: | kekehetakcepsuttuyecikikusnena |
Tokenizer output: | keke hetak cep sut tuye ciki kusne na |
Gold standard: | keke hetak cep sut tuye ci ki kusne na |
Meaning: | “Now I’m going to show you how to make fish extinct” [60] |
Test Data | Tagging Algorithm Version: | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Y10 | JK Samples | Average | ||||||||
Part-of-speech Standard: | Nakagawa | Simplified | Nakagawa | Simplified | Nakagawa | Simplified | N-Grams | Term Frequency | ||
Dictionary | JK | Precision | 0.771 | 0.786 | 0.965 | 0.974 | 0.868 | 0.880 | NO | YES |
Recall | 0.540 | 0.551 | 0.965 | 0.974 | 0.753 | 0.763 | ||||
F-score | 0.635 | 0.648 | 0.965 | 0.974 | 0.800 | 0.811 | ||||
Precision | 0.702 | 0.718 | 0.967 | 0.972 | 0.835 | 0.845 | YES | NO | ||
Recall | 0.492 | 0.503 | 0.967 | 0.972 | 0.730 | 0.738 | ||||
F-score | 0.579 | 0.592 | 0.967 | 0.972 | 0.773 | 0.782 | ||||
Precision | 0.794 | 0.809 | 0.977 | 0.981 | 0.886 | 0.895 | YES | YES | ||
Recall | 0.556 | 0.567 | 0.977 | 0.981 | 0.767 | 0.774 | ||||
F-score | 0.654 | 0.667 | 0.977 | 0.981 | 0.816 | 0.824 | ||||
KK | Precision | 0.821 | 0.859 | 0.713 | 0.763 | 0.767 | 0.811 | NO | YES | |
Recall | 0.807 | 0.845 | 0.563 | 0.603 | 0.685 | 0.724 | ||||
F-score | 0.814 | 0.852 | 0.629 | 0.674 | 0.722 | 0.763 | ||||
Precision | 0.853 | 0.886 | 0.666 | 0.737 | 0.760 | 0.812 | YES | NO | ||
Recall | 0.840 | 0.872 | 0.526 | 0.582 | 0.683 | 0.727 | ||||
F-score | 0.847 | 0.879 | 0.588 | 0.650 | 0.717 | 0.765 | ||||
Precision | 0.859 | 0.891 | 0.728 | 0.790 | 0.794 | 0.841 | YES | YES | ||
Recall | 0.845 | 0.877 | 0.575 | 0.624 | 0.710 | 0.751 | ||||
F-score | 0.852 | 0.884 | 0.643 | 0.697 | 0.747 | 0.791 | ||||
JK+KK | Precision | 0.855 | 0.876 | 0.960 | 0.970 | 0.908 | 0.923 | NO | YES | |
Recall | 0.850 | 0.872 | 0.960 | 0.970 | 0.905 | 0.921 | ||||
F-score | 0.853 | 0.874 | 0.960 | 0.970 | 0.906 | 0.922 | ||||
Precision | 0.866 | 0.892 | 0.942 | 0.949 | 0.904 | 0.921 | YES | NO | ||
Recall | 0.861 | 0.888 | 0.942 | 0.949 | 0.902 | 0.919 | ||||
F-score | 0.864 | 0.890 | 0.942 | 0.949 | 0.903 | 0.920 | ||||
Precision | 0.882 | 0.903 | 0.977 | 0.981 | 0.930 | 0.942 | YES | YES | ||
Recall | 0.877 | 0.898 | 0.977 | 0.981 | 0.927 | 0.940 | ||||
F-score | 0.880 | 0.901 | 0.977 | 0.981 | 0.928 | 0.941 |
POST-AL output: | iyosno ku hosipire kusne na |
[Adverb Personal affix Transitive verb Aux. verb Final particle] | |
[‘the end’/‘later’ ‘I’/‘my’ ‘return’ ‘intend’ EMPHASIS] | |
Meaning: | “I’ll return it later” |
Type of Error | Count |
---|---|
Tagger (disambiguation error) | 8 (35%) |
Dictionary (out-of-vocabulary item) | 1 (4%) |
POS classification (the same word, but different tag) | 14 (61%) |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nowakowski, K.; Ptaszynski, M.; Masui, F.; Momouchi, Y. Improving Basic Natural Language Processing Tools for the Ainu Language. Information 2019, 10, 329. https://doi.org/10.3390/info10110329
Nowakowski K, Ptaszynski M, Masui F, Momouchi Y. Improving Basic Natural Language Processing Tools for the Ainu Language. Information. 2019; 10(11):329. https://doi.org/10.3390/info10110329
Chicago/Turabian StyleNowakowski, Karol, Michal Ptaszynski, Fumito Masui, and Yoshio Momouchi. 2019. "Improving Basic Natural Language Processing Tools for the Ainu Language" Information 10, no. 11: 329. https://doi.org/10.3390/info10110329