2020 Wildre-1 8
2020 Wildre-1 8
2020 Wildre-1 8
Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020
c European Language Resources Association (ELRA), licensed under CC-BY-NC
Abstract
Treebanks are an essential resource for syntactic parsing. The available Paninian dependency treebank(s) for Telugu is annotated only
with inter-chunk dependency relations and not all words of a sentence are part of the parse tree. In this paper, we automatically annotate
the intra-chunk dependencies in the treebank using a Shift-Reduce parser based on Context Free Grammar rules for Telugu chunks.
We also propose a few additional intra-chunk dependency relations for Telugu apart from the ones used in Hindi treebank. Annotating
intra-chunk dependencies finally provides a complete parse tree for every sentence in the treebank. Having a fully expanded treebank is
crucial for developing end to end parsers which produce complete trees. We present a fully expanded dependency treebank for Telugu
consisting of 3220 sentences. In this paper, we also convert the treebank annotated with Anncorra part-of-speech tagset to the latest
BIS tagset. The BIS tagset is a hierarchical tagset adopted as a unified part-of-speech standard across all Indian Languages. The final
treebank is made publicly available.
39
and many Anncorra tags diverge into finer grained BIS cat-
egories. This makes the conversion task challenging.
The rest of the paper is organised as follows. In section
2, we describe the Telugu Dependency Treebank, section
3 describes the part of speech conversion from Anncorra
to BIS standard, section 4 describes the intra-chunk depen-
dency relations annotation for the Telugu and we conclude
the paper in section 5.
2. Telugu Treebank
An initial Telugu treebank consisting of around 1600 sen-
tences is made available in ICON 2009 tools contest. This Figure 1: Inter-chunk dependency annotation in SSF format
treebank is combined with HCU Telugu treebank contain-
ing approximately 2000 sentences similarly annotated and
another 200 sentences annotated at IIIT Hyderabad. We
clean up the treebank by removing sentences with wrong
format or incomplete parse trees etc. The final treebank
consists of 3220 sentences. Details about the treebank are
listed in Table 1.
Figure 2: Inter-chunk dependency tree.
No. of sentences 3222
Avg. sent length 5.5 words
Avg. no of chunks in sent 4.2 inter-chunk annotation alone does not provide a fully con-
Avg. length of a chunk 1.3 words structed parse tree for the sentence. Hence it is important
to determine and annotate intra-chunk relations accurately.
Table 1: Telugu treebank stats In this paper, we expand the Telugu treebank by annotating
the intra-chunk dependency relations.
40
are a few tags in Anncorra which diverge in to many fine- is followed by a noun it is marked as DM DMQ, else it is
grained BIS categories. Those tags are shown in Table 2. marked as PR PRQ.
It should be noted that one to many mapping exists only Verbs Another distinction between the two tagsets lies in
with fine grained tags. There is still a one to one mapping the annotation of verb finiteness. In Anncorra, it is anno-
between the Anncorra tag and the corresponding parent BIS tated only at chunk level. In BIS schema, the finiteness can
tag in all cases except question words. be annotated at word level. While resolving Verbs (V VM),
we look at the verb chunk. There is a one to one map-
Anncorra POS tag BIS POS tag ping between Anncorra chunk types and the fine-grained
PRP (Pronoun) PR PRP, PR PRF, PR PRL, BIS verb categories.
PR PRC, PR PRQ
Compounds and reduplicatives In Anncorra schema,
DEM (Demonstrative) DM DMD, DM DMR,
there are separate tags for identifying reduplicatives(RDP)
DM DMQ
and part of compounds(*C). For example a noun compound
VM (Main verb) V VM VF, V VM VNF,
consisting of two words is tagged as NNC and NN. Exam-
V VM VINF, V VM VNG,
ples of reduplicative and noun compound constructions in
N NNV
Telugu are shown below.
CC (Conjunct) CC CCD, CC CCS
WQ (Question word) DM DMQ, PR PRQ Anncorra: maMci (good) JJ maMci (good) RDP cIralu
SYM (Symbol) RD SYM, RD PUNC (sarees) NN
RDP (Reduplicative) - BIS: maMci JJ maMci JJ cIralu N NN
*C (Compound) -
Anncorra: boVppAyi (papaya) NNC kAya (fruit) NN
Table 2: Fine grained BIS tags corresponding to Anncorra BIS: boVppAyi N NN kAya N NN
tags.
These two tags are done away with in the BIS schema.
Reduplicatives (RDP) are marked with POS tag of the word
During conversion, we aim to annotate with the most fine preceding it and Compounds(*C) are marked with the POS
grained BIS tag. When the fine-grained tag cannot be de- tag of the word following it.
termined we go the parent tag. We use a tagset converter
that maps various tags in Anncorra schema to the tags in
4. Annotating Intra-chunk Dependencies
BIS schema. In case of tags having multiple possibilities, a The intra-chunk annotation in SSF format for the sentence
list based approach is used. Most Anncorra tags diverging in Figure 1 is shown in Figure 4 and the fully expanded
into fine grained BIS tags are for function words which are dependency tree is shown in Figure 3.
limited in number. Separate lists consisting of words be-
longing to fine grained BIS categories are created. A word
is annotated with fine grained BIS tag if it is present in the
corresponding tag word list, otherwise it is annotated with
the parent tag.
Pronouns One of the main distinctions between the two Figure 3: Intra-chunk dependency tree.
tagsets is in the annotation of pronouns. In Anncorra, all
pronouns are annotated with a single tag, PRP. BIS schema
contains separate tags for annotating personal (PR PRP) It can be seen that, in this case, unlike in Figure 2, cAlA
pronouns, reflexive (PR PRF), relative (PR PRL), recip- (many) is attached to its chunk head, xeSAllo (countries-in)
rocal (PR PRC) pronouns and question words (PR PRQ). and I (this) is attached its chunk head parisWiwi (situation).
Pronouns in a language are generally limited in number. In The parse tree for the sentence is now complete. Com-
Telugu however, pronouns can be inflected with case mark- plete parse trees are useful for creating end to end parsers
ers and there can be a huge number of them. When a pro- which do not require intermediate pipeline tools like POS
noun is not found in any word list it is annotated with the taggers, morphological analyzers and shallow parsers. This
parent tag PR. is a huge advantage, especially for low resource languages
Demonstratives In Anncorra, there is a single tag for an- like Telugu.
notating demonstratives where as BIS tagset distinguishes Kosaraju et al. (2012) first proposed the guidelines for an-
between diectic, relative and question-word demonstra- notating intra-chunk dependency relations in SSF format
tives. Demonstratives are limited in number and the same for Hindi. They propose a total of 12 intra-chunk depen-
list based approach used for pronouns is applied here. dency labels mentioned in Table 2. lwg refers to local word
group and pof refers to part of.
Symbols Symbols are separated into symbols and punc- They also propose two approaches, one rule based and an-
tuations. other statistical for automatically annotating intra-chunk
Question words They are separated into pronoun ques- dependencies in Hindi. In the rule based approach sev-
tion words and demonstrative question words in BIS tagset. eral rules are created constrained upon the POS, chunk
Demonstrative question words are always followed by a name or type and the position of the chunk head with re-
noun. While resolving question words (WQ), if the word spect to the child node. The intra-chunk dependencies are
41
Figure 4: Intra-chunk dependency annotation in SSF format.
marked based on these rules. In the statistical approach intf Intensifiers (RP INTF) can modify both adjectives
Malt Parser(Nivre et al., 2006) is used to identify the intra- and adverbs. So we replace the jjmod intf with intf and
chunk dependencies. A model is trained on a few manually use the same dependency label when an intensifier modi-
annotated chunks with Malt parser and the same model is fies an adverb or adjective.
used to predict the intra-chunk dependencies for the rest of
the treebank.
42
lwg psp. Sometimes, spatio-temporal nouns (N NST) 5. Conclusion
also act as post-positions when occurring alongside nouns. In this paper, we automatically annotate the Telugu depen-
In these cases, they are annotated as lwg psp. dency treebank with intra-chunk dependency relations thus
finally providing complete parse trees for every sentence
in the treebank. We also convert the Telugu treebank from
AnnCorra part-of-speech tagset to the latest BIS tagset. We
make the fully expanded Telugu treebank publicly available
to facilitate further research.
6. Acknowledgements
In this paper, we follow the approach proposed by Bhat We would like to thank Himanshu Sharma for making the
(2017) that makes use of a Context Free Grammar (CFG) Hindi tagset converter code available and Parameshwari Kr-
and a shift-reduce parser for automatically annotating intra- ishnamurthy and Pruthwik Mishra for providing relevant
chunk dependencies. We use the treebank expander code input. We also thank all the reviewers for their insightful
made available by Bhat (2017) 3 and write the Context Free comments.
Grammar for Telugu. The Context Free Grammar is gen-
erated using the POS tags and creates a mapping between 7. Bibliographical References
head and child POS tags and dependency labels.
Begum, R., Husain, S., Dhwaj, A., Sharma, D. M., Bai, L.,
The intra-chunk annotation is done using a shift-reduce and Sangal, R. (2008). Dependency annotation scheme
parser which internally uses the Arc-Standard(Nivre, 2004) for indian languages. In Proceedings of the Third Inter-
transition system. The parser predicts a sequence of tran- national Joint Conference on Natural Language Process-
sitions starting from an initial configuration to a terminal ing: Volume-II.
configuration, and annotate the chunk dependencies in the
Bharati, A., Chaitanya, V., Sangal, R., and Ramakrishna-
process. A configuration consists of a stack, a buffer, and
macharyulu, K. (1995). Natural language processing: a
a set of dependency arcs. In the initial configuration, the
Paninian perspective. Prentice-Hall of India New Delhi.
stack is empty, buffer contains all the words in the chunk
Bharati, A., Sangal, R., and Sharma, D. M. (2007). Ssf:
and intra-chunk dependencies are empty. In the terminal
Shakti standard format guide. Language Technologies
configuration, buffer is empty and stack contains only one
Research Centre, International Institute of Information
element, the chunk head, and the chunk sub-tree is given
Technology, Hyderabad, India, pages 1–25.
by the set of dependency arcs. The next transition is pre-
Bharati, A., Sharma, D. M., Bai, L., and Sangal, R.
dicted based on the Context Free Grammar and the current
(2009a). Anncorra : Annotating corpora guidelines for
configuration.
pos and chunk annotation for indian languages. LTRC,
IIIT Hyderabad.
4.1.1. Results
Bharati, A., Sharma, D. M., Husain, S., Bai, L., Begam,
We evaluate intra-chunk dependency relations annotated by R., and Sangal, R. (2009b). Anncorra: Treebanks for in-
the parser for 106 sentences. The test set evaluation results dian languages, guidelines for annotating hindi treebank.
are shown in Table 4. LTRC, IIIT Hyderabad.
Bhat, R. A. (2017). Exploiting linguistic knowledge to ad-
Test sentences LAS UAS dress representation and sparsity issues in dependency
106 93.7 95.8 parsing of indian languages. Phd thesis, IIIT Hyderabad.
Table 4: Intra-chunk dependency annotation accuracies. Bhatt, R., Narasimhan, B., Palmer, M., Rambow,
O., Sharma, D., and Xia, F. (2009). A multi-
representational and multi-layered treebank for
Hindi/Urdu. In Proceedings of the Third Linguistic
Almost all of the wrongly annotated chunks are because of Annotation Workshop (LAW III), pages 186–189, Suntec,
POS errors or chunk boundary errors. Since the Context Singapore, August. Association for Computational
Free Grammar rules are written using POS tags, errors in Linguistics.
annotation of POS tags automatically lead to errors in intra-
Buchholz, S. and Marsi, E. (2006). CoNLL-x shared task
chunk dependency annotation. The dependency relations
on multilingual dependency parsing. In Proceedings of
are annotated within the chunk boundaries. So any errors
the Tenth Conference on Computational Natural Lan-
in the chunk boundary identification also lead to errors in
guage Learning (CoNLL-X), pages 149–164, New York
intra-chunk dependency annotation.
City, June. Association for Computational Linguistics.
Telugu is an agglutinative language and the chunk size Hajičová, E. (1998). Prague dependency treebank: From
rarely exceeds three words. The CFG grammar based ap- analytic to tectogrammatical annotations. Proceedings
proach works accurately provided there are no errors in of 2nd TST, Brno, Springer-Verlag Berlin Heidelberg
POS or chunk annotation. New York, pages 45–50.
Kosaraju, P., Ambati, B. R., Husain, S., Sharma, D. M.,
3 and Sangal, R. (2012). Intra-chunk dependency annota-
https://github.com/ltrc/
Shift-Reduce-Chunk-Expander tion : Expanding Hindi inter-chunk annotated treebank.
43
In Proceedings of the Sixth Linguistic Annotation Work-
shop, pages 49–56, Jeju, Republic of Korea, July. Asso-
ciation for Computational Linguistics.
Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A.
(1993). Building a large annotated corpus of En-
glish: The Penn Treebank. Computational Linguistics,
19(2):313–330.
Nivre, J., Hall, J., and Nilsson, J. (2006). MaltParser:
A data-driven parser-generator for dependency parsing.
In Proceedings of the Fifth International Conference on
Language Resources and Evaluation (LREC’06), Genoa,
Italy, May. European Language Resources Association
(ELRA).
Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y.,
Hajič, J., Manning, C. D., McDonald, R., Petrov, S.,
Pyysalo, S., Silveira, N., Tsarfaty, R., and Zeman, D.
(2016). Universal dependencies v1: A multilingual tree-
bank collection. In Proceedings of the Tenth Interna-
tional Conference on Language Resources and Evalu-
ation (LREC’16), pages 1659–1666, Portorož, Slove-
nia, May. European Language Resources Association
(ELRA).
Nivre, J. (2004). Incrementality in deterministic depen-
dency parsing. In Proceedings of the Workshop on In-
cremental Parsing: Bringing Engineering and Cognition
Together, pages 50–57, Barcelona, Spain, July. Associa-
tion for Computational Linguistics.
44