0% found this document useful (0 votes)

50 views7 pages

The Parallel-TUT: A Multilingual and Multiformat Treebank: Cristina Bosco, Manuela Sanguinetti, Leonardo Lesmo

Parallel multilingual corpora can be considered crucial resources in several tasks. This paper introduces an ongoing project for the development of a parallel treebank for italian, English and French. The focus of the project is mainly on the quality of the annotation and the investigation of some issues related to the alignment of data.

Uploaded by

acouillault

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views7 pages

The Parallel-TUT: A Multilingual and Multiformat Treebank: Cristina Bosco, Manuela Sanguinetti, Leonardo Lesmo

Uploaded by

acouillault

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

The Parallel-TUT: a multilingual and multiformat treebank

Cristina Bosco, Manuela Sanguinetti, Leonardo Lesmo

Dipartimento di Informatica, Universit` a di Torino Corso Svizzera, 195, 10149, Torino (Italy) bosco,msanguin,[email protected] Abstract
The paper introduces an ongoing project for the development of a parallel treebank for Italian, English and French, i.e. ParallelTUT, or simply ParTUT. For the development of this resource, both the dependency and constituency-based formats of the Italian Turin University Treebank (TUT) have been applied to a preliminary dataset, which includes the whole text of the Universal Declaration of Human Rights, sentences from the JRC-Acquis Multilingual Parallel Corpus and the Creative Commons licence. The focus of the project is mainly on the quality of the annotation and the investigation of some issues related to the alignment of data that can be allowed by the TUT formats, also taking into account the availability of conversion tools for display data in standard ways, such as TigerXML and CoNLL formats. It is, in fact, our belief that increasing the portability of our treebank could give us the opportunity to access resources and tools provided by other research groups, especially at this stage of the project, where no particular tool compatible with the TUT format is available in order to tackle the alignment problems. Keywords: parallel treebanks, annotation formats, alignment

Introduction

Parallel multilingual corpora can be considered as crucial resources in several tasks, e.g. Machine Translation (MT) and ComputerAssisted Translation (CAT), language learning and terminology extraction, but also projection of annotation on new (less-resourced) languages (BuchKromann, 2007; Ahrenberg et al., 2010). Their usefulness, as in the case of single language resources, increases when they are annotated and their annotations allow forms of alignment at various levels of linguistic knowledge, see e.g. (Ahrenberg et al., 2010; Grimes et al., 2010; Rios et al., 2009). In particular, research in data-driven methods for MT has greatly benetted from the increasing availability of parallel aligned treebanks for the training of statistical systems. But the development of such kind of resources raises several unresolved applicative and theoretical issues. First, as usual in the case of mono-lingual resources, parallel treebanks are usually semi-automatically developed by applying a very time-consuming and error prone process. Second, several levels of alignment of data, e.g. sentence, words or other syntactic components, can be in principle of some interest for the extraction of information relevant for translation and other tasks, but the development of tools for the alignment is currently limited to particular linguistic knowledge levels and annotation formats. Because of this, on the one hand, only a few of statistical MT models have only recently begun to really take advantage of higher level linguistic structures as annotated in treebanks; on the other hand, only a few parallel treebanks aligned at some level exist, while none of them is of sufcient use in any statistical MT application, see e.g. (Ahrenberg, 2007), (Volk et al., 2010), (Cmejrek et al., 2004) and (Megyesi et al., 2008). This paper introduces the ongoing project of a new parallel treebank for Italian, English and French, henceforth

ParallelTUT (or, more simply, ParTUT) featured by both a pure dependency format (as described in (Sanguinetti and Bosco, 2011)) and a constituency-based annotation like that of the Penn Treebank (PTB), i.e. TUTPenn. Even if the project concerns a resource missing for Italian, the development of a new treebank large enough for training of statistical systems is currently beyond our interest. The focus of the paper is therefore mainly on the features and quality of the annotation, and the investigation of some issues related to the alignment of data allowed by the formats applied in ParTUT. In fact, it will be described both the dependencybased annotation, called native TUT, and the conversion from this format to others useful in the cross-paradigm perspective and in order to increase the portability of data (e.g. Penn), or simply to make the data in native TUT compliant with different standards for displaying and analysis (e.g. TigerXML or CoNLL). For the development of ParTUT, we applied to English and French the same tools designed for Italian and applied within the TUT project1 . In particular, we used the parser TULE and the TUTtoPENNconverter2 , respectively for the application to the raw texts of the dependency-based annotation and the conversion of the resulting data, annotated in TUT, to the TUTPenn format. On the one hand, the application of existing formats to other languages has been often reported in literature, see e.g. the application of the Prague Dependency Treebank (PDT) format to Arabic (Haji c and Zem anek, 2003), or the PTB format to Chinese3 and Arabic4 . This allowed in fact the improvement and extension in multi-lingual perspective of approaches originally developed for single languages, also increasing the portability of NLP tools and the availability of data useful for their comhttp://www.di.unito.it/tutreeb http://www.di.unito.it/tutreeb/TUTtoPENNconverter/ 3 See http://www.cis.upenn.edu/chinese/ 4 See http://www.ircs.upenn.edu/arabic/
2 1

1932

has
VERB-SUBJ VERB-OBJ END

Everyone

the
DEF+DEF-ARG

right
PREP-RMOD

to
PREP-ARG

life
COORD+BASE

,
COORD2ND+BASE

liberty
COORD+BASE

and
COORD2ND+BASE

the
DEF+DEF-ARG

security
PREP-RMOD

of
PREP-ARG

person

(NP-SBJ (PRO~PE Everyone)) (VP (VMA~RE has) (NP (NP (ART~DE the) (NOU~CS right)) (PP (PREP to) (NP (NP (NOU~CS life)) (, ,) (NP (NP (NOU~CS liberty)) (CONJ and) (NP (NP (ART~DE the) (NOU~CS security)) (PP (PREP of) (NP (NOU~CS person))))))))) (. .))

(a)

(b)

Figure 1: The English sentence HUMAN-RIGHTS-21, as annotated in TUT (a) and TUTPenn (b).

parison and study. As suggested in (Paulussen and Macken, 2010), the use of the same annotating tools and formats for each monolingual corpus may also have a positive impact on the following exploitation and processing of the resulting parallel corpora. On the other hand, the availability of multi-format annotations for parallel treebanks, like that described in (Francom and Hulden, 2008), can be of some help in the analysis of the adequateness of specic format for particular languages and phenomena. The next section describes the formats of the parallel treebank. The following section is instead devoted to the description of the data collected in order to build the corpus. The nal section discusses issues related to the alignment of the parallel annotation of Italian, English and French allowed by TUT formats.

Environment (henceforth TULE5 ) (Lesmo, 2007; Lesmo, 2009), which includes a rule-based parser developed in parallel with TUT and the modules needed for tokenization, PoS tagging and morphological analysis. The second step, which is the annotation in the PennTUT format, i.e. the constituency-based Penn-like format designed for TUT, includes the application of conversion tools (Bosco, 2007)6 to the data in TUT native format. 2.1. Dependency: TUT native format As far as the native annotation schema is concerned, a typical TUT tree (see Figure 1 (a)) shows a pure dependency format centered upon the notion of argument structure and is based on the principles of the Word Grammar theoretical framework (Hudson, 1984). This is mirrored, for instance, in the annotation of Determiners and Prepositions which are represented in TUT trees as complementizers of Nouns or Verbs. See, for instance, in Figure 1(a) the Determiner the which is the head for the Noun security and the Preposition of which is the head of the Noun person. For what concerns the dependency relations that label the tree edges, TUT exploits a rich set of grammatical items designed to represent a variety of linguistic information according to three different perspectives, i.e. morphology, functional syntax and semantics. The main idea is that a single layer, the one describing the relations between words, can represent linguistic knowledge that is proximate to semantics and underlies syntax and morphology, i.e. the predicate-argument structure of events and states, which has proven essential for efcient processing of human language. Therefore, each relation label
http://www.tule.di.unito.it/ The conversion tools can be freely downloaded from the TUT web site.
6 5

TUT formats

TUT is a resource developed by the Natural Language Processing group of the University of Turin (http://www.di.unito.it/tutreeb/) which currently consists of more than 102,000 annotated tokens (around 3,500 sentences) extracted from texts varying from newspapers, to legal, to Wikipedia. The development of TUT includes two steps: the rst one, which is devoted to the dependency-based native annotation of data, is the application of an annotation system to raw texts; the second, which outputs the data in a constituency-based format, consists in a conversion applied to the data in the dependency format produced by the rst step. In the current phase of development, both steps require check and a limited amount of corrections which are applied in a semi-automatic way by exploiting tools intended for this purpose. The core of the rst step is the Turin University Linguistic

1933

can in principle include three components, i.e. morphosyntactic, functional-syntactic and syntactic-semantic, but can be made more or less specialized, including from only one (i.e. the functional-syntactic) to three of them (see e.g. (Bosco and Lavelli, 2010; Alicante et al., 2012) for more details). For instance, the relation used for the annotation of the Prepositional modiers in gure 1, i.e. PREP-RMODREASONCAUSE (which includes all the three components), can be reduced to PREP-RMOD (which includes only the rst two components) or to RMOD (which includes only the functional-syntactic component). In gure 1 several relations involving two components are showed: e.g. VERB-SUBJ for the subject of a Verb, PREP-RMOD for the restrictive modier introduced by a Preposition and PREP-ARG for the argument of a Preposition. This variable degree of specicity is a useful means for the human annotator in that it meets his/her different degree of condence about a given relation. Moreover, it can also be applied in particular tasks in order to increase the comparability of TUT with other existing resources, by exploiting the amount of linguistic information more adequate for the comparison, e.g. in terms of number of relations. Last but not least, as Italian requires, the TUT format provides an extended morphological tag set including all the categories and features needed to describe morphologically rich languages. This tag set allowed therefore for an accurate description both for French, whose morphological richness resembles that of Italian, and English, which is morphologically poorer. Moreover, contrary to most of dependency-based annotations, in order to deal with prodrop and equi, long distance dependencies and elliptical structures, the native TUT exploits also null elements. In most of cases, null elements are coindexed with some word of the sentence (e.g. for gapping or equi phenomenon). Non coindexed null elements are instead used e.g. for the representation of elliptical constructions, prodrop subjects or other dropped complements playing some role in argument structure of Verbs. Exploiting null elements permits dependency trees to avoid crossing edges and to be projective. In practice, null elements are useful in giving an explicit representation also of those parts of the argument structure that could be missing, but sometimes crucial for some task. For instance, the exploitation of null elements can make the alignment easier, in all cases where the source language, e.g. Italian, allows the dropped subject and the target language does not, as English or French. Finally, as described in (Chung and Gildea, 2010), adding some empty elements can help building machine translation systems, which benet from training on corpora with annotated empty elements, even when empty element prediction is slightly far from what would be conventionally considered robust. 2.2. Constituency: TUTPenn format

tag set of the TUTPenn, if compared with that exploited in English PTB, clearly reects the fact that Italian is morphologically richer than English, in particular with respect to the inection of Verbs. Beyond the information that the PTB tag set makes explicit, TUTPenn takes into account a richer variety of features for Verbs, Adjectives and Pronouns, apart from a few cases of English morphological features which do not exist (e.g. possessive ending) or do not correspond with Italian forms (e.g. comparative Adjective and Adverb). Instead, for what concerns functional relations, in order to deal with phenomena related to the exibility of Italian word order, some label has been added to the small PTB inventory. For instance, the label EXTPSBJ, which is used for the annotation of subjects in post-verbal position. The standard PTB inventory of null elements is also adopted in TUTPenn, but while for English null elements are mainly traces denoting constituent movements, in TUTPenn they can play different roles: zero Pronouns, reduction of relative clauses, elliptical Verbs and also the duplication of Subjects which are positioned after Verbs.

Data and development of ParTUT

The parallel treebank currently comprises a preliminary set of sample texts, which have been annotated in order to assess our methodology. The corpus consists of 50 sentences extracted from the JRC-Acquis multilingual parallel corpus7 (Steinberger et al., 2006) and the entire text (about 100 sentences) of the Universal Declaration of Human Rights8 . More recently this preliminary set has been enlarged with an additional corpus extracted from the open licence Creative Commons9 composed by around 100 sentences. All the data gathered in ParTUT up to the present (included raw texts) can be consulted and downloaded from the ParTUT web page10 . These texts are represented in the ParTUT corpus in Italian, English and French and the exact amount both in terms of sentences and tokens can be seen in table 111 . The full corpus consists currently in less than 23,000 annotated tokens and represents only very specic text genres. The further development of ParTUT, planned for the future, includes the annotation of a larger set of data that will be collected by taking into account the issues related to the text genre too. It is in fact crucial to enlarge the corpus, in order to both address a larger and more meaningful set of linguistic phenomena, and more reliable analyses not affected by sparseness, like e.g. in (Ahrenberg, 2010). Nevertheless, as deeply unbalanced the treebank might be at the moment, the choice of the texts of this collection was not fortuitous, and several criteria were considered before
See http://langtech.jrc.it/JRC-Acquis. html, http://optima.jrc.it/Acquis/ 8 See http://www.ohchr.org/EN/UDHR/Pages/ SearchByLang.aspx 9 See http://creativecommons.org/licenses/ by-nc-sa/2.0 10 http://www.di.unito.it/tutreeb/partut.html 11 JRCAcquis indicates the JRC-Acquis multilingual subcorpus of ParTUT, UDHR indicates the Universal Declaration of Human Rights and CC the Creative Commons licence texts.
7

As far as the constituency-based annotation is concerned, the annotation in TUTPenn (see Figure 1 (b)) is structurally the same as in Penn Treebank, but it varies from this model because of a richer morphological tag set and an extended inventory of functional relations. In fact, for what concerns morphology, the size of the Pos

1934

Corpus JRCAcquisIt JRCAcquisFr JRCAcquisEn UDHRIt UDHRFr UDHREn CCIt CCFr CCEn total

sentences 50 52 50 76 77 77 96 102 88 688

tokens 2,205 2,297 1,895 2,387 2,537 2,293 3,141 3,624 2,507 22,886

Table 1: Corpus overview. their selection: above all, practical reasons of easy availability from the web and the absence of Intellectual Property Rights problems, which allow us to process the data freely and release them under an open licence. Moreover, choosing texts from legal documents, we benetted from the expertise in the eld of legal language processing acquired within the TUT project by the group of the University of Turin. The data included in our corpus are representative of the development of unannotated parallel corpora developed in the last decades, in particular by the European Community. Finally, these texts includes raw materials which are in translation relation to each other and this should be relevant in the perspective of studies about human and machine translation. The output produced by our annotation tool, however, was somehow affected by this bias, by virtue of the high number of long sentences12 , subordinate clauses, parentheticals and coordinated structures (constituting, by themselves, a wellknown problem within automatic tools), which are all typical features of normative texts. Therefore, as stated above, and for the reasons we have just explained, we plan in the immediate future to extend the treebank so as to make our resource less biased and more complete for any further application and research. For what concerns in particular the application of the annotation, although the TULE parser supports in principle linguistic analysis in several languages (English in particular, but also French, Spanish, Catalan and Hindi), its output quality currently achieves satisfactory results mostly for Italian, since it has been extensively tested in the development of the Italian TUT. That is to say, since TULE is a rule-based parser, it needs in the current phase of development of ParTUT rule-insertion and enrichment of the lexical knowledge for English and French, e.g. insertion of new lexical entries including, in particular, proper nouns, named entities, compounds and locutions, and new disambiguation rules for previously unseen linguistic phenomena. Also the application of tools developed for the conversion of native TUT into TUTPenn format has required some limited update of tables containing the linguistic knowledge exploited by tools for English and French. In general, we observed that applying to the ParTUT in native TUT format
A high percentage of sentences reaches a length of 70 to 100 tokens.
12

the tools for the conversion in TUTPenn, TigerXML13 (see an example in gure 2) and CoNLL has been a very useful practice for error detection and consequently quality improvement of the annotated data. It is our belief that the availability of several formats, in particular those compliant to known standards, increases the portability of our treebank and could give us the opportunity to access resources and tools provided by other research groups, especially at this stage of the project, where no particular tool compatible with the TUT format is available in order to face some typical problems in parallel treebanking. This is particularly true for the alignment phase, which is currently one of the aspects to which we are focusing our attention and whose problems we attempt to describe in the next section.

Aligning ParTUT

Because of the correspondences between the information encoded in the same sentences in different languages, processing the same text in two languages yields useful information on how words and structures are translated from a source to a target language. The ParTUT project is oriented to the development of a data set on which such hypothesis can be tested. ParTUT, assuming the annotation typical of TUT, features a rich annotation and it is oriented to the representation of the predicate-argument structure, a kind of information that we hypothesize that can be useful as a pivot for alignment in translation. As observed above, both the dependency core and the inventory of null elements introduced in the annotation schema of TUT contribute to a more accurate representation under this respect. Moreover, it makes available a set of data in different annotation formats, both dependency and constituency-based, that can allow for the comparison of alignment based on these two paradigms. This kind of comparison has been developed e.g. in (Gildea, 2004) showing that, for the Chinese-English case, constituent-based alignment signicantly outperforms the dependency-based. Since ParTUT features formats belonging to two different paradigms and linguistic theories, it should be exploited as a testbed for similar comparison with reference to Italian, English and French. Up to the present, the issues related to the alignment at sentence, word and syntactic level have been taken into account in ParTUT, but mainly by applying tools not specically implemented for our formats and using empirical methods in order to develop guidelines that can drive the development of suitable tools. For instance, as for the sentence level, the alignment was performed with Omega Aligner14 , a simple Python script which produces les conforming to the Translation Memory eXchange (TMX) standard. For the word alignment as well, we tested a number of freely available resources and took into account the useful suggestions proposed in several guidelines (as those in
Texts and their relative encoding in the TigerXML format preserve the original dependency representation, in a similar fashion to what recommended by the Nordic Treebank Network (Hall and Nilsson, 2005). 14 http://www.omegat.org/en/resources.html
13

1935

Figure 2: Three versions of the same sentence from the Creative Commons licence, represented in Tiger-XML.

1936

(Melamed, 1998) or in (Lambert et al., 2005)), or in other similar works (see, for example, (Grac a et al., 2008) or (Simov et al., 2011), just to name a few). In particular, we found a useful resource the WordAligner15 , a web-based interface which allows for manual editing and browsing of alignments. The tool represents each pair of sentences as a grid of squares (see Figure 3), which is a more useful representation device, if compared to other systems where alignments are drawn as lines, especially in cases of multiple alignment links.

are investigated and tested mainly by hand. Although we hypothesize that the features of the TUT annotation schemes can be of some help for the alignment, in particular at the syntactic level and with respect to the argument structure, these features and the richness of the annotation schema of ParTUT are currently the major limits in the application of standard alignment tools. The latter is, among the others, one of the reason why we decided to make available our resource in other exchange formats as well, such as TigerXML and CoNLL.

Conclusions

Figure 3: Example of a bi-sentence Italian-English aligned with WordAligner.

Furthermore, the WordAligner supports two types of alignment links, which are dened as sure and possible. According to our previous denition criteria, we opted for the notion of exact and fuzzy matches: the former is used to identify complete and minimal semantic translation units, and the latter to indicate valid translation pairs (including all those cases of translation shifts). Provided that the notion of sure and possible links do not differ from those we devised at the previous stage, for the sake of consistency, we decided to keep the terms of exact and fuzzy, while applying to these notions all those cases suggested by the literature respectively as sure and possible alignment links. As pointed above, however, these two steps (sentence and word alignment) are but preliminary and totally experimental stages of a deeper level of alignment we are interested in: our goal is, in fact, to create a mapping between the tree pairs where information about the syntax-semantics interface is included. The major aim of our project for the development of ParTUT is at building a parallel treebank where alignment principles are not only lexically, but also syntactically motivated, and where the data mapped in the corpus can be of use in cross-linguistic research and applications, most notably in MT. This paper describes a rst step in this direction, which consists in the creation of a golden collection of parallel parse trees where such alignment principles
http://www.bultreebank.bas.bg/aligner/index.php
15

The paper describes an ongoing project for the development of a multilingual parallel aligned treebank, i.e. ParTUT, which features two annotation formats respectively based on the dependency (native TUT) and the constituency paradigm (TUTPenn). The focus of the paper is therefore mainly on the features of these annotations and the methodology adopted for their application to the data included in ParTUT. In fact, the development of the resource is based on tools implemented for an existing treebank for Italian, namely TUT, which have been made adequate for English and French too. Furthermore, preliminary issues related to the alignment of data allowed by the applied formats are presented, taking into account that the main goal of the ParTUT project consists in creating a mapping between the tree pairs where information about the syntax-semantics interface is included. Therefore, as for future development of this work, a number of issues must be further pursued, and in particular the development and the integration of suitable tools for alignment at syntactic level, which is currently missing.

References

L. Ahrenberg, J. Tiedemann, and M. Volk, editors. 2010. Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora (AEPC) 2010. Tartu. L. Ahrenberg. 2007. LinEs: an English-Swedish Parallel Treebank. In Proceedings of the 16th Nordic Conference on Computational Linguistics (NODALIDA 07), Tartu. L. Ahrenberg. 2010. Clause restructuring in EnglishSwedish translation. In Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora (AEPC) 2010, Tartu. A. Alicante, C. Bosco, A. Corazza, and A. Lavelli. 2012. A treebank-based study on the inuence of italian word order on parsing performance. In Proceedings of the Language Resources and Evaluation Conference (LREC12), Istanbul. C. Bosco and A. Lavelli. 2010. Annotation schemaoriented validation for dependency parsing evaluation. In Proceedings of the 9th workshop on Treebanks and Linguistic Theories (TLT-9), Tartu. C. Bosco. 2007. Multiple-step treebank conversion: from dependency to Penn format. In Proceedings of the Linguistic Annotation Workshop (LAW) 2007, Prague. M. Buch-Kromann. 2007. Computing translation units and quantifying parallelism in parallel dependency treebanks. In Proceedings of the Linguistic Annotation Workshop (LAW) 2007, Prague.

1937

T. Chung and D. Gildea. 2010. Effects of empty categories on machine translation. In Proceedings of Empirical Methods in Natural Language Processing - EMNLP10, Boston. J. Francom and M. Hulden. 2008. Parallel multi-theory annotation of syntactic structure. In Proceedings of the Language Resources and Evaluation Conference (LREC08), Marrakech. D. Gildea. 2004. Dependencies vs constituents for treebased alignment. In Proceedings of Empirical Methods in Natural Language Processing - EMNLP04, Barcelona. J. Grac a, J. P. Pardal, L. Coheur, and D. Caseiro. 2008. Building a golden collection of parallel multi-language word alignment. In Proceedings of the Language Resources and Evaluation Conference (LREC08), Marrakech. S. Grimes, X. Li, A. Bies, S. Kulick, X. Ma, and S. Strassel. 2010. Creating arabic-english parallel word-aligned treebank corpora at LDC. In Proceedings of Language Resources and Evaluation Conference (LREC10), Malta. J. Haji c and P. Zem anek. 2003. Prague Arabic Dependency Treebank: Development in data and tools. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools, Cairo. J. Hall and J. Nilsson. 2005. Converting Dependency Treebanks to MALT-XML. Technical report, V aj o University, School of Mathematics and Engeneering. R. Hudson. 1984. Word grammar. Basil Blackwell, Oxford and New York. P. Lambert, A. de Gispert, R. E. Banchs, and J. B. Mario. 2005. Guidelines for word alignment evaluation and manual alignment. Language Resources and Evaluation, 39(4). L. Lesmo. 2007. The rule-based parser of the NLP group of the University of Torino. Intelligenza articiale, 2(IV). L. Lesmo. 2009. The Turin University Parser at Evalita 2009. In Proceedings of Evalita09, Reggio Emilia. B. Megyesi, B. Dahlqvist, E. Pettersson, and J. Nivre. 2008. Swedish-Turkish Parallel Treebank. In Proceedings of Language Resources and Evaluation Conference (LREC08), Marrakech. D. Melamed. 1998. Manual annotation of translational equivalence: The blinker project. Technical report, University of Pennsylvania. H. Paulussen and L. Macken. 2010. Annotating the Dutch Parallel Corpus. In Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora (AEPC), Tartu. A. Rios, A. G ohring, and M. Volk. 2009. A Quechua Spanish parallel treebank. In Proceedings of 7th Workshop on Treebanks and Linguistic Theories (TLT-7), Groningen. M. Sanguinetti and C. Bosco. 2011. Building the multilingual TUT parallel treebank. In Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora (AEPC) 2011, Hissar.

K. Simov, P. Osenova, L. Laskova, A. Savkov, and S. Kancheva. 2011. Bulgarian-english parallel treebank: Word and semantic level alignment. In Proceedings of Recent Advances in Natural Language Processing (RANLP), Hissar, Bulgaria. R. Steinberger, B. Pouliquen, A. Widiger, C. Ignat, T. Erjavec, D. Tus , and D Varga. 2006. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of Language Resources and Evaluation Conference (LREC06), Genova. M. Cmejrek, J. Haji c, and V. Kubo n. 2004. Prague CzechEnglish Dependency Treebank: Syntactically Annotated Resources for Machine Translation. In Proceedings of EAMT 10th Annual Conference, Budapest. M. Volk, A. G ohring, T. Marek, and Y. Samuelsson. 2010. SMULTRON (version 3.0) - The Stockholm MULtilingual parallel TReebank. An English-French-GermanSpanish-Swedish parallel treebank with sub-sentential alignments.

1938

4.3 Grade 4 English Handout Term 3 2023#
100% (1)
4.3 Grade 4 English Handout Term 3 2023#
28 pages
Corpus-Based Studies of Legal Language For Translation Purposes
No ratings yet
Corpus-Based Studies of Legal Language For Translation Purposes
15 pages
Text Corpus
No ratings yet
Text Corpus
3 pages
Treebank
No ratings yet
Treebank
17 pages
2005.mtsummit Papers.11
No ratings yet
2005.mtsummit Papers.11
8 pages
tlt02 Webversion
No ratings yet
tlt02 Webversion
18 pages
Topics
No ratings yet
Topics
85 pages
A Multi-Representational and Multi-Layered Treebank For Hindi/Urdu
No ratings yet
A Multi-Representational and Multi-Layered Treebank For Hindi/Urdu
4 pages
Paper Ver1
No ratings yet
Paper Ver1
6 pages
Building Tamil Treebanks
No ratings yet
Building Tamil Treebanks
10 pages
Corpus Linguistics
No ratings yet
Corpus Linguistics
23 pages
Pages From English Corpus Linguistics, An Introduction, 2 Ed., Charles Meyers, CUP 2023
No ratings yet
Pages From English Corpus Linguistics, An Introduction, 2 Ed., Charles Meyers, CUP 2023
41 pages
Corpus Linguistics Final
No ratings yet
Corpus Linguistics Final
13 pages
Corpus 2
No ratings yet
Corpus 2
49 pages
Turk Etal 2022
No ratings yet
Turk Etal 2022
34 pages
Penn Tree Bank
No ratings yet
Penn Tree Bank
30 pages
Introduction To Corpus Linguistics: Sandra K Ubler
No ratings yet
Introduction To Corpus Linguistics: Sandra K Ubler
36 pages
Layering Semantics (Putting Meaning Into Trees) : Treebank Workshop Martha Palmer April 26, 2007
No ratings yet
Layering Semantics (Putting Meaning Into Trees) : Treebank Workshop Martha Palmer April 26, 2007
11 pages
Seminar 1
No ratings yet
Seminar 1
7 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
45 pages
Eamt11 Complete
No ratings yet
Eamt11 Complete
8 pages
Timebankpt: A Timeml Annotated Corpus of Portuguese: Francisco Costa, Ant Onio Branco
No ratings yet
Timebankpt: A Timeml Annotated Corpus of Portuguese: Francisco Costa, Ant Onio Branco
8 pages
The Web As A Parallel Corpus: Philip Resnik Noah A. Smith
No ratings yet
The Web As A Parallel Corpus: Philip Resnik Noah A. Smith
32 pages
NLP Cmu
No ratings yet
NLP Cmu
38 pages
Module 5
No ratings yet
Module 5
24 pages
Corpus Linguistics: An Introduction
No ratings yet
Corpus Linguistics: An Introduction
43 pages
Alt o Cocosda
No ratings yet
Alt o Cocosda
5 pages
Tools Corpora and CAT S NMT Lelandem
No ratings yet
Tools Corpora and CAT S NMT Lelandem
35 pages
Morpho-Syntactic Regularities in Ud - Romanian Nonstandard Parsing
No ratings yet
Morpho-Syntactic Regularities in Ud - Romanian Nonstandard Parsing
10 pages
SLoSP 2007 1
No ratings yet
SLoSP 2007 1
42 pages
Tacl A 00109
No ratings yet
Tacl A 00109
14 pages
Corpus-Based Studies of Legal Language For Translation Purposes: Methodological and Practical Potential
No ratings yet
Corpus-Based Studies of Legal Language For Translation Purposes: Methodological and Practical Potential
15 pages
Volk - Graen - Callegaro - 2014-Innovations Parallel Corpus Tools
No ratings yet
Volk - Graen - Callegaro - 2014-Innovations Parallel Corpus Tools
7 pages
A Novel Dependency Framework For Enhancing DiscourSE
No ratings yet
A Novel Dependency Framework For Enhancing DiscourSE
29 pages
4
No ratings yet
4
3 pages
Unit 2
No ratings yet
Unit 2
15 pages
Phrase Tagset Mapping For French and English Treebanks
No ratings yet
Phrase Tagset Mapping For French and English Treebanks
13 pages
AR - B A A J - S S C C: ULE Ased Pproach For Ligning Apanese Panish Entences From A Omparable Orpora
No ratings yet
AR - B A A J - S S C C: ULE Ased Pproach For Ligning Apanese Panish Entences From A Omparable Orpora
8 pages
Unit 3-2
No ratings yet
Unit 3-2
26 pages
7
No ratings yet
7
4 pages
Dependencies vs. Constituents For Tree-Based Alignment: Daniel Gildea
No ratings yet
Dependencies vs. Constituents For Tree-Based Alignment: Daniel Gildea
8 pages
Unit 2 New One
No ratings yet
Unit 2 New One
12 pages
Structure in Linguistics
No ratings yet
Structure in Linguistics
6 pages
Film Discourse: Corpus Analysis and Synchronic Perspective
No ratings yet
Film Discourse: Corpus Analysis and Synchronic Perspective
5 pages
Agic2015 Universal
No ratings yet
Agic2015 Universal
8 pages
Semanti Roles PDF
No ratings yet
Semanti Roles PDF
105 pages
Unit II
No ratings yet
Unit II
61 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
118 pages
Demos 035
No ratings yet
Demos 035
12 pages
00 General Handout
No ratings yet
00 General Handout
24 pages
The Syntactic Categories (Parts of Speech)
No ratings yet
The Syntactic Categories (Parts of Speech)
5 pages
Natural Language Processing
No ratings yet
Natural Language Processing
32 pages
Some Good Projects in NLP
No ratings yet
Some Good Projects in NLP
3 pages
4 Natural Language Processing-Text Normalization
No ratings yet
4 Natural Language Processing-Text Normalization
10 pages
Any-Language Frame-Semantic Parsing
No ratings yet
Any-Language Frame-Semantic Parsing
5 pages
Linked Data For Language-Learning Applications
No ratings yet
Linked Data For Language-Learning Applications
8 pages
Why Forensic Linguistics
No ratings yet
Why Forensic Linguistics
15 pages
Séquence 4 NEW PPDDFF
No ratings yet
Séquence 4 NEW PPDDFF
6 pages
NLP Unit II Notes
75% (8)
NLP Unit II Notes
18 pages
CAT: The CELCT Annotation Tool: Valentina Bartalesi Lenzi, Giovanni Moretti, Rachele Sprugnoli
No ratings yet
CAT: The CELCT Annotation Tool: Valentina Bartalesi Lenzi, Giovanni Moretti, Rachele Sprugnoli
6 pages
245 Paper
No ratings yet
245 Paper
8 pages
The Open Lexical Infrastructure of SPR Akbanken: Lars Borin, Markus Forsberg, Leif-J Oran Olsson and Jonatan Uppstr Om
No ratings yet
The Open Lexical Infrastructure of SPR Akbanken: Lars Borin, Markus Forsberg, Leif-J Oran Olsson and Jonatan Uppstr Om
5 pages
240 Paper
No ratings yet
240 Paper
6 pages
Towards A Richer Wordnet Representation of Properties: Sanni Nimb, Bolette Sandford Pedersen
No ratings yet
Towards A Richer Wordnet Representation of Properties: Sanni Nimb, Bolette Sandford Pedersen
5 pages
Korp - The Corpus Infrastructure of Språkbanken: Lars Borin, Markus Forsberg, and Johan Roxendal
No ratings yet
Korp - The Corpus Infrastructure of Språkbanken: Lars Borin, Markus Forsberg, and Johan Roxendal
5 pages
Turk Bootstrap Word Sense Inventory 2.0: A Large-Scale Resource For Lexical Substitution
No ratings yet
Turk Bootstrap Word Sense Inventory 2.0: A Large-Scale Resource For Lexical Substitution
5 pages
Predicting Phrase Breaks in Classical and Modern Standard Arabic Text
No ratings yet
Predicting Phrase Breaks in Classical and Modern Standard Arabic Text
5 pages
233 Paper
No ratings yet
233 Paper
6 pages
230 Paper
No ratings yet
230 Paper
6 pages
232 Paper
No ratings yet
232 Paper
7 pages
A Classification of Adjectives For Polarity Lexicons Enhancement
No ratings yet
A Classification of Adjectives For Polarity Lexicons Enhancement
5 pages
Conandoyle-Neg: Annotation of Negation in Conan Doyle Stories
No ratings yet
Conandoyle-Neg: Annotation of Negation in Conan Doyle Stories
6 pages
LIE: Leadership, Influence and Expertise: R. Catizone, L. Guthrie, A.J. Thomas, and Y. Wilks
No ratings yet
LIE: Leadership, Influence and Expertise: R. Catizone, L. Guthrie, A.J. Thomas, and Y. Wilks
5 pages
228 Paper
No ratings yet
228 Paper
8 pages
211 Paper
No ratings yet
211 Paper
5 pages
ROMBAC: The Romanian Balanced Annotated Corpus: Radu Ion, Elena Irimia, Dan Ştefănescu, Dan Tufiș
No ratings yet
ROMBAC: The Romanian Balanced Annotated Corpus: Radu Ion, Elena Irimia, Dan Ştefănescu, Dan Tufiș
6 pages
Automatic Classification of German An Particle Verbs: Sylvia Springorum, Sabine Schulte Im Walde, Antje Roßdeutscher
No ratings yet
Automatic Classification of German An Particle Verbs: Sylvia Springorum, Sabine Schulte Im Walde, Antje Roßdeutscher
8 pages
Detecting Reduplication in Videos of American Sign Language: Zoya Gavrilov, Stan Sclaroff, Carol Neidle, Sven Dickinson
No ratings yet
Detecting Reduplication in Videos of American Sign Language: Zoya Gavrilov, Stan Sclaroff, Carol Neidle, Sven Dickinson
7 pages
210 Paper
No ratings yet
210 Paper
5 pages
208 Paper
No ratings yet
208 Paper
7 pages
A Bilingual Bimodal Reading and Writing Tool For Sign Language Users
No ratings yet
A Bilingual Bimodal Reading and Writing Tool For Sign Language Users
5 pages
204 Paper
No ratings yet
204 Paper
8 pages
Al Mutlaq & Al Muqayyad
No ratings yet
Al Mutlaq & Al Muqayyad
4 pages
SVA Lesson
No ratings yet
SVA Lesson
29 pages
Pre Assessment in MTB Mle 3 Kapampangan
No ratings yet
Pre Assessment in MTB Mle 3 Kapampangan
8 pages
Affirmative, Negative and Interrogative Sentences & Personal, Subject, Objetct and Possessive Pronouns
No ratings yet
Affirmative, Negative and Interrogative Sentences & Personal, Subject, Objetct and Possessive Pronouns
13 pages
No Meaning - Google Search
No ratings yet
No Meaning - Google Search
1 page
КТП 4 Кл 2022-2023 Рогова В.В.
No ratings yet
КТП 4 Кл 2022-2023 Рогова В.В.
9 pages
38-C-100188-Shikha Practice Sheet Full
No ratings yet
38-C-100188-Shikha Practice Sheet Full
10 pages
CEFR B2 Learning Outcomes
100% (2)
CEFR B2 Learning Outcomes
14 pages
Test Unit 5-6: Intermediate B1+ Exam Units 5 - 6
No ratings yet
Test Unit 5-6: Intermediate B1+ Exam Units 5 - 6
20 pages
Anggi Anggraini Nasution Mini Riset Syntax
No ratings yet
Anggi Anggraini Nasution Mini Riset Syntax
7 pages
A B C D A B C D A B C D A B C D A B C D A B C D A B C D A B C D A B C D
No ratings yet
A B C D A B C D A B C D A B C D A B C D A B C D A B C D A B C D A B C D
10 pages
Present Simple A. Write The Verb To Sing. Affirmative Negative Interrogative
No ratings yet
Present Simple A. Write The Verb To Sing. Affirmative Negative Interrogative
3 pages
Informal - Letter or Email
No ratings yet
Informal - Letter or Email
7 pages
A3 Irregular Verbs 2
No ratings yet
A3 Irregular Verbs 2
1 page
Part of Speech
No ratings yet
Part of Speech
14 pages
Alexandru Mardale CESCL
No ratings yet
Alexandru Mardale CESCL
12 pages
Class 7 Cbse English Syllabus 2012-13
No ratings yet
Class 7 Cbse English Syllabus 2012-13
3 pages
Present Perfect Tense
No ratings yet
Present Perfect Tense
2 pages
Material Masa Cuti Form 4 2019
No ratings yet
Material Masa Cuti Form 4 2019
11 pages
Booklet Chapter 1
No ratings yet
Booklet Chapter 1
21 pages
4 Types of Sentences Grade 4 English Resources Printable Worksheets w6
No ratings yet
4 Types of Sentences Grade 4 English Resources Printable Worksheets w6
2 pages
Lesson n13 - Comparative and Superlative
No ratings yet
Lesson n13 - Comparative and Superlative
14 pages
English
No ratings yet
English
73 pages
Hello Muddah Hello Fuddah
No ratings yet
Hello Muddah Hello Fuddah
3 pages
VCOP Pyramids Individual PDF
No ratings yet
VCOP Pyramids Individual PDF
5 pages
Touchstone 2 2nd Edition SB
No ratings yet
Touchstone 2 2nd Edition SB
84 pages
IRREGULAR VERB CHART - Learn English, Irregular, Verbs, Charts
No ratings yet
IRREGULAR VERB CHART - Learn English, Irregular, Verbs, Charts
2 pages
Understanding Sentence and Its Parts Towards Clear and Effective Communication
No ratings yet
Understanding Sentence and Its Parts Towards Clear and Effective Communication
44 pages
RWN Independent and Dependent Clauses P 65 AK
No ratings yet
RWN Independent and Dependent Clauses P 65 AK
2 pages