0% found this document useful (0 votes)

32 views5 pages

Towards A Comprehensive Open Repository of Polish Language Resources

This paper presents current efforts towards the creation of a comprehensive open repository of Polish language resources and tools. The work described here is carried out within the cesar project, member of the META-NET consortium. It has already resulted in the creation of the Computational Linguistics in Poland website containing an exhaustive collection of Polish LRTs.

Uploaded by

acouillault

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views5 pages

Towards A Comprehensive Open Repository of Polish Language Resources

Uploaded by

acouillault

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Towards a comprehensive open repository of Polish language resources

Maciej Ogrodniczuk1 , Piotr P ezik2 , Adam Przepirkowski1

Institute of Computer Science, Polish Academy of Sciences 2 University of d z

Abstract
The aim of this paper is to present current efforts towards the creation of a comprehensive open repository of Polish language resources and tools (LRTs). The work described here is carried out within the CESAR project, member of the META-NET consortium. It has already resulted in the creation of the Computational Linguistics in Poland website containing an exhaustive collection of Polish LRTs. Current work is focused on the creation of new LRTs and, esp., the enhancement of existing LRTs, such as parallel corpora, annotated corpora of written and spoken Polish and morphological dictionaries to be made available via the META-SHARE repository. Efforts are made to ensure a high level of reusability of the LTRs by adhering to widely accepted annotation and interoperability standards. Last but not least, since the great majority of the Polish CESAR resources are released under open licenses, special work is required to clarify their Intellectual Property Rights status. Keywords: META-SHARE, parallel corpora, morphological dictionaries, annotated corpora, spoken corpora

Introduction

There are signicant gaps in the distribution of key language resources and tools (LRTs) across different European languages. One of the main aims of the CESAR project, part of the META-NET consortium (http://www.meta-net.eu/projects/ cesar), started in February 2011, is to make existing LRTs for Central and East European languages more readily available and more widely used. This aim is being achieved via three main means: 1. increasing the awareness of existing LRTs, 2. increasing their availability, esp., via standard and liberal licensing, 3. increasing their reusability, esp., via ensuring adherence to standards and by increasing their quality. The aim of this paper is to report on some early successes and plans with respect to these points. The outline of the paper is as follows: 2. presents the idea of the open repository of Polish LRTs, 3.5. briey describe some of the main Polish resources enhanced within CESAR, while 6. covers one of them parallel corpora in more detail. Subsequently, 7. mentions other resources under consideration in the project and, nally, 8. concludes the paper.

idea of the page operated between 2000 and 2004 at the Institute of Computer Science, Polish Academy of Sciences. The site aims at containing exhaustive information about LRTs, research centres, projects and linguistic engineering courses related to Polish. Furthermore, it intends to bring language-related initiatives, institutions and people from research, government and industry communities together, offering them comprehensive information on available language technology. One of the main design principles of the site was to maintain a wiki-like mode of operation, allowing the authorised representatives of all LRT groups in Poland to edit the content directly. This approach proved very fruitful and several modications and additions have already been made by external editors. According to our best knowledge the site is currently the largest repository of references to publicly available Polish LRTs. Along with creating synergies within the national language community, Polish CESAR partners play an active role in META-SHARE the open language resource exchange infrastructure created by META-NET and operated at the European level. Its main function is sustainable sharing and dissemination of LRTs on a global scale. The operational level of META-SHARE is a network of distributed repositories providing a multi-layer infrastructure for OAI-PMH1 -enabled exchange of LRTs and related metadata, as well as interfaces for remote indexing of LRs. This initiative goes far beyond the repository setup: it promotes the use of widely acceptable LR standards ensuring their maximum interoperability and sustainOpen Archives Initiative Protocol for Metadata Harvesting, see http://www.openarchives.org/OAI/ openarchivesprotocol.html.
1

Towards an open repository

To full the need for increasing the awareness of existing LRTs for Polish and reinforcing relations between the key players in Polish natural language processing (NLP), a new Web portal Computational Linguistics in Poland (CLIP, http://clip.ipipan.waw. pl) was established in mid-April 2011, following the

3593

ability, advertises its own CC-based licensing models and IPR provisions, offering legal and organisational support in the form of licensing templates, language resource sharing forms, ready-to-use agreement declarations and various other LR-related recommendations. Regarding its technological impact, CESAR targets specic Polish language processing resources with a view to improving their availability, interoperability and representativeness. In the remaining part of this paper we introduce a number of key resources whose availability and interoperability will be improved within the CESAR project. We start with morphological dictionaries and annotated corpora, which are a basic prerequisite for most NLP solutions. Due to technical difculties, spoken discourse corpora and speech databases are sparsely distributed across different languages. A separate subtask of the CESAR project is thus concerned with the release of a corpus of casual spoken Polish including a subset of timealigned transcriptions, as well as a speech database of telephone conversations. The next section of this paper outlines CESARs contribution to improving the availability of cross-linguistic NLP resources through the acquisition of new and existing parallel corpora annotated in widely accepted text encoding and translation memory formats such as TEI (Text Encoding Initiative; Burnard & Bauman 2008) and XLiFF2 . These language resources, together with a Polish Wordnet, dictionaries of named entities and a treebank of Polish, will be gradually released as part of the open repository in three batches planned for November 2011, June 2012 and January 2013 respectively.

aries agreed to release them on a very liberal open source licence (the FreeBSD licence, also known as the 2-clause BSD licence). Moreover, again within CESAR, cooperation between the maintainers of the dictionaries has been initiated, leading to the creation of PoliMorf (Woli nski et al. 2012), a single large morphological dictionary for Polish, comprising and extending both Morfologik and Morfeusz. A dedicated tool for extending the dictionary with new lexemes is currently in the nal stages of development. The tool allows linguists to add lexemes and their morphological specication in a distributed fashion, over the Internet. Various quality control mechanisms have been implemented, to minimise errors in the resulting dictionary. The rst version of the new morphological dictionary resulting from the automatic merger of Morfeusz and Morfologik was made available in November 2011; the complete and supplemented version will be compiled by January 2013.

Annotated Corpora

Dictionaries

Morphological dictionaries are about the most basic language resources, and most NLP tasks require their existence and availability. Until recently, there has only been one morphological dictionary available for Polish under an open source licence (LGPL and Creative Commons), namely, Morfologik (http: //morfologik.blogspot.com/; not to be confused with the Hungarian NLP company Morphologic). Another morphological analyser, Morfeusz (http://sgjp.pl/morfeusz/; Woli nski 2006, Saloni et al. 2007), whose quality is widely believed to be higher than that of Morfologik, was available under a closed albeit free for non-commercial applications licence. These two tools seem to be the most widely used morphological analysers for Polish; actually, both are used in the National Corpus of Polish (http://nkjp.pl/; Przepirkowski et al. 2010, 2012). Largely due to the efforts at the very initial stages of CESAR, the owners of the data of both dictionhttp://docs.oasis-open.org/xliff/ xliff-core/xliff-core.html
2

Manually annotated corpora are important resources, used for training various language processing tools. One of the most basic such tools are morphological taggers, used for disambiguating the results of morphological analysers. The most comprehensive resource of this kind for Polish is the 1-million-word subcorpus of the National Corpus of Polish (Pol. Narodowy Korpus J ezyka Polskiego; NKJP), manually annotated at various linguistic levels, including the morphosyntactic level. However, for a morphologically rich language, 1 million words is not sufcient to attain the same tagging accuracy as, for example, for English (over 97%); in fact, current Polish taggers perform at the level of 9293% (Piasecki 2007, Karwa nska & Przepirkowski 2011, Aceda nski 2010). In order to improve these results, two kinds of activities are undertaken in CESAR. First, although a very careful annotation procedure was adopted in NKJP, annotation errors may readily be found in the corpus, so known issues are corrected manually and semiautomatically within CESAR. Additionally, statistical methods are employed to discover unknown errors. Second, an additional corpus of 500 thousand words is annotated within CESAR, with the aim of creating a high-quality 1.5-million-word training corpus. However, in order to minimise costs, an existing corpus is used for this purpose, namely, the Polish language of the 1960s corpus (http://clip.ipipan.waw. pl/PL196x; Ogrodniczuk 2003). The corpus was originally manually annotated with a much more limited tagset than that currently used for Polish, so the work consists in the semi-automatic conversion the annotation of that corpus to the current standards and most importantly in its independent re-annotation.

3594

These two annotations are compared and any differences are sent for adjudication, thus increasing the annotation quality.

Spoken Corpora

Corpora of casual spoken discourse are a rather rare resource for many languages. The largest collection of transcriptions of naturally occurring conversational Polish has been compiled by the PELCRA team3 at the University of d z since 2000, initially as part of the PELCRA reference Corpus and later within the National Corpus of Polish (P ezik 2012). In total, the corpus contains almost 2 million words of transcriptions of conversations recorded in an informal setting, often without some of the speakers knowing they were being taped (although they had been informed about and agreed to the possibility of being recorded and later granted their permission to transcribe the recordings).

could improve the performance of Polish voice recognition systems. The corpus will transcribed and time-aligned recordings of hundreds of Polish speakers reading numbers and spelling words, as these two areas are often key in real-life voice recognition systems. We are planning to release this corpus as part of the last batch of CESAR resources in January 2013.

Parallel Corpora

The Polish branch of CESAR makes an important contribution to the availability and interoperability of parallel corpora of Polish as illustrated in (Table 1). Some of the resources listed in the table are completely new and aligned manually (i.e. Polish Academy of Sciences Academia or Centre for Eastern Studies Corpus). Others have previously been available as parallel corpora in rather minimalistic formats (CORDIS & RAPID) or lacked some bibliographic metadata (JRC Acquis Communautaire) which we considered to be important in advanced parallel corpus search applications. Lang. Alignment Original pairs format PAS Academia 1 Sentence, PDF, manual DOC CORDIS 5 Sentence, HTML automatic RAPID 1 (21) Sentence, HTML automatic JRC Acquis 1 (21) Sentence, TEI Communautaire automatic Collection Documents 500 10 000 4 900 26 000

Figure 1: A sample of the time aligned corpus of conversational Polish. So far this data has been only available through online search interfaces, but within CESAR a subset of this data will be made available in the TEI P5 format following some privacy considerations. Furthermore, a selection of the transcriptions are being time-aligned with the original recordings at the level of utterances and made available under the GPL license through the META-SHARE repository. Another multimedia speech corpus planned to be included into META-SHARE repository is the TEIencoded corpus of transliterated complex spontaneous human-human telephone conversations acquired in the course of LUNA (Spoken Language UNderstanding in multilinguAl communication systems; http:// www.ist-luna.eu; Marciniak 2010) project. The source data have been collected at the call centre of the Public Transport Authority of Warsaw and annotated in terms of semantic constituents and semantic structures (Mykowiecka & Waszczuk 2008). The University of d z is also currently cooperating with industry partners to provide a specialized Spelling and Number Voice recognition corpus (SNUV) which
3

Table 1: The rst batch of Polish parallel corpora. The process of converting, processing and exporting parallel resources encoded in a variety of formats (ranging from HTML and PDF to TEI) is facilitated by the use of a central relational database system (named Paralela) to which text collections are imported in the rst phase of the acquisition process. The Paralela database is used to store bibliographic, structural and alignment information, and it has been designed to handle multiple alignments of the same collection. Once the variously encoded collections are converted and normalised in the database, they can be processed and exported into more uniform and standard formats used for the exchange of parallel corpora and translation memories. We have decided to provide the parallel data in two main formats, namely TEI and XLiFF. The rst format is a widely recognised standard of annotating corpus data with good support for encoding structural, bibliographic and alignment annotation. The XLiFF format, on the other hand, although much less expressive, is supported by all major CAT environments as an increasingly popular way of exchanging translation memories. Any subset of the paral-

See http://pelcra.pl.

3595

lel collections can thus be used directly as a translation memory in a modern CAT environment. The general workow of the process of conversion is shown in Fig. 2.

ical databases with syntagmatic information about the phraseological potential of word patterns. Finally, Skadnica (Woli nski et al. 2011), a treebank of Polish constructed semi-manually on the basis automatic syntactic analysis, will be made available in mid2012.

Conclusion

Figure 2: Source texts are imported into a relational database and exported in XLiFF and TEI formats. Some of the source formats (e.g. HTML, PDF) required manual and/or automatic alignment. More Polish parallel corpora are being prepared for the next two batches of resources covering more than twenty different language pairs. Apart from automatically aligned collections, Polish-English and PolishRussian corpora are being made available with manual alignment annotation for non-trivial segment corresponded cases, which we believe can be used to evaluate and improve the performance of statistical translation aligners.

Being part of META-NET creates a good opportunity to work out the long-term sustainability plan for the important LRTs, which must be extended, linked, preferably multilingually aligned, but rst of all upgraded to recommended representation standards. Starting with technical interoperability provided by Unicode and XML it is vital to maintain the standardisation principle also at the syntactic and semantic levels. Although the latter problem still remains open, even though tackled by several ongoing initiatives such as ISOCat Data Category Registry4 and its instantiations (such as the one described in Patejuk & Przepirkowski 2010), keeping the resource-structure layer seems a much more straightforward task. For Polish resources the recommendations of FLaReNet and CLARIN are being followed, including LMF for the representation of dictionaries, XLIFF for parallel corpora and TEI for various textual resources. The conversion and maintenance of resources scheduled for META-SHARE inclusion in these formats is an important mission of the META-NET / CESAR project.

Other Resources

Acknowledgements
Research funded in 20102013 within CESAR (CEntral and South-east europeAn Resources; http: //www.meta-net.eu/projects/cesar), a European (CIP ICT-PSP) project (grant agreement 271022), part of META-NET.

Apart from the above-mentioned core resources, the processing of which seems the most time-consuming and labour-intensive, another set of equally important resources will be made available through METASHARE channels. The most prominent of them is the Polish Wordnet (Piasecki et al. 2009), still actively developed and therefore planned to be issued in all three CESAR batch editions. Another important resource, scheduled for January 2013, is the merger of existing dictionaries of Polish Named Entities. Various resources are planned to be gathered (e.g. from Tours, Pozna n, Warszawa and Wrocaw) and standardised within this task by encoding them in the LMF (Lexical Markup Framework; ISO:24613 2008) format. The last batch of Polish CESAR resources will also include Polish and English dictionaries of collocations containing some 2 million potential collocations each extracted from the National Corpus of Polish and the British National Corpus. For each potential collocation a number of association and dispersion measures are computed and recorded in the dictionary along with annotations of part-ofspeech patterns in which they were found. The dictionaries will be available in the form of relational databases and they will hopefully be used to complement paradigmatically oriented lex-

References
Aceda nski, S. (2010). A morphosyntactic Brill tagger for inectional languages. In H. Loftsson, E. Rgnvaldsson, and S. Helgadttir, editors, Advances in Natural Language Processing: Proceedings of the 7th International Conference on Natural Language Processing, IceTAL 2010, Reykjavk, Iceland, volume 6233 of Lecture Notes in Articial Intelligence, pages 314, Heidelberg. Springer-Verlag. Burnard, L. & Bauman, S., editors (2008). TEI P5: Guidelines for Electronic Text Encoding and Interchange. Oxford. http://www.tei-c.org/ Guidelines/P5/. ISO:24613 (2008). Language resource management lexical markup framework (LMF). ISO/FDIS 24613, ISO TC 37/SC 4 document N 45 of 200803-21.
See http://www.isocat.org/interface/ index.html.
4

3596

Karwa nska, D. & Przepirkowski, A. (2011). On the evaluation of two Polish taggers. In S. Go zd zRoszkowski, editor, Explorations across Languages and Corpora: PALC 2009, pages 105113, Frankfurt am Main. Peter Lang. Marciniak, M., editor (2010). Anotowany korpus dialogw telefonicznych. Akademicka Ocyna Wydawnicza EXIT, Warsaw. Mykowiecka, A. & Waszczuk, J. (2008). Semantic annotation of city transportation information dialogues using CRF method. In P. Sojka, A. Hork, I. Kope cek, and K. Pala, editors, Text, Speech and Dialogue: 12th International Conference, TSD 2009, Pilsen, Czech Republic, September 2009, volume 5729 of Lecture Notes in Articial Intelligence, pages 411419, Berlin. Springer-Verlag. Ogrodniczuk, M. (2003). Nowa edycja wzbogaconego korpusu sownika frekwencyjnego. In S. Gajda, editor, J ezykoznawstwo w Polsce. Stan i perspektywy, pages 181190. Komitet J ezykoznawstwa, Polska Akademia Nauk and Instytut Filologii Polskiej, Uniwersytet Opolski, Opole. http://www.mimuw. edu.pl/~jsbien/MO/JwP03/. Patejuk, A. & Przepirkowski, A. (2010). ISOcat denition of the National Corpus of Polish tagset. In LREC 2010 Workshop on LRT Standards, Valletta, Malta. ELRA. Piasecki, M. (2007). Polish tagger TaKIPI: Rule based construction and optimisation. Task Quarterly, 11(12), 151167. Piasecki, M., Szpakowicz, S., & Broda, B. (2009). A Wordnet from the Ground Up. Ocyna Wydawnicza Politechniki Wroclawskiej, Wrocaw. Przepirkowski, A., Grski, R. L., azi nski, M., & P ezik, P. (2010). Recent developments in the National Corpus of Polish. In Proceedings of the Seventh International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta. ELRA. Przepirkowski, A., Ba nko, M., Grski, R. L., & Lewandowska-Tomaszczyk, B., editors (2012). Narodowy Korpus J ezyka Polskiego. Wydawnictwo Naukowe PWN, Warsaw. Forthcoming. P ezik, P. (2012). J ezyk mwiony w NKJP. In Przepirkowski et al. (2012). Forthcoming. Saloni, Z., Gruszczy nski, W., Woli nski, M., & Woosz, R. (2007). Sownik gramatyczny j ezyka polskiego. Wiedza Powszechna, Warsaw. Woli nski, M. (2006). Morfeusz a practical tool for the morphological analysis of Polish. In M. A. Kopotek, S. T. Wierzcho n, and K. Trojanowski, editors, Intelligent Information Processing and Web Mining, Advances in Soft Computing, pages 503 512. Springer-Verlag, Berlin. Woli nski, M., Gowi nska, K., & Swidzi nski, M. (2011). A preliminary version of skadnicaa treebank of Polish. In Z. Vetulani, editor, Proceed-

ings of the 5th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pages 299303, Pozna n, Poland. Woli nski, M., Mikowski, M., Ogrodniczuk, M., Przepirkowski, A., & ukasz Szakiewicz (2012). PoliMorf: a (not so) new open morphological dictionary for Polish. In Proceedings of the Eigth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey. ELRA. These proceedings.

3597

Safety, Health and Environment Management Assignment
100% (5)
Safety, Health and Environment Management Assignment
30 pages
Honda Philosophy Refresher Material
100% (2)
Honda Philosophy Refresher Material
28 pages
The Freedom Journal - Original
100% (7)
The Freedom Journal - Original
103 pages
Vocabulary Choice Menu
No ratings yet
Vocabulary Choice Menu
1 page
First LP Tefl Lesson Plan Template 2019 Speaking Listening
No ratings yet
First LP Tefl Lesson Plan Template 2019 Speaking Listening
2 pages
Symbolism and Allegory Ppt. 1 Qpbq8o
No ratings yet
Symbolism and Allegory Ppt. 1 Qpbq8o
16 pages
UniHH2017 PSi23 Book
100% (1)
UniHH2017 PSi23 Book
145 pages
Samaveda Upakarma Saveca Aug31 11
No ratings yet
Samaveda Upakarma Saveca Aug31 11
27 pages
Contoh Absensi
No ratings yet
Contoh Absensi
82 pages
Trends in Linguistics
100% (2)
Trends in Linguistics
363 pages
Roberta - Facchinetti Corpus - Linguistics (25.years - On)
100% (1)
Roberta - Facchinetti Corpus - Linguistics (25.years - On)
392 pages
Sinh viên điền đáp án chọn vào khung trả lời dưới đây Sau khi hoàn thành bài test, SV gởi lại cho GV chấm
No ratings yet
Sinh viên điền đáp án chọn vào khung trả lời dưới đây Sau khi hoàn thành bài test, SV gởi lại cho GV chấm
2 pages
Malteska Gramatika
No ratings yet
Malteska Gramatika
120 pages
1.physics (Aieee) Model Paper
100% (1)
1.physics (Aieee) Model Paper
26 pages
The Local Turn in Peace Building A Critical Agenda For Peace - RICHIMOND
No ratings yet
The Local Turn in Peace Building A Critical Agenda For Peace - RICHIMOND
23 pages
Leadership, Complexity & Change
No ratings yet
Leadership, Complexity & Change
14 pages
Contrastive Linguistics-Translation Studies-Machine Translations
100% (1)
Contrastive Linguistics-Translation Studies-Machine Translations
53 pages
An2 Sem 2 Limba - Engleza
No ratings yet
An2 Sem 2 Limba - Engleza
39 pages
Theory of Cultural Context
No ratings yet
Theory of Cultural Context
1 page
(Carlos Buil-Aranda, Marcelo Arenas, Oscar Corcho (B-Ok - CC)
No ratings yet
(Carlos Buil-Aranda, Marcelo Arenas, Oscar Corcho (B-Ok - CC)
549 pages
Interlingua-Based English-Hindi Machine Translation and Language Divergence
No ratings yet
Interlingua-Based English-Hindi Machine Translation and Language Divergence
54 pages
Accelerate The Law of Attraction by Constructing A PyramiTroniX Resonator For Health and Well-Being
No ratings yet
Accelerate The Law of Attraction by Constructing A PyramiTroniX Resonator For Health and Well-Being
2 pages
CILSChap 151
No ratings yet
CILSChap 151
11 pages
Does A Long-Term Relationship Kill Romantic Love?: Bianca P. Acevedo and Arthur Aron
No ratings yet
Does A Long-Term Relationship Kill Romantic Love?: Bianca P. Acevedo and Arthur Aron
7 pages
Language Planning in Punjab
No ratings yet
Language Planning in Punjab
7 pages
Hub Location Problems PDF
No ratings yet
Hub Location Problems PDF
14 pages
Maciej-Infoteka 2011 2 en Finfin
No ratings yet
Maciej-Infoteka 2011 2 en Finfin
21 pages
284 1091 1 PB
No ratings yet
284 1091 1 PB
16 pages
Measuring Bilingual Corpus Comparability
No ratings yet
Measuring Bilingual Corpus Comparability
27 pages
Melodrama and The Modes of The World
No ratings yet
Melodrama and The Modes of The World
21 pages
Petronas Gas PGB PDF
100% (5)
Petronas Gas PGB PDF
34 pages
Anusaaraka:A Better Approach To Machine Translation (A Case Study For English-Hindi/Telugu)
No ratings yet
Anusaaraka:A Better Approach To Machine Translation (A Case Study For English-Hindi/Telugu)
11 pages
Preparatory Programme Master of Educational Studies (Leuven) PDF
No ratings yet
Preparatory Programme Master of Educational Studies (Leuven) PDF
3 pages
Computational Lexicons and Dictionaries
No ratings yet
Computational Lexicons and Dictionaries
14 pages
2022 Lrec-1 220
No ratings yet
2022 Lrec-1 220
8 pages
Cerl CT 201111
No ratings yet
Cerl CT 201111
15 pages
Reviewer For Crim 102
No ratings yet
Reviewer For Crim 102
35 pages
Final 1080
No ratings yet
Final 1080
11 pages
International Journal of Computational Linguistics (IJCL), Volume (1), Issue
No ratings yet
International Journal of Computational Linguistics (IJCL), Volume (1), Issue
28 pages
Reading Practice Helping Others Answers
No ratings yet
Reading Practice Helping Others Answers
2 pages
What Are The Virtues For?: Lesson 6
No ratings yet
What Are The Virtues For?: Lesson 6
8 pages
233 Paper
No ratings yet
233 Paper
6 pages
An Approach For Interconnecting Lexical Resources
No ratings yet
An Approach For Interconnecting Lexical Resources
6 pages
Turk Bootstrap Word Sense Inventory 2.0: A Large-Scale Resource For Lexical Substitution
No ratings yet
Turk Bootstrap Word Sense Inventory 2.0: A Large-Scale Resource For Lexical Substitution
5 pages
Totalrecall: A Bilingual Concordance For Computer Assisted Translation and Language Learning
No ratings yet
Totalrecall: A Bilingual Concordance For Computer Assisted Translation and Language Learning
4 pages
PDF
0% (1)
PDF
1 page
Participatory Philology - Computational Linguistics and The Future of Historical Language Education
No ratings yet
Participatory Philology - Computational Linguistics and The Future of Historical Language Education
8 pages
State-Of-The-Art Article THE COMMON EUROPEAN FRAMEWORK OF REFERENCE FOR LANGUAGES CONTENT PURPOSE ORIGIN RECEPTION AND IMPACT
No ratings yet
State-Of-The-Art Article THE COMMON EUROPEAN FRAMEWORK OF REFERENCE FOR LANGUAGES CONTENT PURPOSE ORIGIN RECEPTION AND IMPACT
25 pages
245 Paper
No ratings yet
245 Paper
8 pages
Timebankpt: A Timeml Annotated Corpus of Portuguese: Francisco Costa, Ant Onio Branco
No ratings yet
Timebankpt: A Timeml Annotated Corpus of Portuguese: Francisco Costa, Ant Onio Branco
8 pages
228 Paper
No ratings yet
228 Paper
8 pages
Automatic Classification of German An Particle Verbs: Sylvia Springorum, Sabine Schulte Im Walde, Antje Roßdeutscher
No ratings yet
Automatic Classification of German An Particle Verbs: Sylvia Springorum, Sabine Schulte Im Walde, Antje Roßdeutscher
8 pages
204 Paper
No ratings yet
204 Paper
8 pages
232 Paper
No ratings yet
232 Paper
7 pages
The Parallel-TUT: A Multilingual and Multiformat Treebank: Cristina Bosco, Manuela Sanguinetti, Leonardo Lesmo
No ratings yet
The Parallel-TUT: A Multilingual and Multiformat Treebank: Cristina Bosco, Manuela Sanguinetti, Leonardo Lesmo
7 pages
Detecting Reduplication in Videos of American Sign Language: Zoya Gavrilov, Stan Sclaroff, Carol Neidle, Sven Dickinson
No ratings yet
Detecting Reduplication in Videos of American Sign Language: Zoya Gavrilov, Stan Sclaroff, Carol Neidle, Sven Dickinson
7 pages
Plateformes Et Projets
No ratings yet
Plateformes Et Projets
10 pages
208 Paper
No ratings yet
208 Paper
7 pages
Natural Language Understanding
No ratings yet
Natural Language Understanding
3 pages
Conandoyle-Neg: Annotation of Negation in Conan Doyle Stories
No ratings yet
Conandoyle-Neg: Annotation of Negation in Conan Doyle Stories
6 pages
230 Paper
No ratings yet
230 Paper
6 pages
240 Paper
No ratings yet
240 Paper
6 pages
CAT: The CELCT Annotation Tool: Valentina Bartalesi Lenzi, Giovanni Moretti, Rachele Sprugnoli
No ratings yet
CAT: The CELCT Annotation Tool: Valentina Bartalesi Lenzi, Giovanni Moretti, Rachele Sprugnoli
6 pages
The Open Lexical Infrastructure of SPR Akbanken: Lars Borin, Markus Forsberg, Leif-J Oran Olsson and Jonatan Uppstr Om
No ratings yet
The Open Lexical Infrastructure of SPR Akbanken: Lars Borin, Markus Forsberg, Leif-J Oran Olsson and Jonatan Uppstr Om
5 pages
Towards A Richer Wordnet Representation of Properties: Sanni Nimb, Bolette Sandford Pedersen
No ratings yet
Towards A Richer Wordnet Representation of Properties: Sanni Nimb, Bolette Sandford Pedersen
5 pages
Korp - The Corpus Infrastructure of Språkbanken: Lars Borin, Markus Forsberg, and Johan Roxendal
No ratings yet
Korp - The Corpus Infrastructure of Språkbanken: Lars Borin, Markus Forsberg, and Johan Roxendal
5 pages
Predicting Phrase Breaks in Classical and Modern Standard Arabic Text
No ratings yet
Predicting Phrase Breaks in Classical and Modern Standard Arabic Text
5 pages
210 Paper
No ratings yet
210 Paper
5 pages
A Bilingual Bimodal Reading and Writing Tool For Sign Language Users
No ratings yet
A Bilingual Bimodal Reading and Writing Tool For Sign Language Users
5 pages
LIE: Leadership, Influence and Expertise: R. Catizone, L. Guthrie, A.J. Thomas, and Y. Wilks
No ratings yet
LIE: Leadership, Influence and Expertise: R. Catizone, L. Guthrie, A.J. Thomas, and Y. Wilks
5 pages
211 Paper
No ratings yet
211 Paper
5 pages
Overcoming The Barriers of Linguistic Differences
No ratings yet
Overcoming The Barriers of Linguistic Differences
6 pages
Building Text and Speech Datas
No ratings yet
Building Text and Speech Datas
11 pages
A Classification of Adjectives For Polarity Lexicons Enhancement
No ratings yet
A Classification of Adjectives For Polarity Lexicons Enhancement
5 pages
Bilingual Subject-Based Information Retrieval in NECLIB2 Digital Library
No ratings yet
Bilingual Subject-Based Information Retrieval in NECLIB2 Digital Library
5 pages
Language Technologies Customized For Processing Greek Textual Cultural Heritage Data
No ratings yet
Language Technologies Customized For Processing Greek Textual Cultural Heritage Data
10 pages
ROMBAC: The Romanian Balanced Annotated Corpus: Radu Ion, Elena Irimia, Dan Ştefănescu, Dan Tufiș
No ratings yet
ROMBAC: The Romanian Balanced Annotated Corpus: Radu Ion, Elena Irimia, Dan Ştefănescu, Dan Tufiș
6 pages
Conceptnet 3: A Flexible, Multilingual Semantic Network For Common Sense Knowledge
No ratings yet
Conceptnet 3: A Flexible, Multilingual Semantic Network For Common Sense Knowledge
7 pages
Openwordnet-Pt: An Open Brazilian Wordnet For Reasoning: Emap Technical Reports, 2012
No ratings yet
Openwordnet-Pt: An Open Brazilian Wordnet For Reasoning: Emap Technical Reports, 2012
7 pages
Openwordnet-Pt: An Open Brazilian Wordnet For Reasoning: Valeria de Paiva Alexandre Rademaker Gerard de Melo
No ratings yet
Openwordnet-Pt: An Open Brazilian Wordnet For Reasoning: Valeria de Paiva Alexandre Rademaker Gerard de Melo
8 pages
Form - Case Analysis Essay Volunteerism
No ratings yet
Form - Case Analysis Essay Volunteerism
1 page
SLRC - Principles & Strategies of Teaching (2010)
100% (1)
SLRC - Principles & Strategies of Teaching (2010)
26 pages
Basees 2015 Brochure
No ratings yet
Basees 2015 Brochure
2 pages
Strath Prints 002611
No ratings yet
Strath Prints 002611
39 pages
Clarin: Long-Term Preservation and Access
No ratings yet
Clarin: Long-Term Preservation and Access
2 pages
Terminology Extraction, Translation Tools
No ratings yet
Terminology Extraction, Translation Tools
5 pages
Internet Archiving The Use in Discourse Studies
No ratings yet
Internet Archiving The Use in Discourse Studies
9 pages
On The Types and Frequency of Meta-Language in Conversation: A Preliminary Report
No ratings yet
On The Types and Frequency of Meta-Language in Conversation: A Preliminary Report
4 pages
Acceso Abierto y Propiedad Intelectual Octubre 2011
No ratings yet
Acceso Abierto y Propiedad Intelectual Octubre 2011
14 pages
Film Discourse: Corpus Analysis and Synchronic Perspective
No ratings yet
Film Discourse: Corpus Analysis and Synchronic Perspective
5 pages
Corpus
No ratings yet
Corpus
5 pages
On-Line Resources For Translation
No ratings yet
On-Line Resources For Translation
26 pages
Romanian Language Technology - Academic Perspective Dan Tufis
No ratings yet
Romanian Language Technology - Academic Perspective Dan Tufis
16 pages
UNL Based Machine Translation System For Punjabi Language (PDFDrive)
No ratings yet
UNL Based Machine Translation System For Punjabi Language (PDFDrive)
291 pages
Ontology On Ergonomic - 2
No ratings yet
Ontology On Ergonomic - 2
1 page
On-Line and Off-Line Translation Aids For Non-Nati
No ratings yet
On-Line and Off-Line Translation Aids For Non-Nati
6 pages
Lan & Meng 2023
No ratings yet
Lan & Meng 2023
23 pages
Natural Language User Interface - NLP Search
No ratings yet
Natural Language User Interface - NLP Search
6 pages
Corpus Linguistics
No ratings yet
Corpus Linguistics
25 pages
7
No ratings yet
7
4 pages
Kavgic
No ratings yet
Kavgic
15 pages
Features and Differences of The Parallel Corpus of English and Uzbek Languages. Jamshid Norov
No ratings yet
Features and Differences of The Parallel Corpus of English and Uzbek Languages. Jamshid Norov
5 pages
An Ensemble Approach For Comprehensive English Proficiency Evaluation Support: Grammatical Error Correction, Tense Prediction, and CEFR Grading
No ratings yet
An Ensemble Approach For Comprehensive English Proficiency Evaluation Support: Grammatical Error Correction, Tense Prediction, and CEFR Grading
13 pages
Coli 35 3 469
No ratings yet
Coli 35 3 469
7 pages
Will Lamb Inaugeral Lecture December 2024
No ratings yet
Will Lamb Inaugeral Lecture December 2024
15 pages
Graph Databases For Diachronic Language Data Modelling
No ratings yet
Graph Databases For Diachronic Language Data Modelling
11 pages
79 Flynn en
No ratings yet
79 Flynn en
5 pages
Tools Corpora and CAT S NMT Lelandem
No ratings yet
Tools Corpora and CAT S NMT Lelandem
35 pages
Chakraborty 2020
No ratings yet
Chakraborty 2020
6 pages
The Influence of English On The Pol
No ratings yet
The Influence of English On The Pol
15 pages