A Processing Platform Relating Data and Tools For Romanian Language

This document presents RELATE, a natural language processing platform for Romanian. RELATE integrates multiple text and speech processing modules and exposes their functionality through a web interface. It allows large corpora to be loaded, processed using distributed computing, and exported with annotations. RELATE aims to make Romanian language technology accessible to linguists and support research on the Romanian language.

Uploaded by

Leon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

150 views

A Processing Platform Relating Data and Tools For Romanian Language

Uploaded by

Leon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

A Processing Platform Relating Data and Tools for Romanian Language

Vasile Păiș, Radu Ion, Dan Tufiș

Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy
CASA ACADEMIEI, 13 “Calea 13 Septembrie”, Bucharest 050711, ROMANIA
{vasile, radu, tufis}@racai.ro

Abstract
This paper presents RELATE (http://relate.racai.ro), a high-performance natural language platform designed for Romanian language.
It is meant both for demonstration of available services, from text-span annotations to syntactic dependency trees as well as playing or
automatically synthesizing Romanian words, and for the development of new, annotated corpora. It also incorporates the search
engines for the large CoRoLa reference corpus of contemporary Romanian and the Romanian wordnet. It integrates multiple text and
speech processing modules and exposes their functionality through a web interface designed for the linguist researcher. It makes use of
a scheduler-runner architecture, allowing processing to be distributed across multiple computing nodes. A series of input/output
converters allows large corpora to be loaded, processed and exported according to user preferences.

Keywords: natural language processing, web platform, Romanian language processing

least two types of researchers: 1) theoretical linguists,
1. Introduction Romanian language teachers and anyone interested in
studying Romanian language by providing a nice
Today’s natural language processing challenges require
visualization of the automatic analysis for any Romanian
the use of very complex pipelines applied on huge
datasets. In this context, existing pipelines must be sentence and 2) NLP researchers wishing to either have
access to off-the-shelf Romanian annotators or evaluating
integrated and adapted for usage inside high performance
Romanian language technologies.
environments such as clusters, grids or even in the cloud.
The entire flow needs to be supervised and resume
mechanisms must be in place in order to recover 2. Related work
processing in case of unforeseen hardware or software Speaking of language resource inventories and search
errors. engines, META-SHARE1 (Federmann et al., 2012)
Even though existing Romanian language resources are an together with CLARIN2 are the biggest, publicly available
order of magnitude less than those existing for English European websites for research and development in the
language, several new large data sets become available field. ELRC-SHARE3 (European Language Resource
each year. For each new project that we are involved in, Coordination Share) is another website dedicated to
we are faced with processing hundreds of thousands of European language resources, specifically for machine
text files, in the several gigabytes range. Due to large translation. Both ELRC-Share and META-SHARE offer
sizes involved, combined with the pipeline’s complexity, search boxes through which one can easily find various
this usually implies many days of processing time. Thus, language resources (language tool, annotated, text or
the ability to distribute processing across multiple audio corpora, etc.) for any (European) language. Beside
computing nodes becomes a necessity in order to reduce language resources for Romanian, our language of
the required processing time. Furthermore, in order to interest, there are complex processing pipelines such as
allow scientists to focus on their research and not on NLP-Cube (Boroș et al., 2018) or TTL (Ion, 2007) that are
technical issues, a user-friendly interface was needed, able to do tokenization, POS tagging, lemmatization,
allowing easy interaction with the system. chunking and dependency parsing. To use them, one has
RELATE is a Romanian language technology platform to be tech savvy, know Python 3 or Perl programming and
developed at the Institute for Artificial Intelligence of the be comfortable installing required open-source libraries
Romanian Academy, integrating different state-of-the art (actually, this is the story of any open-source language
tools and algorithms for processing Romanian language, technology tool, thus limiting its use to those that possess
developed either in-house or by our partners in different the knowledge to take the required steps).
research projects. It evolved from our previous To make the composition of the language processing
TEPROLIN platform (Ion, 2018) from a demonstrative, chains more user-friendly, GATE (Cunningham, 2002)
single file multi-level processing pipeline, to a more and TextFlows (Perovšek et al., 2016) allow for dragging
complex platform allowing for user-friendly interaction and dropping text processing widgets into a graphical
with Romanian language technologies as well as storage, processing workflow to create the processing pipelines
processing, visualizing and downloading of large sets of that the likes of NLP-Cube and TTL require computer
annotated data. It was constructed using a task-based programming to achieve. While graphically composing
approach, where the user can load a corpus (usually as an language processing chains is a big step towards the
archive) then start a number of annotation tasks and usability of the respective language technologies, their
finally export the resulting data. The platform hides the output is not enhanced with specialized visualization tools
complexities of distributing the load across the available that allow access into the computational resources used
processing nodes, waiting for data to be processed, error for annotation.
recovery and final gathering of results. Instead, the user is
presented with an easy to use web interface where she/he
can interact with the already annotated files and see the 1 http://www.meta-share.org
status of the entire annotation process. RELATE was 2
https://www.clarin.eu
constructed with the goal of making it accessible to at 3
https://elrc-share.eu/

81
Proceedings of the 1st International Workshop on Language Technology Platforms (IWLTP 2020), pages 81–88
Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020
c European Language Resources Association (ELRA), licensed under CC-BY-NC

RELATE aims specifically at doing automatic text 3.1 Platform components
processing, with annotations at multiple levels, along with The RELATE platform was constructed using an
annotation visualization and expansion into the approach based on multiple interconnected layers. From
corresponding linguistic computational resources. the user’s perspective, the first layer is the web front-end.
Compared to other platforms, such as (Wanxiang et al., It is in charge of displaying data to the user and employs
2010), our platform does not focus on exposing APIs, visualizations such as: text views (for displaying raw text
even though such text processing APIs do exist, either files as well as annotated files if the user opts for a text
directly from the different components or as an indirect like visualization), data grids (for visualizing table
result of integrating several components. Instead, information, such as annotation results in different
RELATE is designed to be an integrated environment formats), tree-views (useful for displaying dependency
accessible via the web interface. In some ways it is similar parsing information), integration of Brat rapid annotation
to (Morton and LaCivita, 2003) work, with the addition of tool (Stenetorp et al., 2012) for named entity visualization.
the web interface and parallel processing capabilities. Furthermore, the visualization layer interconnects with
Currently, the RELATE platform does not contain yet any visualizations made available from other projects, such as
functionality for automatic training of new models, such the interrogation tools from the Reference Corpus for
as more recent platforms like (Gardner et al., 2018). Contemporary Romanian Language (CoRoLa) (Mititelu et
Furthermore, compared to the WebLicht (Hinrichs et al, al., 2018).
2010) platform, developed within the CLARIN project, The second layer of the platform is the back-end layer.
RELATE is focused on Romanian language tools. Even This is in charge of orchestrating user requests between
more, besides integrating tagging capabilities, the the various integrated modules. In turn, this happens
platform also integrates other tools, such as WordNet, either via an ephemerous flow, with results communicated
translation, speech recognition and synthesis. directly to the web front-end, or via the task system with
The processing workflow is guided via addition of tasks final storage in the platform’s file system. The multi-layer
architecture is presented in Figure 2.
which, by design, can work with the internal format
produced by any other tasks. Thus, no workflow editor,
such as the one used in (Perovšek et al., 2016), was
needed at the moment. Tasks can be chained together, one
after another, without the need for complex “wiring”.

3. Platform Architecture
RELATE has two main areas (see Figure 1): a public area
and a private area. The public area allows running most of
the annotation tasks as well as exploring other platform
features without any data storage facilities. Therefore, this
is intended either for familiarizing a user with the
platform or for small scale annotations (like single
sentences or small files which do not require long term
storage in the platform). The private part requires a user
name and password4 to be provided for user authentication Figure 2: RELATE multiple layers architecture
and allows access to all platform features, including
annotation of large corpora and storage of both raw and Components integrated in the RELATE platform are
annotated data. written in different programming languages, such as:
C/C++, Java, Python, scripts (bash, php). Furthermore,
most of them were not exposing any web API and the few
who had such an API available used completely different
invocation flows. This created serious integration
challenges, as described in more detail in (Păiș et al.,
2019). Basically, we had to either create a web API
wrapper for the tools or execute them as separate
processes and collect the produced temporary files. In
order to guarantee a uniform interrogation for multiple,
related, modules, we used the TEPROLIN web service
which integrates modules written in Python and other
programming languages and exposes them in the same
web API. This is in turn consumed by the back-end layer
modules. Different modules are integrated either for
Figure 1: RELATE public and private areas textual annotation, as detailed below in the “Available
annotations” sub-section, or only for enhancing the user
visualization experience and allowing the researcher to
make additional enquiries. Such is the case for integrating
the Romanian WordNet aligned with the English
4 WordNet which allows the user to research cross-lingually
The credentials are provided free of charge by request sent to
one of the authors. various senses of annotated words.

82
3.2 Available annotations only allowed inside a specialized method which is
TEPROLIN and its web service5 interface is a text called once when the implementing object is
preprocessing platform for Romanian (Ion, 2018) that instantiated;
currently offers 15 types of text transformation/annotations, • If the NLP application is not written in Python 3,
from text-span annotations to syntactic dependency trees. TEPROLIN expects that the application runs on the
Below is a brief account of these modules and same machine as the platform; the communication with
corresponding annotations: the resident process is done via an established inter-
1. Text normalization: removal of multiple consecutive process communication mechanism (e.g. sockets or
spaces and Romanian diacritical codes normalization; named pipes).
2. Diacritics restoration: automatic detection of texts
lacking Romanian diacritics and automatic diacritic When adding a new NLP application, the software engineer
insertion; has to insert its name and operations in the TEPROLIN
3. Word hyphenation (Stan et al., 2011); operation graph. Using this graph, TEPROLIN is able to
4. Word stressed syllable identification (Stan et al., automatically resolve the requirements of the new operation
2011); (e.g. before doing POS tagging, the text has to be tokenized
5. Word phonetic transcription (Stan et al., 2011) using first).
the SAMPA phonemes for Romanian6; Pushing the “DEMO” button in the TEPROLIN Web
6. Numeral rewriting (Stan et al., 2011): automatic Service/Complete Flow menu entry will run the full (all 15
transformation of number to their written form, useful operations) processing chain on two sample Romanian
in text-to-speech synthesis (e.g. 93 → “ninety-three”); sentences. These two sentences were chosen such that every
7. Abbreviation rewriting (Stan et al., 2011): automatic annotation that TEPROLIN is able to give is present and
expansion of abbreviations or acronyms to their full can be visualized. The output of this run can be visualized
form, also useful for text-to-speech synthesis (e.g. art. in computer readable formats: JSON, CoNLL-U10, CoNLL-
→ “article” or AI → “Artificial Intelligence”); X, XML, and as well as graphically: in “Tree” mode (the
8. Sentence splitting (Ion, 2007; Boroș et al., 2018); most informative) and in “Entities” mode where NER
9. Tokenization (Ion, 2007; Boroș et al., 2018); annotations can be visualized graphically.
10. POS tagging (Ion, 2007; Boroș et al., 2018) using the
3.3 Task-based processing
Morpho-Syntactic Descriptors (MSD) for Romanian
tag set7; In order to achieve better performance by harnessing the
11. Lemmatization (Ion, 2007; Boroș et al., 2018); CPU resources available on different servers, the
RELATE platform uses a task-based scheduler engine
12. Named entity recognition (NER) with four labels: which in turn distributes the load across the available
person-PER, location-LOC, organization-ORG and time computing nodes. Since we targeted a mixed environment,
-TIME (Păiș 2019); with computing nodes of different sizes and
13. Biomedical NER (Boroș et al., 2018) with four labels: performances, as well as a mixture of operating systems,
disorder (DISO), anatomical part (ANAT), medical we decided to develop our own task-engine for the
procedure (PROC) and chemical (CHEM). The sequence purposes of the platform. It has two components: the
labeler was trained on the MoNERo corpus (Mitrofan scheduler, which is the first to receive a new task and
decides where it should be executed, and the task runners
et al., 2018), (Carp (Mitrofan), 2019);
which take care of actually running the task and storing
14. Chunking (Ion, 2007) with four types of non- final files on the file system.
recursive syntactic phrases: noun (Np), verb (Vp), Each task runner process keeps track of the files already
adjectival/adverbial (Ap) and prepositional (Pp); processed so that it can resume processing in case of a
15. Dependency parsing (Boroș et al., 2018) with the system failure. Furthermore, the process is activated via a
Romanian Universal Dependencies label set8. cron job which ensures automatic restart in case the task
Each module was adapted and made available for runner itself encounters a fatal error. Even more, logging
integration as part of the ReTeRom project9. Development is performed at operating system level ensuring all
relevant messages are recorded and available for
of individual modules was realized by the ReTeRom
investigation. However, this is not displayed to the end
partners, as indicated in the references and on the project’s user, being considered a very technical information, useful
website. for platform developers. Entire processing pipelines are
TEPROLIN is a Python 3 module that integrates various kept in-memory and accessed by task runners via URL
NLP applications by requiring them to implement the endpoints. This ensures the possibility to distribute the
TEPROLIN application programming interface: tasks on any computing nodes, regardless of their
location: same local area network (similar to a cluster
• Resource loading, which usually takes from tens of environment), multiple networks (a grid environment) or
seconds to minutes when the NLP application starts, is across the Internet (cloud environment). Of course, the
location of the computing resources can influence the
5 http://relate.racai.ro:5000 overall processing time due to the differences in transfer
6 https://www.phon.ucl.ac.uk/home/sampa/rom-uni.htm speeds. Nevertheless, in case of large corpora, we
7 http://nl.ijs.si/ME/V4/msd/html/msd-ro.html
8 http://universaldependencies.org/ro/index.html
9 http://www.racai.ro/p/reterom/index_en.html 10
https://universaldependencies.org/format.html

83
consider the parallelization outweighs transfer times, thus subset of those as required for different projects. The
reducing the total time required to process the files. actual annotations available in the output file depend on
In order to avoid costly synchronization issues that the annotations tasks that were executed.
usually occur in distributed systems, the RELATE Since different modules in the pipeline require additional
platform does not make use of any shared resources. The internal formats, other converters are available internally
scheduler process allocates disjunct slices of the corpus to inside the platform, but are not exposed as input/output
each of the task runners. This allows for parallel options.
computation throughout the pipelines without the need to
synchronize with other processes. 3.5 Available visualizations
Finally, the last runner who finishes work related to any Apart from the annotation options described in section 3.2
particular task will also be in charge of composing the above, RELATE integrates several visualization
final result if needed. Even though, most of the tasks do components, allowing the researcher user to better interact
not require final assembly of data since each annotation with the data. These components can be accessed either
happens on a separate file. The scheduler and runners directly, via the proper links present in the platform’s
architecture is presented in Figure 3. main menu, or via action buttons made available when
interacting with the annotated data.
The “Tree” visualization mode is the most comprehensive
of all, displaying generated annotations as well as on the fly
query results of other Romanian computational resources
for the selected word. In other words, the user can relate
(hence the name of the portal) the output of the automatic
language processing chain with information stored in the
associated Romanian computational resources, thus seeing
if the resource contains (or not) the relevant information
and whether this information is useful when studying
Romanian or how could it inform other automated
Romanian processing algorithms. The “Tree” visualization
mode has the dependency tree of a sentence in the center of
the frame (one can see individual sentences using the
arrows on the left/right of the current sentence).
Dependency label names can be seen on the relations. If the
user clicks a node in the tree, a panel of information about
that word is opened to the right of the dependency tree:
search in the CoRoLa corpus, search in the Romanian
WordNet, listen to the native pronunciation of the word (if
Figure 3: Task execution inside the RELATE platform it is stored in the corpus) or synthesizing it (if not existing
in the speech corpus), using the SSLA Text-to-Speech
3.4 File formats module12 (Boroș et al., 2018b).
The platform was designed to allow corpora to be Besides linking other Romanian computational language
uploaded in the user’s format, then processed and resources and language tools, token annotations can also
annotated to an internal platform format and finally be inspected in the “Tree” view (e.g. POS tag, lemma,
exported to another user specified format. For this chunk membership, etc.) “Similar Words” will display up
purpose, the platform has an import/export interface to 10 most similar words to the clicked word, computed
which can be extended with new functionality to export using word embeddings extracted from the CoRoLa
different user specified formats. Currently, the input corpus (Păiș and Tufiș, 2018). A lemma with POS version
format is either raw text files or comma separated files of the similar words list is also available.
(CSV). In the second case, the user can specify in the Romanian wordnet, RoWordNet, as described in (Tufiș
interface the column or columns containing text data. and Mititelu, 2014), is made available for interrogation in
The internal format used throughout the platform is the platform, either by itself or aligned with the English
CoNLL-U Plus11 format. This is a tab separated set of wordnet (Miller, 1995). The second option involves
columns, usually considered to be an extension of the searching for a Romanian lemma in the wordnet, seeing
CoNLL-U format. In order to allow for a greater the identified synsets and, based on the synset id, the
compatibility with CoNLL-U aware applications (and corresponding English information is also displayed.
users), we have decided to keep the first 10 columns in the CoRoLa (Mititelu et al., 2018) which was constructed as a
order of CoNLL-U specification and extend this with priority project of the Romanian Academy, between 2014
additional information available in the platform, such as and 2017, contains both written texts and oral recordings.
named entities, IATE and EUROVOC annotations For each of these components, dedicated query interfaces
(Coman et al., 2019). were made available. These were also integrated in the
For output, the platform allows for a number of formats to RELATE platform, allowing words to be researched for
be used, including: JSON, CoNLL-U (with limited occurrences in CoRoLa. In the case of written data,
annotations), CoNLL-U Plus variations, XML. In the case interrogation is performed by integration of the KorAP
of CoNLL-U Plus, one possibility is to export the internal corpus management platform, developed at the Institute
format, containing all the produced annotations, or a for German Language (Leibniz-Institut für Deutsche

11 https://universaldependencies.org/ext-format.html 12
http://slp.racai.ro/index.php/ssla/

84
Sprache) in Mannheim (Bański et al. 2014; Diewald et al. Word form (token) statistics include number of tokens,
2016). Similarly, for interrogation of audio transcriptions number of unique tokens and for each unique word form
aligned with voice recordings, the Oral Corpus Query the total number of occurrences as well as the total
platform (OCQP) (Boroș et al., 2018b) developed for number of files containing the particular word form are
CoRoLa was integrated allowing the user to listen for the computed. Furthermore, the statistics task computes the
pronunciation of different words. number of words occurring only once in the entire corpus
Since only a fraction of Romanian words are available in (also known as “hapax legomena”), the words occurring
the audio component of the CoRoLa corpus, two speech only two times and the words occurring only three times.
synthesis components were integrated in the platform, Lemma statistics include number of unique lemmas as
allowing to user to listen for pronunciation of other words well as the number of occurrences for each lemma.
as well. One such system is the Speech Synthesis for
Lightweight Applications (SSLA), described in (Boroș 4. Case Study: Annotation of Romanian
and Dumitrescu, 2015). Another, more recent Legal Corpus
development, is a system derived from our ROBIN Within the ”Multilingual Resources for CEF.AT in the
project. Furthermore, from the ROBIN project resulted legal domain” (MARCELL)15 project, the seven
also an automatic speech recognition component which participating teams cooperated in order to produce a
was also integrated in the RELATE platform. comparable corpus aligned at the top-level domains
In the case of text automatically recognized from speech, identified by EUROVOC descriptors16. For Romanian
this can be automatically processed through the RELATE language, the legal database created includes more than
platform text annotation components, even though at this 140K legislative documents issued starting with 1881.
moment we lack the integration of an automatic These were gathered from the Romanian legislative
capitalization and punctuation restoration component. portal17 and converted from HTML to raw text format.
Therefore, this particular integration currently has its use This resulted in 2.7GB of raw text. During the conversion
only in the case of small sentences. process certain metadata was also retrieved from within
A machine translation component is also available for the HTML pages, but only information required for the
interrogation within the RELATE platform. This is project’s use cases was stored (such as the publication
derived from the project “CEF Automated Translation year of the document).
toolkit for the Rotating Presidency of the Council of the For upload in the RELATE platform, the raw text was
EU”, TENtec no. 28144308, led by TILDE, a linguistic compressed into a zip archive, which had the size of
technology company specializing in neural automatic 550Mb. After uploading to the platform, it was
translation. As part of this project, the translation system automatically decompressed by a task runner and its
(Ro-En and En-Ro)13 was developed in partnership with content was made available through the interface.
the Institute of Research for Artificial Intelligence “Mihai Following a quick visual inspection to ensure the files
Draganescu” and is available for short translations within were properly imported, an annotation task was launched.
the RELATE platform. Given the large size of the corpus, the annotation process
Apart from the dedicated components, the platform makes took about one month on the two physical servers which
use of advanced data grids whenever such a display option were made available for project’s purposes. Allocation of
makes sense. For this purpose, we integrated the PqGrid14 text files to pipeline components was orchestrated by the
component which allows for features like: maximized RELATE platform using the scheduler-runners approach
view of data grid, column reordering, sorting, searching described in 2.3 above. During this time, one server restart
and integration with JSON based APIs. Furthermore, occurred due to a power outage which demonstrated the
dependency parsed sentences are displayed in a tree-like platform’s ability to recover in case of unexpected errors
visualization which is enhanced with action buttons and resume annotation. Furthermore, during task running,
allowing exploration of words within the other annotated files started to become available in the interface
visualization components as detailed above. as they were finished. This allowed the researchers
3.6 Statistics involved in the project to look at the produced annotations
and identify potential issues.
For each corpus, a dedicated task can be started for Once the basic annotation task ended, a separate,
computing corpus statistics. These are computed at dedicated task was started for IATE18 and EUROVOC
various levels: entire corpus, word form, lemma. After annotations, using the method described in (Coman et al.,
being computed, they can be visualized in the RELATE 2019). This was again orchestrated by the RELATE
interface or downloaded as CSV files. Similar to other platform and split across 10 processes which managed to
tasks, the statistics task makes use of the parallel runners process the entire corpus in less than half hour. Similar to
in order to reduce the overall time required. the previous step, annotations were made available in the
Corpus level statistics include: number of raw documents, RELATE web interface and were consulted by the
number of annotated documents, number of sentences, project’s team. Figure 5 shows a data grid visualization of
number of tokens, number of “words” (strings separated one of the annotated files. This is performed using the
by space characters), number of lines, number of CoNLL-U Plus format.
characters. For each named entity type, the identified The large difference in the required time for the two
number of entities of that type is computed. Similarly, for annotation tasks is due to the number of processes
each universal part of speech tag the corresponding
number of occurrences is computed. 15 https://marcell-project.eu/
16 https://eur-lex.europa.eu/browse/eurovoc.html
13 https://ro.presidencymt.eu/#/text 17 http://legislatie.just.ro/
14 18 https://iate.europa.eu/home
https://paramquery.com/

85
involved and their respective complexity. The named entities in BIO format, NP chunk information,
IATE/EUROVOC annotator used the already tokenized IATE and EUROVOC annotations.
and annotated documents from the previous step. More The entire annotated corpus has a size of 29GB and was
important, the Aho-Corasick algorithm (Aho and archived using an archiving task, resulting a zip archive of
Corasick, 1975) used for detecting the corpus occurrences 4.3GB, downloadable through the platform and was later
of the terms stored in the trie dictionary made of IATE stored in the MARCELL repository.
Romanian terms runs in linear-time (see details in (Coman
et al., 2019)). 5. Conclusion
Following the two annotation stages, a statistics task was
This paper presented an integrated, high performance
executed, in order to compute the overall statistics on the
platform for Romanian language, called RELATE. It
corpus, useful for reporting purposes. This was executed allows researchers to upload a large corpus and perform
using 13 processes orchestrated by the platform and took
annotations as well as complex analysis on the data. To
about one hour and half to compute the statistical
achieve parallelization of time-consuming annotation
indicators described in section 2.6 above. Table 1 presents operations, the platform uses a scheduler-runners
some of the computed statistics.
mechanism. This allows CPU-intensive operations to be
Number of documents 144,131 distributed across multiple processing nodes across a
network or even across the Internet.
Number of tokens 456,079,723
By integrating current state of the art modules for
Unique tokens 1,528,228 processing Romanian language, developed by different
Unique lemmas 1,195,484 research partners, the RELATE platform strives to
Tokens occurring only once 772,141 become a national reference portal.
Table 1: Statistics from the Romanian legal corpus Multiple input and output file formats are supported,
obtained using the RELATE platform while the internal format used by the platform is the
CoNLL-U Plus format. Large archives can be uploaded,
Finally, a MARCELL specific preparation task was processed and finally downloaded in a standard annotated
executed, ensuring the output format agreed within the format.
project. This is also a CoNLL-U Plus based format. Each The platform is loosely coupled with the processing
document begins with a line describing the columns pipelines, by means of URLs accessed by the task runner
followed by a “newdoc” marker holding the file id (# processes, thus complying with a micro-services
newdoc id = ro.legal). Each sentence in a document is architecture. Therefore, one of the key future
labelled by a unique ID (example: “# sent id = ro developments for the platform is envisaged to be its
legal.4”), followed by the text of the respective sentence containerization in the form of multiple docker containers:
(# text = ...). Following is a tab separated list of 14 one for the interface and one for the processing pipeline.
columns, according to the first descriptor line in the file. It This would allow for quick deployment on new
contains the word id, word form, lemma, universal part of processing nodes as well as increased durability when
speech tag, language specific part of speech tag, list of faced with operating system updates or changes in
morphological features, head of the current word, external libraries.
universal dependency relation, underscore in columns RELATE will be further enhanced with new Romanian
nine and ten (since we don’t use any enhanced language technologies/computational resources as they
dependency graph features or miscellaneous features), become available. While we do not aim at standardizing
language technologies interoperation or annotation

86
visualization, thus admitting supplementary programming Cunningham, H. (2002). GATE, a General Architecture
effort for each new addition, our focus is to keep thinking for Text Engineering. Computers and the Humanities,
on how to best visualize and link automatically generated 36(2):223—254.
annotations with their supporting computational resources Diewald, N., Hanl, M., Margaretha, E., Bingel, J.,
in such a way that the widest interested audience is best Kupietz, M., Bański, P. and Witt, A. (2016). KorAP
served doing their work. Architecture – Diving in the Deep Sea of Corpus Data.
In the spirit of European Language Grid, as National Center In: Calzolari, Nicoletta et al. (eds.): Proceedings of the
of Competence for Romania, we will try to persuade all the Tenth International Conference on Language Resources
developers of technologies and resources for Romanian to and Evaluation (LREC’16), Portoroz, European
adhere and contribute to the RELATE portal with new tools Language Resources Association (ELRA).
and data-sets. Federmann, C., Giannopoulou, I., Girardi, C., Hamon, O.,
Mavroeidis, D., Minutoli, S. and Schröder, M. (2012).
6. Acknowledgements META-SHARE v2: An Open Network of Repositories
Part of this work was conducted in the context of the for Language Resources including Data and Tools. In
ReTeRom project. Part of this work was conducted in the Proceedings of the Eighth International Conference on
context of the Marcell project. Speech recognition and Language Resources and Evaluation (LREC 2012),
synthesis components referenced in this work were Turkey, pages 3300-3303.
developed within the ROBIN project. Machine translation Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi,
components referenced in this work were developed P., Liu, N., Peters, M., Schmitz, M., and Zettlemoyer,
within the ”CEF Automated Translation toolkit for the EU L. (2018). AllenNLP: A Deep Semantic Natural
Council Presidency”, TENtec no. 28144308. Language Processing Platform. arXiv:1803.07640.
Hinrichs, M., Zastrow, T., and Hinrichs, E. W. (2010).
7. Bibliographical References WebLicht: Web-based LRT Services in a Distributed
Aho, A. and Corasick, M. (1975). Efficient string eScience Infrastructure. In Proceedings of the
matching: An aid to bibliographic search. Commun. International Conference on Language Resources and
ACM. 18:6, 333-340. Evaluation, LREC 2010, pp 489-493.
Bański, P., Diewald, N., Hanl, M., Kupietz, M. and Witt, Ion, R. (2007). Word Sense Disambiguation Methods
A. (2014). Access Control by Query Rewriting. The Applied to English and Romanian. PhD Thesis,
Case of KorAP. In Proceedings of the Ninth Conference Romanian Academy, 148 pages (in Romanian).
on International Language Resources and Evaluation Ion, R. (2018). TEPROLIN: An Extensible, Online Text
(LREC’14). Reykjavik, European Language Resources Preprocessing Platform for Romanian. In Proceedings
Association (ELRA). of the International Conference on Linguistic Resources
Boroș, T., Dumitrescu, D.Ș. and Burtică, R. (2018). NLP- and Tools for Processing Romanian Language
Cube: End-to-End Raw Text Processing With Neural (ConsILR 2018), November 22-23, 2018, Iași,
Networks. In Proceedings of the CoNLL 2018 Shared Romania.
Task: Multilingual Parsing from Raw Text to Universal Miller, G.A. (1995). WordNet: A Lexical Database for
Dependencies. Association for Computational English, Communications of the ACM, Vol. 38, No.
Linguistics, pages 171-179. 11:39-41.
Boroș, T., Dumitrescu, D.Ș. and Păiș, V. (2018b). ”Tools Mititelu, B.V., Tufiș, D. and Irimia, E. (2018). The
and resources for Romanian text-to-speech and speech- Reference Corpus of Contemporary Romanian
to-text applications”, in Proceedings of the Language (CoRoLa). In Proceedings of the 11th
International Conference on Human-Computer Language Resources and Evaluation Conference –
Interaction – RoCHI 2018, pp 46-53. LREC’18, Miyazaki, Japan, European Language
Boroș, T. and Dumitrescu, D.Ș. (2015). Robust deep Resources Association (ELRA).
learning models for text-to-speech synthesis support on Morton, T. and LaCivita, J. (2003). WordFreak: an open
embedded devices. In Proceedings of the 7th tool for linguistic annotation. In Proceedings of the
International Conference on Management of 2003 Conference of the North American Chapter of the
computational and collective IntElligence in Digital Association for Computational Linguistics on Human
EcoSystems (MEDES'15), Caraguatatuba, Brasil. Language Technology: Demonstrations - Volume 4
Carp (Mitrofan) M. (2019). Extragere de cunoștințe din (NAACL-Demonstrations '03), Vol. 4. Association for
texte în limba română și date structurate cu aplicații în Computational Linguistics, Stroudsburg, PA, USA, 17-
domeniu medical. PhD Thesis, Romanian Academy, 18. DOI: https://doi.org/10.3115/1073427.1073436.
144 pages. Păiș, V. (2019). Contributions to semantic processing of
Che, W., Li, Z. and Liu, T. (2010). LTP: a Chinese texts; Identification of entities and relations between
Language Technology Platform. In Proceedings of the textual units; Case study on Romanian language. PhD
23rd International Conference on Computational Thesis, Romanian Academy, 114 pages.
Linguistics: Demonstrations (COLING '10). Păiș, V., Tufiș, D. and Ion, R. (2019). Integration of
Association for Computational Linguistics, Romanian NLP tools into the RELATE platform. In
Stroudsburg, PA, USA, 13-16. Proceedings of the International Conference on
Coman, A., Mitrofan, M. and Tufis, D. (2019). Automatic Linguistic Resources and Tools for Processing
identification and classification of legal terms in Romanian Language – CONSILR 2019, pages 181-192.
Romanian law texts, In Proceedings of ConsILR 2019, Perovšek, M., Kranjc, J., Erjavec, T., Cestnik, B. and
Cluj, România, pp 39-49. Lavrač, N. (2016). TextFlows: A visual programming
platform for text mining and natural language

87
processing, Science of Computer Programming,
Volume 121, Pages 128-152.
Stan, A., Yamagishi, J., King, S. and Aylett, M. (2011).
The Romanian Speech Synthesis (RSS) corpus:
building a high quality HMM-based speech synthesis
system using a high sampling rate. Speech
Communication 53(3):442—450.
Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou,
S. and Tsujii, J. (2012). brat: a Web-based Tool for
NLP-Assisted Text Annotation. In Proceedings of the
Demonstrations Session at EACL 2012
Tufiș, D. and Mititelu, B.V. (2014). The Lexical Ontology
for Romanian. In Nuria Gala, Reinhard Rapp, Gemma
Bel-Enguix (eds) Recent Advances in Language
Production, Cognition and the Lexicon, pages 491–504,
Springer, 2014