0% found this document useful (0 votes)
79 views5 pages

Mining Filipino-English Corpora From The Web: Joel P. Ilao and Rowena Cristina L. Guevara

This document details the design and implementation of a system to automatically build text corpora for Filipino and English languages intended for use in training a statistical machine translation system. Large amounts of monolingual and parallel text data were mined from the web and used to train an SMT system, significantly increasing the size of existing training data. The SMT system scored 29.24 in BLEU when evaluated.

Uploaded by

Elizabeth Castro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views5 pages

Mining Filipino-English Corpora From The Web: Joel P. Ilao and Rowena Cristina L. Guevara

This document details the design and implementation of a system to automatically build text corpora for Filipino and English languages intended for use in training a statistical machine translation system. Large amounts of monolingual and parallel text data were mined from the web and used to train an SMT system, significantly increasing the size of existing training data. The SMT system scored 29.24 in BLEU when evaluated.

Uploaded by

Elizabeth Castro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Mining Filipino-English Corpora from the Web

Joel P. Ilao* and Rowena Cristina L. Guevara†


Digital Signal Processing Laboratory, Electrical and Electronics Engineering Institute
University of Philippines - Diliman, Philippines
Tel: +632-981-8500 local 3370
E-mails: *[email protected], †[email protected]

Abstract— This paper details the design and implementation target-language, and the actual translation phenomenon. As
of a system that automatically builds text corpora for Filipino such, it can be argued that a comprehensive language corpus
and English languages, intended for use as training, development used for training an SMT will lead to better translation
and test sets of a Filipino-English Statistical Machine accuracy [5].
Translation (SMT) system. The developed SMT is based on the
The Multilingual content of the Web is an ever expanding
Moses SMT system [1]. This work presumes that a larger data
set of well-formed sentences would lead to a better model of any resource [6]. This has proven to be a fertile source of corpora
written language; hence, existing Filipino-English text corpora which can be used for linguistic study. The web has also been
on hand used for training the SMT system were significantly proven to be an adequate source of under-represented (also
made larger. Textual language data were mined from the World called minority) languages, such as Slovenian [2], Tagalog
Wide Web (WWW) using a query-based scheme via the use of [6],[7], and Turkish [8].
the Yahoo! search engine using Odds-Ratio for search term Different efforts have been made in collecting corpora in
selection, as suggested by Ghani et al. in [2]. Initial monolingual different languages. Examples of Parallel corpora reported
and bilingual linguistic data were collected from various sources, are: English-Spanish and English-French [9], Turkish-English
such as the University of the Philippines – Sentro ng Wikang
[8], English-Indonesian (BPPT-PANL and BTEC-ATR)
Filipino (UP-SWF), the De La Salle University – Center for
Language Technologies (CeLT), and Adarna Publishing House. corpus [10]. The Europarl [5] is a collection of material
The corpora collection efforts resulted in a Filipino Monolingual including 11 European languages taken from the European
Corpora of 9.104 Million words, and a Filipino-English Bilingual Parliament. Roxas et al. [3] were able to collect a Tagalog-
Corpora of 14.6K pairs of sentences with around 580.2K words. English parallel corpus containing 207K words.
The SMT system trained using these corpora scored 29.24 in the Efforts in building Monolingual corpora have been started
BLEU metric. earlier than that of parallel corpora, and thus form a more
extensive and comprehensive set. Some more noteable
I. INTRODUCTION examples of such corpora are the Brown Corpus [11]
A parallel corpus, sometimes called bitext, is a collection of composed of typical American English and consisting of 1
original texts translated to another language where the texts, million words, the Lancaster-Oslo-Bergen Corpus (LOB)
paragraphs, and sentences down to word level are typically Corpus [12] of 1 million British English words. The Penn
linked to each other. A monolingual corpus, on the other Tree Bank [13] has 4.5 million syntactically annotated
hand, is a collection of phrases and sentences written in just a American English sentences and has been extensively used in
single language. It is now a well recognized fact that a corpus various linguistic studies, whereas the Penn-Helsinki Parsed
is more than just a collection of electronic texts. Corpus data Corpus of Middle English, second edition (PPCME2) is a
have to be selected with care with respect to the intended syntactically annotated corpus of prose text samples [14] of
applications [3]. Some applications where carefully-selected middle English, consisting of 1.2 million words.
corpora can be of great use include grammar checking, The Philippine-based languages, including Tagalog, which
language recognition, as an aid to transcription, spell checking is the basis of Filipino, the national language of the
and lexicography [4]. Philippines, are still classified as under-resourced (also
A Machine Translation (MT) system is a sub-field of referred to as minority) since their online presence is small in
computational linguistics that solicits the aid of computer comparison to the world major languages, such as English.
software in translating text or speech expressed in one natural However, various efforts have been started in building an
language, to another. There are two types of MT: the rule- electronic database of Filipino corpus. For example, the
based, and the corpus-based. Rule-based MT builds a Tagalog language has been farmed from the web using a
database of rules for language representation and translation query-based scheme [2]. Dimalen and Roxas [7] have also
from linguists and other experts while corpus-based MT mined the web for Philippine-based languages, and
automatically learns such information from sample text incorporated a language classifier to differentiate closely-
translations. An example of corpus-based MT is the SMT, related languages (Tagalog, Cebuano and Bicolano). Roxas et
where statistical methods are used in modeling the source and al. [6] have collected and made available online, various
literary and religious corpora of Tagalog, Cebuano, Ilocano
and Hiligaynon languages of 250K words each, and 7K signs containing line-by-line entries of single sentences, aligned
in video based on the Filipino Sign Language (FSL). The De accordingly with the sentences in the second file. Hence, it
La Salle University – Center for Language Technologies was necessary to format the initially collected data files
(CeLT) have built a parallel corpus of Filipino-English for use manually. Modest automation efforts were also made, in
in training a multi-engine hybrid machine translation system. converting rtf and MS Word files to text [15], cleaning the
The MT efforts by the DLSU- CeLT, however, still have only data by removing non-UTF8 characters that cannot be
been tested in modules, the corpus-based MT-component of displayed in plain text, as well as tokenization in word and
which still does not satisfactorily perform due to lack of sentence level, and sentence-alignment. For HTML files, it
training data. The existing hybrid MT system has also not was necessary to remove markup tags and embedded scripts
been tested using BLEU metric, which is the standard metric in the source code.
being used by the MT research community.
D. Language Modeling
II. MATERIAL AND METHODS Statistical n-gram based language modeling was done using
the SRI Language Modeling (SRILM) toolkit [16], both for
A two-step approach was employed in building a Filipino – the Filipino-English SMT and the Corpus Web Miner
English parallel text corpus: (1) collection of monolingual developed for this study. For the Filipino-English SMT, the n-
corpus, and identification of websites containing significant gram order used is 4, with modified Kneser-Ney smoothing
amount of bilingual text, and (2) focused crawling of fertile up to order 7 [17]. The Corpus Web Miner, on the other hand,
source websites for more extensive mining of parallel corpora. needs just the word prior probabilities, both for the set of
Using the collected data, a Filipino-English SMT system was relevant and non-relevant documents. Hence, only the 1-gram
trained, and compared with existing available Filipino- with Katz backoff model smoothing was used.
English MT systems, using standard performance metrics.
E. Query Generation using Odds-Ratio
A. Scope and Limitation
The Odds-Ratio (OR) of a word w, based on a given
This project only encompasses domains which can be language model, is computed using () given below:
freely accesssed in the web, such as news, science, and

| 
|
informal discourse through blogs and fora. It is assumed that    
|
| ()
the true Language Model will be satisfactorily approximated
through the corpus used to train the SMT. A query-based
where P(w|L) is the word probability in a language L, and
approach using a search engine (i.e. Yahoo!) was used to
P(w|OL) is the word probability in other languages OL. OR
collect pertinent documents. A language filter was employed,
measures the uniqueness of a given word to a particular
based on a vocabulary file generated from a seed Filipino-
language, with very unique words registering higher OR
document fed to the SMT. The SMT system is based on
values.
Moses SMT system, which is an open-source SMT that can
Ghani et al. reported from an empirical study [2] on query-
be trained given any language pair [1].
based web mining of two minority languages (particularly,
B. Preliminary Data Collection Slovenian and Tagalog), that the query term selection method
Linguistic data from various sources have been collected at using Odds-Ratio, selecting the highest k = 4 inclusive and
the initial phase of this project. The UP – Sentro ng Wikang exclusive terms in generating a search query, gave the best
Filipino has provided a number of electronic copies of performance in terms of precision metrics, among other
popular literary works, which were translated to Filipino. UP query-based techniques studied. Thus, only the OR method
SWF also provided a small set of News articles taken from will be used as the query-generator for this study.
various Filipino news websites (i.e. Abante Tonite), from its F. The Query-Based Corpus Web Miner
currently on-going BantayWika project. The DLSU-CeLT
Two seed documents, manually pre-classified as Relevant
group, on the other hand, having undertaken research on MT
and Non-Relevant, serve as input to the system. The Relevant
development, has contributed a large collection of Filipino
seed document is a set of Tagalog-classified sentences,
and English corpora to the project. Finally, Adarna
whereas the Non-Relevant seed document contains sentences
Publishing House, which publishes books for children, has
classified in other languages, in this case, prevalently
provided electronic copies of some of its books which contain
classified as English. This system also uses van Noord’s
both English and Filipino translations. These raw data,
implementation of TextCat for language classification [18].
however, needs to be cleaned and appropriately formatted to
The general algorithm of the Corpus Web Miner is shown
be fed into the developed Filipino-English SMT system.
below. Note that the set of Relevant documents is referred to
C. Pre-processing of Text Data as L, while the set of Non-Relevant documents is referred to
The initially collected data, having been provided by as OL.
1. Compute the word prior probabilities for L and OL.
different sources, came in different file formats (i.e. text file,
2. Generate a query based on Odds-Ratio, and perform web
MS Word document, html, rtf, and pdf). For training and search.
development, the SMT system requires a pair of text files 3. Download the first hit not previously downloaded, noting the
corresponding to two different languages, with each file rank number of the downloaded document.
4. If the hit rank of the downloaded document is greater than TABLE 1
10, alter search query, then repeat step 3. (Note: II.G. LIST OF URL-STEMS WITH GOOD QUALITY FILIPINO DOCUMENTS
section) URL-stem
5. If no document can be added, stop downloading. *.abante-tonite.com
6. Clean the contents of the downloaded web document. *.philstar.com
7. Classify each sentence in the cleaned document, and *.gmanews.tv
append to set L or OL depending on the language *.abante.com.ph
classification. *.abs-cbnnews.com
8. Repeat step 1. *.inquirer.net
As noted in the previous section, word prior probabilities *.sagad-bugso.tripod.com
were computed from the 1-gram language models generated
by the SRILM toolkit for the sets L and OL. The word In order to speed up the rate of document downloading,
probabilities were then used to compute for the Odds-Ratio of instances of the Corpus Web Miner system were
each word found in L and OL sets. The inclusion terms simultaneously run, with each application thread downloading
correspond to the k = 4 terms with highest OR values in set L, documents from each URL identified as a source of good-
while the exclusion terms are the k terms with highest OR quality Filipino documents. A separate application thread
values in set OL. which downloads freely from the web is also run, to
Cleaning of downloaded web documents is done by first moderately allow variety in the type of downloaded
removing markup tags and scripts in the raw document. documents. However, a master list of URL’s downloaded so
Tokenization is then performed by separating the punctuation far is shared by all the running threads in order to prevent
marks from actual words, and lining up sentences one after duplication of downloaded documents. We have assumed that
the other. the document domain (i.e. blog, news, scientific) may be
Note that the process of downloading web documents is loosely based on its source website. Hence, each application
done iteratively, wherein the stopping criterion is defined for thread builds its own L and OL sets during the downloading
the case when no new document can be added to the sets L process, thus making it more convenient to retrieve sentences
and OL. This scenario would occur when a significant from a particular document class in the downloaded set.
portion of all search engine-indexed Tagalog web documents Finally, since subsequent query terms heavily rely on the
has already been downloaded. composition of the current L and OL sets, our Corpus Web
G. Recovery from empty query results Miner system is prone to becoming locked into downloading
just a particular document genre, especially since terms with
As the document set of a particular language grows larger, the highest Odds-Ratio can include words specific to a
the effect of adding a new document to the whole set would particular document category. For example, the words “Diyos”
have less effect on the document statistics, thus, possibly (God) and “Hesukristo” (Jesus Christ) have inordinately high
resulting in the same generated search query in the next frequency counts in Filipino Biblical passages, and hence,
iteration. In order to recover from a situation wherein no would rank among the terms with the highest OR values once
document can be added because of a search query consistently a long Filipino biblical passage has been downloaded by the
equal to that of the previous iteration, we utilized a counter i system. Thus, the OR-based query generation method can
that successively increments, first through the inclusion terms, artificially influence the type of document that will later be
then through the exclusion terms, taking the ith through the searched by the system. The URL-specific multi-threaded
i+kth terms until a query that returns a URL is found. downloading approach mentioned in the previous paragraph
H. Identification of fertile web sources of good quality prevents this scenario from happening.
Filipino Documents, and multi-threaded downloading I. Building the Filipino-English Parallel News Corpus
The Yahoo! search engine allows its users to filter search The translation accuracy of an SMT system is very much
hits belonging only to a particular website. This feature may dependent on the amount and quality of sentence-pairs in its
be simply invoked by including in the query a keyword “site:” training and development data set. Thus, great care was made
followed by a URL stem, as in “site:gmanews.tv” for filtering in building the Filipino-English parallel corpora. The parallel
in hits coming only from the website *.gmanews.tv. Thus, it is news corpus was downloaded from GMA News website (url:
possible for us to maintain the quality of documents being http://gmanews.tv). Not all news articles from this website
downloaded by specifying source URL’s known to contain have corresponding translations, thus, the need to
good-quality Filipino-written documents. exhaustively search the site for existing parallel documents. It
The initial list of source URL’s were expanded by first was fortunate that for this website, each news article can be
letting the Corpus Web Miner system run without any website accessed via a unique numeric identifier appended to the
filtering, and then inspecting the top web sources of URL-stem, and that most of the webpages adhere to a
downloaded documents by frequency count. The quality of template layout. Hence, all possible news webpages were
documents coming from these top web sources were manually first downloaded and processed by a custom-tailored cleaning
verified before the corresponding URL’s can be included in script. A language classifier was again used to separate the
the list. TABLE 1 lists these top URL-stems. English from the Filipino news articles. Parallel documents
were then identified by matching pairs of documents with the
most statistically similar usage of proper names. The Filipino corpora, sentence duplications occur at a rate of
similarity threshold for each candidate document pair was set 13.2%. Creative ways of addressing these issues from
at 85%. downloaded corpora can be the focus of a further study.
The SMT system trained using these corpora scored 29.24
J. The Filipino – English Statistical Machine Translation
in the BLEU metric. The BLEU score indicates that the SMT
System
developed is good enough for simple conversational
This study employs the Moses Toolkit as its baseline translation tasks. Future work can further extend the size of
system. Moses uses the GIZA++ architecture for training and the Filipino-English parallel corpora, and investigate on
word alignment, and makes use of a phrase-based decoder in inherent Linguistic phenomena observable in the Filipino
generating candidate translations (also known as hypotheses). language, particularly in the translation task.
Minimum Error Rate Training (MERT) was used in
optimizing the weights used in generating the translation ACKNOWLEDGMENT
model.
The authors would like to acknowledge the University of
III. RESULTS AND DISCUSSION the Philippines – Sentro ng Wikang Filipino, De La Salle
University – Manila Center for Language Technologies
TABLE 2 shows the extent of monolingual and bilingual (CeLT), and Adarna Publishing house for providing Filipino
corpora collected so far. It is evident that the corpora spans and English text corpora which are used as initial language
across different topic areas, and hence, constitute a good databases for this project. The first author wishes to thank the
collection for investigating usages of Filipino in different Department of Science and Technology for the Engineering
domains. Research and Development for Technology (ERDT)
TABLE 2
TEXT CORPORA COLLECTED IN THE STUDY
Scholarship grant given to him.

MONOLINGUAL TEXT (FILIPINO) REFERENCES


Number of Sentences
Source
(Words)
DLSU [1] The Moses Statistical Machine Translation System. [Online].
• Short Stories 50.2K (1.19M) http://www.statmt.org/moses/
• Short Stories – third corpus 80.3K (1.09M) [2] R. Ghani, R. Jones, and D. Mladenic, "Mining the Web to
• News Corpus 4.8K (133.7K) Create Minority Language Corpora," in 10th International
UP Sentro ng Wikang Filipino Conference on Information and Knowledge Management
14.6K (290.3K)
(BantayWika Project) (CIKM-2001), 2001, pp. 279-286.
Web-mined articles 256K (6.4M) [3] R.E. Roxas et al. , "Building Language Resources for a Multi-
BILINGUAL (PARALLEL) TEXT
Engine English-Filipino Machine Translation System," in
Number of Sentence Pairs
Source
(Words)
Language Resource and Evaluation, 2008.
DLSU [4] J. Tiedemann, "Uplug A Modular Corpus Tool for Parallel
• BIBLE Passages 3.1K (86.1K) Corpora," Parallel corpora, parallel worlds. Selected papers
• El Filibusterismo 2.3K (83.98K) from a symposium on parallel and comparable corpora at
• The Little Prince 667 (14.81K) Uppsala University, 2002.
Adarna Publishing House 740 (6.9K) [5] P. Koehn, Europarl: A Multilingual Corpus for Evaluation of
News articles Machine Translation. Information Sciences Institute, University
7.8K (388.5K)
(downloaded from GMANews.tv) of Southern California, 2002.
[6] R.E. Roxas et al., "Online Corpora of Philippine Languages," in
The corpora collection efforts resulted in a Filipino 2nd DLSU Arts Congress 2009, De La Salle University - Manila,
Monolingual Corpora of 9.104 Million words, of which a total 2009.
of 6.4 Million words came from articles downloaded by our [7] D. Dimalen and R. Roxas, "AutoCor: A Query Based Automatic
Query-based Corpus Web Miner system. The Filipino- Acquisition of Corpora of Closely-related Languages," in 21st
English Bilingual Corpora, on the other hand, consisted of Pacific Asia Conference on Language, Information and
Computation Conference, 2007.
14.6K pairs of sentences with around 580.2K words. [8] B. Megyesi, A. Hein, and E. Johanson, "Building a Swedish-
The web-mined Monolingual Filipino corpora, however, Turkish Parallel Corpus," in 5th International Conference on
are not completely clean. This issue can be attributed to the Language Resources and Evaluation (LREC 2006), Genoa, Italy,
fact that our system captured not just the news articles, but 2006.
also the reader comments as well. This explains the moderate [9] P. Resnik, "Mining the Web for Bilingual Text," in 34th Annual
amount of incorrect letter casing issues, orthographic errors, Meeting of the Association of Computational Linguistics,
grammatical errors, and code switching in the sentences. Maryland, 1999.
News banners were also captured, thus causing sentence [10] H.R. Budiono, H. Rizza, and C. Hakim, "Resource Report:
duplication problems in our data set. The duplication problem Building Parallel Text Corpora for Multi-Domain Translation
System," in 7th Workshop on Asian Language Resources, ACL-
is further aggravated by news article repostings in other IJCNLP 2009, Suntec, Singapore, 2009, pp. 92–95.
websites, and existence of numerous Bible websites [11] W.N. Francis and H. Kučera, Manual of Information to
containing very similar content. In our current web-mined accompany A Standard Corpus of Present-Day Edited American
English, for use with Digital Computers. Providence, Rhode [17] S.F. Chen and J. Goodman, "An empirical study of smoothing
Island: Department of Linguistics, Brown University, 1964, techniques for language modeling," Center for Research in
Revised 1971. Revised and amplified 1979. Computing Technology , Harvard University, Technical Report
[12] S. Johansson, E. Atwell, R. Garside, and G. Leech. (1986) TR–10–98, 1998.
Manual of Information to accompany The Tagged LOB Corpus. [18] G. van Noord. Textcat. [Online].
[Online]. http://khnt.hit.uib.no/icame/manuals/lobman/. http://www.let.rug.nl/~vannoord/TextCat/
[13] M. Marcus, B. Santorini, and M. Marcinkiewicz, "Building a [19] (2005) Google Translator: The Universal Language. [Online].
large annotated corpus of English: the Penn Treebank," 1993. http://blogoscoped.com/archive/2005-05-22-n83.html
[14] A. Kroch and A. Taylor, Penn-Helsinki Parsed Corpus of [20] R. Ghani and R. Jones, "Learning a monolingual language
Middle English, 2nd ed., 2000, [Online]. model from a multilingual text database," in Ninth International
http://www.ling.upenn.edu/hist-corpora/PPCME2-RELEASE-2/. Conference on Information and Knowledge Management
[15] Document Converter: a Python Program. [Online]. (CIKM 2000), 2000.
http://www.artofsolving.com/opensource/pyodconverter [21] K.P. Scannell. Corpus Building for Minority Languages.
[16] A. Stolcke, "SRILM - An Extensible Language Modeling [Online]. http://borel.slu.edu/crubadan/index.html
Toolkit," in Intl. Conf. Spoken Language Processing, Denver,
Colorado, 2002.

You might also like