0% found this document useful (0 votes)
4 views7 pages

1254 Paper

This paper outlines the selection criteria for low resource language programs in the context of Human Language Technology research in the U.S. It discusses various programs, their motivations, and the types of languages chosen for study, emphasizing the importance of demographic factors and resource availability. The authors aim to provide insights for future program managers to adapt and prioritize selection criteria based on their specific goals and contexts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views7 pages

1254 Paper

This paper outlines the selection criteria for low resource language programs in the context of Human Language Technology research in the U.S. It discusses various programs, their motivations, and the types of languages chosen for study, emphasizing the importance of demographic factors and resource availability. The authors aim to provide insights for future program managers to adapt and prioritize selection criteria based on their specific goals and contexts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Selection Criteria for Low Resource Language Programs

Christopher Cieri○, Mike Maxwell◻ , Stephanie Strassel○, Jennifer Tracey○



Linguistic Data Consortium. University of Pennsylvania
3600 Market Street, Suite 810, Philadelphia, PA. 19104 USA
email AT ︎ldc.upenn.edu

University of Maryland
College Park, MD 20742, USA
email AT umd.edu

{ccieri, mmaxwell, strassel, garjen}

Abstract
This paper documents and describes the criteria used to select languages for study within programs that include low resource
languages whether given that label or another similar one. It focuses on five US common task, Human Language Technology research
and development programs in which the authors have provided information or consulting related to the choice of language. The paper
does not describe the actual selection process which is the responsibility of program management and highly specific to a program’s
individual goals and context. Instead it concentrates on the data and criteria that have been considered relevant previously with the
thought that future program managers and their consultants may adapt these and apply them with different prioritization to future
programs.

Keywords: language resources, low resource languages, common task programs

1. Introduction best to document living endangered languages and their


associated cultural and scientific information before they
The past 10 years have seen significant growth in work on disappear.” Such differences in motivation clearly lead to
resource-poor languages within the Human Language very different languages studied and differences in the
Technology (HLT) research community. Whether one languages studied affect opportunities for collaboration
sees this growth as the natural outcome of successful HLT across programs.
development in well-resourced languages or as an
opportunity to test the generality of HLT, the shift in In addition to the programs’ commitments of time and
focus is undeniable. Within the United States alone, the finances, the new language resources (LRs) they create
TIDES, REFLEX LCTL, Babel and LORELEI programs are critical for bringing HLTs to new languages, a matter
have all focused on developing language resources and of great importance for their speakers. Given the size of
technologies for low resource languages. However, the program investment and the potential to impact
differences in the terminology, available information and speakers’ lives, we believe that selection criteria
goals of low resource language efforts lead to variability constitute a topic worthy of study. This paper, in an
and some obscurity in the language selection process. attempt to begin a dialog about how the community
decides which languages to study, surveys the selection
Moving beyond the four US programs named above, there criteria used and available for use by low resource
is an even greater range of motives for studying low language research. The discussion herein focuses on
resources languages. For example, while the LORELEI several US programs for which the authors have provided
program, described in greater detail below, seeks information about the characteristics deemed relevant to
technologies to facilitate situational awareness in the the choice of languages. Importantly, our intent is not to
event of a disaster, the EU funded METANET (2010) sketch the actual decision making process which was the
program asserts that “The majority of European responsibility of program management and, we believe,
languages are severely under-resourced” and proposes highly specific to the programs’ needs and contexts.
that a “coordinated, large-scale effort has to be made in Instead we will detail the kinds of information requested
Europe to create the missing technologies and transfer by program management and suggested by consultants as
this technology to the languages faced with digital relevant to the decision making process expecting that
extinction”. The motivations for the proposed effort future decision makers will assign different priorities to
include quality of life, information access and the ability the same kinds of data.
to collaborate across multilingual Europe. The US
National Science Foundation’s Documenting Endangered
2. Definitions of Low Resource Language
Languages program (2014) gives very different
motivations: “Most of what is known about human and Related Terms
communication and cognition is based on less than 10 Before describing low resourced language selection
percent of the world's 7,000 languages. We must do our criteria, it will be useful to try to define terms and

4543
understand the relations among them. In the past decade Scholarship Program for 2015 listed: Arabic, Azerbaijani,
of work on creating language resources for languages that Bangla, Chinese, Hindi, Indonesian, Japanese, Korean,
lack them we have seen terms such as low density, less Persian, Punjabi, Russian, Swahili, Turkish and Urdu. If
commonly taught, under-resourced, less resourced and we assume these labels refer to standard languages spoken
low resource. Related fields speak of critical and in their homelands then none are endangered. In fact
endangered languages. Distinguishing these will help Ethnologue (Lewis, Simons, Fennig 2015) lists all as
justify the specific selection criteria used. statutory national languages except Punjabi and Swahili
which it lists as a statutory provincial language and de
Endangered refers to languages that are at risk of losing
facto national language, respectively. However almost
their native speakers through a combination of death and
half have appeared in one or more of the HLT programs
shift to other languages. The term typically fits within
mentioned above. One might also note that the action of
classifications of languages according to risk of
these programs and others have greatly increased the
intergenerational disruption that may distinguish safe
resources available for Standard Arabic and Mandarin
from multiple levels of endangerment and moribundity
Chinese though they remain critical languages.
and, rarely, revitalization (Krauss 1992). In addition to
reduction in speakers, both decline in domains of use and Within the HLT literature, low density refers to languages
structural changes characterize endangered languages “for which few online resources exist” (Megerdoomian,
(Dorian 1980). Notably, the absence of a writing system Parvaz 2008) or “for which few computational data
increases risk of language death (Fishman 1991). We will resources exist” (Hogan 1999). The terms under-
have little else to say about endangered languages in this resourced or low resource seem to have similar semantics.
paper if only because they have not been the principal However, as Hammarström (2009) explains, it is unclear
focus of the HLT projects we surveyed. To illustrate this whether this is measured in absolute terms or relative to
point, Figure 21 charts the number of languages selected some other language. If the former, then simply creating
by each of programs surveyed according to the languages’ resources in the language could cause its classification to
Extended Graded Intergenerational Disruption Scale change while in the latter changes to the resources
(EGIDS) rating. EGIDS scores are a measure of available for other languages could affect the
endangerment ranging from 1 to 13 with higher numbers classification. Hammarström also introduces the
indicating greater threat (Lewis and Simons 2010). As we alternative low-affluence defined via the metric of Gross
see, nearly all of the languages have EGIDS scores of 1 or Language Product (GLP) which is the product of the
2, which refer, respectively, to official national and number of native speakers of the language in any country
provincial languages. An EGIDS score of 3 marks a and the country’s per capita Gross National Product. We
language of broader communication lacking official status add to this discussion the possibility that ‘low’ is, like
while 4 and 5 indicate languages in vigorous use with ‘critical’, relative to some expectation based on the
standardization, literatures and, in the case of 4, the importance of the language. Hammarström’s low-
support of educational institutions. None are described as affluence has the advantage of a clear definition; however,
threatened (EGIDS=6b). its correlation with resource availability is imperfect as
Figure 12 shows.

Figure 2: EGIDS scores of languages selected by the


programs surveyed
Figure 1: Gross Language Product correlates moderately
In contrast, critical has typically referred to languages that well with MetaNet's Estimate of Language Resource
suffer an undesirable ratio of supply to demand, typically Support for European Languages
of teachers and translators. In the US, one sees the term
The MetaNet White paper series classifies European
used in government programs that sponsor language and
languages into five categories of language resource
cultural immersion. It is more difficult to find explicit
availability: Excellent, Good, Moderate, Fragmentary,
definitions than enumerations of the critical languages.
Weak/No. Correlating the two measures we find several
The US State Department funded Critical Language

4544
problems: for example, Portuguese has a higher GLP than quality of available resources with the end result that they
Dutch, Swedish, Polish, Czech and Hungarian but fewer are now among the more richly resourced. The TIDES
LRs and Lithuanian has a higher GLP but fewer resources Surprise Language exercises were intended to evaluate the
than Serbian, Basque, and Estonian. HLT community’s ability to rapidly develop technologies
for a low resource language with no prior warning as
The term less commonly taught seems to have been
would be necessary as a response to a natural disaster.
borrowed in the HLT community from the second
Performers, including LDC as the data provider, were
language teaching community where it refers to
given one month from the date the surprise language –
instruction within a specific target market, be it the United
Hindi as it happened – was announced to create best-of-
States, the Western Hemisphere or perhaps outside of the
breed TIDES technologies. In preparation for the exercise,
region where the language is official. The US National
LDC supplied program management with a table of
Council of Less Commonly Taught Languages (LCTL)
language characteristics as described below and managed
described its focus as languages “critically important to
a dry run of the data collection activities focused on
our national interest in the 21st century” but not the
another low resource language, Cebuano. Again it is
“French, German, Italian, or Spanish” studied by 91% of
important to note that, like Mandarin Chinese and Modern
US collegiates. The discussion names Arabic, Chinese,
Standard Arabic, the number of resources for Hindi has
Japanese, Yoruba, Russian, Swahili as specific LCTLs.
grown over the intervening years.
That reference to teaching in a specific market operates
within US HLT programs where Hindi and Bengali were
3.2. REFLEX LCTL
selected though they are the native language and/or the
language of instructions for millions in India. The US government sponsored the REFLEX (Research on
English and Foreign Language Exploitation) LCTL
We also see the term surprise language used in relation to program, which sought to create basic technologies in a
low resourced languages within several DARPA and number of low resource languages. Simpson et al. (2008)
IARPA HLT programs. Here ‘surprise’ does not refer to characterize the selected languages: “Some of the
any inherent characteristic of the language though surprise languages (Thai, Urdu) were chosen to exercise a
languages have usually been low resource. Instead the resource collection paradigm in which raw text is
term refers to a specific HLT research management available digitally in sufficient quantity; others (Amazigh,
technique to determine the extent to which systems are Guarani, Maguindanao) were chosen to force the
portable, and to estimate the time required to port to the program to deal with cases in which it certainly is not.
language from a standing start as might be necessary in The cluster of Indic languages (Bengali, Punjabi, Urdu)
the event, for example, of a natural disaster. was chosen to give researchers the opportunity to
For the remainder of this paper we will use the term low experiment with bootstrapping systems from material in
resource languages (LRL) by default to refer to those that related languages. Amazigh, Hungarian, Pashto, Tamil,
have fewer technologies and especially data sets relative and Yoruba were chosen to take advantage of existing
to some measure of their international importance. collaborations in order to reduce costs. Finally there was
a general desire to select languages that are quite
3. Programs different from each other and from well-resourced
languages in order to maximize the generality of our
The programs surveyed for this paper differed in goals methods. As a group, the LCTL languages are
and thus the languages studied. In this section, as linguistically and geographically diverse …”
background to the discussion of selection criteria, we
sketch each.
3.3. NIST LRE
3.1. TIDES The US National Institute of Standards and Technologies
(NIST) has organized Language Recognition 1 (LRE)
DARPA TIDES (Translingual Information Detection, technology evaluations since 1996 for which LDC has
Extraction and Summarization) was originally conceived often provided data. LRE does not explicitly seek to work
and presented as an intensively multilingual program with on low resource languages. However, since LRE’s goal is
multiple technology development goals. The program to develop robust technologies that perform well even as
manager’s brief from 1999 envisions at least query the number of linguistic varieties increases, and since the
translation for 30 relevant languages but also machine number of well-resourced varieties is relatively small, it is
translation, information retrieval and extraction and inevitable that LRE would include low resource varieties.
summarization for a subset. After some turn-over in We use the term linguistic varieties because LRE requires
project management, TIDES focused the bulk of its performers to also distinguish confusable varieties
attention on English, (Mandarin) Chinese and (Modern including closely related languages and mutually
Standard) Arabic but planned for Surprise Language intelligible dialects. The LRE selection process begins
Exercises. Of course, it is critical to remember that with a set of candidate varieties proposed by the US
Chinese and Arabic were terribly under-resourced at the government sponsor from which the data provider selects
turn of the millennium and that it was the attention of
programs like TIDES that increased the number, size and 1
http://www.nist.gov/itl/iad/mig/lre.cfm

4545
a subset based on two types of criteria: confusability and a dictionaries, grammars, gazetteers and primers; entity,
series of factors related to the probability of success in morphological, syntactic and semantic annotations;
data collection. Data are typically segments of broadcast morpheme level alignment of source and translation; text
and telephone conversations audited for the linguistic processing tools and entity taggers; lexicons and
variety spoken, speaker number and sex, and sound grammatical sketches; and test data including parallel text
quality. Thus, the ‘success’ criteria deal with the with entity and topic annotation for a portion of the
availability, in the variety of interest, of the desired data documents.
types and native speakers capable of the annotation. The
2011 campaign included the following potentially 4. Selection Criteria
confusable sets: Iraqi, Levantine, Maghreb and Modern Because US LRL programs generally work from a
Standard Arabic; American and Indian English; Czech, presumption that resource availability should be in
Polish, Russian, Slovak and Ukrainian; Dari and Persian; proportion to some measure of language importance,
Bengali, Hindi, Punjabi and Urdu; and Thai and Lao plus many of the selection criteria deal with demographic
Mandarin, Pashto, Spanish, Tamil, Turkish. The 2015 factors and the current resource supply. As Simpson and
evaluation used the same selection procedures but added colleagues (2008) reported for the REFLEX LCTL
Egyptian to the Arabic cluster, British to the English program: “All meet the basic criteria of being significant
cluster and created three new clusters: Chinese in terms of the number of native speakers but poorly
(Mandarin, Cantonese, Min Nan, Wu); Spanish- represented in terms of available language resources.”
Portuguese (Brazilian Portuguese, Caribbean Spanish, Another major concern for these projects is the
European Spanish, Latin American Spanish) and French probability of success that is reflected partially in the
(Haitian Creole, West African French). It also reduced the former criteria types but also in the language typology and
Slavic cluster to Russian and Polish. the availability of raw data in digital form.

3.4. IARPA Babel 4.1. Demographic


IARPA Babel 2 sought to escape what the program The population of native speakers as a raw number, rank
described as an English bias present in existing speech or class (e.g. >1 million), either in the homeland or
recognition technologies. Babel systems should be worldwide may stand as a proxy for the language’s
capable of building a keyword search system for audio in influence though the correlation is imperfect. English for
essentially any language in a very short timeline. Program example is 3rd behind Mandarin and Spanish in native
challenges included multilingual speech recognition and speakers but probably greater in influence.
keyword search under difficult conditions, including Hammarström’s GLP tries to correct for languages with
resource scarcity and noisy environments, with the many native speakers having less economic power. In
capability to rapidly adapt to new languages and 2009, he listed English as having the highest GLP with
environments. Selection criteria included estimates of Spanish third and Mandarin seventh. GLP could be
language importance or interest, linguistic factors included among selection criteria of future project, with
including diversity and factors related to the ability to the caveats given in Section 2; however none of the
collect data within the designated schedule. programs surveyed used GLP explicitly. If we consider,
retrospectively, the GLPs of languages included in the US
3.5. DARPA LORLEI HLT programs we see the expected variation in profile.
The LORELEI3 program seeks to advance the state of the Figure 3 charts the number of languages studied in several
art in human language technologies to allow rapid porting US HLT in categories according to their GLP. For
to low resource languages for purposes of information purposes of comparison, official EU languages and 2015
awareness in the event of a disaster. To accomplish those critical languages are similar plotted. Categories of GLP
goals LORELEI technologies include speech recognition, are the log-scaled x-axis and the number of languages in
machine translation and the extraction of information that category on the vertical. The leftmost column shows
including topics, entities and their relations to each other, the number of languages for which Hammaström’s list of
events and sentiment. The program is creating language the 140 most affluent provides no GLP. As we see only
resources for 23 representative and 12 incident languages, one language, Welsh, has a GLP greater than 1 but less
the latter to be used for estimating system performance in than 10 billion. One other, English, has a GLP greater
the event of a disaster. For each of these, the program will than 10 trillion. Among the remaining GLP categories the
create language packs, the composition of which differs critical languages list is evenly distributed, as are the LRE
for representative and incident languages and also languages. TIDES focused most of its attention on the
depending on whether the latter have been chosen for very affluent English, Mandarin and Modern Standard
evaluation. The range of language resources in a pack Arabic while the TIDES Surprise Languages were
could include monolingual and parallel text; found significantly less so. LCTL, Babel and LORELEI, like the
languages of the EU, all tend toward the less affluent end
2
http://www.iarpa.gov/index.php/research-programs/babel of the scale, at least for languages whose GLP we know.
3
http://www.darpa.mil/program/low-resource-languages-for-
emergent-incidents

4546
connections between languages that could prove useful in
migrating HLTs. For example English and Frisian are
both classified as: Indo-European, Germanic, West,
sharing a closer relation than either do to, say, Danish
which is: Indo-European, Germanic, North, East
Scandinavian. Given the time and cost requested to
develop HLTs, numerous researchers have focused on the
challenges of porting or migrating specific HLTs from
one language to another or on developing HLTs that are
intended to be general requiring only training data in the
target language in order to process that language. For
example Vergyri et al. (2005) report “We found that most
of the techniques developed for English or ECA ASR
could be ported to the development of a LCA system.” The
Figure 3: Languages Selected by US HLT programs by GLP abbreviations ECA and LCA refer to Egyptian and
Levantine Colloquial Arabic, respectively. Beyerlein et al.
(1999) reported on experiments to create a speech
Notwithstanding the number of native speakers, if most of
recognition system for a low resource language, Czech,
those also speak another language with an even greater
by augmenting its acoustic model with resources
population or prominence, that could reduce the
borrowed from other languages. Although the authors do
language’s importance to projects whose goal is to
not emphasize this point, the improvements gained by
develop news understanding technologies. In preparation
augmenting with acoustic models from a single language
for the TIDES Surprise Language project we discussed
are greater for closely related Russian than for more
this characteristic with program management and
distantly related English and Spanish and least for
provided a table showing, for each language, whether a
Mandarin. Elmahdy and colleagues (2014), working with
significant portion of its native speakers also spoke a
more closely related varieties, report: “Due to the
language with a greater population of speakers. The
limitation of dialectal speech resources, by utilizing MSA
rationale was that news transcription, translation and
data, cross-dialectal phone mapping, data pooling,
summarization technologies would do the most good
acoustic model adaptation and system combination
when processing the languages in which the world’s news
methods, has achieved 21.3% and 28.9% relative WER
is likely to appear. On the Italian peninsula, Napoletano-
reduction on QA development set and evaluation set
Calabrese, Sicilian, Piemontese, Venetian, Emiliano-
respectively.” The REFLEX LCTL program sought to
Romagnolo and Ligurian are among Hammarström’s 60
encourage research into technology development across
most affluent languages, scoring higher than Urdu,
multiple closely related varieties by including several
Vietnamese, Indonesian, multiple varieties of Arabic,
Indo-Aryan languages in the program: Bengali, Punjabi
Tagalog, Afrikaans, Yoruba and Latvian but lower than
and Urdu. When data on mutually intelligibility is absent
Italian by an order of magnitude. The very large percent
LRE has also used the family tree as a way to locate
of native speakers who also speak Italian and the
potentially confusable varieties. Of course, the family tree
probability that events of international importance taking
model by itself does not consider language contact
place in Italy would like appear in the Italian language
phenomena such as the many borrowings from French
press seems to have contributed to the slow rate of
into English that can also affect mutual intelligibility and
technology development for the other languages of Italy.
comparability.
In some cases, the most telling determinant of a
A number of other factors can affect the effort required to
language’s importance is almost certainly the population
create certain LRs for the language, for example whether:
of second language speakers or the total number of
the language is generally written by native speakers, its
speakers. For example, Swahili is spoken by far more
orthography is standardized, words and sentences are
second language speakers than by first language speakers,
delimited in writing and the ease with which one may
and its importance as a regional language therefore
map written words to their pronunciations. Similarly the
outweighs its importance as a first language.
nature of the morphology affects HLT development, not
Finally, if a significant number of its speakers are only whether the language tends toward an analytic or
currently involved in some international event, such as a synthetic morphology but also such factors as the number
natural disaster, that naturally increases the language’s of morphological classes and the degree of irregularity
priority. The case of Haitian Creole comes immediately to and syncretism present.
mind.
Some LRL programs strive to develop general
computational methods applicable to a variety of other
4.2. Linguistic
languages. For such programs, it is therefore necessary to
The Ethnologue classification according to the family tree choose languages with typological diversity in phonology,
model provides information about the historical morphology, syntax, etc., so that the computational

4547
methods developed on the chosen languages can later be The data on demographics, linguistic features and
applied more broadly. Both REFLEX LCTL and IARPA resource availability are difficult to collect and to weight.
Babel explicitly sought linguistic diversity. Furthermore demographics change – sometimes abruptly
such as the number of Syrian Arabic speakers in Europe.
4.3. Resource Resource availability also changes. Fifteen years ago,
There is a challenge in low resource language selection Quechua had virtually no web presence beyond a few
that involves actual resource availability. If the language sites with some Bible passages. Today the Quechuan
has too few resources, the project could mire in LR languages have a fairly substantial web presence
creation. On the other hand if the language were too well including audio newscasts on YouTube. During the
resourced, the experience might not represent other low TIDES Surprise Language program, Quechua was not a
resource languages. It is therefore important before viable option but some varieties might be today. Although
embarking on a project to pre-screen potential languages the spread of Internet access has proven helpful in
for the desired level of resource availability. All of the documenting some languages others have died during the
programs surveyed considered the range of raw data and web era and sites that host language data have
existing LRs available: TIDES, LCTL, LRE, Babel and disappeared.
LORELEI.
6. Conclusions
In addition to the number of available LRs, programs
might also look for specific types: news broadcasts from We have sketched the criteria: demographic, linguistic
Voice of America which is public domain in the US, and resource related that have been considered in the
translations of religious texts such as the Bible, Qur’an or process of selecting linguistic varieties for study in
Book of Mormon, other commonly translated texts such several US low resource language, common task HLT
as the Universal Declaration of the Rights of Man or programs. The inventory of factors to be considered have
indeed any translations. LRL programs may also pre- varied by program as, apparently, has the weighting given
screen to determine if there exist newspapers, radio and to each. Nonetheless we see that programs balance
TV broadcasts in the language, or more recently: resource availability against some measure of the
webcasts, user contributed videos such as YouTube, languages importance or interest. They may also consider
informal user-generated content including blogs and linguistic factors especially those that permit the selection
discussion forums, micro-blogs as in Twitter or other or either highly confusable or typologically diverse
social media. Beyond the raw resources they may look for languages or both. By documenting these criteria we hope
dictionaries, gazetteers and grammatical descriptions. to open discussion concerning selection criteria in low
resource language programs so that future project may
In terms of human resources, the project may try to find a build on the early work surveyed here.
local speaker community, preferably literate, including
students and especially an expert. Alternately, the project 7. Acknowledgements
may seek partners in country or other conditions favorable
Several low resource language programs: TIDES,
to a successful remote collection such as pre-requisite
REFLEX LCTL, Babel and LORELEI provided the
infrastructure and incentives appropriate to the native
opportunity to develop much of the information
speaker population.
summarized here. For LORELEI, this material is based
Additional desired LRs might include a standard digital upon work supported by the Defense Advanced Research
encoding, and supplies of news text, parallel text, Projects Agency (DARPA) under Contract No. HR0011-
translation dictionaries, name taggers, segmenters, and 15-C-0123. Any opinions, findings and conclusions or
morph analyzers. recommendations expressed in this material are those of
the authors and do not necessarily reflect the views of
Finally LRL programs have different goals so that the
DARPA or any sponsoring agency.
criteria used and the weight given to each will vary.
Nevertheless, sharing information about the criteria used
in those programs will benefit the community. 8. References
Beyerlein, P., W. Byrne, J. M. Huerta, S. Khudanpur, B.
5. Implementation Challenges Marthi, J. Morgan, N. Peterek, J. Picone, W. Wang.
1999. Towards Language Independent Acoustic
The selection criteria we have briefly described form a
Modeling. IEEE Workshop on Automatic Speech
kind of superset of those we have seen used in US HLT
Recognition and Understanding, December 12 - 15,
programs focused on LRL. Not all were used in all
Keystone, Colorado, U.S.A
programs and the criteria have also evolved over time.
Dorian, Nancy C. 1980. Language shift in community and
Even if selection criteria were identified explicitly, other
individual: The phenomenon of the laggard semi-
challenges await. Classifications differ in how they
speaker. International Journal of the Sociology of
determine what constitute a separate language. Languages
Language 25.85-94.
have multiple names, some ambiguous (e.g. “He is
Elmahdy, Mohamed, Mark Hasegawa-Johnson, and
speaking Creole/ Patois/ Dialect.”) and some overlapping.
Eiman Mustafawi, “Development of a tv broadcasts

4548
speech recognition system for Qatari Arabic,” in The Conference on Language Resources and Evaluation,
9th edition of the Language Resources and Evaluation Marrakesh, May 28-30
Conference (LREC 2014), Reykjavik, Iceland, 2014. Vergyri, D., K. Kirchhoff, R. Gadde, A. Stolcke, J.
Fishman, Joshua A. 1991. Reversing language shift: Zheng. 2005. Development Of A Conversational
Theoretical and empirical foundations of assistance to Telephone Speech Recognizer For Levantine Arabic.
threatened languages. Clevedon, UK: Multilingual Proceedings of Interspeech, Lisboa, Portugal.
Matters.
Hammarström, H.. A survey of computational
morphological resources for low-density languages.
Journal of the NEALT, 2009.
Hogan, Christopher. 1999. “OCR for Minority
Languages” pp. 235-244 in David Doermann (ed.)
“Proceedings SDIUT 1999: The 1999 Symposium on
Document Image Understanding Technology.”
University of Maryland Institute for Advanced
Computer Studies, College Park MD.
Krauss, Michael. 1992. The world's languages in crisis.
Language 68(1).1-42.
Lewis, M. Paul and Gary F. Simons. 2010. Assessing
Endangerment: Expanding Fishman’s GIDS. Revue
Roumaine de Linguistique 55(2):103–120.
http://www.lingv.ro/RRL 2 2010 art01Lewis.pdf.
Accessed March 21, 2011.
Lewis, M. Paul, Gary F. Simons, and Charles D. Fennig
(eds.). 2015. Ethnologue: Languages of the World,
Eighteenth edition. Dallas, Texas: SIL International.
Online version: http://www.ethnologue.com.
Maxwell, Mike, Baden Hughes. 2006. Frontiers in
Linguistic Annotation for Lower-Density Languages in
Proceedings of the Workshop on Frontiers in
Linguistically Annotated Corpora 2006, Sydney,
Australia, Association for Computational Linguistics,
pp. 29—37, URL:
http://www.aclweb.org/anthology/W/W06/W06-0605
Megerdoomian, Karine, Dan Parvaz, 2008. Low-density
language bootstrapping: The case of Tajiki Persian. In
Proceedings of LREC 2008. Marrakech, Morocco, May
2008.
METANET. 2010. META-NET White Paper Series:
Press Release, http://www.meta-
net.eu/whitepapers/press-release-en, accessed March
16, 2016.
National Science Foundation Documenting Endangered
Languages Program. 2014. Press Release 14-098:
Federal agencies provide new opportunities for dying
languages, August 15, 2014,
http://www.nsf.gov/news/news_summ.jsp?cntn_id=132
370, accessed March 16, 2016.
Rehm, Georg, Hans Uszkoreit, eds. 2012. META-NET
White Paper Series: Europe's Languages in the Digital
Age, URL: www.meta-net.eu/whitepapers
Simpson, Heather, Christopher Cieri, Kazuaki Maeda,
Kathryn Baker, Boyan Onyshkevych. 2008. Human
Language Technology Resources for Less Commonly
Taught Languages: Lessons Learned Toward Creation
of Basic Language Resources, paper presented at the
SALTMIL Workshop: Free/Open-Source Language
Resources for the Machine Translation of Less-
Resourced Languages satellite to the 7th International

4549

You might also like