Approaches To Arabic Name Transliteration and Matching in The Dataflux Quality Knowledge Base

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

32

Approaches to Arabic Name Transliteration and Matching in the


DataFlux Quality Knowledge Base
Brant N. Kay Brian C. Rineer
SAS Institute Inc. SAS Institute Inc.
100 SAS Campus Drive 100 SAS Campus Drive
Cary, NC 27513 Cary, NC 27513
[email protected] [email protected]

according to internal software quality


Abstract standards.
This paper discusses a hybrid approach to
transliterating and matching Arabic names, 1 Introduction
as implemented in the DataFlux Quality
Knowledge Base (QKB), a knowledge base The challenges inherent to transliterating Arabic
used by data management software systems names from the Latin script to the Arabic script lie
from SAS Institute, Inc. The approach to in the fact that there are many seemingly arbitrary
transliteration relies on a lexicon of names ways to spell Arabic names using Latin characters.
with their corresponding transliterations as Halpern (2007) attributes this arbitrariness to the
its primary method, and falls back on fact that certain Arabic consonant sounds simply
PERL regular expression rules to do not exist in English, so they are represented in
transliterate any names that do not exist in different ways using the Latin script. He also notes
the lexicon. Transliteration in the QKB is that dialectical differences in vowel pronunciation
bi-directional; the technology transliterates contribute to the variety of Latin spellings.
Arabic names written in the Arabic script Because there are often several Latin variants of a
to the Latin script, and transliterates Arabic single Arabic name, it is difficult to successfully
names written in the Latin script to Arabic. transliterate them from Latin to Arabic using a
Arabic name matching takes a similar rule-based approach. Take, for example, the name
approach and relies on a lexicon of Arabic ‫( محمد‬Latin: Mohammed). The single Arabic
names and their corresponding representation of this name, ‫محمد‬, can be spelled in
transliterations, falling back on phonetic several ways using the Latin script. Alternatives
transliteration rules to transliterate names include:
into the Latin script. All names are
ultimately rendered in the Latin script Mohamad
before matching takes place. Thus, the Mohamed
technology is capable of matching names Muhamad
across the Arabic and Latin scripts, as well Muhamed
as within the Arabic script or within the Muhammet
Latin script. The goal of the authors of this Mohammad
paper was to build a software system Mohammed
capable of transliterating and matching Muhammad
Arabic names across scripts with an Muhammed
accuracy deemed to be acceptable
Given the variety of spellings in these alternatives,
it becomes clear why a lexically-based approach is
33

necessary to transliterate such names from Latin to short vowels. Halpern (2007) refers to the omission
Arabic -- rules cannot capture the arbitrary nature of short vowels as the greatest challenge to
of Arabic name orthography as it is rendered using achieving accuracy in transliterating Arabic to
Latin characters. To illustrate this assertion, let’s English. In the absence of information about vowel
focus on only the two variants Muhammet and sounds, there could be several possible
Muhammed. These variants are a minimal pair transliterations of a single name written in Arabic.
differing only by their final consonant (‘T’ or ‘D’). Take, for example, ‫( فرغل‬Latin: Farghal). Possible
The sounds for both ‘T’ and ‘D’ are rendered in transliterations of this name might include:
Arabic as ‫ د‬at the end of the name ‫محمد‬. One might
therefore deduce that a rule can be devised to Ferghal
transform ‘T’ and ‘D’ to ‫ د‬at the end of a word. Farghal
However, mapping both ‘T’ and ‘D’ to the Arabic Firghul
character ‫ د‬is not always appropriate in the word- Farghel
final context. For instance, the name Falahat in Farghil
Arabic is ‫فالحت‬. Mapping the final ‘T’ in Falahat to
‫ د‬would produce ‫ فالحد‬, which is not a valid One must have knowledge of the lexical item ‫فرغل‬
transliteration of Falahat. To allow for such to know that Farghal is the proper way to render
idiosyncrasies, a list must be built of all known ‫ فرغل‬using Latin characters. There are no rules that
Latin variants of Arabic names, along with their would simply insert short vowels to produce the
accompanying Arabic transliterations. correct Latin transliteration. To illustrate this
There are similar challenges inherent to assertion we can examine the Arabic name ‫فردوسی‬,
transliterating Arabic names in the opposite which is properly transliterated to Latin as
direction -- from the Arabic script to the Latin Firdausi. Both ‫( فرغل‬Latin: Farghal) and ‫فردوسی‬
script. Take, for example, the name Ruwaida (Latin: Firdausi) begin with the same two Arabic
(Arabic: ‫) رويده‬. The single Latin representation of letters ‫( ف‬Latin: ‘F’) and ‫( ر‬Arabic: ‘R’). Yet in
this name, Ruwaida, can be spelled in several ways ‫ فرغل‬we would have to insert an ‘A’ between these
using the Arabic script. Alternatives include: two letters, whereas in ‫ فردوسی‬we would have to
insert an ‘I’ between these two letters to generate
‫رويده‬ each respective Latin transliteration. By definition,
‫رويدا‬ no vowel insertion rule can suffice. Knowledge of
‫رويضه‬ each lexical item as a whole is necessary for
generating the correct Latin transliteration.
Focusing specifically on the first two variants, it The fact that Arabic is not written with short
becomes clear why a rule-based approach will not vowels also presents challenges for matching
produce the Latin transliteration Ruwaida. ‫ رويده‬and names across scripts when a rule-based approach is
‫ رويدا‬are a minimal pair differing only by their final employed. Given the absence of vowel information
character (‫ ه‬or ‫) ا‬. The sounds for both ‫ ه‬and ‫ ا‬are from input in the Arabic script, we must ignore all
rendered in Latin as ‘A’ at the end of the name vowels from input in the Latin script entirely when
Ruwaida. One might therefore deduce that a rule attempting to compare names across scripts. As a
can be generated to transform ‫ ه‬and ‫ ا‬to ‘A’ at the result, certain false matches occur, as seen in the
end of a word. However, mapping both ‫ ه‬and ‫ ا‬to following cluster of names:
the Latin character ‘A’ is not always appropriate in
the word-final context. For instance, the name ‫وجیه‬ Cluster:
in Latin is Wajee. Mapping the final ‫ ه‬in ‫ وجیه‬to ‘A’ ‫خالد‬
would produce Waja, which is not a valid Khaled
transliteration for the name ‫وجیه‬. To allow for this ‫خلود‬
orthographical idiosyncrasy, a list must be built of Kholoud
all known Arabic variants of Arabic names, along
with their accompanying Latin transliterations. This cluster results from the fact that ‫ خالد‬is
There is yet another orthographical transliterated to Khaled, whose vowels are then
complication in Arabic. Arabic is written without removed via rules to produce the string KHLD.
34

Likewise, ‫ خلود‬is transliterated to Kholoud, whose users to customize language processing rules to
vowels are then removed via rules to produce the solve a variety of linguistic problems. Therefore
string KHLD. The two Latin input strings Khaled the statistical methods required for training on a
and Kholoud likewise have their vowels removed particular natural language task are not built into
via rules, producing the string KHLD in both its architecture.
cases, and all four strings match. Of course, if we
consider using placeholders for vowels we could 2 Method
render Khaled and Kholoud as KH*L*D and
This section describes the development and testing
KH*L**D, whereby preventing these two Latin
procedure of the Arabic name transliteration and
renderings from falsely matching. But since Arabic
matching technology, as implemented in the
does not contain short vowels, using a placeholder
DataFlux Quality Knowledge Base (QKB).
character prevents us from matching Arabic with
Latin. There can be no placeholder in Arabic 2.1 Arabic to Latin Transliteration
because there are no short vowels to hold on to.
A lexical-based approach would help eliminate A lexicon of approximately 55,000 Arabic name
this problem of false matches. A list of all known variants written in the Arabic script, and their
Latin variants and all known Arabic variants of a accompanying Latin transliterations, was compiled
single name could be mapped to a single canonical using data acquired from the CJK Dictionary
Latin representation. ‫ خالد‬and Khaled (along with Institute.1 In addition, an Egyptian subject matter
all variants of this name in both scripts) could be expert manually created a lexicon of approximately
mapped to Khaled. ‫ خلود‬and Kholoud (along with 10,000 Arabic name variants written in the Arabic
all variants of this name in both scripts) could be script along with their accompanying preferred
mapped to Kholoud. The resultant match behavior Latin transliteration. Since the technology was
would produce these two clusters: implemented as part of an Egyptian Arabic
software localization project, precedence was
Cluster 1: given to Egyptian conventions for spelling and
‫خالد‬ spacing within Arabic names written in Latin as
Khaled the standard for transliterated names. The list of
Cluster 2: preferred Egyptian transliterations was applied
‫خلود‬ first, followed by the general list of transliterations
Kholoud acquired from the CJK Dictionary Institute.
Together these two lexicons served as the primary
Hence the problem of false matches can be reduced source for transliteration. Prior to the application of
by using a comprehensive list of names and their the transliteration lexicons, basic cleansing
variants. A system cannot produce these separate operations, such as punctuation and diacritics
clusters by relying solely on a rule-based approach removal, were first applied. As a fall back, rules
with a step that removes vowels. were designed after the Buckwalter Arabic
Statistical machine translation-based transliteration scheme 2 to transliterate any names
approaches, such as that described in Hermjakob that were not found in either of the two lexicons.
et. al (2008), have been successful at overcoming Some additional context sensitive rules were
many of these challenges. However, the software added. For example, the ‫ ه‬character transliterates to
discussed in this paper relies purely on a the A character at a word boundary; elsewhere it
deterministic approach to transliteration and becomes H. Three other characters that do not exist
matching. The technologies employed in a in the Buckwalter scheme ( ‫ ئ‬, ‫ء‬, and ‫ )ؤ‬were added
machine-learning environment were simply not as well because they were found in the Egyptian
available in the QKB. The QKB is part of a generic Arabic data that were used to test the system.
system used to analyze and transform data in many
languages across different data domains. It is not
built to solve any one particular language problem, 1
such as transliterating names between two scripts. http://www.cjk.org/cjk/index.htm
2
http://open.xerox.com/Services/arabic-
Its components are kept simple to enable business
morphology/Pages/translit-chart
35

A sample of 500 full Arabic names was sensitive rules provided by the Egyptian subject
randomly drawn from a population of matter expert. For example, the Latin characters
approximately 9000 full Arabic names written in ‘Y’ and ‘I’ are transliterated to the Arabic
the Arabic script, taken from a regional banking character ‫ ى‬at word boundaries; elsewhere they
company’s customer database. The 500 names become ‫ي‬. The character ‘U’ is transliterated to ‫و‬
were then transliterated to the Latin script using the if it occurs after ‘O’; elsewhere it becomes ‫ع‬.
QKB. The results were sent to an Egyptian subject A sample of 500 full Arabic names was
matter expert for review. Any transliteration errors randomly drawn from a population of
were noted in the test results, and the correct approximately 8000 full Arabic names written in
transliteration was added to the Egyptian the Latin script, taken from a regional banking
transliteration lexicon. Transliterations were company’s customer database. The 500 names
judged as errors if either the lexicon or the fallback were then transliterated to the Arabic script using
rules rendered an unacceptable transliteration the QKB. The results were sent to an Egyptian
according to the subject matter expert. This subject matter expert for review. Any
regression testing process was repeated until the transliteration errors were noted in the test results,
number of errors was deemed to be acceptable and the correct transliteration was added to the
according to internal software quality standards. Egyptian transliteration lexicon. Transliterations
were judged as errors if either the CJK Dictionary
Example 1: Transliteration via Egyptian Institute lexicon or the fallback rules rendered an
transliteration scheme unacceptable transliteration according to the
‫ طارق جعفر ابوالعینین‬ Tareq Jafar AboAlEnein subject matter expert. This regression testing
process was repeated until the number of errors
Example 2: Transliteration via CJK Dictionary was deemed to be acceptable according to internal
Institute lexicon software quality standards.
‫ كاين محرج زيتون‬ Kayan Muharrij Zeitoun
Example 1: Transliteration via Egyptian
Example 3: Transliteration via PERL regular transliteration scheme
expression rules Mohamed Samir AbdElSalam  ‫محمد سمیر‬
‫ انا نستور ماالخیاس‬ Ana Nstur Malakhyas ‫عبدالسالم‬

2.2 Latin to Arabic Transliteration Example 2: Transliteration via CJK Dictionary


A lexicon of approximately 863,282 Arabic name Institute lexicon
variants written in the Latin script, and their Makhtouf Nesra Abd Elwakel  ‫مقطوف نصراء‬
accompanying Arabic transliterations, was ‫عبدالوكیل‬
compiled using data acquired from the CJK
Dictionary Institute. Additionally, an Egyptian Example 3: Transliteration via PERL regular
subject matter expert manually created a lexicon of expression rules
approximately 10,000 Arabic name variants Anham Enshrah Shaghata  ‫انهام انشراه شاغاته‬
written in the Latin script along with their 2.3 Matching
accompanying preferred Arabic transliteration. As
stated earlier, precedence was given to Egyptian Matching of Arabic names in the QKB is closely
conventions for spelling and spacing, so the list of related to the Arabic to Latin Transliteration
preferred Egyptian transliterations was applied method described above. All names written in the
before the general CJK Dictionary Institute Arabic script are transliterated to Latin in order to
lexicon. Prior to the application of the match the same, or similar, names across the two
transliteration lexicons, basic cleansing operations, scripts.
such as punctuation and diacritics removal, were
applied. As a fall back, rules were put in place after Prior to applying transliteration lexicons, basic
the transliteration lists. These rules performed cleansing operations such as punctuation and
basic letter-for-letter Latin to Arabic diacritics removal are applied. As a supplementary
transliteration, with some additional context step, Arabic name particles in both scripts (ex.
36

Abdel, Al, El, Abu, ‫ ابو‬,‫ ال‬,‫ )عبد‬are removed from the Fatima Abas Abdel Razik
input to reduce the input string to a basic canonical
representation before final matching. Names in the Example 2:
Arabic script are then transliterated using a lexicon Ahmed Malawi Abdel-Aaty
of Arabic names and their Latin counterparts. A ‫احمد معالوى عبدالعاطى‬
second transliteration lexicon, consisting of names ‫احمد معلوى عبدالعاطي‬
in the Arabic script stripped of their particles, is
applied. For example, when ‫( عبدالرازق‬Latin: 3 Results
AbdelRazek) is stripped of the particle ‫( عبدال‬Latin: This section describes the results of the testing
Abdel) in the step above, the name becomes ‫رازق‬ procedure of the Arabic name transliteration and
(Latin: Razek). The second scheme then matching technology, as implemented in the
transliterates ‫ رازق‬to Razek. For any names in the DataFlux Quality Knowledge Base (QKB).
Arabic script that are not in either of the two
lexicons, Arabic to Latin phonetic transliteration 3.1 Arabic to Latin Transliteration
rules are then applied on a letter-for-letter basis.
After twelve iterations of regression testing, the
These rules are similar to the Buckwalter
QKB transliterated Arabic names written in the
transliterations, but are more simplified in that
Arabic script to the Latin script with an accuracy
there are fewer Arabic-to-Latin character
of 92%. Testing was halted after twelve iterations
mappings. That is, there are more Arabic
because an 8% error rate was deemed acceptable
characters that map to a single Latin character in
according to internal software quality standards.
the phonetic rules than there are in the Buckwalter
Once the accuracy reached 92%, returns on further
transliteration scheme. This allows the system to
testing iterations became diminished. Customers
match more names that are similar in
seeking increased transliteration accuracy for their
pronunciation. After the phonetic transliteration
particular data have the ability to add more names
step, all Arabic input is now successfully rendered
to the existing transliteration schemes. Perfect
in the Latin script, and further phonetic reductions
accuracy was neither necessary nor expected, and
(ex. geminate consonant reduction, vowel
thus the product was considered ready to go to
transformations) take place before final matching.
market. See above for sample transliterations.
A sample of approximately 8000 full Arabic
names was randomly drawn from a population of 3.2 Latin to Arabic Transliteration
approximately 17,000 full Arabic names, half
written in Arabic, half in Latin, taken from a After fourteen iterations of regression testing, the
regional banking company’s customer database. QKB transliterated Arabic names written in the
The 8000 names were sent through a cluster Latin script to the Arabic script with an accuracy
analysis test using the matching technology of 93.9%. Testing was halted after fourteen
heretofore described. The results were sent to an iterations because a 6.1% error rate was deemed
Egyptian subject matter expert for review. Any acceptable according to internal software quality
false matches or missed matches were noted in the standards. Once the accuracy reached 93.9%,
test results, and either the transliteration lexicon or returns on further testing iterations became
the phonetic transcription rules were updated to diminished. Customers seeking increased
yield more accurate match results. This regression transliteration accuracy for their particular data
testing process was repeated until the number of have the ability to add more names to the existing
errors was deemed to be acceptable according to transliteration schemes. Perfect accuracy was
internal software quality standards. neither necessary nor expected, and thus the
product was considered ready to go to market. See
Examples: Clusters of similar names, identified by above for sample transliterations.
the matching software system. 3.3 Matching
Example 1: After six iterations of regression testing, the QKB
‫فاطمه عباس عبدالرازق‬ matched names across the Latin and Arabic scripts
Fatma Abbas Abdel Razek with an accuracy of 99.6% with respect to false
37

matches. That is, 0.4% of the matches generated by phonetic transliteration rules, will likewise
the QKB were false positives. The accuracy with contribute to better transliteration accuracy in both
respect to missed matches was 99.98%; a mere directions. The match results were excellent, most
.025% of the data were missed matches; i.e. false likely due to the significant phonetic reductions,
negatives. Testing was halted after six iterations including vowel transformations, which take place
because the aforementioned error rates were quite after transliteration. On the other hand, we
acceptable according to internal software quality permitted a high tolerance for false positives when
standards. See above for sample clusters of similar evaluating the test results. At the time of
names. development of the QKB’s name matching
technology, the CJK Dictionary Institute lexicons
4 Conclusion were not available. In the future, matching will rely
less on rules and will leverage the CJK Dictionary
Transliterating and matching Arabic names
Institute lexicons to produce fewer false positives.
presents a challenge. Transliterating from Latin to
Further research will involve testing the QKB on
Arabic proves difficult because there are so many
more comprehensive data from various sources,
Latin variants of a single Arabic name. This
followed by subsequent improvements and updates
variety cannot be readily captured using rules, so a
to handle the varying conventions for data formats
lexicon of Latin to Arabic transliterations must
across different Arabic-speaking regions.
supplement such rules. Transliterating from Arabic
to Latin is likewise a challenge for this very same
reason. The variety of known Latin transliterations
References
for a single Arabic name means no single Jack Halpern. 2007. The Challenges and Pitfalls of
transliteration is canonically correct. A list of Arabic Romanization and Arabization. In
preferred Latin transliterations for the Arabic- Proceedings of the Second Workshop on
speaking country or region in question determines Computational Approaches to Arabic Script-based
Languages. Palo Alta, CA.
the correct transliteration. Rules schemes such as
the Buckwalter Arabic transliteration scheme U. Hermjakob, K. Knight, and H. Daumé III. 2008.
cannot capture regional orthographic conventions. Name Translation in Statistical Machine Translation
Finally, the absence of short vowels in the Arabic - Learning when to Transliterate. In Proceedings of
script means there can be several possible Latin the Annual Meeting of the Association of
Computational Linguistics (ACL), pages 389–397,
transliterations of a single Arabic name if rules are
Columbus, Ohio, June.
used. The absence of short vowels in Arabic also
accounts for the insufficiency of using rules to
match names across scripts. Without vowel
information in the Arabic script, we must remove
all vowels from the Latin script, and certain false
matches occur. The use of a comprehensive
lexicon to map all Latin and Arabic variants to a
single Latin representation would help solve this
problem.
The hybrid approach to transliterating and
matching Arabic names, as implemented in the
DataFlux Quality Knowledge Base (QKB),
performed well in transliterating names across
scripts. It should be noted that this paper is
reporting on research in progress, as the QKB is
continually undergoing updates. As the
transliteration lexicons are grown over time,
transliteration accuracy will improve. Likewise,
any additional contextual rules that may be added
to the PERL regular expression rules, and/or the

You might also like