Approaches To Arabic Name Transliteration and Matching in The Dataflux Quality Knowledge Base
Approaches To Arabic Name Transliteration and Matching in The Dataflux Quality Knowledge Base
Approaches To Arabic Name Transliteration and Matching in The Dataflux Quality Knowledge Base
necessary to transliterate such names from Latin to short vowels. Halpern (2007) refers to the omission
Arabic -- rules cannot capture the arbitrary nature of short vowels as the greatest challenge to
of Arabic name orthography as it is rendered using achieving accuracy in transliterating Arabic to
Latin characters. To illustrate this assertion, let’s English. In the absence of information about vowel
focus on only the two variants Muhammet and sounds, there could be several possible
Muhammed. These variants are a minimal pair transliterations of a single name written in Arabic.
differing only by their final consonant (‘T’ or ‘D’). Take, for example, ( فرغلLatin: Farghal). Possible
The sounds for both ‘T’ and ‘D’ are rendered in transliterations of this name might include:
Arabic as دat the end of the name محمد. One might
therefore deduce that a rule can be devised to Ferghal
transform ‘T’ and ‘D’ to دat the end of a word. Farghal
However, mapping both ‘T’ and ‘D’ to the Arabic Firghul
character دis not always appropriate in the word- Farghel
final context. For instance, the name Falahat in Farghil
Arabic is فالحت. Mapping the final ‘T’ in Falahat to
دwould produce فالحد, which is not a valid One must have knowledge of the lexical item فرغل
transliteration of Falahat. To allow for such to know that Farghal is the proper way to render
idiosyncrasies, a list must be built of all known فرغلusing Latin characters. There are no rules that
Latin variants of Arabic names, along with their would simply insert short vowels to produce the
accompanying Arabic transliterations. correct Latin transliteration. To illustrate this
There are similar challenges inherent to assertion we can examine the Arabic name فردوسی,
transliterating Arabic names in the opposite which is properly transliterated to Latin as
direction -- from the Arabic script to the Latin Firdausi. Both ( فرغلLatin: Farghal) and فردوسی
script. Take, for example, the name Ruwaida (Latin: Firdausi) begin with the same two Arabic
(Arabic: ) رويده. The single Latin representation of letters ( فLatin: ‘F’) and ( رArabic: ‘R’). Yet in
this name, Ruwaida, can be spelled in several ways فرغلwe would have to insert an ‘A’ between these
using the Arabic script. Alternatives include: two letters, whereas in فردوسیwe would have to
insert an ‘I’ between these two letters to generate
رويده each respective Latin transliteration. By definition,
رويدا no vowel insertion rule can suffice. Knowledge of
رويضه each lexical item as a whole is necessary for
generating the correct Latin transliteration.
Focusing specifically on the first two variants, it The fact that Arabic is not written with short
becomes clear why a rule-based approach will not vowels also presents challenges for matching
produce the Latin transliteration Ruwaida. رويدهand names across scripts when a rule-based approach is
رويداare a minimal pair differing only by their final employed. Given the absence of vowel information
character ( هor ) ا. The sounds for both هand اare from input in the Arabic script, we must ignore all
rendered in Latin as ‘A’ at the end of the name vowels from input in the Latin script entirely when
Ruwaida. One might therefore deduce that a rule attempting to compare names across scripts. As a
can be generated to transform هand اto ‘A’ at the result, certain false matches occur, as seen in the
end of a word. However, mapping both هand اto following cluster of names:
the Latin character ‘A’ is not always appropriate in
the word-final context. For instance, the name وجیه Cluster:
in Latin is Wajee. Mapping the final هin وجیهto ‘A’ خالد
would produce Waja, which is not a valid Khaled
transliteration for the name وجیه. To allow for this خلود
orthographical idiosyncrasy, a list must be built of Kholoud
all known Arabic variants of Arabic names, along
with their accompanying Latin transliterations. This cluster results from the fact that خالدis
There is yet another orthographical transliterated to Khaled, whose vowels are then
complication in Arabic. Arabic is written without removed via rules to produce the string KHLD.
34
Likewise, خلودis transliterated to Kholoud, whose users to customize language processing rules to
vowels are then removed via rules to produce the solve a variety of linguistic problems. Therefore
string KHLD. The two Latin input strings Khaled the statistical methods required for training on a
and Kholoud likewise have their vowels removed particular natural language task are not built into
via rules, producing the string KHLD in both its architecture.
cases, and all four strings match. Of course, if we
consider using placeholders for vowels we could 2 Method
render Khaled and Kholoud as KH*L*D and
This section describes the development and testing
KH*L**D, whereby preventing these two Latin
procedure of the Arabic name transliteration and
renderings from falsely matching. But since Arabic
matching technology, as implemented in the
does not contain short vowels, using a placeholder
DataFlux Quality Knowledge Base (QKB).
character prevents us from matching Arabic with
Latin. There can be no placeholder in Arabic 2.1 Arabic to Latin Transliteration
because there are no short vowels to hold on to.
A lexical-based approach would help eliminate A lexicon of approximately 55,000 Arabic name
this problem of false matches. A list of all known variants written in the Arabic script, and their
Latin variants and all known Arabic variants of a accompanying Latin transliterations, was compiled
single name could be mapped to a single canonical using data acquired from the CJK Dictionary
Latin representation. خالدand Khaled (along with Institute.1 In addition, an Egyptian subject matter
all variants of this name in both scripts) could be expert manually created a lexicon of approximately
mapped to Khaled. خلودand Kholoud (along with 10,000 Arabic name variants written in the Arabic
all variants of this name in both scripts) could be script along with their accompanying preferred
mapped to Kholoud. The resultant match behavior Latin transliteration. Since the technology was
would produce these two clusters: implemented as part of an Egyptian Arabic
software localization project, precedence was
Cluster 1: given to Egyptian conventions for spelling and
خالد spacing within Arabic names written in Latin as
Khaled the standard for transliterated names. The list of
Cluster 2: preferred Egyptian transliterations was applied
خلود first, followed by the general list of transliterations
Kholoud acquired from the CJK Dictionary Institute.
Together these two lexicons served as the primary
Hence the problem of false matches can be reduced source for transliteration. Prior to the application of
by using a comprehensive list of names and their the transliteration lexicons, basic cleansing
variants. A system cannot produce these separate operations, such as punctuation and diacritics
clusters by relying solely on a rule-based approach removal, were first applied. As a fall back, rules
with a step that removes vowels. were designed after the Buckwalter Arabic
Statistical machine translation-based transliteration scheme 2 to transliterate any names
approaches, such as that described in Hermjakob that were not found in either of the two lexicons.
et. al (2008), have been successful at overcoming Some additional context sensitive rules were
many of these challenges. However, the software added. For example, the هcharacter transliterates to
discussed in this paper relies purely on a the A character at a word boundary; elsewhere it
deterministic approach to transliteration and becomes H. Three other characters that do not exist
matching. The technologies employed in a in the Buckwalter scheme ( ئ, ء, and )ؤwere added
machine-learning environment were simply not as well because they were found in the Egyptian
available in the QKB. The QKB is part of a generic Arabic data that were used to test the system.
system used to analyze and transform data in many
languages across different data domains. It is not
built to solve any one particular language problem, 1
such as transliterating names between two scripts. http://www.cjk.org/cjk/index.htm
2
http://open.xerox.com/Services/arabic-
Its components are kept simple to enable business
morphology/Pages/translit-chart
35
A sample of 500 full Arabic names was sensitive rules provided by the Egyptian subject
randomly drawn from a population of matter expert. For example, the Latin characters
approximately 9000 full Arabic names written in ‘Y’ and ‘I’ are transliterated to the Arabic
the Arabic script, taken from a regional banking character ىat word boundaries; elsewhere they
company’s customer database. The 500 names become ي. The character ‘U’ is transliterated to و
were then transliterated to the Latin script using the if it occurs after ‘O’; elsewhere it becomes ع.
QKB. The results were sent to an Egyptian subject A sample of 500 full Arabic names was
matter expert for review. Any transliteration errors randomly drawn from a population of
were noted in the test results, and the correct approximately 8000 full Arabic names written in
transliteration was added to the Egyptian the Latin script, taken from a regional banking
transliteration lexicon. Transliterations were company’s customer database. The 500 names
judged as errors if either the lexicon or the fallback were then transliterated to the Arabic script using
rules rendered an unacceptable transliteration the QKB. The results were sent to an Egyptian
according to the subject matter expert. This subject matter expert for review. Any
regression testing process was repeated until the transliteration errors were noted in the test results,
number of errors was deemed to be acceptable and the correct transliteration was added to the
according to internal software quality standards. Egyptian transliteration lexicon. Transliterations
were judged as errors if either the CJK Dictionary
Example 1: Transliteration via Egyptian Institute lexicon or the fallback rules rendered an
transliteration scheme unacceptable transliteration according to the
طارق جعفر ابوالعینین Tareq Jafar AboAlEnein subject matter expert. This regression testing
process was repeated until the number of errors
Example 2: Transliteration via CJK Dictionary was deemed to be acceptable according to internal
Institute lexicon software quality standards.
كاين محرج زيتون Kayan Muharrij Zeitoun
Example 1: Transliteration via Egyptian
Example 3: Transliteration via PERL regular transliteration scheme
expression rules Mohamed Samir AbdElSalam محمد سمیر
انا نستور ماالخیاس Ana Nstur Malakhyas عبدالسالم
Abdel, Al, El, Abu, ابو, ال, )عبدare removed from the Fatima Abas Abdel Razik
input to reduce the input string to a basic canonical
representation before final matching. Names in the Example 2:
Arabic script are then transliterated using a lexicon Ahmed Malawi Abdel-Aaty
of Arabic names and their Latin counterparts. A احمد معالوى عبدالعاطى
second transliteration lexicon, consisting of names احمد معلوى عبدالعاطي
in the Arabic script stripped of their particles, is
applied. For example, when ( عبدالرازقLatin: 3 Results
AbdelRazek) is stripped of the particle ( عبدالLatin: This section describes the results of the testing
Abdel) in the step above, the name becomes رازق procedure of the Arabic name transliteration and
(Latin: Razek). The second scheme then matching technology, as implemented in the
transliterates رازقto Razek. For any names in the DataFlux Quality Knowledge Base (QKB).
Arabic script that are not in either of the two
lexicons, Arabic to Latin phonetic transliteration 3.1 Arabic to Latin Transliteration
rules are then applied on a letter-for-letter basis.
After twelve iterations of regression testing, the
These rules are similar to the Buckwalter
QKB transliterated Arabic names written in the
transliterations, but are more simplified in that
Arabic script to the Latin script with an accuracy
there are fewer Arabic-to-Latin character
of 92%. Testing was halted after twelve iterations
mappings. That is, there are more Arabic
because an 8% error rate was deemed acceptable
characters that map to a single Latin character in
according to internal software quality standards.
the phonetic rules than there are in the Buckwalter
Once the accuracy reached 92%, returns on further
transliteration scheme. This allows the system to
testing iterations became diminished. Customers
match more names that are similar in
seeking increased transliteration accuracy for their
pronunciation. After the phonetic transliteration
particular data have the ability to add more names
step, all Arabic input is now successfully rendered
to the existing transliteration schemes. Perfect
in the Latin script, and further phonetic reductions
accuracy was neither necessary nor expected, and
(ex. geminate consonant reduction, vowel
thus the product was considered ready to go to
transformations) take place before final matching.
market. See above for sample transliterations.
A sample of approximately 8000 full Arabic
names was randomly drawn from a population of 3.2 Latin to Arabic Transliteration
approximately 17,000 full Arabic names, half
written in Arabic, half in Latin, taken from a After fourteen iterations of regression testing, the
regional banking company’s customer database. QKB transliterated Arabic names written in the
The 8000 names were sent through a cluster Latin script to the Arabic script with an accuracy
analysis test using the matching technology of 93.9%. Testing was halted after fourteen
heretofore described. The results were sent to an iterations because a 6.1% error rate was deemed
Egyptian subject matter expert for review. Any acceptable according to internal software quality
false matches or missed matches were noted in the standards. Once the accuracy reached 93.9%,
test results, and either the transliteration lexicon or returns on further testing iterations became
the phonetic transcription rules were updated to diminished. Customers seeking increased
yield more accurate match results. This regression transliteration accuracy for their particular data
testing process was repeated until the number of have the ability to add more names to the existing
errors was deemed to be acceptable according to transliteration schemes. Perfect accuracy was
internal software quality standards. neither necessary nor expected, and thus the
product was considered ready to go to market. See
Examples: Clusters of similar names, identified by above for sample transliterations.
the matching software system. 3.3 Matching
Example 1: After six iterations of regression testing, the QKB
فاطمه عباس عبدالرازق matched names across the Latin and Arabic scripts
Fatma Abbas Abdel Razek with an accuracy of 99.6% with respect to false
37
matches. That is, 0.4% of the matches generated by phonetic transliteration rules, will likewise
the QKB were false positives. The accuracy with contribute to better transliteration accuracy in both
respect to missed matches was 99.98%; a mere directions. The match results were excellent, most
.025% of the data were missed matches; i.e. false likely due to the significant phonetic reductions,
negatives. Testing was halted after six iterations including vowel transformations, which take place
because the aforementioned error rates were quite after transliteration. On the other hand, we
acceptable according to internal software quality permitted a high tolerance for false positives when
standards. See above for sample clusters of similar evaluating the test results. At the time of
names. development of the QKB’s name matching
technology, the CJK Dictionary Institute lexicons
4 Conclusion were not available. In the future, matching will rely
less on rules and will leverage the CJK Dictionary
Transliterating and matching Arabic names
Institute lexicons to produce fewer false positives.
presents a challenge. Transliterating from Latin to
Further research will involve testing the QKB on
Arabic proves difficult because there are so many
more comprehensive data from various sources,
Latin variants of a single Arabic name. This
followed by subsequent improvements and updates
variety cannot be readily captured using rules, so a
to handle the varying conventions for data formats
lexicon of Latin to Arabic transliterations must
across different Arabic-speaking regions.
supplement such rules. Transliterating from Arabic
to Latin is likewise a challenge for this very same
reason. The variety of known Latin transliterations
References
for a single Arabic name means no single Jack Halpern. 2007. The Challenges and Pitfalls of
transliteration is canonically correct. A list of Arabic Romanization and Arabization. In
preferred Latin transliterations for the Arabic- Proceedings of the Second Workshop on
speaking country or region in question determines Computational Approaches to Arabic Script-based
Languages. Palo Alta, CA.
the correct transliteration. Rules schemes such as
the Buckwalter Arabic transliteration scheme U. Hermjakob, K. Knight, and H. Daumé III. 2008.
cannot capture regional orthographic conventions. Name Translation in Statistical Machine Translation
Finally, the absence of short vowels in the Arabic - Learning when to Transliterate. In Proceedings of
script means there can be several possible Latin the Annual Meeting of the Association of
Computational Linguistics (ACL), pages 389–397,
transliterations of a single Arabic name if rules are
Columbus, Ohio, June.
used. The absence of short vowels in Arabic also
accounts for the insufficiency of using rules to
match names across scripts. Without vowel
information in the Arabic script, we must remove
all vowels from the Latin script, and certain false
matches occur. The use of a comprehensive
lexicon to map all Latin and Arabic variants to a
single Latin representation would help solve this
problem.
The hybrid approach to transliterating and
matching Arabic names, as implemented in the
DataFlux Quality Knowledge Base (QKB),
performed well in transliterating names across
scripts. It should be noted that this paper is
reporting on research in progress, as the QKB is
continually undergoing updates. As the
transliteration lexicons are grown over time,
transliteration accuracy will improve. Likewise,
any additional contextual rules that may be added
to the PERL regular expression rules, and/or the