Characteristics of Malay Translated Hadith Corpus
Characteristics of Malay Translated Hadith Corpus
Characteristics of Malay Translated Hadith Corpus
a r t i c l e i n f o a b s t r a c t
Article history: Annotated corpus can greatly assist in the natural language processing field. For example, computers can
Received 23 February 2020 understand more of the document context, and indexing and clustering in information retrieval can be
Revised 7 July 2020 done precisely with less or no ambiguity of words. However, there are only a few annotated corpora
Accepted 27 July 2020
in Malay language, which are not publicly shared. In this paper, we delve into analysing and annotating
Available online 31 July 2020
Malay translated hadith documents in terms of tagging and entities. There are three phases, which are
manual filtering and cleaning, analysing the corpus and creating the benchmark. As the result, an analysis
Keywords:
and benchmark of Malay translated hadith corpus were produced in term of part-of-speech and named
Malay language
Linguistic analysis
entities tags that follows the Zipf’s law distribution.
Malay translated hadith corpus Ó 2020 The Authors. Published by Elsevier B.V. on behalf of King Saud University. This is an open access
Natural language processing article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Corpus linguistic
https://doi.org/10.1016/j.jksuci.2020.07.011
1319-1578/Ó 2020 The Authors. Published by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
2152 S.S. Sazali et al. / Journal of King Saud University – Computer and Information Sciences 34 (2022) 2151–2160
Table 1 Table 2
Existing Research using Malay Corpus. Existing Malay Corpora.
3. Manual filtering and cleaning Fig. 4 below illustrates the flow of the manual filtering and
cleaning for the collection.
The collection first needs to undergo a normalization process to The process starts with retrieved document from the Mutiara
eliminate noise and errors. For this collection, the process includes Hadis UiTM, named Raw Malay Translated Hadith Documents
address code addition, tokenization, word normalization, symbol, (RMTHD). It first undergoes address code addition process, fol-
and number removal and lastly, human-made error correction. lowed by tokenization. After tokenization, the collection undergoes
2154 S.S. Sazali et al. / Journal of King Saud University – Computer and Information Sciences 34 (2022) 2151–2160
Table 5 Table 6
Special Characters and Numbers Removal. Human-made Errors.
Table 7 word ‘paling’ (most). ‘Kata perintah’ is used with the intention of
Word Counts of Malay Translated Hadith Document Corpus. commanding or giving instruction such as ‘jangan’ (don’t) and ‘to-
long’ (please).
Documents Total Word Count Unique Words
Before Pre-processing 2028 146,654 18,622
A. Word Forms
After Pre-processing 2028 146,154 10,487
In Malay language, the form of words are divided into four main
forms; which are single word form, affixed word form, compound
Table 8 word form and reduplication form (Nik Safiah et al., 2015). Word
Part of Speech Tags Used. forms plays an important role in NLP tasks, such as stemming
Part-of- Tags in Tatabahasa Corresponding Penn Additional
and part-of-speech tagging. In this collection, only three of the four
speech Dewan Treebank Tag Tags forms are being analysed. Compound word forms are excluded
Adjective Adjektif Adjective (JJ)
from the analysis because it required more language expert exper-
Verb Kata kerja Verb (VB) tise and due to time constraints. Single word form is a form in
Noun Kata nama Noun (NN) which a word can stand on its own without any kind of affixes or
Kata ganti nama Pronoun (PRP) mixed with any other word. For example, the word ‘saya’ (me). A
Kata nama tunjuk TJK
sentence can be formed by only using the word ‘saya’ and it is con-
Kata nama khas Proper noun (NNP)
Function Adverba Adverb (RB) sidered as a valid sentence. The next form is affixed form, in which
Words Kata arah ARH the base word undergoes the affixing process. For example, the
Kata bantu Auxiliary (AUX) base word ‘makan’ (eat), can be changed into the affixed form by
Kata bilangan Cardinal number (CD) adding the circumfix (-an), and producing the word ‘makanan’
Kata hubung Coordinating
(food). The next form is the reduplication form. Like English lan-
conjunction (CC)
Kata nafi NEG guage, word in Malay language can undergo a reduplication pro-
Kata pangkal ayat PKL cess either in full reduplication, half reduplication and with
Kata pembenar PBR affixes or without affixes. The word such as ‘kanak-kanak’ (chil-
Kata pembenda Possessive Ending (POS)
dren) and ‘lelaki’ (man) are the example of reduplication in Malay
Kata pemeri PMR
Kata penegas Particle (RP) language.
Kata penekan PNK Fig. 6 shows the distribution of unique words of MTHD Corpus,
Kata penguat PGT whereas Fig. 7 illustrates the total words distribution for total
Kata perintah PRH words in the collection. For the unique word, the highest percent-
Kata sendi nama Preposition (IN)
age of the unique words are made up from affixed word form
Kata seru Interjection (UH)
Kata tanya Wh-pronoun (WP) (62%), with a total of 5230 words. Follow by single word form
Other Foreign word Foreign Word (FW) which represent 33% of the total unique word count, with a total
of 2776 words. Reduplication form only consists of 453 words
which only represent 5% of the total unique words count.
or thing. For example, like ‘itu’, ‘ini’, ‘sana’ and ‘sini’. Kata arah is However, in total words distribution of MTHD, the difference
used to indicate direction, such as ‘utara’ (north), ‘sisi’ (side) and between single word form and affixed word form is tiny (3%
‘luar’ (outside). Kata nafi is a collection of words used for negation.
For example, ‘tak’ (no), and ‘bukan’ (not). Kata pangkal ayat are
words that appear at the beginning of a sentence, seldomly used
to indicate continuity from the previous sentence. It is mostly used
in the classical Malay language. For example, the word ‘kalakian’,
‘hatta’ and ‘alkisah’. ‘Kata pembenar’ is a word to express agree-
ment or validation in a sentence, such as ‘ya’, and ‘benar’. ‘Kata
pemeri’ is a word that links a subject to its predicate adjective or
predicate noun. In Malay language, there are two ‘kata pemeri’
which is ‘ialah’ and ‘adalah’. ‘Kata penekan’ is used to highlight
the importance of the word that it is combined with. The word
become ‘kata penekan’ when it combines with suffix ‘-nya’. For
example, ‘sesungguhnya’, and ‘nampaknya’. ‘Kata penguat’ is used
to strengthen the adjective used in the sentence. For example, the Fig. 6. Unique Words Distribution of MTHD Corpus.
S.S. Sazali et al. / Journal of King Saud University – Computer and Information Sciences 34 (2022) 2151–2160 2157
Table 10 Table 11
Part-of-speech Distribution of MTHD. Named Entities in Malay Translated Hadith Document Corpus.
group, that have 23 occurrences for organization, example such as In conclusion, corpus analysis is one of the crucial items to do
‘Quraizhah’ and ‘Syanuah’, and 180 occurrences for location such as before a research. By having corpus analysis, a lot of NLP-related
‘Baitullah’ and ‘Damsyiq’. However, for the person group, it can be tasks can be done, such as part-of-speech tagger, Information
further detailed down with five different named entities groups, Extraction, summarization, question and answers and many
which are person name with 563 occurrences (example such as others. With the analysed corpus, it provides a way for more hadith
‘Ansari’ and ‘Dhiman), family names with 18 occurrences like ‘Ghif- related natural language processing researches and improve the
far’ and ‘Luai’, tribe names with 16 occurrences such as ‘Khatsaam’ existing ones such as information extraction and information
and ‘Rabiaah’, nicknames with four occurrences such as ‘Muhaj- retrieval field. For future works, an automated tagger and entity
jalin’ and ‘Shabi’ and lastly person’s race with only one occurrence recognizers can be evaluated with the existence of the benchmark
of the word ‘Khawarij’. and topics distributions for the hadiths can be analysed as hadith
In addition, there are 25 more named entities types that exist in can be grouped into sections or similar groups based on human rel-
this collection. There are 23 surah names such as ‘Al-fatihah’, and evance. With the analysis, information retrieval field will benefit
‘An-nisa’, 13 names of month such as ‘Dzulhijjah’ and ‘Muharram’, greatly where the very common and uncommon words can be
13 names of days ‘Tasyriq’ and ‘Aqabah’, 10 names of wars such as eliminated with Zipf’ law, and clustering and indexing technique
‘Badr’ and ‘Khandaq’, nine names of prayer such as ‘Isyak’ and can benefit from the annotated corpus.
‘Maghrib’, five religion names such as ‘Yahudi’ and ‘Nasrani’, four
for each event (example such as ‘Isra’ and ‘Jamratul), entity such Declaration of Competing Interest
as ‘Dajjal’ and ‘Yakjuj’ and animal names such as ‘Adba’ and ‘Jazah’.
Next, with three occurrences, there are five names entities group The authors declare that they have no known competing finan-
which are tree (such as ‘Dauhah’ and ‘Gabah’), trade (such as cial interests or personal relationships that could have appeared
‘Habalah’ and ‘Muzabanah’), paradise (such as ‘Firdaus’ and ‘Aden’), to influence the work reported in this paper.
idolatry (such as ‘Hubal’ and ‘Yamaniah’ and book names such as
‘Al-Quran’ and ‘Injil’. There are four groups of entities with two Acknowledgement
occurrences of each; object such as ‘Hajar’ and ‘Manaat’, hajj such
as ‘Wadak’ and ‘Wida’, fruit such as ‘Ajwah’ and ‘Barni’, and doa This research was fully supported and funded by Malaysian
(invocation) names such as ‘Amin’ and ‘Salaam’. Lastly there are Government under Fundamental Research Grant
one occurrence for year ‘Hijriyah’, water ‘Majannah’, nationality Scheme (FRGS/1/2015/ICT01/UITM/03/1) and Universiti Teknologi
‘Arabiy’, language ‘Ibrani’, door ‘Ar’, building ‘Kabah’ and angle MARA BESTARI Grant 600-IRMI/DANA 5/3/BESTARI (112/2018).
‘Yamani’. The author would also like to thank Mrs. Ros Silawati
From the table, it is observed that more than half of the entities Ahmad@Abdullah, Mr. Sazali Saidin and Mrs. Rokiah Karim from
belong to the person’s group. This is caused by the nature of the Institut Perguruan Kampus Darul Aman and Kampus Perlis for
hadith itself that contain two parts, which are ‘sanad’ (chains of assistance with their expertise in language that we can analyse
narrators) and ‘matn’ (the content). There are also proper names the corpus. We also would like to thank the anonymous reviewers
that does not have any group which are grouped in the entity for their insights.
group. For example, ‘Dajjal’ is portrayed as an evil figure in the Isla-
mic history, and ‘Yakjuj’ and ‘Makjuj’ (Gog and Magog in Hebrew References
Bible), are referred as a group that inherit power as the destroyer
of life on earth. It is until to date unclear whether the term belongs Abdul Rahman, N., Abu Bakar, Z., & Tengku Sembok, T. M. (2010). Query expansion
using Thesaurus in Improving Malay Hadith Retrieval system. In 2010
to a person or not. At the bottom of the Table 11, there are quite International Symposium on Information Technology (pp. 1404–1409). Kuala
several named entities group that have only one occurrence. They Lumpur, Malaysia: IEEE. https://doi.org/10.1109/ITSIM.2010.5561518.
are kept that way to accommodate the future researches that want Abdul Rahman, N., Kamal Ismail, N., Abu Bakar, Z., & Tengku Sembok, T. M. (2006).
MUTIARA HADIS: Malay Hadith Retrieval System. In Proceedings of Seminar IT.
to analyse other types of hadiths, and the possibility of detecting Kuala Terengganu, Malaysia. Retrieved from http://sigir.uitm.edu.my/
some signal words for the entity. webhadis/
Alfred, R., Leong, L.C., On, C.K., Anthony, P., 2014. Malay named entity recognition
based on rule-based approach. Int. J. Mach. Learn. Comput. 4 (3), 300–306.
https://doi.org/10.7763/IJMLC.2014.V4.428.
Alfred, R., Leong, L. C., On, C. K., Anthony, P., Fun, T. S., Razali, M. N., & Ahmad Hijazi,
5. Conclusion M. H. (2013). A Rule-Based Named-Entity Recognition for Malay Articles. In H.
Motoda, Z. Wu, L. Cao, O. Zaiane, M. Yao, & W. Wang (Eds.), Advanced Data
Mining and Applications (Vol. 8346, pp. 288–299). Berlin, Heidelberg: Springer
Corpus analysis can reveal the attributes of the corpus. In this Berlin Heidelberg. https://doi.org/10.1007/978-3-642-53914-5_25.
experiment, it is shown that nouns represent the largest percent- Alfred, R., Mujat, A., & Obit, J. H. (2013). A Ruled-Based Part of Speech (RPOS) tagger
age of the corpus with more than 33%, which is more than half for Malay text articles. In A. Selamat, N. T. Nguyen, & H. Haron (Eds.), Intelligent
Information and Database Systems (Vol. 7803, pp. 50–59). Berlin, Heidelberg:
of the second largest group, which were verb that represents 16% Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-36543-0_6.
of the total corpus. For named entities, in MTHD corpus, there Alias, N., Rahman, N. A., Ismail, N. K., Nor, Z. M., & Alias, M. N. (2016). Searching
are more than half of the total proper names which belong to the Algorithm of Authentic Chain of Narrators’ in Shahih Bukhari Book. In 2016
International Conference on Applied Computing, Mathematical Sciences and
person group because of the nature of the hadith structure that Engineering (ACME2016) (pp. 60–66). Johor, Malaysia. https://doi.org/10.1109/
consists of the narrator chain and the content. In hadith document, INFRKM.2016.7806336.
the narrator chain is important to determine the authenticity of the Alias, S., Mohammad, S. K., Hoon, G. K., & Ping, T. T. (2016). A Malay Text
Summarizer using Pattern-Growth method with Sentence Compression Rules.
hadith. In this experiment, the most important phase is the manual In 2016 Third International Conference on Information Retrieval and
filtering and cleaning. Without the phase, the analysis can differ Knowledge Management (CAMP) (pp. 7–12). Bandar Hilir, Malaysia.
greatly as the number of unique words were reduced by 8135 Azizan, A., Abu Bakar, Z., Abdul Rahman, N., 2019. Construction of durian dataset
from web collection for query reformulation research. Int. J. Recent Technol.
words after the filtering and cleaning. It is observable that affixes
Eng. 8 (2811), 630–634. https://doi.org/10.35940/ijrte.B1098.0982S1119.
represent the highest percentage in term of unique word, but sin- Ha, L. Q., Sicilia-Garcia, E. I., Ming, J., & Smith, F. J. (2002). Extension of Zipf’s law to
gle form represents the highest percentage in the total word. Ten Words and Phrases. In COLING ’02 Proceedings of the 19th international
highest frequency of words in MTHD are ‘yang’, ‘dan’, ‘itu’, ‘saw’, conference on Computational linguistics (pp. 1–6). Taipei, Taiwan. https://doi.
org/10.3115/1072228.1072345.
‘dari’, ‘beliau’, ‘nabi’, ‘di’, ‘ra’, and ‘saya’ are all single form, which Jurafsky, D., & Martin, J. H. (2009). Speech and Language Processing: An
tally with the representation of the total word percentage. Introduction to Natural Language Processing, Computational Linguistics, and
2160 S.S. Sazali et al. / Journal of King Saud University – Computer and Information Sciences 34 (2022) 2151–2160
Speech Recognition (2nd., Vol. 21). Upper Saddle River, NJ, USA: Prentice-Hall, Joint SIGDAT conference on Empirical methods in natural language processing
Inc. https://doi.org/10.1162/089120100750105975 and very large corpora: held in conjunction with the 38th Annual Meeting of
Liddy, E. D. (2001). Natural Language Processing. In Encyclopedia of Library and the Association for Computational Linguistics (Vol. 13, pp. 63–70). Hong Kong,
Information Science (2nd.). New York: Marcel Decker, Inc. https://doi.org/ China. https://doi.org/10.3115/1117794.1117802.
10.1017/S0267190500001446 Tuah Mohamad Rahim, N. N. A., Mabni, Z., Mohamed Hanum, H., & Abdul Rahman,
Mahmood, A., Khan, H. U., Alarfaj, F., Ramzan, M., & Ilyas, M. (2018). A Multilingual N. (2016). A Malay Hadith translated document retrieval using parallel Latent
Datasets Repository of the Hadith Content. International Journal of Advanced Semantic Indexing (LSI). In 2016 3rd International Conference on Information
Computer Science and Applications, 9(2), 165–172. https://doi.org/10.14569/ Retrieval and Knowledge Management, CAMP 2016 - Conference Proceedings
IJACSA.2018.090224 (pp. 118–123). Bandar Hilir, Malaysia. https://doi.org/10.1109/
Manning, C. D. (2011). Part-of-Speech Tagging from 97% to 100%: Is It Time for Some INFRKM.2016.7806346.
Linguistics? In A. F. Gelbukh (Ed.), Computational Linguistics and Intelligent Ulanganathan, T., Ebrahimi, A., Xian, B. C. M., Bouzekri, K., Mahmud, R., & Hoe, O. H.
Text Processing (pp. 171–189). Berlin, Heidelberg: Springer Berlin Heidelberg. (2017). Benchmarking Mi-NER: Malay Entity Recognition Engine. The Ninth
https://doi.org/https://doi.org/10.1007/978-3-642-19400-9_14 International Conference on Information, Process and Knowledge Management
Manning, C. D., & Schütze, H. (2000). Foundations of Natural Language Processing. (EKNOW 2017), 52–58.
In Proceedings of the 20th International Conference on Computer Processing of Xian, B.C.M., Lubani, M., Ping, L.K., Bouzekri, K., Mahmud, R., Lukose, D., 2016.
Oriental Languages. Benchmarking Mi-POS: Malay Part-of-Speech Tagger. Int. J. Knowledge Eng. 2
Mohamed Hanum, H., Abu Bakar, Z., Abdul Rahman, N., Mohd Rosli, M., Musa, N., (3), 115–121. https://doi.org/10.18178/ijke.2016.2.3.064.
2014. Using topic analysis for querying halal information on malay documents. Zamin, N., & Abu Bakar, Z. (2015). Name Entity Recognition for Malay Texts Using
Proc. – Soc. Behavioral Sci. 121, 214–222. https://doi.org/10.1016/j. Cross-Lingual Annotation Projection Approach. In O. Gervasi, B. Murgante, S.
sbspro.2014.01.1122. Misra, M. L. Gavrilova, A. M. A. C. Rocha, C. Torre, . . . B. O. Apduhan (Eds.),
Mohd Noor, N., Sulaiman, J., Noah, S.A., 2016. Malay name entity recognition using Computational Science and Its Applications – ICCSA 2015 (pp. 242–256). Cham:
limited resources. Adv. Sci. Lett. 22 (10), 2968–2971. https://doi.org/10.1166/ Springer International Publishing.
asl.2016.7124. Zamin, N., Oxley, A., Abu Bakar, Z., & Farhan, S. A. (2012a). A Lazy Man’s Way to Part-
Nik Safiah, K., Farid, M. O., Hashim, H. M., & Abdul Hamid, M. (2015). Tatabahasa of-Speech Tagging. In D. Richards & H. B. Kang (Eds.), Knowledge Management
Dewan (3rd.). Kuala Lumpur. and Acquisition for Intelligent Systems (pp. 106–117). Berlin, Heidelberg:
Noor Ariffin, S.N.A., Tiun, S., 2018. Part-of-speech tagger for malay social media Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-32541-0
texts. GEMA OnlineÒ J. Language Stud. 18 (4), 124–141. Zamin, N., Oxley, A., Abu Bakar, Z., & Farhan, S. A. (2012b). Characteristics of a Malay
Sazali, S. S., Abdul Rahman, N., & Abu Bakar, Z. (2016). Information Extraction: Journalistic Corpus. In 2012 IEEE Conference on Control, Systems and Industrial
Evaluating Named Entity Recognition From Classical Malay Documents. In 2016 Informatics (ICCSII) (pp. 214–218). Bandung, Indonesia. https://doi.org/
Third International Conference on Information Retrieval and Knowledge 10.1109/CCSII.2012.6470503.
Management (CAMP) (pp. 48–53). Bandar Hilir, Malaysia: IEEE. https://doi. Zulkefli, N. S. S., Abdul Rahman, N., & Abu Bakar, Z. (2016). Analyzing Search
org/10.1109/INFRKM.2016.7806333 Retrieval Results on Malay Translated Hadith Text Documents Analyzing Search
Sulaiman, S., Abdul Wahid, R., Sarkawi, S., & Omar, N. (2017). Using Stanford NER Retrieval Results on Malay Translated Hadith Text Documents. In 2016
and Illinois NER to Detect Malay Named Entity Recognition. International International Conference on Applied Computing, Mathematical Sciences and
Journal of Computer Theory and Engineering, 9(2), 147–150. https://doi.org/ Engineering (ACME). Johor, Malaysia.
10.7763/IJCTE.2017.V9.1128 Zulkefli, N. S. S., Rahman, N. A., Bakar, Z. A., & Alam, S. (2015). Representation of
Talvensaari, T. (2008). Comparable Corpora in Cross-Language Information Search Retrieval Results on Digital Hadith Online Browser. Universiti
Retrieval. University of Tampere. Kebangsaan Malaysia International Colloquium of Graduates Islamic Studies
Toutanova, K., & Manning, C. D. (2000). Enriching the knowledge sources used in a 2015, (December).
maximum entropy part-of-speech tagger. In EMNLP ’00 Proceedings of the 2000