Characteristics of Malay Translated Hadith Corpus

Journal of King Saud University – Computer and Information Sciences 34 (2022) 2151–2160

Characteristics of Malay translated hadith corpus

Siti Syakirah Sazali a,⇑, Nurazzah Abdul Rahman a, Zainab Abu Bakar b
Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Shah Alam, Selangor, Malaysia
Faculty of Computer and Information Technology, Al-Madinah International University, Kuala Lumpur, Malaysia

a r t i c l e i n f o a b s t r a c t

Article history: Annotated corpus can greatly assist in the natural language processing field. For example, computers can
Received 23 February 2020 understand more of the document context, and indexing and clustering in information retrieval can be
Revised 7 July 2020 done precisely with less or no ambiguity of words. However, there are only a few annotated corpora
Accepted 27 July 2020
in Malay language, which are not publicly shared. In this paper, we delve into analysing and annotating
Available online 31 July 2020
Malay translated hadith documents in terms of tagging and entities. There are three phases, which are
manual filtering and cleaning, analysing the corpus and creating the benchmark. As the result, an analysis
and benchmark of Malay translated hadith corpus were produced in term of part-of-speech and named
Malay language
Linguistic analysis
entities tags that follows the Zipf’s law distribution.
Malay translated hadith corpus Ó 2020 The Authors. Published by Elsevier B.V. on behalf of King Saud University. This is an open access
Natural language processing article under the CC BY-NC-ND license (
Corpus linguistic

1. Introduction an unsupervised part-of-speech tagger that achieved 98.86%

accuracy by using the Wall Street Journal (WSJ) section of the Penn
Natural Language Processing (NLP) is a branch under Artificial Treebank for English language. However, most of these off-the-
Intelligent which aims to give the computer the ability to process shelf applications and researches are focusing on resource-rich lan-
the human language (Jurafsky, 2009; Liddy, 2001; Zamin et al., guage, such as English, French, and German. On the contrary, there
2012b). Liddy (2001) states that NLP aims to make the computer is less research focusing on the resource-poor language such as
process the language like a human does with a range of tasks Hebrew, Urdu, Bengali, and Malay.
and applications. NLP applications include Information Retrieval This paper delves into relevant NLP research involving Malay, one
(IR), Information Extraction (IE), Question and Answering (Q&A), of the resource-poor languages. It is a language under the Austrone-
Summarization, Machine Translation (MT) and Dialogue Systems sian language-tree and the most spoken language in South East Asia
(DS). with over 300 million native speakers (Alfred et al., 2014; Zamin
To date, there exists many novelty applications which were et al., 2012b). A summary of existing research in NLP by using Malay
developed using state-of-the-art techniques. There are also many language as the data are summarized in the Table 1 below.
existing off-the-shelf resources available for use, such as named Malay language related research has garnered significant
entity recognizer, part-of-speech taggers, corpora, stemmers, and increase in the early 2000s. The above table illustrates only the
parsers. Astoundingly, some areas of the NLP research are being recent research in the field to prove that researches related to
considered as solved problems, where the performance of the Malay language are still in the spotlight. There has also been a
application outreaches that of human processing (Manning, spike in Information Retrieval (IR) field on hadith related research.
2011). For example, Toutanova and Manning (2000) developed Techniques such as clustering and theme-related are being tested
to increase the retrieval result. Only in 2013 the first research in
Information Extraction (IE) appeared. It started with rule-based
⇑ Corresponding author. named entity recognition (NER), and then followed by a few statis-
E-mail addresses: [email protected] (S.S. Sazali), [email protected]. tical methods for Malay language. There are also researches in text (N.A. Rahman), [email protected] (Z.A. Bakar). summarization and machine translation for Malay language. For
part of speech tagger, Malay language started with stochastic part
of speech tagger in 2012, followed by rule-based part of speech in
Annotated corpus can help a lot in natural language processing
field. For example, computers can understand more of the
2152 S.S. Sazali et al. / Journal of King Saud University – Computer and Information Sciences 34 (2022) 2151–2160

Table 1 Table 2
Existing Research using Malay Corpus. Existing Malay Corpora.

Area Research Researchers Domain Publicly Annotated?

Information Retrieval  Malay Hadith Retrieval Systems (Mohamed
Hanum et al., 2014; Rahim et al., 2016; Institute of Language Multiple domains Yes No
Zulkefli et al., 2016, 2015) and Literature
Information Extraction  Named Entity Recognizers (Alfred et al., 2014; Mutiara Hadith UiTM Malay Translated Yes No
Mohd Noor, Sulaiman, & Noah, 2016; Sazali, (Abdul Rahmanet al., Quran and Hadith
Abdul Rahman, & Abu Bakar, 2016; Sulaiman, 2006) documents
Abdul Wahid, Sarkawi, & Omar, 2017; (Zamin et al., 2012b) Terrorism-related No Yes
Ulanganathan et al., 2017) journalistic articles
Summarization  Malay Text Summarizer (Aliaset al., 2016b) (Alfred et al., 2013a) News articles, No Yes
Machine Translation  Lazy-man Part-of-speech Tagger (Zamin et al., Biomedical articles
2012a), (Noor Ariffin & Tiun, Malay Twitter Excerpt No Yes
 Named Entity Recognition using Cross-lingual 2018)
Annotation (Zamin & Abu Bakar, 2015) (Azizan et al., 2019) Durian-related No No
Language Analysis,  Malay Journalistic Corpus (Zamin et al., documents
Corpus Construction 2012b), Durian Dataset (Azizan et al., 2019).
Part-of-speech Tagger  Rule-based Part-of-speech (Alfred et al.,
2013b; Noor Ariffin & Tiun, 2018)
 Stochastic Part-of-speech (Xian et al., 2016;
be the record of actions, words, and silent approvals of Prophet
Zamin et al., 2012a)
Muhammad PBUH. There are two main components in a hadith:
the ‘isnad’ and the ‘matn’. ‘Isnad’ refers to the chain of narrators
who have transmitted the hadith, and ‘matn’ refers to the context
document context using part-of-speech tagging, named entity
or the body of the hadith. Muslim clerics identifies hadith in three
recognition in information extraction field, and indexing and clus-
categories, which are ‘sahih’ (authentic), ‘hasan’ (good) or ‘da’if’
tering in information retrieval can be done precisely with less or no
(weak). Hadiths are written in Arabic but are then translated into
ambiguity of words. However, there are only a few annotated cor-
many other languages across the world such as English, Indone-
pora in Malay language, which are not publicly shared. As of today,
sian, and Malay.
there are only two existing published research papers that produce
In this paper, Malay translated hadith document is selected as
a language analysis for the corpus collection which are based on
the collection. Malay Translated Hadith Document (MTHD) corpus
journalistic corpus, and durian dataset. There is a total of three
was gathered from Malay translated hadith documents from
phases, which are manual filtering and cleaning, analysing the cor-
Mutiara Hadis UiTM. Hadiths were chosen because there are recent
pus and creating the benchmark.
spikes in researches that uses hadiths as the data in Malay natural
In this paper, Section 2 will describe the corpus in detail, includ-
language processing field (Abdul Rahman et al., 2010; Alias et al.,
ing the first observation, followed by the first phase which is man-
2016a; Mahmood et al., 2018; Mohamed Hanum et al., 2014;
ual filtering and cleaning in Section 3. Then, the second phase
Zulkefli et al., 2016, 2015).
which is analysis of the corpus is presented in Section 4, in form
Mutiara Hadis UiTM1 is one of the websites that offers Malay
of word forms, Zipf’s law, followed by the third phase which is
translated hadiths for public references. Fig. 1 illustrates the home-
benchmark of part-of-speech tags and named entities, and Sec-
page interface for Mutiara Hadis UiTM. The hadiths available are
tion 5 will conclude the paper.
Sahih Bukhari (2028 hadiths), Sahih Muslim (2521 hadiths), Sunan
At-Termizi (4179 hadiths), Sunan An-Nasai (5700 hadiths), Sunan
2. The corpus Ibnu Majjah (4340 hadiths), and Sunan Abu Daud (1002 hadiths).
For this research experiment, the book Sahih Bukhari is chosen to
A corpus plays a vital role in NLP text-based research. Com- be the test collection for this because of its authenticity and popular-
monly, corpora (plural for corpus) are unstructured. In terms of ity among the Malay speakers. The original collection consists of
availability, there are only a few numbers that are made available 6961 sentences, 146,654 total words and 18,622 unique words.
publicly, while the others are mostly developed for academic use The example of the hadith document is shown in the Fig. 2 below
and are not publicly available. For example, Institute of Language followed by English translation of the hadith.
and Literature (Dewan Bahasa dan Pustaka) have recently made English translation for Fig. 2: From Umar bin al-Khattab (peace
their corpus available for researchers on their website (http:// and blessings of Allah be upon him) said, I heard the Messenger of
Allah (peace be upon him), say: ‘‘Actions are according to inten- They consist of newspaper articles,
tions, and everyone will get what was intended. Whoever migrates
books, magazines, literature texts and paper works. Table 2 below
with an intention for Allah and His messenger, the migration will
illustrates some of the existing corpora that can be used for natural
be for the sake of Allah and his Messenger. And whoever migrates
language processing tasks based on their availability.
for worldly gain or to marry a woman, then his migration will be
There exist some corpora that are publicly available, such as
for the sake of whatever he migrated for.”
from the Institute of Language and Literature, where they provide
(Sahih Bukhari Volume Number 1, Hadith Number 1)
multi domains such as newspaper excerpt, magazines, novels and
Fig. 3 illustrates the Malay Translated Hadith Document
many others. However, these corpora are not annotated. Similarly,
(MTHD) raw file. After the initial observation and tokenization,
Mutiara Hadis UiTM also provides translated Quran and Hadith
there exists errors in the collection. Errors such as typos are extre-
documents publicly, but they are also not annotated. There are
mely in need of correction as it will affect the overall performance
few existing corpora that are annotated but are not publicly avail-
of some of the NLP techniques, such as part-of-speech tagger.
able. For example, terrorism-related corpus, news and biomedical
Hence, the first step was to clean the collection. The step we call
articles, and twitter excerpt. These corpora are built for experi-
here is manual filtering and cleaning.
ments in natural language processing.
In the religion of Islam, there are two main sources of reference:
Quran and hadiths. Quran is a booked collection the word of Allah
revealed to the Prophet Muhammad PBUH. Hadith is referring to 1
S.S. Sazali et al. / Journal of King Saud University – Computer and Information Sciences 34 (2022) 2151–2160 2153

Fig. 1. Mutiara Hadis UiTM Homepage.

Fig. 2. Example of Hadith Excerpt.

Fig. 3. Malay Translated Hadith Document Raw File.

3. Manual filtering and cleaning Fig. 4 below illustrates the flow of the manual filtering and
cleaning for the collection.
The collection first needs to undergo a normalization process to The process starts with retrieved document from the Mutiara
eliminate noise and errors. For this collection, the process includes Hadis UiTM, named Raw Malay Translated Hadith Documents
address code addition, tokenization, word normalization, symbol, (RMTHD). It first undergoes address code addition process, fol-
and number removal and lastly, human-made error correction. lowed by tokenization. After tokenization, the collection undergoes
2154 S.S. Sazali et al. / Journal of King Saud University – Computer and Information Sciences 34 (2022) 2151–2160

Fig. 4. Process Flow for Manual Filtering and Cleaning.

word normalization, special characters and numbers removal, B. Word Normalization

and human-made error correction. These processes are looped
until the data in the collection are cleaned and saved in a collec- Word normalization in this experiment refers to the collection
tion named Malay Translated Hadith Document (MTHD). All the of abbreviations that refers to the same word but are written in
processes in manual filtering and cleaning are done semi- slightly different forms. This process is done to protect the abbre-
supervised, which means the processes are done by using a pro- viations from symbol removals and to reduce the ambiguity of the
gram, or regular expression, and then supervised and validated abbreviations. The examples for these words are shown in Table 4
by humans. with their corresponding occurrences throughout the collection.
There are twelve words that shared four same meanings and
A. Address Code Addition and Tokenization can be grouped together. The first abbreviation is ‘r.a.’ which in
Arabic stands for RadhiAllahu’anhu. There are four other extracted
When the Raw Malay Translated Hadith Documents (RMTHD) patterns which are ‘r.a.,’ with 1234 occurrences, ‘r.a’ with 15 occur-
were first compiled, there were different files for each of the rences, ‘r.a.’ with 593 occurrences, and ‘r a/r(space)a’ with 114
hadiths indicated by a number for each hadith collection; from 1 occurrences. All these abbreviations are then normalized with
to 2028, grouped by the volume number. For illustration, word ‘ra’.
taken from the hadith example above, the hadith indicated by The next abbreviation is ‘s.a.w.’ which in Arabic stands for
number ‘10 : SallallahuAlayhiWaSalaam. There was a total of six patterns
observed. The first pattern is ‘saw.’ with 1804 occurrences, ‘s.a.
‘‘1, Dari Umar bin Khathab r.a., katanya dia mendengar Rasulullah
w.’ with 593 occurrences, ‘s.a.w.,’ with 293 occurrences, ‘s.a.w.,’
saw. bersabda: . . .”
with 48 occurrences, ‘s a w/s(space)a(space)w’ with 59 occur-
rences and ‘s aw/s(space)aw’ with one occurrence. The normalized
word for representing the abbreviation ‘s.a.w.’ is the word ‘saw’.
To differentiate between the hadith number and the actual
Next abbreviation is ‘s.w.t.’ which in Arabic refers to Sub-
number existing in the collection, an address code was introduced.
hanahuWaTa’ala. There is only one extra pattern for this abbrevia-
It includes hadith number, and hadith volume. The address code
tion which is ‘swt.’ with 38 occurrences. The word ‘swt’ is used to
was made by adding ‘H’ in front of the hadith number, and fol-
represent this abbreviation. The last abbreviation is ‘a.s.’ which in
lowed by 5-digit numbers, which represents the volume number
Arabic refers to AlayhisSalaam and written as ‘a s/a(space)s’ with
(0–4), and hadith number (1–2028).
13 occurrences. For this abbreviation, the word ‘as’ is used to rep-
Table 3 illustrates some examples of the address code. For
example, hadith number 1933 is written as H31399. Adding the resent it.
address code is a crucial step for tracking purposes that can be used
in information extraction and information retrieval. C. Special Characters and Numbers Removal
After adding the address code, the collection is then tokenized After normalization, the next step was to remove special char-
by detecting the word boundary, and the unique words are sorted acters and numbers. The removal of special characters and num-
to check for error. Tokenization is the task of separating out word bers are done to reduce the non-alphabet characters. The process
from the running text. Like English, Malay words are often sepa-
rated from each other by blanks (whitespaces or commonly known
Table 4
as space). Sorting by unique word, it is discovered that the collec- Normalized Word with Replacement.
tion needs word normalization, symbol and numbers removal, and
Abbreviation Word Replaced with Occurrences
typos correction. These processes are looped until there are no
more errors and noises to be cleaned. r.a. r.a., ra 1234
r.a 15
r.a. 593
r(space)a 114
Table 3
s.a.w. saw. saw 1804
Address Code Used.
s.a.w. 593
Volume Hadith Number Example s.a.w., 48
saw., 293
1 1–525 H10001, H10342
s(space)a(space)w 59
2 526–1125 H20678, H21105
s(space)aw 1
3 1126–1581 H31146, H31487
s.w.t swt. swt 38
4 1582–2028 H41640, H42000
a.s. a(space)s as 13
S.S. Sazali et al. / Journal of King Saud University – Computer and Information Sciences 34 (2022) 2151–2160 2155

Table 5 Table 6
Special Characters and Numbers Removal. Human-made Errors.

Description Example Occurrences Word Correct Word

Special characters Exclamation mark (!), 37,547 ahdullah abdullah
Open/close brackets (/), ai al, air
Semi colon (;), ai-khudri al-khudri
Stand-alone dash (), kedil kecil
Quote (‘) alhamdu lillah alhamdulillah
Single characters Number (0–9), 259 Mahi Maha
Single alphabet (a-z) Persi Parsi
Spelling errors syarika1ahu 5 Khaththab Khathab
Al1ah jalan-jalan-jalan jalan-jalan
diha1angi karena kerana

correction in this research because this is the result of human fail-

revealed that there are a few typos of words which are combined ing to notice the errors.
with numbers. This process is done by using regular expression The first four rows in the table shows the example for where a
matching. The symbols and typos included are tabulated in the single alphabet letter was mistakenly translated. The word ‘ahdul-
Table 5 below with their corresponding occurrences. lah’ on the first row where the ‘b’ was mistakenly for ‘h’, the second
There were 37,547 occurrences in total for all the special char- and third row shared the same error, where the alphabet ‘l’ was
acters. The special characters included are exclamation mark (!) mistakenly for ‘i’ and the word ‘kedil’ on the fourth row, where
with 1715 occurrences, double quotes (‘‘ and ”) with 9496 occur- ‘d’ was the result of ‘c’ and ‘i’ being too closed together.
rences, comma (,) with 8736 occurrences, colon (:) with 4277 The next six rows were the result from human errors. The errors
occurrences, open bracket ‘(‘ and closing bracket ‘)’ with occur- were in form of additional alphabet in the word (fifth row), spacing
rences of 1545 and 1515 respectively, question mark (?) with (sixth row), wrong alphabet or typo (seventh row), extra repeated
879 occurrences, two single quotes (‘ and ‘) with 30 occurrences, pattern (eight row), and extra word in the duplication (ninth row).
period or dot (.) with 9027 occurrences, semi colon (;) with 160 The last row (tenth row) is translation error. The word ‘karena’ is in
occurrences, single quote (‘ and ’) with 38 occurrences and lastly Indonesian, where in Malay the word is ‘kerana’. Both words refer
single dash () symbol with 36 occurrences. Only the single dashes to the word ‘because’. This step is crucial as to ensure that all the
are removed, because in Malay language, there are also some spe- words in the collection belong to the Malay language to get accu-
cial characters attached with a word. In Malay language, there is rate result for any NLP related techniques. Next, the cleaned corpus
one type of word that uses dash in the word. The type is called will be analysed.
as reduplication, which will be discussed later in the next section.
While removing the dash symbol, there were also some words 4. Analysis of Malay Translated Hadith Document
which were duplication words but written separately. Hence, for
this kind of symbol, the words were glued back together. For In this section, the Malay Translated Hadith Document (MTHD)
example, the word ‘-perempuan’ was detected in H10023. It was is analysed based on the word distribution, Zipf’s law, part of
combined back with the other half to become ‘perempuan-perem speech and named entities. All the analysis is further discussed
puan’. The same goes for the word ‘al-’ with one occurrence. Other in their sub-sections below. Before the pre-processing, the word
than that, symbols that were included in a word were removed. count for the collection is 146,654 total words with 18,622 total
The symbol dash was spotted in the word ‘7-’, quotes are spotted unique words. The result of MTHD after the pre-processing is illus-
in the word ‘aisyah, ‘ra and ‘oAsar all with one occurrence, trated in Fig. 5 and the word count is tabulated in Table 7 below.
respectively. The number of documents remained the same, which is 2028
Next, the single characters are removed. A total of 197 numbers documents, where the total word count was reduced by 500 from
are removed by recognizing all the digits 0 to 9. Similarly, there 146654 to 146154. The unique word has a big difference with 8135
were also some words that are combined with a number. For cases reduction from 18622 to 10487.
like the ‘ke-90 and ‘ke-520 , the numbers are translated into words It is essential for a corpus to be semantically annotated to serve
rather than digits. Additionally, there were also cases of typos. more NLP tasks in the future. Therefore, all the 146,154 words are
From the table above, we can see that there were typos where sent to language experts to be annotated with part-of-speech tags.
the letter L is written as digit 1 with 5 total occurrences. This is The notation of part of speech tags used for this collection are cre-
probably the result of image translating from the previous ated based on Tatabahasa Dewan (Nik Safiah, Farid, Hashim, &
research. For this case, the numbers in the word ‘syarika1ahu’, Abdul Hamid, 2015); a reference book for Malay grammar pub-
‘Al1ah’, ‘diha1angi ‘, ‘penyesa1annya’, and ‘keci10 were translated lished by Malaysia’s Institute Language and Literature (Dewan
into the correct letter. Lastly, single alphabet characters were Bahasa dan Pustaka). The tags are then matched with the Penn Tree
removed as they were a result of typos, did not have any meanings, Bank (PTB) tags. Table 8 below shows the part of speech tags used
and were not legitimate words. There was a total of 60 single in this collection.
alphabets removed as they were results of mistranslation from In Tatabahasa Dewan, the tags are divided into four main
the raw collection. groups, which are adjectives, verbs, nouns and the rest of the
words are grouped based on their heterogenous property (17 small
D. Human-made Errors Correction tags) under one big category: function words. However, in this col-
lection, the small groups are counted as their own tags rather than
Human-made errors in this process consists of errors that have under one big tag. In total, there are 24 tags used in the collection,
been overlooked by human. Table 6 below illustrates some of the including one tag for foreign word. In 24 tags, there are 15 tags that
human-made errors. have their own corresponding Penn Tree Bank tag, whereas the
Most of the errors from the table above are the result of image- remaining nine tags do not. There is kata nama tunjuk which is
to-text translation error. However, it is called human-made errors used to indicate point towards something such as human, animal
2156 S.S. Sazali et al. / Journal of King Saud University – Computer and Information Sciences 34 (2022) 2151–2160

Fig. 5. Malay Translated Hadith Document Input File.

Table 7 word ‘paling’ (most). ‘Kata perintah’ is used with the intention of
Word Counts of Malay Translated Hadith Document Corpus. commanding or giving instruction such as ‘jangan’ (don’t) and ‘to-
long’ (please).
Documents Total Word Count Unique Words
Before Pre-processing 2028 146,654 18,622
A. Word Forms
After Pre-processing 2028 146,154 10,487

In Malay language, the form of words are divided into four main
forms; which are single word form, affixed word form, compound
Table 8 word form and reduplication form (Nik Safiah et al., 2015). Word
Part of Speech Tags Used. forms plays an important role in NLP tasks, such as stemming
Part-of- Tags in Tatabahasa Corresponding Penn Additional
and part-of-speech tagging. In this collection, only three of the four
speech Dewan Treebank Tag Tags forms are being analysed. Compound word forms are excluded
Adjective Adjektif Adjective (JJ)
from the analysis because it required more language expert exper-
Verb Kata kerja Verb (VB) tise and due to time constraints. Single word form is a form in
Noun Kata nama Noun (NN) which a word can stand on its own without any kind of affixes or
Kata ganti nama Pronoun (PRP) mixed with any other word. For example, the word ‘saya’ (me). A
Kata nama tunjuk TJK
sentence can be formed by only using the word ‘saya’ and it is con-
Kata nama khas Proper noun (NNP)
Function Adverba Adverb (RB) sidered as a valid sentence. The next form is affixed form, in which
Words Kata arah ARH the base word undergoes the affixing process. For example, the
Kata bantu Auxiliary (AUX) base word ‘makan’ (eat), can be changed into the affixed form by
Kata bilangan Cardinal number (CD) adding the circumfix (-an), and producing the word ‘makanan’
Kata hubung Coordinating
(food). The next form is the reduplication form. Like English lan-
conjunction (CC)
Kata nafi NEG guage, word in Malay language can undergo a reduplication pro-
Kata pangkal ayat PKL cess either in full reduplication, half reduplication and with
Kata pembenar PBR affixes or without affixes. The word such as ‘kanak-kanak’ (chil-
Kata pembenda Possessive Ending (POS)
dren) and ‘lelaki’ (man) are the example of reduplication in Malay
Kata pemeri PMR
Kata penegas Particle (RP) language.
Kata penekan PNK Fig. 6 shows the distribution of unique words of MTHD Corpus,
Kata penguat PGT whereas Fig. 7 illustrates the total words distribution for total
Kata perintah PRH words in the collection. For the unique word, the highest percent-
Kata sendi nama Preposition (IN)
age of the unique words are made up from affixed word form
Kata seru Interjection (UH)
Kata tanya Wh-pronoun (WP) (62%), with a total of 5230 words. Follow by single word form
Other Foreign word Foreign Word (FW) which represent 33% of the total unique word count, with a total
of 2776 words. Reduplication form only consists of 453 words
which only represent 5% of the total unique words count.
or thing. For example, like ‘itu’, ‘ini’, ‘sana’ and ‘sini’. Kata arah is However, in total words distribution of MTHD, the difference
used to indicate direction, such as ‘utara’ (north), ‘sisi’ (side) and between single word form and affixed word form is tiny (3%
‘luar’ (outside). Kata nafi is a collection of words used for negation.
For example, ‘tak’ (no), and ‘bukan’ (not). Kata pangkal ayat are
words that appear at the beginning of a sentence, seldomly used
to indicate continuity from the previous sentence. It is mostly used
in the classical Malay language. For example, the word ‘kalakian’,
‘hatta’ and ‘alkisah’. ‘Kata pembenar’ is a word to express agree-
ment or validation in a sentence, such as ‘ya’, and ‘benar’. ‘Kata
pemeri’ is a word that links a subject to its predicate adjective or
predicate noun. In Malay language, there are two ‘kata pemeri’
which is ‘ialah’ and ‘adalah’. ‘Kata penekan’ is used to highlight
the importance of the word that it is combined with. The word
become ‘kata penekan’ when it combines with suffix ‘-nya’. For
example, ‘sesungguhnya’, and ‘nampaknya’. ‘Kata penguat’ is used
to strengthen the adjective used in the sentence. For example, the Fig. 6. Unique Words Distribution of MTHD Corpus.
S.S. Sazali et al. / Journal of King Saud University – Computer and Information Sciences 34 (2022) 2151–2160 2157

indexing. These words can be detected by plotting the data gained

from the Zipf’s Law as a graph. Consequently, computational
resources can be reduced by eliminating both extremely common
and uncommon words (Talvensaari, 2008; Zamin et al., 2012b).
The graph is commonly known as Zipf’s Curve.
Fig. 8 illustrates the distribution of words frequency in MTHD
Corpus. It is observable that MTHD Corpus follows the Zipf’s Law
pattern observed in most English corpora. The frequency of word
occurrence is approximately an inverse power law function of its
rank. However, both extremely common and uncommon words
were not removed from this collection, as these words are crucial
Fig. 7. Total Words Distribution of MTHD Corpus. to be used in aiding in other NLP-related tasks such as part-of-
speech tagger and relation extraction in Information Extraction
difference). With 73,651 counts in single word form, it represents
the highest percentage of the total word distribution (51%), fol- C. Part-of-speech
lowed by affixed form (47%) with 68,311 counts. Reduplication
remain the least percentage of the total word distribution with After the word form and frequency distribution, the next thing
2144 counts, representing 2% of the total words count. to analyse was the part-of-speech. Part-of-speech or more com-
monly known as POS is a term used to represent the lexical items
B. Zipf’s Law in a collection. The term POS is interchangeable with ‘‘word cate-
gory” and also ‘‘word class”. Despite being in different terms, these
Zipf’s Law is a statistically formulated law that can be used to terms are all used to describe the grammar of the language. In Eng-
evaluate the distribution of words in a natural language corpus, lish, there are eight commonly used part-of-speech, which are
where various types of data studied in the physical and social adjective, adverb, verb, noun, pronoun, preposition, conjunction,
sciences follow a Zipfian’s distribution (Manning & Schütze, and interjection. Some languages are harder to analyse compared
2000; Zamin et al., 2012b). It is named after one of Harvard linguis- to others. There are languages that divide based on the gender,
tic professor; George Kingsly Zipf (1902–1950). The law indicates masculine and feminine in French for example. There are also lan-
that given a large sample of words used; the frequency of any word guages that are classified into singular, dual, and plural, such as in
is inversely proportional to its rank in the frequency table. Hence, Arabic. Malay language is one of the easiest languages to learn, as
the word number n has a frequency proportional to 1/n. For exam- this language is tense-free, number-free, and gender-free. The part-
ple, the law was applied to the Brown corpus, a frequently used of-speech tags are discussed in the Table 8 before, hence in this
English-language corpus containing roughly one million words, it section, only the result will be discussed.
was discovered that the word ‘‘the” is the highest frequented word Table 10 illustrates the distribution of part-of-speech tags in
in the corpus where it appeared 69,971 times. The law uses the fre- Malay Translated Hadith Document corpus. In the collection, nouns
quency of a word in a corpus (f), a constant (k), and the rank of the have the highest frequency with 48710, representing 33.80% of the
word (r) (Ha et al., 2002; Zamin et al., 2012b). Eq. (1) illustrates the total collection. Top three highest occurrences nouns are ‘saw’ with
Zipf’s Law below: 2890 counts, ‘beliau’ with 2243 counts, and ‘nabi’ with 2215
counts. Verbs are the second highest percentage of the total collec-
k tion, that represents 16.34% of it with 23,553 total frequency. The
f ¼ ð1Þ
r verb ‘bersabda’ appeared the most with 1093 counts, followed by
Table 9 below shows the top 10 words in MTHD Corpus based the word ‘solat’ with 1024 counts, and the third highest frequency
on their frequency. The word ‘yang’ has the highest frequency of verb is the word ‘berkata’ with 768 counts. After verb, the third
3838, followed by the word ‘dan’ at the second rank with 3628 fre- highest percentage frequency is conjunctions, that represent
quency. The third rank is the word ‘itu’ with frequency of 2896, fol- 9.87% with 14,220 frequency. Preposition is that the fourth highest,
low by the word ‘saw’ with 2890 frequency. The word ‘dari’ is at with 7.78% and 11,216 frequency, followed by proper nouns with
the fifth rank with 2887 frequency, followed by the word ‘beliau’ 8616 (5.98%) frequency, adverb 4893 (3.39%) frequency, adjective
with the frequency of 2243. The word ‘nabi’ is at seventh rank, with 4015 (2.79%) frequency, ‘kata tunjuk’ with the frequency of 3940
2215 frequency, followed by the word ‘di’ with 2016 frequency. At (2.73%), pronouns 3847 (2.67%) frequency, possessive ending with
the eighth rank is the word ‘ra’ with frequency of 1963 and at the 3488 (2.42%) frequency, auxiliary with 3222 (2.24%) frequency,
tenth rank is the word ‘saya’ with 1961 frequency. cardinal number with 2566 (1.78%) frequency, ‘kata perintah’ with
In the field of Information Retrieval (IR), words that are extre- 2166 (1.50%) frequency, ‘kata arah’ with 1704 (1.18%) frequency,
mely common and extremely uncommon are not significant for ‘kata nafi’ with 1530 (1.06%) frequency, wh-pronouns with 1436
(1.00%) frequency, foreign word with 1418 (0.98%) frequency, par-
Table 9 ticle with 954 (0.66%) frequency, ‘kata pangkal ayat’ with 837
Highest Frequency Words of Malay Translated Hadith Document Corpus.
(0.58%) frequency, ‘kata pembenar’ with 521 (0.36%) frequency,
Word Frequency Rank POS Tags ‘kata penguat’ with 518 (0.36%) frequency, interjection with 408
yang 3838 1 CC (0.28%) frequency, ‘kata pemeri’ with 299 (0.21%) frequency, and
dan 3628 2 CC lastly at the 24th rank is ‘kata penekan’ with 49 frequency, repre-
itu 2896 3 TJK senting 0.03% of the collection.
saw 2890 4 NN
dari 2887 5 IN
beliau 2243 6 NN
D. Named entities
nabi 2215 7 NN
di 2016 8 IN In part-of-speech distributions, there were 920 proper nouns
ra 1963 9 NN that were tagged. The proper nouns were further analysed to
saya 1961 10 NN
include the named entities for each of the proper nouns. This anal-
2158 S.S. Sazali et al. / Journal of King Saud University – Computer and Information Sciences 34 (2022) 2151–2160

Fig. 8. Zipf’s Curve of Malay Translated Hadith Document Corpus.

Table 10 Table 11
Part-of-speech Distribution of MTHD. Named Entities in Malay Translated Hadith Document Corpus.

Tags Frequency Generic Entities Named Entities Occurrences Example

Noun (NN) 48,710 (33.80%) Person Person 563 Ansari, Dhiman
Verb (VB) 23,553 (16.34%) Family 18 Ghiffar, Luai
Conjunction (CC) 14,220 (9.87%) Tribe 16 Khatsaam, Rabiaah
Preposition (IN) 11,216 (7.78%) Nickname 4 Muhajjalin, Shabi
Proper Nouns (NNP) 8616 (5.98%) Race 1 Khawarij
Adverb (RB) 4893 (3.39%) Organization Organization 23 Quraizhah, Syanuah
Adjective (JJ) 4015 (2.79%) Location Location 180 Baitullah, Damsyiq
Kata Tunjuk (TJK) 3940 (2.73%) Other Surah 23 Al-Fatihah, An-Nisa
Pronoun (PRP) 3847 (2.67%) Month 13 Dzulhijjah, Muharram
Possessive Ending (POS) 3488 (2.42%) Day 13 Tasyriq, Aqabah
Auxiliary (AUX) 3222 (2.24%) War 10 Badr, Khandaq
Cardinal Number (CD) 2566 (1.78%) Prayer 9 Isyak, Maghrib
Kata Perintah (PRH) 2166 (1.50%) Religion 5 Yahudi, Nasrani
Kata Arah (ARH) 1704 (1.18%) Event 4 Isra, Jamratul
Kata Nafi (NEG) 1530 (1.06%) Entity 4 Dajjal,Yakjuj
Wh-pronoun (WP) 1436 (1.00%) Animal 4 Adba, Jazah
Foreign Word (FW) 1418 (0.98%) Tree 3 Dauhah, Gabah
Particle (RP) 954 (0.66%) Trade 3 Habalah, Muzabanah
Kata Pangkal Ayat (PKL) 837 (0.58%) Paradise 3 Firdaus, Aden
Kata Pembenar (PBR) 521 (0.36%) Idolatry 3 Hubal, Yamaniah
Kata Penguat (PGT) 518 (0.36%) Book 3 Al-Quran, Injil
Interjection (UH) 408 (0.28%) Object 2 Hajar, Manaat
Kata Pemeri (PMR) 299 (0.21%) Hajj 2 Wadak, Wida
Kata Penekan (PNK) 49 (0.03%) Fruit 2 Ajwah, Barni
Grand Total 144,126 (100.00%) Doa (Invocation) 2 Amin, Salaam
Year 1 Hijriyah
Water 1 Majannah
Nationality 1 Arabiy
ysis is crucial in Information Extraction (IE), where the first step of Language 1 Ibrani
IE is to recognize named entities. In the book Speech and Language Door 1 Ar
Processing by (Jurafsky and Martin, 2009), it is mentioned that Building 1 Kabah
Angle 1 Yamani
there is a common list of named entities such as person, organiza-
Grand Total 920
tion, location, geo-political entity, facility and vehicle. However,
the named entities group are not limited only to these, but it also
depends on the corpus itself. For this corpus, the named entities
are sorted following the generic list, but it is further detailed down Only three of six generic entities exist in Malay Translated
to each category. Table 11 below shows all the existing named Hadith Document corpus. There are person, organization, and loca-
entities in MTHD corpus. tion. Organization and location are both in the one named entities
S.S. Sazali et al. / Journal of King Saud University – Computer and Information Sciences 34 (2022) 2151–2160 2159

group, that have 23 occurrences for organization, example such as In conclusion, corpus analysis is one of the crucial items to do
‘Quraizhah’ and ‘Syanuah’, and 180 occurrences for location such as before a research. By having corpus analysis, a lot of NLP-related
‘Baitullah’ and ‘Damsyiq’. However, for the person group, it can be tasks can be done, such as part-of-speech tagger, Information
further detailed down with five different named entities groups, Extraction, summarization, question and answers and many
which are person name with 563 occurrences (example such as others. With the analysed corpus, it provides a way for more hadith
‘Ansari’ and ‘Dhiman), family names with 18 occurrences like ‘Ghif- related natural language processing researches and improve the
far’ and ‘Luai’, tribe names with 16 occurrences such as ‘Khatsaam’ existing ones such as information extraction and information
and ‘Rabiaah’, nicknames with four occurrences such as ‘Muhaj- retrieval field. For future works, an automated tagger and entity
jalin’ and ‘Shabi’ and lastly person’s race with only one occurrence recognizers can be evaluated with the existence of the benchmark
of the word ‘Khawarij’. and topics distributions for the hadiths can be analysed as hadith
In addition, there are 25 more named entities types that exist in can be grouped into sections or similar groups based on human rel-
this collection. There are 23 surah names such as ‘Al-fatihah’, and evance. With the analysis, information retrieval field will benefit
‘An-nisa’, 13 names of month such as ‘Dzulhijjah’ and ‘Muharram’, greatly where the very common and uncommon words can be
13 names of days ‘Tasyriq’ and ‘Aqabah’, 10 names of wars such as eliminated with Zipf’ law, and clustering and indexing technique
‘Badr’ and ‘Khandaq’, nine names of prayer such as ‘Isyak’ and can benefit from the annotated corpus.
‘Maghrib’, five religion names such as ‘Yahudi’ and ‘Nasrani’, four
for each event (example such as ‘Isra’ and ‘Jamratul), entity such Declaration of Competing Interest
as ‘Dajjal’ and ‘Yakjuj’ and animal names such as ‘Adba’ and ‘Jazah’.
Next, with three occurrences, there are five names entities group The authors declare that they have no known competing finan-
which are tree (such as ‘Dauhah’ and ‘Gabah’), trade (such as cial interests or personal relationships that could have appeared
‘Habalah’ and ‘Muzabanah’), paradise (such as ‘Firdaus’ and ‘Aden’), to influence the work reported in this paper.
idolatry (such as ‘Hubal’ and ‘Yamaniah’ and book names such as
‘Al-Quran’ and ‘Injil’. There are four groups of entities with two Acknowledgement
occurrences of each; object such as ‘Hajar’ and ‘Manaat’, hajj such
as ‘Wadak’ and ‘Wida’, fruit such as ‘Ajwah’ and ‘Barni’, and doa This research was fully supported and funded by Malaysian
(invocation) names such as ‘Amin’ and ‘Salaam’. Lastly there are Government under Fundamental Research Grant
one occurrence for year ‘Hijriyah’, water ‘Majannah’, nationality Scheme (FRGS/1/2015/ICT01/UITM/03/1) and Universiti Teknologi
‘Arabiy’, language ‘Ibrani’, door ‘Ar’, building ‘Kabah’ and angle MARA BESTARI Grant 600-IRMI/DANA 5/3/BESTARI (112/2018).
‘Yamani’. The author would also like to thank Mrs. Ros Silawati
From the table, it is observed that more than half of the entities Ahmad@Abdullah, Mr. Sazali Saidin and Mrs. Rokiah Karim from
belong to the person’s group. This is caused by the nature of the Institut Perguruan Kampus Darul Aman and Kampus Perlis for
hadith itself that contain two parts, which are ‘sanad’ (chains of assistance with their expertise in language that we can analyse
narrators) and ‘matn’ (the content). There are also proper names the corpus. We also would like to thank the anonymous reviewers
that does not have any group which are grouped in the entity for their insights.
group. For example, ‘Dajjal’ is portrayed as an evil figure in the Isla-
mic history, and ‘Yakjuj’ and ‘Makjuj’ (Gog and Magog in Hebrew References
Bible), are referred as a group that inherit power as the destroyer
of life on earth. It is until to date unclear whether the term belongs Abdul Rahman, N., Abu Bakar, Z., & Tengku Sembok, T. M. (2010). Query expansion
using Thesaurus in Improving Malay Hadith Retrieval system. In 2010
to a person or not. At the bottom of the Table 11, there are quite International Symposium on Information Technology (pp. 1404–1409). Kuala
several named entities group that have only one occurrence. They Lumpur, Malaysia: IEEE.
are kept that way to accommodate the future researches that want Abdul Rahman, N., Kamal Ismail, N., Abu Bakar, Z., & Tengku Sembok, T. M. (2006).
MUTIARA HADIS: Malay Hadith Retrieval System. In Proceedings of Seminar IT.
to analyse other types of hadiths, and the possibility of detecting Kuala Terengganu, Malaysia. Retrieved from
some signal words for the entity. webhadis/
Alfred, R., Leong, L.C., On, C.K., Anthony, P., 2014. Malay named entity recognition
based on rule-based approach. Int. J. Mach. Learn. Comput. 4 (3), 300–306.
Alfred, R., Leong, L. C., On, C. K., Anthony, P., Fun, T. S., Razali, M. N., & Ahmad Hijazi,
5. Conclusion M. H. (2013). A Rule-Based Named-Entity Recognition for Malay Articles. In H.
Motoda, Z. Wu, L. Cao, O. Zaiane, M. Yao, & W. Wang (Eds.), Advanced Data
Mining and Applications (Vol. 8346, pp. 288–299). Berlin, Heidelberg: Springer
Corpus analysis can reveal the attributes of the corpus. In this Berlin Heidelberg.
experiment, it is shown that nouns represent the largest percent- Alfred, R., Mujat, A., & Obit, J. H. (2013). A Ruled-Based Part of Speech (RPOS) tagger
age of the corpus with more than 33%, which is more than half for Malay text articles. In A. Selamat, N. T. Nguyen, & H. Haron (Eds.), Intelligent
Information and Database Systems (Vol. 7803, pp. 50–59). Berlin, Heidelberg:
of the second largest group, which were verb that represents 16% Springer Berlin Heidelberg.
of the total corpus. For named entities, in MTHD corpus, there Alias, N., Rahman, N. A., Ismail, N. K., Nor, Z. M., & Alias, M. N. (2016). Searching
are more than half of the total proper names which belong to the Algorithm of Authentic Chain of Narrators’ in Shahih Bukhari Book. In 2016
International Conference on Applied Computing, Mathematical Sciences and
person group because of the nature of the hadith structure that Engineering (ACME2016) (pp. 60–66). Johor, Malaysia.
consists of the narrator chain and the content. In hadith document, INFRKM.2016.7806336.
the narrator chain is important to determine the authenticity of the Alias, S., Mohammad, S. K., Hoon, G. K., & Ping, T. T. (2016). A Malay Text
Summarizer using Pattern-Growth method with Sentence Compression Rules.
hadith. In this experiment, the most important phase is the manual In 2016 Third International Conference on Information Retrieval and
filtering and cleaning. Without the phase, the analysis can differ Knowledge Management (CAMP) (pp. 7–12). Bandar Hilir, Malaysia.
greatly as the number of unique words were reduced by 8135 Azizan, A., Abu Bakar, Z., Abdul Rahman, N., 2019. Construction of durian dataset
from web collection for query reformulation research. Int. J. Recent Technol.
words after the filtering and cleaning. It is observable that affixes
Eng. 8 (2811), 630–634.
represent the highest percentage in term of unique word, but sin- Ha, L. Q., Sicilia-Garcia, E. I., Ming, J., & Smith, F. J. (2002). Extension of Zipf’s law to
gle form represents the highest percentage in the total word. Ten Words and Phrases. In COLING ’02 Proceedings of the 19th international
highest frequency of words in MTHD are ‘yang’, ‘dan’, ‘itu’, ‘saw’, conference on Computational linguistics (pp. 1–6). Taipei, Taiwan.
‘dari’, ‘beliau’, ‘nabi’, ‘di’, ‘ra’, and ‘saya’ are all single form, which Jurafsky, D., & Martin, J. H. (2009). Speech and Language Processing: An
tally with the representation of the total word percentage. Introduction to Natural Language Processing, Computational Linguistics, and
2160 S.S. Sazali et al. / Journal of King Saud University – Computer and Information Sciences 34 (2022) 2151–2160

Speech Recognition (2nd., Vol. 21). Upper Saddle River, NJ, USA: Prentice-Hall, Joint SIGDAT conference on Empirical methods in natural language processing
Inc. and very large corpora: held in conjunction with the 38th Annual Meeting of
Liddy, E. D. (2001). Natural Language Processing. In Encyclopedia of Library and the Association for Computational Linguistics (Vol. 13, pp. 63–70). Hong Kong,
Information Science (2nd.). New York: Marcel Decker, Inc. China.
10.1017/S0267190500001446 Tuah Mohamad Rahim, N. N. A., Mabni, Z., Mohamed Hanum, H., & Abdul Rahman,
Mahmood, A., Khan, H. U., Alarfaj, F., Ramzan, M., & Ilyas, M. (2018). A Multilingual N. (2016). A Malay Hadith translated document retrieval using parallel Latent
Datasets Repository of the Hadith Content. International Journal of Advanced Semantic Indexing (LSI). In 2016 3rd International Conference on Information
Computer Science and Applications, 9(2), 165–172. Retrieval and Knowledge Management, CAMP 2016 - Conference Proceedings
IJACSA.2018.090224 (pp. 118–123). Bandar Hilir, Malaysia.
Manning, C. D. (2011). Part-of-Speech Tagging from 97% to 100%: Is It Time for Some INFRKM.2016.7806346.
Linguistics? In A. F. Gelbukh (Ed.), Computational Linguistics and Intelligent Ulanganathan, T., Ebrahimi, A., Xian, B. C. M., Bouzekri, K., Mahmud, R., & Hoe, O. H.
Text Processing (pp. 171–189). Berlin, Heidelberg: Springer Berlin Heidelberg. (2017). Benchmarking Mi-NER: Malay Entity Recognition Engine. The Ninth International Conference on Information, Process and Knowledge Management
Manning, C. D., & Schütze, H. (2000). Foundations of Natural Language Processing. (EKNOW 2017), 52–58.
In Proceedings of the 20th International Conference on Computer Processing of Xian, B.C.M., Lubani, M., Ping, L.K., Bouzekri, K., Mahmud, R., Lukose, D., 2016.
Oriental Languages. Benchmarking Mi-POS: Malay Part-of-Speech Tagger. Int. J. Knowledge Eng. 2
Mohamed Hanum, H., Abu Bakar, Z., Abdul Rahman, N., Mohd Rosli, M., Musa, N., (3), 115–121.
2014. Using topic analysis for querying halal information on malay documents. Zamin, N., & Abu Bakar, Z. (2015). Name Entity Recognition for Malay Texts Using
Proc. – Soc. Behavioral Sci. 121, 214–222. Cross-Lingual Annotation Projection Approach. In O. Gervasi, B. Murgante, S.
sbspro.2014.01.1122. Misra, M. L. Gavrilova, A. M. A. C. Rocha, C. Torre, . . . B. O. Apduhan (Eds.),
Mohd Noor, N., Sulaiman, J., Noah, S.A., 2016. Malay name entity recognition using Computational Science and Its Applications – ICCSA 2015 (pp. 242–256). Cham:
limited resources. Adv. Sci. Lett. 22 (10), 2968–2971. Springer International Publishing.
asl.2016.7124. Zamin, N., Oxley, A., Abu Bakar, Z., & Farhan, S. A. (2012a). A Lazy Man’s Way to Part-
Nik Safiah, K., Farid, M. O., Hashim, H. M., & Abdul Hamid, M. (2015). Tatabahasa of-Speech Tagging. In D. Richards & H. B. Kang (Eds.), Knowledge Management
Dewan (3rd.). Kuala Lumpur. and Acquisition for Intelligent Systems (pp. 106–117). Berlin, Heidelberg:
Noor Ariffin, S.N.A., Tiun, S., 2018. Part-of-speech tagger for malay social media Springer Berlin Heidelberg.
texts. GEMA OnlineÒ J. Language Stud. 18 (4), 124–141. Zamin, N., Oxley, A., Abu Bakar, Z., & Farhan, S. A. (2012b). Characteristics of a Malay
Sazali, S. S., Abdul Rahman, N., & Abu Bakar, Z. (2016). Information Extraction: Journalistic Corpus. In 2012 IEEE Conference on Control, Systems and Industrial
Evaluating Named Entity Recognition From Classical Malay Documents. In 2016 Informatics (ICCSII) (pp. 214–218). Bandung, Indonesia.
Third International Conference on Information Retrieval and Knowledge 10.1109/CCSII.2012.6470503.
Management (CAMP) (pp. 48–53). Bandar Hilir, Malaysia: IEEE. Zulkefli, N. S. S., Abdul Rahman, N., & Abu Bakar, Z. (2016). Analyzing Search
org/10.1109/INFRKM.2016.7806333 Retrieval Results on Malay Translated Hadith Text Documents Analyzing Search
Sulaiman, S., Abdul Wahid, R., Sarkawi, S., & Omar, N. (2017). Using Stanford NER Retrieval Results on Malay Translated Hadith Text Documents. In 2016
and Illinois NER to Detect Malay Named Entity Recognition. International International Conference on Applied Computing, Mathematical Sciences and
Journal of Computer Theory and Engineering, 9(2), 147–150. Engineering (ACME). Johor, Malaysia.
10.7763/IJCTE.2017.V9.1128 Zulkefli, N. S. S., Rahman, N. A., Bakar, Z. A., & Alam, S. (2015). Representation of
Talvensaari, T. (2008). Comparable Corpora in Cross-Language Information Search Retrieval Results on Digital Hadith Online Browser. Universiti
Retrieval. University of Tampere. Kebangsaan Malaysia International Colloquium of Graduates Islamic Studies
Toutanova, K., & Manning, C. D. (2000). Enriching the knowledge sources used in a 2015, (December).
maximum entropy part-of-speech tagger. In EMNLP ’00 Proceedings of the 2000

