An Experiment in Natural Language Processing
An Experiment in Natural Language Processing
At first glance, discovering the citation of canons in works of positive law does not seem too
difficult. In fact, one possible solution would be to simply search for the canon in already
digitized collections, such as those provided in al-Maktaba al-Shāmila, Maktaba Ahl al-Bayt,
or al-Jāmiʿ al-Kabīr. But, searching for each canon in potentially hundreds of volumes, one text
at a time through what are often clunky search interfaces is time-consuming. This can be
automated to a certain extent, because of the recent standardization and digital publication of
Islamic texts by the OpenITI initiative. One may simply write a script that searches all sets of
texts in question and delivers the results to be evaluated in a more organized and potentially
efficient way by scholars. But, even this more automated approach has the fundamental
drawback of returning the results of only exact matches. If one were to search for the doubt
canon, “tudraʾ al-ḥudūd bi al-shubuhāt”, only the results that matched exactly that phrase would
be discovered. But, this canon has been phrased a variety of ways by different authors through
the long history of Islamic law. Moreover, something akin to its meaning, even when not
encapsulated in a pithy formula is probably present in a variety of texts, especially early ones
that were authored before the canon acquired a maxim-like formulation.
Advances in NLP and ML allow us to search for texts that are semantically similar and not just
identical to the one queried. Many of the algorithms developed to solve this problem are now
readily available in easily deployable modules, especially in the Python programming language.
Although these modules could initially only handle English or Chinese texts, more recently
technology firms and open-source researchers have started adding the ability to search for
semantically similar phrases to other languages, including Arabic.
Google has recently introduced an experimental cloud service that allows one to quantify the
extent to which texts are similar to each other and can perform the operation on Arabic texts.
Because, the service at the moment is experimental, it is offered at no cost, on the condition that
one’s proposed project is accepted.[1] When I saw a news article describing the service a couple
of months ago, I decided to try my luck and applied to the program. Google got back to me
within a couple of days of receiving my application and approved it.
At a basic level, the service receives two texts and returns a score representing the extent of the
semantic similarity between them – the higher the number, the more similar the texts are to each
other. Potentially, we can do this with a given canon and a corpus of fiqh texts, appropriately
divided up into small, manageable chunks. After we receive a similarity score for the canon and
each chunk of the fiqh corpus, we can do a sort on the similarity score, and the texts most similar
to the canon will show up at the top. As a test, I initially sent the Google Semantic Similarity
Service (GSSS), the following texts (the “input” would be the canon, while “candidate” would be
text searched against):[2]
In the example, above, I chose the intentions canon “inna-mā aʿmāl bi al-niyyāt (actions are
judged by intentions)” and sent a random assortment of phrases, with some that I thought were
semantically more similar to it than others. The results were promising. In the first batch it
correctly rated the phrase, “al-umūr bi-maqāsidihā (matters are to be considered according to
their purposes)” as more semantically similar (≈0.45) to the intentions canon versus the basmala
(≈0.15)[3] and the doubt canon (≈0.22). The same thing can be noticed in the second batch.
Broadly speaking, GSSS is able to separate semantically similar phrases from those that are not.
GSSS thought that the phrase “niyyat al-muʾmin khayr min ʿamalihi (the believer’s intention is
better than his action)” was most similar to intentions canon, then the “al-umūr bi-maqāsidihā”
and then the phrase, I invented “inna al-afʿāl bi-maqāṣidihā.” Intuitively, though I would think
the last phrase would be most semantically similar to the intentions canon. Regardless, I thought
the results were promising enough to pursue a more substantial experiment.
I wrote this book specifically to cover legal canons, adding many canons that are not
in the Dhakhīra, and including all those that appear in the Dhakhīra in simplified and
clarified terms. In the Dhakhīra, I was attempting to draw on the abundance of
foundational-source references (kathrat al-naql) for substantive law rulings (furūʿ),
following the mode of writing substantive law [works]. I preferred not to combine
[that form] with the simpler forms of works outlining legal permissions (mubāḥāt)
and legal canons (qawāʿid). So, I wrote a separate book, given that [understanding]
the canons from [the Dhakhīra alone] would otherwise be difficult.[5]
Since Open ITI has mined, cleaned, and structured thousands of Islamic texts found in other
digital repositories and formatted them so that they can be easily manipulated in programming
languages, such as Python, I used the two versions of the Treasure available there.[6] One was
drawn from the version found in the popular Islamic text repository al-Maktaba al-Shāmila,
and the other from the Islamic digital library al-Jāmīʿ al-Kabīr. I had initially thought these
versions of the Treasure were similar enough for our purposes that an investigation of their
differences was unwarranted. This turned out to be incorrect. I initially decided to work with the
version drawn from al-Maktaba al-Shāmila, simply because it seems like it is the largest
contributor of texts to OpenITI to begin with and because I myself had much more familiarity
with their texts than the ones found in al-Jāmiʿ al-Kabīr. I also decided to use the Jupyter
notebook framework’s implementation of Python to prepare the text, manage the process of
sending them to Google’s semantic similarity service (GSSS) and receive the scores back.[7] The
entire process of figuring out how all of this worked together involved many hours of trial-and-
error, as is the nature of these projects, not to mention my own relative unfamiliarity with
working in Python.
The most difficult part of the experiment consisted of preparing the list of pairs I would send to
GSSS. How do I divide up the Treasure so that I may send GSSS a list of two pairs: one the
doubt canon, and the other a segment from the Treasure that may contain the canon? The
instructions from GSSS were not at all clear. They did not numerically define how short or long
the candidate texts needed to be other than noting that they can range from being short phrases to
no longer than a paragraph, without more precisely quantifying what they considered to be too
long. Second, they suggested not sending the service more than 1000 pairs at a time. They
cautioned that sending them more than that could time-out the service.
A second ambiguity with how to divide up the Treasure, stemmed not from GSSS, but from the
difficulty in selecting a non-arbitrary way to partition the text into smaller components. Would
those smaller components be phrases, sentences, or paragraphs? Ideally, we would want to
partition it into fragments that are as small as possible under two constraints: they should not be
smaller than the canon itself and they should not lose semantic integrity, by which I mean their
ability to convey something like a single idea. Sub-sentence phrases would probably be ideal, but
there was no symbol (like a comma), that we could easily rely upon to identify them in the
Treasure. If I was dealing with an English language text, sentences would be the next optimal
candidate for partition. They retain semantic integrity (they convey one idea), are usually
relatively short, and they can be easily located in a text because the period delimits their ending.
But, the period as a convention to demarcate sentences was not one that premodern Islamic
scholars used. The introduction of periods into modern edited versions of classical Islamic texts
varied from one editor to the next and the editor of the Treasure seemed to have used them
sparingly. I therefore decided to partition the Treasure by paragraphs which thankfully were
formally indicated in OpenITI’s versions by the use of the hashtag sign (#).[8] Here’s a
screenshot of what the beginning of a typical OpenITI text file looks like:
I ended up dividing the whole of the Treasure into 9,678 paragraphs.[9] However, I noticed the
length of each paragraph varied widely. Some could be just 4-5 characters long, others could be
as long as 52,120 characters.[10]
In addition to dividing the Treasure into paragraphs, each paragraph needed to be stripped of any
characters that would detract from the core goal of capturing the semantic similarity between two
texts, the main determinant of which are words in a particular sequence. This meant I had to get
rid of any punctuation (e.g. ‘.’, ’!’, ’:’, etc.) and any characters OpenITI introduced in their own
notation of the texts, such as pound signs and tildes. This step, often referred to as “cleaning a
text”, though relatively easy to perform programmatically, is immensely important for many
natural language processing tasks.
After cleaning the paragraphs, I had 9,678 paragraphs that consisted only of words composed of
Arabic characters, but of widely varying lengths. I was skeptical that this could work, especially
because of the presence of the large paragraphs. I was also curious about how long it would take
GSSS to return scores for all the paragraphs. Nevertheless, I persisted. I recited the basmala and
attempted to send paragraphs, 1000 at a time, to GSSS and see if they could process the texts and
return similarity scores in a timely manner. The service timed out. I kept reducing the number of
pairs I sent the service until I was sending just 2 pairs. Most times, I would receive a response,
sometimes I would not. Each response took about 20-30 seconds to get. Given the fact that
sometimes I could not get a response and that even when I did, I got one only after 20-30
seconds, I determined it was impractical to send the entirety of the Treasure to the service, even
one paragraph at a time, and I decided that the paragraphs need to be broken down further.[11]
Dr. Rabb suggested that instead of sending the entirety of the Treasure, why not pick one or two
chapters, since all we were trying to do is test out the idea. She suggested the chapter devoted to
testimony (kitāb al-shahādāt) for the evidence canon and the chapter on criminal law (kitāb al-
jināyāt) for the doubt canon. When we eliminated the burden of sending the entirety of the
Treasure, we had a much more manageable task before us, though this introduced the complexity
of finding the relevant chapters. OpenITI does mark the sections in their texts, presumably by
reproducing the chapter and section structure of the repositories that they mined.[12] This
presented a problem in the al-Maktaba al-Shāmila version of the Treasure: its headings did not
include the titles of chapters, hence would have required manual effort to find the chapters on
criminal law and testimony.[13] On a lark, I decided to see whether the version Open ITI mined
from al-Jāmiʿ al-Kabīr would have the chapter titles. Alḥamdulillah (all praise is God’s), they
did! Using this version had an additional boon: I noticed that OpenITI used the pipe symbol
“|” perhaps in a way that represented the period, in the sense that each group of words separated
by it seemed to represent a semantic unity.[14] Partitioning along the pipe symbol would result in
fragments larger than sentences, but ones that would be smaller than paragraphs. I decided to
start with the chapter on testimony and search for the evidence canon. The testimony chapter
consisted of 330 paragraphs,[15] and after further partitioning along the pipe symbol, consisted
of 945 fragments. The 945 fragments still had great variation between the lengths of them – 6
characters to as long 3,929, with an average length of 231 characters.
I recited the basmala and attempted to send the 945 fragments to the GSSS. It still didn’t work.
Although GSSS noted that its service could handle up to 1000 pairs of text, I found that sending
even one pair at a time took too long for it to be practicable. Even by limiting the number of
pairs I sent, I would often not get responses at all. I had one last trick up my sleeve, and so,
despite the failure, I persisted.
I suspected that some of the fragments were still entirely too long, perhaps outside the bounds of
what GSSS considered a “short paragraph.” I needed to find a way to divide up the fragments
even more. From a programmatic perspective, dividing up a paragraph into fragments of an equal
number of words is pretty easy. But doing so ran the following risk: what if I divided up the text
precisely in a place that contained something similar to a canon? Fortunately, NLP provides a
solution to this problem in the form of what is called an n-gram, a term of art in that field.
Partitioning a text using the n-gram approach ensures that there is always an overlap of words
between each individual fragment. The ‘n’ in the term n-gram represents the number of words
each fragment of a text will contain. The image below gives a sense of how n-grams accomplish
this:
Using n-grams will ensure that at least one fragment will contain the semantically-similar canon.
The downside of using the n-grams is of course, the proliferation of texts you feed the google
service, most of which will be redundant. In addition, depending on how large of an n-gram you
use to partition your text, the most similar texts will probably refer to the same exact sentence or
passage. These difficulties can be overcome, and it is immaterial for the purpose of this very
small experiment, which merely wishes to test the viability of using semantic similarity based
search to identify canons.
Notes:
[1] Note, the reader will not be able to use Google semantic similarity service unless they apply
for access and get accepted. Only once access has been given does Google make the technical
documentation on how to use the service available. This requires, amongst other things, signing
up for their cloud services.
[2] I performed this initial test by logging onto GSSS and manually typing in the “Input” and
“Candidate” texts.
[3] The basmala is shorthand for the phrase bi-ismillah al-rahmān al-raḥīm (in the name of God,
most Gracious, most Merciful).”
[4] For an analysis of Qarāfī’s work on canons and its relationship to the work of his Shāfiʿī
teacher, Ibn ʿAbd al-Salām, see Mariam Sheibani, “Innovation, Influence, and Borrowing in
Mamluk-Era Legal Maxim Collections: The Case of Ibn ʿAbd al-Salām and al-Qarāfī,” Journal
of the American Oriental Society Forthcoming (2020).
[5] I owe this point to Intisar Rabb, and the translation of the text above is hers. Aḥmad b. Idrīs
Qarāfī, al-Furūq [Anwār al-burūq fī anwāʿ al-Furūq], ed. Khalīl Manṣūr (Beirut: Dār al-Kutub
al-ʿIlmiyya, 1998), 9.
[6] For an easy to use catalog of Islamic texts that OpenITI has rendered machine readable
according to a largely uniform structure, see: https://kitab-corpus-metadata.azurewebsites.net/.
[7] Jupyter has created a notebook interface especially useful for researchers interested in data
exploration and the quick, ad hoc development of scripts.
[8] For a description of OpenITI’s tagging scheme, see here. The way OpenITI used the hashtag
in the two versions of the Treasure was not consistent. In the al-Maktaba al-Shamila version,
Open ITI used the hashtag to indicate not only paragraphs, but also the page and volume
numbers of the original edited edition. In order to get only the paragraphs of the Treasure, while
excluding the information indicating page numbers, I had to first filter them out. A second
difficulty was that, for purposes of readability, OpenITI did not confine one paragraph to a single
line. Rather they artificially divided them up using two tilde symbols (‘~~’). So I needed to
delete these as well, before partitioning into paragraphs. Once these two issues were taken care,
one could simply partition based on the new line symbol (‘/n’).
[9] This number includes paragraphs composed of two things, the metadata OpenITI places at
the beginning of the file documenting details such as the author’s information, the original text
repository they mined, published edition that repository relied on, etc. It also includes the
paragraphs containing the editor’s introduction. I deleted these paragraphs from the list I sent to
GSSS.
[10] The mean length was ~678.
[11] I have a fairly strong intuition that many of the paragraphs I sent to GSSS were too long and
that was the reason for the time-out. But, there is one other possibility that I should rule out but
haven’t gotten a chance to – that I had not yet learned how to log-in to Google’s cloud service
and present the security credentials in the proper manner to get consistent responses from the
service.
[12] Here there is some inconsistency between the way OpenITI says it identified chapter,
section, and sub-section headings and the way it actually did for the texts that I imagined.
OpenITI claims to differentiate between chapter, section, and sub-section headings by the use of
pipe symbols, ‘|’: the more pipe symbols that follow a hashtag the lower level of sub-section it is
supposed to represent. This scheme was not followed in either one of the versions of the
Treasure. Chapter titles, section, and sub-sections were all uniformly indicted by a single hashtag
followed by a single pipe symbol. For their description of how the tagged section headers, see
the “—Section headers” section of their description of their tagging scheme. You can find
it here.
[13] Finding the beginning and end of two chapters in a single book, while frustrating, would
have been not very time-consuming. Now imagine trying to scale up the procedure to cover the
entire corpus of Islamic legal texts.
[14] I found no place in which OpenITI described this use of pipe symbol.
[15] The smallest relevant paragraph consisted of 11 characters, and the largest 30,135, with a
mean of 696.833.
The results are promising, and all 38,683 scores can be found here. Note that a standard textual
search would actually not have found any texts because the text searched for “ اﻟﺒﯿﻨﺔ ﻋﻠﻰ اﻟﻤﺪﻋﻲ
”واﻟﯿﻤﯿﻦ ﻋﻠﻰ ﻣﻦ أﻧﻜﺮdoes not exist in exactly this form in Qarāfī’s work. You can also see that rows
1-4, which have the highest similarity scores may refer to the exact same passage in Qarāfī’s
work. What is unclear is why row 5, with a similarity score of 0.7152 would rank higher than
row 8?[1] These results need further interpretation, but I think were promising enough to move
forward with the search for the doubt canon in the criminal law chapter without drastic
modification of my technique.
When I saw the results on the evidence canon, I decided that the 7-gram fragments may have
been too small. I divided up the criminal law book into 20-gram fragments; there being 43,341
such fragments. I found that I could send only about 300 texts without timing out on a regular
basis, perhaps because the fragments are larger. Ultimately, it took about 2 hours to get similarity
scores for all, and again the results are more promising than in the previous case:
The results for all 43,341 fragments can be found here. The form of the canon that I asked GSSS
to return similarity scores for was tudraʾ ḥudūd bi ʾl-shubuhat. The top five results definitely
contain this idea. What inspires great confidence in the method is the fact that top results rank
fragments in which Qarafī uses the verb yusqaṭ as a synonym entirely replacing the verb tudraʾ!
Reading through the rest of the results one definitely gets the sense that the GSSS is returning
fragments that in some sense talk about exceptional circumstances that either lessen or suspend
punishment, even if doubt is not the driving consideration. For example results 11, 13, 16, 18, 19
(all probably from the same passage) talk about how individuals compelled by necessity or
coercion are not to be given the ḥudūd punishments.
The other thing that may be noticed in comparing the absolute values representing the similarity
between the two canons is how the results of the second canon score about 30 points lower than
the first. Why is this the case? I suspect that it must be a function of the larger n-grams supplied
in the case of the second canon compared to the first (20 vs. 7).[2] For our purposes, though, the
absolute values do not matter. We just need some type of relative values that will provide a
ranking of similarity between all of the searched against texts we feed GSSS. The few that rise to
the top should definitively contain the canon we are looking for and the rest of the fragments
should be absent of it.
In order for a search based on semantic similarity scores to be viable we must be confident that
the highest scores do in fact cluster together and are dramatically different from the rest of the
scores, on the assumption of tens of thousands of fragments that make up each chapter, only a
handful come anywhere close to being semantically similar in a meaningful way to the canon. It
is difficult to investigate that by looking at just 19 rows of results. But we can plot a histogram
that shows the distribution of the similarity scores and we should find that the range of the top
scores should be very few in comparison to similar ranges of values of all other semantic scores.
This is what we find:
The range of the 19 results from the doubt canon 0.78 to 0.61 are mostly contained in the last
three bins (far right) of the histogram depicting the distribution of the scores on the evidence
canon. In contrast, the range that got the highest number of values (between 0.22 and 0.16) got
9,757 values. A similar picture emerges when looking at the distribution of the similarity values
of the doubt canon. The histogram of the distribution of the semantic similarity values confirms
that the top fragments with the highest similarity scores are few in number and clustered together
within a range that is quite distant from all other semantic scores.
Future Exploration
At a minimum, the results suggest that adopting a semantic similarity approach is at least a
fruitful avenue for further efforts in developing a tool that automatically discovers the use of
explicit canons in the fiqh corpus. But, I’m not sure using the Google service, at least in its
current form, is the way to go due to issues related to its performance, potential costs, and the
accuracy of results. We supplied 38,683 and 43,341 fragments from two chapters of a single
work of positive law and sought to discover two canons. Given that there are hundreds of works
of Islamic law, the number of fragments they would generate in totality would be of a magnitude
greater than what we did in this experiment. Let’s assume that there are 10 million 20-gram
fragments in Islamic legal literature. One author of a modern encyclopedia of canons, has
identified just over 900 of them. Given that we would ask GSSS to compare each canon against
all 10 million 20-gram fragments, we would want it to give us similarity scores for 9 billion text-
pairs. This number is large, but not insurmountable, especially because once similarity scores
have been computed once, they need not be done again. GSSS right now is experimental, and for
that reason it was free. I would also surmise that they did not devote many computational
resources to a product that they are currently just testing out. Presumably, in the near future it
will be offered as a commercial service. Thus one may just purchase more computing power to
generate the scores in a smaller amount of time. But, I have no idea how much that would
ultimately cost.
While GSSS is certainly one method through which we may generate similarity scores, it is not
the only method, nor is it necessarily even be the most accurate one. First, we don’t quite know
how Google is generating the similarity scores—their documentation on this issue was virtually
non-existent or extremely difficult to find. More specifically, we don’t know how they generate
the numerical representations of texts (called embeddings in NLP and ML), which they surely
rely on in order to compute the similarity between the words. Numerical representations of texts
allow us to perform mathematical operations on them, such as assign phrases scores that capture
the extent of their similarity to some other phrase. Generating embeddings involves many
parameters and decisions, and the accuracy of embeddings can be experimentally tested to
determine the best set of parameters for a given task. While there has been much research on
how those decisions regarding those parameters that yield better and worse embedding in
English, this is not so much the case for other languages. In fact, much recent research in Arabic
NLP has emerged to tackle this precise problem, and has shown that given the structural
differences between Arabic and English, embedding techniques that take cognizance of these
differences and modified algorithms accordingly perform better.[3] We don’t know the extent to
which Google’s service relies on this recent research, so we don’t know if we could get better
results if we just generated the word embeddings ourselves.
There is another potential problem with even the word embeddings generated by recent Arabic
NLP approaches. They rely largely on modern standard Arabic (either Arabic websites, or
Arabic Wikipedia) or Arabic twitter. Of course embeddings generated by the former are much
more appropriate, but I wonder if the generation of word embeddings that give greater weight to
premodern classical Arabic texts might generate better results in tasks such as semantic
similarity? Furthermore, given the fact that Islamic legal literature is a technical discourse, in
which words often mean something very different and precise than their conventional uses
outside of it, generating word embedding by giving greater weight to Islamic legal texts might
also lead to better results for our research purposes.
The question about which method to use to locate canons in Islamic legal literature is one that
can be answered through collaborative, experimental research. We can test to see which method
gives us the best results in the most efficient manner possible for the research purposes at hand.
Of course, nothing that I have said is limited to canons or Islamic law, per se. The applications
can be extended to other systematic Islamic discourses, such as theology, philosophy, literature,
poetry, or mysticism. Nor is the technique at the general level uniquely usable only on Islamic
material in Arabic, either. Nor must it be limited to simply discovering semantically similar
phrases in a corpus. Any time a scholar wonders whether there exists an idea similar to the one
she is considering in some other text, semantic similarity search is a potential technique for
discovering the answer. The recent past has seen two promising developments: an increase in
research on Arabic NLP/ML and the steady introduction of computational techniques to the
broader public at relatively low cost. The Islamic studies community can capitalize on these two
developments and build tools that may be used to ask entirely new research questions, and just
maybe even settle some enduring ones as well. But, given the complexity of the task and the
experimental nature of this project, it will have to be undertaken through collaboration between
many people that have different skill sets: Arabic NLP and machine learning experts will need to
work hand in hand with those that have in depth knowledge of the different domains of Islamic
studies. If you are interested in such a project, reach out to me. Again, you can find the Jupyter
Python notebook here, which shows how I obtained, cleaned, managed the process of sending
GSSS the canon and the fragments, and recorded the results. For the accompanying video walk-
through see this.
Notes:
[1] I suspect that, for whatever reason, GSSS is giving too much importance to the word
‘ﻗﺎل/qāla’ in row eight. Given the fact that in most cases it is akin to an opening quotation mark in
premodern Arabic, and therefore often ubiquitous, it should not be given much weight in making
determinations of semantic similarity, especially for the specific purposes of discovering the
explicit citation of canons, which are often going to be preceded by the word ‘ﻗﺎل/qāla’.
[2] This suspicion is partly confirmed by further experiment. I re-ran the evidence canon against
the Evidence/Testimony chapter using 20-grams instead of 7-grams and GSSS returned much
lower absolute values.
[3] See Madamira and Farasa as two Arabic NLP packages that perform much of the necessary
preparatory work on a corpus before embeddings may be obtained from. For a pre-trained set of
embeddings obtained from Arabic Wikipedia, Tweets, and Arabic web pages, see Aravec.