Performance Comparison of Django Querysets and Elasticsearch
Performance Comparison of Django Querysets and Elasticsearch
SCIENTIFIC PUBLICATION
By:
ZAHID MUJADDID
L 200 134 010
SCIENTIFIC PUBLICATION
By:
ZAHID MUJADDID
L 200 134 010
2
PAGE OF VERIFICATION
By:
ZAHID MUJADDID
L 200 134 010
Board of Examiners:
Dean Head
Faculty of Communication and Information Study Program of Informatics
3
STATEMENT OF VOW
I hereby declare this scientific publication is the one and only and to the best of my
knowledge no similar work has ever been published to obtain a degree in a college or
institution before by other people, except cited in this manuscript as a reference in the
bibliography.
In the event of my statement above proven to be otherwise, then I shall be hold fully
accountable.
ZAHID MUJADDID
L 200 134 010
4
CERTIFICATE OF PLAGIARISM
NO.
5
PERFORMANCE COMPARISON OF SEARCH FEATURE WITH DJANGO
QUERYSETS AND ELASTICSEARCH IN A WEB APPLICATION
Abstract
Search engine is an important tool for user to search relevant information quickly and easily.
It is especially an essential feature for the application that manage massive influx of data and
information in their server. The implementation of search engine is wide and various ranged
from famous web crawler such as Google, shopping sites such as Amazon to social media
such as Facebook. The purpose of this research is to implement search engine technology to
web application Arsip dan Dokumen UMS that handle every archive in University of
Muhammadiyah Surakarta and developed with Django web framework. This research
focused on the performance comparison of search engine between third-party search engine
using Elasticsearch and search engine build with Django Querysets which have become the
default implementation in this application. Each search engine must perform 12 search
queries against sample of text contained in archive database field. This were repeated ten
times for each query to obtain the best possible performance measurement in seconds. The
archive database field populated with 1001 text samples extracted randomly from various
Indonesian Wikipedia page. This research can prove how useful the implementation of
Elasticsearch as search engine and its drawbacks.
Index Terms: search engine, web application, Django, performance comparison, quantitative
analysis
1. INTRODUCTION
Following the creation of website in 1990 and its availability for everyone with the
announcement from CERN (Cailliau, 1995), search engine use began to resurge in
popularity. Beginning with the creation of first commercialized search engine Yahoo!
Search in 1994 (Oppitz & Tomsu, 2017), the use of search engine gained a traction
among the growing community of web users as a tool with the ability to get relevant
information quickly and easily. The use of search engine not only limited to web crawler
such as Google and Bing, but even in many websites that maintain a large userbase and
database such as shopping sites Amazon and Alibaba and social media such as Facebook
and Twitter.
For this research, the sample of data used for comparison was a collection of
contents extracted from various sources of Indonesian Wikipedia which used to populate
1
fields in Arsip dan Dokumen UMS application database. A series of search query against
database were performed to gather search result count and the time taken for each query.
2. METODHOLOGY
Hardware Software
Laptop Acer Aspire A515-41G-13JX, Windows 10 Enterprise LTSC 64-
AMD Quad-Core Processor up to 3.60 bit
GHz, Docker Desktop
AMD RadeonTM RX 540 with 2 GB Cmder Console Emulator
VRAM, VS Code Editor
8GB DDR4 Memory, 1000 GB HDD Sublime Text 3
Chromium Edge Browser
2
Figure 1. Use case diagram Arsip dan Dokumen UMS
3
2.2.3. Testing
This research adopted unit testing and black box testing. The unit testing
performed to test the code inside the application itself by writing the test code
to determine if the archive CRUD operation works as intended.
The black box testing were performed after the unit test to confirm if
the application really working as intended and to discover unexpected bugs or
error if any which done directly from web User Interface (UI).
2.2.4. Deployment
2.3. Material
The search engine requires to search text against the archive database field. Django
web framework includes object-relational mapping layer (ORM) that can be used to
interact with application data from various relational databases such as SQLite,
PostgreSQL and MySQL. The database generated by creating an archive model first.
Model is a Django representation of a table in database. Each model maps to a single
database table. Table 2 depict archive model attributes created in this application
where each attribute represents each field in the database.
Table 2. Archive model database representation in Django
4
Field Name Models Type Attribute
publik_mulai DateField null=True
publik_berakhir DateField null=True
file_media FileField max_length=200
To search text within the field database, it needs to be populated first. This
research use archive database but the search feature implemented in this
application does not really care whether the text contained within the database
is truly an archive or just some random generated text. The search needs only
to do well matching search query from the user. This also allows separating of
concern should the application wish to implement a feature where content
within the database must be an archive by detecting certain pattern which an
archive should have.
For this research, the archive database populated with random text extracted
from Wikipedia of Indonesia using the technique called web scrapping with
the help from third-party python package wikipedia
(pypi.org/project/wikipedia/). As the name implies, the package used to extract
content from a Wikipedia page which in this research used to extract both page
title and summary of a Wikipedia topic to populate both field nama and field
deskripsi respectively in archive database. The current research collected a
thousand and one (1001) Wikipedia topics as samples to be searched.
2.3.2. Search Engine
The application used two different search implementations which can switched
as needed for conducting a performance comparison analysis. The first search
implemented with Django Querysets is simply an implementation of object-
relational mapping layer (ORM) used to interact with application data from
database. The search with Querysets basically performs a field lookup which
translates from Django ORM as SQL WHERE clause and return a new set of
queries based on specified arguments.
The second search implemented with Elasticsearch require an
additional package, django-elasticsearch-dsl that allows Elasticsearch running
in port 9200 to index Django database model. Elasticsearch uses the indexing
5
concept. It is a document oriented tool. Once the document is added, it can be
searched within a next second (Kalyani & Mehta, 2017). The application
requires at least once for Elasticsearch to run search indexing against the
application database. Every time user performs CRUD operation to the archive
model, Elasticsearch automatically update its index to reflect the change
within the application database.
The search also applies additional filtering determined by the value of
field sifat as depicted in table 2 and whether or not a user is authenticated. The
field sifat is an IntegerField type which store integer number ranged from 0 -
3 where each represents a key of value trait an archive can have. Table 3
outlined each of key-value pair in field sifat.
Table 3. Field sifat key-value pain in Archive database
Key Value
0 Publik
1 Publik (temp)
2 Internal
3 Pribadi
The archive with value Publik and Publik (temp) will always visible
in search results whether a user is authenticated or not. The difference is that
Publik remain publicly available as long as the archive still exist in database,
while Publik (temp) has a time limit which it can remain available. The value
Internal means the archives are only available in the search result for
authenticated user only. The archive with a trait Pribadi is simply a private
archive that only the owner of the archive has access to it.
6
total target is 12 search queries where 10 queries with similar or small difference and
2 others for extreme case with wide disparity search results
The next step is to measure the search performance time using python module
timeit that provides a way to time small bits of code (Python Software Foundation,
2020). This achieved by specifying a variable start timer before the beginning of
search algorithm, and variable end timer after the end of search algorithm each for
Django Querysets search and Elasticsearch. The performance result was calculated
by subtracting the end timer with the start timer to obtain search time in seconds and
then displayed on the search page.
The search was conducted by manually typing search keywords in input box
of the search page. This conducted for ten times for each search keywords to obtain
the best time possible that can be measured.
Figure 2. Homepage
7
There are two important section within the homepage. The top-half section is
an introduction section and the bottom-half are about section. The user can login
from Login navigation shown at the top-right of the header that will redirect user to
login page. The login page consists of one simple login form as depicted in figure 3.
From the homepage user can also enter a search query within input box that
will redirect user to search page depicted in figure 4.
8
At the top-right of header after a user login, the previous login navigation was
gone and replaced with user icon and username display and when clicked will display
a drop-down menu depicted in figure 5.
There are two navigation bars, My Profile that will redirect to user profile
page and Library that will redirect to list of archives uploaded by users as depicted in
figure 6.
9
The library page is also the place which contain links to perform CRUD
operation for archive such as new button at the top-right of table that will redirect
user to archive create page as depicted in figure 7.
The user can also view the detail of each archive in the list by simply clicking
the name of each archive which will redirect user to the page as depicted in figure 8.
Table 4 depict each of the unit test performed within the application.
10
Table 4. Unit test performed
Test Status
Test home page displayed OK
Test archive model works OK
Test add an archive works OK
Test update an archive works OK
Test remove an archive works OK
Test get the list of all archives OK
Test Status
User can login OK
User can logout OK
User can create an archive OK
User can update an archive OK
User can delete an archive OK
User can search the archive OK
11
across different search result which best attributed itself for being a built-in function
of Django ORM implementation. The next stage is comparison of search queries with
different result count as outlined in table 7 and table 8.
Table 7. Performance comparison of both search engine with small difference result count
Search Django Querysets Elasticsearch
Query Result Time/sec Result Time/sec
Ekonomi 9 0.05 8 0.07
Ibu Kota 23 0.04 16 0.07
Industri 11 0.07 8 0.08
Kabupaten 65 0.05 64 0.13
pusat 23 0.06 17 0.1
Table 8. Performance comparison for both search engine with massive difference result count
Search Django Querysets Elasticsearch
Query Result Time/sec Result Time/sec
Dewa 41 0.06 7 0.09
Ad 797 0.09 3 0.08
Elasticsearch also shown to always return results with fewer count than
search with Django Querysets because of its advanced nature which built upon
Lucene library where the relevancy of search result are calculated using practical
scoring functions (Bhandarkar & B. N., 2020). This best depicted by search query in
table 8 where Elasticsearch manage to find the meaning of “Dewa” which means
God and return seven such relevant result.
12
to setup because QuerySet is built-in API provided by Django and is a standard when
working with database in Django way.
Search with Elasticsearch on the other hand, require more complicated setup
and additional search backend configuration within Django. Elasticsearch also needs
to run from a different port first for the application to be able to connect and use it.
The algorithm to create search with Elasticsearch also prone to error without proper
reading of documentation.
4. CONCLUSIONS
Quantitative analysis of search requires proper knowledge of the available search engines
and their applicability to specific types of application because the choice to build search
feature depends entirely on the complexity of data that will be handled by the application
itself.
For comparison between two search implementation, Django Querysets generally
is faster and easier to work with because its built-in nature as Django object-relational
mapping (ORM) implementation and has become standard when interacting with
database in Django way. Search with Elasticsearch is slower only by small margin and
return fewer result, but more relevant because it utilized Lucene library for score
calculation of search result. Elasticsearch is also a lot harder to implement since it
requires the help of external package, additional configuration and even more coding to
utilize its rich features.
Further studies are still necessary especially in regards search performance and
precision where multiple search queries at once involved to fully understand Elasticsearch
full capability and its advantage in performance.
REFERENCES
Bendechache, M., Svorobej, S., Endo, P. T., Mario, M. N., Ares, M. E., Byrne, J., & Lynn, T.
13
Brin, S., & Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine.
Cailliau, R. (1995). History: W3C. Retrieved from A Little History of the World Wide Web:
https://www.w3.org/History.html
Docker, Inc. (n.d.). Docker Desktop: The fastest way to containerize applications on your
Kalyani, D., & Mehta, D. (2017). Paper on Searching and Indexing Using Elasticsearch.
doi:10.18535/ijecs/v6i6.44
Oppitz, M., & Tomsu, P. (2017). Inventing the Cloud Century: How Cloudiness Keeps
Python Software Foundation. (2020, April 20). timeit — Measure execution time of small
https://docs.python.org/3/library/timeit.html
https://www.heroku.com/about
Thamrin, H., Triyono, A., & Fadlilah, U. (2015). Penggunaan Kamus Sinonim dan Hiponim
sebagai Sumber Ekspansi Kueri dalam Sistem Temu Kembali Informasi Berbahasa
http://hdl.handle.net/11617/5109
14