A Bangla Text Search Engine Using Pointwise Approach of Learn to Rank(LtR) Algorithm
A Bangla Text Search Engine Using Pointwise Approach of Learn to Rank(LtR) Algorithm
Abstract—A search engine is the prime source of information supports Bangla language. This paper aims to develop a
in this modern time. Where English search engines are reigning Monolingual Bangla search engine.
the world, a good number of monolingual and bilingual search Developing a Bangla search engine has a lot of chal-
engines are also emerging. Bangla language falls behind in the
fast-growing area of information retrieval (IR) in the native lenges itself. The ideal sources of information to develop
language. Currently, Bangladesh has no active search engine of a Bangla search engine are Bangla Wikipedia, Bangla
its own as the first ever Bangla search engine Pipilika is not national dailies, local and online newspapers of our coun-
providing service for a long time. To fill the absence of such a try, blog posts and various government sites. Though the
platform we aim to initiate the development of a Bangla search design choice of an internet search service has changed
engine through this work. We have developed a framework
for the Bangla search engine that provides search results by throughout the time by adapting different techniques in
performing syntax analysis of a given query. The information the intermediary steps, the major operating steps are
retrieval process of Pipilika was carried out using the score identical in every search engine [2]. They are: Crawling,
generated by a ranking function called BM25. But modern IR Indexing, Querying. We will develop a Bangla search
systems are acing with the aid of machine learning. Learn to engine where the crawler will initiate the process by
Rank (LtR) is a class of techniques in machine learning to deal
with the ranking problem in IR. The search engines employ any crawling some Bangle newspapers to build Bangla corpus
of the three approaches of LtR (pointwise, pairwise, listwise) to data. Indexed corpus data will be stored in a database
show the result on Search Engine Result Page (SERP) and we and machine learning approach will be applied on them
opted for the pointwise approach. In pointwise approach of LtR as soon as a query is requested. The information retrieval
a numerical score (bid) is assigned to each keyword-document process is constantly changing to provide better search
pair in the training set. Then our ranking problem can be
treated as a regression problem. We achieved our pointwise result. The number of ranking factors, prioritizing these
approach by employing Random Forest Regressor which yields factors, and application of a series of ranking algorithms
an accuracy of 70.14%. The dataset was prepared with the aid on these ranking factors keep on changing. Ranking each
of feature extraction following the bid-based ranking approach webpage based on outbound links has a great impact on
of Google where we treat the BM25 score as one of the features. improving results on the SERP [3].
In the bid-based approach not only the BM25 score acts as one
of the ranking factors but also other ranking factors such as The main focus of this work is to build a prototype of
TF-IDF score, keyword density, total words in a document a Bangla Search Engine and make sure that the platform
contribute to generating the result. Our approach is capable can provide search results in Bangla delivering relevant in-
of providing search results based on the ranking score which is formation. Currently there is no existing work on keyword-
generated by combining all the features. based Bangla search engine. Pipilika [4] was the only
Keywords—Learn to Rank, Pointwise approach, Search En- existing Bangla search engine to which we can compare our
gine Result Page, BM25, Machine learning,TF-IDF score.
work. In pipilika, collected documents from corpus data
have been ranked using BM25 document ranking method.
BM25 (Best Matching) is a Term-based document retrieval
I. Introduction method which is also known as a ranking function that
estimates the relevance of documents to a given search
In this era of information and technology search engines query. It assigns a score to a set of documents based on the
are considered as a boon to mankind as it facilitates the query terms irrespective of their inter-relationship within a
information retrieval process from the bulk of documents document. We aimed to work on this information retrieval
and resources available on the web in a systematic way. process by incorporating an alternative approach. A class
A significant number of search engines have evolved over of techniques known as Learning to Rank (LtR) uses
the course of time where some of them are multilingual, supervised machine learning to address ranking issues.
some are bilingual and the rest of them are monolingual. We applied such machine learning approach to resolve
Among these web search engines some soared up high by the matter of the ranking on the SERP (Search Engine
providing the best search results, e.g: Google, Bing, Yahoo, Result Page). To build a machine learning model some
DuckDuckGo, Ask.com. From a comparative analysis, features were required to be extracted from the dataset of
it’s observed that Google and DuckDuckGo offer some the scraped data from web. The new document-query pair
impressive features setting the bar so high [1]. However, dataset contains some relevant features and this BM25
these commonly used search engines are multilingual that score can be treated as one of those features. The pointwise
approach of LtR was adopted to train the model where of web information retrieval, comparison of PageRank and
the BM25 score along with other features was combined MatrixNet algorithms, and the comparison of the quality
to generate a score to get the final rating. In our proposed of the retrieved results for selected queries.
approach the rating can be obtained not only based on S. K. Roy et al [6] describe the design and implemen-
one factor (such as BM25 in pipilika) but also based on tation of an effective NLP based Bangla interfaced search
other related features which influence the search result. engine. Applications of natural language processing run
In this way, we can emphasize other relative factors that computation over big corpora collected from the web. In
contribute to rank the search result. Later the ranking the information retrieval technique, a probabilistic model
quality of our system was compared to other ranking tries to estimate the probability based on the user query.
systems by evaluating a ranking metric. That ensures a The model assumes that the query and the document
good enough ranking quality of the proposed system. The representations have an impact on this probability of
main contributions of this paper are: relevance.
• To build a web crawler to collect data from Bengali M. R. Islam et al [7] mentions that BM25 document
websites. ranking algorithm was used in Pipilika to score the col-
• To preprocess and store the crawled data in a lected documents from corpus data. This paper proposes
database after indexing. a query expansion approach that selects additional words
• To apply machine learning approach to the indexed from documents that are highly related to the search
data to obtain the keyword-based search result. term syntagmatically. It searches for keywords using the
• To develop a web-based graphical user interface to Word2Vec CBoW model, which has a greater semantic
use the search engine. similarity with original words and suggests similar terms
for each subsequent word. This proposed approach may
The rest of the document is outlined as follows: Section
enable the search engine to better understand user needs
II explores the literature review of related works we have
and so produce better search results.
studied to develop our idea. Section III illustrates details
A. Sohail et al [8] mentions the work’s primary objective
of our methodology. Section IV analyzes the performance
is to conduct research on the most recent SEO strategies
of the applied algorithms on the chosen dataset. At last,
for content management-based websites. It discusses the
Section V finishes the paper with a summary.
factors that contribute to the ranking of web pages in
a search engine result page. Features related to on-page
II. Literature Review optimization and off-page optimization Literature Review
8 are clearly defined to show their impact on search results.
In recent times, various techniques for information These features act as search engine ranking factors for
retrieval in search engine are proposed. We have gone further analysis through ranking functions or machine
through some related works and a brief overview of these learning approaches.
works is presented in this section. Many Information Retrieval (IR) problems are inher-
In [2], Lycos creator Michael Mauldin gives a quick ently ranking problems, and learning-to-rank approaches
overview of web search services and explains how search have the potential to improve many IR technologies. [9]
engines like Lycos carry out their functions. According to discussed three types of existing learning-to-rank algo-
[2], the way all major web search engines work is the same: rithms: the pointwise, pairwise, and listwise techniques
a gathering program searches the hyperlinked documents and reviewed these categories. The benefits and drawbacks
on the web and indexes the gathered information. These of each approach are examined, and the connections
documents are accumulated by keeping them in a database between these approaches’ loss functions and IR evaluation
or repositories. The retrieval program then compiles a metrics are highlighted. The statistical ranking theory is
list of links to web documents that correspond to the provided at the end, which can be utilized to assess the
words, phrases, or concepts in the user query. This process query level generalization properties of various learning-
is facilitated by the database combined with a retrieval to-rank methods.
program. For example, the Lycos search engine comprises M. Kantrowit et al [10] investigated how retrieval is
the Lycos Catalog of the Internet and the Pursuit retrieval affected by stemming performance. Previously reported
program. work on stemming and IR is extended by utilizing a
A comparative study in [5] has been done between unique, dictionary-based ”perfect” stemmer that can be
two leading search engines Yandex and Google. Yandex parameterized for various accuracy and coverage levels. As
dominates the Russian search engine industry with 60.4% a result, changes in stemming performance are evaluated
of the market share, followed by Google.ru with 26.2%. for each given data set and assess changes in IR per-
Internet marketers believe that Yandex is the most widely formance. An empirical evaluation of stemming accuracy
used search engine in Russia because it produces better is provided for three stemming algorithms, including the
results than Google. The comparison between these two commonly used Porter method, to put this study in
search engines is divided into three main tasks, description context.
R. Jin et al [11] is especially interested in developing the other hand, the backend comprises several processes
a better ranking function employing two complementing such as crawling through web pages, scraping their data,
sources of information: the ranking information provided preprocessing the extracted data, getting tokens out of
by the current ranking function and that achieved from it, and vectorizing the tokens and storing them through
user input. Since the base ranker’s information is fre- indexing.
quently erroneous and the training data’s information is
frequently noisy, merging the ranking information from
the base ranker and the labeled instances is the key
challenge. The paper describes an error-tolerant boosting
technique for ranking refinement. The empirical analysis
demonstrates that the suggested algorithm is effective for
ranking refinement and that it also significantly outper-
forms baseline techniques that include the outputs from
the base ranker.
A. Ahmad et al [12] shows the findings and analysis of
two separate experiments carried out by employing the
model, namely Context-aware spell checker and Trending
topic detection. It introduces a large-scale Bengali Ngram
model, trained on online newspaper corpus. The paper also
highlights the challenges of the methods when working
with such vast amounts of data. An important feature
of the model is that the N-gram model includes data on
N-gram incidence per day over eight years, spanning the
years 2009 to 2017.
M. Ibrahim et al [13] focus on RF-based learn-to-
rank algorithms by analyzing its three approaches after
applying them on 8 different datasets (TD2004, HP2004,
NP2004, Ohsumed, MQ2008, MQ2007, MSLR-WEB10K,
Yahoo). The datasets are collected from various web search Fig. 1. Overall workflow for the Bangla search engine
platforms or recommendation systems. Both Random For-
The following sub-sections go into greater detail of our
est Classifier and Random Forest Regressor were analyzed
workflow mentioned in Fig. 1.
from various aspects. The ranking quality metrics such
as NDCG@10, MAP, and ERR were measured for each A. Crawling
approach with the aid of the mentioned datasets. Web crawling is an automated process of visiting
After studying the related works, we have got the idea websites and downloading their information. It assists in
of which ranking factors to consider and we combined collecting information about the websites and the links
them by bid-based ranking [14]. The bid-based ranking is associated with it, as well as evaluating the HTML code
a popular approach of Google to combine the features to and linkages. A web crawler is also known as a spider
get the final score and to the best of authors knowledge, or bot.A set of seed URLs denote the starting point of
none of our aforementioned papers went for such an the crawling process. Depending on what seed we choose,
approach. Though many established search engines use we should be able to visit almost every possible Bangla
LtR approach of machine learning for ranking, papers website we aimed to crawl.
related to the previous work of Bangla search engine
B. Indexing
describe a heuristic function (BM25). We learned about
the three approaches of LtR and carried out the ranking Indexing refers to a function that looks through the
process by the pointwise approach of LtR to keep up with visible text portion of the webpage and organizes the
the modern ranking systems. to achieve the final ranking. contents residing inside the CSS, and HTML tags. An
index is stored in order to improve the efficiency and
III. Methodology effectiveness of identifying relevant documents for a search
query. Without an index, the search engine would have
The proposed methodology for a Bangla search engine to spend a lot of time and resources scanning every
can be viewed from two ends (frontend part and backend document in the corpus. To facilitate fast information
part) and both ends are connected through the database. retrieval indexer preprocesses the data before storing them
The front end is initiated by the user by receiving in the database.
a query, then preprocessing the query and converting 1) Data Extraction and Preprocessing: A web page is
them to tokens, and passing them to the database. On retrieved by crawling and data is extracted by scraping.
The data on a page can be copied into a spreadsheet preprocessing. The technique of turning raw data into
or imported into a database after it has been cleaned, numerical features that can be handled while keeping
processed and reformatted. To obtain the contents of a the information in the original data set is known as
web page all the HTML. CSS tags are removed in the feature extraction. Compared to using machine learning
first place. Then the URL and body of the webpage are on the raw data directly, it produces better outcomes. The
stored in a csv file. preprocessed data is used for feature extraction and the
After the removal of HTML and CSS tags to deal opted features are document id, keyword, TF-IDF score,
with the extracted data following steps are performed to keyword density, document length, BM25 score, Bid (total
preprocess the Bangla data for further analysis: rating). Here Bid is calculated by combining other features
• Remove the punctuation from the text. in a feature vector to obtain the total rating.
• Break the raw text into words through tokenization. G. Applying Ranking Algorithm
• Remove the Bangla stopwords that play no major role
The tokenized keywords are ready to be matched with
in searching.
the indexed data stored in the database which is the
initial step for obtaining the search result. The keyword-
2) Vectorization: In the vectorization process, words
or phrases are mapped to corresponding vectors or nu-
merical values which facilitate finding word similarity,
word predictions. Most of the popular search engines use
TF-IDF as a vectorization method [15]. We have also
used this vectorization technique to create an index data
structure called a term-document matrix to push it into
the database. Term- document matrix is a two-dimensional
sparse matrix where the rows correspond to the terms and
the columns refer to the documents.
C. Storing in the Database
While working with enormous amount of the indexed
data will be stored in a database. The term-document
matrix will be fed to a relational database named MySQL.
As we have emphasized the methodology of our work
rather than the volume of data, MySQL will be compatible
enough to store the indexed data for further implementa-
tion.
D. Query on Search Engine Interface
A search engine interface is to be introduced through Fig. 2. Building ML model and predicting result
which the user will interact with the system. The user
initiates the search process in the front end by writing document pairs whose entry contains a non-zero TF-IDF
the keyword to be queried on the search bar. score are fetched to feed as input to a machine learning
model. Several factors corresponding to the keywords need
E. Query Parsing
to be taken into consideration for further processing. The
Query parsing follows the same steps as data prepro- fetched keyword-document pairs are initially unordered
cessing. As soon as the user enters the query the parser and need to be ranked by using a ranking algorithm. To
performs the following steps: deal with the ranking issues a section in Machine Learning,
• All the punctuation in the query is removed. When LtR comes into play. There are three approaches to LtR
punctuation is removed, the words data and data!!!, and we have selected the point-wise method which can be
for instance, are regarded similarly. implemented by any regression algorithm as soon as the
• Tokenization is the main part of query parsing as all dataset containing the feature vector is prepared. Finally,
the tokens are individually matched to get the entry the machine learning algorithm will be applied to obtain
of the term-document matrix located in the database. the final rating that yields the search result and rank the
• Stop words are removed as they have too less impact web pages on the SERP.
on generating the search result.
F. Feature Extraction and Data Preparation IV. Result Analysis
The accuracy and speed of the learning algorithm can This section contains the outcomes and the analysis
be improved by feature extraction as a form of data of the result. We started with the aim of building a
Bangla search engine that can manipulate Bangla data based on the BM25 score. By sorting the BM25 score
and perform a query to provide the search results. From in descending order we can see the result in the following
this perspective, we generated our search result by the use sequence: Doc 5 (1st), Doc 9 (2nd), Doc 12 (3rd). But if we
of LtR based machine learning approach. To deal with the focus on the bid according to our approach the sequence is
ranking of search results an LtRbased IR system follows a bit different. By sorting the bid in descending order, we
a two-stage procedure. In the first stage, k documents are can see the result in the following sequence: Doc 9 (1st),
retrieved from the database by using the value obtained Doc 5 (2nd), Doc 12 (3rd). The third position is ranking in
from the base rankers (they can be TF-IDF, BM25, same manner in both cases but in bid-based ranking Doc 9
Countvector, etc.). We used TF-IDF vectorizer to retrieve is given a higher rank inspite of having lesser BM25 score
the required documents and applied LtR based ML model than Doc 5. If we give a closer look at the other features,
to get final score to finalize the order of those documents we can see that Doc 9 has relatively more words in it than
on SERP. Doc 5 (according to the feature WordsinDoc). In such a
A. Output of LtR-based ML Models way we can say that the machine learning model which is
trained on a dataset prepared based on the bid considers
Not all the extracted features will be fed to the machine more features and it will yield more relevant results than
learning model. In Learn to Rank approach a model is following the traditional heuristic functions such as tf-idf
trained by learning the corresponding rating of a keyword- score, BM25 score.
document pair. So to build the model we only take three
necessary columns that are the document id, keyword, C. Stepwise Result on SERP
and bid. Bid for each row is calculated by multiplying We investigate what happens in the backend to yield
all the scores in a feature vector. We applied Decision the search result and explained it with help of the output
Tree Regressor, Random Forest Regressor, K Nearest in the terminal.
Neighbour Regressor and observed which one of them
performs the best.