0% found this document useful (0 votes)
6 views

A Bangla Text Search Engine Using Pointwise Approach of Learn to Rank(LtR) Algorithm

This paper presents the development of a monolingual Bangla search engine utilizing the pointwise approach of the Learn to Rank (LtR) algorithm. The proposed system incorporates machine learning techniques, specifically a Random Forest Regressor, to enhance the ranking of search results based on various features including BM25 scores, TF-IDF scores, and keyword density. The aim is to provide relevant search results in Bangla, addressing the current lack of effective Bangla search engines.

Uploaded by

sabitri sikder
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

A Bangla Text Search Engine Using Pointwise Approach of Learn to Rank(LtR) Algorithm

This paper presents the development of a monolingual Bangla search engine utilizing the pointwise approach of the Learn to Rank (LtR) algorithm. The proposed system incorporates machine learning techniques, specifically a Random Forest Regressor, to enhance the ranking of search results based on various features including BM25 scores, TF-IDF scores, and keyword density. The aim is to provide relevant search results in Bangla, addressing the current lack of effective Bangla search engines.

Uploaded by

sabitri sikder
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A Bangla Text Search Engine Using Pointwise

Approach of Learn to Rank(LtR) Algorithm

Abstract—A search engine is the prime source of information supports Bangla language. This paper aims to develop a
in this modern time. Where English search engines are reigning Monolingual Bangla search engine.
the world, a good number of monolingual and bilingual search Developing a Bangla search engine has a lot of chal-
engines are also emerging. Bangla language falls behind in the
fast-growing area of information retrieval (IR) in the native lenges itself. The ideal sources of information to develop
language. Currently, Bangladesh has no active search engine of a Bangla search engine are Bangla Wikipedia, Bangla
its own as the first ever Bangla search engine Pipilika is not national dailies, local and online newspapers of our coun-
providing service for a long time. To fill the absence of such a try, blog posts and various government sites. Though the
platform we aim to initiate the development of a Bangla search design choice of an internet search service has changed
engine through this work. We have developed a framework
for the Bangla search engine that provides search results by throughout the time by adapting different techniques in
performing syntax analysis of a given query. The information the intermediary steps, the major operating steps are
retrieval process of Pipilika was carried out using the score identical in every search engine [2]. They are: Crawling,
generated by a ranking function called BM25. But modern IR Indexing, Querying. We will develop a Bangla search
systems are acing with the aid of machine learning. Learn to engine where the crawler will initiate the process by
Rank (LtR) is a class of techniques in machine learning to deal
with the ranking problem in IR. The search engines employ any crawling some Bangle newspapers to build Bangla corpus
of the three approaches of LtR (pointwise, pairwise, listwise) to data. Indexed corpus data will be stored in a database
show the result on Search Engine Result Page (SERP) and we and machine learning approach will be applied on them
opted for the pointwise approach. In pointwise approach of LtR as soon as a query is requested. The information retrieval
a numerical score (bid) is assigned to each keyword-document process is constantly changing to provide better search
pair in the training set. Then our ranking problem can be
treated as a regression problem. We achieved our pointwise result. The number of ranking factors, prioritizing these
approach by employing Random Forest Regressor which yields factors, and application of a series of ranking algorithms
an accuracy of 70.14%. The dataset was prepared with the aid on these ranking factors keep on changing. Ranking each
of feature extraction following the bid-based ranking approach webpage based on outbound links has a great impact on
of Google where we treat the BM25 score as one of the features. improving results on the SERP [3].
In the bid-based approach not only the BM25 score acts as one
of the ranking factors but also other ranking factors such as The main focus of this work is to build a prototype of
TF-IDF score, keyword density, total words in a document a Bangla Search Engine and make sure that the platform
contribute to generating the result. Our approach is capable can provide search results in Bangla delivering relevant in-
of providing search results based on the ranking score which is formation. Currently there is no existing work on keyword-
generated by combining all the features. based Bangla search engine. Pipilika [4] was the only
Keywords—Learn to Rank, Pointwise approach, Search En- existing Bangla search engine to which we can compare our
gine Result Page, BM25, Machine learning,TF-IDF score.
work. In pipilika, collected documents from corpus data
have been ranked using BM25 document ranking method.
BM25 (Best Matching) is a Term-based document retrieval
I. Introduction method which is also known as a ranking function that
estimates the relevance of documents to a given search
In this era of information and technology search engines query. It assigns a score to a set of documents based on the
are considered as a boon to mankind as it facilitates the query terms irrespective of their inter-relationship within a
information retrieval process from the bulk of documents document. We aimed to work on this information retrieval
and resources available on the web in a systematic way. process by incorporating an alternative approach. A class
A significant number of search engines have evolved over of techniques known as Learning to Rank (LtR) uses
the course of time where some of them are multilingual, supervised machine learning to address ranking issues.
some are bilingual and the rest of them are monolingual. We applied such machine learning approach to resolve
Among these web search engines some soared up high by the matter of the ranking on the SERP (Search Engine
providing the best search results, e.g: Google, Bing, Yahoo, Result Page). To build a machine learning model some
DuckDuckGo, Ask.com. From a comparative analysis, features were required to be extracted from the dataset of
it’s observed that Google and DuckDuckGo offer some the scraped data from web. The new document-query pair
impressive features setting the bar so high [1]. However, dataset contains some relevant features and this BM25
these commonly used search engines are multilingual that score can be treated as one of those features. The pointwise
approach of LtR was adopted to train the model where of web information retrieval, comparison of PageRank and
the BM25 score along with other features was combined MatrixNet algorithms, and the comparison of the quality
to generate a score to get the final rating. In our proposed of the retrieved results for selected queries.
approach the rating can be obtained not only based on S. K. Roy et al [6] describe the design and implemen-
one factor (such as BM25 in pipilika) but also based on tation of an effective NLP based Bangla interfaced search
other related features which influence the search result. engine. Applications of natural language processing run
In this way, we can emphasize other relative factors that computation over big corpora collected from the web. In
contribute to rank the search result. Later the ranking the information retrieval technique, a probabilistic model
quality of our system was compared to other ranking tries to estimate the probability based on the user query.
systems by evaluating a ranking metric. That ensures a The model assumes that the query and the document
good enough ranking quality of the proposed system. The representations have an impact on this probability of
main contributions of this paper are: relevance.
• To build a web crawler to collect data from Bengali M. R. Islam et al [7] mentions that BM25 document
websites. ranking algorithm was used in Pipilika to score the col-
• To preprocess and store the crawled data in a lected documents from corpus data. This paper proposes
database after indexing. a query expansion approach that selects additional words
• To apply machine learning approach to the indexed from documents that are highly related to the search
data to obtain the keyword-based search result. term syntagmatically. It searches for keywords using the
• To develop a web-based graphical user interface to Word2Vec CBoW model, which has a greater semantic
use the search engine. similarity with original words and suggests similar terms
for each subsequent word. This proposed approach may
The rest of the document is outlined as follows: Section
enable the search engine to better understand user needs
II explores the literature review of related works we have
and so produce better search results.
studied to develop our idea. Section III illustrates details
A. Sohail et al [8] mentions the work’s primary objective
of our methodology. Section IV analyzes the performance
is to conduct research on the most recent SEO strategies
of the applied algorithms on the chosen dataset. At last,
for content management-based websites. It discusses the
Section V finishes the paper with a summary.
factors that contribute to the ranking of web pages in
a search engine result page. Features related to on-page
II. Literature Review optimization and off-page optimization Literature Review
8 are clearly defined to show their impact on search results.
In recent times, various techniques for information These features act as search engine ranking factors for
retrieval in search engine are proposed. We have gone further analysis through ranking functions or machine
through some related works and a brief overview of these learning approaches.
works is presented in this section. Many Information Retrieval (IR) problems are inher-
In [2], Lycos creator Michael Mauldin gives a quick ently ranking problems, and learning-to-rank approaches
overview of web search services and explains how search have the potential to improve many IR technologies. [9]
engines like Lycos carry out their functions. According to discussed three types of existing learning-to-rank algo-
[2], the way all major web search engines work is the same: rithms: the pointwise, pairwise, and listwise techniques
a gathering program searches the hyperlinked documents and reviewed these categories. The benefits and drawbacks
on the web and indexes the gathered information. These of each approach are examined, and the connections
documents are accumulated by keeping them in a database between these approaches’ loss functions and IR evaluation
or repositories. The retrieval program then compiles a metrics are highlighted. The statistical ranking theory is
list of links to web documents that correspond to the provided at the end, which can be utilized to assess the
words, phrases, or concepts in the user query. This process query level generalization properties of various learning-
is facilitated by the database combined with a retrieval to-rank methods.
program. For example, the Lycos search engine comprises M. Kantrowit et al [10] investigated how retrieval is
the Lycos Catalog of the Internet and the Pursuit retrieval affected by stemming performance. Previously reported
program. work on stemming and IR is extended by utilizing a
A comparative study in [5] has been done between unique, dictionary-based ”perfect” stemmer that can be
two leading search engines Yandex and Google. Yandex parameterized for various accuracy and coverage levels. As
dominates the Russian search engine industry with 60.4% a result, changes in stemming performance are evaluated
of the market share, followed by Google.ru with 26.2%. for each given data set and assess changes in IR per-
Internet marketers believe that Yandex is the most widely formance. An empirical evaluation of stemming accuracy
used search engine in Russia because it produces better is provided for three stemming algorithms, including the
results than Google. The comparison between these two commonly used Porter method, to put this study in
search engines is divided into three main tasks, description context.
R. Jin et al [11] is especially interested in developing the other hand, the backend comprises several processes
a better ranking function employing two complementing such as crawling through web pages, scraping their data,
sources of information: the ranking information provided preprocessing the extracted data, getting tokens out of
by the current ranking function and that achieved from it, and vectorizing the tokens and storing them through
user input. Since the base ranker’s information is fre- indexing.
quently erroneous and the training data’s information is
frequently noisy, merging the ranking information from
the base ranker and the labeled instances is the key
challenge. The paper describes an error-tolerant boosting
technique for ranking refinement. The empirical analysis
demonstrates that the suggested algorithm is effective for
ranking refinement and that it also significantly outper-
forms baseline techniques that include the outputs from
the base ranker.
A. Ahmad et al [12] shows the findings and analysis of
two separate experiments carried out by employing the
model, namely Context-aware spell checker and Trending
topic detection. It introduces a large-scale Bengali Ngram
model, trained on online newspaper corpus. The paper also
highlights the challenges of the methods when working
with such vast amounts of data. An important feature
of the model is that the N-gram model includes data on
N-gram incidence per day over eight years, spanning the
years 2009 to 2017.
M. Ibrahim et al [13] focus on RF-based learn-to-
rank algorithms by analyzing its three approaches after
applying them on 8 different datasets (TD2004, HP2004,
NP2004, Ohsumed, MQ2008, MQ2007, MSLR-WEB10K,
Yahoo). The datasets are collected from various web search Fig. 1. Overall workflow for the Bangla search engine
platforms or recommendation systems. Both Random For-
The following sub-sections go into greater detail of our
est Classifier and Random Forest Regressor were analyzed
workflow mentioned in Fig. 1.
from various aspects. The ranking quality metrics such
as NDCG@10, MAP, and ERR were measured for each A. Crawling
approach with the aid of the mentioned datasets. Web crawling is an automated process of visiting
After studying the related works, we have got the idea websites and downloading their information. It assists in
of which ranking factors to consider and we combined collecting information about the websites and the links
them by bid-based ranking [14]. The bid-based ranking is associated with it, as well as evaluating the HTML code
a popular approach of Google to combine the features to and linkages. A web crawler is also known as a spider
get the final score and to the best of authors knowledge, or bot.A set of seed URLs denote the starting point of
none of our aforementioned papers went for such an the crawling process. Depending on what seed we choose,
approach. Though many established search engines use we should be able to visit almost every possible Bangla
LtR approach of machine learning for ranking, papers website we aimed to crawl.
related to the previous work of Bangla search engine
B. Indexing
describe a heuristic function (BM25). We learned about
the three approaches of LtR and carried out the ranking Indexing refers to a function that looks through the
process by the pointwise approach of LtR to keep up with visible text portion of the webpage and organizes the
the modern ranking systems. to achieve the final ranking. contents residing inside the CSS, and HTML tags. An
index is stored in order to improve the efficiency and
III. Methodology effectiveness of identifying relevant documents for a search
query. Without an index, the search engine would have
The proposed methodology for a Bangla search engine to spend a lot of time and resources scanning every
can be viewed from two ends (frontend part and backend document in the corpus. To facilitate fast information
part) and both ends are connected through the database. retrieval indexer preprocesses the data before storing them
The front end is initiated by the user by receiving in the database.
a query, then preprocessing the query and converting 1) Data Extraction and Preprocessing: A web page is
them to tokens, and passing them to the database. On retrieved by crawling and data is extracted by scraping.
The data on a page can be copied into a spreadsheet preprocessing. The technique of turning raw data into
or imported into a database after it has been cleaned, numerical features that can be handled while keeping
processed and reformatted. To obtain the contents of a the information in the original data set is known as
web page all the HTML. CSS tags are removed in the feature extraction. Compared to using machine learning
first place. Then the URL and body of the webpage are on the raw data directly, it produces better outcomes. The
stored in a csv file. preprocessed data is used for feature extraction and the
After the removal of HTML and CSS tags to deal opted features are document id, keyword, TF-IDF score,
with the extracted data following steps are performed to keyword density, document length, BM25 score, Bid (total
preprocess the Bangla data for further analysis: rating). Here Bid is calculated by combining other features
• Remove the punctuation from the text. in a feature vector to obtain the total rating.
• Break the raw text into words through tokenization. G. Applying Ranking Algorithm
• Remove the Bangla stopwords that play no major role
The tokenized keywords are ready to be matched with
in searching.
the indexed data stored in the database which is the
initial step for obtaining the search result. The keyword-
2) Vectorization: In the vectorization process, words
or phrases are mapped to corresponding vectors or nu-
merical values which facilitate finding word similarity,
word predictions. Most of the popular search engines use
TF-IDF as a vectorization method [15]. We have also
used this vectorization technique to create an index data
structure called a term-document matrix to push it into
the database. Term- document matrix is a two-dimensional
sparse matrix where the rows correspond to the terms and
the columns refer to the documents.
C. Storing in the Database
While working with enormous amount of the indexed
data will be stored in a database. The term-document
matrix will be fed to a relational database named MySQL.
As we have emphasized the methodology of our work
rather than the volume of data, MySQL will be compatible
enough to store the indexed data for further implementa-
tion.
D. Query on Search Engine Interface
A search engine interface is to be introduced through Fig. 2. Building ML model and predicting result
which the user will interact with the system. The user
initiates the search process in the front end by writing document pairs whose entry contains a non-zero TF-IDF
the keyword to be queried on the search bar. score are fetched to feed as input to a machine learning
model. Several factors corresponding to the keywords need
E. Query Parsing
to be taken into consideration for further processing. The
Query parsing follows the same steps as data prepro- fetched keyword-document pairs are initially unordered
cessing. As soon as the user enters the query the parser and need to be ranked by using a ranking algorithm. To
performs the following steps: deal with the ranking issues a section in Machine Learning,
• All the punctuation in the query is removed. When LtR comes into play. There are three approaches to LtR
punctuation is removed, the words data and data!!!, and we have selected the point-wise method which can be
for instance, are regarded similarly. implemented by any regression algorithm as soon as the
• Tokenization is the main part of query parsing as all dataset containing the feature vector is prepared. Finally,
the tokens are individually matched to get the entry the machine learning algorithm will be applied to obtain
of the term-document matrix located in the database. the final rating that yields the search result and rank the
• Stop words are removed as they have too less impact web pages on the SERP.
on generating the search result.
F. Feature Extraction and Data Preparation IV. Result Analysis

The accuracy and speed of the learning algorithm can This section contains the outcomes and the analysis
be improved by feature extraction as a form of data of the result. We started with the aim of building a
Bangla search engine that can manipulate Bangla data based on the BM25 score. By sorting the BM25 score
and perform a query to provide the search results. From in descending order we can see the result in the following
this perspective, we generated our search result by the use sequence: Doc 5 (1st), Doc 9 (2nd), Doc 12 (3rd). But if we
of LtR based machine learning approach. To deal with the focus on the bid according to our approach the sequence is
ranking of search results an LtRbased IR system follows a bit different. By sorting the bid in descending order, we
a two-stage procedure. In the first stage, k documents are can see the result in the following sequence: Doc 9 (1st),
retrieved from the database by using the value obtained Doc 5 (2nd), Doc 12 (3rd). The third position is ranking in
from the base rankers (they can be TF-IDF, BM25, same manner in both cases but in bid-based ranking Doc 9
Countvector, etc.). We used TF-IDF vectorizer to retrieve is given a higher rank inspite of having lesser BM25 score
the required documents and applied LtR based ML model than Doc 5. If we give a closer look at the other features,
to get final score to finalize the order of those documents we can see that Doc 9 has relatively more words in it than
on SERP. Doc 5 (according to the feature WordsinDoc). In such a
A. Output of LtR-based ML Models way we can say that the machine learning model which is
trained on a dataset prepared based on the bid considers
Not all the extracted features will be fed to the machine more features and it will yield more relevant results than
learning model. In Learn to Rank approach a model is following the traditional heuristic functions such as tf-idf
trained by learning the corresponding rating of a keyword- score, BM25 score.
document pair. So to build the model we only take three
necessary columns that are the document id, keyword, C. Stepwise Result on SERP
and bid. Bid for each row is calculated by multiplying We investigate what happens in the backend to yield
all the scores in a feature vector. We applied Decision the search result and explained it with help of the output
Tree Regressor, Random Forest Regressor, K Nearest in the terminal.
Neighbour Regressor and observed which one of them
performs the best.

TABLE I. Outcome of the Machine Learning Models


Applied Algorithm Accuracy
Decision Tree Regressor 57.19%
Random Forest Regressor 70.14%
K Nearest Neighbours Regressor 54.79%

Table I shows the achieved accuracy of these machine


learning models. We can observe that Random Forest
Regressor yields 70.14% accuracy which is the best result
among all. For pointwise ranking, Random Forst Regressor
is one of the most used and popular algorithms [13]. In
LtR it is also known as RF-based pointwise LtR algorithm.
There are some other existing works using RF-based
pointwise approach and we have compared our work with
them in the comparison section.
B. Analysis of Bid-based Ranking
An example from our dataset is provided to put more
light on this matter. In Figure 3 we can see three responses
to the query. If we follow the traditional ranking method
Fig. 4. Result of each step during information retrieval

The queried keyword will be preprocessed and sent to


the database in form of a token to find the match. The re-
trieved keyword-document pairs will act like list of inputs
to the machine learning model. The model will predict the
bid for each keyword-document pair. Then the keyword-
document pairs are associated with their corresponding
Fig. 3. Analysis of bid-based ranking bid and DocID, Keyword, and Bid lists are created for
each pair. Now the lists are sorted concerning their bids in
that Pipilika followed, the ranking will take place only descending order. By then the documents also get sorted
automatically in this step. These sorted documents are By employing a pointwise ranking algorithm in our in-
gathered in the final document list. For multiple tokens, formation retrieval phase, we added a new dimension
the final document list may contain the same document to Bangla search engine characteristics. To obtain the
more than once. In that case, the redundant documents desired form of the dataset we followed bid-based ranking
are removed but the relative ordering of the documents is which is a quite popular approach of the leading search
preserved. engine Google. We built our machine learning model
using Random Forest Regressor and checked the ranking
D. Comparing Ranking Quality with Other Existing Sys-
quality further by canculating the NDCG@10 value. After
tems
comparing the NDCG@10 value of our system with other
The evaluation of a ranking system can not be measured existing systems we can conclude that the new Bangla
by the accuracy of the machine learning model. Because search engine provides a good ranking quality.
LtR solves the ranking problem by providing a list of
ordered documents according to their total score. The idea References
of LtR is to come up with the relative ordering of the
documents as LtR does not care about predicting the exact [1] V. S. Parsania, F. Kalyani, and K. Kamani, “A comparative
score for each document. So, the ranking systems built analysis: Duckduckgo vs. google search engine,” GRD Journals-
with the idea of LtR need to be evaluated by examining Global Research and Development Journal for Engineering,
vol. 2, no. 1, pp. 12–17, 2016.
their ranking quality. One of the most used metrics to [2] G. R. Notess, “Searching the world-wide web: Lycos, webcrawler
measure ranking quality is NDCG value. NDCG@10 value and more.” Online, vol. 19, no. 4, pp. 48–52, 1995.
compares the predicted rank and the actual rank of the top [3] P. Patel, “Research of page ranking algorithm on search engine
using damping factor,” Ijaerd, Febuary, 2014.
10 documents for a particular query. We have calculated [4] T. from Shahjalal University of Science Technology(SUST).
the NDCG@10 value of our system for a set of queries (2013). [Online]. Available: https://pipilika.com/
where the obtained average value is 0.7144. [5] A. Paananen, “Comparative analysis of yandex and google
search engines,” 2012.
[6] S. K. Roy, K. I. Khalilullah, M. I. A. Khan, and M. Hasan,
TABLE II. Comparison Among Ranking Systems “Design and implementation of an efficient search engine with
bangla interface using natural language processing (nlp).”
Name of the Application Query- NDCG@10 [7] M. R. Islam, J. Rahman, M. R. Talha, and F. Chowdhury,
Dataset doc “Query expansion for bangla search engine pipilika,” in 2020
pairs IEEE Region 10 Symposium (TENSYMP). IEEE, 2020, pp.
TD2004 Topic Distillation 75000 0.3509 1367–1370.
HP2004 Homepage Finding 75000 0.7578 [8] A. Sohail et al., “Search engine optimization methods & search
NP2004 Namepage Finding 75000 0.7624 engine indexing for cms applications,” 2012.
Ohsumed Medical Docs 16000 0.4168 [9] T.-Y. Liu et al., “Learning to rank for information retrieval,”
MQ2008 Web Search 15211 0.2227 Foundations and Trends® in Information Retrieval, vol. 3, no. 3,
Yahoo Web Search 709877 0.7554 pp. 225–331, 2009.
MSLR- Web Search 1200192 0.4512 [10] M. Kantrowitz, B. Mohit, and V. Mittal, “Stemming and its
WEB10K effects on tfidf ranking,” in Proceedings of the 23rd annual inter-
MQ2007 Web Search 69623 0.4345 national ACM SIGIR conference on Research and development
Our Dataset Web Search 246266 0.7144 in information retrieval, 2000, pp. 357–359.
[11] R. Jin, H. Valizadegan, and H. Li, “Ranking refinement and
its application to information retrieval,” in Proceedings of the
Authors in [13], elaborately discusses RF based point- 17th international conference on World Wide Web, 2008, pp.
wise learn-to-rank algorithms where both RF classification 397–406.
[12] A. Ahmad, M. R. Talha, M. R. Amin, and F. Chowdhury,
and regression were applied and analyzed from different “Pipilika n-gram viewer: an efficient large scale n-gram model
points of view. For LtR pointwise approach RF classifica- for bengali,” in 2018 International Conference on Bangla Speech
tion and RF regression were applied on 8 datasets. The and Language Processing (ICBSLP). IEEE, 2018, pp. 1–5.
[13] M. Ibrahim, “An empirical comparison of random forest-based
table shows the comparative analysis of ranking quality of and other learning-to-rank algorithms,” Pattern Analysis and
different ranking systems where NDCG@10 values show Applications, vol. 23, no. 3, pp. 1133–1155, 2020.
to what extent the systems can deliver a relevant result. [14] J. Barnard. (2022) Search engine jour-
nal. [Online]. Available: https://www.
Our system yields an NDCG@10 value of 0.7144 which searchenginejournal.com/how-google-search-ranking-works/
can be regarded as a nearly standard value. After the 307591/?fbclid=IwAR2vi2mG-UjmAm0KSZLAcV_
analysis, we can infer that the ranking quality of our 759DEUMfEOSviQcX6Ie9xjuIHoNLDo5MmKH4#close
[15] Y. Jin, Z. Lin, and H. Lin, “The research of search engine
keyword-based search system is adequate. based on semantic web,” in 2008 International Symposium
on Intelligent Information Technology Application Workshops.
IEEE, 2008, pp. 360–363.
V. Conclusion

Building the framework of a Bangla search engine was


a hard task to pull off with the limitation of hardware
and resources. But we developed a framework that can
provide search results through syntax analysis of data.

You might also like