Bachelor of Technology

A Project report on
Building Searching Engine using Machine Learning Techniques
A Dissertation submitted to JNTU Hyderabad in partial fulfillment of the

academic requirements for the award of the degree.
Bachelor of Technology
in
Information Technology
Submitted by
Sonti Harish Reddy

(19H51A1259)
Renukuntla Anvesh
(20H55A1203)
Sangishetti Vandana
(20H55A1204)
Under the esteemed guidance of

Mr. K. Venkateswara Rao
(Associate Prof. IT)
Department of Information Technology
CMR COLLEGE OF ENGINEERING & TECHNOLOGY

(UGC Autonomous)
*Approved by AICTE *Affiliated to JNTUH *NAAC Accredited with A+ Grade
KANDLAKOYA, MEDCHAL ROAD, HYDERABAD - 501401.
2019- 2023
CMR COLLEGE OF ENGINEERING & TECHNOLOGY
KANDLAKOYA, MEDCHAL ROAD, HYDERABAD – 501401
DEPARTMENT OF INFORMATION TECHNOLOGY
CERTIFICATE
This is to certify that the Major Project report entitled "Building Search Engine
Using Machine Learning Techniques" being submitted by S. Harish Reddy
(19H51A1259), R. Anvesh (20H55A1203), S. Vandana (20H55A1204) in partial
fulfillment for the award of Bachelor of Technology in Information Technology
is a record of bonafide work carried out his/her under my guidance and
supervision.
The results embodies in this project report have not been submitted to any
other University or Institute for the award of any Degree.
.
Mr. K. Venkateswara Rao Dr. K.L.S. Soujanya

Associate Professor Associate Professor and HOD
Dept. of IT Dept. of IT
ACKNOWLEDGEMENT
With great pleasure we want to take this opportunity to express my heartfelt gratitude to all
the people who helped in making this project work a grand success.
We are grateful to Ms. K. Venkateswara Rao, Assistant Professor, Department of
Information Technology for her valuable technical suggestions and guidance during the execution
of this project work.
We would like to thank Dr. K L S Soujanya, Head of the Department of Information
Technology, CMR College of Engineering and Technology, who is the major driving forces to
complete my project work successfully.
We are very grateful to Dr. Vijaya Kumar Koppula, Dean-Academic, CMR College of
Engineering and Technology, for his constant support and motivation in carrying out the project
work successfully.
We are highly indebted to Dr. V A Narayana, Principal, CMR College of Engineering and
Technology, for giving permission to carry out this project in a successful and fruitful way.
We would like to thank the Teaching & Non- teaching staff of Department of Information
Technology for their co-operation.
We express our sincere thanks to Mr. Ch. Gopal Reddy, Secretary, CMR Group of
Institutions, for his continuous care.
Finally, We extend thanks to our parents who stood behind us at different stages of this
Project. We sincerely acknowledge and thank all those who gave support directly and indirectly in
completion of this project work.
Sonti Harish Reddy 19H51A1259

Renukuntla Anvesh 20H55A1203
Sangishetti Vandana 20H55A1204
Building Search Engine using Machine Learning Techniques
TABLE OF CONTENTS
CHAPTER
NO. TITLE PAGE NO.
TABLE OF CONTENTS i
LIST OF FIGURES ii
ABSTRACT iii
1 INTRODUCTION 1
1.1 Problem Statement 2
1.2 Research Objective 2
1.3 Project Scope and Limitations 3
2 BACKGROUND WORK 4
2.1. Literature Survey 5-6
2.2 Google Search Engine 7
2.2.1. Introduction 7
2.2.2. Merits, Demerits and Challenges 8
2.2.3. Implementation of Google Search Engine 8
2.3 Search Engine using BERT & TensorFlow 9
2.3.2. Merits, Demerits and challenges 9
2.3.3. Implementation of Search Engine 11
2.4 Search Engine using Elastic Search & Kube Flow 12
2.4.2. Merits, Demerits and Challenges 12
2.4.3. Implementation of Search Engine 13
3 PROPOSED SYSTEM 17-47

3.1 Objective of Proposed Model 18-19
3.2 Algorithm Used for Proposed Model 20-27
3.3 Designing 27
3.3.1. UML Diagram 28-33
3.4 Stepwise Implementation and Code 33-47
CMRCET B. Tech (IT) Page No i

4 RESULTS AND DISCUSSION 48-60
4.1 Screenshots of Outputs 49-60
5 CONCLUSION 61
5.1 Conclusion and Future Enhancement 62
6 REFERENCES 64-65
GitHub Link
CMRCET B. Tech (IT) Page No i

LIST OF FIGURES
FIGURE
NO. TITLE PAGE NO.
1 Structure of the Google Search Engine 7
2 Structure of Search Engine using BERT 10
3 BERT with Modifications 11
4 ElasticSearch with index 13
5 Performance standards 14
6 Search Response time of elastic search 15
7 Index response time of elasticsearch 15
8 Retrieval performance of BERT with Tensorflow 16
9 Working of SVM 20-22
10 Working of XG BOOST 25
11 Data flow diagram 28
12 UML Diagram 30
13 Class Diagram 31
14 Sequence Diagram 32
15 Activity Diagram 33
16 Result 50-60
CMRCET B. Tech (IT) Page No ii

ABSTRACT
The web is the huge and most extravagant wellspring of data. To recover the
information from the World Wide Web, Search Engines are commonly utilized. Search
engines provide a simple interface for searching for user query and displaying results in the
form of the web address of the relevant web page but using traditional search engines has
become very challenging to obtain suitable information. We proposed a search engine using
Machine Learning technique that will give more relevant web pages at top for user queries.
CMRCET B. Tech (IT) Page No iii

CHAPTER 1
INTRODUCTION
CMRCET B. Tech (IT) Page No 1

CHAPTER 1
INTRODUCTION
World Wide Web is actually a web of individual systems and servers which are
connected with different technology and methods. Every site comprises the heaps of site pages
that are being made and sent on the server. So if a user needs something, then he or she needs
to type a keyword. Keyword is a set of words extracted from user search input. Search input
given by a user may be syntactically incorrect. Here comes the actual need for search engines.
Search engines provide you a simple interface to search user queries and display the results.
1.1 Problem Statement
The world's population is widely increasing day by day. Almost every person uses the
internet and smart technology. Because of the widespread use of web pages nowadays,
retrieving information from the internet presents a significant challenge. The complexity of
getting the results is increasing. Maintaining and understanding the data becomes very
complex. The accuracy of results is low due to a lack of algorithms. It provides a simple
interface for searching for user query and displaying results in the form of the web address of
the relevant web page but using traditional search engines has become very challenging to
obtain suitable information. So, to overcome this problem, we are building a search engine
using machine learning.
1.2 Research Objective
Numerous endeavors have been made by data experts and researchers in the field of
search engine. Dutta and Bansal [1] discuss various type of search engine and they conclude
the crawler based search engine is best among them and also Google uses it. A Web crawler is
a program that navigates the web by following the regularly changing, thick and circulated
hyperlinked structure and from there on putting away downloaded pages in a vast database
which is after indexed for productive execution of user queries. In [2], author conclude that
major benefit of using keyword focused web crawler over traditional web crawler is that it
works intelligently, efficiently. The search engine uses a page ranking algorithm to give more
relevant web page at the top of result, according to user need.

Initially just an idea has been developed as user were facing problem in searching data
so simple algorithm introduced which works on link structure, then further modification came
as the web is also expanding so weighted PageRank and HITS came into the scenario. In [3],
author compare various PageRank algorithm and among all, Weighted PageRank algorithm is
best suited for our system. Michael Chau and Hsinchun Chen [4] proposed a system which is
based on a machine learning approach for web page filtering. The machine learning result is
compared with traditional algorithm and found that machine learning result are more useful.
The proposed approach is also effective for building a search engine.
1.3 Project Scope and Limitations
The primary limitation of this work is that the keywords and associated webpages came from
only one company and industry. In addition, the chosen language (Finnish) might affect the results. In
general, the gift industry can be considered as a highly competitive online industry with a lot of SEO
activity taking place. Although the range of keywords was relatively large in the context of that
company, as was the number of webpages, this research would need to be replicated using data on
other companies, industries, and languages in order to claim generalizability of the findings. Even
though we mitigate the impact of potential personalization by using an anonymous browser, there are
other factors that impact the search results, such as click logs, ranking information from past SERPs,
and so on. These factors make search results structurally unstable and make it more difficult to
replicate research in this domain. Moreover, as the ranking algorithms of the major search engines
undergo periodic changes, any research in the SEO field is subject to expiration. Even with the
mentioned limitations, the results are indicative of the impact of content and link features on search
rankings. Acquiring more data would allow for the use of more features (e.g., utilizing unsupervised
methods such as topic modeling), and more learning examples to further improve the algorithm. In
addition, more features about the actual content of the sites, would provide more distinct information
about each site. Apart from obtaining data from other contexts, future research could focus on specific
website elements. In particular, the relatively high correlation of H3 and rankings is an interesting
finding. One reason for this can be that the use of H3 tags is rarer than the use of H1 and H2 tags and,
therefore, websites using H3 tags are applying more advanced SEO and content marketing strategies.
This proposition should be explored in future research.

CHAPTER 2
BACKGROUND
WORK

CHAPTER 2
BACKGROUND
WORK
2.1 LITERATURE SURVEY
1) Weighted page rank algorithm based on in-out weight of webpages

AUTHORS: Kalyani Desikan, B. Jaganathan.
In its classical formulation, the well known page rank algorithm ranks web pages only
based on in-links between web pages. We propose a new in-out weight based page rank
algorithm. In this paper, we have introduced a new weight matrix based on both the in-links
and out-links between web pages to compute the page ranks. We have illustrated the working
of our algorithm using a web graph. We notice that the page rank values of the web pages
computed using the original page rank algorithm and our proposed algorithm are comparable.
Moreover, our algorithm is found to be efficient with respect to the time taken to compute the
page rank values.
2)Web Page Ranking Using Machine Learning Approach

AUTHORS:Junaid Khan, Arunima Jaiswal.
One of the key components which ensures the acceptance of web search service is the
web page ranker - a component which is said to have been the main contributing factor to the
early successes of Google. It is well established that a machine learning method such as the
Graph Neural Network (GNN) is able to learn and estimate Google's page ranking algorithm.
This paper shows that the GNN can successfully learn many other web page ranking methods
e.g. TrustRank, HITS and OPIC. Experimental results show that GNN may be suitable to learn
any arbitrary web page ranking scheme, and hence, may be more flexible than any other
existing web page ranking scheme. The significance of this observation lies in the fact that it is
possible to learn ranking schemes for which no algorithmic solution exists or is known.

3)Review of features and machine learning techniques for web searching.

AUTHORS:Neha Sharm ,Narendra Kohli
As the amount of information is growing rapidly on world wide web, it has become
very difficult to get relevant information using traditional search engines within a stipulated
time. The main reasons for irrelevant search results are the lack of understanding of user's
search intention or user's preferences, keyword based searching, short queries. In this paper,
we will study different features that are used in information retrieval. We will also discuss
various machine learning techniques that are helpful in deciding the relevance of web page to
user. We have done classification on the basis of features. In the end we will compare different
techniques and their pros and cons are also discussed.

2.2 Existing Solutions

2.2.1 Google Search Engine
Introduction:
Google Search is a fully-automated search engine that uses software known as web
crawlers that explore the web regularly to find pages to add to our index. In fact, the vast
majority of pages listed in our results aren't manually submitted for inclusion, but are found
and added automatically when our web crawlers explore the web.
Figure 1: Structure of the Google Search Engine
Merits, Demerits and Challenges:
 Merits:
 Easy to use.
 Most Accurate Results

 Demerits:
 Privacy Concers
 Other Ethical problems
 Not so good image search results
 Decreasing CTR to websites, so that bloggers are facing difficulties.
Implementation of Google Search Engine:
Smart Track utilizes GS1 standards barcodes containing unique serialized product
identifier, Lot production and expiration dates. The information contained in the GS1 barcode
is captured across various supply chain processes and used to maintain a continuous log of
ownership transfers. As each stakeholder records the possession of the product, an end user
(patient) can verify authenticity through central data repository maintained as Global Data
Synchronization Network (GDSN) by using a smartphone app. In the downstream supply
chain at the warehouse, pharmacy and hospital units can scan the barcode to verify the product
and its characteristics.

2.2.2 Search Engine Using BERT
Introduction
BERT is an open source machine learning framework for natural language

processing (NLP). BERT is designed to help computers understand the meaning of ambiguous
language in text by using surrounding text to establish context. The BERT framework was pre-
trained using text from Wikipedia and can be fine-tuned with question and answer
datasets.BERT, which stands for Bidirectional Encoder Representations from Transformers, is
based on Transformers, a deep learning model in which every output element is connected to
every input element, and the weightings between them are dynamically calculated based upon
their connection. (In NLP, this process is called attention.)Historically, language models could
only read text input sequentially -- either left-to-right or right-to-left -- but couldn't do both at
the same time. BERT is different because it is designed to read in both directions at once. This
capability, enabled by the introduction of Transformers, is known as bidirectionality. Using
this bidirectional capability, BERT is pre-trained on two different, but related, NLP tasks:
Masked Language Modeling and Next Sentence Prediction.The objective of Masked Language
Model (MLM) training is to hide a word in a sentence and then have the program predict what
word has been hidden (masked) based on the hidden word's context. The objective of Next
Sentence Prediction training is to have the program predict whether two given sentences have
a logical, sequential connection or whether their relationship is simply random.
Merits, Demerits and Challenges

 Merits
 Much better model performance over legacy methods

 An ability to process larger amounts of text and language
 Demerits
 The model is large because of the training structure and corpus.

 It is slow to train because it is big and there are a lot of weights to update.
 It is expensive.
CMRCET B. Tech (IT) Page No

10
Implementation of Search Engine using BERT
BERT makes use of Transformer, an attention mechanism that learns contextual

relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two
separate mechanisms — an encoder that reads the text input and a decoder that produces a
prediction for the task. Since BERT’s goal is to generate a language model, only the encoder
mechanism is necessary. Before feeding word sequences into BERT, 15% of the words in each
sequence are replaced with a [MASK] token. The model then attempts to predict the original
value of the masked words, based on the context provided by the other, non-masked, words in
the sequence. In technical terms, the prediction of the output words requires
1. Adding a classification layer on top of the encoder output.
2. Multiplying the output vectors by the embedding matrix, transforming

them into the vocabulary dimension.
3. Calculating the probability of each word in the vocabulary with
softmax.
Figure 2: Structure of Search Engine using BERT

11
Figure 3: BERT with Modifications

2.2.3 Search Engine using ElasticSearch and KubeFlow

Introduction:
Elasticsearch is a full-text search and analytics engine based on Apache Lucene.
Elasticsearch makes it easier to perform data aggregation operations on data from multiple
sources and to perform unstructured queries such as Fuzzy Searches on the stored data.
It stores data in a document-like format, similar to how MongoDB does it. Data is
serialized in JSON format. This adds a Non-relational nature to it and thus, it can also be used
as a NoSQL/Non-relational database.
It is distributed, horizontally scalable, as in more Elasticsearch instances can be added
toa cluster as and when need arises, as opposed to increasing the capability of one machine
running an Elasticsearch instance.
Merits, Demerits:
Merits:
 Elasticsearch is compatible to run on every platform because it is developed in
Java.
 It is a real-time search engine, which means that only just one second before
added document is searchable in this engine.
 Elasticsearch offers the concept of gateway, which allows for creating full
backups easily.
 It is distributed document-oriented that makes easy to scale up in large
organization. The developer can easily integrate it to any large organization by scaling it.
 Multi-tenancy can be easily handled in Elasticsearch in comparison to Apache
Solr.
Demerits:
 Sometimes, the problem of split-brain situations occurs in Elasticsearch.
 Unlike Apache Solr, Elasticsearch does not have multi-language support for
handling request and response data.
 Elasticsearch is not a good data store as other options such as MongoDB,
Hadoop, etc. It performs well for small use cases, but in case of streaming of TB's data per
day, it either chokes or loses the data.

2.2.4 Implementation of ElasticSearch
Figure 4: ElasticSearch with index
Raw data flows into Elasticsearch from a variety of sources, including logs, system
metrics, and web applications. Data ingestion is the process by which this raw data is parsed,
normalized, and enriched before it is indexed in Elasticsearch. Once indexed in Elasticsearch,
users can run complex queries against their data and use aggregations to retrieve complex
summaries of their data. From Kibana, users can create powerful visualizations of their data,
share dashboards, and manage the Elastic Stack.
An Elasticsearch index is a collection of documents that are related to each other.
Elasticsearch stores data as JSON documents. Each document correlates a set of keys (names
of fields or properties) with their corresponding values (strings, numbers, Booleans, dates,
arrays of values, geolocations, or other types of data).
Elasticsearch uses a data structure called an inverted index, which is designed to allow
very fast full-text searches. An inverted index lists every unique word that appears in any
document and identifies all of the documents each word occurs in.
During the indexing process, Elasticsearch stores documents and builds an inverted
index to make the document data searchable in near real-time. Indexing is initiated with the
index API, through which you can add or update a JSON document in a specific index.

2.3 Data Collection and Performance metrics

There are numerous tools to measure search engine performance but the golden
standard, at least in the United States, is Google Search Console. Once you’ve registered a site
with Google Search Console, you’ll start to get data trickle into your account. As time goes by
and you accrue a meaningful amount of data, you’ll be able to see a screen similar to the one
below. You’ll be able to see information about some of the key metrics we just reviewed
including the impressions, clicks, average CTR, and average position. You can get information
about these metrics sliced by the search query, page, device, country, date, and search
appearance.
Figure5 : performance standards

Google Search Console is a great tool to be able to get a better understanding of the
types of queries your website is ranking for, what your top traffic driving pages from search
are, and other valuable information that you can use to grow the presence of your website on
the internet and generate new business for your technology company.
Figure 6: Search Response time of elastic search
Figure 7: Index response time of elasticsearch

Figure 8: Retrieval performance of BERT with Tensorflow

CHAPTER 3
PROPOSED
SYSTEM

CHAPTER 3
PROPOSED SYSTEM
3.1 OBJECTIVES
1.Input Design is the process of converting a user-oriented description of the input into
a computer-based system. This design is important to avoid errors in the data input process
and show the correct direction to the management for getting correct information from the
computerized system.
2. It is achieved by creating user-friendly screens for the data entry to handle large
volume of data. The goal of designing input is to make data entry easier and to be free from
errors. The data entry screen is designed in such a way that all the data manipulates can be
performed. It also provides record viewing facilities.
3.When the data is entered it will check for its validity. Data can be entered with the
help of screens. Appropriate messages are provided as when needed so that the user will not
be in maize of instant. Thus the objective of input design is to create an input layout that is
easy to follow
INPUT DESIGN:
The input design is the link between the information system and the user. It comprises
the developing specification and procedures for data preparation and those steps are necessary
to put transaction data in to a usable form for processing can be achieved by inspecting the
computer to read data from a written or printed document or it can occur by having people
keying the data directly into the system. The design of input focuses on controlling the amount
of input required, controlling the errors, avoiding delay, avoiding extra steps and keeping the
process simple. The input is designed in such a way so that it provides security and ease of use
with retaining the privacy. Input Design considered the following things:
 What data should be given as input?
 How the data should be arranged or coded?

 The dialog to guide the operating personnel in providing input.

 Methods for preparing input validations and steps to follow when error occur.
OUTPUT DESIGN:
A quality output is one, which meets the requirements of the end user and presents the
information clearly. In any system results of processing are communicated to the users and to
other system through outputs. In output design it is determined how the information is to be
displaced for immediate need and also the hard copy output. It is the most important and direct
source information to the user. Efficient and intelligent output design improves the system’s
relationship to help user decision-making.
1. Designing computer output should proceed in an organized, well thought out
manner; the right output must be developed while ensuring that each output element is
designed so that people will find the system can use easily and effectively. When analysis
design computer output, they should Identify the specific output that is needed to meet the
requirements.
2. Select methods for presenting information.
3.Create document, report, or other formats that contain information produced by the
system.
The output form of an information system should accomplish one or more of the
following objectives.
 Convey information about past activities, current status or projections of the
 Future.
 Signal important events, opportunities, problems, or warnings.
 Trigger an action.
 Confirm an action.


3.2 ALGORITHMS USED FOR PROPOSED MODEL
1. SUPPORT VECTOR MACHINE ALGORITHM
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms. Support Vector Machine(SVM) is a supervise machine learning algorithm used
for both classification and regression.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.
Though we say regression problems as well it’s best suited for classification. The
objective of the SVM algorithm is to find a hyperplane in an N-dimensional space that
distinctly classifies the data points. The dimension of the hyperplane depends upon the
number of features. If the number of input features is two, then the hyperplane is just a line.
If the number of input features is three, then the hyperplane becomes a 2-D plane. It
becomes difficult to imagine when the number of features exceeds three.
Working of a SVM algorithm:
One reasonable choice as the best hyperplane is the one that represents the largest
separation or margin between the two class

So we choose the hyperplane whose distance from it to the nearest data point on each
side is maximized. If such a hyperplane exists it is known as the maximum-margin
hyperplane/hard margin. So from the above figure, we choose L2. Let’s consider a scenario
like shown below
Here we have one blue ball in the boundary of the red ball. So how does SVM
classify the data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue
balls. The SVM algorithm has the characteristics to ignore the outlier and finds the best
hyperplane that maximizes the margin. SVM is robust to outliers.
So in this type of data point what SVM does is, finds the maximum margin as done with
previous data sets along with that it adds a penalty each time a point crosses the margin. So
the margins in these types of cases are called soft margins. When there is a soft margin to
the data set, the SVM tries to minimize (1/margin+∧(∑penalty)). Hinge loss is a commonly
used penalty. If no violations no hinge loss. If violations hinge loss proportional to the
distance of violation.
Say, our data is shown in the figure above. SVM solves this by creating a new variable using
a kernel. We call a point x i on the line and we create a new variable y i as a function of
distance from origin o.so if we plot this we get something like as shown below

In this case, the new variable y is created as a function of distance from the origin. A
non-linear function that creates a new variable is referred to as a kernel.
Advantages of SVM
 Effective in high-dimensional cases.

 Its memory is efficient as it uses a subset of training points in the decision function
called support vectors.
 Different kernel functions can be specified for the decision functions and its
possible to specify custom kernels.
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we want a
model that can accurately identify whether it is a cat or dog, so such a model can be created by
using the SVM algorithm. We will first train our model with lots of images of cats and dogs so
that it can learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data (cat and

dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text

categorization, etc.
2. XGBOOST ALGORITHM
XG Boost is an optimized distributed gradient boosting library designed for efficient
and scalable training of machine learning models
XG Boost stands for “Extreme Gradient Boosting” and it has become one of the
most popular and widely used machine learning algorithms due to its ability to handle large
datasets and its ability to achieve state-of-the-art performance in many machine learning
tasks such as classification and regression.
One of the key features of XG Boost is its efficient handling of missing values,
which allows it to handle real-world data with missing values without requiring significant
pre-processing. Additionally, XG Boost has built-in support for parallel processing, making
it possible to train models on large datasets in a reasonable amount of time.
XG Boost can be used in a variety of applications, including Kaggle competitions,

recommendation systems, and click-through rate prediction, among others. It is also highly

customizable and allows for fine-tuning of various model parameters to optimize

performance.
XG Boost stands for Extreme Gradient Boosting, which was proposed by the
researchers at the University of Washington. It is a library written in C++ which optimizes
the training for Gradient Boosting.
Random Forest:
Every decision tree has high variance, but when we combine all of them together in
parallel then the resultant variance is low as each decision tree gets perfectly trained on that
particular sample data and hence the output doesn’t depend on one decision tree but multiple
decision trees. In the case of a classification problem, the final output is taken by using the
majority voting classifier. In the case of a regression problem, the final output is the mean of
all the outputs. This part is Aggregation. The basic idea behind this is to combine multiple
decision trees in determining the final output rather than relying on individual decision trees.
Random Forest has multiple decision trees as base learning models.
Random Forest has multiple decision trees as base learning models. We randomly
perform row sampling and feature sampling from the dataset forming sample datasets for
every model. This part is called Bootstrap.
Boosting:
Boosting is an ensemble modelling, technique that attempts to build a strong

classifier from the number of weak classifiers. It is done by building a model by using weak
models in series. Firstly, a model is built from the training data. Then the second model is
built which tries to correct the errors present in the first model. This procedure is continued
and models are added until either the complete training data set is predicted correctly or the
maximum number of models are added.


Gradient Boosting:
Gradient Boosting is a popular boosting algorithm. In gradient boosting, each

predictor corrects its predecessor’s error. In contrast to Adaboost, the weights of the training
instances are not tweaked, instead, each predictor is trained using the residual errors of
predecessor as labels.
There is a technique called the Gradient Boosted Trees whose base learner is CART
(Classification and Regression Trees).
XG Boost:
XG Boost is an implementation of Gradient Boosted decision trees. XGBoost models

majorly dominate in many Kaggle Competitions.
In this algorithm, decision trees are created in sequential form. Weights play an
important role in XG Boost. Weights are assigned to all the independent variables which are
then fed into the decision tree which predicts results. The weight of variables predicted
wrong by the tree is increased and these variables are then fed to the second decision tree.
These individual classifiers/predictors then ensemble to give a strong and more precise
model. It can work on regression, classification, ranking, and user-defined prediction
problems.
Advantages:
 Performance: XG Boost has a strong track record of producing high-quality

results in various machine learning tasks, especially in Kaggle competitions,
where it has been a popular choice for winning solutions.
 Scalability: XG Boost is designed for efficient and scalable training of machine
learning models, making it suitable for large datasets.
 Customizability: XG Boost has a wide range of hyperparameters that can be
adjusted to optimize performance, making it highly customizable.

 Handling of Missing Values: XG Boost has built-in support for handling

missing values, making it easy to work with real-world data that often has
missing values.
 Interpretability: Unlike some machine learning algorithms that can be difficult
to interpret, XG Boost provides feature importance, allowing for a better
understanding of which variables are most important in making predictions.
Disadvantages
 Computational Complexity: XG Boost can be computationally intensive,

especially when training large models, making it less suitable for resource-
constrained systems.
 Overfitting: XG Boost can be prone to overfitting, especially when trained on
small datasets or when too many trees are used in the model.
 Hyperparameter Tuning: XG Boost has many hyperparameters that can be
adjusted, making it important to properly tune the parameters to optimize
performance. However, finding the optimal set of parameters can be time-
consuming and requires expertise.
 Memory Requirements: XG Boost can be memory-intensive, especially when
working with large datasets, making it less suitable for systems with limited
memory resources.
3.3 DESIGNING
DATA FLOW DIAGRAM:
1. The DFD is also called as bubble chart. It is a simple graphical formalism that can
be used to represent a system in terms of input data to the system, various processing carried
out on this data, and the output data is generated by this system.
2. The data flow diagram (DFD) is one of the most important modeling tools. It is used
to model the system components. These components are the system process, the data used by
the process, an external entity that interacts with the system and the information flows in the
system.
3. DFD shows how the information moves through the system and how it is modified
by a series of transformations. It is a graphical technique that depicts information flow and the
transformations that are applied as data moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a system at
any level of abstraction. DFD may be partitioned into levels that represent increasing
information flow and functional detail.
UML DIAGRAMS:
UML stands for Unified Modeling Language. UML is a standardized general-purpose

modeling language in the field of object-oriented software engineering. The standard is
managed, and was created by, the Object Management Group.
The goal is for UML to become a common language for creating models of object
oriented computer software. In its current form UML is comprised of two major components:
a Meta-model and a notation. In the future, some form of method or process may also be
added to; or associated with, UML. The Unified Modeling Language is a standard language
for specifying, Visualization, Constructing and documenting the artifacts of software system,
as well as for business modeling and other non-software systems.
The UML represents a collection of best engineering practices that have proven
successful in the modeling of large and complex systems. The UML is a very important part of
developing objects oriented software and the software development process. The UML uses
mostly graphical notations to express the design of software projects.
GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core concepts.
3. Be independent of particular programming languages and development process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations, frameworks,
patterns and components.
7. Integrate best practices.
USE CASE DIAGRAM:

A use case diagram in the Unified Modeling Language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical
overview of the functionality provided by a system in terms of actors, their goals (represented
as use cases), and any dependencies between those use cases. The main purpose of a use case
diagram is to show what system functions are performed for which actor. Roles of the actors
in the system can be depicted.

CLASS DIAGRAM:
In software engineering, a class diagram in the Unified Modeling Language (UML) is

a type of static structure diagram that describes the structure of a system by showing the
system's classes, their attributes, operations (or methods), and the relationships among the
classes. It explains which class contains information

SEQUENCE DIAGRAM:
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction

diagram that shows how processes operate with one another and in what order. It is a construct
of a Message Sequence Chart. Sequence diagrams are sometimes called event diagrams, event
scenarios, and timing diagrams.

ACTIVITY DIAGRAM:
Activity diagrams are graphical representations of workflows of stepwise activities and

actions with support for choice, iteration and concurrency. In the Unified Modeling Language,
activity diagrams can be used to describe the business and operational step-by-step workflows
of components in a system. An activity diagram shows the overall flow of control

3.4 STEPWISE IMPLEMENTATION AND CODE
Urls.py:
from django.urls import path
from . import views
urlpatterns = [path("index.html", views.index, name="index"),
path("AdminLogin.html", views.AdminLogin, name="AdminLogin"),

path("AdminLoginAction", views.AdminLoginAction,
name="AdminLoginAction"),
path("ManagerLogin.html", views.ManagerLogin, name="ManagerLogin"),
path("ManagerLoginAction", views.ManagerLoginAction,
name="ManagerLoginAction"),
path("Signup.html", views.Signup, name="Signup"),
path("SignupAction", views.SignupAction, name="SignupAction"),
path("UserLogin.html", views.UserLogin, name="UserLogin"),
path("UserLoginAction", views.UserLoginAction,
name="UserLoginAction"),
path("ViewUsers", views.ViewUsers, name="ViewUsers"),
path("VerifyUser", views.VerifyUser, name="VerifyUser"),
path("Train", views.Train, name="Train"),
path("UploadDataset.html", views.UploadDataset, name="UploadDataset"),
path("UploadDatasetAction", views.UploadDatasetAction,
name="UploadDatasetAction"),
path("SearchQuery.html", views.SearchQuery, name="SearchQuery"),
path("SearchQueryAction", views.SearchQueryAction,
name="SearchQueryAction"),
]

Views.py:
from django.shortcuts import render

from django.template import RequestContext
from django.contrib import messages
from django.http import HttpResponse
import os
from django.core.files.storage import FileSystemStorage
import pymysql
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from sklearn import svm
import pandas as pd
from sklearn.model_selection import train_test_split
from string import punctuation
from nltk.corpus import stopwords
import nltk
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from numpy import dot
from numpy.linalg import norm
global uname
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
def cleanNews(doc):
tokens = doc.split()
table = str.maketrans('', '', punctuation)
tokens = [w.translate(table) for w in tokens]
tokens = [word for word in tokens if word.isalpha()]
tokens = [w for w in tokens if not w in stop_words]
tokens = [word for word in tokens if len(word) > 1]
tokens = [lemmatizer.lemmatize(token) for token in tokens]
tokens = ' '.join(tokens)
return tokens
X = np.load("model/X.npy")
Y = np.load("model/Y.npy")
URLS = np.load("model/URLS.npy")
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, use_idf=True,

smooth_idf=False, norm=None, decode_error='replace', max_features=3000)
tfidf = tfidf_vectorizer.fit_transform(X).toarray()
df = pd.DataFrame(tfidf, columns=tfidf_vectorizer.get_feature_names())
print(str(df))
print(df.shape)
X = df.values
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
def Train(request):
if request.method == 'GET':
output = ''
font = '<font size='' color=black>'
arr = ['Algorithm Name','Accuracy','Precision','Recall','FSCORE']
output += '<table border="1" align="center"><tr>'
for i in range(len(arr)):
output += '<th><font size="" color="black">'+arr[i]+'</th>'
output += "</tr>"

svm_cls = svm.SVC()
svm_cls.fit(X, Y)
predict = svm_cls.predict(X_test)
p = precision_score(y_test, predict,average='macro') * 100
r = recall_score(y_test, predict,average='macro') * 100
f = f1_score(y_test, predict,average='macro') * 100
a = accuracy_score(y_test,predict)*100
output += '<tr><td><font size="" color="black">SVM</td><td><font size=""
color="black">'+str(a)+'</td><td><font size="" color="black">'+str(p)+'</td><td><font
size="" color="black">'+str(r)+'</td><td><font size="" color="black">'+str(f)+'</td></tr>'
xgb_cls = XGBClassifier()
xgb_cls.fit(X, Y)
predict = xgb_cls.predict(X_test)
p = precision_score(y_test, predict,average='macro') * 100
r = recall_score(y_test, predict,average='macro') * 100
f = f1_score(y_test, predict,average='macro') * 100
a = accuracy_score(y_test,predict)*100
output += '<tr><td><font size="" color="black">XGBoost</td><td><font size=""
color="black">'+str(a)+'</td><td><font size="" color="black">'+str(p)+'</td><td><font
size="" color="black">'+str(r)+'</td><td><font size="" color="black">'+str(f)+'</td></tr>'
context= {'data':output}
return render(request, 'ViewUsers.html', context)
def VerifyUser(request):
global uname
username = request.GET['t1']
db_connection = pymysql.connect(host='127.0.0.1',port = 3306,user = 'root',
password = 'root', database = 'searchengine',charset='utf8')
db_cursor = db_connection.cursor()
student_sql_query = "update signup set status='Accepted' where
username='"+username+"'"
db_cursor.execute(student_sql_query)
db_connection.commit()
print(db_cursor.rowcount, "Record Inserted")
if db_cursor.rowcount == 1:
output = username+" account activated"
return render(request, 'AdminScreen.html', context)
def SearchQueryAction(request):
if request.method == 'POST':
query = request.POST.get('t1', False)
qry = query
output = '<table border=1 align=center width=100%>'
font = '<font size="" color="black">'
arr = ['Query','Search URL','Rating']
output += "<tr>"
output += "<th>"+font+arr[i]+"</th>"
query = query.strip().lower()
query = cleanNews(query)
vector = tfidf_vectorizer.transform([query]).toarray()
vector = vector.ravel()
for i in range(len(X)):
score = dot(X[i], vector)/(norm(X[i])*norm(vector))
if score > 0.2:
output += "<tr><td>"+font+qry+"</td>"
output += '<td><a href="'+URLS[i]+'" target="_blank">'+font+URLS[i]
+"</td>"
output += "<td>"+font+str(score)+"</td>"
return render(request, 'ViewOutput.html', context)
def ViewUsers(request):
global uname
output = '<table border=1 align=center width=100%>'
font = '<font size="" color="black">'
arr = ['Username','Password','Contact No','Gender','Email
Address','Address','Status']
output += "<tr>"
output += "<th>"+font+arr[i]+"</th>"
con = pymysql.connect(host='127.0.0.1',port = 3306,user = 'root', password =
'root', database = 'searchengine',charset='utf8')
with con:
cur = con.cursor()
cur.execute("select * FROM signup")
rows = cur.fetchall()
for row in rows:
username = row[0]
password = row[1]
contact = row[2]
gender = row[3]
email = row[4]
address = row[5]
status = row[6]
output += "<tr><td>"+font+str(username)+"</td>"
output += "<td>"+font+password+"</td>"
output += "<td>"+font+contact+"</td>"
output += "<td>"+font+gender+"</td>"
output += "<td>"+font+email+"</td>"
output += "<td>"+font+address+"</td>"
if status == 'Pending':
output += '<td><a href="VerifyUser?t1='+username+'">Click
Here</a></td>'
else:
output += "<td>"+font+status+"</td>"
return render(request, 'ViewUsers.html', context)
def UploadDatasetAction(request):
global uname
dataset = request.FILES['t1']
dataset_name = request.FILES['t1'].name
fs = FileSystemStorage()
fs.save('SearchEngineApp/static/files/'+dataset_name, dataset)
output = dataset_name+' saved in database'
return render(request, 'UploadDataset.html', context)
def UploadDataset(request):
return render(request, 'UploadDataset.html', {})
def SearchQuery(request):
return render(request, 'SearchQuery.html', {})
def UserLogin(request):
return render(request, 'UserLogin.html', {})
def index(request):
return render(request, 'index.html', {})
def AdminLogin(request):
return render(request, 'AdminLogin.html', {})
def ManagerLogin(request):
return render(request, 'ManagerLogin.html', {})
def Signup(request):
return render(request, 'Signup.html', {})
def AdminLoginAction(request):
global uname
username = request.POST.get('t1', False)
password = request.POST.get('t2', False)
if username == 'admin' and password == 'admin':
uname = username
context= {'data':'welcome '+username}
return render(request, 'AdminScreen.html', context)
else:
context= {'data':'login failed'}
return render(request, 'AdminLogin.html', context)
def ManagerLoginAction(request):
global uname
if username == 'Manager' and password == 'Manager':
uname = username
context= {'data':'welcome '+uname}
return render(request, 'ManagerScreen.html', context)
else:
context= {'data':'login failed'}
return render(request, 'ManagerLogin.html', context)
def UserLoginAction(request):
global uname
index = 0
with con:
cur = con.cursor()
cur.execute("select username,password, status FROM signup")
for row in rows:
if row[0] == username and password == row[1] and row[2] == "Accepted":
uname = username
index = 1
break
if index == 1:
context= {'data':'welcome '+uname}
return render(request, 'UserScreen.html', context)
else:
context= {'data':'login failed or account not activated by admin'}
return render(request, 'UserLogin.html', context)
def SignupAction(request):
contact = request.POST.get('t3', False)
gender = request.POST.get('t4', False)
email = request.POST.get('t5', False)
address = request.POST.get('t6', False)
output = "none"
with con:
cur = con.cursor()
cur.execute("select username FROM signup")
for row in rows:
if row[0] == username:
output = username+" Username already exists"
break
if output == 'none':
db_connection = pymysql.connect(host='127.0.0.1',port = 3306,user = 'root',
password = 'root', database = 'searchengine',charset='utf8')
db_cursor = db_connection.cursor()
student_sql_query = "INSERT INTO
signup(username,password,contact_no,gender,email,address,status)
VALUES('"+username+"','"+password+"','"+contact+"','"+gender+"','"+email+"','"+address+"'
,'Pending')"
db_cursor.execute(student_sql_query)
db_connection.commit()
print(db_cursor.rowcount, "Record Inserted")
if db_cursor.rowcount == 1:
output = 'Signup Process Completed'
return render(request, 'Signup.html', context)
models.py:
from django.db import models
class userModel(models.Model):
name = models.CharField(max_length=50)
email = models.EmailField()
passwd = models.CharField(max_length=40)
cwpasswd = models.CharField(max_length=40)
mobileno = models.CharField(max_length=50, default="", editable=True)
status = models.CharField(max_length=40, default="", editable=True)
def __str__(self):
return self.email
class Meta:
db_table='userregister'
class weightmodel(models.Model):
filename = models.CharField(max_length=100)
file = models.FileField(upload_to='files/pdfs/')
weight=models.CharField(max_length=100)
rank=models.CharField(max_length=100,default="", editable=False)
label=models.CharField(max_length=100,default="", editable=False)
def __str__(self):
return self.filename
class Meta:
db_table='weight'

forms.py:
from django import forms
from user.models import *
from django.core import validators
class userForm(forms.ModelForm):
name = forms.CharField(widget=forms.TextInput(), required=True,
max_length=100,)
passwd = forms.CharField(widget=forms.PasswordInput(), required=True,
max_length=100)
cwpasswd = forms.CharField(widget=forms.PasswordInput(), required=True,
max_length=100)
email = forms.CharField(widget=forms.TextInput(),required=True)
mobileno= forms.CharField(widget=forms.TextInput(), required=True,
max_length=10,validators=[validators.MaxLengthValidator(10),validators.MinLengthValidat
or(10)])
status = forms.CharField(widget=forms.HiddenInput(), initial='waiting',
max_length=100)
def __str__(self):
return self.email
class Meta:
model=userModel
fields=['name','passwd','cwpasswd','email','mobileno','status']
index.html:
{% load static %}
<html>
<head>
<title>Building Search Engine Using Machine Learning Technique</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<link href="{% static 'style.css' %}" rel="stylesheet" type="text/css" />
</head>
<body>
<div class="main">
<div class="main_resize">
<div class="header">
<div class="logo">
<h1><span>Building Search Engine Using Machine Learning
Technique</span><small></small></h1>
</div>
</div>
<div class="content">
<div class="content_bg">
<div class="menu_nav">
<ul>
<ul>
<li><a href="{% url 'index' %}">Home</a></li>
<li><a href="{% url 'AdminLogin' %}">Admin Login</a></li>
<li><a href="{% url 'ManagerLogin' %}">Manager Login</a></li>
<li><a href="{% url 'UserLogin' %}">User Login</a></li>
<li><a href="{% url 'Signup' %}">New User Signup Here</a></li>
</ul>
</ul>
</div>
<div class="hbg"><img src="{% static 'images/header_images.jpg' %}"
width="915" height="286" alt="" /></div>{{ data }}
<p align="justify"><font size="3" style="font-family: Comic Sans MS"
color="black">Abstract-Building Search Engine Using Machine Learning Technique
</p>
</body>
</html>
Sample Test Cases:
Excepted Remarks(IF
S.no Test Case Result
Result Fails)
If User If already user
1. User Register registration Pass email exists then it
successfully. fails.
If the Username
Unknown
and password is
2. User Login Pass Register Users will
correct then it will
not be logged in.
be a valid page.
If the Manager
name and .Unknown
3. Manager login password is correct Pass Register Manager
then it will be a will not log in.
valid pag.
Admin can
Admin can If the manager did
activate the
4. activate the Pass not find it then it
register manager
register magers won’t login
id.
Admin can login
with his login Invalid login
5. Admin login credential. If Pass details will not
success he get is allowed here
home page
Admin can Admin can .If the user did not
6. activate the activate the Pass find it then it
register users register user id . won’t login.
by clicking svm it
admin can get the prediction of svm
7. will display svm Pass
svm results won’t get..
prediction
by clicking
prediction of
admin can get the xgboost it will
8. Pass xgboost won’t
xgboost results display xgboost
get..
prediction.
user can search the
we won’t get the
weight of
9. user login page Pass weight of
particular
document.
document
Pass
10.

CHAPTER 4
RESULTS & DISCUSSIONS

CHAPTER 4
RESULTS & DISCUSSIONS
In this paper author is using machine learning algorithms called SVM and XGBOOST
to predict search result of given query and building search engine with machine learning
algorithms. To train this algorithm author is using website data and then this data will be
converted to numeric vector called TFIDF (term frequency inverse document frequency).
TFIDF vector contains average frequency of each words.
In this paper, author has implemented following modules
 Admin module: admin can login to application using username and password as admin
and then accept or activate new users registration and then train SVM and XGBOOST
algorithm
 Manager module: manager can login to application by using username and password as
Manager and Manager and then upload dataset to application
 New User Signup: using this module new user can signup with the application
 User Login: user can login to application and then perform search by giving query.
To run project install MYSQL and python 3.7 and then copy content from DB.txt file
and paste in MYSQL to create database.
Now double click on ‘run.bat’ file to start python DJANGO server and get below
screen
In above screen server started and build a vector from dataset where first row showing
word and remaining rows contains TFIDF word frequency. Now open browser and enter URL
as http://127.0.0.1:8000/index.html and press enter key to get below page
SCREENSHOTS:
In above screen click on ‘New User Signup Here’ link to get below screen

In above screen user is signing up and then press button to get below output
In above screen user signup process completed and now click on ‘User Login’ to get
below screen

In above screen user is login and will get below output
In above screen we gave correct login but account not activated by admin so click on
‘Admin Login’ link to login as admin and then activate user

In above screen admin is login and after login will get below screen
In above screen admin can click on ‘View Users’ link to view all users

In above screen admin can click on ‘Click Here’ link to activate that user account
In above screen we can see admin activated kumar user account and now admin can
click on ‘Train SVM & XGBOOST’ link to train machine learning SVM and XGBOOST
algorithm and get below output

In above screen we can see SVM and XGBOOST accuracy and in both algorithms
XGBOOST got high accuracy and now logout and login as Manager
In above screen manager is login and after login will get below screen

In above screen manager can click on ‘Upload Dataset’ link to upload dataset or documents
In above screen manager is browsing and uploading dataset and this file you can find
inside ‘Dataset’ folder and now press button to saved dataset at server database

In above screen dataset file saved in database and now logout and login as user to perform search
In above screen user is login and after login will get below output

In above screen user can click on ‘Search with Page Rank’ link to search any data
In above screen I entered query as ‘news on security’ and press button to get below
search result

In above screen machine learning algorithm predicts two URLS for given query and
user can click on those URLS to visit page
In above screen by clicking on URL link user can visit and view page. Similarly user
can give any query and if query available in dataset then he will get output.
For above query we got below result

CHAPTER 5
CONCLUSION

CHAPTER 5
CONCLUSION
Search engine is very useful for finding out more relevant URL for given keyword.
Due to this, user time is reduced for searching the relevant web page. Due to privacy reasons
and other reasons we want to build own search engine. The project we have built is used to
provide the faster retrieval of information using search engines that are implemented by using
machine learning algorithms. It provides a simple interface for searching for user query and
displaying results in the form of the web address of the relevant web page but using traditional
search engines has become very challenging to obtain suitable information.
For this, Accuracy is a very important factor. From the above observation, it can be
concluded that XGBoost is better in terms of accuracy than SVM and ANN. Thus, Search
engines built using XGBoost and PageRank algorithms will give better accuracy.

CHAPTER 6
REFERENCES

CHAPTER 6
REFERENCES
[1] Manika Dutta, K. L. Bansal, “A Review Paper on Various Search Engines (Google,
Yahoo, Altavista, Ask and Bing)”, International Journal on Recent and Innovation Trends in
Computing and Communication, 2016.
[2] Gunjan H. Agre, Nikita V.Mahajan, “Keyword Focused Web Crawler”, International
Conference on Electronic and Communication Systems, IEEE, 2015.
[3] Tuhena Sen, Dev Kumar Chaudhary, “Contrastive Study of Simple PageRank, HITS and
Weighted PageRank Algorithms: Review”, International Conference on Cloud Computing,
Data Science & Engineering, IEEE, 2017.
[4] Michael Chau, Hsinchun Chen, “A machine learning approach to web page filtering using
content and structure analysis”, Decision Support Systems 44 (2008) 482–
494,scienceDirect,2008.
[5] Taruna Kumari, Ashlesha Gupta, Ashutosh Dixit, “Comparative Study of Page Rank and
Weighted Page Rank Algorithm”, International Journal of Innovative Research in Computer
and Communication Engineering, February 2014.
[6] K. R. Srinath, “Page Ranking Algorithms – A Comparison”, International Research
Journal of Engineering and Technology (IRJET), Dec2017.
[7] S. Prabha, K. Duraiswamy, J. Indhumathi, “Comparative Analysis of Different Page
Ranking Algorithms”, International Journal of Computer and Information Engineering, 2014.
[8] Dilip Kumar Sharma, A. K. Sharma, “A Comparative Analysis of Web Page Ranking
Algorithms”, International Journal on Computer Science and Engineering, 2010.
[9] Vijay Chauhan, Arunima Jaiswal, Junaid Khalid Khan, “Web Page Ranking Using
Machine Learning Approach”, International Conference on Advanced Computing
Communication Technologies, 2015.
[10] Amanjot Kaur Sandhu, Tiewei s. Liu., “Wikipedia Search Engine: Interactive Information
Retrieval Interface Design”, International Conference on Industrial and Information Systems,
2014

[11] Neha Sharma, Rashi Agarwal, Narendra Kohli, “Review of features and machine learning
techniques for web searching”, International Conference on Advanced Computing
Communication Technologies, 2016.
[12] Sweah Liang Yong, Markus Hagenbuchner, Ah Chung Tsoi, “Ranking Web Pages using
Machine Learning Approaches”, International Conference on Web Intelligence and Intelligent
Agent Technology, 2008.
[13] B. Jaganathan, Kalyani Desikan,“Weighted Page Rank Algorithm based on In-Out
Weight of Webpages”, Indian Journal of Science and Technology, Dec-2015.


Bachelor of Technology

Uploaded by

Copyright:

Available Formats

Bachelor of Technology

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bachelor of Technology

Uploaded by

Copyright:

Available Formats

A Project report on

Building Searching Engine using Machine Learning Techniques

A Dissertation submitted to JNTU Hyderabad in partial fulfillment of the

Sonti Harish Reddy

Under the esteemed guidance of

Department of Information Technology

CMR COLLEGE OF ENGINEERING & TECHNOLOGY

DEPARTMENT OF INFORMATION TECHNOLOGY

Mr. K. Venkateswara Rao Dr. K.L.S. Soujanya

Sonti Harish Reddy 19H51A1259

3 PROPOSED SYSTEM 17-47

CMRCET B. Tech (IT) Page No i

CMRCET B. Tech (IT) Page No i

CMRCET B. Tech (IT) Page No ii

CMRCET B. Tech (IT) Page No iii

CMRCET B. Tech (IT) Page No 1

CMRCET B. Tech (IT) Page No 2

CMRCET B. Tech (IT) Page No 3

CMRCET B. Tech (IT) Page No 4

2.1 LITERATURE SURVEY

1) Weighted page rank algorithm based on in-out weight of webpages

2)Web Page Ranking Using Machine Learning Approach

CMRCET B. Tech (IT) Page No 5

3)Review of features and machine learning techniques for web searching.

CMRCET B. Tech (IT) Page No 6

2.2 Existing Solutions

Figure 1: Structure of the Google Search Engine

Merits, Demerits and Challenges:

CMRCET B. Tech (IT) Page No 8

 Other Ethical problems

 Not so good image search results

 Decreasing CTR to websites, so that bloggers are facing difficulties.

Implementation of Google Search Engine:

CMRCET B. Tech (IT) Page No 9

2.2.2 Search Engine Using BERT

BERT is an open source machine learning framework for natural language

Merits, Demerits and Challenges

 Much better model performance over legacy methods

 The model is large because of the training structure and corpus.

CMRCET B. Tech (IT) Page No

Implementation of Search Engine using BERT

BERT makes use of Transformer, an attention mechanism that learns contextual

2. Multiplying the output vectors by the embedding matrix, transforming

Figure 2: Structure of Search Engine using BERT

CMRCET B. Tech (IT) Page No 10

Figure 3: BERT with Modifications

CMRCET B. Tech (IT) Page No 12

2.2.3 Search Engine using ElasticSearch and KubeFlow

 Sometimes, the problem of split-brain situations occurs in Elasticsearch.

CMRCET B. Tech (IT) Page No 13

2.2.4 Implementation of ElasticSearch

Figure 4: ElasticSearch with index

CMRCET B. Tech (IT) Page No 14

2.3 Data Collection and Performance metrics

Figure5 : performance standards

CMRCET B. Tech (IT) Page No 15

Figure 6: Search Response time of elastic search

Figure 7: Index response time of elasticsearch

CMRCET B. Tech (IT) Page No 16