Bachelor of Technology
Bachelor of Technology
Bachelor of Technology
Bachelor of Technology
in
Information Technology
Submitted by
2019- 2023
CMR COLLEGE OF ENGINEERING & TECHNOLOGY
KANDLAKOYA, MEDCHAL ROAD, HYDERABAD – 501401
CERTIFICATE
This is to certify that the Major Project report entitled "Building Search Engine
Using Machine Learning Techniques" being submitted by S. Harish Reddy
(19H51A1259), R. Anvesh (20H55A1203), S. Vandana (20H55A1204) in partial
fulfillment for the award of Bachelor of Technology in Information Technology
is a record of bonafide work carried out his/her under my guidance and
supervision.
The results embodies in this project report have not been submitted to any
other University or Institute for the award of any Degree.
.
With great pleasure we want to take this opportunity to express my heartfelt gratitude to all
the people who helped in making this project work a grand success.
We are grateful to Ms. K. Venkateswara Rao, Assistant Professor, Department of
Information Technology for her valuable technical suggestions and guidance during the execution
of this project work.
We would like to thank Dr. K L S Soujanya, Head of the Department of Information
Technology, CMR College of Engineering and Technology, who is the major driving forces to
complete my project work successfully.
We are very grateful to Dr. Vijaya Kumar Koppula, Dean-Academic, CMR College of
Engineering and Technology, for his constant support and motivation in carrying out the project
work successfully.
We are highly indebted to Dr. V A Narayana, Principal, CMR College of Engineering and
Technology, for giving permission to carry out this project in a successful and fruitful way.
We would like to thank the Teaching & Non- teaching staff of Department of Information
Technology for their co-operation.
We express our sincere thanks to Mr. Ch. Gopal Reddy, Secretary, CMR Group of
Institutions, for his continuous care.
Finally, We extend thanks to our parents who stood behind us at different stages of this
Project. We sincerely acknowledge and thank all those who gave support directly and indirectly in
completion of this project work.
TABLE OF CONTENTS
CHAPTER
NO. TITLE PAGE NO.
TABLE OF CONTENTS i
LIST OF FIGURES ii
ABSTRACT iii
1 INTRODUCTION 1
1.1 Problem Statement 2
1.2 Research Objective 2
1.3 Project Scope and Limitations 3
2 BACKGROUND WORK 4
2.1. Literature Survey 5-6
2.2 Google Search Engine 7
2.2.1. Introduction 7
2.2.2. Merits, Demerits and Challenges 8
2.2.3. Implementation of Google Search Engine 8
2.3 Search Engine using BERT & TensorFlow 9
2.3.1. Introduction 9
2.3.2. Merits, Demerits and challenges 9
2.3.3. Implementation of Search Engine 11
2.4 Search Engine using Elastic Search & Kube Flow 12
2.4.1. Introduction 12
2.4.2. Merits, Demerits and Challenges 12
2.4.3. Implementation of Search Engine 13
5 CONCLUSION 61
5.1 Conclusion and Future Enhancement 62
6 REFERENCES 64-65
GitHub Link
LIST OF FIGURES
FIGURE
NO. TITLE PAGE NO.
1 Structure of the Google Search Engine 7
2 Structure of Search Engine using BERT 10
3 BERT with Modifications 11
4 ElasticSearch with index 13
5 Performance standards 14
6 Search Response time of elastic search 15
7 Index response time of elasticsearch 15
8 Retrieval performance of BERT with Tensorflow 16
9 Working of SVM 20-22
10 Working of XG BOOST 25
11 Data flow diagram 28
12 UML Diagram 30
13 Class Diagram 31
14 Sequence Diagram 32
15 Activity Diagram 33
16 Result 50-60
ABSTRACT
The web is the huge and most extravagant wellspring of data. To recover the
information from the World Wide Web, Search Engines are commonly utilized. Search
engines provide a simple interface for searching for user query and displaying results in the
form of the web address of the relevant web page but using traditional search engines has
become very challenging to obtain suitable information. We proposed a search engine using
Machine Learning technique that will give more relevant web pages at top for user queries.
CHAPTER 1
INTRODUCTION
CHAPTER 1
INTRODUCTION
World Wide Web is actually a web of individual systems and servers which are
connected with different technology and methods. Every site comprises the heaps of site pages
that are being made and sent on the server. So if a user needs something, then he or she needs
to type a keyword. Keyword is a set of words extracted from user search input. Search input
given by a user may be syntactically incorrect. Here comes the actual need for search engines.
Search engines provide you a simple interface to search user queries and display the results.
1.1 Problem Statement
The world's population is widely increasing day by day. Almost every person uses the
internet and smart technology. Because of the widespread use of web pages nowadays,
retrieving information from the internet presents a significant challenge. The complexity of
getting the results is increasing. Maintaining and understanding the data becomes very
complex. The accuracy of results is low due to a lack of algorithms. It provides a simple
interface for searching for user query and displaying results in the form of the web address of
the relevant web page but using traditional search engines has become very challenging to
obtain suitable information. So, to overcome this problem, we are building a search engine
using machine learning.
1.2 Research Objective
Numerous endeavors have been made by data experts and researchers in the field of
search engine. Dutta and Bansal [1] discuss various type of search engine and they conclude
the crawler based search engine is best among them and also Google uses it. A Web crawler is
a program that navigates the web by following the regularly changing, thick and circulated
hyperlinked structure and from there on putting away downloaded pages in a vast database
which is after indexed for productive execution of user queries. In [2], author conclude that
major benefit of using keyword focused web crawler over traditional web crawler is that it
works intelligently, efficiently. The search engine uses a page ranking algorithm to give more
relevant web page at the top of result, according to user need.
Initially just an idea has been developed as user were facing problem in searching data
so simple algorithm introduced which works on link structure, then further modification came
as the web is also expanding so weighted PageRank and HITS came into the scenario. In [3],
author compare various PageRank algorithm and among all, Weighted PageRank algorithm is
best suited for our system. Michael Chau and Hsinchun Chen [4] proposed a system which is
based on a machine learning approach for web page filtering. The machine learning result is
compared with traditional algorithm and found that machine learning result are more useful.
The proposed approach is also effective for building a search engine.
1.3 Project Scope and Limitations
The primary limitation of this work is that the keywords and associated webpages came from
only one company and industry. In addition, the chosen language (Finnish) might affect the results. In
general, the gift industry can be considered as a highly competitive online industry with a lot of SEO
activity taking place. Although the range of keywords was relatively large in the context of that
company, as was the number of webpages, this research would need to be replicated using data on
other companies, industries, and languages in order to claim generalizability of the findings. Even
though we mitigate the impact of potential personalization by using an anonymous browser, there are
other factors that impact the search results, such as click logs, ranking information from past SERPs,
and so on. These factors make search results structurally unstable and make it more difficult to
replicate research in this domain. Moreover, as the ranking algorithms of the major search engines
undergo periodic changes, any research in the SEO field is subject to expiration. Even with the
mentioned limitations, the results are indicative of the impact of content and link features on search
rankings. Acquiring more data would allow for the use of more features (e.g., utilizing unsupervised
methods such as topic modeling), and more learning examples to further improve the algorithm. In
addition, more features about the actual content of the sites, would provide more distinct information
about each site. Apart from obtaining data from other contexts, future research could focus on specific
website elements. In particular, the relatively high correlation of H3 and rankings is an interesting
finding. One reason for this can be that the use of H3 tags is rarer than the use of H1 and H2 tags and,
therefore, websites using H3 tags are applying more advanced SEO and content marketing strategies.
This proposition should be explored in future research.
CHAPTER 2
BACKGROUND
WORK
CHAPTER 2
BACKGROUND
WORK
Introduction:
Google Search is a fully-automated search engine that uses software known as web
crawlers that explore the web regularly to find pages to add to our index. In fact, the vast
majority of pages listed in our results aren't manually submitted for inclusion, but are found
and added automatically when our web crawlers explore the web.
Merits:
Easy to use.
CMRCET B. Tech (IT) Page No 7
Building Search Engine using Machine Learning Techniques
Most Accurate Results
Demerits:
Privacy Concers
Smart Track utilizes GS1 standards barcodes containing unique serialized product
identifier, Lot production and expiration dates. The information contained in the GS1 barcode
is captured across various supply chain processes and used to maintain a continuous log of
ownership transfers. As each stakeholder records the possession of the product, an end user
(patient) can verify authenticity through central data repository maintained as Global Data
Synchronization Network (GDSN) by using a smartphone app. In the downstream supply
chain at the warehouse, pharmacy and hospital units can scan the barcode to verify the product
and its characteristics.
Introduction
Demerits:
Unlike Apache Solr, Elasticsearch does not have multi-language support for
handling request and response data.
Elasticsearch is not a good data store as other options such as MongoDB,
Hadoop, etc. It performs well for small use cases, but in case of streaming of TB's data per
day, it either chokes or loses the data.
Raw data flows into Elasticsearch from a variety of sources, including logs, system
metrics, and web applications. Data ingestion is the process by which this raw data is parsed,
normalized, and enriched before it is indexed in Elasticsearch. Once indexed in Elasticsearch,
users can run complex queries against their data and use aggregations to retrieve complex
summaries of their data. From Kibana, users can create powerful visualizations of their data,
share dashboards, and manage the Elastic Stack.
An Elasticsearch index is a collection of documents that are related to each other.
Elasticsearch stores data as JSON documents. Each document correlates a set of keys (names
of fields or properties) with their corresponding values (strings, numbers, Booleans, dates,
arrays of values, geolocations, or other types of data).
Elasticsearch uses a data structure called an inverted index, which is designed to allow
very fast full-text searches. An inverted index lists every unique word that appears in any
document and identifies all of the documents each word occurs in.
During the indexing process, Elasticsearch stores documents and builds an inverted
index to make the document data searchable in near real-time. Indexing is initiated with the
index API, through which you can add or update a JSON document in a specific index.
Google Search Console is a great tool to be able to get a better understanding of the
types of queries your website is ranking for, what your top traffic driving pages from search
are, and other valuable information that you can use to grow the presence of your website on
the internet and generate new business for your technology company.
CHAPTER 3
PROPOSED
SYSTEM
CHAPTER 3
PROPOSED SYSTEM
3.1 OBJECTIVES
1.Input Design is the process of converting a user-oriented description of the input into
a computer-based system. This design is important to avoid errors in the data input process
and show the correct direction to the management for getting correct information from the
computerized system.
2. It is achieved by creating user-friendly screens for the data entry to handle large
volume of data. The goal of designing input is to make data entry easier and to be free from
errors. The data entry screen is designed in such a way that all the data manipulates can be
performed. It also provides record viewing facilities.
3.When the data is entered it will check for its validity. Data can be entered with the
help of screens. Appropriate messages are provided as when needed so that the user will not
be in maize of instant. Thus the objective of input design is to create an input layout that is
easy to follow
INPUT DESIGN:
The input design is the link between the information system and the user. It comprises
the developing specification and procedures for data preparation and those steps are necessary
to put transaction data in to a usable form for processing can be achieved by inspecting the
computer to read data from a written or printed document or it can occur by having people
keying the data directly into the system. The design of input focuses on controlling the amount
of input required, controlling the errors, avoiding delay, avoiding extra steps and keeping the
process simple. The input is designed in such a way so that it provides security and ease of use
with retaining the privacy. Input Design considered the following things:
What data should be given as input?
How the data should be arranged or coded?
OUTPUT DESIGN:
A quality output is one, which meets the requirements of the end user and presents the
information clearly. In any system results of processing are communicated to the users and to
other system through outputs. In output design it is determined how the information is to be
displaced for immediate need and also the hard copy output. It is the most important and direct
source information to the user. Efficient and intelligent output design improves the system’s
relationship to help user decision-making.
1. Designing computer output should proceed in an organized, well thought out
manner; the right output must be developed while ensuring that each output element is
designed so that people will find the system can use easily and effectively. When analysis
design computer output, they should Identify the specific output that is needed to meet the
requirements.
2. Select methods for presenting information.
3.Create document, report, or other formats that contain information produced by the
system.
The output form of an information system should accomplish one or more of the
following objectives.
Convey information about past activities, current status or projections of the
Future.
Signal important events, opportunities, problems, or warnings.
Trigger an action.
Confirm an action.
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms. Support Vector Machine(SVM) is a supervise machine learning algorithm used
for both classification and regression.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.
Though we say regression problems as well it’s best suited for classification. The
objective of the SVM algorithm is to find a hyperplane in an N-dimensional space that
distinctly classifies the data points. The dimension of the hyperplane depends upon the
number of features. If the number of input features is two, then the hyperplane is just a line.
If the number of input features is three, then the hyperplane becomes a 2-D plane. It
becomes difficult to imagine when the number of features exceeds three.
Working of a SVM algorithm:
One reasonable choice as the best hyperplane is the one that represents the largest
separation or margin between the two class
So we choose the hyperplane whose distance from it to the nearest data point on each
side is maximized. If such a hyperplane exists it is known as the maximum-margin
hyperplane/hard margin. So from the above figure, we choose L2. Let’s consider a scenario
like shown below
Here we have one blue ball in the boundary of the red ball. So how does SVM
classify the data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue
CMRCET B. Tech (IT) Page No 22
Building Search Engine using Machine Learning Techniques
balls. The SVM algorithm has the characteristics to ignore the outlier and finds the best
hyperplane that maximizes the margin. SVM is robust to outliers.
So in this type of data point what SVM does is, finds the maximum margin as done with
previous data sets along with that it adds a penalty each time a point crosses the margin. So
the margins in these types of cases are called soft margins. When there is a soft margin to
the data set, the SVM tries to minimize (1/margin+∧(∑penalty)). Hinge loss is a commonly
used penalty. If no violations no hinge loss. If violations hinge loss proportional to the
distance of violation.
Say, our data is shown in the figure above. SVM solves this by creating a new variable using
a kernel. We call a point x i on the line and we create a new variable y i as a function of
distance from origin o.so if we plot this we get something like as shown below
In this case, the new variable y is created as a function of distance from the origin. A
non-linear function that creates a new variable is referred to as a kernel.
Advantages of SVM
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we want a
model that can accurately identify whether it is a cat or dog, so such a model can be created by
using the SVM algorithm. We will first train our model with lots of images of cats and dogs so
that it can learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:
2. XGBOOST ALGORITHM
XG Boost is an optimized distributed gradient boosting library designed for efficient
and scalable training of machine learning models
XG Boost stands for “Extreme Gradient Boosting” and it has become one of the
most popular and widely used machine learning algorithms due to its ability to handle large
datasets and its ability to achieve state-of-the-art performance in many machine learning
tasks such as classification and regression.
One of the key features of XG Boost is its efficient handling of missing values,
which allows it to handle real-world data with missing values without requiring significant
pre-processing. Additionally, XG Boost has built-in support for parallel processing, making
it possible to train models on large datasets in a reasonable amount of time.
XG Boost stands for Extreme Gradient Boosting, which was proposed by the
researchers at the University of Washington. It is a library written in C++ which optimizes
the training for Gradient Boosting.
Random Forest:
Every decision tree has high variance, but when we combine all of them together in
parallel then the resultant variance is low as each decision tree gets perfectly trained on that
particular sample data and hence the output doesn’t depend on one decision tree but multiple
decision trees. In the case of a classification problem, the final output is taken by using the
majority voting classifier. In the case of a regression problem, the final output is the mean of
all the outputs. This part is Aggregation. The basic idea behind this is to combine multiple
decision trees in determining the final output rather than relying on individual decision trees.
Random Forest has multiple decision trees as base learning models.
Random Forest has multiple decision trees as base learning models. We randomly
perform row sampling and feature sampling from the dataset forming sample datasets for
every model. This part is called Bootstrap.
Boosting:
Gradient Boosting:
There is a technique called the Gradient Boosted Trees whose base learner is CART
(Classification and Regression Trees).
XG Boost:
In this algorithm, decision trees are created in sequential form. Weights play an
important role in XG Boost. Weights are assigned to all the independent variables which are
then fed into the decision tree which predicts results. The weight of variables predicted
wrong by the tree is increased and these variables are then fed to the second decision tree.
These individual classifiers/predictors then ensemble to give a strong and more precise
model. It can work on regression, classification, ranking, and user-defined prediction
problems.
Advantages:
3.3 DESIGNING
DATA FLOW DIAGRAM:
1. The DFD is also called as bubble chart. It is a simple graphical formalism that can
be used to represent a system in terms of input data to the system, various processing carried
out on this data, and the output data is generated by this system.
2. The data flow diagram (DFD) is one of the most important modeling tools. It is used
to model the system components. These components are the system process, the data used by
the process, an external entity that interacts with the system and the information flows in the
CMRCET B. Tech (IT) Page No 29
Building Search Engine using Machine Learning Techniques
system.
3. DFD shows how the information moves through the system and how it is modified
by a series of transformations. It is a graphical technique that depicts information flow and the
transformations that are applied as data moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a system at
any level of abstraction. DFD may be partitioned into levels that represent increasing
information flow and functional detail.
UML DIAGRAMS:
for specifying, Visualization, Constructing and documenting the artifacts of software system,
as well as for business modeling and other non-software systems.
The UML represents a collection of best engineering practices that have proven
successful in the modeling of large and complex systems. The UML is a very important part of
developing objects oriented software and the software development process. The UML uses
mostly graphical notations to express the design of software projects.
GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core concepts.
3. Be independent of particular programming languages and development process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations, frameworks,
patterns and components.
7. Integrate best practices.
CLASS DIAGRAM:
SEQUENCE DIAGRAM:
ACTIVITY DIAGRAM:
Urls.py:
from django.urls import path
from . import views
urlpatterns = [path("index.html", views.index, name="index"),
CMRCET B. Tech (IT) Page No 35
Building Search Engine using Machine Learning Techniques
Views.py:
import pandas as pd
from sklearn.model_selection import train_test_split
from string import punctuation
from nltk.corpus import stopwords
import nltk
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from numpy import dot
from numpy.linalg import norm
global uname
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
def cleanNews(doc):
tokens = doc.split()
CMRCET B. Tech (IT) Page No 37
Building Search Engine using Machine Learning Techniques
table = str.maketrans('', '', punctuation)
tokens = [w.translate(table) for w in tokens]
tokens = [word for word in tokens if word.isalpha()]
tokens = [w for w in tokens if not w in stop_words]
tokens = [word for word in tokens if len(word) > 1]
tokens = [lemmatizer.lemmatize(token) for token in tokens]
tokens = ' '.join(tokens)
return tokens
X = np.load("model/X.npy")
Y = np.load("model/Y.npy")
URLS = np.load("model/URLS.npy")
def Train(request):
if request.method == 'GET':
output = ''
font = '<font size='' color=black>'
arr = ['Algorithm Name','Accuracy','Precision','Recall','FSCORE']
output += '<table border="1" align="center"><tr>'
for i in range(len(arr)):
output += '<th><font size="" color="black">'+arr[i]+'</th>'
output += "</tr>"
xgb_cls = XGBClassifier()
xgb_cls.fit(X, Y)
predict = xgb_cls.predict(X_test)
p = precision_score(y_test, predict,average='macro') * 100
r = recall_score(y_test, predict,average='macro') * 100
f = f1_score(y_test, predict,average='macro') * 100
a = accuracy_score(y_test,predict)*100
output += '<tr><td><font size="" color="black">XGBoost</td><td><font size=""
color="black">'+str(a)+'</td><td><font size="" color="black">'+str(p)+'</td><td><font
size="" color="black">'+str(r)+'</td><td><font size="" color="black">'+str(f)+'</td></tr>'
context= {'data':output}
return render(request, 'ViewUsers.html', context)
def VerifyUser(request):
if request.method == 'GET':
global uname
username = request.GET['t1']
db_connection = pymysql.connect(host='127.0.0.1',port = 3306,user = 'root',
password = 'root', database = 'searchengine',charset='utf8')
db_cursor = db_connection.cursor()
CMRCET B. Tech (IT) Page No 39
Building Search Engine using Machine Learning Techniques
student_sql_query = "update signup set status='Accepted' where
username='"+username+"'"
db_cursor.execute(student_sql_query)
db_connection.commit()
print(db_cursor.rowcount, "Record Inserted")
if db_cursor.rowcount == 1:
output = username+" account activated"
context= {'data':output}
return render(request, 'AdminScreen.html', context)
def SearchQueryAction(request):
if request.method == 'POST':
query = request.POST.get('t1', False)
qry = query
output = '<table border=1 align=center width=100%>'
font = '<font size="" color="black">'
arr = ['Query','Search URL','Rating']
output += "<tr>"
for i in range(len(arr)):
output += "<th>"+font+arr[i]+"</th>"
query = query.strip().lower()
query = cleanNews(query)
vector = tfidf_vectorizer.transform([query]).toarray()
vector = vector.ravel()
for i in range(len(X)):
score = dot(X[i], vector)/(norm(X[i])*norm(vector))
if score > 0.2:
output += "<tr><td>"+font+qry+"</td>"
output += '<td><a href="'+URLS[i]+'" target="_blank">'+font+URLS[i]
+"</td>"
output += "<td>"+font+str(score)+"</td>"
context= {'data':output}
CMRCET B. Tech (IT) Page No 40
Building Search Engine using Machine Learning Techniques
return render(request, 'ViewOutput.html', context)
def ViewUsers(request):
if request.method == 'GET':
global uname
output = '<table border=1 align=center width=100%>'
font = '<font size="" color="black">'
arr = ['Username','Password','Contact No','Gender','Email
Address','Address','Status']
output += "<tr>"
for i in range(len(arr)):
output += "<th>"+font+arr[i]+"</th>"
con = pymysql.connect(host='127.0.0.1',port = 3306,user = 'root', password =
'root', database = 'searchengine',charset='utf8')
with con:
cur = con.cursor()
cur.execute("select * FROM signup")
rows = cur.fetchall()
for row in rows:
username = row[0]
password = row[1]
contact = row[2]
gender = row[3]
email = row[4]
address = row[5]
status = row[6]
output += "<tr><td>"+font+str(username)+"</td>"
output += "<td>"+font+password+"</td>"
output += "<td>"+font+contact+"</td>"
output += "<td>"+font+gender+"</td>"
output += "<td>"+font+email+"</td>"
CMRCET B. Tech (IT) Page No 41
Building Search Engine using Machine Learning Techniques
output += "<td>"+font+address+"</td>"
if status == 'Pending':
output += '<td><a href="VerifyUser?t1='+username+'">Click
Here</a></td>'
else:
output += "<td>"+font+status+"</td>"
context= {'data':output}
return render(request, 'ViewUsers.html', context)
def UploadDatasetAction(request):
if request.method == 'POST':
global uname
dataset = request.FILES['t1']
dataset_name = request.FILES['t1'].name
fs = FileSystemStorage()
fs.save('SearchEngineApp/static/files/'+dataset_name, dataset)
output = dataset_name+' saved in database'
context= {'data':output}
return render(request, 'UploadDataset.html', context)
def UploadDataset(request):
if request.method == 'GET':
return render(request, 'UploadDataset.html', {})
def SearchQuery(request):
if request.method == 'GET':
return render(request, 'SearchQuery.html', {})
def UserLogin(request):
if request.method == 'GET':
return render(request, 'UserLogin.html', {})
CMRCET B. Tech (IT) Page No 42
Building Search Engine using Machine Learning Techniques
def index(request):
if request.method == 'GET':
return render(request, 'index.html', {})
def AdminLogin(request):
if request.method == 'GET':
return render(request, 'AdminLogin.html', {})
def ManagerLogin(request):
if request.method == 'GET':
return render(request, 'ManagerLogin.html', {})
def Signup(request):
if request.method == 'GET':
return render(request, 'Signup.html', {})
def AdminLoginAction(request):
global uname
if request.method == 'POST':
username = request.POST.get('t1', False)
password = request.POST.get('t2', False)
if username == 'admin' and password == 'admin':
uname = username
context= {'data':'welcome '+username}
return render(request, 'AdminScreen.html', context)
else:
context= {'data':'login failed'}
return render(request, 'AdminLogin.html', context)
def ManagerLoginAction(request):
global uname
CMRCET B. Tech (IT) Page No 43
Building Search Engine using Machine Learning Techniques
if request.method == 'POST':
username = request.POST.get('t1', False)
password = request.POST.get('t2', False)
if username == 'Manager' and password == 'Manager':
uname = username
context= {'data':'welcome '+uname}
return render(request, 'ManagerScreen.html', context)
else:
context= {'data':'login failed'}
return render(request, 'ManagerLogin.html', context)
def UserLoginAction(request):
global uname
if request.method == 'POST':
username = request.POST.get('t1', False)
password = request.POST.get('t2', False)
index = 0
con = pymysql.connect(host='127.0.0.1',port = 3306,user = 'root', password =
'root', database = 'searchengine',charset='utf8')
with con:
cur = con.cursor()
cur.execute("select username,password, status FROM signup")
rows = cur.fetchall()
for row in rows:
if row[0] == username and password == row[1] and row[2] == "Accepted":
uname = username
index = 1
break
if index == 1:
context= {'data':'welcome '+uname}
return render(request, 'UserScreen.html', context)
else:
CMRCET B. Tech (IT) Page No 44
Building Search Engine using Machine Learning Techniques
context= {'data':'login failed or account not activated by admin'}
return render(request, 'UserLogin.html', context)
def SignupAction(request):
if request.method == 'POST':
username = request.POST.get('t1', False)
password = request.POST.get('t2', False)
contact = request.POST.get('t3', False)
gender = request.POST.get('t4', False)
email = request.POST.get('t5', False)
address = request.POST.get('t6', False)
output = "none"
con = pymysql.connect(host='127.0.0.1',port = 3306,user = 'root', password =
'root', database = 'searchengine',charset='utf8')
with con:
cur = con.cursor()
cur.execute("select username FROM signup")
rows = cur.fetchall()
for row in rows:
if row[0] == username:
output = username+" Username already exists"
break
if output == 'none':
db_connection = pymysql.connect(host='127.0.0.1',port = 3306,user = 'root',
password = 'root', database = 'searchengine',charset='utf8')
db_cursor = db_connection.cursor()
student_sql_query = "INSERT INTO
signup(username,password,contact_no,gender,email,address,status)
VALUES('"+username+"','"+password+"','"+contact+"','"+gender+"','"+email+"','"+address+"'
,'Pending')"
db_cursor.execute(student_sql_query)
db_connection.commit()
CMRCET B. Tech (IT) Page No 45
Building Search Engine using Machine Learning Techniques
print(db_cursor.rowcount, "Record Inserted")
if db_cursor.rowcount == 1:
output = 'Signup Process Completed'
context= {'data':output}
return render(request, 'Signup.html', context)
models.py:
from django.db import models
class userModel(models.Model):
name = models.CharField(max_length=50)
email = models.EmailField()
passwd = models.CharField(max_length=40)
cwpasswd = models.CharField(max_length=40)
mobileno = models.CharField(max_length=50, default="", editable=True)
status = models.CharField(max_length=40, default="", editable=True)
def __str__(self):
return self.email
class Meta:
db_table='userregister'
class weightmodel(models.Model):
filename = models.CharField(max_length=100)
file = models.FileField(upload_to='files/pdfs/')
weight=models.CharField(max_length=100)
rank=models.CharField(max_length=100,default="", editable=False)
label=models.CharField(max_length=100,default="", editable=False)
def __str__(self):
return self.filename
class Meta:
db_table='weight'
forms.py:
from django import forms
from user.models import *
from django.core import validators
class userForm(forms.ModelForm):
name = forms.CharField(widget=forms.TextInput(), required=True,
max_length=100,)
passwd = forms.CharField(widget=forms.PasswordInput(), required=True,
max_length=100)
cwpasswd = forms.CharField(widget=forms.PasswordInput(), required=True,
max_length=100)
email = forms.CharField(widget=forms.TextInput(),required=True)
mobileno= forms.CharField(widget=forms.TextInput(), required=True,
max_length=10,validators=[validators.MaxLengthValidator(10),validators.MinLengthValidat
or(10)])
status = forms.CharField(widget=forms.HiddenInput(), initial='waiting',
max_length=100)
def __str__(self):
return self.email
class Meta:
model=userModel
fields=['name','passwd','cwpasswd','email','mobileno','status']
index.html:
CMRCET B. Tech (IT) Page No 47
Building Search Engine using Machine Learning Techniques
{% load static %}
<html>
<head>
<title>Building Search Engine Using Machine Learning Technique</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<link href="{% static 'style.css' %}" rel="stylesheet" type="text/css" />
</head>
<body>
<div class="main">
<div class="main_resize">
<div class="header">
<div class="logo">
<h1><span>Building Search Engine Using Machine Learning
Technique</span><small></small></h1>
</div>
</div>
<div class="content">
<div class="content_bg">
<div class="menu_nav">
<ul>
<ul>
<li><a href="{% url 'index' %}">Home</a></li>
<li><a href="{% url 'AdminLogin' %}">Admin Login</a></li>
<li><a href="{% url 'ManagerLogin' %}">Manager Login</a></li>
<li><a href="{% url 'UserLogin' %}">User Login</a></li>
<li><a href="{% url 'Signup' %}">New User Signup Here</a></li>
</ul>
</ul>
</div>
<div class="hbg"><img src="{% static 'images/header_images.jpg' %}"
width="915" height="286" alt="" /></div>{{ data }}
<p align="justify"><font size="3" style="font-family: Comic Sans MS"
CMRCET B. Tech (IT) Page No 48
Building Search Engine using Machine Learning Techniques
color="black">Abstract-Building Search Engine Using Machine Learning Technique
</p>
</body>
</html>
Sample Test Cases:
Excepted Remarks(IF
S.no Test Case Result
Result Fails)
If User If already user
1. User Register registration Pass email exists then it
successfully. fails.
If the Username
Unknown
and password is
2. User Login Pass Register Users will
correct then it will
not be logged in.
be a valid page.
If the Manager
name and .Unknown
3. Manager login password is correct Pass Register Manager
then it will be a will not log in.
valid pag.
Admin can
Admin can If the manager did
activate the
4. activate the Pass not find it then it
register manager
register magers won’t login
id.
Admin can login
with his login Invalid login
5. Admin login credential. If Pass details will not
success he get is allowed here
home page
Admin can Admin can .If the user did not
6. activate the activate the Pass find it then it
register users register user id . won’t login.
by clicking svm it
admin can get the prediction of svm
7. will display svm Pass
svm results won’t get..
prediction
by clicking
prediction of
admin can get the xgboost it will
8. Pass xgboost won’t
xgboost results display xgboost
get..
prediction.
user can search the
we won’t get the
weight of
9. user login page Pass weight of
particular
document.
document
Pass
10.
CHAPTER 4
RESULTS & DISCUSSIONS
CHAPTER 4
RESULTS & DISCUSSIONS
In this paper author is using machine learning algorithms called SVM and XGBOOST
to predict search result of given query and building search engine with machine learning
algorithms. To train this algorithm author is using website data and then this data will be
converted to numeric vector called TFIDF (term frequency inverse document frequency).
TFIDF vector contains average frequency of each words.
In this paper, author has implemented following modules
Admin module: admin can login to application using username and password as admin
and then accept or activate new users registration and then train SVM and XGBOOST
algorithm
Manager module: manager can login to application by using username and password as
Manager and Manager and then upload dataset to application
New User Signup: using this module new user can signup with the application
User Login: user can login to application and then perform search by giving query.
To run project install MYSQL and python 3.7 and then copy content from DB.txt file
and paste in MYSQL to create database.
Now double click on ‘run.bat’ file to start python DJANGO server and get below
screen
In above screen server started and build a vector from dataset where first row showing
CMRCET B. Tech (IT) Page No 51
Building Search Engine using Machine Learning Techniques
word and remaining rows contains TFIDF word frequency. Now open browser and enter URL
as http://127.0.0.1:8000/index.html and press enter key to get below page
SCREENSHOTS:
In above screen click on ‘New User Signup Here’ link to get below screen
In above screen user is signing up and then press button to get below output
In above screen user signup process completed and now click on ‘User Login’ to get
below screen
In above screen we gave correct login but account not activated by admin so click on
‘Admin Login’ link to login as admin and then activate user
In above screen admin is login and after login will get below screen
In above screen admin can click on ‘View Users’ link to view all users
In above screen admin can click on ‘Click Here’ link to activate that user account
In above screen we can see admin activated kumar user account and now admin can
click on ‘Train SVM & XGBOOST’ link to train machine learning SVM and XGBOOST
algorithm and get below output
In above screen we can see SVM and XGBOOST accuracy and in both algorithms
XGBOOST got high accuracy and now logout and login as Manager
In above screen manager is login and after login will get below screen
In above screen manager can click on ‘Upload Dataset’ link to upload dataset or documents
In above screen manager is browsing and uploading dataset and this file you can find
inside ‘Dataset’ folder and now press button to saved dataset at server database
In above screen dataset file saved in database and now logout and login as user to perform search
In above screen user is login and after login will get below output
In above screen user can click on ‘Search with Page Rank’ link to search any data
In above screen I entered query as ‘news on security’ and press button to get below
search result
In above screen machine learning algorithm predicts two URLS for given query and
user can click on those URLS to visit page
In above screen by clicking on URL link user can visit and view page. Similarly user
can give any query and if query available in dataset then he will get output.
CMRCET B. Tech (IT) Page No 61
Building Search Engine using Machine Learning Techniques
CHAPTER 5
CONCLUSION
CHAPTER 5
CONCLUSION
Search engine is very useful for finding out more relevant URL for given keyword.
Due to this, user time is reduced for searching the relevant web page. Due to privacy reasons
and other reasons we want to build own search engine. The project we have built is used to
provide the faster retrieval of information using search engines that are implemented by using
machine learning algorithms. It provides a simple interface for searching for user query and
displaying results in the form of the web address of the relevant web page but using traditional
search engines has become very challenging to obtain suitable information.
For this, Accuracy is a very important factor. From the above observation, it can be
concluded that XGBoost is better in terms of accuracy than SVM and ANN. Thus, Search
engines built using XGBoost and PageRank algorithms will give better accuracy.
CHAPTER 6
REFERENCES
CHAPTER 6
REFERENCES
[1] Manika Dutta, K. L. Bansal, “A Review Paper on Various Search Engines (Google,
Yahoo, Altavista, Ask and Bing)”, International Journal on Recent and Innovation Trends in
Computing and Communication, 2016.
[2] Gunjan H. Agre, Nikita V.Mahajan, “Keyword Focused Web Crawler”, International
Conference on Electronic and Communication Systems, IEEE, 2015.
[3] Tuhena Sen, Dev Kumar Chaudhary, “Contrastive Study of Simple PageRank, HITS and
Weighted PageRank Algorithms: Review”, International Conference on Cloud Computing,
Data Science & Engineering, IEEE, 2017.
[4] Michael Chau, Hsinchun Chen, “A machine learning approach to web page filtering using
content and structure analysis”, Decision Support Systems 44 (2008) 482–
494,scienceDirect,2008.
[5] Taruna Kumari, Ashlesha Gupta, Ashutosh Dixit, “Comparative Study of Page Rank and
Weighted Page Rank Algorithm”, International Journal of Innovative Research in Computer
and Communication Engineering, February 2014.
[6] K. R. Srinath, “Page Ranking Algorithms – A Comparison”, International Research
Journal of Engineering and Technology (IRJET), Dec2017.
[7] S. Prabha, K. Duraiswamy, J. Indhumathi, “Comparative Analysis of Different Page
Ranking Algorithms”, International Journal of Computer and Information Engineering, 2014.
[8] Dilip Kumar Sharma, A. K. Sharma, “A Comparative Analysis of Web Page Ranking
Algorithms”, International Journal on Computer Science and Engineering, 2010.
[9] Vijay Chauhan, Arunima Jaiswal, Junaid Khalid Khan, “Web Page Ranking Using
Machine Learning Approach”, International Conference on Advanced Computing
Communication Technologies, 2015.
[10] Amanjot Kaur Sandhu, Tiewei s. Liu., “Wikipedia Search Engine: Interactive Information
Retrieval Interface Design”, International Conference on Industrial and Information Systems,
2014
[11] Neha Sharma, Rashi Agarwal, Narendra Kohli, “Review of features and machine learning
techniques for web searching”, International Conference on Advanced Computing
Communication Technologies, 2016.
[12] Sweah Liang Yong, Markus Hagenbuchner, Ah Chung Tsoi, “Ranking Web Pages using
Machine Learning Approaches”, International Conference on Web Intelligence and Intelligent
Agent Technology, 2008.
[13] B. Jaganathan, Kalyani Desikan,“Weighted Page Rank Algorithm based on In-Out
Weight of Webpages”, Indian Journal of Science and Technology, Dec-2015.