Web Crawler with Advanced Algorithms

This document outlines a web crawler that utilizes multiple searching and string matching algorithms to search websites. The crawler allows a user to input a domain, search query, and choose algorithms to customize how the crawler searches a domain without needing large data storage. By using different algorithms, the crawler provides a more dynamic way to optimize searches for relevant results compared to crawlers like Googlebot that build databases. The rest of the paper discusses related work, the crawler's methodology, findings from using it, and plans for future work.

Uploaded by

John Wiltberger

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

130 views1 page

Web Crawler with Advanced Algorithms

Uploaded by

John Wiltberger

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Web-based Crawler Utilizing Multiple Searching and

String Matching Algorithms

Johnathan Wiltberger
Johns Hopkins University
Whiting School of Engineering
Engineering for Professionals
Email: [email protected]
AbstractIn the field of computer science, one will invariably
stumble upon the Internet and the vast amount of information
that is held therein. In order to better utilize information
stored within, one must be able to search for and find relevant
information within web domains to help further either their
knowledge or their objective. This document outlines a tool
for such a use; a web crawler that utilizes multiple different
searching algorithms, as well as several string matching algorithms. Included in the references for this document are multiple
journal entries and source web sites that helped to contribute to
the complimation of the crawler. There is a comparison drawn
between both Googlebots (web crawler used by Google, Inc.) and
the proposed crawler. With the choice of multiple searching and
string matching algorithms, one will have a more dynamic and
versatile way of optomizing their searches to gain more relevant
and profitable results.

I. I NTRODUCTION
The World Wide Web currently has, at least, almost 2
billion indexed web pages currently [1]. This does not count
un-indexed pages, whose number is much larger. The main
method of indexing and searching all of these sites currently
is through web crawlers. A web crawler is an application that
systematically browses the web, going from link to link, and
indexing the sites it comes upon. These sites then can be stored
in memory (for smaller, single-use web crawlers) or stored in
databases for traversal later. One well known web crawler is
GoogleBot, which is used by Google to crawl and index sites
that are used later in their search engine [2]. There are multiple
theories related to best performance crawls, however, it seems
that the situation determines what will be the most appropriate
and efficient crawling method.
With many web crawlers, the process is to visit a top-level
domain, search for links within the page, and follow those
links. Meanwhile, the crawler will cache these links within
a database of some sort and pass the control to the database
to some searching application, potentially a search engine, to
use for searching. This approach is extremely efficient if the
goal of a process is to build a database to use multiple times
over, and the storage space is available. However, this may be
too much for a simple query over a specific domain during
one-time searches
The crawler discussed in this paper systematically crawls
and matches query strings based on a users input on a caseby-case basis. The user will input there targeted domain, the
query they are looking for, and their choices for searching

algorithms and string matching algorithms. These inputs are

then used to develop the initial crawling strategy employed by
the crawler.
Using these methods, the crawler has the ability to quickly
query through a domain for a search string of interest without
the backend processing of building a database. This allows a
normal user the ability to customize how they craw a domain
without needing to obtain enough equipment for database
storage, as well as spending the preliminary time and effort
building a database of links to search. Although it may not be
as thorough, this solution is easily deployable in many small
to medium domains.
The rest of the paper is structured as follows. In Section
II, there is a review of related work on the subject. Section
III will present methodologies that were used in the crawler.
For Section IV, discussions will be focused on some of the
findings that were collected within use of the crawler, as well
as analysis of performance. Finally, future work and conclusion
will be discussed in Section V.
II. R ELATED W ORK
III. M ETHODOLOGY
IV. F INDINGS AND A NALYSIS
V. C ONCLUSION
The conclusion goes here.
ACKNOWLEDGMENT
The authors would like to thank...
R EFERENCES
[1] https://www.worldwidewebsize.com; Accessed 03/20/2014 0848
[2] http://en.wikipedia.org/wiki/Web crawler; Accessed 03/20/2014 0854

Keyw Word Quer Ry Based D Focused Dwebc Rawler: Sciencedirect
No ratings yet
Keyw Word Quer Ry Based D Focused Dwebc Rawler: Sciencedirect
7 pages
Crawling The Web: Seed Page and Then Uses The External Links Within It To Attend To Other Pages
No ratings yet
Crawling The Web: Seed Page and Then Uses The External Links Within It To Attend To Other Pages
25 pages
Explores The Ways of Usage of Web Crawler in Mobile Systems
No ratings yet
Explores The Ways of Usage of Web Crawler in Mobile Systems
5 pages
Crawler: 1.0 Introduction
No ratings yet
Crawler: 1.0 Introduction
12 pages
Dept. of Cse, Msec 2014-15
No ratings yet
Dept. of Cse, Msec 2014-15
19 pages
Overview of Web Crawler Functionality
No ratings yet
Overview of Web Crawler Functionality
4 pages
Focused Web Crawling Techniques Review
No ratings yet
Focused Web Crawling Techniques Review
4 pages
5.web Crawler Writeup
No ratings yet
5.web Crawler Writeup
7 pages
Crahid: A New Technique For Web Crawling in Multimedia Web Sites
No ratings yet
Crahid: A New Technique For Web Crawling in Multimedia Web Sites
6 pages
WI Sem8
No ratings yet
WI Sem8
56 pages
Adaptive Focus
No ratings yet
Adaptive Focus
6 pages
Web Crawler A Survey
No ratings yet
Web Crawler A Survey
3 pages
Web Crawler Toolkit for Developers
No ratings yet
Web Crawler Toolkit for Developers
6 pages
Web Crawlers & Hyperlink Analysis
No ratings yet
Web Crawlers & Hyperlink Analysis
50 pages
A Smart Web Crawler For A Concept Based Semantic Search Engine
No ratings yet
A Smart Web Crawler For A Concept Based Semantic Search Engine
43 pages
Research Paper
No ratings yet
Research Paper
5 pages
Web Crawler Design Guide
No ratings yet
Web Crawler Design Guide
6 pages
21jul201512071432 DAIWAT A VYAS 1-6
No ratings yet
21jul201512071432 DAIWAT A VYAS 1-6
6 pages
Study of Webcrawler: Implementation of Efficient and Fast Crawler
No ratings yet
Study of Webcrawler: Implementation of Efficient and Fast Crawler
6 pages
Ms. Poonam Sinai Kenkre
No ratings yet
Ms. Poonam Sinai Kenkre
43 pages
IR-UNIT 10 (Web Crawling)
No ratings yet
IR-UNIT 10 (Web Crawling)
62 pages
A Two Stage Crawler On Web Search Using Site Ranker For Adaptive Learning
No ratings yet
A Two Stage Crawler On Web Search Using Site Ranker For Adaptive Learning
4 pages
Seminar Report: Submitted By: Aanchal Garg CSE
No ratings yet
Seminar Report: Submitted By: Aanchal Garg CSE
22 pages
I) Web Crawling: Yash Pahlani D17B 49
No ratings yet
I) Web Crawling: Yash Pahlani D17B 49
7 pages
Efficient Web Crawler Project SRS
No ratings yet
Efficient Web Crawler Project SRS
7 pages
IR Module 3
No ratings yet
IR Module 3
45 pages
Web Crawler
0% (1)
Web Crawler
16 pages
A Keyword Focused Web Crawler Using Domain Engineering and Ontology
No ratings yet
A Keyword Focused Web Crawler Using Domain Engineering and Ontology
3 pages
Ir 5
No ratings yet
Ir 5
18 pages
08 Web Search and Web Crawling
No ratings yet
08 Web Search and Web Crawling
33 pages
An Extended Model For Effective Migrating Parallel Web Crawling With Domain Specific Crawling
No ratings yet
An Extended Model For Effective Migrating Parallel Web Crawling With Domain Specific Crawling
4 pages
Crawler and URL Retrieving & Queuing
No ratings yet
Crawler and URL Retrieving & Queuing
5 pages
Simple Web Search Engine Design
No ratings yet
Simple Web Search Engine Design
9 pages
CS 3308 Discussion Assignment Unit 8
No ratings yet
CS 3308 Discussion Assignment Unit 8
4 pages
S O W C A: Urvey F EB Rawling Lgorithms
No ratings yet
S O W C A: Urvey F EB Rawling Lgorithms
8 pages
Reaction Paper: A Scale For Crawler Effectiveness On The Client-Side Hidden Web
No ratings yet
Reaction Paper: A Scale For Crawler Effectiveness On The Client-Side Hidden Web
11 pages
Web Crawler A Review
No ratings yet
Web Crawler A Review
5 pages
Web Crawling Strategies Thesis
No ratings yet
Web Crawling Strategies Thesis
5 pages
Crawling The Web: Information Retrieval © Crista Lopes, UCI
No ratings yet
Crawling The Web: Information Retrieval © Crista Lopes, UCI
25 pages
Extended Curlcrawler: A Focused and Path-Oriented Framework For Crawling The Web With Thumb
No ratings yet
Extended Curlcrawler: A Focused and Path-Oriented Framework For Crawling The Web With Thumb
9 pages
Implementing A Web Crawler in A Smart Phone Mobile Application
No ratings yet
Implementing A Web Crawler in A Smart Phone Mobile Application
4 pages
Focused Crawling for Topic-Specific Web Discovery
No ratings yet
Focused Crawling for Topic-Specific Web Discovery
18 pages
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
No ratings yet
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
27 pages
Minor Report
No ratings yet
Minor Report
46 pages
Thesis On Focused Web Crawler
100% (2)
Thesis On Focused Web Crawler
8 pages
Web Crawling for Linguistics Students
No ratings yet
Web Crawling for Linguistics Students
8 pages
WIRE: Open Source Web Crawler Overview
No ratings yet
WIRE: Open Source Web Crawler Overview
4 pages
Web Crawling: Christopher Olston and Marc Najork
No ratings yet
Web Crawling: Christopher Olston and Marc Najork
49 pages
The Implementation of A Web Crawler URL Filter Algorithm Based On Caching
No ratings yet
The Implementation of A Web Crawler URL Filter Algorithm Based On Caching
4 pages
Effective Web Crawler Strategies
No ratings yet
Effective Web Crawler Strategies
3 pages
Thesis 1
No ratings yet
Thesis 1
4 pages
Introduction to Web Crawling Techniques
No ratings yet
Introduction to Web Crawling Techniques
46 pages
Web Crawling and Search Engine Basics
No ratings yet
Web Crawling and Search Engine Basics
40 pages
13 Building Search Engine Using Machine Learning Technique
No ratings yet
13 Building Search Engine Using Machine Learning Technique
4 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
Erformance Valuation EB Rawler: P E O W C
No ratings yet
Erformance Valuation EB Rawler: P E O W C
34 pages
WebTracker Paper - SUST Journal
No ratings yet
WebTracker Paper - SUST Journal
11 pages
Major
No ratings yet
Major
14 pages

Web Crawler with Advanced Algorithms

Uploaded by

Web Crawler with Advanced Algorithms

Uploaded by

Web-based Crawler Utilizing Multiple Searching and

String Matching Algorithms

algorithms and string matching algorithms. These inputs are

You might also like