How To Create A Simple Web Crawler in PHP

This document discusses how to create a simple web crawler in PHP. It explains that a web crawler indexes URLs on the web by crawling through sites. It provides code to parse web pages using Simple HTML Dom, convert relative URLs to absolute URLs, and a core crawling function. The crawling function recursively calls itself to crawl additional URLs found on pages. The summary notes that while the code provides a starting point, a perfect crawler would require significant computing resources to crawl many URLs.

Uploaded by

Gabriel Eivazian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views3 pages

How To Create A Simple Web Crawler in PHP

Uploaded by

Gabriel Eivazian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

How To Create A Simple Web Crawler in PHP

Table of Contents
A Web Crawler is a program that crawls through the sites in the Web and indexes those URL‘s.
Search Engines uses a crawler to index URL’s on the Web. Google uses a crawler written in
Python. There are other search engines that uses different types of crawlers.
In this post I’m going to tell you how to create a simple Web Crawler in PHP.
The codes shown here was created by me. It took me 2 days to create a simple crawler. Then How
much time would it take to create a perfect crawler ? Creating a Crawler is a very hard task. It’s like
creating a Robot. Let’s start building a crawler.
Download Demo
For parsing the web page of a URL, we are going to use Simple HTML Dom class which can be
downloaded at Sourceforge. Include the file "simple_html_dom.php" and mention the variables we
are going to use :
include "simple_html_dom.php";
$crawled_urls = array();
$found_urls = array();

Then, Add the functions we are going to use. The following function will convert relative URL‘s to
absolute URL‘s :
function rel2abs($rel, $base) {
if (parse_url($rel, PHP_URL_SCHEME) != '') {
return $rel;
}
if ($rel[0] == '#' || $rel[0] == '?') {
return $base . $rel;
}
extract(parse_url($base));
$path = preg_replace('#/[^/]*$#', '', $path);
if ($rel[0] == '/') {
$path = '';
}
$abs = "$host$path/$rel";
$re = array('#(/.?/)#', '#/(?!..)[^/]+/../#');
for ($n = 1; $n & gt; 0; $abs = preg_replace($re, '/', $abs, -1, $n)) {}
$abs = str_replace('../', '', $abs);
return $scheme . '://' . $abs;
}

The following function will change the URL‘s found when crawling to real URL‘s :
function perfect_url($u, $b) {
$bp = parse_url($b);
if (($bp['path'] != '/' & amp; & amp; $bp['path'] != '') ||
$bp['path'] == '') {
if ($bp['scheme'] == '') {
$scheme = 'http';
} else {
$scheme = $bp['scheme'];
}
$b = $scheme . '://' . $bp['host'] . '/';
}
if (substr($u, 0, 2) == '//') {
$u = 'http:' . $u;
}
if (substr($u, 0, 4) != 'http') {
$u = rel2abs($u, $b);
}
return $u;
}

This code is the core of the crawler :

Finally, we will call the crawl_site function to crawl a URL. I’m going to use
http://subinsb.com for crawling.
crawl_site("http://subinsb.com");

When you run the PHP crawler now, you will get all the URL’s in the page. You can again
crawl those founded URL’s to find more URL’s, but you would need a fast Server and a High
Speed Internet Connection.
A Super Computer and an Internet Connection of 10 GB/Second would be perfect for that. If you
think that your computer is fast and can crawl many URL’s, then change the following line in the
code :
echo $url . PHP_EOL;

to :
crawl_site($url);

Note :- The code isn’t perfect, there may be errors when crawling some URL’s. I don’t
recommend you to crawl the URL’s found again unless you have a Super Computer and a High
Speed Internet Connection. Feel free to make the crawler better, awesome and fast @ GitHub.
If you have any problems / suggestions / feedback, echo it in the comments. Your Feedback is my
happiness.

Ajira Manual
No ratings yet
Ajira Manual
36 pages
APC Building Data Lakes On AWS SG
No ratings yet
APC Building Data Lakes On AWS SG
187 pages
Final SRS
No ratings yet
Final SRS
7 pages
Google Dorks: Analysis, Creation, and New Defenses
No ratings yet
Google Dorks: Analysis, Creation, and New Defenses
26 pages
1.1 Website Hacking PDF
No ratings yet
1.1 Website Hacking PDF
12 pages
A Simple Python Web Crawler...
100% (1)
A Simple Python Web Crawler...
5 pages
Web Scraping with PHP
No ratings yet
Web Scraping with PHP
14 pages
Webchat Cam Sex
No ratings yet
Webchat Cam Sex
23 pages
Build A Web Crawler
No ratings yet
Build A Web Crawler
6 pages
Lab1 Crawling Python
No ratings yet
Lab1 Crawling Python
10 pages
Crawling The Web: Information Retrieval © Crista Lopes, UCI
No ratings yet
Crawling The Web: Information Retrieval © Crista Lopes, UCI
25 pages
08 Web Search and Web Crawling
No ratings yet
08 Web Search and Web Crawling
33 pages
20 Crawl
No ratings yet
20 Crawl
46 pages
Script Auto Ping
No ratings yet
Script Auto Ping
7 pages
Script Auto Ping
No ratings yet
Script Auto Ping
7 pages
Basic PHP Web Scraping Script Tutorial - Oooff
No ratings yet
Basic PHP Web Scraping Script Tutorial - Oooff
5 pages
50 Recipes for Programming Angular
From Everand
50 Recipes for Programming Angular
Jamie Munro
4/5 (1)
Crawler: 1.0 Introduction
No ratings yet
Crawler: 1.0 Introduction
12 pages
Write Your Own PHP MVC Framework
No ratings yet
Write Your Own PHP MVC Framework
20 pages
Eclipse Foundation: Home Downloads Users Members Committers Resources Projects
No ratings yet
Eclipse Foundation: Home Downloads Users Members Committers Resources Projects
22 pages
Script To Calculate Your Sites Google SERP Position
No ratings yet
Script To Calculate Your Sites Google SERP Position
3 pages
Dorks With DonJuji
100% (1)
Dorks With DonJuji
4 pages
Web Scraping with C
No ratings yet
Web Scraping with C
28 pages
Readme
No ratings yet
Readme
4 pages
Web Crawling: Based On The Slides by Filippo
No ratings yet
Web Crawling: Based On The Slides by Filippo
52 pages
Ms. Poonam Sinai Kenkre
No ratings yet
Ms. Poonam Sinai Kenkre
43 pages
Lect 02-Crawling Part a
No ratings yet
Lect 02-Crawling Part a
21 pages
Search Engine
No ratings yet
Search Engine
42 pages
Search Engine
100% (2)
Search Engine
42 pages
Introduction to Web Crawling chapter -13
No ratings yet
Introduction to Web Crawling chapter -13
3 pages
A Practical Guide to Web Scraping ( PDFDrive )
No ratings yet
A Practical Guide to Web Scraping ( PDFDrive )
107 pages
Perl Project: Siddhant Sanjeev 337/CO/11 Siddharth Saluja 338/CO/11
No ratings yet
Perl Project: Siddhant Sanjeev 337/CO/11 Siddharth Saluja 338/CO/11
14 pages
Philips Web Scraper Spec
No ratings yet
Philips Web Scraper Spec
2 pages
Creating A Web Crawler in 3 Steps: Issac Goldstand Mirimar Networks
No ratings yet
Creating A Web Crawler in 3 Steps: Issac Goldstand Mirimar Networks
20 pages
ir5
No ratings yet
ir5
18 pages
Readme
No ratings yet
Readme
6 pages
Different Types of Web Crawlers
No ratings yet
Different Types of Web Crawlers
40 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
Week4
No ratings yet
Week4
38 pages
IR-UNIT 10 (Web Crawling)
No ratings yet
IR-UNIT 10 (Web Crawling)
62 pages
A basic OOP site
No ratings yet
A basic OOP site
6 pages
Information Retrieval Project
No ratings yet
Information Retrieval Project
3 pages
PHP Crawler PDF
No ratings yet
PHP Crawler PDF
1 page
5.web Crawler Writeup
No ratings yet
5.web Crawler Writeup
7 pages
HTML Dom Parser
No ratings yet
HTML Dom Parser
3 pages
Java Web Crawler
No ratings yet
Java Web Crawler
1 page
Web Info PDF
No ratings yet
Web Info PDF
4 pages
Web Crawling: Christopher Olston and Marc Najork
No ratings yet
Web Crawling: Christopher Olston and Marc Najork
49 pages
Chapter 11. Web Scraping
100% (1)
Chapter 11. Web Scraping
57 pages
46 Useful PHP Code Snippets
No ratings yet
46 Useful PHP Code Snippets
39 pages
Web Scraping Handbook
No ratings yet
Web Scraping Handbook
115 pages
I) Web Crawling: Yash Pahlani D17B 49
No ratings yet
I) Web Crawling: Yash Pahlani D17B 49
7 pages
Simplified PHP
From Everand
Simplified PHP
James Blanchette
No ratings yet
Information Retrieval Lecture 10 - Web Crawling
No ratings yet
Information Retrieval Lecture 10 - Web Crawling
8 pages
Rewriteengine On 3. Rewriterule $ Public/ (L) 4. Rewriterule (. ) Public/$1 (L) 5.
No ratings yet
Rewriteengine On 3. Rewriterule $ Public/ (L) 4. Rewriterule (. ) Public/$1 (L) 5.
4 pages
Web Scraping
No ratings yet
Web Scraping
51 pages
60004210188_RajSingh_WIexp4
No ratings yet
60004210188_RajSingh_WIexp4
7 pages
Cse3024 WM Module-2 Smsatapathy
No ratings yet
Cse3024 WM Module-2 Smsatapathy
106 pages
Open Search Server Client
No ratings yet
Open Search Server Client
52 pages
51f0a827-a6f4-47b4-82fd-d38391c3f080
No ratings yet
51f0a827-a6f4-47b4-82fd-d38391c3f080
15 pages
Research paper
No ratings yet
Research paper
5 pages
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Azure For Starters
From Everand
Azure For Starters
Chinmoy Mukherjee
No ratings yet
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
Lathes 10
No ratings yet
Lathes 10
4 pages
Rose Engine-MDF
No ratings yet
Rose Engine-MDF
11 pages
Cam Obscura Student Mat v4
No ratings yet
Cam Obscura Student Mat v4
2 pages
The Copernicus-Planetary: Assembly Instructions
No ratings yet
The Copernicus-Planetary: Assembly Instructions
16 pages
University of Salford: Ahmed Salah
No ratings yet
University of Salford: Ahmed Salah
18 pages
Rankknar Audit Sheet
No ratings yet
Rankknar Audit Sheet
7 pages
Research on the construction of event corpus with document-level causal relations for social security
No ratings yet
Research on the construction of event corpus with document-level causal relations for social security
15 pages
LM SEO Unit-2 2
No ratings yet
LM SEO Unit-2 2
14 pages
Enemy of The State: A State-Aware Black-Box Web Vulnerability Scanner
No ratings yet
Enemy of The State: A State-Aware Black-Box Web Vulnerability Scanner
16 pages
Real Estate Market Data Scraping and Analysis For Financial Investments
No ratings yet
Real Estate Market Data Scraping and Analysis For Financial Investments
67 pages
Search Engines
No ratings yet
Search Engines
19 pages
Bits Pilani, Dubai Campus
No ratings yet
Bits Pilani, Dubai Campus
11 pages
ISR Question Bank
No ratings yet
ISR Question Bank
19 pages
Guia de Inicio de Traductor SYSTRAm
No ratings yet
Guia de Inicio de Traductor SYSTRAm
32 pages
SL No Topic Page No
No ratings yet
SL No Topic Page No
6 pages
Text Supplement
No ratings yet
Text Supplement
804 pages
Entity Based SEO Glossary of Terms - (Beginners Guide To Semantic SEO and Google Entities) - WikiSauce
No ratings yet
Entity Based SEO Glossary of Terms - (Beginners Guide To Semantic SEO and Google Entities) - WikiSauce
30 pages
ReleaseNotes - 2024.2.0 Netsuite
No ratings yet
ReleaseNotes - 2024.2.0 Netsuite
78 pages
A Detailed Guide on Cewl
No ratings yet
A Detailed Guide on Cewl
16 pages
SEO Secrets Ebook V1.3
No ratings yet
SEO Secrets Ebook V1.3
56 pages
Configuring Search
No ratings yet
Configuring Search
32 pages
Issuecrawlerscenarios Use
No ratings yet
Issuecrawlerscenarios Use
6 pages
500 - Free Tools For Startups
No ratings yet
500 - Free Tools For Startups
7 pages
Assignment 2 Captcha 1
No ratings yet
Assignment 2 Captcha 1
6 pages
C: A M S C C: ES Orpius Assive Panish Rawling Orpus
No ratings yet
C: A M S C C: ES Orpius Assive Panish Rawling Orpus
7 pages
SEO Tools: The Complete List (153 Tools Reviewed and Rated)
0% (1)
SEO Tools: The Complete List (153 Tools Reviewed and Rated)
111 pages
E Commerce and Internet Marketing
No ratings yet
E Commerce and Internet Marketing
78 pages
Security Testing With JMeter
No ratings yet
Security Testing With JMeter
11 pages
Application of Expert Agents or Assistants in Library and Information Systems
No ratings yet
Application of Expert Agents or Assistants in Library and Information Systems
14 pages
Web Science: An Interdisciplinary Approach To Understanding The Web
No ratings yet
Web Science: An Interdisciplinary Approach To Understanding The Web
10 pages