How To Create A Simple Web Crawler in PHP
How To Create A Simple Web Crawler in PHP
Table of Contents
A Web Crawler is a program that crawls through the sites in the Web and indexes those URL‘s.
Search Engines uses a crawler to index URL’s on the Web. Google uses a crawler written in
Python. There are other search engines that uses different types of crawlers.
In this post I’m going to tell you how to create a simple Web Crawler in PHP.
The codes shown here was created by me. It took me 2 days to create a simple crawler. Then How
much time would it take to create a perfect crawler ? Creating a Crawler is a very hard task. It’s like
creating a Robot. Let’s start building a crawler.
Download Demo
For parsing the web page of a URL, we are going to use Simple HTML Dom class which can be
downloaded at Sourceforge. Include the file "simple_html_dom.php" and mention the variables we
are going to use :
include "simple_html_dom.php";
$crawled_urls = array();
$found_urls = array();
Then, Add the functions we are going to use. The following function will convert relative URL‘s to
absolute URL‘s :
function rel2abs($rel, $base) {
if (parse_url($rel, PHP_URL_SCHEME) != '') {
return $rel;
}
if ($rel[0] == '#' || $rel[0] == '?') {
return $base . $rel;
}
extract(parse_url($base));
$path = preg_replace('#/[^/]*$#', '', $path);
if ($rel[0] == '/') {
$path = '';
}
$abs = "$host$path/$rel";
$re = array('#(/.?/)#', '#/(?!..)[^/]+/../#');
for ($n = 1; $n & gt; 0; $abs = preg_replace($re, '/', $abs, -1, $n)) {}
$abs = str_replace('../', '', $abs);
return $scheme . '://' . $abs;
}
The following function will change the URL‘s found when crawling to real URL‘s :
function perfect_url($u, $b) {
$bp = parse_url($b);
if (($bp['path'] != '/' & amp; & amp; $bp['path'] != '') ||
$bp['path'] == '') {
if ($bp['scheme'] == '') {
$scheme = 'http';
} else {
$scheme = $bp['scheme'];
}
$b = $scheme . '://' . $bp['host'] . '/';
}
if (substr($u, 0, 2) == '//') {
$u = 'http:' . $u;
}
if (substr($u, 0, 4) != 'http') {
$u = rel2abs($u, $b);
}
return $u;
}
Finally, we will call the crawl_site function to crawl a URL. I’m going to use
http://subinsb.com for crawling.
crawl_site("http://subinsb.com");
When you run the PHP crawler now, you will get all the URL’s in the page. You can again
crawl those founded URL’s to find more URL’s, but you would need a fast Server and a High
Speed Internet Connection.
A Super Computer and an Internet Connection of 10 GB/Second would be perfect for that. If you
think that your computer is fast and can crawl many URL’s, then change the following line in the
code :
echo $url . PHP_EOL;
to :
crawl_site($url);
Note :- The code isn’t perfect, there may be errors when crawling some URL’s. I don’t
recommend you to crawl the URL’s found again unless you have a Super Computer and a High
Speed Internet Connection. Feel free to make the crawler better, awesome and fast @ GitHub.
If you have any problems / suggestions / feedback, echo it in the comments. Your Feedback is my
happiness.