0% found this document useful (0 votes)
46 views

How To Create A Simple Web Crawler in PHP

This document discusses how to create a simple web crawler in PHP. It explains that a web crawler indexes URLs on the web by crawling through sites. It provides code to parse web pages using Simple HTML Dom, convert relative URLs to absolute URLs, and a core crawling function. The crawling function recursively calls itself to crawl additional URLs found on pages. The summary notes that while the code provides a starting point, a perfect crawler would require significant computing resources to crawl many URLs.

Uploaded by

Gabriel Eivazian
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

How To Create A Simple Web Crawler in PHP

This document discusses how to create a simple web crawler in PHP. It explains that a web crawler indexes URLs on the web by crawling through sites. It provides code to parse web pages using Simple HTML Dom, convert relative URLs to absolute URLs, and a core crawling function. The crawling function recursively calls itself to crawl additional URLs found on pages. The summary notes that while the code provides a starting point, a perfect crawler would require significant computing resources to crawl many URLs.

Uploaded by

Gabriel Eivazian
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

How To Create A Simple Web Crawler in PHP

Table of Contents
A Web Crawler is a program that crawls through the sites in the Web and indexes those URL‘s.
Search Engines uses a crawler to index URL’s on the Web. Google uses a crawler written in
Python. There are other search engines that uses different types of crawlers.
In this post I’m going to tell you how to create a simple Web Crawler in PHP.
The codes shown here was created by me. It took me 2 days to create a simple crawler. Then How
much time would it take to create a perfect crawler ? Creating a Crawler is a very hard task. It’s like
creating a Robot. Let’s start building a crawler.
Download Demo
For parsing the web page of a URL, we are going to use Simple HTML Dom class which can be
downloaded at Sourceforge. Include the file "simple_html_dom.php" and mention the variables we
are going to use :
include "simple_html_dom.php";
$crawled_urls = array();
$found_urls = array();

Then, Add the functions we are going to use. The following function will convert relative URL‘s to
absolute URL‘s :
function rel2abs($rel, $base) {
if (parse_url($rel, PHP_URL_SCHEME) != '') {
return $rel;
}
if ($rel[0] == '#' || $rel[0] == '?') {
return $base . $rel;
}
extract(parse_url($base));
$path = preg_replace('#/[^/]*$#', '', $path);
if ($rel[0] == '/') {
$path = '';
}
$abs = "$host$path/$rel";
$re = array('#(/.?/)#', '#/(?!..)[^/]+/../#');
for ($n = 1; $n & gt; 0; $abs = preg_replace($re, '/', $abs, -1, $n)) {}
$abs = str_replace('../', '', $abs);
return $scheme . '://' . $abs;
}

The following function will change the URL‘s found when crawling to real URL‘s :
function perfect_url($u, $b) {
$bp = parse_url($b);
if (($bp['path'] != '/' & amp; & amp; $bp['path'] != '') ||
$bp['path'] == '') {
if ($bp['scheme'] == '') {
$scheme = 'http';
} else {
$scheme = $bp['scheme'];
}
$b = $scheme . '://' . $bp['host'] . '/';
}
if (substr($u, 0, 2) == '//') {
$u = 'http:' . $u;
}
if (substr($u, 0, 4) != 'http') {
$u = rel2abs($u, $b);
}
return $u;
}

This code is the core of the crawler :


function crawl_site($u) {
global $crawled_urls, $found_urls;
$uen = urlencode($u);
if ((array_key_exists($uen, $crawled_urls) == 0 || $crawled_urls[$uen] &
lt; date('YmdHis', strtotime('-25 seconds', time())))) {
$html = file_get_html($u);
$crawled_urls[$uen] = date('YmdHis');
foreach ($html- & gt; find('a') as $li) {
$url = perfect_url($li- & gt; href, $u);
$enurl = urlencode($url);
if ($url != '' & amp; & amp; substr($url, 0, 4) != 'mail' & amp;
& amp; substr($url, 0, 4) != 'java' & amp; & amp; array_key_exists($enurl,
$found_urls) == 0) {
$found_urls[$enurl] = 1;
echo $url . PHP_EOL;
}
}
}
}

Finally, we will call the crawl_site function to crawl a URL. I’m going to use
http://subinsb.com for crawling.
crawl_site("http://subinsb.com");

When you run the PHP crawler now, you will get all the URL’s in the page. You can again
crawl those founded URL’s to find more URL’s, but you would need a fast Server and a High
Speed Internet Connection.
A Super Computer and an Internet Connection of 10 GB/Second would be perfect for that. If you
think that your computer is fast and can crawl many URL’s, then change the following line in the
code :
echo $url . PHP_EOL;

to :
crawl_site($url);

Note :- The code isn’t perfect, there may be errors when crawling some URL’s. I don’t
recommend you to crawl the URL’s found again unless you have a Super Computer and a High
Speed Internet Connection. Feel free to make the crawler better, awesome and fast @ GitHub.
If you have any problems / suggestions / feedback, echo it in the comments. Your Feedback is my
happiness.

You might also like