A Practical Guide To Web Scraping (PDFDrive)
A Practical Guide To Web Scraping (PDFDrive)
Sameer Borate
A PRACTICAL GUIDE
TO
WEB SCRAPING
A PRACTICAL GUIDE TO WEB SCRAPING 1
Sameer Borate
A PRACTICAL GUIDE
TO
WEB SCRAPING
[UNDERSTANDING
UNDERSTANDING
WEB
SCRAPING]
SCRAPING
A PRACTICAL GUIDE TO WEB SCRAPING 5
1.1 Web Scraping Defined
<html>
<head>
<title>Hello HTML</title>
</head>
<body>
<p>Hello World!</p>
</body>
</html>
Hello World!
If you are scraping for the sole purpose of using someone else's
intellectual property on your own website, than you're clearly
violating copyright laws; this is a no brainer. If you are scraping
data from a competitors site and using it on your own site than
it is clearly illegal.
Also, even if you are not using the scraper for any illegal data
gathering, but your scraper loads the target server with lots of
requests, thereby impairing the server, you are violating the
terms of the site. So make sure your scraper does not in any way
degrade the performance of the target server.
Now with all the legal issues out of the way (but still in sight),
we are ready to get ahead with the coding part.
[HTTP
HTTP::
HTTP
A QUICK
OVERVIEW]
OVERVIEW
A PRACTICAL GUIDE TO WEB SCRAPING 10
2.1 HTTP Overview
GET /index.php?page=3&limit=10
Host: example.com
<scheme>://<user>:<pass>@<host>:<port>/<path>?<query>#<frag>
Cookies are the best way to identify different users and allow
persistent sessions. Cookies were first developed by Netscape
but now are supported by all major browsers.
[SCRAPING
SCRAPING
TOOLBOX]
TOOLBOX
A PRACTICAL GUIDE TO WEB SCRAPING 19
3.1 Your Scraping Toolbox
As with any other tasks, starting with a good set of tools makes
one understand and use the tools efficiently. Web scraping is no
different. Web scraping can be either done programmatically -
using scripting languages like PHP, Ruby; or with the help of
tools such as wget or curl, although the later do not provide the
flexibility of a scripting language, they are useful tools which
will come handy. Each can be used independently or with
combination to accomplish certain scraping tasks. Our primary
goal in this book is to use PHP to retrieve web pages and scrape
the page for contents we are interested in.
Working with PHP to scrape data from web pages is not an easy
task and requires a good knowledge of Regular Expressions.
However some excellent libraries will make it easier to parse
data from a web page without any knowledge of Regular
Expressions. This does not mean that Regular Expressions are
not required. Having a working knowledge of them is an
essential part of a programmer and can help you immensely
with your scraping work.
http://simplehtmldom.sourceforge.net/
$ch = curl_init('http://example.com');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://example.com');
<?php
$ch = curl_init('http://example.com');
curl_setopt($ch, CURLOPT_RETURNTRANSER, true);
$res = curl_exec($ch);
curl_close($ch);
?>
More information about curl and its various options are given in
Appendix A.
Let us start with real world example to get a feel for the library.
The following is a straightforward program that will search
Google for the 'flower' keyword and print all the links on the
page.
<?php
?>
require_once('simplehtmldom/simple_html_dom.php');
$html =
file_get_html('http://www.google.com/search?q=flower');
Once the above line is executed, the $html variable will now
hold the simplehtmldom object containing the HTML content
for the url given.
Once we have our page DOM, we are ready to query it with the
'find' method of the library. In our example we are searching
for the <a> link element.
$links = $html->find('a');
<?php
echo $element->href;
Array
(
[0] => href
[1] => rel
[2] => title
)
echo $element->title;
Many times we do not need the attributes, but the actual text
within the DOM element, for example the text within an h3 tag.
For this we can use the 'plaintext' or 'innertext' methods.
'innertext' returns the raw html content within the specified
element, whereas 'plaintext' returns the plain text without
any html. There is one other method, 'outertext', which
returns the DOM node's outer text along with the tag.
We can then use the code given below. The titles are all within
the h3 tag, so we will need to search for the same. Notice how
easy it is to change the search element within the find method.
If you are using curl you can use the following instead.
Now that you have a local copy of the index page, you can use
it in your scraping code.
<?php
?>
Debugging cURL
Seldom does a new code work correctly the first time. Curl is no
exception. Many times it may just not work correctly or return
some error. The curl_getinfo function enables you to view
requests being sent out by cURL. This can be quite useful when
debugging requests. Below is an example of this feature in
action.
$ch =
curl_init('https://www.google.com/search?q=flower&start=0');
$html = curl_exec($ch);
$request = curl_getinfo($ch, CURLINFO_HEADER_OUT);
curl_close($ch);
The $request variable will now hold the request content sent
out by curl. You ca see if this is correct and modify the code
accordingly.
[EXPLORING
EXPLORING
SIMPLEHTMLDOM]
SIMPLEHTMLDOM
A PRACTICAL GUIDE TO WEB SCRAPING 31
4.1 Simplehtmldom in detail
http://localhost/scrape/template/index.html
str_get_html($str,
$lowercase=true, // Force Lowercase for tag names
$forceTagsClosed=true,
$target_charset = DEFAULT_TARGET_CHARSET,
$stripRN=true, // Strip NewLine characters
$defaultBRText=DEFAULT_BR_TEXT)
file_get_html($url,
$use_include_path = false,
$context=null,
$offset = -1,
$maxLen=-1,
$lowercase = true,
$forceTagsClosed=true,
$target_charset = DEFAULT_TARGET_CHARSET,
$stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)
The last few parameters are the same as for the str_get_html
function, while the first five parameters are the same as for the
PHP function file_get_contents, as file_get_html calls
this function internally.
$template = 'http://localhost/scrape/template/index.html';
$html = file_get_html($template);
echo $html->plaintext;
This will print all the page content without the html tags, i.e the
plain text of the page. A shortcut way would be like this.
If you need plain-text content from some external site page you
can instead use the following.
Say for example that you want to get all the link titles in the
sidebar of our example template.
Using Firebug we find that the links are all under a u tag with
the class sidemenu. So we can ask the find method to search
print_r(array_keys($link->attr));
Array
(
[0] => title
[1] => href
)
print_r($link->getAllAttributes());
Will print:
Array
(
[title] => side menu 5
[href] => #
)
if($link->hasAttribute('title'))
{
echo $link->getAttribute('title');
}
We could also have used the following line to get to the links,
but the ul is redundant in this case.
Let us say you want to iterate over all the div elements in our
template page which has a class ct and print the contents. The
code will be as following.
foreach($findDivs as $findDiv)
{
echo $findDiv->plaintext . '<br />';
}
For example if you want to search for all the <p> tags within
<div> tags, we can write is as follows.
foreach($findDivs as $findDiv)
{
foreach($findDiv->find('p') as $p)
{
echo $p->plaintext . '<br />';
}
}
foreach($findDivs as $findDiv)
{
echo $findDiv->plaintext . '<br />';
}
The best way to get all the table content is to find and iterate
over each <tr> element and then search for the <td> element.
The following shows the entire code.
$table = $html->find('table[id=mytable]');
Note that in the above example if there is only one table with
the id 'mytable' you may also write the code as follows.
Notice how we are indexing the $table variable.
$table = $html->find('table[id=mytable]');
$table_array = array();
$row = 0;
$col = 0;
$table = $html->find('table[id=mytable]');
Notice how we have used a for loop to iterate over the table
columns. You can also access individual columns or rows using
an index instead of using a for loop.
$td = $tr->find('td') ;
echo $td[1]; // echo the second column of the table
<!DOCTYPE html>
<html>
<head>
<title>Web Scraping Template</title>
</head>
<body>
<div id="container" class="clearfix">
<div id="menucont">
<ul>
<li><a title="" href="#" class="active">Home</a></li>
<li><a title="" href="#">About Us</a></li>
<li><a title="" href="#">Blog</a></li>
You may notice that the div tag with the id 'container' is the
child of the body tag. If we use the following code on our
template this will return the parent of the 'container' div as
the body tag. In this way can walk the DOM tree using the
parent method.
$findDivs = $html->find('div[id=container]');
foreach($findDivs as $findDiv)
{
echo $findDiv->parent()->tag; // Prints 'body'
}
Let us say we want to find all the p tags, but only those that are
the immediate children of a div element with a class named ct.
We can use the parent method here. The HTML snippet is
given below,
<div class="ct">
<p>Div 1. … pellentesque…</p>
</div>
$findDivs = $html->find('p');
foreach($findDivs as $findDiv)
{
if ($findDiv->parent()->class == 'ct' &&
$findDiv->parent()->tag == 'div')
{
echo $findDiv->plaintext;
}
}
foreach($findDivs as $findDiv)
{
echo $findDiv->plaintext;
}
$findDivs = $html->find('div[class=ct]');
foreach($findDivs as $findDiv)
{
foreach($findDiv->children as $child)
{
echo $child->tag;
}
}
// Prints 'p'
$findDivs = $html->find('div[id=container]');
foreach($findDivs as $findDiv) {
echo $findDiv->children(1)->children(0);
}
foreach($images as $image)
{
echo $image->src;
}
Note that the image src has relative paths, which we now need
to convert to absolute to download the images. Once we are
correctly able to get the src of the images we can now
download it using the PHP function file_get_contents().
foreach($images as $image)
{
/* Get the image source */
$image_src = $image->src;
foreach($images as $image)
{
/* Get the image source */
$image_src = $image->src;
fwrite($fp,$url . $image_src . PHP_EOL);
}
fclose($fp);
In the above example we have stored all the image urls in the
'image_urls.txt' file. Later we can use this to download the
images with another script; either using file_get_contents()
or using curl. We have seen an example using
file_get_contents, below is an example using curl.
curl_close($ch);
Saving all the images urls in a file has another major advantage.
You can use a utility like Wget to retrieve the images.
4.1.9 WGET
http://gnuwin32.sourceforge.net/packages/wget.htm
Once you have installed Wget, you can ask it to get the images
from the image_urls.txt file. Wget will now open the
image_urls.txt file, retreive the urls and save the image to a
directory specified or in the current directory.
d:\wget>wget -i image_urls.txt
Wget is a powerful and flexible tool which can help you in your
scraping efforts. There are many options and it would help you
immensely if you read the documentation carefully and get
familiar with it innumerable features. The following is the
general format of Wget.
[SCRAPING
SCRAPING
AUTHENTICATED
PAGES]
PAGES
A PRACTICAL GUIDE TO WEB SCRAPING 52
5.1 Authenticated sites
This request should be sent using the HTTP 401 Not Authorized
response code containing a WWW-Authenticate HTTP header.
echo base64_encode('jonny:whitecollar');
<?php
$username = "admin";
$password = "adminauth";
$s = curl_init();
curl_setopt($s, CURLOPT_URL, 'http://example.com/index.php');
curl_setopt($s, CURLOPT_RETURNTRANSFER, true);
curl_setopt($s, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($s, CURLOPT_USERPWD, "$username:$password");
curl_setopt($s, CURLOPT_HTTPAUTH, CURLAUTH_BASIC);
$output = curl_exec($ch);
curl_close($ch);
echo $output;
?>
<?php
$username = "admin";
$password = "adminauth";
$s = curl_init();
curl_setopt($s, CURLOPT_URL, 'http://example.com/');
curl_setopt($s, CURLOPT_HEADER, 1);
curl_setopt($s, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($s, CURLOPT_USERPWD, "$username:$password");
curl_setopt($s, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt($s, CURLOPT_TIMEOUT, 30);
curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE);
$page_data = curl_exec($s);
echo $page_data;
?>
<?php
$blog_url = "http://www.wordpress-blog.com/";
$blog_name = "wordpress-blog.com";
$username = "admin";
$password = "your-admin-password";
$post_data = "log={$username}&pwd={$password}&wp-
submit=Log+In&redirect_to=http%3A%2F%2F{$blog_name}%2Fwp-
admin%2F&testcookie=1";
$s = curl_init();
curl_setopt($s, CURLOPT_URL, $blog_url . 'wp-login.php');
curl_setopt($s, CURLOPT_TIMEOUT, 30);
curl_setopt($s, CURLOPT_POST, true);
curl_setopt($s, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($s, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($s, CURLOPT_POSTFIELDS, $post_data);
curl_setopt($s, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE);
$downloaded_page = curl_exec($s);
?>
$html = curl_exec($s);
/* $html now contains data for the ‘edit.php’ page */
echo $html;
<span class=”pending-count”>6</span>
We now use the 'find' method to get the comment count. This
is accomplished by the following line. We return the first node
found matching the search parameter. This is specified using the
second parameter for 'find', '0' in our example.
You can also see the POST details and other headers by clicking
on the 'wp-login.php' link.
Content-Length: 102
[SCRAPING
SCRAPING
WITH REGULAR EXPRESSIONS]
EXPRESSIONS
A PRACTICAL GUIDE TO WEB SCRAPING 63
6.1 Regular Expressions: A quick Introduction
The following short example will get the character encoding for
a particular web page. This makes use of the PHP RegEx
functions to search for the required content. Here we are
looking for the 'charset' attribute which include the encoding
information.
<?php
echo strtoupper($matches);
?>
The important part of the code is the Regex to search for the
'charset' pattern.
'/charset\=.*?(\'|\"|\s)/i'
You may have noticed the character 'i' in the regular expression
after the closing delimiter. This instructs that the match should
be case insensitive.
The following code enables you to get a list of all the images in
a web page along with their attributes – such as src, height,
width, alt etc. For this example we will separate the regular
expression search part from the main program as this is a bit
long and would be better if we make it as a function. We call
this function 'parseTag'. The 'parseTag' function also
extracts the tag attributes of the image. The complete function is
given below.
if(count($pairs[1]) > 0) {
$raw[] = array_combine($pairs[1],$pairs[4]);
}
}
return $raw;
}
<?php
/* curl version */
$url = ''http://localhost/scrape/template/index.html';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 6);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
<?php
/* file_get_html version */
$url = 'http://localhost/scrape/template/index.html';
$html = file_get_html($url);
?>
Array
(
[0] => Array
(
[src] => images/flower1.jpg
[class] => imageb
[alt] => flower 1
)
You can also print the src attribute of each image within a for
loop.
foreach($links as $link) {
echo $link['src'] . '<br>';
}
[SCRAPING
SCRAPING
AJAX CONTENT]
CONTENT
A PRACTICAL GUIDE TO WEB SCRAPING 72
7.1 JavaScript and the rise of Ajax
As you can see there are many requests, GET as well as POST. It
is your task to locate the correct url that is used to retrieve the
Ajax content you are interested in. For example a GET request
may take the form of the following url, where the page
parameter will vary with each request. The url below could be
for retrieving a list of books for the category 'science'. There
may be a total of 600 books, with 20 books shown per request,
which brings the total page count to 30 pages.
http://some-site.com/request/get-book-list.php?page=6&cat=10
<?php
$ch = curl_init();
$page_data = curl_exec($ch);
/* Do something with the data */
}
?>
<?php
$url = "http://example/login.php?login=1";
$username = "jamest";
$password = "your-password";
$post_data = "username={$username}&password={$password};
$s = curl_init();
curl_setopt($s, CURLOPT_URL, $url);
curl_setopt($s, CURLOPT_TIMEOUT, 30);
curl_setopt($s, CURLOPT_POST, true);
Decoding Ajax calls and extracting the correct urls is a skill you
will acquire over time. It is an art, and requires some patience
and trial an error to get to get it right, but is worth the effort.
7.2 PhantomJS
http://phantomjs.org/download.html
console.log('Hello, World!');
phantom.exit();
c:\>phantomjs test.js
[COOKBOOK
COOKBOOK]
COOKBOOK
A PRACTICAL GUIDE TO WEB SCRAPING 79
8.1 Get character encoding for a web page
Get character encoding data for a web page such as UTF-8
The Code
The following short example will get the character encoding for
a particular web page. This makes use of the powerful PHP
RegEx functions to search for the required content. Here we are
looking for the ‘charset’ attribute which include the encoding
information.
<?php
/**
* Recipe 1
* Get the encoding for a particular page.
*/
if(strtoupper($matches[0]) == '') {
echo 'auto';
} else {
echo strtoupper($matches[0]);
}
?>
The Code
The code uses the PHP file_get_html function to get the web
page content, which is then searched for the favicon link. If
file_get_html is disabled on your server you can instead use
curl to download the page. This link is then further passed to the
curl function which does the actual task of downloading the
icon and saving it to a file. The important curl option used here
is CURLOPT_FILE. We need to pass a file handle to this option in
which the icon will be saved. In the following example we have
use the ‘favicon.png’ filename.
/**
* Recipe 2
* Get website Favicons and save to your local drive.
*/
require_once('simplehtmldom/simple_html_dom.php');
$favicon_url = '';
$ch = curl_init($favicon_url);
curl_setopt($ch, CURLOPT_TIMEOUT, 6);
?>
One of the useful and important tasks you can accomplish using
scraping is that of grabbing Google search results urls. This can
be useful for crawling search links one after another for a
particular keyword.
The Code
The code uses curl to get the Google search page and a regex to
parse the raw html and retrieve the urls. Note that we can search
for any result page by adding the required offset in the url. The
general search format is shown below.
www.google.com/search?q=keyword&start=offset
<?php
/**
* Recipe 3
* Scrape Google search results.
*/
require_once('simplehtmldom/simple_html_dom.php');
foreach($matches[1] as $url)
{
echo $url . "\n";
}
?>
curl_init(‘https://www.google.com/search?q=pluto&start=0');
Although many people would not take Alexa site rank seriously,
it still provides good representation of your website traffic rank
compared to others on a global scale.
The Code
This searches for the div element having the class 'data up'
and returns the content within that div.
<?php
/**
* Recipe 4
* Get Alexa global website rank.
*/
require_once('simplehtmldom/simple_html_dom.php');
$website = 'codediesel.com';
$phtml = str_get_html($html);
$data = @$phtml->find('div[class=data up]');
echo @$data[0]->nodes[2]->_[4];
?>
The Code
<?php
/**
* Recipe 5
* Scraping a page using HTTP basic authentication
*/
$username = "admin";
$password = "adminauth";
$s = curl_init();
curl_setopt($s, CURLOPT_URL, 'http://example.com/');
curl_setopt($s, CURLOPT_HEADER, 1);
curl_setopt($s, CURLOPT_USERPWD, $username . ":" . $password);
curl_setopt($s, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt($s, CURLOPT_TIMEOUT, 30);
curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE);
$page_data = curl_exec($s);
echo $page_data;
?>
$s = curl_init();
curl_setopt($s, CURLOPT_URL, 'http://some-site.com/inner-
page.php');
curl_setopt($s, CURLOPT_COOKIEFILE, 'cookie.txt');
curl_setopt($s, CURLOPT_TIMEOUT, 30);
curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE);
$page_data = curl_exec($s);
The Code
<span class=”pending-count”>6</span>
$data = @$html->find('span[class=pending-count]');
<?php
/**
* Recipe 6.1
* Login to WordPress admin and get the total comments pending
*/
include_once('simplehtmldom/simple_html_dom.php');
$post_data = "log={$username}&pwd={$password}&wp-
submit=Log+In&redirect_to=http%3A%2F%2F{$blog_name}%2Fwp-
admin%2F&testcookie=1";
$s = curl_init();
curl_setopt($s, CURLOPT_URL, $blog_url . 'wp-login.php');
curl_setopt($s, CURLOPT_TIMEOUT, 30);
curl_setopt($s, CURLOPT_POST, true);
curl_setopt($s, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($s, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($s, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U;
Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725
Firefox/2.0.0.6");
curl_setopt($s, CURLOPT_POSTFIELDS, $post_data);
curl_setopt($s, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE);
$downloaded_page = curl_exec($s);
$html = str_get_html($downloaded_page);
$data = @$html->find('span[class=pending-count]');
?>
<?php
/**
* Recipe 6.2
* Login to WordPress admin and get the page content for the
* ‘All Posts’ link
*/
include_once('simplehtmldom/simple_html_dom.php');
$blog_url = "http://www.wordpress-blog.com/";
$blog_name = "wordpress-blog.com";
$username = "admin";
$password = " your-admin-password ";
$post_data = "log={$username}&pwd={$password}&wp-
submit=Log+In&redirect_to=http%3A%2F%2F{$blog_name}%2Fwp-
admin%2F&testcookie=1";
$s = curl_init();
curl_setopt($s, CURLOPT_URL, $blog_url . 'wp-login.php');
curl_setopt($s, CURLOPT_TIMEOUT, 30);
curl_setopt($s, CURLOPT_POST, true);
curl_setopt($s, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($s, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($s, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U;
Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725
Firefox/2.0.0.6");
curl_setopt($s, CURLOPT_POSTFIELDS, $post_data);
curl_exec($s);
$html = curl_exec($s);
/* $html now contains data for the ‘edit.php’ page */
echo $html;
curl_close($s);
?>
The Code
The following code enables you to get a list of all the images in
a web page along with the attributes – such as src, height, width,
alt etc. We use a generic ‘parseTag’ function to grab all the
‘img’ tags from a page. The ‘parseTag’ function also extracts
the tag attributes.
<?php
/**
* Recipe 7
* Get all the images for a page, along with all the
attributes.
*/
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http:// some-site.com/');
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page_data = curl_exec($ch);
print_r($links);
?>
In the last recipe we saw a way to get all the image urls from a
page but we did not save it locally. In this recipe we will further
modify the previous code so that we will be able to save the
images to our local system
The Code
<?php
/**
* Recipe 8
* Get all the images for a page, along with all the
attributes.
*/
[CURL
CURL OPTIONS]
OPTIONS
A PRACTICAL GUIDE TO WEB SCRAPING 96
A.1 A simple CURL session
Before you use CURL, you must initiate a session with the
curl_init() function. Initialization creates a session variable,
which identifies options and data belonging to a specific
session. Once you create a session, you may use it as many
times as you need to.
<?php
?>
CURLOPT_URL
CURLOPT_RETURNTRANSFER
curl_setopt($s, CURLOPT_REFERER,
"http://www.myexample.com/index.php");
This option tells CURL that you want it to follow every page
redirection it finds. It's important to understand that CURL only
honors header redirections and not redirections set with a
refresh meta tag or with JavaScript as CURL is not a browser.
These options tell CURL to return either the web page's header
or body. By default CURL always returns the body, but not the
header. This explains why setting CURL_NOBODY to TRUE
excludes the body, and setting CURL_HEADER to TRUE includes
the header.
CURLOPT_TIMEOUT
If you don't limit how long CURL waits for a response from a
server, it may probably wait forever—especially if the url you're
fetching is on a busy server or you're trying to connect to a
nonexistent or inactive IP address or dead links. Setting a time-
out value causes CURL to end the session if the download takes
longer than the time-out value .
CURLOPT_USERAGENT
Use this option to define the name of your user agent. The user
agent name is recorded in server access log files and is available
to server-side scripts in the $_SERVER['HTTP_USER_AGENT']
variable. Some servers require a special user-agent to be passed
to validate. For example with RETS services. Many websites
will not even dole out pages correctly if your user agent name is
something other than a standard web browser user-agent.
$agent = "webbot";
curl_setopt($s, CURLOPT_USERAGENT, $agent);
CURLOPT_HTTPHEADER
CURLOPT_SSL_VERIFYPEER
You only need to use this option if the target website uses SSL
encryption and the protocol in CURLOPT_URL is https:.
Notice that the POST data looks like a standard query string sent
in a GET method. Incidentally, to send form information with the
GET method, simply attach the query string to the target URL.
CURLOPT_PORT
[HTTP
HTTP
STATUS CODES]
CODES
A PRACTICAL GUIDE TO WEB SCRAPING 104
A.2 HTTP Status Codes
The below table is a quick reference for all the status codes
defined in the HTTP/1.1 specification, providing a brief
summary of each.
Status
Reason Meaning
code
An initial part of the request was received, and the
100 Continue
client should continue.
Switching The server is changing protocols, as specified by
101
Protocols the client, to one listed in the Upgrade header.
200 OK The request is okay.
The resource was created (for requests that create
201 Created
server objects).
The request was accepted, but the server has not
202 Accepted
yet performed any action with it.
Non- The transaction was okay, except the information
203 Authoritative contained in the entity headers was not from the
Information origin server, but from a copy of the resource.
The response message contains headers and a
204 No Content
status line, but no entity body.
Another code primarily for browsers; means that
205 Reset Content the browser should clear any HTML form
elements on the current page.
206 Partial Content A partial request was successful.
A client has requested a URL that actually refers
to multiple resources. This code is returned along
300 Multiple Choices
with a list of options; the user can then select
which one he wants.
The requested URL has been moved. The
Moved
301 response should contain a Location URL
Permanently
indicating where the resource now resides.
Like the 301 status code, but the move is
temporary. The client should use the URL given
302 Found
in the Location header to locate the resource
temporarily.
Tells the client that the resource should be fetched
303 See Other
using a different URL. This new URL is in the