0% found this document useful (0 votes)
229 views107 pages

A Practical Guide To Web Scraping (PDFDrive)

Uploaded by

eliane.matheus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
229 views107 pages

A Practical Guide To Web Scraping (PDFDrive)

Uploaded by

eliane.matheus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

"a quick guide for PHP programmers"

Sameer Borate

A PRACTICAL GUIDE
TO
WEB SCRAPING
A PRACTICAL GUIDE TO WEB SCRAPING 1
Sameer Borate

A PRACTICAL GUIDE
TO
WEB SCRAPING

A PRACTICAL GUIDE TO WEB SCRAPING 2


CONTENTS
1.1 WEB SCRAPING DEFINED ............................................................................... 6
1.2 REASONS TO SCRAPE ...................................................................................... 7
1.3 LEGALITY OF WEB SCRAPING....................................................................... 9
2.1 HTTP OVERVIEW............................................................................................. 11

2.1.1 GET REQUESTS ............................................................................... 12


2.1.2 QUERY STRINGS .............................................................................. 14
2.1.3 POST REQUESTS ............................................................................. 14
2.1.4 URL SYNTAX .................................................................................. 15
2.1.5 HTTP STATE MANAGEMENT AND COOKIES .................................... 17
3.1 YOUR SCRAPING TOOLBOX ......................................................................... 20

3.1.1 PHP CURL: ..................................................................................... 21


3.1.2 SIMPLEHTMLDOM: A QUICK OVERVIEW ....................................... 23
3.1.3 CACHING DOWNLOADED PAGES ....................................................... 28
4.1 SIMPLEHTMLDOM IN DETAIL ...................................................................... 32

4.1.1 BUILDING THE INITIAL DOM ........................................................... 32


4.1.2 GRABBING THE TEXT CONTENT OF A PAGE ....................................... 35
4.1.3 FINDING ELEMENTS ON A PAGE ........................................................ 35
4.1.4 ITERATING OVER NESTED ELEMENTS ............................................... 39
4.1.5 SCRAPING HTML TABLES ............................................................... 40
4.1.6 DOM PARENTS AND CHILDREN ...................................................... 43
4.1.7 CHAINING METHODS ........................................................................ 46
4.1.8 DOWNLOADING IMAGES .................................................................. 46
4.1.9 WGET ............................................................................................ 49
5.1 AUTHENTICATED SITES ................................................................................ 53

5.1.1 HTTP BASIC AUTHENTICATION ...................................................... 53


5.1.2 HTTP BASIC AUTHENTICATION WITH CURL .................................. 55
5.1.3 STORING AND SENDING COOKIES ..................................................... 56
5.1.4 SESSION AUTHENTICATION .............................................................. 57
5.1.5 LOGGING TO A WORDPRESS ADMIN SITE WITH CURL ...................... 57
6.1 REGULAR EXPRESSIONS: A QUICK INTRODUCTION.............................. 64

6.1.1 GETTING THE CHARACTER ENCODING FOR A WEB PAGE ................... 66

A PRACTICAL GUIDE TO WEB SCRAPING 3


6.1.2 GRABBING IMAGES FROM A WEB PAGE............................................. 68
7.1 JAVASCRIPT AND THE RISE OF AJAX ........................................................ 73

7.1.1 FIREBUG TO THE RESCUE ................................................................. 74


7.2 PHANTOMJS ..................................................................................................... 77
8.1 GET CHARACTER ENCODING FOR A WEB PAGE ..................................... 80
8.2 GRABBING WEBSITE FAVICONS ................................................................. 81
8.3 SCRAPE GOOGLE SEARCH RESULTS .......................................................... 83
8.4 GET ALEXA GLOBAL SITE RANK ................................................................ 85
8.5 SCRAPING A PAGE WITH HTTP AUTHENTICATION ................................ 86
8.6 LOGGING TO WORDPRESS ADMIN AND GRABBING CONTENT ........... 88
8.7 GETTING ALL THE IMAGE URLS FROM A PAGE ...................................... 92
8.8 SAVING ALL THE IMAGES FROM A PAGE TO A DIRECTORY ............... 94
A.1 A SIMPLE CURL SESSION ............................................................................. 97

A.1.1 SETTING CURL OPTIONS ................................................................ 98


CURLOPT_URL .................................................................................. 98
CURLOPT_RETURNTRANSFER ....................................................... 98
CURLOPT_REFERER ........................................................................ 99
CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS ..... 99
CURLOPT_NOBODY and CURLOPT_HEADER ............................ 100
CURLOPT_TIMEOUT ...................................................................... 100
CURLOPT_USERAGENT ................................................................. 100
CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR ............... 101
CURLOPT_HTTPHEADER .............................................................. 101
CURLOPT_SSL_VERIFYPEER ........................................................ 101
CURLOPT_USERPWD and CURLOPT_UNRESTRICTED_AUTH 102
CURLOPT_POST and CURLOPT_POSTFIELDS ........................... 102
CURLOPT_VERBOSE ...................................................................... 103
CURLOPT_PORT ............................................................................. 103
A.2 HTTP STATUS CODES .................................................................................. 105

A PRACTICAL GUIDE TO WEB SCRAPING 4


1

[UNDERSTANDING
UNDERSTANDING
WEB
SCRAPING]
SCRAPING
A PRACTICAL GUIDE TO WEB SCRAPING 5
1.1 Web Scraping Defined

The Internet today is a huge repository of information, the


larger portion of it being accessible via the World Wide Web.
The Web Browser being the standard tool to access this
information. Although the browser is a great tool, it is limited to
human users. With such a large set of information available
online, it would be helpful if machines could automatically grab
the content in some way. This could be for repurposing data,
analysis or creating mashups.

Many of the large websites today provide access to their content


with the help of an API, either using REST or SOAP protocols.
Users can retrieve the content from the site using the API and
repurpose it however they wish (of course while adhering to the
terms and conditions of the site). Unfortunately for the most
part websites do not provide an API and the only way to get to
the data is via web scraping, also know as spidering or screen
scraping.

Web scraping is a method involving the automatic retrieval by


programs, of semi-structured data from web pages. Commonly
today a web page is built in a markup language such as HTML
or XHTML as shown below.

<html>
<head>
<title>Hello HTML</title>
</head>
<body>
<p>Hello World!</p>
</body>
</html>

A PRACTICAL GUIDE TO WEB SCRAPING 6


As you can see in the code above, the information here is the
string 'Hello World!', while most of the page content is
HTML markup, which the browser renders for the user.

Hello World!

If you need to scrape the above 'Hello World!' string using a


computer program, than you will need to download the page
and parse the content suitably, eliminating all the superfluous
HTML markup to get to the pure text content. Of course the
markup example given above is a very simple one. In reality
web pages are very complex, harboring various HTML
elements in a variety of combinations. Some of the HTML may
be ill formed, missing tags or being nested incorrectly. Modern
browsers usually ignore these problems and try to auto correct
the inconsistencies before displaying a page. However, when
you are writing a web scraper you have to take all of these
factors into consideration. This can make parsing web pages a
difficult task. Fortunately there are various libraries and tools
available to make that undertaking easier for us.

1.2 Reasons to scrape

Now that we have defined scraping and had a cursory look at


the idea, we need to answer the question: why? Why bother to
scrape? Of course if you have bought this book, it is sure for
some scraping purpose. Still providing some answers to the
above question will help you enlarge your scraping skills to a
wide variety of domains.

A PRACTICAL GUIDE TO WEB SCRAPING 7


a. Aggregate and search specific kind of data

Although different websites provide different types of data,


most of them are semantically connected in some way. For
example if you are interested in blogs related to science, and
you have around 100 blog feeds in your reader, it would be
difficult to go through all of them on a regular basis and find
items of particular interest. However, you can write a scraper to
collect all the blog feeds and search for a particular keyword of
interest, thus transferring the drudgery of data filtering to a
machine.

b. Gaining automated access to web resources

If you need to regularly check a price for some product on an


ecommerce store to see if any discount is available, you could
regular visit the site to check on it. However, that would be time
consuming and tedious. A better way would be to write a small
scraper program that would regular visit the site and get the
price, and email you if some price change is found. You could
also regularly check for fresh images and download them to
your computer.

c. Combine information and present it in an alternate format

This method, one of the most common uses of scraping, also


known as 'mashups', allows you to gather different kind's of
information from various sites and combine them in some
interesting way that would be valuable to the end user.

A PRACTICAL GUIDE TO WEB SCRAPING 8


1.3 Legality of Web Scraping

Having seen some of the uses of scraping, we need to look into


an important topic regarding web scraping legality. The legality
of web scraping is a rather complex question, largely due to
copyright and intellectual property laws. Unfortunately, there is
no easy and completely clear cut answer, particularly because
these laws can vary between countries. However, there are a
few universal points for examination when reviewing a
potential web scraping target.

First, web sites often have documents known as Terms of


Service (TOS), Terms or Conditions of Use, or User
Agreements. These are generally located in along the footer or
the sites help section. These types of documents are more
common on larger and more well-known web sites. Read these
and understand their terms against automated content scraping
by scripts.

If you are scraping for the sole purpose of using someone else's
intellectual property on your own website, than you're clearly
violating copyright laws; this is a no brainer. If you are scraping
data from a competitors site and using it on your own site than
it is clearly illegal.

Also, even if you are not using the scraper for any illegal data
gathering, but your scraper loads the target server with lots of
requests, thereby impairing the server, you are violating the
terms of the site. So make sure your scraper does not in any way
degrade the performance of the target server.

Now with all the legal issues out of the way (but still in sight),
we are ready to get ahead with the coding part.

A PRACTICAL GUIDE TO WEB SCRAPING 9


2

[HTTP
HTTP::
HTTP
A QUICK
OVERVIEW]
OVERVIEW
A PRACTICAL GUIDE TO WEB SCRAPING 10
2.1 HTTP Overview

The first task of a web scraping application is that of retrieving


documents containing the information to be extracted. For
websites consisting of multiple pages or requiring preservation
of session between requests or authentication information, some
level of reverse-engineering is often required to develop a
corresponding web scraping application. This sort of
exploration requires a good working knowledge of the
Hypertext Transfer Protocol or HTTP, the protocol that powers
the internet.

HTTP is a request/response protocol intended as a


communications protocol between web clients and web servers.
Clients are programs or scripts that send requests to servers;
client examples include web browsers, like Firefox, Internet
Explorer or crawlers, like those used by Yahoo! and Google and
web scrapers.

Whenever your web browser fetches a file (a page, a picture,


etc) from a web server, it does so using HTTP. That request
which your computer sends to the web server contains all sorts
of interesting information. HTTP defines methods (sometimes
referred to as verbs) to indicate the desired action to be
performed by the web server on the requested resource. What
this resource represents, whether pre-existing data or data that is
generated dynamically, depends on the implementation of the
server. Often, the resource corresponds to a file or the output of
an executable residing on the server. An HTTP request consists
of a request method, a request URL, header fields, and a body.
HTTP 1.1 defines the following request methods, of which only
two will be of interest to us – GET and POST.

A PRACTICAL GUIDE TO WEB SCRAPING 11


GET: Retrieves the resource identified by the request URL
HEAD: Returns the headers identified by the request URL
POST: Sends data of unlimited length to the Web server
PUT: Stores a resource under the request URL
DELETE: Removes the resource identified by the request URL
OPTIONS: Returns the HTTP methods the server supports
TRACE: Returns the header fields sent with the TRACE request

2.1.1 GET Requests

Let us start with a very simple HTTP GET request, one to


retrieve the index page of a fictitious site - example.com. The
following is the request your browser sends to the remote server
to retrieve the index file.

GET /index.html HTTP/1.1


Host: www.example.com

The individual components of the above request are detailed


below:

 GET is the method or operation to be applied on the


server resource. Think of it as a verb in a sentence, an
action that you want to perform on something.

 /index.html is the Uniform Resource Identifier or


URI. It provides a unique point of reference for the
resource, the object or target of the operation.

 HTTP/1.1 specifies the HTTP protocol version in use by


the client.

A PRACTICAL GUIDE TO WEB SCRAPING 12


 The method, URI, and HTTP version together make up
the request line.

 A single header Host and its associated value


www.example.com follow the request line. More header-
value pairs may follow.

 Based on the resource, the value of the Host header and


the protocol in use
http://www.example.com/index.php is the resulting
full URL of the requested resource.

Most request headers for sites are however more complex,


containing many additional fields. You can use tools like
Firebug, Live HTTP Headers Firefox plugin, to check the
headers send by the browser and the corresponding responses
received from the server. A sample header captured by Live
HTTP Headers for the site yahoo.com is shown below. As you
can see there are many additional fields along with the GET
action.

GET /?p=us HTTP/1.1


Host: in.yahoo.com
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:16.0)
Gecko/20100101 Firefox/16.0
Accept:
text/html,application/xhtml+xml,application/xml;q=0
.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Cookie: B=1mjov258ke&b=4&d=g2uJ6p15tCRXz.
(truncated)
Cache-Control: max-age=0

A PRACTICAL GUIDE TO WEB SCRAPING 13


2.1.2 Query Strings

Another provision of URLs is a mechanism called the query


string, which is used to pass request parameters to web
applications. Below is a GET request that includes a query string
and is used to request a certain page from example.com.

GET /index.php?page=3&limit=10
Host: example.com

There are a few notable points to note about this URL.

 A question mark denotes the end of the resource path


and the beginning of the query string.

 The query string is composed of key-value pairs where


each pair is separated by an ampersand (&).

 Keys and values are separated by an equal sign.

Query strings are not specific to GET operations and can be


used in other operations as well like POST, which we look into
next.

2.1.3 POST Requests

The addition of the POST method is perhaps one of the largest


improvements of the HTTP specifications to date. This method
can be credited with transitioning the Web to a truly interactive
application development platform. When using a web browser
as a client, this is most often done by means of an HTML form.
If an HTML form specifies a method of POST, the browser will
send the data from the form fields in a POST request rather than

A PRACTICAL GUIDE TO WEB SCRAPING 14


a GET request. One major difference between a GET request and
a POST request is that the POST includes a body following the
request headers to contain the data to be submitted. POST is
intended to add to or alter data exposed by the application, a
potential result of which is that a new resource is created or an
existing resource is changed.

2.1.4 URL Syntax

URLs provide a means of locating any resource on the Internet,


but these resources can be accessed by different schemes (e.g.,
HTTP, FTP, SMTP), and URL syntax varies from scheme to
scheme. Despite the differences between various schemes,
URLs adhere to a general URL syntax, and there is significant
overlap in the style and syntax between different URL schemes.

All HTTP messages, with the possible exception of the content


section of the message, use the ISO-8859-1 (ISO-Latin)
character set. An HTTP request may include an Accept-
Encoding request header that identifies alternative character
encodings that are acceptable for the content in the HTTP
response.

Most URL schemes base their URL syntax on this nine-part


general format:

<scheme>://<user>:<pass>@<host>:<port>/<path>?<query>#<frag>

Almost no URLs contain all these components. The three most


important parts of a URL are scheme, host, and path. The
various parts are explained below.

A PRACTICAL GUIDE TO WEB SCRAPING 15


Component Description Default value
scheme Which protocol to use when
None
accessing a server to get a resource.
user The username some schemes require
anonymous
to access a resource.
pass The password that may be included
after the username, <Email address>
separated by a colon (:).
host The hostname or dotted IP address
None
of the server hosting the resource.
port The port number on which the server
hosting the resource is listening.
Many schemes have default port Scheme-specific
numbers (the default port number
for HTTP is 80).
path The local name for the resource on
the server, separated from the
previous URL components by a
None
slash (/). The syntax of the path
component is server- and scheme-
specific.
query Used by some schemes to pass
parameters to active applications
(such as databases, bulletin boards,
search engines, and other
Internet gateways). There is no None
common format for the contents
of the query component. It is
separated from the rest of the URL
by the "?" character.
frag A name for a piece or part of the
resource. The frag field is not
passed to the server when
referencing the object; it is used None
internally by the client. It is
separated from the rest of the URL
by the "#" character.

A PRACTICAL GUIDE TO WEB SCRAPING 16


2.1.5 HTTP State Management and Cookies

A Web server's primary task is to fulfill each HTTP request


received from the client. Everything that the Web server
considers when generating the HTTP response is included in the
request. If two completely different Web clients send identical
HTTP requests to the same Web server, the Web server will use
the same method to generate the response, which may include
executing some server-side application.

Modern web sites however need to provide a personal touch to


clients. They want to know more about users on the other ends
of the connections and be able to keep track of those users as
they browse. HTTP transactions are stateless. Each
request/response happens in isolation. Many web sites want to
build up incremental state as you interact with the site (for
example, filling an online shopping cart). To do this, web sites
need a way to distinguish one HTTP transaction from another.

Cookies are the best way to identify different users and allow
persistent sessions. Cookies were first developed by Netscape
but now are supported by all major browsers.

You can classify cookies generally into two types: session


cookies and persistent cookies. A session cookie is a temporary
cookie that keeps track of settings and preferences as a user
navigates a site. A session cookie is deleted when the user
closes the browser. Persistent cookies can live longer; they are
stored on disk and survive browser exits and even computer
restarts. Persistent cookies often are used to retain a
configuration profile or login name for a site that a user visits
periodically. The only difference between session cookies and
persistent cookies is when they expire.

A PRACTICAL GUIDE TO WEB SCRAPING 17


Now that we have had a brief overview of HTTP, we can
proceed with main object of this book – web scraping.

A PRACTICAL GUIDE TO WEB SCRAPING 18


3

[SCRAPING
SCRAPING
TOOLBOX]
TOOLBOX
A PRACTICAL GUIDE TO WEB SCRAPING 19
3.1 Your Scraping Toolbox

As with any other tasks, starting with a good set of tools makes
one understand and use the tools efficiently. Web scraping is no
different. Web scraping can be either done programmatically -
using scripting languages like PHP, Ruby; or with the help of
tools such as wget or curl, although the later do not provide the
flexibility of a scripting language, they are useful tools which
will come handy. Each can be used independently or with
combination to accomplish certain scraping tasks. Our primary
goal in this book is to use PHP to retrieve web pages and scrape
the page for contents we are interested in.

Working with PHP to scrape data from web pages is not an easy
task and requires a good knowledge of Regular Expressions.
However some excellent libraries will make it easier to parse
data from a web page without any knowledge of Regular
Expressions. This does not mean that Regular Expressions are
not required. Having a working knowledge of them is an
essential part of a programmer and can help you immensely
with your scraping work.

In this book we will exclusively work with the wonderful PHP


DOM library SimpleHTMLDOM. This is an extremely small
and versatile library and is easy to integrate within your
application. It is a HTML DOM parser written in PHP5+ that
lets you manipulate HTML in a very easy way. It supports
invalid HTML and allows you to find tags on an HTML page
with selectors just like jQuery. You can download it from the
link below.

http://simplehtmldom.sourceforge.net/

A PRACTICAL GUIDE TO WEB SCRAPING 20


The best way to get a feel for using the library is to get a couple
of simple examples running. Once we are through with these
examples, we will look deeper into the library. Although there
are many PHP libraries out there that can help you in your
scraping efforts, focusing solely on a single library will help
you attain proficiency in the library and make is easier to debug
and write efficient scrapers. Also, once you get to know a
library you can easily branch-out with other libraries.
SimpleHTMLDOM has been around for quite some time and is
stable and widely used. Before we start with simplehtmldom we
will look at another important tool, cURL.

3.1.1 PHP cURL:

While PHP in itself is able to download remote files and web


pages, most real-life scraping applications require additional
functionality to handle advanced issues like form submission,
authentication, redirection, and so on. These functions are
difficult to facilitate with PHP’s built-in functions alone.
Fortunately for us, every PHP installation includes a library
called PHP/CURL, which automatically takes care of these
advanced features. Most of this book’s examples make use of
CURL’s ability to download files and web pages.

Unlike the built-in PHP network functions, CURL supports


multiple transfer protocols - FTP, FTPS, HTTP, HTTPS,
Gopher, Telnet, and LDAP. Of these protocols, the most
important for our purpose is probably HTTP and HTTPS.
HTTPS allows our web scrapers to download from encrypted
websites that employ the Secure Sockets Layer (SSL) protocol.

A PRACTICAL GUIDE TO WEB SCRAPING 21


As stated above, curl allows transfer of data across a wide
variety of protocols. It is widely used as a way to send content
across websites, including things like API. curl is unrestricted in
what it can do, from the basic HTTP request, to the more
complex FTP upload or interaction with an secure HTTPS site.

Before we can do anything with a curl request, we need to first


create an instance of a curl resource by calling the curl_init()
function. This function takes one parameter which is the URL
that you want to send the request to and returns a curl resource.

$ch = curl_init('http://example.com');

We can also initialize curl in a slightly different way.

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://example.com');

Take a simple example:

<?php

$ch = curl_init('http://example.com');
curl_setopt($ch, CURLOPT_RETURNTRANSER, true);
$res = curl_exec($ch);
curl_close($ch);

?>

The details of each line are given below.

• curl_init is called and passed 'http://example.com' as the url


for the request.

• curl_setopt is called to set the configuration setting represented by the


CURLOPT_RETURNTRANSFER parameter to have a value of true. This
setting will cause curl_exec to return the HTTP response body in a string

A PRACTICAL GUIDE TO WEB SCRAPING 22


rather than outputting it directly to the browser or console, the latter being
the default behavior.

• curl_exec is called to have it execute the request and return the


response body which is stored in the $res variable.

• curl_close is called to explicitly close the curl session handle.

More information about curl and its various options are given in
Appendix A.

3.1.2 SimpleHTMLDOM: A quick overview

Let us start with real world example to get a feel for the library.
The following is a straightforward program that will search
Google for the 'flower' keyword and print all the links on the
page.

<?php

/* Include the simplehtmldom library */


require_once('simplehtmldom/simple_html_dom.php');

/* Get the page content for Google search */


$html =
file_get_html('http://www.google.com/search?q=flower');

/* Find all the link '<a>' elements on the page */


$links = $html->find('a');

/* Loop through all the links and print them */


foreach($links as $element)
{
echo $element->href . '<br>';
}

?>

A PRACTICAL GUIDE TO WEB SCRAPING 23


The code is somewhat self explanatory. We initially include the
'simplehtmldom' library into your program. Make sure that the
path to the library is set correctly.

require_once('simplehtmldom/simple_html_dom.php');

Next we use the libraries file_get_html() function to get the


raw content from the url given. The $html variable will now
contain the complete DOM structure of the web page retrieved.

$html =
file_get_html('http://www.google.com/search?q=flower');

Note that the file_get_html() function uses the PHP


file_get_contents() function internally. So if
file_get_contents() is disabled on your system by
your provider for security reasons, you will need
to enable it in your php.ini file or instead use
curl to get the initial page content.

Once the above line is executed, the $html variable will now
hold the simplehtmldom object containing the HTML content
for the url given.

Once we have our page DOM, we are ready to query it with the
'find' method of the library. In our example we are searching
for the <a> link element.

$links = $html->find('a');

This will return an array of <a> element objects, which we then


iterate over and print the href tags. The above example using
curl is given below.

<?php

/* Include the simplehtmldom library */

A PRACTICAL GUIDE TO WEB SCRAPING 24


require_once('simplehtmldom/simple_html_dom.php');

/* Get the page content for Google search */


$ch = curl_init('http://www.google.com/search?q=flower');
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$html_data = curl_exec($ch);
curl_close($ch);

/* Create a DOM object from the raw page data. */


$html = str_get_html($html_data);

/* Find all the link '<a>' elements on the page. */


$links = $html->find('a');

/* Loop through all the links and print them */


foreach($links as $element) {
echo $element->href . '<br>';
}
?>

In the above example we have used the following line to print


the href attribute.

echo $element->href;

This is the general flow of any web scraper. Of course for


complex projects there will be many variations which we will
later explore, but the basic logic and flow will remain the same.

Whenever you search for any DOM elements, various attributes


are returned with the search. In the above example we have
used href, but there could be others. You can find out what
other attribute data are available by using the following.

print_r( array_keys ( $element->attr ));

A PRACTICAL GUIDE TO WEB SCRAPING 25


This should return something like the following.

Array
(
[0] => href
[1] => rel
[2] => title
)

So if there is a title attribute available we can print that


instead of the href attribute.

echo $element->title;

Whenever you specify an attribute to print, like title in the


above example, simplehtmldom first searches the attributes list
to see if there is any attribute with that name available, if it
finds one then it returns that or it throws an error. We will look
into attributes in detail in the next chapter.

Many times we do not need the attributes, but the actual text
within the DOM element, for example the text within an h3 tag.
For this we can use the 'plaintext' or 'innertext' methods.
'innertext' returns the raw html content within the specified
element, whereas 'plaintext' returns the plain text without
any html. There is one other method, 'outertext', which
returns the DOM node's outer text along with the tag.

The following example shows what various methods return for


a sample string.

$html = str_get_html("<div>Hello <b>World!</b></div>");


$ret = $html->find("div", 0);

echo $ret->tag; // " div"


echo $ret->outertext; // "<div>Hello <b>World!</b></div>"

A PRACTICAL GUIDE TO WEB SCRAPING 26


echo $ret->innertext; // "Hello <b>World!</b>"
echo $ret->plaintext; // "Hello World!"

Suppose we want to list all the Google search titles instead of


the links, as highlighted in the image below.

We can then use the code given below. The titles are all within
the h3 tag, so we will need to search for the same. Notice how
easy it is to change the search element within the find method.

/* Find all the title '<h3>' elements on the page */


$titles = $html->find('h3');

/* Loop through all the links and print them */


foreach($titles as $title) {
echo $title->plaintext . '<br>';
}

We have used the 'plaintext' method here, rather than


attribute names.

A PRACTICAL GUIDE TO WEB SCRAPING 27


3.1.3 Caching downloaded pages

During your initial web scraping projects when you are in a


learning phase, it will help you immensely if you save a local
copy of the page you are scraping. This will save your
bandwidth and also of the site which you are scraping. It may
happen that if you hit the remote web server on a frequent basis
they may ban your IP. Most web page data does not change that
frequently, save for news or other related sites. So once you
have saved the page locally you can comfortably test your
scraping code against the local copy rather then going out over
the web each time.

So if you are scraping the index page of yahoo.com, you can


save it to a local file initially and use that later for scraping.
This will help keep your traffic low and increase your test speed
as you are accessing a local page rather then a page on the web.
The code to save a local copy of a page is shown below.

/* Get the page content for yahoo.com index */


$homepage = file_get_contents('http://www.yahoo.com/');

/* Save it to a local file */


file_put_contents("cache/index.html", $homepage);

If you are using curl you can use the following instead.

/* Get the page content for Yahoo.com */


$ch = curl_init('http://yahoo.com');

curl_setopt($ch, CURLOPT_TIMEOUT, 60);


curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$html_data = curl_exec($ch);
curl_close($ch);

A PRACTICAL GUIDE TO WEB SCRAPING 28


file_put_contents("cache/index.html", $html_data);

Now that you have a local copy of the index page, you can use
it in your scraping code.

<?php

/* Include the simplehtmldom library */


require_once('simplehtmldom/simple_html_dom.php');

/* Create a DOM object from a local file */


$html = file_get_html('cache/index.html');

/* Find all the link '<a>' elements on the page */


$links = $html->find('a');

/* Loop through all the links and print them */


foreach($links as $element) {
echo $element->href . '<br>';
}

?>
Debugging cURL

Seldom does a new code work correctly the first time. Curl is no
exception. Many times it may just not work correctly or return
some error. The curl_getinfo function enables you to view
requests being sent out by cURL. This can be quite useful when
debugging requests. Below is an example of this feature in
action.

$ch =
curl_init('https://www.google.com/search?q=flower&start=0');

curl_setopt($ch, CURLOPT_TIMEOUT, 6);


curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

A PRACTICAL GUIDE TO WEB SCRAPING 29


curl_setopt($ch, CURLINFO_HEADER_OUT, true);

$html = curl_exec($ch);
$request = curl_getinfo($ch, CURLINFO_HEADER_OUT);
curl_close($ch);

The $request variable will now hold the request content sent
out by curl. You ca see if this is correct and modify the code
accordingly.

In the next chapter we will be looking into the details of the


simplehtmldom library. So is simplehtmldom enough for
scraping purposes? It depends. I've successfully used it along
with PHP/curl library for more than a dozen projects. I've used
it to scrape product prices from e-commerce sites to build a
comparison engine; used it to retrieve Google search results to
analyze listing rank for SEO etc. Although there are a few more
PHP libraries available, in this book we primarily focus on
simplehtmldom. The advantage of using a single library and
mastering it is that you develop a deeper understanding of the
library. This enables to you to quickly develop scraping
solutions to client problems. simplehtmldom is also object-
oriented, so you can easily extend it to add your custom
methods and functions to the core class. Also using a single
library keeps your code-base organized, allowing you to transfer
domain solutions from one scraping project to another with
ease.

A PRACTICAL GUIDE TO WEB SCRAPING 30


4

[EXPLORING
EXPLORING
SIMPLEHTMLDOM]
SIMPLEHTMLDOM
A PRACTICAL GUIDE TO WEB SCRAPING 31
4.1 Simplehtmldom in detail

Now that we have seen a couple of examples, we will now look


into the details of this wonderful library. For all of the examples
below we will use the template provided with this eBook. A
screenshot of the template is shown on the next page. The
reason we do not use any live site in many examples is that the
actual details of the html pages changes frequently, which can
render the examples unworkable.

4.1.1 Building the initial DOM

As we will be working with the template provided, you first


need to install it to your local directory. For the examples we
will assume that you have installed it to a local url shown
below, but yours could be different, so you need to adjust the
examples accordingly.

http://localhost/scrape/template/index.html

Working with simplehtmldom requires that we first create the


DOM structure with the html page we are interested with. All
the functions provided by simplehtmldom work with this DOM
object and not with the plain page text. So the DOM object is
what we need to create first. The DOM object can be created in
three ways.

/* Create a DOM object from a string */


$html = str_get_html('<html><body>Hello!</body></html>');

/* Create a DOM object from a URL */


$html = file_get_html('http://www.google.com/');

/* Create a DOM object from a HTML file */


$html = file_get_html('index.html');

A PRACTICAL GUIDE TO WEB SCRAPING 32


All of the above return a DOM object which is stored in the
$html variable. str_get_html() and file_get_html() are
simplehtmldom functions.

As noted earlier, file_get_html() function uses the PHP


file_get_contents() function internally. So if this function is
disabled on your system for whatever reason, you will need to
use curl to get the initial page content. Once you get the content
using curl, you can use the str_get_html() function to create
a DOM object. If you are inclined more towards the Object-
Oriented style of programming you can use the following
instead to create the DOM.

/* Create a DOM object */


$html = new simple_html_dom();

/* Load HTML from a string */


$html->load('<html><body>Hello!</body></html>');

/* Load HTML from a URL */


$html->load_file('http://www.google.com/');

// Load HTML from a HTML file


$html->load_file('index.html');

Although not mentioned earlier, str_get_html and


file_get_html take a variety of additional parameters along
with the HTML string.

str_get_html($str,
$lowercase=true, // Force Lowercase for tag names
$forceTagsClosed=true,
$target_charset = DEFAULT_TARGET_CHARSET,
$stripRN=true, // Strip NewLine characters
$defaultBRText=DEFAULT_BR_TEXT)

A PRACTICAL GUIDE TO WEB SCRAPING 33


All parameters have a default setting, which will do or most
applications. However you need to be careful. The 5th parameter
- $defaultBRText, hints that all the linebreak characters be
striped from a string. This can sometimes cause problems when
you are using the string elsewhere and the string is expected to
have a linebreak character. Also note that the last parameter
defines the new-line character, which by default is set to "\r\n"
(Windows). If you are working with *NIX style linebreaks, you
need to set it to "\n".

The third parameter - $forceTagsClosed, is also important.


Forcing tags to be closed implies that we don't trust the html,
this can be useful in cases of bad html markup. But in validated
html markup it may be better to set this option to false.

The following is the parameter list for the file_get_html


function.

file_get_html($url,
$use_include_path = false,
$context=null,
$offset = -1,
$maxLen=-1,
$lowercase = true,
$forceTagsClosed=true,
$target_charset = DEFAULT_TARGET_CHARSET,
$stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)

The last few parameters are the same as for the str_get_html
function, while the first five parameters are the same as for the
PHP function file_get_contents, as file_get_html calls
this function internally.

A PRACTICAL GUIDE TO WEB SCRAPING 34


4.1.2 Grabbing the text content of a page

Let us start by scraping the plaintext content of the index.html


page of our template.

/* Include the simplehtmldom library */


require_once('simplehtmldom/simple_html_dom.php');

$template = 'http://localhost/scrape/template/index.html';
$html = file_get_html($template);
echo $html->plaintext;

This will print all the page content without the html tags, i.e the
plain text of the page. A shortcut way would be like this.

/* Dump contents (without tags) from HTML */


echo file_get_html($template)->plaintext;

If you need plain-text content from some external site page you
can instead use the following.

// Dump contents (without tags) from HTML


echo file_get_html('http://www.yahoo.com')->plaintext;

This can be useful if you need to some text processing on the


complete page text, like calculate word frequency to create a tag
cloud, find important keywords etc.

4.1.3 Finding elements on a page

The whole exercise of downloading a web page is that we can


retrieve some content we are interested in. The find method of
simplehtmldom is the primary method for this purpose. The
function takes a selector as a parameter and returns an array of
element objects found. A few examples are shown below.

A PRACTICAL GUIDE TO WEB SCRAPING 35


/* Search for the 'a' html tag */
$ret = $html->find('a');

/* Find all <div> tags with the id attribute set */


$ret = $html->find('div[id]');

/* Find all <div> tags which have id=comment */


$ret = $html->find('div[id=comment]');

/* Find all elements which have id=comment */


$ret = $html->find('#comment');

/* Find all elements which have class=comment */


$ret = $html->find('.comment');

/* Find all elements that has the id attribute set */


$ret = $html->find('*[id]');

/* Find all anchors and images */


$ret = $html->find('a, img');

Say for example that you want to get all the link titles in the
sidebar of our example template.

Our template sidebar. The links are


what we are interested in.

Using Firebug we find that the links are all under a u tag with
the class sidemenu. So we can ask the find method to search

A PRACTICAL GUIDE TO WEB SCRAPING 36


for all the links that are below the ul tag with a given class
name.

$links = $html->find('ul[class=sidemenu] a');

/* Loop through all the links and print them */


foreach($links as $link)
{
echo $link->plaintext . '<br>';
}

We can further check what other attributes are available for us


to use.

print_r(array_keys($link->attr));

This will return the following in our case.

Array
(
[0] => title
[1] => href
)

So we can print the titles instead of the links.

$links = $html->find('ul[class=sidemenu] a');

/* Loop through all the links and print the 'title' */


foreach($links as $link)
{
echo $link->title . '<br>';
}

Simplehtmldom also has a few methods to work with attributes.


If you need to see all the attributes that a element has we can
use the getAllAttributes() method. So for the above

A PRACTICAL GUIDE TO WEB SCRAPING 37


example the following will print all the attributes of the $link
element.

print_r($link->getAllAttributes());

Will print:

Array
(
[title] => side menu 5
[href] => #
)

We can check if an element has a particular attribute set and


return it.

if($link->hasAttribute('title'))
{
echo $link->getAttribute('title');
}

Another example - Suppose we want to find the footer links on


our template page. We can see using Firebug that they are under
a div with the id name footerleft.

Finding the DOM element using Firebug.

A PRACTICAL GUIDE TO WEB SCRAPING 38


Once we know the id name we can use it with find.

$links = $html->find('div[id=footerleft] a');

We could also have used the following line to get to the links,
but the ul is redundant in this case.

$links = $html->find('div[id=footerleft] ul a');

There is no one right way to reach an element you are interested


in. This is a trick you will learn over time once you get to
understand the library and have some scraping experience under
your belt.

The above are all examples of descendant selectors, a few more


of which are shown below.

/* Find all <li> in <ul> */


$ret = $html->find('ul li');

/* Find Nested <div> tags */


$ret = $html->find('div div div');

/* Find all <td> in a <table> with class=salary */


$ret = $html->find('table.salary td');

/* Find all <td> tags with attribute align=center


in a <table> tag */
$ret = $html->find('table td[align=center]');

4.1.4 Iterating over nested elements

Let us say you want to iterate over all the div elements in our
template page which has a class ct and print the contents. The
code will be as following.

A PRACTICAL GUIDE TO WEB SCRAPING 39


$findDivs = $html->find('div[class=ct]');

foreach($findDivs as $findDiv)
{
echo $findDiv->plaintext . '<br />';
}

For example if you want to search for all the <p> tags within
<div> tags, we can write is as follows.

foreach($findDivs as $findDiv)
{
foreach($findDiv->find('p') as $p)
{
echo $p->plaintext . '<br />';
}
}

Of course this is equivalent to the following. Which style to use


will depend on your particular problem. The above is more
flexible if you want to search for multiple tags within the div
tag and want to process the tags further.

$findDivs = $html->find('div p');

foreach($findDivs as $findDiv)
{
echo $findDiv->plaintext . '<br />';
}

4.1.5 Scraping HTML tables

HTML tables are one of the important elements of a page and


can require some efforts to scrape using Regular Expressions.
But simplehtmldom makes it easier to iterate over all the rows
and columns of the table. Our sample template has a 3 row table
which is used in the example below.

A PRACTICAL GUIDE TO WEB SCRAPING 40


<table id="mytable">
<thead>
<tr>
<th width="150">Column 1</th>
<th width="150">Column 2</th>
<th width="150">Column 3</th>
<th width="150">Column 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>Col-1 Row-1 Content</td>
<td>Col-2 Row-1 Content</td>
<td>Col-3 Row-1 Content</td>
<td>Col-4 Row-1 Content</td>
</tr>
<tr>
.
.

The best way to get all the table content is to find and iterate
over each <tr> element and then search for the <td> element.
The following shows the entire code.

$table = $html->find('table[id=mytable]');

foreach ($table as $t)


{
foreach ($t->find('tr') as $tr)
{
foreach ($tr->find('td') as $td)
{
echo $td->innertext . " | " ;
}
echo "<br>";
}
}

This will output something like the following:

A PRACTICAL GUIDE TO WEB SCRAPING 41


Col-1 Row-1 Content | Col-2 Row-1 Content | Col-3 Row-1 Content | Col-4 Row-1 Content |
Col-1 Row-2 Content | Col-2 Row-2 Content | Col-3 Row-2 Content | Col-4 Row-2 Content |
Col-1 Row-3 Content | Col-2 Row-3 Content | Col-3 Row-3 Content | Col-4 Row-3 Content |

Note that in the above example if there is only one table with
the id 'mytable' you may also write the code as follows.
Notice how we are indexing the $table variable.

$table = $html->find('table[id=mytable]');

foreach ($table[0]->find('tr') as $tr)


{
foreach ($tr->find('td') as $td)
{
echo $td->innertext . " | ";
}
echo "<br>";
}

If there were 2 tables and we wanted content for the second


table the index for the $table will be different.

foreach ($table[1]->find('tr') as $tr)

Sometimes there may be a need to store the table contents into a


two dimensional array for further processing instead of echoing
it to the screen. Maybe you want to save it to a database or
export it to a CSV file. Whatever the case we can easily add a
few additional lines to save the table content to an array.

$table_array = array();
$row = 0;
$col = 0;

$table = $html->find('table[id=mytable]');

foreach ($table as $t)

A PRACTICAL GUIDE TO WEB SCRAPING 42


{
foreach ($t->find('tr') as $tr)
{
foreach ($tr->find('td') as $td)
{
$table_array[$row][$col] = $td->plaintext;
$col++;
}
$row++;
}
}

print_r($table_array); // Test the array

Notice how we have used a for loop to iterate over the table
columns. You can also access individual columns or rows using
an index instead of using a for loop.

$td = $tr->find('td') ;
echo $td[1]; // echo the second column of the table

4.1.6 DOM Parents and Children

A HTML page structure is always hierarchal. All the DOM


elements have parents and children. A small part of our
template is given below.

<!DOCTYPE html>
<html>
<head>
<title>Web Scraping Template</title>
</head>
<body>
<div id="container" class="clearfix">
<div id="menucont">
<ul>
<li><a title="" href="#" class="active">Home</a></li>
<li><a title="" href="#">About Us</a></li>
<li><a title="" href="#">Blog</a></li>

A PRACTICAL GUIDE TO WEB SCRAPING 43


<li><a title="" href="#">Contact Us</a></li>
</ul>
</div>

You may notice that the div tag with the id 'container' is the
child of the body tag. If we use the following code on our
template this will return the parent of the 'container' div as
the body tag. In this way can walk the DOM tree using the
parent method.

$findDivs = $html->find('div[id=container]');

foreach($findDivs as $findDiv)
{
echo $findDiv->parent()->tag; // Prints 'body'
}

Let us say we want to find all the p tags, but only those that are
the immediate children of a div element with a class named ct.
We can use the parent method here. The HTML snippet is
given below,

<div class="ct">
<p>Div 1. … pellentesque…</p>
</div>

The code to parse the above is shown next.

$findDivs = $html->find('p');

foreach($findDivs as $findDiv)
{
if ($findDiv->parent()->class == 'ct' &&
$findDiv->parent()->tag == 'div')
{
echo $findDiv->plaintext;
}
}

A PRACTICAL GUIDE TO WEB SCRAPING 44


Of course we could have written it differently by specifying a
more accurate query to the find method.

$findDivs = $html->find('div[class=ct] p');

foreach($findDivs as $findDiv)
{
echo $findDiv->plaintext;
}

So why do we need a parent method? Many times the HTML


structure changes frequently and we need to be more flexible in
our parsing efforts. The parent method allows us that
flexibility while writing code, so any page structure changes can
be easily accommodated.

Another method to complement the parent method is


children(n). This returns the Nth child object if the index is
set, otherwise it returns an array of children. So for example the
following will print the tag names of the children of the div
with the ct class. Note again that the examples here are taken
from the sample template provided.

$findDivs = $html->find('div[class=ct]');

foreach($findDivs as $findDiv)
{
foreach($findDiv->children as $child)
{
echo $child->tag;
}
}
// Prints 'p'

A PRACTICAL GUIDE TO WEB SCRAPING 45


Other methods to use along with children are given below.

$e->first_child () Returns the first child of element, or null if not found.


$e->last_child () Returns the last child of element, or null if not found.
$e->next_sibling () Returns the next sibling of element, or null.
$e->prev_sibling () Returns the previous sibling of element, or null.

4.1.7 Chaining methods

Chaining of methods is an easier way to drill down the


hierarchy of DOM elements. The following for example will
allow us to access the nested children of the div tag, a p
element.

$findDivs = $html->find('div[id=container]');

foreach($findDivs as $findDiv) {
echo $findDiv->children(1)->children(0);
}

Which will print the following.

<p>Event Horizon Website Template</p>

4.1.8 Downloading Images

Downloading images is one of the other common web scraping


tasks a developer encounters. With the help of simplehtmldom
and curl we can accomplish this easily. Let us take our sample
template again. This has three images which we wish to
download to our local directory. Using Firebug we can see that
the images are all in a p tag within a div with the class images.
The Firebug screenshot is shown below.

A PRACTICAL GUIDE TO WEB SCRAPING 46


From the above Firebug DOM layout we can easily get the
image urls with the following code.

$images = $html->find('div[class=images] img');

foreach($images as $image)
{
echo $image->src;
}

Note that the image src has relative paths, which we now need
to convert to absolute to download the images. Once we are
correctly able to get the src of the images we can now
download it using the PHP function file_get_contents().

$images = $html->find('div[class=images] img');

/* Adjust this to your correct template path */


$url = 'http://localhost/scrape/template/';

foreach($images as $image)
{
/* Get the image source */
$image_src = $image->src;

/* The src attribute also contains the


'images/' directory name, We need to
get rid of that to get a plain images
file name.

A PRACTICAL GUIDE TO WEB SCRAPING 47


*/
$file_name = str_replace('images/', '', $image_src);

/* We now download the image data and


save it to a file */
$file = file_get_contents($url . $image_src);
$fp = fopen($file_name, 'w');
fwrite($fp,$file);
fclose($fp);
}

In the above scenario we searched for the images and


simultaneously downloaded the images to our local directory.
However, many times it is better to just save the image urls to a
file and later download the images with some other script.

$fp = fopen('image_urls.txt', 'w');

foreach($images as $image)
{
/* Get the image source */
$image_src = $image->src;
fwrite($fp,$url . $image_src . PHP_EOL);
}

fclose($fp);

In the above example we have stored all the image urls in the
'image_urls.txt' file. Later we can use this to download the
images with another script; either using file_get_contents()
or using curl. We have seen an example using
file_get_contents, below is an example using curl.

/* Read all the file into an array */


$lines = file('image_urls.txt');

foreach ($lines as $line)


{
/* Remove any end of line characters */

A PRACTICAL GUIDE TO WEB SCRAPING 48


$url = trim($line);
/* Get only the image filename from the url */
$filename = basename($url);

$fp = fopen($filename, 'w');


$ch = curl_init($url);

curl_setopt($ch, CURLOPT_TIMEOUT, 6);


curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_exec($ch);
fclose($fp);
}

curl_close($ch);

Normally whenever a curl session is executed, the retrieved


content is echoed to the console or stored in a variable. There is
however an important curl option, CURLOPT_FILE which allows
you to specify a file resource in which the retrieved content is
saved to a file, as in our example above.

Saving all the images urls in a file has another major advantage.
You can use a utility like Wget to retrieve the images.

4.1.9 WGET

Wget is a powerful command-line program that retrieves


content from web servers, and is part of the GNU Project. Its
name is derived from World Wide Web and get. It supports
downloading via HTTP, HTTPS, and FTP protocols. Among
other things its features include recursive download, conversion
of links for offline viewing of local HTML, support for proxies
etc. Written in portable C, Wget can be easily installed on any
Unix-like system and has been ported to many environments,

A PRACTICAL GUIDE TO WEB SCRAPING 49


including Microsoft Windows, Mac OS X. If you are using
Windows you can download the version from below.

http://gnuwin32.sourceforge.net/packages/wget.htm

Once you have installed Wget, you can ask it to get the images
from the image_urls.txt file. Wget will now open the
image_urls.txt file, retreive the urls and save the image to a
directory specified or in the current directory.

d:\wget>wget -i image_urls.txt

Wget is a powerful and flexible tool which can help you in your
scraping efforts. There are many options and it would help you
immensely if you read the documentation carefully and get
familiar with it innumerable features. The following is the
general format of Wget.

wget [option]... [URL]...

A few examples of Wget are given below.

# Download the title page of example.com to a file


# named "index.html".
wget http://www.example.com/

# Download the entire contents of example.com


wget -r -l 0 http://www.example.com/

# Download the title page of example.com, along


# with the images and style sheets needed to
# display the page, and convert the URLS inside
# it to refer to locally available content.

A PRACTICAL GUIDE TO WEB SCRAPING 50


wget -p -k http://www.example.com/

A PRACTICAL GUIDE TO WEB SCRAPING 51


5

[SCRAPING
SCRAPING
AUTHENTICATED
PAGES]
PAGES
A PRACTICAL GUIDE TO WEB SCRAPING 52
5.1 Authenticated sites

Although the majority of web pages are unsecured and can be


directly accessed using curl, many important ones are secured
using various authentication techniques. One of the most
common ways to secure a page is using HTTP Basic
Authentication. To access the content that is authenticated we
need to use the PHP curl extension. But before that a few words
about Basic Authentication.

5.1.1 HTTP Basic Authentication

HTTP Basic authentication implementation is one of the easiest


and fastest ways to secure web pages because it doesn't require
cookies, session handling, or the creation of login pages.
Instead, HTTP Basic authentication uses static HTTP headers,
which mean that no handshakes are necessary between clients
and servers.

Whenever the server wants the browser to authenticate itself


using Basic Authentication it can send a request for
authentication. The headers for the same are shown below.

HTTP/1.1 401 Access Denied


WWW-Authenticate: Basic realm="My Server"
Content-Length: 0

This request should be sent using the HTTP 401 Not Authorized
response code containing a WWW-Authenticate HTTP header.

Most web browsers will display a login dialog when this


response is received, allowing the user to enter a username and
password. An example is shown below.

A PRACTICAL GUIDE TO WEB SCRAPING 53


When the user agent such as a browser wants to send the server
authentication credentials it may use the Authorization header.

The Authorization header is constructed as follows:

1. Username and password are combined into a string


"username:password".
2. The resulting string literal is then encoded using Base64.
3. The authorization method and a space i.e. "Basic " is then put
before the encoded string.

For example, if the browser uses 'jonny' as the username and


'whitecollar' as the password and wants to access some secure
file on the example.com/securefiles/ server, the headers
sent by the browser will be like the following.

GET /securefiles/ HTTP/1.1


Host: www.example.com
Authorization: Basic am9ubnk6d2hpdGVjb2xsYXI=

In PHP we can encode username and password string to base64


using the following.

echo base64_encode('jonny:whitecollar');

A PRACTICAL GUIDE TO WEB SCRAPING 54


5.1.2 HTTP Basic Authentication with cURL

Now that we are familiar with Basic Authentication we will see


how we can submit the authentication information using curl.
The following example allows us to login to a HTTP-Basic
authenticated site using curl. This uses the CURL_USERPWD
option to pass the username and password to the remote server
for authentication. Note that curl handles all the base64
encoding issues for us. We only have to specify the appropriate
options to curl and it handles the rest for us. The main options
in this example are CURLOPT_HTTPAUTH, CURLAUTH_BASIC.

<?php
$username = "admin";
$password = "adminauth";

$s = curl_init();
curl_setopt($s, CURLOPT_URL, 'http://example.com/index.php');
curl_setopt($s, CURLOPT_RETURNTRANSFER, true);
curl_setopt($s, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($s, CURLOPT_USERPWD, "$username:$password");
curl_setopt($s, CURLOPT_HTTPAUTH, CURLAUTH_BASIC);
$output = curl_exec($ch);
curl_close($ch);
echo $output;
?>

Once the favored type of authentication, basic authentication


however is losing out to other techniques because of its weak
nature. For example, with basic authentication, there is no way
to log out without closing your browser. There is also no way to
change the appearance of the authentication form because the
browser creates it and the user has no control over that.
Importantly, Basic authentication is not very secure, as the
browser sends the login credentials to the server in clear-text.

A PRACTICAL GUIDE TO WEB SCRAPING 55


5.1.3 Storing and sending cookies

Most servers after authentication return some cookie


information which is required for correctly working with the
rest of the site once we are successfully past the authentication.
We need to store this cookie data somewhere so that we can use
it later with other pages. Curl makes this job easier for us with
the CURLOPT_COOKIEJAR option. This option takes a filename
into which the site cookie information is stored; we have named
it ‘cookie.txt’ in our example below. We can then use the
cookie data later as required for working with other pages.

<?php

$username = "admin";
$password = "adminauth";

$s = curl_init();
curl_setopt($s, CURLOPT_URL, 'http://example.com/');
curl_setopt($s, CURLOPT_HEADER, 1);
curl_setopt($s, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($s, CURLOPT_USERPWD, "$username:$password");
curl_setopt($s, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt($s, CURLOPT_TIMEOUT, 30);
curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE);
$page_data = curl_exec($s);
echo $page_data;

?>

Later when we want to access a page that requires the cookies


set earlier during authentication, we can use the curl option
CURLOPT_COOKIEFILE to load the cookie data, which will then
be sent to the server when requesting a page. An example code
is given below.

A PRACTICAL GUIDE TO WEB SCRAPING 56


$s = curl_init();
curl_setopt($s, CURLOPT_URL, 'http://example.com/page3.php');
curl_setopt($s, CURLOPT_COOKIEFILE, 'cookie.txt');
curl_setopt($s, CURLOPT_TIMEOUT, 30);
curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE);
$page_data = curl_exec($s);

Besides HTTP-Basic authentication, there is also the standard


username / password form authentication used on most sites.
Authentication data is usually posted using the HTTP POST
method. In the next example we will see how to use curl to
work with these kind of authentications.

5.1.4 Session Authentication

Unlike basic authentication, in which login credentials are sent


each time a page is requested, session authentication validates
users once and creates a session value that represents that
authentication. The standard username/password login forms
seen on web pages uses session authentication. Once a user is
correctly authenticated a session is created with the help of
cookies and the session values are passed to each subsequent
page request to indicate that the user is authenticated. There are
two basic methods for employing session authentication—with
cookies and with query strings. In the next example we will see
how we can use curl to login to a WordPress admin site using
session authentication.

5.1.5 Logging to a WordPress Admin site with curl

Currently WordPress is one of the most installed CMS in the


world, and there are millions of sites that run on this versatile
platform. The below example shows how you can login to a
remote WordPress admin section using curl and grab any
content of interest. For example if you want to automatically

A PRACTICAL GUIDE TO WEB SCRAPING 57


check at regular intervals the number of comments posted to
your blog, you can use the following code to accomplish that.

The following example uses curl to post login credentials to


your WordPress admin page. This is accomplished using the
CURLOPT_POSTFIELDS option along with some other required
options. After correct login WordPress returns a set of cookies
which are stored to a local file using CURLOPT_COOKIEJAR. This
cookie data will be used later by our code while requesting
other admin pages.

<?php
$blog_url = "http://www.wordpress-blog.com/";
$blog_name = "wordpress-blog.com";
$username = "admin";
$password = "your-admin-password";
$post_data = "log={$username}&pwd={$password}&wp-
submit=Log+In&redirect_to=http%3A%2F%2F{$blog_name}%2Fwp-
admin%2F&testcookie=1";

$s = curl_init();
curl_setopt($s, CURLOPT_URL, $blog_url . 'wp-login.php');
curl_setopt($s, CURLOPT_TIMEOUT, 30);
curl_setopt($s, CURLOPT_POST, true);
curl_setopt($s, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($s, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($s, CURLOPT_POSTFIELDS, $post_data);
curl_setopt($s, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE);
$downloaded_page = curl_exec($s);
?>

Once we correctly login to admin, we can then get any admin


page required and scrape the content. This will however require
that you set the correct options with curl. The following is
another example which will let you login to WordPress admin
and let you get the page content for the ‘All Posts’ link.

A PRACTICAL GUIDE TO WEB SCRAPING 58


An important change to note here is the CURLOPT_COOKIEFILE
option. This allows you to continue the curl session initiated
after login. Curl uses the cookie data in this file while
requesting the new page.

/* Now that we have correctly logged in to admin, we can


request a new page */
curl_setopt($s, CURLOPT_URL, $blog_url . 'wp-admin/edit.php');
curl_setopt($s, CURLOPT_COOKIEFILE, "cookie.txt");

$html = curl_exec($s);
/* $html now contains data for the ‘edit.php’ page */
echo $html;

Once we have tha page contents, we can then use


simplehtmldom to find the relevant DOM element and return
the total number of comments. The count of the pending
comments is stored in the following span element in the DOM.
Note that this could vary depending on the version of
WordPress, so make sure you are checking the correct DOM
element.

<span class=”pending-count”>6</span>

We now use the 'find' method to get the comment count. This
is accomplished by the following line. We return the first node
found matching the search parameter. This is specified using the
second parameter for 'find', '0' in our example.

$data = @$html->find('span[class=pending-count]', 0);

We can also specify the search parameter as below.

$data = $html->find('span.pending-count', 0);

Once the relevant element is found we return the appropriate


data from the object.

A PRACTICAL GUIDE TO WEB SCRAPING 59


echo @$data->plaintext;

In the above example you may be wondering how we got the


data for the $post_data variable. The best way is to open the
FireBug Net panel and submit the WordPress form. You will
now get all the requested content in the net panel. A typical
FireBug WordPress admin urls after a login form submission is
shown below. You can easily see that the login form data was
posted to the 'wp-login.php' file using the POST method.

You can also see the POST details and other headers by clicking
on the 'wp-login.php' link.

A PRACTICAL GUIDE TO WEB SCRAPING 60


A partial sample request captured using Firebug is shown
below. The post content is displayed after the 'Content-
Length' field. You only need to copy that and replace the
username/password with the correct one.

POST /wp-login.php HTTP/1.1


Host: www.example.com
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:16.0)
Gecko/20100101 Firefox/16.0
Accept:
text/html,application/xhtml+xml,application/xml;q=0
.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Referer: http://www.example.com/wp-login.php
Cookie: __utma=101293715.944295703.1353556559;
Content-Type: application/x-www-form-urlencoded

Content-Length: 102

A PRACTICAL GUIDE TO WEB SCRAPING 61


log=admin&pwd=test&wp-
submit=Log+In&redirect_to=http%3A%2F%example.com%2F
wp-admin%2F&testcookie=1
HTTP/1.1 200 OK
Date: Fri, 08 Mar 2013 04:32:18 GMT
Server: Apache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Pragma: no-cache
x-frame-options: SAMEORIGIN
Set-Cookie: wordpress_test_cookie=WP+Cookie+check;
path=/
Last-Modified: Fri, 08 Mar 2013 04:32:19 GMT
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8

With these two authentication types you should be able to get


access to most of the sites on the Web. The only other
authentication architecture we need to explore is that which uses
Ajax to validate. This will be explored in chapter 7.

A PRACTICAL GUIDE TO WEB SCRAPING 62


6

[SCRAPING
SCRAPING
WITH REGULAR EXPRESSIONS]
EXPRESSIONS
A PRACTICAL GUIDE TO WEB SCRAPING 63
6.1 Regular Expressions: A quick Introduction

Regular Expressions or Regex are the Swiss army knife of text


processing and are an important part of web scraping. Once you
are comfortable with simplehtmldom you can now embark on
using Regular Expressions in your scraping projects. This is a
complex topic and I assume you have at least a rudimentary
knowledge of them. We do not cover them in detail here as it is
a large subject and you can find many good books on the
subject matter elsewhere. A cursory overview of Regular
Expressions is however presented in this chapter.

Regular Expressions are an amazingly powerful tool available


in most of today’s programming languages. Think of Regular
Expressions as a sophisticated system of text pattern matching.
You specify a pattern and then use one of PHP’s built-in
functions to apply the pattern to a text string to see if it matches.

PHP supports two types of regular expression standards, POSIX


Extended and Perl-Compatible (PCRE). PCRE is currently the
preferred type to use in PHP, it tends to be faster than the
POSIX option, and it uses the same regex syntax as Perl. You
can identify these functions because, in PHP, they start with the
prefix preg. Examples of PCRE regular expression functions in
PHP are preg_replace(), preg_split(), preg_match(), and
preg_match_all().

We will describe only the PCRE regular expressions in this


book, and for simplicity, we will also limit our discussion to the
most frequently used functions within PHP.

A PRACTICAL GUIDE TO WEB SCRAPING 64


Let us look at string matching in PHP first. The PHP function to
use for regular expression matching is preg_match().
Consider the string:

“Colorless green ideas sleep furiously.”

We will use the above to do some pattern matching. Here is the


first example:

$string = "Colorless green ideas sleep furiously";


echo preg_match('/green/', $string) ; // returns 1

Here we are searching for the pattern 'green' anywhere within


the provided string.

Whenever you are looking for a certain string or pattern within


a given string, you have to first delimit the pattern. You
generally do this with the forward slash character (/), but you
can use any other non-alphanumeric character. So, if you are
looking for a string pattern of 'green', you could set it up as
/green/ or #green#, as long as you use the same characters on both
ends of the pattern and they are not among the string pattern for
which you are looking. For example, the following are
equivalent:

preg_match('/Hello/', 'Hello World!'); // returns 1


preg_match('#Hello#', 'Hello World!'); // returns 1

The general format of the preg_match() function is as below.

int preg_match ( string $pattern , string $subject [, array


&$matches [, int $flags = 0 [, int $offset = 0 ]]] )

We could have accomplished the above with some of the more


basic string functions that PHP provides, but the preg_match()
function also comes with some pattern quantifiers which

A PRACTICAL GUIDE TO WEB SCRAPING 65


increase the power of regular expressions, a few selected are
shown below.

Table 1. Pattern quantifiers for preg_match expressions

^ Test for the pattern at the beginning of the string.


$ Test for the pattern at the end of the string.
. Match for any single character (wildcard).
\ Escape character, used when searching for other quantifiers as literal strings.
[ ] Range of valid characters: [0-6] means “between and including 0 and 6.”
{ } How many characters are allowed within the previously defined pattern rule.

With these quantifiers, we can be much more specific in what


we are looking for and where we are looking for it. Here are
some examples:

echo preg_match('/^green/', $string) ; // returns 0


echo preg_match('/^Colorless/', $string) ; // returns 1
echo preg_match('/^the/', $string) ; // returns 0
echo preg_match('/furiously$/', $string) ; // returns 1
echo preg_match('/gre.n/', $string) ; // returns 1

If $matches is provided, then it is filled with the results of the


search. $matches[0] will contain the text that matched the full
pattern, $matches[1] will have the text that matched the first
captured parenthesized sub pattern, and so on. Let us work
through a complete example below.

6.1.1 Getting the character encoding for a web page

Every web page has a particular encoding to select what ranges


of characters are displayed by the browser. There are several
ways to specify which character encoding is used in the
document. First, the web server can include the character
encoding or 'charset' in the Hypertext Transfer Protocol

A PRACTICAL GUIDE TO WEB SCRAPING 66


(HTTP) Content-Type header, which would typically look like
the following:

Content-Type: text/html; charset=ISO-8859-1

For HTML it is possible to include this information inside the


head element near the top of the document:

<meta http-equiv="Content-Type" content="text/html;


charset=utf-8">

The following short example will get the character encoding for
a particular web page. This makes use of the PHP RegEx
functions to search for the required content. Here we are
looking for the 'charset' attribute which include the encoding
information.

<?php

/* Get the page content for a site */


$html = file_get_contents('http://www.microsoft.com/');

//Find the charset meta attribute


preg_match('/charset\=.*?(\'|\"|\s)/i', $html, $matches);

//Trim out everything we don't need


$matches = preg_replace('/(charset|\=|\'|\"|\s)/', '',
$matches[0]);

echo strtoupper($matches);

?>

The important part of the code is the Regex to search for the
'charset' pattern.

'/charset\=.*?(\'|\"|\s)/i'

A PRACTICAL GUIDE TO WEB SCRAPING 67


The preg_match function is used to find the first matching
string for the above pattern. Once that is found the
preg_replace function only keeps the actual charset code and
returns that.

You may have noticed the character 'i' in the regular expression
after the closing delimiter. This instructs that the match should
be case insensitive.

The preg_match() function only matches the first found match


and then halts. The preg_match_all() function however
repeatedly matches from where the last match ended, until no
more matches can be made. preg_match_all() works exactly
the same like preg_match(); however, the resulting array is a
multi-dimensional one. Each entry in this array is an array of
matches as it would have been returned by preg_match().

6.1.2 Grabbing images from a web page

Take another example. One of the common tasks in web


scraping is to grab the images from a web page. This can be
easily accomplished using some regular expression power.

The following code enables you to get a list of all the images in
a web page along with their attributes – such as src, height,
width, alt etc. For this example we will separate the regular
expression search part from the main program as this is a bit
long and would be better if we make it as a function. We call
this function 'parseTag'. The 'parseTag' function also
extracts the tag attributes of the image. The complete function is
given below.

A PRACTICAL GUIDE TO WEB SCRAPING 68


function parseTag($tag, $content)
{
$regex = "/<{$tag}\s+([^>]+)>/i";
preg_match_all($regex, $content, $matches);
$raw = array();
//We also want attributes of the tag
foreach($matches[1] as $str)
{
$regex = '/([a-z]([a-z0-9]*)?)=("|\')(.*?)("|\')/is';
preg_match_all($regex, $str, $pairs);

if(count($pairs[1]) > 0) {
$raw[] = array_combine($pairs[1],$pairs[4]);
}
}
return $raw;
}

The workhorse of the above code is the preg_match_all()


function, which grabs all the tags specified in the parameter
from a web page. The complete code to grab the image links
from our template using the parseTag function is shown below.
We have used curl below, but you could also use
file_get_html(). Both versions are shown here.

<?php

/* curl version */
$url = ''http://localhost/scrape/template/index.html';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 6);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$html = curl_exec($ch);

/* This will grab all the <img> tag data */


$links = parseTag('img', $html);
print_r($links);
?>

A PRACTICAL GUIDE TO WEB SCRAPING 69


The file_get_html version is shown below. As you can see it
is more concise then the curl version.

<?php

/* file_get_html version */
$url = 'http://localhost/scrape/template/index.html';

$html = file_get_html($url);

/* This will grab all the <img> tag data */


$links = parseTag('img', $html);
print_r($links);

?>

This will return the following array. Note the additional


attributes returned including the class names of the images. This
can be helpful when you only need to find images with a certain
class name or alt tag.

Array
(
[0] => Array
(
[src] => images/flower1.jpg
[class] => imageb
[alt] => flower 1
)

[1] => Array


(
[src] => images/flower2.jpg
[class] => imageb
[alt] => flower 2
)

[2] => Array


(
[src] => images/flower3.jpg
[class] => imageb

A PRACTICAL GUIDE TO WEB SCRAPING 70


[alt] => flower 3
)
)

You can also print the src attribute of each image within a for
loop.

foreach($links as $link) {
echo $link['src'] . '<br>';
}

As you can see Regular Expressions along with simplehtmldom


can help you scrape any web content you desire. Of course,
Regular Expressions are more complex than using a library like
simplehtmldom, but offer more flexibility and power if used
correctly. But the major downside is the learning curve it
demands. If designed incorrectly Regular Expressions can be a
source of subtle bugs. For these reasons I prefer to use
simplehtmldom wherever possible and only resort to Regular
Expressions in rare cases.

A PRACTICAL GUIDE TO WEB SCRAPING 71


7

[SCRAPING
SCRAPING
AJAX CONTENT]
CONTENT
A PRACTICAL GUIDE TO WEB SCRAPING 72
7.1 JavaScript and the rise of Ajax

A few years back most web page content, whether dynamic or


static was generated on the server and pushed to the client.
However, with the introduction of Ajax, content now could be
dynamically called after a web page has been loaded. This
provided a new challenge to web scraping.

With the rise of JavaScript many websites now use Ajax to


retrieve fresh content on a web page. This can however be
problematic to our scraping efforts, since content retrieved by
JavaScript cannot be parsed easily by PHP as it is dynamically
inserted into the DOM by the browser. Just using CURL to get
the page contents will return the data without the Ajax content.
So how do we get to the dynamic content generated by
JavaScript?

One solution is to use a browser tool like Selenium. However,


Selenium is a testing tool and cannot be easily integrated with
PHP to retrieve dynamic content. The other solution is to use
'phantom.js', a desktop program which enables you to download
web pages and manipulate the DOM using Javascript, enabling
you to retrieve Ajax content. phantomjs actually is a browser
without the normal window, something called as a headless
browser. phantomjs uses JavaScript as its scripting language to
process downloaded pages. We will look a little into phantomjs
in this chapter, however our primary objective is to use PHP to
scrape web pages.

Every Ajax page request entails a call to a server url which


returns back some data to the browser. This data is used by the
browser to update relevant content on the page. The important

A PRACTICAL GUIDE TO WEB SCRAPING 73


part to know when scraping Ajax content is to find the request
url. So how do we get to the correct Ajax request url.

7.1.1 FireBug to the rescue

This is where the FireBug addon will come in handy. On the


'Net' panel of the FireBug window you can see all the GET/POST
requests made by the web page. A sample screenshot of various
JavaScript requests for a website is shown below.

As you can see there are many requests, GET as well as POST. It
is your task to locate the correct url that is used to retrieve the
Ajax content you are interested in. For example a GET request
may take the form of the following url, where the page
parameter will vary with each request. The url below could be
for retrieving a list of books for the category 'science'. There
may be a total of 600 books, with 20 books shown per request,
which brings the total page count to 30 pages.

http://some-site.com/request/get-book-list.php?page=6&cat=10

Of course many times this url may be a little complex than


shown here. But for examples sake let us go with the above one.
Once we know the correct url then we can use it in our CURL

A PRACTICAL GUIDE TO WEB SCRAPING 74


request to get to the actual content. In the example given below
we get all the Ajax content for pages 0-20. With each iteration
of the 'for' loop we modify the page parameter and send a
new request. We have not added any error handling, but you
should check if the request returned any valid data or is empty.

<?php

$ch = curl_init();

for($i=0; $<=20; $i++)


{
$url = "http://some-site.com/request/get-book-
list.php?page={$i}&limit=10";

curl_setopt($ch, CURLOPT_URL, $url);


curl_setopt($ch, CURLOPT_TIMEOUT, 6);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$page_data = curl_exec($ch);
/* Do something with the data */
}

?>

For Ajax content with many pages, as in the above example, I


usually prefer to cache the retrieved content to a local file,
which makes the scraping part easier as I can repeatedly test
different code on the cached pages.

This, in a nutshell, is how you retrieve content for Ajax


requests. Some JavaScript requests though may use POST
instead of GET. However this does not change anything. You
can get the additional details of the POST request from the
FireBug as before. In the screenshot shown below we can see
that JavaScript has posted to the 'login.php' url with two

A PRACTICAL GUIDE TO WEB SCRAPING 75


parameters – username, password. Sometimes this request
could be more complex, with dozens of parameters. However,
the logic remains the same. The response sent by the server to
the request is available in the 'Response' tab. The response can
be returned in plain HTML, plain text, JSON or in XML format.
The task to decode the response in a appropriate formats falls
with the web scraper.

The following example shows how to use CURL to post to this


url along with the two parameters.

<?php

$url = "http://example/login.php?login=1";
$username = "jamest";
$password = "your-password";
$post_data = "username={$username}&password={$password};

$s = curl_init();
curl_setopt($s, CURLOPT_URL, $url);
curl_setopt($s, CURLOPT_TIMEOUT, 30);
curl_setopt($s, CURLOPT_POST, true);

A PRACTICAL GUIDE TO WEB SCRAPING 76


curl_setopt($s, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($s, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($s, CURLOPT_POSTFIELDS, $post_data);
curl_setopt($s, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE);
$downloaded_page = curl_exec($s);
?>

Of course FireBug is not the only tool to watch network traffic.


Google Chrome and Safari have excellent developer tools built
into the browser, which enables you to analyze Ajax calls. It all
depends on which browser you are comfortable with.

Decoding Ajax calls and extracting the correct urls is a skill you
will acquire over time. It is an art, and requires some patience
and trial an error to get to get it right, but is worth the effort.

7.2 PhantomJS

As described earlier in this chapter, PhantomJS is a headless


WebKit scriptable with a JavaScript API. It has fast and native
support for various web standards: DOM handling, CSS
selector, JSON, Canvas, and SVG. You can use PhantomJS to
programmatically capture web contents, including SVG and
Canvas, create web site screenshots, access and manipulate
webpages with the standard DOM API, or use it alongside with
libraries like jQuery. Because PhantomJS can load and
manipulate a web page, it is perfect to carry out various page
automations including web scraping, our primary purpose.

You can download PhantomJS from the below link:

http://phantomjs.org/download.html

A PRACTICAL GUIDE TO WEB SCRAPING 77


PhantomJS runs from the command line, so make sure the
executable is available in your PATH. To test if your
installation is running correctly coy the following lines into a
blank text file and save it as 'test.js'.

console.log('Hello, World!');
phantom.exit();

You can now execute these commands by passing the file to


PhantomJS.

c:\>phantomjs test.js

This will print the string 'Hello, World!' in the console


window.

Below is a simple example that will download a web page and


print the content to the console.

var page = require('webpage').create();


page.open('http://phantomjs.org', function(status) {
if ( status === "success" ) {
console.log(page.content);
}
phantom.exit();
});

PhatomJS is a relatively big topic with a decent number of


functions and methods which we cannot cover in this book. The
website has a decent documentation which you can use while
coding. Note that PhantomJS assumes that you have a very
good working knowledge of JavaScript.

A PRACTICAL GUIDE TO WEB SCRAPING 78


8

[COOKBOOK
COOKBOOK]
COOKBOOK
A PRACTICAL GUIDE TO WEB SCRAPING 79
8.1 Get character encoding for a web page
Get character encoding data for a web page such as UTF-8

Every web page has a particular encoding to select what ranges


of characters are displayed by the browser. There are several
ways to specify which character encoding is used in the
document. First, the web server can include the character
encoding or "charset" in the Hypertext Transfer Protocol
(HTTP) Content-Type header, which would typically look like
the following:

Content-Type: text/html; charset=ISO-8859-1

For HTML it is possible to include this information inside the


head element near the top of the document:

<meta http-equiv="Content-Type" content="text/html;


charset=utf-8">

The Code

The following short example will get the character encoding for
a particular web page. This makes use of the powerful PHP
RegEx functions to search for the required content. Here we are
looking for the ‘charset’ attribute which include the encoding
information.

<?php
/**
* Recipe 1
* Get the encoding for a particular page.
*/

/* Get the page content for a site */


$html = file_get_contents('http://www.microsoft.com/');
$favicon_url = '';

//Find the charset meta attribute

A PRACTICAL GUIDE TO WEB SCRAPING 80


preg_match_all('~charset\=.*?(\'|\"|\s)~i', $html, $matches);

//Trim out everything we don't need


$matches = preg_replace('/(charset|\=|\'|\"|\s)/', '',
$matches[0]);

if(strtoupper($matches[0]) == '') {
echo 'auto';
} else {
echo strtoupper($matches[0]);
}

?>

8.2 Grabbing website favicons


Grab Favicons from websites and save it to your local drive.

Almost all websites today display a favicon - a small piece of


icon that you see in the browser url field. This recipe grabs the
favicon from a given url and saves it to the local drive.

The Code

The code uses the PHP file_get_html function to get the web
page content, which is then searched for the favicon link. If
file_get_html is disabled on your server you can instead use
curl to download the page. This link is then further passed to the
curl function which does the actual task of downloading the
icon and saving it to a file. The important curl option used here
is CURLOPT_FILE. We need to pass a file handle to this option in
which the icon will be saved. In the following example we have
use the ‘favicon.png’ filename.

A PRACTICAL GUIDE TO WEB SCRAPING 81


<?php

/**
* Recipe 2
* Get website Favicons and save to your local drive.
*/
require_once('simplehtmldom/simple_html_dom.php');

/* The site of which favicon we want */


$html = file_get_html('http://www.example.com/');

$favicon_url = '';

/* Get all the links from the page */


foreach($html->find('link') as $element)
{
/* Search for the favicon link */
if($element->rel == "shortcut icon" || $element->rel ==
"icon")
{
/* If found save it to a file */
$favicon_url = $element->href;

$fp = fopen('favicon.png', 'w+');

$ch = curl_init($favicon_url);
curl_setopt($ch, CURLOPT_TIMEOUT, 6);

/* Save the returned data to a file */


curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_exec($ch);
curl_close($ch);
fclose($fp);
}
}

?>

A PRACTICAL GUIDE TO WEB SCRAPING 82


8.3 Scrape Google search results
Get Google search results urls.

One of the useful and important tasks you can accomplish using
scraping is that of grabbing Google search results urls. This can
be useful for crawling search links one after another for a
particular keyword.

The Code

The code uses curl to get the Google search page and a regex to
parse the raw html and retrieve the urls. Note that we can search
for any result page by adding the required offset in the url. The
general search format is shown below.

www.google.com/search?q=keyword&start=offset

q is the search keyword and start is the page offset to search


from. Setting start to ‘0’ will return the first page, while an
offset of ‘10’ will return the second page, an offset of ‘20’ will
return the third and so on.

<?php

/**
* Recipe 3
* Scrape Google search results.
*/
require_once('simplehtmldom/simple_html_dom.php');

/* Get results for the first page, for additional pages


set the 'start' parameter in the url below to 10 for page
2, 20 for page 3 and so on.
*/
$ch = curl_init(‘www.google.com/search?q=pluto&start=0');
curl_setopt($ch, CURLOPT_TIMEOUT, 6);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

A PRACTICAL GUIDE TO WEB SCRAPING 83


$html = curl_exec($ch);
curl_close($ch);

preg_match_all('#<h3.*?><a href="(.*?)".*?</h3>#', $html,


$matches);

foreach($matches[1] as $url)
{
echo $url . "\n";
}

?>

You can also connect to Google search using ‘https’.

curl_init(‘https://www.google.com/search?q=pluto&start=0');

This will require you to add appropriate curl options to the


above code as given below.

curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);


curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

A PRACTICAL GUIDE TO WEB SCRAPING 84


8.4 Get Alexa global site rank
Get Alexa global rank for a website

Although many people would not take Alexa site rank seriously,
it still provides good representation of your website traffic rank
compared to others on a global scale.

The Code

The code uses the simplehtmldom library to search for DOM


element corresponding to the site rank. The process on how to
select the relevant DOM element and extracting the data from
the resulting simplehtmldom object has been explained in the
previous chapters. The important piece of code that searches for
the site rank is the following.

$data = @$phtml->find('div[class=data up]');

This searches for the div element having the class 'data up'
and returns the content within that div.

<?php

/**
* Recipe 4
* Get Alexa global website rank.
*/

require_once('simplehtmldom/simple_html_dom.php');
$website = 'codediesel.com';

/* Get Alexa search page */


$html =
file_get_html("http://www.alexa.com/siteinfo/{$website}#");

$phtml = str_get_html($html);
$data = @$phtml->find('div[class=data up]');
echo @$data[0]->nodes[2]->_[4];
?>

A PRACTICAL GUIDE TO WEB SCRAPING 85


8.5 Scraping a page with HTTP authentication
Scrape a web page that is secured by HTTP Basic Authentication

Although majority of web pages are unsecured and can be


directly accessed using curl, many important ones are secured
using various techniques. One of the most common ways to
secure a page is using HTTP Basic Authentication. The
following recipe show how one could access a page that is
secured using this type of authentication.

The Code

The recipe uses the CURL_USERPWD curl option to pass the


username and password to the remote server for authentication.
Most servers return some cookie information which is required
for correctly working with the rest of the site once we pass the
authentication. This cookie data is stored in the ‘cookie.txt’ file
using the CURLOPT_COOKIEJAR option. We can then use the
cookie data later as required for working with other pages.

<?php
/**
* Recipe 5
* Scraping a page using HTTP basic authentication
*/
$username = "admin";
$password = "adminauth";

$s = curl_init();
curl_setopt($s, CURLOPT_URL, 'http://example.com/');
curl_setopt($s, CURLOPT_HEADER, 1);
curl_setopt($s, CURLOPT_USERPWD, $username . ":" . $password);
curl_setopt($s, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt($s, CURLOPT_TIMEOUT, 30);
curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE);
$page_data = curl_exec($s);
echo $page_data;
?>

A PRACTICAL GUIDE TO WEB SCRAPING 86


Later when we want to access a page that requires the cookies
set earlier during authentication, we can use the curl option,
CURLOPT_COOKIEFILE to load the cookie data, which will then
be sent to the server when requesting a page. An example code
is given below.

$s = curl_init();
curl_setopt($s, CURLOPT_URL, 'http://some-site.com/inner-
page.php');
curl_setopt($s, CURLOPT_COOKIEFILE, 'cookie.txt');
curl_setopt($s, CURLOPT_TIMEOUT, 30);
curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE);

$page_data = curl_exec($s);

A PRACTICAL GUIDE TO WEB SCRAPING 87


8.6 Logging to WordPress admin and grabbing content
Login to a remote WordPress admin site using curl

Currently WordPress is one of the most installed CMS in the


world, and there are millions of sites that run on this versatile
platform. This recipe shows how you can log to a remote
WordPress admin section using curl and grab any content of
interest. For example if you want to automatically check at
regular intervals the number of comments posted to your blog,
you can use the following recipe to accomplish that.

The Code

The following example uses curl to post login credentials to


your WordPress admin page. This is accomplished using the
CURLOPT_POSTFIELDS option along with some other required.
After correct login WordPress returns a set of cookies which are
stored to a local file using the CURLOPT_COOKIEJAR option. This
cookie data will be used later by our code while requesting
other admin pages.

We use simplehtmldom to find the relevant DOM element and


return the total number of comments. The count of the pending
comments is stored in a span element in the DOM.

<span class=”pending-count”>6</span>

We use simplehtmldom to find that element and return the


count. This is accomplished by the following line.

$data = @$html->find('span[class=pending-count]');

Once that is found we return the appropriate data from the


object using the following line.

A PRACTICAL GUIDE TO WEB SCRAPING 88


echo @$data[0]->nodes[0]->_[4];

<?php

/**
* Recipe 6.1
* Login to WordPress admin and get the total comments pending
*/

include_once('simplehtmldom/simple_html_dom.php');

$blog_url = "http://www. wordpress-blog.com/";


$blog_name = "wordpress-blog.com";
$username = "admin";
$password = "your-admin-password";

$post_data = "log={$username}&pwd={$password}&wp-
submit=Log+In&redirect_to=http%3A%2F%2F{$blog_name}%2Fwp-
admin%2F&testcookie=1";

$s = curl_init();
curl_setopt($s, CURLOPT_URL, $blog_url . 'wp-login.php');
curl_setopt($s, CURLOPT_TIMEOUT, 30);
curl_setopt($s, CURLOPT_POST, true);
curl_setopt($s, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($s, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($s, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U;
Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725
Firefox/2.0.0.6");
curl_setopt($s, CURLOPT_POSTFIELDS, $post_data);
curl_setopt($s, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE);

$downloaded_page = curl_exec($s);

$html = str_get_html($downloaded_page);

$data = @$html->find('span[class=pending-count]');

/* Display pending comments count */


echo @$data[0]->nodes[0]->_[4];

?>

A PRACTICAL GUIDE TO WEB SCRAPING 89


Once we correctly login to admin, we can then get any admin
page required and scrape the content. This will however require
that you set the correct options with curl. The following is
another example which will let you login to WordPress admin
and let you get the page content for the ‘All Posts’ link.

An important change to note here is the CURLOPT_COOKIEFILE


option. This allows you to continue the curl session initiated
after login. Curl uses the cookie data in this file while
requesting the new page.

<?php

/**
* Recipe 6.2
* Login to WordPress admin and get the page content for the
* ‘All Posts’ link
*/

include_once('simplehtmldom/simple_html_dom.php');

$blog_url = "http://www.wordpress-blog.com/";
$blog_name = "wordpress-blog.com";
$username = "admin";
$password = " your-admin-password ";

$post_data = "log={$username}&pwd={$password}&wp-
submit=Log+In&redirect_to=http%3A%2F%2F{$blog_name}%2Fwp-
admin%2F&testcookie=1";

$s = curl_init();
curl_setopt($s, CURLOPT_URL, $blog_url . 'wp-login.php');
curl_setopt($s, CURLOPT_TIMEOUT, 30);
curl_setopt($s, CURLOPT_POST, true);
curl_setopt($s, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($s, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($s, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U;
Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725
Firefox/2.0.0.6");
curl_setopt($s, CURLOPT_POSTFIELDS, $post_data);

A PRACTICAL GUIDE TO WEB SCRAPING 90


curl_setopt($s, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE);

curl_exec($s);

/* Now that we have correctly logged in to admin, we can


request a new page */
curl_setopt($s, CURLOPT_URL, $blog_url . 'wp-admin/edit.php');
curl_setopt($s, CURLOPT_COOKIEFILE, "cookie.txt");

$html = curl_exec($s);
/* $html now contains data for the ‘edit.php’ page */
echo $html;

curl_close($s);

?>

A PRACTICAL GUIDE TO WEB SCRAPING 91


8.7 Getting all the image urls from a page
Get all the image links for a page along with the attributes

One of the common tasks in web scraping is to grab the images


from a web page. This can be easily accomplished using some
regular expression tricks.

The Code

The following code enables you to get a list of all the images in
a web page along with the attributes – such as src, height, width,
alt etc. We use a generic ‘parseTag’ function to grab all the
‘img’ tags from a page. The ‘parseTag’ function also extracts
the tag attributes.

The workhorse of the below code is the preg_match_all


function, which grabs all the image links from a web page.

<?php

/**
* Recipe 7
* Get all the images for a page, along with all the
attributes.
*/

function parseTag($tag, $content)


{
preg_match_all('~<' . $tag . '\s+([^>]+)>~i', $content,
$matches);
$raw = array();
//We also want attributes of the tag
foreach($matches[1] as $str)
{
preg_match_all('~([a-z]([a-z0-
9]*)?)=("|\')(.*?)("|\')~is', $str, $pairs);
if(count($pairs[1]) > 0)
{
$raw[] = array_combine($pairs[1],$pairs[4]);

A PRACTICAL GUIDE TO WEB SCRAPING 92


}
}
return $raw;
}

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http:// some-site.com/');
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page_data = curl_exec($ch);

/* We pass the tag name we are interested, ‘img’ here */


$links = parseTag('img', $page_data);

print_r($links);

?>

A PRACTICAL GUIDE TO WEB SCRAPING 93


8.8 Saving all the images from a page to a directory
Save all the images from a web page

In the last recipe we saw a way to get all the image urls from a
page but we did not save it locally. In this recipe we will further
modify the previous code so that we will be able to save the
images to our local system

The Code

The code is a duplicate of the previous recipe with addition


lines added to copy the files from the remote server to your
local machine.

<?php

/**
* Recipe 8
* Get all the images for a page, along with all the
attributes.
*/

function parseTag($tag, $content)


{
preg_match_all('~<' . $tag . '\s+([^>]+)>~i', $content,
$matches);
$raw = array();
//We also want attributes of the tag
foreach($matches[1] as $str)
{
preg_match_all('~([a-z]([a-z0-
9]*)?)=("|\')(.*?)("|\')~is', $str, $pairs);
if(count($pairs[1]) > 0)
{
$raw[] = array_combine($pairs[1],$pairs[4]);
}
}
return $raw;
}

A PRACTICAL GUIDE TO WEB SCRAPING 94


$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://example.com/');
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page_data = curl_exec($ch);

/* We pass the tag name we are interested, ‘img’ here */


$links = parseTag('img', $page_data);

$links = parseTag('img', $page_data);

/* Now copy the remote images to the local machine, we are


assuming that $link[‘src’] contains the absolute image
and not a relative url.
*/
foreach($links as $link)
{
copy($link['src'], "./" . basename($link['src']));
}
?>

The $link['src'] should contain absolute image urls. If they


are relative, than we need to make it absolute by adding a prefix
string to the resource.

/* If the src url is relative than make it absolute */


foreach($links as $link)
{
$image_url = 'http://example.com/' . $link['src'];
copy($image_url, "./" . basename($image_url));
}

A PRACTICAL GUIDE TO WEB SCRAPING 95


A.1

[CURL
CURL OPTIONS]
OPTIONS
A PRACTICAL GUIDE TO WEB SCRAPING 96
A.1 A simple CURL session

A CURL session is similar to a PHP file I/O session. Both


create a session (or file handle) to reference an external file
(URL in the case of a CURL). CURL however differs from
standard file I/O it that it requires a series of options to be set
that define the nature of the data. These options are set
individually; order does not make any difference. The following
shows the minimal options required to create a CURL session
that will put a downloaded web page into a variable.

Before you use CURL, you must initiate a session with the
curl_init() function. Initialization creates a session variable,
which identifies options and data belonging to a specific
session. Once you create a session, you may use it as many
times as you need to.

<?php

// Open a CURL session


$ch = curl_init();

// Configure the CURL command


curl_setopt($ch, CURLOPT_URL, "http://www.example.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

// Execute the cURL command and


// save the contents of target web page to string.
$downloaded_page = curl_exec($ch);

// Close CURL session


curl_close($ch);

?>

A PRACTICAL GUIDE TO WEB SCRAPING 97


The rest of this section details the various CURL options.

Note: Options settings must be capitalized, as shown in the example


above. This is because the option names are predefined PHP
constants. Therefore, your code will fail if you specify and option as
curlopt_url instead of CURLOPT_URL.

A.1.1 Setting CURL Options

The CURL session is initialized using the curl_setopt()


function. Each individual configuration option is set with a
separate call to this function. There are over 90 separate
configuration options available within CURL, making the
CURL interface very flexible. The average PHP user, however,
uses only a small subset of the available options. The following
sections describe the CURL options you are most likely to use.

CURLOPT_URL

Used to define the target URL for your CURL session.

curl_setopt($s, CURLOPT_URL, "http://www.example.com/");

You should use a fully formed URL describing the protocol,


domain, and file in every CURL file request.

CURLOPT_RETURNTRANSFER

This option must be set to TRUE, if you want the result to be


returned in a string. If you don't set this option to TRUE, CURL
echoes the result to the terminal.

curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE);

A PRACTICAL GUIDE TO WEB SCRAPING 98


CURLOPT_REFERER

This option allows your scraper to spoof a hyperlink reference


that was clicked to initiate the request for the target file. The
following example tells the target server that someone clicked a
link on http://www.myexample.com/index.php to request the
target web page.

curl_setopt($s, CURLOPT_REFERER,
"http://www.myexample.com/index.php");

CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS

This option tells CURL that you want it to follow every page
redirection it finds. It's important to understand that CURL only
honors header redirections and not redirections set with a
refresh meta tag or with JavaScript as CURL is not a browser.

// Example of redirection that CURL will follow


header("Location: http://www.google.com");

<!-- Examples of redirections that cURL will not follow-->


<meta http-equiv="Refresh"
content="0;url=http://www.google.com">
<script>document.location="http://www.google.com"</script>

Whenever you use CURLOPT_FOLLOWLOCATION, set


CURLOPT_MAXREDIRS to the maximum number of redirections
you care to follow. This helps you keep your scraper entering a
infinite loop.

curl_setopt($s, CURLOPT_FOLLOWLOCATION, TRUE);


curl_setopt($s, CURLOPT_MAXREDIRS, 3);

A PRACTICAL GUIDE TO WEB SCRAPING 99


CURLOPT_NOBODY and CURLOPT_HEADER

These options tell CURL to return either the web page's header
or body. By default CURL always returns the body, but not the
header. This explains why setting CURL_NOBODY to TRUE
excludes the body, and setting CURL_HEADER to TRUE includes
the header.

curl_setopt($s, CURLOPT_HEADER, TRUE);


curl_setopt($s, CURLOPT_NOBODY, TURE);

CURLOPT_TIMEOUT

If you don't limit how long CURL waits for a response from a
server, it may probably wait forever—especially if the url you're
fetching is on a busy server or you're trying to connect to a
nonexistent or inactive IP address or dead links. Setting a time-
out value causes CURL to end the session if the download takes
longer than the time-out value .

curl_setopt($s, CURLOPT_TIMEOUT, 20); // In seconds

CURLOPT_USERAGENT

Use this option to define the name of your user agent. The user
agent name is recorded in server access log files and is available
to server-side scripts in the $_SERVER['HTTP_USER_AGENT']
variable. Some servers require a special user-agent to be passed
to validate. For example with RETS services. Many websites
will not even dole out pages correctly if your user agent name is
something other than a standard web browser user-agent.

$agent = "webbot";
curl_setopt($s, CURLOPT_USERAGENT, $agent);

A PRACTICAL GUIDE TO WEB SCRAPING 100


CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR

One of the powerful features of CURL is the ability to manage


cookies sent to and received from a website. Use this option to
define the file where previously stored cookies exist. At the end
of the session CURL writes new cookies to the file indicated by
CURLOPT_COOKIEJAR. We have already seen the example in the
text previously.

// Read the cookie file


curl_setopt($s, CURLOPT_COOKIEFILE, " cookies.txt");
// Write the cookie file
curl_setopt($s, CURLOPT_COOKIEJAR, "cookies.txt");

When specifying the location of a cookie file, always use the


complete path of the file, and not relative paths.

CURLOPT_HTTPHEADER

The CURLOPT_HTTPHEADER option allows a CURL session to


send header message to the server. CURLOPT_HTTPHEADER
expects to receive data in an array.

$header[] = "Mime-Version: 1.0";


$header[] = "Content-type: text/html; charset=iso-8859-1";
$header[] = "Accept-Encoding: compress, gzip";
curl_setopt($s, CURLOPT_HTTPHEADER, $header);

CURLOPT_SSL_VERIFYPEER

You only need to use this option if the target website uses SSL
encryption and the protocol in CURLOPT_URL is https:.

curl_setopt($s, CURLOPT_SSL_VERIFYPEER, FALSE);

Depending on the version of CURL installed on your system,


this option may be required; if you don't use it, the target server

A PRACTICAL GUIDE TO WEB SCRAPING 101


will attempt to download a client certificate, which is
unnecessary in all but rare cases.

CURLOPT_USERPWD and CURLOPT_UNRESTRICTED_AUTH

The CURLOPT_USERPWD option is required to work with a valid


username and password to access websites that uses Basic
Authentication. Note that, in contrast to using a browser, you
will have to submit the username and password to every page
accessed within the basic authentication domain.

curl_setopt($s, CURLOPT_USERPWD, "username:password");


curl_setopt($s, CURLOPT_UNRESTICTED_AUTH, TRUE);

If you use this option along with CURLOPT_FOLLOWLOCATION,


you should also set the CURLOPT_UNRESTRICTED_AUTH option,
which will ensure that the username and password are sent to all
pages you're redirected to, providing they are part of the same
domain.

CURLOPT_POST and CURLOPT_POSTFIELDS

The CURLOPT_POST and CURLOPT_POSTFIELDS options


configure CURL to emulate forms with the POST method. Since
the default method is GET, you must first tell CURL to use the
POST method. Then you must specify the POST data that you
want to be sent to the target server.

// Use the HTTP POST method


curl_setopt($s, CURLOPT_POST, TRUE);
// Define POST data
$post_data = "para1=1&para2=2&para3=3";
curl_setopt($s, CURLOPT_POSTFIELDS, $post_data);

Notice that the POST data looks like a standard query string sent
in a GET method. Incidentally, to send form information with the
GET method, simply attach the query string to the target URL.

A PRACTICAL GUIDE TO WEB SCRAPING 102


CURLOPT_VERBOSE

The CURLOPT_VERBOSE option controls the quantity of status


messages created during a data transfer. You may find this
helpful during debugging, but it is best to turn off this option
during the production phase, because it produces many entries
in your server log file. If you're in verbose mode on a busy
server, you'll create very large log files.

curl_setopt($s, CURLOPT_VERBOSE, FALSE);

CURLOPT_PORT

By default CURL uses port 80 for all HTTP sessions, unless


you are connecting to an SSL encrypted server, in which case
port 443 is used. These are the standard port numbers for HTTP
and HTTPS protocols, respectively. If you're connecting to a
custom protocol or wish to connect to a non-web protocol, use
CURLOPT_PORT to set the desired port number.

curl_setopt($s, CURLOPT_PORT, 234);

A PRACTICAL GUIDE TO WEB SCRAPING 103


A.2

[HTTP
HTTP
STATUS CODES]
CODES
A PRACTICAL GUIDE TO WEB SCRAPING 104
A.2 HTTP Status Codes

The below table is a quick reference for all the status codes
defined in the HTTP/1.1 specification, providing a brief
summary of each.

Status
Reason Meaning
code
An initial part of the request was received, and the
100 Continue
client should continue.
Switching The server is changing protocols, as specified by
101
Protocols the client, to one listed in the Upgrade header.
200 OK The request is okay.
The resource was created (for requests that create
201 Created
server objects).
The request was accepted, but the server has not
202 Accepted
yet performed any action with it.
Non- The transaction was okay, except the information
203 Authoritative contained in the entity headers was not from the
Information origin server, but from a copy of the resource.
The response message contains headers and a
204 No Content
status line, but no entity body.
Another code primarily for browsers; means that
205 Reset Content the browser should clear any HTML form
elements on the current page.
206 Partial Content A partial request was successful.
A client has requested a URL that actually refers
to multiple resources. This code is returned along
300 Multiple Choices
with a list of options; the user can then select
which one he wants.
The requested URL has been moved. The
Moved
301 response should contain a Location URL
Permanently
indicating where the resource now resides.
Like the 301 status code, but the move is
temporary. The client should use the URL given
302 Found
in the Location header to locate the resource
temporarily.
Tells the client that the resource should be fetched
303 See Other
using a different URL. This new URL is in the

A PRACTICAL GUIDE TO WEB SCRAPING 105


Location header of the response message.
Clients can make their requests conditional by the
304 Not Modified request headers they include. This code indicates
that the resource has not changed.
The resource must be accessed through a proxy,
305 Use Proxy the location of the proxy is given in the Location
header.
306 (Unused) This status code currently is not used.
Like the 301 status code; however, the client
Temporary
307 should use the URL given in the Location header
Redirect
to locate the resource temporarily.
400 Bad Request Tells the client that it sent a malformed request.
Returned along with appropriate headers that ask
401 Unauthorized the client to authenticate itself before it can gain
access to the resource.
Payment Currently this status code is not used, but it has
402
Required been set aside for future use.
403 Forbidden The request was refused by the server.
404 Not Found The server cannot find the requested URL.
A request was made with a method that is not
supported for the requested URL. The Allow
Method Not
405 header should be included in the response to tell
Allowed
the client what methods are allowed on the
requested resource.
Clients can specify parameters about what types
of entities they are willing to accept. This code is
406 Not Acceptable
used when the server has no resource matching
the URL that is acceptable for the client.
Proxy
Like the 401 status code, but used for proxy
407 Authentication
servers that require authentication for a resource.
Required
If a client takes too long to complete its request, a
408 Request Timeout server can send back this status code and close
down the connection.
The request is causing some conflict on a
409 Conflict
resource.
Like the 404 status code, except that the server
410 Gone
once held the resource.
Servers use this code when they require a
411 Length Required
Content-Length header in the request message.

A PRACTICAL GUIDE TO WEB SCRAPING 106


The server will not accept requests for the
resource without the Content-Length header.
Precondition If a client makes a conditional request and one of
412
Failed the conditions fails, this response code is returned.
Request Entity The client sent an entity body that is larger than
413
Too Large the server can or wants to process.
The client sent a request with a request URL that
Request URI Too
414 is larger than what the server can or wants to
Long
process.
Unsupported The client sent an entity of a content type that the
415
Media Type server does not understand or support.
The request message requested a range of a given
Requested Range
416 resource, and that range either was invalid or
Not Satisfiable
could not be met.
The request contained an expectation in the
Expectation
417 Expect request header that could not be satisfied
Failed
by the server.
Internal Server The server encountered an error that prevented it
500
Error from servicing the request.
The client made a request that is beyond the
501 Not Implemented
server's capabilities.
A server acting as a proxy or gateway
502 Bad Gateway encountered a bogus response from the next link
in the request response chain.
Service The server cannot currently service the request
503
Unavailable but will be able to in the future.
Similar to the 408 status code, except that the
response is coming from a gateway or proxy that
504 Gateway Timeout
has timed out waiting for a response to its request
from another server.
HTTP Version The server received a request in a version of the
505
Not Supported protocol that it can't or won't support.

A PRACTICAL GUIDE TO WEB SCRAPING 107

You might also like