What Are HTTP Headers?: All About PHP in Interaction With HTTP & Dom
What Are HTTP Headers?: All About PHP in Interaction With HTTP & Dom
Example
When you type a url in your address bar, your browser sends an HTTP request and it may look like this:
view plaincopy to clipboardprint?
1.
2.
3.
First line is the Request Line which contains some basic info on the request. And the rest are the HTTP
headers.
After that request, your browser receives an HTTP response that may look like this:
view plaincopy to clipboardprint?
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
HTTP/1.x 200 OK
Transfer-Encoding: chunked
Date: Sat, 28 Nov 2009 04:36:25 GMT
Server: LiteSpeed
Connection: close
X-Powered-By: W3 Total Cache/0.8
Pragma: public
Expires: Sat, 28 Nov 2009 05:36:25 GMT
Etag: "pub1259380237;gz"
Cache-Control: max-age=3600, public
Content-Type: text/html; charset=UTF-8
Last-Modified: Sat, 28 Nov 2009 03:50:37 GMT
X-Pingback: http://net.tutsplus.com/xmlrpc.php
Content-Encoding: gzip
Vary: Accept-Encoding, Cookie, User-Agent
The first line is the Status Line, followed by HTTP headers, until the blank line. After that, the
content starts (in this case, an HTML output).
When you look at the source code of a web page in your browser, you will only see the HTML portion and
not the HTTP headers, even though they actually have been transmitted together as you see above.
These HTTP requests are also sent and received for other things, such as images, CSS files, JavaScript
files etc. That is why I said earlier that your browser has sent at least 40 or more HTTP requests as you
loaded just this article page.
Now, lets start reviewing the structure in more detail.
In PHP:
getallheaders() gets the request headers. You can also use the $_SERVER array.
headers_list() gets the response headers.
Further in the article, we will see some code examples in PHP.
The first line of the HTTP request is called the request line and consists of 3 parts:
The method indicates what kind of request this is. Most common methods are GET, POST and HEAD.
The path is generally the part of the url that comes after the host (domain). For example, when
requesting http://net.tutsplus.com/tutorials/other/top-20-mysql-best-practices/ , the path portion is
/tutorials/other/top-20-mysql-best-practices/.
The protocol part contains HTTP and the version, which is usually 1.1 in modern browsers.
The remainder of the request contains HTTP headers as Name: Value pairs on each line. These contain
various information about the HTTP request and your browser. For example, the User-Agent line
provides information on the browser version and the Operating System you are using. Accept-Encoding
tells the server if your browser can accept compressed output like gzip.
You may have noticed that the cookie data is also transmitted inside an HTTP header. And if there was a
referring url, that would have been in the header too.
Most of these headers are optional. This HTTP request could have been as small as this:
1.
2.
And you would still get a valid response from the web server.
Request Methods
The three most commonly used request methods are: GET, POST and HEAD. Youre probably already
familiar with the first two, from writing html forms.
Web forms can be set to use the method GET. Here is an example.
view plaincopy to clipboardprint?
1.
2.
3.
4.
5.
6.
7.
8.
When that form is submitted, the HTTP request begins like this:
view plaincopy to clipboardprint?
1.
2.
You can see that each form input was added into the query string.
1.
2.
3.
4.
5.
6.
7.
8.
1.
2.
3.
Content-Type and Content-Lenght headers have been added, which provide information about the data
being sent.
All the data is in now sent after the headers, with the same format as the query string.
POST method requests can also be made via AJAX, applications, cURL, etc. And all file upload forms are
required to use the POST method.
When you send a HEAD request, it means that you are only interested in the
response code and the HTTP headers, not the document itself.
With this method the browser can check if a document has been modified, for caching purposes. It can also
check if the document exists at all.
For example, if you have a lot of links on your website, you can periodically send HEAD requests to all of
them to check for broken links. This will work much faster than using GET.
The first piece of data is the protocol. This is again usually HTTP/1.x or HTTP/1.1 on modern servers.
The next part is the status code followed by a short message. Code 200 means that our GET request was
successful and the server will return the contents of the requested document, right after the headers.
We all have seen 404 pages. This number actually comes from the status code part of the HTTP
response. If the GET request would be made for a path that the server cannot find, it would respond with a
404 instead of 200.
The rest of the response contains headers just like the HTTP request. These values can contain information
about the server software, when the page/file was last modified, the mime type etc
Again, most of those headers are actually optional.
200 OK
As mentioned before, this status code is sent in response to a successful request.
When the requested page or file was not found, a 404 response code is sent by the server.
401 Unauthorized
Password protected web pages send this code. If you dont enter a login correctly, you may see the
following in your browser.
Note that this only applies to HTTP password protected pages, that pop up login prompts like this:
403 Forbidden
If you are not allowed to access a page, this code may be sent to your browser. This often happens when
you try to open a url for a folder, that contains no index page. If the server settings do not allow the display
of the folder contents, you will get a 403 error.
For example, on my local server I created an images folder. Inside this folder I put an .htaccess file with
this line: Options -Indexes. Now when I try to open http://localhost/images/ I see this:
There are other ways in which access can be blocked, and 403 can be sent. For example, you can block by
IP address, with the help of some htaccess directives.
1.
2.
3.
4.
5.
order allow,deny
deny from 192.168.44.201
deny from 224.39.163.12
deny from 172.16.7.92
allow from all
10
This code is usually seen when a web script crashes. Most CGI scripts do not output errors directly to the
browser, unlike PHP. If there is any fatal errors, they will just send a 500 status code. And the programmer
then needs to search the server error logs to find the error messages.
Complete List
You can find the complete list of HTTP status codes with their explanations here.
Host
An HTTP Request is sent to a specific IP Addresses. But since most servers are capable of hosting multiple
websites under the same IP, they must know which domain name the browser is looking for.
1.
Host: net.tutsplus.com
This is basically the host name, including the domain and the subdomain.
In PHP, it can be found as $_SERVER['HTTP_HOST'] or $_SERVER['SERVER_NAME'].
User-Agent
1.
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; enUS; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)
11
Default language.
This is how websites can collect certain general information about their surfers systems. For example,
they can detect if the surfer is using a cell phone browser and redirect them to a mobile version of their
website which works better with low resolutions.
In PHP, it can be found with: $_SERVER['HTTP_USER_AGENT'].
view plaincopy to clipboardprint?
1.
2.
3.
if ( strstr($_SERVER['HTTP_USER_AGENT'],'MSIE 6') ) {
echo "Please stop using IE6!";
}
Accept-Language
view plaincopy to clipboardprint?
1.
Accept-Language: en-us,en;q=0.5
This header displays the default language setting of the user. If a website has different language versions, it
can redirect a new surfer based on this data.
It can carry multiple languages, separated by commas. The first one is the preferred language, and each
other listed language can carry a q value, which is an estimate of the users preference for the language
(min. 0 max. 1).
In PHP, it can be found as: $_SERVER["HTTP_ACCEPT_LANGUAGE"].
view plaincopy to clipboardprint?
1.
2.
3.
if (substr($_SERVER['HTTP_ACCEPT_LANGUAGE'], 0, 2) == 'fr') {
header('Location: http://french.mydomain.com');
}
Accept-Encoding
1.
Accept-Encoding: gzip,deflate
Most modern browsers support gzip, and will send this in the header. The web server then can send the
HTML output in a compressed format. This can reduce the size by up to 80% to save bandwidth and time.
In PHP, it can be found as: $_SERVER["HTTP_ACCEPT_ENCODING"]. However, when you use
theob_gzhandler() callback function, it will check this value automatically, so you dont need to.
view plaincopy to clipboardprint?
1.
2.
3.
If-Modified-Since
If a web document is already cached in your browser, and you visit it again, your browser can check if the
document has been updated by sending this:
12
If it was not modified since that date, the server will send a 304 Not Modified response code, and no
content and the browser will load the content from the cache.
In PHP, it can be found as: $_SERVER['HTTP_IF_MODIFIED_SINCE'].
view plaincopy to clipboardprint?
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
There is also an HTTP header named Etag, which can be used to make sure the cache is current. Well talk
about this shortly.
Cookie
As the name suggests, this sends the cookies stored in your browser for that domain.
view plaincopy to clipboardprint?
1.
These are name=value pairs separated by semicolons. Cookies can also contain the session id.
In PHP, individual cookies can be accessed with the $_COOKIE array. You can directly access the session
variables using the $_SESSION array, and if you need the session id, you can use the session_id() function
instead of the cookie.
view plaincopy to clipboardprint?
1.
2.
3.
4.
5.
6.
7.
echo $_COOKIE['foo'];
// output: bar
echo $_COOKIE['PHPSESSID'];
// output: r2t5uvjq435r4q7ib3vtdjq120
session_start();
echo session_id();
// output: r2t5uvjq435r4q7ib3vtdjq120
Referer
As the name suggests, this HTTP header contains the referring url.
For example, if I visit the Nettuts+ homepage, and click on an article link, this header is sent to my
browser:
1.
Referer: http://net.tutsplus.com/
1.
2.
3.
if (isset($_SERVER['HTTP_REFERER'])) {
$url_info = parse_url($_SERVER['HTTP_REFERER']);
13
You may have noticed the word referrer is misspelled as referer. Unfortunately it made into the
official HTTP specifications like that and got stuck.
Authorization
When a web page asks for authorization, the browser opens a login window. When you enter a username
and password in this window, the browser sends another HTTP request, but this time it contains this
header.
view plaincopy to clipboardprint?
1.
The data inside the header is base64 encoded. For example, base64_decode(bXl1c2VyOm15cGFzcw==)
would return myuser:mypass
In PHP, these values can be found as $_SERVER['PHP_AUTH_USER'] and
$_SERVER['PHP_AUTH_PW'].
More on this when we talk about the WWW-Authenticate header.
Cache-Control
Definition from w3.org: The Cache-Control general-header field is used to specify directives which
MUST be obeyed by all caching mechanisms along the request/response chain. These caching
mechanisms include gateways and proxies that your ISP may be using.
Example:
view plaincopy to clipboardprint?
14
public means that the response may be cached by anyone. max-age indicates how many seconds the
cache is valid for. Allowing your website to be cached can reduce server load and bandwidth, and also
improve load times at the browser.
Caching can also be prevented by using the no-cache directive.
1.
Cache-Control: no-cache
Content-Type
This header indicates the mime-type of the document. The browser then decides how to interpret the
contents based on this. For example, an html page (or a PHP script with html output) may return this:
view plaincopy to clipboardprint?
1.
text is the type and html is the subtype of the document. The header can also contain more info such as
charset.
For a gif image, this may be sent.
1.
Content-Type: image/gif
The browser can decide to use an external application or browser extension based on the mime-type. For
example this will cause the Adobe Reader to be loaded:
1.
Content-Type: application/pdf
When loading directly, Apache can usually detect the mime-type of a document and send the appropriate
header. Also most browsers have some amount fault tolerance and auto-detection of the mime-types, in
case the headers are wrong or not present.
You can find a list of common mime types here.
In PHP, you can use the finfo_file() function to detect the mime type of a file.
Content-Disposition
This header instructs the browser to open a file download box, instead of trying to parse the content.
Example:
view plaincopy to clipboardprint?
1.
15
Note that the appropriate Content-Type header should also be sent along with this:
view plaincopy to clipboardprint?
1.
2.
Content-Type: application/zip
Content-Disposition: attachment; filename="download.zip"
Content-Length
When content is going to be transmitted to the browser, the server can indicate the size of it (in bytes)
using this header.
1.
Content-Length: 89123
This is especially useful for file downloads. Thats how the browser can determine the progress of the
download.
For example, here is a dummy script I wrote, which simulates a slow download.
view plaincopy to clipboardprint?
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
16
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
The browser can only tell you how many bytes have been downloaded, but it does not know the total
amount. And the progress bar is not showing the progress.
17
Etag
This is another header that is used for caching purposes. It looks like this:
1.
Etag: "pub1259380237;gz"
The web server may send this header with every document it serves. The value can be based on the last
modify date, file size or even the checksum value of a file. The browser then saves this value as it caches
the document. Next time the browser requests the same file, it sends this in the HTTP request:
1.
If-None-Match: "pub1259380237;gz"
If the Etag value of the document matches that, the server will send a 304 code instead of 200, and no
content. The browser will load the contents from its cache.
Last-Modified
As the name suggests, this header indicates the last modify date of the document, in GMT format:
1.
1.
2.
3.
$modify_time = filemtime($file);
header("Last-Modified: " . gmdate("D, d M Y H:i:s", $modify_time) . " GMT");
It offers another way for the browser to cache a document. The browser may send this in the HTTP
request:
1.
Location
This header is used for redirections. If the response code is 301 or 302, the server must also send this
header. For example, when you go to http://www.nettuts.com your browser will receive this:
1.
2.
3.
4.
1.
header('Location: http://net.tutsplus.com/');
By default, that will send a 302 response code. If you want to send 301 instead:
view plaincopy to clipboardprint?
1.
Set-Cookie
When a website wants to set or update a cookie in your browser, it will use this header.
view plaincopy to clipboardprint?
1.
2.
18
1.
setcookie("TestCookie", "foobar");
1.
Set-Cookie: TestCookie=foobar
If the expiration date is not specified, the cookie is deleted when the browser window is closed.
WWW-Authenticate
A website may send this header to authenticate a user through HTTP. When the browser sees this header, it
will open up a login dialogue window.
view plaincopy to clipboardprint?
1.
There is a section in the PHP manual, that has code samples on how to do this in PHP.
view plaincopy to clipboardprint?
1.
2.
3.
4.
5.
6.
7.
8.
9.
if (!isset($_SERVER['PHP_AUTH_USER'])) {
header('WWW-Authenticate: Basic realm="My Realm"');
header('HTTP/1.0 401 Unauthorized');
echo 'Text to send if user hits Cancel button';
exit;
} else {
echo "<p>Hello {$_SERVER['PHP_AUTH_USER']}.</p>";
echo "<p>You entered {$_SERVER['PHP_AUTH_PW']} as your password.</p>";
}
19
Content-Encoding
This header is usually set when the returned content is compressed.
1.
Content-Encoding: gzip
In PHP, if you use the ob_gzhandler() callback function, it will be set automatically for you.
Conclusion
Thanks for reading. I hope this article was a good starting point to learn about HTTP Headers. Please leave
your comments and questions below, and I will try to respond as much as I can.
20
Domdocument
The domdocument class of Php is a very handy one that can be used for a number of tasks like
parsing xml, html and creating xml. It is documented here.
In this tutorial we are going to see how to use this class to parse html content. The need to parse
html happens when are you are for example writing scrapers, or similar data extraction scripts.
Sample html
The following is the sample html file that we are going to use with DomDocument.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
<html>
<body>
<div id="mango">
This is the mango div. It has some text and a form too.
<form>
<input type="text" name="first_name" value="Yahoo" />
<input type="text" name="last_name" value="Bingo" />
</form>
<table class="inner">
<tr><td>Happy</td><td>Sky</td></tr>
</table>
</div>
<table id="data" class="outer">
<tr><td>Happy</td><td>Sky</td></tr>
<tr><td>Happy</td><td>Sky</td></tr>
<tr><td>Happy</td><td>Sky</td></tr>
<tr><td>Happy</td><td>Sky</td></tr>
<tr><td>Happy</td><td>Sky</td></tr>
</table>
</body>
</html>
21
1
2
3
4
5
6
7
8
Done. The $dom object has loaded the html content and can be used to extract contents from
the whole html structure just like its done inside javascript. Most common functions are
getElementsByTagName and getElementById.
Now that the html is loaded, its time to see how nodes and child elements can be accessed.
1
2
3
4
5
6
7
8
9
//get element by id
$mango_div = $dom->getElementById('mango');
if(!mango_div)
{
die("Element not found");
}
echo "element found";
echo $mango_div->nodeValue;
The second method is to use the saveHTML function, that gets out the exact html inside that
particular node.
echo $dom->saveHTML($mango_div);
22
1
2
inner html
To get just the inner html take the following approach. It adds up the html of all of the child
nodes.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$tables = $dom->getElementsByTagName('table');
echo get_inner_html($tables->item(0));
function get_inner_html( $node )
{
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child)
{
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
The function get_inner_html gets the inner html of the html element. Note that we used the
saveXML function instead of the saveHTML function. The property "childNodes" provides the
child nodes of an element. These are the direct children.
1
2
3
4
5
6
$tables = $dom->getElementsByTagName('table');
foreach($tables as $table)
{
echo $dom->saveHTML($table);
}
23
1
2
3
4
5
6
7
8
9
$tables = $dom->getElementsByTagName('table');
echo "Found : ".$tables->length. " items";
$i = 0;
while($table = $tables->item($i++))
{
echo $dom->saveHTML($table);
}
The item function takes the index of the item to be fetched. The length attribute of the
DomNodeList gives the number of objects found.
1
2
3
4
5
6
7
8
9
10
11
12
$tables = $dom->getElementsByTagName('table');
$i = 0;
while($table = $tables->item($i++))
{
$class_node = $table->attributes->getNamedItem('class');
if($class_node)
{
echo "Class is : " . $table->attributes->getNamedItem('class')->value . PHP_
}
}
24
Children of a node
A DomNode has the following properties that provide access to its children
1. childNodes
2. firstChild
3. lastChild
1
$tables = $dom->getElementsByTagName('table');
2
3
$table = $tables->item(1);
4
5
//get the number of rows in the 2nd table
6
echo $table->childNodes->length;
7
//content of each child
8
foreach($table->childNodes as $child)
9
{
10
echo $child->ownerDocument->saveHTML($child);
11
}
12
Checking if child nodes exist
The hasChildNodes function can be used to check if a node has any children at all.
Quick example
1
if( $table->hasChildNodes() )
2
{
3
//print content of children
4
foreach($table->childNodes as $child)
{
5
echo $child->ownerDocument->saveHTML($child);
6
}
7
}
8
1
2
3
4
5
$tables = $dom->getElementsByTagName('table');
$table = $tables->item(1);
$table2 = $dom->getElementById('data');
25
var_dump($table->isSameNode($table2));
The var_dump would show true , indicating that the tables in both $table and $table2 are the
same.
Conclusion
The above examples showed how Domdocument can be used to access elements in an html
document in an object oriented manner. Domdocument can not only parse html but also
create/modify html and xml. In later articles we shall see how to do that.
26
Step 1. Preparation
The first thing youll need to do is download a copy of the simpleHTMLdom library, freely available
fromsourceforge.
There are several files in the download, but the only one you need is the simple_html_dom.php file; the
rest are examples and documentation.
Loading HTML
view plaincopy to clipboardprint?
1.
2.
3.
27
You can create your initial object either by loading HTML from a string, or from a file. Loading a file can
be done either via URL, or via your local file system.
A note of caution: The load_file() method delegates its job to PHPs file_get_contents. If allow_url_fopen
is not set to true in your php.ini file, you may not be able to open a remote file this way. You could always
fall back on the CURL library to load remote pages in this case, then read them in with the load() method.
Accessing Information
Once you have your DOM object, you can start to work with it by using find() and creating collections. A
collection is a group of objects found via a selector the syntax is quite similar to jQuery.
view plaincopy to clipboardprint?
1.
2.
3.
4.
5.
6.
<html>
<body>
<p>Hello World!</p>
<p>We're Here.</p>
</body>
</html>
In this example HTML, were going to take a look at how to access the information in the second
paragraph, change it, and then output the results.
view plaincopy to clipboardprint?
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
28
1.
2.
$element[1]->class = "class_name";
echo $html->save();
1.
2.
3.
4.
5.
6.
<html>
<body>
<p>Hello World!</p>
<p class="class_name">We're here and we're here to stay.</p>
</body>
</html>
Other Selectors
Here are some other examples of selectors. If youve used jQuery, these will seem very familiar.
view plaincopy to clipboardprint?
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
The first example isnt entirely intuitive all queries by default return collections, even an ID query, which
should only return a single result. However, by specifying the second parameter, we are saying only
return the first item of this collection.
This means $single is a single element, rather then an array of elements with one item.
29
Documentation
Complete documentation on the library can be found at the project documentation page.
1.
2.
3.
4.
include('simple_html_dom.php');
$articles = array();
getArticles('http://net.tutsplus.com/page/76/');
30
1.
2.
3.
4.
5.
6.
7.
8.
9.
<div class="preview">
<!-- Post Taxonomies -->
<div class="post_taxonomy"> ... </div>
<!-- Post Title -->
<h1 class="post_title"><a>Title</a></h1>
<!-- Post Meta -->
<div class="post_meta"> ... </div>
<div class="text"><p>Description</p></div>
</div>
This represents a basic post format on the site, including source code comments. Why are the comments
important? They count as nodes to the parser.
1.
2.
3.
4.
5.
6.
7.
8.
function getArticles($page) {
global $articles;
$html = new simple_html_dom();
$html->load_file($page);
// ... more ...
}
We begin very simply by claiming our global, creating a new simple_html_dom object, then loading the
page we want to parse. This function is going to be calling itself later, so were setting it up to accept the
URL as a parameter.
31
1.
2.
3.
4.
5.
6.
7.
$items = $html->find('div[class=preview]');
foreach($items as $post) {
# remember comments count as nodes
$articles[] = array($post->children(3)->outertext,
$post->children(6)->first_child()->outertext);
}
This is the meat of the getArticles function. Its going to take a closer look to really understand whats
happening.
Line 1: Creates an array of elements divs with the class of preview. We now have a collection of articles
stored in $items.
Line 5: $post now refers to a single div of class preview. If we look at the original HTML, we can see that
the third child is the H1 containing the article title. We take that and assign it to $articles[index][0].
Remember to start at 0 and to count comments when trying to determine the proper index of a child node.
Line 6: The sixth child of $post is <div class=text>. We want the description text from within, so we
grab the first childs outertext this will include the paragraph tag. A single record in articles now looks
like this:
view plaincopy to clipboardprint?
1.
2.
Step 6, Pagination
The first thing we do is determine how to find our next page. On Nettuts+, the URLs are easy to figure out,
but were going to pretend they arent, and get the next link via parsing.
32
1.
If there is a next page (and there wont always be), well find an anchor with the class of nextpostslink.
Now that information can be put to use.
view plaincopy to clipboardprint?
1.
2.
3.
4.
5.
6.
7.
8.
On the first line, we see if we can find an anchor with the class nextpostslink. Take special notice of the
second parameter for find(). This specifies we only want the first element (index 0) of the found collection
returned. $next will only be holding a single element, rather than a group of elements.
Next, we assign the links HREF to the variable $URL. This is important because were about to destroy
the HTML object. Due to a php5 circular references memory leak, the current simple_html_dom object
must be cleared and unset before another one is created. Failure to do so could cause you to eat up all your
available memory.
Finally, we call getArticles with the URL of the next page. This recursion ends when there are no more
pages to parse.
33
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
#main {
margin:80px auto;
width:500px;
}
h1 {
font:bold 40px/38px helvetica, verdana, sans-serif;
margin:0;
}
h1 a {
color:#600;
text-decoration:none;
}
p{
background: #ECECEC;
font:10px/14px verdana, sans-serif;
margin:8px 0 15px;
border: 1px #CCC solid;
padding: 15px;
}
.item {
padding:10px;
}
Next were going to put a small bit of PHP in the page to output the previously stored information.
view plaincopy to clipboardprint?
1.
2.
3.
4.
5.
6.
7.
8.
<?php
foreach($articles as $item) {
echo "<div class='item'>";
echo $item[0];
echo $item[1];
echo "</div>";
}
?>
The final result is a single HTML page listing all the articles, starting on the page indicated by the first
getArticles() call.
34
Step 8 Conclusion
If youre parsing a great deal of pages (say, the entire site) it may take longer then the max execution time
allowed by your server. For example, running from my local machine it takes about one second per page
(including time to fetch).
On a site like Nettuts, with a current 78 pages of tutorials, this would run over one minute.
This tutorial should get you started with HTML parsing. There are other methods to work with the DOM,
including PHPs built in one, which lets you work with powerful xpath selectors to find elements. For easy
of use, and quick starts, I find this library to be one of the best. As a closing note, always remember to
obtain permission before scraping a site; this is important. Thanks for reading!
35
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
echo file_get_html('http://www.google.com/')->plaintext;
Scraping Slashdot!
// Create DOM from URL
$html = file_get_html('http://slashdot.org/');
// Find all article blocks
foreach($html->find('div.article') as $article) {
$item['title']
= $article->find('div.title', 0)->plaintext;
$item['intro'] = $article->find('div.intro', 0)->plaintext;
$item['details'] = $article->find('div.details', 0)->plaintext;
$articles[] = $item;
}
print_r($articles);
36
Quick way
// Create a DOM object from a string
$html = str_get_html('<html><body>Hello!</body></html>');
// Create a DOM object from a URL
$html = file_get_html('http://www.google.com/');
// Create a DOM object from a HTML file
$html = file_get_html('test.htm');
Object-oriented way
// Create a DOM object
$html->load('<html><body>Hello!</body></html>');
// Load HTML from a URL
$html->load_file('http://www.google.com/');
// Load HTML from a HTML file
$html->load_file('test.htm');
Basics
// Find all anchors, returns a array of element objects
$ret = $html->find('a');
// Find (N)th anchor, returns element object or null if not found (zero based)
$ret = $html->find('div[id]');
// Find all <div> which attribute id=foo
$ret = $html->find('div[id=foo]');
Advanced
// Find all element which id=foo
$ret = $html->find('#foo');
// Find all element which class=foo
$ret = $html->find('.foo');
// Find all element has attribute id
$ret = $html->find('*[id]');
37
Descendant selectors
// Find all <li> in <ul>
Nested selectors
// Find all <li> in <ul>
foreach($html->find('ul') as $ul)
{
foreach($ul->find('li') as $li)
{
// do something...
Attribute Filters
Filter
[attribute]
[!attribute]
[attribute=value]
[attribute!=value]
[attribute^=value]
[attribute$=value]
[attribute*=value]
Description
Matches elements
Matches elements
Matches elements
Matches elements
value.
Matches elements
certain value.
Matches elements
value.
Matches elements
value.
that
that
that
that
38
$es = $html->find('text');
// Find all comment (<!--...-->) blocks
$es = $html->find('comment');
$value = $e->href;
// Set a attribute(If the attribute is non-value attribute (eg. checked, selected...), set it's value as true or false)
$e->href = null;
// Determine whether a attribute exist?
if(isset($e->href))
echo 'href exist!';
Magic attributes
// Example
Attribute Name
$e->tag
$e->outertext
$e->innertext
$e->plaintext
Usage
Read
Read
Read
Read
or
or
or
or
write
write
write
write
the
the
the
the
Tips
// Extract contents from HTML
echo $html->plaintext;
// Wrap a element
$e->outertext = '';
// Append a element
39
Background Knowledge
// If you are not so familiar with HTML DOM, check this link to learn more...
// Example
echo $html->getElementById("div1")->childNodes(1)->childNodes(1)->childNodes(2)>getAttribute('id');
Method
Description
mixed
$e->children ( [int
$index] )
element
$e->parent ()
element
$e->first_child ()
element
$e->last_child ()
element
$e->next_sibling ()
element
$e->prev_sibling ()
Returns the Nth child object if index is set, otherwise return an array of
children.
Returns the parent of element.
Returns the first child of element, or null if not found.
Returns the last child of element, or null if not found.
Returns the next sibling of element, or null if not found.
Returns the previous sibling of element, or null if not found.
Quick way
// Dumps the internal DOM tree back into string
$str = $html;
// Print it!
echo $html;
Object-oriented way
// Dumps the internal DOM tree back into string
$str = $html->save();
40
$html->save('result.htm');
Callback function
function my_callback($element) {
// Hide all <b> tags
if ($element->tag=='b')
$element->outertext = '';
$html->set_callback('my_callback');
// Callback function will be invoked while dumping
echo $html;
41
2| By: Tim Smith | Posted: June 25, 2012 | Intermediate, PHP Tutorials
In a recent article I discussed PHPs implementation of the DOM and introduced various functions to pull data from and
manipulate an XML structure. I also briefly mentioned XPath, but didnt have much space to discuss it. In this article, well
look closer at XPath, how it functions, and how it is implemented in PHP. Youll find that XPath can greatly reduce the
amount of code you have to write to query and filter XML data, and will often yield better performance as well.
Ill use the same DTD and XML from the previous article to demonstrate the PHP DOM XPath functionality. To quickly
refresh your memory, heres what the DTD and XML look like:
<book isbn="isbn1234">
05
<title>A Book</title>
06
<author>An Author</author>
07
<genre>Horror</genre>
08
<chapter position="first">
09
<chaptitle>chapter one</chaptitle>
10
<text><![CDATA[Lorem Ipsum...]]></text>
42
</chapter>
12
</book>
13
<book isbn="isbn1235">
14
<title>Another Book</title>
15
<author>Another Author</author>
16
<genre>Science Fiction</genre>
17
<chapter position="first">
18
<chaptitle>chapter one</chaptitle>
19
20
21
</chapter>
</book>
22 </library>
//library/book
Thats it. The two forward slashes indicate library is the root element of the document, and the single slash indicates book is
a child. Its pretty straight forward, no?
But what if you want to specify a particular book. Lets say you want to return any books written by An Author. The
XPath for that would be:
You can use text() here in square braces to perform a comparison against the value of a node, and the trailing /..
indicates we want the parent element (i.e. move back up the tree one node).
XPath queries can be executed using one of two functions: query() and evaluate(). Both perform the query, but the
difference lies in the type of result they return. query() will always return a DOMNodeList whereas evaluate() will
return a typed result if possible. For example, if your XPath query is to return the number of books written by a certain
author rather than the actual books themselves, then query() will return an empty DOMNodeList. evaluate() will
simply return the number so you can use it immediately instead of having to pull the data from a node.
43
01 <?php
02 public function getNumberOfBooksByAuthor($author) {
03
$total = 0;
04
$elements = $this->domDocument->getElementsByTagName("author");
05
06
if ($element->nodeValue == $author) {
$total++;
07
}
08
09
10
return $number;
11 }
The next method achieves the same result, but uses XPath to select just those books that are written by a specific author:
1 <?php
2 public function getNumberOfBooksByAuthor($author)
3
$query = "//library/book/author1/..";
$result = $xpath->query($query);
return $result->length;
7}
Notice how we this time we have removed the need for PHP to test against the value of the author. But we can go one step
further still and use the XPath function count() to count the occurrences of this path.
1 <?php
2 public function getNumberOfBooksByAuthor($author)
3
$query = "count(//library/book/author1/..)";
return $xpath->evaluate($query);
6}
Were able to retrieve the information we needed with only only line of XPath and there is no need to perform laborious
filtering with PHP. Indeed, this is a much simpler and succinct way to write this functionality!
Notice that evaluate() was used in the last example. This is because the function count()returns a typed result.
Using query() will return a DOMNodeList but you will find that it is an empty list.
Not only does this make your code cleaner, but it also comes with speed benefits. I found that version 1 was 30% faster on
average than version 2 but version 3 was about 10 percent faster than version 2 (about 15% faster than version 1). While
these measurements will vary depending on your server and query, using XPath in its purest form will generally yield a
considerable speed benefit as well as making your code easier to read and maintain.
44
XPath Functions
There are quite a few functions that can be used with XPath and there are many excellent resources which detail what
functions are available. If you find that you are iterating overDOMNodeLists or comparing nodeValues, you will
probably find an XPath function that can eliminate a lot of the PHP coding.
Youve already see how count() functions. Lets use the id() function to return the titles of the books with the given
ISBNs. The XPath expression you will need to use is:
id("isbn1234 isbn1235")/title
Notice here that the values you are searching for are enclosed within quotes and delimited with a space; there is no need for a
comma to delimit the terms.
01 <?php
02 public function findBooksByISBNs(array $isbns) {
03
04
$query = "id('$ids')/title";
05
06
07
$result = $xpath->query($query);
08
09
$books = array();
10
11
12
$books[] = $book;
13
14
return $books;
15 }
Executing complex functions in XPath is relatively simple; the trick is to become familiar with the functions that are
available.
45
04
05
06
$result = $xpath->query($query);
07
08
$title = $result->item(0)->getElementsByTagName("title")
->item(0)->nodeValue;
09
10
11
return str_word_count($title);
12 }
But we can also incorporate the function str_word_count() directly into the XPath query. There are a few steps that
need to be completed to do this. First of all, we have to register a namespace with the XPath object. PHP functions in XPath
queries are preceded by php:functionString and then the name of the function function you want to use is enclosed in
parentheses. Also, the namespace to be defined is http://php.net/xpath. The namespace must be set to this; any other
values will result in errors. We then need to call registerPHPFunctions() which tells PHP that whenever it comes
across a function namespaced with php:, it is PHP that should handle it.
The actual syntax for calling the function is:
01 <?php
02 public function getNumberOfWords($isbn) {
03
04
05
06
$xpath->registerNamespace("php", "http://php.net/xpath");
07
08
09
$xpath->registerPHPFunctions();
10
11
$query = "php:functionString('str_word_count',(//library/book[@isbn
= '$isbn']/title))";
12
13
return $xpath->evaluate($query);
14 }
46
php:functionString('str_word_count',(//library/book[@isbn =
'$isbn']/title[text()]))
Registering PHP functions is not restricted to the functions that come with PHP. You can define your own functions and
provide those within the XPath. The only difference here is that when defining the function, you use php:function rather
than php:functionString. Also, it is only possible to provide either functions on their own or static methods. Calling
instance methods are not supported.
Lets use a regular function that is outside the scope of the class to demonstrate the basic functionality. The function we will
use will return only books by George Orwell. It must return true for every node you wish to include in the query.
1 <?php
2 function compare($node) {
3
4}
The argument passed to the function is an array of DOMElements. It is up to the function to iterate through the array and
determine whether the node being tested should be returned in theDOMNodeList. In this example, the node being tested
is /book and we are using /author to make the determination.
Now we can create the method getGeorgeOrwellBooks() :
01 <?php
02
03
04
$xpath->registerNamespace("php", "http://php.net/xpath");
05
$xpath->registerPHPFunctions();
06
07
$query = "//library/book1";
08
$result = $xpath->query($query);
09
10
$books = array();
11
foreach($result as $node) {
$books[] = $node->getElementsByTagName("title")
12
->item(0)->nodeValue;
13
14
15
16
return $books;
17 }
47
//library/book[php:function('Library::compare', author)]
In truth, all of this functionality can be easily coded up with just XPath, but the example shows how you can extend XPath
queries to become more complex.
Calling an object method is not possible within XPath. If you find you need to access some object properties or methods to
complete the XPath query, the best solution would be to do what you can with XPath and then work on the
resulting DOMNodeList with any object methods or properties as necessary.
Summary
XPath is a great way of cutting down the amount of code you have to write and to speed up the execution of the code when
working with XML data. Although not part of the official DOM specification, the additional functionality that the PHP
DOM provides allows you to extend the normal XPath functions with custom functionality. This is a very powerful feature
and as your familiarity with XPath functions increase you may find that you come to rely on this less and less.
48
// create
// create
// adds
// create
// adds
// sets
//
// adds
// create a
// adds
In the $xmlDoc variable (which is a DOM object) is created the content of the XML, defining and
adding each element one by one.
49
// create a new XML object and load the content of the XML file
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml_file);
$root = $xmlDoc->documentElement;
(the root)
$elms = $root->getElementsByTagName("*");
of root
// create a new XML object and load the content of the XML file
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml_file);
$root = $xmlDoc->documentElement;
(the root)
$elms = $root->getElementsByTagName("*");
in root
$nr_elms = $elms->length;
elements
51
The DOM has many properties and methods for working with XML, some of them are
used
in
the
examples
of
this
lesson.
For the complete list of PHP DOM functions, see the Document Object Model
52
What Is SmartDOMDocument?
What Is DOMDocument?
So What Exactly Does SmartDOMDocument Do Then?
saveHTMLExact()
Encoding Fix
SmartDOMDocument Object As String
Example
Requirements And Prerequisites
Sounds Great Where Do I Get It?
Download
Check out from SVN
Use as "svn:externals"
Version History
References
How To Report Bugs
Comments (33)
What Is SmartDOMDocument?
What Is DOMDocument?
DOMDocument is a native PHP library for using DOM to read, parse, manipulate, and write
HTML and XML.
Instead of using hacky regexes that are prone to breaking as soon as something you haven't
thought of changes, DOMDocument parses HTML/XML using the DOM (Document Object
Model), just like your browser, and creates an easily manipulatable object in memory.
DOMDocument can actually validate and normalize your HTML/XML.
DOMDocument supports namespaces.
53
saveHTMLExact()
DOMDocument has an extremely badly designed "feature" where if the HTML code you are
loading does not contain <html> and <body> tags, it adds them automatically (yup, there are
no flags to turn this behavior off).
Thus, when you call $doc->saveHTML(), your newly saved content now has <html><body>
and DOCTYPE in it. Not very handy when trying to work with code fragments (XML has a
similar problem).
SmartDOMDocument contains a new function called saveHTMLExact() which does exactly
what you would want it saves HTML without adding that extra garbage that
DOMDocument does.
Encoding Fix
DOMDocument notoriously doesn't handle encoding (at least UTF-8) correctly and garbles
the output.
SmartDOMDocument tries to work around this problem by enhancing loadHTML() to deal
with encoding correctly. This behavior is transparent to you just use loadHTML() as you
would normally.
54
Example
This example loads sample HTML using SmartDOMDocument, uses
getElementsByTagName() to find and removeChild() to remove the first <img> tag, then
prints the old HTML and the newly removed image HTML.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
$content = <<<CONTENT
<div class='class1'>
<img src='http://www.google.com/favicon.ico' />
Some Text
<p></p>
</div>
CONTENT;
print "Before removing the image, the content is: " .
htmlspecialchars($content) . "<br>";
$content_doc = new SmartDOMDocument();
$content_doc->loadHTML($content);
try {
$first_image = $content_doc->getElementsByTagName("img")>item(0);
if ($first_image) {
$first_image->parentNode->removeChild($first_image);
$content = $content_doc->saveHTMLExact();
$image_doc = new SmartDOMDocument();
$image_doc->appendChild($image_doc->importNode($first_image,
true));
$image = $image_doc->saveHTMLExact();
}
} catch(Exception $e) { }
print "After removing the image, the content is: " .
htmlspecialchars($content) . "<br>";
print "The image is: " . htmlspecialchars($image);
}
This is no longer a requirement any version of PHP 5 that has DOMDocument should work
now.
DOMDocument this should be a built-in class but I've seen instances of it missing for some
reason. My guess is 99.9% you will already have it.
55
I highly recommend using SVN (Subversion) because you can easily update to the latest
version by running svn up.
Use as "svn:externals"
If you have an existing project in SVN and you would like to use SmartDOMDocument, you
can use set up this library as svn:externals.
svn:externals is kind of like a symlink to another repository from your existing SVN project.
That way, you can still benefit from using SVN commands such as svn up without having to
maintain a local copy of the external code.
You can read more about setting svn:externals here.
Here's how you would do this:
cd YOUR_PROJ_DIR;
1
svn propset svn:externals 'SmartDOMDocument
2
http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk' .
3
svn ci .
4
svn up
56