Oxford University CH9 - Using Development Tools To Examine Webpages
Oxford University CH9 - Using Development Tools To Examine Webpages
Skills you will learn: For this tutorial, we will use the developer tools in Firefox. However, these
are quite similar to the developer tools found in just about every modern web browser, including
Google’s Chrome and Microsoft’s Edge.
We aren’t covering everything Firefox’s developer tools can do, but instead are focusing on aspects
that are of value for web scraping and simple development.
Note that the prior version of this tutorial focused on the Firebug plugin, which has since been inte-
grated into Firefox’s toolset.
Summary: Hidden behind the public face of modern web browsers is a suite of tools that gives you
extraordinary power to inspect the source code of web pages, and observe network requests and re-
sponses. These tools are useful both in web scraping, discussed in chapter 9, and web development,
discussed in chapter 10.
Introduction:
All web browsers allow the user to look at the source HTML for the page, and this is perfectly use-
ful for looking at the overall HTML structure of a page, for seeing tags in their full context, for
searching for elements, and so on. But sometimes you want to drill down and look more closely at
specific elements on the page, or you want to be able to see the network traffic being passed to and
from the remote web server. For these tasks, development tools can be the perfect solution.
Using the tools
The developer tools are an integrated part of Firefox.
To start using them, got to Tools>Web Developer>Toggle Tools. We’ll use the Mac version in this
tutorial, but the functionality is essentially identical on a Windows PC.
Examining Elements in a webpage
We’re going to have a look at the source code for the Public Works and Government Services Can-
ada (now renamed Public Services and Procurement Canada) proactive disclosure page that we first
saw in chapter 9 of The Data Journalist. While the government is moving the proactive disclosure in-
formation to its open data portal, the page you saw in Chapter 9 is archived at this address:
https://www.tpsgc-pwgsc.gc.ca/cgi-
bin/proactive/cl.pl?lang=eng;SCR=L;Sort=0;PF=CL201516Q1.txt
If you simply open the page source (Tools>Web Developer>Page Source), you will see all of the
HTML for the page.
Looking at the HTML code this way is useful way to put it all in context, especially when you al-
ready know what elements you want to look at or when you are trying to figure out how you will
instruct a Python library such as Beautiful Soup to pinpoint the elements you want (Chapter 9 intro-
duces Beautiful Soup).
But when you are first exploring a page to see how it works, the element inspector can make the job
a lot easier.
If you open Developer Tools while the page is open, you see a panel like the one below appears at
the bottom of the browser screen.
If you look at the top left corner of the panel, you can see the element inspector icon.
Clicking on the icon makes the inspector active. With it we can drill down into the HTML for any
part of the page that we want to examine. Simply click on a part of the webpage in the upper win-
dow containing the web page as rendered by Firefox, and the inspector will show you the underlying
HTML and CSS code. We’ll use the inspector to click on the HTML table value IGF Vigilance Inc.
We then see the HTML for that part of the page, in the lower panel.
The small expansion arrows to the left of individual HTML elements indicate that more detail is
available to be viewed. If we click on that, we see more code, and yet another expansion arrow indi-
cating we can drill down even further.
So we’ll click on that to reveal the last layer of detail.
We can now see all of the HTML code associated with the IGF Vigilance entry. First off, it is con-
tained in a <td> or table cell tag, which itself is enclosed in a <tr> or table row tag, which in turn is
enclosed in a <tbody> or table body tag, and so in, in the hierarchy of tags that makes up the
webpage.
Within the <td> tag is a <a> or hyperlink tag. The title attribute that follows causes its contents to
appear as a tooltip when a user hovers the mouse over the link. This is then followed by the HREF
attribute, which indicates the destination of the link, in this case another Public Words web page that
provides detail on this particular contract. This is followed by a <span> element that has a CSS class
that hides it on the page, and finally the actual text seen by the browser user. This is followed by the
closing </a> tag for the hyperlink.
If we were scraping data from the page, we would probably be most interested in the text for the
name of the contract vendor, IGF Vigilance Inc. By using the element inspector, we have identified
that we would be looking for the visible display text within the hyperlink within the <td> tag. We
can use a parser such as Beautiful Soup, or a regular expression, to isolate the desired text element,
and grab it for further processing (Chapter 9 walks through how to scrape this page using a copy of
the data saved so it will remain available).
This feature is useful for examining files that are included as part of a response, such as JavaScript
files, CSS style files, images, and other files.
The Status column reports on the status of the HTTP request. The code 200 means the request was
successful. 304 indicates that the file was fetched from the browser cache. 404 means the resource
did not exist on the server. You can read about all of the different possible status codes here:
https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
The Domain column indicates which Internet domain responded to the request.
The Size column indicates the size of the request and the returned response, in Bytes.
The time scale on the far right tells you how long the request took in milliseconds.
Clicking on one of the request rows will open a request details panel to the right (it can also be
opened with the icon. It has a number of tabs of its own. There’s a lot of technical stuff here,
but we’ll make note of a few things.
Under the Headers tab, the Remote Address entry indicates the IP address of the server that sent
the response. This can also be copied and pasted into the address bar of a web browser. We can also
see the HTML request and response headers. HTML headers contain information used by the client
and server computers. They are hidden from users in the normal day-to-day use of a browser.
Let’s look at the request headers first.
The Host is the domain of the server. The referrer is the page from which the request came (we
clicked on a link). Most important for our purposes is the User Agent. This is a character string that
represents the type of browser, including the version, that made the request. Some web servers will
ignore requests that don’t come from actual browsers. It is possible to set the header string sent by a
script so it will appear to be a real browser. You can copy the user agent string from this panel, and
paste it into a script. Of course, doing so raises potential ethical issues, and these are discussed in
Chapter 9.
The response headers tell us the type of content in the response, in this case an HTML page, the
character encoding, in this case ISO-859-1, the time and date of the request given in coordinated
universal time, or GMT, and other technical details.
The Params tab contains the parameters that were passed to the external web server as part of the
URL for the request. This can be extremely useful for understanding what you might need to send
to the server in a scraping script.
The Response tab contains the actual content of the response, in this case the HTML page that was
sent back.
This may seem superfluous when the response is a plain HTML page, but sometimes you will want
to see the response made to an XHR (XMLHttpRequest) request when AJAX is being used, the data
sent back in response to a search request, or the contents of a JavaScript or CSS file.
The Timings tab provides information on how long the request and response took.
When a page isn’t completely reloaded
As discussed in Chapter 9, not all web sites request a whole new page each time a user interacts with
the page. Instead, data may be requested and passed back to the browser without having to reload
the page. This can be done in different ways. For example, a request may be made for an external
html resource that is then placed in the main page by way of an iframe. Or a site may use a technol-
ogy called AJAX. When these kinds of methods are being used, opening and looking at the HTML
source for a page will only show you the original HTML that was sent by the server. Any changes to
the DOM (document object model), which could be thought of as the current state of the web page
the browser has stored in memory, will not be reflected. The network panel, however, can allow you
to peek under the hood, and see everything that is happening.
Let’s say you wanted to have a look at flight arrivals and departures and John F. Kennedy Airport in
New York. You could go to the website maintained by the Port Authority of New York and New
Jersey at https://www.panynj.gov/airports/flight-status.html
You may notice that the actual flight status information, which comes from an outside provider, is
slightly delayed in loading compared to the rest of the page. This is because you are being provided
with the latest information, and it is being added after the main page loads. Here is what the fully
loaded page, with the embedded flight information looks like:
If we use the element inspector to examine the area of the page that contains the flight information,
we can see that contained within the <div> tag with the id of “viewport” there is an iframe that has
as its source, a different URL. It’s a relative URL, so we can’t see the domain, but we can see that by
using the network monitor.
We’ll switch to the network tab in developer tools, then choose the HTML filter (if you go straight
to the network tab after opening developer tools, you may need to reload the page in the browser
before the network panel populates). We’re using the HTML filter because an iframe is used to em-
bed an HTML document.
We can see that four files were transferred, including one that is almost a megabyte in size. We’ll
guess that might be the file we want, because there’s quite a bit of arrivals data.
If we now open the request details panel by clicking on the row for the request, we can see that the
request was made to a URL that does not appear in the address bar of our web browser.
This is the full URL:
https://tracker.flightview.com/FVAccess2/tools/fids/fidsDefault.asp?accCustId=PANYNJ&fidsI
d=20001&fidsInit=arrivals&fidsApt=JFK
Now, if we click on the Response tab, we can see the HTML filed that was transferred.
If we like, we can copy and paste the HTML into a separate file, if you wished to retain it for future
reference.
Looking at AJAX
As we discussed above, another way used to insert information into an existing web page is the
AJAX protocol.
We’ll take a look at a site used by the Sudbury District Health Unit in the northern part of the Cana-
dian province of Ontario to provide information on restaurant inspections to the public. The page
now contains archived information, and will be replaced by a new site, according to the municipality.
The site showed the restaurants in the city, and whether each is in compliance. You could also access
a more detailed inspection history for each premise.
If you click on the Inspection History link for a premise, the page was populated with individual in-
spections for that facility, complete with more links to delve down to more precise detail.
An examination of the page source, however, shows that the detail listed is not present. The DOM
of the page is being manipulated using JavaScript that runs based on user interactions, such as click-
ing on an Inspection History link. New data is being added to the page dynamically.
We can have a closer look at what is going on using the Network monitor in Firefox’s developer
tools.
When we first load the page, we can see that five requests were made to the server, totalling 529 kb
of data sent and received.
The site used to serve up a great many more resources, but has been reduced in scope as it only ar-
chives date up till January 2017.
As you can see, two of the files transferred have XHR under type. XHR request is another way of
saying AJAX request.
If we switch to the XHR tab, we can see just these two and we can also see that one of the files is an
html file and other a JSON file. JSON stands for JavaScript Object Notation, which as discussed in
Chapter 2 is a plain-text data format often used when the data is intended to be machine read.
If we want to see the whole response, we can <ctrl> click (right click on a Windows computer) on
the row for the JSON file in the network panel, and choose Copy>Response.
We can then paste the resulting JSON into a separate file to examine further. You can then use a
Python script or an online converter to convert the JSON to a CSV file for import into a spread-
sheet or database program. In this way, you are using the web development tool as a rudimentary
web scraping tool. For a few pages, it could be all you need.
If we like, we can also look at the response in the request details window. This allows us to examine
each individual object in the JSON file.
An object in this context is a data construct that holds all of the information for one row of data,
that is, one facility that was inspected.
If we click on the expansion triangle for object 0, we can see the data contained in that object, neatly
organized.