Oxford University CH9 - Using Development Tools To Examine Webpages

Chapter 9 focuses on using Firefox's developer tools for web scraping and development, highlighting the importance of inspecting HTML and monitoring network traffic. It explains how to utilize the element inspector to examine specific webpage elements and the network monitor to analyze requests and responses. The chapter emphasizes the tools' utility in understanding webpage structures and data retrieval methods, including AJAX and JSON responses.

Uploaded by

Jpurull2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views14 pages

Oxford University CH9 - Using Development Tools To Examine Webpages

Uploaded by

Jpurull2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Chapter 9

Using Development Tools to Examine Webpages

Skills you will learn: For this tutorial, we will use the developer tools in Firefox. However, these
are quite similar to the developer tools found in just about every modern web browser, including
Google’s Chrome and Microsoft’s Edge.
We aren’t covering everything Firefox’s developer tools can do, but instead are focusing on aspects
that are of value for web scraping and simple development.
Note that the prior version of this tutorial focused on the Firebug plugin, which has since been inte-
grated into Firefox’s toolset.
Summary: Hidden behind the public face of modern web browsers is a suite of tools that gives you
extraordinary power to inspect the source code of web pages, and observe network requests and re-
sponses. These tools are useful both in web scraping, discussed in chapter 9, and web development,
discussed in chapter 10.
Introduction:
All web browsers allow the user to look at the source HTML for the page, and this is perfectly use-
ful for looking at the overall HTML structure of a page, for seeing tags in their full context, for
searching for elements, and so on. But sometimes you want to drill down and look more closely at
specific elements on the page, or you want to be able to see the network traffic being passed to and
from the remote web server. For these tasks, development tools can be the perfect solution.
Using the tools
The developer tools are an integrated part of Firefox.
To start using them, got to Tools>Web Developer>Toggle Tools. We’ll use the Mac version in this
tutorial, but the functionality is essentially identical on a Windows PC.
Examining Elements in a webpage
We’re going to have a look at the source code for the Public Works and Government Services Can-
ada (now renamed Public Services and Procurement Canada) proactive disclosure page that we first
saw in chapter 9 of The Data Journalist. While the government is moving the proactive disclosure in-
formation to its open data portal, the page you saw in Chapter 9 is archived at this address:
https://www.tpsgc-pwgsc.gc.ca/cgi-
bin/proactive/cl.pl?lang=eng;SCR=L;Sort=0;PF=CL201516Q1.txt
If you simply open the page source (Tools>Web Developer>Page Source), you will see all of the
HTML for the page.

Looking at the HTML code this way is useful way to put it all in context, especially when you al-
ready know what elements you want to look at or when you are trying to figure out how you will
instruct a Python library such as Beautiful Soup to pinpoint the elements you want (Chapter 9 intro-
duces Beautiful Soup).
But when you are first exploring a page to see how it works, the element inspector can make the job
a lot easier.
If you open Developer Tools while the page is open, you see a panel like the one below appears at
the bottom of the browser screen.
If you look at the top left corner of the panel, you can see the element inspector icon.

Clicking on the icon makes the inspector active. With it we can drill down into the HTML for any
part of the page that we want to examine. Simply click on a part of the webpage in the upper win-
dow containing the web page as rendered by Firefox, and the inspector will show you the underlying
HTML and CSS code. We’ll use the inspector to click on the HTML table value IGF Vigilance Inc.
We then see the HTML for that part of the page, in the lower panel.

The small expansion arrows to the left of individual HTML elements indicate that more detail is
available to be viewed. If we click on that, we see more code, and yet another expansion arrow indi-
cating we can drill down even further.
So we’ll click on that to reveal the last layer of detail.

We can now see all of the HTML code associated with the IGF Vigilance entry. First off, it is con-
tained in a <td> or table cell tag, which itself is enclosed in a <tr> or table row tag, which in turn is
enclosed in a <tbody> or table body tag, and so in, in the hierarchy of tags that makes up the
webpage.
Within the <td> tag is a <a> or hyperlink tag. The title attribute that follows causes its contents to
appear as a tooltip when a user hovers the mouse over the link. This is then followed by the HREF
attribute, which indicates the destination of the link, in this case another Public Words web page that
provides detail on this particular contract. This is followed by a <span> element that has a CSS class
that hides it on the page, and finally the actual text seen by the browser user. This is followed by the
closing </a> tag for the hyperlink.
If we were scraping data from the page, we would probably be most interested in the text for the
name of the contract vendor, IGF Vigilance Inc. By using the element inspector, we have identified
that we would be looking for the visible display text within the hyperlink within the <td> tag. We
can use a parser such as Beautiful Soup, or a regular expression, to isolate the desired text element,
and grab it for further processing (Chapter 9 walks through how to scrape this page using a copy of
the data saved so it will remain available).

Using the Network monitor tab

As discussed in chapter 9, when you enter a web address in a browser, click on a link, or when a
script running in the page requests an external resource, a request is sent to a remote server, some-
where else on the Internet, and a response comes back. You can use the Network panel in the de-
veloper tools to examine these requests and responses, so as to better understand what the browser
is asking for, and what is being sent back in response.
We’ll begin with a simple example, what happens when you click on one of the details links in the
Proactive Disclosure page we’ve been looking at.
To start, open developer tools and click on the network tab.
Here’s a look at the network panel:
Within the panel there are a number of options are available.
Clicking on the small trash can icon will remove all request and response data from the panel.
The remaining options filter the information displayed. For example, clicking on HTML limits the
display to markup files such as HTML and XML. The CSS option limits the display to CSS files, the
JS option to script files, the Images option to image files, and so on. The XHR option will show
AJAX requests and responses.
The All option shows all request and response activity.
We’ll look more closely at the HTML and XHR options.

Watching network traffic

The network panel is populated as soon as you take an action, or a script triggers an action, that
makes a request to an external server. For example, if we click on the details link for IGF Vigilance
Inc., seen above, the panel will display all information about requests and responses that follow the
click.
As you can see, the browser has gone to the page and the panel below has been populated with in-
formation about 21 requests made. Below, there is a summary of the requests which shows that the
total size of the request and response traffic was about 724 KB. Of that, about 519 was transferred
from the remote server. When you visit a website, your browser will cache some of the response
files on your hard drive so that when you return to the site, the files can be loaded quickly without
having to request them from the server again.
The File column shows the URL to which the request was made. It’s a relative URL, but if you hov-
er over it you will see the full URL. The Method column shows the request method, GET or POST.
If you <Ctrl> Click (right click on a PC) on the file cell for a request, a context sensitive menu ap-
pears that allows you to copy the URL to paste into a browser’s address bar.

This feature is useful for examining files that are included as part of a response, such as JavaScript
files, CSS style files, images, and other files.
The Status column reports on the status of the HTTP request. The code 200 means the request was
successful. 304 indicates that the file was fetched from the browser cache. 404 means the resource
did not exist on the server. You can read about all of the different possible status codes here:
https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
The Domain column indicates which Internet domain responded to the request.
The Size column indicates the size of the request and the returned response, in Bytes.
The time scale on the far right tells you how long the request took in milliseconds.
Clicking on one of the request rows will open a request details panel to the right (it can also be
opened with the icon. It has a number of tabs of its own. There’s a lot of technical stuff here,
but we’ll make note of a few things.

Under the Headers tab, the Remote Address entry indicates the IP address of the server that sent
the response. This can also be copied and pasted into the address bar of a web browser. We can also
see the HTML request and response headers. HTML headers contain information used by the client
and server computers. They are hidden from users in the normal day-to-day use of a browser.
Let’s look at the request headers first.
The Host is the domain of the server. The referrer is the page from which the request came (we
clicked on a link). Most important for our purposes is the User Agent. This is a character string that
represents the type of browser, including the version, that made the request. Some web servers will
ignore requests that don’t come from actual browsers. It is possible to set the header string sent by a
script so it will appear to be a real browser. You can copy the user agent string from this panel, and
paste it into a script. Of course, doing so raises potential ethical issues, and these are discussed in
Chapter 9.
The response headers tell us the type of content in the response, in this case an HTML page, the
character encoding, in this case ISO-859-1, the time and date of the request given in coordinated
universal time, or GMT, and other technical details.
The Params tab contains the parameters that were passed to the external web server as part of the
URL for the request. This can be extremely useful for understanding what you might need to send
to the server in a scraping script.

The Response tab contains the actual content of the response, in this case the HTML page that was
sent back.
This may seem superfluous when the response is a plain HTML page, but sometimes you will want
to see the response made to an XHR (XMLHttpRequest) request when AJAX is being used, the data
sent back in response to a search request, or the contents of a JavaScript or CSS file.
The Timings tab provides information on how long the request and response took.
When a page isn’t completely reloaded
As discussed in Chapter 9, not all web sites request a whole new page each time a user interacts with
the page. Instead, data may be requested and passed back to the browser without having to reload
the page. This can be done in different ways. For example, a request may be made for an external
html resource that is then placed in the main page by way of an iframe. Or a site may use a technol-
ogy called AJAX. When these kinds of methods are being used, opening and looking at the HTML
source for a page will only show you the original HTML that was sent by the server. Any changes to
the DOM (document object model), which could be thought of as the current state of the web page
the browser has stored in memory, will not be reflected. The network panel, however, can allow you
to peek under the hood, and see everything that is happening.
Let’s say you wanted to have a look at flight arrivals and departures and John F. Kennedy Airport in
New York. You could go to the website maintained by the Port Authority of New York and New
Jersey at https://www.panynj.gov/airports/flight-status.html
You may notice that the actual flight status information, which comes from an outside provider, is
slightly delayed in loading compared to the rest of the page. This is because you are being provided
with the latest information, and it is being added after the main page loads. Here is what the fully
loaded page, with the embedded flight information looks like:
If we use the element inspector to examine the area of the page that contains the flight information,
we can see that contained within the <div> tag with the id of “viewport” there is an iframe that has
as its source, a different URL. It’s a relative URL, so we can’t see the domain, but we can see that by
using the network monitor.

We’ll switch to the network tab in developer tools, then choose the HTML filter (if you go straight
to the network tab after opening developer tools, you may need to reload the page in the browser
before the network panel populates). We’re using the HTML filter because an iframe is used to em-
bed an HTML document.

We can see that four files were transferred, including one that is almost a megabyte in size. We’ll
guess that might be the file we want, because there’s quite a bit of arrivals data.
If we now open the request details panel by clicking on the row for the request, we can see that the
request was made to a URL that does not appear in the address bar of our web browser.
This is the full URL:
https://tracker.flightview.com/FVAccess2/tools/fids/fidsDefault.asp?accCustId=PANYNJ&fidsI
d=20001&fidsInit=arrivals&fidsApt=JFK
Now, if we click on the Response tab, we can see the HTML filed that was transferred.

If we like, we can copy and paste the HTML into a separate file, if you wished to retain it for future
reference.
Looking at AJAX
As we discussed above, another way used to insert information into an existing web page is the
AJAX protocol.
We’ll take a look at a site used by the Sudbury District Health Unit in the northern part of the Cana-
dian province of Ontario to provide information on restaurant inspections to the public. The page
now contains archived information, and will be replaced by a new site, according to the municipality.
The site showed the restaurants in the city, and whether each is in compliance. You could also access
a more detailed inspection history for each premise.
If you click on the Inspection History link for a premise, the page was populated with individual in-
spections for that facility, complete with more links to delve down to more precise detail.

An examination of the page source, however, shows that the detail listed is not present. The DOM
of the page is being manipulated using JavaScript that runs based on user interactions, such as click-
ing on an Inspection History link. New data is being added to the page dynamically.
We can have a closer look at what is going on using the Network monitor in Firefox’s developer
tools.
When we first load the page, we can see that five requests were made to the server, totalling 529 kb
of data sent and received.

The site used to serve up a great many more resources, but has been reduced in scope as it only ar-
chives date up till January 2017.
As you can see, two of the files transferred have XHR under type. XHR request is another way of
saying AJAX request.
If we switch to the XHR tab, we can see just these two and we can also see that one of the files is an
html file and other a JSON file. JSON stands for JavaScript Object Notation, which as discussed in
Chapter 2 is a plain-text data format often used when the data is intended to be machine read.
If we want to see the whole response, we can <ctrl> click (right click on a Windows computer) on
the row for the JSON file in the network panel, and choose Copy>Response.

We can then paste the resulting JSON into a separate file to examine further. You can then use a
Python script or an online converter to convert the JSON to a CSV file for import into a spread-
sheet or database program. In this way, you are using the web development tool as a rudimentary
web scraping tool. For a few pages, it could be all you need.
If we like, we can also look at the response in the request details window. This allows us to examine
each individual object in the JSON file.
An object in this context is a data construct that holds all of the information for one row of data,
that is, one facility that was inspected.
If we click on the expansion triangle for object 0, we can see the data contained in that object, neatly
organized.

This makes it trivially easy to examine the data structure.

There is a great deal more that the Firefox developer tools, and those in other browsers, can do, in-
cluding helping you develop and debug JavaScript, manipulate the HTML and CSS of the page to
see, in real time, how the change would affect the page, and much more that we simply couldn’t
cover here. These are remarkable tools and belong in every data journalist’s toolkit.

J&T Express Leveraging Information Systems For Competitive Advantage
No ratings yet
J&T Express Leveraging Information Systems For Competitive Advantage
14 pages
ASUS VW198S Service Manual PDF
0% (1)
ASUS VW198S Service Manual PDF
57 pages
MCAD
No ratings yet
MCAD
24 pages
Black Hat Webcast: Pen Testing The Web With Firefox
50% (2)
Black Hat Webcast: Pen Testing The Web With Firefox
78 pages
Oracle Cloud PM 3
No ratings yet
Oracle Cloud PM 3
21 pages
Quiz 1 Patterns of Paragraph Development
No ratings yet
Quiz 1 Patterns of Paragraph Development
7 pages
DigitalBCG Immersion Centers Brochure 2019 PDF
No ratings yet
DigitalBCG Immersion Centers Brochure 2019 PDF
11 pages
Shijin Lab File Edited Final
No ratings yet
Shijin Lab File Edited Final
53 pages
Information Technology Systems: 3.4 Internet
No ratings yet
Information Technology Systems: 3.4 Internet
59 pages
Tectura Cloud Capability - 2017
No ratings yet
Tectura Cloud Capability - 2017
26 pages
Web Cybersecurity
No ratings yet
Web Cybersecurity
85 pages
EXERCISE 13A - Door and Window Schedule
No ratings yet
EXERCISE 13A - Door and Window Schedule
1 page
Browser Security
No ratings yet
Browser Security
31 pages
Lab Exercise - HTTP: Objective
No ratings yet
Lab Exercise - HTTP: Objective
10 pages
Penstock Design
No ratings yet
Penstock Design
2 pages
Web Design and Management
No ratings yet
Web Design and Management
34 pages
CV - Yuvaraj 2022
No ratings yet
CV - Yuvaraj 2022
5 pages
Project Proposal
No ratings yet
Project Proposal
8 pages
Surya Prakash - 202231039 - E - Individual Assignment 2023
No ratings yet
Surya Prakash - 202231039 - E - Individual Assignment 2023
23 pages
Study Unit 3 - PT 1 The World Wide Web
No ratings yet
Study Unit 3 - PT 1 The World Wide Web
36 pages
CIPM FSG November - 2018 - v1
No ratings yet
CIPM FSG November - 2018 - v1
11 pages
Com Sci IA Word Doc KC
No ratings yet
Com Sci IA Word Doc KC
93 pages
C01001848H - 00 Man - Inst Rack Cooler CRCC - CRCX en
No ratings yet
C01001848H - 00 Man - Inst Rack Cooler CRCC - CRCX en
34 pages
Module 5.3: HTTP & Content Rendering
No ratings yet
Module 5.3: HTTP & Content Rendering
23 pages
1.1 Web Scraping
No ratings yet
1.1 Web Scraping
34 pages
Top 41 SAP Security Interview Questions and Answers
No ratings yet
Top 41 SAP Security Interview Questions and Answers
6 pages
IC Software Project Sign Off Document 11340
No ratings yet
IC Software Project Sign Off Document 11340
7 pages
My SQL CBSE Class 12
No ratings yet
My SQL CBSE Class 12
79 pages
Adishwar Steel Bangalore
No ratings yet
Adishwar Steel Bangalore
3 pages
Computer Security
No ratings yet
Computer Security
34 pages
Internet and Web Browsers
No ratings yet
Internet and Web Browsers
28 pages
Brodie Paton Resume
No ratings yet
Brodie Paton Resume
2 pages
Security in The Browser
No ratings yet
Security in The Browser
6 pages
Web Server
No ratings yet
Web Server
43 pages
Session 12 Internet Browsers
No ratings yet
Session 12 Internet Browsers
19 pages
The World Wide Web
No ratings yet
The World Wide Web
24 pages
Collecting Information About A Target Website Using Firebug
100% (1)
Collecting Information About A Target Website Using Firebug
16 pages
Unit-1 Upto HTML Tags
No ratings yet
Unit-1 Upto HTML Tags
36 pages
ICT Microproject
No ratings yet
ICT Microproject
22 pages
The Internet
No ratings yet
The Internet
15 pages
Web Analytics Tutorial
No ratings yet
Web Analytics Tutorial
29 pages
Accessing and Use Internet INFOSHEET2 LAST
No ratings yet
Accessing and Use Internet INFOSHEET2 LAST
4 pages
Assignment #2
No ratings yet
Assignment #2
14 pages
Web Development With Google Chrome and Mozilla Firefox
No ratings yet
Web Development With Google Chrome and Mozilla Firefox
25 pages
An Approach To Enhance The Usability, Security and Compatibility of A Web Browser
No ratings yet
An Approach To Enhance The Usability, Security and Compatibility of A Web Browser
5 pages
Datamining
No ratings yet
Datamining
21 pages
Slide 1
No ratings yet
Slide 1
24 pages
Chapter-1: An Introduction To The World Wide Web
No ratings yet
Chapter-1: An Introduction To The World Wide Web
23 pages
Web Design PUT
No ratings yet
Web Design PUT
12 pages
What Is A Web Browser
100% (1)
What Is A Web Browser
9 pages
Document
No ratings yet
Document
6 pages
Chapter 6 Server Side
No ratings yet
Chapter 6 Server Side
19 pages
What Happens When You Call Any Website
No ratings yet
What Happens When You Call Any Website
4 pages
MyDocument Merged
No ratings yet
MyDocument Merged
9 pages
جهاز Ultrasound Dus 60 - كتيب المستخدم
No ratings yet
جهاز Ultrasound Dus 60 - كتيب المستخدم
114 pages
Turning Firefox Ethical Hacking Platform
No ratings yet
Turning Firefox Ethical Hacking Platform
3 pages
PDR Process - VaporVM
No ratings yet
PDR Process - VaporVM
12 pages
Wireshark
No ratings yet
Wireshark
8 pages
All About Internet Browsers
No ratings yet
All About Internet Browsers
6 pages
pf6k Urcap Manual
No ratings yet
pf6k Urcap Manual
15 pages
Introduction To F12 Developer Tools
No ratings yet
Introduction To F12 Developer Tools
5 pages
Chapter 5 Internet and Its Uses
No ratings yet
Chapter 5 Internet and Its Uses
31 pages
Module1 - Web Programming Fundamentals
No ratings yet
Module1 - Web Programming Fundamentals
33 pages
Prateek Resume
No ratings yet
Prateek Resume
1 page
Notes Chapter 5 Till Cookies
No ratings yet
Notes Chapter 5 Till Cookies
32 pages
Web Servers
No ratings yet
Web Servers
12 pages
Chapter 5 Note
No ratings yet
Chapter 5 Note
14 pages
Grade 9 Chapter 2
No ratings yet
Grade 9 Chapter 2
34 pages
Access and Use Internet
No ratings yet
Access and Use Internet
47 pages
Guide To Kindle Content Quality
No ratings yet
Guide To Kindle Content Quality
8 pages
Unit 1 Faculty Notes
No ratings yet
Unit 1 Faculty Notes
64 pages
Web HackingV2
No ratings yet
Web HackingV2
116 pages
Advanced Certification in Full Stack Developer Course IITG
No ratings yet
Advanced Certification in Full Stack Developer Course IITG
13 pages
GC 2025 01 26
No ratings yet
GC 2025 01 26
2 pages
Testing & Commissioning Procedure For Jockey Pumps - Fire Fighting System
No ratings yet
Testing & Commissioning Procedure For Jockey Pumps - Fire Fighting System
2 pages
Web Technology Vvi Questions With Answers
No ratings yet
Web Technology Vvi Questions With Answers
1 page
TechDuWeb (1) .FR - en
No ratings yet
TechDuWeb (1) .FR - en
45 pages
ITA-Unit IV
No ratings yet
ITA-Unit IV
14 pages
WAS Unit 3
No ratings yet
WAS Unit 3
72 pages
GAIA Liste Public2025.03.28
No ratings yet
GAIA Liste Public2025.03.28
6 pages
NTTF Placement Brochure 2021
No ratings yet
NTTF Placement Brochure 2021
72 pages
Group 3
No ratings yet
Group 3
24 pages
Itas Laboratory Manual
No ratings yet
Itas Laboratory Manual
52 pages
TCM Past Paper
No ratings yet
TCM Past Paper
4 pages
Internert Notes-Unit IV
No ratings yet
Internert Notes-Unit IV
18 pages
Web - How It Works
No ratings yet
Web - How It Works
1 page
Websec 1
No ratings yet
Websec 1
25 pages
Security Primer Deep Web
No ratings yet
Security Primer Deep Web
1 page