0% found this document useful (0 votes)
29 views22 pages

21AD71 Module 5 Textbook

This chapter discusses networked programs, focusing on the Hypertext Transfer Protocol (HTTP) and how to retrieve web pages using Python's socket and urllib libraries. It explains the process of making a connection to a web server, sending GET requests, and handling responses, including headers and content. Additionally, it covers retrieving binary files and web scraping techniques using urllib.

Uploaded by

Nisarga M K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views22 pages

21AD71 Module 5 Textbook

This chapter discusses networked programs, focusing on the Hypertext Transfer Protocol (HTTP) and how to retrieve web pages using Python's socket and urllib libraries. It explains the process of making a connection to a web server, sending GET requests, and handling responses, including headers and content. Additionally, it covers retrieving binary files and web scraping techniques using urllib.

Uploaded by

Nisarga M K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

MODULE 5

Chapter 12

Networked programs

While many of the examples in this book have focused on reading files and looking
for data in those files, there are many different sources of information when one
considers the Internet.
In this chapter we will pretend to be a web browser and retrieve web pages using
the Hypertext Transfer Protocol (HTTP). Then we will read through the web page
data and parse it.

12.1 Hypertext Transfer Protocol - HTTP


The network protocol that powers the web is actually quite simple and there is
built-in support in Python called socket which makes it very easy to make network
connections and retrieve data over those sockets in a Python program.
A socket is much like a file, except that a single socket provides a two-way connec-
tion between two programs. You can both read from and write to the same socket.
If you write something to a socket, it is sent to the application at the other end
of the socket. If you read from the socket, you are given the data which the other
application has sent.
But if you try to read a socket1 when the program on the other end of the socket
has not sent any data, you just sit and wait. If the programs on both ends of
the socket simply wait for some data without sending anything, they will wait for
a very long time, so an important part of programs that communicate over the
Internet is to have some sort of protocol.
A protocol is a set of precise rules that determine who is to go first, what they are
to do, and then what the responses are to that message, and who sends next, and
so on. In a sense the two applications at either end of the socket are doing a dance
and making sure not to step on each other’s toes.
There are many documents that describe these network protocols. The Hypertext
Transfer Protocol is described in the following document:
1 If you want to learn more about sockets, protocols or how web servers are developed, you

can explore the course at https://www.dj4e.com.

145
146 CHAPTER 12. NETWORKED PROGRAMS

https://www.w3.org/Protocols/rfc2616/rfc2616.txt
This is a long and complex 176-page document with a lot of detail. If you find
it interesting, feel free to read it all. But if you take a look around page 36 of
RFC2616 you will find the syntax for the GET request. To request a document
from a web server, we make a connection, e.g. to the www.pr4e.org server on port
80, and then send a line of the form
GET http://data.pr4e.org/romeo.txt HTTP/1.0
where the second parameter is the web page we are requesting, and then we also
send a blank line. The web server will respond with some header information about
the document and a blank line followed by the document content.

12.2 The world’s simplest web browser


Perhaps the easiest way to show how the HTTP protocol works is to write a very
simple Python program that makes a connection to a web server and follows the
rules of the HTTP protocol to request a document and display what the server
sends back.

import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)


mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
data = mysock.recv(512)
if len(data) < 1:
break
print(data.decode(),end='')

mysock.close()

# Code: https://www.py4e.com/code3/socket1.py

First the program makes a connection to port 80 on the server www.pr4e.com.


Since our program is playing the role of the “web browser”, the HTTP protocol
says we must send the GET command followed by a blank line. \r\n signifies
an EOL (end of line), so \r\n\r\n signifies nothing between two EOL sequences.
That is the equivalent of a blank line.
Once we send that blank line, we write a loop that receives data in 512-character
chunks from the socket and prints the data out until there is no more data to read
(i.e., the recv() returns an empty string).
The program produces the following output:

HTTP/1.1 200 OK
12.2. THE WORLD’S SIMPLEST WEB BROWSER 147

Your
Program
www.py4e.com
socket )
* Web Pages
connect + Port 80 .
send ,
- .
recv . .

Figure 12.1: A Socket Connection

Date: Wed, 11 Apr 2018 18:52:55 GMT


Server: Apache/2.4.7 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks


It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

The output starts with headers which the web server sends to describe the docu-
ment. For example, the Content-Type header indicates that the document is a
plain text document (text/plain).
After the server sends us the headers, it adds a blank line to indicate the end of
the headers, and then sends the actual data of the file romeo.txt.
This example shows how to make a low-level network connection with sockets.
Sockets can be used to communicate with a web server or with a mail server or
many other kinds of servers. All that is needed is to find the document which
describes the protocol and write the code to send and receive the data according
to the protocol.
However, since the protocol that we use most commonly is the HTTP web protocol,
Python has a special library specifically designed to support the HTTP protocol
for the retrieval of documents and data over the web.
One of the requirements for using the HTTP protocol is the need to send and
receive data as bytes objects, instead of strings. In the preceding example, the
encode() and decode() methods convert strings into bytes objects and back again.
148 CHAPTER 12. NETWORKED PROGRAMS

The next example uses b'' notation to specify that a variable should be stored as
a bytes object. encode() and b'' are equivalent.

>>> b'Hello world'


b'Hello world'
>>> 'Hello world'.encode()
b'Hello world'

12.3 Retrieving an image over HTTP


In the above example, we retrieved a plain text file which had newlines in the file
and we simply copied the data to the screen as the program ran. We can use a
similar program to retrieve an image across using HTTP. Instead of copying the
data to the screen as the program runs, we accumulate the data in a string, trim
off the headers, and then save the image data to a file as follows:

import socket
import time

HOST = 'data.pr4e.org'
PORT = 80
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((HOST, PORT))
mysock.sendall(b'GET http://data.pr4e.org/cover3.jpg HTTP/1.0\r\n\r\n')
count = 0
picture = b""

while True:
data = mysock.recv(5120)
if len(data) < 1: break
#time.sleep(0.25)
count = count + len(data)
print(len(data), count)
picture = picture + data

mysock.close()

# Look for the end of the header (2 CRLF)


pos = picture.find(b"\r\n\r\n")
print('Header length', pos)
print(picture[:pos].decode())

# Skip past the header and save the picture data


picture = picture[pos+4:]
fhand = open("stuff.jpg", "wb")
fhand.write(picture)
fhand.close()

# Code: https://www.py4e.com/code3/urljpeg.py
12.3. RETRIEVING AN IMAGE OVER HTTP 149

When the program runs, it produces the following output:

$ python urljpeg.py
5120 5120
5120 10240
4240 14480
5120 19600
...
5120 214000
3200 217200
5120 222320
5120 227440
3167 230607
Header length 393
HTTP/1.1 200 OK
Date: Wed, 11 Apr 2018 18:54:09 GMT
Server: Apache/2.4.7 (Ubuntu)
Last-Modified: Mon, 15 May 2017 12:27:40 GMT
ETag: "38342-54f8f2e5b6277"
Accept-Ranges: bytes
Content-Length: 230210
Vary: Accept-Encoding
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: image/jpeg

You can see that for this url, the Content-Type header indicates that body of the
document is an image (image/jpeg). Once the program completes, you can view
the image data by opening the file stuff.jpg in an image viewer.
As the program runs, you can see that we don’t get 5120 characters each time
we call the recv() method. We get as many characters as have been transferred
across the network to us by the web server at the moment we call recv(). In this
example, we either get as few as 3200 characters each time we request up to 5120
characters of data.
Your results may be different depending on your network speed. Also note that on
the last call to recv() we get 3167 bytes, which is the end of the stream, and in
the next call to recv() we get a zero-length string that tells us that the server has
called close() on its end of the socket and there is no more data forthcoming.
We can slow down our successive recv() calls by uncommenting the call to
time.sleep(). This way, we wait a quarter of a second after each call so that
the server can “get ahead” of us and send more data to us before we call recv()
again. With the delay, in place the program executes as follows:

$ python urljpeg.py
5120 5120
5120 10240
5120 15360
...
5120 225280
150 CHAPTER 12. NETWORKED PROGRAMS

5120 230400
207 230607
Header length 393
HTTP/1.1 200 OK
Date: Wed, 11 Apr 2018 21:42:08 GMT
Server: Apache/2.4.7 (Ubuntu)
Last-Modified: Mon, 15 May 2017 12:27:40 GMT
ETag: "38342-54f8f2e5b6277"
Accept-Ranges: bytes
Content-Length: 230210
Vary: Accept-Encoding
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: image/jpeg

Now other than the first and last calls to recv(), we now get 5120 characters each
time we ask for new data.
There is a buffer between the server making send() requests and our application
making recv() requests. When we run the program with the delay in place, at
some point the server might fill up the buffer in the socket and be forced to pause
until our program starts to empty the buffer. The pausing of either the sending
application or the receiving application is called “flow control.”

12.4 Retrieving web pages with urllib


While we can manually send and receive data over HTTP using the socket library,
there is a much simpler way to perform this common task in Python by using the
urllib library.
Using urllib, you can treat a web page much like a file. You simply indicate
which web page you would like to retrieve and urllib handles all of the HTTP
protocol and header details.
The equivalent code to read the romeo.txt file from the web using urllib is as
follows:

import urllib.request

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
print(line.decode().strip())

# Code: https://www.py4e.com/code3/urllib1.py

Once the web page has been opened with urllib.request.urlopen, we can treat
it like a file and read through it using a for loop.
When the program runs, we only see the output of the contents of the file. The
headers are still sent, but the urllib code consumes the headers and only returns
the data to us.
12.5. READING BINARY FILES USING URLLIB 151

But soft what light through yonder window breaks


It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

As an example, we can write a program to retrieve the data for romeo.txt and
compute the frequency of each word in the file as follows:

import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

counts = dict()
for line in fhand:
words = line.decode().split()
for word in words:
counts[word] = counts.get(word, 0) + 1
print(counts)

# Code: https://www.py4e.com/code3/urlwords.py

Again, once we have opened the web page, we can read it like a local file.

12.5 Reading binary files using urllib


Sometimes you want to retrieve a non-text (or binary) file such as an image or
video file. The data in these files is generally not useful to print out, but you can
easily make a copy of a URL to a local file on your hard disk using urllib.
The pattern is to open the URL and use read to download the entire contents of
the document into a string variable (img) then write that information to a local
file as follows:

import urllib.request, urllib.parse, urllib.error

img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg').read()
fhand = open('cover3.jpg', 'wb')
fhand.write(img)
fhand.close()

# Code: https://www.py4e.com/code3/curl1.py

This program reads all of the data in at once across the network and stores it in the
variable img in the main memory of your computer, then opens the file cover.jpg
and writes the data out to your disk. The wb argument for open() opens a binary
file for writing only. This program will work if the size of the file is less than the
size of the memory of your computer.
However if this is a large audio or video file, this program may crash or at least
run extremely slowly when your computer runs out of memory. In order to avoid
152 CHAPTER 12. NETWORKED PROGRAMS

running out of memory, we retrieve the data in blocks (or buffers) and then write
each block to your disk before retrieving the next block. This way the program can
read any size file without using up all of the memory you have in your computer.

import urllib.request, urllib.parse, urllib.error

img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg')
fhand = open('cover3.jpg', 'wb')
size = 0
while True:
info = img.read(100000)
if len(info) < 1: break
size = size + len(info)
fhand.write(info)

print(size, 'characters copied.')


fhand.close()

# Code: https://www.py4e.com/code3/curl2.py

In this example, we read only 100,000 characters at a time and then write those
characters to the cover3.jpg file before retrieving the next 100,000 characters of
data from the web.
This program runs as follows:

python curl2.py
230210 characters copied.

12.6 Parsing HTML and scraping the web


One of the common uses of the urllib capability in Python is to scrape the web.
Web scraping is when we write a program that pretends to be a web browser and
retrieves pages, then examines the data in those pages looking for patterns.
As an example, a search engine such as Google will look at the source of one web
page and extract the links to other pages and retrieve those pages, extracting links,
and so on. Using this technique, Google spiders its way through nearly all of the
pages on the web.
Google also uses the frequency of links from pages it finds to a particular page as
one measure of how “important” a page is and how high the page should appear
in its search results.

12.7 Parsing HTML using regular expressions


One simple way to parse HTML is to use regular expressions to repeatedly search
for and extract substrings that match a particular pattern.
Here is a simple web page:
12.7. PARSING HTML USING REGULAR EXPRESSIONS 153

<h1>The First Page</h1>


<p>
If you like, you can switch to the
<a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>.
</p>

We can construct a well-formed regular expression to match and extract the link
values from the above text as follows:

href="http[s]?://.+?"

Our regular expression looks for strings that start with “href="http://” or
“href="https://”, followed by one or more characters (.+?), followed by another
double quote. The question mark behind the [s]? indicates to search for the
string “http” followed by zero or one “s”.
The question mark added to the .+? indicates that the match is to be done in
a “non-greedy” fashion instead of a “greedy” fashion. A non-greedy match tries
to find the smallest possible matching string and a greedy match tries to find the
largest possible matching string.
We add parentheses to our regular expression to indicate which part of our matched
string we would like to extract, and produce the following program:

# Search for link values within URL input


import urllib.request, urllib.parse, urllib.error
import re
import ssl

# Ignore SSL certificate errors


ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')


html = urllib.request.urlopen(url, context=ctx).read()
links = re.findall(b'href="(http[s]?://.*?)"', html)
for link in links:
print(link.decode())

# Code: https://www.py4e.com/code3/urlregex.py

The ssl library allows this program to access web sites that strictly enforce HTTPS.
The read method returns HTML source code as a bytes object instead of returning
an HTTPResponse object. The findall regular expression method will give us a
list of all of the strings that match our regular expression, returning only the link
text between the double quotes.
When we run the program and input a URL, we get the following output:
154 CHAPTER 12. NETWORKED PROGRAMS

Enter - https://docs.python.org
https://docs.python.org/3/index.html
https://www.python.org/
https://docs.python.org/3.8/
https://docs.python.org/3.7/
https://docs.python.org/3.5/
https://docs.python.org/2.7/
https://www.python.org/doc/versions/
https://www.python.org/dev/peps/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/PythonBooks
https://www.python.org/doc/av/
https://www.python.org/
https://www.python.org/psf/donations/
http://sphinx.pocoo.org/

Regular expressions work very nicely when your HTML is well formatted and
predictable. But since there are a lot of “broken” HTML pages out there, a solution
only using regular expressions might either miss some valid links or end up with
bad data.
This can be solved by using a robust HTML parsing library.

12.8 Parsing HTML using BeautifulSoup


Even though HTML looks like XML2 and some pages are carefully constructed to
be XML, most HTML is generally broken in ways that cause an XML parser to
reject the entire page of HTML as improperly formed.
There are a number of Python libraries which can help you parse HTML and
extract data from the pages. Each of the libraries has its strengths and weaknesses
and you can pick one based on your needs.
As an example, we will simply parse some HTML input and extract links using
the BeautifulSoup library. BeautifulSoup tolerates highly flawed HTML and still
lets you easily extract the data you need. You can download and install the
BeautifulSoup code from:
https://pypi.python.org/pypi/beautifulsoup4
Information on installing BeautifulSoup with the Python Package Index tool pip
is available at:
https://packaging.python.org/tutorials/installing-packages/
We will use urllib to read the page and then use BeautifulSoup to extract the
href attributes from the anchor (a) tags.

# To run this, download the BeautifulSoup zip file


# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

2 The XML format is described in the next chapter.


12.8. PARSING HTML USING BEAUTIFULSOUP 155

import urllib.request, urllib.parse, urllib.error


from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors


ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')


html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags


tags = soup('a')
for tag in tags:
print(tag.get('href', None))

# Code: https://www.py4e.com/code3/urllinks.py

The program prompts for a web address, then opens the web page, reads the data
and passes the data to the BeautifulSoup parser, and then retrieves all of the
anchor tags and prints out the href attribute for each tag.
When the program runs, it produces the following output:

Enter - https://docs.python.org
genindex.html
py-modindex.html
https://www.python.org/
#
whatsnew/3.6.html
whatsnew/index.html
tutorial/index.html
library/index.html
reference/index.html
using/index.html
howto/index.html
installing/index.html
distributing/index.html
extending/index.html
c-api/index.html
faq/index.html
py-modindex.html
genindex.html
glossary.html
search.html
contents.html
bugs.html
about.html
license.html
copyright.html
download.html
156 CHAPTER 12. NETWORKED PROGRAMS

https://docs.python.org/3.8/
https://docs.python.org/3.7/
https://docs.python.org/3.5/
https://docs.python.org/2.7/
https://www.python.org/doc/versions/
https://www.python.org/dev/peps/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/PythonBooks
https://www.python.org/doc/av/
genindex.html
py-modindex.html
https://www.python.org/
#
copyright.html
https://www.python.org/psf/donations/
bugs.html
http://sphinx.pocoo.org/

This list is much longer because some HTML anchor tags are relative paths (e.g.,
tutorial/index.html) or in-page references (e.g., ‘#’) that do not include “http://”
or “https://”, which was a requirement in our regular expression.
You can use also BeautifulSoup to pull out various parts of each tag:

# To run this, download the BeautifulSoup zip file


# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

from urllib.request import urlopen


from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors


ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')


html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")

# Retrieve all of the anchor tags


tags = soup('a')
for tag in tags:
# Look at the parts of a tag
print('TAG:', tag)
print('URL:', tag.get('href', None))
print('Contents:', tag.contents[0])
print('Attrs:', tag.attrs)

# Code: https://www.py4e.com/code3/urllink2.py

python urllink2.py
12.9. BONUS SECTION FOR UNIX / LINUX USERS 157

Enter - http://www.dr-chuck.com/page1.htm
TAG: <a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>
URL: http://www.dr-chuck.com/page2.htm
Content: ['\nSecond Page']
Attrs: [('href', 'http://www.dr-chuck.com/page2.htm')]

html.parser is the HTML parser included in the standard Python 3 library. In-
formation on other HTML parsers is available at:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
These examples only begin to show the power of BeautifulSoup when it comes to
parsing HTML.

12.9 Bonus section for Unix / Linux users

If you have a Linux, Unix, or Macintosh computer, you probably have commands
built in to your operating system that retrieves both plain text and binary files
using the HTTP or File Transfer (FTP) protocols. One of these commands is
curl:

$ curl -O http://www.py4e.com/cover.jpg

The command curl is short for “copy URL” and so the two examples listed earlier
to retrieve binary files with urllib are cleverly named curl1.py and curl2.py
on www.py4e.com/code3 as they implement similar functionality to the curl com-
mand. There is also a curl3.py sample program that does this task a little more
effectively, in case you actually want to use this pattern in a program you are
writing.
A second command that functions very similarly is wget:

$ wget http://www.py4e.com/cover.jpg

Both of these commands make retrieving webpages and remote files a simple task.

12.10 Glossary

BeautifulSoup A Python library for parsing HTML documents and extracting


data from HTML documents that compensates for most of the imperfections
in the HTML that browsers generally ignore. You can download the Beauti-
fulSoup code from www.crummy.com.
port A number that generally indicates which application you are contacting when
you make a socket connection to a server. As an example, web traffic usually
uses port 80 while email traffic uses port 25.
158 CHAPTER 12. NETWORKED PROGRAMS

scrape When a program pretends to be a web browser and retrieves a web page,
then looks at the web page content. Often programs are following the links
in one page to find the next page so they can traverse a network of pages or
a social network.
socket A network connection between two applications where the applications can
send and receive data in either direction.
spider The act of a web search engine retrieving a page and then all the pages
linked from a page and so on until they have nearly all of the pages on the
Internet which they use to build their search index.

12.11 Exercises
Exercise 1: Change the socket program socket1.py to prompt the user for the
URL so it can read any web page.
You can use split('/') to break the URL into its component parts so you can
extract the host name for the socket connect call. Add error checking using try
and except to handle the condition where the user enters an improperly formatted
or non-existent URL.
Exercise 2: Change your socket program so that it counts the number of charac-
ters it has received and stops displaying any text after it has shown 3000 characters.
The program should retrieve the entire document and count the total number of
characters and display the count of the number of characters at the end of the
document.
Exercise 3: Use urllib to replicate the previous exercise of (1) retrieving the
document from a URL, (2) displaying up to 3000 characters, and (3) counting the
overall number of characters in the document. Don’t worry about the headers for
this exercise, simply show the first 3000 characters of the document contents.
Exercise 4: Change the urllinks.py program to extract and count paragraph (p)
tags from the retrieved HTML document and display the count of the paragraphs
as the output of your program. Do not display the paragraph text, only count
them. Test your program on several small web pages as well as some larger web
pages.
Exercise 5: (Advanced) Change the socket program so that it only shows data
after the headers and a blank line have been received. Remember that recv receives
characters (newlines and all), not lines.
Chapter 13

Using Web Services

Once it became easy to retrieve documents and parse documents over HTTP using
programs, it did not take long to develop an approach where we started producing
documents that were specifically designed to be consumed by other programs (i.e.,
not HTML to be displayed in a browser).
There are two common formats that we use when exchanging data across the web.
eXtensible Markup Language (XML) has been in use for a very long time and
is best suited for exchanging document-style data. When programs just want to
exchange dictionaries, lists, or other internal information with each other, they
use JavaScript Object Notation (JSON) (see www.json.org). We will look at both
formats.

13.1 eXtensible Markup Language - XML


XML looks very similar to HTML, but XML is more structured than HTML. Here
is a sample of an XML document:

<person>
<name>Chuck</name>
<phone type="intl">
+1 734 303 4456
</phone>
<email hide="yes" />
</person>

Each pair of opening (e.g., <person>) and closing tags (e.g., </person>) represents
a element or node with the same name as the tag (e.g., person). Each element
can have some text, some attributes (e.g., hide), and other nested elements. If
an XML element is empty (i.e., has no content), then it may be depicted by a
self-closing tag (e.g., <email />).
Often it is helpful to think of an XML document as a tree structure where there is
a top element (here: person), and other tags (e.g., phone) are drawn as children
of their parent elements.

159
160 CHAPTER 13. USING WEB SERVICES

person

name phone email


type=intl hide=yes

+1 734
Chuck
303 4456

Figure 13.1: A Tree Representation of XML

13.2 Parsing XML


Here is a simple application that parses some XML and extracts some data elements
from the XML:

import xml.etree.ElementTree as ET

data = '''
<person>
<name>Chuck</name>
<phone type="intl">
+1 734 303 4456
</phone>
<email hide="yes" />
</person>'''

tree = ET.fromstring(data)
print('Name:', tree.find('name').text)
print('Attr:', tree.find('email').get('hide'))

# Code: https://www.py4e.com/code3/xml1.py

The triple single quote ('''), as well as the triple double quote ("""), allow for
the creation of strings that span multiple lines.
Calling fromstring converts the string representation of the XML into a “tree” of
XML elements. When the XML is in a tree, we have a series of methods we can
call to extract portions of data from the XML string. The find function searches
through the XML tree and retrieves the element that matches the specified tag.

Name: Chuck
Attr: yes

Using an XML parser such as ElementTree has the advantage that while the
XML in this example is quite simple, it turns out there are many rules regarding
13.3. LOOPING THROUGH NODES 161

valid XML, and using ElementTree allows us to extract data from XML without
worrying about the rules of XML syntax.

13.3 Looping through nodes

Often the XML has multiple nodes and we need to write a loop to process all of
the nodes. In the following program, we loop through all of the user nodes:

import xml.etree.ElementTree as ET

input = '''
<stuff>
<users>
<user x="2">
<id>001</id>
<name>Chuck</name>
</user>
<user x="7">
<id>009</id>
<name>Brent</name>
</user>
</users>
</stuff>'''

stuff = ET.fromstring(input)
lst = stuff.findall('users/user')
print('User count:', len(lst))

for item in lst:


print('Name', item.find('name').text)
print('Id', item.find('id').text)
print('Attribute', item.get('x'))

# Code: https://www.py4e.com/code3/xml2.py

The findall method retrieves a Python list of subtrees that represent the user
structures in the XML tree. Then we can write a for loop that looks at each of
the user nodes, and prints the name and id text elements as well as the x attribute
from the user node.

User count: 2
Name Chuck
Id 001
Attribute 2
Name Brent
Id 009
Attribute 7
162 CHAPTER 13. USING WEB SERVICES

It is important to include all parent level elements in the findall statement except
for the top level element (e.g., users/user). Otherwise, Python will not find any
desired nodes.

import xml.etree.ElementTree as ET

input = '''
<stuff>
<users>
<user x="2">
<id>001</id>
<name>Chuck</name>
</user>
<user x="7">
<id>009</id>
<name>Brent</name>
</user>
</users>
</stuff>'''

stuff = ET.fromstring(input)

lst = stuff.findall('users/user')
print('User count:', len(lst))

lst2 = stuff.findall('user')
print('User count:', len(lst2))

lst stores all user elements that are nested within their users parent. lst2 looks
for user elements that are not nested within the top level stuff element where
there are none.

User count: 2
User count: 0

13.4 JavaScript Object Notation - JSON


The JSON format was inspired by the object and array format used in the
JavaScript language. But since Python was invented before JavaScript, Python’s
syntax for dictionaries and lists influenced the syntax of JSON. So the format of
JSON is nearly identical to a combination of Python lists and dictionaries.
Here is a JSON encoding that is roughly equivalent to the simple XML from above:

{
"name" : "Chuck",
"phone" : {
"type" : "intl",
"number" : "+1 734 303 4456"
13.5. PARSING JSON 163

},
"email" : {
"hide" : "yes"
}
}

You will notice some differences. First, in XML, we can add attributes like “intl”
to the “phone” tag. In JSON, we simply have key-value pairs. Also the XML
“person” tag is gone, replaced by a set of outer curly braces.
In general, JSON structures are simpler than XML because JSON has fewer ca-
pabilities than XML. But JSON has the advantage that it maps directly to some
combination of dictionaries and lists. And since nearly all programming languages
have something equivalent to Python’s dictionaries and lists, JSON is a very nat-
ural format to have two cooperating programs exchange data.
JSON is quickly becoming the format of choice for nearly all data exchange between
applications because of its relative simplicity compared to XML.

13.5 Parsing JSON


We construct our JSON by nesting dictionaries and lists as needed. In this example,
we represent a list of users where each user is a set of key-value pairs (i.e., a
dictionary). So we have a list of dictionaries.
In the following program, we use the built-in json library to parse the JSON and
read through the data. Compare this closely to the equivalent XML data and code
above. The JSON has less detail, so we must know in advance that we are getting a
list and that the list is of users and each user is a set of key-value pairs. The JSON
is more succinct (an advantage) but also is less self-describing (a disadvantage).

import json

data = '''
[
{ "id" : "001",
"x" : "2",
"name" : "Chuck"
} ,
{ "id" : "009",
"x" : "7",
"name" : "Brent"
}
]'''

info = json.loads(data)
print('User count:', len(info))

for item in info:


print('Name', item['name'])
164 CHAPTER 13. USING WEB SERVICES

print('Id', item['id'])
print('Attribute', item['x'])

# Code: https://www.py4e.com/code3/json2.py

If you compare the code to extract data from the parsed JSON and XML you will
see that what we get from json.loads() is a Python list which we traverse with
a for loop, and each item within that list is a Python dictionary. Once the JSON
has been parsed, we can use the Python index operator to extract the various bits
of data for each user. We don’t have to use the JSON library to dig through the
parsed JSON, since the returned data is simply native Python structures.
The output of this program is exactly the same as the XML version above.

User count: 2
Name Chuck
Id 001
Attribute 2
Name Brent
Id 009
Attribute 7

In general, there is an industry trend away from XML and towards JSON for web
services. Because the JSON is simpler and more directly maps to native data struc-
tures we already have in programming languages, the parsing and data extraction
code is usually simpler and more direct when using JSON. But XML is more self-
descriptive than JSON and so there are some applications where XML retains an
advantage. For example, most word processors store documents internally using
XML rather than JSON.

13.6 Application Programming Interfaces


We now have the ability to exchange data between applications using Hypertext
Transport Protocol (HTTP) and a way to represent complex data that we are send-
ing back and forth between these applications using eXtensible Markup Language
(XML) or JavaScript Object Notation (JSON).
The next step is to begin to define and document “contracts” between applications
using these techniques. The general name for these application-to-application con-
tracts is Application Program Interfaces (APIs). When we use an API, generally
one program makes a set of services available for use by other applications and
publishes the APIs (i.e., the “rules”) that must be followed to access the services
provided by the program.
When we begin to build our programs where the functionality of our program
includes access to services provided by other programs, we call the approach a
Service-oriented architecture (SOA). An SOA approach is one where our overall
application makes use of the services of other applications. A non-SOA approach
is where the application is a single standalone application which contains all of the
code necessary to implement the application.
13.7. SECURITY AND API USAGE 165

Auto Hotel Airline


Rental Reservation Reservation
Service Service Service

API

API API

Travel
Application

Figure 13.2: Service-oriented architecture

We see many examples of SOA when we use the web. We can go to a single web
site and book air travel, hotels, and automobiles all from a single site. The data
for hotels is not stored on the airline computers. Instead, the airline computers
contact the services on the hotel computers and retrieve the hotel data and present
it to the user. When the user agrees to make a hotel reservation using the airline
site, the airline site uses another web service on the hotel systems to actually make
the reservation. And when it comes time to charge your credit card for the whole
transaction, still other computers become involved in the process.
A Service-oriented architecture has many advantages, including: (1) we always
maintain only one copy of data (this is particularly important for things like hotel
reservations where we do not want to over-commit) and (2) the owners of the data
can set the rules about the use of their data. With these advantages, an SOA
system must be carefully designed to have good performance and meet the user’s
needs.
When an application makes a set of services in its API available over the web, we
call these web services.

13.7 Security and API usage


It is quite common that you need an API key to make use of a vendor’s API. The
general idea is that they want to know who is using their services and how much
each user is using. Perhaps they have free and pay tiers of their services or have a
policy that limits the number of requests that a single individual can make during
a particular time period.
Sometimes once you get your API key, you simply include the key as part of POST
data or perhaps as a parameter on the URL when calling the API.
166 CHAPTER 13. USING WEB SERVICES

Other times, the vendor wants increased assurance of the source of the requests
and so they expect you to send cryptographically signed messages using shared
keys and secrets. A very common technology that is used to sign requests over
the Internet is called OAuth. You can read more about the OAuth protocol at
www.oauth.net.
Thankfully there are a number of convenient and free OAuth libraries so you can
avoid writing an OAuth implementation from scratch by reading the specification.
These libraries are of varying complexity and have varying degrees of richness. The
OAuth web site has information about various OAuth libraries.

13.8 Glossary
API Application Program Interface - A contract between applications that defines
the patterns of interaction between two application components.
ElementTree A built-in Python library used to parse XML data.
JSON JavaScript Object Notation - A format that allows for the markup of struc-
tured data based on the syntax of JavaScript Objects.
SOA Service-Oriented Architecture - When an application is made of components
connected across a network.
XML eXtensible Markup Language - A format that allows for the markup of
structured data.

You might also like