0% found this document useful (0 votes)
59 views

Python Module5 Notes

Uploaded by

DARSHAN DARSH
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Python Module5 Notes

Uploaded by

DARSHAN DARSH
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Python Programming(21EC643) C Prathibha, Asst.Prof.

,EC Dept,KIT, Tiptur

Module 5
12.1 Hypertext Transfer Protocol - HTTP
The network protocol that powers the web is actually quite simple and there is built-in support in
Python called socket which makes it very easy to make network connections and retrieve data
over those sockets in a Python program.
A socket is much like a file, except that a single socket provides a two-way connection between
two programs. You can both read from and write to the same socket. If you write something to a
socket, it is sent to the application at the other end of the socket. If you read from the socket,
you are given the data which the other application has sent.
But if you try to read a socket 1 when the program on the other end of the socket has not sent any
data, you just sit and wait. If the programs on both ends of the socket simply wait for some
data without sending anything, they will wait for a very long time, so an important part of
programs that communicate over the Internet is to have some sort of protocol.
A protocol is a set of precise rules that determine who is to go first, what they are to do, and then
what the responses are to that message, and who sends next, and so on. In a sense the two
applications at either end of the socket are doing a dance and making sure not to step on each
other’s toes.
There are many documents that describe these network protocols. The Hypertext Transfer Protocol
is described in the following document:

https://www.w3.org/Protocols/rfc2616/rfc2616.txt
This is a long and complex 176-page document with a lot of detail. If you find it
interesting, feel free to read it all. But if you take a look around page 36 of RFC2616
you will find the syntax for the GET request. To request a document from a web
server, we make a connection, e.g. to the www.pr4e.org server on port 80, and then
send a line of the form
GET http://data.pr4e.org/romeo.txt HTTP/1.0
where the second parameter is the web page we are requesting, and then we also send a
blank line. The web server will respond with some header information about the
document and a blank line followed by the document content.

12.2 The world’s simplest web browser


Perhaps the easiest way to show how the HTTP protocol works is to write a very
simple Python program that makes a connection to a web server and follows the rules
of the HTTP protocol to request a document and display what the server sends back.

import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)


mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
1
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

data = mysock.recv(512)
if len(data) < 1:
break
print(data.decode(),end='')
mysock.close()
First the program makes a connection to port 80 on the server www.pr4e.com. Since
our program is playing the role of the “web browser”, the HTTP protocol says we
must send the GET command followed by a blank line. \r\n signifies an EOL
(end of line), so \r\n\r\n signifies nothing between two EOL sequences. That is the
equivalent of a blank line.
Once we send that blank line, we write a loop that receives data in 512-character chunks
from the socket and prints the data out until there is no more data to read (i.e., the recv()
returns an empty string).
The program produces the following output:

HTTP/1.1 200 OK

Your
Program
www.py4e.com
socket
Web Pages
connect
Port 80
send
recv

Figure 12.1: A Socket Connection

Date: Wed, 11 Apr 2018 18:52:55 GMT


Server: Apache/2.4.7 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22
GMT ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00
GMT Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks


It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

The output starts with headers which the web server sends to describe the document. For example,
the Content-Type header indicates that the document is a plain text document (text/plain).
After the server sends us the headers, it adds a blank line to indicate the end of the headers,
2
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

and then sends the actual data of the file romeo.txt.


This example shows how to make a low-level network connection with sockets. Sockets can be
used to communicate with a web server or with a mail server or many other kinds of servers. All
that is needed is to find the document which describes the protocol and write the code to send and
receive the data according to the protocol.
However, since the protocol that we use most commonly is the HTTP web protocol, Python has a
special library specifically designed to support the HTTP protocol for the retrieval of
documents and data over the web.
One of the requirements for using the HTTP protocol is the need to send and receive data as
bytes objects, instead of strings. In the preceding example, the encode() and decode() methods
convert strings into bytes objects and back again.
The next example uses b'' notation to specify that a variable should be stored as a
bytes object. encode() and b'' are equivalent.

>>> b'Hello world'


b'Hello world'
>>> 'Hello world'.encode()
b'Hello world'

12.3 Retrieving an image over HTTP


In the above example, we retrieved a plain text file which had newlines in the file and
we simply copied the data to the screen as the program ran. We can use a similar
program to retrieve an image across using HTTP. Instead of copying the data to the
screen as the program runs, we accumulate the data in a string, trim off the headers,
and then save the image data to a file as follows:
import socket
import time

HOST = 'data.pr4e.org'
PORT = 80
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((HOST, PORT))
mysock.sendall(b'GET http://data.pr4e.org/cover3.jpg HTTP/1.0\r\n\r\n')
count = 0
picture = b""

while True:
data = mysock.recv(5120) if
len(data) < 1: break
#time.sleep(0.25)
count = count + len(data)
print(len(data), count) picture =
picture + data
mysock.close()

# Look for the end of the header (2 CRLF)


pos = picture.find(b"\r\n\r\n")
print('Header length', pos)
print(picture[:pos].decode())

3
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

# Skip past the header and save the picture data


picture = picture[pos+4:]
fhand = open("stuff.jpg", "wb")
fhand.write(picture)
fhand.close()

# Code: https://www.py4e.com/code3/urljpeg.py

When the program runs, it produces the following output:

$ python urljpeg.py 5120 5120


5120 10240
4240 14480
5120 19600
...
5120 214000
3200 217200
5120 222320
5120 227440
3167 230607
Header length 393
HTTP/1.1 200 OK
Date: Wed, 11 Apr 2018 18:54:09 GMT
Server: Apache/2.4.7 (Ubuntu)
Last-Modified: Mon, 15 May 2017 12:27:40
GMT ETag: "38342-54f8f2e5b6277"
Accept-Ranges: bytes
Content-Length: 230210
Vary: Accept-Encoding
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00
GMT Connection: close
Content-Type: image/jpeg

You can see that for this url, the Content-Type header indicates that body of the document is an
image (image/jpeg). Once the program completes, you can view the image data by opening the
file stuff.jpg in an image viewer.
As the program runs, you can see that we don’t get 5120 characters each time we call the
recv() method. We get as many characters as have been transferred across the network to us by
the web server at the moment we call recv(). In this example, we either get as few as 3200
characters each time we request up to 5120 characters of data.
Your results may be different depending on your network speed. Also note that on the last call to
recv() we get 3167 bytes, which is the end of the stream, and in the next call to recv() we get
a zero-length string that tells us that the server has called close() on its end of the socket and
there is no more data forthcoming.
We can slow down our successive recv() calls by uncommenting the call to time.sleep(). This
way, we wait a quarter of a second after each call so that the server can “get ahead” of us and
send more data to us before we call recv() again. With the delay, in place the program executes
as follows:

$ python urljpeg.py 5120 5120


5120 10240
4
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

5120 15360
...
5120 225280

5120 230400
207 230607
Header length 393
HTTP/1.1 200 OK
Date: Wed, 11 Apr 2018 21:42:08 GMT
Server: Apache/2.4.7 (Ubuntu)
Last-Modified: Mon, 15 May 2017 12:27:40 GMT
ETag: "38342-54f8f2e5b6277"
Accept-Ranges: bytes
Content-Length: 230210
Vary: Accept-Encoding
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: image/jpeg

Now other than the first and last calls to recv(), we now get 5120 characters each time
we ask for new data.
There is a buffer between the server making send() requests and our application
making recv() requests. When we run the program with the delay in place, at some
point the server might fill up the buffer in the socket and be forced to pause until our
program starts to empty the buffer. The pausing of either the sending application or
the receiving application is called “flow control.”

12.4 Retrieving web pages with urllib


While we can manually send and receive data over HTTP using the socket library,
there is a much simpler way to perform this common task in Python by using the
urllib library.
Using urllib, you can treat a web page much like a file. You simply indicate
which web page you would like to retrieve and urllib handles all of the HTTP
protocol and header details.
The equivalent code to read the romeo.txt file from the web using urllib is as
follows:

import urllib.request

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
print(line.decode().strip())

# Code: https://www.py4e.com/code3/urllib1.py

Once the web page has been opened with urllib.request.urlopen, we can treat it
like a file and read through it using a for loop.
When the program runs, we only see the output of the contents of the file. The headers

5
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

are still sent, but the urllib code consumes the headers and only returns the data to
us.
But soft what light through yonder window breaks It is the
east and Juliet is the sun
Arise fair sun and kill the envious moon Who is
already sick and pale with grief

As an example, we can write a program to retrieve the data for romeo.txt and compute the
frequency of each word in the file as follows:

import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt') counts =

dict()
for line in fhand:
words = line.decode().split()
for word in words:
counts[word] = counts.get(word, 0) + 1 print(counts)

# Code: https://www.py4e.com/code3/urlwords.py

Again, once we have opened the web page, we can read it like a local file.

12.5 Reading binary files using urllib


Sometimes you want to retrieve a non-text (or binary) file such as an image or video file. The
data in these files is generally not useful to print out, but you can easily make a copy of a URL
to a local file on your hard disk using urllib.
The pattern is to open the URL and use read to download the entire contents of the document
into a string variable (img) then write that information to a local file as follows:

import urllib.request, urllib.parse, urllib.error

img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg').read() fhand =


open('cover3.jpg', 'wb')
fhand.write(img) fhand.close()

# Code: https://www.py4e.com/code3/curl1.py

This program reads all of the data in at once across the network and stores it in the variable img
in the main memory of your computer, then opens the file cover.jpg and writes the data out to
your disk. The wb argument for open() opens a binary file for writing only. This program will
work if the size of the file is less than the size of the memory of your computer.
However if this is a large audio or video file, this program may crash or at least run extremely
slowly when your computer runs out of memory. In order to avoid running out of memory, we
retrieve the data in blocks (or buffers) and then write each block to your disk before retrieving the
next block. This way the program can read any size file without using up all of the memory you
have in your computer.

import urllib.request, urllib.parse, urllib.error


6
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg') fhand


= open('cover3.jpg', 'wb')
size = 0
while True:
info = img.read(100000) if
len(info) < 1: break size =
size + len(info)
fhand.write(info)

print(size, 'characters copied.')


fhand.close()

# Code: https://www.py4e.com/code3/curl2.py

In this example, we read only 100,000 characters at a time and then write those
characters to the cover3.jpg file before retrieving the next 100,000 characters of data
from the web.
This program runs as follows:

python curl2.py
230210 characters copied.

12.6 Parsing HTML and scraping the web


One of the common uses of the urllib capability in Python is to scrape the web.
Web scraping is when we write a program that pretends to be a web browser and
retrieves pages, then examines the data in those pages looking for patterns.
As an example, a search engine such as Google will look at the source of one web page
and extract the links to other pages and retrieve those pages, extracting links, and so on.
Using this technique, Google spiders its way through nearly all of the pages on the web.
Google also uses the frequency of links from pages it finds to a particular page as one
measure of how “important” a page is and how high the page should appear in its
search results.

12.7 Parsing HTML using regular expressions


One simple way to parse HTML is to use regular expressions to repeatedly search
for and extract substrings that match a particular pattern.
Here is a simple web page:
<h1>The First Page</h1>
<p>
If you like, you can switch to the
<a href="http://www.dr-chuck.com/page2.htm"> Second
Page</a>.
</p>
We can construct a well-formed regular expression to match and extract the link values from the
above text as follows:
href="http[s]?://.+?"

Our regular expression looks for strings that start with “href="http://” or “href="https://”,

7
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

followed by one or more characters (.+?), followed by another double quote. The question
mark behind the [s]? indicates to search for the string “http” followed by zero or one “s”.
The question mark added to the .+? indicates that the match is to be done in a “non-greedy”
fashion instead of a “greedy” fashion. A non-greedy match tries to find the smallest possible
matching string and a greedy match tries to find the largest possible matching string.
We add parentheses to our regular expression to indicate which part of our matched string we would
like to extract, and produce the following program:

# Search for link values within URL input


import urllib.request, urllib.parse, urllib.error
import re
import ssl

# Ignore SSL certificate errors


ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')


html = urllib.request.urlopen(url, context=ctx).read() links =
re.findall(b'href="(http[s]?://.*?)"', html)
for link in links:
print(link.decode())

# Code: https://www.py4e.com/code3/urlregex.py

The ssl library allows this program to access web sites that strictly enforce HTTPS. The read
method returns HTML source code as a bytes object instead of returning an HTTPResponse
object. The findall regular expression method will give us a list of all of the strings that
match our regular expression, returning only the link text between the double quotes.
When we run the program and input a URL, we get the following output:

Enter - https://docs.python.org
https://docs.python.org/3/index.html
https://www.python.org/ https://docs.python.org/3.8/
https://docs.python.org/3.7/
https://docs.python.org/3.5/
https://docs.python.org/2.7/
https://www.python.org/doc/versions/
https://www.python.org/dev/peps/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/PythonBooks
https://www.python.org/doc/av/
https://www.python.org/
https://www.python.org/psf/donations/
http://sphinx.pocoo.org/

Regular expressions work very nicely when your HTML is well formatted and
predictable. But since there are a lot of “broken” HTML pages out there, a solution only
using regular expressions might either miss some valid links or end up with bad
data.

8
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

This can be solved by using a robust HTML parsing library.

12.8 Parsing HTML using BeautifulSoup


Even though HTML looks like XML 2 and some pages are carefully constructed to
be XML, most HTML is generally broken in ways that cause an XML parser to
reject the entire page of HTML as improperly formed.
There are a number of Python libraries which can help you parse HTML and extract
data from the pages. Each of the libraries has its strengths and weaknesses and you can
pick one based on your needs.
As an example, we will simply parse some HTML input and extract links using the
BeautifulSoup library. BeautifulSoup tolerates highly flawed HTML and still lets
you easily extract the data you need. You can download and install the
BeautifulSoup code from:
https://pypi.python.org/pypi/beautifulsoup4
Information on installing BeautifulSoup with the Python Package Index tool pip
is available at: https://packaging.python.org/tutorials/installing-
packages/
We will use urllib to read the page and then use BeautifulSoup to extract the
href attributes from the anchor (a) tags.

# To run this, download the BeautifulSoup zip file


# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

import urllib.request, urllib.parse, urllib.error


from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors


ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')


html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags


tags = soup('a')
for tag in tags:
print(tag.get('href', None))

# Code: https://www.py4e.com/code3/urllinks.py
The program prompts for a web address, then opens the web page, reads the data and passes the
9
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

data to the BeautifulSoup parser, and then retrieves all of the anchor tags and prints out the
href attribute for each tag.
When the program runs, it produces the following output:

Enter - https://docs.python.org
genindex.html
py-modindex.html
https://www.python.org/
#

whatsnew/3.6.html
whatsnew/index.html
tutorial/index.html
library/index.html
reference/index.html
using/index.html
howto/index.html
installing/index.html
distributing/index.html
extending/index.html
c-api/index.html
faq/index.html
py-modindex.html
genindex.html
glossary.html
search.html
contents.html
bugs.html
about.html
license.html
copyright.html
download.html

https://docs.python.org/3.8/
https://docs.python.org/3.7/
https://docs.python.org/3.5/
https://docs.python.org/2.7/
https://www.python.org/doc/versions/
https://www.python.org/dev/peps/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/PythonBooks
https://www.python.org/doc/av/
genindex.html
py-modindex.html
https://www.python.org/
#
copyright.html
https://www.python.org/psf/donations/
bugs.html
http://sphinx.pocoo.org/

This list is much longer because some HTML anchor tags are relative paths (e.g.,
tutorial/index.html) or in-page references (e.g., ‘#’) that do not include “http://” or
“https://”, which was a requirement in our regular expression.
10
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

You can use also BeautifulSoup to pull out various parts of each tag:

# To run this, download the BeautifulSoup zip file


# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

from urllib.request import urlopen


from bs4 import BeautifulSoup import
ssl

# Ignore SSL certificate errors


ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')


html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")

# Retrieve all of the anchor tags


tags = soup('a')
for tag in tags:
# Look at the parts of a tag
print('TAG:', tag)
print('URL:', tag.get('href', None))
print('Contents:', tag.contents[0])
print('Attrs:', tag.attrs)

# Code: https://www.py4e.com/code3/urllink2.py

python urllink2.py

Enter - http://www.dr-chuck.com/page1.htm
TAG: <a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>
URL: http://www.dr-chuck.com/page2.htm
Content: ['\nSecond Page']
Attrs: [('href', 'http://www.dr-chuck.com/page2.htm')]

html.parser is the HTML parser included in the standard Python 3 library. In- formation on
other HTML parsers is available at:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
These examples only begin to show the power of BeautifulSoup when it comes to parsing HTML.

12.9 Bonus section for Unix / Linux users


If you have a Linux, Unix, or Macintosh computer, you probably have commands built in to
your operating system that retrieves both plain text and binary files using the HTTP or File
Transfer (FTP) protocols. One of these commands is curl:
11
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

$ curl -O http://www.py4e.com/cover.jpg

The command curl is short for “copy URL” and so the two examples listed earlier to retrieve
binary files with urllib are cleverly named curl1.py and curl2.py on www.py4e.com/code3
as they implement similar functionality to the curl command. There is also a curl3.py sample
program that does this task a little more effectively, in case you actually want to use this pattern
in a program you are writing.
A second command that functions very similarly is wget:

$ wget http://www.py4e.com/cover.jpg

Both of these commands make retrieving webpages and remote files a simple task.

Using Web Services


Once it became easy to retrieve documents and parse documents over HTTP using programs, it
did not take long to develop an approach where we started producing documents that were
specifically designed to be consumed by other programs (i.e., not HTML to be displayed in a
browser).
There are two common formats that we use when exchanging data across the web. eXtensible
Markup Language (XML) has been in use for a very long time and is best suited for
exchanging document-style data. When programs just want to exchange dictionaries, lists, or
other internal information with each other, they use JavaScript Object Notation (JSON) (see
www.json.org). We will look at both formats.

13.1 eXtensible Markup Language - XML


XML looks very similar to HTML, but XML is more structured than HTML.
Here is a sample of an XML document:

<person>
<name>Chuck</name>
<phone type="intl">
+1 734 303 4456
</phone>
<email hide="yes" />
</person>

Each pair of opening (e.g., <person>) and closing tags (e.g., </person>) represents a element or node
with the same name as the tag (e.g., person). Each element can have some text, some attributes
(e.g., hide), and other nested elements. If an XML element is empty (i.e., has no content), then
it may be depicted by a self-closing tag (e.g., <email />).
Often it is helpful to think of an XML document as a tree structure where there is a top element
(here: person), and other tags (e.g., phone) are drawn as children of their parent elements.

12
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

13.2 Parsing XML


Here is a simple application that parses some XML and extracts some data elements
from the XML:

import xml.etree.ElementTree as ET

data = '''
<person>
<name>Chuck</name>
<phone type="intl">
+1 734 303 4456
</phone>
<email hide="yes" />
</person>'''

tree = ET.fromstring(data)
print('Name:', tree.find('name').text)
print('Attr:', tree.find('email').get('hide'))

# Code: https://www.py4e.com/code3/xml1.py

The triple single quote ('''), as well as the triple double quote ("""), allow for the
creation of strings that span multiple lines.
Calling fromstring converts the string representation of the XML into a “tree” of
XML elements. When the XML is in a tree, we have a series of methods we can
call to extract portions of data from the XML string. The find function searches
through the XML tree and retrieves the element that matches the specified tag.

Name: huck
Attr: yes

Using an XML parser such as ElementTree has the advantage that while the
XML in this example is quite simple, it turns out there are many rules regarding
valid XML, and using ElementTree allows us to extract data from XML without worrying

13
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

about the rules of XML syntax.

13.3 Looping through nodes

Often the XML has multiple nodes and we need to write a loop to process all of the nodes. In
the following program, we loop through all of the user nodes:

import xml.etree.ElementTree as ET
input = '''
<stuff>
<users>
<user x="2">
<id>001</id>
<name>Chuck</name>
</user>
<user x="7">
<id>009</id>
<name>Brent</name>
</user>
</users>
</stuff>'''

stuff = ET.fromstring(input)
lst = stuff.findall('users/user')
print('User count:', len(lst))

for item in lst:


print('Name', item.find('name').text)
print('Id', item.find('id').text)
print('Attribute', item.get('x'))

# Code: https://www.py4e.com/code3/xml2.py

The findall method retrieves a Python list of subtrees that represent the user structures in the
XML tree. Then we can write a for loop that looks at each of the user nodes, and prints the name
and id text elements as well as the x attribute from the user node.
User count: 2
Name Chuck
Id 001
Attribute 2
Name Brent
Id 009
Attribute 7

It is important to include all parent level elements in the findall statement except for the top level
element (e.g., users/user). Otherwise, Python will not find any desired nodes.
import xml.etree.ElementTree as ET input = '''
<stuff>
<users>
<user x="2">
<id>001</id>
<name>Chuck</name>

14
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

</user>
<user x="7">
<id>009</id>
<name>Brent</name>
</user>
</users>
</stuff>'''

stuff = ET.fromstring(input)

lst = stuff.findall('users/user')
print('User count:', len(lst))

lst2 = stuff.findall('user')
print('User count:', len(lst2))

lst stores all user elements that are nested within their users parent. lst2 looks for
user elements that are not nested within the top level stuff element where there
are none.

User count: 2
User count: 0

13.4 JavaScript Object Notation - JSON


The JSON format was inspired by the object and array format used in the JavaScript
language. But since Python was invented before JavaScript, Python’s syntax for
dictionaries and lists influenced the syntax of JSON. So the format of JSON is nearly
identical to a combination of Python lists and dictionaries.
Here is a JSON encoding that is roughly equivalent to the simple XML from above:

{
"name" : "Chuck",
"phone" : {
"type" : "intl",
"number" : "+1 734 303 4456"
},
"email" : {
"hide" : "yes"
}
}

You will notice some differences. First, in XML, we can add attributes like “intl” to the “phone”
tag. In JSON, we simply have key-value pairs. Also the XML “person” tag is gone, replaced
by a set of outer curly braces.
In general, JSON structures are simpler than XML because JSON has fewer ca- pabilities than
XML. But JSON has the advantage that it maps directly to some combination of dictionaries and
lists. And since nearly all programming languages have something equivalent to Python’s
dictionaries and lists, JSON is a very natural format to have two cooperating programs
exchange data.
JSON is quickly becoming the format of choice for nearly all data exchange between applications
15
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

because of its relative simplicity compared to XML.

13.5 Parsing JSON


We construct our JSON by nesting dictionaries and lists as needed. In this example, we represent
a list of users where each user is a set of key-value pairs (i.e., a dictionary). So we have a list of
dictionaries.
In the following program, we use the built-in json library to parse the JSON and read through
the data. Compare this closely to the equivalent XML data and code above. The JSON has less
detail, so we must know in advance that we are getting a list and that the list is of users and each
user is a set of key-value pairs. The JSON is more succinct (an advantage) but also is less self-
describing (a disadvantage).

import json

data = ''' [
{ "id" : "001",
"x" : "2",
"name" : "Chuck"
} ,
{ "id" : "009",
"x" : "7",
"name" : "Brent"
}
]'''

info = json.loads(data)
print('User count:', len(info))

for item in info:


print('Name', item['name'])
print('Id', item['id'])
print('Attribute', item['x'])

# Code: https://www.py4e.com/code3/json2.py

If you compare the code to extract data from the parsed JSON and XML you will see that what
we get from json.loads() is a Python list which we traverse with a for loop, and each item
within that list is a Python dictionary. Once the JSON has been parsed, we can use the Python
index operator to extract the various bits of data for each user. We don’t have to use the JSON
library to dig through the parsed JSON, since the returned data is simply native Python
structures.
The output of this program is exactly the same as the XML version above.
User count: 2
Name Chuck
Id 001
Attribute 2
Name Brent
Id 009
Attribute 7

16
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

In general, there is an industry trend away from XML and towards JSON for web services.
Because the JSON is simpler and more directly maps to native data structures we already have in
programming languages, the parsing and data extraction code is usually simpler and more direct
when using JSON. But XML is more self- descriptive than JSON and so there are some
applications where XML retains an advantage. For example, most word processors store documents
internally using XML rather than JSON.

13.6 Application Programming Interfaces


We now have the ability to exchange data between applications using Hypertext Transport
Protocol (HTTP) and a way to represent complex data that we are send ing back and forth between
these applications using eXtensible Markup Language (XML) or JavaScript Object Notation
(JSON).
The next step is to begin to define and document “contracts” between applications using these
techniques. The general name for these application-to-application con- tracts is Application
Program Interfaces (APIs). When we use an API, generally one program makes a set of services
available for use by other applications and publishes the APIs (i.e., the “rules”) that must be
followed to access the services provided by the program.
When we begin to build our programs where the functionality of our program includes access
to services provided by other programs, we call the approach a Service-oriented architecture
(SOA). An SOA approach is one where our overall application makes use of the services of
other applications. A non-SOA approach is where the application is a single standalone application
which contains all of the code necessary to implement the application.

Auto Hotel Airline


Rent
al

A A
P P

Travel
Applicati
on
Figure 13.2: Service-oriented architecture

We see many examples of SOA when we use the web. We can go to a single web site and book
air travel, hotels, and automobiles all from a single site. The data for hotels is not stored on the
airline computers. Instead, the airline computers contact the services on the hotel computers and
retrieve the hotel data and present it to the user. When the user agrees to make a hotel reservation
using the airline site, the airline site uses another web service on the hotel systems to actually make
the reservation. And when it comes time to charge your credit card for the whole transaction, still
other computers become involved in the process.
A Service-oriented architecture has many advantages, including: (1) we always maintain only one
copy of data (this is particularly important for things like hotel reservations where we do not want
to over-commit) and (2) the owners of the data can set the rules about the use of their data. With
these advantages, an SOA system must be carefully designed to have good performance and meet
the user’s needs.
17
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

When an application makes a set of services in its API available over the web, we call these web
services.

13.7 Security and API usage


It is quite common that you need an API key to make use of a vendor’s API. The general idea
is that they want to know who is using their services and how much each user is using. Perhaps
they have free and pay tiers of their services or have a policy that limits the number of requests
that a single individual can make during a particular time period.
Sometimes once you get your API key, you simply include the key as part of POST data or
perhaps as a parameter on the URL when calling the API.

Other times, the vendor wants increased assurance of the source of the requests and so they
expect you to send cryptographically signed messages using shared keys and secrets. A
very common technology that is used to sign requests over the Internet is called OAuth.
You can read more about the OAuth protocol at www.oauth.net.
Thankfully there are a number of convenient and free OAuth libraries so you can avoid writing
an OAuth implementation from scratch by reading the specification. These libraries are of
varying complexity and have varying degrees of richness. The OAuth web site has
information about various OAuth libraries.

Database
15.1 What is a database?
A database is a file that is organized for storing data. Most databases are organized like a dictionary
in the sense that they map from keys to values. The biggest difference is that the database is on
disk (or other permanent storage), so it persists after the program ends. Because a database is stored
on permanent storage, it can store far more data than a dictionary, which is limited to the size of
the memory in the computer.
Like a dictionary, database software is designed to keep the inserting and accessing of data very fast,
even for large amounts of data. Database software maintains its performance by building indexes
as data is added to the database to allow the computer to jump quickly to a particular entry.
There are many different database systems which are used for a wide variety of pur poses including:
Oracle, MySQL, Microsoft SQL Server, PostgreSQL, and SQLite. We focus on SQLite in this
book because it is a very common database and is already built into Python. SQLite is designed
to be embedded into other applica tions to provide database support within the application. For
example, the Firefox browser also uses the SQLite database internally as do many other
products.
http://sqlite.org/
SQLite is well suited to some of the data manipulation problems that we see in Informatics.

15.2 Database concepts


When you first look at a database it looks like a spreadsheet with multiple sheets. The primary data
structures in a database are: tables, rows, and columns.
In technical descriptions of relational databases the concepts of table, row, and column are more
formally referred to as relation, tuple, and attribute, respectively. We will use the less formal
terms in this chapter.

18
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

column attribute

Table
Relation

row 2.3 tuple


2.3

Figure 15.1: Relational Databases

15.3 Database Browser for SQLite


While this chapter will focus on using Python to work with data in SQLite database files, many
operations can be done more conveniently using software called the Database Browser for
SQLite which is freely available from:
http://sqlitebrowser.org/
Using the browser you can easily create tables, insert data, edit data, or run simple SQL queries
on the data in the database.
In a sense, the database browser is similar to a text editor when working with text files. When
you want to do one or very few operations on a text file, you can just open it in a text editor and
make the changes you want. When you have many changes that you need to do to a text file,
often you will write a simple Python program. You will find the same pattern when working
with databases. You will do simple operations in the database manager and more complex
operations will be most conveniently done in Python.

15.4 Creating a database table


Databases require more defined structure than Python lists or dictionaries1.
When we create a database table we must tell the database in advance the names of each of the
columns in the table and the type of data which we are planning to store in each column. When
the database software knows the type of data in each column, it can choose the most efficient
way to store and look up the data based on the type of data.
You can look at the various data types supported by SQLite at the following url:
http://www.sqlite.org/datatypes.html
Defining structure for your data up front may seem inconvenient at the beginning, but the payoff
is fast access to your data even when the database contains a large amount of data

19
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

Figure 15.2: A Database Cursor

The code to create a database file and a table named Track with two columns in the database
is as follows:

import sqlite3

conn = sqlite3.connect('music.sqlite')
cur = conn.cursor()

cur.execute('DROP TABLE IF EXISTS Track')


cur.execute('CREATE TABLE Track (title TEXT, plays INTEGER)')

conn.close()

# Code: https://www.py4e.com/code3/db1.py

The connect operation makes a “connection” to the database stored in the file music.sqlite in
the current directory. If the file does not exist, it will be created. The reason this is called a
“connection” is that sometimes the database is stored on a separate “database server” from the
server on which we are running our application. In our simple examples the database will just
be a local file in the same directory as the Python code we are running.
A cursor is like a file handle that we can use to perform operations on the data stored in the
database. Calling cursor() is very similar conceptually to calling open() when dealing with text
files.
Once we have the cursor, we can begin to execute commands on the contents of the database
using the execute() method.
Database commands are expressed in a special language that has been standardized across many
different database vendors to allow us to learn a single database language. The database language
is called Structured Query Language or SQL for short.
http://en.wikipedia.org/wiki/SQL
In our example, we are executing two SQL commands in our database. As a convention,
we will show the SQL keywords in uppercase and the parts of the command that we are
adding (such as the table and column names) will be shown in lowercase.
The first SQL command removes the Track table from the database if it exists. This
pattern is simply to allow us to run the same program to create the Track table over and
over again without causing an error. Note that the DROP TABLE command deletes the table
and all of its contents from the database (i.e., there is no “undo”).

20
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
cur.execute('DROP TABLE IF EXISTS Track ')

The second command creates a table named Track with a text column named
title and an integer column named plays.
cur.execute('CREATE TABLE Track (title TEXT, plays INTEGER)')

Now that we have created a table named Track, we can put some data into that table using the SQL
INSERT operation. Again, we begin by making a connection to the database and obtaining the
cursor. We can then execute SQL commands using the cursor.
The SQL INSERT command indicates which table we are using and then defines a new row by
listing the fields we want to include (title, plays) followed by the VALUES we want placed in
the new row. We specify the values as question marks (?, ?) to indicate that the actual values are
passed in as a tuple ( 'My Way',15 ) as the second parameter to the execute() call.

import sqlite3

conn = sqlite3.connect('music.sqlite')
cur = conn.cursor()
cur.execute('INSERT INTO Track (title, plays) VALUES (?, ?)',
('Thunderstruck', 20))
cur.execute('INSERT INTO Track (title, plays) VALUES (?, ?)', ('My
Way', 15))
conn.commit()
print('Track:')
cur.execute('SELECT title, plays FROM Track')
for row in cur:
print(row)
cur.execute('DELETE FROM Track WHERE plays < 100')
conn.commit()
cur.close()

# Code: https://www.py4e.com/code3/db2.py

Tracks
title plays
Thunderstruck 20
My Way 15

Figure 15.3: Rows in a Table

First we INSERT two rows into our table and use commit() to force the data to be written to the
database file.
Then we use the SELECT command to retrieve the rows we just inserted from the table. On the
SELECT command, we indicate which columns we would like (title, plays) and indicate
which table we want to retrieve the data from. After we execute the SELECT statement, the
cursor is something we can loop through in a for statement. For efficiency, the cursor does not
read all of the data from the database when we execute the SELECT statement. Instead, the data
is read on demand as we loop through the rows in the for statement.

21
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
The output of the program is as follows:
Track:
('Thunderstruck', 20)
('My Way', 15)

Our for loop finds two rows, and each row is a Python tuple with the first value as the title
and the second value as the number of plays.
At the very end of the program, we execute an SQL command to DELETE the rows we
have just created so we can run the program over and over. The DELETE command shows the use
of a WHERE clause that allows us to express a selection criterion so that we can ask the database
to apply the command to only the rows that match the criterion. In this example the criterion
happens to apply to all the rows so we empty the table out so we can run the program repeatedly.
After the DELETE is performed, we also call commit() to force the data to be removed from the
database.

15.5 Structured Query Language summary


So far, we have been using the Structured Query Language in our Python examples and have
covered many of the basics of the SQL commands. In this section, we look at the SQL
language in particular and give an overview of SQL syntax.
Since there are so many different database vendors, the Structured Query Language (SQL) was
standardized so we could communicate in a portable manner to database systems from multiple
vendors.

A relational database is made up of tables, rows, and columns. The columns generally have a type
such as text, numeric, or date data. When we create a table, we indicate the names and types of
the columns:

CREATE TABLE Track (title TEXT, plays INTEGER)


To insert a row into a table, we use the SQL INSERT command:
INSERT INTO Track (title, plays) VALUES ('My Way', 15)

The INSERT statement specifies the table name, then a list of the fields/columns that you would
like to set in the new row, and then the keyword VALUES and a list of corresponding values for each
of the fields.
The SQL SELECT command is used to retrieve rows and columns from a database. The SELECT
statement lets you specify which columns you would like to retrieve as well as a WHERE clause
to select which rows you would like to see. It also allows an optional ORDER BY clause to control
the sorting of the returned rows.

SELECT * FROM Track WHERE title = 'My Way'


Using * indicates that you want the database to return all of the columns for each row that matches
the WHERE clause.
Note, unlike in Python, in a SQL WHERE clause we use a single equal sign to indicate a test for
equality rather than a double equal sign. Other logical operations allowed in a WHERE clause
include <, >, <=, >=, !=, as well as AND and OR and parentheses to build your logical expressions.
You can request that the returned rows be sorted by one of the fields as follows:

SELECT title,plays FROM Track ORDER BY title

22
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
It is possible to UPDATE a column or columns within one or more rows in a table using
the SQL UPDATE statement as follows:
UPDATE Track SET plays = 16 WHERE title = 'My Way'

The UPDATE statement specifies a table and then a list of fields and values to change after the SET
keyword and then an optional WHERE clause to select the rows that are to be updated. A single
UPDATE statement will change all of the rows that match the WHERE clause. If a WHERE clause
is not specified, it performs the UPDATE on all of the rows in the table.
To remove a row, you need a WHERE clause on an SQL DELETE statement. The
WHERE clause determines which rows are to be deleted:

DELETE FROM Track WHERE title = 'My Way'

These four basic SQL commands (INSERT, SELECT, UPDATE, and DELETE) al- low
the four basic operations needed to create and maintain data. We use “CRUD” (Create, Read,
Update, and Delete) to capture all these concepts in a single term.2

15.6 Multiple tables and basic data modeling


The real power of a relational database is when we create multiple tables and make links between
those tables. The act of deciding how to break up your application data into multiple tables and
establishing the relationships between the tables is called data modeling. The design document
that shows the tables and their relationships is called a data model.
Data modeling is a relatively sophisticated skill and we will only introduce the most basic
concepts of relational data modeling in this section. For more detail on data modeling you can
start with:
http://en.wikipedia.org/wiki/Relational_model
Lets say for our tracks database we wanted to track the name of the artist for each track in
addition to the title and number of plays for each track. A simple approach might be to simply
add another column to the database called artist and put the name of the artist in the column
as follows:

DROP TABLE IF EXISTS Track;


CREATE TABLE Track (title TEXT, plays INTEGER, artist TEXT);

Then we could insert a few tracks into our table.

INSERT INTO Track (title, plays, artist)


VALUES ('My Way', 15, 'Frank Sinatra');
INSERT INTO Track (title, plays, artist)
VALUES ('New York', 25, 'Frank Sinatra');

If we were to look at our data with a SELECT * FROM Track statement, it looks like we have
done a fine job.

sqlite> SELECT * FROM Track; My


Way|15|Frank Sinatra
New York|25|Frank Sinatra
sqlite>

23
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

We have made a very bad error in our data modeling. We have violated the rules of database
normalization.
https://en.wikipedia.org/wiki/Database_normalization

While database normalization seems very complex on the surface and contains a lot of
mathematical justifications, for now we can reduce it all into one simple rule that we will follow.
We should never put the same string data in a column more than once. If we need the data more
than once, we create a numeric key for the data and reference the actual data using this key. Especially
if the multiple entries refer to the same object.
To demonstrate the slippery slope we are going down by assigning string columns to out database
model, think about how we would change the data model if we wanted to keep track of the eye color
of our artists? Would we do this?

DROP TABLE IF EXISTS Track;


CREATE TABLE Track (title TEXT, plays
INTEGER, artist TEXT, eyes TEXT);
INSERT INTO Track (title, plays, artist, eyes)
VALUES ('My Way', 15, 'Frank Sinatra', 'Blue');
INSERT INTO Track (title, plays, artist, eyes)
VALUES ('New York', 25, 'Frank Sinatra', 'Blue');

Since Frank Sinatra recorded over 1200 songs, are we really going to put the string ‘Blue’ in 1200
rows in our Track table. And what would happen if we decided his eye color was ‘Light Blue’?
Something just does not feel right.
The correct solution is to create a table for the each Artist and store all the data about the artist
in that table. And then somehow we need to make a connection between a row in the Track table
to a row in the Artist table. Perhaps we could call this “link” between two “tables” a
“relationship” between two tables. And that is exactly what database experts decided to all these
links.
Lets make an Artist table as follows:

DROP TABLE IF EXISTS Artist;


CREATE TABLE Artist (name TEXT, eyes TEXT);
INSERT INTO Artist (name, eyes)
VALUES ('Frank Sinatra', 'blue');

Now we have two tables but we need a way to link rows in the two tables. To do this, we need what
we call ‘keys’. These keys will just be integer numbers that we can use to lookup a row in
different table. If we are going to make links to rows inside of a table, we need to add a primary
key to the rows in the table. By convention we usually name the primary key column ‘id’. So
our Artist table looks as follows:

DROP TABLE IF EXISTS Artist;


CREATE TABLE Artist (id INTEGER, name TEXT, eyes TEXT);
INSERT INTO Artist (id, name, eyes)
VALUES (42, 'Frank Sinatra', 'blue');

Now we have a row in the table for ‘Frank Sinatra’ (and his eye color) and a primary key of
‘42’ to use to link our tracks to him. So we alter our Track table as follows:

24
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
DROP TABLE IF EXISTS Track;
CREATE TABLE Track (title TEXT, plays
INTEGER, artist_id INTEGER);
INSERT INTO Track (title, plays, artist_id)
VALUES ('My Way', 15, 42);
INSERT INTO Track (title, plays, artist_id)
VALUES ('New York', 25, 42);

The artist_id column is an integer, and by naming convention is a foreign key pointing at
a primary key in the Artist table. We call it a foreign key because it is pointing to a row in
a different table.
Now we are following the rules of database normalization, but when we want to get data out of our
database, we don’t want to see the 42, we want to see the name and eye color of the artist. To do
this we use the JOIN keyword in our SELECT statement.

SELECT title, plays, name, eyes


FROM Track JOIN Artist
ON Track.artist_id = Artist.id;

The JOIN clause includes an ON condition that defines how the rows are to to be connected.
For each row in Track add the data from Artist from the row where artist_id Track table
matches the id from the Artist table.
The output would be:

My Way|15|Frank Sinatra|blue
New York|25|Frank Sinatra|blue

While it might seem a little clunky and your instincts might tell you that it would be faster just
to keep the data in one table, it turns out the the limit on database performance is how much data
needs to be scanned when retrieving a query. While the details are very complex, integers are a lot
smaller than strings (especially Unicode) and far quicker to to move and compare.

15.7 Data model diagrams


While our Track and Artist database design is simple with just two tables and a single
one-to-many relationship, these data models can get complicated quickly and are easier to
understand if we can make a graphical representation of our data model.
While there are many graphical representations of data models, we will use one of the
“classic” approaches, called “Crow’s Foot Diagrams” as shown in Figure 15.4. Each table is
shown as a box with the name of the table and its columns. Then where there is a relationship
between two tables a line is drawn connecting the tables with a notation added to the end of
each line indicating the nature of the relationship.

Track Many One Artist


title id
artist_id name

Figure 15.4: A Verbose One-to-Many Data Model

25
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

Track Artist
title name

Figure 15.5: A Succinct One-to-Many Data Model

https://en.wikipedia.org/wiki/Entity-relationship_model
In this case, “many” tracks can be associated with each artist. So the track end is shown with
the crow’s foot spread out indicating it is the" “many” end. The artist end is shown with a
vertical like that indicates “one”. There will be “many” artists in general, but the important
aspect is that for each artist there will be many tracks. And each of those artists may be
associated with multiple tracks.
You will note that the column that holds the foreign_key like artist_id is on the “many” end
and the primary key is at the “one” end.
Since the pattern of foreign and primary key placement is so consistent and follows the “many”
and “one” ends of the lines, we never include either the primary or foreign key columns in our
diagram of the data model as shown in the second diagram as shown in Figure 15.5. The columns
are thought of as “implementation detail” to capture the nature of the relationship details and
not an essential part of the data being modeled.

15.8 Automatically creating primary keys


In the above example, we arbitrarily assigned Frank the primary key of 42. However when we are
inserting millions or rows, it is nice to have the database automatically generate the values for the
id column. We do this by declaring the id column as a PRIMARY KEY and leave out the
id value when inserting the row:

DROP TABLE IF EXISTS Artist;


CREATE TABLE Artist (id INTEGER PRIMARY KEY,
name TEXT, eyes TEXT);
INSERT INTO Artist (name, eyes)
VALUES ('Frank Sinatra', 'blue');

Now we have instructed the database to auto-assign us a unique value to the Frank Sinatra row.
But we then need a way to have the database tell us the id value for the recently inserted row.
One way is to use a SELECT statement to retrieve data from an SQLite built-in-fuction called
last_insert_rowid().
sqlite> DROP TABLE IF EXISTS Artist;
sqlite> CREATE TABLE Artist (id INTEGER PRIMARY KEY,
...> name TEXT, eyes TEXT);
sqlite> INSERT INTO Artist (name,
eyes)
...> VALUES ('Frank Sinatra',
'blue');
sqlite> select last_insert_rowid();
1
sqlite> SELECT * FROM Artist;
1|Frank Sinatra|blue
sqlite>

26
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

Once we know the id of our ‘Frank Sinatra’ row, we can use it when we INSERT the tracks
into the Track table. As a general strategy, we add these id columns to any table we create:

sqlite> DROP TABLE IF EXISTS Track;


sqlite> CREATE TABLE Track (id INTEGER PRIMARY KEY,
...> title TEXT, plays INTEGER, artist_id INTEGER);

Note that the artist_id value is the new auto-assigned row in the Artist table and that while
we added an INTEGER PRIMARY KEY to the the Track table, we did not include id in the
list of fields on the INSERT statements into the Track table. Again this tells the database to
choose a unique value for us for the id column.

sqlite> INSERT INTO Track (title, plays, artist_id)


...> VALUES ('My Way', 15, 1);
sqlite> select last_insert_rowid();
1
sqlite> INSERT INTO Track (title, plays, artist_id)
...> VALUES ('New York', 25, 1);
sqlite> select last_insert_rowid();
2
sqlite>

You can call SELECT last_insert_rowid(); after each of the inserts to retrieve the value
that the database assigned to the id of each newly created row. Later when we are coding in
Python, we can ask for the id value in our code and store it in a variable for later use.

15.9 Logical keys for fast lookup


If we had a table full of artists and a table full of tracks, each with a foreign key link to a row
in a table full of artists and we wanted to list all the tracks that were sung by ‘Frank Sinatra’ as
follows:

SELECT title, plays, name, eyes FROM


Track JOIN Artist
ON Track.artist_id = Artist.id
WHERE Artist.name = 'Frank Sinatra';

Since we have two tables and a foreign key between the two tables, our data is well-modeled,
but if we are going to have millions of records in the Artist table and going to do a lot of
lookups by artist name, we would benefit if we gave the database a hint about our intended use
of the name column.
We do this by adding an “index” to a text column that we intend to use in WHERE
clauses:

CREATE INDEX artist_name ON Artist(name);


When the database has been told that an index is needed on a column in a table, it stores extra
information to make it possible to look up a row more quickly using the indexed field (name in
this example). Once you request that an index be created, there is nothing special that is needed
in the SQL to access the table. The database keeps the index up to date as data is inserted,
deleted, and updated, and uses it automatically if it will increase the performance of a
database query.
27
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
These text columns that are used to find rows based on some information in the “real world” like
the name of an artist are called Logical keys.

15.10 Adding constraints to the data database


We can also use an index to enforce a constraint (i.e. rules) on our database op- erations. The
most common constraint is a uniqueness constraint which insists that all of the values in a
column are unique. We can add the optional UNIQUE keyword, to the CREATE INDEX
statement to tell the database that we would like it to enforce the constraint on our SQL. We can
drop and re-create the artist_name index with a UNIQUE constraint as follows.

DROP INDEX artist_name;


CREATE UNIQUE INDEX artist_name ON Artist(name);

If we try to insert ‘Frank Sinatra’ a second time, it will fail with an error.

sqlite> SELECT * FROM Artist;


1|Frank Sinatra|blue
sqlite> INSERT INTO Artist (name, eyes)
...> VALUES ('Frank Sinatra', 'blue');
Runtime error: UNIQUE constraint failed: Artist.name (19)
sqlite>

We can tell the database to ignore any duplicate key errors by adding the IGNORE
keyword to the INSERT statement as follows:

sqlite> INSERT OR IGNORE INTO Artist (name, eyes)


...> VALUES ('Frank Sinatra', 'blue');
sqlite> SELECT id FROM Artist WHERE name='Frank Sinatra';
1
sqlite>

Track
title
Album Artist
len
name name
rating
count

Figure 15.6: Tracks, Albums, and Artists


By combining an INSERT OR IGNORE and a SELECT we can insert a new record if the
name is not already there and whether or not the record is already there, retrieve the primary key
of the record.
sqlite> INSERT OR IGNORE INTO Artist (name, eyes)
...> VALUES ('Elvis', 'blue');
sqlite> SELECT id FROM Artist WHERE name='Elvis';
2
sqlite> SELECT * FROM Artist;
1|Frank Sinatra|blue
2|Elvis|blue
sqlite>

28
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
Since we have not added a uniqueness constraint to the eye color column, there is no problem
having multiple ‘Blue’ values in the eye column.

15.11 Sample multi-table application


A sample application called tracks_csv.py shows how these ideas can be combined to parse textual
data and load it into several tables using a proper data model with relational connections between
the tables.
This application reads and parses a comma-separated file tracks.csv based on an export from
Dr. Chuck’s iTunes library.

Another One Bites The Dust,Queen,Greatest Hits,55,100,217103 Asche Zu


Asche,Rammstein,Herzeleid,79,100,231810
Beauty School Dropout,Various,Grease,48,100,239960 Black Dog,Led
Zeppelin,IV,109,100,296620
...

The columns in this file are: title, artist, album, number of plays, rating (0-100) and length in
milliseconds.
Our data model is shown in Figure 15.6 and described in SQL as follows:

DROP TABLE IF EXISTS Artist;


DROP TABLE IF EXISTS Album;
DROP TABLE IF EXISTS Track;

CREATE TABLE Artist (


id INTEGER PRIMARY KEY,
name TEXT UNIQUE
);

CREATE TABLE Album (


id INTEGER PRIMARY KEY,
artist_id INTEGER,
title TEXT UNIQUE
);

CREATE TABLE Track (


id INTEGER PRIMARY KEY,
title TEXT UNIQUE,
album_id INTEGER,
len INTEGER, rating INTEGER, count INTEGER
);

We are adding the UNIQUE keyword to TEXT columns that we would like to have a uniqueness
constraint that we will use in INSERT IGNORE statements. This is more succinct that separate
CREATE INDEX statements but has the same effect.
With these tables in place, we write the following code tracks_csv.py to parse the data
and insert it into the tables:
import sqlite3
conn = sqlite3.connect('trackdb.sqlite')
cur = conn.cursor()

29
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
handle = open('tracks.csv')

for line in handle:


line = line.strip();
pieces = line.split(',')
if len(pieces) != 6 : continue

name = pieces[0]
artist = pieces[1]
album = pieces[2]
count = pieces[3]
rating = pieces[4]
length = pieces[5]

print(name, artist, album, count, rating, length)

cur.execute('''INSERT OR IGNORE INTO Artist (name)


VALUES ( ? )''', ( artist, ) )
cur.execute('SELECT id FROM Artist WHERE name = ? ', (artist, ))
artist_id = cur.fetchone()[0]

cur.execute('''INSERT OR IGNORE INTO Album (title, artist_id) VALUES


( ?, ? )''', ( album, artist_id ) )
cur.execute('SELECT id FROM Album WHERE title = ? ', (album, )) album_id
= cur.fetchone()[0]

cur.execute('''INSERT OR REPLACE INTO Track (title,


album_id, len, rating, count) VALUES ( ?, ?, ?,
?, ? )''',
( name, album_id, length, rating, count ) ) conn.commit()

You can see that we are repeating the pattern of INSERT OR IGNORE followed by a SELECT
to get the appropriate artist_id and album_id for use in later INSERT statements. We start
from Artist because we need artist_id to insert the Album and need the album_id to insert the
Track.
If we look at the Album table, we can see that the entries were added and assigned a primary key
as necessary as the data was parsed. We can also see the foreign key pointing to a row in the
Artist table for each Album row.

sqlite> .mode column


sqlite> SELECT * FROM Album LIMIT 5;
id artist_id title

1 1 Greatest Hits
2 2 Herzeleid
3 3 Grease
4 4 IV
5 5 The Wall [Disc 2]

We can reconstruct all of the Track data, following all the relations using JOIN / ON clauses.
You can see both ends of each of the (2) relational connections in each row in the output below:

30
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
sqlite> .mode line
sqlite> SELECT * FROM Track
...> JOIN Album ON Track.album_id = Album.id
...> JOIN Artist ON Album.artist_id = Artist.id
...> LIMIT 2;
id = 1
title = Another One Bites The Dust
album_id = 1
len = 217103
rating = 100
count = 55
id = 1
artist_id = 1
title = Greatest Hits
id = 1
name = Queen

Course User
title name

Figure 15.7: A Many to Many Relationship

id = 2
title = Asche Zu Asche
album_id = 2
len = 231810
rating = 100
count = 79
id = 2
artist_id = 2
title = Herzeleid
id = 2
name = Rammstein

This example shows three tables and two one-to-many relationships between the tables. It
also shows how to use indexes and uniqueness constraints to program- matically construct
the tables and their relationships.
https://en.wikipedia.org/wiki/One-to-many_(data_model)
Up next we will look at the many-to-many relationships in data models.

15.12 Many to many relationships in databases

Some data relationships cannot be modeled by a simple one-to-many relationship. For


example, lets say we are going to build a data model for a course management system. There
will be courses, users, and rosters. A user can be on the roster for many courses and a course
will have many users on its roster.
It is pretty simple to draw a many-to-many relationship as shown in Figure 15.7. We simply
draw two tables and connect them with a line that has the “many” indi- cator on both ends of
the lines. The problem is how to implement the relationship using primary keys and foreign

31
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
keys.
Before we explore how we implement many-to-many relationships, let’s see if we could hack
something up by extending a one-to many relationship.
If SQL supported the notion of arrays, we might try to define this:

CREATE TABLE Course (


id INTEGER PRIMARY KEY,
title TEXT UNIQUE
student_ids ARRAY OF INTEGER;
);

Member
Course User
course_id
title name
user_id

Figure 15.8: A Many to Many Connector Table

Sadly, while this is a tempting idea, SQL does not support arrays.3
Or we could just make long string and concatenate all the User primary keys into a long string
separated by commas.

CREATE TABLE Course (


id INTEGER PRIMARY KEY,
title TEXT UNIQUE
student_ids ARRAY OF INTEGER;
);

INSERT INTO Course (title, student_ids)


VALUES( 'si311', '1,3,4,5,6,9,14');

This would be very inefficient because as the course roster grows in size and the number of courses
increases it becomes quite expensive to figure out which courses have student 14 on their roster.
Instead of either of these approaches, we model a many-to-many relationship using an additional
table that we call a “junction table”, “through table”, “connector table”, or “join table” as shown
in Figure 15.8. The purpose of this table is to capture the connection between a course and a
student.
In a sense the table sits between the Course and User table and has a one-to-many relationship to
both tables. By using an intermediate table we break a many-to- many relationship into two one-
to-many relationships. Databases are very good at modeling and processing one-to-many
relationships.
An example Member table would be as follows:

CREATE TABLE User (


id INTEGER PRIMARY
KEY, name TEXT UNIQUE
);

CREATE TABLE Course (

32
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
id INTEGER PRIMARY KEY,
title TEXT UNIQUE
);

CREATE TABLE Member (


user_id INTEGER,
course_id INTEGER,
PRIMARY KEY (user_id, course_id)
);

Following our naming convention, Member.user_id and Member.course_id are foreign keys
pointing at the corresponding rows in the User and Course tables. Each entry in the member
table links a row in the User table to a row in the Course table by going through the Member
table.
We indicate that the combination of course_id and user_id is the PRIMARY KEY for the
Member table, also creating an uniqueness constraint for a course_id / user_id combination.
Now lets say we need to insert a number of students into the rosters of a number of courses. Lets
assume the data comes to us in a JSON-formatted file with records like this:

[
[ "Charley", "si110"],
[ "Mea", "si110"],
[ "Hattie", "si110"],
[ "Keziah", "si110"],
[ "Rosa", "si106"],
[ "Mea", "si106"],
[ "Mairin", "si106"],
[ "Zendel", "si106"],
[ "Honie", "si106"],
[ "Rosa", "si106"],
...
]

We could write code as follows to read the JSON file and insert the members of each
course roster into the database using the following code:

import json
import sqlite3

conn = sqlite3.connect('rosterdb.sqlite')
cur = conn.cursor()

str_data = open('roster_data_sample.json').read()
json_data = json.loads(str_data)

for entry in json_data:

name = entry[0]
title = entry[1]

print((name, title))

cur.execute('''INSERT OR IGNORE INTO User (name)

33
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

VALUES ( ? )''', ( name, ) )


cur.execute('SELECT id FROM User WHERE name = ? ', (name, ))
user_id = cur.fetchone()[0]

cur.execute('''INSERT OR IGNORE INTO Course (title)


VALUES ( ? )''', ( title, ) )
cur.execute('SELECT id FROM Course WHERE title = ? ', (title, ))
course_id = cur.fetchone()[0]

cur.execute('''INSERT OR REPLACE INTO


Member (user_id, course_id) VALUES ( ?,
? )''', ( user_id, course_id ) )

conn.commit()

Like in a previous example, we first make sure that we have an entry in the User table and
know the primary key of the entry as well as an entry in the Course table and know its primary
key. We use the ‘INSERT OR IGNORE’ and ‘SELECT’ pattern so our code works regardless
of whether the record is in the table or not.
Our insert into the Member table is simply inserting the two integers as a new or existing row
depending on the constraint to make sure we do not end up with duplicate entries in the Member
table for a particular user_id / course_id com- bination.
To reconstruct our data across all three tables, we again use JOIN / ON to construct a SELECT
query;

sqlite> SELECT * FROM Course


...> JOIN Member ON Course.id = Member.course_id
...> JOIN User ON Member.user_id = User.id;
+----+-------+---------+-----------+----+---------+
| id | title | user_id | course_id | id | name |
+----+-------+---------+-----------+----+---------+
| 1 | si110 | 1 | 1 | 1 | Charley |
| 1 | si110 | 2 | 1 | 2 | Mea |
| 1 | si110 | 3 | 1 | 3 | Hattie |
| 1 | si110 | 4 | 1 | 4 | Lyena |
| 1 | si110 | 5 | 1 | 5 | Keziah |
| 1 | si110 | 6 | 1 | 6 | Ellyce |
| 1 | si110 | 7 | 1 | 7 | Thalia |
| 1 | si110 | 8 | 1 | 8 | Meabh |
| 2 | si106 | 2 | 2 | 2 | Mea |
| 2 | si106 | 10 | 2 | 10 | Mairin |
| 2 | si106 | 11 | 2 | 11 | Zendel |
| 2 | si106 | 12 | 2 | 12 | Honie |
| 2 | si106 | 9 | 2 | 9 | Rosa |
+----+-------+---------+-----------+----+---------+
sqlite>

You can see the three tables from left to right - Course, Member, and User and you can see the
connections between the primary keys and foreign keys in each row of output.

34
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

15.13 Modeling data at the many-to-many


connection
While we have presented the “join table” as having two foreign keys making a connection
between rows in two tables, this is the simplest form of a join table. It is quite common to want
to add some data to the connection itself.
Continuing with our example of users, courses, and rosters to model a simple learning
management system, we will also need to understand the role that each user is assigned in each
course.
If we first try to solve this by adding an “instructor” flag to the User table, we will find that
this does not work because a user can be a instructor in one course and a student in another
course. If we add an instructor_id to the Course table it will not work because a course
can have multiple instructors. And there is no one-to-many hack that can deal with the fact
that the number of roles will expand into roles like Teaching Assistant or Parent.
But if we simply add a role column to the Member table - we can represent a wide range of
roles, role combinations, etc.
Lets change our member table as follows:

DROP TABLE Member;

CREATE TABLE Member (


user_id INTEGER,
course_id INTEGER,
role INTEGER,
PRIMARY KEY (user_id, course_id)
);

For simplicity, we will decide that zero in the role means “student” and one in the role means
instructor. Lets assume our JSON data is augmented with the role as follows:

[
[ "Charley", "si110", 1],
[ "Mea", "si110", 0],
[ "Hattie", "si110", 0],
[ "Keziah", "si110", 0],
[ "Rosa", "si106", 0],
[ "Mea", "si106", 1],
[ "Mairin", "si106", 0],
[ "Zendel", "si106", 0],
[ "Honie", "si106", 0],
[ "Rosa", "si106", 0],
...
]

We could alter the roster.py program above to incorporate role as follows:

35
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur

for entry in json_data:

name = entry[0]
title = entry[1]
role = entry[2]

...

cur.execute('''INSERT OR REPLACE INTO Member


(user_id, course_id, role) VALUES ( ?, ?, ? )''',
( user_id, course_id, role ) )

In a real system, we would probably build a Role table and make the role column in
Member a foreign key into the Role table as follows:

DROP TABLE Member; CREATE

TABLE Member (
user_id INTEGER,
course_id INTEGER,
role_id INTEGER,
PRIMARY KEY (user_id, course_id, role_id)
);

CREATE TABLE Role (


id INTEGER PRIMARY KEY,
name TEXT UNIQUE
);

INSERT INTO Role (id, name) VALUES (0, 'Student');


INSERT INTO Role (id, name) VALUES (1, 'Instructor');

Notice that because we declared the id column in the Role table as a PRIMARY KEY, we could omit
it in the INSERT statement. But we can also choose the id value as long as the value is not already
in the id column and does not violate the implied UNIQUE constaint on primary keys.

36

You might also like