Python Module5 Notes
Python Module5 Notes
Module 5
12.1 Hypertext Transfer Protocol - HTTP
The network protocol that powers the web is actually quite simple and there is built-in support in
Python called socket which makes it very easy to make network connections and retrieve data
over those sockets in a Python program.
A socket is much like a file, except that a single socket provides a two-way connection between
two programs. You can both read from and write to the same socket. If you write something to a
socket, it is sent to the application at the other end of the socket. If you read from the socket,
you are given the data which the other application has sent.
But if you try to read a socket 1 when the program on the other end of the socket has not sent any
data, you just sit and wait. If the programs on both ends of the socket simply wait for some
data without sending anything, they will wait for a very long time, so an important part of
programs that communicate over the Internet is to have some sort of protocol.
A protocol is a set of precise rules that determine who is to go first, what they are to do, and then
what the responses are to that message, and who sends next, and so on. In a sense the two
applications at either end of the socket are doing a dance and making sure not to step on each
other’s toes.
There are many documents that describe these network protocols. The Hypertext Transfer Protocol
is described in the following document:
https://www.w3.org/Protocols/rfc2616/rfc2616.txt
This is a long and complex 176-page document with a lot of detail. If you find it
interesting, feel free to read it all. But if you take a look around page 36 of RFC2616
you will find the syntax for the GET request. To request a document from a web
server, we make a connection, e.g. to the www.pr4e.org server on port 80, and then
send a line of the form
GET http://data.pr4e.org/romeo.txt HTTP/1.0
where the second parameter is the web page we are requesting, and then we also send a
blank line. The web server will respond with some header information about the
document and a blank line followed by the document content.
import socket
while True:
1
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
data = mysock.recv(512)
if len(data) < 1:
break
print(data.decode(),end='')
mysock.close()
First the program makes a connection to port 80 on the server www.pr4e.com. Since
our program is playing the role of the “web browser”, the HTTP protocol says we
must send the GET command followed by a blank line. \r\n signifies an EOL
(end of line), so \r\n\r\n signifies nothing between two EOL sequences. That is the
equivalent of a blank line.
Once we send that blank line, we write a loop that receives data in 512-character chunks
from the socket and prints the data out until there is no more data to read (i.e., the recv()
returns an empty string).
The program produces the following output:
HTTP/1.1 200 OK
Your
Program
www.py4e.com
socket
Web Pages
connect
Port 80
send
recv
The output starts with headers which the web server sends to describe the document. For example,
the Content-Type header indicates that the document is a plain text document (text/plain).
After the server sends us the headers, it adds a blank line to indicate the end of the headers,
2
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
HOST = 'data.pr4e.org'
PORT = 80
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((HOST, PORT))
mysock.sendall(b'GET http://data.pr4e.org/cover3.jpg HTTP/1.0\r\n\r\n')
count = 0
picture = b""
while True:
data = mysock.recv(5120) if
len(data) < 1: break
#time.sleep(0.25)
count = count + len(data)
print(len(data), count) picture =
picture + data
mysock.close()
3
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
# Code: https://www.py4e.com/code3/urljpeg.py
You can see that for this url, the Content-Type header indicates that body of the document is an
image (image/jpeg). Once the program completes, you can view the image data by opening the
file stuff.jpg in an image viewer.
As the program runs, you can see that we don’t get 5120 characters each time we call the
recv() method. We get as many characters as have been transferred across the network to us by
the web server at the moment we call recv(). In this example, we either get as few as 3200
characters each time we request up to 5120 characters of data.
Your results may be different depending on your network speed. Also note that on the last call to
recv() we get 3167 bytes, which is the end of the stream, and in the next call to recv() we get
a zero-length string that tells us that the server has called close() on its end of the socket and
there is no more data forthcoming.
We can slow down our successive recv() calls by uncommenting the call to time.sleep(). This
way, we wait a quarter of a second after each call so that the server can “get ahead” of us and
send more data to us before we call recv() again. With the delay, in place the program executes
as follows:
5120 15360
...
5120 225280
5120 230400
207 230607
Header length 393
HTTP/1.1 200 OK
Date: Wed, 11 Apr 2018 21:42:08 GMT
Server: Apache/2.4.7 (Ubuntu)
Last-Modified: Mon, 15 May 2017 12:27:40 GMT
ETag: "38342-54f8f2e5b6277"
Accept-Ranges: bytes
Content-Length: 230210
Vary: Accept-Encoding
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: image/jpeg
Now other than the first and last calls to recv(), we now get 5120 characters each time
we ask for new data.
There is a buffer between the server making send() requests and our application
making recv() requests. When we run the program with the delay in place, at some
point the server might fill up the buffer in the socket and be forced to pause until our
program starts to empty the buffer. The pausing of either the sending application or
the receiving application is called “flow control.”
import urllib.request
fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
print(line.decode().strip())
# Code: https://www.py4e.com/code3/urllib1.py
Once the web page has been opened with urllib.request.urlopen, we can treat it
like a file and read through it using a for loop.
When the program runs, we only see the output of the contents of the file. The headers
5
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
are still sent, but the urllib code consumes the headers and only returns the data to
us.
But soft what light through yonder window breaks It is the
east and Juliet is the sun
Arise fair sun and kill the envious moon Who is
already sick and pale with grief
As an example, we can write a program to retrieve the data for romeo.txt and compute the
frequency of each word in the file as follows:
dict()
for line in fhand:
words = line.decode().split()
for word in words:
counts[word] = counts.get(word, 0) + 1 print(counts)
# Code: https://www.py4e.com/code3/urlwords.py
Again, once we have opened the web page, we can read it like a local file.
# Code: https://www.py4e.com/code3/curl1.py
This program reads all of the data in at once across the network and stores it in the variable img
in the main memory of your computer, then opens the file cover.jpg and writes the data out to
your disk. The wb argument for open() opens a binary file for writing only. This program will
work if the size of the file is less than the size of the memory of your computer.
However if this is a large audio or video file, this program may crash or at least run extremely
slowly when your computer runs out of memory. In order to avoid running out of memory, we
retrieve the data in blocks (or buffers) and then write each block to your disk before retrieving the
next block. This way the program can read any size file without using up all of the memory you
have in your computer.
# Code: https://www.py4e.com/code3/curl2.py
In this example, we read only 100,000 characters at a time and then write those
characters to the cover3.jpg file before retrieving the next 100,000 characters of data
from the web.
This program runs as follows:
python curl2.py
230210 characters copied.
Our regular expression looks for strings that start with “href="http://” or “href="https://”,
7
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
followed by one or more characters (.+?), followed by another double quote. The question
mark behind the [s]? indicates to search for the string “http” followed by zero or one “s”.
The question mark added to the .+? indicates that the match is to be done in a “non-greedy”
fashion instead of a “greedy” fashion. A non-greedy match tries to find the smallest possible
matching string and a greedy match tries to find the largest possible matching string.
We add parentheses to our regular expression to indicate which part of our matched string we would
like to extract, and produce the following program:
# Code: https://www.py4e.com/code3/urlregex.py
The ssl library allows this program to access web sites that strictly enforce HTTPS. The read
method returns HTML source code as a bytes object instead of returning an HTTPResponse
object. The findall regular expression method will give us a list of all of the strings that
match our regular expression, returning only the link text between the double quotes.
When we run the program and input a URL, we get the following output:
Enter - https://docs.python.org
https://docs.python.org/3/index.html
https://www.python.org/ https://docs.python.org/3.8/
https://docs.python.org/3.7/
https://docs.python.org/3.5/
https://docs.python.org/2.7/
https://www.python.org/doc/versions/
https://www.python.org/dev/peps/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/PythonBooks
https://www.python.org/doc/av/
https://www.python.org/
https://www.python.org/psf/donations/
http://sphinx.pocoo.org/
Regular expressions work very nicely when your HTML is well formatted and
predictable. But since there are a lot of “broken” HTML pages out there, a solution only
using regular expressions might either miss some valid links or end up with bad
data.
8
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
# Code: https://www.py4e.com/code3/urllinks.py
The program prompts for a web address, then opens the web page, reads the data and passes the
9
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
data to the BeautifulSoup parser, and then retrieves all of the anchor tags and prints out the
href attribute for each tag.
When the program runs, it produces the following output:
Enter - https://docs.python.org
genindex.html
py-modindex.html
https://www.python.org/
#
whatsnew/3.6.html
whatsnew/index.html
tutorial/index.html
library/index.html
reference/index.html
using/index.html
howto/index.html
installing/index.html
distributing/index.html
extending/index.html
c-api/index.html
faq/index.html
py-modindex.html
genindex.html
glossary.html
search.html
contents.html
bugs.html
about.html
license.html
copyright.html
download.html
https://docs.python.org/3.8/
https://docs.python.org/3.7/
https://docs.python.org/3.5/
https://docs.python.org/2.7/
https://www.python.org/doc/versions/
https://www.python.org/dev/peps/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/PythonBooks
https://www.python.org/doc/av/
genindex.html
py-modindex.html
https://www.python.org/
#
copyright.html
https://www.python.org/psf/donations/
bugs.html
http://sphinx.pocoo.org/
This list is much longer because some HTML anchor tags are relative paths (e.g.,
tutorial/index.html) or in-page references (e.g., ‘#’) that do not include “http://” or
“https://”, which was a requirement in our regular expression.
10
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
You can use also BeautifulSoup to pull out various parts of each tag:
# Code: https://www.py4e.com/code3/urllink2.py
python urllink2.py
Enter - http://www.dr-chuck.com/page1.htm
TAG: <a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>
URL: http://www.dr-chuck.com/page2.htm
Content: ['\nSecond Page']
Attrs: [('href', 'http://www.dr-chuck.com/page2.htm')]
html.parser is the HTML parser included in the standard Python 3 library. In- formation on
other HTML parsers is available at:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
These examples only begin to show the power of BeautifulSoup when it comes to parsing HTML.
$ curl -O http://www.py4e.com/cover.jpg
The command curl is short for “copy URL” and so the two examples listed earlier to retrieve
binary files with urllib are cleverly named curl1.py and curl2.py on www.py4e.com/code3
as they implement similar functionality to the curl command. There is also a curl3.py sample
program that does this task a little more effectively, in case you actually want to use this pattern
in a program you are writing.
A second command that functions very similarly is wget:
$ wget http://www.py4e.com/cover.jpg
Both of these commands make retrieving webpages and remote files a simple task.
<person>
<name>Chuck</name>
<phone type="intl">
+1 734 303 4456
</phone>
<email hide="yes" />
</person>
Each pair of opening (e.g., <person>) and closing tags (e.g., </person>) represents a element or node
with the same name as the tag (e.g., person). Each element can have some text, some attributes
(e.g., hide), and other nested elements. If an XML element is empty (i.e., has no content), then
it may be depicted by a self-closing tag (e.g., <email />).
Often it is helpful to think of an XML document as a tree structure where there is a top element
(here: person), and other tags (e.g., phone) are drawn as children of their parent elements.
12
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
import xml.etree.ElementTree as ET
data = '''
<person>
<name>Chuck</name>
<phone type="intl">
+1 734 303 4456
</phone>
<email hide="yes" />
</person>'''
tree = ET.fromstring(data)
print('Name:', tree.find('name').text)
print('Attr:', tree.find('email').get('hide'))
# Code: https://www.py4e.com/code3/xml1.py
The triple single quote ('''), as well as the triple double quote ("""), allow for the
creation of strings that span multiple lines.
Calling fromstring converts the string representation of the XML into a “tree” of
XML elements. When the XML is in a tree, we have a series of methods we can
call to extract portions of data from the XML string. The find function searches
through the XML tree and retrieves the element that matches the specified tag.
Name: huck
Attr: yes
Using an XML parser such as ElementTree has the advantage that while the
XML in this example is quite simple, it turns out there are many rules regarding
valid XML, and using ElementTree allows us to extract data from XML without worrying
13
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
Often the XML has multiple nodes and we need to write a loop to process all of the nodes. In
the following program, we loop through all of the user nodes:
import xml.etree.ElementTree as ET
input = '''
<stuff>
<users>
<user x="2">
<id>001</id>
<name>Chuck</name>
</user>
<user x="7">
<id>009</id>
<name>Brent</name>
</user>
</users>
</stuff>'''
stuff = ET.fromstring(input)
lst = stuff.findall('users/user')
print('User count:', len(lst))
# Code: https://www.py4e.com/code3/xml2.py
The findall method retrieves a Python list of subtrees that represent the user structures in the
XML tree. Then we can write a for loop that looks at each of the user nodes, and prints the name
and id text elements as well as the x attribute from the user node.
User count: 2
Name Chuck
Id 001
Attribute 2
Name Brent
Id 009
Attribute 7
It is important to include all parent level elements in the findall statement except for the top level
element (e.g., users/user). Otherwise, Python will not find any desired nodes.
import xml.etree.ElementTree as ET input = '''
<stuff>
<users>
<user x="2">
<id>001</id>
<name>Chuck</name>
14
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
</user>
<user x="7">
<id>009</id>
<name>Brent</name>
</user>
</users>
</stuff>'''
stuff = ET.fromstring(input)
lst = stuff.findall('users/user')
print('User count:', len(lst))
lst2 = stuff.findall('user')
print('User count:', len(lst2))
lst stores all user elements that are nested within their users parent. lst2 looks for
user elements that are not nested within the top level stuff element where there
are none.
User count: 2
User count: 0
{
"name" : "Chuck",
"phone" : {
"type" : "intl",
"number" : "+1 734 303 4456"
},
"email" : {
"hide" : "yes"
}
}
You will notice some differences. First, in XML, we can add attributes like “intl” to the “phone”
tag. In JSON, we simply have key-value pairs. Also the XML “person” tag is gone, replaced
by a set of outer curly braces.
In general, JSON structures are simpler than XML because JSON has fewer ca- pabilities than
XML. But JSON has the advantage that it maps directly to some combination of dictionaries and
lists. And since nearly all programming languages have something equivalent to Python’s
dictionaries and lists, JSON is a very natural format to have two cooperating programs
exchange data.
JSON is quickly becoming the format of choice for nearly all data exchange between applications
15
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
import json
data = ''' [
{ "id" : "001",
"x" : "2",
"name" : "Chuck"
} ,
{ "id" : "009",
"x" : "7",
"name" : "Brent"
}
]'''
info = json.loads(data)
print('User count:', len(info))
# Code: https://www.py4e.com/code3/json2.py
If you compare the code to extract data from the parsed JSON and XML you will see that what
we get from json.loads() is a Python list which we traverse with a for loop, and each item
within that list is a Python dictionary. Once the JSON has been parsed, we can use the Python
index operator to extract the various bits of data for each user. We don’t have to use the JSON
library to dig through the parsed JSON, since the returned data is simply native Python
structures.
The output of this program is exactly the same as the XML version above.
User count: 2
Name Chuck
Id 001
Attribute 2
Name Brent
Id 009
Attribute 7
16
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
In general, there is an industry trend away from XML and towards JSON for web services.
Because the JSON is simpler and more directly maps to native data structures we already have in
programming languages, the parsing and data extraction code is usually simpler and more direct
when using JSON. But XML is more self- descriptive than JSON and so there are some
applications where XML retains an advantage. For example, most word processors store documents
internally using XML rather than JSON.
A A
P P
Travel
Applicati
on
Figure 13.2: Service-oriented architecture
We see many examples of SOA when we use the web. We can go to a single web site and book
air travel, hotels, and automobiles all from a single site. The data for hotels is not stored on the
airline computers. Instead, the airline computers contact the services on the hotel computers and
retrieve the hotel data and present it to the user. When the user agrees to make a hotel reservation
using the airline site, the airline site uses another web service on the hotel systems to actually make
the reservation. And when it comes time to charge your credit card for the whole transaction, still
other computers become involved in the process.
A Service-oriented architecture has many advantages, including: (1) we always maintain only one
copy of data (this is particularly important for things like hotel reservations where we do not want
to over-commit) and (2) the owners of the data can set the rules about the use of their data. With
these advantages, an SOA system must be carefully designed to have good performance and meet
the user’s needs.
17
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
When an application makes a set of services in its API available over the web, we call these web
services.
Other times, the vendor wants increased assurance of the source of the requests and so they
expect you to send cryptographically signed messages using shared keys and secrets. A
very common technology that is used to sign requests over the Internet is called OAuth.
You can read more about the OAuth protocol at www.oauth.net.
Thankfully there are a number of convenient and free OAuth libraries so you can avoid writing
an OAuth implementation from scratch by reading the specification. These libraries are of
varying complexity and have varying degrees of richness. The OAuth web site has
information about various OAuth libraries.
Database
15.1 What is a database?
A database is a file that is organized for storing data. Most databases are organized like a dictionary
in the sense that they map from keys to values. The biggest difference is that the database is on
disk (or other permanent storage), so it persists after the program ends. Because a database is stored
on permanent storage, it can store far more data than a dictionary, which is limited to the size of
the memory in the computer.
Like a dictionary, database software is designed to keep the inserting and accessing of data very fast,
even for large amounts of data. Database software maintains its performance by building indexes
as data is added to the database to allow the computer to jump quickly to a particular entry.
There are many different database systems which are used for a wide variety of pur poses including:
Oracle, MySQL, Microsoft SQL Server, PostgreSQL, and SQLite. We focus on SQLite in this
book because it is a very common database and is already built into Python. SQLite is designed
to be embedded into other applica tions to provide database support within the application. For
example, the Firefox browser also uses the SQLite database internally as do many other
products.
http://sqlite.org/
SQLite is well suited to some of the data manipulation problems that we see in Informatics.
18
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
column attribute
Table
Relation
19
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
The code to create a database file and a table named Track with two columns in the database
is as follows:
import sqlite3
conn = sqlite3.connect('music.sqlite')
cur = conn.cursor()
conn.close()
# Code: https://www.py4e.com/code3/db1.py
The connect operation makes a “connection” to the database stored in the file music.sqlite in
the current directory. If the file does not exist, it will be created. The reason this is called a
“connection” is that sometimes the database is stored on a separate “database server” from the
server on which we are running our application. In our simple examples the database will just
be a local file in the same directory as the Python code we are running.
A cursor is like a file handle that we can use to perform operations on the data stored in the
database. Calling cursor() is very similar conceptually to calling open() when dealing with text
files.
Once we have the cursor, we can begin to execute commands on the contents of the database
using the execute() method.
Database commands are expressed in a special language that has been standardized across many
different database vendors to allow us to learn a single database language. The database language
is called Structured Query Language or SQL for short.
http://en.wikipedia.org/wiki/SQL
In our example, we are executing two SQL commands in our database. As a convention,
we will show the SQL keywords in uppercase and the parts of the command that we are
adding (such as the table and column names) will be shown in lowercase.
The first SQL command removes the Track table from the database if it exists. This
pattern is simply to allow us to run the same program to create the Track table over and
over again without causing an error. Note that the DROP TABLE command deletes the table
and all of its contents from the database (i.e., there is no “undo”).
20
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
cur.execute('DROP TABLE IF EXISTS Track ')
The second command creates a table named Track with a text column named
title and an integer column named plays.
cur.execute('CREATE TABLE Track (title TEXT, plays INTEGER)')
Now that we have created a table named Track, we can put some data into that table using the SQL
INSERT operation. Again, we begin by making a connection to the database and obtaining the
cursor. We can then execute SQL commands using the cursor.
The SQL INSERT command indicates which table we are using and then defines a new row by
listing the fields we want to include (title, plays) followed by the VALUES we want placed in
the new row. We specify the values as question marks (?, ?) to indicate that the actual values are
passed in as a tuple ( 'My Way',15 ) as the second parameter to the execute() call.
import sqlite3
conn = sqlite3.connect('music.sqlite')
cur = conn.cursor()
cur.execute('INSERT INTO Track (title, plays) VALUES (?, ?)',
('Thunderstruck', 20))
cur.execute('INSERT INTO Track (title, plays) VALUES (?, ?)', ('My
Way', 15))
conn.commit()
print('Track:')
cur.execute('SELECT title, plays FROM Track')
for row in cur:
print(row)
cur.execute('DELETE FROM Track WHERE plays < 100')
conn.commit()
cur.close()
# Code: https://www.py4e.com/code3/db2.py
Tracks
title plays
Thunderstruck 20
My Way 15
First we INSERT two rows into our table and use commit() to force the data to be written to the
database file.
Then we use the SELECT command to retrieve the rows we just inserted from the table. On the
SELECT command, we indicate which columns we would like (title, plays) and indicate
which table we want to retrieve the data from. After we execute the SELECT statement, the
cursor is something we can loop through in a for statement. For efficiency, the cursor does not
read all of the data from the database when we execute the SELECT statement. Instead, the data
is read on demand as we loop through the rows in the for statement.
21
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
The output of the program is as follows:
Track:
('Thunderstruck', 20)
('My Way', 15)
Our for loop finds two rows, and each row is a Python tuple with the first value as the title
and the second value as the number of plays.
At the very end of the program, we execute an SQL command to DELETE the rows we
have just created so we can run the program over and over. The DELETE command shows the use
of a WHERE clause that allows us to express a selection criterion so that we can ask the database
to apply the command to only the rows that match the criterion. In this example the criterion
happens to apply to all the rows so we empty the table out so we can run the program repeatedly.
After the DELETE is performed, we also call commit() to force the data to be removed from the
database.
A relational database is made up of tables, rows, and columns. The columns generally have a type
such as text, numeric, or date data. When we create a table, we indicate the names and types of
the columns:
The INSERT statement specifies the table name, then a list of the fields/columns that you would
like to set in the new row, and then the keyword VALUES and a list of corresponding values for each
of the fields.
The SQL SELECT command is used to retrieve rows and columns from a database. The SELECT
statement lets you specify which columns you would like to retrieve as well as a WHERE clause
to select which rows you would like to see. It also allows an optional ORDER BY clause to control
the sorting of the returned rows.
22
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
It is possible to UPDATE a column or columns within one or more rows in a table using
the SQL UPDATE statement as follows:
UPDATE Track SET plays = 16 WHERE title = 'My Way'
The UPDATE statement specifies a table and then a list of fields and values to change after the SET
keyword and then an optional WHERE clause to select the rows that are to be updated. A single
UPDATE statement will change all of the rows that match the WHERE clause. If a WHERE clause
is not specified, it performs the UPDATE on all of the rows in the table.
To remove a row, you need a WHERE clause on an SQL DELETE statement. The
WHERE clause determines which rows are to be deleted:
These four basic SQL commands (INSERT, SELECT, UPDATE, and DELETE) al- low
the four basic operations needed to create and maintain data. We use “CRUD” (Create, Read,
Update, and Delete) to capture all these concepts in a single term.2
If we were to look at our data with a SELECT * FROM Track statement, it looks like we have
done a fine job.
23
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
We have made a very bad error in our data modeling. We have violated the rules of database
normalization.
https://en.wikipedia.org/wiki/Database_normalization
While database normalization seems very complex on the surface and contains a lot of
mathematical justifications, for now we can reduce it all into one simple rule that we will follow.
We should never put the same string data in a column more than once. If we need the data more
than once, we create a numeric key for the data and reference the actual data using this key. Especially
if the multiple entries refer to the same object.
To demonstrate the slippery slope we are going down by assigning string columns to out database
model, think about how we would change the data model if we wanted to keep track of the eye color
of our artists? Would we do this?
Since Frank Sinatra recorded over 1200 songs, are we really going to put the string ‘Blue’ in 1200
rows in our Track table. And what would happen if we decided his eye color was ‘Light Blue’?
Something just does not feel right.
The correct solution is to create a table for the each Artist and store all the data about the artist
in that table. And then somehow we need to make a connection between a row in the Track table
to a row in the Artist table. Perhaps we could call this “link” between two “tables” a
“relationship” between two tables. And that is exactly what database experts decided to all these
links.
Lets make an Artist table as follows:
Now we have two tables but we need a way to link rows in the two tables. To do this, we need what
we call ‘keys’. These keys will just be integer numbers that we can use to lookup a row in
different table. If we are going to make links to rows inside of a table, we need to add a primary
key to the rows in the table. By convention we usually name the primary key column ‘id’. So
our Artist table looks as follows:
Now we have a row in the table for ‘Frank Sinatra’ (and his eye color) and a primary key of
‘42’ to use to link our tracks to him. So we alter our Track table as follows:
24
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
DROP TABLE IF EXISTS Track;
CREATE TABLE Track (title TEXT, plays
INTEGER, artist_id INTEGER);
INSERT INTO Track (title, plays, artist_id)
VALUES ('My Way', 15, 42);
INSERT INTO Track (title, plays, artist_id)
VALUES ('New York', 25, 42);
The artist_id column is an integer, and by naming convention is a foreign key pointing at
a primary key in the Artist table. We call it a foreign key because it is pointing to a row in
a different table.
Now we are following the rules of database normalization, but when we want to get data out of our
database, we don’t want to see the 42, we want to see the name and eye color of the artist. To do
this we use the JOIN keyword in our SELECT statement.
The JOIN clause includes an ON condition that defines how the rows are to to be connected.
For each row in Track add the data from Artist from the row where artist_id Track table
matches the id from the Artist table.
The output would be:
My Way|15|Frank Sinatra|blue
New York|25|Frank Sinatra|blue
While it might seem a little clunky and your instincts might tell you that it would be faster just
to keep the data in one table, it turns out the the limit on database performance is how much data
needs to be scanned when retrieving a query. While the details are very complex, integers are a lot
smaller than strings (especially Unicode) and far quicker to to move and compare.
25
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
Track Artist
title name
https://en.wikipedia.org/wiki/Entity-relationship_model
In this case, “many” tracks can be associated with each artist. So the track end is shown with
the crow’s foot spread out indicating it is the" “many” end. The artist end is shown with a
vertical like that indicates “one”. There will be “many” artists in general, but the important
aspect is that for each artist there will be many tracks. And each of those artists may be
associated with multiple tracks.
You will note that the column that holds the foreign_key like artist_id is on the “many” end
and the primary key is at the “one” end.
Since the pattern of foreign and primary key placement is so consistent and follows the “many”
and “one” ends of the lines, we never include either the primary or foreign key columns in our
diagram of the data model as shown in the second diagram as shown in Figure 15.5. The columns
are thought of as “implementation detail” to capture the nature of the relationship details and
not an essential part of the data being modeled.
Now we have instructed the database to auto-assign us a unique value to the Frank Sinatra row.
But we then need a way to have the database tell us the id value for the recently inserted row.
One way is to use a SELECT statement to retrieve data from an SQLite built-in-fuction called
last_insert_rowid().
sqlite> DROP TABLE IF EXISTS Artist;
sqlite> CREATE TABLE Artist (id INTEGER PRIMARY KEY,
...> name TEXT, eyes TEXT);
sqlite> INSERT INTO Artist (name,
eyes)
...> VALUES ('Frank Sinatra',
'blue');
sqlite> select last_insert_rowid();
1
sqlite> SELECT * FROM Artist;
1|Frank Sinatra|blue
sqlite>
26
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
Once we know the id of our ‘Frank Sinatra’ row, we can use it when we INSERT the tracks
into the Track table. As a general strategy, we add these id columns to any table we create:
Note that the artist_id value is the new auto-assigned row in the Artist table and that while
we added an INTEGER PRIMARY KEY to the the Track table, we did not include id in the
list of fields on the INSERT statements into the Track table. Again this tells the database to
choose a unique value for us for the id column.
You can call SELECT last_insert_rowid(); after each of the inserts to retrieve the value
that the database assigned to the id of each newly created row. Later when we are coding in
Python, we can ask for the id value in our code and store it in a variable for later use.
Since we have two tables and a foreign key between the two tables, our data is well-modeled,
but if we are going to have millions of records in the Artist table and going to do a lot of
lookups by artist name, we would benefit if we gave the database a hint about our intended use
of the name column.
We do this by adding an “index” to a text column that we intend to use in WHERE
clauses:
If we try to insert ‘Frank Sinatra’ a second time, it will fail with an error.
We can tell the database to ignore any duplicate key errors by adding the IGNORE
keyword to the INSERT statement as follows:
Track
title
Album Artist
len
name name
rating
count
28
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
Since we have not added a uniqueness constraint to the eye color column, there is no problem
having multiple ‘Blue’ values in the eye column.
The columns in this file are: title, artist, album, number of plays, rating (0-100) and length in
milliseconds.
Our data model is shown in Figure 15.6 and described in SQL as follows:
We are adding the UNIQUE keyword to TEXT columns that we would like to have a uniqueness
constraint that we will use in INSERT IGNORE statements. This is more succinct that separate
CREATE INDEX statements but has the same effect.
With these tables in place, we write the following code tracks_csv.py to parse the data
and insert it into the tables:
import sqlite3
conn = sqlite3.connect('trackdb.sqlite')
cur = conn.cursor()
29
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
handle = open('tracks.csv')
name = pieces[0]
artist = pieces[1]
album = pieces[2]
count = pieces[3]
rating = pieces[4]
length = pieces[5]
You can see that we are repeating the pattern of INSERT OR IGNORE followed by a SELECT
to get the appropriate artist_id and album_id for use in later INSERT statements. We start
from Artist because we need artist_id to insert the Album and need the album_id to insert the
Track.
If we look at the Album table, we can see that the entries were added and assigned a primary key
as necessary as the data was parsed. We can also see the foreign key pointing to a row in the
Artist table for each Album row.
1 1 Greatest Hits
2 2 Herzeleid
3 3 Grease
4 4 IV
5 5 The Wall [Disc 2]
We can reconstruct all of the Track data, following all the relations using JOIN / ON clauses.
You can see both ends of each of the (2) relational connections in each row in the output below:
30
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
sqlite> .mode line
sqlite> SELECT * FROM Track
...> JOIN Album ON Track.album_id = Album.id
...> JOIN Artist ON Album.artist_id = Artist.id
...> LIMIT 2;
id = 1
title = Another One Bites The Dust
album_id = 1
len = 217103
rating = 100
count = 55
id = 1
artist_id = 1
title = Greatest Hits
id = 1
name = Queen
Course User
title name
id = 2
title = Asche Zu Asche
album_id = 2
len = 231810
rating = 100
count = 79
id = 2
artist_id = 2
title = Herzeleid
id = 2
name = Rammstein
This example shows three tables and two one-to-many relationships between the tables. It
also shows how to use indexes and uniqueness constraints to program- matically construct
the tables and their relationships.
https://en.wikipedia.org/wiki/One-to-many_(data_model)
Up next we will look at the many-to-many relationships in data models.
31
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
keys.
Before we explore how we implement many-to-many relationships, let’s see if we could hack
something up by extending a one-to many relationship.
If SQL supported the notion of arrays, we might try to define this:
Member
Course User
course_id
title name
user_id
Sadly, while this is a tempting idea, SQL does not support arrays.3
Or we could just make long string and concatenate all the User primary keys into a long string
separated by commas.
This would be very inefficient because as the course roster grows in size and the number of courses
increases it becomes quite expensive to figure out which courses have student 14 on their roster.
Instead of either of these approaches, we model a many-to-many relationship using an additional
table that we call a “junction table”, “through table”, “connector table”, or “join table” as shown
in Figure 15.8. The purpose of this table is to capture the connection between a course and a
student.
In a sense the table sits between the Course and User table and has a one-to-many relationship to
both tables. By using an intermediate table we break a many-to- many relationship into two one-
to-many relationships. Databases are very good at modeling and processing one-to-many
relationships.
An example Member table would be as follows:
32
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
id INTEGER PRIMARY KEY,
title TEXT UNIQUE
);
Following our naming convention, Member.user_id and Member.course_id are foreign keys
pointing at the corresponding rows in the User and Course tables. Each entry in the member
table links a row in the User table to a row in the Course table by going through the Member
table.
We indicate that the combination of course_id and user_id is the PRIMARY KEY for the
Member table, also creating an uniqueness constraint for a course_id / user_id combination.
Now lets say we need to insert a number of students into the rosters of a number of courses. Lets
assume the data comes to us in a JSON-formatted file with records like this:
[
[ "Charley", "si110"],
[ "Mea", "si110"],
[ "Hattie", "si110"],
[ "Keziah", "si110"],
[ "Rosa", "si106"],
[ "Mea", "si106"],
[ "Mairin", "si106"],
[ "Zendel", "si106"],
[ "Honie", "si106"],
[ "Rosa", "si106"],
...
]
We could write code as follows to read the JSON file and insert the members of each
course roster into the database using the following code:
import json
import sqlite3
conn = sqlite3.connect('rosterdb.sqlite')
cur = conn.cursor()
str_data = open('roster_data_sample.json').read()
json_data = json.loads(str_data)
name = entry[0]
title = entry[1]
print((name, title))
33
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
conn.commit()
Like in a previous example, we first make sure that we have an entry in the User table and
know the primary key of the entry as well as an entry in the Course table and know its primary
key. We use the ‘INSERT OR IGNORE’ and ‘SELECT’ pattern so our code works regardless
of whether the record is in the table or not.
Our insert into the Member table is simply inserting the two integers as a new or existing row
depending on the constraint to make sure we do not end up with duplicate entries in the Member
table for a particular user_id / course_id com- bination.
To reconstruct our data across all three tables, we again use JOIN / ON to construct a SELECT
query;
You can see the three tables from left to right - Course, Member, and User and you can see the
connections between the primary keys and foreign keys in each row of output.
34
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
For simplicity, we will decide that zero in the role means “student” and one in the role means
instructor. Lets assume our JSON data is augmented with the role as follows:
[
[ "Charley", "si110", 1],
[ "Mea", "si110", 0],
[ "Hattie", "si110", 0],
[ "Keziah", "si110", 0],
[ "Rosa", "si106", 0],
[ "Mea", "si106", 1],
[ "Mairin", "si106", 0],
[ "Zendel", "si106", 0],
[ "Honie", "si106", 0],
[ "Rosa", "si106", 0],
...
]
35
Python Programming(21EC643) C Prathibha, Asst.Prof.,EC Dept,KIT, Tiptur
name = entry[0]
title = entry[1]
role = entry[2]
...
In a real system, we would probably build a Role table and make the role column in
Member a foreign key into the Role table as follows:
TABLE Member (
user_id INTEGER,
course_id INTEGER,
role_id INTEGER,
PRIMARY KEY (user_id, course_id, role_id)
);
Notice that because we declared the id column in the Role table as a PRIMARY KEY, we could omit
it in the INSERT statement. But we can also choose the id value as long as the value is not already
in the id column and does not violate the implied UNIQUE constaint on primary keys.
36