CTP-MD5 ch3
CTP-MD5 ch3
Cha pter 03
W h a t is a database?
A database is a file that is organized for storing data.
Database software maintains its performance by building indexes as data is added to the
database to allow the computer to jump quickly to a particular entry.
SQLite is well suited to some of the data manipulation problems that we seein
Informatics such as the Twitter spidering application that we describe in this
chapter.
Database concepts
When you first look at a database it looks like a spreadsheet with multiple sheets. The
primary data structures in a database are: tables, rows, and columns.
In technical descriptions of relational databases the concepts of table, row, and column are
more formally referred to as relation, tuple, and attribute, respectively.
Table column
Relation
attribute
row 2.3
tuple
2.3
import s q l i t e 3
INTEGER)') conn.close()
A cursor is like a file handle that we can use to perform operations on the data stored in the
database. Calling cursor() is very similar conceptually to calling open() when dealing with
text files.
Database commands are expressed in a special language that has been standardizedacross
many different database vendors to allow us to learn a single database language. The
database language is called Structured Query Language or SQL for short.
The first SQL command removes the Tracks table from the database if it exists. This
pattern is simply to allow us to run the same program to create the Tracks table
over
and over again without causing an error.
cur.execute('DROP TABLE I F EXISTS Tracks ' )
The second command creates a table named Tracks with a text column named
t i t l e and an integer column named plays.
Now that we have created a table named Tracks, we can put some data into that table using
the S Q L INSERT operation. Again, we begin by making a connection to the database
and obtaining the cursor. We can then execute SQL commands using the cursor.
The S Q L INSERT command indicates which table we are using and then defines a new
row by listing the fields we want to include ( t i t l e , p l a y s ) followed by the VALUES we
want placed in the new row.
We specify the values as question marks ( ? , ? ) to indicate that the actual values are
passed in as a tuple ( 'My Way',15 ) as the second parameter to the execute() call.
import s q l i t e 3
p r i n t ( ' Tr a c k s : ' )
cur.execute('SELECT t i t l e , p l a y s FROM Tr a c k s' )
f o r row i n c u r :
print(row)
c u r. c l o s e ( )
Tracks
title plays
Thunderstruck 20
My Way 15
Tr a c k s :
( ' T h u n d e r s t r u c k ' , 20)
('My Way', 15)
The DELETE command shows the use of a WHERE clause that allows us to express a
selection criterion so that we can ask the database to apply the command to only the rows
that match the criterion.
In this example the criterion happens to apply to all the rows so we empty the
table out so we can run the program repeatedly. After the DELETE is performed, we
also call commit() to force the data to be removed from the database.
Structured Q u e r y La ng ua g e s u m ma r y
So far, we have been using the Structured Query Language in our Python examples and have
covered many of the basics of the SQL commands. In this section, we look at the SQL
language in particular and give an overview of SQL syntax.
Since there are so many different database vendors, the Structured Query Language(SQL)
was standardized so we could communicate in a portable manner to databasesystems from
multiple vendors.
A relational database is made up of tables, rows, and columns. The columns
generally have a type such as text, numeric, or date data. When we create a table,
we indicate the names and types of the columns:
To remove a row, you need a WHERE clause on an SQL DELETE statement. The
WHERE clause determines which rows are to be deleted:
s q l i t e 3 . c o n n e c t ( ' s p i d e r. s q l i t e ' )
cur = conn.cursor()
while True:
acct = in p u t( 'En ter a Twitter account, or q u i t : ' )
i f ( a c c t == ' q u i t ' ) : break
i f (len(acct) < 1):
cur.execute('SELECT
name FROM Twitter
WHERE r etrieved = 0
L I MI T 1 ' )
try:
acct =
cur.fetchone()[ 0 ]
except:
p r in t( 'No
unretrieved Twitter
accounts found')
continue
countnew = 0
f r i e n d = u['screen_name']
print(friend)
cur.execute('SELECT
f r i e n d s FROM Twitter
WHERE name = ? LIMIT
1',
(friend, ) )
try:
count =
cur.fetchone()[ 0 ]
cur.execute('UPDATE
Twitter SET f r i e n d s
= ? WHERE name =
? ' , (count+1, f r i e n d ) )
countold = countold + 1
except:
cur.execute( '''INSERT INTO Twitter (name, r e t r i e v e d , f r i e n d s )
VALUES ( ? , 0 , 1 ) ' ' ' , ( f r i e n d , ) )
countnew = countnew + 1
print('New accounts=', countnew, ' r e v i s i t e d = ' , countold)
conn.commit()
c u r. c l o s e ( )
Once we retrieve the list of friends and statuses, we loop through all of the useritems in
the returned J S O N and retrieve the screen_name for each user. Then we use the SELECT
statement to see if we already have stored this particular screen_name in the database and
retrieve the friend count ( f r i e n d s) if the record exists.
countnew = 0
countold = 0
for u in js['users'] :
f r i e n d = u['screen_name']
print(friend)
cur.execute('SELECT f r i e n d s FROM Twitter WHERE name = ? L I MI T 1 ' ,
(friend, ) )
try:
count = cur.fetchone()[ 0 ]
cur.execute('UPDATE Twitter SET f r i e n d s = ? WHERE name = ? ' ,
(count+1, f r i e n d ) )
countold = countold + 1
except:
cur.execute( '''INSERT INTO Twitter (name, r e t r i e v e d , f r i e n d s )
VALUES ( ? , 0 , 1 ) ' ' ' , ( f r i e n d , ) )
countnew = countnew + 1
print('New accounts=',countnew,' r e v i s i t e d = ' , c o u n t o l d )
conn.commit()
So the first time the program runs and we enter a Twitter account, the program runs as
follows:
This program simply opens the database and selects all of the columns of all of the
rows in the table Twitter, then loops through the rows and prints out each row.
i f ( len(acct) < 1 ) :
cur.execute('SELECT name FROM
Twitter WHERE retrieved = 0 L I MI T
1')
try:
acct = cur.fetchone()[ 0 ]
except:
p rin t( 'No unretrieved t w i t t e r
accounts found')
continue
We use the S Q L SELECT statement to retrieve the name of the first ( LI MI T 1) user who
still has their “have we retrieved this user” value set to zero. We also use the fetchone()[0]
pattern within a try/except block to either extract a screen_name from the retrieved data or
put out an error message and loop back up.
If we successfully retrieved an unprocessed screen_name, we retrieve their data as
follows:
If we run the friend program and press enter twice to retrieve the next unvisited
friend’s friends, then run the dumping program, it will give us the following output:
Basic data modeling
The real power of a relational database is when we create multiple tables and
makelinks between those tables.
The act of deciding how to break up your application data into multiple tables
and establishing the relationships between the tablesis called data modeling.
The design document that shows the tables and their relationships is called a
data model.
create a new table that keeps track of pairs of friends. The following is a simple
way of making such a table:
Each time we encounter a person who drchuck is following, we would insert a row of
the form:
As we are processing the 20 friends from the drchuck Twitter feed, we will insert
20 records with “drchuck” as the first parameter so we will end up duplicating the
string many times in the database.
The People table has an additional column to store the numeric key associated with the
row for this Twitter user. SQLite hasa feature that automatically adds the key value for
any row we insert into a tableusing a special type of data column (INTEGER PRIMARY
KEY).
We can create the People table with this additional i d column as follows:
Now instead of creating the table P a l s above, we create a table called Follows with
two integer columns from_id and to_id and a constraint on the table that the combination of
from_id and to_id must be unique in this table (i.e., we cannot insert duplicate rows)
in our database.
# Ignore SSL c e r t i f i c a t e e r ro r s
c t x = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
while True:
acct = in p u t( 'En ter a Twitter
account, or q u i t : ' )
i f ( a c c t == ' q u i t ' ) : break
i f (len(acct) < 1):
cur.execute('SELECT
i d , name FROM People
WHERE retrieved = 0
L IMIT 1 ' )
try:
( i d , acct) =
except:
p rin t( 'No unretrieved Twitter accounts found')
continue
else:
cur.execute('SELECT i d FROM People WHERE name = ? LIMIT
1 ' , ( acct, ) )
try:
i d = cur.fetchone()[ 0 ]
except:
cur.execute( '''INSERT OR IGNORE INTO People
(name, r e t r i e v e d ) VALUES ( ? , 0 ) ' ' ' , ( a c c t , ) )
conn.commit()
i f cur.rowcount ! = 1 :
p r i n t ( ' E r r o r i n s e r t i n g a cc o u n t:' , acct)
continue
i d = cur.lastro wid
u r l = twurl.augment(TWITTER_URL, {'screen_name':
a c c t , 'co u n t' : ' 1 0 0 ' } ) p r i n t ( ' R e t r i e v i n g account', acct)
try:
connection = u r l l i b . r e q u e s t . u r l o p e n ( u r l ,
context=ctx)
except Exception as e r r :
p r i n t ( ' F a i l e d to R e t r i e v e ' , e r r )
break
data = connection.read().decode()
headers
= dict(connection.getheaders())
try:
j s = json.loads(data)
except:
p rint( 'Unable to parse j s o n ' )
p rin t(d ata)
break
# Debugging
# print(json.dumps(js,
indent=4))
countnew = 0
countold = 0
for u in js['users']:
f r i e n d = u['screen_name']
print(friend)
cur.execute('SELECT i d FROM People WHERE name = ? LIMIT
1', (friend, ) )
try:
f r i e n d _ i d = cur.fetchone()[ 0 ] countold =
countold + 1
except:
cur.execute( '''INSERT OR IGNORE INTO People (name, r etr ie v e d )
VALUES ( ? , 0 ) ' ' ' , ( f r i e n d , ) )
conn.commit()
i f cur.rowcount != 1 :
p rin t( 'Er ro r inserting account:', friend)
continue
f r i e n d _ i d = cur.lastrowid countnew =
countnew + 1
cur.ex ecu te( '''INSERT OR IGNORE
INTO Follows (from_id,
t o _ i d ) VALUES ( ? , ? ) ' ' ' , ( i d , f r i e n d _ i d ) )
print('New accounts=', countnew, ' r e v i s i t e d = ' , countold)
p r i n t ( ' R e m a i n i n g ' , headers[ 'x- r a t e - l i m i t - r emain ing ' ] )
conn.commit()
c u r. c l o s e ( )
# Code: http://www.py4e.com/code3/twfriends.py
This program is starting to get a bit complicated, but it illustrates the patterns
that we need to use when we are using integer keys to link tables. The basic patterns
are:
We add the OR IGNORE clause to our INSERT statement to indicate that if this
particular INSERT would cause a violation of the “name must be unique” rule, the
database system is allowed to ignore the INSERT. We are using the database con-
straint as a safety net to make sure we don’t inadvertently do something incorrect.
Similarly, the following code ensures that we don’t add the exact same Follows
relationship twice.
Again, we simply tell the database to ignore our attempted INSERT if it would
violate the uniqueness constraint that we specified for the Follows rows.
f r i e n d = u['screen_name']
cur.execute('SELECT i d FROM People WHERE name = ? LIMIT
1', (friend, ) )
try:
2 In general, when a sentence starts with “if all goes well” you will find that the code needs
to use try/except.
f r i e n d _ i d = cur.fetchone()[ 0 ]
countold = countold + 1
except:
cur.ex ecute( '''INSERT OR IGNORE INTO People (name, r etr ie v e d )
VALUES ( ? , 0 ) ' ' ' , ( f r i e n d , ) )
conn.commit()
i f cur.rowcount != 1 :
p rin t( 'Er ro r inserting account:',friend)
continue
f r i e n d _ i d = cur.lastrowid
countnew = countnew + 1
If we end up in the except code, it simply means that the row was not found, sowe
must insert the row. We use INSERT OR IGNORE just to avoid errors and then
call commit() to force the database to really be updated. After the write is done, we
can check the cur.rowcount to see how many rows were affected. Since we are
attempting to insert a single row, if the number of affected rows is something other
than 1, it is an error.
If the INSERT is successful, we can look at c u r. l a st r o w i d to find out what
value the database assigned to the i d column in our newly created row.
Once we know the key value for both the Twitter user and the friend in the JSON,it
is a simple matter to insert the two numbers into the Follows table with the
following code:
Notice that we let the database take care of keeping us from “double-inserting” a
relationship by creating the table with a uniqueness constraint and then addingOR
IGNORE to our INSERT statement.
Here is a sample execution of this program:
People:
(1, 'drchuck', 1)
( 2 , ' o p e n c o n t e n t ' , 1)
(3, 'lhawthorn', 1)
(4, 'steve_coppin', 0)
(5, 'davidkocher', 0)
55 rows.
Fo llo ws
:
(1, 2)
(1, 3)
(1, 4)
(1, 5)
(1, 6)
60 rows.
You can see the i d , name, and v i s i t e d fields in the People table and you see
the numbers of both ends of the relationship in the Follows table. In the People
table, we can see that the first three people have been visited and their data hasbeen
retrieved. The data in the Follows table indicates that drchuck (user 1) is a friend
to all of the people shown in the first five rows. This makes sense because the first
data we retrieved and stored was the Twitter friends of drchuck. If you were to
print more rows from the Follows table, you would see the friends of users 2 and 3
as well.
We are using a naming convention of always calling the primary key field name i d
and appending the suffix _ i d to any field name that is a foreign key.
Us i n g J O I N to retrieve data
Now that we have followed the rules of database normalization and have data
separated into two tables, linked together using primary and foreign keys, we needto
be able to build a SELECT that reassembles the data across the tables.
S Q L uses the JOIN clause to reconnect these tables. In the JOIN clause you specify
the fields that are used to reconnect the rows between the tables.
The following is an example of a SELECT with a JOIN clause:
The JOIN clause indicates that the fields we are selecting cross both the Follows
and People tables. The ON clause indicates how the two tables are to be joined:
Take the rows from Follows and append the row from People where the field
from_id in Follows is the same the i d value in the People table.
People
Follows
id name retrieved
1 drchuck 1
2 opecontent 1
3 lhawthorn 1
4 steve_coppin 0
...
import s q l i t e 3
cur.execute('SELECT * FROM
Peo p le') count = 0
print('People:')
f o r row i n cur :
i f count < 5 : print(row)
c o u n t = cou n t + 1
print(count, 'rows.')
cur.execute('SELECT * FROM
F o l l o w s ' ) count = 0
print('Follows:')
f o r row i n cur :
i f count < 5 : print(row)
c o u n t = cou n t + 1
print(count, 'rows.')
c u r. c l o s e ( )
# Code:
http://www.py4e.com/code3/twjoi
n.py
In this program, we first dump out the People and Follows and then dump out
a subset of the data in the tables joined together.
Here is the output of the program:
python twjoin.py
People:
(1, 'drchuck', 1)
( 2 , ' o p e n c o n t e n t ' , 1)
(3, 'lhawthorn', 1)
(4, 'steve_coppin', 0)
(5, 'davidkocher', 0)
55 rows.
Fo llo ws
:
(1, 2)
(1, 3)
(1, 4)
(1, 5)
(1, 6)
60 rows.
Connect
ions for
id=2:
(2, 1,
1,
'drchuc
k', 1)
(2, 2 8 ,
28,
'cnxorg
', 0)
(2, 3 0 ,
30,
'kthano
s', 0)
(2,
102,
102,
'Someth
ingGirl'
, 0)
(2,
103,
103,
' j a_ P ac' ,
0)
20 rows.
You see the columns from the People and Follows tables and the last set of rows
is the result of the SELECT with the JOIN clause.
In the last select, we are looking for accounts that are friends of “opencontent”
(i.e., People.id=2).
In each of the “metarows” in the last select, the first two columns are from the
Follows table followed by columns three through five from the People table. You
can also see that the second column ( Fo llo ws.to _ id ) matches the third column
(Peo p le.id ) in each of the joined-up “metarows”.