0% found this document useful (0 votes)
8 views18 pages

CTP-MD5 ch3

The document provides an overview of databases and SQL, explaining what a database is and how to use SQLite for data manipulation. It covers basic SQL commands such as CREATE, INSERT, SELECT, UPDATE, and DELETE, along with examples of how to implement them in Python. Additionally, it describes a Twitter spidering application that retrieves and stores Twitter account data in a database using SQL commands.

Uploaded by

sindhud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views18 pages

CTP-MD5 ch3

The document provides an overview of databases and SQL, explaining what a database is and how to use SQLite for data manipulation. It covers basic SQL commands such as CREATE, INSERT, SELECT, UPDATE, and DELETE, along with examples of how to implement them in Python. Additionally, it describes a Twitter spidering application that retrieves and stores Twitter account data in a database using SQL commands.

Uploaded by

sindhud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

M O D U L E - 05

Cha pter 03

Using Databases and SQL

W h a t is a database?
 A database is a file that is organized for storing data.
 Database software maintains its performance by building indexes as data is added to the
database to allow the computer to jump quickly to a particular entry.
 SQLite is well suited to some of the data manipulation problems that we seein
Informatics such as the Twitter spidering application that we describe in this
chapter.

Database concepts
 When you first look at a database it looks like a spreadsheet with multiple sheets. The
primary data structures in a database are: tables, rows, and columns.
 In technical descriptions of relational databases the concepts of table, row, and column are
more formally referred to as relation, tuple, and attribute, respectively.

Table column
Relation
attribute

row 2.3
tuple
2.3

Figure 1: Relational Databases

Database Browser for SQLite


 Python to work with data in SQLite database files, many operations can be done more
conveniently using software called the Database Browser for SQLite which is freely
available from:
http://sqlitebrowser.org/
 Using the browser, you can easily create tables, insert data, edit data, or run simpleSQL
queries on the data in the database.
Creating a database table
 The code to create a database file and a table named Tracks with two columns inthe
database is as follows:

import s q l i t e 3

conn = s q l i t e 3 . c o n n e c t ( ' m u s i c . s q l i t e ' )


cur = conn.cursor()

cur.execute('DROP TABLE I F EXISTS Tr ac k s' )


cur.execute('CREATE TABLE Tracks ( t i t l e TEXT, p lay s

INTEGER)') conn.close()

 A cursor is like a file handle that we can use to perform operations on the data stored in the
database. Calling cursor() is very similar conceptually to calling open() when dealing with
text files.

Figure 2: A Database Cursor

 Database commands are expressed in a special language that has been standardizedacross
many different database vendors to allow us to learn a single database language. The
database language is called Structured Query Language or SQL for short.
 The first SQL command removes the Tracks table from the database if it exists. This
pattern is simply to allow us to run the same program to create the Tracks table
over
and over again without causing an error.
cur.execute('DROP TABLE I F EXISTS Tracks ' )

 The second command creates a table named Tracks with a text column named
t i t l e and an integer column named plays.

cur.execute('CREATE TABLE Tracks ( t i t l e TEXT, p l a y s


INTEGER)')

 Now that we have created a table named Tracks, we can put some data into that table using
the S Q L INSERT operation. Again, we begin by making a connection to the database
and obtaining the cursor. We can then execute SQL commands using the cursor.
 The S Q L INSERT command indicates which table we are using and then defines a new
row by listing the fields we want to include ( t i t l e , p l a y s ) followed by the VALUES we
want placed in the new row.
 We specify the values as question marks ( ? , ? ) to indicate that the actual values are
passed in as a tuple ( 'My Way',15 ) as the second parameter to the execute() call.

import s q l i t e 3

conn = s q l i t e 3 . c o n n e c t ( ' m u s i c . s q l i t e ' ) cur = conn.cursor()

cur.execute('INSERT INTO Tracks ( t i t l e , p l a y s ) VALUES ( ? , ? ) ' ,


( 'Th u n d er struck' , 20))
cur.execute('INSERT INTO Tracks ( t i t l e , p l a y s ) VALUES ( ? , ? ) ' , ('My Way',
15))
conn.commit()

p r i n t ( ' Tr a c k s : ' )
cur.execute('SELECT t i t l e , p l a y s FROM Tr a c k s' )
f o r row i n c u r :
print(row)

cur.execute('DELETE FROM Tracks WHERE p l a y s < 1 0 0') conn.commit()

c u r. c l o s e ( )
Tracks
title plays
Thunderstruck 20
My Way 15

Figure 3: Rows in a Table


 First we INSERT two rows into our table and use commit() to force the data to be
written to the database file.
 Then we use the SELECT command to retrieve the rows we just inserted from the table.
On the SELECT command, we indicate which columns we would like ( t i t l e , p l a y s )
and indicate which table we want to retrieve the data from.
 After we execute the SELECT statement, the cursor is something we can loop
through in a for statement. For efficiency, the cursor does not read all of the data
from the database when we execute the SELECT statement. Instead, the data is read on
demand as we loop through the rows in the for statement.
The output of the program is as follows:

Tr a c k s :
( ' T h u n d e r s t r u c k ' , 20)
('My Way', 15)

 The DELETE command shows the use of a WHERE clause that allows us to express a
selection criterion so that we can ask the database to apply the command to only the rows
that match the criterion.
 In this example the criterion happens to apply to all the rows so we empty the
table out so we can run the program repeatedly. After the DELETE is performed, we
also call commit() to force the data to be removed from the database.

Structured Q u e r y La ng ua g e s u m ma r y
 So far, we have been using the Structured Query Language in our Python examples and have
covered many of the basics of the SQL commands. In this section, we look at the SQL
language in particular and give an overview of SQL syntax.
 Since there are so many different database vendors, the Structured Query Language(SQL)
was standardized so we could communicate in a portable manner to databasesystems from
multiple vendors.
 A relational database is made up of tables, rows, and columns. The columns
generally have a type such as text, numeric, or date data. When we create a table,
we indicate the names and types of the columns:

CREATE TABLE Tracks ( t i t l e TEXT, p l a y s INTEGER)


 To insert a row into a table, we use the S Q L INSERT command:

INSERT INTO Tracks ( t i t l e , p l a y s ) VALUES ('My Way', 15)


 The INSERT statement specifies the table name, then a list of the
fields/columns that you would like to set in the new row, and then the keyword
VALUES and a list of corresponding values for each of the fields.
 The SQL SELECT command is used to retrieve rows and columns from a
database. The SELECT statement lets you specify which columns you would
like to retrieve as well as a WHERE clause to select which rows you would like
to see. It also allows an optional ORDER BY clause to control the sorting of the
returned rows.

SELECT * FROM Tracks WHERE t i t l e = 'My Way'


 Using * indicates that you want the database to return all of the columns for each
row that matches the WHERE clause.
You can request that the returned rows be sorted by one of the fields as follows:

SELECT t i t l e , p l a y s FROM Tracks ORDER BY t i t l e

 To remove a row, you need a WHERE clause on an SQL DELETE statement. The
WHERE clause determines which rows are to be deleted:

DELETE FROM Tracks WHERE t i t l e = 'My Way'

It is possible to UPDATE a column or columns within one or more rows in a table


using the SQL UPDATE statement as follows:

UPDATE Tracks SET p l a y s = 16 WHERE t i t l e = 'My Way'


 The UPDATE statement specifies a table and then a list of fields and values to
changeafter the SET keyword and then an optional WHERE clause to select the
rows that are to be updated. A single UPDATE statement will change all of the
rows that match the WHERE clause. If a WHERE clause is not specified, it
performs the UPDATEon all of the rows in the table.
 These four basic SQL commands (INSERT, S E L E CT, U P D AT E , and
DELETE)allow the four basic operations needed to create and maintain
data.
Spidering Twitter using a database
 we will create a simple spidering program that will go through Twitter accounts and
build a database of them.
Note: Be very careful when running this program. You do not want to pull
too much data or run the program for too long and end up having your
Twitter access shut off.
Here is the source code for our Twitter spidering application:

from u r l l i b . r e q u e s t import urlopen import u r l l i b . e r r o r


import twurl import jso n import s q l i t e 3 import s s l

TWITTER_URL = ' h t t p s : / / a p i . t w i t t e r. c o m / 1 . 1 / f r i e n d s / l i s t . j s o n ' conn =

s q l i t e 3 . c o n n e c t ( ' s p i d e r. s q l i t e ' )
cur = conn.cursor()

c u r. e x e c u t e ( ' ' '


CREATE TABLE I F NOT EXISTS Twitter
(name TEXT, r etrieved INTEGER, f r i e n d s I NTEGER) ''' )

c t x = ssl. c r e a te_d efau lt_co n text() ctx.check_hostname = False


ctx.verify_mode = ssl.CERT_NONE

while True:
acct = in p u t( 'En ter a Twitter account, or q u i t : ' )
i f ( a c c t == ' q u i t ' ) : break
i f (len(acct) < 1):
cur.execute('SELECT
name FROM Twitter
WHERE r etrieved = 0
L I MI T 1 ' )
try:
acct =
cur.fetchone()[ 0 ]
except:
p r in t( 'No
unretrieved Twitter
accounts found')
continue

u r l = twurl.augment(TWITTER_URL, {'screen_name': a c c t , ' c o u n t ' : ' 5 ' } )


print('Retrieving', url)
connection = u r l o p e n ( u r l , context=ctx) data
= connection.read().decode() headers =
dict(connection.getheaders())

p r i n t ( ' R e m a i n i n g ' , headers[ 'x- r a t e - l i m i t - r e m a i n i n g ' ] ) j s =


jso n.lo ad s( d ata)
# Debugging
# p r i n t j s o n . d u m p s( j s, indent=4)

cur.execute('UPDATE Twitter SET retrieved=1 WHERE name = ? ' , ( a c c t , ) )

countnew = 0
f r i e n d = u['screen_name']
print(friend)
cur.execute('SELECT
f r i e n d s FROM Twitter
WHERE name = ? LIMIT
1',
(friend, ) )
try:
count =
cur.fetchone()[ 0 ]
cur.execute('UPDATE
Twitter SET f r i e n d s
= ? WHERE name =
? ' , (count+1, f r i e n d ) )
countold = countold + 1
except:
cur.execute( '''INSERT INTO Twitter (name, r e t r i e v e d , f r i e n d s )
VALUES ( ? , 0 , 1 ) ' ' ' , ( f r i e n d , ) )
countnew = countnew + 1
print('New accounts=', countnew, ' r e v i s i t e d = ' , countold)
conn.commit()

c u r. c l o s e ( )

 Once we retrieve the list of friends and statuses, we loop through all of the useritems in
the returned J S O N and retrieve the screen_name for each user. Then we use the SELECT
statement to see if we already have stored this particular screen_name in the database and
retrieve the friend count ( f r i e n d s) if the record exists.

countnew = 0
countold = 0
for u in js['users'] :
f r i e n d = u['screen_name']
print(friend)
cur.execute('SELECT f r i e n d s FROM Twitter WHERE name = ? L I MI T 1 ' ,
(friend, ) )
try:
count = cur.fetchone()[ 0 ]
cur.execute('UPDATE Twitter SET f r i e n d s = ? WHERE name = ? ' ,
(count+1, f r i e n d ) )
countold = countold + 1
except:
cur.execute( '''INSERT INTO Twitter (name, r e t r i e v e d , f r i e n d s )
VALUES ( ? , 0 , 1 ) ' ' ' , ( f r i e n d , ) )
countnew = countnew + 1
print('New accounts=',countnew,' r e v i s i t e d = ' , c o u n t o l d )
conn.commit()

So the first time the program runs and we enter a Twitter account, the program runs as
follows:

E n t er a Twitter acco u n t, o r quit: d rch u ck


Retriev in g http://api.twitter.com/1.1/friends . . .
New accounts= 20 revisited= 0
E n t er a Twi t t er acco u n t, o r quit: quit

import s q l i t e 3

conn = s q l i t e 3 . c o n n e c t ( ' s p i d e r. s q l i t e ' )


cur = conn.cursor()
cur.execute('SELECT *
Tw i t t e r ' ) count = 0 FROM
f o r row i n c u r :
print(row)
count = count + 1
print(count, 'rows.')
c u r. c l o s e ( )

This program simply opens the database and selects all of the columns of all of the
rows in the table Twitter, then loops through the rows and prints out each row.

E n t er a Twitter acco u n t, o r quit:


Retriev in g http://api.twitter.com/1.1/friends . . .
New accounts= 18 revisited= 2
E n t er a Twitter acco u n t, o r quit:
Retriev in g http://api.twitter.com/1.1/friends . . .
New accounts= 17 revisited= 3
E n t er a Twi t t er acco u n t, o r quit: quit

Since we pressed enter (i.e., we did not specify a Twitter


account), the following code is executed:

i f ( len(acct) < 1 ) :
cur.execute('SELECT name FROM
Twitter WHERE retrieved = 0 L I MI T
1')
try:
acct = cur.fetchone()[ 0 ]
except:
p rin t( 'No unretrieved t w i t t e r
accounts found')
continue

 We use the S Q L SELECT statement to retrieve the name of the first ( LI MI T 1) user who
still has their “have we retrieved this user” value set to zero. We also use the fetchone()[0]
pattern within a try/except block to either extract a screen_name from the retrieved data or
put out an error message and loop back up.
If we successfully retrieved an unprocessed screen_name, we retrieve their data as
follows:

url=twurl.augment(TWITTER_URL,{'screen_name': a c c t , ' c o u n t ' : ' 2 0 ' } )


print('Retrieving', url)
connection = u r l l i b . u r l o p e n ( u r l )
data = connection.read()
j s = json.loads(data)

cur.execute('UPDATE Twitter SET retrieved=1 WHERE name


= ?',(acct, ) )

If we run the friend program and press enter twice to retrieve the next unvisited
friend’s friends, then run the dumping program, it will give us the following output:
Basic data modeling
 The real power of a relational database is when we create multiple tables and
makelinks between those tables.
 The act of deciding how to break up your application data into multiple tables
and establishing the relationships between the tablesis called data modeling.

 The design document that shows the tables and their relationships is called a
data model.
 create a new table that keeps track of pairs of friends. The following is a simple
way of making such a table:

CREATE TABLE P a l s (from_friend TEXT, to_friend TEXT)

Each time we encounter a person who drchuck is following, we would insert a row of
the form:

INSERT INTO P a l s ( fr om _f rien d ,to _ fr ien d ) VALUES ( ' d r c h u c k ' , 'lhawthorn')

As we are processing the 20 friends from the drchuck Twitter feed, we will insert
20 records with “drchuck” as the first parameter so we will end up duplicating the
string many times in the database.
 The People table has an additional column to store the numeric key associated with the
row for this Twitter user. SQLite hasa feature that automatically adds the key value for
any row we insert into a tableusing a special type of data column (INTEGER PRIMARY
KEY).
We can create the People table with this additional i d column as follows:

CREATE TABLE People


( i d INTEGER PRIMARY KEY, name TEXT UNIQUE, r etriev ed
INTEGER)

 Now instead of creating the table P a l s above, we create a table called Follows with
two integer columns from_id and to_id and a constraint on the table that the combination of
from_id and to_id must be unique in this table (i.e., we cannot insert duplicate rows)
in our database.

CREATE TABLE Follows


(from_id INTEGER, to_id INTEGER, UNIQUE(from_id, t o _ i d ) )
 When we add UNIQUE clauses to our tables, we are communicating a set of rules that we
are asking the database to enforce when we attempt to insert records.
P ro g r a mm i n g with multiple tables

Figure 4: Relationships Between Tables

import urllib.request, urllib.parse, urllib.error


import twurl
import json
import sqlite3
import ssl

TWITTER_URL = ' h t t p s : / / a p i . t w i t t e r. c o m / 1 . 1 / f r i e n d s / l i s t . j s o n '

conn = s q l i t e 3 . c o n n e c t ( ' f r i e n d s . s q l i t e ' )


cur = conn.cursor()

cur.execute('''CREATE TABLE I F NOT EXISTS


People
( i d INTEGER PRIMARY KEY, name TEXT UNIQUE, retrieved
INTEGER) ''' )
cur.execute('''CREATE TABLE I F NOT EXISTS
Follows
(from_id INTEGER, to_id INTEGER, UNIQUE(from_id, t o _ i d ) ) ' ' ' )

# Ignore SSL c e r t i f i c a t e e r ro r s
c t x = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

while True:
acct = in p u t( 'En ter a Twitter
account, or q u i t : ' )
i f ( a c c t == ' q u i t ' ) : break
i f (len(acct) < 1):
cur.execute('SELECT
i d , name FROM People
WHERE retrieved = 0
L IMIT 1 ' )
try:
( i d , acct) =
except:
p rin t( 'No unretrieved Twitter accounts found')
continue
else:
cur.execute('SELECT i d FROM People WHERE name = ? LIMIT
1 ' , ( acct, ) )
try:
i d = cur.fetchone()[ 0 ]
except:
cur.execute( '''INSERT OR IGNORE INTO People
(name, r e t r i e v e d ) VALUES ( ? , 0 ) ' ' ' , ( a c c t , ) )
conn.commit()
i f cur.rowcount ! = 1 :
p r i n t ( ' E r r o r i n s e r t i n g a cc o u n t:' , acct)
continue
i d = cur.lastro wid

u r l = twurl.augment(TWITTER_URL, {'screen_name':
a c c t , 'co u n t' : ' 1 0 0 ' } ) p r i n t ( ' R e t r i e v i n g account', acct)
try:
connection = u r l l i b . r e q u e s t . u r l o p e n ( u r l ,
context=ctx)
except Exception as e r r :
p r i n t ( ' F a i l e d to R e t r i e v e ' , e r r )
break

data = connection.read().decode()
headers
= dict(connection.getheaders())

p r i n t ( ' R e m a i n i n g ' , headers[ 'x- r a t e -


l i m i t - remaining' ])

try:
j s = json.loads(data)
except:
p rint( 'Unable to parse j s o n ' )
p rin t(d ata)
break

# Debugging
# print(json.dumps(js,
indent=4))

i f ' u s e r s ' not i n j s :


p r i n t ( ' I n c o r r e c t JSON r ec eiv ed ' )
p r i n t ( j s o n . d u m p s ( j s , indent=4))
continue

cur.execute('UPDATE People SET retrieved=1 WHERE name = ? ' , ( a c c t , ) )

countnew = 0
countold = 0
for u in js['users']:
f r i e n d = u['screen_name']
print(friend)
cur.execute('SELECT i d FROM People WHERE name = ? LIMIT
1', (friend, ) )
try:
f r i e n d _ i d = cur.fetchone()[ 0 ] countold =
countold + 1
except:
cur.execute( '''INSERT OR IGNORE INTO People (name, r etr ie v e d )
VALUES ( ? , 0 ) ' ' ' , ( f r i e n d , ) )
conn.commit()
i f cur.rowcount != 1 :
p rin t( 'Er ro r inserting account:', friend)
continue
f r i e n d _ i d = cur.lastrowid countnew =
countnew + 1
cur.ex ecu te( '''INSERT OR IGNORE
INTO Follows (from_id,
t o _ i d ) VALUES ( ? , ? ) ' ' ' , ( i d , f r i e n d _ i d ) )
print('New accounts=', countnew, ' r e v i s i t e d = ' , countold)
p r i n t ( ' R e m a i n i n g ' , headers[ 'x- r a t e - l i m i t - r emain ing ' ] )
conn.commit()
c u r. c l o s e ( )

# Code: http://www.py4e.com/code3/twfriends.py

This program is starting to get a bit complicated, but it illustrates the patterns
that we need to use when we are using integer keys to link tables. The basic patterns
are:

1. Create tables with primary keys and constraints.


2. When we have a logical key for a person (i.e., account name) and we need the
i d value for the person, depending on whether or not the person is
already in the People table we either need to: (1) look up the person in
the People table and retrieve the i d value for the person or (2) add the
person to the People table and get the i d value for the newly
added row.
3. Insert the row that captures the “follows” relationship.

Constraints in database tables


 As we design our table structures, we can tell the database system that we would
like it to enforce a few rules on us. These rules help us from making mistakes and
introducing incorrect data into out tables. When we create our tables:

cur.execute('''CREATE TABLE I F NOT EXISTS People


( i d INTEGER PRIMARY KEY, name TEXT UNIQUE, r etriev ed
I NTEGER) ''' )
cur.execute('''CREATE TABLE I F NOT EXISTS Follows
(from_id INTEGER, to_id INTEGER, UNIQUE(from_id, t o _ i d ) ) ' ' ' )
We indicate that the name column in the People table must be UNIQUE. We also
indicate that the combination of the two numbers in each row of the Follows table
must be unique. These constraints keep us from making mistakes such as adding the
same relationship more than once.
We can take advantage of these constraints in the following code:

cur.ex ecute( '''INSERT OR IGNORE INTO People (name, r etr ie v e d )


VALUES ( ? , 0 ) ' ' ' , ( f r i e n d , ) )

We add the OR IGNORE clause to our INSERT statement to indicate that if this
particular INSERT would cause a violation of the “name must be unique” rule, the
database system is allowed to ignore the INSERT. We are using the database con-
straint as a safety net to make sure we don’t inadvertently do something incorrect.
Similarly, the following code ensures that we don’t add the exact same Follows
relationship twice.

cur.ex ecute( '''INSERT OR IGNORE INTO Follows


( fr om _id, t o _ i d ) VALUES ( ? , ? ) ' ' ' , ( i d , f r i e n d _ i d ) )

Again, we simply tell the database to ignore our attempted INSERT if it would
violate the uniqueness constraint that we specified for the Follows rows.

Retrieve and/or insert a record


When we prompt the user for a Twitter account, if the account exists, we must
look up its i d value. If the account does not yet exist in the People table, we must
insert the record and get the i d value from the inserted row.
This is a very common pattern and is done twice in the program above. This code
shows how we look up the i d for a friend’s account when we have extracted a
screen_name from a user node in the retrieved Twitter JSON.
Since over time it will be increasingly likely that the account will already be in
the database, we first check to see if the People record exists using a SELECT
statement.
If all goes well 2 inside the try section, we retrieve the record using fetchone()
and then retrieve the first (and only) element of the returned tuple and store it in
f rien d _id.
If the SELECT fails, the fetchone()[0] code will fail and control will transfer
intothe except section.

f r i e n d = u['screen_name']
cur.execute('SELECT i d FROM People WHERE name = ? LIMIT
1', (friend, ) )
try:
2 In general, when a sentence starts with “if all goes well” you will find that the code needs

to use try/except.
f r i e n d _ i d = cur.fetchone()[ 0 ]
countold = countold + 1
except:
cur.ex ecute( '''INSERT OR IGNORE INTO People (name, r etr ie v e d )
VALUES ( ? , 0 ) ' ' ' , ( f r i e n d , ) )
conn.commit()
i f cur.rowcount != 1 :
p rin t( 'Er ro r inserting account:',friend)
continue
f r i e n d _ i d = cur.lastrowid
countnew = countnew + 1

If we end up in the except code, it simply means that the row was not found, sowe
must insert the row. We use INSERT OR IGNORE just to avoid errors and then
call commit() to force the database to really be updated. After the write is done, we
can check the cur.rowcount to see how many rows were affected. Since we are
attempting to insert a single row, if the number of affected rows is something other
than 1, it is an error.
If the INSERT is successful, we can look at c u r. l a st r o w i d to find out what
value the database assigned to the i d column in our newly created row.

Storing the friend relationship

Once we know the key value for both the Twitter user and the friend in the JSON,it
is a simple matter to insert the two numbers into the Follows table with the
following code:

cur.execute('INSERT OR IGNORE INTO Follows ( f r o m _ id , t o _ i d )


VALUES ( ? ,
? ) ' , ( i d , friend_id) )

Notice that we let the database take care of keeping us from “double-inserting” a
relationship by creating the table with a uniqueness constraint and then addingOR
IGNORE to our INSERT statement.
Here is a sample execution of this program:

E n t er a Twitter acco u n t, o r quit:


N o unretrieved Twitter accounts found
Enter a Twitter account, or quit: drchuck
Retriev in g http://api.twitter.com/1.1/friends . . . N e w
accounts= 20 revisited= 0
E n t er a Twitter acco u n t, o r quit:
Retriev in g http://api.twitter.com/1.1/friends . . . N e w
accounts= 17 revisited= 3
E n t er a Twitter acco u n t, o r quit:
Retriev in g http://api.twitter.com/1.1/friends . . . N e w
accounts= 17 revisited= 3
E n t er a Twi t t er acco u n t, o r quit: quit
We started with the drchuck account and then let the program automatically pick the
next two accounts to retrieve and add to our database.
The following is the first few rows in the People and Follows tables after this run
is completed:

People:
(1, 'drchuck', 1)
( 2 , ' o p e n c o n t e n t ' , 1)
(3, 'lhawthorn', 1)
(4, 'steve_coppin', 0)
(5, 'davidkocher', 0)
55 rows.
Fo llo ws
:
(1, 2)
(1, 3)
(1, 4)
(1, 5)
(1, 6)
60 rows.

You can see the i d , name, and v i s i t e d fields in the People table and you see
the numbers of both ends of the relationship in the Follows table. In the People
table, we can see that the first three people have been visited and their data hasbeen
retrieved. The data in the Follows table indicates that drchuck (user 1) is a friend
to all of the people shown in the first five rows. This makes sense because the first
data we retrieved and stored was the Twitter friends of drchuck. If you were to
print more rows from the Follows table, you would see the friends of users 2 and 3
as well.

Thre e kinds of keys


Now that we have started building a data model putting our data into multiple linked
tables and linking the rows in those tables using keys, we need to look at some
terminology around keys. There are generally three kinds of keys used in a database
model.
• A logical key is a key that the “real world” might use to look up a row. In
our example data model, the name field is a logical key. It is the screen
name for the user and we indeed look up a user’s row several times in the
program using the name field. You will often find that it makes sense
to add a UNIQUE constraint to a logical key. Since the
logical key is how we look up a row from the outside world, it
makes little sense to allow multiple rows with the same
value in the table.
• A primary key is usually a number that is assigned automatically by the
database. It generally has no meaning outside the program and is only
used to link rows from different tables together. When we want to
look up a rowin a table, usually searching for the row using
the primary key is the fastestway to find the row. Since primary keys are
integer numbers, they take upvery little storage and can be
compared or sorted very quickly. In our datamodel, the i d
field is an example of a primary key.
• A foreign key is usually a number that points to the primary key of an
associated row in a different table. An example of a foreign key in
our data model is the from_id.

We are using a naming convention of always calling the primary key field name i d
and appending the suffix _ i d to any field name that is a foreign key.

Us i n g J O I N to retrieve data
Now that we have followed the rules of database normalization and have data
separated into two tables, linked together using primary and foreign keys, we needto
be able to build a SELECT that reassembles the data across the tables.
S Q L uses the JOIN clause to reconnect these tables. In the JOIN clause you specify
the fields that are used to reconnect the rows between the tables.
The following is an example of a SELECT with a JOIN clause:

SELECT * FROM Follows JOIN People


ON Follows.from_id = Peo p le.id WHERE Peo p le . id = 1

The JOIN clause indicates that the fields we are selecting cross both the Follows
and People tables. The ON clause indicates how the two tables are to be joined:
Take the rows from Follows and append the row from People where the field
from_id in Follows is the same the i d value in the People table.

People
Follows
id name retrieved

1 drchuck 1
2 opecontent 1
3 lhawthorn 1
4 steve_coppin 0
...

name id from_id to_id name


drchuck 1 1 2 opencontent
drchuck 1 1 3 lhawthorn
drchuck 1 1 4 steve_coppin

Figure 15.5: Connecting Tables Using JOIN


The result of the J O I N is to create extra-long “metarows” which have both the
fields from People and the matching fields from Follows. Where there is more
than one match between the i d field from People and the from_id from People,
then J O I N creates a metarow for each of the matching pairs of rows, duplicating
data as needed.
The following code demonstrates the data that we will have in the database after the
multi-table Twitter spider program (above) has been run several times.

import s q l i t e 3

conn = s q l i t e 3 . c o n n e c t ( ' f r i e n d s . s q l i t e ' )


cur = conn.cursor()

cur.execute('SELECT * FROM
Peo p le') count = 0
print('People:')
f o r row i n cur :
i f count < 5 : print(row)
c o u n t = cou n t + 1
print(count, 'rows.')

cur.execute('SELECT * FROM
F o l l o w s ' ) count = 0
print('Follows:')
f o r row i n cur :
i f count < 5 : print(row)
c o u n t = cou n t + 1
print(count, 'rows.')

cur.ex ecute( '''SELECT * FROM Follows JOIN


People ON Fo llo ws.to _ id = People.id
WHERE Follows.from_id = 2 ' ' ' )
count = 0
print('Connections f o r i d = 2 : ' )
f o r row i n cur :
i f count < 5 : print(row)
c o u n t = cou n t + 1
print(count, 'rows.')

c u r. c l o s e ( )

# Code:
http://www.py4e.com/code3/twjoi
n.py

In this program, we first dump out the People and Follows and then dump out
a subset of the data in the tables joined together.
Here is the output of the program:

python twjoin.py
People:
(1, 'drchuck', 1)
( 2 , ' o p e n c o n t e n t ' , 1)
(3, 'lhawthorn', 1)
(4, 'steve_coppin', 0)
(5, 'davidkocher', 0)
55 rows.
Fo llo ws
:
(1, 2)
(1, 3)
(1, 4)
(1, 5)
(1, 6)
60 rows.
Connect
ions for
id=2:
(2, 1,
1,
'drchuc
k', 1)
(2, 2 8 ,
28,
'cnxorg
', 0)
(2, 3 0 ,
30,
'kthano
s', 0)
(2,
102,
102,
'Someth
ingGirl'
, 0)
(2,
103,
103,
' j a_ P ac' ,
0)
20 rows.

You see the columns from the People and Follows tables and the last set of rows
is the result of the SELECT with the JOIN clause.
In the last select, we are looking for accounts that are friends of “opencontent”
(i.e., People.id=2).
In each of the “metarows” in the last select, the first two columns are from the
Follows table followed by columns three through five from the People table. You
can also see that the second column ( Fo llo ws.to _ id ) matches the third column
(Peo p le.id ) in each of the joined-up “metarows”.

You might also like