0% found this document useful (0 votes)
40 views4 pages

03 - Worked Example Twfriends Py Chapter 15.en

The document discusses a Python script that crawls Twitter to build a social network graph by collecting a user's friends and storing the data in a SQLite database with two tables in a many-to-many relationship. The script checks if Twitter accounts exist in the database, inserts new ones, retrieves friends and inserts those, and records the relationships between accounts.

Uploaded by

Box Box
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views4 pages

03 - Worked Example Twfriends Py Chapter 15.en

The document discusses a Python script that crawls Twitter to build a social network graph by collecting a user's friends and storing the data in a SQLite database with two tables in a many-to-many relationship. The script checks if Twitter accounts exist in the database, inserts new ones, retrieves friends and inserts those, and records the relationships between accounts.

Uploaded by

Box Box
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 4

Hello everybody. Welcome to Python for Everybody.

We are doing some code walk-


throughs. If you want to follow along with the code, you can download the source
code from Python for Everybody website. So, the code we're playing with today is
twfriends.py. This is a step beyond the simple twspider. It is a restartable
spider. But we're going to date the model things a little bit differently. We're
going to have two tables and we're going to have a many-to-many relationship
accepted. It's a many-to-many relationship between the same table, which is okay.
Twitter friends are a directional relationship. So, we start out here in
twfriends.py. Remember, that the file hidden.py, I'll show it to you, but I'm not
going to open it because I've got my keys and secrets in it. So, this hidden.py
file, you've got to edit that, and you got to go to apps.twitter.com and get your
keys and put them in there. Otherwise, these things won't work. But if you have
Twitter and you set your API keys up, and you put them in the hidden.py, then all
these things will work. It's fun actually, and impressive. Not hard to do actually.
So, the Twitter URL, that's my library, that reads hidden.py and augments the URL
does all the OAuth stuff, JSON and SSL because Python doesn't accept any
certificates even if they're good certificates. So, we crush that. Here's our
friends list that we're going to hit. We're going to make a database
friends.sqlite. Now, here we're doing a create table if not exists. So, what this
really is saying is, I want this to be a restartable process and I don't want to
lose the data. We're starting out, we do not have any SQLite files. So, this is
going to create the database and create these tables. But the second time we run
it, we're not going to recreate the tables. We're not going to be able to restart
this, because we're going to run out of rate limit before we finish this. So, we
just have to wait, however long the time it takes to reset, and we'll watch the
rate limit go down. So, we're going to have a people table, and we're going to have
a primary key and the name. The name is going to be unique, and whether or not
we've retrieved it. That's from a previous one. Then there's the who follows who,
the from ID to, to ID. So, this is a direction and we're going to put a uniqueness
constraint in, just like we do in many-to-many, that basically says, the
combination of from ID and to ID has got to be unique. We don't allow ourselves to
put duplicates of the combination. So, from ID can be one in many records, and to
ID can be one in many records, but one, one is only allowed once. This is the crud
we have to do to convince Python to accept the Twitter certificate. So, this is
similar to some of the other stuff that we've done. We're going to enter a Twitter
account or quit. If we enter by itself, then we will actually go and retrieve a
record that was not yet retrieved. Now, we're actually pulling out two values, ID
and name. So, we will grab, fetch one, is going to give us a two tuple basically.
We're going to store that in ID, in account. Of course, that's like, this is coming
back with a two tuple, first of which is the ID from the database. Limit one, means
we're only going get one of these or zero of these. If there's zero these, that
means there are no unretrieved Twitter accounts. Retrieved equals zero, well,
you'll see in a second, that all the new accounts we put in, are the ones for which
we haven't retrieved. Again, given that our rate limit, we want to know which ones
we've retrieved. Okay? So, what we're going do next is, we're going to check to see
if the person that we just checked, which means the length of the account is
greater than what we just entered. We're going to check to see if they're already
there. Okay? We're going to select ID from people where name equals. So, that's the
one we just entered. We're going to fetch one and grab the first thing, because we
only got one thing in the select statement here. If this person that we just asked
to see is not in the table, that means this is going to fail, we're going to do an
insert or ignore. This our ignore is redundant because we just checked to see if it
was there. But we'll put that in just to be safe. We're going to put the name in as
the new account that we're looking at, indicating that retrieved is zero. So, that
we will know that we haven't retrieved it yet. You'll see that we'll update that in
a second. We commit it, so that later selects we'll see this. So, you got to do the
commit. This later select wouldn't see the one we just inserted. We're going to ask
how many rows were affected, and if it's not equal to one, then, we're going to
complain about- we inserted it, and we're going to do this thing. We are going to
ask, "Hey remember there was an ID up there?" Right here, ID integer, primary key,
and we did not insert this here, but we want to know what that ID is. Every time I
was showing you that in lectures, I was saying it's really easy in Python to do
this. That's what we're saying. This cursor did the insert, but one of the things
happens is after the insert, we're going to grab the last row ID, which is the
primary key that was assigned by SQL. Okay? So, that means, that one way or
another, coming through this code here in line 45, one way or another, we're either
going to know the ID of the user that was there before, or we just inserted one and
so we're going to know the primary key of the current user. You'll see why we need
that. So, ID, is the primary key of the current user that we entered right here.
Okay? Now, what we're going to do is do the Twitter URL augment with OAuth, and all
the keys, and the secrets in hidden.py. Instead, we're going go through let's count
1,000. Let's go count, what the heck. Let's go 200. Up to 200 friends. No, let's do
100. We'll keep it that way. Then, we're going to retrieve it and we're retrieving
the account. We're not going to print the nasty URL out, we could. Then we're going
open the URL with a connection, and then we're going to read that, and we're going
get the UTF-8 data from this, and then we're going to decode that, and we're going
to have the unicode data. So, the data in string is a internal Python string, with
all that data representing all the wonderful characters. Of course, we're going to
ask URL open to give us back the headers as a dictionary, using this call. We can
see how many we have left for the remaining. Right? What's the remaining rate limit
that we have? So, then we're going to parse the data with JSON load S. Wait, I need
to continue in here. Continue. Okay. Save. If we're going to parse this data, we'll
print it out, right? So, that means that this died, which means it's not
syntactically correct, json basically. Who knows if we're ever going to see that,
but at least, when it blows up it'll print this data out. We'll have to catch it,
and then it'll continue. Actually, I'll make this a break, because if that's blown
up that bad, we should quit. Now, I don't yet know what happens when this rate
limit says you can't have it, and so but I do know that I expect when it's
successful, that there will be a key of users in this outer dictionary that we're
going to get. If this outer dictionary, that if users is not in the parsed
dictionary, then I'm going to dump out this data so that at least I can debug what
happens when I've got some broken json. So, the difference between this code, this
code is going to fail when the json is syntactically bad, meaning a curly brace
isn't right or whatever. This code will trigger when I get good json, but I don't
have a user's key in it, okay? So, then, once we've retrieved it, we will be pretty
happy with it. We're going to update for our account that we're retrieving. We're
going to set this is one of our retrieved accounts, okay? Then, what we're going to
do is write a loop that goes through all the friends of this particular user that
we're asking, and gets their screen name, prints it out, and then we're going to
check to see if this one is already in our people database, because this is a
spider, we're grabbing accounts. So, we'll do a friend ID, and do a fetch one, grab
the subzero thing. If that works, if this person not in there, this fetch one is
going to blow up which means we're going to drop down to the except code. But if it
does work, we have friend ID, that if they're there, and they're already in our
database. They just weren't retrieved. Okay. So now, if the friend ID wasn't there,
we're going to do an insert into, setting retrieved to zero, and then we're going
to commit. Now, remember, row count is how many rows were affected by this last
transaction code at row count, and we're going to die if that insert doesn't work.
This is unlikely, unless somehow we've ran out a disc drive or something, and we're
going to grab the friend ID as the key, the last row that was inserted. We're only
going insert one row, so it's basically the primary key of the row that we just
inserted. So, if you look at this code right here, it comes out the bottom, one way
or another with friend ID successful. Friend ID is either they are already in our
database or they're not, and if we insert them, then we have it. So now, this count
new and count old is just so I can print out a nice print out. Now, we are going to
insert into the friend's table which is called the follows table in this case, from
ID and to ID. Those are the two outward pointing foreign keys, and we have the ID
of the account that we're retrieving the friends of, and then this particular
friend. So, we're inserting the connection from this person to that person then we
commit it. We want to commit these again so that later selects when the loop goes
back up, later selects get all of that data that's going on. So, we do want to
commit from time to time, and then we close the
cursor at the very end. So, let's run this and see what happens. So, python
twfriends.py. Of course, I am a refugee from python two, so I always forget to type
python three. So, we're going to start. If we take a look right now, I'm going to
start another tab over here, and ls-l *sqlite. Now, that sqlite file is there, and
it's actually made the tables. If you go up here, it ran all this stuff, create the
tables, yada, yada, and we're sitting right here at this line. As a matter of fact,
I think without causing too much trouble, I can open that database, and get into
this database right here. There is no data in the Follows table, and there is no
data in the People table. It's completely empty. So, we're waiting for the first
one, and I'll go with mine, Doctor Chuck. So, it's retrieving the 100 friends, and
they all were brand new. They are all inserted. So now, if I hit Refresh, we will
see that Doctor Chuck is retrieved. Who follows? So, these are all the people I
follow, for one follows two. So, if we look at here, we see that Doctor Chuck
follows Stephanie Teasley. Because we grabbed the followers of Dr. Chuck, we're
going to have a record in all of the follows for all the ones that I did, right?
So, these are all the people I followed, and we put them in. So, we can go back and
we can, let's see grab somebody. Let's go grab Stephanie Teasley, and let's pull
out her friends. So, we grabbed 100 of her folks. I got 14 left. That's my x-rate
limit. I did Stephanie Teasley, so let's go back here. So, you'll notice there's
101, that row is going to be 182. That's interesting. So, we've retrieved Doctor
Chuck and Stephanie Teasley, and let's go take a look in the Friends table, the
Follows table. So, we have all the people I follow, now all the people Stephanie
follows. So, there we go. Let's go ahead and do somebody else. Let's see here. I
think we both follow Tim McKay. Where is Tim McKay? Let's follow Tim McKay. Let's
see who to Tim follows, If we can get like an overlap. We revisited some. Let's see
if we can see this in the Follows. Let's see People. So, we've got Dr. Chuck
retrieved. Tim McKay's somewhere down here. It might take us a while before we get
any really good overlaps. Let's see. Let's do a database called. Let's see. Let's
do a database SQL. Select, count. Yeah. Okay. So, let's just run this some more.
It's clearly working. Now, one thing I can do here is I can hit "Enter", and it
will just pick one randomly. So, it grabbed Liveedutv. Let's see how many I got
left. We got 12 left. Now, I can hit "Enter" again. It picks another one. That was
the next one. It's picking them in order. Is it picking them in order? Let's go to
"People". Yeah, it's picking these. So, we can see that it's going to just do the
first unretrieved person, who's Nancy. Let it retrieve Nancy. So, it grabbed Nancy,
new. So, we're finding some and this table's getting really big. So, if we look at
the "People" table, we now have 455 people and we have 467 following records. So,
there we go. Hit "Enter", does another one, and away we go. So, you get the idea. I
can type quit to finish. Just to give you a little interesting bit of code to show
you how to do selects, I'm going to do this twjoin. Now, you'll notice that we're
not talking, let's show you one thing, ls minus l star sqlite. So, this database
has it. So, I can restart this process and run it again. The database is still
there. So, we just grab, swear_trek. So, we can keep doing this. So, this data, it
keeps extending. So, this is a restartable process. I can run it. Then, tell it to
grab the next unretrieved one. So, away we go. So, that's part of it. I've got
eight left. How many do I have left, really? Let's keep going. How many do I got
left? I got five left. Okay. Wait, I guess we'll just run it out. So, I got four
left. You know what I should do? I can't change the code. Yes, I can change the
code. I can stop the code and I can quit the code. So, what I'm going to do is I'm
going to change this code a little bit really quick and I'm going to print the
headers of rate limiting at the beginning and at the end. So, now I can run it
again. I changed the code. Hopefully, I didn't make a Python error. Tell it to go
get another one, Ana Navarro. So, I got three left. We'll see what happens when I
run out of rate limit. Run out of rate limit. So, we have one left. Hit "Enter" and
control K, OpenSource.org. So, we have zero left, that worked. Now, let's see what
happens. I don't know what happens next. We blew up. Too many requests. So, we got
a HTTP Error 429. So, that means that going from Marc Cuban, that was in line 48,
so the right thing to do would be in line 48. We should really put this in a "try
except" block. "try except" block because it gives us an error. Print,
fiddlesticks. How do I print the exception message? I always am forgetting. Print
"Failed to Retrieve". So, we'll put that in. Now, if I run it. Then, I have to put
a "break" here because that's not good. Break. Fail to Retrieve. See, I never know
how to print out the error message. Yeah. So, see that's the weird thing about
stuff is, that I don't ever remember enough. I don't remember the syntax, what I
say here, to print the error message out. So, I'm going to go to Google and I'm
going to say, "Print out the exception message in Python." Python 3, hello. So,
let's go find it here in the documentation. Except. Is this it? Is this what I say?
I just want to print out the message. That's it. Except. Let's try this. So, this
is part of Python programming is, for me at least, because I'm just not a genius
expert at this stuff. This is one thing I like about Python is you can guess stuff
and sometimes you guess right. So, there we go. We got the error, we got the nice
little error message, and we see error 429: Too Many Requests. So, that cleans that
up nicely. So, we have run out of requests. On that, it is a good time to say
thanks for listening. I hope that you found this valuable.

You might also like