Generator Tricks For Systems Programmers Back Story
Generator Tricks For Systems Programmers Back Story
Generator Tricks
For Systems Programmers
David Beazley
http://www.dabeaz.com
Presented at PyCon'2008
1- 1
1- 5
An Introduction
A Complaint
The coverage of generators in most Python
books is lame (mine included)
1- 2
About Me
This Tutorial
Some more practical uses of generators
Focus is "systems programming"
Which loosely includes files, file systems,
1- 7
My Story
Support Files
1- 6
1- 4
1- 8
Disclaimer
1- 13
Performance Details
There are some later performance numbers
Python 2.5.1 on OS X 10.4.11
All tests were conducted on the following:
Mac Pro 2x2.66 Ghz Dual-Core Xeon
3 Gbytes RAM
WDC WD2500JS-41SGB0 Disk (250G)
Timings are 3-run average of 'time' command
1- 10
1- 14
Part I
1- 11
Iteration
Consuming Iterables
Constructors
list(s), tuple(s), set(s), dict(s)
in operator
1- 15
item in s
1- 16
Iteration Example
Iteration Protocol
The reason why you can iterate over different
objects is that there is a specific protocol
Example use:
>>> c =
>>> for
...
...
5 4 3 2
>>>
1- 17
Iteration Protocol
countdown(5)
i in c:
print i,
1
1-21
Iteration Commentary
1-18
1-22
Supporting Iteration
Generators
A generator is a function that produces a
1-19
Supporting Iteration
Generators
Behavior is quite different than normal func
Calling a generator function creates an
Sample implementation
class countdown(object):
def __init__(self,start):
self.count = start
def __iter__(self):
return self
def next(self):
if self.count <= 0:
raise StopIteration
r = self.count
self.count -= 1
return r
1-23
def countdown(n):
print "Counting down from", n
while n > 0:
yield n
Notice that no
n -= 1
output was
produced
>>> x = countdown(10)
>>> x
<generator object at 0x58490>
>>>
1-20
1-24
Generator Functions
Generator Expressions
A generated version of a list comprehension
Function starts
executing here
>>> a = [1,2,3,4]
>>> b = (2*x for x in a)
>>> b
<generator object at 0x58760>
>>> for i in b: print b,
...
2 4 6 8
>>>
1-25
Generator Functions
1-29
Generator Expressions
Important differences from a list comp.
Example:
>>> a = [1,2,3,4]
>>> b = [2*x for x in a]
>>> b
[2, 4, 6, 8]
>>> c = (2*x for x in a)
<generator object at 0x58760>
>>>
1-26
Generator Functions
1-30
Generator Expressions
General syntax
(expression for i in s if cond1
for j in t if cond2
...
if condfinal)
What it means
for i in s:
if cond1:
for j in t:
if cond2:
...
if condfinal: yield expression
It just works
Copyright (C) 2008, http://www.dabeaz.com
1-27
1-31
A Note on Syntax
Example:
sum(x*x for x in s)
Generator expression
1-28
1-32
Interlude
A Non-Generator Soln
def countdown(n):
while n > 0:
yield n
n -= 1
Generator expressions
1-37
A Generator Solution
Let's use some generator expressions
wwwlog
= open("access-log")
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
bytes
= (int(x) for x in bytecolumn if x != '-')
Part 2
1- 34
Programming Problem
Generators as a Pipeline
To understand the solution, think of it as a data
processing pipeline
access-log
81.107.39.38 81.107.39.38 81.107.39.38 81.107.39.38 81.107.39.38 66.249.72.134 -
...
...
...
...
...
...
"GET
"GET
"GET
"GET
"GET
"GET
bytecolumn
bytes
sum()
total
wwwlog
= open("access-log")
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
bytes
= (int(x) for x in bytecolumn if x != '-')
print "Total", sum(bytes)
1-35
1-39
Being Declarative
At each step of the pipeline, we declare an
wwwlog
1-38
access-log
wwwlog
bytecolumn
bytes
sum()
total
bytestr = line.rsplit(None,1)[1]
if bytestr != '-':
bytes = int(bytestr)
1-36
1-40
Being Declarative
Commentary
Not only was it not slow, it was 5% faster
And it was less code
And it was relatively easy to read
And frankly, I like it a whole better...
1-45
Performance Contest
wwwlog
= open("access-log")
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
bytes
= (int(x) for x in bytecolumn if x != '-')
wwwlog
= open("access-log")
bytes
Time
25.96
1-42
Performance
37.33
1-46
% ls -l big-access-log
-rw-r--r-- beazley 1303238000 Feb 29 08:06 big-access-log
1-47
Performance Contest
wwwlog = open("big-access-log")
total = 0
for line in wwwlog:
bytestr = line.rsplit(None,1)[1]
if bytestr != '-':
total += int(bytestr)
Time
More Thoughts
The generator solution was based on the
Time
27.20
wwwlog
= open("big-access-log")
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
bytes
= (int(x) for x in bytecolumn if x != '-')
print "Total", sum(bytes)
Time
25.96
1-44
1-48
find
Generate all filenames in a directory tree
that match a given filename pattern
import os
import fnmatch
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
Examples
pyfiles = gen_find("*.py","/")
logs
= gen_find("access-log*","/usr/www/")
1-49
1-53
Performance Contest
pyfiles = gen_find("*.py","/")
for name in pyfiles:
print name
559s
Part 3
Fun with Files and Directories
468s
Performed on a 750GB file system
containing about 140000 .py files
1- 50
Programming Problem
A File Opener
Open a sequence of filenames
foo/
access-log-012007.gz
access-log-022007.gz
access-log-032007.gz
...
access-log-012008
bar/
access-log-092007.bz2
...
access-log-022008
1-51
1-54
cat
os.walk()
A very useful function for searching the
file system
import os
def gen_cat(sources):
for s in sources:
for item in s:
yield item
Example:
1-55
1-52
1-56
Programming Problem
grep
Generate a sequence of lines that contain
Example:
lognames
logfiles
loglines
patlines
=
=
=
=
gen_find("access-log*", "/usr/www")
gen_open(lognames)
gen_cat(logfiles)
gen_grep(pat, loglines)
1-57
Example
= r"somepattern"
= "/some/dir/"
filenames
logfiles
loglines
patlines
bytecolumn
bytes
=
=
=
=
=
=
gen_find("access-log*",logdir)
gen_open(filenames)
gen_cat(logfiles)
gen_grep(pat,loglines)
(line.rsplit(None,1)[1] for line in patlines)
(int(x) for x in bytecolumn if x != '-')
1-58
Important Concept
1-62
colnames
= ('host','referrer','user','datetime',
'method','request','proto','status','bytes')
log
{ 'status' :
'proto'
:
'referrer':
'request' :
'bytes'
:
'datetime':
'host'
:
'user'
:
'method' :
groups
tuples
Tuples to Dictionaries
1-61
'200',
'HTTP/1.1',
'-',
'/ply/ply.html',
'97238',
'24/Feb/2008:00:08:59 -0600',
'140.180.132.213',
'-',
'GET'}
1-63
Field Conversion
Map specific dictionary fields through a function
def field_map(dictseq,name,func):
for d in dictseq:
d[name] = func(d[name])
yield d
Part 4
1- 60
1-64
Field Conversion
Example Use
It's easy
conversion
lines = lines_from_dir("access-log*","www")
log
= apache_log(lines)
for r in log:
print r
1-65
=
=
=
=
=
A Query Language
Now that we have our log, let's do some queries
Find the set of all documents that 404
gen_find("access-log*","www")
gen_open(lognames)
gen_cat(logfiles)
(logpat.match(line) for line in loglines)
(g.groups() for g in groups if g)
colnames = ('host','referrer','user','datetime','method',
'request','proto','status','bytes')
log
log
log
1-66
Packaging
Packaging
1-71
A Query Language
def apache_log(lines):
groups
= (logpat.match(line) for line in lines)
tuples
= (g.groups() for g in groups if g)
colnames
= ('host','referrer','user','datetime','method',
'request','proto','status','bytes')
log
log
log
1-67
1-70
A Query Language
1-69
import socket
for addr in addrs:
try:
print socket.gethostbyaddr(addr)[0]
except socket.herror:
print addr
return log
1-68
1-72
Performance Study
Infinite Sequences
=
=
=
=
lines_from_dir("big-access-log",".")
(line for line in lines if 'robots.txt' in line)
apache_log(lines)
set(r['host'] for r in log)
Some Thoughts
Tailing a File
A Python version of 'tail -f'
import time
def follow(thefile):
thefile.seek(0,2)
# Go to the end of the file
while True:
line = thefile.readline()
if not line:
time.sleep(0.1)
# Sleep briefly
continue
yield line
1-74
1-77
1-78
Example
Using our follow function
Part 5
logfile = open("access-log")
loglines = follow(logfile)
for line in loglines:
print line,
1- 75
1-79
Question
Example
% tail -f logfile
...
... lines of output ...
...
logfile = open("access-log")
loglines = follow(logfile)
log
= apache_log(loglines)
1-76
1-80
Commentary
Generating Connections
Generate a sequence of TCP connections
import socket
def receive_connections(addr):
s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,1)
s.bind(addr)
s.listen(5)
while True:
client = s.accept()
yield client
Example:
1-81
Thoughts
1-85
Generating Messages
Receive a sequence of UDP messages
import socket
def receive_messages(addr,maxsize):
s = socket.socket(socket.AF_INET,socket.SOCK_DGRAM)
s.bind(addr)
while True:
msg = s.recvfrom(maxsize)
yield msg
Example:
for msg, addr in receive_messages(("",10000),1024):
print msg, "from", addr
1-82
1-86
I/O Multiplexing
Generating I/O events on a set of sockets
import select
def gen_events(socks):
while True:
rdr,wrt,err = select.select(socks,socks,socks,0.1)
for r in rdr:
yield "read",r
for w in wrt:
yield "write",w
for e in err:
yield "error",e
Part 6
Feeding the Pipeline
1- 83
Feeding Generators
1-87
I/O Multiplexing
clientset = []
def acceptor(sockset,addr):
for c,a in receive_connections(addr):
sockset.append(c)
acc_thr = threading.Thread(target=acceptor,
args=(clientset,("",12000))
acc_thr.setDaemon(True)
acc_thr.start()
for evt,s in gen_events(clientset):
if evt == 'read':
data = s.recv(1024)
if not data:
print "Closing", s
s.close()
clientset.remove(s)
else:
print s,data
lines = open(filename)
Tailing a file
lines = follow(open(filename))
1-84
1-88
Consuming a Queue
Pickler/Unpickler
Turn a generated sequence into pickled objects
def gen_pickle(source):
for item in source:
yield pickle.dumps(item)
def consume_queue(thequeue):
while True:
item = thequeue.get()
if item is StopIteration: break
yield item
def gen_unpickle(infile):
while True:
try:
item = pickle.load(infile)
yield item
except EOFError:
return
Consuming a Queue
1-93
Sender/Receiver
Example: Sender
Example:
def sendto(source,addr):
s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
s.connect(addr)
for pitem in gen_pickle(source):
s.sendall(pitem)
s.close()
Example: Receiver
in_q = Queue.Queue()
con_thr = threading.Thread(target=consumer,args=(in_q,))
con_thr.start()
def receivefrom(addr):
s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,1)
s.bind(addr)
s.listen(5)
c,a = s.accept()
for item in gen_unpickle(c.makefile()):
yield item
c.close()
for i in xrange(100):
in_q.put(i)
in_q.put(StopIteration)
1-90
1-94
Example Use
Example: Read log lines and parse into records
# netprod.py
Part 7
lines = follow(open("access-log"))
log
= apache_log(lines)
sendto(log,("",15000))
1- 91
1-95
Multiple Processes
Fanning Out
consumers?
process 2
generator
process 1
consumer1
Copyright (C) 2008, http://www.dabeaz.com
1-92
consumer2
consumer3
1-96
Broadcasting
Network Consumer
Example Usage:
class Stat404(NetConsumer):
def send(self,item):
if item['status'] == 404:
NetConsumer.send(self,item)
lines = follow(open("access-log"))
log
= apache_log(lines)
stat404 = Stat404(("somehost",15000))
broadcast(log, [stat404])
processing
1-97
Consumers
Consumer Thread
Example:
class Consumer(object):
def send(self,item):
print self, "got", item
Example:
c1 = Consumer()
c2 = Consumer()
c3 = Consumer()
lines = follow(open("access-log"))
broadcast(lines,[c1,c2,c3])
1-98
1-102
Consumers
Consumer Thread
def bytes_transferred(log):
total = 0
for r in log:
total += r['bytes']
print "Total bytes", total
c1 = ConsumerThread(find_404)
c1.start()
c2 = ConsumerThread(bytes_transferred)
c2.start()
1-101
1-99
Network Consumer
1-103
Multiple Sources
Example:
import socket,pickle
class NetConsumer(object):
def __init__(self,addr):
self.s = socket.socket(socket.AF_INET,
socket.SOCK_STREAM)
self.s.connect(addr)
def send(self,item):
pitem = pickle.dumps(item)
self.s.sendall(pitem)
def close(self):
self.s.close()
source2
source3
1-100
1-104
Concatenation
Multiplexing Generators
def gen_multiplex(genlist):
item_q = Queue.Queue()
def run_one(source):
for item in source: item_q.put(item)
def run_all():
Each generator runs in a
thrlist = []
thread and drops items
for source in genlist:
t = threading.Thread(target=run_one,args=(source,))
onto a queue
t.start()
thrlist.append(t)
for t in thrlist: t.join()
item_q.put(StopIteration)
threading.Thread(target=run_all).start()
while True:
item = item_q.get()
if item is StopIteration: return
yield item
1-105
Parallel Iteration
1-109
Multiplexing Generators
def gen_multiplex(genlist):
item_q = Queue.Queue()
def run_one(source):
for item in source: item_q.put(item)
def run_all():
thrlist = []
for source in genlist:
t = threading.Thread(target=run_one,args=(source,))
t.start()
thrlist.append(t)
for t in thrlist: t.join()
Pull items off the queue
item_q.put(StopIteration)
z = itertools.izip(s1,s2,s3)
threading.Thread(target=run_all).start()
while True:
item = item_q.get()
if item is StopIteration: return
yield item
1-106
Multiplexing
Multiplexing Generators
def gen_multiplex(genlist):
Run all of the
item_q = Queue.Queue()
def run_one(source):
generators, wait for
for item in source: item_q.put(item)
them
to terminate, then put a
sentinel on the queue
(StopIteration)
def run_all():
thrlist = []
for source in genlist:
t = threading.Thread(target=run_one,args=(source,))
t.start()
thrlist.append(t)
for t in thrlist: t.join()
item_q.put(StopIteration)
Example use
log1 = follow(open("foo/access-log"))
log2 = follow(open("bar/access-log"))
lines = gen_multiplex([log1,log2])
1-110
1-107
threading.Thread(target=run_all).start()
while True:
item = item_q.get()
if item is StopIteration: return
yield item
1-111
Multiplexing Generators
def gen_multiplex(genlist):
item_q = Queue.Queue()
def run_one(source):
for item in source: item_q.put(item)
def run_all():
thrlist = []
for source in genlist:
t = threading.Thread(target=run_one,args=(source,))
t.start()
thrlist.append(t)
for t in thrlist: t.join()
item_q.put(StopIteration)
Part 8
Various Programming Tricks (And Debugging)
threading.Thread(target=run_all).start()
while True:
item = item_q.get()
if item is StopIteration: return
yield item
Copyright (C) 2008, http://www.dabeaz.com
1-108
1-112
Shutting Down
Generators can be shut down using .close()
import time
def follow(thefile):
thefile.seek(0,2)
# Go to the end of the file
while True:
line = thefile.readline()
if not line:
time.sleep(0.1)
# Sleep briefly
continue
yield line
Example:
lines = follow(open("access-log"))
for i,line in enumerate(lines):
print line,
if i == 10: lines.close()
1-113
Creating Generators
Shutting Down
import time
def follow(thefile):
thefile.seek(0,2)
# Go to the end of the file
try:
while True:
line = thefile.readline()
if not line:
time.sleep(0.1)
# Sleep briefly
continue
yield line
except GeneratorExit:
print "Follow: Shutting down"
def generate(func):
def gen_func(s):
for item in s:
yield func(item)
return gen_func
Example:
gen_sqrt = generate(math.sqrt)
for x in gen_sqrt(xrange(100)):
print x
Debug Tracing
import time
def follow(thefile):
thefile.seek(0,2)
# Go to the end of the file
while True:
try:
line = thefile.readline()
if not line:
time.sleep(0.1)
# Sleep briefly
continue
yield line
except GeneratorExit:
print "Forget about it"
1-118
def trace(source):
for item in source:
print item
yield item
Ignoring Shutdown
r404
1-117
1-115
class storelast(object):
def __init__(self,source):
self.source = source
def next(self):
item = self.source.next()
self.last = item
return item
def __iter__(self):
return self
1-119
lines = follow(open("foo/test.log"))
def sleep_and_close(s):
time.sleep(s)
lines.close()
threading.Thread(target=sleep_and_close,args=(30,)).start()
lines = storelast(follow(open("access-log")))
log
= apache_log(lines)
for r in log:
print r
print lines.last
Copyright (C) 2008, http://www.dabeaz.com
1-116
1-120
Shutdown
Example:
import threading,signal
1-121
shutdown = threading.Event()
def sigusr1(signo,frame):
print "Closing it down"
shutdown.set()
signal.signal(signal.SIGUSR1,sigusr1)
lines = follow(open("access-log"),shutdown)
for line in lines:
print line,
1-125
Part 9
signal.signal(signal.SIGUSR1,sigusr1)
Co-routines
lines = follow(open("access-log"))
for line in lines:
print line,
1-122
1-126
def recv_count():
try:
while True:
n = (yield)
# Yield expression
print "T-minus", n
except GeneratorExit:
print "Kaboom!"
Sigh.
1-123
Shutdown
Example Use
1-127
1-124
Using a receiver
>>> r = recv_count()
>>> r.next()
>>> for i in range(5,0,-1):
...
r.send(i)
...
T-minus 5
T-minus 4
T-minus 3
T-minus 2
T-minus 1
>>> r.close()
Kaboom!
>>>
1-128
Co-routines
Coroutine Pipelines
.send()
.send()
Setting up a Coroutine
1-133
Broadcasting (Reprise)
def recv_count():
try:
while True:
n = (yield)
# Yield expression
print "T-minus", n
except GeneratorExit:
print "Kaboom!"
Example:
r = recv_count()
r.next()
1-130
@consumer decorator
Example
@consumer
def find_404():
while True:
r = (yield)
if r['status'] == 404:
print r['status'],r['datetime'],r['request']
@consumer
def bytes_transferred():
total = 0
while True:
r = (yield)
total += r['bytes']
print "Total bytes", total
Example:
@consumer
def recv_count():
try:
while True:
n = (yield)
# Yield expression
print "T-minus", n
except GeneratorExit:
print "Kaboom!"
lines = follow(open("access-log"))
log
= apache_log(lines)
broadcast(log,[find_404(),bytes_transferred()])
1-131
Discussion
>>> r = recv_count()
>>> for i in range(5,0,-1):
...
r.send(i)
...
T-minus 5
T-minus 4
T-minus 3
T-minus 2
T-minus 1
>>> r.close()
Kaboom!
>>>
1-135
@consumer decorator
1-134
1-132
1-136
Pitfalls
I don't think many programmers really
understand generators yet
Wrap Up
1-137
Thanks!
1-141
http://www.dabeaz.com
1-138
Code Reuse
I like the way that code gets reused with
generators
1-139
Example
SocketServer Module (Strategy Pattern)
import SocketServer
class HelloHandler(SocketServer.BaseRequestHandler):
def handle(self):
self.request.sendall("Hello World\n")
serv = SocketServer.TCPServer(("",8000),HelloHandler)
serv.serve_forever()
My generator version
for c,a in receive_connections(("",8000)):
c.send("Hello World\n")
c.close()
1-140
1-142