Generator Tricks For Systems Programmers: David Beazley Presented at Pycon'2008
Generator Tricks For Systems Programmers: David Beazley Presented at Pycon'2008
Presented at PyCon'2008
An Introduction
My Story
My addiction to generators started innocently
enough. I was just a happy Python
programmer working away in my secret lair
when I got "the call." A call to sort through
1.5 Terabytes of C++ source code (~800
weekly snapshots of a million line application).
That's when I discovered the os.walk()
function. I knew this wasn't going to end well...
A Complaint
• The coverage of generators in most Python
books is lame (mine included)
• Look at all of these cool examples!
• Fibonacci Numbers
• Squaring a list of numbers
• Randomized sequences
• Wow! Blow me over!
Copyright (C) 2008, http://www.dabeaz.com 1- 6
This Tutorial
• Some more practical uses of generators
• Focus is "systems programming"
• Which loosely includes files, file systems,
parsing, networking, threads, etc.
• My goal : To provide some more compelling
examples of using generators
• Planting some seeds
Copyright (C) 2008, http://www.dabeaz.com 1- 7
Support Files
Performance Details
• There are some later performance numbers
• Python 2.5.1 on OS X 10.4.11
• All tests were conducted on the following:
• Mac Pro 2x2.66 Ghz Dual-Core Xeon
• 3 Gbytes RAM
• WDC WD2500JS-41SGB0 Disk (250G)
• Timings are 3-run average of 'time' command
Copyright (C) 2008, http://www.dabeaz.com 1- 10
Part I
Introduction to Iterators and Generators
Iteration
• As you know, Python has a "for" statement
• You use it to loop over a collection of items
>>> for x in [1,4,5,10]:
... print x,
...
1 4 5 10
>>>
Consuming Iterables
• Many functions consume an "iterable" object
• Reductions:
sum(s), min(s), max(s)
• Constructors
list(s), tuple(s), set(s), dict(s)
• in operator
item in s
Iteration Protocol
• An inside look at the for statement
for x in obj:
# statements
Supporting Iteration
• Sample implementation
class countdown(object):
def __init__(self,start):
self.count = start
def __iter__(self):
return self
def next(self):
if self.count <= 0:
raise StopIteration
r = self.count
self.count -= 1
return r
• Example use:
>>> c = countdown(5)
>>> for i in c:
... print i,
...
5 4 3 2 1
>>>
Iteration Commentary
Generators
• Behavior is quite different than normal func
• Calling a generator function creates an
generator object. However, it does not start
running the function.
def countdown(n):
print "Counting down from", n
while n > 0:
yield n
n -= 1 Notice that no
output was
>>> x = countdown(10) produced
>>> x
<generator object at 0x58490>
>>>
Generator Functions
Generator Expressions
• Important differences from a list comp.
• Does not construct a list.
• Example:
>>> a = [1,2,3,4]
>>> b = [2*x for x in a]
>>> b
[2, 4, 6, 8]
>>> c = (2*x for x in a)
<generator object at 0x58760>
>>>
A Note on Syntax
• The parens on a generator expression can
dropped if used as a single function argument
• Example:
sum(x*x for x in s)
Generator expression
• Generator expressions
squares = (x*x for x in s)
Part 2
Processing Data Files
A Generator Solution
• Let's use some generator expressions
wwwlog = open("access-log")
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
bytes = (int(x) for x in bytecolumn if x != '-')
Being Declarative
• At each step of the pipeline, we declare an
operation that will be applied to the entire
input stream
access-log wwwlog bytecolumn bytes sum() total
Performance Contest
wwwlog = open("big-access-log")
total = 0
for line in wwwlog: Time
bytestr = line.rsplit(None,1)[1]
if bytestr != '-':
total += int(bytestr) 27.20
print "Total", total
wwwlog = open("big-access-log")
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
bytes = (int(x) for x in bytecolumn if x != '-')
Performance Contest
wwwlog = open("access-log")
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
bytes = (int(x) for x in bytecolumn if x != '-')
Time
Note:extracting the last
column may not be 37.33
awk's strong point
More Thoughts
• The generator solution was based on the
concept of pipelining data between
different components
• What if you had more advanced kinds of
components to work with?
• Perhaps you could perform different kinds
of processing by just plugging various
pipeline components together
Part 3
Fun with Files and Directories
foo/
access-log-012007.gz
access-log-022007.gz
access-log-032007.gz
...
access-log-012008
bar/
access-log-092007.bz2
...
access-log-022008
os.walk()
• A very useful function for searching the
file system
import os
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
• Examples
pyfiles = gen_find("*.py","/")
logs = gen_find("access-log*","/usr/www/")
Performance Contest
pyfiles = gen_find("*.py","/")
for name in pyfiles:
Wall Clock Time
print name
559s
cat
• Concatenate items from one or more
source into a single sequence of items
def gen_cat(sources):
for s in sources:
for item in s:
yield item
• Example:
lognames = gen_find("access-log*", "/usr/www")
logfiles = gen_open(lognames)
loglines = gen_cat(logfiles)
• Example:
lognames = gen_find("access-log*", "/usr/www")
logfiles = gen_open(lognames)
loglines = gen_cat(logfiles)
patlines = gen_grep(pat, loglines)
Example
• Find out how many bytes transferred for a
specific pattern in a whole directory of logs
pat = r"somepattern"
logdir = "/some/dir/"
filenames = gen_find("access-log*",logdir)
logfiles = gen_open(filenames)
loglines = gen_cat(logfiles)
patlines = gen_grep(pat,loglines)
bytecolumn = (line.rsplit(None,1)[1] for line in patlines)
bytes = (int(x) for x in bytecolumn if x != '-')
Part 4
Parsing and Processing Data
logpat = re.compile(logpats)
Field Conversion
• Map specific dictionary fields through a function
def field_map(dictseq,name,func):
for d in dictseq:
d[name] = func(d[name])
yield d
colnames = ('host','referrer','user','datetime','method',
'request','proto','status','bytes')
Packaging
• Parse an Apache log
def apache_log(lines):
groups = (logpat.match(line) for line in lines)
tuples = (g.groups() for g in groups if g)
colnames = ('host','referrer','user','datetime','method',
'request','proto','status','bytes')
return log
for r in log:
print r
A Query Language
• Now that we have our log, let's do some queries
• Find the set of all documents that 404
stat404 = set(r['request'] for r in log
if r['status'] == 404)
for r in large:
print r['request'], r['bytes']
A Query Language
• Find out who has been hitting robots.txt
addrs = set(r['host'] for r in log
if 'robots.txt' in r['request'])
import socket
for addr in addrs:
try:
print socket.gethostbyaddr(addr)[0]
except socket.herror:
print addr
Some Thoughts
• I like the idea of using generator expressions as a
pipeline query language
• You can write simple filters, extract data, etc.
• You you pass dictionaries/objects through the
pipeline, it becomes quite powerful
• Feels similar to writing SQL queries
Question
• Have you ever used 'tail -f' in Unix?
% tail -f logfile
...
... lines of output ...
...
Tailing a File
• A Python version of 'tail -f'
import time
def follow(thefile):
thefile.seek(0,2) # Go to the end of the file
while True:
line = thefile.readline()
if not line:
time.sleep(0.1) # Sleep briefly
continue
yield line
Example
• Turn the real-time log file into records
logfile = open("access-log")
loglines = follow(logfile)
log = apache_log(loglines)
Thoughts
Feeding Generators
• In order to feed a generator processing
pipeline, you need to have an input source
• So far, we have looked at two file-based inputs
• Reading a file
lines = open(filename)
• Tailing a file
lines = follow(open(filename))
• Example:
for c,a in receive_connections(("",9000)):
c.send("Hello World\n")
c.close()
Generating Messages
• Receive a sequence of UDP messages
import socket
def receive_messages(addr,maxsize):
s = socket.socket(socket.AF_INET,socket.SOCK_DGRAM)
s.bind(addr)
while True:
msg = s.recvfrom(maxsize)
yield msg
• Example:
for msg, addr in receive_messages(("",10000),1024):
print msg, "from", addr
I/O Multiplexing
clientset = []
def acceptor(sockset,addr):
for c,a in receive_connections(addr):
sockset.append(c)
acc_thr = threading.Thread(target=acceptor,
args=(clientset,("",12000))
acc_thr.setDaemon(True)
acc_thr.start()
Consuming a Queue
• Example:
import Queue, threading
def consumer(q):
for item in consume_queue(q):
print "Consumed", item
print "Done"
in_q = Queue.Queue()
con_thr = threading.Thread(target=consumer,args=(in_q,))
con_thr.start()
for i in xrange(100):
in_q.put(i)
in_q.put(StopIteration)
Multiple Processes
• Can you extend a processing pipeline across
processes and machines?
process 2
socket
pipe
process 1
def gen_unpickle(infile):
while True:
try:
item = pickle.load(infile)
yield item
except EOFError:
return
Sender/Receiver
• Example: Sender
def sendto(source,addr):
s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
s.connect(addr)
for pitem in gen_pickle(source):
s.sendall(pitem)
s.close()
• Example: Receiver
def receivefrom(addr):
s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,1)
s.bind(addr)
s.listen(5)
c,a = s.accept()
for item in gen_unpickle(c.makefile()):
yield item
c.close()
Copyright (C) 2008, http://www.dabeaz.com 1-94
Example Use
• Example: Read log lines and parse into records
# netprod.py
lines = follow(open("access-log"))
log = apache_log(lines)
sendto(log,("",15000))
Fanning Out
• In all of our examples, the processing pipeline is
driven by a single consumer
for item in gen:
# Consume item
Consumers
• To create a consumer, define an object with a
send method on it
class Consumer(object):
def send(self,item):
print self, "got", item
• Example:
c1 = Consumer()
c2 = Consumer()
c3 = Consumer()
lines = follow(open("access-log"))
broadcast(lines,[c1,c2,c3])
Network Consumer
• Example:
import socket,pickle
class NetConsumer(object):
def __init__(self,addr):
self.s = socket.socket(socket.AF_INET,
socket.SOCK_STREAM)
self.s.connect(addr)
def send(self,item):
pitem = pickle.dumps(item)
self.s.sendall(pitem)
def close(self):
self.s.close()
lines = follow(open("access-log"))
log = apache_log(lines)
stat404 = Stat404(("somehost",15000))
broadcast(log, [stat404])
Consumer Thread
• Example: import Queue, threading
class ConsumerThread(threading.Thread):
def __init__(self,target):
threading.Thread.__init__(self)
self.setDaemon(True)
self.in_queue = Queue.Queue()
self.target = target
def send(self,item):
self.in_queue.put(item)
def generate(self):
while True:
item = self.in_queue.get()
yield item
def run(self):
self.target(self.generate())
def bytes_transferred(log):
total = 0
for r in log:
total += r['bytes']
print "Total bytes", total
c1 = ConsumerThread(find_404)
c1.start()
c2 = ConsumerThread(bytes_transferred)
c2.start()
Multiple Sources
• In all of our examples, the processing pipeline is
being fed by a single source
• But, what if you had multiple sources?
source1 source2 source3
Parallel Iteration
• Zipping multiple generators together
import itertools
z = itertools.izip(s1,s2,s3)
lines = gen_multiplex([log1,log2])
Multiplexing Generators
def gen_multiplex(genlist):
item_q = Queue.Queue()
def run_one(source):
for item in source: item_q.put(item)
def run_all():
thrlist = []
for source in genlist:
t = threading.Thread(target=run_one,args=(source,))
t.start()
thrlist.append(t)
for t in thrlist: t.join()
item_q.put(StopIteration)
threading.Thread(target=run_all).start()
while True:
item = item_q.get()
if item is StopIteration: return
yield item
def run_all():
thrlist = []
Each generator runs in a
for source in genlist: thread and drops items
onto a queue
t = threading.Thread(target=run_one,args=(source,))
t.start()
thrlist.append(t)
for t in thrlist: t.join()
item_q.put(StopIteration)
threading.Thread(target=run_all).start()
while True:
item = item_q.get()
if item is StopIteration: return
yield item
Multiplexing Generators
def gen_multiplex(genlist):
item_q = Queue.Queue()
def run_one(source):
for item in source: item_q.put(item)
def run_all():
thrlist = []
for source in genlist:
t = threading.Thread(target=run_one,args=(source,))
t.start()
thrlist.append(t)
for t in thrlist: t.join()
item_q.put(StopIteration) Pull items off the queue
and yield them
threading.Thread(target=run_all).start()
while True:
item = item_q.get()
if item is StopIteration: return
yield item
threading.Thread(target=run_all).start()
while True:
item = item_q.get()
if item is StopIteration: return
yield item
Part 8
Various Programming Tricks (And Debugging)
Creating Generators
• Any single-argument function is easy to turn
into a generator function
def generate(func):
def gen_func(s):
for item in s:
yield func(item)
return gen_func
• Example:
gen_sqrt = generate(math.sqrt)
for x in gen_sqrt(xrange(100)):
print x
for r in log:
print r
print lines.last
Copyright (C) 2008, http://www.dabeaz.com 1-116
Shutting Down
• Generators can be shut down using .close()
import time
def follow(thefile):
thefile.seek(0,2) # Go to the end of the file
while True:
line = thefile.readline()
if not line:
time.sleep(0.1) # Sleep briefly
continue
yield line
• Example:
lines = follow(open("access-log"))
for i,line in enumerate(lines):
print line,
if i == 10: lines.close()
Shutting Down
• In the generator, GeneratorExit is raised
import time
def follow(thefile):
thefile.seek(0,2) # Go to the end of the file
try:
while True:
line = thefile.readline()
if not line:
time.sleep(0.1) # Sleep briefly
continue
yield line
except GeneratorExit:
print "Follow: Shutting down"
def sleep_and_close(s):
time.sleep(s)
lines.close()
threading.Thread(target=sleep_and_close,args=(30,)).start()
signal.signal(signal.SIGUSR1,sigusr1)
lines = follow(open("access-log"))
for line in lines:
print line,
• Sigh.
Shutdown
• The only way to externally shutdown a
generator would be to instrument with a flag or
some kind of check
def follow(thefile,shutdown=None):
thefile.seek(0,2)
while True:
if shutdown and shutdown.isSet(): break
line = thefile.readline()
if not line:
time.sleep(0.1)
continue
yield line
shutdown = threading.Event()
def sigusr1(signo,frame):
print "Closing it down"
shutdown.set()
signal.signal(signal.SIGUSR1,sigusr1)
lines = follow(open("access-log"),shutdown)
for line in lines:
print line,
Part 9
Co-routines
Example Use
• Using a receiver
>>> r = recv_count()
>>> r.next() Note: must call .next() here
>>> for i in range(5,0,-1):
... r.send(i)
...
T-minus 5
T-minus 4
T-minus 3
T-minus 2
T-minus 1
>>> r.close()
Kaboom!
>>>
Setting up a Coroutine
• To get a co-routine to run properly, you have to
ping it with a .next() operation first
def recv_count():
try:
while True:
n = (yield) # Yield expression
print "T-minus", n
except GeneratorExit:
print "Kaboom!"
• Example:r = recv_count()
r.next()
• Example:@consumer
def recv_count():
try:
while True:
n = (yield) # Yield expression
print "T-minus", n
except GeneratorExit:
print "Kaboom!"
Copyright (C) 2008, http://www.dabeaz.com 1-131
@consumer decorator
• Using the decorated version
>>> r = recv_count()
>>> for i in range(5,0,-1):
... r.send(i)
...
T-minus 5
T-minus 4
T-minus 3
T-minus 2
T-minus 1
>>> r.close()
Kaboom!
>>>
Broadcasting (Reprise)
• Consume a generator and send items to a set
of consumers
def broadcast(source, consumers):
for item in source:
for c in consumers:
c.send(item)
@consumer
def bytes_transferred():
total = 0
while True:
r = (yield)
total += r['bytes']
print "Total bytes", total
lines = follow(open("access-log"))
log = apache_log(lines)
broadcast(log,[find_404(),bytes_transferred()])
Discussion
Example
• SocketServer Module (Strategy Pattern)
import SocketServer
class HelloHandler(SocketServer.BaseRequestHandler):
def handle(self):
self.request.sendall("Hello World\n")
serv = SocketServer.TCPServer(("",8000),HelloHandler)
serv.serve_forever()
• My generator version
for c,a in receive_connections(("",8000)):
c.send("Hello World\n")
c.close()
Thanks!