Introduction To Python Generators

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

— FREE Email Series —

Python Tricks

Email…

Get Python Tricks »

Introduction to Python Generators No spam. Unsubscribe any


by Real Python  6 Comments  intermediate python time.

Table of Contents All Tutorial Topics


• Generator Functions
advanced api basics best-practices
• Generator Expressions
community databases data-science
• Use Cases
devops django docker flask
◦ Example 1
front-end intermediate
◦ Example 2
machine-learning python testing
• Conclusion
tools web-dev web-scraping

Generators are functions that can be paused and resumed on the fly, returning an
object that can be iterated over. Unlike lists, they are lazy and thus produce items
one at a time and only when asked. So they are much more memory efficient when
dealing with large datasets. This article details how to create generator functions
and expressions as well as why you would want to use them in the first place.

Free Bonus: Click here to get access to a free Python OOP Cheat Sheet that
points you to the best tutorials, videos, and books to learn more about Object-
Oriented Programming with Python.
Table of Contents
• Generator Functions
Generator Functions • Generator Expressions
To create a generator, you define a function as you normally would but use the yield • Use Cases
statement instead of return, indicating to the interpreter that this function should • Conclusion
be treated as an iterator:

Python

def countdown(num):
print('Starting')
while num > 0:
yield num
num -= 1

 

The yield statement pauses the function and saves the local state so that it can be
resumed right where it left off.
Improve Your Python
What happens when you call this function?

Python >>>

>>> def countdown(num):


... print('Starting')
... while num > 0:
... yield num
... num -= 1
...
>>> val = countdown(5)
>>> val
<generator object countdown at 0x10213aee8>

 

Calling the function does not execute it. We know this because the string Starting
did not print. Instead, the function returns a generator object which is used to
control execution.

Generator objects execute when next() is called:

Python >>>

>>> next(val)
Starting
5

 

When calling next() the first time, execution begins at the start of the function body
and continues until the next yield statement where the value to the right of the
statement is returned, subsequent calls to next() continue from the yield statement
to the end of the function, and loop around and continue from the start of the
function body until another yield is called. If yield is not called (which in our case
means we don’t go into the if function because num <= 0) a StopIteration exception
is raised:

Python >>>

>>> next(val)
4
>>> next(val)
3
>>> next(val)
2
>>> next(val)
1
>>> next(val)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration

 

Generator Expressions
Just like list comprehensions, generators can also be written in the same manner
except they return a generator object rather than a list:

Python >>>

Improve Your Python


>>> my_list = ['a', 'b', 'c', 'd']
>>> gen_obj = (x for x in my_list)
>>> for val in gen_obj:
... print(val)
...
a
b
c
d

 

Take note of the parens on either side of the second line denoting a generator
expression, which, for the most part, does the same thing that a list comprehension
does, but does it lazily:

Python >>>

>>> import sys


>>> g = (i * 2 for i in range(10000) if i % 3 == 0 or i % 5 == 0)
>>> print(sys.getsizeof(g))
72
>>> l = [i * 2 for i in range(10000) if i % 3 == 0 or i % 5 == 0]
>>> print(sys.getsizeof(l))
38216

 

Be careful not to mix up the syntax of a list comprehension with a generator


expression - [] vs () - since generator expressions can run slower than list
comprehensions (unless you run out of memory, of course):

Python >>>

>>> import cProfile


>>> cProfile.run('sum((i * 2 for i in range(10000000) if i % 3 == 0 or i % 5
== 0))')
4666672 function calls in 3.531 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)


4666668 2.936 0.000 2.936 0.000 <string>:1(<genexpr>)
1 0.001 0.001 3.529 3.529 <string>:1(<module>)
1 0.002 0.002 3.531 3.531 {built-in method exec}
1 0.592 0.592 3.528 3.528 {built-in method sum}
1 0.000 0.000 0.000 0.000 {method 'disable' of
'_lsprof.Profiler' objects}

>>> cProfile.run('sum([i * 2 for i in range(10000000) if i % 3 == 0 or i % 5


== 0])')
5 function calls in 3.054 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)


1 2.725 2.725 2.725 2.725 <string>:1(<listcomp>)
1 0.078 0.078 3.054 3.054 <string>:1(<module>)
1 0.000 0.000 3.054 3.054 {built-in method exec}
1 0.251 0.251 0.251 0.251 {built-in method sum}
1 0.000 0.000 0.000 0.000 {method 'disable' of
'_lsprof.Profiler' objects}

 

This is particularly easy (even for senior developers) to do in the above example
since both output the exact same thing in the end.

Improve Your Python


NOTE: Keep in mind that generator expressions are drastically faster when the
size of your data is larger than the available memory.

Use Cases
Generators are perfect for reading a large number of large files since they yield out
data a single chunk at a time irrespective of the size of the input stream. They can
also result in cleaner code by decoupling the iteration process into smaller
components.

Example 1
Python

def emit_lines(pattern=None):
lines = []
for dir_path, dir_names, file_names in os.walk('test/'):
for file_name in file_names:
if file_name.endswith('.py'):
for line in open(os.path.join(dir_path, file_name)):
if pattern in line:
lines.append(line)
return lines

 

This function loops through a set of files in the specified directory. It opens each file
and then loops through each line to test for the pattern match.

This works fine with a small number of small files. But, what if we’re dealing with
extremely large files? And what if there are a lot of them? Fortunately, Python’s open
() function is efficient and doesn’t load the entire file into memory. But what if our
matches list far exceeds the available memory on our machine?

So, instead of running out of space (large lists) and time (nearly infinite amount of
data stream) when processing large amounts of data, generators are the ideal things
to use, as they yield out data one time at a time (instead of creating intermediate
lists).

Let’s look at the generator version of the above problem and try to understand why
generators are apt for such use cases using processing pipelines.

We divided our whole process into three different components:

• Generating set of filenames


• Generating all lines from all files
• Filtering out lines on the basis of pattern matching

Python

Improve Your Python


def generate_filenames():
"""
generates a sequence of opened files
matching a specific extension
"""
for dir_path, dir_names, file_names in os.walk('test/'):
for file_name in file_names:
if file_name.endswith('.py'):
yield open(os.path.join(dir_path, file_name))

def cat_files(files):
"""
takes in an iterable of filenames
"""
for fname in files:
for line in fname:
yield line

def grep_files(lines, pattern=None):


"""
takes in an iterable of lines
"""
for line in lines:
if pattern in line:
yield line

py_files = generate_filenames()
py_file = cat_files(py_files)
lines = grep_files(py_file, 'python')
for line in lines:
print (line)

 

In the above snippet, we do not use any extra variables to form the list of lines,
instead we create a pipeline which feeds its components via the iteration process
one item at a time. grep_files takes in a generator object of all the lines of *.py files.
Similarly, cat_files takes in a generator object of all the filenames in a directory. So
this is how the whole pipeline is glued via iterations.

Example 2
Generators work great for web scraping and crawling recursively:

Python

Improve Your Python


import requests
import re

def get_pages(link):
links_to_visit = []
links_to_visit.append(link)
while links_to_visit:
current_link = links_to_visit.pop(0)
page = requests.get(current_link)
for url in re.findall('<a href="([^"]+)">', str(page.content)):
if url[0] == '/':
url = current_link + url[1:]
pattern = re.compile('https?')
if pattern.match(url):
links_to_visit.append(url)
yield current_link

webpage = get_pages('http://sample.com')
for result in webpage:
print(result)

 

Here, we simply fetch a single page at a time and then perform some sort of action
on the page when execution occurs. What would this look like without a generator?
Either the fetching and processing would have to happen within the same function
(resulting in highly coupled code that’s hard to test) or we’d have to fetch all the
links before processing a single page.

Conclusion
Generators allow us to ask for values as and when we need them, making our
applications more memory efficient and perfect for infinite streams of data. They
can also be used to refactor out the processing from loops resulting in cleaner,
decoupled code. If you’d like to see more examples, check out Generator Tricks for
Systems Programmers and Iterator Chains as Pythonic Data Processing Pipelines.

Free Bonus: Click here to get access to a free Python OOP Cheat Sheet that
points you to the best tutorials, videos, and books to learn more about Object-
Oriented Programming with Python.

How have you used generators in your own projects?

Python Tricks

Get a short & sweet Python Trick delivered to your inbox every couple of
days. No spam ever. Unsubscribe any time. Curated by the Real Python
team.

Email Address
Improve Your Python
Send Me Python Tricks »
What Do You Think?

Real Python Comment Policy: The most useful comments are those
written with the goal of learning from or helping out other readers—after
reading the whole article and all the earlier comments. Complaints and
insults generally won’t make the cut here.

6 Comments Real Python 


1 Login

 Recommend 1 t Tweet f Share Sort by Best

Join the discussion…

LOG IN WITH OR SIGN UP WITH DISQUS ?

Name

Ohad Gazit • 2 years ago


Thank you,
I always prefer to call the next yield with val.next() instead of next(val)
1△ ▽ • Reply • Share ›

Monika Zarini • 14 days ago


It seems to me in "http://sample.com" should be a slash at the end:
"http://sample.com/". (Example 2)
△ ▽ • Reply • Share ›

John • 3 months ago


Thanks.
Can you also add to the article how to use generator.send() ?
△ ▽ • Reply • Share ›

Swati • 4 months ago


x=[1,2,3,4]
iter_X = iter(x)
for i in iter_X:
print(i)

Then again,
for i in iter_X: ### second iteration on iter_X
print(i) ### will not print values

The second for loop will not print anything. This behaviour of iterator is same
that of Generator, that is, generator can be iterated over only once.

Is it true, the only difference between iterator and generator is that of syntax?
△ ▽ • Reply • Share ›

Gene Ricky Shaw • 2 years ago


There is a minor error in the instructions. The following line:

py_file = cat_file(py_files)

Should read:

py_file = cat_files(py_files) Improve Your Python


Keep Reading

intermediate python

© 2012–2018 Real Python ⋅ Newsletter ⋅ YouTube ⋅ Twitter ⋅ Facebook ⋅ Instagram


Python Tutorials ⋅ Search ⋅ Privacy Policy ⋅ Advertise ⋅ Contact
Happy Pythoning!

Improve Your Python

You might also like