Python Generator

This article aims to provide a somewhat gentle introduction to python generator, and hopefully will inspire the reader to find interesting uses for them.

It is a somewhat complicated topic with a lot of small details, so we will use the convention below to provide extra information. Feel free to skip over these, as they are not necessary for understanding, but they should give some more insight for the curious.

What is a generator?

A generator is a powerful programming tool that allows for efficient iteration over large or computationally expensive data. It also boasts simpler code and better performance when compared to other methods of iteration.

The primary methods for implementing python generators are:

  • Generator functions (added in Python 2.2)
  • Generator expressions (added in Python 2.4)

Some core concepts are essential to understanding first, so let’s dive into those now.

Iteration and Iterables

As you are probably aware, _iteration_ is just a more formal term for the repetition of lines of code, such as in while and for loops.

Python has a particularly flexible method for iteration built into the language, based on the concept of _iterable_ objects, which is most often used by for loops. Whenever we have talked about for loops previously, they have always operated on iterable objects.

Iterables are generally sequence types, like lists, ranges, tuples or sets, and in Python, being an iterable means that an object can be iterated over by something like for.

# These are all iterable objects that 'for' is operating on

my_list = [1,2,3,"Python",4]

my_tuple = (5,6,7,"Rocks")

my_set = {8,9,10}

for item in my_list:

    print(item)

for item in my_tuple:

    print(item)

for item in my_set:

    print(item)

for x in range(5):

    print(x)

Behind the scenes, the fact that we refer to an object as ‘iterable’ means that it exposes a method called “__iter__()”, which returns an _iterator_ for the object.

Iterators

An _iterator_ is an object that controls looping behavior over an iterable object. Its purpose is to keep track of where the iteration has got to so far.

When “for” and related functions are run on an iterable object, what they actually do first is to request an iterator from the object. If that fails, “for” will throw an exception, otherwise, that iterator is then used repeatedly to grab the next item in the sequence, until the sequence is exhausted.

In practice, this means that “for” can loop over any object that can supply an iterator, but it can’t loop over anything else. Objects like int or float will not work with for, because they don’t implement the right methods.

# Example: for statement works over a list, but not over a float

my_list = [3,5,7,9]

my_float = 1.1

for item in my_list:

    print(item)

for item in my_float:

    print(item)

Output

3

5

7

9

Traceback (most recent call last):

  File ".\examples\iterator_demo.py", line 23, in <module>

    for item in my_float:

TypeError: 'float' object is not iterable

It is up to the iterator to keep track of where the program has got to in a loop and to pull in the next item when requested.

The iterator must:

  • set itself up for a loop when created.
  • implement the `__next__()` method that must return the next item
  • The `__next__()` method must also raise the `StopIteration()` exception when the loop should finish.

Typically the iterator object keeps track of where it is using a loop counter or similar, stored in one of its properties. Here is an example of how it would be used in practice:

  • An iterator is created: set the loop counter to zero.
  • Iterator’s `__next__()` is called: check the loop counter
    • if we are finished, return the `StopIteration()` exception
    • if we are not finished, increment the loop counter and return the next item

Generators in detail

A generator can be looked at as a much-improved version of an iterator, designed to be written more intuitively by the programmer, and by using much less code. There are subtle differences between a generator and iterator:

  • Generators are a special type of iterator
  • They define a function like behavior
  • They keep track of their own internal state (e.g. local variables), unlike an iterator

To expand on that last point, an iterator starts with a fresh state each time it loops around – each call to __next__() is a new function call, needing its own setup and creating its own state.

A generator doesn’t need to perform that setup more than once – it reuses the state from the previous call. This becomes much more efficient when running the same code many thousands of times.

In the following sections, we will cover how we can implement generators, why we benefit from using them, and a few examples of usable code.

Further reading: Advanced topics which will not be covered in this document but are closely related include:

  • Sending Values to Generators (`send()` method)
  • Connecting Generators (`yield from` expression)
  • Concurrency/Parallel processing/Coroutines

They are covered in some of the references at the end of this article, but for good introduction to all of these concepts, Dave Beazley’s presentation, in particular, is highly recommended.

Implementing generators

Generator functions

Generator functions are possibly the easiest way to implement generators in Python, but they do still carry a slightly higher learning curve than regular functions and loops.

Simply put, a generator function is a particular kind of function that can yield its results one by one as it runs, instead of waiting to complete and returning them all at once.

They are easy to spot, as you will notice that values are returned using the keyword “yield”. The usual “return” can be used here, but only to exit the function.

If a generator hasn’t managed to “yield” any data but hits a “return”, the caller will get back an empty list (`[]`)

Here are some examples of the syntax that is commonly used to create a generator function:

# Form 1 - Loop through a collection (an iterable) and apply some processing to each item

def generator_function(collection):

    #setup statements here if needed

    for item in collection:

        #do some processing

        return_value = apply_something_processing_to(item)

        yield return_value



# Form 2 - Set up an arbitrary loop and return items that are generated by that - similar to range() function

def generator_function(start,stop,step):

    #setup statements here

    loop_counter = <initial value based on start>

    loop_limit = <final value based on stop>

        #might need to add one to limit to be inclusive of final value

    loop_step = <value to increment counter by, based on step>

        #step could be negative, to run backwards

    while loop_counter != loop_limit:

        #do some processing

        return_value = generate_item_based_on(loop_counter)

        #increment the counter

        loop_counter += loop_step

        yield return_value



# Form 3 - Illustrates return mechanism - imagine the processing we're doing requires some kind of setup beforehand, perhaps connecting to a network resource

def generator_function(collection):

    #setup statements here if needed

    setup_succeeded = do_some_setup()

    if setup_succeeded:

        for item in collection:

            #do some processing

            return_value = apply_something_processing_to(item)

            yield return_value

    else:

        return

As noted in the comments, the first form steps through an iterable collection of items and applies some processing to each item before `yield`ing it back. Not very exciting, and certainly very doable with plain old iteration, but its power comes from being able to be part of a chain of other generators. As more processing is needed per item, generators quickly become much easier to maintain code for.

The second form is an example that could be adapted to generate sequences of values of arbitrary length, support infinite sequences, or infinite cycles of sequences, and so on. The mechanism is very useful when dealing with mathematical or scientific problems, for instance.

Here is some example syntax for calling the generator function:

#Form 1 - more readable

my_generator = generator_function(arguments)

for result in my_generator:

    output(result)



#Form 2 - more concise

for result in generator_function(arguments):

    output(result)

In the first form, the generator is set up in the first statement, and a reference to it stored in a variable. It is then used (or consumed) by the following “for” loop.

In the second form, the generator is not stored but immediately consumed by the “for” loop.

Generator functions in practice

Here is an example that illustrates creating and using a simple generator function. It filters a text file down to only lines containing a particular string. It also shows three slightly different ways to call the generator, see the comments for details:

#Example: Generator Functions

def filtered_text(text_lines,wanted_text):

    """ Compares each line in text_lines to wanted_text

        Yields the line if it matches """

    for line in text_lines:

        if wanted_text in line:

            yield line



#slow method - read whole file into memory, then use the generator to filter text

#need to wait for the whole file to load before anything else can begin

#uses more memory

#not much benefit here!

with open("Programming_Books_List.txt",'r') as file_obj:

    lots_of_text = file_obj.readlines()

matches = filtered_text(lots_of_text,"Python")

for match in matches:

    print(match)


#faster method - use the file object as an iterator, filter it with the generator

#only needs to keep current line in memory

#current line is only read directly before use

#outputs each match directly after it is found (before the file has finished reading)

with open("Programming_Books_List.txt",'r') as file_obj:

    matches = filtered_text(file_obj,"Python")

    for match in matches:

        print(match)


#sleeker method - this is doing the same as the faster method above, but in fewer lines of code

#instead of storing the generator object in a variable, it is immediately used in a for loop

#this is perhaps less readable, so it can be harder to debug

with open("Programming_Books_List.txt",'r') as file_obj:

    for match in filtered_text(file_obj,"Python"):

        print(match)

Generator expressions

A generator expression is another way to create a simple generator function. These tend to be much more concise, often resulting in one-liners, but are not always as readable as generator functions.

Their main drawback is they are not as flexible as generator functions; it can be difficult to implement anything particularly complex in a generator expression, as you are restricted to what can be easily written in a single expression.

Some example syntax might be helpful:

#Form 1: basic form - iterate over all items, run some processing on each

new_generator = (apply_processing to(item) for item in iterable)



#Form 2: filter - rejects items if the condition is not true

new_generator = (item for item in iterable if condition)



#Form 3: combination of forms 1 and 2

new_generator = (apply_processing to(item) for item in iterable if condition)

They appear similar to a list comprehension, but in practice, the list comprehension will build its entire output to a list in memory before returning, whereas the generator expression will only return its output one item at a time.

Once it has been created, the generator can then be used (or consumed) in much the same way as when using a generator function. Here are a few examples:

#Form 1 - more readable

my_generator = (apply_processing to(item) for item in iterable)

for result in my_generator:

    output(result)



#Form 2 - more concise

for result in (apply_processing to(item) for item in iterable):

    output(result)

Generator expressions in practice

Here is the previous book list example, rewritten using form 2:

#Example: Generator Expressions



with open("Programming_Books_List.txt",'r') as file_obj:

    for match in (line for line in file_obj if "Python" in line):

        print(match)

Note that only these three lines are required. The minimum required lines to do the same with a generator function is seven. This is far more concise, and leads to an elegant code, but cannot be used in all situations.

Why use generators?

Generators are especially useful when:

  1. performing repetitive tasks over large amounts of data, where the original data is **only needed once**
  2. performing computations on long sequences of data (that may or may not fit in memory – or may even be infinite!)
  3. generating sequences of data where each item should be calculated only when needed (lazy evaluation)
  4. performing a sequence of the same operations on multiple items in a stream of data (in a pipeline, similar to Unix pipes)

In the first case, generators are very effective if the data itself does not need to be stored in memory or referred to again. They allow the programmer to operate on smaller chunks of data in a piecemeal fashion, and yield up the results one by one. At no point is the program required to hold onto any data from previous iterations.

The benefits generators bring are:

  • Reduced memory usage
  • Better speed and less overhead than other iteration methods
  • They allow elegant construction of pipelines

In the previous book list searching example, there is not truly a great benefit to performance or resource usage by using the generator, as it’s an extremely simple use case, and the source data is not very large. Also, the processing required is so minimal that it would be easy to implement using other methods.

But what if the processing required on each line is much more complicated? Perhaps some kind of text analysis, Natural Language Processing or checking words against a dictionary?

Let’s say in that example, we also want to take each book title, search for it in ten different online bookstores, then return the cheapest price available. Then, let’s expand the source data,  say we replace the book list with a copy of every book Amazon has available.

At that scale, the problem becomes so large that traditional iteration would demand many resources, and would be comparatively much more difficult to code with any efficiency.

In this case, using generators would simplify the code a great deal, and means processing can begin directly after the first book title is found. Additionally, there is little overhead, even when dealing with very large source files.

Basic usage

Infinite sequences

The, by now, a rather clichéd example of producing the Fibonacci sequence using generators is an old favorite in the Computer Science teaching world, but it’s still worth taking a look at.

Here is the code:

#Example: Fibonacci sequence using generators

def fibonacci(limit):

    """ Generate the fibonacci sequence, stop when

        we reach the specified limit """

    current = 0

    previous1 = 0

    previous2 = 0

    while current <= limit:

        return_value = current

        previous2 = previous1

        previous1 = current

        if current == 0:

            current = 1

        else:

            current = previous1 + previous2

        yield return_value

for term in fibonacci(144):

    print(term)

Its output is as follows:

0

1

1

2

3

5

8

13

21

34

55

89

144

It’s a fairly trivial use case, but it does happen to illustrate the fact that the generator is handling storage of the two previous values in its local variables, and that these are not lost between iterations. No other data is stored, so the function will use pretty much the same amount of memory throughout its life, from iteration one to iteration one hundred thousand.

Advanced usage

Using generators as pipelines

Pipelines are where the real power of generators can be seen. They can be implemented by simply chaining generators together so that the output from one passes into the input of the next. They are very useful when multiple operations need to be applied in turn to one set of data.

As noted in Dave Beazley’s excellent presentation (see references), generators can be used to great effect by the systems programmer. Rather than produce arbitrary mathematical sequences, Beazley demonstrates useful techniques like parsing web server logs, monitoring files and network ports, iterating over files and filesystems, and more.

Below is my own example that uses a mix of generator functions and expressions to perform a text search on multiple files. I’ve set it up to search for the string “# TODO:” in all Python files in the current directory.

Whenever I spot a problem in my code, or I have an idea I’d like to implement later, I like to insert a to-do note as close as possible to where it is needed, using that notation.

It often comes in handy, but these notes can get lost when working on big projects with lots of large files!

This example is a little convoluted and could be greatly improved with the use of regular expressions (and quite likely other library or OS functions), but as a pure Python demo, it should illustrate a little of what can be achieved using generators:

# pipeline_demo.py

#Example: Search for "# TODO:" at start of lines in Python

# files, to pick up what I need to work on next

import os

def print_filenames(filenames):

    """Prints out each filename, and returns it back to the pipeline"""

    for filename in filenames:

        print(filename)

        yield filename

def file_read_lines(filenames):

    """Read every line from every file"""

    for filename in filenames:

        with open(filename,'r') as file_obj:

            for line in file_obj:

                yield line

#get a list of all python files in this directory

filenames_list = os.listdir(".")

#turn it into a generator

filenames = (filename for filename in filenames_list)

#filter to only Python files (*.py)

filenames = (filename for filename in filenames if filename.lower().endswith(".py"))

#print out current file name, then pop it back into the pipeline

filenames = print_filenames(filenames)

#pass the filenames into the file reader, get back the file contents

file_lines = file_read_lines(filenames)

#strip out leading spaces and tabs from the lines

file_lines = (line.lstrip(" \t") for line in file_lines)

#filter to just lines starting with "# TODO:"

filtered = (line for line in file_lines if line.startswith("# TODO:"))

#strip out trailing spaces, tabs and newlines

filtered = (line.rstrip() for line in filtered)

#display output

for item in filtered:

    print(item)

# TODO: Write generator example

# TODO: Test on current folder

    # TODO: Test on a line indented with spaces

            # TODO: Test on a line indented with tabs

# TODO: Add more TODOs

Output:

test-TODOs.py

# TODO: Test finding a TODO in another file

test-noTODOs.py

pipeline_demo.py

# TODO: Write generator example

# TODO: Test on current folder

# TODO: Test on a line indented with spaces

# TODO: Test on a line indented with tabs

# TODO: Add more TODOs

This idea can be taken much further – as mentioned above, there are endless use cases in the systems programming, and indeed system admin spaces. If you are ever required to manage log files on servers, this kind of technique will be extremely valuable.

Please see the references below for further information on this topic and some excellent examples.

You can get all the code related to this article on JBTAdmin Github.

References/Further Reading

Title: Python Wiki – Generators

Authors: Multiple

Source: Python Wiki

 

Title: Python Wiki – Iterator

Authors: Multiple

Source: Python Wiki

 

Title: Python Generators

Author: Scott Robinson

Source: Stack Abuse website

 

Title: Python Practice Book – Chapter 5. Iterators & Generators

Author: Anand Chitipothu

Source: Python Practice Book website

 

Title: Generator Tricks for Systems Programmers – Version 2.0

Author: David M. Beazley

Source: David Beazley’s website

 

Title: 2 great benefits of Python generators (and how they changed me forever)

Author: Aaron Maxwell

Source: O’Reilly website

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.