Python Generators: Memory Efficient Data Processing

Python Generators: Memory Efficient Data Processing

posted 11 min read

The real challenge lies in writing an efficient code to take care of memory utilization. Python provides the concept of generators and iterators to write memory conscious code which can utilize less memory and provide better results. Generators are special types of functions which help to generate a set of values on demand and are capable of returning a sequence of values unlike regular functions which return a single value at a time. The article covers understanding the abstract of generators and the syntax of using generators with relevant examples including calculating average, generating infinites random numbers ultimately leading to the practical use of generators in programming and providing relevant tips and practices to be followed for dealing with large datasets by using generators.

Understanding Iterators

1. Explanation of Iterators and Their Role in Python

The word iterator itself represents looping through a stream of values one by one in an orderly manner to access the values one by one and perform operations needs to be done based on the requirement of the program.

2. Illustration of Iterating Over a List Using a `for` Loop

Let's take an example of iterating over a list of marks to calculate the average of marks.

student_marks = [30, 40, 10, 20, 20]
sum_marks = 0

# iterate through the student marks and calculate sum of marks
for marks in student_marks:
    sum_marks += marks

average = sum_marks/len(student_marks)
print("Average of Marks is: ", average)
  1. The program first declares the iterator with a set of values of marks.
  2. Then, using the for-in loop, program iterators through each value one by one and adds it to the sum_marks variable.
  3. Finally, the program calculates the average of marks and prints the result on the console.


3. Highlighting the Limitations of Working With Large Datasets Using Conventional Lists

Although using iterators is fine for small set of values which do not require complex processing but considering today’s world which need processing of large datasets, iterators may prove inefficient. This is because of iterators provide sequential access, consume large memory and single-threaded execution which may prove unfriendly for time, cost and memory. Single-threaded execution does not allow the iterators to be processed with multiple processors and they have only one thread which may not prove efficient for complex processing.

Introducing Generators

1. Definition of Generators and How They Differ From Regular Functions

To avoid unnecessary processing and preserve memory, using generators would not go unwanted. Generators are different from regular functions which return one value at a time; generators are capable of returning a sequence of values with the help of yield keyword.

2. Comparison Between Generator Functions and Regular Functions

3. Example of a Simple Generator Function That Yields Values

Below is an example of generator function to generate square of numbers upto the limit number specified:

# define a generator named sequence_generator
def sequence_generator(limit):
    while i<=limit:
        yield i*i # generate the value with yield keyword after self-multiplying

values = sequence_generator(10)

for value in values:


Working with Generator Expressions

1. Introduction to Generator Expressions as a Concise Way to Create Generators

Generator expressions provide inline and concise way of implementing generator functions. The generator functions need to be properly define with the def and yield keywords but generator expressions can be implemented without using def and yield keywords. So, in this way, they provide ease to the coders as they can implement and use generator expressions right where required.

2. Illustration of Creating a Generator Expression for Generating Fibonacci Numbers

# generator function for fibonacci series
def fibonacci_series():
    number, next_number = 0, 1
    while True:
        yield number
        number, next_number = next_number, number + next_number

# generator expression for fibonacci series
fibonacci_gen_exp = (num for num in fibonacci_series())

# generating fibonacci series
for curr in range(8):
    print("Fibonacci series till number", curr, next(fibonacci_gen_exp))


3. Advantages of Using Generator Expressions Over List Comprehensions

Following are three major advantages of generator expressions over list comprehensions:

  • Efficient processing with large datasets.
  • Generating values on demand.
  • Loads and reserves memory only for the data which needs processing.

Always prefer using generators over iterators for large datasets.

Lazy Evaluation and Memory Efficiency

1. Explanation of Lazy Evaluation and Its Significance in the Context of Generators

To keep it simple, lazy evaluation refers to processing on demand rather than processing all the data at once and reserving unnecessary memory especially for large data sets which contains thousands of parameters and millions of values. Generators are capable of producing values on demand with the help of next() function which helps to generate values only when required and avoid reserving memory.

2. Comparison of Memory Usage Between Generators and Lists

Let’s take an example of generating values from 1 to 1000000 and adding 10 to each value. A naïve approach is to first generate a list of 1000000 and then update each value by adding 10. Using the concept of generators, we can define a generator function or use a generator expression to generate values only when required and not reserve unnecessary memory. Reserving memory for millions of values may reach memory limit and stop the program with abnormal exit. So, generators are quite useful to deal with these problems with the help of lazy evaluation and memory efficiency.

3. Example Demonstrating the Memory Efficiency of Generators When Dealing With Large Datasets

import sys

# iterate and store values in the form of list
for i in range(1,100):

# define a generate to generate values on demand
def generator_efficiency():
    while i<100:
        cube_value = i**3
        yield cube_value

gen_obj = generator_efficiency()

size_of_list = sys.getsizeof(cubic_numbers)
size_of_generator = sys.getsizeof(gen_obj)

print("Size of List in Bytes: ", size_of_list)
print("Size of Generator in Bytes: ", size_of_generator)


The code snippet above is related to taking the cube of numbers from 1 to 100 using list and generator. It is clear from the above example that list is consuming much more memory (920 bytes) than generator (192 bytes).

Practical Use Cases of Generators

1. Illustration of Using Generators for Processing Large Files Line by Line

Processing large files by loading the entire content may consume a lot of memory and may not lead to efficient data processing in practical. So, there becomes a need to use generators for processing large dataset and files of large sizes.
The generators are capable of processing files line by line instead of loading the whole content at once. This way can result in memory efficient solution for processing complex and large data to perform complex computations.

2. Example of Generating Infinite Sequences (Random Number Generator)

Another use of generators is producing an infinite sequence of random numbers whenever required. Normally, the program generates a set of random numbers from a certain number to a certain number but generator provides a random number only when necessary and thus it’s capable of generating infinite random numbers.

import random

def random_num():
    while True:
        yield random.randint(1, 10) # yield a random number when required

gen_obj = random_num() # generator object

print(next(gen_obj)) # next is used to generate next value on demand

while randomNumber!=5: # keep generating random value and stop if value is 5
    print("New Random Number: ", randomNumber)


3. Application of Generators In Combination With Other Python Features (Itertools)

Combining generators with a python feature like itertools may further help in advanced data processing. Itertools provide a variety of functions (count, cycle, repeat, chain, compress, etc) which can be combined with generators to achieve next level processing and computations as per today’s requirement where data is being generated continuously.

import itertools

limit = 5

repeate_word = itertools.repeat('Coderlegion', limit) # repeat the word Coderlegion to the specified limit

increment = itertools.count(start=10) # increments value after adding 1 to previous value

for _ in range(limit):
    print(next(repeate_word), next(increment))


Advanced Generator Features

1. Introduction to Generator Delegation and The "yield from" Statement

Generator delegation is a way of sub-dividing the responsibilities of generator and forming a composition between them. The main generator delegates to the sub-generator for performing any task with the help of yield from statement.

2. Explanation of Generator Pipelines and How To Chain Generators Together

We can chain the generators and form a pipeline by calling sub-generator within the main generator. The sub-generator will yield the value to the main generator for further processing. The sub-generator is capable of calling another sub-generator. By chaining the generators and forming a pipeline with generators and sub-generators, it becomes easier to performs operation on complex data such as sorting the data, filtering the data based on certain conditions, removing duplicate data entries and other modification required to be performed on the data.

3. Example Demonstrating the Composition of Generators for Complex Data Processing Tasks

def capitalize_word(hashed_word):
    # Yield the word without hash
    yield hashed_word.capitalize()

def hyphen_remover(hashed_word):
    # Pass to Capitalize Generator
    yield from capitalize_word(hashed_word.replace('-', ' '))

def hash_remover(hashed_word):
    # Pass to Hyphen Remover Generator
    yield from hyphen_remover(hashed_word.replace('#', ''))

def hashtag_converter(hashed_word):
    # Pass to Hash Remover Generator
    converted_word = yield from hash_remover(hashed_word)
    yield converted_word

for hashed_word in ["#coderlegion","#one-community", "#more-codes"]:
    converted_word = hashtag_converter(hashed_word)


Best Practices and Tips

1. Recommendations for Writing Efficient and Readable Generator Code

Following tips may prove help to you to write a generator code which can fulfill its responsibility in a better and optimized way:
  • Note the operations to be performed on data.
  • Try to chain the generators and name them in such a way that each generation becomes responsible for one and only one operation to be achieve modularity and the code becomes much clear and readable.
  • The regularity for loops and conditions should be maintained while writing generators and avoiding unnecessary looping and conditioning with the help of break and continue statements.

As generators generate values on-the-fly, they do not support indexing or slicing.

2. Tips for Optimizing Generator Performance

For processing of large datasets, it is a good practice to apply chaining on generators and delegate generators only when it becomes necessary as unnecessary generator delegations may compromise the performance of generator code and utilize more memory.

3. Common Pitfalls to Avoid When Working With Generators

Following points should be kept in mind while working with generators:
  • Don't forget to handle StopIteration exception properly.
  • Unnecessary chaining and pipelining of generators can results in difficulties maintaining code. So, generators should be chained accordingly as per the requirement.
  • The conditions should be applied to terminate generators after necessary processing is done. Otherwise, it may lead to memory leaks.

Q: What does the term "Lazy Evaluation" mean?
A: The on-demand loading of data in memory only where it is required to perform operation and processing of data is known as lazy evaluation.
Q: Are generator functions and generator expressions same terms?
A: No, generator functions and generator expressions are related yet different terms. Generator expressions is a precise way of implementing generator functions with an inline approach.
Q: Should I prefer generators over iterators?
A: Using iterators is fine for small sizes datasets but for large and detailed datasets for performing complex computations; generators are highly recommended over iterators.
Q: Why do I need to chain generators?
A: Chaining the generators improve modularity and clarity of code which ultimately assists in applying complex computations on large datasets.

Wrapping Up

The article concludes with a small discussion on the concept of iterators and a detailed view of generator moving from basic to advanced generator concepts including generator functions, generator expressions, chaining and delegating generators and the practices which should be followed to write an efficient, readable and optimized generator codes. The article includes the relevant basic and real world examples which can help you to get a detailed understanding on generators. You are suggested to work on some more relevant generator examples to get in-depth knowledge of generator which can assist you to get used to the generator concepts and apply the concepts of generators in your python projects. Not Only Code, Code Well!

Additonal Resources

If you read this far, tweet to the author to show them you care. Tweet a Thanks
Great insight about memory efficient data processing. I have passion about python and data.

More Posts

Mastering Context Manager Simplifying Resource Management Python

Abdul Daim - Jun 15

Multithreading and Multiprocessing Guide in Python

Abdul Daim - Jun 7

Mastering Lambda Functions in Python: A comprehensive Guide

Abdul Daim - May 6

Regular Expressions in Python

aditi-coding - Apr 5

Decorators in Python [With Examples]

aditi-coding - Mar 26