Optimizing Python Performance: Practical Techniques and Best Practices

By Alessandro D.P.4 min readFebruary 10, 2026

Understanding Performance Bottlenecks

Before you can optimize your code, you must first identify where the performance bottlenecks lie. Tools like cProfile and timeit are invaluable for profiling and benchmarking Python code. Once you’ve identified slow sections of your code, you can apply optimization strategies such as using built-in data types, avoiding unnecessary computations, and leveraging compiled code with tools like Cython or Numba.

Here’s a simple example using cProfile to profile a function:

import cProfile

def slow_function(n):
    return sum([i**2 for i in range(n)])

cProfile.run('slow_function(10000)')

This will output a detailed breakdown of where the function spends most of its time.

Using Built-in Functions and Libraries

Built-in functions and standard libraries are typically implemented in C and are much faster than their Python counterparts. For example, using the built-in map() or itertools module can lead to performance gains in loops.

# Inefficient
result = []
for i in range(1000000):
    result.append(i * 2)

# More efficient
result = list(map(lambda x: x * 2, range(1000000)))

Additionally, consider using libraries like NumPy for numerical computations. NumPy operations are implemented in optimized C code, which makes them significantly faster than native Python loops.

import numpy as np

# Native Python
result = [x * 2 for x in range(1000000)]

# NumPy
result = np.arange(1000000) * 2

Avoiding Global Lookups in Loops

Accessing global variables inside loops can be slow because Python has to look up the variable in the global namespace every time. To mitigate this, assign the variable to a local variable before the loop.

# Slow
def slow_sum(n):
    result = 0
    for i in range(n):
        result += i
    return result

# Faster
def fast_sum(n):
    result = 0
    range_n = range(n)
    for i in range_n:
        result += i
    return result

In the faster version, range_n is a local variable, so Python doesn’t need to perform a global lookup on each iteration.

Using List Comprehensions and Generator Expressions

List comprehensions and generator expressions are often more efficient than traditional for-loops in Python. They are optimized internally and reduce the overhead of function calls and intermediate objects.

# Traditional loop
squares = []
for i in range(1000000):
    squares.append(i * i)

# List comprehension
squares = [i * i for i in range(1000000)]

# Generator expression for lazy evaluation
squares_gen = (i * i for i in range(1000000))

Use list comprehensions for cases where you need the full list, and generator expressions for lazy iteration over large datasets.

Leveraging the `functools.lru_cache` Decorator

The functools.lru_cache decorator is a powerful tool for memoizing results of function calls. This is especially useful for recursive functions or functions with expensive computations.

from functools import lru_cache

@lru_cache(maxsize=None)
def fibonacci(n):
    if n < 2:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

# Call the function
print(fibonacci(100))

Without lru_cache, this function would be prohibitively slow for large n. The decorator caches results, reducing redundant computations.

Using Multiprocessing for CPU-bound Tasks

Python’s Global Interpreter Lock (GIL) limits the performance of multi-threaded CPU-bound tasks. For CPU-bound workloads, using the multiprocessing module can bypass the GIL and utilize multiple CPU cores.

from multiprocessing import Pool

def square(x):
    return x * x

if __name__ == "__main__":
    with Pool(4) as p:
        result = p.map(square, range(1000000))

This example uses four processes to compute squares in parallel. Note the use of the if __name__ == "__main__": guard to prevent infinite spawning of processes on Windows.

Summary of Optimization Techniques

Technique	Use Case	Performance Impact
Built-in functions	Replacing custom logic with built-ins	High
NumPy	Numerical and array operations	Very High
List comprehensions	Fast list creation	Medium
Local variable lookup	Avoiding global lookups in loops	Medium
`lru_cache`	Repeated expensive function calls	High
`multiprocessing`	CPU-bound parallel tasks	Very High