We can make Python run faster on multi-core CPUs, but one does not simply make use of multiple cores. We must design our programs carefully to leverage all available cores in a given machine effectively.
In this guide, you will learn the difference between regular (sequential) and parallel processing. You'll also learn about different techniques for running multiple processes simultaneously, effectively side-stepping the limitations of the Python interpreter.
By the end of the guide, you'll be able to write programs that run in parallel on all processors available and also as background processes.
Table of Contents
- Introduction
- CPU-Bound vs. IO-Bound Tasks
- Race Conditions and Pitfalls
- Processes
- Process Pool
- ProcessPoolExecutor
- Coordination Between Processes
- Sharing State Between Processes
- When to use Processes
- Why Not Use Threads
- Why Not Use Asyncio
- Recap: Python Multiprocessing Best Practices
Introduction
Since the release of multi-core CPUs all the way back in 2001, we have seen processors grow faster than ever in their performance and capabilities. Modern smartphones and laptop computers today include as many as 8-core chips, and we can find high-end CPUs with up to 64 cores at affordable prices.
Most programs run in a single processor thread by default, so multi-core processors can run many of those programs simultaneously, using their different cores and threads. The OS coordinates with the CPU to run the programs on various cores as needed.
We can leverage multiple cores in Python and run parts of our programs simultaneously. But to be successful, we need to use the right strategy to run our tasks efficiently, depending on the task. This is known as multiprocessing and is the key to unlocking the full performance of our CPU.
Let’s look at examples of particular tasks that may benefit from multiprocessing.
CPU-bound vs. IO-bound tasks
Some multiprocessing applications are complex systems that must perform multiple tasks, often in the background while something more important takes place.
Databases use multiple programs to write new data while remaining able to answer read queries. Since database writes need to be saved, we need to wait for the write to disk to finish so we know the update was completed. This IO-bound task can only run as fast as a disk write, regardless of the kind of CPU.
Similarly, reading from a large filesystem also depends on disk read speeds. Just like reading from a network server depends on the speed of our network connection.
On the other hand, running computations in memory is a CPU-bound task, as the processor executes those instructions. For example: rendering graphics, compressing data, scientific computations, and video streaming all use processing to perform their job.
Race conditions and pitfalls
In a regular (aka sequential) program, the instructions are executed in order from start to finish. Every operation must finish before we can proceed with the next. The order of operations is straightforward, linear, and easy to reason about.
In a multiprocessing scenario, we can quickly run into trouble. Consider a program that saves its result in a file. If we executed this program in multiple cores concurrently, we can’t tell which version of the program will be the last to write to this file.
This is called a race condition. Generally, if we have some shared resource that multiple processes need access to, we will need a way to coordinate between them, so they don’t conflict. After all, the correctness of our program should not depend on accidents of timing.
One way of preventing race conditions is using locks (or mutexes), which grant exclusive access to the resource to a single process for a limited time, preventing others from accessing the resource until it’s released. Something to keep in mind is that we must ensure the holder releases any acquired locks. Otherwise, other processes may wait indefinitely for access, and our program may never finish. This is called a deadlock.
Processes
What is a process
A process is an instance of a program being executed by the CPU. A program is just a series of instructions typically stored on a disk file, and a process is the execution of that program that happens once the OS loads that program in memory and hands it off to the CPU.
Any given process can be executed concurrently by one or many threads if the CPU supports it.
In a multi-core processor, each core executes a single task at a time. However, the cores can switch between tasks, often while a process is waiting for input/output operation or when the OS decides it should switch tasks based on the task scheduler criteria, for example, after some time interval or depending on the number of processes waiting.
We can run multiple processes associated with the same program; for example, opening up several instances of the same program often results in more than one process being executed. However, we can only run as many processes simultaneously as the number of (logical) cores in our processor. The rest will queue up and will have to wait their turn.
Python can detect the number of logical cores in our system like this:
Running subprocesses
So how can we run multiple processes associated with the same program? One example is running multiple instances of the same program, but we can also launch instances of different programs at will, using different methods.
One of these methods is called "spawning."
Actually, spawning new processes is very straightforward and can be done by constructing a new Process
instance and then calling start()
.
The start()
function will return right away, and the OS will execute the target function in a separate process as soon as possible.
Naming a process
We can give our processes a name for easier debugging and logging. By default, they will get a name assigned by the Python interpreter, but we can make their names more user-friendly!
Let's see this in practice:
Waiting for processes
When we start a new process in Python, the start()
function does not block the parent process. That means it creates a new process and returns immediately, regardless of how long the subprocess may take to run. We can explicitly wait for another process by calling the join()
method.
This will block our main program and wait for the child process to terminate. Sometimes we don't want to wait for the process forever, though, and we may need to terminate it manually.
Here's an example:
Fork vs. Spawn
When starting a new process, the Python interpreter will hand off the process to the OS for execution. We should take into account how this handoff happens.
Spawning a new process means we will create a new Python interpreter instance from scratch to execute our new Python process. Then, Python will need to create copies of our process functions and references from the parent program to execute the new process concurrently. There is a performance hit during this initialization step.
On the other hand, forking is faster because it does not copy from the parent process; instead, the necessary resources are inherited and copied on write. This makes it trickier to coordinate resources between the parent and child process.
Forking is the default start method on Linux implementations, while macOS and Windows use spawn by default. However, because forking can lead to varying behavior, recent Python docs mention that the fork start method should be considered unsafe as it can lead to subprocesses crashing.
We can tell Python always to use spawn with this snippet:
Security concerns when starting processes
When starting a subprocess on a typical Unix system, fork()
is called, and the host process is copied. Modern Unix systems have a copy-on-write implementation of that syscall, meaning that the operation does not copy all the memory of the host process over. Forking is (almost) immediately followed by calling execve()
in the child process — which transforms the calling process into a new process.
While Python hides that away from us, we may need to call some OS functions directly and manage the creation of processes through the OS. For example, by using system()
calls or invoking shell commands. The problem here is that system()
calls expect arguments as strings and will use the shell to evaluate these arguments. This can be exploited by using malicious arguments to execute arbitrary commands or even get full shell access and hack the host system.
A secure way to interact with the OS is by using the subprocess module and setting the keyword shell=False
.
Running processes in the background
We may need to start a process to execute sporadic, periodic, or long-running tasks in the background. These are called "daemon" processes, like the demons or spirits from ancient Greek conjured to perform tasks for us in the background.
A process may be configured as a daemon or not, and most processes in concurrent programming, including the main parent process, are non-daemon processes (not background processes) by default.
Some use cases for daemon projects can be:
- Background logging to a file or database
- Retrieving or querying data in the background
- Storing data in a disk or database
The main parent process will terminate once all non-daemon processes finish, even if daemon processes are still running. Therefore, the code it executes must be robust to arbitrary termination, such as the flushing and closing of external resources like streams and files that may not be closed correctly.
Daemon processes do not need to be waited for or manually terminated, as they will be closed when all non-daemon processes terminate. Even so, it is probably good practice to explicitly join all the processes we start.
Process Pool
Now, let's switch gears to learn how to leverage all the cores in our system, and we can do this in very few lines of code using multiprocessing Pools.
Starting a processing pool
This is a great way to improve the parallelism of our program by just asking Python to distribute a task to all cores in the system. Let’s see an example:
As we can see, the square function runs in separate processes. Pay special attention to our with
statement, where we ask the OS to use the spawn method. This can save some of the headaches discussed before in Fork vs Spawn and should be standard practice.
Spawn from pools
Also, note that while set_start_method('spawn')
changes things globally, it’s better to use get_context("spawn")
in our pools, avoiding any side effects.
ProcessPoolExecutor
The ProcessPoolExecutor class is a convenient wrapper for managing multiprocessing pools. It also provides better resource utilization by allowing processes to share resources and tasks.
When using a ProcessPoolExecutor
we have more control over how many processes run at once, when new processes are created, and how they are created.
Note that if max_workers
is not set, the pool will default to the number of cpu cores on the machine.
The ProcessPoolExecutor
class defines three methods used to control our process pool: submit(), map(), and shutdown().
- submit(): Schedule a function to be executed and returns a Future object.
- map(): Apply a function to an iterable of elements and returns an iterator.
- shutdown(): Shut down the executor.
Running tasks with map()
The map() function will return an iterator immediately. This iterable can be used to access the results from the target task function as they are available in the order that the tasks were submitted.
Submitting tasks with submit()
This method takes a function to call and all arguments to the function, then returns a Future object immediately.
The Future object is a promise to return the results from the task (if any) and provides a way to determine whether or not a specific task has been completed.
The ProcessPoolExecutor
is started when the class is created and must be shut down explicitly by calling shutdown(), which will release any resources.
Note that if we don't close the process pool, it will be closed automatically when we exit the main thread. If we forget to close the pool and there are still tasks, the main process will only exit once all the queued tasks have been executed.
We can also shut down the ProcessPoolExecutor
automatically, using a context manager for the pool. This is actually the preferred way:
Coordination Between Processes
A mutual exclusion lock or mutex lock is used to protect critical sections of code from concurrent execution and to prevent a race condition.
A mutex lock can be used to ensure that only a single process executes a critical section of code at a time. All other processes trying to execute the same code must wait until the currently executing process is finished with the critical section and releases the lock.
We have a mutual exclusion (mutex) lock in the multiprocessing.Lock
class.
Each process must attempt to acquire the lock at the beginning of the critical section. If the lock has not been obtained, then a process will acquire it, and other processes must wait until the process that acquired the lock releases it.
# create a lock
lock = multiprocessing.Lock()
# acquire the lock
lock.acquire()
...
# release the lock
lock.release()
Only one process can have the lock at any time. If a process does not release an acquired lock, it cannot be acquired again.
The process attempting to acquire the lock will block until the lock is acquired, such as if another process currently holds the lock and then releases it.
We can also use the lock with a context manager, allowing the lock to be released automatically once the critical section is completed.
# create a lock
lock = multiprocessing.Lock()
# acquire the lock
with lock:
# critical section only the holder of the lock can execute
A limitation of a (non-reentrant) mutex lock is that if a process has acquired the lock, it cannot acquire it again. In fact, this situation will result in a deadlock as it will wait forever for the lock to be released so that it can be acquired, but it holds the lock and will not release it.
A reentrant lock will allow a process to acquire the same lock again if it has already been acquired.
There are many reasons why a process may need to acquire the same lock more than once. Imagine critical sections spread across different functions, each protected by the same lock. A process may call into one critical section from another critical section.
To create a reentrant lock, we use the multiprocessing.RLock
class instead of a regular Lock.
Sharing State Between Processes
When dealing with multiple processes, we typically need to share data and state between processes. We should wrap the shared data using the following classes:
multiprocessing.Value
: manage a shared value.multiprocessing.Array
: manage an array of shared values.
For example, a shared ctype value can be defined in a parent process, then shared with multiple child processes. All child and parent processes can then safely read and modify the data within the shared value.
This can be useful in many cases, such as:
- A counter shared among multiple processes.
- Returning data from a child process to a parent process.
- Sharing results of computation among processes.
Python also provides a process-safe queue in the multiprocessing.Queue
class.
This shared job queue implementation allows queued items to be processed in parallel by multiple workers. This type of queue can store and transfer any pickle-able* object across process boundaries.
We can add items to a queue for later processing using the put()
method and retrieve the items in subprocesses by calling its get()
method.
We could also use a simple queue to send data back and forth between two processes, called a Pipe.
For example:
# create a pipe
rec, send = multiprocessing.Pipe()
By default, the first connection (rec
) can only be used to receive data and the second connection (send
) can only be used to send data.
We can make the connection objects bidirectional by setting the duplex
argument to the constructor to True
.
# create a duplex pipe
conn1, conn2 = multiprocessing.Pipe(duplex=True)
In this case, we can use both connections to send and receive data.
Objects can be shared between processes using the Pipe.
We have a connection.send()
to send objects from one process to another.
# send an object
conn2.send('Hello world')
We can then use the connection.recv()
method to receive objects in one process sent by another.
# will block until an object is received
object = conn1.recv()
We can check the status of the pipe using the connection.poll()
method. This will return a boolean as to whether there is data to be received and read from the pipe.
# returns True if there is data to read from the pipe.
if conn1.poll():
...
We can set a timeout with the "timeout" argument. If specified, the call will block until data is available. The function will return if no data is available before the timeout number of seconds has elapsed.
# waits for 5 seconds
if conn1.poll(timeout=5):
...
When to use Processes
Python only executes a single thread at a time because the Global Interpreter Lock (GIL) requires each thread to acquire a lock on the interpreter before executing, preventing all other threads from executing simultaneously.
Although we may have tens, hundreds, or even thousands of concurrent threads in our Python application, only one thread may execute in parallel.
Process-based concurrency is not limited in the same way as thread-based concurrency.
Both threads and processes can execute concurrently (out of order), but only python processes can execute in parallel (simultaneously), not Python threads (with some caveats).
This means that if we want our Python code to run on all CPU cores and make the best use of our system hardware, we should use process-based concurrency.
If process-based concurrency offers true parallelism in Python, why not always use processes?
First, only one thread can run at a time within a Python process under most situations, except:
- When we are using a Python interpreter that does not use a GIL.
- When we are performing IO-bound tasks that release the GIL.
- When we are performing CPU-bound tasks that release the GIL.
For example, when we are reading or writing from a file or socket, the GIL is released, allowing multiple threads to run in parallel.
Common examples include:
- Hard disk drive: Reading, writing, appending, renaming, and deleting files, etc.
- Peripherals: mouse, keyboard, screen, printer, serial, camera, etc.
- Internet: Downloading and uploading files, getting a webpage, querying RSS, etc.
- Database: Select, update, delete, etc. SQL queries.
- Email: Send mail, receive mail, query inbox, etc.
Additionally, many CPU-bound tasks that are known to not rely on the state of the Python interpreter will release the GIL, such as C-code in third-party libraries called from Python.
For example:
- When performing compression operations, e.g. in zlib.
- When calculating cryptographic hashes, e.g. in hashlib.
- When performing encoding/decoding images and video, e.g, in OpenCV.
- When performing matrix operations, e.g. in NumPy and SciPy.
- And many more.
Second, processes and process-based concurrency also have limitations compared to threads.
For example:
- We may have thousands of threads, but perhaps only tens of processes.
- Threads are small and fast, but processes are large and slow to create and start.
- Threads can share data quickly and directly, whereas processes must pickle and transmit data to each other.
Why Not Use Threads
Threads and processes are quite different, and choosing one over the other must be quite intentional.
A Python program is a process that has a main thread. You can create many additional threads in a Python process. You can also fork or spawn many Python processes, each of which will have one main thread and may spawn additional threads.
Threads are lightweight and can share memory (data and variables) within a process, whereas processes are heavyweight and require more overhead and impose more limits on sharing memory (data and variables).
Because of the GIL, only a single thread can execute at a time, making them a good candidate for workloads where we need to switch between tasks that spend a good amount of time waiting, like IO-bound tasks.
CPU-bound tasks won't benefit from multiple threads and will actually slow down because of the overhead of switching between multiple tasks without allowing them a fair share of CPU resources.
You can learn more about when to use threads in our guide:
Why Not Use Asyncio
Asyncio is a newer alternative to using a threading.Thread, but it's probably not a good replacement for the multiprocessing.Process
class.
Asyncio is designed to support large numbers of IO operations, perhaps thousands to tens of thousands, all within a single Thread.
When using the multiprocessing.Process
class, we're typically executing CPU-bound tasks, which is not something that benefits from the asyncio paradigm, and so it's not a good alternative for CPU-intensive workloads.
You can learn more about when to use asyncio in python in our guide:
Recap: Python Multiprocessing Best Practices
Now that we know processes work and how to use them, let’s review some best practices to consider when bringing multiprocessing into our Python programs.
Best practices to keep in mind when using processes in Python:
- Always Check for Main Module Before Starting Processes
- Use Context Managers
- Use Timeouts When Waiting
- Use Locks to Prevent Race Conditions
- Use Values, Pipes, and Queues for Shared State
These best practices will help you avoid many concurrency pitfalls like race conditions and deadlocks.
By now, you should know how multiprocessing works in Python in great detail and how to best utilize processes in your programs.
References
- multiprocessing module
- Python Multiprocessing: The Complete Guide
- Why your multiprocessing Pool is stuck
- Effective Python: 90 Specific Ways to Write Better Python
- Python in a Nutshell
- Python Cookbook
Did you find this guide helpful?
If so, I'd love to know. Please share what you learned in the comments below.
How are you using Processes in your programs?
I’d love to hear about it, too. Please let me know in the comments.
Do you have any questions?
Leave your question in a comment below, and I will answer it with my best advice.