âšī¸ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0.2 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://docs.pola.rs/user-guide/misc/multiprocessing/ |
| Last Crawled | 2026-03-31 03:27:25 (6 days ago) |
| First Indexed | 2023-12-28 15:49:45 (2 years ago) |
| HTTP Status Code | 200 |
| Meta Title | Multiprocessing - Polars user guide |
| Meta Description | null |
| Meta Canonical | null |
| Boilerpipe Text | TLDR: if you find that using Python's built-in
multiprocessing
module together with Polars results
in a Polars error about multiprocessing methods, you should make sure you are using
spawn
, not
fork
, as the starting method:
from
multiprocessing
import
get_context
def
my_fun
(
s
):
print
(
s
)
with
get_context
(
"spawn"
)
.
Pool
()
as
pool
:
pool
.
map
(
my_fun
,
[
"input1"
,
"input2"
,
...
])
When not to use multiprocessing
Before we dive into the details, it is important to emphasize that Polars has been built from the
start to use all your CPU cores. It does this by executing computations which can be done in
parallel in separate threads. For example, requesting two expressions in a
select
statement can be
done in parallel, with the results only being combined at the end. Another example is aggregating a
value within groups using
group_by().agg(<expr>)
, each group can be evaluated separately. It is
very unlikely that the
multiprocessing
module can improve your code performance in these cases. If
you're using the GPU Engine with Polars you should also avoid manual multiprocessing. When used
simultaneously, they can compete for system memory and processing power, leading to reduced
performance.
See
the optimizations section
for more optimizations.
When to use multiprocessing
Although Polars is multithreaded, other libraries may be single-threaded. When the other library is
the bottleneck, and the problem at hand is parallelizable, it makes sense to use multiprocessing to
gain a speed up.
The problem with the default multiprocessing config
Summary
The
Python multiprocessing documentation
lists the three methods to create a process pool:
spawn
fork
forkserver
The description of fork is (as of 2022-10-15):
The parent process uses os.fork() to fork the Python interpreter. The child process, when it
begins, is effectively identical to the parent process. All resources of the parent are inherited
by the child process. Note that safely forking a multithreaded process is problematic.
Available on Unix only. The default on Unix.
The short summary is: Polars is multithreaded as to provide strong performance out-of-the-box. Thus,
it cannot be combined with
fork
. If you are on Unix (Linux, BSD, etc), you are using
fork
,
unless you explicitly override it.
The reason you may not have encountered this before is that pure Python code, and most Python
libraries, are (mostly) single threaded. Alternatively, you are on Windows or MacOS, on which
fork
is not even available as a method (for MacOS it was up to Python 3.7).
Thus one should use
spawn
, or
forkserver
, instead.
spawn
is available on all platforms and the
safest choice, and hence the recommended method.
Example
The problem with
fork
is in the copying of the parent's process. Consider the example below, which
is a slightly modified example posted on the
Polars issue tracker
:
import
multiprocessing
import
polars
as
pl
def
test_sub_process
(
df
:
pl
.
DataFrame
,
job_id
):
df_filtered
=
df
.
filter
(
pl
.
col
(
"a"
)
>
0
)
print
(
f
"Filtered (job_id:
{
job_id
}
)"
,
df_filtered
,
sep
=
"
\n
"
)
def
create_dataset
():
return
pl
.
DataFrame
({
"a"
:
[
0
,
2
,
3
,
4
,
5
],
"b"
:
[
0
,
4
,
5
,
56
,
4
]})
def
setup
():
# some setup work
df
=
create_dataset
()
df
.
write_parquet
(
"/tmp/test.parquet"
)
def
main
():
test_df
=
pl
.
read_parquet
(
"/tmp/test.parquet"
)
for
i
in
range
(
0
,
5
):
proc
=
multiprocessing
.
get_context
(
"spawn"
)
.
Process
(
target
=
test_sub_process
,
args
=
(
test_df
,
i
)
)
proc
.
start
()
proc
.
join
()
print
(
f
"Executed sub process
{
i
}
"
)
if
__name__
==
"__main__"
:
setup
()
main
()
Using
fork
as the method, instead of
spawn
, will cause a dead lock.
The fork method is equivalent to calling
os.fork()
, which is a system call as defined in
the POSIX standard
:
A process shall be created with a single thread. If a multi-threaded process calls fork(), the new
process shall contain a replica of the calling thread and its entire address space, possibly
including the states of mutexes and other resources. Consequently, to avoid errors, the child
process may only execute async-signal-safe operations until such time as one of the exec functions
is called.
In contrast,
spawn
will create a completely new fresh Python interpreter, and not inherit the
state of mutexes.
So what happens in the code example? For reading the file with
pl.read_parquet
the file has to be
locked. Then
os.fork()
is called, copying the state of the parent process, including mutexes. Thus
all child processes will copy the file lock in an acquired state, leaving them hanging indefinitely
waiting for the file lock to be released, which never happens.
What makes debugging these issues tricky is that
fork
can work. Change the example to not having
the call to
pl.read_parquet
:
import
multiprocessing
import
polars
as
pl
def
test_sub_process
(
df
:
pl
.
DataFrame
,
job_id
):
df_filtered
=
df
.
filter
(
pl
.
col
(
"a"
)
>
0
)
print
(
f
"Filtered (job_id:
{
job_id
}
)"
,
df_filtered
,
sep
=
"
\n
"
)
def
create_dataset
():
return
pl
.
DataFrame
({
"a"
:
[
0
,
2
,
3
,
4
,
5
],
"b"
:
[
0
,
4
,
5
,
56
,
4
]})
def
main
():
test_df
=
create_dataset
()
for
i
in
range
(
0
,
5
):
proc
=
multiprocessing
.
get_context
(
"fork"
)
.
Process
(
target
=
test_sub_process
,
args
=
(
test_df
,
i
)
)
proc
.
start
()
proc
.
join
()
print
(
f
"Executed sub process
{
i
}
"
)
if
__name__
==
"__main__"
:
main
()
This works fine. Therefore debugging these issues in larger code bases, i.e. not the small toy
examples here, can be a real pain, as a seemingly unrelated change can break your multiprocessing
code. In general, one should therefore never use the
fork
start method with multithreaded
libraries unless there are very specific requirements that cannot be met otherwise.
Pro's and cons of fork
Based on the example, you may think, why is
fork
available in Python to start with?
First, probably because of historical reasons:
spawn
was added to Python in version 3.4, whilst
fork
has been part of Python from the 2.x series.
Second, there are several limitations for
spawn
and
forkserver
that do not apply to
fork
, in
particular all arguments should be pickleable. See the
Python multiprocessing docs
for more information.
Third, because it is faster to create new processes compared to
spawn
, as
spawn
is effectively
fork
+ creating a brand new Python process without the locks by calling
execv
. Hence the warning in
the Python docs that it is slower: there is more overhead to
spawn
. However, in almost all cases,
one would like to use multiple processes to speed up computations that take multiple minutes or even
hours, meaning the overhead is negligible in the grand scheme of things. And more importantly, it
actually works in combination with multithreaded libraries.
Fourth,
spawn
starts a new process, and therefore it requires code to be importable, in contrast
to
fork
. In particular, this means that when using
spawn
the relevant code should not be in the
global scope, such as in Jupyter notebooks or in plain scripts. Hence in the examples above, we
define functions where we spawn within, and run those functions from a
__main__
clause. This is
not an issue for typical projects, but during quick experimentation in notebooks it could fail.
References
https://docs.python.org/3/library/multiprocessing.html
https://pythonspeed.com/articles/python-multiprocessing/
https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html
https://bnikolic.co.uk/blog/python/parallelism/2019/11/13/python-forkserver-preload.html |
| Markdown | [Skip to content](https://docs.pola.rs/user-guide/misc/multiprocessing/#multiprocessing)
[](https://docs.pola.rs/ "Polars user guide")
Polars user guide
Multiprocessing
Type to start searching
[pola-rs/polars py-1.39.3 37.9k 2.7k](https://github.com/pola-rs/polars "Go to repository")
- [Polars](https://docs.pola.rs/)
- [Polars Cloud](https://docs.pola.rs/polars-cloud/)
- [Polars on-premises](https://docs.pola.rs/polars-on-premises/)
[](https://docs.pola.rs/ "Polars user guide") Polars user guide
[pola-rs/polars py-1.39.3 37.9k 2.7k](https://github.com/pola-rs/polars "Go to repository")
- Polars
Polars
- [User guide](https://docs.pola.rs/)
User guide
- [Getting started](https://docs.pola.rs/user-guide/getting-started/)
- [Installation](https://docs.pola.rs/user-guide/installation/)
- [Concepts](https://docs.pola.rs/user-guide/concepts/)
Concepts
- [Data types and structures](https://docs.pola.rs/user-guide/concepts/data-types-and-structures/)
- [Expressions and contexts](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/)
- [Lazy API](https://docs.pola.rs/user-guide/concepts/lazy-api/)
- [Streaming](https://docs.pola.rs/user-guide/concepts/streaming/)
- [Expressions](https://docs.pola.rs/user-guide/expressions/)
Expressions
- [Basic operations](https://docs.pola.rs/user-guide/expressions/basic-operations/)
- [Expression expansion](https://docs.pola.rs/user-guide/expressions/expression-expansion/)
- [Casting](https://docs.pola.rs/user-guide/expressions/casting/)
- [Strings](https://docs.pola.rs/user-guide/expressions/strings/)
- [Lists and arrays](https://docs.pola.rs/user-guide/expressions/lists-and-arrays/)
- [Categorical data and enums](https://docs.pola.rs/user-guide/expressions/categorical-data-and-enums/)
- [Structs](https://docs.pola.rs/user-guide/expressions/structs/)
- [Missing data](https://docs.pola.rs/user-guide/expressions/missing-data/)
- [Aggregation](https://docs.pola.rs/user-guide/expressions/aggregation/)
- [Window functions](https://docs.pola.rs/user-guide/expressions/window-functions/)
- [Folds](https://docs.pola.rs/user-guide/expressions/folds/)
- [User-defined Python functions](https://docs.pola.rs/user-guide/expressions/user-defined-python-functions/)
- [Numpy functions](https://docs.pola.rs/user-guide/expressions/numpy-functions/)
- [Transformations](https://docs.pola.rs/user-guide/transformations/)
Transformations
- [Joins](https://docs.pola.rs/user-guide/transformations/joins/)
- [Concatenation](https://docs.pola.rs/user-guide/transformations/concatenation/)
- [Pivots](https://docs.pola.rs/user-guide/transformations/pivot/)
- [Unpivots](https://docs.pola.rs/user-guide/transformations/unpivot/)
- Time series
Time series
- [Parsing](https://docs.pola.rs/user-guide/transformations/time-series/parsing/)
- [Filtering](https://docs.pola.rs/user-guide/transformations/time-series/filter/)
- [Grouping](https://docs.pola.rs/user-guide/transformations/time-series/rolling/)
- [Resampling](https://docs.pola.rs/user-guide/transformations/time-series/resampling/)
- [Time zones](https://docs.pola.rs/user-guide/transformations/time-series/timezones/)
- [Lazy API](https://docs.pola.rs/user-guide/lazy/)
Lazy API
- [Usage](https://docs.pola.rs/user-guide/lazy/using/)
- [Optimizations](https://docs.pola.rs/user-guide/lazy/optimizations/)
- [Schema](https://docs.pola.rs/user-guide/lazy/schemas/)
- [DataType Expressions](https://docs.pola.rs/user-guide/lazy/datatype_exprs/)
- [Query plan](https://docs.pola.rs/user-guide/lazy/query-plan/)
- [Query execution](https://docs.pola.rs/user-guide/lazy/execution/)
- [Sources and sinks](https://docs.pola.rs/user-guide/lazy/sources_sinks/)
- [Multiplexing queries](https://docs.pola.rs/user-guide/lazy/multiplexing/)
- [GPU Support](https://docs.pola.rs/user-guide/lazy/gpu/)
- [IO](https://docs.pola.rs/user-guide/io/)
IO
- [CSV](https://docs.pola.rs/user-guide/io/csv/)
- [Excel](https://docs.pola.rs/user-guide/io/excel/)
- [Parquet](https://docs.pola.rs/user-guide/io/parquet/)
- [JSON files](https://docs.pola.rs/user-guide/io/json/)
- [Multiple](https://docs.pola.rs/user-guide/io/multiple/)
- [Hive](https://docs.pola.rs/user-guide/io/hive/)
- [Databases](https://docs.pola.rs/user-guide/io/database/)
- [Cloud storage](https://docs.pola.rs/user-guide/io/cloud-storage/)
- [Google BigQuery](https://docs.pola.rs/user-guide/io/bigquery/)
- [Hugging Face](https://docs.pola.rs/user-guide/io/hugging-face/)
- [Google Sheets (via Colab)](https://docs.pola.rs/user-guide/io/sheets_colab/)
- [Plugins](https://docs.pola.rs/user-guide/plugins/)
Plugins
- [Expression Plugins](https://docs.pola.rs/user-guide/plugins/expr_plugins/)
- [IO Plugins](https://docs.pola.rs/user-guide/plugins/io_plugins/)
- SQL
SQL
- [Introduction](https://docs.pola.rs/user-guide/sql/intro/)
- [SHOW TABLES](https://docs.pola.rs/user-guide/sql/show/)
- [SELECT](https://docs.pola.rs/user-guide/sql/select/)
- [CREATE](https://docs.pola.rs/user-guide/sql/create/)
- [Common Table Expressions](https://docs.pola.rs/user-guide/sql/cte/)
- Migrating
Migrating
- [Coming from Pandas](https://docs.pola.rs/user-guide/migration/pandas/)
- [Coming from Apache Spark](https://docs.pola.rs/user-guide/migration/spark/)
- Misc
Misc
- [Ecosystem](https://docs.pola.rs/user-guide/ecosystem/)
- Multiprocessing
[Multiprocessing](https://docs.pola.rs/user-guide/misc/multiprocessing/)
Table of contents
- [When not to use multiprocessing](https://docs.pola.rs/user-guide/misc/multiprocessing/#when-not-to-use-multiprocessing)
- [When to use multiprocessing](https://docs.pola.rs/user-guide/misc/multiprocessing/#when-to-use-multiprocessing)
- [The problem with the default multiprocessing config](https://docs.pola.rs/user-guide/misc/multiprocessing/#the-problem-with-the-default-multiprocessing-config)
- [Summary](https://docs.pola.rs/user-guide/misc/multiprocessing/#summary)
- [Example](https://docs.pola.rs/user-guide/misc/multiprocessing/#example)
- [Pro's and cons of fork](https://docs.pola.rs/user-guide/misc/multiprocessing/#pros-and-cons-of-fork)
- [References](https://docs.pola.rs/user-guide/misc/multiprocessing/#references)
- [Visualization](https://docs.pola.rs/user-guide/misc/visualization/)
- [Styling](https://docs.pola.rs/user-guide/misc/styling/)
- [Comparison with other tools](https://docs.pola.rs/user-guide/misc/comparison/)
- [Arrow producer/consumer](https://docs.pola.rs/user-guide/misc/arrow/)
- [Generating Polars code with LLMs](https://docs.pola.rs/user-guide/misc/polars_llms/)
- [GPU Support \[Open Beta\]](https://docs.pola.rs/user-guide/gpu-support/)
- API
API
- [Reference guide](https://docs.pola.rs/api/reference/)
- Development
Development
- [Contributing](https://docs.pola.rs/development/contributing/)
Contributing
- [IDE configuration](https://docs.pola.rs/development/contributing/ide/)
- [Test suite](https://docs.pola.rs/development/contributing/test/)
- [Continuous integration](https://docs.pola.rs/development/contributing/ci/)
- [Code style](https://docs.pola.rs/development/contributing/code-style/)
- [Versioning](https://docs.pola.rs/development/versioning/)
- Releases
Releases
- [Upgrade guides](https://docs.pola.rs/releases/upgrade/)
Upgrade guides
- [Version 1](https://docs.pola.rs/releases/upgrade/1/)
- [Version 0.20](https://docs.pola.rs/releases/upgrade/0.20/)
- [Version 0.19](https://docs.pola.rs/releases/upgrade/0.19/)
- [Changelog](https://docs.pola.rs/releases/changelog/)
- [Polars Cloud](https://docs.pola.rs/polars-cloud/)
Polars Cloud
- [Getting started](https://docs.pola.rs/polars-cloud/quickstart/)
- [Connect to your cloud](https://docs.pola.rs/polars-cloud/connect-cloud/)
- Queries
Queries
- [Execute remote query](https://docs.pola.rs/polars-cloud/run/remote-query/)
- [Distributed queries](https://docs.pola.rs/polars-cloud/run/distributed-engine/)
- [Query profiling](https://docs.pola.rs/polars-cloud/run/query-profile/)
- [Glossary](https://docs.pola.rs/polars-cloud/run/glossary/)
- Integrations
Integrations
- [Orchestration](https://docs.pola.rs/polars-cloud/integrations/)
Orchestration
- [Airflow](https://docs.pola.rs/polars-cloud/integrations/airflow/)
- [Dagster](https://docs.pola.rs/polars-cloud/integrations/dagster/)
- [Prefect](https://docs.pola.rs/polars-cloud/integrations/prefect/)
- [AWS Lambda](https://docs.pola.rs/polars-cloud/integrations/lambda/)
- Concepts
Concepts
- Context
Context
- [Compute context introduction](https://docs.pola.rs/polars-cloud/context/compute-context/)
- [Reconnect to compute cluster](https://docs.pola.rs/polars-cloud/context/reconnect/)
- [Plugins and custom libraries](https://docs.pola.rs/polars-cloud/context/plugins/)
- [Proxy mode](https://docs.pola.rs/polars-cloud/context/proxy-mode/)
- Organizations
Organizations
- [Set up organization](https://docs.pola.rs/polars-cloud/organization/organizations/)
- [Start trial period](https://docs.pola.rs/polars-cloud/organization/start-trial/)
- [Payment and billing](https://docs.pola.rs/polars-cloud/organization/billing/)
- [Manage members](https://docs.pola.rs/polars-cloud/organization/members/)
- Workspaces
Workspaces
- [Workspace configuration](https://docs.pola.rs/polars-cloud/workspace/settings/)
- [Manage team](https://docs.pola.rs/polars-cloud/workspace/team/)
- Authentication
Authentication
- [Logging in](https://docs.pola.rs/polars-cloud/explain/authentication/)
- [Using service accounts](https://docs.pola.rs/polars-cloud/explain/service-accounts/)
- Providers
Providers
- AWS
AWS
- [Infrastructure](https://docs.pola.rs/polars-cloud/providers/aws/infra/)
- [Permissions](https://docs.pola.rs/polars-cloud/providers/aws/permissions/)
- Misc
Misc
- [CLI](https://docs.pola.rs/polars-cloud/cli/)
- [Public datasets](https://docs.pola.rs/polars-cloud/public-datasets/)
- [FAQ](https://docs.pola.rs/polars-cloud/faq/)
- [API Reference](https://docs.cloud.pola.rs/)
- API
API
- [Reference guide](https://docs.cloud.pola.rs/)
- [Polars on-premises](https://docs.pola.rs/polars-on-premises/)
Polars on-premises
- Kubernetes
Kubernetes
- [Getting started](https://docs.pola.rs/polars-on-premises/kubernetes/getting-started/)
- Bare-metal
Bare-metal
- [Getting started](https://docs.pola.rs/polars-on-premises/bare-metal/getting-started/)
- Configuration
Configuration
- [Config file reference](https://docs.pola.rs/polars-on-premises/bare-metal/configuration/reference/)
- [License](https://docs.pola.rs/polars-on-premises/bare-metal/configuration/license/)
- [Profiling and host metrics](https://docs.pola.rs/polars-on-premises/bare-metal/configuration/monitoring/)
- [Resource limits](https://docs.pola.rs/polars-on-premises/bare-metal/configuration/resource-limits/)
- [Shuffle data](https://docs.pola.rs/polars-on-premises/bare-metal/configuration/shuffle-data/)
- [Anonymous results](https://docs.pola.rs/polars-on-premises/bare-metal/configuration/anonymous-results/)
- [Network addresses](https://docs.pola.rs/polars-on-premises/bare-metal/configuration/network-addresses/)
- [Static leader configuration](https://docs.pola.rs/polars-on-premises/bare-metal/configuration/static-leader/)
- [Example configurations](https://docs.pola.rs/polars-on-premises/bare-metal/configuration/example-configurations/)
- [Environment variables](https://docs.pola.rs/polars-on-premises/bare-metal/environment-variables/)
- [Python environment](https://docs.pola.rs/polars-on-premises/bare-metal/python-environment/)
Table of contents
- [When not to use multiprocessing](https://docs.pola.rs/user-guide/misc/multiprocessing/#when-not-to-use-multiprocessing)
- [When to use multiprocessing](https://docs.pola.rs/user-guide/misc/multiprocessing/#when-to-use-multiprocessing)
- [The problem with the default multiprocessing config](https://docs.pola.rs/user-guide/misc/multiprocessing/#the-problem-with-the-default-multiprocessing-config)
- [Summary](https://docs.pola.rs/user-guide/misc/multiprocessing/#summary)
- [Example](https://docs.pola.rs/user-guide/misc/multiprocessing/#example)
- [Pro's and cons of fork](https://docs.pola.rs/user-guide/misc/multiprocessing/#pros-and-cons-of-fork)
- [References](https://docs.pola.rs/user-guide/misc/multiprocessing/#references)
# Multiprocessing
TLDR: if you find that using Python's built-in `multiprocessing` module together with Polars results in a Polars error about multiprocessing methods, you should make sure you are using `spawn`, not `fork`, as the starting method:
[Python](https://docs.pola.rs/user-guide/misc/multiprocessing/#__tabbed_1_1)
```
```
## When not to use multiprocessing
Before we dive into the details, it is important to emphasize that Polars has been built from the start to use all your CPU cores. It does this by executing computations which can be done in parallel in separate threads. For example, requesting two expressions in a `select` statement can be done in parallel, with the results only being combined at the end. Another example is aggregating a value within groups using `group_by().agg(<expr>)`, each group can be evaluated separately. It is very unlikely that the `multiprocessing` module can improve your code performance in these cases. If you're using the GPU Engine with Polars you should also avoid manual multiprocessing. When used simultaneously, they can compete for system memory and processing power, leading to reduced performance.
See [the optimizations section](https://docs.pola.rs/user-guide/lazy/optimizations/) for more optimizations.
## When to use multiprocessing
Although Polars is multithreaded, other libraries may be single-threaded. When the other library is the bottleneck, and the problem at hand is parallelizable, it makes sense to use multiprocessing to gain a speed up.
## The problem with the default multiprocessing config
### Summary
The [Python multiprocessing documentation](https://docs.python.org/3/library/multiprocessing.html) lists the three methods to create a process pool:
1. spawn
2. fork
3. forkserver
The description of fork is (as of 2022-10-15):
> The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.
>
> Available on Unix only. The default on Unix.
The short summary is: Polars is multithreaded as to provide strong performance out-of-the-box. Thus, it cannot be combined with `fork`. If you are on Unix (Linux, BSD, etc), you are using `fork`, unless you explicitly override it.
The reason you may not have encountered this before is that pure Python code, and most Python libraries, are (mostly) single threaded. Alternatively, you are on Windows or MacOS, on which `fork` is not even available as a method (for MacOS it was up to Python 3.7).
Thus one should use `spawn`, or `forkserver`, instead. `spawn` is available on all platforms and the safest choice, and hence the recommended method.
### Example
The problem with `fork` is in the copying of the parent's process. Consider the example below, which is a slightly modified example posted on the [Polars issue tracker](https://github.com/pola-rs/polars/issues/3144):
[Python](https://docs.pola.rs/user-guide/misc/multiprocessing/#__tabbed_2_1)
```
```
Using `fork` as the method, instead of `spawn`, will cause a dead lock.
The fork method is equivalent to calling `os.fork()`, which is a system call as defined in [the POSIX standard](https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html):
> A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called.
In contrast, `spawn` will create a completely new fresh Python interpreter, and not inherit the state of mutexes.
So what happens in the code example? For reading the file with `pl.read_parquet` the file has to be locked. Then `os.fork()` is called, copying the state of the parent process, including mutexes. Thus all child processes will copy the file lock in an acquired state, leaving them hanging indefinitely waiting for the file lock to be released, which never happens.
What makes debugging these issues tricky is that `fork` can work. Change the example to not having the call to `pl.read_parquet`:
[Python](https://docs.pola.rs/user-guide/misc/multiprocessing/#__tabbed_3_1)
```
```
This works fine. Therefore debugging these issues in larger code bases, i.e. not the small toy examples here, can be a real pain, as a seemingly unrelated change can break your multiprocessing code. In general, one should therefore never use the `fork` start method with multithreaded libraries unless there are very specific requirements that cannot be met otherwise.
### Pro's and cons of fork
Based on the example, you may think, why is `fork` available in Python to start with?
First, probably because of historical reasons: `spawn` was added to Python in version 3.4, whilst `fork` has been part of Python from the 2.x series.
Second, there are several limitations for `spawn` and `forkserver` that do not apply to `fork`, in particular all arguments should be pickleable. See the [Python multiprocessing docs](https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods) for more information.
Third, because it is faster to create new processes compared to `spawn`, as `spawn` is effectively `fork` + creating a brand new Python process without the locks by calling [execv](https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html). Hence the warning in the Python docs that it is slower: there is more overhead to `spawn`. However, in almost all cases, one would like to use multiple processes to speed up computations that take multiple minutes or even hours, meaning the overhead is negligible in the grand scheme of things. And more importantly, it actually works in combination with multithreaded libraries.
Fourth, `spawn` starts a new process, and therefore it requires code to be importable, in contrast to `fork`. In particular, this means that when using `spawn` the relevant code should not be in the global scope, such as in Jupyter notebooks or in plain scripts. Hence in the examples above, we define functions where we spawn within, and run those functions from a `__main__` clause. This is not an issue for typical projects, but during quick experimentation in notebooks it could fail.
## References
1. https://docs.python.org/3/library/multiprocessing.html
2. https://pythonspeed.com/articles/python-multiprocessing/
3. https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html
4. https://bnikolic.co.uk/blog/python/parallelism/2019/11/13/python-forkserver-preload.html
[Previous Ecosystem](https://docs.pola.rs/user-guide/ecosystem/)
[Next Visualization](https://docs.pola.rs/user-guide/misc/visualization/)
Made with [Material for MkDocs](https://squidfunk.github.io/mkdocs-material/) |
| Readable Markdown | TLDR: if you find that using Python's built-in `multiprocessing` module together with Polars results in a Polars error about multiprocessing methods, you should make sure you are using `spawn`, not `fork`, as the starting method:
```
```
## When not to use multiprocessing
Before we dive into the details, it is important to emphasize that Polars has been built from the start to use all your CPU cores. It does this by executing computations which can be done in parallel in separate threads. For example, requesting two expressions in a `select` statement can be done in parallel, with the results only being combined at the end. Another example is aggregating a value within groups using `group_by().agg(<expr>)`, each group can be evaluated separately. It is very unlikely that the `multiprocessing` module can improve your code performance in these cases. If you're using the GPU Engine with Polars you should also avoid manual multiprocessing. When used simultaneously, they can compete for system memory and processing power, leading to reduced performance.
See [the optimizations section](https://docs.pola.rs/user-guide/lazy/optimizations/) for more optimizations.
## When to use multiprocessing
Although Polars is multithreaded, other libraries may be single-threaded. When the other library is the bottleneck, and the problem at hand is parallelizable, it makes sense to use multiprocessing to gain a speed up.
## The problem with the default multiprocessing config
### Summary
The [Python multiprocessing documentation](https://docs.python.org/3/library/multiprocessing.html) lists the three methods to create a process pool:
1. spawn
2. fork
3. forkserver
The description of fork is (as of 2022-10-15):
> The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.
>
> Available on Unix only. The default on Unix.
The short summary is: Polars is multithreaded as to provide strong performance out-of-the-box. Thus, it cannot be combined with `fork`. If you are on Unix (Linux, BSD, etc), you are using `fork`, unless you explicitly override it.
The reason you may not have encountered this before is that pure Python code, and most Python libraries, are (mostly) single threaded. Alternatively, you are on Windows or MacOS, on which `fork` is not even available as a method (for MacOS it was up to Python 3.7).
Thus one should use `spawn`, or `forkserver`, instead. `spawn` is available on all platforms and the safest choice, and hence the recommended method.
### Example
The problem with `fork` is in the copying of the parent's process. Consider the example below, which is a slightly modified example posted on the [Polars issue tracker](https://github.com/pola-rs/polars/issues/3144):
```
```
Using `fork` as the method, instead of `spawn`, will cause a dead lock.
The fork method is equivalent to calling `os.fork()`, which is a system call as defined in [the POSIX standard](https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html):
> A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called.
In contrast, `spawn` will create a completely new fresh Python interpreter, and not inherit the state of mutexes.
So what happens in the code example? For reading the file with `pl.read_parquet` the file has to be locked. Then `os.fork()` is called, copying the state of the parent process, including mutexes. Thus all child processes will copy the file lock in an acquired state, leaving them hanging indefinitely waiting for the file lock to be released, which never happens.
What makes debugging these issues tricky is that `fork` can work. Change the example to not having the call to `pl.read_parquet`:
```
```
This works fine. Therefore debugging these issues in larger code bases, i.e. not the small toy examples here, can be a real pain, as a seemingly unrelated change can break your multiprocessing code. In general, one should therefore never use the `fork` start method with multithreaded libraries unless there are very specific requirements that cannot be met otherwise.
### Pro's and cons of fork
Based on the example, you may think, why is `fork` available in Python to start with?
First, probably because of historical reasons: `spawn` was added to Python in version 3.4, whilst `fork` has been part of Python from the 2.x series.
Second, there are several limitations for `spawn` and `forkserver` that do not apply to `fork`, in particular all arguments should be pickleable. See the [Python multiprocessing docs](https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods) for more information.
Third, because it is faster to create new processes compared to `spawn`, as `spawn` is effectively `fork` + creating a brand new Python process without the locks by calling [execv](https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html). Hence the warning in the Python docs that it is slower: there is more overhead to `spawn`. However, in almost all cases, one would like to use multiple processes to speed up computations that take multiple minutes or even hours, meaning the overhead is negligible in the grand scheme of things. And more importantly, it actually works in combination with multithreaded libraries.
Fourth, `spawn` starts a new process, and therefore it requires code to be importable, in contrast to `fork`. In particular, this means that when using `spawn` the relevant code should not be in the global scope, such as in Jupyter notebooks or in plain scripts. Hence in the examples above, we define functions where we spawn within, and run those functions from a `__main__` clause. This is not an issue for typical projects, but during quick experimentation in notebooks it could fail.
## References
1. https://docs.python.org/3/library/multiprocessing.html
2. https://pythonspeed.com/articles/python-multiprocessing/
3. https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html
4. https://bnikolic.co.uk/blog/python/parallelism/2019/11/13/python-forkserver-preload.html |
| Shard | 194 (laksa) |
| Root Hash | 15785171083524017994 |
| Unparsed URL | rs,pola!docs,/user-guide/misc/multiprocessing/ s443 |