đŸ•ˇī¸ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 194 (from laksa177)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

â„šī¸ Skipped - page is already crawled

📄
INDEXABLE
✅
CRAWLED
6 days ago
🤖
ROBOTS ALLOWED

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH0.2 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://docs.pola.rs/user-guide/misc/multiprocessing/
Last Crawled2026-03-31 03:27:25 (6 days ago)
First Indexed2023-12-28 15:49:45 (2 years ago)
HTTP Status Code200
Meta TitleMultiprocessing - Polars user guide
Meta Descriptionnull
Meta Canonicalnull
Boilerpipe Text
TLDR: if you find that using Python's built-in multiprocessing module together with Polars results in a Polars error about multiprocessing methods, you should make sure you are using spawn , not fork , as the starting method: from multiprocessing import get_context def my_fun ( s ): print ( s ) with get_context ( "spawn" ) . Pool () as pool : pool . map ( my_fun , [ "input1" , "input2" , ... ]) When not to use multiprocessing Before we dive into the details, it is important to emphasize that Polars has been built from the start to use all your CPU cores. It does this by executing computations which can be done in parallel in separate threads. For example, requesting two expressions in a select statement can be done in parallel, with the results only being combined at the end. Another example is aggregating a value within groups using group_by().agg(<expr>) , each group can be evaluated separately. It is very unlikely that the multiprocessing module can improve your code performance in these cases. If you're using the GPU Engine with Polars you should also avoid manual multiprocessing. When used simultaneously, they can compete for system memory and processing power, leading to reduced performance. See the optimizations section for more optimizations. When to use multiprocessing Although Polars is multithreaded, other libraries may be single-threaded. When the other library is the bottleneck, and the problem at hand is parallelizable, it makes sense to use multiprocessing to gain a speed up. The problem with the default multiprocessing config Summary The Python multiprocessing documentation lists the three methods to create a process pool: spawn fork forkserver The description of fork is (as of 2022-10-15): The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic. Available on Unix only. The default on Unix. The short summary is: Polars is multithreaded as to provide strong performance out-of-the-box. Thus, it cannot be combined with fork . If you are on Unix (Linux, BSD, etc), you are using fork , unless you explicitly override it. The reason you may not have encountered this before is that pure Python code, and most Python libraries, are (mostly) single threaded. Alternatively, you are on Windows or MacOS, on which fork is not even available as a method (for MacOS it was up to Python 3.7). Thus one should use spawn , or forkserver , instead. spawn is available on all platforms and the safest choice, and hence the recommended method. Example The problem with fork is in the copying of the parent's process. Consider the example below, which is a slightly modified example posted on the Polars issue tracker : import multiprocessing import polars as pl def test_sub_process ( df : pl . DataFrame , job_id ): df_filtered = df . filter ( pl . col ( "a" ) > 0 ) print ( f "Filtered (job_id: { job_id } )" , df_filtered , sep = " \n " ) def create_dataset (): return pl . DataFrame ({ "a" : [ 0 , 2 , 3 , 4 , 5 ], "b" : [ 0 , 4 , 5 , 56 , 4 ]}) def setup (): # some setup work df = create_dataset () df . write_parquet ( "/tmp/test.parquet" ) def main (): test_df = pl . read_parquet ( "/tmp/test.parquet" ) for i in range ( 0 , 5 ): proc = multiprocessing . get_context ( "spawn" ) . Process ( target = test_sub_process , args = ( test_df , i ) ) proc . start () proc . join () print ( f "Executed sub process { i } " ) if __name__ == "__main__" : setup () main () Using fork as the method, instead of spawn , will cause a dead lock. The fork method is equivalent to calling os.fork() , which is a system call as defined in the POSIX standard : A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called. In contrast, spawn will create a completely new fresh Python interpreter, and not inherit the state of mutexes. So what happens in the code example? For reading the file with pl.read_parquet the file has to be locked. Then os.fork() is called, copying the state of the parent process, including mutexes. Thus all child processes will copy the file lock in an acquired state, leaving them hanging indefinitely waiting for the file lock to be released, which never happens. What makes debugging these issues tricky is that fork can work. Change the example to not having the call to pl.read_parquet : import multiprocessing import polars as pl def test_sub_process ( df : pl . DataFrame , job_id ): df_filtered = df . filter ( pl . col ( "a" ) > 0 ) print ( f "Filtered (job_id: { job_id } )" , df_filtered , sep = " \n " ) def create_dataset (): return pl . DataFrame ({ "a" : [ 0 , 2 , 3 , 4 , 5 ], "b" : [ 0 , 4 , 5 , 56 , 4 ]}) def main (): test_df = create_dataset () for i in range ( 0 , 5 ): proc = multiprocessing . get_context ( "fork" ) . Process ( target = test_sub_process , args = ( test_df , i ) ) proc . start () proc . join () print ( f "Executed sub process { i } " ) if __name__ == "__main__" : main () This works fine. Therefore debugging these issues in larger code bases, i.e. not the small toy examples here, can be a real pain, as a seemingly unrelated change can break your multiprocessing code. In general, one should therefore never use the fork start method with multithreaded libraries unless there are very specific requirements that cannot be met otherwise. Pro's and cons of fork Based on the example, you may think, why is fork available in Python to start with? First, probably because of historical reasons: spawn was added to Python in version 3.4, whilst fork has been part of Python from the 2.x series. Second, there are several limitations for spawn and forkserver that do not apply to fork , in particular all arguments should be pickleable. See the Python multiprocessing docs for more information. Third, because it is faster to create new processes compared to spawn , as spawn is effectively fork + creating a brand new Python process without the locks by calling execv . Hence the warning in the Python docs that it is slower: there is more overhead to spawn . However, in almost all cases, one would like to use multiple processes to speed up computations that take multiple minutes or even hours, meaning the overhead is negligible in the grand scheme of things. And more importantly, it actually works in combination with multithreaded libraries. Fourth, spawn starts a new process, and therefore it requires code to be importable, in contrast to fork . In particular, this means that when using spawn the relevant code should not be in the global scope, such as in Jupyter notebooks or in plain scripts. Hence in the examples above, we define functions where we spawn within, and run those functions from a __main__ clause. This is not an issue for typical projects, but during quick experimentation in notebooks it could fail. References https://docs.python.org/3/library/multiprocessing.html https://pythonspeed.com/articles/python-multiprocessing/ https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html https://bnikolic.co.uk/blog/python/parallelism/2019/11/13/python-forkserver-preload.html
Markdown
[Skip to content](https://docs.pola.rs/user-guide/misc/multiprocessing/#multiprocessing) [![logo](https://docs.pola.rs/_build/assets/logo.png)](https://docs.pola.rs/ "Polars user guide") Polars user guide Multiprocessing Type to start searching [pola-rs/polars py-1.39.3 37.9k 2.7k](https://github.com/pola-rs/polars "Go to repository") - [Polars](https://docs.pola.rs/) - [Polars Cloud](https://docs.pola.rs/polars-cloud/) - [Polars on-premises](https://docs.pola.rs/polars-on-premises/) [![logo](https://docs.pola.rs/_build/assets/logo.png)](https://docs.pola.rs/ "Polars user guide") Polars user guide [pola-rs/polars py-1.39.3 37.9k 2.7k](https://github.com/pola-rs/polars "Go to repository") - Polars Polars - [User guide](https://docs.pola.rs/) User guide - [Getting started](https://docs.pola.rs/user-guide/getting-started/) - [Installation](https://docs.pola.rs/user-guide/installation/) - [Concepts](https://docs.pola.rs/user-guide/concepts/) Concepts - [Data types and structures](https://docs.pola.rs/user-guide/concepts/data-types-and-structures/) - [Expressions and contexts](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/) - [Lazy API](https://docs.pola.rs/user-guide/concepts/lazy-api/) - [Streaming](https://docs.pola.rs/user-guide/concepts/streaming/) - [Expressions](https://docs.pola.rs/user-guide/expressions/) Expressions - [Basic operations](https://docs.pola.rs/user-guide/expressions/basic-operations/) - [Expression expansion](https://docs.pola.rs/user-guide/expressions/expression-expansion/) - [Casting](https://docs.pola.rs/user-guide/expressions/casting/) - [Strings](https://docs.pola.rs/user-guide/expressions/strings/) - [Lists and arrays](https://docs.pola.rs/user-guide/expressions/lists-and-arrays/) - [Categorical data and enums](https://docs.pola.rs/user-guide/expressions/categorical-data-and-enums/) - [Structs](https://docs.pola.rs/user-guide/expressions/structs/) - [Missing data](https://docs.pola.rs/user-guide/expressions/missing-data/) - [Aggregation](https://docs.pola.rs/user-guide/expressions/aggregation/) - [Window functions](https://docs.pola.rs/user-guide/expressions/window-functions/) - [Folds](https://docs.pola.rs/user-guide/expressions/folds/) - [User-defined Python functions](https://docs.pola.rs/user-guide/expressions/user-defined-python-functions/) - [Numpy functions](https://docs.pola.rs/user-guide/expressions/numpy-functions/) - [Transformations](https://docs.pola.rs/user-guide/transformations/) Transformations - [Joins](https://docs.pola.rs/user-guide/transformations/joins/) - [Concatenation](https://docs.pola.rs/user-guide/transformations/concatenation/) - [Pivots](https://docs.pola.rs/user-guide/transformations/pivot/) - [Unpivots](https://docs.pola.rs/user-guide/transformations/unpivot/) - Time series Time series - [Parsing](https://docs.pola.rs/user-guide/transformations/time-series/parsing/) - [Filtering](https://docs.pola.rs/user-guide/transformations/time-series/filter/) - [Grouping](https://docs.pola.rs/user-guide/transformations/time-series/rolling/) - [Resampling](https://docs.pola.rs/user-guide/transformations/time-series/resampling/) - [Time zones](https://docs.pola.rs/user-guide/transformations/time-series/timezones/) - [Lazy API](https://docs.pola.rs/user-guide/lazy/) Lazy API - [Usage](https://docs.pola.rs/user-guide/lazy/using/) - [Optimizations](https://docs.pola.rs/user-guide/lazy/optimizations/) - [Schema](https://docs.pola.rs/user-guide/lazy/schemas/) - [DataType Expressions](https://docs.pola.rs/user-guide/lazy/datatype_exprs/) - [Query plan](https://docs.pola.rs/user-guide/lazy/query-plan/) - [Query execution](https://docs.pola.rs/user-guide/lazy/execution/) - [Sources and sinks](https://docs.pola.rs/user-guide/lazy/sources_sinks/) - [Multiplexing queries](https://docs.pola.rs/user-guide/lazy/multiplexing/) - [GPU Support](https://docs.pola.rs/user-guide/lazy/gpu/) - [IO](https://docs.pola.rs/user-guide/io/) IO - [CSV](https://docs.pola.rs/user-guide/io/csv/) - [Excel](https://docs.pola.rs/user-guide/io/excel/) - [Parquet](https://docs.pola.rs/user-guide/io/parquet/) - [JSON files](https://docs.pola.rs/user-guide/io/json/) - [Multiple](https://docs.pola.rs/user-guide/io/multiple/) - [Hive](https://docs.pola.rs/user-guide/io/hive/) - [Databases](https://docs.pola.rs/user-guide/io/database/) - [Cloud storage](https://docs.pola.rs/user-guide/io/cloud-storage/) - [Google BigQuery](https://docs.pola.rs/user-guide/io/bigquery/) - [Hugging Face](https://docs.pola.rs/user-guide/io/hugging-face/) - [Google Sheets (via Colab)](https://docs.pola.rs/user-guide/io/sheets_colab/) - [Plugins](https://docs.pola.rs/user-guide/plugins/) Plugins - [Expression Plugins](https://docs.pola.rs/user-guide/plugins/expr_plugins/) - [IO Plugins](https://docs.pola.rs/user-guide/plugins/io_plugins/) - SQL SQL - [Introduction](https://docs.pola.rs/user-guide/sql/intro/) - [SHOW TABLES](https://docs.pola.rs/user-guide/sql/show/) - [SELECT](https://docs.pola.rs/user-guide/sql/select/) - [CREATE](https://docs.pola.rs/user-guide/sql/create/) - [Common Table Expressions](https://docs.pola.rs/user-guide/sql/cte/) - Migrating Migrating - [Coming from Pandas](https://docs.pola.rs/user-guide/migration/pandas/) - [Coming from Apache Spark](https://docs.pola.rs/user-guide/migration/spark/) - Misc Misc - [Ecosystem](https://docs.pola.rs/user-guide/ecosystem/) - Multiprocessing [Multiprocessing](https://docs.pola.rs/user-guide/misc/multiprocessing/) Table of contents - [When not to use multiprocessing](https://docs.pola.rs/user-guide/misc/multiprocessing/#when-not-to-use-multiprocessing) - [When to use multiprocessing](https://docs.pola.rs/user-guide/misc/multiprocessing/#when-to-use-multiprocessing) - [The problem with the default multiprocessing config](https://docs.pola.rs/user-guide/misc/multiprocessing/#the-problem-with-the-default-multiprocessing-config) - [Summary](https://docs.pola.rs/user-guide/misc/multiprocessing/#summary) - [Example](https://docs.pola.rs/user-guide/misc/multiprocessing/#example) - [Pro's and cons of fork](https://docs.pola.rs/user-guide/misc/multiprocessing/#pros-and-cons-of-fork) - [References](https://docs.pola.rs/user-guide/misc/multiprocessing/#references) - [Visualization](https://docs.pola.rs/user-guide/misc/visualization/) - [Styling](https://docs.pola.rs/user-guide/misc/styling/) - [Comparison with other tools](https://docs.pola.rs/user-guide/misc/comparison/) - [Arrow producer/consumer](https://docs.pola.rs/user-guide/misc/arrow/) - [Generating Polars code with LLMs](https://docs.pola.rs/user-guide/misc/polars_llms/) - [GPU Support \[Open Beta\]](https://docs.pola.rs/user-guide/gpu-support/) - API API - [Reference guide](https://docs.pola.rs/api/reference/) - Development Development - [Contributing](https://docs.pola.rs/development/contributing/) Contributing - [IDE configuration](https://docs.pola.rs/development/contributing/ide/) - [Test suite](https://docs.pola.rs/development/contributing/test/) - [Continuous integration](https://docs.pola.rs/development/contributing/ci/) - [Code style](https://docs.pola.rs/development/contributing/code-style/) - [Versioning](https://docs.pola.rs/development/versioning/) - Releases Releases - [Upgrade guides](https://docs.pola.rs/releases/upgrade/) Upgrade guides - [Version 1](https://docs.pola.rs/releases/upgrade/1/) - [Version 0.20](https://docs.pola.rs/releases/upgrade/0.20/) - [Version 0.19](https://docs.pola.rs/releases/upgrade/0.19/) - [Changelog](https://docs.pola.rs/releases/changelog/) - [Polars Cloud](https://docs.pola.rs/polars-cloud/) Polars Cloud - [Getting started](https://docs.pola.rs/polars-cloud/quickstart/) - [Connect to your cloud](https://docs.pola.rs/polars-cloud/connect-cloud/) - Queries Queries - [Execute remote query](https://docs.pola.rs/polars-cloud/run/remote-query/) - [Distributed queries](https://docs.pola.rs/polars-cloud/run/distributed-engine/) - [Query profiling](https://docs.pola.rs/polars-cloud/run/query-profile/) - [Glossary](https://docs.pola.rs/polars-cloud/run/glossary/) - Integrations Integrations - [Orchestration](https://docs.pola.rs/polars-cloud/integrations/) Orchestration - [Airflow](https://docs.pola.rs/polars-cloud/integrations/airflow/) - [Dagster](https://docs.pola.rs/polars-cloud/integrations/dagster/) - [Prefect](https://docs.pola.rs/polars-cloud/integrations/prefect/) - [AWS Lambda](https://docs.pola.rs/polars-cloud/integrations/lambda/) - Concepts Concepts - Context Context - [Compute context introduction](https://docs.pola.rs/polars-cloud/context/compute-context/) - [Reconnect to compute cluster](https://docs.pola.rs/polars-cloud/context/reconnect/) - [Plugins and custom libraries](https://docs.pola.rs/polars-cloud/context/plugins/) - [Proxy mode](https://docs.pola.rs/polars-cloud/context/proxy-mode/) - Organizations Organizations - [Set up organization](https://docs.pola.rs/polars-cloud/organization/organizations/) - [Start trial period](https://docs.pola.rs/polars-cloud/organization/start-trial/) - [Payment and billing](https://docs.pola.rs/polars-cloud/organization/billing/) - [Manage members](https://docs.pola.rs/polars-cloud/organization/members/) - Workspaces Workspaces - [Workspace configuration](https://docs.pola.rs/polars-cloud/workspace/settings/) - [Manage team](https://docs.pola.rs/polars-cloud/workspace/team/) - Authentication Authentication - [Logging in](https://docs.pola.rs/polars-cloud/explain/authentication/) - [Using service accounts](https://docs.pola.rs/polars-cloud/explain/service-accounts/) - Providers Providers - AWS AWS - [Infrastructure](https://docs.pola.rs/polars-cloud/providers/aws/infra/) - [Permissions](https://docs.pola.rs/polars-cloud/providers/aws/permissions/) - Misc Misc - [CLI](https://docs.pola.rs/polars-cloud/cli/) - [Public datasets](https://docs.pola.rs/polars-cloud/public-datasets/) - [FAQ](https://docs.pola.rs/polars-cloud/faq/) - [API Reference](https://docs.cloud.pola.rs/) - API API - [Reference guide](https://docs.cloud.pola.rs/) - [Polars on-premises](https://docs.pola.rs/polars-on-premises/) Polars on-premises - Kubernetes Kubernetes - [Getting started](https://docs.pola.rs/polars-on-premises/kubernetes/getting-started/) - Bare-metal Bare-metal - [Getting started](https://docs.pola.rs/polars-on-premises/bare-metal/getting-started/) - Configuration Configuration - [Config file reference](https://docs.pola.rs/polars-on-premises/bare-metal/configuration/reference/) - [License](https://docs.pola.rs/polars-on-premises/bare-metal/configuration/license/) - [Profiling and host metrics](https://docs.pola.rs/polars-on-premises/bare-metal/configuration/monitoring/) - [Resource limits](https://docs.pola.rs/polars-on-premises/bare-metal/configuration/resource-limits/) - [Shuffle data](https://docs.pola.rs/polars-on-premises/bare-metal/configuration/shuffle-data/) - [Anonymous results](https://docs.pola.rs/polars-on-premises/bare-metal/configuration/anonymous-results/) - [Network addresses](https://docs.pola.rs/polars-on-premises/bare-metal/configuration/network-addresses/) - [Static leader configuration](https://docs.pola.rs/polars-on-premises/bare-metal/configuration/static-leader/) - [Example configurations](https://docs.pola.rs/polars-on-premises/bare-metal/configuration/example-configurations/) - [Environment variables](https://docs.pola.rs/polars-on-premises/bare-metal/environment-variables/) - [Python environment](https://docs.pola.rs/polars-on-premises/bare-metal/python-environment/) Table of contents - [When not to use multiprocessing](https://docs.pola.rs/user-guide/misc/multiprocessing/#when-not-to-use-multiprocessing) - [When to use multiprocessing](https://docs.pola.rs/user-guide/misc/multiprocessing/#when-to-use-multiprocessing) - [The problem with the default multiprocessing config](https://docs.pola.rs/user-guide/misc/multiprocessing/#the-problem-with-the-default-multiprocessing-config) - [Summary](https://docs.pola.rs/user-guide/misc/multiprocessing/#summary) - [Example](https://docs.pola.rs/user-guide/misc/multiprocessing/#example) - [Pro's and cons of fork](https://docs.pola.rs/user-guide/misc/multiprocessing/#pros-and-cons-of-fork) - [References](https://docs.pola.rs/user-guide/misc/multiprocessing/#references) # Multiprocessing TLDR: if you find that using Python's built-in `multiprocessing` module together with Polars results in a Polars error about multiprocessing methods, you should make sure you are using `spawn`, not `fork`, as the starting method: [Python](https://docs.pola.rs/user-guide/misc/multiprocessing/#__tabbed_1_1) ``` ``` ## When not to use multiprocessing Before we dive into the details, it is important to emphasize that Polars has been built from the start to use all your CPU cores. It does this by executing computations which can be done in parallel in separate threads. For example, requesting two expressions in a `select` statement can be done in parallel, with the results only being combined at the end. Another example is aggregating a value within groups using `group_by().agg(<expr>)`, each group can be evaluated separately. It is very unlikely that the `multiprocessing` module can improve your code performance in these cases. If you're using the GPU Engine with Polars you should also avoid manual multiprocessing. When used simultaneously, they can compete for system memory and processing power, leading to reduced performance. See [the optimizations section](https://docs.pola.rs/user-guide/lazy/optimizations/) for more optimizations. ## When to use multiprocessing Although Polars is multithreaded, other libraries may be single-threaded. When the other library is the bottleneck, and the problem at hand is parallelizable, it makes sense to use multiprocessing to gain a speed up. ## The problem with the default multiprocessing config ### Summary The [Python multiprocessing documentation](https://docs.python.org/3/library/multiprocessing.html) lists the three methods to create a process pool: 1. spawn 2. fork 3. forkserver The description of fork is (as of 2022-10-15): > The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic. > > Available on Unix only. The default on Unix. The short summary is: Polars is multithreaded as to provide strong performance out-of-the-box. Thus, it cannot be combined with `fork`. If you are on Unix (Linux, BSD, etc), you are using `fork`, unless you explicitly override it. The reason you may not have encountered this before is that pure Python code, and most Python libraries, are (mostly) single threaded. Alternatively, you are on Windows or MacOS, on which `fork` is not even available as a method (for MacOS it was up to Python 3.7). Thus one should use `spawn`, or `forkserver`, instead. `spawn` is available on all platforms and the safest choice, and hence the recommended method. ### Example The problem with `fork` is in the copying of the parent's process. Consider the example below, which is a slightly modified example posted on the [Polars issue tracker](https://github.com/pola-rs/polars/issues/3144): [Python](https://docs.pola.rs/user-guide/misc/multiprocessing/#__tabbed_2_1) ``` ``` Using `fork` as the method, instead of `spawn`, will cause a dead lock. The fork method is equivalent to calling `os.fork()`, which is a system call as defined in [the POSIX standard](https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html): > A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called. In contrast, `spawn` will create a completely new fresh Python interpreter, and not inherit the state of mutexes. So what happens in the code example? For reading the file with `pl.read_parquet` the file has to be locked. Then `os.fork()` is called, copying the state of the parent process, including mutexes. Thus all child processes will copy the file lock in an acquired state, leaving them hanging indefinitely waiting for the file lock to be released, which never happens. What makes debugging these issues tricky is that `fork` can work. Change the example to not having the call to `pl.read_parquet`: [Python](https://docs.pola.rs/user-guide/misc/multiprocessing/#__tabbed_3_1) ``` ``` This works fine. Therefore debugging these issues in larger code bases, i.e. not the small toy examples here, can be a real pain, as a seemingly unrelated change can break your multiprocessing code. In general, one should therefore never use the `fork` start method with multithreaded libraries unless there are very specific requirements that cannot be met otherwise. ### Pro's and cons of fork Based on the example, you may think, why is `fork` available in Python to start with? First, probably because of historical reasons: `spawn` was added to Python in version 3.4, whilst `fork` has been part of Python from the 2.x series. Second, there are several limitations for `spawn` and `forkserver` that do not apply to `fork`, in particular all arguments should be pickleable. See the [Python multiprocessing docs](https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods) for more information. Third, because it is faster to create new processes compared to `spawn`, as `spawn` is effectively `fork` + creating a brand new Python process without the locks by calling [execv](https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html). Hence the warning in the Python docs that it is slower: there is more overhead to `spawn`. However, in almost all cases, one would like to use multiple processes to speed up computations that take multiple minutes or even hours, meaning the overhead is negligible in the grand scheme of things. And more importantly, it actually works in combination with multithreaded libraries. Fourth, `spawn` starts a new process, and therefore it requires code to be importable, in contrast to `fork`. In particular, this means that when using `spawn` the relevant code should not be in the global scope, such as in Jupyter notebooks or in plain scripts. Hence in the examples above, we define functions where we spawn within, and run those functions from a `__main__` clause. This is not an issue for typical projects, but during quick experimentation in notebooks it could fail. ## References 1. https://docs.python.org/3/library/multiprocessing.html 2. https://pythonspeed.com/articles/python-multiprocessing/ 3. https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html 4. https://bnikolic.co.uk/blog/python/parallelism/2019/11/13/python-forkserver-preload.html [Previous Ecosystem](https://docs.pola.rs/user-guide/ecosystem/) [Next Visualization](https://docs.pola.rs/user-guide/misc/visualization/) Made with [Material for MkDocs](https://squidfunk.github.io/mkdocs-material/)
Readable Markdown
TLDR: if you find that using Python's built-in `multiprocessing` module together with Polars results in a Polars error about multiprocessing methods, you should make sure you are using `spawn`, not `fork`, as the starting method: ``` ``` ## When not to use multiprocessing Before we dive into the details, it is important to emphasize that Polars has been built from the start to use all your CPU cores. It does this by executing computations which can be done in parallel in separate threads. For example, requesting two expressions in a `select` statement can be done in parallel, with the results only being combined at the end. Another example is aggregating a value within groups using `group_by().agg(<expr>)`, each group can be evaluated separately. It is very unlikely that the `multiprocessing` module can improve your code performance in these cases. If you're using the GPU Engine with Polars you should also avoid manual multiprocessing. When used simultaneously, they can compete for system memory and processing power, leading to reduced performance. See [the optimizations section](https://docs.pola.rs/user-guide/lazy/optimizations/) for more optimizations. ## When to use multiprocessing Although Polars is multithreaded, other libraries may be single-threaded. When the other library is the bottleneck, and the problem at hand is parallelizable, it makes sense to use multiprocessing to gain a speed up. ## The problem with the default multiprocessing config ### Summary The [Python multiprocessing documentation](https://docs.python.org/3/library/multiprocessing.html) lists the three methods to create a process pool: 1. spawn 2. fork 3. forkserver The description of fork is (as of 2022-10-15): > The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic. > > Available on Unix only. The default on Unix. The short summary is: Polars is multithreaded as to provide strong performance out-of-the-box. Thus, it cannot be combined with `fork`. If you are on Unix (Linux, BSD, etc), you are using `fork`, unless you explicitly override it. The reason you may not have encountered this before is that pure Python code, and most Python libraries, are (mostly) single threaded. Alternatively, you are on Windows or MacOS, on which `fork` is not even available as a method (for MacOS it was up to Python 3.7). Thus one should use `spawn`, or `forkserver`, instead. `spawn` is available on all platforms and the safest choice, and hence the recommended method. ### Example The problem with `fork` is in the copying of the parent's process. Consider the example below, which is a slightly modified example posted on the [Polars issue tracker](https://github.com/pola-rs/polars/issues/3144): ``` ``` Using `fork` as the method, instead of `spawn`, will cause a dead lock. The fork method is equivalent to calling `os.fork()`, which is a system call as defined in [the POSIX standard](https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html): > A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called. In contrast, `spawn` will create a completely new fresh Python interpreter, and not inherit the state of mutexes. So what happens in the code example? For reading the file with `pl.read_parquet` the file has to be locked. Then `os.fork()` is called, copying the state of the parent process, including mutexes. Thus all child processes will copy the file lock in an acquired state, leaving them hanging indefinitely waiting for the file lock to be released, which never happens. What makes debugging these issues tricky is that `fork` can work. Change the example to not having the call to `pl.read_parquet`: ``` ``` This works fine. Therefore debugging these issues in larger code bases, i.e. not the small toy examples here, can be a real pain, as a seemingly unrelated change can break your multiprocessing code. In general, one should therefore never use the `fork` start method with multithreaded libraries unless there are very specific requirements that cannot be met otherwise. ### Pro's and cons of fork Based on the example, you may think, why is `fork` available in Python to start with? First, probably because of historical reasons: `spawn` was added to Python in version 3.4, whilst `fork` has been part of Python from the 2.x series. Second, there are several limitations for `spawn` and `forkserver` that do not apply to `fork`, in particular all arguments should be pickleable. See the [Python multiprocessing docs](https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods) for more information. Third, because it is faster to create new processes compared to `spawn`, as `spawn` is effectively `fork` + creating a brand new Python process without the locks by calling [execv](https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html). Hence the warning in the Python docs that it is slower: there is more overhead to `spawn`. However, in almost all cases, one would like to use multiple processes to speed up computations that take multiple minutes or even hours, meaning the overhead is negligible in the grand scheme of things. And more importantly, it actually works in combination with multithreaded libraries. Fourth, `spawn` starts a new process, and therefore it requires code to be importable, in contrast to `fork`. In particular, this means that when using `spawn` the relevant code should not be in the global scope, such as in Jupyter notebooks or in plain scripts. Hence in the examples above, we define functions where we spawn within, and run those functions from a `__main__` clause. This is not an issue for typical projects, but during quick experimentation in notebooks it could fail. ## References 1. https://docs.python.org/3/library/multiprocessing.html 2. https://pythonspeed.com/articles/python-multiprocessing/ 3. https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html 4. https://bnikolic.co.uk/blog/python/parallelism/2019/11/13/python-forkserver-preload.html
Shard194 (laksa)
Root Hash15785171083524017994
Unparsed URLrs,pola!docs,/user-guide/misc/multiprocessing/ s443