-
-
Notifications
You must be signed in to change notification settings - Fork 34.4k
Optimize XIData pickle fallback: up to 17x slower than ProcessPoolExecutor for mutable containers #148072
Description
Feature or enhancement
Proposal:
Motivation
The XIData pickle fallback path used by InterpreterPoolExecutor to transfer non-shareable objects (list, dict, large tuple) is 2–17x slower than ProcessPoolExecutor's direct pickle over IPC. The native path for shareable types works great — this targets only the fallback path.
I'm aware that mutable object sharing is being explored separately (PEP 797); this proposal is limited to optimizing the existing pickle codepath. I have a draft PR ready — if PEP 797 is expected to land in 3.15 and would make this moot, I'm happy to close this. Otherwise, the optimization may be worth having in the interim.
Cause
The main bottleneck is PyImport_ImportModuleAttrString("pickle", "dumps"/"loads") called on every transfer (crossinterp.c:578, crossinterp.c:639), while ProcessPoolExecutor caches the reference at startup.
Approach
I have a prototype that caches these per-interpreter, showing 1.7–3.3x speedup across all tested payloads. I'd like to submit this as a PR.
"""Reproduction: pickle fallback vs direct pickle"""
import time
from concurrent.futures import InterpreterPoolExecutor, ProcessPoolExecutor
def identity(x):
return x
def bench(cls, data, n=1000):
with cls(max_workers=1) as pool:
t0 = time.perf_counter()
for _ in range(n):
pool.submit(identity, data).result()
return (time.perf_counter() - t0) / n * 1000 # ms/call
for label, data, n in [
("list(10000)", list(range(10_000)), 1000),
("list[10x100KB]", [b"x" * 100_000] * 10, 500),
("tuple(1MB bytes)", (b"x" * 1_000_000,), 500),
]:
ti = bench(InterpreterPoolExecutor, data, n)
tp = bench(ProcessPoolExecutor, data, n)
print(f"{label:20s} Interp {ti:.2f}ms Process {tp:.2f}ms ratio {ti/tp:.1f}x")Output (Python 3.14.3, macOS Apple M2, max_workers=1):
list(10000) Interp 0.90ms Process 0.58ms ratio 1.6x
list[10x100KB] Interp 4.87ms Process 0.28ms ratio 17.4x
tuple(1MB bytes) Interp 0.13ms Process 0.78ms ratio 0.2x
The third case confirms the native path works as designed — tuple(bytes) is 6x faster than Process. But wrapping the same bytes in a list triggers the pickle fallback and becomes 17x slower: a 37x swing depending only on the outer container type.
Bottleneck: pickle module lookup on every call
_PyPickle_Dumps() (line 578) and _PyPickle_Loads() (line 639) call PyImport_ImportModuleAttrString("pickle", "dumps"/"loads") per invocation. I understand this is intentional for interpreter isolation (each interpreter has its own sys.modules), but the reference could be cached per-interpreter and invalidated on finalization.
// Python/crossinterp.c:578
PyObject *dumps = PyImport_ImportModuleAttrString("pickle", "dumps");Additional bottlenecks (lower priority)
Exception creation in the always-fail lookup path
For non-shareable types, _PyObject_GetXIData() (line 539) tries the type registry first, which always fails for list/dict. The failure path creates an exception object (_set_xid_lookup_failure, line 411–426), captures it (_PyErr_GetRaisedException, line 542), then discards it when pickle succeeds (line 552). This cycle runs on every transfer.
Per-element recursive dispatch for large tuples
_tuple_shared() (crossinterp_data_lookup.h, line 642–689) calls _PyObject_GetXIData() per element, each involving a _PyXIData_New() heap allocation, recursion guards, and full type registry lookup. For tuple(10000) of ints, this means 10,000 C-level round-trips vs. one pickle.dumps() call in ProcessPoolExecutor.
Full benchmark data
Container objects (1,000 iterations):
| Data | Thread | Interpreter | Process | Winner |
|---|---|---|---|---|
int |
0.01ms | 0.05ms | 0.18ms | Interp (3.3x faster) |
tuple(100) |
0.01ms | 0.06ms | 0.19ms | Interp (3.1x faster) |
list(100) |
0.01ms | 0.07ms | 0.19ms | Interp (2.6x faster) |
dict(100) |
0.01ms | 0.09ms | 0.20ms | Interp (2.2x faster) |
tuple(10000) |
0.01ms | 0.90ms | 0.55ms | Process (1.6x faster) |
list(10000) |
0.01ms | 0.90ms | 0.58ms | Process (1.6x faster) |
dict(10000) |
0.01ms | 3.69ms | 2.49ms | Process (1.5x faster) |
Large payloads (500 iterations):
| Data | Interpreter | Process | Winner |
|---|---|---|---|
list [1 x 1KB bytes] |
0.09ms | 0.22ms | Interp (2.5x faster) |
list [1 x 100KB bytes] |
0.57ms | 0.27ms | Process (2.1x faster) |
list [1 x 1MB bytes] |
5.49ms | 0.76ms | Process (7.2x faster) |
tuple (1 x 1MB bytes) |
0.13ms | 0.78ms | Interp (6x faster) |
list [10 x 100KB bytes] |
4.87ms | 0.28ms | Process (17x faster) |
dict {10 x 100KB bytes} |
5.58ms | 0.77ms | Process (7.2x faster) |
Prototype implementation
I have a working implementation at c39755e that caches pickle.dumps/pickle.loads references in _PyXI_state_t (per-interpreter, lazily initialized, cleared on finalization). The change is +47/−6 lines in crossinterp.c and pycore_crossinterp.h.
Before/after results (identity(x) round-trip, max_workers=1, Apple M2):
| Data | Before | After | Speedup |
|---|---|---|---|
int |
0.12ms | 0.05ms | 2.3x |
list(100) |
0.14ms | 0.07ms | 2.0x |
dict(100) |
0.22ms | 0.10ms | 2.2x |
list(10000) |
3.59ms | 1.09ms | 3.3x |
dict(10000) |
11.09ms | 4.51ms | 2.5x |
list [10 x 100KB] |
10.27ms | 4.69ms | 2.2x |
Full before/after data
Mutable type transfer (1,000 iterations):
| Data | Before | After | Speedup |
|---|---|---|---|
int |
0.12ms | 0.05ms | 2.3x |
tuple(100) |
0.12ms | 0.07ms | 1.7x |
list(100) |
0.14ms | 0.07ms | 2.0x |
dict(100) |
0.22ms | 0.10ms | 2.2x |
tuple(10000) |
2.39ms | 0.97ms | 2.5x |
list(10000) |
3.59ms | 1.09ms | 3.3x |
dict(10000) |
11.09ms | 4.51ms | 2.5x |
Large payload transfer (500 iterations):
| Data | Before | After | Speedup |
|---|---|---|---|
list [1 x 1KB] |
0.19ms | 0.09ms | 2.1x |
list [1 x 100KB] |
1.25ms | 0.71ms | 1.8x |
list [1 x 1MB] |
10.58ms | 4.77ms | 2.2x |
tuple (1 x 1MB bytes) |
0.35ms | 0.13ms | 2.7x |
list [10 x 100KB] |
10.27ms | 4.69ms | 2.2x |
dict {10 x 100KB} |
10.65ms | 4.89ms | 2.2x |
Tests passed: test_interpreters (172 tests), test_concurrent_futures (380 tests including test_interpreter_pool).
Related
- #124694 — InterpreterPoolExecutor addition (mentions "more efficient sharing scheme" as future work)
- #143017 — Queue polling delay (related but separate)
Tested on Python 3.14.3 (macOS Apple M2). Source references verified against main branch as of 2026-04-04.
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
(none)