Skip to content

Optimize XIData pickle fallback: up to 17x slower than ProcessPoolExecutor for mutable containers #148072

@jrfk

Description

@jrfk

Feature or enhancement

Proposal:

Motivation

The XIData pickle fallback path used by InterpreterPoolExecutor to transfer non-shareable objects (list, dict, large tuple) is 2–17x slower than ProcessPoolExecutor's direct pickle over IPC. The native path for shareable types works great — this targets only the fallback path.

I'm aware that mutable object sharing is being explored separately (PEP 797); this proposal is limited to optimizing the existing pickle codepath. I have a draft PR ready — if PEP 797 is expected to land in 3.15 and would make this moot, I'm happy to close this. Otherwise, the optimization may be worth having in the interim.

Cause

The main bottleneck is PyImport_ImportModuleAttrString("pickle", "dumps"/"loads") called on every transfer (crossinterp.c:578, crossinterp.c:639), while ProcessPoolExecutor caches the reference at startup.

Approach

I have a prototype that caches these per-interpreter, showing 1.7–3.3x speedup across all tested payloads. I'd like to submit this as a PR.

"""Reproduction: pickle fallback vs direct pickle"""
import time
from concurrent.futures import InterpreterPoolExecutor, ProcessPoolExecutor

def identity(x):
    return x

def bench(cls, data, n=1000):
    with cls(max_workers=1) as pool:
        t0 = time.perf_counter()
        for _ in range(n):
            pool.submit(identity, data).result()
        return (time.perf_counter() - t0) / n * 1000  # ms/call

for label, data, n in [
    ("list(10000)",       list(range(10_000)),         1000),
    ("list[10x100KB]",    [b"x" * 100_000] * 10,      500),
    ("tuple(1MB bytes)",  (b"x" * 1_000_000,),         500),
]:
    ti = bench(InterpreterPoolExecutor, data, n)
    tp = bench(ProcessPoolExecutor, data, n)
    print(f"{label:20s}  Interp {ti:.2f}ms  Process {tp:.2f}ms  ratio {ti/tp:.1f}x")

Output (Python 3.14.3, macOS Apple M2, max_workers=1):

list(10000)           Interp 0.90ms  Process 0.58ms  ratio 1.6x
list[10x100KB]        Interp 4.87ms  Process 0.28ms  ratio 17.4x
tuple(1MB bytes)      Interp 0.13ms  Process 0.78ms  ratio 0.2x

The third case confirms the native path works as designed — tuple(bytes) is 6x faster than Process. But wrapping the same bytes in a list triggers the pickle fallback and becomes 17x slower: a 37x swing depending only on the outer container type.

Bottleneck: pickle module lookup on every call

_PyPickle_Dumps() (line 578) and _PyPickle_Loads() (line 639) call PyImport_ImportModuleAttrString("pickle", "dumps"/"loads") per invocation. I understand this is intentional for interpreter isolation (each interpreter has its own sys.modules), but the reference could be cached per-interpreter and invalidated on finalization.

// Python/crossinterp.c:578
PyObject *dumps = PyImport_ImportModuleAttrString("pickle", "dumps");
Additional bottlenecks (lower priority)

Exception creation in the always-fail lookup path

For non-shareable types, _PyObject_GetXIData() (line 539) tries the type registry first, which always fails for list/dict. The failure path creates an exception object (_set_xid_lookup_failure, line 411–426), captures it (_PyErr_GetRaisedException, line 542), then discards it when pickle succeeds (line 552). This cycle runs on every transfer.

Per-element recursive dispatch for large tuples

_tuple_shared() (crossinterp_data_lookup.h, line 642–689) calls _PyObject_GetXIData() per element, each involving a _PyXIData_New() heap allocation, recursion guards, and full type registry lookup. For tuple(10000) of ints, this means 10,000 C-level round-trips vs. one pickle.dumps() call in ProcessPoolExecutor.

Full benchmark data

Container objects (1,000 iterations):

Data Thread Interpreter Process Winner
int 0.01ms 0.05ms 0.18ms Interp (3.3x faster)
tuple(100) 0.01ms 0.06ms 0.19ms Interp (3.1x faster)
list(100) 0.01ms 0.07ms 0.19ms Interp (2.6x faster)
dict(100) 0.01ms 0.09ms 0.20ms Interp (2.2x faster)
tuple(10000) 0.01ms 0.90ms 0.55ms Process (1.6x faster)
list(10000) 0.01ms 0.90ms 0.58ms Process (1.6x faster)
dict(10000) 0.01ms 3.69ms 2.49ms Process (1.5x faster)

Large payloads (500 iterations):

Data Interpreter Process Winner
list [1 x 1KB bytes] 0.09ms 0.22ms Interp (2.5x faster)
list [1 x 100KB bytes] 0.57ms 0.27ms Process (2.1x faster)
list [1 x 1MB bytes] 5.49ms 0.76ms Process (7.2x faster)
tuple (1 x 1MB bytes) 0.13ms 0.78ms Interp (6x faster)
list [10 x 100KB bytes] 4.87ms 0.28ms Process (17x faster)
dict {10 x 100KB bytes} 5.58ms 0.77ms Process (7.2x faster)

Prototype implementation

I have a working implementation at c39755e that caches pickle.dumps/pickle.loads references in _PyXI_state_t (per-interpreter, lazily initialized, cleared on finalization). The change is +47/−6 lines in crossinterp.c and pycore_crossinterp.h.

Before/after results (identity(x) round-trip, max_workers=1, Apple M2):

Data Before After Speedup
int 0.12ms 0.05ms 2.3x
list(100) 0.14ms 0.07ms 2.0x
dict(100) 0.22ms 0.10ms 2.2x
list(10000) 3.59ms 1.09ms 3.3x
dict(10000) 11.09ms 4.51ms 2.5x
list [10 x 100KB] 10.27ms 4.69ms 2.2x
Full before/after data

Mutable type transfer (1,000 iterations):

Data Before After Speedup
int 0.12ms 0.05ms 2.3x
tuple(100) 0.12ms 0.07ms 1.7x
list(100) 0.14ms 0.07ms 2.0x
dict(100) 0.22ms 0.10ms 2.2x
tuple(10000) 2.39ms 0.97ms 2.5x
list(10000) 3.59ms 1.09ms 3.3x
dict(10000) 11.09ms 4.51ms 2.5x

Large payload transfer (500 iterations):

Data Before After Speedup
list [1 x 1KB] 0.19ms 0.09ms 2.1x
list [1 x 100KB] 1.25ms 0.71ms 1.8x
list [1 x 1MB] 10.58ms 4.77ms 2.2x
tuple (1 x 1MB bytes) 0.35ms 0.13ms 2.7x
list [10 x 100KB] 10.27ms 4.69ms 2.2x
dict {10 x 100KB} 10.65ms 4.89ms 2.2x

Tests passed: test_interpreters (172 tests), test_concurrent_futures (380 tests including test_interpreter_pool).

Related

  • #124694 — InterpreterPoolExecutor addition (mentions "more efficient sharing scheme" as future work)
  • #143017 — Queue polling delay (related but separate)

Tested on Python 3.14.3 (macOS Apple M2). Source references verified against main branch as of 2026-04-04.


Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

(none)

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance or resource usagetype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions