Optimize XIData pickle fallback: up to 17x slower than ProcessPoolExecutor for mutable containers

# Feature or enhancement

### Proposal:

#### Motivation

The XIData pickle fallback path used by `InterpreterPoolExecutor` to transfer non-shareable objects (`list`, `dict`, large `tuple`) is **2–17x slower** than `ProcessPoolExecutor`'s direct pickle over IPC. The native path for shareable types works great — this targets only the fallback path.

I'm aware that mutable object sharing is being explored separately ([PEP 797](https://peps.python.org/pep-0797/)); this proposal is limited to optimizing the existing pickle codepath. I have a draft PR ready — if PEP 797 is expected to land in 3.15 and would make this moot, I'm happy to close this. Otherwise, the optimization may be worth having in the interim.

#### Cause

The main bottleneck is `PyImport_ImportModuleAttrString("pickle", "dumps"/"loads")` called on **every transfer** (`crossinterp.c:578`, `crossinterp.c:639`), while `ProcessPoolExecutor` caches the reference at startup.

#### Approach

I have a prototype that caches these per-interpreter, showing **1.7–3.3x speedup** across all tested payloads. I'd like to submit this as a PR.

```python
"""Reproduction: pickle fallback vs direct pickle"""
import time
from concurrent.futures import InterpreterPoolExecutor, ProcessPoolExecutor

def identity(x):
    return x

def bench(cls, data, n=1000):
    with cls(max_workers=1) as pool:
        t0 = time.perf_counter()
        for _ in range(n):
            pool.submit(identity, data).result()
        return (time.perf_counter() - t0) / n * 1000  # ms/call

for label, data, n in [
    ("list(10000)",       list(range(10_000)),         1000),
    ("list[10x100KB]",    [b"x" * 100_000] * 10,      500),
    ("tuple(1MB bytes)",  (b"x" * 1_000_000,),         500),
]:
    ti = bench(InterpreterPoolExecutor, data, n)
    tp = bench(ProcessPoolExecutor, data, n)
    print(f"{label:20s}  Interp {ti:.2f}ms  Process {tp:.2f}ms  ratio {ti/tp:.1f}x")
```

Output (Python 3.14.3, macOS Apple M2, `max_workers=1`):

```
list(10000)           Interp 0.90ms  Process 0.58ms  ratio 1.6x
list[10x100KB]        Interp 4.87ms  Process 0.28ms  ratio 17.4x
tuple(1MB bytes)      Interp 0.13ms  Process 0.78ms  ratio 0.2x
```

The third case confirms the native path works as designed — `tuple(bytes)` is 6x faster than Process. But wrapping the same `bytes` in a `list` triggers the pickle fallback and becomes 17x slower: a 37x swing depending only on the outer container type.

#### Bottleneck: `pickle` module lookup on every call

`_PyPickle_Dumps()` (line 578) and `_PyPickle_Loads()` (line 639) call `PyImport_ImportModuleAttrString("pickle", "dumps"/"loads")` per invocation. I understand this is intentional for interpreter isolation (each interpreter has its own `sys.modules`), but the reference could be cached per-interpreter and invalidated on finalization.

```c
// Python/crossinterp.c:578
PyObject *dumps = PyImport_ImportModuleAttrString("pickle", "dumps");
```

<details>
<summary>Additional bottlenecks (lower priority)</summary>

**Exception creation in the always-fail lookup path**

For non-shareable types, `_PyObject_GetXIData()` (line 539) tries the type registry first, which always fails for `list`/`dict`. The failure path creates an exception object (`_set_xid_lookup_failure`, line 411–426), captures it (`_PyErr_GetRaisedException`, line 542), then discards it when pickle succeeds (line 552). This cycle runs on every transfer.

**Per-element recursive dispatch for large tuples**

`_tuple_shared()` (`crossinterp_data_lookup.h`, line 642–689) calls `_PyObject_GetXIData()` per element, each involving a `_PyXIData_New()` heap allocation, recursion guards, and full type registry lookup. For `tuple(10000)` of ints, this means 10,000 C-level round-trips vs. one `pickle.dumps()` call in `ProcessPoolExecutor`.

</details>

<details>
<summary>Full benchmark data</summary>

**Container objects (1,000 iterations):**

| Data | Thread | Interpreter | Process | Winner |
|------|--------|-------------|---------|--------|
| `int` | 0.01ms | 0.05ms | 0.18ms | Interp (3.3x faster) |
| `tuple(100)` | 0.01ms | 0.06ms | 0.19ms | Interp (3.1x faster) |
| `list(100)` | 0.01ms | 0.07ms | 0.19ms | Interp (2.6x faster) |
| `dict(100)` | 0.01ms | 0.09ms | 0.20ms | Interp (2.2x faster) |
| `tuple(10000)` | 0.01ms | 0.90ms | 0.55ms | **Process (1.6x faster)** |
| `list(10000)` | 0.01ms | 0.90ms | 0.58ms | **Process (1.6x faster)** |
| `dict(10000)` | 0.01ms | 3.69ms | 2.49ms | **Process (1.5x faster)** |

**Large payloads (500 iterations):**

| Data | Interpreter | Process | Winner |
|------|-------------|---------|--------|
| `list [1 x 1KB bytes]` | 0.09ms | 0.22ms | Interp (2.5x faster) |
| `list [1 x 100KB bytes]` | 0.57ms | 0.27ms | **Process (2.1x faster)** |
| `list [1 x 1MB bytes]` | 5.49ms | 0.76ms | **Process (7.2x faster)** |
| `tuple (1 x 1MB bytes)` | 0.13ms | 0.78ms | Interp (6x faster) |
| `list [10 x 100KB bytes]` | 4.87ms | 0.28ms | **Process (17x faster)** |
| `dict {10 x 100KB bytes}` | 5.58ms | 0.77ms | **Process (7.2x faster)** |

</details>

#### Prototype implementation

I have a working implementation at [`c39755e`](https://github.com/jrfk/cpython/commit/c39755eaab6cc9ed6be22f464eae095d04ef3a9b) that caches `pickle.dumps`/`pickle.loads` references in `_PyXI_state_t` (per-interpreter, lazily initialized, cleared on finalization). The change is +47/−6 lines in `crossinterp.c` and `pycore_crossinterp.h`.

Before/after results (`identity(x)` round-trip, `max_workers=1`, Apple M2):

| Data | Before | After | Speedup |
|------|--------|-------|---------|
| `int` | 0.12ms | 0.05ms | 2.3x |
| `list(100)` | 0.14ms | 0.07ms | 2.0x |
| `dict(100)` | 0.22ms | 0.10ms | 2.2x |
| `list(10000)` | 3.59ms | 1.09ms | 3.3x |
| `dict(10000)` | 11.09ms | 4.51ms | 2.5x |
| `list [10 x 100KB]` | 10.27ms | 4.69ms | 2.2x |

<details>
<summary>Full before/after data</summary>

**Mutable type transfer (1,000 iterations):**

| Data | Before | After | Speedup |
|------|--------|-------|---------|
| `int` | 0.12ms | 0.05ms | 2.3x |
| `tuple(100)` | 0.12ms | 0.07ms | 1.7x |
| `list(100)` | 0.14ms | 0.07ms | 2.0x |
| `dict(100)` | 0.22ms | 0.10ms | 2.2x |
| `tuple(10000)` | 2.39ms | 0.97ms | 2.5x |
| `list(10000)` | 3.59ms | 1.09ms | 3.3x |
| `dict(10000)` | 11.09ms | 4.51ms | 2.5x |

**Large payload transfer (500 iterations):**

| Data | Before | After | Speedup |
|------|--------|-------|---------|
| `list [1 x 1KB]` | 0.19ms | 0.09ms | 2.1x |
| `list [1 x 100KB]` | 1.25ms | 0.71ms | 1.8x |
| `list [1 x 1MB]` | 10.58ms | 4.77ms | 2.2x |
| `tuple (1 x 1MB bytes)` | 0.35ms | 0.13ms | 2.7x |
| `list [10 x 100KB]` | 10.27ms | 4.69ms | 2.2x |
| `dict {10 x 100KB}` | 10.65ms | 4.89ms | 2.2x |

**Tests passed:** `test_interpreters` (172 tests), `test_concurrent_futures` (380 tests including `test_interpreter_pool`).

</details>

#### Related

- [#124694](https://github.com/python/cpython/issues/124694) — InterpreterPoolExecutor addition (mentions "more efficient sharing scheme" as future work)
- [#143017](https://github.com/python/cpython/issues/143017) — Queue polling delay (related but separate)

Tested on Python 3.14.3 (macOS Apple M2). Source references verified against `main` branch as of 2026-04-04.

---

### Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

### Links to previous discussion of this feature:

(none)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize XIData pickle fallback: up to 17x slower than ProcessPoolExecutor for mutable containers #148072

Feature or enhancement

Proposal:

Motivation

Cause

Approach

Bottleneck: `pickle` module lookup on every call

Prototype implementation

Related

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Data	Thread	Interpreter	Process	Winner
`int`	0.01ms	0.05ms	0.18ms	Interp (3.3x faster)
`tuple(100)`	0.01ms	0.06ms	0.19ms	Interp (3.1x faster)
`list(100)`	0.01ms	0.07ms	0.19ms	Interp (2.6x faster)
`dict(100)`	0.01ms	0.09ms	0.20ms	Interp (2.2x faster)
`tuple(10000)`	0.01ms	0.90ms	0.55ms	Process (1.6x faster)
`list(10000)`	0.01ms	0.90ms	0.58ms	Process (1.6x faster)
`dict(10000)`	0.01ms	3.69ms	2.49ms	Process (1.5x faster)

Data	Interpreter	Process	Winner
`list [1 x 1KB bytes]`	0.09ms	0.22ms	Interp (2.5x faster)
`list [1 x 100KB bytes]`	0.57ms	0.27ms	Process (2.1x faster)
`list [1 x 1MB bytes]`	5.49ms	0.76ms	Process (7.2x faster)
`tuple (1 x 1MB bytes)`	0.13ms	0.78ms	Interp (6x faster)
`list [10 x 100KB bytes]`	4.87ms	0.28ms	Process (17x faster)
`dict {10 x 100KB bytes}`	5.58ms	0.77ms	Process (7.2x faster)

Data	Before	After	Speedup
`int`	0.12ms	0.05ms	2.3x
`list(100)`	0.14ms	0.07ms	2.0x
`dict(100)`	0.22ms	0.10ms	2.2x
`list(10000)`	3.59ms	1.09ms	3.3x
`dict(10000)`	11.09ms	4.51ms	2.5x
`list [10 x 100KB]`	10.27ms	4.69ms	2.2x

Data	Before	After	Speedup
`list [1 x 1KB]`	0.19ms	0.09ms	2.1x
`list [1 x 100KB]`	1.25ms	0.71ms	1.8x
`list [1 x 1MB]`	10.58ms	4.77ms	2.2x
`tuple (1 x 1MB bytes)`	0.35ms	0.13ms	2.7x
`list [10 x 100KB]`	10.27ms	4.69ms	2.2x
`dict {10 x 100KB}`	10.65ms	4.89ms	2.2x

Uh oh!

Optimize XIData pickle fallback: up to 17x slower than ProcessPoolExecutor for mutable containers #148072

Description

Feature or enhancement

Proposal:

Motivation

Cause

Approach

Bottleneck: pickle module lookup on every call

Prototype implementation

Related

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Bottleneck: `pickle` module lookup on every call