[WIP] Replace chunked FLA with recurrent gated delta rule for T=1 decode by Gasoonjia · Pull Request #18667 · pytorch/executorch

Gasoonjia · 2026-04-02T06:21:02Z

The chunked FLA pipeline (6 Triton kernels) is overkill for T=1 decode. Replace with plain PyTorch einsum ops that Inductor can fuse and locally benchmark result:

FLA GPU time: 1.085ms → 0.344ms/step (-68%)
Total GPU time: 12.0ms → 9.0ms/step (-25%)

perf boost from 77.7 token/s to 89.9 token/s, but negative impact on prefill perf
still ongoing to solve prefill issue.

The chunked FLA pipeline (6 Triton kernels) is overkill for T=1 decode. Replace with plain PyTorch einsum ops that Inductor can fuse: - FLA GPU time: 1.085ms → 0.344ms/step (-68%) - Total GPU time: 12.0ms → 9.0ms/step (-25%) - Export changed to static T=1 with enable_dynamic_shape=False

pytorch-bot · 2026-04-02T06:21:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18667

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 2 Unrelated Failures

As of commit 7dd4280 with merge base 300e368 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner (gh)
>>> Lint for backends/cuda/triton/kernels/chunk_gated_delta_rule.py:
Lint / lintrunner-mypy (gh)
Test CUDA Builds / unittest-cuda / linux-job (gh)
backends/cuda/tests/test_fused_moe.py::TestFusedMoE::test_e2e_cpp_runner

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-04-02T06:21:55Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

digantdesai · 2026-04-02T17:42:48Z

examples/models/qwen3_5_moe/model.py

-            q, k, v, g, beta, self.recurrent_state[:B]
+        # Recurrent gated delta rule — single-step update.
+        # The model is exported with static T=1 and the C++ runner does
+        # token-by-token prefill (enable_dynamic_shape=False), so T is


Any impact on prefill performance?

it will impact prefill performance to only half; the ongoing fix will make sure that prefill will use chunked implementation while decode uses recurrent one.

Move decode/prefill dispatch inside the chunk_gated_delta_rule triton_op instead of using torch.cond at model level. This follows the same pattern as the SDPA triton_op (pow2/non-pow2 dispatch) and avoids torch.cond incompatibility with AOTI's FunctionalTensor pipeline. Changes: - chunk_gated_delta_rule.py: Add fused recurrent Triton kernel for T=1, refactor chunked pipeline into _launch_chunked(), dispatch via Python if inside the @triton_op wrapper - model.py: Remove torch.cond from GatedDeltaNet.forward(), call triton_op directly (dispatch is internal) - export.py: Single-method export with dynamic seq_len dim - main.cpp: Fix create_text_llm_runner API signature

Only chunk_gated_delta_rule.py needs modification — dispatch logic is internal to the triton_op, no model/export/runner changes needed.

- test_recurrent_t1: verify T=1 recurrent kernel against FLA naive reference across all FLA test configs - test_dispatch_multiple_seq_lengths: verify correctness for T in {1, 2, 32, 63, 64, 65, 128, 256}, covering both dispatch paths and chunk boundary edge cases

- Grid changed from (B*H,) to (V//BV, B*H) — 4x more blocks, better SM occupancy (128 blocks vs 32 on A100) - BV reduced from 128 to 32 — lower register pressure, no spilling - Removed unnecessary .contiguous() copies on squeezed inputs - Removed debug print from triton_op dispatch - GPU kernel time: 6us (3.47x faster than Inductor-fused native ops)

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 2, 2026

Gasoonjia added the ciflow/cuda label Apr 2, 2026

digantdesai reviewed Apr 2, 2026

View reviewed changes

Gasoonjia added 8 commits April 2, 2026 20:49

Revert model.py, export.py, main.cpp to main branch

fc5018e

Only chunk_gated_delta_rule.py needs modification — dispatch logic is internal to the triton_op, no model/export/runner changes needed.

lint fix - 2

ce3e9ca

lint fix - 2

8d35c65

Merge branch 'main' into recurrent-fla

709deb0

lint fix - 3

eff976d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Replace chunked FLA with recurrent gated delta rule for T=1 decode#18667

[WIP] Replace chunked FLA with recurrent gated delta rule for T=1 decode#18667
Gasoonjia wants to merge 9 commits intomainfrom
recurrent-fla

Gasoonjia commented Apr 2, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

digantdesai Apr 2, 2026

Uh oh!

Gasoonjia Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Gasoonjia commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18667

❌ 3 New Failures, 2 Unrelated Failures

Uh oh!

github-actions bot commented Apr 2, 2026

This PR needs a release notes: label

Uh oh!

digantdesai Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Gasoonjia Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Gasoonjia commented Apr 2, 2026 •

edited

Loading

pytorch-bot bot commented Apr 2, 2026 •

edited

Loading

This PR needs a `release notes:` label