Fix WebGPU Windows CI timeout by removing redundant test and sharding provider tests by tianleiwu · Pull Request #27953 · microsoft/onnxruntime

tianleiwu · 2026-04-02T16:21:23Z

The webgpu_build_x64_RelWithDebInfo CI job has been failing for about a week, with the runner becoming unresponsive during the verbose-logging "Run tests" step (likely around ActivationOpTest.Elu). The probable cause is GPU resource exhaustion from hundreds of tests sharing a single WebGPU context, which crashes the runner or makes it unable to communicate with the CI job controller. This PR removes the redundant onnxruntime_test_all.exe invocation from that step (already covered by CTest in "Build and Test"), and shards onnxruntime_provider_test.exe across 4 GTest shards to prevent Dawn/WebGPU GPU resource exhaustion.

Summary of Changes

CI Workflow (`windows_webgpu.yml`)

#	Change
1	Removed `onnxruntime_test_all.exe` from the verbose-logging step; it was redundant with CTest in the preceding "Build and Test" step.
2	Split `onnxruntime_provider_test.exe` into 4 GTest shards (`GTEST_TOTAL_SHARDS` / `GTEST_SHARD_INDEX`) so each shard runs in its own process with an isolated WebGPU context, preventing GPU staging-buffer accumulation.
3	Added `timeout-minutes: 60` safety net on the verbose-logging step to prevent future hangs from blocking the entire job.
4	Added per-shard exit-code checking so a failed shard fails the step immediately.
5	Renamed step to clarify its purpose: shader key validation via verbose logging.

Testing

No functional test coverage is lost: onnxruntime_test_all.exe continues to run via CTest in the "Build and Test" step.
Shader key validation is preserved: onnxruntime_provider_test.exe still runs with ORT_UNIT_TEST_MAIN_LOG_LEVEL=0, producing the same verbose WebGPU shader compilation logs (stderr) consumed by the subsequent "Check log file" step.
Verification: Confirm the webgpu_build_x64_RelWithDebInfo job completes successfully and the shader key validation step passes.

Motivation and Context

PR #26907 (commit 99c5dd8839, Mar 24) introduced a separate "Run tests" step to capture verbose WebGPU shader compilation logs for shader key validation. However, it unnecessarily ran the full onnxruntime_test_all.exe in the same process as onnxruntime_provider_test.exe. After ~10 activation test groups (Sigmoid, HardSigmoid, Tanh, Relu, etc.), accumulated GPU resources from hundreds of tests sharing a single WebGPU context exhausted GPU staging buffers (Dawn's MapAsync), likely around ActivationOpTest.Elu (100,000-element input). This resource exhaustion appears to crash the runner or make it unresponsive, causing it to lose communication with the CI job controller rather than simply timing out. Sharding isolates each shard in its own process, releasing GPU resources between shards.

edgchen1 · 2026-04-02T16:29:11Z

When onnxruntime_test_all.exe runs all tests sequentially in one process, hundreds of tests share a single WebGPU context. Each test creates sessions, compiles shaders, and allocates GPU staging buffers for output verification (MapAsync). After ~10 activation test groups (Sigmoid, HardSigmoid, Tanh, Relu, etc.), accumulated GPU resources cause Dawn's MapAsync to hang when processing ActivationOpTest.Elu with its 100,000-element input vector, leading to the job exceeding its 300-minute timeout.

I think the failures have not actually been exceeding a 300 minute (5 hour) timeout. Rather, something else is going on that causes the connection to the runner to be lost.

But let's see if this change makes it better.

update tests in webgpu CI

db23057

tianleiwu requested review from edgchen1 and fs-eire April 2, 2026 16:21

4 shards

ef8eb85

tianleiwu changed the title ~~Fix WebGPU Windows CI timeout by removing redundant onnxruntime_test_all.exe from verbose logging step~~ Fix WebGPU Windows CI timeout by removing redundant test and sharding provider tests Apr 2, 2026

2 shards

5d0ce28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix WebGPU Windows CI timeout by removing redundant test and sharding provider tests#27953

Fix WebGPU Windows CI timeout by removing redundant test and sharding provider tests#27953
tianleiwu wants to merge 3 commits intomainfrom
tlwu/20260402/fix_webgpu_ci_timeout

tianleiwu commented Apr 2, 2026 •

edited

Loading

Uh oh!

edgchen1 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tianleiwu commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of Changes

CI Workflow (windows_webgpu.yml)

Testing

Motivation and Context

Uh oh!

edgchen1 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tianleiwu commented Apr 2, 2026 •

edited

Loading

CI Workflow (`windows_webgpu.yml`)