Skip to content

Fix WebGPU Windows CI timeout by removing redundant test and sharding provider tests#27953

Open
tianleiwu wants to merge 3 commits intomainfrom
tlwu/20260402/fix_webgpu_ci_timeout
Open

Fix WebGPU Windows CI timeout by removing redundant test and sharding provider tests#27953
tianleiwu wants to merge 3 commits intomainfrom
tlwu/20260402/fix_webgpu_ci_timeout

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu commented Apr 2, 2026

The webgpu_build_x64_RelWithDebInfo CI job has been failing for about a week, with the runner becoming unresponsive during the verbose-logging "Run tests" step (likely around ActivationOpTest.Elu). The probable cause is GPU resource exhaustion from hundreds of tests sharing a single WebGPU context, which crashes the runner or makes it unable to communicate with the CI job controller. This PR removes the redundant onnxruntime_test_all.exe invocation from that step (already covered by CTest in "Build and Test"), and shards onnxruntime_provider_test.exe across 4 GTest shards to prevent Dawn/WebGPU GPU resource exhaustion.

Summary of Changes

CI Workflow (windows_webgpu.yml)

# Change
1 Removed onnxruntime_test_all.exe from the verbose-logging step; it was redundant with CTest in the preceding "Build and Test" step.
2 Split onnxruntime_provider_test.exe into 4 GTest shards (GTEST_TOTAL_SHARDS / GTEST_SHARD_INDEX) so each shard runs in its own process with an isolated WebGPU context, preventing GPU staging-buffer accumulation.
3 Added timeout-minutes: 60 safety net on the verbose-logging step to prevent future hangs from blocking the entire job.
4 Added per-shard exit-code checking so a failed shard fails the step immediately.
5 Renamed step to clarify its purpose: shader key validation via verbose logging.

Testing

  • No functional test coverage is lost: onnxruntime_test_all.exe continues to run via CTest in the "Build and Test" step.
  • Shader key validation is preserved: onnxruntime_provider_test.exe still runs with ORT_UNIT_TEST_MAIN_LOG_LEVEL=0, producing the same verbose WebGPU shader compilation logs (stderr) consumed by the subsequent "Check log file" step.
  • Verification: Confirm the webgpu_build_x64_RelWithDebInfo job completes successfully and the shader key validation step passes.

Motivation and Context

PR #26907 (commit 99c5dd8839, Mar 24) introduced a separate "Run tests" step to capture verbose WebGPU shader compilation logs for shader key validation. However, it unnecessarily ran the full onnxruntime_test_all.exe in the same process as onnxruntime_provider_test.exe. After ~10 activation test groups (Sigmoid, HardSigmoid, Tanh, Relu, etc.), accumulated GPU resources from hundreds of tests sharing a single WebGPU context exhausted GPU staging buffers (Dawn's MapAsync), likely around ActivationOpTest.Elu (100,000-element input). This resource exhaustion appears to crash the runner or make it unresponsive, causing it to lose communication with the CI job controller rather than simply timing out. Sharding isolates each shard in its own process, releasing GPU resources between shards.

@tianleiwu tianleiwu requested review from edgchen1 and fs-eire April 2, 2026 16:21
@edgchen1
Copy link
Copy Markdown
Contributor

edgchen1 commented Apr 2, 2026

When onnxruntime_test_all.exe runs all tests sequentially in one process, hundreds of tests share a single WebGPU context. Each test creates sessions, compiles shaders, and allocates GPU staging buffers for output verification (MapAsync). After ~10 activation test groups (Sigmoid, HardSigmoid, Tanh, Relu, etc.), accumulated GPU resources cause Dawn's MapAsync to hang when processing ActivationOpTest.Elu with its 100,000-element input vector, leading to the job exceeding its 300-minute timeout.

I think the failures have not actually been exceeding a 300 minute (5 hour) timeout. Rather, something else is going on that causes the connection to the runner to be lost.

But let's see if this change makes it better.

@tianleiwu tianleiwu changed the title Fix WebGPU Windows CI timeout by removing redundant onnxruntime_test_all.exe from verbose logging step Fix WebGPU Windows CI timeout by removing redundant test and sharding provider tests Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants