Fix WebGPU Windows CI timeout by removing redundant test and sharding provider tests#27953
Open
Fix WebGPU Windows CI timeout by removing redundant test and sharding provider tests#27953
Conversation
Contributor
I think the failures have not actually been exceeding a 300 minute (5 hour) timeout. Rather, something else is going on that causes the connection to the runner to be lost. But let's see if this change makes it better. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The
webgpu_build_x64_RelWithDebInfoCI job has been failing for about a week, with the runner becoming unresponsive during the verbose-logging "Run tests" step (likely aroundActivationOpTest.Elu). The probable cause is GPU resource exhaustion from hundreds of tests sharing a single WebGPU context, which crashes the runner or makes it unable to communicate with the CI job controller. This PR removes the redundantonnxruntime_test_all.exeinvocation from that step (already covered by CTest in "Build and Test"), and shardsonnxruntime_provider_test.exeacross 4 GTest shards to prevent Dawn/WebGPU GPU resource exhaustion.Summary of Changes
CI Workflow (
windows_webgpu.yml)onnxruntime_test_all.exefrom the verbose-logging step; it was redundant with CTest in the preceding "Build and Test" step.onnxruntime_provider_test.exeinto 4 GTest shards (GTEST_TOTAL_SHARDS/GTEST_SHARD_INDEX) so each shard runs in its own process with an isolated WebGPU context, preventing GPU staging-buffer accumulation.timeout-minutes: 60safety net on the verbose-logging step to prevent future hangs from blocking the entire job.Testing
onnxruntime_test_all.execontinues to run via CTest in the "Build and Test" step.onnxruntime_provider_test.exestill runs withORT_UNIT_TEST_MAIN_LOG_LEVEL=0, producing the same verbose WebGPU shader compilation logs (stderr) consumed by the subsequent "Check log file" step.webgpu_build_x64_RelWithDebInfojob completes successfully and the shader key validation step passes.Motivation and Context
PR #26907 (commit
99c5dd8839, Mar 24) introduced a separate "Run tests" step to capture verbose WebGPU shader compilation logs for shader key validation. However, it unnecessarily ran the fullonnxruntime_test_all.exein the same process asonnxruntime_provider_test.exe. After ~10 activation test groups (Sigmoid, HardSigmoid, Tanh, Relu, etc.), accumulated GPU resources from hundreds of tests sharing a single WebGPU context exhausted GPU staging buffers (Dawn'sMapAsync), likely aroundActivationOpTest.Elu(100,000-element input). This resource exhaustion appears to crash the runner or make it unresponsive, causing it to lose communication with the CI job controller rather than simply timing out. Sharding isolates each shard in its own process, releasing GPU resources between shards.