Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses a checkpoint hang scenario in Tsavorite/Garnet by preventing orphaned semaphores in the state machine waiting list and by ensuring snapshot flush completion signaling is not skipped when exceptions occur.
Changes:
- Prevent multiple
lastVersionTransactionsDonesemaphores from being enqueued for the same version; add typed waiting-list entries and trace logging to improve diagnosability. - Update checkpoint SM tasks to tag semaphores with origin/type to make hangs easier to identify.
- Improve a cluster replication test by using
Task.WhenAll(...).Wait(timeout)and reporting task statuses on timeout.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| test/Garnet.test.cluster/ReplicationTests/ClusterReplicationBaseTests.cs | Improves checkpoint/replica-attach wait logic and timeout diagnostics in a replication cleanup test. |
| libs/storage/Tsavorite/cs/src/core/Index/Recovery/IndexCheckpoint.cs | Tags index checkpoint semaphores when added to the state machine waiting list; removes unused async completion helper. |
| libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineDriver.cs | Prevents duplicate semaphores per version; replaces raw semaphore waiting list with typed entries and adds trace logs during waits. |
| libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/SnapshotCheckpointSMTask.cs | Tags snapshot flush semaphore when enqueued. |
| libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/IncrementalSnapshotCheckpointSMTask.cs | Tags incremental snapshot flush semaphore when enqueued and simplifies phase handling. |
| libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/FoldOverSMTask.cs | Tags fold-over flush semaphore when enqueued. |
| libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/SemaphoreWaiterMonitor.cs | Introduces StateMachineSemaphoreType and a wrapper struct to track semaphore provenance in the waiting list. |
| libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs | Wraps snapshot flush runner in try/catch and ensures the completion semaphore is released on exception; adjusts flush callback release placement. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (destOffset > 0) | ||
| deltaLog.Seal(destOffset); | ||
| _completedSemaphore.Release(); | ||
| try |
There was a problem hiding this comment.
why is only this covered by try/finally? is there reason to believe only this will throw exception and we need to anyway release semaphore, and that does not apply to the code that exists earlier up?
There was a problem hiding this comment.
I added a catch to the above to ensure that the semaphore is released.
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineDriver.cs
Outdated
Show resolved
Hide resolved
d0e05d5 to
c1a7fdd
Compare
This PR attempts to fix checkpoint hanging due to uncaught exception during AsyncFlushPagesForSnapshot.