Skip to content

Ensure CompletedSemaphore is Always Released#1666

Open
vazois wants to merge 13 commits intodevfrom
vazois/dev-ct-fix
Open

Ensure CompletedSemaphore is Always Released#1666
vazois wants to merge 13 commits intodevfrom
vazois/dev-ct-fix

Conversation

@vazois
Copy link
Copy Markdown
Contributor

@vazois vazois commented Apr 2, 2026

This PR attempts to fix checkpoint hanging due to uncaught exception during AsyncFlushPagesForSnapshot.

Copilot AI review requested due to automatic review settings April 2, 2026 21:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a checkpoint hang scenario in Tsavorite/Garnet by preventing orphaned semaphores in the state machine waiting list and by ensuring snapshot flush completion signaling is not skipped when exceptions occur.

Changes:

  • Prevent multiple lastVersionTransactionsDone semaphores from being enqueued for the same version; add typed waiting-list entries and trace logging to improve diagnosability.
  • Update checkpoint SM tasks to tag semaphores with origin/type to make hangs easier to identify.
  • Improve a cluster replication test by using Task.WhenAll(...).Wait(timeout) and reporting task statuses on timeout.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
test/Garnet.test.cluster/ReplicationTests/ClusterReplicationBaseTests.cs Improves checkpoint/replica-attach wait logic and timeout diagnostics in a replication cleanup test.
libs/storage/Tsavorite/cs/src/core/Index/Recovery/IndexCheckpoint.cs Tags index checkpoint semaphores when added to the state machine waiting list; removes unused async completion helper.
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineDriver.cs Prevents duplicate semaphores per version; replaces raw semaphore waiting list with typed entries and adds trace logs during waits.
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/SnapshotCheckpointSMTask.cs Tags snapshot flush semaphore when enqueued.
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/IncrementalSnapshotCheckpointSMTask.cs Tags incremental snapshot flush semaphore when enqueued and simplifies phase handling.
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/FoldOverSMTask.cs Tags fold-over flush semaphore when enqueued.
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/SemaphoreWaiterMonitor.cs Introduces StateMachineSemaphoreType and a wrapper struct to track semaphore provenance in the waiting list.
libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs Wraps snapshot flush runner in try/catch and ensures the completion semaphore is released on exception; adjusts flush callback release placement.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@vazois vazois changed the title Ensure CompletedSemaphore Release is Always Released Ensure CompletedSemaphore is Always Released Apr 3, 2026
if (destOffset > 0)
deltaLog.Seal(destOffset);
_completedSemaphore.Release();
try
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is only this covered by try/finally? is there reason to believe only this will throw exception and we need to anyway release semaphore, and that does not apply to the code that exists earlier up?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a catch to the above to ensure that the semaphore is released.

@vazois vazois force-pushed the vazois/dev-ct-fix branch from d0e05d5 to c1a7fdd Compare April 3, 2026 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants