Fix heap size double-subtraction causing server crash under low-memory object operations#1663
Draft
Fix heap size double-subtraction causing server crash under low-memory object operations#1663
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Updates ListPushPopStressTest to be more resilient under thread pool starvation by avoiding blocking Redis calls and switching to an async wait pattern.
Changes:
- Convert
ListPushPopStressTestto anasync TaskNUnit test. - Replace synchronous
ListRightPop/Task.WaitAllwithListRightPopAsync/Task.WhenAll. - Replace tight
Thread.Yield()spin with a small async delay between empty-pop retries.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ap objects Under lowMemory settings (LogMemorySize=2k, PageSize=512) with heavy object operations, the server would livelock in TryAllocateRetryNow because: 1. IsBeyondSizeLimitAndCanEvict(addingPage:true) returned true for soft memory limits (heap objects >> target), blocking page allocations via NeedToWaitForClose. Evicting pages cannot reduce heap memory from objects that are immediately cloned back via CopyUpdate, creating an infinite RETRY_NOW spin. 2. IssueShiftAddress skipped HeadAddress advancement when IsBeyondSizeLimit was true, even at the hard MaxAllocatedPageCount limit. This prevented the TryAllocateRetryNow spin from resolving because no thread advanced HeadAddress. 3. TryAllocateRetryNow had no backoff, consuming 70%+ CPU in a tight spin loop when allocation kept returning RETRY_NOW. Fixes: - IsBeyondSizeLimitAndCanEvict: when addingPage=true, only check the hard page limit (MaxAllocatedPageCount). Soft limits are handled asynchronously by the resizer task, signaled via NeedToWaitForClose. - NeedToWaitForClose: separate hard limit (block) from soft limit (signal only). - IssueShiftAddress: always advance HeadAddress at MaxAllocatedPageCount, regardless of IsBeyondSizeLimit state. - TryAllocateRetryNow: progressive backoff (Thread.Yield -> Thread.Sleep(1)) to reduce CPU usage during unavoidable spins. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CircularDiskReadBuffer.OnBeginRecord compared recordFilePosition.word against bufferFilePosition.word using the full ulong word, but this word includes the piggybacked ObjectSizeHighByte field (bits 56-63) which differs between records and is unrelated to file position. This caused spurious assertion failures in Debug mode under heavy object eviction with storage tiering enabled. Additionally, use saturating subtraction via CurrentAddress (segment + offset only) to compute the increment, avoiding ulong underflow when positions have small misalignments from sector-alignment padding at partial flush boundaries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pdate Root cause: CacheSizeTracker used SetLogSizeTracker() which only sets the logSizeTracker field but does NOT subscribe the tracker as onEvictionObserver. Without the eviction observer, OnNext is never called when pages are evicted. Instead, ResizeIfNeeded pre-computes heapTrimmedSize by scanning records BEFORE eviction, then subtracts it AFTER ShiftAddresses completes. But between the scan and subtraction, concurrent CopyUpdate operations on other threads can clear the source record's value object AND decrement heapSize via AddHeapSize(sizeAdjustment) — causing the same heap memory to be subtracted twice, driving heapSize negative (e.g. -80) and triggering a DebugAssertException that crashes the server session. Fix: Use SubscribeEvictions() instead of SetLogSizeTracker(). This sets both onEvictionObserver AND logSizeTracker, enabling the OnNext callback to subtract each record's CURRENT HeapMemorySize at actual eviction time. Records already cleared by concurrent CopyUpdate will have HeapMemorySize=0, avoiding the double-subtraction. Remove the stale pre-computed heapSize.Increment(-heapTrimmedSize) from ResizeIfNeeded since OnNext now handles it correctly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ec4b48c to
5b7322a
Compare
…heapTrimmedSize) Remove unnecessary changes that were workarounds for symptoms of the heap accounting bug, not separate issues: - AllocatorBase.cs: revert NeedToWaitForClose, IssueShiftAddress, and TryAllocateRetryNow changes (the livelock was caused by heapSize growing unbounded due to double-subtraction, not by the soft limit logic itself) - CircularDiskReadBuffer.cs: revert OnBeginRecord assertion change (the assertion only fired when heapSize was wrong, causing incorrect eviction decisions that led to corrupted object log positions) - LogSizeTracker.cs: revert IsBeyondSizeLimitAndCanEvict and IncrementSize clamping changes (no longer needed with correct accounting) The actual fix is two lines in two files: 1. CacheSizeTracker.cs: SubscribeEvictions() instead of SetLogSizeTracker() 2. LogSizeTracker.cs: remove heapSize.Increment(-heapTrimmedSize) from ResizeIfNeeded (OnNext callback now handles it at eviction time) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
68b2c6c to
8c52658
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Under
lowMemory: truesettings (LogMemorySize="2k",PageSize="512") with concurrent object operations (LPUSH/RPOP), the server crashes withConnectionReset— killing the client connection while commands are in-flight. TheListPushPopStressTestfails intermittently (~90% of the time in Debug mode).Root Cause
CacheSizeTrackercalledSetLogSizeTracker()to register theLogSizeTracker, but this only sets thelogSizeTrackerfield — it does not subscribe the tracker asonEvictionObserver. Without the eviction observer, theOnNextcallback is never invoked when pages are evicted.To compensate,
ResizeIfNeededpre-computed aheapTrimmedSizeby scanning records before eviction, then subtracted it fromheapSizeafterShiftAddressescompleted. This introduced a race condition: between the scan and the subtraction, concurrentCopyUpdateoperations on other threads could clear the source record's value object (viaClearSourceValueObject) and decrementheapSizeviaAddHeapSize(sizeAdjustment). The resizer then subtractedheapTrimmedSize— which still included the already-decremented record — causing a double-subtraction that droveheapSizeto -80, triggeringDebugAssertException→ session crash →ConnectionReset.Fix
CacheSizeTracker.cs: UseSubscribeEvictions()instead ofSetLogSizeTracker(). This sets bothonEvictionObserverandlogSizeTracker, enabling theOnNextcallback to subtract each record's currentHeapMemorySizeat actual eviction time. Records already cleared by concurrentCopyUpdatehaveHeapMemorySize=0, so they correctly contribute nothing.LogSizeTracker.cs: Remove the staleheapSize.Increment(-heapTrimmedSize)fromResizeIfNeeded. This is no longer needed becauseOnNextnow handles heap subtraction duringShiftAddresses→OnPagesClosed→MemoryPageScan. TheheapTrimmedSizefromDetermineEvictionRangeis still computed and used for the eviction range calculation (how far to advanceHeadAddress), just not for the actual heap counter adjustment.RespListTests.cs: ConvertListPushPopStressTestto async to avoid thread pool starvation from blocking pop retries.Validation
[Repeat(10)]= 100 total stress iterations (was crashing ~90%)