Skip to content

Fix heap size double-subtraction causing server crash under low-memory object operations#1663

Draft
badrishc wants to merge 6 commits intodevfrom
badrishc/fix-ListPushPopStressTest
Draft

Fix heap size double-subtraction causing server crash under low-memory object operations#1663
badrishc wants to merge 6 commits intodevfrom
badrishc/fix-ListPushPopStressTest

Conversation

@badrishc
Copy link
Copy Markdown
Collaborator

@badrishc badrishc commented Apr 2, 2026

Problem

Under lowMemory: true settings (LogMemorySize="2k", PageSize="512") with concurrent object operations (LPUSH/RPOP), the server crashes with ConnectionReset — killing the client connection while commands are in-flight. The ListPushPopStressTest fails intermittently (~90% of the time in Debug mode).

Root Cause

CacheSizeTracker called SetLogSizeTracker() to register the LogSizeTracker, but this only sets the logSizeTracker field — it does not subscribe the tracker as onEvictionObserver. Without the eviction observer, the OnNext callback is never invoked when pages are evicted.

To compensate, ResizeIfNeeded pre-computed a heapTrimmedSize by scanning records before eviction, then subtracted it from heapSize after ShiftAddresses completed. This introduced a race condition: between the scan and the subtraction, concurrent CopyUpdate operations on other threads could clear the source record's value object (via ClearSourceValueObject) and decrement heapSize via AddHeapSize(sizeAdjustment). The resizer then subtracted heapTrimmedSize — which still included the already-decremented record — causing a double-subtraction that drove heapSize to -80, triggering DebugAssertException → session crash → ConnectionReset.

Fix

  1. CacheSizeTracker.cs: Use SubscribeEvictions() instead of SetLogSizeTracker(). This sets both onEvictionObserver and logSizeTracker, enabling the OnNext callback to subtract each record's current HeapMemorySize at actual eviction time. Records already cleared by concurrent CopyUpdate have HeapMemorySize=0, so they correctly contribute nothing.

  2. LogSizeTracker.cs: Remove the stale heapSize.Increment(-heapTrimmedSize) from ResizeIfNeeded. This is no longer needed because OnNext now handles heap subtraction during ShiftAddressesOnPagesClosedMemoryPageScan. The heapTrimmedSize from DetermineEvictionRange is still computed and used for the eviction range calculation (how far to advance HeadAddress), just not for the actual heap counter adjustment.

  3. RespListTests.cs: Convert ListPushPopStressTest to async to avoid thread pool starvation from blocking pop retries.

Validation

  • ListPushPopStressTest (100 keys × 500 items, 200 concurrent tasks, Debug mode): 10/10 passes with [Repeat(10)] = 100 total stress iterations (was crashing ~90%)
  • Full test suite: 3359 total, 0 failed, 3326 passed, 33 skipped

Copilot AI review requested due to automatic review settings April 2, 2026 01:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates ListPushPopStressTest to be more resilient under thread pool starvation by avoiding blocking Redis calls and switching to an async wait pattern.

Changes:

  • Convert ListPushPopStressTest to an async Task NUnit test.
  • Replace synchronous ListRightPop/Task.WaitAll with ListRightPopAsync/Task.WhenAll.
  • Replace tight Thread.Yield() spin with a small async delay between empty-pop retries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

badrishc and others added 2 commits April 1, 2026 19:58
…ap objects

Under lowMemory settings (LogMemorySize=2k, PageSize=512) with heavy object
operations, the server would livelock in TryAllocateRetryNow because:

1. IsBeyondSizeLimitAndCanEvict(addingPage:true) returned true for soft memory
   limits (heap objects >> target), blocking page allocations via NeedToWaitForClose.
   Evicting pages cannot reduce heap memory from objects that are immediately
   cloned back via CopyUpdate, creating an infinite RETRY_NOW spin.

2. IssueShiftAddress skipped HeadAddress advancement when IsBeyondSizeLimit was
   true, even at the hard MaxAllocatedPageCount limit. This prevented the
   TryAllocateRetryNow spin from resolving because no thread advanced HeadAddress.

3. TryAllocateRetryNow had no backoff, consuming 70%+ CPU in a tight spin loop
   when allocation kept returning RETRY_NOW.

Fixes:
- IsBeyondSizeLimitAndCanEvict: when addingPage=true, only check the hard page
  limit (MaxAllocatedPageCount). Soft limits are handled asynchronously by the
  resizer task, signaled via NeedToWaitForClose.
- NeedToWaitForClose: separate hard limit (block) from soft limit (signal only).
- IssueShiftAddress: always advance HeadAddress at MaxAllocatedPageCount,
  regardless of IsBeyondSizeLimit state.
- TryAllocateRetryNow: progressive backoff (Thread.Yield -> Thread.Sleep(1))
  to reduce CPU usage during unavoidable spins.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CircularDiskReadBuffer.OnBeginRecord compared recordFilePosition.word against
bufferFilePosition.word using the full ulong word, but this word includes the
piggybacked ObjectSizeHighByte field (bits 56-63) which differs between records
and is unrelated to file position. This caused spurious assertion failures in
Debug mode under heavy object eviction with storage tiering enabled.

Additionally, use saturating subtraction via CurrentAddress (segment + offset
only) to compute the increment, avoiding ulong underflow when positions have
small misalignments from sector-alignment padding at partial flush boundaries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@badrishc badrishc marked this pull request as draft April 2, 2026 17:03
badrishc and others added 2 commits April 2, 2026 10:18
…pdate

Root cause: CacheSizeTracker used SetLogSizeTracker() which only sets the
logSizeTracker field but does NOT subscribe the tracker as onEvictionObserver.
Without the eviction observer, OnNext is never called when pages are evicted.
Instead, ResizeIfNeeded pre-computes heapTrimmedSize by scanning records BEFORE
eviction, then subtracts it AFTER ShiftAddresses completes. But between the
scan and subtraction, concurrent CopyUpdate operations on other threads can
clear the source record's value object AND decrement heapSize via
AddHeapSize(sizeAdjustment) — causing the same heap memory to be subtracted
twice, driving heapSize negative (e.g. -80) and triggering a DebugAssertException
that crashes the server session.

Fix: Use SubscribeEvictions() instead of SetLogSizeTracker(). This sets both
onEvictionObserver AND logSizeTracker, enabling the OnNext callback to subtract
each record's CURRENT HeapMemorySize at actual eviction time. Records already
cleared by concurrent CopyUpdate will have HeapMemorySize=0, avoiding the
double-subtraction. Remove the stale pre-computed heapSize.Increment(-heapTrimmedSize)
from ResizeIfNeeded since OnNext now handles it correctly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@badrishc badrishc force-pushed the badrishc/fix-ListPushPopStressTest branch from ec4b48c to 5b7322a Compare April 2, 2026 19:17
@badrishc badrishc changed the title Fix ListPushPopStressTest Fix heap size double-subtraction causing server crash under low-memory object operations Apr 3, 2026
…heapTrimmedSize)

Remove unnecessary changes that were workarounds for symptoms of the heap
accounting bug, not separate issues:
- AllocatorBase.cs: revert NeedToWaitForClose, IssueShiftAddress, and
  TryAllocateRetryNow changes (the livelock was caused by heapSize growing
  unbounded due to double-subtraction, not by the soft limit logic itself)
- CircularDiskReadBuffer.cs: revert OnBeginRecord assertion change (the
  assertion only fired when heapSize was wrong, causing incorrect eviction
  decisions that led to corrupted object log positions)
- LogSizeTracker.cs: revert IsBeyondSizeLimitAndCanEvict and IncrementSize
  clamping changes (no longer needed with correct accounting)

The actual fix is two lines in two files:
1. CacheSizeTracker.cs: SubscribeEvictions() instead of SetLogSizeTracker()
2. LogSizeTracker.cs: remove heapSize.Increment(-heapTrimmedSize) from
   ResizeIfNeeded (OnNext callback now handles it at eviction time)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@badrishc badrishc force-pushed the badrishc/fix-ListPushPopStressTest branch from 68b2c6c to 8c52658 Compare April 3, 2026 04:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants