Skip to content

fix: prevent zombie MCP processes via handshake timeout + deferred init#89

Merged
PatrickSys merged 1 commit intomasterfrom
fix/zombie-process-deterministic
Apr 5, 2026
Merged

fix: prevent zombie MCP processes via handshake timeout + deferred init#89
PatrickSys merged 1 commit intomasterfrom
fix/zombie-process-deterministic

Conversation

@PatrickSys
Copy link
Copy Markdown
Owner

Adds a 30-second handshake timer that exits the process if no MCP client
sends an initialize message within the timeout window. Defers CPU-heavy
indexing and file-watcher setup to the oninitialized callback so a misfire
consumes near-zero resources before exiting.

Closes the recurring zombie process bug where Codex CLI spawns

px codebase-context \ without a subcommand, starting the MCP server
with no client to connect. The 4 previous guards (ppid poller, stdin close,
server.onclose, SIGHUP) all failed because cmd.exe holds the pipe open
while waiting for the child to produce output.

The fix is tool-agnostic: it does not rely on who is calling it or how.
Any invocation that does not result in an MCP initialize handshake within
30 seconds exits cleanly with code 1.

What / Why

Explain the change briefly.

Release notes

This repo uses an automated release flow.

  • Please use a Conventional-Commits style PR title (we squash-merge): feat: ..., fix: ..., docs: ..., chore: ..., refactor: ...
  • If this should not appear in release notes, say so here

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 4, 2026

Greptile Summary

This PR fixes the recurring zombie MCP process bug by adding a 30-second handshake timeout that exits the process if no MCP initialize message arrives, and defers all CPU-heavy indexing and file-watcher setup to the oninitialized callback. It also ships a new discovery evaluation harness (src/eval/discovery-harness.ts) with frozen fixtures, gate logic, and comprehensive tests.

Key changes:

  • src/index.ts: Handshake timer (HANDSHAKE_TIMEOUT_MS, env-configurable) with unref() so it doesn't block natural exits; server.oninitialized now owns all project initialization (indexing, watcher, root refresh)
  • tests/zombie-guard.test.ts: Real child-process integration tests validating exit code 1, timeout message, deferred init ([DEBUG] Server ready absent before handshake), and custom timeout env var
  • src/eval/discovery-harness.ts + src/eval/types.ts: New discovery benchmark harness with surface runners, per-task scoring (usefulnessScore, firstRelevantHit, bestExampleUseful), gate evaluation with baseline + comparator lanes, and report formatting
  • tests/fixtures/discovery-*.json: Three new frozen fixtures (angular-spotify, excalidraw, benchmark protocol) pinned to concrete repository refs
  • scripts/run-eval.mjs + docs/capabilities.md: CLI and documentation updated for --mode retrieval|discovery

Confidence Score: 5/5

Safe to merge — the zombie guard logic is correct and well-tested; all findings are P2 style/maintenance suggestions

The core handshake-timeout implementation is sound: unref() is used correctly, oninitialized defers all heavy work, and four integration tests exercise the real binary. The discovery harness is comprehensive and its gate logic is correct for the current frozen protocol. Three P2 findings (lowerIsBetter inconsistency, missing undefined-runner guard, and build-dependency test guard) are non-blocking quality improvements that do not affect correctness at the current protocol version.

src/eval/discovery-harness.ts deserves a follow-up to align lowerIsBetter across compareMetric and compareMetricWithinTolerance, and to add a guard for unregistered surface runners

Important Files Changed

Filename Overview
src/index.ts Adds handshake timeout (30 s default, env-configurable) and defers all project init to oninitialized callback to prevent zombie MCP processes
tests/zombie-guard.test.ts Integration tests spawning real child processes to verify exit-code-1 behavior, timeout message, deferred init, and custom timeout env var
src/eval/discovery-harness.ts New discovery harness with surface runners, task scoring, gate evaluation, and report formatting; minor lowerIsBetter inconsistency with compareMetric
src/eval/types.ts New discovery eval types: DiscoveryTask, DiscoveryFixture, DiscoverySummary, gate types, and benchmark protocol interfaces
tests/discovery-harness.test.ts Comprehensive unit tests for discovery harness scoring, fixture freezing, report formatting, and gate evaluation edge cases
tests/eval-harness.test.ts Imports countUtf8Bytes and estimateTokenCountFromBytes as shared exports; existing harness tests unchanged
scripts/run-eval.mjs Extends eval CLI with --mode flag for retrieval
src/eval/harness.ts Extracts countUtf8Bytes and estimateTokenCountFromBytes as public exports for reuse by the discovery harness
docs/capabilities.md Updates eval docs to reflect discovery mode, new fixtures, metrics, and gate semantics
tests/fixtures/discovery-benchmark-protocol.json Frozen benchmark protocol with ship gate, 15% comparator tolerance, and named comparator list
tests/fixtures/discovery-angular-spotify.json Frozen discovery fixture for angular-spotify: 12 tasks, balanced 4/4/4 map/find/search split, pinned to commit SHA
tests/fixtures/discovery-excalidraw.json Frozen discovery fixture for excalidraw: 12 tasks, balanced 4/4/4 map/find/search split, pinned to commit SHA
src/resources/codebase-intelligence.ts No functional changes; serves as the surface runner for codebase://context discovery tasks
tests/fixtures/README.md Documentation update describing the new discovery fixtures and their frozen-eval semantics

Sequence Diagram

sequenceDiagram
    participant CLI as npx codebase-context
    participant Main as main() in index.ts
    participant Timer as HandshakeTimer (30s, unref'd)
    participant MCP as MCP Client
    participant Init as Project Init

    CLI->>Main: start (STDIO mode)
    Main->>Timer: setTimeout(exit(1), 30s).unref()
    Main->>MCP: server.connect(StdioTransport)

    alt Zombie scenario — no client connects
        Timer-->>Main: fires after 30s
        Main->>CLI: process.exit(1) + stderr message
    else Real MCP client connects
        MCP->>Main: initialize message
        Main->>Timer: clearTimeout(handshakeTimer)
        Main->>Init: oninitialized() — index + file-watcher
        Init-->>MCP: ready
        MCP->>Main: (normal lifecycle)
        Main->>CLI: process.exit(0) on close/SIGTERM
    end
Loading

Reviews (1): Last reviewed commit: "fix: prevent zombie MCP processes via ha..." | Re-trigger Greptile

metric: DiscoveryMetricName,
tolerancePercent: number
): DiscoveryMetricComparison {
const lowerIsBetter = metric === 'averageFirstRelevantHit';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 lowerIsBetter inconsistency with compareMetric

In compareMetric (line 243), both averageEstimatedTokens and averageFirstRelevantHit are treated as lower-is-better:

const lowerIsBetter = metric === 'averageEstimatedTokens' || metric === 'averageFirstRelevantHit';

But in compareMetricWithinTolerance only averageFirstRelevantHit is:

const lowerIsBetter = metric === 'averageFirstRelevantHit';

This is safe today because the comparator gate's usefulnessMetrics in the frozen protocol only contains averageUsefulness, averageFirstRelevantHit, and bestExampleUsefulnessRate — never averageEstimatedTokens. However, since usefulnessMetrics is typed as DiscoveryMetricName[], nothing prevents a future protocol update from adding averageEstimatedTokens to the comparator gate, which would silently evaluate it as higher-is-better and produce an incorrect gate result.

Consider aligning the two functions:

Suggested change
const lowerIsBetter = metric === 'averageFirstRelevantHit';
const lowerIsBetter = metric === 'averageEstimatedTokens' || metric === 'averageFirstRelevantHit';

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. now includes and in its check, consistent with . This was already addressed in the rebase by taking the metric-alignment commit from master (21c3e43fix: align discovery protocol metrics), which added averagePayloadBytes to both functions and to DiscoveryMetricName.

Comment on lines +424 to +427
for (const task of fixture.tasks) {
const runner = runners[task.surface];
const payload = await runner(task, rootPath);
results.push(evaluateDiscoveryTask(task, payload));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No guard for undefined surface runner

If task.surface is a value not present in the runners map (e.g., a typo in a fixture file or a newly added DiscoverySurface value without a corresponding runner), runner is undefined and calling await runner(task, rootPath) throws a confusing TypeError: runner is not a function rather than a descriptive error.

Consider adding an explicit check:

for (const task of fixture.tasks) {
  const runner = runners[task.surface];
  if (!runner) {
    throw new Error(`No runner registered for surface: ${task.surface}`);
  }
  const payload = await runner(task, rootPath);
  results.push(evaluateDiscoveryTask(task, payload));
}

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added an explicit guard before calling the runner:

const runner = runners[task.surface];
if (!runner) {
  throw new Error(`No runner registered for surface: ${task.surface}`);
}

This surfaces fixture typos or unregistered surfaces with a clear message instead of a cryptic TypeError.

import { fileURLToPath } from 'node:url';

const __dirname = path.dirname(fileURLToPath(import.meta.url));
const ENTRY_POINT = path.resolve(__dirname, '..', 'dist', 'index.js');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Tests depend on a pre-built dist/index.js

The test resolves the entry point from dist/index.js, so it will fail with a confusing rejection (from child.on('error', reject)) if the project hasn't been built before running tests. A descriptive beforeAll guard would make the failure more actionable:

import { existsSync } from 'node:fs';

beforeAll(() => {
  if (!existsSync(ENTRY_POINT)) {
    throw new Error(
      `dist/index.js not found — run \`npm run build\` before the zombie-guard tests.`
    );
  }
});

This is especially relevant for contributors running vitest without a prior build step.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added a beforeAll guard at the top of the describe block:

beforeAll(() => {
  if (!existsSync(ENTRY_POINT)) {
    throw new Error(
      `dist/index.js not found - run \`npm run build\` before the zombie-guard tests.`
    );
  }
});

Throws early with an actionable message rather than letting the test fail with a cryptic spawn rejection.

 Adds a 30-second handshake timer that exits the process if no MCP client
 sends an initialize message within the timeout window. Defers CPU-heavy
 indexing and file-watcher setup to the oninitialized callback so a misfire
 consumes near-zero resources before exiting.

 Closes the recurring zombie process bug where Codex CLI spawns
 \
px codebase-context <path>\ without a subcommand, starting the MCP server
 with no client to connect. The 4 previous guards (ppid poller, stdin close,
 server.onclose, SIGHUP) all failed because cmd.exe holds the pipe open
 while waiting for the child to produce output.

 The fix is tool-agnostic: it does not rely on who is calling it or how.
 Any invocation that does not result in an MCP initialize handshake within
 30 seconds exits cleanly with code 1.
@PatrickSys PatrickSys force-pushed the fix/zombie-process-deterministic branch from 0344cc1 to dfcbbce Compare April 5, 2026 10:08
@PatrickSys PatrickSys merged commit 37fd4b9 into master Apr 5, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant