feat: freeze discovery benchmark contract#88
Conversation
Greptile SummaryThis PR freezes the discovery benchmark contract by pinning two public fixtures ( Confidence Score: 5/5Safe to merge — all findings are P2 style/maintenance items with no present runtime impact The PR introduces a well-typed, well-tested discovery benchmark contract. The gate logic is covered end-to-end in tests. The three remaining issues are: (1) a latent
|
| Filename | Overview |
|---|---|
| src/eval/discovery-harness.ts | New 473-line discovery harness; gate logic is sound, but compareMetricWithinTolerance omits averageEstimatedTokens from lowerIsBetter, diverging from compareMetric |
| scripts/run-eval.mjs | Discovery mode wired in cleanly; passesAllGates is undefined in discovery mode (dead code), and fs imports are split across two statements |
| src/eval/types.ts | Discovery types added alongside existing retrieval types; DiscoveryBenchmarkProtocol.metrics field type doesn't match JSON values |
| tests/discovery-harness.test.ts | Good coverage: fixture freeze invariants, deterministic scoring, and gate pass/fail/pending_evidence paths all tested |
| tests/fixtures/discovery-benchmark-protocol.json | Frozen protocol with 4 comparators and 15% tolerance; metrics field uses task-level names instead of DiscoveryMetricName values |
| src/resources/codebase-intelligence.ts | Clean extraction of generateCodebaseIntelligence into a standalone module; no logic changes |
| tests/eval-harness.test.ts | Tests updated to cover new countUtf8Bytes and estimateTokenCountFromBytes exports; no issues |
| tests/fixtures/discovery-angular-spotify.json | 12-task frozen fixture with pinned 40-char SHA and balanced 4/4/4 job distribution |
| tests/fixtures/discovery-excalidraw.json | 12-task frozen fixture with pinned 40-char SHA and balanced 4/4/4 job distribution |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[run-eval.mjs --mode=discovery] --> B[runSingleEvaluation: codebaseA]
A --> C{codebaseB provided?}
B --> D[evaluateDiscoveryFixture]
C -- yes --> E[runSingleEvaluation: codebaseB]
E --> F[evaluateDiscoveryFixture]
D & F --> G[combineDiscoverySummaries]
C -- no --> G
G --> H[evaluateDiscoveryGate]
H --> I{gate.status}
I -- pending_evidence --> J[exit 0]
I -- passed --> J
I -- failed --> K[exit 1]
subgraph evaluateDiscoveryGate
L[evaluateBaselineGate
vs raw Claude Code] --> M{payload cost lower
AND 1+ usefulness win?}
N[evaluateComparatorGate x3
GrepAI / jCodeMunch / codebase-memory-mcp] --> O{all metrics within
15% tolerance?}
P{suiteComplete?
both fixtures run} --> Q[build missingEvidence list]
M & O & Q --> R[passed / failed / pending_evidence]
end
Comments Outside Diff (2)
-
scripts/run-eval.mjs, line 288 (link)passesAllGatesisundefinedin discovery modeIn discovery mode,
runSingleEvaluationreturns aDiscoverySummary, which has nopassesGateproperty. SopassesAllGatesis initialized asundefinedhere. The discovery branch exits the process at line 320 before this value is ever read at line 324, so there is no runtime impact today. However, it is a silent dead-state assignment that becomes a latent footgun if the branching is ever refactored. -
scripts/run-eval.mjs, line 6-7 (link)Duplicate
fsimport statementsTwo separate named imports from
'fs'should be merged.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Reviews (1): Last reviewed commit: "fix: format discovery benchmark sources" | Re-trigger Greptile
|
Follow-up on Greptile notes: fixed the CI lint failure by renaming the unused legacy helper to _generateCodebaseContext (it is not used for the MCP resource; ReadResource(codebase://context) still calls generateCodebaseIntelligence). Also normalized the retry/status messages to ASCII '-' and kept discovery protocol metrics/lowerIsBetter logic aligned. No behavior change to the codebase://context resource. |
Summary
Testing
Notes