Evaluate SWE, TB2, Tau, and SAE through a unified kbench CLI plus benchmark-specific adapters. Benchmark orchestration stays outside the SDK repo, while harness execution is normalized into one runtime contract.
- Unified CLI entrypoint:
kbench benchmark,kbench harness,kbench run - Built-in harnesses:
kode-agent-sdk,codex,claude-code,gemini-cli,custom-adapter - Supports custom adapters plus generated adapter bootstraps for external CLI or SDK harnesses
- Runs
swebench-verifiedandterminal-bench@2.0through Harbor - Runs official
tau-bench - Runs Kaggle
Standardized Agent Exams (SAE)through the public HTTP exam API - Supports GitHub Actions smoke tests and larger benchmark runs
- Produces standardized per-instance artifacts under
instances/<id>/
+---------------------------+
| GitHub Actions / Scripts |
| SWE / TB2 / Tau / SAE |
+-------------+-------------+
|
+-------------v-------------+
| benchmark-specific bridge |
| Harbor / Tau / SAE |
+-------------+-------------+
|
+-------------v-------------+
| kbench CLI |
| benchmark / harness / run |
+------+------+-------------+
| |
+----------------+ +--------------------------+
| | |
+--------v---------+ +-----------v--+ +--v------------------+
| kode-agent-sdk | | CLI harnesses | | custom-adapter |
| task + session | | codex/claude/ | | manifest + runner |
+--------+---------+ | gemini-cli | +--+------------------+
| +------+--------+ |
+--------v--------------------+ | |
| @shareai-lab/kode-sdk | +--------v--------+ |
| provider adapters | | local CLI | |
+-----------------------------+ | login/config | |
+--------+--------+ |
^ |
| |
+--------+--------+ |
| adapter generate |---+
| heuristic repo |
| inspection |
+------------------+
adapter generate is part of the current runtime path for bootstrapping dynamic custom-adapter integrations.
Today this generator is heuristic and repository-inspection-based; this repo does not yet ship a built-in remote LLM adapter generator.
- Node.js
24.xis the recommended local and CI runtime for this repository - npm
11.xis the expected package manager line npm ciis the canonical clean-install path and is enforced by the PR quality gate- Benchmark workflows still install their own extra runtime dependencies such as Python/Harbor/tau-bench as needed
This repository is primarily intended to run benchmarks through GitHub Actions.
- Configure repository secrets and variables:
OPENAI_API_KEYOPENAI_BASE_URLOPENAI_APIif your OpenAI endpoint needs itANTHROPIC_API_KEYANTHROPIC_BASE_URLif neededGEMINI_API_KEYGEMINI_BASE_URLif neededGOOGLE_GEMINI_BASE_URLis also accepted for compatibilityKAGGLE_AGENT_IDif you want to reuse a fixed SAE agent identity in ActionsKAGGLE_AGENT_API_KEYif you want to reuse a fixed SAE agent identity in Actions
- Manually dispatch one of these workflows:
.github/workflows/eval-swe.yml.github/workflows/eval-tb2.yml.github/workflows/eval-tau.yml.github/workflows/eval-sae.yml
- In the workflow form:
- keep
harness=kode-agent-sdkfor the current GitHub Actions path - choose exactly one
provider:openai,anthropic, orgemini - fill
modelfor that provider - optionally override
base_urlfor the selected provider
Provider API keys stay in repository secrets. The dispatch form does not accept secret inputs safely, so keys are not entered manually at runtime.
The workflow run step now invokes kbench benchmark run ... directly instead of calling benchmark shell scripts as the top-level entrypoint.
Example:
provider=openai
model=gpt-4.1-mini
Recommended first SWE smoke test:
provider=openai
model=gpt-4.1-mini
task_limit=2
shard_count=2
max_parallel_shards=1
n_concurrent=1
n_attempts=1
Build the bundle first:
npm ci
npm run typecheck
npm run bench:bundle:cli
npm linkAfter npm link, kbench is available as a local shell command from this repo package.
Useful commands:
kbench benchmark list
kbench benchmark run --benchmark swe --harness kode-agent-sdk --model-name glm/glm-5
kbench benchmark run --benchmark sae --harness kode-agent-sdk --model-name glm/glm-5
kbench harness list
kbench harness probe --harness codex
kbench harness validate --harness codex --benchmark swe
kbench adapter profiles
kbench adapter init --type node --name my-runner --out /tmp/my-runner
kbench adapter generate --repo ../my-agent --out /tmp/my-agent-adapter
kbench adapter generate --repo ../my-cli-agent --hint cli-harness --out /tmp/my-cli-agent-adapter
kbench adapter validate --adapter /tmp/my-runner
kbench run --benchmark swe --harness custom-adapter --adapter /tmp/my-runner --instruction "Fix the bug"
kbench run --benchmark swe --harness kode-agent-sdk --model-name glm/glm-5 --instruction "Fix the failing test"
kbench run --benchmark swe --harness codex --model-name openai/gpt-5.3-codex --base-url https://apikey.soxio.me/openai --api-key-env OPENAI_API_KEY --config-mode inherit --instruction "Fix the bug"Top-level help is now a full CLI reference:
kbench --helpThe top-level help page currently includes:
- command tree and benchmark list
- built-in harness matrix
- required and optional parameters for
run,benchmark run,harness *, andadapter * - parameter types, defaults, and applicability notes
- SAE-specific options
- environment-variable based configuration such as provider credentials and generated CLI-adapter bridge vars
- timeout policy and end-to-end examples
Current built-in harness behavior:
kode-agent-sdk: task mode forswe,tb2, andsae, session mode fortaucodex: task mode onlyclaude-code: task mode onlygemini-cli: experimental task mode onlycustom-adapter: task or session mode, depending on the supplied adapter manifest
Current benchmark bridges:
swe/tb2: Harbor-based benchmark runnerstau: official tau-bench runnersae: native Kaggle exam runner insidekbench
sae benchmark usage:
- requires
--harness kode-agent-sdk - requires local credential files by default:
~/.kaggle-agent-id~/.kaggle-agent-api-key
- optional auto-registration:
--sae-register-if-missing true--sae-agent-name ...--sae-agent-description ...--sae-agent-version ...--sae-agent-type ...
- optional API overrides:
--sae-api-base https://www.kaggle.com/api/v1--sae-agent-id-file ...--sae-api-key-file ...--sae-poll-interval-ms 2000--sae-timeout-ms 1800000
Example:
kbench benchmark run \
--benchmark sae \
--harness kode-agent-sdk \
--model-name glm/glm-5 \
--sae-register-if-missing falseharness list currently prints:
- harness id
- driver kind
- supported benchmarks
- short description
harness probe now validates key machine-readable features for built-in CLI harnesses:
codex:exec,--json, sandbox bypass flagclaude-code:--print,--output-format,stream-json,--permission-modegemini-cli:--prompt,--output-format,stream-json,--yolo
Recommended gemini-cli invocation:
- prefer
--config-mode isolated - use
--api-key-env GEMINI_API_KEY - when Gemini requires a local proxy, pass
--proxy-url http://127.0.0.1:7897or the explicit--http-proxy/--https-proxy/--all-proxyflags - use
--base-urlonly when you explicitly need a custom Gemini-compatible endpoint
harness validate checks whether a harness/benchmark combination is structurally valid before execution:
- supported benchmark declared by the harness
- required run mode for that benchmark (
taskorsession)
adapter init currently supports:
--type cli--type python--type node
Each generated adapter contains:
adapter.manifest.json- a stub runner entrypoint
README.mdadapter.validate.jsonwhen generated throughadapter generate
adapter generate adds one more experimental bootstrap path:
- input:
--repo <local-path-or-url> - optional override:
--hint <profile>where profiles come fromkbench adapter profiles - output: a valid stub adapter plus
adapter.generate.jsonand an automaticadapter.validate.json - purpose: infer a starting
kind,runModes, supported benchmarks, and likely command candidates - report fields now include
commandDetails,recommendedCommand, andcandidateEntrypointsto make manual bridge wiring faster
Current built-in generate profiles:
generic: conservative fallback bootstrapcli-harness: bootstrap for external CLI harnesses such as Codex, Claude Code, Gemini CLI, or similar command-based agentscodexclaude-codegemini-clilangchain-runnerkode-agent-sdk
cli-harness is intended for repositories that expose a local or globally installed command instead of a stable SDK surface.
The generated adapter validates successfully by default because its runner.sh starts in stub mode and only bridges to the real CLI when environment overrides are provided.
This is the current dynamic-adapter path in-repo: repo inspection -> generated custom-adapter scaffold -> manual wiring or local refinement.
cli-harness bridge environment variables:
KBENCH_CLI_COMMAND: full command used to invoke the target CLI harnessKBENCH_CLI_PROMPT_FLAG: prompt flag name, for example--promptKBENCH_CLI_MODEL_FLAG: model flag name, for example--modelKBENCH_CLI_OUTPUT_FLAG: output-format flag name, for example--output-formatKBENCH_CLI_OUTPUT_VALUE: output-format value passed together with the output flag, defaulttextKBENCH_CLI_EXTRA_ARGS: extra raw arguments appended after the inferred flags
Example:
KBENCH_CLI_COMMAND="my-agent" \
KBENCH_CLI_PROMPT_FLAG="--prompt" \
KBENCH_CLI_MODEL_FLAG="--model" \
KBENCH_CLI_OUTPUT_FLAG="--output-format" \
KBENCH_CLI_OUTPUT_VALUE="json" \
KBENCH_CLI_EXTRA_ARGS="--dangerously-skip-permissions" \
kbench run --benchmark swe --harness custom-adapter --adapter /tmp/my-cli-agent-adapter --model-name openai/gpt-5.3-codex --instruction "Fix the bug"Current limitations of adapter generate:
- remote URLs are not fetched automatically; URL inputs are treated as lightweight identifiers unless the repo is available locally
- output is a bootstrap scaffold, not a production-ready adapter
- generated command candidates must still be reviewed manually
adapter validate currently checks:
- manifest schema
- entry existence and executability when required
- fixture execution for declared
runModes - runner output shape against the kbench adapter protocol
custom-adapter runtime usage:
- pass
--harness custom-adapter - pass
--adapter <path-to-adapter-dir-or-manifest> - for task benchmarks, provide
--instruction - for session benchmarks such as
tau, provide--messages-fileand--tools-file
Parameter precedence for CLI harnesses:
- explicit CLI flags such as
--base-url,--api-key-env,--config-mode,--proxy-url - environment variables such as
OPENAI_BASE_URLorANTHROPIC_BASE_URL - local harness config or login state when running in
inheritmode - driver defaults
Default timeouts:
swe/tb2withcodexorclaude-code:30minswe/tb2withkode-agent-sdk:20mintau:5min
Every kbench run writes a standardized run directory:
run.jsonsummary.jsonoutput.jsonloutput_errors.jsonlinstances/<instance-id>/result.jsoninstances/<instance-id>/native_result.jsoninstances/<instance-id>/trace/normalized/trajectory.jsonwhen availableinstances/<instance-id>/trace/native/*when availableinstances/<instance-id>/artifacts/*instances/<instance-id>/artifacts/manifest.json
kbench benchmark run --benchmark sae also writes run-level exam artifacts:
artifacts/sae/start_submission.jsonartifacts/sae/questions.jsonartifacts/sae/answers.jsonartifacts/sae/submit_answers.jsonartifacts/sae/final_submission.jsonartifacts/sae/agent_profile.jsonartifacts/sae/registration.jsonwhen auto-registration is usedartifacts/sae/benchmark_error.jsonon benchmark-level failures
Harbor shard merging also reads the standardized per-instance result path inside each trial artifact:
agent/instances/<trial-name>/result.json
The generated custom adapter runner should:
src/harness/sdk/manifest.tssrc/harness/sdk/protocol.tssrc/harness/sdk/fixtures.tssrc/harness/templates/cli/src/harness/templates/python_runner/src/harness/templates/node_runner/
Runtime transport contract:
- read one JSON input payload from
$KBENCH_ADAPTER_INPUTwhen present - otherwise read one JSON payload from stdin
- write one JSON output payload to stdout
Optional adapter output fields now include:
trace.normalized: inline normalized trace events written totrace/normalized/trajectory.jsontrace.native: file references copied intotrace/native/*
The current Phase 3 scope now covers:
adapter initadapter validaterun --harness custom-adapter --adapter ...
Experimental bootstrap now also exists:
adapter generate
See README.zh-CN.md.