Use benchmark evidence as an API contract
TokenRoute API output is designed to explain why a model is recommended for a workload: quality checks, contract failures, cache provenance, cost, latency, and confidence. Use the guide menu to jump between compact sections instead of scanning one long reference page.
API referenceStart here
Four steps before using route decisions.
Start here
Four steps before using route decisions.
Create a project API key
Open Settings, create a project key, and store it in your shell as TOKENROUTE_API_KEY. The full key is shown once.
Run or reuse a benchmark
Use the dashboard or POST /api/v1/benchmark-runs with prompts, model_ids, output contracts, expected keywords, and optional JSON schema.
Read the evidence pack
Inspect score_breakdown for evaluator pack, prompt classification, deterministic checks, cache state, cost, latency, and score adjustments.
Export a decision artifact
Use JSON, CSV, routing-policy, LiteLLM export, or route-decision endpoints to move evidence into agents, apps, CI, or gateways.
API referenceCore endpoints and workflow
The API surface most users need first.
Core endpoints and workflow
The API surface most users need first.
API referenceDecision quality
Use this before trusting a benchmark recommendation.
Decision quality
Use this before trusting a benchmark recommendation.
3 to 5 prompts for one workload; more for launch decisions.
One-prompt smoke runs are thin evidence.
Output contract plus expected keywords; JSON schema for structured output.
Free-form prompt only, no schema, no objective anchors.
`evaluator_quality_profile.status` is `strong` or `usable`.
`thin` or warnings about missing contract, schema, or required validators.
No repeated provider failures; failure_policy is empty or clearly transient.
Auth/config errors, circuit-open evidence, empty outputs, or repeated timeouts.
Estimate is within per-run and project-month caps; latency fits the workload.
Unknown pricing, project budget breach, or latency outside the SLA.
Schema-backed or otherwise well-anchored evaluator evidence.
Use for comparison, subject to sample count and failure rate.
Enough contract or keyword anchors to compare models, but not ideal.
Use for exploration. Add schema, stricter contract, or more prompts before relying on it.
The benchmark lacks objective anchors or has weak evaluator coverage.
Do not treat the recommendation as route-ready. Tighten the prompt contract first.
API referenceScoring criteria
What each validator means in the output.
Scoring criteria
What each validator means in the output.
Checks whether the output includes your expected scoring hints.
Use for classification labels, must-mention concepts, and smokeable contract hints.
Prevents empty or whitespace-only model responses from passing.
Always keep this on; empty outputs are provider/model failures for benchmark purposes.
Warns when outputs are too short or too long for the launch scoring window.
Use max_output_tokens and output contracts to reduce accidental truncation.
Scores latency against the current benchmark SLA thresholds.
Use to compare candidates after contract quality is acceptable.
Validates raw JSON against the prompt schema. In scoring-policy v2, failed required contracts cap the score below route-ready.
Use for extraction, API output, classification JSON, and strict downstream automation.
Shows whether estimated provider cost is below the prompt threshold.
Use as a guardrail when comparing expensive models or broad datasets.
Reports contract depth, schema/keyword anchors, required-validator coverage, and judge-boundary evidence without changing score.
Use to decide whether a benchmark is ready for decision-making or needs stronger contracts, schemas, or sample depth.
Reports obvious output-contract signals such as JSON-only, no markdown fence, one-category, and two-sentence shape without changing score.
Use to debug why an output may look partially compliant before adding stricter JSON schema or exact-match contracts.
API referenceEvidence fields and trust boundaries
Fields to read before trusting a recommendation.
Evidence fields and trust boundaries
Fields to read before trusting a recommendation.
Benchmark intelligence is tenant/project-private by default. Raw prompts and raw outputs are not included in aggregate intelligence.
Generation and validator cache keys include evaluator and scoring policy identity. A scoring-policy change should produce new cache entries.
Evaluator quality profile is evidence-only. It explains whether the prompt has enough contract, schema, keyword, and judge-boundary depth before you act on a recommendation.
Semantic cache is tenant/project-private and advisory. It exposes nearest prior prompt evidence and confidence, but still runs the selected model.
LLM judge output is stored as provenance. Single-judge output does not alter deterministic ranking, and multi-judge consensus remains a separate approval boundary.
Provider recovery evidence is run-scoped. It reduces repeated calls during a failing benchmark, but does not permanently suppress a provider across future runs.
Route-decision output is advisory. Production hot-path routing remains intentionally out of scope until evidence density and governance improve.
API referenceJSON examples
Request and evidence shapes for integration work.
JSON examples
Request and evidence shapes for integration work.
Create benchmark run
POST /api/v1/benchmark-runs
{
"model_ids": ["anthropic/claude-haiku-4-5-20251001"],
"max_output_tokens": 256,
"enable_llm_judge": false,
"prompts": [
{
"id": "support-triage-json",
"text": "Classify this customer message into billing, technical, or account.",
"output_contract": "Return valid compact JSON with category and reason only.",
"expected_keywords": ["billing", "reason"],
"json_schema": {
"type": "object",
"required": ["category", "reason"],
"properties": {
"category": { "type": "string" },
"reason": { "type": "string" }
}
},
"max_cost": "0.005"
}
]
}Read score breakdown
score_breakdown fields
{
"scoring_policy_version": "tokenroute.scoring.deterministic.v2",
"evaluator_pack": {
"id": "tokenroute.eval.structured_extraction.v1",
"required_validators": ["non_empty_output", "json_format", "json_schema_contract"]
},
"prompt_classification": {
"workload_type": "structured_extraction",
"domain": "customer_support",
"risk_level": "high"
},
"raw_score": "100.00",
"score_adjustments": [
{
"policy": "required_contract_cap_v1",
"cap": "69.00",
"failed_validators": ["json_schema_contract"]
}
],
"evaluator_quality_profile": {
"version": "tokenroute.evaluator_quality_profile.v1",
"status": "strong",
"score_impact": "evidence_only",
"contract_strength": "schema_backed",
"warnings": []
},
"failure_policy": {
"version": "tokenroute-provider-recovery-v1",
"retryable": true,
"action": "failed_circuit_opened"
},
"validator_cache": { "status": "hit", "scope": "tenant_project_private" },
"semantic_cache": {
"status": "hit",
"mode": "advisory_only",
"execution": "provider_call_not_skipped",
"nearest_evidence": { "prompt_id": "prior-support-case", "similarity": "0.86" }
},
"llm_judge": {
"status": "completed",
"trust_policy": {
"mode": "single_judge_provenance",
"scoring_impact": "not_in_composite_score",
"consensus_status": "not_enabled"
}
}
}Route decision request
POST /api/v1/route-decisions
{
"workload": "classification",
"constraints": {
"min_quality_score": 80,
"max_latency_ms": 3000,
"excluded_models": []
}
}