API Help
Command center
tokenRoute
API help

Use benchmark evidence as an API contract

TokenRoute API output is designed to explain why a model is recommended for a workload: quality checks, contract failures, cache provenance, cost, latency, and confidence. Use the guide menu to jump between compact sections instead of scanning one long reference page.

API reference

Start here

Four steps before using route decisions.

Create a project API key

Open Settings, create a project key, and store it in your shell as TOKENROUTE_API_KEY. The full key is shown once.

Run or reuse a benchmark

Use the dashboard or POST /api/v1/benchmark-runs with prompts, model_ids, output contracts, expected keywords, and optional JSON schema.

Read the evidence pack

Inspect score_breakdown for evaluator pack, prompt classification, deterministic checks, cache state, cost, latency, and score adjustments.

Export a decision artifact

Use JSON, CSV, routing-policy, LiteLLM export, or route-decision endpoints to move evidence into agents, apps, CI, or gateways.

API reference

Core endpoints and workflow

The API surface most users need first.

GET
/api/v1/model-registry
List available BYOK/local models with pricing, health, sample counts, and route-decision readiness.
POST
/api/v1/benchmark-runs/estimate
Estimate run cost, per-run cap, and project-month cap before queueing model calls.
POST
/api/v1/benchmark-runs
Create an async benchmark run across selected models and prompt contracts.
GET
/api/v1/benchmark-runs/{run_id}
Fetch run status, model results, recommendation, score breakdown, and cache provenance.
GET
/api/v1/benchmark-runs/{run_id}/export?format=json
Export the full benchmark pack for review, storage, or automation.
GET
/api/v1/benchmark-runs/{run_id}/routing-policy
Export an advisory routing-policy artifact from benchmark evidence.
POST
/api/v1/route-decisions
Ask for the current advisory model decision for a workload and constraints.
GET
/api/v1/intelligence/benchmarks
Fetch tenant-private aggregate intelligence without raw prompts or raw outputs.
1
Use one workload per benchmark pack. Do not mix support triage, summarization, code generation, and extraction in one dataset unless you are testing general behavior.
2
Define an output contract that matches your downstream consumer. If the consumer needs JSON, add a JSON schema.
3
Start with one cheap model and one prompt, then widen to 3 to 5 models and a realistic dataset.
4
Treat a single sample as thin evidence. Prefer at least 3 to 5 prompts per workload before acting on a recommendation.
5
Use route-decision and policy exports as advisory artifacts. Do not treat them as a production hot-path router yet.
6
Watch cache fields. Exact cache hits save provider calls; semantic cache hits are advisory nearest-prompt evidence only.
7
Treat llm_judge as provenance unless a future consensus policy is approved. Deterministic score remains the route-ranking input.
8
Read failure_policy on failed results before blaming a model. It separates retryable provider outages from permanent auth, config, or invalid-request failures.
API reference

Decision quality

Use this before trusting a benchmark recommendation.

Sample depth
Stronger evidence

3 to 5 prompts for one workload; more for launch decisions.

Needs work

One-prompt smoke runs are thin evidence.

Contract anchors
Stronger evidence

Output contract plus expected keywords; JSON schema for structured output.

Needs work

Free-form prompt only, no schema, no objective anchors.

Evaluator fit
Stronger evidence

`evaluator_quality_profile.status` is `strong` or `usable`.

Needs work

`thin` or warnings about missing contract, schema, or required validators.

Failures
Stronger evidence

No repeated provider failures; failure_policy is empty or clearly transient.

Needs work

Auth/config errors, circuit-open evidence, empty outputs, or repeated timeouts.

Cost and latency
Stronger evidence

Estimate is within per-run and project-month caps; latency fits the workload.

Needs work

Unknown pricing, project budget breach, or latency outside the SLA.

strong

Schema-backed or otherwise well-anchored evaluator evidence.

Use for comparison, subject to sample count and failure rate.

usable

Enough contract or keyword anchors to compare models, but not ideal.

Use for exploration. Add schema, stricter contract, or more prompts before relying on it.

thin

The benchmark lacks objective anchors or has weak evaluator coverage.

Do not treat the recommendation as route-ready. Tighten the prompt contract first.

API reference

Scoring criteria

What each validator means in the output.

contains_expected_keywordsdeterministic0.45

Checks whether the output includes your expected scoring hints.

Use for classification labels, must-mention concepts, and smokeable contract hints.

non_empty_outputdeterministic0.25

Prevents empty or whitespace-only model responses from passing.

Always keep this on; empty outputs are provider/model failures for benchmark purposes.

length_windowdeterministic0.20

Warns when outputs are too short or too long for the launch scoring window.

Use max_output_tokens and output contracts to reduce accidental truncation.

latency_slaperformance0.10

Scores latency against the current benchmark SLA thresholds.

Use to compare candidates after contract quality is acceptable.

json_schema_contractcontractcap

Validates raw JSON against the prompt schema. In scoring-policy v2, failed required contracts cap the score below route-ready.

Use for extraction, API output, classification JSON, and strict downstream automation.

cost_thresholdcostevidence

Shows whether estimated provider cost is below the prompt threshold.

Use as a guardrail when comparing expensive models or broad datasets.

evaluator_quality_profileevaluator_evidenceevidence

Reports contract depth, schema/keyword anchors, required-validator coverage, and judge-boundary evidence without changing score.

Use to decide whether a benchmark is ready for decision-making or needs stronger contracts, schemas, or sample depth.

output_contract_diagnosticscontract_evidenceevidence

Reports obvious output-contract signals such as JSON-only, no markdown fence, one-category, and two-sentence shape without changing score.

Use to debug why an output may look partially compliant before adding stricter JSON schema or exact-match contracts.

API reference

Evidence fields and trust boundaries

Fields to read before trusting a recommendation.

prompt_classification
Workload, domain, output shape, risk level, and classifier reasons.
evaluator_pack
The versioned validator pack selected for this prompt type.
evaluator_quality_profile
Evidence-only profile showing whether the selected evaluator has strong, usable, or thin prompt-contract depth.
scoring_policy_version
The scoring rules used. Policy changes invalidate cache identity.
deterministic_checks
Objective validators such as keyword, JSON schema, output-contract diagnostics, length, latency, and cost checks.
score_adjustments
Caps or penalties applied after raw scoring, such as required contract failure caps.
generation_cache
Whether the model output was reused from tenant/project-private exact generation cache.
validator_cache
Whether deterministic validation was reused for the same prompt/output/policy identity.
semantic_cache
Advisory tenant/project-private nearest-prompt evidence. It does not skip model calls.
llm_judge
Optional qualitative judge provenance when judge scoring is enabled, including trust_policy boundaries.
failure_policy
Retryability, terminal action, attempt count, and provider circuit-open evidence for failed results.
recommendation.confidence
Thin-sample, route-ready, regression-risk, or candidate confidence evidence.

Benchmark intelligence is tenant/project-private by default. Raw prompts and raw outputs are not included in aggregate intelligence.

Generation and validator cache keys include evaluator and scoring policy identity. A scoring-policy change should produce new cache entries.

Evaluator quality profile is evidence-only. It explains whether the prompt has enough contract, schema, keyword, and judge-boundary depth before you act on a recommendation.

Semantic cache is tenant/project-private and advisory. It exposes nearest prior prompt evidence and confidence, but still runs the selected model.

LLM judge output is stored as provenance. Single-judge output does not alter deterministic ranking, and multi-judge consensus remains a separate approval boundary.

Provider recovery evidence is run-scoped. It reduces repeated calls during a failing benchmark, but does not permanently suppress a provider across future runs.

Route-decision output is advisory. Production hot-path routing remains intentionally out of scope until evidence density and governance improve.

API reference

JSON examples

Request and evidence shapes for integration work.

JSON

Create benchmark run

POST /api/v1/benchmark-runs

{
  "model_ids": ["anthropic/claude-haiku-4-5-20251001"],
  "max_output_tokens": 256,
  "enable_llm_judge": false,
  "prompts": [
    {
      "id": "support-triage-json",
      "text": "Classify this customer message into billing, technical, or account.",
      "output_contract": "Return valid compact JSON with category and reason only.",
      "expected_keywords": ["billing", "reason"],
      "json_schema": {
        "type": "object",
        "required": ["category", "reason"],
        "properties": {
          "category": { "type": "string" },
          "reason": { "type": "string" }
        }
      },
      "max_cost": "0.005"
    }
  ]
}
JSON

Read score breakdown

score_breakdown fields

{
  "scoring_policy_version": "tokenroute.scoring.deterministic.v2",
  "evaluator_pack": {
    "id": "tokenroute.eval.structured_extraction.v1",
    "required_validators": ["non_empty_output", "json_format", "json_schema_contract"]
  },
  "prompt_classification": {
    "workload_type": "structured_extraction",
    "domain": "customer_support",
    "risk_level": "high"
  },
  "raw_score": "100.00",
  "score_adjustments": [
    {
      "policy": "required_contract_cap_v1",
      "cap": "69.00",
      "failed_validators": ["json_schema_contract"]
    }
  ],
  "evaluator_quality_profile": {
    "version": "tokenroute.evaluator_quality_profile.v1",
    "status": "strong",
    "score_impact": "evidence_only",
    "contract_strength": "schema_backed",
    "warnings": []
  },
  "failure_policy": {
    "version": "tokenroute-provider-recovery-v1",
    "retryable": true,
    "action": "failed_circuit_opened"
  },
  "validator_cache": { "status": "hit", "scope": "tenant_project_private" },
  "semantic_cache": {
    "status": "hit",
    "mode": "advisory_only",
    "execution": "provider_call_not_skipped",
    "nearest_evidence": { "prompt_id": "prior-support-case", "similarity": "0.86" }
  },
  "llm_judge": {
    "status": "completed",
    "trust_policy": {
      "mode": "single_judge_provenance",
      "scoring_impact": "not_in_composite_score",
      "consensus_status": "not_enabled"
    }
  }
}
JSON

Route decision request

POST /api/v1/route-decisions

{
  "workload": "classification",
  "constraints": {
    "min_quality_score": 80,
    "max_latency_ms": 3000,
    "excluded_models": []
  }
}