Data-driven model selection

Ship the best model, not just any model

A/B test any variable — model, temperature, system prompt, or guardrails — with live production traffic. Get statistically grounded results, then promote the winner in one click.

Traffic splitting

Assign any traffic weight per variant. Route 10 % to GPT-4o and 90 % to Claude — or anything in between.

Live metrics comparison

Cost, latency (p50/p95), steps, error rate, and retry rate — updated in real time as sessions complete.

Baseline comparison

Set a baseline variant and see percentage deltas for every metric across all other variants.

Automatic verdict

Niitaka computes a ship · revert · mixed · insufficient_data verdict with confidence level for you.

Promote to version

Winning variant becomes a new AgentVersion in one click — model, prompt, and guardrails all captured.

Per-variant guardrails

Test different cost limits and retry strategies per variant without touching the Policies table.

At a glance

  • Weighted traffic split across unlimited variants
  • Test model, temperature, system prompt, or guardrails
  • Live cost / latency / error metrics per variant
  • Statistical verdict: ship · revert · mixed
  • One-click promote winner to production version

Common questions

Do I need to deploy new code to run an A/B test?

No. You create the experiment and configure its variants entirely in the dashboard. The SDK fetches the active variant assignment at session start via runtime config — your agent code does not change. To switch from variant A to variant B, you adjust the traffic weights in the dashboard.

How many sessions do I need for a statistically valid result?

It depends on the effect size you expect. Niitaka shows a confidence level alongside each verdict and flags results as insufficient_data when the sample is too small. As a rough guide, 50+ sessions per variant gives meaningful signal for cost and latency comparisons. For smaller effect sizes you may need several hundred.

More general questions? See the full FAQ →

Ready to get started?

Connect your first agent in under 5 minutes. Free to start, no credit card required.

Next: Evaluation