Experimentation
What is an Experiment?
An experiment is a randomised, controlled comparison of two or more agent configurations — running simultaneously in production — with statistical analysis to determine which performs better on the metrics that matter to you.
Instead of guessing whether switching from GPT-4o to Gemini Flash will hurt your goal completion rate, you run both in parallel, collect real production data, and let the stats engine tell you — with confidence intervals and a p-value — whether the difference is real or noise.
Anatomy of an experiment
Every experiment has four main ingredients:
Experiment lifecycle
An experiment moves through three states. Transitions are one-way.
Draft
- Add / remove variants
- Set baseline & weights
- No sessions yet
Running
- Sessions assigned to variants
- Metrics collected in real time
- Variant structure locked
Completed
- Results final
- Promote winning variant
- Experiment archived
The Draft state is the only time you can add or remove variants, change traffic weights, or swap the baseline. Once a session is recorded the experiment is locked into Running and variant structure cannot change. You complete it manually when you have enough data.
How experiments fit into Niitaka
Experiments sit on top of the session and event infrastructure. Your instrumented agent reports to Niitaka as normal; the experiment layer attaches variant metadata to each session and the stats engine aggregates them automatically.
Your Agent
Instrumented with niitaka-sdk
Sessions + Events
Raw telemetry stored per variant
Stats Engine
Aggregates metrics, runs hypothesis tests
Verdict
Ship / Revert / Mixed / Insufficient data
Why run an experiment?
- Model evaluation — Compare GPT-4o vs Gemini 2.5 Flash on your actual workload, not on benchmarks that may not reflect your use case.
- Prompt engineering — Measure whether a new system prompt actually improves goal completion, rather than relying on vibes from manual testing.
- Guardrail tuning — Find the cost-limit threshold that reduces spend without meaningfully reducing quality.
- Multi-provider failover — Quantify the quality difference of your fallback provider before traffic hits it in an emergency.
When not to use an experiment
- Bug fixes — If you're fixing a broken prompt, just fix it. You don't need statistical evidence that the bug was bad.
- Infrastructure changes — Switching from one embedding provider to another for cost reasons, where both produce equivalent results, doesn't need an experiment.
- Very low traffic — Experiments need volume to reach statistical power. If your agent runs fewer than ~20 sessions per day, reaching 200 sessions per variant takes weeks. Consider synthetic load testing instead.
Next steps
- Setting up variants — create variants, set traffic weights, and designate a baseline.
- Running experiments — attach sessions to your experiment and report outcome metrics.
- Interpreting results — read the stats panel, understand the verdict, and act on the data.