Experimentation

Interpreting results

The experiment detail page has two tabs — Overview and Stats. Overview gives you a quick verdict and per-variant summary. Stats gives you the full statistical analysis. This page explains what every number means and how to act on it.

Overview tab

The Overview tab shows the stats engine's verdict at the top, followed by the Outcome metrics panel and the Variant comparison table.

Verdict bannerShip it / Revert / Mixed results / Insufficient data — derived from the Stats engine. Same as the Stats tab verdict.

Outcome metricsPer-variant cards showing coverage (sessions that reported metrics), goal-completion rate, and average score. Scroll right to compare all variants.

Variant comparisonA tabular view of every metric across all variants, with delta % and n for each cell.

Stats tab

Statistical Power card

Shows how far you are from the estimated target session count. The bar turns green at 100%.

Statistical Power

Low power

n = 168 sessionstarget ≥ 400

Target estimated for 5% MDE at 80% power (α=5%), derived from baseline goal-completion rate.

current_min_n — the smallest session count across all variants. The experiment is only as powerful as its smallest arm.
target_n — estimated sessions per variant for 80% power to detect a 5% relative change in goal-completion rate at α=0.05.
Low power — fewer than 30 sessions in the smallest arm. Results are unreliable.
Insufficient — fewer than 10 sessions. No stats are computed.

Statistical verdict

The verdict is computed from primary metrics only (goal completion and score). Cost, error rate, retry rate, and custom metrics appear in the forest plot for context but do not change the verdict — a better model that costs more is still worth shipping.

Ship itAt least one primary metric significantly improved, none significantly worse.

RevertAt least one primary metric significantly worsened, none improved.

Mixed resultsSome primary metrics improved significantly, others worsened significantly.

Insufficient dataFewer than 10 sessions, no primary metrics reported, or no significant result in either direction.

Note:The verdict is a statistical recommendation, not an automatic deployment. Weigh cost, latency, and business constraints before making the final call.

Forest plot (ComparisonPanel)

Each treatment variant gets a forest plot showing the estimated effect of switching from baseline to that treatment, per metric. In a multi-variant experiment, use the pill selector above the plot to choose which treatment to inspect.

Annotated forest plot

MetricBaseline → TreatmentEffect (95% CI)LiftSig.

Primary

Goal completed68.2% → 74.5%

+9.2%★

Avg score0.712 → 0.724

+1.7%—

Guardrails

Avg cost$0.0042 → $0.0061

+45.2%★

Significant improvementSignificant worseningNot significant★ p < α ~ trending

Baseline → TreatmentThe raw values for the baseline and the selected treatment arm.

CI barThe horizontal bar is the 95% confidence interval for the relative lift. If it does not cross the center line (0), the result is statistically significant.

Lift %Point estimate of the relative improvement (positive = treatment is better for higher-is-better metrics).

Sig.★ = significant (p < α). ~ = trending (p < 2α). Blank = not significant.

Ranking table (multi-variant)

When there are two or more treatment variants, the Stats tab shows a ranking table above the forest plot. Variants are sorted by goal-completion delta descending — the top row is the best-performing treatment. A ▲ mark highlights the current leader.

Note:The ranking table uses Bonferroni-corrected significance (see below), so a metric marked "ns" (not significant) in the ranking table may look significant at the nominal α=0.05 in isolation. This is intentional — the correction controls false positives across all comparisons.

Statistical methodology

Understanding the tests behind the numbers helps you know when to trust a verdict and when to collect more data.

Hypothesis tests

Proportion Z-testUsed for binary metrics: goal_completed and error_rate. Tests whether two observed proportions (e.g. 68% vs 74% goal completion) differ more than chance would predict. Uses pooled SE for the p-value and unpooled SE for the CI.

Welch's t-testUsed for continuous metrics: score, avg_cost, avg_steps, and all custom numeric metrics. Does not assume equal variance between arms (unlike Student's t-test), making it more robust when one arm has higher variance than another.

Chi-squared (SRM)Checks whether the observed session split matches the configured traffic weights. A significant result (p < 0.01) means the randomisation unit is broken — results are unreliable regardless of metric significance.

Significance level (α)

The default significance threshold is α = 0.05 (5% false-positive rate). A metric is marked significant when its p-value falls below α. "Trending" means p is between α and 2α — a real effect may exist but more data is needed.

Bonferroni correction (multi-variant)

When you have k treatment variants, you run k simultaneous hypothesis tests. If each test has a 5% false-positive rate and you run 7 tests, the probability of at least one false positive is ~30%. Bonferroni correction controls this by dividing α by k:

Treatments (k)	Corrected α per test	Notes
1	0.0500	Standard two-variant experiment
2	0.0250
3	0.0167
4	0.0125
5	0.0100
7	0.0071	7 treatments — Bonferroni applies

This means that with 7 treatments you need a much stronger signal (p < 0.007) for any individual comparison to count as significant. Real effects will still be detected, but borderline effects will correctly appear as not significant.

Tip:If Bonferroni correction is causing all your results to appear non-significant, consider reducing the number of variants or collecting more sessions. The ranking table still shows the direction of each effect even when significance is not reached.

Sample Ratio Mismatch (SRM)

An SRM is flagged when the observed session counts deviate significantly from the configured traffic weights. For example, if you configured a 50/50 split but observe 60/40, the randomisation unit may be broken.

A red banner at the top of the Stats tab warns you when SRM is detected.
SRM does not automatically invalidate results — but you should investigate the cause before shipping.
Common causes: sessions not being created for every agent run, bugs in the variant assignment logic, or the experiment ID being wrong for some sessions.

Acting on results

Ship it — at least one primary metric significantly improved, none significantly worsened. Promote the winning variant. See Shipping a winner.
Revert — at least one primary metric significantly worsened, none improved. Stick with the baseline.
Mixed results — some primary metrics improved, others worsened. Requires a judgment call. Look at effect sizes, not just significance.
Insufficient data — not enough sessions to run tests, or no primary metrics reported. Keep running.

Running experiments

Limitations & constraints

Was this page helpful?

Need help? Contact Support Questions? Contact Sales LLM? Read llms.txt