Experimentation
Limitations & constraints
Knowing the constraints upfront saves you from running into them mid-experiment. This page documents every hard limit, soft limit, and design decision that affects how you set up and run experiments.
Quick reference
| Constraint | Limit | Notes |
|---|---|---|
| Maximum variants per experiment | 8 | Enforced at creation time. Driven by Bonferroni correction — beyond 8 simultaneous tests the corrected α becomes too small to be practical. |
| Maximum custom metrics | 10 | Per session. goal_completed and score are reserved and do not count toward the 10. |
| Variant lock | After first session | Cannot add, delete, or rename variants once any session is recorded for the experiment. |
| Metric reporting deadline | Within the session | Metrics must be reported before the session context manager exits. Metrics reported after session close are silently dropped. |
| Minimum sessions for stats | 10 per variant | Fewer than 10 sessions returns insufficient_data — no hypothesis tests are run. |
| SRM detection threshold | p < 0.01 | Chi-squared goodness-of-fit. Flagged in the Stats tab but does not automatically block results. |
| Concurrent experiments per session | 1 | A session can belong to at most one experiment. Passing two experiment IDs is not supported. |
Variant lock after first session
Once the first session is recorded for an experiment, the variant list is permanently locked. You cannot:
- Add a new variant arm
- Delete an existing variant
- Rename a variant
- Change which variant is the baseline
Traffic weights can still be updated while the experiment is running — but large weight changes after data collection has started can cause a Sample Ratio Mismatch (SRM).
The UI enforces this by hiding the Add variant button and disabling the Delete button on all variant cards once the experiment has sessions.
8-variant maximum
You can have at most 8 variants per experiment (1 baseline + up to 7 treatments). The limit exists for two reasons:
- Statistical correctness — with k simultaneous pairwise tests, Bonferroni correction sets the per-test α to 0.05/k. At k=7 treatments, each test requires p < 0.007 to count as significant. Beyond this, the corrected α becomes so small that you would need thousands of sessions per variant to detect realistic effect sizes.
- Practical traffic dilution — 8 equal-weight variants each receive only 12.5% of traffic. With 100 sessions per day, that's 12 sessions per variant per day — reaching 400 sessions takes over a month.
10 custom metric limit
Each session can report up to 10 custom numeric metrics in addition to the two reserved keys (goal_completed and score). Metrics beyond 10 are silently ignored.
Custom metrics must be reported as floats. Boolean values are accepted and stored as 1.0 / 0.0. There is no enforced range, but values in 0–1 produce the most readable forest plots.
Peeking invalidates statistical guarantees
Looking at experiment results before reaching the target session count and making a decision based on what you see is called peeking. It inflates the false-positive rate far above α=0.05.
False-positive rate vs number of peeks at α=0.05
Source: Johari et al. (2015), "Peeking at A/B Tests". Approximate values for illustration.
Niitaka displays a peeking warning at the bottom of the Stats tab. The correct approach is:
- Decide on your target session count before starting the experiment.
- Only interpret results once the power bar reaches 100%.
- If you must check early (e.g. one variant is clearly harmful), use the result to stop, not to ship.
Sample Ratio Mismatch
An SRM is detected when the observed session distribution deviates significantly from the configured traffic weights (chi-squared test, p < 0.01). It means sessions are not being assigned to variants as expected.
Common causes:
- Sessions failing silently before the experiment ID is passed to the SDK — those failures are not counted but would have gone to a specific variant.
- The agent code routing some sessions away from the experiment (e.g. a cached response that skips
start_session). - Large traffic weight changes after data collection started.
- Multiple deployments of the agent code, some with the experiment ID and some without.
Metric reporting deadline
Outcome metrics must be reported before the start_session() context manager exits. Once the session is closed, any subsequent report_metrics() calls are silently dropped on the backend.