A/B Test Setup

1️⃣ Purpose & Scope

Ensure every A/B test is valid, rigorous, and safe before a single line of code is written.

●Prevents "peeking"

●Enforces statistical power

●Blocks invalid hypotheses

---

2️⃣ Pre-Requisites

You must have:

●A clear user problem

●Access to an analytics source

●Roughly estimated traffic volume

Hypothesis Quality Checklist

A valid hypothesis includes:

●Observation or evidence

●Single, specific change

●Directional expectation

●Defined audience

●Measurable success criteria

---

3️⃣ Hypothesis Lock (Hard Gate)

Before designing variants or metrics, you MUST:

●Present the final hypothesis

●Specify:

●Target audience

●Primary metric

●Expected direction of effect

●Minimum Detectable Effect (MDE)

Ask explicitly:

> “Is this the final hypothesis we are committing to for this test?”

Do NOT proceed until confirmed.

---

4️⃣ Assumptions & Validity Check (Mandatory)

Explicitly list assumptions about:

●Traffic stability

●User independence

●Metric reliability

●Randomization quality

●External factors (seasonality, campaigns, releases)

If assumptions are weak or violated:

●Warn the user

●Recommend delaying or redesigning the test

---

5️⃣ Test Type Selection

Choose the simplest valid test:

●A/B Test – single change, two variants

●A/B/n Test – multiple variants, higher traffic required

●Multivariate Test (MVT) – interaction effects, very high traffic

●Split URL Test – major structural changes

Default to A/B unless there is a clear reason otherwise.

---

6️⃣ Metrics Definition

#### Primary Metric (Mandatory)

●Single metric used to evaluate success

●Directly tied to the hypothesis

●Pre-defined and frozen before launch

#### Secondary Metrics

●Provide context

●Explain _why_ results occurred

●Must not override the primary metric

#### Guardrail Metrics

●Metrics that must not degrade

●Used to prevent harmful wins

●Trigger test stop if significantly negative

---

7️⃣ Sample Size & Duration

Define upfront:

●Baseline rate

●MDE

●Significance level (typically 95%)

●Statistical power (typically 80%)

Estimate:

●Required sample size per variant

●Expected test duration

Do NOT proceed without a realistic sample size estimate.

---

8️⃣ Execution Readiness Gate (Hard Stop)

You may proceed to implementation only if all are true:

●Hypothesis is locked

●Primary metric is frozen

●Sample size is calculated

●Test duration is defined

●Guardrails are set

●Tracking is verified

If any item is missing, stop and resolve it.

---

Running the Test

During the Test

DO:

●Monitor technical health

●Document external factors

DO NOT:

●Stop early due to “good-looking” results

●Change variants mid-test

●Add new traffic sources

●Redefine success criteria

---

Analyzing Results

Analysis Discipline

When interpreting results:

●Do NOT generalize beyond the tested population

●Do NOT claim causality beyond the tested change

●Do NOT override guardrail failures

●Separate statistical significance from business judgment

Interpretation Outcomes

| Result | Action |

| -------------------- | -------------------------------------- |

| Significant positive | Consider rollout |

| Significant negative | Reject variant, document learning |

| Inconclusive | Consider more traffic or bolder change |

| Guardrail failure | Do not ship, even if primary wins |

---

Documentation & Learning

Test Record (Mandatory)

Document:

●Hypothesis

●Variants

●Metrics

●Sample size vs achieved

●Results

●Decision

●Learnings

●Follow-up ideas

Store records in a shared, searchable location to avoid repeated failures.

---

Refusal Conditions (Safety)

Refuse to proceed if:

●Baseline rate is unknown and cannot be estimated

●Traffic is insufficient to detect the MDE

●Primary metric is undefined

●Multiple variables are changed without proper design

●Hypothesis cannot be clearly stated

Explain why and recommend next steps.

---

Key Principles (Non-Negotiable)

●One hypothesis per test

●One primary metric

●Commit before launch

●No peeking

●Learning over winning

●Statistical rigor first

---

Final Reminder

A/B testing is not about proving ideas right.

It is about learning the truth with confidence.

If you feel tempted to rush, simplify, or “just try it” —

that is the signal to slow down and re-check the design.

ab-test-setup

Documentation