A/B/n testing is ultimately an elaborate analysis task. In our domain, which I like to refer to generically as internet user activity, it’s exponentially more complex because it’s impossible to conduct a truly controlled experiment. In other words, there’s no way to gather sufficiently large and homogenous groups of template users that we can experiment on, each having a neat and constant variance (for whatever random variable you intend to rely on) over time.

On top of that are ubiquitous external factors that are extremely difficult to account for, such as: changing fashions, worldwide economic trends, and reliability of the user acquisition platform used for the test, to name a few.

## The “classical” approach

The common approach to this conundrum is to gather “a lot” of data and run statistical tests (ANOVA, proportion difference test, etc.). This approach works nicely for well-designed and controlled experiments aimed at discovering natural phenomena, such as the efficacy of some drug vs. placebo. In other cases, such as the one described above, it fails spectacularly. The reason is that unsurprisingly there is no tight control over A/B/n testing in the internet user activity domain. There is no way to guarantee that the population we sample from is homogeneous in the characteristics we are interested in. This brings forth the very relevant issue of the required sample size.

Discussion of sample size requires a deep understanding of the internal mechanisms of statistical inference. Concisely, statistical inference requires observing a few rules of conduct. One of them is to only perform inference after all of the experimental data are gathered. But then, we haven’t yet answered the question of how much data are required? Or the more subtle but arguably much more important question of required for which purpose? Herein lies the crux of the matter. Continuing with the common approach, which will be referred to from this point on as frequentist, a second crucial requirement is to state the experiment hypotheses before any data are gathered.

This is a relatively simple task for exact science domains, such as testing drug efficacy. Generally, new drugs are supposed to either outperform or provide enhanced safety compared to currently used (reference) drugs. Their development should be driven by developments in the chemistry domain, i.e. logically applying chemical phenomena to enhance performance. The experiment hypothesis is then reduced to inferring whether the new (and supposedly enhanced) drug is more efficient than its reference drug. Very simple indeed because it stems from logic and science. The drug development business is full of regulations so I won’t discuss it any further. Disregarding any regulatory requirements, the next steps would have been to gather test subjects, randomly divide them to the treatment and reference groups, apply each treatment, record the results, calculate a test statistic and compute the p-value.

Now that we know the hypothesis, at least for this example, we can use a statistical model to hypothetically calculate the minimum sample size. This model obviously relies on further assumptions, most notably – by how much will the treatment outperform the reference. This last assumption is critical for the procedure and clearly portrays the reason that this approach fits here. This type of experiment should only be conducted if the experimenters believe that the treatment effect over the reference is substantial. Otherwise, significant resources go to waste. When relying on relevant previous scientific research, such as how specific chemicals interact, researchers may get a sense of the enhanced efficacy, thereby hypothesizing by how much the treatment should enhance the effect of the reference. When conducted honestly, this stage leads to a relevant sample size calculation for the experiment as it will account for this perceived difference in effects.

Note that the calculated size doesn’t guarantee that the experiment will succeed in terms of discovering something new. It could very well turn out that given the calculated sample size, based on the experimental hypotheses and assumptions, the results will still be inconclusive. The reason for this stems from statistical philosophy which I will not digress to here. If this seems confusing then just keep in mind that the reason we’re experimenting is that we don’t know the outcome in advance – it could be the hypothesis is either true, false, or unknown.

To conclude this elaborate example, for the frequentist approach to be valid it requires:

- A hypothesis to test, which makes logical sense as objectively as possible given the analysis subject.
- A statistical model that fits the experiment design and assumptions.
- A large enough sample size.
- A predefined experiment termination rule that is completely independent of gathered data values.
- A predefined rule for experiment “success” or alternatively, inconclusiveness.

It’s important to note that even if all of these requirements are meticulously adhered to, the inference is still invalid unless the test results are shown to be replicable. After all, since we are dealing with “randomness” here any single experiment results could be a fluke. By fluke I mean of course there are external factors that influence the results and cannot be controlled however hard we try.

## Pitfalls in the internet user activity domain

The experiment setting described above completely collapses in the internet user activity domain for two reasons:

- Hypotheses and assumptions are made for design and marketing choices. With all due respect, design and marketing are not exact science domains. Cause and effect relationships are much less conspicuous here than in the drug example above. As noted above, this makes logical hypothesizing nigh impossible which means that there is no obvious statistical black box we can use to calculate minimum sample size and p-values. Out of the basic requirements outlined above, this contradicts with the first three.
- User acquisition is expensive. Admittedly this has no theoretical bearing on scientific requirements, but practically, this makes all the difference. We already saw that there is no reliable way to define the required sample size in advance for this case. Practitioners are therefore left with two options: guess a small enough sample size to keep expenses in check and risk running an “unreliable” experiment; or observe intermediary experiment results while it’s still running. Since most of us would like to avoid seeming unreliable, the latter approach is unfortunately a common practice, thereby contradicting with the fourth basic requirement above and rendering the fifth meaningless.

Let’s take a moment to appreciate why observing intermediary experiment results is so disastrous for the frequentist setting. Suppose that some anonymous practitioner (let’s call him Pete) runs a basic A/B test for a day and observes the results. Say variation A had a larger CVR than B and the p-value was 0.06. Pete could feel tempted to terminate the experiment because the significance level that he (maybe forgot to) set in advance is 0.1. Suppose Pete was suddenly preoccupied with other urgent matters and forgot to terminate the test. The next day Pete sees that variation B’s CVR has become larger than A’s. The resulting p-value is admittedly a bit higher – 0.09 but still the result is “significant”. Feeling pleased with the apparent finality of this solution, Pete immediately moves on to terminate the test so that no more traffic could be wasted there. Needless to say, Pete didn’t contemplate what could have happened the day after were the test to remain active. The moral of this totally fictitious story, but the completely plausible scenario, is that intermediary p-values are unreliable for statistical inference. In other words, a p-value is a tool that needs to be used properly in the appropriate scenarios.

In conclusion, A/B/n testing in the internet user activity domain has two very unscientific desiderata:

- Experimenting without first articulating any objectively logic-driven hypotheses.
- Minimizing sample costs by observing intermediary results.

These requirements stand in complete opposition to the frequentist experimentation rules described above which necessitates a different approach.

## When the IQ kicks in

Storemaven’s approach to A/B/n testing is designed to tackle the problems described above, i.e. striving for an accurate test setting while economizing on sample size. Fair warning, this section includes technical terms that require basic statistical theory knowledge to fully comprehend.

Storemaven’s StoreIQ can be called Bayesian as it’s mainly focused on estimating and then comparing the distribution of each variation’s CVR. If we regard the number of installers (given the number of samples) as the prominent random variable in the frequentist approach, then their distribution parameter is the proportion of installers (CVR). Frequentist approaches would usually estimate these parameter values (e.g. using MLE) and move on to perform inference. StoreIQ instead deals with modelling the distribution of the CVRs as random variables.

In order to economize on user traffic, StoreIQ terminates underperforming variations according to the modelled distribution of their CVR once enough data are gathered. From this point on, traffic is diverted to all other active variations which increases the momentum towards the experiment conclusion. When sufficient data are gathered to conclude that a single distribution is superior (in terms of location and scale) to all others or that the performance of all active variations is nearly identical, StoreIQ concludes the test.

Storemaven’s StoreIQ has a few mechanisms in place to help keep the experiment stability in check:

- Each experiment starts with a “warm up” period during which some preliminary amount of samples are gathered and statistics are not yet calculated. This helps us avoid making inference on unstable data.
- Model updates and statistical calculations are performed at constant intervals (when not in the warm up stage) to control for seasonality.
- The model parameters include regularization factors that help avoid early test termination. These factors are relaxed as the test progresses.

The StoreIQ methodology isn’t limited by the hypothesis setting and statistical modeling requirements of the frequentist approach. It has one mission – to compare parameter distributions and take action when a clear distinction between them can be made. Like the frequentist approach, it’s a tool, but unlike it, this tool is designed for A/B/n testing in the internet user activity domain.