âšď¸ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://towardsdatascience.com/a-b-testing-a-complete-guide-to-statistical-testing-e3f1db140499/ |
| Last Crawled | 2026-04-06 13:52:50 (11 hours ago) |
| First Indexed | 2025-02-11 06:39:55 (1 year ago) |
| HTTP Status Code | 200 |
| Meta Title | A/B Testing - A complete guide to statistical testing | Towards Data Science |
| Meta Description | null |
| Meta Canonical | null |
| Boilerpipe Text | Getting Started
For marketers and data scientists alike, itâs crucial to set up the right test.
Photo by
John McArthur
on
Unsplash
What is A/B testing?
A/B testing
is one of the most popular controlled experiments used to optimize web marketing strategies. It allows decision makers to choose the best design for a website by looking at the analytics results obtained with two possible alternatives A and B.
In this article weâll see how different statistical methods can be used to make A/B testing successful. I recommend you to also have a look at
this notebook
where you can play with the examples discussed in this article.
To understand what A/B testing is about, letâs consider two alternative designs: A and B. Visitors of a website are randomly served with one of the two. Then, data about their activity is collected by web analytics. Given this data, one can apply statistical tests to determine whether one of the two designs has better efficacy.
Now, different kinds of metrics can be used to measure a website efficacy. With
discrete metrics
, also called
binomial metrics
, only the two values
0
and
1
are possible. The following are examples of popular discrete metrics.
Click-through rate
â if a user is shown an advertisement, do they click on it?
Conversion rate
â if a user is shown an advertisement, do they convert into customers?
Bounce rate
â if a user is visits a website, is the following visited page on the same website?
Discrete metrics: click-through rate (image by author)
With
continuous metrics
, also called
non-binomial metrics
,, the metric may take continuous values that are not limited to a set two discrete states. The following are examples of popular continuous metrics.
Average revenue per user
â how much revenue does a user generate in a month?
Average session duration
â for how long does a user stay on a website in a session?
Average order value
â what is the total value of the order of a user?
Continuous metrics: average order value (image by author)
We are going to see in detail how discrete and continuous metrics require different statistical test. But first, letâs quickly review some fundamental concepts of statistics.
Statistical significance
With the data we collected from the activity of users of our website, we can compare the efficacy of the two designs A and B. Simply comparing mean values wouldnât be very meaningful, as we would fail to assess the
statistical significance
of our observations. It is indeed fundamental to determine how likely it is that the observed discrepancy between the two samples originates from chance.
In order to do that, we will use a
two-sample hypothesis test
. Our
null hypothesis H0
is that the two designs A and B have the same efficacy, i.e. that they produce an equivalent click-through rate, or average revenue per user, etc. The statistical significance is then measured by the
p-value
, i.e. the probability of observing a discrepancy between our samples at least as strong as the one that we actually observed.
P-value (image by author)
Now, some care has to be applied to properly choose the
alternative hypothesis Ha
. This choice corresponds to the choice between
one- and two- tailed tests
.
A
two-tailed test
is preferable in our case, since we have no reason to know a priori whether the discrepancy between the results of A and B will be in favor of A or B. This means that we consider the alternative hypothesis
Ha
the hypothesis that A and B have different efficacy.
One- and Two-tailed tests (image by author)
The
p-value
is therefore computed as the area under the the two tails of the probability density function
p(x)
of a chosen test statistic on all
xâ
s.t.
p(xâ) <= p(our observation)
. The computation of such p-value clearly depends on the data distribution. So we will first see how to compute it for discrete metrics, and then for continuous metrics.
Discrete metrics
Letâs first consider a discrete metric such as the click-though rate. We randomly show visitors one of two possible designs of an advertisement, and we keep track of how many of them click on it.
Letâs say that from we collected the following information.
nX = 15
visitors saw the advertisement A, and
7
of them clicked on it.
nY = 19
visitors saw the advertisement B, and
15
of them clicked on it.
Click-through ratios: contingency table (image by author)
At a first glance, it looks like version B was more effective, but how statistically significant is this discrepancy?
Fisherâs exact test
Using the 2Ă2
contingency table shown above
we can use
Fisherâs exact test
to compute an exact p-value and test our hypothesis. To understand how this test works, let us start by noticing that if we fix the margins of the tables (i.e. the four sums of each row and column), then only few different outcomes are possible.
Click-through ratios: possible outcomes (image by author)
Now, the key observation is that, under the null hypothesis H0 that A and B have same efficacy, the probability of observing any of these possible outcomes is given by the
hypergeometric distribution
.
Hypergeometric distribution of possible outcomes (image by author)
Using this formula we obtain that:
the probability of seeing our actual observations is
~4.5%
the probability of seeing even more unlikely observations in favor if B is
~1.0%
(left tail);
the probability of seeing observations even more unlikely observations in favor if A is
~2.0%
(right tail).
Click-through ratios: tails and p-value (image by author)
So Fisherâs exact test gives
p-value â 7.5%
.
Pearsonâs chi-squared test
Fisherâs exact test has the important advantage of computing exact p-values. But if we have a large sample size, it may be computationally inefficient. In this case, we can use
Pearsonâs chi-squared test
to compute an approximate p-value.
Let us call
Oij
the observed value of the contingency table at row
i
and column
j
. Under the null hypothesis of independence of rows and columns, i.e. assuming that A and B have same efficacy, we can easily compute corresponding expected values
Eij
. Moreover, if the observations are normally distributed, then the Ď2 statistic follows exactly a
chi-square distribution
with 1 degree of freedom.
Pearsonâs chi-squared test (image by author)
In fact, this test can also be used with non-normal observations if the sample size is large enough, thanks to the
central limit theorem
.
In our example, using Pearsonâs chi-square test we obtain
Ď2 â 3.825
, which gives
p-value â 5.1%
.
Continuous metrics
Letâs now consider the case of a continuous metric such as the average revenue per user. We randomly show visitors one of two possible layouts of our website, and based on how much revenue each user generates in a month we want to determine if one of the two layouts is more efficient.
Letâs consider the following case.
nX = 17
users saw the layout A, and then made the following purchases: 200$, 150$, 250$, 350$, 150$, 150$, 350$, 250$, 150$, 250$, 150$, 150$, 200$, 0$, 0$, 100$, 50$.
nX = 14
users saw the layout B, and then made the following purchases: 300$, 150$, 150$, 400$, 250$, 250$, 150$, 200$, 250$, 150$, 300$, 200$, 250$, 200$.
Average revenue per user: samples distribution (image by author)
Again, at a first glance, it looks like version B was more effective. But how statistically significant is this discrepancy?
Z-test
The
Z-test
can be applied under the following assumptions.
The observations are normally distributed (or the sample size is large).
The sampling distributions have known variance
ĎX
and
ĎY
.
Under the above assumptions, the Z-test exploits the fact that the following
Z statistic
has a standard normal distribution.
Z-test (image by author)
Unfortunately in most real applications the standard deviations are unknown and must be estimated, so a t-test is preferable, as we will see later. Anyway, if in our case we knew the true value of
ĎX=100
and
ĎX=90
, then we would obtain
z â -1.697
, which corresponds to a
p-value â 9%
.
Studentâs t-test
In most cases, the variances of the sampling distributions are unknown, so that we need to estimate them.
Studentâs t-test
can then be applied under the following assumptions.
The observations are normally distributed (or the sample size is large).
The sampling distributions have "similar" variances
ĎX â ĎY
.
Under the above assumptions, Studentâs t-test relies on the observation that the following
t statistic
has a Studentâs t distribution.
Studentâs t-test
Here
SP
is the
pooled standard deviation
obtained from the sample variances
SX
and
S Y
, which are computed using the unbiased formula that applies
Besselâs correction
).
In our example, using Studentâs t-test we obtain
t â -1.789
and
ν = 29
, which give
p-value â 8.4%
.
Welchâs t-test
In most cases Studentâs t test can be effectively applied with good results. However, it may rarely happen that its second assumption (similar variance of the sampling distributions) is violated. In that case, we cannot compute a pooled variance and rather than Studentâs t test we should use
Welchâs t-test
.
This test operates under the same assumptions of Studentâs t-test but removes the requirement on the similar variances. Then, we can use a slightly different
t statistic
, which also has a Studentâs t distribution, but with a different number of degrees of freedom
ν
.
Welchâs t-test
The complex formula for
ν
comes from
WelchâSatterthwaite equation
.
In our example, using Welchâs t-test we obtain
t â -1.848
and
ν â 28.51
, which give
p-value â 7.5%
.
Continuous non-normal metrics
In the previous section on continuous metrics, we assumed that our observations came from normal distributions. But non-normal distributions are extremely common when dealing with per-user monthly revenues etc. There are several ways in which normality is often violated:
zero-inflated distributions
â most user donât buy anything at all, so lots of zero observations;
multimodal distributions
â a market segment tends purchases cheap products, while another segment purchases more expensive products.
Continuous non-normal distribution (image by author)
However, if we have enough samples, tests derived under normality assumptions like Z-test, Studentâs t-test, and Welchâs t-test can still be applied for observations that signficantly deviate from normality. Indeed, thanks to the
central limit theorem
, the distribution of the test statistics tends to normality as the sample size increases. In the zero-inflated and multimodal example we are considering, even a sample size of 40 produces a distribution that is well approximated by a normal distribution.
Convergence to normality of a non-normal distribution (image by author)
But if the sample size is still too small to assume normality, we have no other choice than using a non-parametric approach such as the Mann-Whitney U test.
MannâWhitney U test
This test makes no assumption on the nature of the sampling distributions, so it is fully nonparametric. The idea of
Mann-Whitney U test
is to compute the following
U statistic
.
Mann-Whitney U test (image by author)
The values of this test statistic are tabulated, as the distribution can be computed under the null hypothesis that, for random samples
X
and
Y
from the two populations, the probability
P(X < Y)
is the same as
P(X > Y)
.
In our example, using Mann-Whitney U test we obtain
u = 76
which gives
p-value â 8.0%
.
Conclusion
In this article we have seen that different kinds of metrics, sample size, and sampling distributions require different kinds of statistical tests for computing the the significance of A/B tests. We can summarize all these possibilities in the form of a decision tree.
Summary of the statistical tests to be used for A/B testing (image by author)
If you want to know more, you can start by playing with
this notebook
where you can see all the examples discussed in this article! |
| Markdown | [](https://towardsdatascience.com/)
Publish AI, ML & data-science insights to a global community of data professionals.
[Sign in]()
[Submit an Article](https://contributor.insightmediagroup.io/)
- [Latest](https://towardsdatascience.com/latest/)
- [Editorâs Picks](https://towardsdatascience.com/tag/editors-pick/)
- [Deep Dives](https://towardsdatascience.com/tag/deep-dives/)
- [Newsletter](https://towardsdatascience.com/tag/the-variable/)
- [Write For TDS](https://towardsdatascience.com/submissions/)
Toggle Mobile Navigation
- [LinkedIn](https://www.linkedin.com/company/towards-data-science/?originalSubdomain=ca)
- [X](https://x.com/TDataScience)
Toggle Search
[Data Science](https://towardsdatascience.com/category/data-science/)
# A/B Testing â A complete guide to statistical testing
Optimizing web marketing strategies through statistical testing
[Francesco Casalegno](https://towardsdatascience.com/author/francesco-casalegno/)
Feb 17, 2021
10 min read
Share
### [Getting Started](https://towardsdatascience.com/tagged/getting-started)
### For marketers and data scientists alike, itâs crucial to set up the right test.

Photo by [John McArthur](https://unsplash.com/@snowjam?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/s/photos/red-and-blue?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)
## What is A/B testing?
**A/B testing** is one of the most popular controlled experiments used to optimize web marketing strategies. It allows decision makers to choose the best design for a website by looking at the analytics results obtained with two possible alternatives A and B.
In this article weâll see how different statistical methods can be used to make A/B testing successful. I recommend you to also have a look at **this notebook** where you can play with the examples discussed in this article.
To understand what A/B testing is about, letâs consider two alternative designs: A and B. Visitors of a website are randomly served with one of the two. Then, data about their activity is collected by web analytics. Given this data, one can apply statistical tests to determine whether one of the two designs has better efficacy.
Now, different kinds of metrics can be used to measure a website efficacy. With **discrete metrics**, also called **binomial metrics**, only the two values **0** and **1** are possible. The following are examples of popular discrete metrics.
- [Click-through rate](https://en.wikipedia.org/wiki/Click-through_rate) â if a user is shown an advertisement, do they click on it?
- [Conversion rate](https://en.wikipedia.org/wiki/Conversion_rate_optimization) â if a user is shown an advertisement, do they convert into customers?
- [Bounce rate](https://en.wikipedia.org/wiki/Bounce_rate) â if a user is visits a website, is the following visited page on the same website?

Discrete metrics: click-through rate (image by author)
With **continuous metrics**, also called **non-binomial metrics**,, the metric may take continuous values that are not limited to a set two discrete states. The following are examples of popular continuous metrics.
- [Average revenue per user](https://en.wikipedia.org/wiki/Average_revenue_per_user) â how much revenue does a user generate in a month?
- [Average session duration](https://en.wikipedia.org/wiki/Session_\(web_analytics\)) â for how long does a user stay on a website in a session?
- [Average order value](https://www.optimizely.com/optimization-glossary/average-order-value/) â what is the total value of the order of a user?

Continuous metrics: average order value (image by author)
We are going to see in detail how discrete and continuous metrics require different statistical test. But first, letâs quickly review some fundamental concepts of statistics.
## Statistical significance
With the data we collected from the activity of users of our website, we can compare the efficacy of the two designs A and B. Simply comparing mean values wouldnât be very meaningful, as we would fail to assess the **statistical significance** of our observations. It is indeed fundamental to determine how likely it is that the observed discrepancy between the two samples originates from chance.
In order to do that, we will use a [two-sample hypothesis test](https://en.wikipedia.org/wiki/Two-sample_hypothesis_testing). Our **null hypothesis H0** is that the two designs A and B have the same efficacy, i.e. that they produce an equivalent click-through rate, or average revenue per user, etc. The statistical significance is then measured by the **p-value**, i.e. the probability of observing a discrepancy between our samples at least as strong as the one that we actually observed.

P-value (image by author)
Now, some care has to be applied to properly choose the **alternative hypothesis Ha**. This choice corresponds to the choice between [one- and two- tailed tests](https://en.wikipedia.org/wiki/One-_and_two-tailed_tests) .
A **two-tailed test** is preferable in our case, since we have no reason to know a priori whether the discrepancy between the results of A and B will be in favor of A or B. This means that we consider the alternative hypothesis **Ha** the hypothesis that A and B have different efficacy.

One- and Two-tailed tests (image by author)
The **p-value** is therefore computed as the area under the the two tails of the probability density function **p(x)** of a chosen test statistic on all **xâ** s.t. **p(xâ) \<= p(our observation)**. The computation of such p-value clearly depends on the data distribution. So we will first see how to compute it for discrete metrics, and then for continuous metrics.
## Discrete metrics
Letâs first consider a discrete metric such as the click-though rate. We randomly show visitors one of two possible designs of an advertisement, and we keep track of how many of them click on it.
Letâs say that from we collected the following information.
- **nX = 15** visitors saw the advertisement A, and **7** of them clicked on it.
- **nY = 19** visitors saw the advertisement B, and **15** of them clicked on it.

Click-through ratios: contingency table (image by author)
At a first glance, it looks like version B was more effective, but how statistically significant is this discrepancy?
## Fisherâs exact test
Using the 2Ă2 [contingency table shown above](https://en.wikipedia.org/wiki/Contingency_table) we can use [Fisherâs exact test](https://en.wikipedia.org/wiki/Fisher%27s_exact_test) to compute an exact p-value and test our hypothesis. To understand how this test works, let us start by noticing that if we fix the margins of the tables (i.e. the four sums of each row and column), then only few different outcomes are possible.

Click-through ratios: possible outcomes (image by author)
Now, the key observation is that, under the null hypothesis H0 that A and B have same efficacy, the probability of observing any of these possible outcomes is given by the [hypergeometric distribution](https://en.wikipedia.org/wiki/Hypergeometric_distribution) .

Hypergeometric distribution of possible outcomes (image by author)
Using this formula we obtain that:
- the probability of seeing our actual observations is **~4.5%**
- the probability of seeing even more unlikely observations in favor if B is **~1.0%** (left tail);
- the probability of seeing observations even more unlikely observations in favor if A is **~2.0%** (right tail).

Click-through ratios: tails and p-value (image by author)
So Fisherâs exact test gives **p-value â 7.5%**.
## Pearsonâs chi-squared test
Fisherâs exact test has the important advantage of computing exact p-values. But if we have a large sample size, it may be computationally inefficient. In this case, we can use [Pearsonâs chi-squared test](https://en.wikipedia.org/wiki/Pearson%27s_chi-square_test) to compute an approximate p-value.
Let us call **Oij** the observed value of the contingency table at row **i** and column **j**. Under the null hypothesis of independence of rows and columns, i.e. assuming that A and B have same efficacy, we can easily compute corresponding expected values **Eij**. Moreover, if the observations are normally distributed, then the Ď2 statistic follows exactly a [chi-square distribution](https://en.wikipedia.org/wiki/Chi-square_distribution) with 1 degree of freedom.

Pearsonâs chi-squared test (image by author)
In fact, this test can also be used with non-normal observations if the sample size is large enough, thanks to the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem).
In our example, using Pearsonâs chi-square test we obtain **Ď2 â 3.825**, which gives **p-value â 5.1%**.
## Continuous metrics
Letâs now consider the case of a continuous metric such as the average revenue per user. We randomly show visitors one of two possible layouts of our website, and based on how much revenue each user generates in a month we want to determine if one of the two layouts is more efficient.
Letâs consider the following case.
- **nX = 17** users saw the layout A, and then made the following purchases: 200\$, 150\$, 250\$, 350\$, 150\$, 150\$, 350\$, 250\$, 150\$, 250\$, 150\$, 150\$, 200\$, 0\$, 0\$, 100\$, 50\$.
- **nX = 14** users saw the layout B, and then made the following purchases: 300\$, 150\$, 150\$, 400\$, 250\$, 250\$, 150\$, 200\$, 250\$, 150\$, 300\$, 200\$, 250\$, 200\$.

Average revenue per user: samples distribution (image by author)
Again, at a first glance, it looks like version B was more effective. But how statistically significant is this discrepancy?
## Z-test
The [Z-test](https://en.wikipedia.org/wiki/Z-test) can be applied under the following assumptions.
- The observations are normally distributed (or the sample size is large).
- The sampling distributions have known variance **ĎX** and **ĎY**.
Under the above assumptions, the Z-test exploits the fact that the following **Z statistic** has a standard normal distribution.

Z-test (image by author)
Unfortunately in most real applications the standard deviations are unknown and must be estimated, so a t-test is preferable, as we will see later. Anyway, if in our case we knew the true value of **ĎX=100** and **ĎX=90**, then we would obtain **z â -1.697**, which corresponds to a **p-value â 9%**.
## Studentâs t-test
In most cases, the variances of the sampling distributions are unknown, so that we need to estimate them. [Studentâs t-test](https://en.wikipedia.org/wiki/Student%27s_t-test) can then be applied under the following assumptions.
- The observations are normally distributed (or the sample size is large).
- The sampling distributions have "similar" variances **ĎX â ĎY**.
Under the above assumptions, Studentâs t-test relies on the observation that the following **t statistic** has a Studentâs t distribution.

Studentâs t-test
Here **SP** is the [pooled standard deviation](https://en.wikipedia.org/wiki/Pooled_variance) obtained from the sample variances **SX** and **S Y**, which are computed using the unbiased formula that applies [Besselâs correction](https://en.wikipedia.org/wiki/Bessel%27s_correction) ).
In our example, using Studentâs t-test we obtain **t â -1.789** and **ν = 29**, which give **p-value â 8.4%**.
## Welchâs t-test
In most cases Studentâs t test can be effectively applied with good results. However, it may rarely happen that its second assumption (similar variance of the sampling distributions) is violated. In that case, we cannot compute a pooled variance and rather than Studentâs t test we should use [Welchâs t-test](https://en.wikipedia.org/wiki/Welch%27s_t-test).
This test operates under the same assumptions of Studentâs t-test but removes the requirement on the similar variances. Then, we can use a slightly different **t statistic**, which also has a Studentâs t distribution, but with a different number of degrees of freedom **ν**.

Welchâs t-test
The complex formula for **ν** comes from [WelchâSatterthwaite equation](https://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equation) .
In our example, using Welchâs t-test we obtain **t â -1.848** and **ν â 28.51**, which give **p-value â 7.5%**.
## Continuous non-normal metrics
In the previous section on continuous metrics, we assumed that our observations came from normal distributions. But non-normal distributions are extremely common when dealing with per-user monthly revenues etc. There are several ways in which normality is often violated:
- [zero-inflated distributions](https://en.wikipedia.org/wiki/Zero-inflated_model) â most user donât buy anything at all, so lots of zero observations;
- [multimodal distributions](https://en.wikipedia.org/wiki/Multimodal_distribution) â a market segment tends purchases cheap products, while another segment purchases more expensive products.

Continuous non-normal distribution (image by author)
However, if we have enough samples, tests derived under normality assumptions like Z-test, Studentâs t-test, and Welchâs t-test can still be applied for observations that signficantly deviate from normality. Indeed, thanks to the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem), the distribution of the test statistics tends to normality as the sample size increases. In the zero-inflated and multimodal example we are considering, even a sample size of 40 produces a distribution that is well approximated by a normal distribution.

Convergence to normality of a non-normal distribution (image by author)
But if the sample size is still too small to assume normality, we have no other choice than using a non-parametric approach such as the Mann-Whitney U test.
## MannâWhitney U test
This test makes no assumption on the nature of the sampling distributions, so it is fully nonparametric. The idea of [Mann-Whitney U test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) is to compute the following **U statistic**.

Mann-Whitney U test (image by author)
The values of this test statistic are tabulated, as the distribution can be computed under the null hypothesis that, for random samples **X** and **Y** from the two populations, the probability **P(X \< Y)** is the same as **P(X \> Y)**.
In our example, using Mann-Whitney U test we obtain **u = 76** which gives **p-value â 8.0%**.
## Conclusion
In this article we have seen that different kinds of metrics, sample size, and sampling distributions require different kinds of statistical tests for computing the the significance of A/B tests. We can summarize all these possibilities in the form of a decision tree.

Summary of the statistical tests to be used for A/B testing (image by author)
If you want to know more, you can start by playing with **this notebook** where you can see all the examples discussed in this article\!
***
Written By
Francesco Casalegno
[See all from Francesco Casalegno](https://towardsdatascience.com/author/francesco-casalegno/)
[A B Testing](https://towardsdatascience.com/tag/a-b-testing/), [Data Science](https://towardsdatascience.com/tag/data-science/), [Getting Started](https://towardsdatascience.com/tag/getting-started/), [Machine Learning](https://towardsdatascience.com/tag/machine-learning/), [Statistics](https://towardsdatascience.com/tag/statistics/)
Share This Article
- [Share on Facebook](https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Ftowardsdatascience.com%2Fa-b-testing-a-complete-guide-to-statistical-testing-e3f1db140499%2F&title=A%2FB%20Testing%20%E2%80%93%20A%20complete%20guide%20to%20statistical%20testing)
- [Share on LinkedIn](https://www.linkedin.com/shareArticle?mini=true&url=https%3A%2F%2Ftowardsdatascience.com%2Fa-b-testing-a-complete-guide-to-statistical-testing-e3f1db140499%2F&title=A%2FB%20Testing%20%E2%80%93%20A%20complete%20guide%20to%20statistical%20testing)
- [Share on X](https://x.com/share?url=https%3A%2F%2Ftowardsdatascience.com%2Fa-b-testing-a-complete-guide-to-statistical-testing-e3f1db140499%2F&text=A%2FB%20Testing%20%E2%80%93%20A%20complete%20guide%20to%20statistical%20testing)
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
[Write for TDS](https://towardsdatascience.com/questions-96667b06af5/)
## Related Articles
- 
## [Implementing Convolutional Neural Networks in TensorFlow](https://towardsdatascience.com/implementing-convolutional-neural-networks-in-tensorflow-bc1c4f00bd34/)
[Artificial Intelligence](https://towardsdatascience.com/category/artificial-intelligence/)
Step-by-step code guide to building a Convolutional Neural Network
[Shreya Rao](https://towardsdatascience.com/author/shreya-rao/)
August 20, 2024
6 min read
- 
## [How to Forecast Hierarchical Time Series](https://towardsdatascience.com/how-to-forecast-hierarchical-time-series-75f223f79793/)
[Artificial Intelligence](https://towardsdatascience.com/category/artificial-intelligence/)
A beginnerâs guide to forecast reconciliation
[Dr. Robert KĂźbler](https://towardsdatascience.com/author/dr-robert-kuebler/)
August 20, 2024
13 min read
- 
## [Hands-on Time Series Anomaly Detection using Autoencoders, with Python](https://towardsdatascience.com/hands-on-time-series-anomaly-detection-using-autoencoders-with-python-7cd893bbc122/)
[Data Science](https://towardsdatascience.com/category/data-science/)
Hereâs how to use Autoencoders to detect signals with anomalies in a few lines ofâŚ
[Piero Paialunga](https://towardsdatascience.com/author/piero-paialunga/)
August 21, 2024
12 min read
- 
## [3 AI Use Cases (That Are Not a Chatbot)](https://towardsdatascience.com/3-ai-use-cases-that-are-not-a-chatbot-f4f328a2707a/)
[Machine Learning](https://towardsdatascience.com/category/artificial-intelligence/machine-learning/)
Feature engineering, structuring unstructured data, and lead scoring
[Shaw Talebi](https://towardsdatascience.com/author/shawhin/)
August 21, 2024
7 min read
- ## [Solving a Constrained Project Scheduling Problem with Quantum Annealing](https://towardsdatascience.com/solving-a-constrained-project-scheduling-problem-with-quantum-annealing-d0640e657a3b/)
[Data Science](https://towardsdatascience.com/category/data-science/)
Solving the resource constrained project scheduling problem (RCPSP) with D-Waveâs hybrid constrained quadratic model (CQM)
[Luis Fernando PĂREZ ARMAS, Ph.D.](https://towardsdatascience.com/author/luisfernandopa1212/)
August 20, 2024
29 min read
- 
## [Back To Basics, Part Uno: Linear Regression and Cost Function](https://towardsdatascience.com/back-to-basics-part-uno-linear-regression-cost-function-and-gradient-descent-590dcb3eee46/)
[Data Science](https://towardsdatascience.com/category/data-science/)
An illustrated guide on essential machine learning concepts
[Shreya Rao](https://towardsdatascience.com/author/shreya-rao/)
February 3, 2023
6 min read
- 
## [Must-Know in Statistics: The Bivariate Normal Projection Explained](https://towardsdatascience.com/must-know-in-statistics-the-bivariate-normal-projection-explained-ace7b2f70b5b/)
[Data Science](https://towardsdatascience.com/category/data-science/)
Derivation and practical examples of this powerful concept
[Luigi Battistoni](https://towardsdatascience.com/author/lu-battistoni/)
August 14, 2024
7 min read
- [YouTube](https://www.youtube.com/c/TowardsDataScience)
- [X](https://x.com/TDataScience)
- [LinkedIn](https://www.linkedin.com/company/towards-data-science/?originalSubdomain=ca)
- [Threads](https://www.threads.net/@towardsdatascience)
- [Bluesky](https://bsky.app/profile/towardsdatascience.com)
[](https://towardsdatascience.com/)
Your home for data science and Al. The worldâs leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.
Š Insight Media Group, LLC 2026
Subscribe to Our Newsletter
- [Write For TDS](https://towardsdatascience.com/questions-96667b06af5/)
- [About](https://towardsdatascience.com/about-towards-data-science-d691af11cc2f/)
- [Advertise](https://contact.towardsdatascience.com/advertise-with-towards-data-science)
- [Privacy Policy](https://towardsdatascience.com/privacy-policy/)
- [Terms of Use](https://towardsdatascience.com/website-terms-of-use/)
 |
| Readable Markdown | ### [Getting Started](https://towardsdatascience.com/tagged/getting-started)
### For marketers and data scientists alike, itâs crucial to set up the right test.

Photo by [John McArthur](https://unsplash.com/@snowjam?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/s/photos/red-and-blue?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)
## What is A/B testing?
**A/B testing** is one of the most popular controlled experiments used to optimize web marketing strategies. It allows decision makers to choose the best design for a website by looking at the analytics results obtained with two possible alternatives A and B.
In this article weâll see how different statistical methods can be used to make A/B testing successful. I recommend you to also have a look at **this notebook** where you can play with the examples discussed in this article.
To understand what A/B testing is about, letâs consider two alternative designs: A and B. Visitors of a website are randomly served with one of the two. Then, data about their activity is collected by web analytics. Given this data, one can apply statistical tests to determine whether one of the two designs has better efficacy.
Now, different kinds of metrics can be used to measure a website efficacy. With **discrete metrics**, also called **binomial metrics**, only the two values **0** and **1** are possible. The following are examples of popular discrete metrics.
- [Click-through rate](https://en.wikipedia.org/wiki/Click-through_rate) â if a user is shown an advertisement, do they click on it?
- [Conversion rate](https://en.wikipedia.org/wiki/Conversion_rate_optimization) â if a user is shown an advertisement, do they convert into customers?
- [Bounce rate](https://en.wikipedia.org/wiki/Bounce_rate) â if a user is visits a website, is the following visited page on the same website?

Discrete metrics: click-through rate (image by author)
With **continuous metrics**, also called **non-binomial metrics**,, the metric may take continuous values that are not limited to a set two discrete states. The following are examples of popular continuous metrics.
- [Average revenue per user](https://en.wikipedia.org/wiki/Average_revenue_per_user) â how much revenue does a user generate in a month?
- [Average session duration](https://en.wikipedia.org/wiki/Session_\(web_analytics\)) â for how long does a user stay on a website in a session?
- [Average order value](https://www.optimizely.com/optimization-glossary/average-order-value/) â what is the total value of the order of a user?

Continuous metrics: average order value (image by author)
We are going to see in detail how discrete and continuous metrics require different statistical test. But first, letâs quickly review some fundamental concepts of statistics.
## Statistical significance
With the data we collected from the activity of users of our website, we can compare the efficacy of the two designs A and B. Simply comparing mean values wouldnât be very meaningful, as we would fail to assess the **statistical significance** of our observations. It is indeed fundamental to determine how likely it is that the observed discrepancy between the two samples originates from chance.
In order to do that, we will use a [two-sample hypothesis test](https://en.wikipedia.org/wiki/Two-sample_hypothesis_testing). Our **null hypothesis H0** is that the two designs A and B have the same efficacy, i.e. that they produce an equivalent click-through rate, or average revenue per user, etc. The statistical significance is then measured by the **p-value**, i.e. the probability of observing a discrepancy between our samples at least as strong as the one that we actually observed.

P-value (image by author)
Now, some care has to be applied to properly choose the **alternative hypothesis Ha**. This choice corresponds to the choice between [one- and two- tailed tests](https://en.wikipedia.org/wiki/One-_and_two-tailed_tests) .
A **two-tailed test** is preferable in our case, since we have no reason to know a priori whether the discrepancy between the results of A and B will be in favor of A or B. This means that we consider the alternative hypothesis **Ha** the hypothesis that A and B have different efficacy.

One- and Two-tailed tests (image by author)
The **p-value** is therefore computed as the area under the the two tails of the probability density function **p(x)** of a chosen test statistic on all **xâ** s.t. **p(xâ) \<= p(our observation)**. The computation of such p-value clearly depends on the data distribution. So we will first see how to compute it for discrete metrics, and then for continuous metrics.
## Discrete metrics
Letâs first consider a discrete metric such as the click-though rate. We randomly show visitors one of two possible designs of an advertisement, and we keep track of how many of them click on it.
Letâs say that from we collected the following information.
- **nX = 15** visitors saw the advertisement A, and **7** of them clicked on it.
- **nY = 19** visitors saw the advertisement B, and **15** of them clicked on it.

Click-through ratios: contingency table (image by author)
At a first glance, it looks like version B was more effective, but how statistically significant is this discrepancy?
## Fisherâs exact test
Using the 2Ă2 [contingency table shown above](https://en.wikipedia.org/wiki/Contingency_table) we can use [Fisherâs exact test](https://en.wikipedia.org/wiki/Fisher%27s_exact_test) to compute an exact p-value and test our hypothesis. To understand how this test works, let us start by noticing that if we fix the margins of the tables (i.e. the four sums of each row and column), then only few different outcomes are possible.

Click-through ratios: possible outcomes (image by author)
Now, the key observation is that, under the null hypothesis H0 that A and B have same efficacy, the probability of observing any of these possible outcomes is given by the [hypergeometric distribution](https://en.wikipedia.org/wiki/Hypergeometric_distribution) .

Hypergeometric distribution of possible outcomes (image by author)
Using this formula we obtain that:
- the probability of seeing our actual observations is **~4.5%**
- the probability of seeing even more unlikely observations in favor if B is **~1.0%** (left tail);
- the probability of seeing observations even more unlikely observations in favor if A is **~2.0%** (right tail).

Click-through ratios: tails and p-value (image by author)
So Fisherâs exact test gives **p-value â 7.5%**.
## Pearsonâs chi-squared test
Fisherâs exact test has the important advantage of computing exact p-values. But if we have a large sample size, it may be computationally inefficient. In this case, we can use [Pearsonâs chi-squared test](https://en.wikipedia.org/wiki/Pearson%27s_chi-square_test) to compute an approximate p-value.
Let us call **Oij** the observed value of the contingency table at row **i** and column **j**. Under the null hypothesis of independence of rows and columns, i.e. assuming that A and B have same efficacy, we can easily compute corresponding expected values **Eij**. Moreover, if the observations are normally distributed, then the Ď2 statistic follows exactly a [chi-square distribution](https://en.wikipedia.org/wiki/Chi-square_distribution) with 1 degree of freedom.

Pearsonâs chi-squared test (image by author)
In fact, this test can also be used with non-normal observations if the sample size is large enough, thanks to the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem).
In our example, using Pearsonâs chi-square test we obtain **Ď2 â 3.825**, which gives **p-value â 5.1%**.
## Continuous metrics
Letâs now consider the case of a continuous metric such as the average revenue per user. We randomly show visitors one of two possible layouts of our website, and based on how much revenue each user generates in a month we want to determine if one of the two layouts is more efficient.
Letâs consider the following case.
- **nX = 17** users saw the layout A, and then made the following purchases: 200\$, 150\$, 250\$, 350\$, 150\$, 150\$, 350\$, 250\$, 150\$, 250\$, 150\$, 150\$, 200\$, 0\$, 0\$, 100\$, 50\$.
- **nX = 14** users saw the layout B, and then made the following purchases: 300\$, 150\$, 150\$, 400\$, 250\$, 250\$, 150\$, 200\$, 250\$, 150\$, 300\$, 200\$, 250\$, 200\$.

Average revenue per user: samples distribution (image by author)
Again, at a first glance, it looks like version B was more effective. But how statistically significant is this discrepancy?
## Z-test
The [Z-test](https://en.wikipedia.org/wiki/Z-test) can be applied under the following assumptions.
- The observations are normally distributed (or the sample size is large).
- The sampling distributions have known variance **ĎX** and **ĎY**.
Under the above assumptions, the Z-test exploits the fact that the following **Z statistic** has a standard normal distribution.

Z-test (image by author)
Unfortunately in most real applications the standard deviations are unknown and must be estimated, so a t-test is preferable, as we will see later. Anyway, if in our case we knew the true value of **ĎX=100** and **ĎX=90**, then we would obtain **z â -1.697**, which corresponds to a **p-value â 9%**.
## Studentâs t-test
In most cases, the variances of the sampling distributions are unknown, so that we need to estimate them. [Studentâs t-test](https://en.wikipedia.org/wiki/Student%27s_t-test) can then be applied under the following assumptions.
- The observations are normally distributed (or the sample size is large).
- The sampling distributions have "similar" variances **ĎX â ĎY**.
Under the above assumptions, Studentâs t-test relies on the observation that the following **t statistic** has a Studentâs t distribution.

Studentâs t-test
Here **SP** is the [pooled standard deviation](https://en.wikipedia.org/wiki/Pooled_variance) obtained from the sample variances **SX** and **S Y**, which are computed using the unbiased formula that applies [Besselâs correction](https://en.wikipedia.org/wiki/Bessel%27s_correction) ).
In our example, using Studentâs t-test we obtain **t â -1.789** and **ν = 29**, which give **p-value â 8.4%**.
## Welchâs t-test
In most cases Studentâs t test can be effectively applied with good results. However, it may rarely happen that its second assumption (similar variance of the sampling distributions) is violated. In that case, we cannot compute a pooled variance and rather than Studentâs t test we should use [Welchâs t-test](https://en.wikipedia.org/wiki/Welch%27s_t-test).
This test operates under the same assumptions of Studentâs t-test but removes the requirement on the similar variances. Then, we can use a slightly different **t statistic**, which also has a Studentâs t distribution, but with a different number of degrees of freedom **ν**.

Welchâs t-test
The complex formula for **ν** comes from [WelchâSatterthwaite equation](https://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equation) .
In our example, using Welchâs t-test we obtain **t â -1.848** and **ν â 28.51**, which give **p-value â 7.5%**.
## Continuous non-normal metrics
In the previous section on continuous metrics, we assumed that our observations came from normal distributions. But non-normal distributions are extremely common when dealing with per-user monthly revenues etc. There are several ways in which normality is often violated:
- [zero-inflated distributions](https://en.wikipedia.org/wiki/Zero-inflated_model) â most user donât buy anything at all, so lots of zero observations;
- [multimodal distributions](https://en.wikipedia.org/wiki/Multimodal_distribution) â a market segment tends purchases cheap products, while another segment purchases more expensive products.

Continuous non-normal distribution (image by author)
However, if we have enough samples, tests derived under normality assumptions like Z-test, Studentâs t-test, and Welchâs t-test can still be applied for observations that signficantly deviate from normality. Indeed, thanks to the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem), the distribution of the test statistics tends to normality as the sample size increases. In the zero-inflated and multimodal example we are considering, even a sample size of 40 produces a distribution that is well approximated by a normal distribution.

Convergence to normality of a non-normal distribution (image by author)
But if the sample size is still too small to assume normality, we have no other choice than using a non-parametric approach such as the Mann-Whitney U test.
## MannâWhitney U test
This test makes no assumption on the nature of the sampling distributions, so it is fully nonparametric. The idea of [Mann-Whitney U test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) is to compute the following **U statistic**.

Mann-Whitney U test (image by author)
The values of this test statistic are tabulated, as the distribution can be computed under the null hypothesis that, for random samples **X** and **Y** from the two populations, the probability **P(X \< Y)** is the same as **P(X \> Y)**.
In our example, using Mann-Whitney U test we obtain **u = 76** which gives **p-value â 8.0%**.
## Conclusion
In this article we have seen that different kinds of metrics, sample size, and sampling distributions require different kinds of statistical tests for computing the the significance of A/B tests. We can summarize all these possibilities in the form of a decision tree.

Summary of the statistical tests to be used for A/B testing (image by author)
If you want to know more, you can start by playing with **this notebook** where you can see all the examples discussed in this article\! |
| Shard | 79 (laksa) |
| Root Hash | 12035788063718406279 |
| Unparsed URL | com,towardsdatascience!/a-b-testing-a-complete-guide-to-statistical-testing-e3f1db140499/ s443 |