🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 79 (from laksa134)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄
INDEXABLE
✅
CRAWLED
11 hours ago
🤖
ROBOTS ALLOWED

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH0 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://towardsdatascience.com/a-b-testing-a-complete-guide-to-statistical-testing-e3f1db140499/
Last Crawled2026-04-06 13:52:50 (11 hours ago)
First Indexed2025-02-11 06:39:55 (1 year ago)
HTTP Status Code200
Meta TitleA/B Testing - A complete guide to statistical testing | Towards Data Science
Meta Descriptionnull
Meta Canonicalnull
Boilerpipe Text
Getting Started For marketers and data scientists alike, it’s crucial to set up the right test. Photo by John McArthur on Unsplash What is A/B testing? A/B testing is one of the most popular controlled experiments used to optimize web marketing strategies. It allows decision makers to choose the best design for a website by looking at the analytics results obtained with two possible alternatives A and B. In this article we’ll see how different statistical methods can be used to make A/B testing successful. I recommend you to also have a look at this notebook where you can play with the examples discussed in this article. To understand what A/B testing is about, let’s consider two alternative designs: A and B. Visitors of a website are randomly served with one of the two. Then, data about their activity is collected by web analytics. Given this data, one can apply statistical tests to determine whether one of the two designs has better efficacy. Now, different kinds of metrics can be used to measure a website efficacy. With discrete metrics , also called binomial metrics , only the two values 0 and 1 are possible. The following are examples of popular discrete metrics. Click-through rate – if a user is shown an advertisement, do they click on it? Conversion rate – if a user is shown an advertisement, do they convert into customers? Bounce rate – if a user is visits a website, is the following visited page on the same website? Discrete metrics: click-through rate (image by author) With continuous metrics , also called non-binomial metrics ,, the metric may take continuous values that are not limited to a set two discrete states. The following are examples of popular continuous metrics. Average revenue per user – how much revenue does a user generate in a month? Average session duration – for how long does a user stay on a website in a session? Average order value – what is the total value of the order of a user? Continuous metrics: average order value (image by author) We are going to see in detail how discrete and continuous metrics require different statistical test. But first, let’s quickly review some fundamental concepts of statistics. Statistical significance With the data we collected from the activity of users of our website, we can compare the efficacy of the two designs A and B. Simply comparing mean values wouldn’t be very meaningful, as we would fail to assess the statistical significance of our observations. It is indeed fundamental to determine how likely it is that the observed discrepancy between the two samples originates from chance. In order to do that, we will use a two-sample hypothesis test . Our null hypothesis H0 is that the two designs A and B have the same efficacy, i.e. that they produce an equivalent click-through rate, or average revenue per user, etc. The statistical significance is then measured by the p-value , i.e. the probability of observing a discrepancy between our samples at least as strong as the one that we actually observed. P-value (image by author) Now, some care has to be applied to properly choose the alternative hypothesis Ha . This choice corresponds to the choice between one- and two- tailed tests . A two-tailed test is preferable in our case, since we have no reason to know a priori whether the discrepancy between the results of A and B will be in favor of A or B. This means that we consider the alternative hypothesis Ha the hypothesis that A and B have different efficacy. One- and Two-tailed tests (image by author) The p-value is therefore computed as the area under the the two tails of the probability density function p(x) of a chosen test statistic on all x’ s.t. p(x’) <= p(our observation) . The computation of such p-value clearly depends on the data distribution. So we will first see how to compute it for discrete metrics, and then for continuous metrics. Discrete metrics Let’s first consider a discrete metric such as the click-though rate. We randomly show visitors one of two possible designs of an advertisement, and we keep track of how many of them click on it. Let’s say that from we collected the following information. nX = 15 visitors saw the advertisement A, and 7 of them clicked on it. nY = 19 visitors saw the advertisement B, and 15 of them clicked on it. Click-through ratios: contingency table (image by author) At a first glance, it looks like version B was more effective, but how statistically significant is this discrepancy? Fisher’s exact test Using the 2×2 contingency table shown above we can use Fisher’s exact test to compute an exact p-value and test our hypothesis. To understand how this test works, let us start by noticing that if we fix the margins of the tables (i.e. the four sums of each row and column), then only few different outcomes are possible. Click-through ratios: possible outcomes (image by author) Now, the key observation is that, under the null hypothesis H0 that A and B have same efficacy, the probability of observing any of these possible outcomes is given by the hypergeometric distribution . Hypergeometric distribution of possible outcomes (image by author) Using this formula we obtain that: the probability of seeing our actual observations is ~4.5% the probability of seeing even more unlikely observations in favor if B is ~1.0% (left tail); the probability of seeing observations even more unlikely observations in favor if A is ~2.0% (right tail). Click-through ratios: tails and p-value (image by author) So Fisher’s exact test gives p-value ≈ 7.5% . Pearson’s chi-squared test Fisher’s exact test has the important advantage of computing exact p-values. But if we have a large sample size, it may be computationally inefficient. In this case, we can use Pearson’s chi-squared test to compute an approximate p-value. Let us call Oij the observed value of the contingency table at row i and column j . Under the null hypothesis of independence of rows and columns, i.e. assuming that A and B have same efficacy, we can easily compute corresponding expected values Eij . Moreover, if the observations are normally distributed, then the χ2 statistic follows exactly a chi-square distribution with 1 degree of freedom. Pearson’s chi-squared test (image by author) In fact, this test can also be used with non-normal observations if the sample size is large enough, thanks to the central limit theorem . In our example, using Pearson’s chi-square test we obtain χ2 ≈ 3.825 , which gives p-value ≈ 5.1% . Continuous metrics Let’s now consider the case of a continuous metric such as the average revenue per user. We randomly show visitors one of two possible layouts of our website, and based on how much revenue each user generates in a month we want to determine if one of the two layouts is more efficient. Let’s consider the following case. nX = 17 users saw the layout A, and then made the following purchases: 200$, 150$, 250$, 350$, 150$, 150$, 350$, 250$, 150$, 250$, 150$, 150$, 200$, 0$, 0$, 100$, 50$. nX = 14 users saw the layout B, and then made the following purchases: 300$, 150$, 150$, 400$, 250$, 250$, 150$, 200$, 250$, 150$, 300$, 200$, 250$, 200$. Average revenue per user: samples distribution (image by author) Again, at a first glance, it looks like version B was more effective. But how statistically significant is this discrepancy? Z-test The Z-test can be applied under the following assumptions. The observations are normally distributed (or the sample size is large). The sampling distributions have known variance σX and σY . Under the above assumptions, the Z-test exploits the fact that the following Z statistic has a standard normal distribution. Z-test (image by author) Unfortunately in most real applications the standard deviations are unknown and must be estimated, so a t-test is preferable, as we will see later. Anyway, if in our case we knew the true value of σX=100 and σX=90 , then we would obtain z ≈ -1.697 , which corresponds to a p-value ≈ 9% . Student’s t-test In most cases, the variances of the sampling distributions are unknown, so that we need to estimate them. Student’s t-test can then be applied under the following assumptions. The observations are normally distributed (or the sample size is large). The sampling distributions have "similar" variances σX ≈ σY . Under the above assumptions, Student’s t-test relies on the observation that the following t statistic has a Student’s t distribution. Student’s t-test Here SP is the pooled standard deviation obtained from the sample variances SX and S Y , which are computed using the unbiased formula that applies Bessel’s correction ). In our example, using Student’s t-test we obtain t ≈ -1.789 and ν = 29 , which give p-value ≈ 8.4% . Welch’s t-test In most cases Student’s t test can be effectively applied with good results. However, it may rarely happen that its second assumption (similar variance of the sampling distributions) is violated. In that case, we cannot compute a pooled variance and rather than Student’s t test we should use Welch’s t-test . This test operates under the same assumptions of Student’s t-test but removes the requirement on the similar variances. Then, we can use a slightly different t statistic , which also has a Student’s t distribution, but with a different number of degrees of freedom ν . Welch’s t-test The complex formula for ν comes from Welch–Satterthwaite equation . In our example, using Welch’s t-test we obtain t ≈ -1.848 and ν ≈ 28.51 , which give p-value ≈ 7.5% . Continuous non-normal metrics In the previous section on continuous metrics, we assumed that our observations came from normal distributions. But non-normal distributions are extremely common when dealing with per-user monthly revenues etc. There are several ways in which normality is often violated: zero-inflated distributions — most user don’t buy anything at all, so lots of zero observations; multimodal distributions – a market segment tends purchases cheap products, while another segment purchases more expensive products. Continuous non-normal distribution (image by author) However, if we have enough samples, tests derived under normality assumptions like Z-test, Student’s t-test, and Welch’s t-test can still be applied for observations that signficantly deviate from normality. Indeed, thanks to the central limit theorem , the distribution of the test statistics tends to normality as the sample size increases. In the zero-inflated and multimodal example we are considering, even a sample size of 40 produces a distribution that is well approximated by a normal distribution. Convergence to normality of a non-normal distribution (image by author) But if the sample size is still too small to assume normality, we have no other choice than using a non-parametric approach such as the Mann-Whitney U test. Mann–Whitney U test This test makes no assumption on the nature of the sampling distributions, so it is fully nonparametric. The idea of Mann-Whitney U test is to compute the following U statistic . Mann-Whitney U test (image by author) The values of this test statistic are tabulated, as the distribution can be computed under the null hypothesis that, for random samples X and Y from the two populations, the probability P(X < Y) is the same as P(X > Y) . In our example, using Mann-Whitney U test we obtain u = 76 which gives p-value ≈ 8.0% . Conclusion In this article we have seen that different kinds of metrics, sample size, and sampling distributions require different kinds of statistical tests for computing the the significance of A/B tests. We can summarize all these possibilities in the form of a decision tree. Summary of the statistical tests to be used for A/B testing (image by author) If you want to know more, you can start by playing with this notebook where you can see all the examples discussed in this article!
Markdown
[![Towards Data Science](https://towardsdatascience.com/wp-content/uploads/2025/02/TDS-Vector-Logo.svg)](https://towardsdatascience.com/) Publish AI, ML & data-science insights to a global community of data professionals. [Sign in]() [Submit an Article](https://contributor.insightmediagroup.io/) - [Latest](https://towardsdatascience.com/latest/) - [Editor’s Picks](https://towardsdatascience.com/tag/editors-pick/) - [Deep Dives](https://towardsdatascience.com/tag/deep-dives/) - [Newsletter](https://towardsdatascience.com/tag/the-variable/) - [Write For TDS](https://towardsdatascience.com/submissions/) Toggle Mobile Navigation - [LinkedIn](https://www.linkedin.com/company/towards-data-science/?originalSubdomain=ca) - [X](https://x.com/TDataScience) Toggle Search [Data Science](https://towardsdatascience.com/category/data-science/) # A/B Testing – A complete guide to statistical testing Optimizing web marketing strategies through statistical testing [Francesco Casalegno](https://towardsdatascience.com/author/francesco-casalegno/) Feb 17, 2021 10 min read Share ### [Getting Started](https://towardsdatascience.com/tagged/getting-started) ### For marketers and data scientists alike, it’s crucial to set up the right test. ![Photo by John McArthur on Unsplash](https://towardsdatascience.com/wp-content/uploads/2021/02/1cQOmpvcO_hKWHoB1sQNXYg-scaled.jpeg) Photo by [John McArthur](https://unsplash.com/@snowjam?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/s/photos/red-and-blue?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) ## What is A/B testing? **A/B testing** is one of the most popular controlled experiments used to optimize web marketing strategies. It allows decision makers to choose the best design for a website by looking at the analytics results obtained with two possible alternatives A and B. In this article we’ll see how different statistical methods can be used to make A/B testing successful. I recommend you to also have a look at **this notebook** where you can play with the examples discussed in this article. To understand what A/B testing is about, let’s consider two alternative designs: A and B. Visitors of a website are randomly served with one of the two. Then, data about their activity is collected by web analytics. Given this data, one can apply statistical tests to determine whether one of the two designs has better efficacy. Now, different kinds of metrics can be used to measure a website efficacy. With **discrete metrics**, also called **binomial metrics**, only the two values **0** and **1** are possible. The following are examples of popular discrete metrics. - [Click-through rate](https://en.wikipedia.org/wiki/Click-through_rate) – if a user is shown an advertisement, do they click on it? - [Conversion rate](https://en.wikipedia.org/wiki/Conversion_rate_optimization) – if a user is shown an advertisement, do they convert into customers? - [Bounce rate](https://en.wikipedia.org/wiki/Bounce_rate) – if a user is visits a website, is the following visited page on the same website? ![Discrete metrics: click-through rate (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/11RGVtvjKCvoVc6m1H7LWIQ.png) Discrete metrics: click-through rate (image by author) With **continuous metrics**, also called **non-binomial metrics**,, the metric may take continuous values that are not limited to a set two discrete states. The following are examples of popular continuous metrics. - [Average revenue per user](https://en.wikipedia.org/wiki/Average_revenue_per_user) – how much revenue does a user generate in a month? - [Average session duration](https://en.wikipedia.org/wiki/Session_\(web_analytics\)) – for how long does a user stay on a website in a session? - [Average order value](https://www.optimizely.com/optimization-glossary/average-order-value/) – what is the total value of the order of a user? ![Continuous metrics: average order value (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1B9Q4djagVefVBiwADoxmwg.png) Continuous metrics: average order value (image by author) We are going to see in detail how discrete and continuous metrics require different statistical test. But first, let’s quickly review some fundamental concepts of statistics. ## Statistical significance With the data we collected from the activity of users of our website, we can compare the efficacy of the two designs A and B. Simply comparing mean values wouldn’t be very meaningful, as we would fail to assess the **statistical significance** of our observations. It is indeed fundamental to determine how likely it is that the observed discrepancy between the two samples originates from chance. In order to do that, we will use a [two-sample hypothesis test](https://en.wikipedia.org/wiki/Two-sample_hypothesis_testing). Our **null hypothesis H0** is that the two designs A and B have the same efficacy, i.e. that they produce an equivalent click-through rate, or average revenue per user, etc. The statistical significance is then measured by the **p-value**, i.e. the probability of observing a discrepancy between our samples at least as strong as the one that we actually observed. ![P-value (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1AUjqShmAQlwZkOUQ7Grq0Q.png) P-value (image by author) Now, some care has to be applied to properly choose the **alternative hypothesis Ha**. This choice corresponds to the choice between [one- and two- tailed tests](https://en.wikipedia.org/wiki/One-_and_two-tailed_tests) . A **two-tailed test** is preferable in our case, since we have no reason to know a priori whether the discrepancy between the results of A and B will be in favor of A or B. This means that we consider the alternative hypothesis **Ha** the hypothesis that A and B have different efficacy. ![One- and Two-tailed tests (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/13kWMH_uFHqMBg1QVj77DgQ.png) One- and Two-tailed tests (image by author) The **p-value** is therefore computed as the area under the the two tails of the probability density function **p(x)** of a chosen test statistic on all **x’** s.t. **p(x’) \<= p(our observation)**. The computation of such p-value clearly depends on the data distribution. So we will first see how to compute it for discrete metrics, and then for continuous metrics. ## Discrete metrics Let’s first consider a discrete metric such as the click-though rate. We randomly show visitors one of two possible designs of an advertisement, and we keep track of how many of them click on it. Let’s say that from we collected the following information. - **nX = 15** visitors saw the advertisement A, and **7** of them clicked on it. - **nY = 19** visitors saw the advertisement B, and **15** of them clicked on it. ![Click-through ratios: contingency table (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1Un_gx38ClZ4EwQGHj5PW3g.png) Click-through ratios: contingency table (image by author) At a first glance, it looks like version B was more effective, but how statistically significant is this discrepancy? ## Fisher’s exact test Using the 2×2 [contingency table shown above](https://en.wikipedia.org/wiki/Contingency_table) we can use [Fisher’s exact test](https://en.wikipedia.org/wiki/Fisher%27s_exact_test) to compute an exact p-value and test our hypothesis. To understand how this test works, let us start by noticing that if we fix the margins of the tables (i.e. the four sums of each row and column), then only few different outcomes are possible. ![Click-through ratios: possible outcomes (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1JkSqL1HLPR03c1vkWa1iSQ.png) Click-through ratios: possible outcomes (image by author) Now, the key observation is that, under the null hypothesis H0 that A and B have same efficacy, the probability of observing any of these possible outcomes is given by the [hypergeometric distribution](https://en.wikipedia.org/wiki/Hypergeometric_distribution) . ![Hypergeometric distribution of possible outcomes (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1jeoM9FBdMofqOvq67b2Xrg.png) Hypergeometric distribution of possible outcomes (image by author) Using this formula we obtain that: - the probability of seeing our actual observations is **~4.5%** - the probability of seeing even more unlikely observations in favor if B is **~1.0%** (left tail); - the probability of seeing observations even more unlikely observations in favor if A is **~2.0%** (right tail). ![Click-through ratios: tails and p-value (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1mehHfKLpJc8b5R-U8IJ_gQ.png) Click-through ratios: tails and p-value (image by author) So Fisher’s exact test gives **p-value ≈ 7.5%**. ## Pearson’s chi-squared test Fisher’s exact test has the important advantage of computing exact p-values. But if we have a large sample size, it may be computationally inefficient. In this case, we can use [Pearson’s chi-squared test](https://en.wikipedia.org/wiki/Pearson%27s_chi-square_test) to compute an approximate p-value. Let us call **Oij** the observed value of the contingency table at row **i** and column **j**. Under the null hypothesis of independence of rows and columns, i.e. assuming that A and B have same efficacy, we can easily compute corresponding expected values **Eij**. Moreover, if the observations are normally distributed, then the χ2 statistic follows exactly a [chi-square distribution](https://en.wikipedia.org/wiki/Chi-square_distribution) with 1 degree of freedom. ![Pearson's chi-squared test (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1okSzrUwypIAyrowTglzQpw.png) Pearson’s chi-squared test (image by author) In fact, this test can also be used with non-normal observations if the sample size is large enough, thanks to the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem). In our example, using Pearson’s chi-square test we obtain **χ2 ≈ 3.825**, which gives **p-value ≈ 5.1%**. ## Continuous metrics Let’s now consider the case of a continuous metric such as the average revenue per user. We randomly show visitors one of two possible layouts of our website, and based on how much revenue each user generates in a month we want to determine if one of the two layouts is more efficient. Let’s consider the following case. - **nX = 17** users saw the layout A, and then made the following purchases: 200\$, 150\$, 250\$, 350\$, 150\$, 150\$, 350\$, 250\$, 150\$, 250\$, 150\$, 150\$, 200\$, 0\$, 0\$, 100\$, 50\$. - **nX = 14** users saw the layout B, and then made the following purchases: 300\$, 150\$, 150\$, 400\$, 250\$, 250\$, 150\$, 200\$, 250\$, 150\$, 300\$, 200\$, 250\$, 200\$. ![Average revenue per user: samples distribution (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1doVw74dcT3QriWpLrE322w.png) Average revenue per user: samples distribution (image by author) Again, at a first glance, it looks like version B was more effective. But how statistically significant is this discrepancy? ## Z-test The [Z-test](https://en.wikipedia.org/wiki/Z-test) can be applied under the following assumptions. - The observations are normally distributed (or the sample size is large). - The sampling distributions have known variance **σX** and **σY**. Under the above assumptions, the Z-test exploits the fact that the following **Z statistic** has a standard normal distribution. ![Z-test (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1st_79maYP3YyIjT7AB1RSQ.png) Z-test (image by author) Unfortunately in most real applications the standard deviations are unknown and must be estimated, so a t-test is preferable, as we will see later. Anyway, if in our case we knew the true value of **σX=100** and **σX=90**, then we would obtain **z ≈ -1.697**, which corresponds to a **p-value ≈ 9%**. ## Student’s t-test In most cases, the variances of the sampling distributions are unknown, so that we need to estimate them. [Student’s t-test](https://en.wikipedia.org/wiki/Student%27s_t-test) can then be applied under the following assumptions. - The observations are normally distributed (or the sample size is large). - The sampling distributions have "similar" variances **σX ≈ σY**. Under the above assumptions, Student’s t-test relies on the observation that the following **t statistic** has a Student’s t distribution. ![Student's t-test](https://towardsdatascience.com/wp-content/uploads/2021/02/1UOfI_SzCTNA8AQngeRn34A.png) Student’s t-test Here **SP** is the [pooled standard deviation](https://en.wikipedia.org/wiki/Pooled_variance) obtained from the sample variances **SX** and **S Y**, which are computed using the unbiased formula that applies [Bessel’s correction](https://en.wikipedia.org/wiki/Bessel%27s_correction) ). In our example, using Student’s t-test we obtain **t ≈ -1.789** and **ν = 29**, which give **p-value ≈ 8.4%**. ## Welch’s t-test In most cases Student’s t test can be effectively applied with good results. However, it may rarely happen that its second assumption (similar variance of the sampling distributions) is violated. In that case, we cannot compute a pooled variance and rather than Student’s t test we should use [Welch’s t-test](https://en.wikipedia.org/wiki/Welch%27s_t-test). This test operates under the same assumptions of Student’s t-test but removes the requirement on the similar variances. Then, we can use a slightly different **t statistic**, which also has a Student’s t distribution, but with a different number of degrees of freedom **ν**. ![Welch's t-test](https://towardsdatascience.com/wp-content/uploads/2021/02/1ke1Ln8YN1L1jUkFJ0R1TNA.png) Welch’s t-test The complex formula for **ν** comes from [Welch–Satterthwaite equation](https://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equation) . In our example, using Welch’s t-test we obtain **t ≈ -1.848** and **ν ≈ 28.51**, which give **p-value ≈ 7.5%**. ## Continuous non-normal metrics In the previous section on continuous metrics, we assumed that our observations came from normal distributions. But non-normal distributions are extremely common when dealing with per-user monthly revenues etc. There are several ways in which normality is often violated: - [zero-inflated distributions](https://en.wikipedia.org/wiki/Zero-inflated_model) — most user don’t buy anything at all, so lots of zero observations; - [multimodal distributions](https://en.wikipedia.org/wiki/Multimodal_distribution) – a market segment tends purchases cheap products, while another segment purchases more expensive products. ![Continuous non-normal distribution (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1oPGk4tS_w_hqApwXst3zTQ.png) Continuous non-normal distribution (image by author) However, if we have enough samples, tests derived under normality assumptions like Z-test, Student’s t-test, and Welch’s t-test can still be applied for observations that signficantly deviate from normality. Indeed, thanks to the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem), the distribution of the test statistics tends to normality as the sample size increases. In the zero-inflated and multimodal example we are considering, even a sample size of 40 produces a distribution that is well approximated by a normal distribution. ![Convergence to normality of a non-normal distribution (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1eNSBJ1kkIqGleVD3DgmfPw.png) Convergence to normality of a non-normal distribution (image by author) But if the sample size is still too small to assume normality, we have no other choice than using a non-parametric approach such as the Mann-Whitney U test. ## Mann–Whitney U test This test makes no assumption on the nature of the sampling distributions, so it is fully nonparametric. The idea of [Mann-Whitney U test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) is to compute the following **U statistic**. ![Mann-Whitney U test (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1Urx1TmPRxG1Le1WlCnXsPg.png) Mann-Whitney U test (image by author) The values of this test statistic are tabulated, as the distribution can be computed under the null hypothesis that, for random samples **X** and **Y** from the two populations, the probability **P(X \< Y)** is the same as **P(X \> Y)**. In our example, using Mann-Whitney U test we obtain **u = 76** which gives **p-value ≈ 8.0%**. ## Conclusion In this article we have seen that different kinds of metrics, sample size, and sampling distributions require different kinds of statistical tests for computing the the significance of A/B tests. We can summarize all these possibilities in the form of a decision tree. ![Summary of the statistical tests to be used for A/B testing (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1Vzkwzrs4DOmBBa1LymW-PQ.png) Summary of the statistical tests to be used for A/B testing (image by author) If you want to know more, you can start by playing with **this notebook** where you can see all the examples discussed in this article\! *** Written By Francesco Casalegno [See all from Francesco Casalegno](https://towardsdatascience.com/author/francesco-casalegno/) [A B Testing](https://towardsdatascience.com/tag/a-b-testing/), [Data Science](https://towardsdatascience.com/tag/data-science/), [Getting Started](https://towardsdatascience.com/tag/getting-started/), [Machine Learning](https://towardsdatascience.com/tag/machine-learning/), [Statistics](https://towardsdatascience.com/tag/statistics/) Share This Article - [Share on Facebook](https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Ftowardsdatascience.com%2Fa-b-testing-a-complete-guide-to-statistical-testing-e3f1db140499%2F&title=A%2FB%20Testing%20%E2%80%93%20A%20complete%20guide%20to%20statistical%20testing) - [Share on LinkedIn](https://www.linkedin.com/shareArticle?mini=true&url=https%3A%2F%2Ftowardsdatascience.com%2Fa-b-testing-a-complete-guide-to-statistical-testing-e3f1db140499%2F&title=A%2FB%20Testing%20%E2%80%93%20A%20complete%20guide%20to%20statistical%20testing) - [Share on X](https://x.com/share?url=https%3A%2F%2Ftowardsdatascience.com%2Fa-b-testing-a-complete-guide-to-statistical-testing-e3f1db140499%2F&text=A%2FB%20Testing%20%E2%80%93%20A%20complete%20guide%20to%20statistical%20testing) Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program. [Write for TDS](https://towardsdatascience.com/questions-96667b06af5/) ## Related Articles - ![](https://towardsdatascience.com/wp-content/uploads/2024/08/0c09RmbCCpfjAbSMq.png) ## [Implementing Convolutional Neural Networks in TensorFlow](https://towardsdatascience.com/implementing-convolutional-neural-networks-in-tensorflow-bc1c4f00bd34/) [Artificial Intelligence](https://towardsdatascience.com/category/artificial-intelligence/) Step-by-step code guide to building a Convolutional Neural Network [Shreya Rao](https://towardsdatascience.com/author/shreya-rao/) August 20, 2024 6 min read - ![Photo by Krista Mangulsone on Unsplash](https://towardsdatascience.com/wp-content/uploads/2024/08/0GyVVTbgotH-DhGPH-scaled.jpg) ## [How to Forecast Hierarchical Time Series](https://towardsdatascience.com/how-to-forecast-hierarchical-time-series-75f223f79793/) [Artificial Intelligence](https://towardsdatascience.com/category/artificial-intelligence/) A beginner’s guide to forecast reconciliation [Dr. Robert Kübler](https://towardsdatascience.com/author/dr-robert-kuebler/) August 20, 2024 13 min read - ![Photo by davisuko on Unsplash](https://towardsdatascience.com/wp-content/uploads/2024/08/1bAABgtZtAIG5YW1oEjW3pA-scaled.jpeg) ## [Hands-on Time Series Anomaly Detection using Autoencoders, with Python](https://towardsdatascience.com/hands-on-time-series-anomaly-detection-using-autoencoders-with-python-7cd893bbc122/) [Data Science](https://towardsdatascience.com/category/data-science/) Here’s how to use Autoencoders to detect signals with anomalies in a few lines of… [Piero Paialunga](https://towardsdatascience.com/author/piero-paialunga/) August 21, 2024 12 min read - ![Image from Canva.](https://towardsdatascience.com/wp-content/uploads/2024/08/1UAA9jQVdqMXnwzYiz8Q53Q.png) ## [3 AI Use Cases (That Are Not a Chatbot)](https://towardsdatascience.com/3-ai-use-cases-that-are-not-a-chatbot-f4f328a2707a/) [Machine Learning](https://towardsdatascience.com/category/artificial-intelligence/machine-learning/) Feature engineering, structuring unstructured data, and lead scoring [Shaw Talebi](https://towardsdatascience.com/author/shawhin/) August 21, 2024 7 min read - ## [Solving a Constrained Project Scheduling Problem with Quantum Annealing](https://towardsdatascience.com/solving-a-constrained-project-scheduling-problem-with-quantum-annealing-d0640e657a3b/) [Data Science](https://towardsdatascience.com/category/data-science/) Solving the resource constrained project scheduling problem (RCPSP) with D-Wave’s hybrid constrained quadratic model (CQM) [Luis Fernando PÉREZ ARMAS, Ph.D.](https://towardsdatascience.com/author/luisfernandopa1212/) August 20, 2024 29 min read - ![](https://towardsdatascience.com/wp-content/uploads/2023/02/1VEUgT5T4absnTqBMOEuNig.png) ## [Back To Basics, Part Uno: Linear Regression and Cost Function](https://towardsdatascience.com/back-to-basics-part-uno-linear-regression-cost-function-and-gradient-descent-590dcb3eee46/) [Data Science](https://towardsdatascience.com/category/data-science/) An illustrated guide on essential machine learning concepts [Shreya Rao](https://towardsdatascience.com/author/shreya-rao/) February 3, 2023 6 min read - ![](https://towardsdatascience.com/wp-content/uploads/2024/08/1kM8tfYcdaoccB1HX71YDig.png) ## [Must-Know in Statistics: The Bivariate Normal Projection Explained](https://towardsdatascience.com/must-know-in-statistics-the-bivariate-normal-projection-explained-ace7b2f70b5b/) [Data Science](https://towardsdatascience.com/category/data-science/) Derivation and practical examples of this powerful concept [Luigi Battistoni](https://towardsdatascience.com/author/lu-battistoni/) August 14, 2024 7 min read - [YouTube](https://www.youtube.com/c/TowardsDataScience) - [X](https://x.com/TDataScience) - [LinkedIn](https://www.linkedin.com/company/towards-data-science/?originalSubdomain=ca) - [Threads](https://www.threads.net/@towardsdatascience) - [Bluesky](https://bsky.app/profile/towardsdatascience.com) [![Towards Data Science](https://towardsdatascience.com/wp-content/uploads/2025/02/TDS-Vector-Logo.svg)](https://towardsdatascience.com/) Your home for data science and Al. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals. © Insight Media Group, LLC 2026 Subscribe to Our Newsletter - [Write For TDS](https://towardsdatascience.com/questions-96667b06af5/) - [About](https://towardsdatascience.com/about-towards-data-science-d691af11cc2f/) - [Advertise](https://contact.towardsdatascience.com/advertise-with-towards-data-science) - [Privacy Policy](https://towardsdatascience.com/privacy-policy/) - [Terms of Use](https://towardsdatascience.com/website-terms-of-use/) ![](https://px.ads.linkedin.com/collect/?pid=7404572&fmt=gif)
Readable Markdown
### [Getting Started](https://towardsdatascience.com/tagged/getting-started) ### For marketers and data scientists alike, it’s crucial to set up the right test. ![Photo by John McArthur on Unsplash](https://towardsdatascience.com/wp-content/uploads/2021/02/1cQOmpvcO_hKWHoB1sQNXYg-scaled.jpeg) Photo by [John McArthur](https://unsplash.com/@snowjam?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/s/photos/red-and-blue?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) ## What is A/B testing? **A/B testing** is one of the most popular controlled experiments used to optimize web marketing strategies. It allows decision makers to choose the best design for a website by looking at the analytics results obtained with two possible alternatives A and B. In this article we’ll see how different statistical methods can be used to make A/B testing successful. I recommend you to also have a look at **this notebook** where you can play with the examples discussed in this article. To understand what A/B testing is about, let’s consider two alternative designs: A and B. Visitors of a website are randomly served with one of the two. Then, data about their activity is collected by web analytics. Given this data, one can apply statistical tests to determine whether one of the two designs has better efficacy. Now, different kinds of metrics can be used to measure a website efficacy. With **discrete metrics**, also called **binomial metrics**, only the two values **0** and **1** are possible. The following are examples of popular discrete metrics. - [Click-through rate](https://en.wikipedia.org/wiki/Click-through_rate) – if a user is shown an advertisement, do they click on it? - [Conversion rate](https://en.wikipedia.org/wiki/Conversion_rate_optimization) – if a user is shown an advertisement, do they convert into customers? - [Bounce rate](https://en.wikipedia.org/wiki/Bounce_rate) – if a user is visits a website, is the following visited page on the same website? ![Discrete metrics: click-through rate (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/11RGVtvjKCvoVc6m1H7LWIQ.png) Discrete metrics: click-through rate (image by author) With **continuous metrics**, also called **non-binomial metrics**,, the metric may take continuous values that are not limited to a set two discrete states. The following are examples of popular continuous metrics. - [Average revenue per user](https://en.wikipedia.org/wiki/Average_revenue_per_user) – how much revenue does a user generate in a month? - [Average session duration](https://en.wikipedia.org/wiki/Session_\(web_analytics\)) – for how long does a user stay on a website in a session? - [Average order value](https://www.optimizely.com/optimization-glossary/average-order-value/) – what is the total value of the order of a user? ![Continuous metrics: average order value (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1B9Q4djagVefVBiwADoxmwg.png) Continuous metrics: average order value (image by author) We are going to see in detail how discrete and continuous metrics require different statistical test. But first, let’s quickly review some fundamental concepts of statistics. ## Statistical significance With the data we collected from the activity of users of our website, we can compare the efficacy of the two designs A and B. Simply comparing mean values wouldn’t be very meaningful, as we would fail to assess the **statistical significance** of our observations. It is indeed fundamental to determine how likely it is that the observed discrepancy between the two samples originates from chance. In order to do that, we will use a [two-sample hypothesis test](https://en.wikipedia.org/wiki/Two-sample_hypothesis_testing). Our **null hypothesis H0** is that the two designs A and B have the same efficacy, i.e. that they produce an equivalent click-through rate, or average revenue per user, etc. The statistical significance is then measured by the **p-value**, i.e. the probability of observing a discrepancy between our samples at least as strong as the one that we actually observed. ![P-value (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1AUjqShmAQlwZkOUQ7Grq0Q.png) P-value (image by author) Now, some care has to be applied to properly choose the **alternative hypothesis Ha**. This choice corresponds to the choice between [one- and two- tailed tests](https://en.wikipedia.org/wiki/One-_and_two-tailed_tests) . A **two-tailed test** is preferable in our case, since we have no reason to know a priori whether the discrepancy between the results of A and B will be in favor of A or B. This means that we consider the alternative hypothesis **Ha** the hypothesis that A and B have different efficacy. ![One- and Two-tailed tests (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/13kWMH_uFHqMBg1QVj77DgQ.png) One- and Two-tailed tests (image by author) The **p-value** is therefore computed as the area under the the two tails of the probability density function **p(x)** of a chosen test statistic on all **x’** s.t. **p(x’) \<= p(our observation)**. The computation of such p-value clearly depends on the data distribution. So we will first see how to compute it for discrete metrics, and then for continuous metrics. ## Discrete metrics Let’s first consider a discrete metric such as the click-though rate. We randomly show visitors one of two possible designs of an advertisement, and we keep track of how many of them click on it. Let’s say that from we collected the following information. - **nX = 15** visitors saw the advertisement A, and **7** of them clicked on it. - **nY = 19** visitors saw the advertisement B, and **15** of them clicked on it. ![Click-through ratios: contingency table (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1Un_gx38ClZ4EwQGHj5PW3g.png) Click-through ratios: contingency table (image by author) At a first glance, it looks like version B was more effective, but how statistically significant is this discrepancy? ## Fisher’s exact test Using the 2×2 [contingency table shown above](https://en.wikipedia.org/wiki/Contingency_table) we can use [Fisher’s exact test](https://en.wikipedia.org/wiki/Fisher%27s_exact_test) to compute an exact p-value and test our hypothesis. To understand how this test works, let us start by noticing that if we fix the margins of the tables (i.e. the four sums of each row and column), then only few different outcomes are possible. ![Click-through ratios: possible outcomes (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1JkSqL1HLPR03c1vkWa1iSQ.png) Click-through ratios: possible outcomes (image by author) Now, the key observation is that, under the null hypothesis H0 that A and B have same efficacy, the probability of observing any of these possible outcomes is given by the [hypergeometric distribution](https://en.wikipedia.org/wiki/Hypergeometric_distribution) . ![Hypergeometric distribution of possible outcomes (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1jeoM9FBdMofqOvq67b2Xrg.png) Hypergeometric distribution of possible outcomes (image by author) Using this formula we obtain that: - the probability of seeing our actual observations is **~4.5%** - the probability of seeing even more unlikely observations in favor if B is **~1.0%** (left tail); - the probability of seeing observations even more unlikely observations in favor if A is **~2.0%** (right tail). ![Click-through ratios: tails and p-value (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1mehHfKLpJc8b5R-U8IJ_gQ.png) Click-through ratios: tails and p-value (image by author) So Fisher’s exact test gives **p-value ≈ 7.5%**. ## Pearson’s chi-squared test Fisher’s exact test has the important advantage of computing exact p-values. But if we have a large sample size, it may be computationally inefficient. In this case, we can use [Pearson’s chi-squared test](https://en.wikipedia.org/wiki/Pearson%27s_chi-square_test) to compute an approximate p-value. Let us call **Oij** the observed value of the contingency table at row **i** and column **j**. Under the null hypothesis of independence of rows and columns, i.e. assuming that A and B have same efficacy, we can easily compute corresponding expected values **Eij**. Moreover, if the observations are normally distributed, then the χ2 statistic follows exactly a [chi-square distribution](https://en.wikipedia.org/wiki/Chi-square_distribution) with 1 degree of freedom. ![Pearson's chi-squared test (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1okSzrUwypIAyrowTglzQpw.png) Pearson’s chi-squared test (image by author) In fact, this test can also be used with non-normal observations if the sample size is large enough, thanks to the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem). In our example, using Pearson’s chi-square test we obtain **χ2 ≈ 3.825**, which gives **p-value ≈ 5.1%**. ## Continuous metrics Let’s now consider the case of a continuous metric such as the average revenue per user. We randomly show visitors one of two possible layouts of our website, and based on how much revenue each user generates in a month we want to determine if one of the two layouts is more efficient. Let’s consider the following case. - **nX = 17** users saw the layout A, and then made the following purchases: 200\$, 150\$, 250\$, 350\$, 150\$, 150\$, 350\$, 250\$, 150\$, 250\$, 150\$, 150\$, 200\$, 0\$, 0\$, 100\$, 50\$. - **nX = 14** users saw the layout B, and then made the following purchases: 300\$, 150\$, 150\$, 400\$, 250\$, 250\$, 150\$, 200\$, 250\$, 150\$, 300\$, 200\$, 250\$, 200\$. ![Average revenue per user: samples distribution (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1doVw74dcT3QriWpLrE322w.png) Average revenue per user: samples distribution (image by author) Again, at a first glance, it looks like version B was more effective. But how statistically significant is this discrepancy? ## Z-test The [Z-test](https://en.wikipedia.org/wiki/Z-test) can be applied under the following assumptions. - The observations are normally distributed (or the sample size is large). - The sampling distributions have known variance **σX** and **σY**. Under the above assumptions, the Z-test exploits the fact that the following **Z statistic** has a standard normal distribution. ![Z-test (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1st_79maYP3YyIjT7AB1RSQ.png) Z-test (image by author) Unfortunately in most real applications the standard deviations are unknown and must be estimated, so a t-test is preferable, as we will see later. Anyway, if in our case we knew the true value of **σX=100** and **σX=90**, then we would obtain **z ≈ -1.697**, which corresponds to a **p-value ≈ 9%**. ## Student’s t-test In most cases, the variances of the sampling distributions are unknown, so that we need to estimate them. [Student’s t-test](https://en.wikipedia.org/wiki/Student%27s_t-test) can then be applied under the following assumptions. - The observations are normally distributed (or the sample size is large). - The sampling distributions have "similar" variances **σX ≈ σY**. Under the above assumptions, Student’s t-test relies on the observation that the following **t statistic** has a Student’s t distribution. ![Student's t-test](https://towardsdatascience.com/wp-content/uploads/2021/02/1UOfI_SzCTNA8AQngeRn34A.png) Student’s t-test Here **SP** is the [pooled standard deviation](https://en.wikipedia.org/wiki/Pooled_variance) obtained from the sample variances **SX** and **S Y**, which are computed using the unbiased formula that applies [Bessel’s correction](https://en.wikipedia.org/wiki/Bessel%27s_correction) ). In our example, using Student’s t-test we obtain **t ≈ -1.789** and **ν = 29**, which give **p-value ≈ 8.4%**. ## Welch’s t-test In most cases Student’s t test can be effectively applied with good results. However, it may rarely happen that its second assumption (similar variance of the sampling distributions) is violated. In that case, we cannot compute a pooled variance and rather than Student’s t test we should use [Welch’s t-test](https://en.wikipedia.org/wiki/Welch%27s_t-test). This test operates under the same assumptions of Student’s t-test but removes the requirement on the similar variances. Then, we can use a slightly different **t statistic**, which also has a Student’s t distribution, but with a different number of degrees of freedom **ν**. ![Welch's t-test](https://towardsdatascience.com/wp-content/uploads/2021/02/1ke1Ln8YN1L1jUkFJ0R1TNA.png) Welch’s t-test The complex formula for **ν** comes from [Welch–Satterthwaite equation](https://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equation) . In our example, using Welch’s t-test we obtain **t ≈ -1.848** and **ν ≈ 28.51**, which give **p-value ≈ 7.5%**. ## Continuous non-normal metrics In the previous section on continuous metrics, we assumed that our observations came from normal distributions. But non-normal distributions are extremely common when dealing with per-user monthly revenues etc. There are several ways in which normality is often violated: - [zero-inflated distributions](https://en.wikipedia.org/wiki/Zero-inflated_model) — most user don’t buy anything at all, so lots of zero observations; - [multimodal distributions](https://en.wikipedia.org/wiki/Multimodal_distribution) – a market segment tends purchases cheap products, while another segment purchases more expensive products. ![Continuous non-normal distribution (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1oPGk4tS_w_hqApwXst3zTQ.png) Continuous non-normal distribution (image by author) However, if we have enough samples, tests derived under normality assumptions like Z-test, Student’s t-test, and Welch’s t-test can still be applied for observations that signficantly deviate from normality. Indeed, thanks to the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem), the distribution of the test statistics tends to normality as the sample size increases. In the zero-inflated and multimodal example we are considering, even a sample size of 40 produces a distribution that is well approximated by a normal distribution. ![Convergence to normality of a non-normal distribution (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1eNSBJ1kkIqGleVD3DgmfPw.png) Convergence to normality of a non-normal distribution (image by author) But if the sample size is still too small to assume normality, we have no other choice than using a non-parametric approach such as the Mann-Whitney U test. ## Mann–Whitney U test This test makes no assumption on the nature of the sampling distributions, so it is fully nonparametric. The idea of [Mann-Whitney U test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) is to compute the following **U statistic**. ![Mann-Whitney U test (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1Urx1TmPRxG1Le1WlCnXsPg.png) Mann-Whitney U test (image by author) The values of this test statistic are tabulated, as the distribution can be computed under the null hypothesis that, for random samples **X** and **Y** from the two populations, the probability **P(X \< Y)** is the same as **P(X \> Y)**. In our example, using Mann-Whitney U test we obtain **u = 76** which gives **p-value ≈ 8.0%**. ## Conclusion In this article we have seen that different kinds of metrics, sample size, and sampling distributions require different kinds of statistical tests for computing the the significance of A/B tests. We can summarize all these possibilities in the form of a decision tree. ![Summary of the statistical tests to be used for A/B testing (image by author)](https://towardsdatascience.com/wp-content/uploads/2021/02/1Vzkwzrs4DOmBBa1LymW-PQ.png) Summary of the statistical tests to be used for A/B testing (image by author) If you want to know more, you can start by playing with **this notebook** where you can see all the examples discussed in this article\!
Shard79 (laksa)
Root Hash12035788063718406279
Unparsed URLcom,towardsdatascience!/a-b-testing-a-complete-guide-to-statistical-testing-e3f1db140499/ s443