python中的ab测试的简单指南
A/B testing is a crucial data science skill. It’s often used to test the effectiveness of Website A vs. Website B or Drug A vs. Drug B, or any two variations on one idea with the same primary motivation, whether it’s sales, drug efficacy, or customer retention. It’s one of those statistical concepts prone to an extra layer of confusion, because hypothesis testing alone requires understanding the Normal distribution, z-scores, p-values and careful framing of the null hypothesis. With A/B testing, we have two samples to deal with. However, A/B testing is still just hypothesis testing at heart! This guide will be a basic walkthrough of simple, one-sample hypothesis testing up to two-sample hypothesis testing, or A/B testing.
A / B测试是一项至关重要的数据科学技能。 它通常用于测试网站A与网站B或药物A与药物B的有效性,或一个想法具有相同主要动机(无论是销售,药物功效还是客户保留率)的任意两个变体。 这是一个容易引起混乱的统计概念之一,因为仅凭假设检验就需要了解正态分布,z得分,p值以及对原假设的仔细构图。 通过A / B测试,我们有两个样本要处理。 但是,A / B测试本质上仍只是假设测试! 本指南将是简单的一样本假设测试到两样本假设测试或A / B测试的基本演练。
简单的一样本假设检验 (The Simple One-Sample Hypothesis Test)
Let’s consider the domain of apple farmers and apple sizes. We know that historically, red apples in a particular set of orchards have a mean of 3.5 inches in width, with a standard deviation of .2 inches. Therefore:
让我们考虑一下苹果农民的领域和苹果大小。 我们知道,从历史上看,一组特定果园中的红苹果的平均宽度为3.5英寸,标准差为0.2英寸。 因此:
But farmer McIntosh claims to have a special new type of red apple from his orchard, which is extra delicious and larger than other farmers’ apples. To frame the null and alternative hypotheses, or ????0 and H1 respectively:
但是农民麦金托什声称自己的果园里有一种特殊的新型红苹果,它比其他农民的苹果更美味,而且更大。 分别构造零假设和替代假设或????0和H1:
A significance level of 5% is chosen for alpha, which is the standard level of significance for single-sample hypothesis testing, as well as a somewhat arbitrary convention among statisticians and researchers. Let’s say that we take a sample of 40 apples from farmer McIntosh’s orchard, and they average 4 inches each with a sample standard deviation of .5:
alpha的显着性水平选择为5%,这是单样本假设检验的标准显着性水平,以及统计学家和研究人员之间的某种任意约定。 假设我们从农民麦金托什(McIntosh)的果园中取样了40个苹果,它们平均长4英寸,样本标准偏差为.5:
We now have everything we need to carry out the test. To refresh the logic behind any hypothesis test, take a look at the following equation for obtaining the test statistic:
现在,我们拥有执行测试所需的一切。 要刷新任何假设检验的逻辑,请查看以下方程式以获得检验统计量:
The idea is that we’re taking x-bar, which is our sample mean, and finding its difference with the population mean that the sample is presumably from. In this case, that’s 3.5, which we then divide by the standard error to give us our test-statistic:
想法是,我们采用x-bar(这是我们的样本均值),并发现它与总体的差异意味着样本可能来自 。 在这种情况下,为3.5,然后我们将其除以标准误差即可得出测试统计信息:
As a reminder, the test statistic is an indicator of how likely we are to obtain this sample by pure chance assuming the null hypothesis is true. The standard normal table, or z-score table, can be used to find the corresponding area under the Normal curve, which is the probability of obtaining that result by chance. In Python, scipy.stats
is an incredibly useful set of built-in functions for such purposes, and scipy.stats.norm.cdf
and scipy.stats.norm.sf
give the area under the normal distribution for a test statistic to the left and right, respectively. In this case, 6.32 is an incredibly high test-statistic, corresponding to an extremely low probability of obtaining these apple sizes by chance:
提醒一下,假设零假设为真 ,则检验统计量可指示我们通过纯机会获得此样本的可能性。 标准法线表或z得分表可用于在法线曲线下找到相应的区域,这是偶然获得该结果的概率。 在Python中, scipy.stats
是一组非常有用的内置函数, scipy.stats.norm.cdf
和scipy.stats.norm.sf
在正态分布下为左侧的测试统计量提供区域。和右分别。 在这种情况下,6.32是一个非常高的检验统计量,对应于偶然获得这些苹果大小的极低概率:
Because our alpha was set at .05, we can use st.norm.isf(.05)
to determine the critical value which, if our test-statistic exceeds it, allows us to reject the null hypothesis. The critical value in this case evaluates to 1.64
, a common and recognizable number, since it’s always the critical value for an alpha of .05 with an assumption of Normality. Our test statistic was 6.32, which far exceeds the critical value, so we can safely reject the null hypothesis and say that farmer McIntosh certainly produces larger-than-average apples.
因为我们的alpha设置为.05,所以我们可以使用st.norm.isf(.05)
来确定临界值 ,如果我们的检验统计量超过该临界值 ,该临界值将允许我们拒绝原假设。 在这种情况下,临界值的取值为1.64
,这是一个常见且可识别的数字,因为在假设正态性的情况下,它始终是.05的alpha的临界值。 我们的检验统计量是6.32,远远超出了临界值,因此我们可以放心地否定零假设,并说农民McIntosh肯定生产了比平均水平还要大的苹果。
使用模拟数据进行A / B测试 (A/B Testing With Simulated Data)
Now that we’ve gone through the process for a single-sample hypothesis test, let’s look at some simulated data to get a feel for a two-sample test, keeping in mind that A/B testing will still, at the end of the day, be testing a null hypothesis vs. an alternative hypothesis. The Jupyter notebooks and data can be found on my Github, and I encourage you to actually run some of the functions yourself in order to get a hands-on feel for the data, even if you just change the random seed to witness how the sampling changes.
既然我们已经完成了单样本假设检验的过程,那么让我们来看一些模拟数据,以进行两样本检验,同时请记住,A / B测试在一天,测试零假设与替代假设。 可以在我的Github上找到 Jupyter笔记本和数据,我鼓励您自己实际运行一些功能,以便获得数据的实际操作经验,即使您只是更改随机种子以见证采样方式也是如此。变化。
Here is one-thousand randomly generated, Normally-distributed points, created using sklearn.datasets.make_gaussian_quantiles.
We will call this our population:
这是使用sklearn.datasets.make_gaussian_quantiles.
创建的一千个随机生成的,正态分布的点sklearn.datasets.make_gaussian_quantiles.
我们称其为我们的人口:
pop, ignore_classes = make_gaussian_quantiles(n_samples=1000, n_features = 2, cov=1, n_classes=1, random_state=0)plt.figure(figsize=(15,10))
plt.scatter(pop[:,0], pop[:,1], s=5, color='cornflowerblue')
plt.show()
For the sake of simplicity, let’s limit the scope of our tests to the x-axis. This means that these visualizations could have been on a number line, but two dimensions allows us to more easily visualize the distribution of random samples.
为了简单起见,让我们将测试范围限制为x轴。 这意味着这些可视化本可以放在数字线上,但是二维可以使我们更轻松地可视化随机样本的分布。
Now let’s take two random samples of size 30 from this population:
现在让我们从这个总体中随机抽取两个大小为30的样本:
rand1 = np.random.choice(range(1000), 30, replace=False)
rand2 = np.random.choice(range(1000), 30, replace=False)sample1_x = pop[:,0][rand1]
sample2_x = pop[:,0][rand2]
Our first sample mean, in red, is -.23 with a sample standard deviation of .93, and our second sample mean, in green, is -.13 with a sample standard deviation of .99. Since we know the true population mean and standard deviation (because the original one-thousand points were generated from a Normal distribution) are μ=0 and σ=1.0, we can safely say these samples are pretty representative of the whole population. But how representative are they? It looks like both means are a little under the real thing. At which point would we get suspicious, and say that perhaps these samples come from different populations altogether? This question is what A/B testing tries to answer.
我们的第一个样本平均值为-.23,样本标准偏差为.93,我们的第二个样本平均值为-.13,样本标准偏差为.99。 由于我们知道真实的总体平均值和标准偏差(因为原始一千个点是从正态分布生成的)为μ = 0和σ = 1.0,因此可以肯定地说这些样本可以很好地代表整个总体。 但是,它们代表什么呢? 看起来这两种手段在真实情况下都有些不足。 在什么时候我们会变得可疑,并说这些样本可能完全来自不同的人群? 这个问题是A / B测试试图回答的问题。
Just like with one-sample hypothesis testing, we have a formula for the test statistic with two samples. Even though it looks far more complex, it’s still just the difference between the null and alternative hypothesis divided by the standard error:
就像单样本假设检验一样,我们有一个包含两个样本的检验统计量的公式。 即使看起来复杂得多,它仍然只是原假设和替代假设之间的差除以标准误差:
Essentially, even though we’re dealing with two samples, we’re still assuming that both samples are drawn from a Normal distribution, and so under the null hypothesis in a two-sample test, we assume the difference between sample means will be zero. This effectively reduces a two sample test to a single-sample hypothesis test. The process of picking a level of significance (an alpha) and finding a critical value is nearly identical.
本质上,即使我们正在处理两个样本,我们仍然假设两个样本均来自正态分布,因此在两个样本检验的零假设下,我们假设样本均值之差为零。 这有效地将两样本检验减少为单样本假设检验。 选择重要程度(alpha)并找到临界值的过程几乎相同。
The following code snippet demonstrates a Python function for an A/B test, with an option for setting different alphas:
以下代码段演示了用于A / B测试的Python函数,以及用于设置不同alpha的选项:
def ab_test(sample_A, sample_B, alpha=.05):
mean_A = np.mean(sample_A)
mean_B = np.mean(sample_B)
std_A = np.std(sample_A)
std_B = np.std(sample_B)
standard_error = np.sqrt((std_A**2)/len(sample_A) + (std_B**2)/len(sample_B))
difference = mean_A - mean_B
test_statistic = difference/standard_error
crit = st.norm.ppf(p_value/2)*-1
reject_status = 'Reject Null Hypothesis' if crit < test_statistic else 'Fail to Reject Null Hypothesis'
return 'Test Statistic: ', test_statistic, 'Critical Value: ', crit, reject_status
Let’s see how likely it was that we obtained these sample means (which we know should have a difference zero) due to pure chance:
让我们看看由于纯粹的机会而获得这些样本均值(我们知道应该有零差)的可能性:
Our test statistic for these two samples is very close to zero, which it should be, and so obviously we reject the null hypothesis, since these sample means are supposed to both be exactly zero.
我们对这两个样本的检验统计量非常接近零(应该是零),因此显然我们拒绝了原假设,因为这两个样本均被假定为正好为零。
However, let’s continuously create random samples until we obtain two samples which are very unlikely, by setting the alpha to .0001, and which will allow us to reject the null hypothesis:
但是,让我们不断创建随机样本,直到将alpha设置为.0001来获得两个不太可能的样本为止,这将允许我们拒绝原假设:
It took 26,193 random samplings in order to produce unlikely enough sample means to reject the null hypothesis! The likelihood of this happening due to pure chance with just pair of random samples is .002%. Let’s visualize these samples:
为了产生足够少的抽样手段来拒绝原假设,进行了26,193次随机抽样! 由于仅有一对随机样本的纯机会,发生这种情况的可能性为0.002%。 让我们可视化这些样本:
We can see that most of the red dots are to the right of zero, and most of the green dots are to the left, yet we obtained these two samples by pure chance (after 26,193 contrived samples)!
我们可以看到,大多数红点在零的右边,而大多数绿点在左边,但是我们偶然获得了这两个样本(在26,193个伪造样本之后)!
The point here is that with real data, where we don’t know if two samples come from different populations, we must rely on A/B testing to tell us if the samples are really different enough to confidently make claims about their populations.
这里的要点是,在真实数据的情况下, 我们不知道两个样本是否来自不同的种群,我们必须依靠A / B测试来告诉我们样本是否真的有足够的差异来自信地声明其种群 。
This is crucial information for companies testing out different website versions or drug efficacies, and with a higher significance level such as .05, it’s actually relatively common to obtain samples which allow us to wrongly reject the null hypothesis based on pure chance (also known as Type I error). Because of this, it’s common to introduce corrections such as Bonferroni’s Correction to reduce the error that arises from increased parameter space, which is known as the multiple comparison problem or the look-elsewhere effect.
这对于测试不同网站版本或药物功效的公司来说是至关重要的信息,并且具有较高的显着性水平(例如.05),因此获取样本使我们能够基于纯机会错误地拒绝零假设(实际上也称为“类型I错误)。 因此,通常会引入诸如Bonferroni的校正之类的校正来减少因参数空间增加而引起的误差,这种误差被称为多重比较问题或外观相似效应 。
As always, it’s important to understand your data inside and out before making strong claims about it, even when wielding statistical tests and performing A/B testing, because false-though-apparent statistical significance is rampant in many fields, and it’s surprisingly easy to obtain seemingly-unlikely sample data, even despite the contrived effort it took in the above example (even changing the random seed in this Jupyter notebook to 42 allows us, due to ‘luck’, to obtain two sample means which allow us to reject the null hypothesis at an alpha level of .0001 with only a few hundred samplings. Try it.)
与往常一样,即使在进行统计测试和执行A / B测试时,也必须对内部和外部的数据进行强有力的声明,这是很重要的,因为在许多领域,虚假的表观统计意义都很普遍,而且很容易做到即使在上面的示例中付出了很大的努力,也仍然获得了看似不太可能的样本数据(即使将Jupyter笔记本中的随机种子更改为42,由于“运气”,我们仍然可以获取两个样本均值,从而允许我们拒绝在只有几百个样本的情况下,在.0001的alpha级别上的空假设。请尝试一下。)
Messing with the data yourself is truly helpful for understanding the statistical intuitions behind testing, especially with A/B testing due to so many moving parts. But all hypothesis testing is virtually the same procedure. So generate some data, try it yourself, and thanks for reading!
亲自处理数据对于理解测试背后的统计直觉确实很有帮助,尤其是由于A / B测试中有许多活动部件,因此非常有用。 但是所有假设检验实际上都是相同的过程。 因此,请生成一些数据,自己尝试一下,并感谢您的阅读!
翻译自: https://towardsdatascience.com/a-simple-guide-to-a-b-testing-in-python-5235289dae57